Project

General

Profile

Bug #1144

lxml not finding elements

Added by Luke Murphey almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
12/27/2015
Due date:
% Done:

100%


Description

https://answers.splunk.com/answers/337109/extracting-http-served-xml-datascraping-http-provi.html

Sample code:

import lxml.html
from cssselector import CSSSelector

content = """<nutcallstatus>
    <!--this is similar to status.xml, but with more values-->
    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->
    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  
    <COOK>
       <COOK_NAME>Cook</COOK_NAME>
       <COOK_TEMP>695</COOK_TEMP>
       <COOK_SET>1000</COOK_SET>
       <COOK_STATUS>0</COOK_STATUS>
    </COOK>""" 

tree = lxml.html.fromstring(content)

selector = CSSSelector("COOK_TEMP")
matches = selector(tree)

print matches

History

#1 Updated by Luke Murphey almost 10 years ago

  • Target version set to 1.2.0

#2 Updated by Luke Murphey almost 10 years ago

It turns out that LXML converts the HTML representation to lowercase. That can be proven by parsing some XML and then calling this:

lxml.etree.tostring(tree)

The output is:

<nutcallstatus>\n    <!--this is similar to status.xml, but with more values-->\n    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->\n    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  \n    <cook>\n       <cook_name>Cook</cook_name>\n       <cook_temp>695</cook_temp>\n       <cook_set>1000</cook_set>\n       <cook_status>0</cook_status>\n    </cook></nutcallstatus>

#3 Updated by Luke Murphey almost 10 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Also available in: Atom PDF