Project

General

Profile

Bug #1144

lxml not finding elements

Added by Luke Murphey about 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
12/27/2015
Due date:
% Done:

100%


Description

https://answers.splunk.com/answers/337109/extracting-http-served-xml-datascraping-http-provi.html

Sample code:

import lxml.html
from cssselector import CSSSelector

content = """<nutcallstatus>
    <!--this is similar to status.xml, but with more values-->
    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->
    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  
    <COOK>
       <COOK_NAME>Cook</COOK_NAME>
       <COOK_TEMP>695</COOK_TEMP>
       <COOK_SET>1000</COOK_SET>
       <COOK_STATUS>0</COOK_STATUS>
    </COOK>""" 

tree = lxml.html.fromstring(content)

selector = CSSSelector("COOK_TEMP")
matches = selector(tree)

print matches

History

#1 Updated by Luke Murphey about 8 years ago

  • Target version set to 1.2.0

#2 Updated by Luke Murphey about 8 years ago

It turns out that LXML converts the HTML representation to lowercase. That can be proven by parsing some XML and then calling this:

lxml.etree.tostring(tree)

The output is:

<nutcallstatus>\n    <!--this is similar to status.xml, but with more values-->\n    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->\n    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  \n    <cook>\n       <cook_name>Cook</cook_name>\n       <cook_temp>695</cook_temp>\n       <cook_set>1000</cook_set>\n       <cook_status>0</cook_status>\n    </cook></nutcallstatus>

#3 Updated by Luke Murphey about 8 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Also available in: Atom PDF