Bug #1144: lxml not finding elements - Website Input - LukeMurphey.net

Bug #1144

lxml not finding elements

Added by Luke Murphey over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Luke Murphey

Category:

Target version:

1.2.0

Start date:

12/27/2015

Due date:

% Done:

100%

Description

https://answers.splunk.com/answers/337109/extracting-http-served-xml-datascraping-http-provi.html

Sample code:

import lxml.html
from cssselector import CSSSelector

content = """<nutcallstatus>
    <!--this is similar to status.xml, but with more values-->
    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->
    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  
    <COOK>
       <COOK_NAME>Cook</COOK_NAME>
       <COOK_TEMP>695</COOK_TEMP>
       <COOK_SET>1000</COOK_SET>
       <COOK_STATUS>0</COOK_STATUS>
    </COOK>""" 

tree = lxml.html.fromstring(content)

selector = CSSSelector("COOK_TEMP")
matches = selector(tree)

print matches

History

#1 Updated by Luke Murphey over 9 years ago

Target version set to 1.2.0

#2 Updated by Luke Murphey over 9 years ago

It turns out that LXML converts the HTML representation to lowercase. That can be proven by parsing some XML and then calling this:

lxml.etree.tostring(tree)

The output is:

<nutcallstatus>\n    <!--this is similar to status.xml, but with more values-->\n    <!--all temperatures are displayed in tenths F, regardless of setting of unit-->\n    <!--all temperatures sent by browser to unit should be in F.  you can send tenths F with a decimal place, ex: 123.5-->  \n    <cook>\n       <cook_name>Cook</cook_name>\n       <cook_temp>695</cook_temp>\n       <cook_set>1000</cook_set>\n       <cook_status>0</cook_status>\n    </cook></nutcallstatus>

#3 Updated by Luke Murphey over 9 years ago

Status changed from New to Closed
% Done changed from 0 to 100

Also available in: Atom PDF

Project

General

Profile

Splunk Apps » Website Input

Issues