Python / JavaScript gurus - I need some assistance...

YFZblu · December 2013

Hey all, I have an issue I'm trying to work out in Python. I'm pretty new to programming/scripting, so bear with me on this.

There's a site my team uses quite a bit at work; the Bluecoat Site Review page. This page takes user input in the form of a URL, and then displays that website's categorization according to Bluecoat. It's located here : WebPulse Site Review Request

Anyway, what I want to do is write a command-line script that sends a request to this page in the form of an HTTP POST, and parses the response to display only the site categorization of the website submitted by the User.

In a perfect world my script would receive the response and search the document for a certain string; if found, that string would be printed to standard out. The issue I'm having is that Bluecoat's response does not give the site categorization as a string literal - the response is actually sent back as a variable value. That value is not directly reported back in the source code of the page, only the variable is. And I don't know how to grab the value of this variable with Python to display to the User. Here is the return result as it appears in the source code of the result:

"This page is currently categorized as <span data-bind="html: categorization"></span>"

I've been using Python's "Requests" module to make the HTTP request because it seems to have the most elegant syntax; however I'm open to learning/using something else like urllib2 or mechanize if it is more appropriate for this situation. Any ideas?

NotHackingYou · December 2013

in C# this is e.Response.BodyString - do you have a similar property in your response object?

YFZblu · December 2013

[ObjectName].content() is how the requests module would handle that in Python, if I'm interpreting your question correctly.

NotHackingYou · December 2013

Try [ObjectName].text

YFZblu · December 2013

.text produces the same result unfortunately

NotHackingYou · December 2013

Maybe I misunderstood - if you cannot view the value you need in source code - where is it stored - JSON or ? Try examining the traffic with Fiddler.

YFZblu · December 2013

Yeah it looks like this page uses the KnockoutJS library, and JSON. I'm currently reimaging my primary box at home with Windows 8.1, I'll post my Fiddler findings in a bit.

YFZblu · December 2013

Indeed, it appears to be JSON:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json
Date: Tue, 03 Dec 2013 22:12:08 GMT
Content-Length: 1130

YFZblu · December 2013

It looks like Python 2.7 contains a standard json library, I'm going to try to use that with urllib2 to parse this information out. Thanks for the assistance I'll post my results.

YFZblu · December 2013

So, I'm still confused - Even if it's formatted in JSON, shouldn't the results still be in the source code of the page as a key/value pair or something, just formatted to follow the rules of JSON?

Or would it be normal to store it in some type of variable, the value of which is presented to the User when in the DOM environment? If this is the case, I'm still stuck on how to use Python to present this value to the User.

NotHackingYou · December 2013

import json
import pprint
data = json.loads([ObjectName].content()) //decode the json into an object you can work with.
pprint(data) //if this looks funny, decode it again with data = json.loads(json.loads([ObjectName].content()))

You should be able to go by name from there

Source: Python accessing data in JSON object - Stack Overflow

YFZblu · December 2013

Thank you - I'll run through your link

NotHackingYou · December 2013

I think the code i posted will get you pretty close

NotHackingYou · December 2013

So how did it turn out?

YFZblu · February 2014

Apologies for the bump on this old thread - I actually gave up on this project a couple of months ago; however after more reading, banging my head against the wall, and expanding my capabilities overall I went back to this project and was able to solve it in a very Pythonic six lines of code. That being said the response is a JSON object and once I add the parsing portion it will be a big longer. Additionally I'll need to add the user-input portion to the script - right now techexams.net is hardcoded as the site I want category info on.

Ultimately my issue was that I was making a POST request to the .jsp front-end webpage: 'sitereview.bluecoat.com/siterevew.jsp' - Which doesn't work for obvious reasons (obvious NOW, not then).

After viewing the traffic in a web proxy debugger, I noticed the POST request was actually being made to: "sitereview.bluecoat.com/rest/categorization". So that's where I made my change. Here is the solution:

================================================

import requests

url = "http://sitereview.bluecoat.com/rest/categorization"
payload = {'url':'www.techexams.net'}
headers = {'Referer':'http://www.sitereview.bluecoat.com/siterevew.jsp'}
r = requests.post(url, data=payload, headers=headers)

print r.content

================================================

and the raw response output - the data I want is in bold:

================================================

{"url":"http://www.techexams.net/","unrated":false,"curtrackingid":1771060,"locked":false,"ratedate":"Last Time Rated/Reviewed: > 7 days <img onmouseover='document.getElementById(\"dtsDiv\").style.display=\"block\";' onmouseout='document.getElementById(\"dtsDiv\").style.display=\"none\";'src='images/info24.gif' width=10 height=10></img><div id='dtsDiv' style='background-color: white; position:absolute; display:none'><table cellspacing=\"0\" cellpadding=\"4\" width=\"600\" style=\"border: 1px solid black;\"><thead><th bgcolor=003366> </th></thead><tbody><tr><td class='bodytext'>The URL submitted for review was rated more than 7 days ago. The default setting for Blue Coat SG clients to download rating changes is once a day. There is no need to show ratings older than this.<br><br>Since Blue Coat's desktop client K9 and certain OEM partners update differently, ratings may differ from those of a Blue Coat SG as well as those present on the Site Review Tool.</td></tr></tbody></table></div>","categorization":"<a href=\"javascript:showPopupWindow('catdesc.jsp?catnum=38')\">Technology/Internet</a>","linkable":true}

================================================

YFZblu · February 2014

Ultimately, had I observed the traffic in a proxy debugger earlier, like CarlSaiyed suggested, this would have been an easy 10 minute endeavor

NotHackingYou · February 2014

Glad you got it to work! Fiddler is an amazing tool - I use it daily.

Python / JavaScript gurus - I need some assistance...

Comments