Python / JavaScript gurus - I need some assistance...

YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
Hey all, I have an issue I'm trying to work out in Python. I'm pretty new to programming/scripting, so bear with me on this.

There's a site my team uses quite a bit at work; the Bluecoat Site Review page. This page takes user input in the form of a URL, and then displays that website's categorization according to Bluecoat. It's located here : WebPulse Site Review Request

Anyway, what I want to do is write a command-line script that sends a request to this page in the form of an HTTP POST, and parses the response to display only the site categorization of the website submitted by the User.

In a perfect world my script would receive the response and search the document for a certain string; if found, that string would be printed to standard out. The issue I'm having is that Bluecoat's response does not give the site categorization as a string literal - the response is actually sent back as a variable value. That value is not directly reported back in the source code of the page, only the variable is. And I don't know how to grab the value of this variable with Python to display to the User. Here is the return result as it appears in the source code of the result:

"This page is currently categorized as <span data-bind="html: categorization"></span>"

I've been using Python's "Requests" module to make the HTTP request because it seems to have the most elegant syntax; however I'm open to learning/using something else like urllib2 or mechanize if it is more appropriate for this situation. Any ideas?

Comments

  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    in C# this is e.Response.BodyString - do you have a similar property in your response object?
    When you go the extra mile, there's no traffic.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    [ObjectName].content() is how the requests module would handle that in Python, if I'm interpreting your question correctly.
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    Try [ObjectName].text
    When you go the extra mile, there's no traffic.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    .text produces the same result unfortunately
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    Maybe I misunderstood - if you cannot view the value you need in source code - where is it stored - JSON or ? Try examining the traffic with Fiddler.
    When you go the extra mile, there's no traffic.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    Yeah it looks like this page uses the KnockoutJS library, and JSON. I'm currently reimaging my primary box at home with Windows 8.1, I'll post my Fiddler findings in a bit.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    Indeed, it appears to be JSON:

    HTTP/1.1 200 OK
    Server: Apache-Coyote/1.1
    Content-Type: application/json
    Date: Tue, 03 Dec 2013 22:12:08 GMT
    Content-Length: 1130
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    It looks like Python 2.7 contains a standard json library, I'm going to try to use that with urllib2 to parse this information out. Thanks for the assistance I'll post my results.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    So, I'm still confused - Even if it's formatted in JSON, shouldn't the results still be in the source code of the page as a key/value pair or something, just formatted to follow the rules of JSON?

    Or would it be normal to store it in some type of variable, the value of which is presented to the User when in the DOM environment? If this is the case, I'm still stuck on how to use Python to present this value to the User.
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    import json
    import pprint
    data = json.loads([ObjectName].content()) //decode the json into an object you can work with.
    pprint(data) //if this looks funny, decode it again with data = json.loads(json.loads([ObjectName].content()))

    You should be able to go by name from there

    Source: Python accessing data in JSON object - Stack Overflow
    When you go the extra mile, there's no traffic.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    Thank you - I'll run through your link
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    I think the code i posted will get you pretty close
    When you go the extra mile, there's no traffic.
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    So how did it turn out?
    When you go the extra mile, there's no traffic.
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    Apologies for the bump on this old thread - I actually gave up on this project a couple of months ago; however after more reading, banging my head against the wall, and expanding my capabilities overall I went back to this project and was able to solve it in a very Pythonic six lines of code. That being said the response is a JSON object and once I add the parsing portion it will be a big longer. Additionally I'll need to add the user-input portion to the script - right now techexams.net is hardcoded as the site I want category info on.

    Ultimately my issue was that I was making a POST request to the .jsp front-end webpage: 'sitereview.bluecoat.com/siterevew.jsp' - Which doesn't work for obvious reasons (obvious NOW, not then).

    After viewing the traffic in a web proxy debugger, I noticed the POST request was actually being made to: "sitereview.bluecoat.com/rest/categorization". So that's where I made my change. Here is the solution:

    ================================================

    import requests

    url = "http://sitereview.bluecoat.com/rest/categorization&quot;
    payload = {'url':'www.techexams.net'}
    headers = {'Referer':'http://www.sitereview.bluecoat.com/siterevew.jsp'}
    r = requests.post(url, data=payload, headers=headers)

    print r.content

    ================================================


    and the raw response output - the data I want is in bold:


    ================================================

    {"url":"http://www.techexams.net/","unrated":false,"curtrackingid":1771060,"locked":false,"ratedate":"Last Time Rated/Reviewed: > 7 days <img onmouseover='document.getElementById(\"dtsDiv\").style.display=\"block\";' onmouseout='document.getElementById(\"dtsDiv\").style.display=\"none\";'src='images/info24.gif' width=10 height=10></img><div id='dtsDiv' style='background-color: white; position:absolute; display:none'><table cellspacing=\"0\" cellpadding=\"4\" width=\"600\" style=\"border: 1px solid black;\"><thead><th bgcolor=003366> </th></thead><tbody><tr><td class='bodytext'>The URL submitted for review was rated more than 7 days ago. The default setting for Blue Coat SG clients to download rating changes is once a day. There is no need to show ratings older than this.<br><br>Since Blue Coat's desktop client K9 and certain OEM partners update differently, ratings may differ from those of a Blue Coat SG as well as those present on the Site Review Tool.</td></tr></tbody></table></div>","categorization":"<a href=\"javascript:showPopupWindow('catdesc.jsp?catnum=38')\">Technology/Internet</a>","linkable":true}

    ================================================
  • YFZbluYFZblu Member Posts: 1,462 ■■■■■■■■□□
    Ultimately, had I observed the traffic in a proxy debugger earlier, like CarlSaiyed suggested, this would have been an easy 10 minute endeavor :)
  • NotHackingYouNotHackingYou Member Posts: 1,460 ■■■■■■■■□□
    Glad you got it to work! Fiddler is an amazing tool - I use it daily.
    When you go the extra mile, there's no traffic.
Sign In or Register to comment.