javascript - Scrapy returning empty list for xpath -


i using scrapy abstracts openreview urls. example, want abstract http://openreview.net/forum?id=bk0fwvcgx, , upon doing

$ scrapy shell "http://openreview.net/forum?id=bk0fwvcgx" $ response.xpath('//span[@class="note_content_value"]').extract()

i []. in addition, when view(response) lead blank site file:///var/folders/1j/_gkykr316td7f26fv1775c3w0000gn/t/tmpbehkh8.html.

further, inspecting openreview webpage shows me there script elements, i've never seen before. when call

response.xpath(//script).extract() things u'<script src="static/libs/search.js"></script>' example.

i've read little bit having javascript, i'm kind of beginner scrapy , unsure how bypass , want.

i found page uses javascript/ajax load information address
http://openreview.net/notes?forum=bk0fwvcgx&trash=true

but needs 2 cookies access information. first server sends cookie gclb. later page loads http://openreview.net/token , gets second cookie openreview:sid. after page can load json data.

it working example requests

import requests  s = requests.session()  # `gclb` cookie r = s.get('http://openreview.net/forum?id=bk0fwvcgx') print(r.cookies)  # `openreview:sid` cookie r = s.get('http://openreview.net/token') print(r.cookies)  # json data r = s.get('http://openreview.net/notes?forum=bk0fwvcgx&trash=true') data = r.json() print(data['notes'][0]['content']['title']) 

other solution: use selenium or other tool run javascript code , can full html information. scrapy can use seleniu or phantomjs run javascript. newer try scrapy.


Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -