javascript - Scrapy returning empty list for xpath -
i using scrapy abstracts openreview urls. example, want abstract http://openreview.net/forum?id=bk0fwvcgx, , upon doing
$ scrapy shell "http://openreview.net/forum?id=bk0fwvcgx" $ response.xpath('//span[@class="note_content_value"]').extract()
i []
. in addition, when view(response)
lead blank site file:///var/folders/1j/_gkykr316td7f26fv1775c3w0000gn/t/tmpbehkh8.html
.
further, inspecting openreview webpage shows me there script elements, i've never seen before. when call
response.xpath(//script).extract()
things u'<script src="static/libs/search.js"></script>'
example.
i've read little bit having javascript, i'm kind of beginner scrapy , unsure how bypass , want.
i found page uses javascript/ajax load information address
http://openreview.net/notes?forum=bk0fwvcgx&trash=true
but needs 2 cookies access information. first server sends cookie gclb
. later page loads http://openreview.net/token , gets second cookie openreview:sid
. after page can load json data.
it working example requests
import requests s = requests.session() # `gclb` cookie r = s.get('http://openreview.net/forum?id=bk0fwvcgx') print(r.cookies) # `openreview:sid` cookie r = s.get('http://openreview.net/token') print(r.cookies) # json data r = s.get('http://openreview.net/notes?forum=bk0fwvcgx&trash=true') data = r.json() print(data['notes'][0]['content']['title'])
other solution: use selenium
or other tool run javascript code , can full html information. scrapy
can use seleniu
or phantomjs
run javascript. newer try scrapy
.
Comments
Post a Comment