ajax - Python requests module gets the same results despite incrementing page number -
the thing changes in url page number, incremented after each request.
other selenium or related tools, i’m not sure approach used traverse pages. instinct there may header/query combination data directly, don't know find it.
url = 'http://therunningbug.co.uk/events/find-races.aspx?eventname=&addressregion=&addresscounty=&date=&surface=#sort=date&page=' page = 1 while true: pagedata = beautifulsoup(requests.get(url + str(page)).content) articles = pagedata.find('div', {'class':"items-content"}) in articles.find_all('article'): name = a.find('span', {'itemprop':"name"}).text d, t = a.find('time').get('datetime').split('t') timedata = t[:-3] datedata = d.split('-') date = (datedata[1] + '/' + datedata[2] + '/' + datedata[0][2:]).strip() description = a.find('p', {'itemprop':"description"}).text.strip() weblink = 'http://therunningbug.co.uk' + a.find('a', {'itemprop':"url"}).get('href') category = a.find('span', {'class':"surface"}).text location = a.find('span', {'class':"region"}).text + ', ' + a.find('span', {'class':"county"}).text print name, ' -- name' print date, ', ', timedata, ' -- date, time' print description, ' -- description' print weblink, ' -- website link' print category, ' -- category' print location, ' -- location\n' page += 1
the problem url encoding. can urlencode:
url = 'http://therunningbug.co.uk/events/find-races.aspx' payload = {'page': page} pagedata = beautifulsoup(requests.get(url, params = payload).content)
this works there no complex characters in uri url encode.
url = 'http://therunningbug.co.uk/events/find-races.aspx' pagedata = beautifulsoup(requests.get(url + '?page=' + str(page)).content)
see requests documentation url encoding. http://docs.python-requests.org/en/master/user/quickstart/
complete code:
#!/usr/bin/env python import requests bs4 import beautifulsoup page = 1 while true: url = 'http://therunningbug.co.uk/events/find-races.aspx' payload = {'page': page} pagedata = beautifulsoup(requests.get(url, params = payload).content) articles = pagedata.find('div', {'class':"items-content"}) in articles.find_all('article'): name = a.find('span', {'itemprop':"name"}).text d, t = a.find('time').get('datetime').split('t') timedata = t[:-3] datedata = d.split('-') date = (datedata[1] + '/' + datedata[2] + '/' + datedata[0][2:]).strip() description = a.find('p', {'itemprop':"description"}).text.strip() weblink = 'http://therunningbug.co.uk' + a.find('a', {'itemprop':"url"}).get('href') category = a.find('span', {'class':"surface"}).text location = a.find('span', {'class':"region"}).text + ', ' + a.find('span', {'class':"county"}).text print name, ' -- name' print date, ', ', timedata, ' -- date, time' print description, ' -- description' print weblink, ' -- website link' print category, ' -- category' print location, ' -- location\n' page += 1
Comments
Post a Comment