python - Delete h2 until you reach the next h2 in beautifulsoup -
considering following html:
<h2 id="example">cool stuff</h2> <ul> <li>hi</li> </ul> <div> <h2 id="cool"><h2> <ul><li>zz</li> </ul> </div>
and following list:
ignore_list = ['example','lalala']
my goal is, while going through html using beautifulsoup, find h2 has id in list (ignore_list) should delete ul , lis under until find h2. check if next h2 in ignore list, if is, delete ul , lis until reach next h2 (or if there no h2s left, delete ul , lis under current 1 , stop).
how see process going: read h2s down in dom. if id of in ignore_list, delete ul , li under h2 until reach next h2. if there no h2, delete ul , li stop.
here full hmtl trying work with: http://pastebin.com/z3ev9c8n
i trying delete ul , lis after "see_also" how accomplish in python?
below solution came with.
remove content don't want
try: element in body.find_all('h2'): current_h2 = element.get_text() current_h2 = current_h2.replace('[edit]','') #print(current_h2) if(current_h2 in ignore_list): if(element.find_next_sibling('div') != none): element.find_next_sibling('div').decompose() if(element.find_next_sibling('ul') != none): element.find_next_sibling('ul').decompose() except(attributeerror, typeerror) e: continue
Comments
Post a Comment