python - Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match -
i trying split sentence correctly bases on normal grammatical rules in python.
the sentence want split
s = """mr. smith bought cheapsite.com 1.5 million dollars, i.e. paid lot it. did mind? adam jones jr. thinks didn't. in case, isn't true... well, probability of .9 isn't."""
the expected o/p
mr. smith bought cheapsite.com 1.5 million dollars, i.e. paid lot it. did mind? adam jones jr. thinks didn't. in case, isn't true... well, probability of .9 isn't.
to achieve using regular , after lot of searching came upon following regex trick.the new_str jut remove \n 's'
m = re.split(r'(?<!\w\.\w.)(?<![a-z][a-z]\.)(?<=\.|\?)\s',new_str) in m: print (i) mr. smith bought cheapsite.com 1.5 million dollars,i.e. paid lot it. did mind? adam jones jr. thinks didn't. in case, isn't true... well, aprobability of .9 isn't.
so way understand reg ex first selecting
1) characters i.e
2) filtered spaces first selection ,we select characters dont have words mr. mrs. etc
3) filtered 2nd step select subjects have either dot or question , preceded space.
so tried change order below
1) filter out titles first.
2) filtered step select preceded space
3) remove phrases i.e
but when blank after split
m = re.split(r'(?<![a-z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str) in m: print (i) mr. smith bought cheapsite.com 1.5 million dollars,i.e. paid lot it. did mind? adam jones jr. thinks didn't. in case, isn't true... well, aprobability of .9 isn't.
shouldn't last step in modified procedure capable in identifying phrases i.e ,why failing detect ?
first, last .
in (?<!\w\.\w.)
looks suspicious, if need match literal dot it, escape ((?<!\w\.\w\.)
).
coming question, when use r'(?<![a-z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)'
regex, last negative lookbehind checks if position after whitespace not preceded word char, dot, word char, any char (since .
unescaped). condition true, because there dot, e
, .
, space before position.
to make lookbehind work same way when before \s
, put \s
lookbehind pattern, too:
(?<![a-z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
see regex demo
another enhancement can using character class in second lookbehind: (?<=\.|\?)
-> (?<=[.?])
.
Comments
Post a Comment