python - Removing chars/signs from string -
i'm preparing text word cloud, stuck.
i need remove digits, signs . , - ? = / ! @ etc., don't know how. don't want replace again , again. there method that?
here concept , have do:
- concatenate texts in 1 string
- set chars lowercase <--- i'm here
- now want delete specific signs , divide text words (list)
- calculate freq of words
- next stopwords script...
abstracts_list = open('new','r') abstracts = [] allab = '' ab in abstracts_list: abstracts.append(ab) ab in abstracts: allab += ab lower = allab.lower()
text example:
micrornas (mirnas) class of noncoding rna molecules approximately 19 25 nucleotides in length downregulate expression of target genes @ post-transcriptional level binding 3'-untranslated region (3'-utr). epstein-barr virus (ebv) generates @ least 44 mirnas, functions of of these mirnas have not yet been identified. previously, reported bruce target of mir-bart15-3p, mirna produced ebv, our data suggested there might other apoptosis-associated target genes of mir-bart15-3p. thus, in study, searched new target genes of mir-bart15-3p using in silico analyses. found possible seed match site in 3'-utr of tax1-binding protein 1 (tax1bp1). luciferase activity of reporter vector including 3'-utr of tax1bp1 decreased mir-bart15-3p. mir-bart15-3p downregulated expression of tax1bp1 mrna , protein in ags cells, while inhibitor against mir-bart15-3p upregulated expression of tax1bp1 mrna , protein in ags-ebv cells. mir-bart15-3p modulated nf-κb activity in gastric cancer cell lines. moreover, mir-bart15-3p promoted chemosensitivity 5-fluorouracil (5-fu). our results suggest mir-bart15-3p targets anti-apoptotic tax1bp1 gene in cancer cells, causing increased apoptosis , chemosensitivity 5-fu.
so set upper case characters lower case characters following: store text string variable, example string , next use command
string=re.sub('([a-z]{1})', r'\1',string).lower()
now string free of capital letters.
to remove special characters again module re can sub command :
string = re.sub('[^a-za-z0-9-_*.]', ' ', string )
with these command string free of special characters
and determine word frequency use module collections have import counter.
then use following command determine frequency words occur:
counter(string.split()).most_common()
Comments
Post a Comment