python - Removing chars/signs from string -


i'm preparing text word cloud, stuck.

i need remove digits, signs . , - ? = / ! @ etc., don't know how. don't want replace again , again. there method that?

here concept , have do:

  • concatenate texts in 1 string
  • set chars lowercase <--- i'm here
  • now want delete specific signs , divide text words (list)
  • calculate freq of words
  • next stopwords script...
abstracts_list = open('new','r') abstracts = [] allab = '' ab in abstracts_list:     abstracts.append(ab) ab in abstracts:     allab += ab lower = allab.lower() 

text example:

micrornas (mirnas) class of noncoding rna molecules approximately 19 25 nucleotides in length downregulate expression of target genes @ post-transcriptional level binding 3'-untranslated region (3'-utr). epstein-barr virus (ebv) generates @ least 44 mirnas, functions of of these mirnas have not yet been identified. previously, reported bruce target of mir-bart15-3p, mirna produced ebv, our data suggested there might other apoptosis-associated target genes of mir-bart15-3p. thus, in study, searched new target genes of mir-bart15-3p using in silico analyses. found possible seed match site in 3'-utr of tax1-binding protein 1 (tax1bp1). luciferase activity of reporter vector including 3'-utr of tax1bp1 decreased mir-bart15-3p. mir-bart15-3p downregulated expression of tax1bp1 mrna , protein in ags cells, while inhibitor against mir-bart15-3p upregulated expression of tax1bp1 mrna , protein in ags-ebv cells. mir-bart15-3p modulated nf-κb activity in gastric cancer cell lines. moreover, mir-bart15-3p promoted chemosensitivity 5-fluorouracil (5-fu). our results suggest mir-bart15-3p targets anti-apoptotic tax1bp1 gene in cancer cells, causing increased apoptosis , chemosensitivity 5-fu.

so set upper case characters lower case characters following: store text string variable, example string , next use command

string=re.sub('([a-z]{1})', r'\1',string).lower() 

now string free of capital letters.

to remove special characters again module re can sub command :

string = re.sub('[^a-za-z0-9-_*.]', ' ', string ) 

with these command string free of special characters

and determine word frequency use module collections have import counter.

then use following command determine frequency words occur:

counter(string.split()).most_common()


Comments

Popular posts from this blog

java - SSE Emitter : Manage timeouts and complete() -

jquery - uncaught exception: DataTables Editor - remote hosting of code not allowed -

java - How to resolve error - package com.squareup.okhttp3 doesn't exist? -