Artificial Intelligence Programming Lab(AIPLab) 討論區

Please login or register.

請輸入帳號, 密碼以及預計登入時間



作者 主題: [Tool] RBPmap - Motifs Analysis and Prediction of RNA Binding Proteins  (閱讀 1405 次)


  • 管理員
  • Hero Member
  • *****
  • 文章: 1839
    • 檢視個人資料
RBPmap Overview
RBPmap is a computational tool that enables accurate prediction and mapping of RNA binding proteins (RBPs) binding siteson any RNA sequence or list of sequences of interest, provided by the users (as either sequences or genomiccoordinates).RBPmap has been developed specifically for mapping RBPs in human, mouse and drosophila melanogaster genomes,though it supports mapping RBP binding sites in other organisms too.
RBPmap enables the users to select motifs from a database of 114 human/mouse and 51 drosophila melanogaster experimentally defined motifs,extracted from the literature as either a consensus motif or a Position Specific Scoring Matrix (PSSM).View RBPmap motifs list.In addition, the user can provide any motif of interest given as either a consensus (over IUPAC symbols) or a PSSM in MEME format.
The algorithm for mapping the motifs on the RNA sequences is based on the Weighted-Rank (WR) approach,previously exploited in the SFmap web-server for mapping splicing factor binding sites ( mapping algorithm considers the clustering propensity of the motif and the overall tendency of regulatory region to be conserved(Akerman et al., Genome Biology 2009).
A detailed description of RBPmap algorithm:
Processing the query sequence (for human, mouse and Drosophila genomes)
The mandatory input parameters for RBPmap are a query sequence and at least one motifof interest to be mapped to the sequence. The query sequence can be provided as asequence in FASTA format or as genomic coordinates.In casethe query sequence is provided in FASTAformat, RBPmap uses the BLAT utility to mapthe sequence to the chosen genomeand retrieve the genomic coordinates.The sequence is then expanded by 25ntsupstreamand downstream to include the sequence environment in theWR calculation (see below).
Further, the sequence is mapped tothe genome and categorized to one of fivedifferentgenomic regions:intronic regions flanking thesplice sites(80ntslong), internal exons,exons in 5’ and 3’ UTR regions, non-coding RNA and mid-intron/intergenic regions. Thecategory of the sequence is further used to choosethe region-specific background model(see below).
Calculating a match score for the motif
Given aquery sequence and amotif (defined aseither a consensussequenceor a PSSM,selected from RBPmapdatabase or provided by the user),amatch score for the motif iscalculated for eachk-merof the motif sizein the query sequence,in overlappingwindows.The match score Sconsensus,for motifprovided as aconsensus sequence,isdefined as following:

Where L is the motif length and H is the Hamming distance between each k-mer and the motif.
The match score Spssm, for a PSSM (Position Specific Scoring Matrix), is defined as following:

Where L is the motif length, Ni is the specific nucleotide in position i and f(Ni) is the frequency of the nucleotide as defined in the PSSM.
The values of both match scores range between 0 and 1, increasing as the distance between the motif and the k-mer decreases.
Comparing the match scores to a background model
For defining a significant match, the match scores of all the sites in the query sequence are compared to the mean match score for the motif calculated for a background of randomly chosen regulatory regions(composed of exonic and intronic regions around splice-sites and exons in UTR regions).Z-scores are calculated and coupled to a P-value, which represents the probability of obtaining a specific Z-score considering a normal one-tailed distribution.The sites are filtered according to two thresholds (set by the user as the stringency level parameter):significant threshold (default P-value<0.005) and suboptimal threshold (default P-value<0.01).The significant threshold is used to define the putative binding siteand the suboptimal threshold, which is less stringent, filters the sites that are clustered around the putative binding site and will be considered in the Weighted Rank (WR) score (see below).
Calculating a Weighted Rank (WR) score for windows around each putative binding site
In order to calculate a multiplicity score, which reflects the propensity of suboptimal motifs to cluster around the significant motif,a Weighted Rank (WR) function is employed.The WR score is calculated for each candidate significant site, by summing up all suboptimal match scores within a window of 50nts around the site (25nts of each side),weighted by their match to the motif of interest (the significant site is ranked first).The WR score SWR is defined as following:

Where rankmax is the number of suboptimal sites within the 50nts windowand Srank is the match score of each ranked suboptimal site.
Comparing the WR scores to a region-specific background model
In order to reduce the false positive predictions, the final WR scores are compared to a background model,which is calculated independently for 5 different genomic regions (see above).The WR score of each putative binding site is compared to the mean WR score of its pre-defined genomic region.Z-scores are calculated and coupled to a P-value, which represents the probability of obtaining a specific Z-score considering a normal one-tailed distribution.The sites are reported as predicted binding sites if their P-value<0.05.The Z-score and P-value of the predicted binding sites are reported in the output of RBPmap.
Conservation-based filtering
The conservation-based filtering is optional and can be applied only to binding sites that are mapped to intronic/intergenic regions.It is based on the tendency of regulatory regions to be evolutionary conserved.These sites are removed from the final results if the mean conservation score calculated for their window is lower than the mean conservation score calculated for intronic regulatory regions.For sequences from human and mouse, the conservation information is retrieved from the UCSC phyloP conservation table(Siepel et al., Genome Res. 2005), based on the conservation of all placental mammals.For Drosophila sequences we use the phastCons insect conservation table (Siepel et al., Genome Res. 2005).Conservation filtering can be applied only for input sequences from human mouse or Drosophila.
RBPmap calculation for genomes other than human, mouse or Drosophila
In case the query sequence comes from other organism than human, mouse or Drosophila melanogaster, RBPmap cannot use any background genomic information.Thus, all the above steps are performed, except for the comparison to the genome-specific background model and the conservation-based filtering.Nevertheless, to reduce the false-positive predictions, the WR scores are compared to a theoretical threshold, calculated for each motif, based on the motif length and complexity (Paz et al., Nucleic Acids Res. 2010).In such cases, the reported output of RBPmap includes the final WR scores (‘Score’) and the theoretical threshold for each motif (‘Cutoff’).

SimplePortal Classic 2.0.5