Was this page helpful?

RNA-Protein Interactions

    This page is under construction!

     

    This page contains information about the crf code (in CVS repository RPI) to predict protein binding regions in RNA.

     

    File Format

    For training data, the RNA sequence, structure and label information is provided in the following format:

    >Seq_id : bases

    AUCGAUCGUAGCUA

    >Seq_id : str

    ABCDACABAHCBAH

    >Seq_id : label

    SNAXNANANXNAN

    Where the bases are from set {A,U,C,G}, structure is from set {A= Watson Crick, B=, C=, D=, E=, F=, G=, H= unpaired} and label is from the set {S= specific binding, N= nonspecific binding, A=van der Waals interaction, X= no binding}.

    Please see any .seqstrlbl file in RPI/data/biological for examples.

    For test sequences, the label string can be anything, it doesn't matter as the code does not take that into account.

     

    Data

    Location: RPI/data/

    Synthetic 

    contains the synthetic RNA data with simplistic and realistic sequence-structural features. Simple artificial data is generated by introducing four sequence-structural features for each of the four labels with high (0.9) and low (0.7) probabilities. The realistic synthetic data is generated by following the 1st order Markov chain probabilities seen in the biological data. 

    The scripts to generate these datasets, and the datasets themselves are at RPI/data/artificial.

    Biological

    contains the RNA data in .seqstrlbl format from RNA-protein complexes in the PDB as outlined in JMB paper (Role of RNA Sequence and Structure in RNA-Protein Interactions, Gupta and Gribskov, 2010). 

    The ribosomal RNA sequences are also fragmented methodically to allow cross-validation of the prediction tools. The details for fragmenting rRNA sequences are in RPI/data/biological/README. The biological data and the script to fragment rRNA sequences are at RPI/data/biological.

     

    CRF Code

    The crf codes for predicting protein binding regions in RNA is at RPI/code/crf/. Three perl package files are required to tun the tool:

    crf.pm calls the data.pm to get a set of sequence-structural features present in the RNA data, calls optimize.pm to run gradient ascent optimization with L-1 regularization for learning weights for the features selected by data.pm, and makes predictions using the subroutines present in crf.pm.

    Please refer to documentation within each .pm file to get more information about individual subroutines.

     

    Scripts to run crf

    Location: RPI/scripts/

    run_crf.pl : To identify features in the training data and to learn weights for those features. Reads in the following flags:

    -f: training data file in .seqstrlbl format

    -s: step size for optimization algorithm

    -r: penalty for L1-regularization

    -i: maximum number of iterations for optimization algorithm

     

    Prints out the feature frequency in the dataset, the number of features for each of the four labels, and feature weights at each optimization iteration. Also computes confusion matrix and class-wise performance measures (precision, recall, F1-measure) at every 20 iterations by making predictions on the training data using feature weights at that given iteration.

    batch_crf.pl runs run_crf.pl on multiple training datasets in parallel.

     

    getLL.pl : reads in the run_crf.pl output and prints the log likelihood values (with and without regularization term) for each iteration. batch_getLL.pl runs this code in parallel for several run_crf.pl output files.

     

    getWtUpdates.pl : reads in run_crf.pl output and prints weights for each feature in all the optimization iterations.

     

    printCrfWts.pl: Prints feature weights at a specific iteration from the file containing run_crf.pl output, and saves thiese weights in a format that can be used by the prediction subroutines.

     

    predictOnTestdata.plreads in a weight file (written by printCrfWts.pl) and a test data file in .seqstrlbl format and outputs prediction performance on the test file.

     

    predAllIter.pl: reads in a RNA data file and a run_crf.pl output and prints prediction performance on the RNA data using the feature weights at each iteration of the run_crf.pl output file.

    Was this page helpful?
    Tag page (Edit tags)
    • No tags
    You must login to post a comment.