The Genome Access Course

Gene Prediction I



  1. Intro

  2. ORFs and Gene Prediction

  3. Ab initio Gene Prediction


    Exercise 1 - Use of ORFs for prokaryotic gene prediction

    • (Roughly) annotate this non-eukaryotic genome and determine the nature of the organism by using NCBI's "Open Reading Frame Finder"


      • Go to NCBI ORF Finder window at http://www.ncbi.nih.gov/gorf/gorf.html
      • Paste sequence into window.
      • Select 'OrfFind'.
      • On the result page change sensitivity from 100 to 300. Select 'Redraw'.
      • Locate the exact start and end points for an ORF by clicking it. Record the nucelotide positions for each ORF. Determine exact position for each ORF.
      • Identify nature of gene for any ORF by clicking on it. Then find and select 'Blast' above the ORF display.
      • Wait for your search results to be displayed, If this takes too long write down the 'Request ID', start a Blast search for another ORF, and return later to retrieve your results.
      • View the Blast results for several ORFs and determine what the genes are and the nature of the organism.

    • Does this map confirm your suspicion? Which genes do the ORFs you found before represent? Which genes did you miss?

    • Re-run the NCBI Orf Finder but this time set the sensitivity on the output page to 100 and to 50. Select 'Redraw' and try to detect ORFs for the genes that are on the map but which you did not see before.



    Exercise 2 - Ab initio eukaryotic gene prediction

      Identify the gene in this human nucleotide sequence. Record your results on this worksheet, in a word processor and in form of screen shots. Screen shots on most PCs can be taken by using the Print Screen key on the keyboard and pasting the saved image into Photoshop or Powerpoint (use Ctrl V to paste). On Macs, apple+shift+3 saves the entire screen, apple+shift+4 a selected area. The images are being saved to hard drive. To save images on Mac to the clipboard, press ctrl while taking the screen shot.

      In order to predict genes in a DNA sequence follow these steps:


      1. Narrow down gene region and direction by determining the strand it is encoded by ('Watson'- strand vs. 'Crick'-strand. The 'Crick'-strand is the reverse complement of the 'Watson'-strand.), as well as the start and the end of the gene.
      2. Predict exon/intron borders.
      3. Use these data to build all possible models of the gene in the DNA sequence.
      4. In order to determine which of the models most likley describes the putative gene identify those parts of the sequence that contain the coding sequences of the gene's exons. For each exon, identify which reading frames are "open". Determine putative start and stop codons for the initial and terminal exon (note: sometimes the translational start site of a gene is not located in the first but in a subsequent exon). Adjust exon/intron borders to yield one contigous CDS over the run of the gene that agrees with all data.

      Here is how to do it:


      • Determine the start and the end of the prospective gene:

        • To determine the beginning of the gene run sequence through a Transcriptional Start Site Finder (here: http://argon.cshl.org/genefinder/CPROMOTER/index.htm) and determine possible start regions. Save your results in your notepad file or as screen shots. (Should you experience problems with the program, view the results here

        • To determine the end of the gene run the sequence through this Poly A Signal Predictor (here: http://argon.cshl.org/tabaska/polyadq_form.html). Save your results in your notepad file or as screen shots. (Should you experience problems with the program, view the results here .)

        • In your worksheet mark the promoter region, and the prospective translational start and polyA-signal sites. Which direction is the gene most likely transcribed in?

        • For the sake of time you have not made any predictions for the reverse (Crick) strand. If you wish to do so, built the reverse complement of the sequence here (http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm#permutator) and run the promoter and polyA-signal prediction programs again.


      • Discern exons and introns by identifying splice sites (i.e. the borders between exons and introns):


        • Run sequence through this splice site prediction program at UC Berkeley (http://www.fruitfly.org/seq_tools/splice.html) changing the donor score cutoff to 0.88 and acceptor score cutoff to 0.94 (donor site at beginning of intron, acceptor site at end of intron). (Should you experience problems with the program, view the results here.)

        • What does the output page tell you about the prediction mechanism?

        • Use the output to determine the splice site positions and record the results in the worksheet.

        • Build Models: From the total of six splice site predictions build a number of alternative maps that show how exons and introns could follow each other in this gene.


      • Determine which of the predicted splice sites truly border exons and introns:

        Relate the predicted splice sites to start and stop codons: donor sites (where exons end) should be preceeded by ORFs and acceptor sites (where exons begin) should be followed by ORFs. Follow the sequence below by first identifying the exact location of the internal exon. Proceed by identifying the exact location of the final exon. The determine the exact position of the first exon.

        • Internal Exon: Identify the positions of start and stop codons by running sequence through this tool (http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm#translator). At the bottom of the output page is a table listing the codon positions. Use the positions listed for the stop codons in this sequence to identify a putative internal exon which is flanked by predicted Donor and Acceptor Sites and is not interrupted by a stop codon. (Should you experience problems with the program, view the position of stop codons here.)

          • At what nucleotide position does the internal exon start? Where does it end?
          • Which of the three reading frames RF1, RF2, RF3 (counting from nucleotide 1 of the unknown sequence) is open for this putative exon?
          • At what nucleotide position exactly does the coding sequence (CDS) within this ORF begin?
          • So, how many nucleotides are not matched to a full triplet at the beginning/end of this putative exon?


        • Last Exon: Check your model above to identify where the CDS ends. Determine the appropriate stop codon from table above by checking which reading frame provides a full number of triplets between the exon start and the stop codon, taking into account the number of nucleotides which need to be completed to a full triplet at the end of the internal exon. What's the position of the stop codon? In what reading frame is the ORF for the last exon?


        • First exon: Check your model above to identify where the CDS begins. Determine the appropriate start codon from table above by checking which reading frame provides a full number of triplets between the start codon and the end of this exon takin ginto account how many nucleotides may have to be completed to full triplets at the beginnning of the internal exon. What's the position of the start codon? What reading frame is the ORF in?



      • Characterize the gene answering the following questions:


        • What characteristics does this gene have? (Length, exons, introns, splice sites, promoter, etc.)?
        • What do the mRNA and amino acid sequences for this gene look like?
        • Is it possible that the gene in this sequence consists of only one CDS and does not entail any introns