Exercise 1 - Use of ORFs for prokaryotic gene
prediction
-
(Roughly) annotate this
non-eukaryotic genome and determine the nature of the organism by using NCBI's "Open
Reading Frame Finder"
- Go to NCBI
ORF Finder window at http://www.ncbi.nih.gov/gorf/gorf.html
- Paste sequence into window.
- Select 'OrfFind'.
- On the result page change sensitivity from 100
to 300. Select 'Redraw'.
- Locate the exact start and end points for an ORF by clicking it. Record the nucelotide positions for each ORF. Determine exact position for each ORF.
- Identify nature of gene for any ORF by clicking on it. Then find and select 'Blast' above the ORF display.
- Wait for your search results to be displayed,
If this takes too long write down the 'Request
ID', start a Blast search for another ORF, and
return later to retrieve your results.
- View the Blast results for several ORFs and
determine what the genes are and the nature of the organism.
-
Does this
map confirm your suspicion? Which genes do the
ORFs you found before represent? Which genes did
you miss?
-
Re-run the NCBI Orf Finder but this time set the
sensitivity on the output page to 100 and to 50.
Select 'Redraw' and try to detect ORFs for the genes
that are on the map but which you did not see before.
Exercise 2 - Ab initio eukaryotic
gene prediction
Identify the gene in
this human nucleotide sequence. Record your results
on this worksheet,
in a word processor and in form of screen shots.
Screen shots on most PCs can be taken by using the
Print Screen key on the keyboard and pasting the saved image
into Photoshop or Powerpoint (use Ctrl V to paste).
On Macs, apple+shift+3 saves the entire screen, apple+shift+4
a selected area. The images are being saved to hard drive.
To save images on Mac to the clipboard, press ctrl while
taking the screen shot.
In order to predict genes in a DNA sequence follow
these steps:
- Narrow down gene region and direction by determining
the strand it is encoded by ('Watson'- strand vs.
'Crick'-strand. The 'Crick'-strand is the reverse
complement of the 'Watson'-strand.), as well as
the start and the end of the gene.
- Predict exon/intron borders.
- Use these data to build all possible models of the gene in the DNA sequence.
- In order to determine which of the models most likley describes the putative gene
identify those parts of the sequence that contain the coding sequences
of the gene's exons. For each exon, identify which reading frames are "open". Determine putative start and stop codons for the initial and terminal exon (note: sometimes the translational start site of a gene is not located in the first but in a subsequent exon). Adjust exon/intron
borders to yield one contigous CDS over the run
of the gene that agrees with all data.
Here is how to do it:
-
Determine the start and the end of the prospective
gene:
-
To determine the beginning of the gene run
sequence through a Transcriptional
Start Site Finder (here: http://argon.cshl.org/genefinder/CPROMOTER/index.htm)
and determine possible start regions. Save
your results in your notepad file or as screen
shots. (Should you experience problems with
the program, view the results here
-
To determine the end of the gene run the
sequence through this Poly
A Signal Predictor (here: http://argon.cshl.org/tabaska/polyadq_form.html).
Save your results in your notepad file or
as screen shots. (Should you experience problems
with the program, view the results here
.)
-
In your worksheet mark the promoter region,
and the prospective translational start and
polyA-signal sites. Which direction is the
gene most likely transcribed in?
-
For the sake of time you have not made any
predictions for the reverse (Crick)
strand. If you wish to do so, built the reverse
complement of the sequence here
(http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm#permutator)
and run the promoter and polyA-signal prediction
programs again.
-
Discern exons and introns by identifying splice
sites (i.e. the borders between exons and introns):
-
Run sequence through this splice
site prediction program at UC Berkeley
(http://www.fruitfly.org/seq_tools/splice.html)
changing the donor score cutoff to 0.88
and acceptor score cutoff to 0.94 (donor
site at beginning of intron, acceptor site
at end of intron). (Should you experience
problems with the program, view the results
here.)
-
What does the output page tell you about
the prediction mechanism?
-
Use the output to determine the splice site
positions and record the results in the worksheet.
-
Build Models: From the total
of six splice site predictions build a number
of alternative maps that show how exons and
introns could follow each other in this gene.
-
Determine which of the predicted splice sites
truly border exons and introns:
Relate the predicted splice sites to start and
stop codons: donor sites (where exons end) should
be preceeded by ORFs and acceptor sites (where
exons begin) should be followed by ORFs. Follow
the sequence below by first identifying the exact
location of the internal exon. Proceed by identifying
the exact location of the final exon. The determine
the exact position of the first exon.
-
Internal Exon: Identify the
positions of start and stop codons by running
sequence through this
tool (http://www.dnalc.org/bioinformatics/dnalc_nucleotide_analyzer.htm#translator).
At the bottom of the output page is a table
listing the codon positions. Use the positions
listed for the stop codons in this sequence
to identify a putative internal exon which
is flanked by predicted Donor and Acceptor
Sites and is not interrupted by a stop codon.
(Should you experience problems with the program,
view the position of stop codons here.)
- At what nucleotide position does the internal
exon start? Where does it end?
- Which of the three reading frames RF1,
RF2, RF3 (counting from nucleotide 1 of
the unknown sequence) is open for this putative
exon?
- At what nucleotide position exactly does
the coding sequence (CDS) within this ORF
begin?
- So, how many nucleotides are not matched
to a full triplet at the beginning/end of
this putative exon?
-
Last Exon: Check your model
above to identify where the CDS ends. Determine
the appropriate stop codon from table above
by checking which reading frame provides a
full number of triplets between the exon start
and the stop codon, taking into account the
number of nucleotides which need to be completed
to a full triplet at the end of the internal
exon. What's the position of the stop codon?
In what reading frame is the ORF for the last
exon?
-
First exon: Check your model
above to identify where the CDS begins. Determine
the appropriate start codon from table above
by checking which reading frame provides a
full number of triplets between the start
codon and the end of this exon takin ginto
account how many nucleotides may have to be
completed to full triplets at the beginnning
of the internal exon. What's the position
of the start codon? What reading frame is
the ORF in?
-
Characterize the gene answering the following
questions:
- What characteristics does this gene have?
(Length, exons, introns, splice sites, promoter,
etc.)?
- What do the mRNA and amino acid sequences
for this gene look like?
- Is it possible that the gene in this sequence
consists of only one CDS and does not entail
any introns
|
|
|
|