Characteristics of DNA Sequences I - Randomness vs. Patterns
Concept: Nucleotide and amino acid sequences within living organisms are not random;
they entail information. Different sequence patterns are correlated with different functions. |
|
Genes consist of sequences of nucleotides. These nucleotides, in form of a triplett code,
determine the location of individual amino acids in peptides. The sequences of amino acids within a
peptide determine the characteristics and, thus, the function of the resulting proteins. Since genes and
proteins are tailored to serve specific functions their elements, nucleotides and amino acids,
are not lined up randomly, but in very specific patterns.
The exercises in the hand-out provided will help you to better understand the concept
of randomness and will provide you and your students with tools to examine the randomness of
DNA sequences. They will also prepare you to better understand how
bioinformatics tools work and help you to answer
questions like this:
How often would you expect a sequence of 16 nucleotides
(16-mer) to be repeated in the human genome? How often a
sequence of 300 nucleotides?
Or like this:
Alu elements are nucleotide stretches of ca.
350 bp which occur repetetively in the human
genome and amass to about 10% of the entire
human DNA sequence. How many copies of Alu
does this amount to? Would you think that this is
due to chance? What kind of information could be entailed in the occurence of so many Alu elements?
|
Characteristics of DNA Sequences
II - Bioinformatics
Concept: Informatics tools for the analysis of genomes
are being developed by analyzing the occurrence of
patterns in
DNA and their putative correlation with
biological function. This highly mathematical approach
is expected to lead to the development of tools that
not only allow the identification of genes, introns,
and exons, but also the identification of protein binding
sites, as well as the classification of proteins due
to their biochemical activities (motif
search).
|
|
DNA sequences are provided in a variety of different formats: lower case letters, upper case letters,
digits, interspersed with blank spaces (to separate
long
sequences into strings of 10 nucleotides). Most
sequence analysis programs accept
nucleotide sequences that have been formatted in
the so-called "FASTA" format.
A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data.
The description line is distinguished from the
sequence data by a greater-than (">") symbol
in the first column. It is recommended that
all lines of text be shorter than 80 characters
in length. An example sequence in FASTA format
is:
>gi|532319|pir|TVFV2E|TVFV2E description
wxyz
whereby wxyz denotes a sequence of
nucleotides or amino acids; FASTA
is used for DNA as well as for
protein sequences.
Tools for the transformation of a sequence
into FASTA format can be found at a variety
of web sites. Here, we would like to introduce
you to two ways to perform this transformation:
- Use the web-site of the Dolan DNA Learning
Center:
- Open browser
Go to Bioserver at http://vector.cshl.org/bioserver/
- Log into 'Sequence Server' (Loging
in as a user will allow you to save your
sequences).
- Select 'Create A Sequence'.
- Paste sequence into window, name sequence
with a name you will remember.
- Select 'OK'.
- To save this sequence select 'Save',
'Add Group', create a new group with your
name, select 'OK'.
- Use the sequence utilities provided through
the web site of the Baylor College of Medicine.
- Go to course links at
http://vector.cshl.org/bioinformatics/links.htm,
open website 'Baylor College of
Medicine'.
- Select 'Sequence Utilities'.
- Paste sequence into window.
- Select 'ReadSeq', then 'Search'.
- Open a text editor and paste result
into text file. Make sure that you save
this file in a location you will remember
(e.g. directory 'Workshops').
|
Characteristics of DNA Sequences III - Coding vs. Non-coding
DNA
Concept: Different genomic DNA sequences entail different
functions (e.g. promoters, genes, etc.). Each DNA-type is associated
with specific sequence characteristics (motifs).
|
|
DNA sequences that are translated into amino acid sequences are called
coding DNA, all other DNA sequences are summarized as non-coding DNA. In
eukaryotes, coding DNA therefore refers to the DNA sequences in exons,
while non-coding DNA entails introns, non-translated 5'- and 3'-DNA
stretches (5'-UTR, 3'-UTR), and promotors. In addition to these "intragenic"
non-coding DNA sequences there is a wide array of non-coding DNA that is located in intergenic regions, e.g. in
centromers and telomers.
Examples for characteristics that have been found associated with different DNA sequences are:
- coding sequences (exons): e.g. open reading frames (ORFs).
- introns: e.g. stop codons.
- non-coding, regulatory regions; e.g. TATA, Shine-Delgarno, Pribnow, Kozak Consensus, in CpG-rich regions.
- translational start/stop sites; e.g. ATG.
- protein binding sites; e.g. TATA.
- non-coding, intergenic sequences; e.g. absence of coherent ORFs, AT-rich.
Exercise: Characteristics of coding and non-coding DNA
| |
Use the programs in the window on the right to examine the characteristics
for exon, intron, and intergenic DNA sequences. Determine
for representatives of
each sequence type the numbers of the four nucleotides A,T,G,C, the
numbers and ratios of A+T and C+G, the numbers and ratios of CpGs and
all other dimers, as well as the numbers and ratios of all trimers
(all possible combinations). Completing the worksheet provided (sample: coding/non-coding) answer the following questions:
- What differences can you identify between intra- and intergenic regions?
Between coding and non-coding sequences within genes (intragenic)?
- Can you identify differences among first, internal and last exons?
The original version of the programs utilized for this analysis can be
viewed here;
more streamlined versions are available for viewing through the links
provided at the right. Try to identify functions for some of the components
of these programs.
|
Paste your sequence
into the window (FASTA format or raw sequence)
and select a program for analysis
View the programs
length.pl
; monomers.pl
; dimers.pl
; trimers.pl
; fourmers.pl
; all.pl
|
|
|
Characteristics of DNA Sequences IV - Exon-Intron Borders
Concept: The borders between introns and exons (i.e. "splice
sites") are recognized by proteins which remove introns from
primary transcripts by their characteristic nucleotide sequences.
Exon-intron borders (i.e. "donor sites") and intron_exon borders
(i.e. "acceptor sites") are associated
with specific sequence characteristics (motifs).
|
|
Splice site examination
During the transcription of a gene, its exon and intron sequences
are transcribed together into one mRNA string. Subsequently,
the introns are removed and the exons are spliced together in
order to form a mature mRNA which serves as the template for
translation. The following exercise examines intron/exon borders
(i.e. splice sites) of
human and plant genes and is
designed to help you and your students identify characteristic
features of these important sites in eukaryotic genes.
- Use the worksheet provided or create a table following this
example to identify
sequence characteristics at splice sites
(i.e. exon/intron
borders) for arabidopsis and human genes. Are
you able to identify any patterns that recur
at each splice site and may be used in splice
site prediction program?
|
Characteristics of DNA Sequences V - Open Reading Frames
Concept: Genes provide the building plan for proteins. The genetic
code describes a relationship between nucleotide tripletts and amino
acids, whereby 64 possible combinations of nucleotides code for the
placement of 20 amin acids and 3 stop codons, responsible for
terminating mRNA translations (find a refresher here). Non-coding gene sequences are
usually highly interspersed with stop codons in all three reading
frames while coding gene sequences (exons) usually allow the
establishment of a cotinous read of coding nucelotide tripletts in one
|
|
Run some of the exon
and intron sequences from the exercise above through NCBI's ORF detection program
as well as through the 6 Frame Translation utility at the Baylor College
of Medicine's 'Sequence Utilities'.
- What differences do you observe between the outputs provided by
the two programs?
- What differences do you observe between exons and introns?
- Can you observe differences in how ORFs are disributed throughout
exons? Do all ORFs start with the first nucelotide and run through the
last? Comment on your observations.
|
|
|
|
|