Bioinformatics in the Classroom

Characteristics of DNA Sequences

Characteristics of DNA Sequences I - Randomness vs. Patterns
Concept: Nucleotide and amino acid sequences within living organisms are not random; they entail information. Different sequence patterns are correlated with different functions.

Genes consist of sequences of nucleotides. These nucleotides, in form of a triplett code, determine the location of individual amino acids in peptides. The sequences of amino acids within a peptide determine the characteristics and, thus, the function of the resulting proteins. Since genes and proteins are tailored to serve specific functions their elements, nucleotides and amino acids, are not lined up randomly, but in very specific patterns.

The exercises in the hand-out provided will help you to better understand the concept of randomness and will provide you and your students with tools to examine the randomness of DNA sequences. They will also prepare you to better understand how bioinformatics tools work and help you to answer questions like this:

  • How often would you expect a sequence of 16 nucleotides (16-mer) to be repeated in the human genome? How often a sequence of 300 nucleotides?

  • Or like this:
  • Alu elements are nucleotide stretches of ca. 350 bp which occur repetetively in the human genome and amass to about 10% of the entire human DNA sequence. How many copies of Alu does this amount to? Would you think that this is due to chance? What kind of information could be entailed in the occurence of so many Alu elements?
  • Characteristics of DNA Sequences II - Bioinformatics
    Concept: Informatics tools for the analysis of genomes are being developed by analyzing the occurrence of patterns in DNA and their putative correlation with biological function. This highly mathematical approach is expected to lead to the development of tools that not only allow the identification of genes, introns, and exons, but also the identification of protein binding sites, as well as the classification of proteins due to their biochemical activities (motif search).

    DNA sequences are provided in a variety of different formats: lower case letters, upper case letters, digits, interspersed with blank spaces (to separate long sequences into strings of 10 nucleotides). Most sequence analysis programs accept nucleotide sequences that have been formatted in the so-called "FASTA" format.

    A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

    >gi|532319|pir|TVFV2E|TVFV2E description
    wxyz

    whereby wxyz denotes a sequence of nucleotides or amino acids; FASTA is used for DNA as well as for protein sequences.

    Tools for the transformation of a sequence into FASTA format can be found at a variety of web sites. Here, we would like to introduce you to two ways to perform this transformation:

    1. Use the web-site of the Dolan DNA Learning Center:
      • Open browser
        Go to Bioserver at http://vector.cshl.org/bioserver/
      • Log into 'Sequence Server' (Loging in as a user will allow you to save your sequences).
      • Select 'Create A Sequence'.
      • Paste sequence into window, name sequence with a name you will remember.
      • Select 'OK'.
      • To save this sequence select 'Save', 'Add Group', create a new group with your name, select 'OK'.

    2. Use the sequence utilities provided through the web site of the Baylor College of Medicine.
      • Go to course links at http://vector.cshl.org/bioinformatics/links.htm, open website 'Baylor College of Medicine'.
      • Select 'Sequence Utilities'.
      • Paste sequence into window.
      • Select 'ReadSeq', then 'Search'.
      • Open a text editor and paste result into text file. Make sure that you save this file in a location you will remember (e.g. directory 'Workshops').
    Characteristics of DNA Sequences III - Coding vs. Non-coding DNA
    Concept: Different genomic DNA sequences entail different functions (e.g. promoters, genes, etc.). Each DNA-type is associated with specific sequence characteristics (motifs).

    DNA sequences that are translated into amino acid sequences are called coding DNA, all other DNA sequences are summarized as non-coding DNA. In eukaryotes, coding DNA therefore refers to the DNA sequences in exons, while non-coding DNA entails introns, non-translated 5'- and 3'-DNA stretches (5'-UTR, 3'-UTR), and promotors. In addition to these "intragenic" non-coding DNA sequences there is a wide array of non-coding DNA that is located in intergenic regions, e.g. in centromers and telomers.

    Examples for characteristics that have been found associated with different DNA sequences are:

    • coding sequences (exons): e.g. open reading frames (ORFs).
    • introns: e.g. stop codons.
    • non-coding, regulatory regions; e.g. TATA, Shine-Delgarno, Pribnow, Kozak Consensus, in CpG-rich regions.
    • translational start/stop sites; e.g. ATG.
    • protein binding sites; e.g. TATA.
    • non-coding, intergenic sequences; e.g. absence of coherent ORFs, AT-rich.

  • Exercise: Characteristics of coding and non-coding DNA


    Use the programs in the window on the right to examine the characteristics for exon, intron, and intergenic DNA sequences. Determine for representatives of each sequence type the numbers of the four nucleotides A,T,G,C, the numbers and ratios of A+T and C+G, the numbers and ratios of CpGs and all other dimers, as well as the numbers and ratios of all trimers (all possible combinations). Completing the worksheet provided (sample: coding/non-coding) answer the following questions:
    • What differences can you identify between intra- and intergenic regions? Between coding and non-coding sequences within genes (intragenic)?
    • Can you identify differences among first, internal and last exons?

    The original version of the programs utilized for this analysis can be viewed here; more streamlined versions are available for viewing through the links provided at the right. Try to identify functions for some of the components of these programs.

    Paste your sequence into the window (FASTA format or raw sequence)
    and select a program for analysis

    Select a program

    View the programs

    length.pl ; monomers.pl ; dimers.pl ; trimers.pl ; fourmers.pl ; all.pl

     

  • Characteristics of DNA Sequences IV - Exon-Intron Borders
    Concept: The borders between introns and exons (i.e. "splice sites") are recognized by proteins which remove introns from primary transcripts by their characteristic nucleotide sequences. Exon-intron borders (i.e. "donor sites") and intron_exon borders (i.e. "acceptor sites") are associated with specific sequence characteristics (motifs).
  • Splice site examination

    During the transcription of a gene, its exon and intron sequences are transcribed together into one mRNA string. Subsequently, the introns are removed and the exons are spliced together in order to form a mature mRNA which serves as the template for translation. The following exercise examines intron/exon borders (i.e. splice sites) of human and plant genes and is designed to help you and your students identify characteristic features of these important sites in eukaryotic genes.


    • Use the worksheet provided or create a table following this example to identify sequence characteristics at splice sites (i.e. exon/intron borders) for arabidopsis and human genes. Are you able to identify any patterns that recur at each splice site and may be used in splice site prediction program?

     

  • Characteristics of DNA Sequences V - Open Reading Frames
    Concept: Genes provide the building plan for proteins. The genetic code describes a relationship between nucleotide tripletts and amino acids, whereby 64 possible combinations of nucleotides code for the placement of 20 amin acids and 3 stop codons, responsible for terminating mRNA translations (find a refresher here). Non-coding gene sequences are usually highly interspersed with stop codons in all three reading frames while coding gene sequences (exons) usually allow the establishment of a cotinous read of coding nucelotide tripletts in one

    Run some of the exon and intron sequences from the exercise above through NCBI's ORF detection program as well as through the 6 Frame Translation utility at the Baylor College of Medicine's 'Sequence Utilities'.

    • What differences do you observe between the outputs provided by the two programs?
    • What differences do you observe between exons and introns?
    • Can you observe differences in how ORFs are disributed throughout exons? Do all ORFs start with the first nucelotide and run through the last? Comment on your observations.