Bioinformatics in the Classroom

Single Nucleotide Polymorphisms
Variations on the Human Genome

SNPs I - Introduction
Concepts: The genomes of different humans are highly similar, yet sufficiently different to guarantee that each human has a unique genome. The most prevalent among the differences between human genomes consists of differences in single nucleotides, leading to single nucleotide polymorphisms or SNPs (pronounced snepp). It is estimated that the genomes of two individual humans entail, on average, one SNP per 1,300 bp.

Literature: Paper by The International SNP Working Group

The genomes of two humans are to about 99.9% identical - yet, the 'Human Genome Project', implying existence of a single human genome, has always been somewhat a misnomer. Of course, every person - with the exception of identical twins - has a unique genome. Even though two genomes are roughly 99.9% identical, the remaining difference of 0.1% leaves roughly 3,200,000 differences among the 3.2 billion base pairs comprising each individual's (haploid) genome. It is precisely these differences, or polymorphisms, that account for the heritable variation among individuals, including susceptibility to diseases and responsiveness to cures.

Each somatic cell within a human being has two sets of autosomal chromosomes, one maternal, the other paternal. Thus all of us have two different forms of each chromosome (be aware of the different situation for sex chromosomes - X and Y - and for the genetic information stored on the DNA in the "powerplants" within our cells, the mitochondria). The two sets of each genetic location are called "alleles". The paternal and the maternal allele of each genetic locus are usually not identical but bear differences due to the introduction of mutations in these loci during the development of the different lineages that have given raise to father and mother, respectively. In addition, the alleles of different genetic loci within an individual's genome have been recombined during meiosis, so that offspring do not express pure traits of their parents and grandparents but recombined mixes thereof.

Differences between alleles which are passed on to off-spring in a Mendelian fashion, are called polymorphisms. While polymorphisms can consist of a variety of different types (e.g. insertions, deletions, inversions, duplications) academic and industrial research are currently focusing on the most prevalent form of variation in the human genome: differences in single nucleotides or Single Nucleotide Polymorphisms (SNPs). Especially the pharmaceutical industry expects that the identification of meaningful SNPs will lead to breakthrough developments of new diagnostic tools, optimizations of drug discovery processes, and to the development of "individualized" drugs.

In the era of molecular biology genetic analysis is important to find out where on the DNA to look to find information of relevance to particular phenomena. Some genes have very complex phenotypic consequences, and we may never have the ability to look at a DNA sequence and to infer, directly, that it regulates some aspect of facial features or mathematical reasoning ability. Genetic analysis, however, offers a totally independent approach to determining the location of genes responsible for inherited traits.

In most organisms, genetics is carried out by breeding specific pairs of parents and examining the characteristics of their offspring. Clearly, this approach is not practical in the human. Instead, what must be done is to perform retrospective analyses of inheritance in families.

Instead, statistical analysis of the pattern of inheritance is used in place of direct genetic manipulation to test hypotheses about the genetic mechanism underlying particular traits in humans. Linkage is the tendency for two observable genetic traits, called markers, to be coinherited if they lie near each other on the same chromosome. To be distinguishable genetically, markers must occur in more than one form -alleles- (e.g. eye color) in different members of the population. Several factors cause large problems for the ability to identify traits that are linked to certain phenotypes (e.g. the susceptibility to a certain disease):

  • The size of human families is generally too small to allow identification of linkage between genes that are closer together than 10 MB (if two markers are 1MB apart there is roughly a chance of 1% for recombination to occur between them and, statistically, it would take families with 100 offspring to detect a single recombination event that would place two genes 1 Mb apart)
  • The low resolution of linkage analysis often requires the research to be performed on several families. This type of research is most promising if populations within families can be examined which have been somewhat isolated and, therefore, established some kind of relationship even between different families (e.g. families on islands like Iceland, Sicily, Corsica; the Azkerbaijkan Jews, etc.). If genetic analysis is performed on unrelated families it is very difficult to detect linkage.
  • If both chromosomes of a parent contain the same marker A, there is no way to tell from this marker alone which chromosome the offspring received. In order to be able to distinguish successfully among the four homologous chromosomes originally carried by the two parents, a large number of closely associated, polymorphic markers needs to be available.
  • The possibility that individuals may be excluded/included in linkage analysis based on misdiagnosis can obscure respective findings by either leading to no clear linkage or to linkage between markers that have nothing to do with the disease.
  • Mispaternity may lead to falsely assume men to be fathers (and/or falsely exclude true fathers) in respective studies.

Recent efforts by the SNP Consortium, the International SNP Map Working Group, as well as a number of individual companies have led to the identification of about 1.4 million SNPs. On average, two haploid human genomes differ in 1 nucleotide per 1330 bp, a rate that is expected to vary somewhat between ethnic groups.

There are two basic types of SNPs, those within the coding regions of genes which are called cSNPs, and those outside of genes. Previous studies have found that, on average, genes contain about four SNPs per gene. This observation, together with the estimate of a total of ca. 30,000 genes in the human genome, indicates the existence of roughly 120,000 cSNPs. Of these, about 60% are estimated to be synchronous, i.e. to not lead to changes in amino acids, and 40% are expected to change an amino acid. These 50,000 non-synchronous cSNPs, together with an unknown number of regulatory and other non-coding but functional polymorphisms, comprise the bulk of common molecular variation with potential phenotypic consequences. In addition to these causal SNPs, linkage studies try to identify SNPs which are associated with the inheritance of metabolic phenotypes, yet located outside of genes and not involved in directly causing changes in phenotype.

SNPs appear to occur at different frequencies in different DNA sequences. For example an analysis of a 1Mb Interleukin gene cluster on the human chromosome 5 (5q31), analyzed in 40 individual Northern Europeans, yielded the following results (Banerjee at al., 2001, CSHL Genome Meeting):

  Bp sequenced # SNPs Frequency
Coding regions/exons 1977 4 1/494
Introns 14700 40 1/367
UTRs 1105 4 1/276
Conserved non-coding sequences 11295 28 1/403
Intergenic 13998 36 1/389
Total 31778 85 1/374

Due to their genetic make-up individuals differ in how they react to disease-causing events, what course and form a disease may take for them, and how they react to specific medications. E.g. while smoking in most people increases the risk to develop lung cancer, some heavy smokers live to very high ages. Also, while most people who suffer from pain can find relief through codeine treatment, some can not transform codeine into the corresponding morphine-like structure and will not experience efficient pain relief upon being treated with codeine. In other cases, differential reactions to certain drugs can mean the difference between healing and suffering and, often, death.

The sequencing of the human genome has lead to the discovery of a bounty of SNPs; on average, two human chromosomes entail one SNP per 1300 nucleotides. This is true for the comparison of chromosomes among humans as well as for the comparison of the two copies of each chromosome within each of our body cells. Several groups are involved in SNP discovery, the most prominent of which are the SNP Consortium and Celera. The SNP Consortium consists of several corporations vested in pharmacy as well as in informatics (see http://snp.cshl.org/about/members.html). The Consortium works with four major academic centers for molecular genetics and stores its data in its database at Cold Spring Harbor Laboratory.

SNPs II - SNPs and Bioinformatics
Concept: SNPs entail information about differences between humans. The occurence of SNPs throughout the human genome underlies certain principles that are not entirely understood, yet.

SNP websites

Follow these links to Exercises and to Chapter 2: SNPs in Biomedicine:

SNP Exploration Exercises

SNPs in Biomedicine