Genotyping, Haplotype Assembly Problem, Haplotype Map,
Functional Genomics and Proteomics
February 19, 2002
Prepared by Kaleigh Smith
The following document provides an introduction single nucleotide polymorphisms and the motivation they provide for current research in pharmacogenomics (and other areas) and the technology to facilitate this research. The main initiative behind SNP-related work is that genetic differences between people to be used to predict phenotypes and phylogeny. We will generally discus SNPs as they occur in human genomic sequences unless specified otherwise. This text is accompanied by a set of slides on SNPs also available at this location.
A Single Nucleotide Polymorphism is a source variance in a genome. A
SNP (ßnip") is a single base mutation in DNA. SNPs are the most simple
form and
most common source of genetic polymorphism
in the human genome (90% of all human DNA polymorphisms).
There are two types of nucleotide base substitutions resulting in SNPs:
Sequence Variation
Sequence variation caused by SNPs can be measured in terms of nucleotide diversity, the ratio of
the number of base differences between two genomes over the number of
bases compared. This is approximately 1/1000 (1/1350) base pairs between
two equivalent chromosomes.
Distribution of SNPs
SNPs are not uniformly distributed over the entire human genome, neither over all chromosomes and neither within a single chromosome. There are one third as many SNPs within coding regions as non-coding region SNPs. It has also been shown that sequence variation is much lower for the sex chromosomes. Within a single chromosome, SNPs can be concentrated about a specific region, usually implying a region of medical or research interest. For instance, the sequence that encodes proteins that present antigens to the immune system in chromosome 6 displays very high nucleotide diversity compared to the other areas of that chromosome.
A SNP in a coding region may have two different effects on the resulting protein:
SNPs may also occur in regulatory regions of genes. These SNPs are capable
of changing the amount or timing of a protein's production. Such SNPs are
much more difficult to find and understand and gene regulation itself is
not yet clearly understood.
Phenotype, genotype and haplotype are the most important and basic concepts related to SNPs. It is important to have a clear understanding of each term and the processes of genotyping and haplotyping.
There are over one million SNPs identified (1,255,326 mapped SNPs at the SNP
Consortium Organization). Validation experiments have shown that 95% of
these are unique and valid polymorphisms (not the product of error or
redundancy).
Methods for SNP discovery/detection involve a set of
biochemical reactions that isolates the precise location of a suspected
SNP and then directly determines the identity of the SNP, using an
enzyme called DNA polymerase. [Source: http://www.orchid.com/]
Also, many SNPs were initially detected by comparing different sequenced
genomes. This work has now been extended to a much larger-scale effort to
determine the SNPs (genotypes) of many genomes from different populations.
Notice the difference between SNP discovery/detection and SNP scoring or SNP genotyping. One strives to identify new SNP locations on the genome, while the other ïnvolves methods to determine the genotypes of many individuals for particular SNPs that have already been discovered" [NIH, Methods for Discovering and Scoring Single Nucleotide Polymorphisms, Request for Applications Jan. 9, 1998]. This ends our discussion of SNP detection. What follows is an overview of "post-genomic" SNP related applications such as high throughput genotyping, determining haplotypes from genotypes, and haplotype mapping.
The second phase of human genomics (the first being the sequencing of the
human genome) involves large-scale screening of different human
populations for significant DNA polymorphisms. The information gathered
will lead to accurate associations between genotype and phenotype.
High-throughput SNP genotyping is the process of quickly and cost-effectively identifying the SNP values in as many different individual human genomes as possible. Steps of SNP genotyping involve DNA sample preparation, PCR amplification,
and microarray assays. For the last step, the technology must
label SNP locations of both alleles in the DNA sample and determine the
base values using microarray technology.
Orchid Biocomputer and Affymetrix are the leaders in providing SNP genotyping technology. They have developed single nucleotide polymorphism (SNP) genotyping assays that combine OrchidÕs proprietary GBA¨ primer extension technology with an Affymetrix GeneChip¨ universal array. Their technologies can provide ultra-high throughput of 100,000 genotypes/day. It is interesting to note that nearly all SNP genotyping uses Affymetrix equipment (including GenFlex, Affymetrix's universal microarray).
High-throughput SNP genotyping achieves the goal of recording the SNP
location base values of thousands of people from a given population. The genotype data gathered from high-throughput genotyping is then used to determine the related haplotypes. This information is then used for SNP mapping which is discussed further on in this summary.
SNP genotyping has introduced a complex computational problem (luckily). This problem was first published by Lancia et al. in the paper "SNPs Problems, Complexity and Algorithms", and followed by Älgorithmic Strategies for the single nucleotide polymorphism haplotype assembly problem" by Lipert et al. Before understanding the motivation for this problem, it is important to be familiar with the following terms.
Large-scale human genotyping technology introduces an interesting
algorithmic problem of partitioning SNP genotype data into
haplotype partitions. The problem arises because the genomic fragments that constitute the genotype of a diploid organism contain two copies of each location or chromosome (two haplotypes). The polymorphic DNA fragments must then be assembled into their original haplotype.
The papers describe algorithms to determine haplotypes over long regions from short sequence
fragments. The aligned fragments must be partitioned into two sets according
to their original homologue. This partitioning is accomplished by using
conflicting SNP values between fragments to create a SNP conflict graph or a fragment
conflict graph. The following is a summary of the two aforementioned papers.
The SNP Assembly Problem (Lancia)
A SNP assembly is a tuple (S, F, R) where S is a set of n SNPs, F is a set of m fragments and R is a relation R: S x F ® {0, A, B} indicating whether a SNP s Î S does not occur on a fragment f Î F (marked by 0) or if occurring, the ßcore" of s (A or B).
Computational Haplotyping (Lancia)
A partition of F into two blocks: H1 and H2 called haplotypes. A SNP assembly is 'feasible' when there exists a haplotyping such that:
"s Î S and "f , f¢ Î Hi: R(s,f) = R(s,f¢) or R(s,f) = 0 or R(s,f¢) = 0.
A present SNP s Î S has value either A or B. Recall that the organism is diploid and there are copies of that SNP in fragments taken from both homologues of the sample's chromosome. If the SNP is heterozygous, then the SNP will have value A in H1 and B in H2 (or vice versa). If the SNP is homozygous, it can not be used to help determine the fragment's haplotype and is therefore not considered.
S and F are both used to create conflict graphs GS and GF so that an algorithm can determine the fewest s Î S or f Î F that must be removed as errors to render the graphs conflict-free. Two fragments fi and fj are in conflict when the exists a SNP s Î S such that R(fi, s) ¹ 0 and R(fj, s) ¹ 0 and R(fi, s) ¹ R(fj, s). Two SNPs s1 and s2 are in conflict when there exist two fragments f1 and f2 such that three of R(f1, s1), R(f2, s1), R(f1, s2) or R(f2, s2) have the same non-zero value and one has the opposing non-zero value.
The SNP haplotype assembly problem has now become either the Minimum Fragment Removal Problem or the Minimum SNP Removal Problem. Both these problems take GF or GS respectively as input and return a maximally large GF¢ that is bipartite (MAX Induced Bipartite Subgraph problem) or a maximally large GS¢ that is a stable set (Vertex Cover). Note that GF is bipartite iff GS is a stable set. Both of these problems are NP-hard.
Maximum stable set (Vertex Cover) is a very well studied problem that can be attacked using one of many approximation algorithms. We will therefore focus on the problem of finding the largest bipartite subgraph in GF. A graph G is bipartite iff it contains no odd cycles. We will therefore consider the MFR problem in terms of a cover problem, the Odd Cycle Cover problem.
Clearly, OCC determines the minimal set of vertices F¢ to remove from GF to remove all odd cycles, and therefore render the graph bipartite. OCC has a 9/4-approximation algorithm. It has not yet been shown Fixed Parameter Tractable (FPT). We would like to consider an algorithm with bounded exponential time to solve a parameterized version of this problem, k-OCC. The problem of k-Odd Cycle Cover is as follows: Given a graph G = (V,E), find V¢, |V¢| £ k and |V¢| is minimal such that G¢ = V - V¢ is bipartite. This is the central function to establish when solving MFR. The second paper describes a naive branch and bound approach based on searching a tree of possible solutions. Our approach involves creating a set of reduction rules to condense GF ® GFR, making it possible to apply efficient (parameterized) algorithms to find a small ( < k) odd cycle cover of GFR, to determine that GFR contains greater than k disjoint odd cycles (and therefore does not have a small enough OCC) or contains one other graph structure that disallows a small enough OCC.
This is a work in progress. However, it is good to read Lancia's papers to understand the algorithmic problems related to SNP haplotyping. Let it be known that there are strong statistical methods for performing haplotyping, and as the costs of SNP assays go down, there are also biological methods for determining the haplotypes.
SNP mapping (association mapping) is one of the most active areas of
SNP post-genomics research. This work involves identifying SNP sites along
the genome to track disease genes. A human
SNP map specifies the contributions of individual genes to diseases
and other phenotypes.
The SNP Consortium Ltd. is a non-profit foundation that provides over one
million detected SNPs and their annotations to the public. Its mission is to develop up to
300,000 SNPs distributed evenly throughout the human genome and to make
the information related to these SNPs available to the public without
intellectual property restrictions.
http://snp.cshl.org/
SNP mapping succeeds in identifying individual genes responsible for
monogenic diseases such as HuntingtonÕs and cystic fibrosis, however
the majority of traits are influenced by multiple genes and environmental
factors. As an extension to basic SNP mapping, human genetic variation
research
determines how variation among
individuals or groups contributes to the health status of that individual
or group. This type of research has the initiative of developing a
haplotype map of the human genome. The map's purpose is to relate human
genetic
variation with disease predisposition, specifically common or complex
disorders.
Scientists have discovered that
there is a small number of different versions of certain genetic blocks (small number of haplotypes). This
means that there is a small number of SNP patterns at each chromosomal
position. For instance, for some blocks, only four or five patterns of
SNPs
were found (four or five different haplotypes) that account for 80%-90%
of the entire population. This finding greatly simplifies the search for
associations
between DNA variations and disease.
Also, as many such haplotype patterns are specific to populations
(groups), the map will facilitate the conduct of association
studies in selected populations where certain diseases are more or less
prevalent.
The following information from CHI puts haplotype mapping in context.
Haplotype trees provide methods for examining the phylogeny of individuals based on their haplotypes and also provide methods for understanding molecular (genomic) natural selection. They are constructed to understand human evolution, historical timelines and to genetically determine genealogy. These trees can be created for one species or can be created to represent inter-species haplotypic phylogenies, Recall that there are a small number of haplotypes (unique patterns) for chosen location of interest. A population or haplotype group is a set of highly similar haplotypes. Often the haplotype under consideration is a maternally inherited gene or a set of locations on one of the sex chromosomes. It has been shown that members of a population generally share the same haplotype pattern. These trees are often combined with homogy-based trees to provide a more reliable portrait of geneologies.
Haplotype trees are constructed parsimoniously with unique haplotypes being represented by the nodes of the tree. Haplotypes develop from older ancestral haplotypes. These older haplotypes are believed to be more widespread over the species, and are therefore generally represented by internal nodes, whereas newer haplotypes (more recently emerged patterns) will be represented by leaf nodes. Note that it is integral to select a haplotype that is robust to recombination and mutations.
Jin, et al describe the use of a haplotype tree for studying migratory history, genetic drift and natural selection in their 1999 paper "Distribution of haplotypes from a chromosome 21 region distinguishes multiple prehistoric human migrations". The haplotypes considered in their research originate from a 565-bp chromosome 21 region near the MX1 gene that contains 12 SNP locations and lack recombination and recurrent mutation events. The paper notes that some haplotypes occur much more frequently than others and that the variety of haplotypes present in different populations vary with older populations displaying higher haplotype variety.
An example of a commercial application of haplotype trees is the Internet company "DNA Family Tree" - http://www.familytreedna.com/, a self-declared "DNA-driven genealogical testing company". The scientists at this company use a set of statistical and phylogenetic (tree-building) methods that show the exact genealogical linkages amongst the haplotypes ( either Y chromosome or mitochondrial DNA ) to determine the ancient lineage of an interested client. Particularly, one of the products is used to determine if a female client has Native American ancestry. The company can determine this because "genetic studies have shown that Native American mtDNAs belong to one of five distinct maternal lineages. These have been designated haplogroups A, B, C, D and X. Each of these is defined by a specific set of mutations/markers that occur in the coding and non-coding regions of the mtDNA genome." This use of haplotypes again introduces many ethical concerns and it is not difficult to imagine how haplotype technology could facilitate genomic discrimination.
The following are fields that utilize SNP information and haplotype maps. The major application of SNP information is towards improved and futuristic health care. Genomics and specifically SNP research can be used to improve health care through gene therapy, to yield new targets for drug discovery, to refine the process of drug development and to discover new diagnostics.
All aspects of pharmacogenomics require data from high-throughput
genotyping, specifically the target population for a drug or the
population of people who react poorly with the drug. Also, this type of
research may lead to population specific treatments. The high cost of drug
recalls have provided an initiative for advanced drug
design involving drug-target validation studies as well as studies to predict adverse
events and lack of efficacy.
A sample pharmacogenomic experiment may proceed as follows:
An individual's genotype can be determined and then analysed according to a haplotype map to determine the patient's disease risk or reception to different treatments.
SNP related functional proteomics involve the identification of functional
SNPs that modify proteins and protein active sites structure and function.
Functional proteomics is closely tied to modern (post-genomic) drug design
and function SNP information helps to discover new therapeutic targets.
Most interestingly, by developing a database of the modifications
generated
by functional (coding) SNPs in disease related proteins, "new compounds
can
be designed for correcting or enhancing the effects of those mutations in
the population." [Source: Genodyssee]
What are these compounds and how can knowledge of SNP effects be used to
correct populations with undesirable SNPs or enhance populations by
introducing the advantages of a desirable SNP? Aside from drugs, here are
some interesting genomic therapies that may become more feasible as SNP information in the form of trees and maps become more detailed.
The following is an excerpt from the Cambridge Healthtech Institute Article:
"SNP-Research Market Could Reach $1.2 Billion By 2005, If Pharmacogenomics
Advances
January 1, 2002". By Malorye Allison Branca. [Source:
http://www.chireports.com/content/articles/snpresearch.asp]
Ännual expenditures on single nucleotide polymorphism (SNP) research could
increase sevenfold by 2005, growing to more than $1.2 billion, from $158
million in 2001, according to a new Cambridge Healthtech Institute (CHI)
report, Commercial Implications of Advances in the Identification,
Mapping, and Application of Single Nucleotide Polymorphisms. (For more on
this report, go to www.chireports.com/content/reports/snpupdate01.asp.)"
"Three major factors will fuel this growth:
"Currently, the most common applications for SNP-related research tools are
gene-disease association studies and drug-target validation. Other popular
applications are disease-susceptibility studies or diagnostics,
pharmacogenomic studies for clinical trials, drug-target screening, and
new technology development. We anticipate that target validation and
disease association studies will continue to be the most common
SNP-related tasks in drug discovery and development; however, we expect
that by 2003, the application of SNP studies in pharmacogenomics will
begin to increase steadily and could quickly become a multibillion-dollar
market itself."
Affymetrix www.affymetrix.com
Cambridge Healthtech Institute Articles: http://www.chireports.com
Cambridge Healthtech Institute Glossary: http://www.genomicglossaries.com/
DNA Family Tree company: http://www.familytreedna.com/
Lancia, G., Afna, V., Istrail, S., Lippert, R. and Schwartz, R.. SNPs Problems, Complexity and Algorithms. ESA 2002, LNCS 2161, pp. 182-193, 2001. Springer-Verlag Berlin Heidelberg 2001.
Jin, Li, et al. Distribution of haplotypes from a chromosome 21 region distinguishes multiple prehistoric human migrations. Proc. Natl. Acad. Sci. USA Vol. 96, pp. 3796-3800, March 1999.
Lippert, R., Schwartz, R., Lancia, G. and Istrail, S. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in Bioinformatics. Volume 3, NO 1, 1-9. February 2002.
Map of the Human Genome 3.0, BioSINO, Laura Helmuth
http://www.biosino.org/bioinformatics/010814-4.htm
Orchid BioSciences www.orchid.com
Pharmaceutical Research and Manufacturers of America (PhRMA) Genomics glossary: http://genomics.phrma.org/lexicon/
PolyGenyx: http://www.polygenyx.com/
SNP details: SNP Web Source, Xuan Chen (offline)
Studies of the ethical, legal and social implications of human genetic variation research for individuals and diverse racial and ethnic groups: http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-02-003.html