All Rights Reserved.

Last updated September 30, 2003

The material here is intended to provide a brief background in information theory, as part of an overall framework for understanding "Karlin-Altschul" statistics.

- References
- Shannon's Entropy, I
- Uniform Probability Distribution
- Nonuniform Probability Distribution
- Relative Entropy, H
- Why Is
Relative Entropy
*Relevant* - Karlin-Altschul (or Dembo-Karlin) Statistics
- The "Edge Effect"
- Statistical Inference in Database Searching

- Karlin, S, and SF Altschul (1990).
*Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.*Proc. Natl. Acad. Sci. USA**87**:2264-68.

Additional covered or supporting material is contained in:

- Altschul, SF (1991).
*Amino acid substitution matrices from an information theoretic perspective.*J. Mol. Biol.**219**:555-65. - Altschul, SF (1993).
*A protein alignment scoring system sensitive at all evolutionary distances.*J. Mol. Evol.**36**:290-300. - Altschul, SF, and W Gish (1996).
*Local alignment statistics.*Methods Enzymol.**266**:460-80. - Altschul, SF, Bundschuh, R, Olsen, R, and T Hwa (2001).
*The estimation of statistical parameters for local alignment score distributions*Nucl. Acids Res.**29**:351-61. - Karlin, S, and SF Altschul (1993).
*Applications and statistics for multiple high-scoring segments in molecular sequences.*Proc. Natl. Acad. Sci. USA**90**:2264-68. - Pearson, WR (1998).
*Empirical statistical estimates for sequence similarity searches.*J. Mol. Biol.**276**:71-84. - Shannon, CE (1948).
*A mathematical theory of communication*The Bell System Tech. J.**27**:379-423. - States, DJ, W Gish, and SF Altschul (1991).
*Improved sensitivity of nucleic acid database similarity searches using application specific scoring matrices.*Methods: A companion to Methods Enzymol.**3**:66-70. (Complete PDF version is available here).

For detailed coverage of the extreme value distribution, see:

- Lawless, JF (1982).
*Statistical models and methods for lifetime data*. Wiley & Sons, New York, NY.*(**Note*: this book is available in the Becker Library)

For a modern treatment of gapped alignment scores and the "edge effect", see:

- Altschul, SF, Bundschuh, R, Olsen, R, and T Hwa (2001).
*The estimation of statistical parameters for local alignment score distributions.*Nucleic Acids Res.**29**:351-361. - Yu, YK, and T Hwa (2001).
*Statistical significance of probabilistic sequence alignment and related local hidden markov models.*J. Comput. Biol.**8**:249-282.

For detailed coverage of information theory, see:

- Cover, TM, and JA Thomas (1991).
*Elements of information theory*. Wiley & Sons, New York, NY.

A good introductory text on algorithms (such as Huffman coding) is:

- Sedgewick, R (1992).
*Algorithms in C++*. Addison-Wesley.

Abraham Wald's Sequential Probability Ratio Test (SPRT) may help one to grasp the meaning of the sums of log-odds scores that are evaluated with Karlin-Altschul statistics and the drop-off score, X, used in the BLAST algorithm.

An excellent treatise on practical aspects of numerical computing (including arithmetic encoding) is:

- Press, WH, Teukolsky, SA, Vetterling, WT, and BP Flannery (1992).
*Numerical Recipes in C*. Cambridge University Press, Cambridge, UK.

For a combined historical and philosophical perspective on information theory and the study of genetics and molecular biology, see:

- Kay, LE (2000).
*Who wrote the book of life: a history of the genetic code*Stanford University Press.

Information theory deals with messages that are numerically coded for their
description, storage and communication. For our work, messages will consist of
sequences of letters or symbols chosen from some alphabet, S. For example, to encode DNA sequences, which constitute one
kind of message, we are accustomed to using the set of IUPAC/IUB codes where
S = {A, C, G, T}. Later on, we'll see that sequence
alignments are themselves messages, selected for having certain properties (such
as the highest possible score) and where the symbols used to spell out these
messages are the aligned letter-*pairs*.

To perform statistical analyses on messages or sequences, we need a reference
model. The model we will use is a simple one that merely requires that each
letter in a sequence be selected from a defined alphabet in an independent and
identically distributed (*i.i.d.*) manner. This choice of model system will
allow us to compute the statistical significance of certain characteristics of a
sequence or its subsequences. In addition, we'll see that the model defines the
context or assumptions under which our statistical interpretations are valid.

Given a probability distribution, *P _{i}*,
for the letters in an

As an alternative to multiplication of the probabilities, we could sum their
logarithms and exponentiate the result. The probability of the same sequence of
letters can be computed by exponentiating *log P _{i} + log P_{j} + log
P_{k}+ ··· + log P_{n}*.

Log-probabilities and their sums represent measures of **information**.
Conversely, information can be thought of as log-probabilities. Since
probabilities are fractional values between zero and one, their logarithms have
negative values (or zero in the uninteresting case that only one possible
outcome is exists). For convenience, information can be expressed as a negative
log-probability, so increasing quantitities of information become increasingly
positive-valued. If this convention is followed, then we need to remember to use
a minus sign when converting from information space to probability space, as in
*P = e ^{-I}*.

Information is measured in units that depend on the base of the logarithm.
When taking logarithms to the base 2 (*i.e.,* log_{2}), the resultant information is expressed in units of
*bits* (a contraction of *bi*nary digi*ts*). When taking Natural
logarithms of probabilities, the resultant information is expressed in units of
*nats* (a contraction of *Na*tural digi*ts*). Bits and nats are
interconvertible by the proportionality constant log(2).

nats = bits · log(2)

log(2) ~= 0.693

1/log(2) ~= 1.44

Generalizing, for a given base of the logarithm, *b*,

log(

x) = log(b) · log_{b}(x)

Using logarithms to arbitrary bases, information can be expressed in
arbitrary units, not just bits and nats, such that
-log_{b}*P* = *-k*log(*P*), where
*1/k=*log(*b*). When working with information in units of bits,
*k=*1/log(2). To keep the exposition a little simpler, we'll confine
ourselves to working in units of nats, so *k* can be omitted.

Shannon's entropy, *I* = - sum_i *P _{i} log P_{i}*, is
the expected (weighted arithmetic mean) value of

Huffman codes are often used for data compression because of their
simplicity. A binary Huffman code provides an optimal, MDL encoding for
sequences of *i.i.d.* letters. While better compression can often be
obtained by identifying higher-order patterns within a message or by adapting
the encoding algorithm to a changing distribution as the message is read, such
methods are more complex and, if we are constrained to considering letters
one-by-one in a strictly *i.i.d.* manner, David Huffman's method is as good as it gets. If you are
interested, however, a perfect, entropy-bounded encoding that requires precisely
*-log _{2} P_{i}* bits per letter

The description length of a message when using a Huffman code is expected to
be about equal to the Shannon or arithmetic code length. Thus, a Huffman-encoded
200 letter sequence will average about 200*I* bits in length. For some
letters in the alphabet, the Huffman code may be more or less efficient in its
use of bits than the Shannon code (and vice versa), but the expected behavior
averaged over all letters in the alphabet is the same for both (within one bit).

Consider the 4-letter DNA alphabet S = {A, C, G, T}.
Each letter in a DNA sequence can be one of 4 possibilities. With a uniform
probability distribution for the letters, *P _{A}
= P_{C} = P_{G} =
P_{T} = 0.25*. The negative

Consider the nonuniform probability distribution

*P*_{A}= 0.05*P*_{C}= 0.05*P*_{G}= 0.45*P*_{T}= 0.45

If we build a binary Huffman code (tree) for the above distribution, 1 bit
would be required to code for *G* and 2 bits to code for *T* (or vice
versa); and 3 bits each to code for *A* and *C*. The "Huffman entropy"
in this case is 1*0.45 + 2*0.45 + 3*0.05 + 3*0.05 = 1.65 bits per letter --
which is not quite as efficient as the previous Shannon code.

When a message exhibits a nonuniform letter distribution, some letters
require fewer bits and others require more bits to encode than if the
distribution were uniform; but Shannon's entropy, which is an average over all
letters in the alphabet, is always less for a nonuniform distribution than for a
uniform one. Continuing with the Huffman tree analogy, *on average* we
expect to need fewer bits to encode each letter in a message (or sequence) when
the Huffman tree is built precisely according to the message's nonuniform letter
distribution.

As a corollary of the above, uniform letter distributions are expected to
yield suboptimal description lengths (*i.e.,* longer descriptions) whenever
the actual letter distribution of the message is *non*uniform. Uniform
distributions are advised for use when reliable, *a priori* knowledge of
the true letter distribution is unavailable. Assumption of a uniform
distribution provides a conservative bound on the expected description length:
*log _{2}(1/N)* bits per letter, where

In general, when we know of a bias in a sequence's letter composition, it can be used to our advantage, by permitting the sequence to be described (encoded) using fewer bits. If the composition is biased and we don't know it -- or if it's biased and we just don't know what that bias is -- conservative use of the uniform distribution will indeed yield a longer encoding under all circumstances than if we had known the true distribution, but at least the encoded length will be well-behaved regardless of the actual bias. If the composition is biased one way and we incorrectly assume it is biased in another, the encoded sequence may be far longer than its optimally encoded length, thus making inefficient use of valuable bits (although bits have dropped in price considerably over time!)

Relative entropy, *H*, is a measure of the expected coding
*in*efficiency per letter, when describing a message with an assumed letter
distribution *P*, when the true distribution is *Q*. If a binary
Huffman code is used to describe the message, the relative entropy concerns the
inefficiency of using a Huffman code built for the wrong distribution.

Whereas **Shannon's entropy**, *I*, is the expected log-likelihood
for a single distribution, **relative entropy** is the expected logarithm of
the likelihood *ratio* (log-likelihood ratio or **LLR**) between two
distributions.

TheH(Q || P) = sum_i Q_{i}log(Q_{i}/ P_{i})

The numerical sign of individual log-odds ratios *log(Q _{i} / P_{i})* tell us:

- LLR greater than 0 =>
*i*was more likely chosen using*Q*than*P*; - LLR equal to 0 =>
*i*was as likely selected using*Q*as it was using*P*; - LLR less than 0 =>
*i*was more likely chosen using*P*than*Q*.

Relative entropy is sometimes called the *Kullback Leibler distance*
between the distributions *P* and *Q*. Note that relative entropy is
always nonnegative and will be zero if and only if *P = Q*. Furthermore,
*H(P || Q)* is generally not equal to *H(Q || P)*.

Relative entropy is symbolized in the present text by *H* and should not
be confused with Shannon's entropy, which is represented in some texts by
*H*, as well. As set forth earlier, *I* is used here to represent
Shannon's entropy.

What does relative entropy have to do with biological sequence features and alignments? Let's consider a simple situation...

Given a scoring matrix (or scoring *vector*, in this one-dimensional
case) and a model sequence generated *i.i.d.* according to some
**background** distribution, *P*, we can write a program that finds the
maximal scoring segment (**MSS**) within the sequence. Here the complete,
full-length sequence represents one message, while the MSS reported by our
program constitutes another. For example, we could create a hydrophobicity
scoring matrix that assigns increasingly positive scores to more hydrophobic
amino acids, zero score to amphipathic residues, and increasingly negative
scores to more hydrophilic residues. (Karlin and Altschul (1990) provide other
examples of scoring systems, as well). By searching for the MSS -- the "most
hydrophobic" segment in this case -- we select for a portion of the sequence
whose letters exhibit a different, hydrophobic distribution than the random
background. By convention, the distribution of letters in the MSS is called the
**target** distribution and is signified here by *Q*. (Note: some texts
may signify the background distribution as *Q* and the target distribution
as *P*).

If the letters in the MSS are described using an optimal Shannon code for the background distribution, we expect a longer description than if a code optimized for the target distribution was used instead. These extra bits -- the extra description length -- tell us the relative odds that the MSS arose by chance from the background distribution rather than the target.

Why would we consider a target distribution, *Q*, when our original
sequence was generated from the background distribution, *P*? We'll soon
see that the target frequencies are indeed special; that the background and
target frequencies are formally related to one another by the scoring system;
and how the odds of chance occurrence of the MSS are related to its score.

Database similarity searches typically yield high-scoring pairwise sequence alignments such as the following:

Score = 67 (28.6 bits), Expect = 69., P = 1.0 Identities = 13/33 (39%), Positives = 23/33 (69%) Query: 164 ESLKISQAVHGAFMELSEDGIEMAGSTGVIEDI 196 E+L + Q+ G+ ELSED ++ G +GV+E++ Sbjct: 381 ETLTLRQSSFGSKCELSEDFLKKVGKSGVVENL 413 +++-+-++--++--+++++-++- + +++++++ Scores: 514121512161315445632110601643512

The total score for an alignment (67 in the above case) is simply the sum of pairwise scores. The individual pairwise scores are listed beneath the alignment above (+5 on the left through +2 on the right).

Note that some pairs of letters may yield the same score. For example, in the
BLOSUM62 matrix used to find the above alignment, *S(*S*,*T*) =
S(*A*,*S*) = S(*E*,*K*) = S(*T*,*S*) =
S(*D*,*N*) = +1*. While alignments are usually reported without the
individual pairwise scores shown, the statistics of alignment scores depend
implicitly on the probabilities of occurrence of the scores, not the letters.
Dependency on the letter frequencies is factored out by the scoring matrix. We
see then that the "message" obtained from a similarity search consists of a
sequence of pairwise *scores* indicated by the high-scoring alignment.

If we search two sequences *X* and *Y* using a scoring matrix
*S _{ij}* to identify the maximal-scoring
segment pair (

- the two sequences are
*i.i.d.*and have respective background distributions*P*and_{X}*P*(which may be the same),_{Y} - the two sequences are effectively "long" or infinite and not too dissimilar in length,
- the expected pairwise score
*sum_i,j P*is negative,_{X}(i) P_{Y}(j) S_{ij} - a positive score is possible,
*i.e., P*for some letters_{X}(i)P_{Y}(j)S_{ij}> 0*i*and*j*,

- the scores in the scoring matrix are implicitly
**log-odds scores**of the form:

where*S*_{ij}= log(Q_{ij}/ (P_{X}(i)P_{Y}(j))) / l*Q*is the limiting target distribution of the letter pairs_{ij}*(i,j)*in the MSP and*l*is the unique positive-valued solution to the equation*sum_i,j P*_{X}(i) P_{Y}(j) e^{lSij}= 1. - the expected frequency of chance occurrence of an MSP with score
*S or greater*is:

where*E = K M N e*^{-lS}*M*and*N*are the lengths of the two sequences,*M N*is the size of the search space, and*K*is a measure of the relative independence of the points in this space in the context of accumulating the MSP score. (While the complete sequences*X*and*Y*must be*i.i.d.*, the letter pairs comprising the MSP do exhibit an interdependency).

Since gaps are disallowed in the MSP, a pairwise sequence comparison for the
MSP is analogous to a linear search for the MSS along the diagonals of a 2-d
search space. The sum total length of all the diagonals searched is just *M
N*.

Another way to express the pairwise scores is:

where logarithms to some baseS_{ij}= log_{b}(Q_{ij}/ (P_{X}(i)P_{Y}(j)))

l = log_{e}b.

By summing the scores *S _{ij}* for the
aligned pairs of letters in the MSP, one obtains the sum of log-odds ratios that
is the MSP score. The MSP score is then seen to be the

Considering the size of the examined area alone, the expected description
length of the MSP (measured in information) is *log(K M N)*. As the
relative entropy, *H*, has units of information per length (or letter
pair), the expected length of the MSP (measured in letter pairs) is

whereE(L) = log(K M N) / H,

HereH = sum_i,j Q_{ij}log(Q_{ij}/ (P_{X}(i) P_{Y}(j)))

By definition, the MSP has the highest observed score. Since this score is
expected to occur just once, the (expected) MSP score has an (expected)
frequency of occurrence, *E*, of 1.

The appearance of MSPs can be modeled as a Poisson process with
characteristic parameter *E*, as it is possible for multiple, independent
segments of the background sequences to achieve the same high score. The Poisson
"events" are the individual MSPs having the same score *S* or greater. The
probability of one or more MSPs having score *S* or greater is simply one
minus the probability that *no* MSPs appear having score *S* or
greater. Thus,

P = 1 - e^{-E}

Note that in the limit asEapproaches0,P = E; and for values ofEorPless than about0.05,EandPare practically speaking equal. (This relationship can be derived from the fact thatxitself is the limiting value oflog(1 + x), asxapproaches0).

For convenience in comparing scores from different database searches, which
may have been performed with different score matrices and/or different
background distributions, we can compute the **normalized** or
**adjusted** score:

S' = lS - log K.

Expressing the "Karlin-Altschul equation" as a function of the adjusted score, we obtain:

E = M N e^{-S'}

Note how the search-dependent values for *l*
and *K* have been factored out of the equation by using adjusted scores. We
can therefore assess the relative significance of two alignments found under
different conditions, merely by comparing their adjusted scores. To assess their
absolute statistical significance, we need only know the size of the search
space, *M N*.

Biological sequences have finite length, as do the MSPs found between them.
MSPs will tend not to appear near the outer (lower and right) edges of the
search space. The chance of achieving the highest score of all are reduced in
that region because the end of one or both sequences may be reached before a
higher score is obtained than might be found elsewhere. Effectively, the search
space is reduced by the expected length, *E(L)*, for the MSP. We can
therefore modify the Karlin-Altschul equation to use effective lengths for the
sequences compared.

M' = M - E(L)N' = N - E(L)

E' = K M' N' e^{-lS}

Since *M'* is less than *M* and *N'* is less than *N*,
the edge-corrected *E'* is less than *E*, indicating greater
statistical significance.

A raw MSP or alignment score is meaningless (uninformative) unless we know
the associated value of the scaling parameter *l*,
to a lesser extent *K*, and the size of the search space. Even if the name
of the scoring matrix someone used has been told to us (*e.g.,* BLOSUM62),
it might be that the matrix they used was scaled differently from the version of
the matrix we normally use. For instance, a given database search could be
performed twice, with the second search using the same scoring matrix as the
first's but with all scores multiplied by 100. While the first search's scores
will be *100*-fold lower, the alignments produced by the two searches will
be the same and their significance should rightfully be expected to be the same.
This is a consequence of the scaling parameter *l*
being *100*-fold lower for the second search, so that the product *lS* is identical for both searches. If we did not know two
different scoring matrices had been used and hadn't learned the lessons of
Karlin-Altschul statistics, we might have been tempted to say that the higher
scores from the second search are more significant. (At the same time, we might
have wondered: if the scores are so different, why are the alignments
identical?!)

Karlin-Altschul statistics suggest we should be careful about drawing conclusions from raw alignment scores. When someone says they used the "BLOSUM62" matrix to obtain a high score, it is possible their matrix values were scaled differently than the BLOSUM62 matrix we have come to know. Generally in our work with different programs, and in communications with other people, we may find "BLOSUM62" refers to the exact same matrix, but this doesn't happen automatically; it only happens through efforts to standardize and avoid confusion. Miscommunication still occasionally happens, and people do sometimes experiment with matrix values and scaling factors, so consider yourself forewarned! Hopefully you see how the potential for this pitfall further motivates the use of adjusted or normalized scores, as well as P-values and E-values, instead of raw scores.

Statistical interpretations of results often involves weighing a null model
against an alternative model. In the case of sequence comparisons and the use of
Karlin-Altschul statistics, the two models being weighed are the frequencies
with which letters are expected to be paired in unselected, random alignments
from the background versus the target frequencies of their pairing in the MSP.
Karlin and Altschul tell us when we search for the MSP we are implicitly
selecting for alignments with a specific target distribution, *Q*, defined
for seeing letter *i* paired with letter *j*, where

We can interpret a probability (orQ_{ij}= P_{X}(i) P_{Y}(j) e^{lSij}

If the odds of being sampled from the background are low (*i.e.,* much
less than 1), then they are approximately equal to the P-value of the MSP having
been created from the background distribution. (In the limit as an odds ratio
A/B approaches 0, P(A) approaches A/B). Low P-values do *not* necessarily
mean the score is biologically significant, only that the MSP was more likely to
have been generated from the target distribution, which presumably was chosen on
the basis of some interesting biological phenomena (such as multiple alignments
of families of protein sequences). Further interpretations, such as the
biological significance of an MSP score, are not formally covered by the theory
and are often made with a rosy view of the situation.

Even if (biological) sequences are not random according to the conditions required for proper application of Karlin-Altschul statistics, we can still search for high-scoring alignments. It may just happen that the results still provide some biological insight, but their statistical significance can not be believed. Even so, if the scoring system and search algorithm are not carefully chosen, the results may be uninformative or irrelevant -- and the software provides no warning.

Return to the