# Introduction to Alignment Scoring Statistics

Copyright (C) 1999-2002 Warren R. Gish, Saint Louis, Missouri 63108 USA.

Last updated September 30, 2003

The material here is intended to provide a brief background in information theory, as part of an overall framework for understanding "Karlin-Altschul" statistics.

## Topics

### References

The frequently cited reference is:

Additional covered or supporting material is contained in:

For detailed coverage of the extreme value distribution, see:

For a modern treatment of gapped alignment scores and the "edge effect", see:

For detailed coverage of information theory, see:

A good introductory text on algorithms (such as Huffman coding) is:

Abraham Wald's Sequential Probability Ratio Test (SPRT) may help one to grasp the meaning of the sums of log-odds scores that are evaluated with Karlin-Altschul statistics and the drop-off score, X, used in the BLAST algorithm.

An excellent treatise on practical aspects of numerical computing (including arithmetic encoding) is:

• Press, WH, Teukolsky, SA, Vetterling, WT, and BP Flannery (1992). Numerical Recipes in C. Cambridge University Press, Cambridge, UK.

For a combined historical and philosophical perspective on information theory and the study of genetics and molecular biology, see:

### Shannon's Entropy, I

Information theory deals with messages that are numerically coded for their description, storage and communication. For our work, messages will consist of sequences of letters or symbols chosen from some alphabet, S. For example, to encode DNA sequences, which constitute one kind of message, we are accustomed to using the set of IUPAC/IUB codes where S = {A, C, G, T}. Later on, we'll see that sequence alignments are themselves messages, selected for having certain properties (such as the highest possible score) and where the symbols used to spell out these messages are the aligned letter-pairs.

To perform statistical analyses on messages or sequences, we need a reference model. The model we will use is a simple one that merely requires that each letter in a sequence be selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence or its subsequences. In addition, we'll see that the model defines the context or assumptions under which our statistical interpretations are valid.

Given a probability distribution, Pi, for the letters in an i.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn.

As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn.

Log-probabilities and their sums represent measures of information. Conversely, information can be thought of as log-probabilities. Since probabilities are fractional values between zero and one, their logarithms have negative values (or zero in the uninteresting case that only one possible outcome is exists). For convenience, information can be expressed as a negative log-probability, so increasing quantitities of information become increasingly positive-valued. If this convention is followed, then we need to remember to use a minus sign when converting from information space to probability space, as in P = e-I.

Information is measured in units that depend on the base of the logarithm. When taking logarithms to the base 2 (i.e., log2), the resultant information is expressed in units of bits (a contraction of binary digits). When taking Natural logarithms of probabilities, the resultant information is expressed in units of nats (a contraction of Natural digits). Bits and nats are interconvertible by the proportionality constant log(2).

nats = bits · log(2)

log(2) ~= 0.693

1/log(2) ~= 1.44

Generalizing, for a given base of the logarithm, b,

log(x) = log(b) · logb(x)

Using logarithms to arbitrary bases, information can be expressed in arbitrary units, not just bits and nats, such that -logbP = -klog(P), where 1/k=log(b). When working with information in units of bits, k=1/log(2). To keep the exposition a little simpler, we'll confine ourselves to working in units of nats, so k can be omitted.

Shannon's entropy, I = - sum_i Pi log Pi, is the expected (weighted arithmetic mean) value of log Pi computed over all letters in the alphabet, using weights that are simply the probabilities of the letters themselves, the Pis. The information, I, is expressed in units per letter. Shannon's entropy allows us to compute the expected description length of a message, given an a priori assumption (or knowledge) that the letters in the message will appear with frequencies Pi. If a message is 200 letters in length, 200I is the expected description length for that message. Furthermore, the theory tells us there is no shorter encoding for the message than 200I, if the symbols do indeed appear at the prescribed frequencies. Shannon's entropy thus informs us of the minimum description length, or MDL, for a message.

Huffman codes are often used for data compression because of their simplicity. A binary Huffman code provides an optimal, MDL encoding for sequences of i.i.d. letters. While better compression can often be obtained by identifying higher-order patterns within a message or by adapting the encoding algorithm to a changing distribution as the message is read, such methods are more complex and, if we are constrained to considering letters one-by-one in a strictly i.i.d. manner, David Huffman's method is as good as it gets. If you are interested, however, a perfect, entropy-bounded encoding that requires precisely -log2 Pi bits per letter i can be achieved using a somewhat more involved technique called "arithmetic coding" (e.g., see W. H. Press et al. (1992), pp. 910-915).

The description length of a message when using a Huffman code is expected to be about equal to the Shannon or arithmetic code length. Thus, a Huffman-encoded 200 letter sequence will average about 200I bits in length. For some letters in the alphabet, the Huffman code may be more or less efficient in its use of bits than the Shannon code (and vice versa), but the expected behavior averaged over all letters in the alphabet is the same for both (within one bit).

### Uniform Probability Distribution

Consider the 4-letter DNA alphabet S = {A, C, G, T}. Each letter in a DNA sequence can be one of 4 possibilities. With a uniform probability distribution for the letters, PA = PC = PG = PT = 0.25. The negative log2 probability for all letters in this case is 2 bits. Shannon's entropy is then simply 2 bits per letter. In this case of a uniform probability distribution and a 4-letter alphabet, each letter in a sequence minimally requires 2 bits to encode or to describe. This makes sense, since with 2 bits (binary digits) of precision we can uniquely map the 4 different 2-bit values 002, 012, 102, and 112 to the 4 different letters. Obviously we could use more than 2 bits to encode each letter, but any bits beyond 2 would be wasted.

### Nonuniform Probability Distribution

Consider the nonuniform probability distribution

• PA = 0.05
• PC = 0.05
• PG = 0.45
• PT = 0.45
Computing Shannon's entropy, I, for this nonuniform distribution, we obtain an average of 1.47 bits per letter, or more than 0.5 bit less than for the uniform distribution.

If we build a binary Huffman code (tree) for the above distribution, 1 bit would be required to code for G and 2 bits to code for T (or vice versa); and 3 bits each to code for A and C. The "Huffman entropy" in this case is 1*0.45 + 2*0.45 + 3*0.05 + 3*0.05 = 1.65 bits per letter -- which is not quite as efficient as the previous Shannon code.

When a message exhibits a nonuniform letter distribution, some letters require fewer bits and others require more bits to encode than if the distribution were uniform; but Shannon's entropy, which is an average over all letters in the alphabet, is always less for a nonuniform distribution than for a uniform one. Continuing with the Huffman tree analogy, on average we expect to need fewer bits to encode each letter in a message (or sequence) when the Huffman tree is built precisely according to the message's nonuniform letter distribution.

As a corollary of the above, uniform letter distributions are expected to yield suboptimal description lengths (i.e., longer descriptions) whenever the actual letter distribution of the message is nonuniform. Uniform distributions are advised for use when reliable, a priori knowledge of the true letter distribution is unavailable. Assumption of a uniform distribution provides a conservative bound on the expected description length: log2(1/N) bits per letter, where N is the size of the alphabet. Faced with the alternative -- not knowing the true distribution and guessing what it might be instead of using the uniform distribution -- we could end up with message descriptions that average considerably more than 2 bits per letter for DNA.

In general, when we know of a bias in a sequence's letter composition, it can be used to our advantage, by permitting the sequence to be described (encoded) using fewer bits. If the composition is biased and we don't know it -- or if it's biased and we just don't know what that bias is -- conservative use of the uniform distribution will indeed yield a longer encoding under all circumstances than if we had known the true distribution, but at least the encoded length will be well-behaved regardless of the actual bias. If the composition is biased one way and we incorrectly assume it is biased in another, the encoded sequence may be far longer than its optimally encoded length, thus making inefficient use of valuable bits (although bits have dropped in price considerably over time!)

### Relative Entropy, H

Relative entropy, H, is a measure of the expected coding inefficiency per letter, when describing a message with an assumed letter distribution P, when the true distribution is Q. If a binary Huffman code is used to describe the message, the relative entropy concerns the inefficiency of using a Huffman code built for the wrong distribution.

Whereas Shannon's entropy, I, is the expected log-likelihood for a single distribution, relative entropy is the expected logarithm of the likelihood ratio (log-likelihood ratio or LLR) between two distributions.

H(Q || P) = sum_i   Qi log(Qi / Pi)
The odds ratio Qi / Pi is the relative odds that letter i was chosen from the distribution Q versus P. In computing the relative entropy, the log-odds ratios are weighted by Qi; thus the weighted average is relative to the letter frequencies expected for a message generated with Q.

The numerical sign of individual log-odds ratios log(Qi / Pi) tell us:

• LLR greater than 0 => i was more likely chosen using Q than P;
• LLR equal to 0 => i was as likely selected using Q as it was using P;
• LLR less than 0 => i was more likely chosen using P than Q.

Relative entropy is sometimes called the Kullback Leibler distance between the distributions P and Q. Note that relative entropy is always nonnegative and will be zero if and only if P = Q. Furthermore, H(P || Q) is generally not equal to H(Q || P).

Relative entropy is symbolized in the present text by H and should not be confused with Shannon's entropy, which is represented in some texts by H, as well. As set forth earlier, I is used here to represent Shannon's entropy.

### Why Is Relative Entropy Relevant?

What does relative entropy have to do with biological sequence features and alignments? Let's consider a simple situation...

Given a scoring matrix (or scoring vector, in this one-dimensional case) and a model sequence generated i.i.d. according to some background distribution, P, we can write a program that finds the maximal scoring segment (MSS) within the sequence. Here the complete, full-length sequence represents one message, while the MSS reported by our program constitutes another. For example, we could create a hydrophobicity scoring matrix that assigns increasingly positive scores to more hydrophobic amino acids, zero score to amphipathic residues, and increasingly negative scores to more hydrophilic residues. (Karlin and Altschul (1990) provide other examples of scoring systems, as well). By searching for the MSS -- the "most hydrophobic" segment in this case -- we select for a portion of the sequence whose letters exhibit a different, hydrophobic distribution than the random background. By convention, the distribution of letters in the MSS is called the target distribution and is signified here by Q. (Note: some texts may signify the background distribution as Q and the target distribution as P).

If the letters in the MSS are described using an optimal Shannon code for the background distribution, we expect a longer description than if a code optimized for the target distribution was used instead. These extra bits -- the extra description length -- tell us the relative odds that the MSS arose by chance from the background distribution rather than the target.

Why would we consider a target distribution, Q, when our original sequence was generated from the background distribution, P? We'll soon see that the target frequencies are indeed special; that the background and target frequencies are formally related to one another by the scoring system; and how the odds of chance occurrence of the MSS are related to its score.

### Karlin-Altschul (or Karlin-Dembo) Statistics

Database similarity searches typically yield high-scoring pairwise sequence alignments such as the following:

Score = 67 (28.6 bits), Expect = 69., P = 1.0
Identities = 13/33 (39%), Positives = 23/33 (69%)

Query:   164 ESLKISQAVHGAFMELSEDGIEMAGSTGVIEDI 196
E+L + Q+  G+  ELSED ++  G +GV+E++
Sbjct:   381 ETLTLRQSSFGSKCELSEDFLKKVGKSGVVENL 413

+++-+-++--++--+++++-++- + +++++++
Scores:   514121512161315445632110601643512

The total score for an alignment (67 in the above case) is simply the sum of pairwise scores. The individual pairwise scores are listed beneath the alignment above (+5 on the left through +2 on the right).

Note that some pairs of letters may yield the same score. For example, in the BLOSUM62 matrix used to find the above alignment, S(S,T) = S(A,S) = S(E,K) = S(T,S) = S(D,N) = +1. While alignments are usually reported without the individual pairwise scores shown, the statistics of alignment scores depend implicitly on the probabilities of occurrence of the scores, not the letters. Dependency on the letter frequencies is factored out by the scoring matrix. We see then that the "message" obtained from a similarity search consists of a sequence of pairwise scores indicated by the high-scoring alignment.

If we search two sequences X and Y using a scoring matrix Sij to identify the maximal-scoring segment pair (MSP), and if the following conditions hold:

1. the two sequences are i.i.d. and have respective background distributions PX and PY (which may be the same),
2. the two sequences are effectively "long" or infinite and not too dissimilar in length,
3. the expected pairwise score sum_i,j PX(i) PY(j) Sij is negative,
4. a positive score is possible, i.e., PX(i)PY(j)Sij > 0 for some letters i and j,
then Karlin-Altschul statistics tell us:
• the scores in the scoring matrix are implicitly log-odds scores of the form:
Sij = log(Qij / (PX(i)PY(j))) / l
where Qij is the limiting target distribution of the letter pairs (i,j) in the MSP and l is the unique positive-valued solution to the equation sum_i,j PX(i) PY(j) elSij = 1.
• the expected frequency of chance occurrence of an MSP with score S or greater is:
E = K M N e-lS
where M and N are the lengths of the two sequences, M N is the size of the search space, and K is a measure of the relative independence of the points in this space in the context of accumulating the MSP score. (While the complete sequences X and Y must be i.i.d., the letter pairs comprising the MSP do exhibit an interdependency).

Since gaps are disallowed in the MSP, a pairwise sequence comparison for the MSP is analogous to a linear search for the MSS along the diagonals of a 2-d search space. The sum total length of all the diagonals searched is just M N.

Another way to express the pairwise scores is:

Sij = logb(Qij / (PX(i)PY(j)))
where logarithms to some base b are used instead of Natural logarithms. l, which was used in the earlier expression for Sij, is often called the scale of the scoring matrix; it is related to the base of the logarithm as follows:
l = loge b.

By summing the scores Sij for the aligned pairs of letters in the MSP, one obtains the sum of log-odds ratios that is the MSP score. The MSP score is then seen to be the logb (logarithm to some base b) of the odds that an MSP with its score occurs by chance at any given starting location within the random background -- not considering yet how large an area, or how many starting locations, were actually examined in finding the MSP.

Considering the size of the examined area alone, the expected description length of the MSP (measured in information) is log(K M N). As the relative entropy, H, has units of information per length (or letter pair), the expected length of the MSP (measured in letter pairs) is

E(L) = log(K M N) / H,
where H is the relative entropy of the target and background frequencies. H can be computed as:
H = sum_i,j Qij log(Qij / (PX(i) PY(j)))
Here PX(i) PY(j) is the product frequency expected for letter i paired with j in the background search space; and Qij is the frequency at which we expect to observe i paired with j in the MSP.

By definition, the MSP has the highest observed score. Since this score is expected to occur just once, the (expected) MSP score has an (expected) frequency of occurrence, E, of 1.

The appearance of MSPs can be modeled as a Poisson process with characteristic parameter E, as it is possible for multiple, independent segments of the background sequences to achieve the same high score. The Poisson "events" are the individual MSPs having the same score S or greater. The probability of one or more MSPs having score S or greater is simply one minus the probability that no MSPs appear having score S or greater. Thus,

P = 1 - e-E
Note that in the limit as E approaches 0, P = E; and for values of E or P less than about 0.05, E and P are practically speaking equal. (This relationship can be derived from the fact that x itself is the limiting value of log(1 + x), as x approaches 0).

For convenience in comparing scores from different database searches, which may have been performed with different score matrices and/or different background distributions, we can compute the normalized or adjusted score:

S' = lS - log K.

Expressing the "Karlin-Altschul equation" as a function of the adjusted score, we obtain:

E = M N e-S'

Note how the search-dependent values for l and K have been factored out of the equation by using adjusted scores. We can therefore assess the relative significance of two alignments found under different conditions, merely by comparing their adjusted scores. To assess their absolute statistical significance, we need only know the size of the search space, M N.

### The "Edge Effect"

Biological sequences have finite length, as do the MSPs found between them. MSPs will tend not to appear near the outer (lower and right) edges of the search space. The chance of achieving the highest score of all are reduced in that region because the end of one or both sequences may be reached before a higher score is obtained than might be found elsewhere. Effectively, the search space is reduced by the expected length, E(L), for the MSP. We can therefore modify the Karlin-Altschul equation to use effective lengths for the sequences compared.

M' = M - E(L)

N' = N - E(L)

E' = K M' N' e-lS

Since M' is less than M and N' is less than N, the edge-corrected E' is less than E, indicating greater statistical significance.

### Statistical Inference in Database Searching

A raw MSP or alignment score is meaningless (uninformative) unless we know the associated value of the scaling parameter l, to a lesser extent K, and the size of the search space. Even if the name of the scoring matrix someone used has been told to us (e.g., BLOSUM62), it might be that the matrix they used was scaled differently from the version of the matrix we normally use. For instance, a given database search could be performed twice, with the second search using the same scoring matrix as the first's but with all scores multiplied by 100. While the first search's scores will be 100-fold lower, the alignments produced by the two searches will be the same and their significance should rightfully be expected to be the same. This is a consequence of the scaling parameter l being 100-fold lower for the second search, so that the product lS is identical for both searches. If we did not know two different scoring matrices had been used and hadn't learned the lessons of Karlin-Altschul statistics, we might have been tempted to say that the higher scores from the second search are more significant. (At the same time, we might have wondered: if the scores are so different, why are the alignments identical?!)

Karlin-Altschul statistics suggest we should be careful about drawing conclusions from raw alignment scores. When someone says they used the "BLOSUM62" matrix to obtain a high score, it is possible their matrix values were scaled differently than the BLOSUM62 matrix we have come to know. Generally in our work with different programs, and in communications with other people, we may find "BLOSUM62" refers to the exact same matrix, but this doesn't happen automatically; it only happens through efforts to standardize and avoid confusion. Miscommunication still occasionally happens, and people do sometimes experiment with matrix values and scaling factors, so consider yourself forewarned! Hopefully you see how the potential for this pitfall further motivates the use of adjusted or normalized scores, as well as P-values and E-values, instead of raw scores.

Statistical interpretations of results often involves weighing a null model against an alternative model. In the case of sequence comparisons and the use of Karlin-Altschul statistics, the two models being weighed are the frequencies with which letters are expected to be paired in unselected, random alignments from the background versus the target frequencies of their pairing in the MSP. Karlin and Altschul tell us when we search for the MSP we are implicitly selecting for alignments with a specific target distribution, Q, defined for seeing letter i paired with letter j, where

Qij = PX(i) PY(j) elSij
We can interpret a probability (or P-value) reported by BLASTP as being a function of the odds that the MSP was sampled from the background distribution versus being sampled from the target distribution. In each case, one considers the MSP to have been created by a random, i.i.d. process; the only question is which of the two distributions was most likely used to generate the MSP score: the background frequencies or the target frequences?

If the odds of being sampled from the background are low (i.e., much less than 1), then they are approximately equal to the P-value of the MSP having been created from the background distribution. (In the limit as an odds ratio A/B approaches 0, P(A) approaches A/B). Low P-values do not necessarily mean the score is biologically significant, only that the MSP was more likely to have been generated from the target distribution, which presumably was chosen on the basis of some interesting biological phenomena (such as multiple alignments of families of protein sequences). Further interpretations, such as the biological significance of an MSP score, are not formally covered by the theory and are often made with a rosy view of the situation.

Even if (biological) sequences are not random according to the conditions required for proper application of Karlin-Altschul statistics, we can still search for high-scoring alignments. It may just happen that the results still provide some biological insight, but their statistical significance can not be believed. Even so, if the scoring system and search algorithm are not carefully chosen, the results may be uninformative or irrelevant -- and the software provides no warning.