Arithmetic Coding - Arithmetic Coding As A Generalized Change of Radix

Arithmetic Coding As A Generalized Change of Radix

Recall that in the case where the symbols had equal probabilities, arithmetic coding could be implemented by a simple change of base, or radix. In general, arithmetic (and range) coding may be interpreted as a generalized change of radix. For example, we may look at any sequence of symbols:

as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a sequential integer A = 0, B = 1, C = 2, D = 3, and so on. This results in the following frequencies and cumulative frequencies:

Symbol Frequency of occurrence Cumulative frequency
A 1 0
B 2 1
D 3 3

The cumulative frequency is the total of all frequencies below it in a frequency distribution (a running total of frequencies).

In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number. For example, in the decimal system the number of symbols is 10, namely 0,1,2,3,4,5,6,7,8,9. The radix is used to express any finite integer in a presumed multiplier in polynomial form. For example, the number 457 is actually 4×102 + 5×101 + 7×100, where base 10 is presumed but not shown explicitly.

Initially, we will convert DABDDB into a base-6 numeral, because 6 is the length of the string. The string is first mapped into the digit string 301331, which then maps to an integer by the polynomial:

The result 23671 has a length of 15 bits, which is not very close to the theoretical limit (the entropy of the message), which is approximately 9 bits.

To encode a message with a length closer to the theoretical limit imposed by information theory we need to slightly generalize the classic formula for changing the radix. We will compute lower and upper bounds L and U and choose a number between them. For the computation of L we multiply each term in the above expression by the product of the frequencies of all previously occurred symbols:

The difference between this polynomial and the polynomial above is that each term is multiplied by the product of the frequencies of all previously occurring symbols. More generally, L may be computed as:

where are the cumulative frequencies and are the frequencies of occurrences. Indexes denote the position of the symbol in a message. In the special case where all frequencies are 1, this is the change-of-base formula.

The upper bound U will be L plus the product of all frequencies; in this case U = L + (3 × 1 × 2 × 3 × 3 × 2) = 25002 + 108 = 25110. In general, U is given by:

Now we can choose any number from the interval [L, U) to represent the message; one convenient choice is the value with the longest possible trail of zeroes, 25100, since it allows us to achieve compression by representing the result as 251×102. The zeroes can also be truncated, giving 251, if the length of the message is stored separately. Longer messages will tend to have longer trails of zeroes.

To decode the integer 25100, the polynomial computation can be reversed as shown in the table below. At each stage the current symbol is identified, then the corresponding term is subtracted from the result.

Remainder Identification Identified symbol Corrected remainder
25100 25100 / 65 = 3 D (25100 − 65 × 3) / 3 = 590
590 590 / 64 = 0 A (590 − 64 × 0) / 1 = 590
590 590 / 63 = 2 B (590 − 63 × 1) / 2 = 187
187 187 / 62 = 5 D (187 − 62 × 3) / 3 = 26
26 26 / 61 = 4 D (26 − 61 × 3) / 3 = 2
2 2 / 60 = 2 B

During decoding we take the floor after dividing by the corresponding power of 6. The result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified the result is corrected. The process is continued for the known length of the message or while the remaining result is positive. The only difference compared to the classical change-of-base is that there may be a range of values associated with each symbol. In this example, A is always 0, B is either 1 or 2, and D is any of 3, 4, 5. This is in exact accordance with our intervals that are determined by the frequencies. When all intervals are equal to 1 we have a special case of the classic base change.

Read more about this topic:  Arithmetic Coding

Famous quotes containing the words arithmetic, generalized and/or change:

    ‘Tis no extravagant arithmetic to say, that for every ten jokes,—thou hast got an hundred enemies; and till thou hast gone on, and raised a swarm of wasps about thine ears, and art half stung to death by them, thou wilt never be convinced it is so.
    Laurence Sterne (1713–1768)

    One is conscious of no brave and noble earnestness in it, of no generalized passion for intellectual and spiritual adventure, of no organized determination to think things out. What is there is a highly self-conscious and insipid correctness, a bloodless respectability submergence of matter in manner—in brief, what is there is the feeble, uninspiring quality of German painting and English music.
    —H.L. (Henry Lewis)

    Revolutions demand enormous sacrifices and, at the same time, create a new need to change the world again.
    Friedrich Dürrenmatt (1921–1990)