PAQ - Algorithm

Algorithm

PAQ uses a context mixing algorithm. Context mixing is related to Prediction by Partial Matching (PPM) in that the compressor is divided into a predictor and an arithmetic coder, but differs in that the next-symbol prediction is computed using a weighed combination of probability estimates from a large number of models conditioned on different contexts. Unlike PPM, a context doesn't need to be contiguous. Most PAQ versions collect next-symbol statistics for the following contexts:

  • n-grams. The context is the last n bytes before the predicted symbol (as in PPM).
  • whole word n-grams, ignoring case and nonalphabetic characters (useful in text files).
  • "sparse" contexts, for example, the second and fourth bytes preceding the predicted symbol (useful in some binary formats).
  • "analog" contexts, consisting of the high order bits of previous 8 or 16 bit words (useful for multimedia files).
  • two dimensional contexts (useful for images, tables, and spreadsheets). The row length is determined by finding the stride length of repeating byte patterns.
  • specialized models, such as x86 executables, or BMP, TIFF, or JPEG images. These models are active only when the particular file type is detected.

All PAQ versions predict and compress one bit at a time, but differ in the details of the models and how the predictions are combined and postprocessed. Once the next-bit probability is determined, it is encoded by arithmetic coding. There are three methods for combining predictions, depending on the version.

  • In PAQ1 through PAQ3, each prediction is represented as a pair of bit counts . These counts are combined by weighted summation, with greater weights given to longer contexts.
  • In PAQ4 through PAQ6, the predictions are combined as before, but the weights assigned to each model are adjusted to favor the more accurate models.
  • In PAQ7 and later, each model outputs a probability rather than a pair of counts. The probabilities are combined using an artificial neural network.

PAQ1SSE and later versions postprocess the prediction using secondary symbol estimation (SSE). The combined prediction and a small context are used to look up a new prediction in a table. After the bit is encoded, the table entry is adjusted to reduce the prediction error. SSE stages can be pipelined with different contexts, or computed in parallel with the outputs averaged.

Read more about this topic:  PAQ