Return to GLYPH's Project Page

Prediction and its Arbitrary Qualities

Overview of Prediction

What exactly are we doing when we're predicting? We may be looking for certain kinds of patterns or repetitions. But, imagine a universe where data files, just because of the way they're made, hardly ever have repetitions, but instead have some other kind of predictable properties. If you lived in that universe, you cerainly wouldn't use PKZip!

The whole point of this hypothetical universe is this: We predict things based on our previous experiences. In other words, the best way to predict stuff that comes from around here is to compare it with other stuff from around here. The term "around here" can be as localized as you like, depending on the application. i.e. you can make a dandy image compressor that doesn't handle text very well, and vice versa. Or, you can make more general algorithms that can handle both, like Deflate which is used in GZIP and PNG. This stuff we're comparing against is usually called history.

Global vs. Local history

Usually, we're concerned with transmitting data in some way. So, it's convenient to divide history into two components: Global history, which is known before any signal is transmitted, and Local history, which is the part of the signal that has already been transmitted (and decoded). The most local history is often called the context.

It has been found that usually, in our world, the more local the history, the more useful it is for prediction. Specifically, comparisons with global history is usually only used to classify the information, and this is often performed by the user! This occurs any time you save a file using a different format, like GIF or JPEG or Zip or Mpeg3. This classification is used to choose a more specific algorithm, and the algorithm uses the local history to complete the prediction. This kind of heirarchy of classification / prediction is often useful for speeding up the prediction process, and in our happy world it hardly ever reduces the potential accuracy of the prediction.

However, there is the problem of overly simple classification, and misclassification. Here's an example: JPEG is an image format, and so is PNG. However, the classification of "image" is not detailed enough for choosing the appropriate format. Some images can be stored efficiently using JPEG, others more efficiently with PNG, and some images aren't stored very well using either format.

Summary

So, summing up, a good, fast way of predicting stuff is by using a heirarchy of classifications until we have a satisfyingly narrow classification. Then we pick a prediction algorithm specific to that classification. Then we do the actual predictions using the algorithm on the local history.

This is all nice, but we should remember that the classes we define, and the algorithms we create, are only models of our happy little local world, and are in no way universal, perfect, or ideal. This is the arbitrariness of prediction.