BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

endless compression

Subject: endless compression
From: markw at mohawksoft.com (markw at mohawksoft.com)
Date: Fri Aug 13 09:19:01 2004
In-reply-to: <20040813111220.GA23508@tao.merseine.nu>
References: <17529.24.91.171.78.1092340471.squirrel@mail.mohawksoft.com> <20040813111220.GA23508@tao.merseine.nu>

> On Thu, Aug 12, 2004 at 03:54:31PM -0400, markw at mohawksoft.com wrote:
>> (Let me preface this with some background, I've written a few format
>> specific compressors, and have implemented a few standard compressors.)
>
> I shudder to think about them.

Why? They work. They are fast and they do a good job.

>
>> Lets talk about *what* compression is. Compression is representing a
>> theoretically larger range of data with a smaller range. Compression is
>> removing "knowledge" about the data from the data, leaving only the
>> "interesting" part that is needed to reconstitute the data based on
>> knowledge.
>
> Bzzt, wrong.
>
> You've just described "lossy compression". Lossy compression
> throws out information. The file which is reconstructed is NOT
> the same as the original file; it is an approximation which in
> some domains (notably audio and video) may be considered "good
> enough".
>
> Lossless compression does not destroy information. Lossless
> compression is finding a more compact notation to express
> exactly the same information. It removes redundancies and
> replaces them with an explanation of how to replace them
> exactly.

No, it is not wrong. If you read the paragraph, and try to understand
compression, then you will understand. "Compression" is representing data
with as little information as possible.

"What" that information "is" is subject to knowledge of data. If you know
you will have a regularly recurring sequence of numbers 0,1,2,3,4,5,...
1,000,000 (really some arbitrary end point), you don't need a million
integers to represent it. A special case compressor could represent this
section of the stream as three integers, start, step, and end.

Note: This is not "lossy" compression, it is lossless compression.

>
>> The idea, then, is to take knowledge of a numerical series or knowledge
>> about data. The theoretical number range may be X but in reality there
>> is
>> only Y variation, and if Y < X, the data should be compressable.
>
> You have now taken an example of lossless and an example of
> lossy and announced that lossy is always acceptable. It isn't.

You have assumed incorrectly.

>
>> When you look at a ZIP compressed file, it does not compress well using
>> standard stream compressors. This is because there are few if any
>> repeating paterns. The fact that it has few repeating patterns, makes
>> this
>> stream interresting. It means that it is less likely in a universe of
>> streams. If it is less likely, then it can be represented with fewer
>> bits.
>
> OK, you've defined "information" the way Shannon did. Big deal.

So? Your point, please?

>
>> Standard compressors work on the notion that repeated patterns can be
>> saved once and re-used when they show up again. LZW actually uses more
>> bits per byte for each byte, but many fewer bits for strings of repeated
>> bytes.
>
> Erm... kind of. LZW uses a dictionary-and-quoting method.
> Repeated byte patterns are stored in the dictionary, taking up
> slightly more space than a single copy would, and then a symbol
> for the pattern -- a reference to the dictionary -- is inserted.
> But LZW is smart enough to simply store quoted information --
> non-referenceable strings -- as themselves. Most of the
> improvements in this system have come from adjustments in
> dictionary hashing, look-ahead window size and sub-pattern
> matching.

And how is your description, in any way in conflict with mine? I made a
very much generalized description on the nature of an LZW file, where as
you've attempted to describe the inner workings of the algorithm. In fact,
the algorithm is fairly unimportant for this discussion because we are
talking about compressors in general, not in specific.

The statement that LZW stores 8 bit bytes with generally more than 8 bits
is not untrue. The statement that repeating strings are represented with
fewer bits is also not untrue. It describes, quite accurately, how any
conventional compressor saves space.

>
>> A "re-compressor" is different, rather than "repeated" bytes, it would
>> use
>> information about special sequences of data that have little or no
>> repitition and few recognizable patterns.
>
> Where does this information come from?

OK, you have to use your imagination here. First some facts:

We know the data stream has few repeating patterns, this is important.

Because of this, we also know that the "universe" of probability for the
size of the data is significantly less then the actual nature of the data.

The job is, therefore, to analyze the data and construct a special
compressor for it.

If a fractal or other equation can be defined with fewer bytes than the
data it generates, then you have compressibility. Like the sequence of
numbers described above, the sequence is not conventionally compressible
because it has no repeating patterns, but "knowledge" of the numbers may
allow us to represent an algorithm for recreating it without data loss.

This is a HUGE project with a lot of numerical analysis, but, in theory,
with sufficient computational power, it may be possible.

>
>> This may be hugely expensive to
>> compute and offer only a compressability of a few percent, however, if
>> it
>> repeatable,
>
> Then it didn't work.

Then what didn't work?
>
>> then it can iterate and iterate untill the overhead for the
>> compression is larger than the data.
>
>> I'm not saying it is practical, workable, or really possible, I'm just
>> saying that, in theory, it may be possible.
>
> Can I accurately rephrase that as "I'm not saying that it is
> possible, I'm saying that it may be possible."?

That is a very accurate. I don't "know" that it is possible, because I
have not heard of it being done. A scientist does not believe something to
be true without proof, but by the same token, he does not dismiss a theory
without thought and/or study. I have read a couple papers that make an
interesting case for it, and it sounds rational. It may be impossible with
a finite amount of computing power, but Moore's law may make it possible.

Up until the age of computers, mathematicians didn't even know that there
were equations, like fractals, that were nonterminal and non repeating.
Fractal compression may be possible with enough computing power.

>
> OK, you and Jules are cranks. Go read the sci.compression FAQ a
> few more times, and leave us alone.

You know, I have a real problem with personal attacks. I dislike insulting
or impugning people during a discussion.

I have not claimed it was possible, I just said that it may not be
impossible. The difference is one requires proof, and the other requires
thought. I am used to dealing with people who can entertain interesting or
challenging ideas without it being personal, if you are not such a person,
please do not respond.

Follow-Ups:
- endless compression
  - From: 235u at comcast.net (Eric)

References:
- endless compression
  - From: markw at mohawksoft.com (markw at mohawksoft.com)
- endless compression
  - From: dsr at tao.merseine.nu (dsr at tao.merseine.nu)

Prev by Date: Jules Gilbert explains
Next by Date: endless compression
Previous by thread: endless compression
Next by thread: endless compression
Index(es):
- Date
- Thread


BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Boston Linux & Unix / webmaster@blu.org