Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Jules Gilbert explains



Note: He is having problems with his email, and asked me to forward this 
to the list.


---------------------------------------------------------
I have decided to provide this information to the hobbyist's at the
BLU group.

I have gotten mail from a few of you, and I realize that most of you
are not representative of the people who attacked me.

In the Bible, God tells a story about a farmer who plants a field.  At
night, while the farmer and his field hands are asleep, the farmer's
enemy comes and sows weed seed and clearly intends to make it impossible
for the farmer to raise his crop.

In fact, in the morning this misdeed is discovered and the fieldhands
ask the farmer if they should burn everything in the field up, thus
burning up the good seed the farmer planted along with the foul seed
planted by the farmer's enemy.

But though the fieldhands were ready go forward, the farmer (who
must have been very wise,) tells them not to do this, and proclaims
that once his crop is grown, it will be easy to tell the weeds from
the crop's planted by the farmer.  

The story ends with the farmer declaring that at harvest time the crop
would be harvested and the weeds would be burned up...

To me, it's clear that we live in that world.



Now, about compression.  I am the very same Jules Gilbert that declared
to all the world (courtesy of comp.compression) that I had developed a
repeatable compressor.

Unfortunately, without so much as a "hello", I was jumped on and
declared to be a liar.

But I am not lying and the method I had developed (which was the basis
for my claim) is generally described below.

I say 'generally' because I've had some time to play with it and what I
describe below is slightly improved.  If you can program in C or Fortran
you can easily code it up in a day or so.

So get out your numerical methods texts and be off!



Let's say that you have a file produced from the process of compression.

Let's say that you segment this file into 256k byte blocks.

Now, in the following I am going to describe how to compress (supposedly
incompressible) data serially, one buffer at a time.

This is nice, but the really big gains (using this method, which is not
optimal at all, I have much better designs in terms of time and
space performance,) are derived from processing ALL cell0's concurrently,
then cell1's, etc...

Howevcr, let's learn to crawl before we learn walk...


First, let's say our example buffer begins with the values 1,29,3,7

Allocate a second buffer of 25k cells using unsigned longs, and serially
add the values from the first buffer.  So that the second buffer contains
1,30,33,40,...

Notice, it's monotonic.  Now we can curve fit the values.  Use a high
degree polynomial curve fitter.  Say 30 terms.  (The curve fitter I like
gives me about 31 terms, maximum.  It's not the NRC one which deliver's
almost any desired degree of precision.)

My suggestion is to play.  Get it coded and run it, using different 
model parameters.  You'll discover some very interesting things about the
space gain performance measured just after the curve fit.  Clearly the
optimal way to do this is to do the fit ONCE on the entire file, then
segment into smaller sizes.


Continuing, so do the curve fit.  30 terms.

And from the resulting coefficients, produce an approximation of the 
original input.

And subtract the original monotonic data values from the approximated
values.  Do this subtraction in-place.

Notice, the curve fitting operation requires 30 terms, each eight bytes.
(That's 240 bytes which is pure overhead.)

And look at the resulting unsigned values.  They are no longer random in
their left-most bit positions.

In fact if you extract the left-most three or four significant bits you
have yourself a very compressible signal.  Just remember that what does 
not compress, you have to send.  And sending costs -- that's the measure
by which all compressors are judged; how much do you have to send.

Notice that the left-most three significant bits don't begin in the same
position in each of the 256k cells.  However mapping the left-most bit
position and sending this is very low-cost, roughly 500 bytes for a 256k
buffer that's processed this way.  (This is from memory, but I used to
build these things, that's what I did.)

The problem now is that after removing the significant left-most three
bit's of non-random data from each column cell, we are left with data,
that as you move from bit-column to bit-column from left to right, 
exhibits increasing randomness.

The columns on the right remain completely random.

We have a lot of options doing multiple 256k buffers concurrently.

But for now, we do this;

Take the right most columns and simply repackage the data.  Reapply the
process. 

But the left column's remaining, while not of sufficient quality to 
send to a conventional good quality compressor, can be compressed.

Here is one way that works.  Make an RLE list from each bit-column, and
compress those, individually.  They are not random, they will compress.
Well?  No.

Please note:  If you try to compress these values before doing the RLE's
you will not obtain any compression.  Do the RLE's.


Now, let's talk about cost and compute the savings. (This is per pass:)

240 bytes for the coefficients.

500 bytes for tracking the three-bit compressible column fragment.

We save three bits times 256k cells.  That's almost 100k but we have to
compress it, and the savings is actually only about 40k bytes.

But we are not done...  Now we have to account for those bit columns we
compressed by first making them into an RLE list.

I found that even slight changes in the type of curve fitter produced
widely variant number's of columns.  But this is the gestalt of my
learning:

Expect about 5k, maybe 7k compression per logical pass.  Not a lot, 
I know.


A friend gave me a special fitter (from the NR book, I think...) which 
produced almost any desired degree of precision.  It took forever, but
it worked.  That just doubled the savings, say to 10% or very close.



Once I had this running I knew that what people had inferred from Shannon
wasn't true.

And while this method doesn't have any merit, (because of the huge CPU
overhead necessary to do the curve fitting,) other systems, not based 
on floating point mechanics or involving integer DSP logic work well,
Unfortunately, these are all bit based and, running on an X86 box, do
require lot's of machine time.

Of course, this is a special case, because this process involves a step
that requires lot's of CPU time!  As I keep saying, it's not practical. 
But does it work?  Absolutely.

What I've just described is pretty easy to build.  Anyone with a copy
of the NRC or NRF should have no trouble constructing somethine within
a day or so.

My first one was 64k, not in C, and it barely worked.  It took almost
a full day because the atomic instructions to read and write the buffer
values were within CLIPS, an expert system shell.

When I went to C I discovered that 8k was probably about as thin as I
could make things and still get compression.  Try 1-bit wide buffers
with 8k cells.

Thin!  Oh yes.  that might even be practical but I discovered several
bit-based architectures that deliver the goods without curve fitting,



--
DDDD
DK KD  I'm good at 'hello'.
DKK D  not great at 'how are you'.
DK KD  and I suck at 'goodbye.'
DDDD                 - Joss Whedon




BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org