name:   CodePressor
author: Eli-Jean R. Leyssens and Tony Haines
size:   1180 bytes
needs:  RISC OS
descr.: Pretty good ARM code compressor

-----------------------------------------------------------------------------

  Syntax: CodePressr <inputfile> <outputfile>
  
  Note that CodePressr can only handle UNTYPED files, with a LOAD address of
&8100 OR HIGHER.

  When Tony Haines sent me his beta of his Flu entry he told me he had
implemented a new type of compression algorithm targeted specifically at ARM
code. I had a look at it and discovered that it did indeed perform very well.

  I played around with it for a bit and then asked him whether he'd mind if I
turned the algorithm into a 2K entry for CC#2. He said he'd be enjoyed if I
did, so I set out to create the program which we now call CodePressor, as
suggested by Tony.

  Just so you don't think I just turned it into a 2K program: I did add some
improvements, increasing the algorithms performance ;)

  
  So, how does it work? Well, the basic algorithm is to for each word look
back in a range of 256 words and find the word which has the most identical
nibbles. Identical to the nibbles in the current word of course.

  You can then store a byte, indicating which word had the most identical
nibbles, called distance, and another byte of which each bit indicates
whether the matching nibble is identical or not. After that you only store
the non-matching nibbles.

  So, if the best matching word has 6 identical nibbles, then that could be
stored as 8 + 8 + ( 8 - 2) * 4 bits = 16 + 8 = 24, instead of 32.
  If not enough nibbles match, then you store one byte indicating an
uncompressed word followed by the 8 nibbles making up that word.

  As you might have figured out by now the above only generates real
compression for words whose best matching word has at least 5 identical
nibbles. Still, this results in a good overall compression as a lot of "best"
matches have 4, 5, 6, 7 or even 8 identical nibbles.
 
  However, on closer inspection of all the data that was being generated by
this algorithm I discovered a few things:

  - a lot of best matching words have a distance of 16 (words) or less
  - in lots of cases the identical nibbles included the 4 top nibbles
  
  So, instead of always using 8 bits to store the distance I now encode both
the distance and nibbles-mask as follows:

  %0000 0000 - %0000 1111    ->   %0 0000 - %0 1111
  %xxxx 0000 - %xxxx 1111    ->   %1 xxxx 0000 - %1 xxxx 1111
  
  Where xxxx <> 0. So, if only the bottom 4 bits are set then a 0 followed by
the lower 4 bits are stored, otherwise a 1 bit, followed by all 8 bits.

  This gave the compression ratio an extra boost :)

  Furthermore, the algorithm really only works well on code. So, if there are
one or more data blocks in the program that you wish to compress then it's
very likely that most words in those blocks can not be compressed by the
algorithm and thus must be stored as "uncompressed words". However, as I
already said, and uncompressed word is stored by first storing a byte
indicating an uncompressed word and then the uncompressed word. If there are
lots of uncompressable words then this marker per word would seriously
increase the resulting file.

  So, I added code to detect runs, or sequences, of uncompressable words. It
then stored a marker, followed by a count. After that all the uncompressed
without any further markers follow.

  Well, that's it basically. If you want to find out the exact workings then
you're free to squander through the source although I doubt anybody will find
it at all readable ;0
 
 
  Files included in this archive are:
CodePressr        ; The program itself
ReadMe            ; This file
More.!SetPaths    ; Sets the CodePressorWork$Dir
More.!CPress      ; CodePresses the CodePress program, erhm... ;)
More.CPSRC        ; Source code
More.Build        ; Build instructions (Sources->Final)


  Thanks to Tony Haines for his brilliant algorithm, letting me improve it
and turn it into a 2K entry for CC#2 and being very supportive during the
CC#2 competition in general.


  If you hadn't guess by now: it's late, I've had a couple of beers and
definately not too much sleep during the last couple of days. I also
desperately need to release this so others can still use it for their CC#2
entries so I'm in a bit of a hurry as well. All in all not the best situation
for writing a clear and understandable ReadMe ;)
  
  
  If you want to contact me or want to know more about Topix or Topix
productions:

    Pervect :     pervect@topix.student.utwente.nl

    TopixWEB:     http://www.dse.nl/~topix

  Tony Haines can be reached at: a.s.haines@bham.ac.uk