A large commit.

[pdp8.git] / sw / kermit / k12 / k12enc.doc
diff --git a/sw/kermit/k12/k12enc.doc b/sw/kermit/k12/k12enc.doc

new file mode 100644 (file)

index 0000000..1e140a6
--- /dev/null
+++ b/sw/kermit/k12/k12enc.doc
@@ -0,0 +1,177 @@
+From: Charles Lasner <lasner@watsun.cc.columbia.edu>
+Subject: ENCODE/DECODE format description
+
+    The  latest KERMIT-12 binary files are encoded in  a  specialized  format.
+This document will describe the internal encoding and related subjects.
+
+OS/8 file considerations.
+
+    All OS/8 files are contiguous multiples of PDP-8 double-page sized logical
+records.  These are sometimes known as blocks, but are  more  accurately known
+as records.  (The term block is associated with various hardware descriptions,
+and means different things to different people, such as DECtape blocks or RK05
+blocks,  where  the first means a physical block which just happens to be  1/2
+the  size  of  the OS/8 logical record, and the second means a physical sector
+which  is  the same size as the OS/8 logical record, but only if the drive  is
+attached to an RK8E.  We will therefore avoid this term!)
+
+    Since  PDP-8  pages are 128 12-bit words each, the OS/8 record consists of
+256  12-bit  words,  which  can  also  be viewed as 384 8-bit bytes.  For  the
+benefit of various existing utilities, there is a standard method of unpacking
+the 8-bit bytes yielding  a  stream  of  coherent  8-bit  bytes.    The  PDP-8
+convention is to number bits from left to right starting with bit[0].  This is
+INCOMPATIBLE with the notation commonly used  in other architectures, which is
+usually given as what power of 2  the  bit  represents.  The PDP-8 notation is
+used to denote bit position in a manner  consistent  with  significence of the
+bit,  and  arbitrarily uses origin 0, which is the  usually  assembly-language
+orientation.  Using this notation, the first byte (byte #0)  to be unpacked is
+taken from word[0] bits[4-11].  The second byte (byte #1) to  be  unpacked  is
+taken  from  word[1] bits[4-11].  The third byte (byte #2) to be  unpacked  is
+taken from word[0]  bits[0-3]  concatenated  with word[1] bits[0-3].  All bits
+are taken left to  right  as stated.  This method is usually referred to as "3
+for 2," and repeated accordingly  will  yield  the correct stream of bytes for
+ASCII OS/8 files.  OS/8 absolute  binary  files are images of 8-bit paper-tape
+frames packed in the same format.   Although  the  high-order bit "matters" in
+absolute binary files, the high-bit is untrustworthy in  ASCII  files.    Both
+types of files end with a ^Z character which  will have the high-order bit set
+in  the case of the absolute binary files.  The  reason  it  succeeds  in  the
+binary  case  is  that  the  paper-tape format treats the high-bit present  as
+leader or trailer, not loadable data, etc., so the loader ignores all  leading
+high-bit set bytes, and finishes on the first trailing high-bit set byte.  The
+binary file  contains  several leader bytes of 0200 octal, and several trailer
+bytes of 0200  followed by 232, the code for ^Z.  There is no "fool-proof" way
+to derive these formats,  rather these are usually given by the extensions .BN
+for binary, and various extensions  (.LS,  .PA, .MA, .DI, .BI, .TX, .DC, etc.)
+for ASCII files.  If the  file  is  "known"  to  be  either  ASCII or absolute
+binary,  then  these conventions can be used  to  ignore  extraneous  trailing
+bytes.  If the file type is unknown,  then  it  must  be treated as an "image"
+file, where all data must be preserved.  The  most typical image file is a .SV
+file,  which is an image of files organized as pages  and  double-pages,  with
+some  trivial absolute loading instructions in a "header" block.  Each  record
+of  the  file  is  always  "paired"  off,  i.e.,  the  record  has an  implied
+double-page boundary of main memory it is meant to load into.  If  the loading
+instructions indicate  a single-page load, then the first page must be loaded;
+the second half  of the record is IGNORED.  Notice it is impossible to specify
+singular loading of the "odd" page.  OS/8 also supports various other formats,
+so it is difficult to  obtain  useful  knowledge  of the "inner" format of the
+file.
+
+Encoding considerations.
+
+      If the 8-bit bytes of  an  OS/8 file are unpacked and packed faithfully,
+the resultant file will be an accurate copy of the original data.  This is the
+basis for an alternate encoding format, perhaps  more  universal in scope, but
+it is NOT the method used currently.   The  method chosen here is to treat the
+entire file as a contiguous stream of 5-bit bytes spanning words as necessary.
+This means that bits are taken from left to right,  five  at  a time, and each
+encoded  into  a  "printable"  character  for inclusion into the encoded file.
+This  means  that  data  will  form  60-bit  groups  of  12  5-bit  characters
+representing five  12-bit words.  The 5-bit encoding is accomplished using the
+ASCII  coding  for    an   extension  to  hexadecimal,  which  can  be  called
+bi-hexadecimal, or base 32  notation.    In this base, the values are 0-9, and
+A-V (just the "natural" extension  of hexadecimal's 0-9, A-F).  The alphabetic
+characters  can  be  upper  or lower  case  as  necessary.    This  method  is
+theoretically 25% more efficient than hexadecimal ASCII  since  each character
+holds 5-bit data rather than 4-bit data.
+
+    Since the 5-bit data has no "good" boundary for  most  computers,  we will
+use the "best" case for PDP-8 image data, the 60-bit group as described above.
+Once  started,  a  60-bit  group must be completed, thus there are  boundaries
+throughout the file every 12 characters.
+
+    At any boundary, the file may have compressed data.  Compressed fields are
+indicated  by X (or x) followed by a four character  field.    The  format  is
+basically a truncation of the "normal" 60-bit group, but only contains 20 bits
+of data.  The first 12 bits are the value of the  data  to  be  repeated.  The
+last eight bits are the repeat count.  Values of the repeat count  range  from
+1-256, where 256 is implied if the value is zero.  Practical values range from
+3-255, since  one  or two values would take less file space uncompressed.  Due
+to the boundary  requirements,  compression fields are independent of the data
+preceeding them.  The  repeat count limitation to a maximum of 256 was felt to
+be a reasonable compromise between compressed field length and adequate repeat
+count.  Making the repeat count  even  only  double  the current maximum would
+require six character compression fields instead of  five  (including  the X).
+As  an implementation restriction, the encoding program only  reads  one  OS/8
+record at a time, thus the case of 256  maximum  repetitions  occurs only at a
+double-page boundary.  The added complexity required to achieve infrequent and
+minimal  improvement  was  considered  to  be  unimportant, but could be added
+later.  Thus adjacent repeated values split across boundaries, or split across
+logical records will not contribute to a (single) compression field.
+
+    Note that  compression is achieved by locating repeating strings of 12-bit
+values;  if  the  file  were  viewed as repeating strings of 8-bit bytes, then
+compression would be less  likely, except for cases like 0000 octal, which due
+to "symmetry" are compressible via  either  method.    Many  PDP-8 image files
+contain "filler" words, i.e., otherwise unloaded  areas which are "pre-filled"
+with constant data such as 0000 octal,  or  7402 octal, which is the PDP-8 HLT
+instruction.  Image files filled with "large" regions  of repeating strings of
+7402 will not compress using 8-bit byte orientation.
+
+Reliability considerations.
+
+    Even with the safeguards of a "robust" character  set,  file validity must
+be tested to maintain data integrity.  Towards this  end,  the encoding format
+has several additional features.  Unlike other "popular" formats, there  is an
+explicit  "command" structure to the encoded file format.  All lines  of  data
+start  with  <  and  end  with  >.   This prevents the characters  from  being
+"massaged" into unwanted  alternates.   Various systems are known to "destroy"
+files which have lines  starting  with  "from",  etc.    By enclosing the data
+lines, we prevent these problems.   Additionally, a class of explicit commands
+exist which start with ( and  end  with  ).    Instead of implied positioning,
+there is a command called (FILE filename.ext),  where  the filename.ext is the
+"suggested" name for the decoded file.  The encoding program uses the original
+un-encoded file's file name in this field.  After  the  data, there is another
+command (END filename.ext) which can be used to validate the  data,  since the
+same  file  name  should be in both commands (as implemented in  the  encoding
+program).   Several (REMARK {anything}) commands are inserted into the file at
+various  points   to  assist  in  reconstructing  the  original  file,  or  in
+documenting the fact  that  the  file is an encoded "version" of another file.
+Several "frill" REMARK commands  are implemented to indicate the original file
+date, and the date of  encoded file creation.  Today's date is always used for
+the encoded file creation date.  The original file date may be omitted, if the
+system  doesn't  support  Additional  Information  Words  (AIWs),  since  this
+optional  feature must be present in order for  the  files  to  have  creation
+dates.  The overall encoding format theoretically can be a concatenated series
+of  encoded files, each with its own (FILE ) and  (END  )  commands,  but  the
+decoding program  only  supports  single-file  decoding  as  an implementation
+restriction.
+
+    The file must always end with a good boundary  condition.    If  the  last
+field is an X (compression) field, then this is already  satisfied.    If  the
+last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added
+at the end of the last field if necessary to ensure a  good boundary.  The end
+of  file  is  signified  by a single Z (or z) character.  At  this  point,  an
+extraneous  record may be in progress.  If it consists of four or less  12-bit
+words of 0000 octal, it is discarded.  Any other situation regarding a partial
+record indicates defective data in the received encoded file.
+
+    After the single Z (z) character, there are 12 more characters:  an entire
+60-bit  group.    This  is  the  file  checksum.    It  is  accomplished  with
+pentuple-precision arithmetic.   It  is the sum of all 12-bit data values with
+carry into a total of five 12-bit words.  Repeat compression data values are
+also added, but only once for each compression field.  The compression field's
+repeat count is also added, but  it  is  first  multiplied by 16.  (The repeat
+count was expressed originally as *16 so  it  would have its eight significant
+bits left-justified).  The entire 60-bit quantity is  expressed  in  two's
+complement notation by negating it and encoding the group as  any other 60-bit
+group.    Since  most  files  are  relatively  short, the high-order bits  are
+generally zero, so most two's complement checksums start with 7777,7777 octal.
+The five 12-bit  quantities  holding the checksum are encoded low-order first,
+so the right-most characters  in  the  checksum  field tend to be V (v).  This
+order  is used merely to  accomodate  multi-precision  arithmetic,  as  anyone
+attempting to observe "backwards bytes" on other machines is familiar with.
+
+Future considerations.
+
+    This format is by no means  "perfect,"  but  it  is more robust than most,
+with a minimum of efficiency loss, given  the  tradeoffs  involved.   The data
+bracketing characters can be changed if required.   The characters W (w) and Y
+(y)  are  available  for  this purpose.  Files could  incorporate  a  word  or
+character  count,  or  other  validation  technique,  etc.    Each line  could
+incorporate  a  local  count.   These and other considerations could create  a
+"compromise"  format  that  could  be  more  generic and "pallatable" to other
+systems.   The checksum could be limited to 48 bits, which is more amenable to
+8 and 16  bit  architectures.    Perhaps  opening  parameters could govern the
+contents of the rest  of  the  file, such as whether the checksum was present,
+etc.
+[end of file]