[pdp8.git] / sw / kermit / k12 / k12enc.doc

From: Charles Lasner <lasner@watsun.cc.columbia.edu>
Subject: ENCODE/DECODE format description

    The  latest KERMIT-12 binary files are encoded in  a  specialized  format.
This document will describe the internal encoding and related subjects.

OS/8 file considerations.

    All OS/8 files are contiguous multiples of PDP-8 double-page sized logical
records.  These are sometimes known as blocks, but are  more  accurately known
as records.  (The term block is associated with various hardware descriptions,
and means different things to different people, such as DECtape blocks or RK05
blocks,  where  the first means a physical block which just happens to be  1/2
the  size  of  the OS/8 logical record, and the second means a physical sector
which  is  the same size as the OS/8 logical record, but only if the drive  is
attached to an RK8E.  We will therefore avoid this term!)

    Since  PDP-8  pages are 128 12-bit words each, the OS/8 record consists of
256  12-bit  words,  which  can  also  be viewed as 384 8-bit bytes.  For  the
benefit of various existing utilities, there is a standard method of unpacking
the 8-bit bytes yielding  a  stream  of  coherent  8-bit  bytes.    The  PDP-8
convention is to number bits from left to right starting with bit[0].  This is
INCOMPATIBLE with the notation commonly used  in other architectures, which is
usually given as what power of 2  the  bit  represents.  The PDP-8 notation is
used to denote bit position in a manner  consistent  with  significence of the
bit,  and  arbitrarily uses origin 0, which is the  usually  assembly-language
orientation.  Using this notation, the first byte (byte #0)  to be unpacked is
taken from word[0] bits[4-11].  The second byte (byte #1) to  be  unpacked  is
taken  from  word[1] bits[4-11].  The third byte (byte #2) to be  unpacked  is
taken from word[0]  bits[0-3]  concatenated  with word[1] bits[0-3].  All bits
are taken left to  right  as stated.  This method is usually referred to as "3
for 2," and repeated accordingly  will  yield  the correct stream of bytes for
ASCII OS/8 files.  OS/8 absolute  binary  files are images of 8-bit paper-tape
frames packed in the same format.   Although  the  high-order bit "matters" in
absolute binary files, the high-bit is untrustworthy in  ASCII  files.    Both
types of files end with a ^Z character which  will have the high-order bit set
in  the case of the absolute binary files.  The  reason  it  succeeds  in  the
binary  case  is  that  the  paper-tape format treats the high-bit present  as
leader or trailer, not loadable data, etc., so the loader ignores all  leading
high-bit set bytes, and finishes on the first trailing high-bit set byte.  The
binary file  contains  several leader bytes of 0200 octal, and several trailer
bytes of 0200  followed by 232, the code for ^Z.  There is no "fool-proof" way
to derive these formats,  rather these are usually given by the extensions .BN
for binary, and various extensions  (.LS,  .PA, .MA, .DI, .BI, .TX, .DC, etc.)
for ASCII files.  If the  file  is  "known"  to  be  either  ASCII or absolute
binary,  then  these conventions can be used  to  ignore  extraneous  trailing
bytes.  If the file type is unknown,  then  it  must  be treated as an "image"
file, where all data must be preserved.  The  most typical image file is a .SV
file,  which is an image of files organized as pages  and  double-pages,  with
some  trivial absolute loading instructions in a "header" block.  Each  record
of  the  file  is  always  "paired"  off,  i.e.,  the  record  has an  implied
double-page boundary of main memory it is meant to load into.  If  the loading
instructions indicate  a single-page load, then the first page must be loaded;
the second half  of the record is IGNORED.  Notice it is impossible to specify
singular loading of the "odd" page.  OS/8 also supports various other formats,
so it is difficult to  obtain  useful  knowledge  of the "inner" format of the
file.

Encoding considerations.

      If the 8-bit bytes of  an  OS/8 file are unpacked and packed faithfully,
the resultant file will be an accurate copy of the original data.  This is the
basis for an alternate encoding format, perhaps  more  universal in scope, but
it is NOT the method used currently.   The  method chosen here is to treat the
entire file as a contiguous stream of 5-bit bytes spanning words as necessary.
This means that bits are taken from left to right,  five  at  a time, and each
encoded  into  a  "printable"  character  for inclusion into the encoded file.
This  means  that  data  will  form  60-bit  groups  of  12  5-bit  characters
representing five  12-bit words.  The 5-bit encoding is accomplished using the
ASCII  coding  for    an   extension  to  hexadecimal,  which  can  be  called
bi-hexadecimal, or base 32  notation.    In this base, the values are 0-9, and
A-V (just the "natural" extension  of hexadecimal's 0-9, A-F).  The alphabetic
characters  can  be  upper  or lower  case  as  necessary.    This  method  is
theoretically 25% more efficient than hexadecimal ASCII  since  each character
holds 5-bit data rather than 4-bit data.

    Since the 5-bit data has no "good" boundary for  most  computers,  we will
use the "best" case for PDP-8 image data, the 60-bit group as described above.
Once  started,  a  60-bit  group must be completed, thus there are  boundaries
throughout the file every 12 characters.

    At any boundary, the file may have compressed data.  Compressed fields are
indicated  by X (or x) followed by a four character  field.    The  format  is
basically a truncation of the "normal" 60-bit group, but only contains 20 bits
of data.  The first 12 bits are the value of the  data  to  be  repeated.  The
last eight bits are the repeat count.  Values of the repeat count  range  from
1-256, where 256 is implied if the value is zero.  Practical values range from
3-255, since  one  or two values would take less file space uncompressed.  Due
to the boundary  requirements,  compression fields are independent of the data
preceeding them.  The  repeat count limitation to a maximum of 256 was felt to
be a reasonable compromise between compressed field length and adequate repeat
count.  Making the repeat count  even  only  double  the current maximum would
require six character compression fields instead of  five  (including  the X).
As  an implementation restriction, the encoding program only  reads  one  OS/8
record at a time, thus the case of 256  maximum  repetitions  occurs only at a
double-page boundary.  The added complexity required to achieve infrequent and
minimal  improvement  was  considered  to  be  unimportant, but could be added
later.  Thus adjacent repeated values split across boundaries, or split across
logical records will not contribute to a (single) compression field.

    Note that  compression is achieved by locating repeating strings of 12-bit
values;  if  the  file  were  viewed as repeating strings of 8-bit bytes, then
compression would be less  likely, except for cases like 0000 octal, which due
to "symmetry" are compressible via  either  method.    Many  PDP-8 image files
contain "filler" words, i.e., otherwise unloaded  areas which are "pre-filled"
with constant data such as 0000 octal,  or  7402 octal, which is the PDP-8 HLT
instruction.  Image files filled with "large" regions  of repeating strings of
7402 will not compress using 8-bit byte orientation.

Reliability considerations.

    Even with the safeguards of a "robust" character  set,  file validity must
be tested to maintain data integrity.  Towards this  end,  the encoding format
has several additional features.  Unlike other "popular" formats, there  is an
explicit  "command" structure to the encoded file format.  All lines  of  data
start  with  <  and  end  with  >.   This prevents the characters  from  being
"massaged" into unwanted  alternates.   Various systems are known to "destroy"
files which have lines  starting  with  "from",  etc.    By enclosing the data
lines, we prevent these problems.   Additionally, a class of explicit commands
exist which start with ( and  end  with  ).    Instead of implied positioning,
there is a command called (FILE filename.ext),  where  the filename.ext is the
"suggested" name for the decoded file.  The encoding program uses the original
un-encoded file's file name in this field.  After  the  data, there is another
command (END filename.ext) which can be used to validate the  data,  since the
same  file  name  should be in both commands (as implemented in  the  encoding
program).   Several (REMARK {anything}) commands are inserted into the file at
various  points   to  assist  in  reconstructing  the  original  file,  or  in
documenting the fact  that  the  file is an encoded "version" of another file.
Several "frill" REMARK commands  are implemented to indicate the original file
date, and the date of  encoded file creation.  Today's date is always used for
the encoded file creation date.  The original file date may be omitted, if the
system  doesn't  support  Additional  Information  Words  (AIWs),  since  this
optional  feature must be present in order for  the  files  to  have  creation
dates.  The overall encoding format theoretically can be a concatenated series
of  encoded files, each with its own (FILE ) and  (END  )  commands,  but  the
decoding program  only  supports  single-file  decoding  as  an implementation
restriction.

    The file must always end with a good boundary  condition.    If  the  last
field is an X (compression) field, then this is already  satisfied.    If  the
last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added
at the end of the last field if necessary to ensure a  good boundary.  The end
of  file  is  signified  by a single Z (or z) character.  At  this  point,  an
extraneous  record may be in progress.  If it consists of four or less  12-bit
words of 0000 octal, it is discarded.  Any other situation regarding a partial
record indicates defective data in the received encoded file.

    After the single Z (z) character, there are 12 more characters:  an entire
60-bit  group.    This  is  the  file  checksum.    It  is  accomplished  with
pentuple-precision arithmetic.   It  is the sum of all 12-bit data values with
carry into a total of five 12-bit words.  Repeat compression data values are
also added, but only once for each compression field.  The compression field's
repeat count is also added, but  it  is  first  multiplied by 16.  (The repeat
count was expressed originally as *16 so  it  would have its eight significant
bits left-justified).  The entire 60-bit quantity is  expressed  in  two's
complement notation by negating it and encoding the group as  any other 60-bit
group.    Since  most  files  are  relatively  short, the high-order bits  are
generally zero, so most two's complement checksums start with 7777,7777 octal.
The five 12-bit  quantities  holding the checksum are encoded low-order first,
so the right-most characters  in  the  checksum  field tend to be V (v).  This
order  is used merely to  accomodate  multi-precision  arithmetic,  as  anyone
attempting to observe "backwards bytes" on other machines is familiar with.

Future considerations.

    This format is by no means  "perfect,"  but  it  is more robust than most,
with a minimum of efficiency loss, given  the  tradeoffs  involved.   The data
bracketing characters can be changed if required.   The characters W (w) and Y
(y)  are  available  for  this purpose.  Files could  incorporate  a  word  or
character  count,  or  other  validation  technique,  etc.    Each line  could
incorporate  a  local  count.   These and other considerations could create  a
"compromise"  format  that  could  be  more  generic and "pallatable" to other
systems.   The checksum could be limited to 48 bits, which is more amenable to
8 and 16  bit  architectures.    Perhaps  opening  parameters could govern the
contents of the rest  of  the  file, such as whether the checksum was present,
etc.
[end of file]
Commit	Line	Data
81e70d48 PH	1	From: Charles Lasner <lasner@watsun.cc.columbia.edu>
	2	Subject: ENCODE/DECODE format description
	3
	4	The latest KERMIT-12 binary files are encoded in a specialized format.
	5	This document will describe the internal encoding and related subjects.
	6
	7	OS/8 file considerations.
	8
	9	All OS/8 files are contiguous multiples of PDP-8 double-page sized logical
	10	records. These are sometimes known as blocks, but are more accurately known
	11	as records. (The term block is associated with various hardware descriptions,
	12	and means different things to different people, such as DECtape blocks or RK05
	13	blocks, where the first means a physical block which just happens to be 1/2
	14	the size of the OS/8 logical record, and the second means a physical sector
	15	which is the same size as the OS/8 logical record, but only if the drive is
	16	attached to an RK8E. We will therefore avoid this term!)
	17
	18	Since PDP-8 pages are 128 12-bit words each, the OS/8 record consists of
	19	256 12-bit words, which can also be viewed as 384 8-bit bytes. For the
	20	benefit of various existing utilities, there is a standard method of unpacking
	21	the 8-bit bytes yielding a stream of coherent 8-bit bytes. The PDP-8
	22	convention is to number bits from left to right starting with bit[0]. This is
	23	INCOMPATIBLE with the notation commonly used in other architectures, which is
	24	usually given as what power of 2 the bit represents. The PDP-8 notation is
	25	used to denote bit position in a manner consistent with significence of the
	26	bit, and arbitrarily uses origin 0, which is the usually assembly-language
	27	orientation. Using this notation, the first byte (byte #0) to be unpacked is
	28	taken from word[0] bits[4-11]. The second byte (byte #1) to be unpacked is
	29	taken from word[1] bits[4-11]. The third byte (byte #2) to be unpacked is
	30	taken from word[0] bits[0-3] concatenated with word[1] bits[0-3]. All bits
	31	are taken left to right as stated. This method is usually referred to as "3
	32	for 2," and repeated accordingly will yield the correct stream of bytes for
	33	ASCII OS/8 files. OS/8 absolute binary files are images of 8-bit paper-tape
	34	frames packed in the same format. Although the high-order bit "matters" in
	35	absolute binary files, the high-bit is untrustworthy in ASCII files. Both
	36	types of files end with a ^Z character which will have the high-order bit set
	37	in the case of the absolute binary files. The reason it succeeds in the
	38	binary case is that the paper-tape format treats the high-bit present as
	39	leader or trailer, not loadable data, etc., so the loader ignores all leading
	40	high-bit set bytes, and finishes on the first trailing high-bit set byte. The
	41	binary file contains several leader bytes of 0200 octal, and several trailer
	42	bytes of 0200 followed by 232, the code for ^Z. There is no "fool-proof" way
	43	to derive these formats, rather these are usually given by the extensions .BN
	44	for binary, and various extensions (.LS, .PA, .MA, .DI, .BI, .TX, .DC, etc.)
	45	for ASCII files. If the file is "known" to be either ASCII or absolute
	46	binary, then these conventions can be used to ignore extraneous trailing
	47	bytes. If the file type is unknown, then it must be treated as an "image"
	48	file, where all data must be preserved. The most typical image file is a .SV
	49	file, which is an image of files organized as pages and double-pages, with
	50	some trivial absolute loading instructions in a "header" block. Each record
	51	of the file is always "paired" off, i.e., the record has an implied
	52	double-page boundary of main memory it is meant to load into. If the loading
	53	instructions indicate a single-page load, then the first page must be loaded;
	54	the second half of the record is IGNORED. Notice it is impossible to specify
	55	singular loading of the "odd" page. OS/8 also supports various other formats,
	56	so it is difficult to obtain useful knowledge of the "inner" format of the
	57	file.
	58
	59	Encoding considerations.
	60
	61	If the 8-bit bytes of an OS/8 file are unpacked and packed faithfully,
	62	the resultant file will be an accurate copy of the original data. This is the
	63	basis for an alternate encoding format, perhaps more universal in scope, but
	64	it is NOT the method used currently. The method chosen here is to treat the
65	entire file as a contiguous stream of 5-bit bytes spanning words as necessary.
66	This means that bits are taken from left to right, five at a time, and each
67	encoded into a "printable" character for inclusion into the encoded file.
68	This means that data will form 60-bit groups of 12 5-bit characters
69	representing five 12-bit words. The 5-bit encoding is accomplished using the
70	ASCII coding for an extension to hexadecimal, which can be called
71	bi-hexadecimal, or base 32 notation. In this base, the values are 0-9, and
72	A-V (just the "natural" extension of hexadecimal's 0-9, A-F). The alphabetic
73	characters can be upper or lower case as necessary. This method is
74	theoretically 25% more efficient than hexadecimal ASCII since each character
75	holds 5-bit data rather than 4-bit data.
76
77	Since the 5-bit data has no "good" boundary for most computers, we will
78	use the "best" case for PDP-8 image data, the 60-bit group as described above.
79	Once started, a 60-bit group must be completed, thus there are boundaries
80	throughout the file every 12 characters.
81
82	At any boundary, the file may have compressed data. Compressed fields are
83	indicated by X (or x) followed by a four character field. The format is
84	basically a truncation of the "normal" 60-bit group, but only contains 20 bits
85	of data. The first 12 bits are the value of the data to be repeated. The
86	last eight bits are the repeat count. Values of the repeat count range from
87	1-256, where 256 is implied if the value is zero. Practical values range from
88	3-255, since one or two values would take less file space uncompressed. Due
89	to the boundary requirements, compression fields are independent of the data
90	preceeding them. The repeat count limitation to a maximum of 256 was felt to
91	be a reasonable compromise between compressed field length and adequate repeat
92	count. Making the repeat count even only double the current maximum would
93	require six character compression fields instead of five (including the X).
94	As an implementation restriction, the encoding program only reads one OS/8
95	record at a time, thus the case of 256 maximum repetitions occurs only at a
96	double-page boundary. The added complexity required to achieve infrequent and
97	minimal improvement was considered to be unimportant, but could be added
98	later. Thus adjacent repeated values split across boundaries, or split across
99	logical records will not contribute to a (single) compression field.
100
101	Note that compression is achieved by locating repeating strings of 12-bit
102	values; if the file were viewed as repeating strings of 8-bit bytes, then
103	compression would be less likely, except for cases like 0000 octal, which due
104	to "symmetry" are compressible via either method. Many PDP-8 image files
105	contain "filler" words, i.e., otherwise unloaded areas which are "pre-filled"
106	with constant data such as 0000 octal, or 7402 octal, which is the PDP-8 HLT
107	instruction. Image files filled with "large" regions of repeating strings of
108	7402 will not compress using 8-bit byte orientation.
109
110	Reliability considerations.
111
112	Even with the safeguards of a "robust" character set, file validity must
113	be tested to maintain data integrity. Towards this end, the encoding format
114	has several additional features. Unlike other "popular" formats, there is an
115	explicit "command" structure to the encoded file format. All lines of data
116	start with < and end with >. This prevents the characters from being
117	"massaged" into unwanted alternates. Various systems are known to "destroy"
118	files which have lines starting with "from", etc. By enclosing the data
119	lines, we prevent these problems. Additionally, a class of explicit commands
120	exist which start with ( and end with ). Instead of implied positioning,
121	there is a command called (FILE filename.ext), where the filename.ext is the
122	"suggested" name for the decoded file. The encoding program uses the original
123	un-encoded file's file name in this field. After the data, there is another
124	command (END filename.ext) which can be used to validate the data, since the
125	same file name should be in both commands (as implemented in the encoding
126	program). Several (REMARK {anything}) commands are inserted into the file at
127	various points to assist in reconstructing the original file, or in
128	documenting the fact that the file is an encoded "version" of another file.
129	Several "frill" REMARK commands are implemented to indicate the original file
130	date, and the date of encoded file creation. Today's date is always used for
131	the encoded file creation date. The original file date may be omitted, if the
132	system doesn't support Additional Information Words (AIWs), since this
133	optional feature must be present in order for the files to have creation
134	dates. The overall encoding format theoretically can be a concatenated series
135	of encoded files, each with its own (FILE ) and (END ) commands, but the
136	decoding program only supports single-file decoding as an implementation
137	restriction.
138
139	The file must always end with a good boundary condition. If the last
140	field is an X (compression) field, then this is already satisfied. If the
141	last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added
142	at the end of the last field if necessary to ensure a good boundary. The end
143	of file is signified by a single Z (or z) character. At this point, an
144	extraneous record may be in progress. If it consists of four or less 12-bit
145	words of 0000 octal, it is discarded. Any other situation regarding a partial
146	record indicates defective data in the received encoded file.
147
148	After the single Z (z) character, there are 12 more characters: an entire
149	60-bit group. This is the file checksum. It is accomplished with
150	pentuple-precision arithmetic. It is the sum of all 12-bit data values with
151	carry into a total of five 12-bit words. Repeat compression data values are
152	also added, but only once for each compression field. The compression field's
153	repeat count is also added, but it is first multiplied by 16. (The repeat
154	count was expressed originally as *16 so it would have its eight significant
155	bits left-justified). The entire 60-bit quantity is expressed in two's
156	complement notation by negating it and encoding the group as any other 60-bit
157	group. Since most files are relatively short, the high-order bits are
158	generally zero, so most two's complement checksums start with 7777,7777 octal.
159	The five 12-bit quantities holding the checksum are encoded low-order first,
160	so the right-most characters in the checksum field tend to be V (v). This
161	order is used merely to accomodate multi-precision arithmetic, as anyone
162	attempting to observe "backwards bytes" on other machines is familiar with.
163
164	Future considerations.
165
166	This format is by no means "perfect," but it is more robust than most,
167	with a minimum of efficiency loss, given the tradeoffs involved. The data
168	bracketing characters can be changed if required. The characters W (w) and Y
169	(y) are available for this purpose. Files could incorporate a word or
170	character count, or other validation technique, etc. Each line could
171	incorporate a local count. These and other considerations could create a
172	"compromise" format that could be more generic and "pallatable" to other
173	systems. The checksum could be limited to 48 bits, which is more amenable to
174	8 and 16 bit architectures. Perhaps opening parameters could govern the
175	contents of the rest of the file, such as whether the checksum was present,
176	etc.
177	[end of file]