A large commit.
[pdp8.git] / sw / kermit / k12 / k12enc.doc
1 From: Charles Lasner <lasner@watsun.cc.columbia.edu>
2 Subject: ENCODE/DECODE format description
3
4 The latest KERMIT-12 binary files are encoded in a specialized format.
5 This document will describe the internal encoding and related subjects.
6
7 OS/8 file considerations.
8
9 All OS/8 files are contiguous multiples of PDP-8 double-page sized logical
10 records. These are sometimes known as blocks, but are more accurately known
11 as records. (The term block is associated with various hardware descriptions,
12 and means different things to different people, such as DECtape blocks or RK05
13 blocks, where the first means a physical block which just happens to be 1/2
14 the size of the OS/8 logical record, and the second means a physical sector
15 which is the same size as the OS/8 logical record, but only if the drive is
16 attached to an RK8E. We will therefore avoid this term!)
17
18 Since PDP-8 pages are 128 12-bit words each, the OS/8 record consists of
19 256 12-bit words, which can also be viewed as 384 8-bit bytes. For the
20 benefit of various existing utilities, there is a standard method of unpacking
21 the 8-bit bytes yielding a stream of coherent 8-bit bytes. The PDP-8
22 convention is to number bits from left to right starting with bit[0]. This is
23 INCOMPATIBLE with the notation commonly used in other architectures, which is
24 usually given as what power of 2 the bit represents. The PDP-8 notation is
25 used to denote bit position in a manner consistent with significence of the
26 bit, and arbitrarily uses origin 0, which is the usually assembly-language
27 orientation. Using this notation, the first byte (byte #0) to be unpacked is
28 taken from word[0] bits[4-11]. The second byte (byte #1) to be unpacked is
29 taken from word[1] bits[4-11]. The third byte (byte #2) to be unpacked is
30 taken from word[0] bits[0-3] concatenated with word[1] bits[0-3]. All bits
31 are taken left to right as stated. This method is usually referred to as "3
32 for 2," and repeated accordingly will yield the correct stream of bytes for
33 ASCII OS/8 files. OS/8 absolute binary files are images of 8-bit paper-tape
34 frames packed in the same format. Although the high-order bit "matters" in
35 absolute binary files, the high-bit is untrustworthy in ASCII files. Both
36 types of files end with a ^Z character which will have the high-order bit set
37 in the case of the absolute binary files. The reason it succeeds in the
38 binary case is that the paper-tape format treats the high-bit present as
39 leader or trailer, not loadable data, etc., so the loader ignores all leading
40 high-bit set bytes, and finishes on the first trailing high-bit set byte. The
41 binary file contains several leader bytes of 0200 octal, and several trailer
42 bytes of 0200 followed by 232, the code for ^Z. There is no "fool-proof" way
43 to derive these formats, rather these are usually given by the extensions .BN
44 for binary, and various extensions (.LS, .PA, .MA, .DI, .BI, .TX, .DC, etc.)
45 for ASCII files. If the file is "known" to be either ASCII or absolute
46 binary, then these conventions can be used to ignore extraneous trailing
47 bytes. If the file type is unknown, then it must be treated as an "image"
48 file, where all data must be preserved. The most typical image file is a .SV
49 file, which is an image of files organized as pages and double-pages, with
50 some trivial absolute loading instructions in a "header" block. Each record
51 of the file is always "paired" off, i.e., the record has an implied
52 double-page boundary of main memory it is meant to load into. If the loading
53 instructions indicate a single-page load, then the first page must be loaded;
54 the second half of the record is IGNORED. Notice it is impossible to specify
55 singular loading of the "odd" page. OS/8 also supports various other formats,
56 so it is difficult to obtain useful knowledge of the "inner" format of the
57 file.
58
59 Encoding considerations.
60
61 If the 8-bit bytes of an OS/8 file are unpacked and packed faithfully,
62 the resultant file will be an accurate copy of the original data. This is the
63 basis for an alternate encoding format, perhaps more universal in scope, but
64 it is NOT the method used currently. The method chosen here is to treat the
65 entire file as a contiguous stream of 5-bit bytes spanning words as necessary.
66 This means that bits are taken from left to right, five at a time, and each
67 encoded into a "printable" character for inclusion into the encoded file.
68 This means that data will form 60-bit groups of 12 5-bit characters
69 representing five 12-bit words. The 5-bit encoding is accomplished using the
70 ASCII coding for an extension to hexadecimal, which can be called
71 bi-hexadecimal, or base 32 notation. In this base, the values are 0-9, and
72 A-V (just the "natural" extension of hexadecimal's 0-9, A-F). The alphabetic
73 characters can be upper or lower case as necessary. This method is
74 theoretically 25% more efficient than hexadecimal ASCII since each character
75 holds 5-bit data rather than 4-bit data.
76
77 Since the 5-bit data has no "good" boundary for most computers, we will
78 use the "best" case for PDP-8 image data, the 60-bit group as described above.
79 Once started, a 60-bit group must be completed, thus there are boundaries
80 throughout the file every 12 characters.
81
82 At any boundary, the file may have compressed data. Compressed fields are
83 indicated by X (or x) followed by a four character field. The format is
84 basically a truncation of the "normal" 60-bit group, but only contains 20 bits
85 of data. The first 12 bits are the value of the data to be repeated. The
86 last eight bits are the repeat count. Values of the repeat count range from
87 1-256, where 256 is implied if the value is zero. Practical values range from
88 3-255, since one or two values would take less file space uncompressed. Due
89 to the boundary requirements, compression fields are independent of the data
90 preceeding them. The repeat count limitation to a maximum of 256 was felt to
91 be a reasonable compromise between compressed field length and adequate repeat
92 count. Making the repeat count even only double the current maximum would
93 require six character compression fields instead of five (including the X).
94 As an implementation restriction, the encoding program only reads one OS/8
95 record at a time, thus the case of 256 maximum repetitions occurs only at a
96 double-page boundary. The added complexity required to achieve infrequent and
97 minimal improvement was considered to be unimportant, but could be added
98 later. Thus adjacent repeated values split across boundaries, or split across
99 logical records will not contribute to a (single) compression field.
100
101 Note that compression is achieved by locating repeating strings of 12-bit
102 values; if the file were viewed as repeating strings of 8-bit bytes, then
103 compression would be less likely, except for cases like 0000 octal, which due
104 to "symmetry" are compressible via either method. Many PDP-8 image files
105 contain "filler" words, i.e., otherwise unloaded areas which are "pre-filled"
106 with constant data such as 0000 octal, or 7402 octal, which is the PDP-8 HLT
107 instruction. Image files filled with "large" regions of repeating strings of
108 7402 will not compress using 8-bit byte orientation.
109
110 Reliability considerations.
111
112 Even with the safeguards of a "robust" character set, file validity must
113 be tested to maintain data integrity. Towards this end, the encoding format
114 has several additional features. Unlike other "popular" formats, there is an
115 explicit "command" structure to the encoded file format. All lines of data
116 start with < and end with >. This prevents the characters from being
117 "massaged" into unwanted alternates. Various systems are known to "destroy"
118 files which have lines starting with "from", etc. By enclosing the data
119 lines, we prevent these problems. Additionally, a class of explicit commands
120 exist which start with ( and end with ). Instead of implied positioning,
121 there is a command called (FILE filename.ext), where the filename.ext is the
122 "suggested" name for the decoded file. The encoding program uses the original
123 un-encoded file's file name in this field. After the data, there is another
124 command (END filename.ext) which can be used to validate the data, since the
125 same file name should be in both commands (as implemented in the encoding
126 program). Several (REMARK {anything}) commands are inserted into the file at
127 various points to assist in reconstructing the original file, or in
128 documenting the fact that the file is an encoded "version" of another file.
129 Several "frill" REMARK commands are implemented to indicate the original file
130 date, and the date of encoded file creation. Today's date is always used for
131 the encoded file creation date. The original file date may be omitted, if the
132 system doesn't support Additional Information Words (AIWs), since this
133 optional feature must be present in order for the files to have creation
134 dates. The overall encoding format theoretically can be a concatenated series
135 of encoded files, each with its own (FILE ) and (END ) commands, but the
136 decoding program only supports single-file decoding as an implementation
137 restriction.
138
139 The file must always end with a good boundary condition. If the last
140 field is an X (compression) field, then this is already satisfied. If the
141 last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added
142 at the end of the last field if necessary to ensure a good boundary. The end
143 of file is signified by a single Z (or z) character. At this point, an
144 extraneous record may be in progress. If it consists of four or less 12-bit
145 words of 0000 octal, it is discarded. Any other situation regarding a partial
146 record indicates defective data in the received encoded file.
147
148 After the single Z (z) character, there are 12 more characters: an entire
149 60-bit group. This is the file checksum. It is accomplished with
150 pentuple-precision arithmetic. It is the sum of all 12-bit data values with
151 carry into a total of five 12-bit words. Repeat compression data values are
152 also added, but only once for each compression field. The compression field's
153 repeat count is also added, but it is first multiplied by 16. (The repeat
154 count was expressed originally as *16 so it would have its eight significant
155 bits left-justified). The entire 60-bit quantity is expressed in two's
156 complement notation by negating it and encoding the group as any other 60-bit
157 group. Since most files are relatively short, the high-order bits are
158 generally zero, so most two's complement checksums start with 7777,7777 octal.
159 The five 12-bit quantities holding the checksum are encoded low-order first,
160 so the right-most characters in the checksum field tend to be V (v). This
161 order is used merely to accomodate multi-precision arithmetic, as anyone
162 attempting to observe "backwards bytes" on other machines is familiar with.
163
164 Future considerations.
165
166 This format is by no means "perfect," but it is more robust than most,
167 with a minimum of efficiency loss, given the tradeoffs involved. The data
168 bracketing characters can be changed if required. The characters W (w) and Y
169 (y) are available for this purpose. Files could incorporate a word or
170 character count, or other validation technique, etc. Each line could
171 incorporate a local count. These and other considerations could create a
172 "compromise" format that could be more generic and "pallatable" to other
173 systems. The checksum could be limited to 48 bits, which is more amenable to
174 8 and 16 bit architectures. Perhaps opening parameters could govern the
175 contents of the rest of the file, such as whether the checksum was present,
176 etc.
177 [end of file]