Commit | Line | Data |
---|---|---|
81e70d48 PH |
1 | From: Charles Lasner <lasner@watsun.cc.columbia.edu> |
2 | Subject: ENCODE/DECODE format description | |
3 | ||
4 | The latest KERMIT-12 binary files are encoded in a specialized format. | |
5 | This document will describe the internal encoding and related subjects. | |
6 | ||
7 | OS/8 file considerations. | |
8 | ||
9 | All OS/8 files are contiguous multiples of PDP-8 double-page sized logical | |
10 | records. These are sometimes known as blocks, but are more accurately known | |
11 | as records. (The term block is associated with various hardware descriptions, | |
12 | and means different things to different people, such as DECtape blocks or RK05 | |
13 | blocks, where the first means a physical block which just happens to be 1/2 | |
14 | the size of the OS/8 logical record, and the second means a physical sector | |
15 | which is the same size as the OS/8 logical record, but only if the drive is | |
16 | attached to an RK8E. We will therefore avoid this term!) | |
17 | ||
18 | Since PDP-8 pages are 128 12-bit words each, the OS/8 record consists of | |
19 | 256 12-bit words, which can also be viewed as 384 8-bit bytes. For the | |
20 | benefit of various existing utilities, there is a standard method of unpacking | |
21 | the 8-bit bytes yielding a stream of coherent 8-bit bytes. The PDP-8 | |
22 | convention is to number bits from left to right starting with bit[0]. This is | |
23 | INCOMPATIBLE with the notation commonly used in other architectures, which is | |
24 | usually given as what power of 2 the bit represents. The PDP-8 notation is | |
25 | used to denote bit position in a manner consistent with significence of the | |
26 | bit, and arbitrarily uses origin 0, which is the usually assembly-language | |
27 | orientation. Using this notation, the first byte (byte #0) to be unpacked is | |
28 | taken from word[0] bits[4-11]. The second byte (byte #1) to be unpacked is | |
29 | taken from word[1] bits[4-11]. The third byte (byte #2) to be unpacked is | |
30 | taken from word[0] bits[0-3] concatenated with word[1] bits[0-3]. All bits | |
31 | are taken left to right as stated. This method is usually referred to as "3 | |
32 | for 2," and repeated accordingly will yield the correct stream of bytes for | |
33 | ASCII OS/8 files. OS/8 absolute binary files are images of 8-bit paper-tape | |
34 | frames packed in the same format. Although the high-order bit "matters" in | |
35 | absolute binary files, the high-bit is untrustworthy in ASCII files. Both | |
36 | types of files end with a ^Z character which will have the high-order bit set | |
37 | in the case of the absolute binary files. The reason it succeeds in the | |
38 | binary case is that the paper-tape format treats the high-bit present as | |
39 | leader or trailer, not loadable data, etc., so the loader ignores all leading | |
40 | high-bit set bytes, and finishes on the first trailing high-bit set byte. The | |
41 | binary file contains several leader bytes of 0200 octal, and several trailer | |
42 | bytes of 0200 followed by 232, the code for ^Z. There is no "fool-proof" way | |
43 | to derive these formats, rather these are usually given by the extensions .BN | |
44 | for binary, and various extensions (.LS, .PA, .MA, .DI, .BI, .TX, .DC, etc.) | |
45 | for ASCII files. If the file is "known" to be either ASCII or absolute | |
46 | binary, then these conventions can be used to ignore extraneous trailing | |
47 | bytes. If the file type is unknown, then it must be treated as an "image" | |
48 | file, where all data must be preserved. The most typical image file is a .SV | |
49 | file, which is an image of files organized as pages and double-pages, with | |
50 | some trivial absolute loading instructions in a "header" block. Each record | |
51 | of the file is always "paired" off, i.e., the record has an implied | |
52 | double-page boundary of main memory it is meant to load into. If the loading | |
53 | instructions indicate a single-page load, then the first page must be loaded; | |
54 | the second half of the record is IGNORED. Notice it is impossible to specify | |
55 | singular loading of the "odd" page. OS/8 also supports various other formats, | |
56 | so it is difficult to obtain useful knowledge of the "inner" format of the | |
57 | file. | |
58 | ||
59 | Encoding considerations. | |
60 | ||
61 | If the 8-bit bytes of an OS/8 file are unpacked and packed faithfully, | |
62 | the resultant file will be an accurate copy of the original data. This is the | |
63 | basis for an alternate encoding format, perhaps more universal in scope, but | |
64 | it is NOT the method used currently. The method chosen here is to treat the | |
65 | entire file as a contiguous stream of 5-bit bytes spanning words as necessary. | |
66 | This means that bits are taken from left to right, five at a time, and each | |
67 | encoded into a "printable" character for inclusion into the encoded file. | |
68 | This means that data will form 60-bit groups of 12 5-bit characters | |
69 | representing five 12-bit words. The 5-bit encoding is accomplished using the | |
70 | ASCII coding for an extension to hexadecimal, which can be called | |
71 | bi-hexadecimal, or base 32 notation. In this base, the values are 0-9, and | |
72 | A-V (just the "natural" extension of hexadecimal's 0-9, A-F). The alphabetic | |
73 | characters can be upper or lower case as necessary. This method is | |
74 | theoretically 25% more efficient than hexadecimal ASCII since each character | |
75 | holds 5-bit data rather than 4-bit data. | |
76 | ||
77 | Since the 5-bit data has no "good" boundary for most computers, we will | |
78 | use the "best" case for PDP-8 image data, the 60-bit group as described above. | |
79 | Once started, a 60-bit group must be completed, thus there are boundaries | |
80 | throughout the file every 12 characters. | |
81 | ||
82 | At any boundary, the file may have compressed data. Compressed fields are | |
83 | indicated by X (or x) followed by a four character field. The format is | |
84 | basically a truncation of the "normal" 60-bit group, but only contains 20 bits | |
85 | of data. The first 12 bits are the value of the data to be repeated. The | |
86 | last eight bits are the repeat count. Values of the repeat count range from | |
87 | 1-256, where 256 is implied if the value is zero. Practical values range from | |
88 | 3-255, since one or two values would take less file space uncompressed. Due | |
89 | to the boundary requirements, compression fields are independent of the data | |
90 | preceeding them. The repeat count limitation to a maximum of 256 was felt to | |
91 | be a reasonable compromise between compressed field length and adequate repeat | |
92 | count. Making the repeat count even only double the current maximum would | |
93 | require six character compression fields instead of five (including the X). | |
94 | As an implementation restriction, the encoding program only reads one OS/8 | |
95 | record at a time, thus the case of 256 maximum repetitions occurs only at a | |
96 | double-page boundary. The added complexity required to achieve infrequent and | |
97 | minimal improvement was considered to be unimportant, but could be added | |
98 | later. Thus adjacent repeated values split across boundaries, or split across | |
99 | logical records will not contribute to a (single) compression field. | |
100 | ||
101 | Note that compression is achieved by locating repeating strings of 12-bit | |
102 | values; if the file were viewed as repeating strings of 8-bit bytes, then | |
103 | compression would be less likely, except for cases like 0000 octal, which due | |
104 | to "symmetry" are compressible via either method. Many PDP-8 image files | |
105 | contain "filler" words, i.e., otherwise unloaded areas which are "pre-filled" | |
106 | with constant data such as 0000 octal, or 7402 octal, which is the PDP-8 HLT | |
107 | instruction. Image files filled with "large" regions of repeating strings of | |
108 | 7402 will not compress using 8-bit byte orientation. | |
109 | ||
110 | Reliability considerations. | |
111 | ||
112 | Even with the safeguards of a "robust" character set, file validity must | |
113 | be tested to maintain data integrity. Towards this end, the encoding format | |
114 | has several additional features. Unlike other "popular" formats, there is an | |
115 | explicit "command" structure to the encoded file format. All lines of data | |
116 | start with < and end with >. This prevents the characters from being | |
117 | "massaged" into unwanted alternates. Various systems are known to "destroy" | |
118 | files which have lines starting with "from", etc. By enclosing the data | |
119 | lines, we prevent these problems. Additionally, a class of explicit commands | |
120 | exist which start with ( and end with ). Instead of implied positioning, | |
121 | there is a command called (FILE filename.ext), where the filename.ext is the | |
122 | "suggested" name for the decoded file. The encoding program uses the original | |
123 | un-encoded file's file name in this field. After the data, there is another | |
124 | command (END filename.ext) which can be used to validate the data, since the | |
125 | same file name should be in both commands (as implemented in the encoding | |
126 | program). Several (REMARK {anything}) commands are inserted into the file at | |
127 | various points to assist in reconstructing the original file, or in | |
128 | documenting the fact that the file is an encoded "version" of another file. | |
129 | Several "frill" REMARK commands are implemented to indicate the original file | |
130 | date, and the date of encoded file creation. Today's date is always used for | |
131 | the encoded file creation date. The original file date may be omitted, if the | |
132 | system doesn't support Additional Information Words (AIWs), since this | |
133 | optional feature must be present in order for the files to have creation | |
134 | dates. The overall encoding format theoretically can be a concatenated series | |
135 | of encoded files, each with its own (FILE ) and (END ) commands, but the | |
136 | decoding program only supports single-file decoding as an implementation | |
137 | restriction. | |
138 | ||
139 | The file must always end with a good boundary condition. If the last | |
140 | field is an X (compression) field, then this is already satisfied. If the | |
141 | last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added | |
142 | at the end of the last field if necessary to ensure a good boundary. The end | |
143 | of file is signified by a single Z (or z) character. At this point, an | |
144 | extraneous record may be in progress. If it consists of four or less 12-bit | |
145 | words of 0000 octal, it is discarded. Any other situation regarding a partial | |
146 | record indicates defective data in the received encoded file. | |
147 | ||
148 | After the single Z (z) character, there are 12 more characters: an entire | |
149 | 60-bit group. This is the file checksum. It is accomplished with | |
150 | pentuple-precision arithmetic. It is the sum of all 12-bit data values with | |
151 | carry into a total of five 12-bit words. Repeat compression data values are | |
152 | also added, but only once for each compression field. The compression field's | |
153 | repeat count is also added, but it is first multiplied by 16. (The repeat | |
154 | count was expressed originally as *16 so it would have its eight significant | |
155 | bits left-justified). The entire 60-bit quantity is expressed in two's | |
156 | complement notation by negating it and encoding the group as any other 60-bit | |
157 | group. Since most files are relatively short, the high-order bits are | |
158 | generally zero, so most two's complement checksums start with 7777,7777 octal. | |
159 | The five 12-bit quantities holding the checksum are encoded low-order first, | |
160 | so the right-most characters in the checksum field tend to be V (v). This | |
161 | order is used merely to accomodate multi-precision arithmetic, as anyone | |
162 | attempting to observe "backwards bytes" on other machines is familiar with. | |
163 | ||
164 | Future considerations. | |
165 | ||
166 | This format is by no means "perfect," but it is more robust than most, | |
167 | with a minimum of efficiency loss, given the tradeoffs involved. The data | |
168 | bracketing characters can be changed if required. The characters W (w) and Y | |
169 | (y) are available for this purpose. Files could incorporate a word or | |
170 | character count, or other validation technique, etc. Each line could | |
171 | incorporate a local count. These and other considerations could create a | |
172 | "compromise" format that could be more generic and "pallatable" to other | |
173 | systems. The checksum could be limited to 48 bits, which is more amenable to | |
174 | 8 and 16 bit architectures. Perhaps opening parameters could govern the | |
175 | contents of the rest of the file, such as whether the checksum was present, | |
176 | etc. | |
177 | [end of file] |