X-Git-Url: http://gitweb.hachti.de/?a=blobdiff_plain;f=sw%2Fkermit%2Fk12%2Fk12enc.doc;fp=sw%2Fkermit%2Fk12%2Fk12enc.doc;h=1e140a67b91f3eb661f653758031dd0cc5982dc1;hb=81e70d488b71bf995c459ca3a02c025993460ffa;hp=0000000000000000000000000000000000000000;hpb=07ec0278333ed187ac242dedcff13c56cf1b0b91;p=pdp8.git diff --git a/sw/kermit/k12/k12enc.doc b/sw/kermit/k12/k12enc.doc new file mode 100644 index 0000000..1e140a6 --- /dev/null +++ b/sw/kermit/k12/k12enc.doc @@ -0,0 +1,177 @@ +From: Charles Lasner +Subject: ENCODE/DECODE format description + + The latest KERMIT-12 binary files are encoded in a specialized format. +This document will describe the internal encoding and related subjects. + +OS/8 file considerations. + + All OS/8 files are contiguous multiples of PDP-8 double-page sized logical +records. These are sometimes known as blocks, but are more accurately known +as records. (The term block is associated with various hardware descriptions, +and means different things to different people, such as DECtape blocks or RK05 +blocks, where the first means a physical block which just happens to be 1/2 +the size of the OS/8 logical record, and the second means a physical sector +which is the same size as the OS/8 logical record, but only if the drive is +attached to an RK8E. We will therefore avoid this term!) + + Since PDP-8 pages are 128 12-bit words each, the OS/8 record consists of +256 12-bit words, which can also be viewed as 384 8-bit bytes. For the +benefit of various existing utilities, there is a standard method of unpacking +the 8-bit bytes yielding a stream of coherent 8-bit bytes. The PDP-8 +convention is to number bits from left to right starting with bit[0]. This is +INCOMPATIBLE with the notation commonly used in other architectures, which is +usually given as what power of 2 the bit represents. The PDP-8 notation is +used to denote bit position in a manner consistent with significence of the +bit, and arbitrarily uses origin 0, which is the usually assembly-language +orientation. Using this notation, the first byte (byte #0) to be unpacked is +taken from word[0] bits[4-11]. The second byte (byte #1) to be unpacked is +taken from word[1] bits[4-11]. The third byte (byte #2) to be unpacked is +taken from word[0] bits[0-3] concatenated with word[1] bits[0-3]. All bits +are taken left to right as stated. This method is usually referred to as "3 +for 2," and repeated accordingly will yield the correct stream of bytes for +ASCII OS/8 files. OS/8 absolute binary files are images of 8-bit paper-tape +frames packed in the same format. Although the high-order bit "matters" in +absolute binary files, the high-bit is untrustworthy in ASCII files. Both +types of files end with a ^Z character which will have the high-order bit set +in the case of the absolute binary files. The reason it succeeds in the +binary case is that the paper-tape format treats the high-bit present as +leader or trailer, not loadable data, etc., so the loader ignores all leading +high-bit set bytes, and finishes on the first trailing high-bit set byte. The +binary file contains several leader bytes of 0200 octal, and several trailer +bytes of 0200 followed by 232, the code for ^Z. There is no "fool-proof" way +to derive these formats, rather these are usually given by the extensions .BN +for binary, and various extensions (.LS, .PA, .MA, .DI, .BI, .TX, .DC, etc.) +for ASCII files. If the file is "known" to be either ASCII or absolute +binary, then these conventions can be used to ignore extraneous trailing +bytes. If the file type is unknown, then it must be treated as an "image" +file, where all data must be preserved. The most typical image file is a .SV +file, which is an image of files organized as pages and double-pages, with +some trivial absolute loading instructions in a "header" block. Each record +of the file is always "paired" off, i.e., the record has an implied +double-page boundary of main memory it is meant to load into. If the loading +instructions indicate a single-page load, then the first page must be loaded; +the second half of the record is IGNORED. Notice it is impossible to specify +singular loading of the "odd" page. OS/8 also supports various other formats, +so it is difficult to obtain useful knowledge of the "inner" format of the +file. + +Encoding considerations. + + If the 8-bit bytes of an OS/8 file are unpacked and packed faithfully, +the resultant file will be an accurate copy of the original data. This is the +basis for an alternate encoding format, perhaps more universal in scope, but +it is NOT the method used currently. The method chosen here is to treat the +entire file as a contiguous stream of 5-bit bytes spanning words as necessary. +This means that bits are taken from left to right, five at a time, and each +encoded into a "printable" character for inclusion into the encoded file. +This means that data will form 60-bit groups of 12 5-bit characters +representing five 12-bit words. The 5-bit encoding is accomplished using the +ASCII coding for an extension to hexadecimal, which can be called +bi-hexadecimal, or base 32 notation. In this base, the values are 0-9, and +A-V (just the "natural" extension of hexadecimal's 0-9, A-F). The alphabetic +characters can be upper or lower case as necessary. This method is +theoretically 25% more efficient than hexadecimal ASCII since each character +holds 5-bit data rather than 4-bit data. + + Since the 5-bit data has no "good" boundary for most computers, we will +use the "best" case for PDP-8 image data, the 60-bit group as described above. +Once started, a 60-bit group must be completed, thus there are boundaries +throughout the file every 12 characters. + + At any boundary, the file may have compressed data. Compressed fields are +indicated by X (or x) followed by a four character field. The format is +basically a truncation of the "normal" 60-bit group, but only contains 20 bits +of data. The first 12 bits are the value of the data to be repeated. The +last eight bits are the repeat count. Values of the repeat count range from +1-256, where 256 is implied if the value is zero. Practical values range from +3-255, since one or two values would take less file space uncompressed. Due +to the boundary requirements, compression fields are independent of the data +preceeding them. The repeat count limitation to a maximum of 256 was felt to +be a reasonable compromise between compressed field length and adequate repeat +count. Making the repeat count even only double the current maximum would +require six character compression fields instead of five (including the X). +As an implementation restriction, the encoding program only reads one OS/8 +record at a time, thus the case of 256 maximum repetitions occurs only at a +double-page boundary. The added complexity required to achieve infrequent and +minimal improvement was considered to be unimportant, but could be added +later. Thus adjacent repeated values split across boundaries, or split across +logical records will not contribute to a (single) compression field. + + Note that compression is achieved by locating repeating strings of 12-bit +values; if the file were viewed as repeating strings of 8-bit bytes, then +compression would be less likely, except for cases like 0000 octal, which due +to "symmetry" are compressible via either method. Many PDP-8 image files +contain "filler" words, i.e., otherwise unloaded areas which are "pre-filled" +with constant data such as 0000 octal, or 7402 octal, which is the PDP-8 HLT +instruction. Image files filled with "large" regions of repeating strings of +7402 will not compress using 8-bit byte orientation. + +Reliability considerations. + + Even with the safeguards of a "robust" character set, file validity must +be tested to maintain data integrity. Towards this end, the encoding format +has several additional features. Unlike other "popular" formats, there is an +explicit "command" structure to the encoded file format. All lines of data +start with < and end with >. This prevents the characters from being +"massaged" into unwanted alternates. Various systems are known to "destroy" +files which have lines starting with "from", etc. By enclosing the data +lines, we prevent these problems. Additionally, a class of explicit commands +exist which start with ( and end with ). Instead of implied positioning, +there is a command called (FILE filename.ext), where the filename.ext is the +"suggested" name for the decoded file. The encoding program uses the original +un-encoded file's file name in this field. After the data, there is another +command (END filename.ext) which can be used to validate the data, since the +same file name should be in both commands (as implemented in the encoding +program). Several (REMARK {anything}) commands are inserted into the file at +various points to assist in reconstructing the original file, or in +documenting the fact that the file is an encoded "version" of another file. +Several "frill" REMARK commands are implemented to indicate the original file +date, and the date of encoded file creation. Today's date is always used for +the encoded file creation date. The original file date may be omitted, if the +system doesn't support Additional Information Words (AIWs), since this +optional feature must be present in order for the files to have creation +dates. The overall encoding format theoretically can be a concatenated series +of encoded files, each with its own (FILE ) and (END ) commands, but the +decoding program only supports single-file decoding as an implementation +restriction. + + The file must always end with a good boundary condition. If the last +field is an X (compression) field, then this is already satisfied. If the +last field is ordinary data, then 1-4 12-bit words of 0000 octal will be added +at the end of the last field if necessary to ensure a good boundary. The end +of file is signified by a single Z (or z) character. At this point, an +extraneous record may be in progress. If it consists of four or less 12-bit +words of 0000 octal, it is discarded. Any other situation regarding a partial +record indicates defective data in the received encoded file. + + After the single Z (z) character, there are 12 more characters: an entire +60-bit group. This is the file checksum. It is accomplished with +pentuple-precision arithmetic. It is the sum of all 12-bit data values with +carry into a total of five 12-bit words. Repeat compression data values are +also added, but only once for each compression field. The compression field's +repeat count is also added, but it is first multiplied by 16. (The repeat +count was expressed originally as *16 so it would have its eight significant +bits left-justified). The entire 60-bit quantity is expressed in two's +complement notation by negating it and encoding the group as any other 60-bit +group. Since most files are relatively short, the high-order bits are +generally zero, so most two's complement checksums start with 7777,7777 octal. +The five 12-bit quantities holding the checksum are encoded low-order first, +so the right-most characters in the checksum field tend to be V (v). This +order is used merely to accomodate multi-precision arithmetic, as anyone +attempting to observe "backwards bytes" on other machines is familiar with. + +Future considerations. + + This format is by no means "perfect," but it is more robust than most, +with a minimum of efficiency loss, given the tradeoffs involved. The data +bracketing characters can be changed if required. The characters W (w) and Y +(y) are available for this purpose. Files could incorporate a word or +character count, or other validation technique, etc. Each line could +incorporate a local count. These and other considerations could create a +"compromise" format that could be more generic and "pallatable" to other +systems. The checksum could be limited to 48 bits, which is more amenable to +8 and 16 bit architectures. Perhaps opening parameters could govern the +contents of the rest of the file, such as whether the checksum was present, +etc. +[end of file]