Description of libisofs MD5 checksumming by Thomas Schmitt - mailto:scdbackup@gmx.net Libburnia project - mailto:libburn-hackers@pykix.org 16 Aug 2009 MD5 is a 128 bit message digest with a very low probability to be the same for any pair of differing data files. It is described in RFC 1321. and can be computed e.g. by program md5sum. libisofs can equip its images with MD5 checksums for superblock, directory tree, the whole session, and for each single data file. See libisofs.h, iso_write_opts_set_record_md5(). The data file checksums get loaded together with the directory tree if this is enabled by iso_read_opts_set_no_md5(). Loaded checksums can be inquired by iso_image_get_session_md5() and iso_file_get_md5(). Stream recognizable checksum tags occupy exactly one block each. They can be detected by submitting a read-in block to iso_util_decode_md5_tag(). libisofs has own MD5 computation functions: iso_md5_start(), iso_md5_compute(), iso_md5_clone(), iso_md5_end() Representation in the Image The checksums are stored as stream recognizable checksum tags and as a compact array at the end of the session. The latter allows to quickly load many file checksums from media with slow random access. The Checksum Array Location and layout of the checksum array is recorded as AAIP attribute "isofs.ca" of the root node. See doc/susp_aaip_2_0.txt for a general description of AAIP and doc/susp_aaip_isofs_names.txt for the layout of "isofs.ca". The single data files hold an index to their MD5 checksum in individual AAIP attributes "isofs.cx". Index I means: array base address + 16 * I. If there are N checksummed data files then the array consists of N + 2 entries with 16 bytes each. Entry number 0 holds a session checksum which covers the range from the session start block up to (but not including) the start block of the checksum area. This range is described by attribute "isofs.ca" of the root node. Entries 1 to N hold the checksums of individual data files. Entry number N + 1 holds the MD5 checksum of entries 0 to N. The Checksum Tags Because the inquiry of AAIP attributes demands loading of the image tree, there are also checksum tags which can be detected on the fly when reading and checksumming the session from the start point as learned from a media table-of-content. The superblock checksum tag is written after the ECMA-119 volume descriptors. The tree checksum tag is written after the ECMA-119 directory entries. The session checksum tag is written after all payload including the checksum array. (Then follows eventual padding.) The tags are a single lines of printable text, padded by 0 bytes. They have the following format: Tag_id pos=# range_start=# range_size=# [next=#] md5=# self=#\n Tag_id distinguishes the three tag types "libisofs_sb_checksum_tag_v1" Superblock tag "libisofs_tree_checksum_tag_v1" Directory tree tag "libisofs_checksum_tag_v1" Session tag Example (session starts at at Logical Block Address 32): <... ECMA-119 System Area and Volume Descriptors ...> libisofs_sb_checksum_tag_v1 pos=50 range_start=32 range_size=18 md5=17471035f1360a69eedbd1d0c67a6aa2 self=52d602210883eeababfc9cd287e28682 <... ECMA-119 Directory Entries ...> libisofs_tree_checksum_tag_v1 pos=334 range_start=32 range_size=302 md5=41acd50285339be5318decce39834a45 self=fe100c338c8f9a494a5432b5bfe6bf3c <... Data file payload and checksum array ...> libisofs_checksum_tag_v1 pos=81554 range_start=32 range_size=81522 md5=8adb404bdf7f5c0a078873bb129ee5b9 self=57c2c2192822b658240d62cbc88270cb There are five tag parameters. The first three are decimal numbers, the others are strings of 32 hex digits: pos= gives the block address where the tag supposes itself to be stored. If this does not match the block address where the tag is found then this either indicates that the tag is payload of the image or that the image has been relocated. (The latter makes the image unusable.) range_start= The block address where the session is supposed to start. If this does not match the session start on media then the volume descriptors of the image have been relocated. (This can happen with overwriteable media. If checksumming started at LBA 0 and finds range_start=32, then one has to restart checksumming at LBA 32. See libburn/doc/cookbook.txt "ISO 9660 multi-session emulation on overwriteable media" for background information.) range_size= The number of blocks beginning at range_start which are covered by the checksum of the tag. Only with superblock tag and tree tag: next= The block address where the next tag is supposed to be found. This is to avoid the small possibility that a checksum tag with matching position is part of a directory entry or data file. The superblock tag is quite uniquely placed directly after the ECMA-119 Volume Descriptor Set Terminator where no such cleartext is supposed to reside by accident. md5= The checksum payload of the tag as lower case hex digits. self= The MD5 checksum of the tag itself up to and including the last hex digit of parameter "md5=". The newline character at the end is mandatory. For now all bytes of the block after that newline shall be zero. There may arise future extensions. ------------------------------------------------------------------------------- Usage at Read Time Checking Before Image Tree Loading In order to check for a trustworthy loadable image tree, read the first 32 blocks from to the session start and look in block 16 to 32 for the superblock checksum tag by iso_util_decode_md5_tag(block, &tag_type, &pos, &range_start, &range_size, &next_tag, md5, 2); If it appears and has plausible parameters, then check whether its MD5 matches the MD5 of the data blocks which were read before. (Keep the original MD5 context of the data blocks and clone one for obtaining the MD5 bytes.) Compute the block into the MD5 checksum after your are done with interpreting it. If those MD5s match, then compute the checksum block into the kept MD5 context and go on with reading and computing for the tree checksum tag. This will be found at block address next_tag, verified and parsed by: iso_util_decode_md5_tag(block, &tag_type, &pos, &range_start, &range_size, &next_tag, md5, 3); Again, if the parameters match the reading state, the MD5 must match the MD5 computed from the data blocks which were before. If so, then the tree is ok and safe to be loaded by iso_image_import(). Checking a Whole Session In order to check the trustworthyness of a whole session, continue reading and checksumming after the tree was verified. Read and checksum the blocks. When reaching block address next_tag (from the tree tag) submit this block to iso_util_decode_md5_tag(block, &tag_type, &pos, &range_start, &range_size, &next_tag, md5, 1); If this returns 1, then check whether the returned parameters pos, range_start, and range_size match the state of block reading, and whether the returned bytes in parameter md5 match the MD5 computed from the data blocks which were read before the tag block. Checking Single Files in a Loaded Image Once the image has been loaded, you can obtain MD5 sums from IsoNode objects which fulfill iso_node_get_type(node) == LIBISO_FILE The recorded checksum can be obtained by iso_file_get_md5(image, (IsoFile *) node, md5, 0); For accessing the file data in the loaded image use iso_file_get_stream((IsoFile *) node); to get the data stream of the object. The checksums cover the data content as it was actually written into the ISO image stream, not necessarily as it was on hard disk before or afterwards. This implies that content filtered files bear the MD5 of the filtered data and not of the original files on disk. When checkreading, one has to avoid any reverse filtering. Dig out the stream which directly reads image data by calling iso_stream_get_input_stream() until it returns NULL and use iso_stream_get_size() rather than iso_file_get_size(). Now you may call iso_stream_open(), iso_stream_read(), iso_stream_close() for reading file content from the loaded image. Session Check in a Loaded Image iso_image_get_session_md5() gives start LBA and session payload size as of "isofs.ca" and the session checksum as of the checksum array.