Writing Software for Portability, Part 2 By Gilbert Detillieux, Info West Inc. Last time, we looked at how to structure a program to improve portability. By breaking it up into small, independent modules, and isolating system specific code to just a few low-level modules, you make it more portable. Even more important to assure portability is the way we use data structures and data types. We will now look at certain considerations in using data in memory, and more importantly, how to use them in data files. Data Types and Structures in Memory Abstract data types can be used in many languages, to help clarify or restrict the processing we do on that data. We also can make use of this on data whose representations may change, to simplify such changes. In C, system header files define many such data types (time_t, uid_t, etc.), so we don't need to worry about the internal representation, which may change from system to system. Likewise, we also can define our own data types (e.g., typedef long int acctnum_t). You should never make assumptions about the size of data types and objects in memory, as these are machine (and compiler) specific. It may not even be safe to assume that a character takes one byte, if your code is to support international character sets. Also use named constants for array sizes and counts. If the system headers define some of these (buffer sizes, maximum number of files), use them; otherwise, define your own. In C, always use "sizeof" to specify sizes, when you need the size of a variable. For example, use malloc(NCHRS*sizeof(char)) rather than malloc(80) -- hard-coded constants for sizes are always a sign of trouble. For structured data types, there are some extra considerations. Element sizes, alignment of elements and padding can all vary with machine or compiler. Don't assume padding is done or not, or how it's done. With unions (variant records in Pascal), don't assume a particular alignment of the various members. Of course, you can't easily determine the overall size of a data structure in a portable way (other than using "sizeof" in C). If you're concerned about keeping the size of structures down to a minimum, you can assume that the best way to do that is to put all the largest elements first, then work your way down. This will reduce the amount of padding that might be introduced, and shouldn't hinder portability in any way. (In C, it's probably safe to assume the following size order: doubles, floats, pointer types, longs, ints, shorts, then chars.) There might be restrictions on how you can allocate data, and the size that you can allocate. Avoid allocating data in a way that isn't supported by all compilers (for example, initialized automatic arrays are supported by some compilers but aren't standard, and register types supported can vary with compilers). Some architectures have restrictions on the size of particular objects -- automatic variables may be restricted by the size of a stack frame, or static or dynamic data may be restricted by the maximum size of a data segment. Data File Formats All this data has to come from somewhere, and may have to go somewhere else. This usually means data files. You may have to design your own data file formats, and these should be designed with portability in mind too. This should be based on usage -- decide what data must be kept, for what purpose, and for how long, then define your format accordingly. Consider using text files for permanent data storage, as this is not only far more portable, but more versatile, flexible, and maintainable. There are lots of existing programs that you can use to edit, filter, search, and sort these files, and their format is less tied to program code changes. Use binary files only if you have to, due to space or time constraints (i.e. if you need a more compact representation, or if conversion to/from internal representation would be too slow). Clearly define the data types you are using, their size, representation, and even bit and byte ordering. Unlike for data types in memory, where it was better not to make assumptions about internal representation, here we must know exactly how everything is stored, so the data can be treated consistently, regardless of which program, machine, or operating system is involved. For data structures in files, use explicit data types. (In C, use short int, or long int, rather than just int, which is more machine specific.) Avoid automatic padding between elements by ordering them carefully. Or you might want to introduce explicit pad bytes to force a particular alignment; this also allow for growth if you have to add new elements later. Define clearly any assumptions you make about element sizes and offsets, and test this out with the various compilers you will use. A Little Bit of Magic It's a good idea to start each file with a "magic number" or "magic string" which can be used to identify uniquely the type of a file. The value you pick is arbitrary, but should be long enough (four bytes or more), and unique. If you use an integer magic number, you can use the bit pattern in your code to determine the bit and byte ordering in the file automatically (e.g. 1234 v.s. 4321). A magic string is perhaps more portable, and can be used for both text files (e.g. "%!" in PostScript files) or binary files (e.g. GIF87 in GIF files). Version numbers in files are useful as well, to identify variations in a file's format. A major version number change could indicate a significant format change (or you might want to change the magic number if the files are incompatible), while a minor version number change would indicate a small change that doesn't affect the format (e.g. new data stored in former pad bytes). Version numbers also could indicate the presence or absence of sections, allowing changes and additions to the file's contents while still maintaining compatibility. Avoiding Obsolescence As much as possible, your programs should handle all file versions, so old data files aren't made obsolete by program changes. If it's feasible, data files can be converted on the fly when your program loads an older file, or one from a different system. Failing that, conversion programs should be provided, to support file formats for all versions, systems, and machines. Remember that data files are likely to outlive the programs that created them, and it may be important to continue to support them. In a way, file formats are like protocols -- you should stick to a standard, define it clearly, and plan for future growth. Use industry standards where possible (e.g. ASCII or ISO character sets, IEEE floating point format, Sun's XDR standards), or define your own if you have to, but do so carefully. And most importantly, document everything you do (data types, structures, file formats, versions supported, conversions, defaults for missing data, etc.), keep these documents consistent with your code, and make them available to everyone who needs to know. There is no simple formula to assure portability of software and data. Choosing languages and systems that are portable is an important step, but that isn't enough. You have to plan for future growth in machines, systems, and in your own software. Think about portability from the start. Define and stick to standards, but also be flexible -- use an open-ended design to allow for unforeseen changes. And document, document, document!