Writing Software for Portability, Part 2

By Gilbert Detillieux, Info West Inc.

Last time, we looked at how to structure a program to
improve portability. By breaking it up into small,
independent modules, and isolating system specific code to
just a few low-level modules, you make it more portable.
Even more important to assure portability is the way we use
data structures and data types.  We will now look at certain
considerations in using data in memory, and more
importantly, how to use them in data files.

Data Types and Structures in Memory

Abstract data types can be used in many languages, to help
clarify or restrict the processing we do on that data.  We
also can make use of this on data whose representations may
change, to simplify such changes.  In C, system header files
define many such data types (time_t, uid_t, etc.), so we
don't need to worry about the internal representation, which
may change from system to system.  Likewise, we also can
define our own data types (e.g., typedef long int
acctnum_t).

You should never make assumptions about the size of data
types and objects in memory, as these are machine (and
compiler) specific. It may not even be safe to assume that a
character takes one byte, if your code is to support
international character sets.  Also use named constants for
array sizes and counts.  If the system headers define some
of these (buffer sizes, maximum number of files), use them;
otherwise, define your own. In C, always use "sizeof" to
specify sizes, when you need the size of a variable. For
example, use malloc(NCHRS*sizeof(char)) rather than
malloc(80) -- hard-coded constants for sizes are always a
sign of trouble.

For structured data types, there are some extra
considerations.  Element sizes, alignment of elements and
padding can all vary with machine or compiler.  Don't assume
padding is done or not, or how it's done.  With unions
(variant records in Pascal), don't assume a particular
alignment of the various members.  Of course, you can't
easily determine the overall size of a data structure in a
portable way (other than using "sizeof" in C).  If you're
concerned about keeping the size of structures down to a
minimum, you can assume that the best way to do that is to
put all the largest elements first, then work your way down.
This will reduce the amount of padding that might be
introduced, and shouldn't hinder portability in any way.
(In C, it's probably safe to assume the following size
order: doubles, floats, pointer types, longs, ints, shorts,
then chars.)

There might be restrictions on how you can allocate data,
and the size that you can allocate.  Avoid allocating data
in a way that isn't supported by all compilers (for example,
initialized automatic arrays are supported by some compilers
but aren't standard, and register types supported can vary
with compilers).  Some architectures have restrictions on
the size of particular objects -- automatic variables may be
restricted by the size of a stack frame, or static or
dynamic data may be restricted by the maximum size of a data
segment.

Data File Formats

All this data has to come from somewhere, and may have to go
somewhere else.  This usually means data files.  You may
have to design your own data file formats, and these should
be designed with portability in mind too.  This should be
based on usage -- decide what data must be kept, for what
purpose, and for how long, then define your format
accordingly. Consider using text files for permanent data
storage, as this is not only far more portable, but more
versatile, flexible, and maintainable. There are lots of
existing programs that you can use to edit, filter, search,
and sort these files, and their format is less tied to
program code changes.

Use binary files only if you have to, due to space or time
constraints (i.e. if you need a more compact representation,
or if conversion to/from internal representation would be
too slow).  Clearly define the data types you are using,
their size, representation, and even bit and byte ordering.
Unlike for data types in memory, where it was better not to
make assumptions about internal representation, here we must
know exactly how everything is stored, so the data can be
treated consistently, regardless of which program, machine,
or operating system is involved.

For data structures in files, use explicit data types.  (In
C, use short int, or long int, rather than just int, which
is more machine specific.) Avoid automatic padding between
elements by ordering them carefully.  Or you might want to
introduce explicit pad bytes to force a particular
alignment; this also allow for growth if you have to add new
elements later.  Define clearly any assumptions you make
about element sizes and offsets, and test this out with the
various compilers you will use.

A Little Bit of Magic

It's a good idea to start each file with a "magic number" or
"magic string" which can be used to identify uniquely the
type of a file.  The value you pick is arbitrary, but should
be long enough (four bytes or more), and unique.  If you use
an integer magic number, you can use the bit pattern in your
code to determine the bit and byte ordering in the file
automatically (e.g. 1234 v.s. 4321).  A magic string is
perhaps more portable, and can be used for both text files
(e.g. "%!" in PostScript files) or binary files (e.g. GIF87
in GIF files).

Version numbers in files are useful as well, to identify
variations in a file's format.  A major version number
change could indicate a significant format change (or you
might want to change the magic number if the files are
incompatible), while a minor version number change would
indicate a small change that doesn't affect the format (e.g.
new data stored in former pad bytes).  Version numbers also
could indicate the presence or absence of sections, allowing
changes and additions to the file's contents while still
maintaining compatibility.

Avoiding Obsolescence

As much as possible, your programs should handle all file
versions, so old data files aren't made obsolete by program
changes.  If it's feasible, data files can be converted on
the fly when your program loads an older file, or one from a
different system.  Failing that, conversion programs should
be provided, to support file formats for all versions,
systems, and machines.  Remember that data files are likely
to outlive the programs that created them, and it may be
important to continue to support them.

In a way, file formats are like protocols -- you should stick
to a standard, define it clearly, and plan for future
growth.  Use industry standards where possible (e.g. ASCII
or ISO character sets, IEEE floating point format, Sun's XDR
standards), or define your own if you have to, but do so
carefully.  And most importantly, document everything you do
(data types, structures, file formats, versions supported,
conversions, defaults for missing data, etc.), keep these
documents consistent with your code, and make them available
to everyone who needs to know.

There is no simple formula to assure portability of software
and data. Choosing languages and systems that are portable
is an important step, but that isn't enough.  You have to
plan for future growth in machines, systems, and in your own
software.  Think about portability from the start.  Define
and stick to standards, but also be flexible -- use an open-ended
design to allow for unforeseen changes.  And document,
document, document!