For speech corpora larger than 5GB it is probably best to press the physical signals on DVDs and the annotations (which are usually much smaller) on a separate CD-ROM.
Burners for CD-R and DVD are reasonably priced and should be included in the budget of the speech corpus production. Choose quality drives to avoid unnecessary drop-outs. Standard computer networks are fast enough that you may set up a burner station on one host and store the master images on your file server12.1.
Alternatively you might out-source the CDROM or DVD production to a company which produces `real' (that is pressed) CDs or DVDs. However, this might only be economical if you plan at least 100 copies to be produced (see also section ).
Another thing to consider here is the so-called file system (FS) you will install on your media. For CDROM there is basically one widely accepted FS called ISO9660 but it comes with many different extensions. We recommend using the Rockridge extension (for UNIX systems) and the Joliet extension (for MS systems). Both allow you to use longer file names than the basic ISO9669 which is restricted to the old DOS 8.3 convention. Both can be installed in parallel and do not interfere with each other. Please note that the addition of such extensions may increase the total size of your data significantly, if your corpus contains a lot of files.
On DVD you may use ISO9660 as well or preferably the new UDF file system that overcomes most shortcomings of ISO9660: is allows long file names, a larger number of files, etc. Most computer platforms (tested on MS, Macintosh and Linux) detect ISO9660 and UDF automatically.
The use of very many small files (typically your annotation files) with less than 4kB will increase the net size of your data as well, because most FS will reserve a minimum block size per file (typically 4kB, 8kB or 16kB). Take this into consideration when dividing up your corpus over several media volumes. The best way to test the actual size of a volume is to produce an image file and check whether it fits on the medium.
If your corpus exceeds several 100GB you'll probably consider distributing either on special tapes (which makes it harder for the users to access the data) or on inexpensive IDE hard disks. In the latter case the FS on the hard disk should be VFAT that can be mounted by all computer platforms.12.2
If you're using HD drives as distribution media, you might install a swappable IDE drive slot on one of your hosts; this makes it quite easy to change drives without opening the case.