next up previous contents
Next: Larger Edition vs. Burn-on-Demand Up: Distribution Previous: Signal / Symbolic Data   Contents


Safety / Verify / Versions

The produced and validated speech corpus is a very valuable resource. Therefore all care should be taken against all kinds of possible data loss. Never rely on one storage medium alone and never keep all storage media in the same location. For instance the archive data at BAS are stored in three (four) different ways: on a file server with an independent backup system, on CD-R or DVD-R shut away in a safe place, and on a Tivoli storage system in another part of town. During and possibly also after the production phase there will be changes to your data. Be sure to set up reliable procedures to distribute all these changes to all of your data locations.

Always use verify procedures to ensure that data transfers were performed successfully, especially when you transfer over the network. Use diff -r on UNIX systems to detect differences between your target and source data quickly. Make use of build-in verify procedures in CD-R or DVD-R burner software. To be 100% sure mount the ready medium on a different host and run an additional verify to the source data.

It's recommended to use a version control or at least manually set version numbers on your data that are increased after each update or change. Every speech corpus documentation must contain a change log where all changes are documented together with the corresponding version number. We recommend a two-part version number X.Y where X is increased only after major changes that imply that for instance software which uses the corpus has to be adapted, while Y is increased for error corrections (updates) only.

If possible, set up a mailing list of all users of the speech corpus and inform them about version changes automatically.


next up previous contents
Next: Larger Edition vs. Burn-on-Demand Up: Distribution Previous: Signal / Symbolic Data   Contents
BITS Projekt-Account 2004-06-01