next up previous contents
Next: Safety / Verify / Up: Distribution Previous: Compression / Compatibility   Contents


Signal / Symbolic Data

As mentioned before it is a good concept to separate signal and symbolic data in the corpus structure and henceforth also in the distribution. In medium-sized or large speech corpora you might not be able to keep all the data online at all times. However, you will very likely need to access the annotation data of the whole corpus. Therefore it makes sense to store these data either on a separate medium or to copy them on to all volumes of the speech corpus.

For example in the SmartKom corpus on every single DVD-R volume you will find the complete set of annotation files for the total corpus. In the WebCommand corpus these data are stored on a separate CDROM12.6.

Another advantage of keeping the symbolic information separate, is the fact that these data are much more likely subject to updates than the signal files. Since the symbolic data occupy usually less than 1% of the total corpus size, these updates can be easily distributed to users via a download from a FTP server.


next up previous contents
Next: Safety / Verify / Up: Distribution Previous: Compression / Compatibility   Contents
BITS Projekt-Account 2004-06-01