next up previous contents
Next: Corpus Structure Up: File Formats Previous: Meta Data File Formats   Contents


Lexicon Format

Depending on the complexity of your speech corpus you might add the specification of a lexicon covering the entire corpus or parts of it. Again there are no widely accepted standards about a file format for lexica or dictionaries.

In most cases speech corpora come with just a simple three-column list giving for each spoken word form the orthographic representation, the word count and the most likely pronunciation of this word form. This seems quite straightforward, but is clearly not sufficient for languages for which there is no standard orthography, or the orthography cannot be established unambiguously from speech, or in which orthography is not necessarily based on words. Here are some hints:


next up previous contents
Next: Corpus Structure Up: File Formats Previous: Meta Data File Formats   Contents
BITS Projekt-Account 2004-06-01