Lexicon Format

Next: Corpus Structure Up: File Formats Previous: Meta Data File Formats Contents

Lexicon Format

Depending on the complexity of your speech corpus you might add the specification of a lexicon covering the entire corpus or parts of it. Again there are no widely accepted standards about a file format for lexica or dictionaries.

In most cases speech corpora come with just a simple three-column list giving for each spoken word form the orthographic representation, the word count and the most likely pronunciation of this word form. This seems quite straightforward, but is clearly not sufficient for languages for which there is no standard orthography, or the orthography cannot be established unambiguously from speech, or in which orthography is not necessarily based on words. Here are some hints:

Orthography: whenever feasible use Unicode.
Pronunciation: whenever feasible use SAMPA or X-SAMPA^4.17.
Clearly specify what is meant by `most likely' or `canonical' pronunciation and how you will produce them.^4.18
Specify whether there might be more than one possible pronunciation of the same word form in the lexicon
Use a simple plain text list or an XML markup text as the file format (everybody is happy with that because it can easily be imported into any kind of database system).

Next: Corpus Structure Up: File Formats Previous: Meta Data File Formats Contents

BITS Projekt-Account 2004-06-01