Next: Corpus Structure
Up: File Formats
Previous: Meta Data File Formats
Contents
Lexicon Format
Depending on the complexity of your speech corpus you might add
the specification of a lexicon covering the entire corpus or parts of it.
Again there are no widely accepted standards about a file
format for lexica or dictionaries.
In most cases speech corpora come with just a simple
three-column list giving for each spoken word form the orthographic
representation, the word count and the most likely pronunciation of
this word form. This seems quite straightforward, but is clearly not sufficient
for languages for which there is no standard orthography, or the
orthography cannot be established unambiguously from speech, or
in which orthography is not necessarily based on words.
Here are some hints:
- Orthography: whenever feasible use Unicode.
- Pronunciation: whenever feasible use SAMPA or X-SAMPA4.17.
- Clearly specify what is meant by `most likely' or `canonical'
pronunciation and how you will produce them.4.18
- Specify whether there might be more than one
possible pronunciation of the same word form in the lexicon
- Use a simple plain text list or an XML markup text as
the file format (everybody is happy with that because it can easily
be imported into any kind of database system).
Next: Corpus Structure
Up: File Formats
Previous: Meta Data File Formats
Contents
BITS Projekt-Account
2004-06-01