Next: File Naming Conventions
Up: Corpus Structure
Previous: Corpus Structure
Contents
As mentioned before, it is a good idea to keep signal data files
and annotation data separately. The reason for this is that very
often users will need only access to the symbolic data of your speech
corpus. Furthermore, the annotation part is much more likely subject
to updates than the signal data. Therefore it's better to have
them separated for an easier maintenance of the corpus.
Small corpora will have the following typical structure in the root
of the distribution media:
- DATA : contains all signal files
- ANNOT : contains all annotation files
- META : contains all meta data files
- DOC : contains the documentation
- LEX : contains the lexica (if any)
- TOOLS : contains software to access signal, annotation and lexicon data
Larger corpora might distribute the DATA part on other
media but the basic structure remains the same.
Within the DATA and ANNOT directories organize the files in a way to avoid very
large (approx.
) numbers of directory entries, and try to provide a natural
order to the prospective user. Depending on the aims of your speech corpus this
order of subdirectories may be:
- male / female
- recording sessions
- speakers
- different acoustical environments
- languages
- dialect classes
- speech types (read, non-prompted, ...)
- technical setups (telephone, on-site, ...)
- in ANNOT: different annotation types
Next: File Naming Conventions
Up: Corpus Structure
Previous: Corpus Structure
Contents
BITS Projekt-Account
2004-06-01