Documentation files may be plain text files, Postscript, PDF or even be maintained on the Web10.1but there should always be a copy of all Web files in the documentation on the distribution medium. Place all documentation files in a common directory called DOC or in subdirectories of this directory.
For example: most of the BAS speech corpora contain annotation files in the BAS Partitur Format (BPF). The BPF is documented on a Web page10.2, since there are frequent changes when new tiers are added to the format. The documentation of the BAS speech corpora therefore contains an URL of this documentation but also contains a copy of the Web documentation at the time of media production.
The `starting document' should be named README or REPORT and contain at least the Introduction, the Copyright, the Version Number and Edition Date, the List of Documentation Files and the Corpus Description:
The Introduction describes the main features and the intended usage of the corpus in one paragraph. This information may later be used in catalogues etc. For example:
This is the documentation for the WEBCOMMAND database created in Jun - Aug 2002 as a subcontract to Siemens Company. WEBCOMMAND contains recording sessions of 48 native speakers of France and Great Britain. All speakers read a list of 130 prompts from a screen. They are recorded with two microphones: a high quality headset and a high quality microphone fixed to a `webpad' held on the lap.
Clearly state the Copyright and Disclaimers immediately after the
introduction (see chapter ). Be sure to make absolutely clear
who may use the corpus for what purposes and who is eligible to distribute
the data. Then give the Version Number of the speech corpus in this
distribution, the Date of edition and the Date of the last
update.
The List of Documentation Files is simply a commented directory listing of the documentation directory.
The Corpus Description should contain all information about the corpus in its present state: Numbers about size, speakers and recordings, distributions about certain speaker characteristics and special recording conditions, contents of the speech recordings, the structure of the corpus on the distribution media, the distribution media itself and the usage of these, the technical specifications of the signal files, file formats, terminology, other parts of the corpus such as dictionary, translation files etc.
The speech corpus may be identified in a more formal way by using
standardized
entries in the head of the corpus description as in the following
example: