The file naming conventions (or nomenclature) define the allowed file and directory names within your speech corpus. A very common approach is to use content-based file names, an alternative approach is to use generated file names.
Content-based file names are constructed from features of the corpus, e.g. language code, speaker gender, type of speech, etc. Content-based names allow access to specific fragments of the corpus simply by filtering file names. Of course the information encoded in the file name must be meaningful and easy to extract. One problem with content-based file names is the platform- or medium dependent length restriction of file names, e.g. 8.3 for ISO 9960 CDs. Another problem is that there is often no natural hierarchical structure in a speech corpus: is it better to organize the recordings by recording location and then by gender, or the other way round?
Generated file names are usually created automatically, e.g. as sequence numbers. Generated file names can easily be organized in hierarchies, e.g. BLOCKxx/SESxxyy with xx and yy numbers from to . To retrieve fragments of a corpus, a separate document is necessary listing the contents of a signal file.
Some operating systems and programming languages are case sensitive, some are not; some apply their own rules for capitalization, others do not. Sometimes case changes when data is copied to another medium, sometimes it does not. The lesson here is: do not define a nomenclature that is case sensitive.
Here is an example from the German Verbmobil II corpus:
Dialog names are coded as follows: 1st character: <lang> [g,e,j,m,n] recorded language g(erman), e(nglish), j(apanese), m(ultilingual), n(oise) 2nd to 4th character: dialogue number i.e.\ 001 5th character: scenario a(main), b(information desk), c(remote maintenance), d(VM1), n(noise) Turn names consist of the dialog name (char 1-5) and the following: 6th character: technical definition of recording c(lose), r(oom), t(elephone) 7th character: detailed description of recording means (microphone) telephone: m(obile), p(hone,analog), w(ireless), d(ect) close: h(eadset), n(eckband microphone), c(lip microphone) room: r(room) 8th character: channel coding [1..n] 9th character: '_' 10th - 12th character: turn number starting with '000' 13th character: '_' 14th - 16th character: <sp_id> speaker ID The extensions code the contents of the file: .nis NIST file .trl transliteration .spr speaker protocol file .rpr recording session protocol .par symbolic information in "partitur" format Each recording consists of a set of files like the following: Type Name Location signals <turn>.nis data/<dialog>/ recording session protocol <dialog>.rpr data/<dialog>/ speaker protocol file <lang>_<sp_id> spr/ transliteration <dialog>.trl trl/ Bas Partitur Files (BPF) <turn>.par par/<dialog>/In the above example the dialog name is used as a structural element to sort files into groups, while the turn name is the prefix to signal and annotation files.