Probably the simplest way to specify the spoken content is by vocabulary. It is more or less derived automatically from the intended usage of the corpus.

For instance, if the corpus will be used to train a speech recognizer on 11 German digits4.1 and three command words, then the content definition most likely will require an equal distribution for all 14 items of the vocabulary and their repetitions per speaker, e.g.

14 words spoken by 500 speakers with 10 repetitions equals 70000 tokens

