next up previous contents
Next: Validation Method Up: Manual Validation Previous: Manual Validation Contents   Contents


Selection of Validation Data

In most cases manual validation will not concern the whole speech corpus. Typically, a fixed proportion of the annotation and meta data will be randomly selected for manual validation. The proportion is chosen so that the sample is ``representative for the speech corpus''. Actually, nobody exactly knows what that means. In practice, the proportion is set to an amount that can be treated by the validator without causing undue costs: 5-20% for smaller corpora (10000-100000 recorded items), 1-2% for very large corpora ($>$100000 recorded items).

You may use a truly random process (e.g. shuffled cards or dice) to produce random numbers. Use of a pseudo-random sequence, which can be generated by most programming languages, is easier.

Beware: We found that some programming languages actually generate the identical pseudo-random sequence every time the program or script is executed if the random number generator is not properly seeded. A good random number generator is for instance used in the gawk programming language.

The following example gawk script selects a random sequence of 40 session numbers from a corpus session range between 150 and 350. Since the random generator is seeded with the actual system time, it will generate a different sequence every new second. It also keeps track of the already selected numbers and will not produce the same session number twice:

BEGIN {
        srand()     # seeding the random number generator
        i = 1
        while(i<=40)
        {
          flag = 1
          while ( flag == 1 )
          {
            random = int(rand() * 200) + 150
            flag = 0
            for ( j in randarr )
              if ( randarr[j] == random ) flag = 1
          }
          randarr[i] = random
          printf("%03d ",randarr[i])
          i ++
        }
        printf("\n")
      }
In most cases the selection process not only involves random sequences but also a number of other constraints. For instance: equal distribution between sexes, certain proportions of special features within the corpus etc. There are several ways to implement such constraints on a random selection. The brute force approach is to run the random sequencer repeatedly until the resulting sample meets the required constraints.

Document the resulting data sample and your method for creating it in the validation report.


next up previous contents
Next: Validation Method Up: Manual Validation Previous: Manual Validation Contents   Contents
Angela Baumann 2004-06-03