As already mentioned transcriptions and other annotations should undergo a final correction step, preferably performed by one single person, to ensure a good and consistent quality of the annotation. In this correction step all errors found should be logged into a file and used to improve the training of the labeler group. Also, this logging may be used to measure improvements in the labeling procedure.
As a further way to evaluate the quality of your annotation teams you might measure the inter-labeler agreement of the final results. In most cases this can be done automatically by using some automatic alignment method as being used in automatic speech recognition to compare the results of a recognizer with a reference transcript.8.27 To estimate the inter-labeler agreement you need a representative sample of data to be annotated by two or preferably more labelers independently. Then the results of these labelers or labeler groups are matched against each other and the average coverage between groups or labelers is calculated.
You will find that the inter-labeler (and intra-labeler) agreement gets worse with the decrease of segment length. That is, the segmentation of dialog acts will be much more consistent than the segmentation of phonetic units. If you are planning to measure the inter- or intra-labeler agreements of segmentations, you'll have to evaluate the labels and the time information independently. However, they are dependent in a way, because, if a label is missing, the time information of the adjacent segments will be distorted. There exists no widely accepted measure for inter- or intra-labeler agreement. You may find some hints in the PhD thesis of A. Kipp ().
Typical values for inter-labeler agreements are