Next: Transcription Example Up: SmartKom Previous: Corpus Specification Contents

Transcription

The SmartKom transcription format^15.1is an overlay of very different information layers to the speech signal. From a technical point of view these layers would be better represented in separated layers coded in XML. However, we found that it is much more time consuming to produce 7 different layers of information than one complex transcript. If the format is valid and parsable, you may separate the layers later automatically^15.2. Furthermore, the complex transcript is easier to read because the time relations are more obvious.

Assume for the following list of tags that a dialogue between a machine and a human being is transcribed turn by turn by listening to the signals.

Lexical units: Lexical units are written in a standardized spelling and character coding. Furthermore, a definition of lexical units is needed, e.g. words, interjections, reduced forms of words, etc.
In the SmartKom format the spelling is defined by the German Duden and a list of neologisms and foreign words (to keep the spelling of these consistent), the coding is LaTeX and the character set is 7 Bit ASCII. The lexical unit comprises only normal words and interjections^15.3.
Spelling: The spelling label is used when the subject spells a name letter by letter, e.g. for referring to the orthography or in abbreviations like `USA'. The letters are always uppercase and separated by a comma or a dash, the latter mostly in abbreviations, e.g.
my name is Smith, $S , $M , $I , $T , $H .
Acronyms: Acronyms are official substitutes for particular words. The label only has to be placed once, at the beginning of the acronym. Acronyms must be pronounced like a word, e.g.
&OPEC
Proper names: All words are marked as proper names that can't be translated into another language; this includes surnames and first names of people, names of streets, hotels and restaurants, company names, names of institutions, local places, national holidays etc. Words are not labeled as proper names if they do not only appear in one language and thus can be translated e.g. names of international holidays, names of countries and continents, the names of the seven seas etc. If the proper name consists of several words that in regular orthography are separated by spaces, they will be linked by a `+' sign between each part of the name. For instance:
~Peter ~Marine+World ~Zur+Blauen+Traube
Numbers: Numbers are numerals, combinations of numbers and ordinal numbers. Two-digit numbers are labeled as one word. All numbers are written as words, e.g.
#three #twentytwo #first
Neologisms: `Neologism' is a term referring to a word that has been made up by the speaker and does not appear in a regular dictionary. It could be slang or a slip of the tongue. e.g.
*forrowed
Foreign Words: Foreign words are words that stem from another language than that used by the speaker in that dialogue. In these cases an international language code marker is attached to the beginning of the word, e.g.
<*IT>Milano
Off-Talk: Especially in Wizard-of-Oz recordings you can find Off-Talk, i.e. when a person is speaking to himself or herself and not to the partner of the dialogue or the machine. You distinguish between `Read Off-Talk' (ROT; the person is reading something aloud) and `Other Off-Talk' (OOT; any other speech which does not belong to the dialogue). For example:
now<OOT> what<OOT> do<OOT> we<OOT> have<OOT> here<OOT>
<hm> ~Arabic+Nights<ROT> can you give me
Command Words: Command words are words that speakers use to operate the system by means of meta language, e.g.
!KEYSmartakus
Lengthening: Markup of sounds within or at the end of a lexical unit that are lengthened. It may also be used for pre-final lengthening, with plosives that have a particulary long closure phase and in the event of an aspiration phase that is stronger or longer than normally. The label is directly added to the letter representing the sound affected, e.g.
giv<Z>e so<Z>rry
Not or hardly identifiable words: This label can be used if it is impossible to understand a part of what has been said within the recording. Words that are not identifiable can either be completely incomprehensible or may be partially understood but not with certainty. The SmartKom format uses the label <%> in place of a non-understandable word, and a trailing % if we can understand a word partially but not well enough to identify it without any doubt, e.g.
enough% I have <%> enough
Truncated Words: Truncated words occur when the speaker has begun to articulate a word but doesn't finish it. In other words, the item is terminated at a point where some of the component sounds have already been produced, while the rest has been cut off before being articulated. The equal sign is used here as the label; it is also placed during a series of stutters where parts of a word are repeated but the word as a whole is still not completely pronounced, e.g.
the +/que=/+ question is could you hel= <*T>
Articulatory Interruptions: Lexical items can be interrupted by various phenomena such as pauses, breathing, hesitations, slips of the tongue, mispronunciations etc. Such events can be marked up by adding a underscore followed by a blank space at the point of interruption. Then we insert the interrupting element and finally conclude with the remaining part of the interrupted word which is preceded by another blank space and underscore, e.g.
this e_ <A> _vening
Technical Interruptions: Technical interruptions are caused by a temporarily broken or missing section of the audio signal, something that might happen due to technical or other errors. There are four distinguishable types of technical interruption:
1. <T_> is used when the beginning of a utterance is missing. In this case it is attached to the beginning of the first lexical item occurring, again without a blank space and regardless of whether this item seems to be complete or fragmental.
2. <*T> is used when larger parts of an utterance are missing. It's a substitute for the missing speech.
3. <*T>t is used when the end of an utterance is missing.
4. <_T> is used when the last part of a word is cut off. In this case the label is attached to the end of the last word.
Comments on pronunciation: The pronunciation comment indicates that the subject uses an unusual pronunciation (like foreign accent or dialect, word contractions, assimilations or mispronunciations). Thus, pronunciation comments show the deviation between actual pronunciation and the most likely form. In the case of contractions the number of the contracted words is given after the exclamation mark of the label. The label follows the lexical item, separated by a blank space, e.g.
no <!1 nope> haben wir <!2 hamma>
Repetition or Correction: There's a tendency in spontaneous speech to stutter and also to correct such disfluencies. The brackets +/.../+
are used when the speaker repeats a word or a phrase or when he substitutes a new word for the one he started with, but continues with the same word class, e.g.
I would like +/to/+ to see
False Start: A false start is characterized by the subject beginning an utterance, breaking it off before completion and continuing the utterance with an entirely new thought. The label is placed in the same way as the repetition/correction label, e.g.
-/this evening/- tomorrow I will
Breathing: Clearly audible breathing, inhalation or exhalation, often occurs at prosodic or syntactic boundaries. In the transcript only breathing that can be heard well has to be marked. If the punctuation mark and the breathing label collide, the punctuation mark is put first, e.g.
please show me <A> the way . <A>
Filled Pauses: In spontaneous speech filled pauses are defined as pauses that are filled with some vocalization (or nasalization). A filled pause may occur when a speaker thinks about something. The speaker actually interrupts his speech while continuing his articulation. This articulation is however neither a word nor part of a word and should thus not be treated as such. As a consequence a punctuation mark cannot follow a filled pause, it has to come first. Nevertheless a filled pause can make a turn of its own. In SmartKom transcripts the four labels <"ah> (vocalic), <"ahm> (vocalic/nasalized), <hm> (nasalized) and <h"as> (others) are used. English adaptations four these four markers <uh> , <uhm> , <hm> and <hes> are also allowed.
Empty Pause: Empty pauses can be defined as temporary, unfilled gaps in speech. They can be overlayed by cross talk, but cannot overlay actively. Just as with the filled pause labels punctuation marks always come first. Empty pauses at the beginning or at the end of a turn are not transcribed, e.g.
could you please <P> tell me
Human Noises: Speakers also produce sounds that have no real meaning, such as laughing, coughing, swallowing etc. These are all labeled as <Noise> or <Ger"ausch> (German for noise). If one of these noises occurs for a long period of time, without being interrupted (a speaker laughing for example), a single label will be sufficient. As usual, punctuation marks come first.
Technical Noises: Noises that can't be attributed to the speaker are technical noises. These might be caused by the recording instruments, by dropping things or by people moving around while recording, e.g.
hello <#> !KEYSmartakus
Cross talk: Cross talk occurs when the subject and the system (or two subjects) speak at the same time or when noises occur while the subject speaks. From the point of view of the subject a cross talk may be either passive or active, depending on whether the speaker is the one who has been interrupted or the one who has interrupted. In either case the labeling indicates both the turn components passively affected by and the turn components actively affecting the interference. It's quite usual that within a dialogue there are several speaker interferences. This is why interferences are numbered consecutively, e.g.
A: I'd like1@ to1@
B: @1please @1give me
A: here you can2@ see2@
B: @2that's @2right
Superposition of noise: Any part of an utterance may be superimposed by one or more noises that are either background noises or noises produced by a speaker. If a noise appears during a word, brackets are used to embrace both, the noise and the word, e.g.
I <:<Ger"ausch> will:> take here <:<#> you:> are
Prosodic events: It is quite possible to mark up prosodic events in a transcript. In SmartKom, primary and secondary accent of the utterance as well as boundaries are marked in square brackets after the corresponding word item (see the following example transcript).

Next: Transcription Example Up: SmartKom Previous: Corpus Specification Contents

BITS Projekt-Account 2004-06-01