Next: Transcription Example
Previous: Corpus Specification
transcription format15.1is an overlay of very
different information layers to the speech signal. From a technical point
of view these layers would be better represented in separated layers
coded in XML. However, we found that it is much more time consuming to
produce 7 different layers of information than one complex transcript. If
the format is valid and parsable, you may separate the layers later
automatically15.2. Furthermore, the complex transcript is
easier to read because the time relations are more obvious.
Assume for the following list of tags that a dialogue between a machine and a
human being is transcribed turn by turn by listening to the signals.
- Lexical units: Lexical units are written in a standardized spelling and character
coding. Furthermore, a definition of lexical units is needed,
e.g. words, interjections,
reduced forms of words, etc.
In the SmartKom format the spelling is defined by the German Duden and
a list of neologisms and foreign words (to keep the spelling of these consistent), the
coding is LaTeX and the character set is 7 Bit ASCII. The lexical unit
comprises only normal words and interjections15.3.
- Spelling: The spelling label is used when the subject spells a name
letter by letter, e.g. for referring to the orthography or in abbreviations
like `USA'. The letters are always uppercase and separated by a comma or a dash,
the latter mostly in abbreviations, e.g.
my name is Smith, $S , $M , $I , $T , $H .
- Acronyms: Acronyms are official substitutes for particular words.
The label only has to be placed once, at the
beginning of the acronym. Acronyms must be pronounced like a word,
- Proper names: All words are marked as proper names
that can't be translated into
another language; this includes surnames and first names of people,
names of streets, hotels and restaurants, company names, names of
institutions, local places,
national holidays etc. Words are not labeled as proper
names if they do not only appear in one language and thus can be
translated e.g. names of international holidays, names of
countries and continents, the names of the seven seas etc. If the proper
consists of several words that in regular orthography are separated by
spaces, they will be linked by a `+' sign between each part of the
~Peter ~Marine+World ~Zur+Blauen+Traube
- Numbers: Numbers are numerals, combinations of numbers and ordinal
numbers. Two-digit numbers are
labeled as one word. All numbers are written as words, e.g.
#three #twentytwo #first
- Neologisms: `Neologism' is a term referring to a word that has been made
up by the speaker and does not appear in a regular dictionary. It could be
slang or a slip of the tongue. e.g.
- Foreign Words: Foreign words are words that stem from another language
than that used by the speaker in that dialogue. In these cases an
language code marker is attached to
the beginning of the word, e.g.
- Off-Talk: Especially in Wizard-of-Oz recordings you can find Off-Talk,
i.e. when a person is speaking to himself or herself and not to the partner of
the dialogue or the machine. You distinguish between
`Read Off-Talk' (ROT; the person is reading something aloud) and `Other
Off-Talk' (OOT; any other speech which does not belong to the dialogue).
now<OOT> what<OOT> do<OOT> we<OOT> have<OOT> here<OOT>
<hm> ~Arabic+Nights<ROT> can you give me
- Command Words: Command words are words that speakers use to operate the
system by means of meta language, e.g.
- Lengthening: Markup of sounds within or at the end of a lexical
unit that are
It may also be used for pre-final lengthening, with plosives that have a
closure phase and in the event of an aspiration phase that is stronger
or longer than normally. The label is directly added to the letter
representing the sound affected, e.g.
- Not or hardly identifiable words: This label can be used if it is
impossible to understand a part of what has been said within the
recording. Words that are not identifiable can either be completely
incomprehensible or may be partially understood but not with certainty.
The SmartKom format uses the label
<%> in place of a
non-understandable word, and
% if we can understand a word partially but not
well enough to identify it without any doubt, e.g.
enough% I have <%> enough
- Truncated Words: Truncated words occur when the speaker has begun to
articulate a word but doesn't finish it. In other words, the item is terminated
at a point where some of the component sounds have already been produced, while
the rest has been cut off before being articulated. The equal sign is used here
as the label; it is also placed during a series of stutters
where parts of a word are repeated but the word as a whole is still not
completely pronounced, e.g.
the +/que=/+ question is could you hel= <*T>
- Articulatory Interruptions: Lexical items can be interrupted by various
phenomena such as pauses, breathing, hesitations, slips of the tongue,
mispronunciations etc. Such events can be marked up by adding
a underscore followed by a blank
space at the point of interruption. Then we insert the
interrupting element and finally conclude with the remaining part of the
interrupted word which is preceded by another blank space and underscore,
this e_ <A> _vening
- Technical Interruptions: Technical interruptions are caused by a
temporarily broken or missing section of the audio signal, something that might
happen due to technical or other errors. There are four distinguishable types
of technical interruption:
<T_> is used when the beginning of a
utterance is missing. In this case it is attached to the beginning of the first
lexical item occurring, again without a blank space and regardless of whether
this item seems to be complete or fragmental.
<*T> is used when larger parts of an utterance are
missing. It's a substitute
for the missing speech.
<*T>t is used when the end of an utterance is
<_T> is used when the last part of a word is cut off. In
this case the label is attached to the end of the last word.
- Comments on pronunciation: The pronunciation comment indicates that the
subject uses an unusual pronunciation (like
foreign accent or dialect, word contractions, assimilations
or mispronunciations). Thus, pronunciation comments show the deviation
between actual pronunciation and the most likely form. In the case of
contractions the number of the contracted words is given after the exclamation
mark of the label. The label follows the lexical item, separated by a blank
no <!1 nope> haben wir <!2 hamma>
- Repetition or Correction: There's a tendency in spontaneous speech to
stutter and also to correct such disfluencies.
are used when
the speaker repeats a word or a phrase or when he substitutes a new word for
the one he started with, but continues with the same word class, e.g.
I would like +/to/+ to see
- False Start: A false start is characterized by the subject beginning an
utterance, breaking it off before completion and continuing the utterance with an
entirely new thought. The label is placed in the same way as the
repetition/correction label, e.g.
-/this evening/- tomorrow I will
- Breathing: Clearly audible breathing, inhalation or exhalation, often
occurs at prosodic or syntactic boundaries. In the transcript only breathing that
can be heard well has to be marked.
If the punctuation mark and the breathing label
collide, the punctuation mark is put first, e.g.
please show me <A> the way . <A>
- Filled Pauses: In spontaneous speech filled pauses are defined as pauses
that are filled with some vocalization (or nasalization).
A filled pause may occur when a speaker
thinks about something. The speaker actually interrupts his speech while continuing
his articulation. This articulation is however neither a word nor part of a word
and should thus not be treated as such. As a consequence a punctuation mark cannot
follow a filled pause, it has to come first. Nevertheless a filled pause can make
a turn of its own. In SmartKom transcripts the
<hm> (nasalized) and
(others) are used. English adaptations four these four markers
are also allowed.
- Empty Pause: Empty pauses can be defined as temporary, unfilled gaps in
speech. They can be overlayed by cross talk, but cannot
overlay actively. Just as with the filled pause labels punctuation marks always
come first. Empty pauses at the beginning or at the end of a turn are not
could you please <P> tell me
- Human Noises: Speakers also produce sounds that have no real meaning, such
as laughing, coughing, swallowing etc. These are all labeled as
(German for noise). If one of these noises occurs for a long period of time,
without being interrupted (a speaker laughing for example), a single label will
be sufficient. As usual, punctuation marks come first.
- Technical Noises: Noises that can't be attributed to the speaker are
technical noises. These might be caused by the recording instruments, by dropping
things or by people moving around while recording, e.g.
hello <#> !KEYSmartakus
- Cross talk: Cross talk occurs when the subject and the
system (or two subjects) speak at the same time or when noises occur
while the subject
speaks. From the point of view of the subject a cross talk may be
either passive or active, depending on whether the speaker is the one who has been
interrupted or the one who has interrupted. In either case
the labeling indicates
both the turn components passively affected by and the turn components
actively affecting the interference. It's quite usual that within a dialogue there
are several speaker interferences. This is why interferences are numbered
A: I'd like1@ to1@
B: @1please @1give me
A: here you can2@ see2@
B: @2that's @2right
- Superposition of noise: Any part of an utterance may be
superimposed by one or more noises that
are either background noises or noises produced by a speaker. If a noise
appears during a word, brackets are used to embrace both, the noise and the
I <:<Ger"ausch> will:> take here <:<#> you:> are
- Prosodic events: It is quite possible to mark up prosodic events
in a transcript. In SmartKom, primary and secondary accent of the
utterance as well as boundaries are marked in square brackets after the
corresponding word item (see the following example
Next: Transcription Example
Previous: Corpus Specification