BAS
Bavarian Archive for Speech Signals
Pronunciation Lexicon PHONOLEX
Gleiche Seite in deutsch
This page was last updated 2013-09-19
Overview
PHONOLEX is the result of a close cooperation between the
DFKI
Saarbrücken, Computational Linguistics Lab, the
Universität Leipzig (UL)
and the
Bavarian Archive for Speech Signals (BAS) in Munich.
It comprises a simple list of word forms (inflected words) with the
following entries:
- Orthography
Features:
- ASCII or UTF-8; German 'Umlauts' also in LaTeX format
- Capital nouns
- Old and modern German spelling rules (depending on the source)
- Only single words - no phrases
- Orthographic conventions according to source corpus (this may
result in different coding for the same word!)
- Other information
Features:
TAB separated list of linguistic markers; each marker consists of a two-char
key, colon and value (string); order of appearence is arbitrary and optional
Word class: CL
- Noun -
nom
- Verb -
ver
- Adjective -
adj
- Adverb -
adv
- Preposition -
prep
- Proper Name -
prop
- Article -
det
- Numerals -
num
- Particle -
par
- Noun baseform -
baseform
Genus: GE
- maskulinum -
m
- femininum -
f
- neutrum -
n
Origin: OR
- Universität Saarbrücken -
sb
- Universität Leipzig -
lg
- German Verbmobil -
vm
- Phondat 1 -
pd1
- Phondat 2 -
pd2
- SI100 -
si100
- SI1000 -
si1000
- RVG1 read speech -
rvg1_read
- RVG1 monologue -
rvg1_trl
- German SmartKom -
sk_ger
- SpeechDat FIXED1DE -
fixed1de
- SpeechDat VEHIC1DE -
vehic1de
- SpeechDat MOBIL1DE -
mobil1de
- SpeechDat VERIF1DE -
verif1de
- SpeechDat ORIENTEL -
orientel
- HEMPEL monologue -
hempel
- RVG-J kids speech -
rvg-j
- ZIPTEL numbers over telephone -
ziptel
- German SmartWeb queries -
sw_ger
- ALC - alcoholized speech -
alc
Text-to-Phonem Method: TP
- Pronunciation
Features:
- Citation form.
- Coded in extended
SAM-PA (PhonDat-Verbmobil)
The citation form is generated by different methods and exception lists
(givenb by the 'TP' key above).
The program P-TRA was provided by the University of Bonn, Dr. Stock.
P-TRA was ported to UNIX at BAS and modified for the usage within the
PHONOLEX project.
- List of empirical pronunciations
Features:
- List may be empty, if no empirical pronunciations are found yet
- Coded in extended
SAM-PA (PhonDat-Verbmobil)
- Manuell or automatic segmentation (MAN vs. MAUS)
- One line consists of:
pronunciation TAB counter TAB corpus TAB type
with
pronunciation
: pronunciation in SAM-PA
counter
: detection rate in corpus
corpus
: corpus
type
: type of analysis (MAN: manuell, MAUS: automatic)
Structure
PHONOLEX is currently build as a simple ASCII file (file phonolex)
and as a XML version (file phonolex_xml).
The entries
are sorted to ASCII order ('NL' is new line).
file -> item 'NL'
[ item 'NL'
... ]
item -> orthography
info
canonic_pronunciation
empiricial_pronunciation_list
'*'
orthography -> German Orthography with LateX Umlauts
info -> TAB-seperated list of keys:string
canonic_pronunciation -> word_form
empirical_pronunciation_list ->
word_form TAB counter TAB corpus TAB type
[ ... ]
word_form -> string of extended SAM-PA
counter -> Integer
corpus -> String
type -> String
Aside from the basic PHONOLEX ASCII list there exist a XML version
and two excerpts that might come handy:
phonolex_xml
: XML version of phonolex
; see
the DTD for details
phonolex_list
: simplified version of phonolex
with a three-column list (one entry per line): orthography pronunciation
origin
phonolex_core
: same as phonolex_list
but only
entries with manually checked pronunciation are listed
Known Bugs
(plenty and hopefully decreasing; see German Page)
Actual Corpus Documentation
Source Table
History
- Dez 95 : Foundation of Working Group DFKI - BAS
- Aug 96 : Version 1.0 : First Word List - 665.893 Formen
- Aug 96 : Version 1.1 : Improved P-TRA, Exception lists, 666.237 entries
- Dez 96 : Version 1.2 : Improved glottal stops, geminates removed,
Update to users
- Jan 97 : Version 1.3 : Improved rule set, benchmark from 62 to 67 %
- Feb 97 : Version 1.4 : Bug removed: in some contexts a superfluous
/S/ was appended to words.
- Jun 98 : University of Leipzig joins Working group
- Sep 98 : Extended Wordlist to 1.600.000
- Nov 98 : Version 2.0 : Changed format of info line to 'Key:Text',
Inserted ORIGIN marker,
Improved Rule set for P-TRA (bench mark to 80%),
Using morpheme boundaries,
- Mar 99 : Version 2.1 : Bug caused some items of origin 'lg' not to be marked
text-to-phoneme method 'ptra ('TP:ptra'),
all items from origin 'lg' had an empty class tag,
improved canonical pronunciation for items with
morph boundaries (bench mark to 90%)
- May 99 : Version 2.2 : Improved rule sets for the pronunciation
(bench mark: with morph boundaries : 93%,
w/out morph boundaries : 83%)
- Jun 99 : Version 2.3 : Added new class of noun baseforms ('baseform') that are NOT compounds of German
- Jul 99 : Version 2.4 : Extended empiric pronunciation from VM corpus
- Aug 99 : Version 2.5 : 48 entries contain a 8-Bit char in pronunciation
denoting /O~/. Fixed.
- Jul 01 : Version 2.6 : Added new empirically detected pronunciations
based on the Verbmobil corpora
New sources PhonDat1, Phondat2, SI100, SI1000, RVG1
added.
- Jan 03 : Version 2.7 : New source SmartKom German (sk_ger) added.
- Jan 03 : Version 2.8 : New sources RVG1_read and RVG1_trl added.
- Jan 03 : Version 2.9 : New source speechdat (FIXED1de, MOBIL1DE, VEHIC1DE,
VERIF1DE, ORIENTEL) added
- Feb 03 : Version 2.10 : New corrected version of Verbmobil (OR:vm)
source, SourceTable.pdf added with specific
description and features of sources
- Apr 03 : Version 3.0 : Added a rule set for proper transcription
in German SAM-PA.
The following resources were re-transcribed
to meet the requirements of the new standard:
Verbmobil I + II (or:VM)
SmartKom (or:SK)
- Jul 03 : Version 3.1 : Added filter that prevents /R/ (instead of /r/
in the rule based pronunciation output.
Re-build phonolex
- Aug 03 : Version 3.2 : Added 'TP=manu_veri' descriptor, that denotes
an manually verified canonical pronunciation
according to the 'Transcription Conventions
for Canonical German' as published on the BAS
Web site.
Re-calculated transcription of the German VM
corpus and updated the empirical word forms in
phonolex accordingly.
Re-build phonolex.
- Sep 03 : Version 3.3 : Extended the makefile for the generation of
phonolex_core, a list of all phonolex-entries
that that have been manually checked for accuracy
and tagged with "manu_veri"
- Jan 04 . Version 3.4 : Fixed a bug in the creation of phonolex_core
The bug caused the first column to have multiple
identical entries with different pronunciations
- Jan 04 : Version 3.5 : Fixed some bugs in sk_ger.lex, RVG1_trl
R-substitution did not work caused by a faulty
script for RVG1 lexica.
- Feb 04 : Version 3.6 : Updated documentation; mapped orthography of
SpeechDat lexica to LaTeX
Added hempel
- Feb 04 : Version 3.7 : Mapped glotal stops /?/ in sd1 lexica to /Q/
Added rvg-j; phonolex_core now at 22086 entries
- Mar 04 : Version 3.8 : Re-calculated OR:vm entries after bug fix in
volume 4.1 signals.
- Apr 04 : Version 3.9 : Bugfix in source RVG-J : This bug caused about 100
entries from RVG-J to be false aligned. Fixed.
- May 04 : Version 3.10: Approx. 120 typos fixed; mainly in source hempel
Changed /R/ to /r/ in all speechdat sources
- Jun 04 : Version 3.11: Re-calculated MAUS segmentations of VM corpora;
included new empirical wordforms (OR:vm)
- Oct 04 : Version 3.12: Fixed /R/ -> /r/ in HEMPEL source
Fixed errors in RVG1 source
- Dec 04 : Version 3.13: Added speechdat_m section
added third column to phonolex_core output derived from
key OR:...
added phonolex_list output with a simple three-column
list (as phonolex_core) with all phonolex entries,
where each orthographic entry comes only once (the
first one, if there are multiple of equal quality)
- Feb 05 : Version 3.14: Replaced original PD1 lexicon by BAS standard
list (PD1_bas.lex)
- Apr 05 : Version 3.15: Replaced FIXED1DE by a manually verified version
Replaced MOBIL1DE by a manually verified version
Replaced VEHIC1DE by a manually verified version
- May 05 : Version 3.16: Replaced VERIF1DE by a manually verified version
Replaced ORIENTEL by a manually verified version
Added ZIPTEL manually verified
- Jun 05 : Version 3.17: Replaced RVG1_TRL by a manually verified version
Replaced RVG1_READ by a manually verified version
- Sep 05 : Version 3.18: Replaced SI100 by a manually verified version
Replaced SI1000 by a manually verified version
Multiple pronunciation error fixes in the following
source lexica:
HEMPEL, ORIENTEL, PD1, RVG-J, SmartKom (sk_ger),
FIXED1DE, MOBIL1DE, VEHIC1DE, VERIF1DE,
Verbmobil (vm_ger), ZIPTEL
- Sep 05 : Version 3.18: Replaced SI100 by a manually verified version
Replaced SI1000 by a manually verified version
Multiple pronunciation error fixes in the following
source lexica:
HEMPEL, ORIENTEL, PD1, RVG-J, SmartKom (sk_ger),
FIXED1DE, MOBIL1DE, VEHIC1DE, VERIF1DE,
Verbmobil (vm_ger), ZIPTEL
- Sep 05 : Version 3.19: Multiple pronunciation error fixes in the following
source lexica: PD2, vm_ger (German Verbmobil)
New calculation of pronunciation variants of vm_ger
- Oct 05 : Version 3.20: Added XML version phonolex_xml
- Jun 08 : Version 3.21: Added SmartWeb (SW) manually verified
- Jul 11 : Version 3.23: changed phonolex_list so that it contains ALL entries
of phonolex not just unique (and arbitrarily chosen) orthographic entries. Add
ed alc section (alc) manually verified.
- Aug 11 : Version 3.24: Multiple pronunciation error fixes in FIXED1DE,
ZIPTEL, SMARTKOM, ALC.
- Sep 13 : Version 4.0: re-coding of orthographic string. Until now the coding
of the orthographic string depended on the coding of the source.
This let to mixed-coding files. From version 4.x the coding
must be either LaTeX or UTF-8 resulting in true UTF-8
coded files:
ziptel,hempel,rvg-j are recoded on-the-fly from ISO8859 to UTF-8
(sources still ISO8859!)
Bug fix: corrupt entry at begin of phonolex_core/list: 'OR:si100...'
Content fix: several wrong pronunciations fixed in sources
Availability
A copy of the current PHONOLEX version may be obtained from BAS.
The purchase of a user licence is necessary.
The user licence entitles to use the PHONOLEX list for commercial, scientific
and educational purpose (depending on licence). Furthermore the owner of a user licence will
receive free upgrades of higher versions of PHONOLEX for free.
The user licence does not entitle the user to re-distribute the list
in any form, not partly and not modified or extended to third parties.
Furthermore the user agrees to report any errors found in the list to
the BAS. This way we hope to achieve improvements in the future.
All rights stay with DFKI and BAS.
By purchasing the user licence the user will accept the above conditions.
Costs
PHONOLEX - Delivery via CDROM, Update-Service
Scientific Licence EUR 1030.25
Scientific Licence ELRA members EUR 631.45
Licence commercial EUR 6081.82
Licence commercial ELRA members EUR 3423.10
Please send orders or questions to the following address:
.
The signed
user agreement
has to be faxed or send by mail
prior or together with the order.
Copyright © 1996-2011 Bayerisches Archiv für Sprachsignale,
Universität Müchen, Deutsches Forschungszentrum für
künstliche
Intelligenz, Saarbrücken, Universität Leipzig.
This page and all other pages with the initial 'BAS' or 'Bas' in the
filename may be copied, printed and distributed to other parties,
under the condition that the pages are distributed as shown here. Parts
of pages or extended pages may not be distributed further without
permission of the BAS.
Florian Schiel