Overview of Non-native Databases
Table 1: Abbreviations for languages used in Table 2
Arabic |
A |
Japanese |
J |
Chinese |
C |
Korean |
K |
Czech |
Cze |
Malaysian |
M |
Danish |
D |
Norwegian |
N |
Dutch |
Dut |
Portuguese |
P |
English |
E |
Russian |
R |
French |
F |
Spanish |
S |
German |
G |
Swedish |
Swe |
Greek |
Gre |
Thai |
T |
Indonesian |
Ind |
Vietnamese |
V |
Italian |
I |
|
|
|
The actual table with information about the different databases is shown in Table 2.
Table 2: Overview of non-native Databases
Corpus |
Author |
Available at |
Language(s) |
#Speakers |
native Language |
#Utt. |
Duration |
Date |
Specials |
Reference |
AMI |
|
EU |
E |
|
Dut and other |
|
100h |
|
meeting recordings |
|
ATR-Gruhn |
Gruhn |
ATR |
E |
96 |
C G F J Ind |
15000 |
|
2004 |
proficiency rating |
|
BAS Strange Corpus I+II |
|
ELRA |
G |
139 |
50 countries |
7500 |
|
1998 |
|
|
Berkeley Restaurant |
|
ICSI |
E |
55 |
G I H C F S J |
2500 |
|
1994 |
|
|
Broadcast News |
|
LDC |
E |
|
|
|
|
1997 |
|
|
Cambridge-Witt |
Witt |
U. Cambridge |
E |
10 |
J I K S |
1200 |
|
1999 |
|
|
Cambridge-Ye |
Ye |
U. Cambridge |
E |
20 |
C |
1600 |
|
2005 |
|
|
Children News |
Tomokiyo |
CMU |
E |
62 |
J C |
7500 |
|
2000 |
partly spontaneous |
|
CLIPS-IMAG |
Tan |
CLIPS-IMAG |
F |
15 |
C V |
|
6h |
2006 |
|
|
CLSU |
|
LDC |
E |
|
22 countries |
5000 |
|
2007 |
telephone, spontaneous |
|
CMU |
|
CMU |
E |
64 |
G |
452 |
0.9h |
|
not available |
|
Cross Towns |
Schaden |
U. Bochum |
E F G I Cze Dut |
161 |
E F G I S |
72000 |
133h |
2006 |
city names |
|
Duke-Arslan |
Arslan |
Duke University |
E |
93 |
15 countries |
2200 |
|
1995 |
partly telephone speech |
|
ERJ |
Minematsu |
U. Tokyo |
E |
200 |
J |
68000 |
|
2002 |
proficiency rating |
|
Fischer |
|
LDC |
E |
|
many |
|
200h |
|
telephone speech |
|
Fitt |
Fitt |
U. Edinburgh |
F I N Gre |
10 |
E |
700 |
|
1995 |
city names |
|
Fraenki |
|
U. Erlangen |
E |
19 |
G |
2148 |
|
|
|
|
Hispanic |
Byrne |
|
E |
22 |
S |
|
20h |
1998 |
partly spontaneous |
|
IBM-Fischer |
|
IBM |
E |
40 |
S F G I |
2000 |
|
2002 |
digits |
|
ISLE |
Atwell |
EU/ELDA |
E |
46 |
G I |
4000 |
18h |
2000 |
|
|
Jupiter |
Zue |
MIT |
E |
unknown |
unknown |
5146 |
|
1999 |
telephone speech |
|
K-SEC |
Rhee |
SiTEC |
E |
unknown |
K |
|
|
2004 |
|
|
LDC WSJ1 |
|
LDC |
|
10 |
|
800 |
1h |
1994 |
|
|
LeaP |
Gut |
University of Münster |
E G |
127 |
41 different ones |
73.941 words |
12h |
2003 |
|
|
MIST |
|
ELRA |
E F G |
75 |
Dut |
2200 |
|
1996 |
|
|
NATO HIWIRE |
|
NATO |
E |
81 |
F Gre I S |
8100 |
|
2007 |
clean speech |
|
NATO M-ATC |
Pigeon |
NATO |
E |
622 |
F G I S |
9833 |
17h |
2007 |
heavy background noise |
|
NATO N4 |
|
NATO |
E |
115 |
unknown |
|
7.5h |
2006 |
heavy background noise |
|
Onomastica |
|
|
D Dut E F G Gre I N P S Swe |
|
(121000) |
|
1995 |
only lexicon |
|
PF-STAR |
|
U. Erlangen |
E |
57 |
G |
4627 |
3.4h |
2005 |
children speech |
|
Sunstar |
|
EU |
E |
100 |
G S I P D |
40000 |
|
1992 |
parliament speech |
|
TC-STAR |
Heuvel |
ELDA |
E S |
unknown |
EU countries |
|
13h |
2006 |
multiple data sets |
|
TED |
Lamel |
ELDA |
E |
40(188) |
many |
|
10h(47h) |
1994 |
eurospeech 93 |
|
TLTS |
|
DARPA |
A |
|
E |
|
1h |
2004 |
|
|
Tokyo-Kikuko |
|
U. Tokyo |
J |
140 |
10 countries |
35000 |
|
2004 |
proficiency rating |
|
Verbmobil |
|
U. Munich |
E |
44 |
G |
|
1.5h |
1994 |
very spontaneous |
|
VODIS |
|
EU |
F G |
178 |
F G |
2500 |
|
1998 |
about car navigation |
|
WP Arabic |
Rocca |
LDC |
A |
35 |
E |
800 |
1h |
2002 |
|
|
WP Russian |
Rocca |
LDC |
R |
26 |
E |
2500 |
2h |
2003 |
|
|
WP Spanish |
Morgan |
LDC |
S |
|
E |
|
|
2006 |
|
|
WSJ Spoke |
|
|
E |
10 |
unknown |
800 |
|
1993 |
|
|
|