JPS63237098A

JPS63237098A - Voice data base configuration system having multi-layer label

Info

Publication number: JPS63237098A
Application number: JP62072847A
Authority: JP
Inventors: 芳典匂坂; 一哉武田; 尚夫桑原; 滋片桐
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK; A T R SHICHIYOUKAKU KIKO KENKYUSHO KK; ATR AUDITORY VISUAL PERCEPTION; ATR JIDO HONYAKU DENWA
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK; A T R SHICHIYOUKAKU KIKO KENKYUSHO KK; ATR AUDITORY VISUAL PERCEPTION; ATR JIDO HONYAKU DENWA
Priority date: 1987-03-25
Filing date: 1987-03-25
Publication date: 1988-10-03
Anticipated expiration: 2013-02-04
Also published as: JP2709385B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］この発明は多層ラベルを持つ音声゛データベース構成方
式に関し、特に、音声信号波形をディジタル化し、信号
の特徴に基づいて音声波形を音素ごとに区分を行ない、
各音素にラベルを付与したような多層ラベルを持つ音声
データベース構成方式［従来の技術および発明が解決し
ようとする問題点］音声認識アルゴリズム、音声合成アルゴリズム。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a speech database configuration method with multilayer labels, and in particular, to digitizing speech signal waveforms and classifying the speech waveforms into phonemes based on signal characteristics. do the
Speech database construction method with multi-layered labels such as labels given to each phoneme [Problems to be solved by conventional technology and invention] Speech recognition algorithm, speech synthesis algorithm.

話者認識・適応アルゴリズムなどのように音声処理を行
なう諸技術を向上させるためには、種々の環境下におけ
る音韻特徴の変動を収集整理する必要がある。そのため
には、音韻ラベル付けのされた音声データベースの整備
が不可欠である。In order to improve speech processing technologies such as speaker recognition and adaptive algorithms, it is necessary to collect and organize changes in phonological features under various environments. To this end, it is essential to develop a speech database with phonological labels.

従来の音声データベースは主として、音声認識装置など
の音声処理装置の性能評価用と、音声研究開発用の２種
類に大別される。前者としては、たとえば都市名の単８
ｉやその他からなる音声データベースがあるが、アナロ
グ音声を収録したものにすぎず、ラベル付けされていな
い。一方、後者の研究用音声データベースとしては、ラ
ベル付けされていても、音素あるいはそれに準する単位
の記号のみのラベルであるため、音声事象を効率良く選
択することができないという欠点があった。Conventional speech databases are mainly divided into two types: those for performance evaluation of speech processing devices such as speech recognition devices, and those for speech research and development. For the former, for example, a city name
There is an audio database consisting of i and others, but it is just a collection of analog audio and is not labeled. On the other hand, the latter research speech database had the disadvantage that even if it was labeled, it was only labeled with symbols for phonemes or similar units, making it impossible to efficiently select speech events.

それゆえに、この発明の主たる目的は、種々の音声特徴
ラベルを階層的に付与することにより、ラベル情報を基
にして、音声データの選択や抽出の効率化および音声に
対する種々の研究目的に利用できるような音声データベ
ースの構成方式を提供することである。Therefore, the main purpose of the present invention is to hierarchically assign various audio feature labels so that the label information can be used to improve the efficiency of selecting and extracting audio data and for various research purposes related to audio. The purpose of the present invention is to provide a method for configuring such a speech database.

［問題点を解決するための手段］この発明は音声信号波形をディジタル化し、信号の特徴
に基づいて音声波形を音素ごとに区分し、各音素にラベ
ルを付与した音声ｆ−タベースを構成する方式であって
、音素ラベルが付与された部分を１つの層とし、実際の
音声現象を反映する種々の音声特徴を複数の種類にわた
って記述し、それぞれの特徴に対応する層を設けて、各
音素内あるいは音素間にわたって各層ごとにそれぞれ記
述するラベルを付与して、ディジタル化された音声波形
とその物理的特徴を記述する多層ラベルとの対応づけを
行なうようにしたものである。[Means for Solving the Problems] This invention provides a method for digitizing a speech signal waveform, dividing the speech waveform into phonemes based on the characteristics of the signal, and constructing a speech f-database in which each phoneme is given a label. The part to which a phoneme label is attached is taken as one layer, and various phonetic features that reflect actual phonetic phenomena are described over multiple types, and a layer corresponding to each feature is provided to create a layer within each phoneme. Alternatively, a label is assigned to each layer between phonemes, and a digitized speech waveform is associated with a multilayer label describing its physical characteristics.

「作用］この発明に係る多層ラベルを持つ音声データベース構成
方式では、ディジタル化された音声波形に対して階層的
な種々の音声特徴ラベルを付与することにより、ラベル
情報を基にして音声データの選択や抽出の効率化および
音声に対する種々の研究目的に利用できる。"Operation" In the speech database configuration method with multilayer labels according to the present invention, by assigning various hierarchical speech feature labels to digitized speech waveforms, speech data can be selected based on label information. It can be used for various research purposes such as improving the efficiency of speech extraction and speech.

［Ｊ？ｉ明の実施例］第１図はこの発明にＪ３ける音声信号の音素ごと、音素
内および音素間にわたって付与された各層におジノるλ
ベルの例を示す図であり、第２図は各層におけるラベル
表示方法を示す図であり、第３図はイベント層における
表記方法とその記号を示す図であり、第４図は異ａ化層
における表記方法とその記号を示す図であり、第５図は
計算機内部でのラベルファイルのデータ形式を示す図で
ある。[J? Example of I-A
FIG. 2 is a diagram showing the label display method in each layer, FIG. 3 is a diagram showing the notation method and its symbols in the event layer, and FIG. 4 is a diagram showing the label display method in the event layer. FIG. 5 is a diagram showing the notation method and its symbols, and FIG. 5 is a diagram showing the data format of the label file inside the computer.

以下、第１図ないし第５図を参照して、この発明につい
て説明する。第１図は音声Ｏｔ弓の波形とスペクトル変
化率とパワーに対応してラベル付【プを行なったもので
あり、ラベルとしては、第２図に示ずように第一層とし
ての音声記号層と、第二層としてのイベント層と、第三
層としての異音化層と、第四膚としての融合化層と、第
五層としての母音中心層と、第六層としてのコメント層
とからなる。音声記号層は発声した音用のスペクトル変
化を手がかりにして音素ごとのセグメンテーションを行
ない、ヘボン式ローマ字表記を母音部と子音部とに分割
し、対応する音声区間に記述したものである。たとえば
、”　ａｔｏｓｈｉｍａｔｓｕ”の音声を発音したとさ
、各音素の母音部と子音部を分割し、１ｌａＩ１．“［
）ｌ　、　　１１０＋１　、　１１３１１１１　、　　
ＩＩ　ｉ　ＩＩ。The present invention will be explained below with reference to FIGS. 1 to 5. Figure 1 shows the waveform, spectral change rate, and power of the vocal Ot arch with labels corresponding to the waveform, spectrum change rate, and power.As shown in Figure 2, the labels are for the phonetic symbol layer as the first layer. , the event layer as the second layer, the allophonic layer as the third layer, the fusion layer as the fourth layer, the vowel center layer as the fifth layer, and the comment layer as the sixth layer. Consisting of The phonetic symbol layer performs segmentation for each phoneme using the spectral changes of the uttered sounds as clues, and divides the Hepburn Roman alphabet into vowels and consonants, and describes them in the corresponding speech sections. For example, when you pronounce the sound "atoshimatsu", the vowel part and consonant part of each phoneme are divided and 1laI1. “[
)l, 110+1, 1131111,
II i II.

１１、ｌＺ＆ａ″、“ｔｓｕ　”を記述する。このよう
に、発声したき声の母音部と子音部とをそれぞれ音声区
間に対応づけすることにより、言語環境の表現を容易に
実現できる。ただし、異音化や融合化により、８素境界
が決められない場合には、第二層以下でそれに相当する
記号を付与する。11. Describe lZ&a'' and ``tsu''. In this way, by associating the vowel and consonant parts of the uttered voice with respective speech intervals, the expression of the linguistic environment can be easily realized. However, If the 8-element boundary cannot be determined due to allophoneization or fusion, the corresponding symbol is given in the second layer and below.

イベント層は音声記号層で区分された各音素区間に対し
て、スペクトルの変化に応じて複数に分割し、実際の発
声をよく反映するようにラベルを付与したものであって
、第３図に示すような表記記号で表わされる。すなわち
、ｒ＊　＜　Ｉｆは母音への入りわたりを示し、ｉ？ｉ
頭の母音（半母音を含む）に伴なう′Ａ渡区間（低域に
エネルギが存在ザるにもかかわらずホルマント構造が未
だ整っていない区間）を表わしている。ｔｉ　〉ｐｒは
母音からの出わたりを示し、語尾および比較的長い無音
前の母音（撥音を含む）に伴なう過渡区間（低域にエネ
ルギが存在するにもかかわらず、ホルマント構造が崩れ
ている区間）を示している。In the event layer, each phoneme interval divided by the phonetic symbol layer is divided into multiple parts according to changes in the spectrum, and labels are given to better reflect the actual utterances, as shown in Figure 3. It is represented by the notation symbol as shown. That is, r* < If indicates transition to a vowel, and i? i
It represents the 'A crossing section (a section in which the formant structure is not yet complete despite the presence of energy in the low range) that accompanies the initial vowel (including semi-vowels). ti〉pr indicates the transition from a vowel, and the transition period accompanying a vowel at the end of a word and a relatively long pre-silent vowel (including a plosive) (despite the presence of energy in the low range, the formant structure collapses). section).

“ゞ〉”は母音から有声子音への出わたりを示し、母音
部（撥音を含む）から有声子音部へ遷移していく過渡区
間である。１ｌｌｒｌｔは何らかの原因でスペクトルパ
ターンに乱れが生じている区間である。“ゞ〉” indicates the transition from a vowel to a voiced consonant, and is a transitional section in which the vowel part (including the plethora) transitions to the voiced consonant part. 1llrlt is a section where the spectrum pattern is disturbed for some reason.

“ｃ　ｌ　、　”ｃ　ｌ″は破裂（破擦）音に伴なう閉
鎖（クロジャ）区間Ｊ５よび促音に伴なう休止区間であ
り、＊は有声の場合を示している。“ｐ、　　ｔ、　　
ｋ、　　ｂ。"c l, "c l" is a closure section J5 accompanying a plosive (affricate) and a pause section accompanying a consonant, and * indicates a voiced case. "p, t ,
k, b.

ｄ、ｇ”は破裂音内のクロージャ以外の区間を示してい
る。“ｎｖ”は鼻子音区間であり、”ｐａｕ″は単語境
界における休止区間である。”ｓ、　ｈ、　ｓｈ。d, g" indicate sections other than closures within plosives. "nv" is a nasal consonant section, and "pau" is a pause section at a word boundary. "s, h, sh.

Ｚ、　ｄｊ、　ｆ”は摩擦音区間であり、’ｗ、ｙ”は
半母音区間であり、“、　ＩＩは温合区間であり、”ａ
、ｉ。Z, dj, f'' are fricative intervals, 'w, y' are semi-vowel intervals, ', II are warming intervals, 'a'
,i.

ｕ、　ｅ、　ｏ　”は母ａ区間であり、ｔｌＪ）ｌは拗
音区間であり、“Ｎ　１１は撥音区間であり、”ｔｓ、
　ｃｈ”は破擦音内のクロージャ以外の区間を示してい
る。``u, e, o'' is the basic a interval, tlJ)l is the persistent tone interval, ``N11 is the plucked tone interval,'' ts,
ch” indicates a section other than the closure within the affricate.

上述のイベント層について第１図を参照してより具体的
に説明すると、最初の母音“ａ　Ｎが発音されるまでの
区間は母音への立ち上がりを示す過渡区間であるため、
゛〈″が付与される。母音１ｉａｓに続く区間は母音１
１ａ１１の語尾に伴なう過渡区間であるために“〉″が
付与される。その次の区間は破裂音に伴なうクロージャ
として’ａｔ”が付与され、その次の区間では破裂音内
の閉鎖区間以外の区間として“℃”が付与される。その
次の区間は母音ｌｌ０１１であり、その後摩擦音区間と
してＩＪｈｌｌが付与される。さらに、次の区］ｎは母
音１１ｉ１１であり、続いて鼻子音区間“”Ｉｌｍ”、
母音ＩｆａＮ、母音“ａＩＩに続く過渡区間として゛〉
”が付与され、その後破擦音に伴なう閉鎖区間として１
１　ＣＩ　１１が付与され、ざらに破擦音内の開鎖区間
以外の区間として°’ｔｓ”が付与される。To explain the above event layer in more detail with reference to FIG. 1, the interval until the first vowel "a N" is pronounced is a transitional interval indicating the rise to the vowel.
゛〈'' is added.The section following vowel 1ias is vowel 1
">" is added because it is a transitional section accompanying the ending of 1a11. In the next section, 'at' is added as a closure accompanying the plosive, and in the next section, '℃' is added as a section other than the closed section within the plosive.The next section is the vowel ll011 , and then IJhll is given as a fricative interval.Furthermore, the next interval ]n is a vowel 11i11, followed by a nasal consonant interval “”Ilm”,
As a transition interval following the vowel IfaN and the vowel “aII゛〉
” is added, and then 1 is added as a closed section accompanying the affricate.
1 CI 11 is given, and °'ts'' is given as a section other than the open section within the affricate.

イベント層の次は異音化層であり、ローマ字表記とは異
なるいわゆる異音化が生じている場合にセグメントを設
け、記号を付与する。異音化としては、無声化および摩
擦音化の二棟類の区間を記述する。異音化が発生してい
る場合は、音声記号層の境界に拘わらず、異音化発生時
点から終了時点までをセグメントとする。表記記号は第
４図に示ずように、無声化している区間どして記号１１
ｄ■ＩＩと、母音が後続の摩擦音の影響により摩擦音化
している区間として’ｆｒ”が付与される。第１図に示
した例では、音素１１０ＩＴと’　ｓｈ”との間に’ｒ
ｒ”が付与され、最後の音素“ｕ”が無声化しているた
めに“ｄ　ｖ　”が付与されている。なお、異音化現象
の一つである母音または有声破裂音のＬＨ化は、スペク
トル上での判断が困難であるため、異音化層には含めて
いない。Next to the event layer is an allophone layer, in which segments are provided and symbols are provided when so-called allophones, which differ from the Roman alphabet, occur. As for allophones, we describe the two categories of devoicing and fricatives. If allophones have occurred, the segment is defined as the period from the time the allophones occur to the end, regardless of the boundaries of the phonetic symbol layer. As shown in Figure 4, the notation symbols are 11 for devoiced sections.
d■II and 'fr' are given as the interval where the vowel becomes a fricative due to the influence of the following fricative.In the example shown in Figure 1, 'r' is added between the phoneme 110IT and 'sh'.
"r" is added, and "d v " is added because the last phoneme "u" is devoiced.In addition, the LH conversion of a vowel or voiced plosive, which is one of the allophonic phenomena, is Since it is difficult to judge on the spectrum, it is not included in the allophone layer.

第４１？ｉｌは融合化層であり、連続する音素が融合し
、スペクトログラム上で分離不可能な連続部分を記述す
る。セグメントの境界は音声信号層のそれを用いる。第
１図に示した例では、最後の２つの音素“ｔｓ”、　　
“ｕ”のスベクトロダラムが連続して区別できないため
に、記号”ｔｓｕ’が付与される。41st? il is a fusion layer in which consecutive phonemes are fused to describe a continuous part that cannot be separated on the spectrogram. The segment boundaries are those of the audio signal layer. In the example shown in Figure 1, the last two phonemes "ts",
The symbol "tsu" is given because the svectrodalums of "u" are consecutive and cannot be distinguished.

第５層は母音中心層であり、音声記号層でセグメントさ
れた音素が明確な母音の特徴を保持している中心を示す
ポインタを記述する。第１図に示した例では、母音“ｔ
ａＰｌ、“１０ＩＩ　、　　Ｉｌｌ　Ｉｌ、“′ａ″の
それぞれの中心を示すポインタが記述されている。第６
層はコメント層であり、上述の第１層ないし第５１１１
では記述できない現象についてのコメントが記述される
。The fifth layer is the vowel center layer, which describes pointers indicating the centers where phonemes segmented in the phonetic symbol layer retain distinct vowel characteristics. In the example shown in Figure 1, the vowel “t
Pointers indicating the centers of aPl, "10II, Ill Il, and "'a" are written. Sixth
The layer is a comment layer, and the above-mentioned 1st to 5111th layers
Comments are written about phenomena that cannot be described.

なお、各ラベル層と音声波形との対応関係は第５図に示
すように対応づけされる。すなわち、ラベル記号は、そ
のスタートの時刻を示す開始値と終わりの時刻を示す終
了値で挾まれて記入される。Note that the correspondence between each label layer and the audio waveform is as shown in FIG. That is, a label symbol is written with a start value indicating its start time and an end value indicating its end time.

そして、実際の音声波形とのリンクは、発声者。And the link with the actual speech waveform is the speaker.

単語の種類などを示す記号を指定することにより、音声
データファイルを引出し、この開始時刻と終了時刻とに
より、そのラベルと音声波形との対応づＧＪを行なう。By specifying a symbol indicating the type of word, etc., an audio data file is retrieved, and based on the start time and end time, GJ is performed based on the correspondence between the label and the audio waveform.

第６図はこの発明による多層ラベルを持つ音声データベ
ースを構成するためのＡ／Ｄ変換からラベルデータを入
力するまでのフロー図である。FIG. 6 is a flowchart from A/D conversion to input of label data for constructing a speech database with multilayer labels according to the present invention.

次に、この発明による多層ラベルを持つ音声データベー
スを構成する方法について説明する。収録条件としては
、録音スタジオなどのできるだけ静かな環境で、単語ご
とに区切って明瞭に発声し、一旦磁気テープにＰＣＭ録
音する。そして、オフラインにより、計算機ワークステ
ージ１ンを介して、磁気テープにＰＣＭ録音された音声
信号を２０ｋＨｚのサンプリングにより、１６ビツトで
△／Ｄ変換して、磁気ディスクに格納する。イして、格
納された音声データを単５ｎごとに切出し、５１２ポイ
ントのＦＦＴ（ｒｓ速フーリーＬ変換）を行なうととも
に、フレーム周期２．５ＩＩｓｅｃによりスペクトル分
析し、その結果をレーデプリンタで濃淡表示を行なう。Next, a method of configuring a speech database with multilayer labels according to the present invention will be explained. The recording conditions are as follows: in as quiet an environment as possible, such as in a recording studio, the words are uttered clearly, separated into words, and then recorded as PCM onto magnetic tape. Then, off-line, the audio signal recorded in PCM on the magnetic tape is subjected to 16-bit Δ/D conversion by sampling at 20 kHz via the computer work stage 1, and is stored on the magnetic disk. Then, the stored audio data is cut out every 5n, subjected to 512-point FFT (RS-speed Fourie L transform), and subjected to spectrum analysis with a frame period of 2.5 II sec, and the results are displayed in gradation on a radar printer. Do the following.

その結果は、第１図に示すようなツナグラフとして表わ
される。The results are expressed as a tuna graph as shown in FIG.

この濃淡表示を児て、音素の区分とラベル付けを行ない
、ラベルデータをキーボードから入力する。すなわち、
第１図に示す音声波形のスペクトラムに従って、音声記
号層、イベント層、異音化層、融合化層、母音中心層、
コメント層についてそれぞれ前述の第１図ないし第４図
に従ってラベル付けする。そして、計算機のターミナル
のキーボードからラベル付けしたデータを入力し、各ラ
ベルに対して、第５図に示したように、各ラベルで表わ
される区間開始時刻と終了時刻を同時に記録することに
よって、波形データとの対応づけを行なう。Using this gradation display, phonemes are classified and labeled, and label data is input from the keyboard. That is,
According to the spectrum of the speech waveform shown in Figure 1, the phonetic symbol layer, event layer, allophonic layer, fusion layer, vowel center layer,
Each comment layer is labeled according to FIGS. 1 to 4 described above. Then, by inputting the labeled data from the keyboard of the computer terminal and simultaneously recording the section start time and end time represented by each label, as shown in Figure 5, the waveform Make a correspondence with the data.

［発明の効果］以上のように、この発明によれば、音声データベースと
して、音素のローマ字表記による単なる表層的なラベル
のみでなく、実際の発声現象を詳細に記述した多層のラ
ベル構造を持つため、音声の様々な研究目的に用いるこ
とができる。たとえば、音声認識では、！ｉｌ！識アル
ゴリズムの開発や評価および誤りの解析に適用でき、音
声合成では、合成規則の構成や２僅に適用でき、知覚で
は、音声の響きと物理量との対応づけなどに適用できる
。[Effects of the Invention] As described above, according to the present invention, the speech database has a multi-layered label structure that describes the actual vocalization phenomenon in detail, rather than just superficial labels based on the Roman alphabet notation of phonemes. , can be used for various research purposes of speech. For example, in voice recognition! Il! It can be applied to the development and evaluation of recognition algorithms and error analysis; in speech synthesis, it can be applied to the construction of synthesis rules; and in perception, it can be applied to the correspondence between the sound of speech and physical quantities.

[Brief explanation of the drawing]

第１図はこの発明における音声信号の各層に付与された
ラベルの例を示す図である。第２図は各層におけるラベ
ル表示方法を示す図である。第３図はイベント層におけ
る表記方法とその記号を示す図である。第４図は異音化
層における表記方法とその記号を示ず図である。第５図
は計ｔ３機内部でのラベルファイルのデータ形式を示す
図である。第６図はこの発明による多層ラベルを持つ音声データベ
ースを構成するためのΔ／Ｄ変換からラベルデータを入
力するまでのフロー図である。特許出願人　株式会社エイ・ティ・アール自動翻訳電話
研究所株式会社エイφティ・アール第３図注ｌ：長母皆は母音１文字で記述する。第４図第６図FIG. 1 is a diagram showing an example of labels given to each layer of an audio signal in the present invention. FIG. 2 is a diagram showing a label display method in each layer. FIG. 3 is a diagram showing the notation method and its symbols in the event layer. FIG. 4 is a diagram without showing the notation method and its symbols in the allophone layer. FIG. 5 is a diagram showing the data format of the label file inside the t3 machines. FIG. 6 is a flow diagram from Δ/D conversion to input of label data for constructing a speech database with multilayer labels according to the present invention. Patent Applicant: A.T.R. Automatic Translation Telephone Research Institute A.T.R. Co., Ltd. Figure 3 Note 1: All long vowels are written with one vowel letter. Figure 4 Figure 6

Claims

[Claims]

(1) In a speech database that has a structure in which the speech signal waveform is digitized, the speech waveform is divided into phonemes based on the characteristics of the signal, and each phoneme is labeled, the parts with phoneme labels are combined into one We describe the characteristics of various sounds that reflect actual speech phenomena in multiple types, and create layers corresponding to each feature. An audio database configuration method having multilayer labels, which is characterized in that a descriptive label is assigned and a digitized audio waveform is associated with a multilayer label that describes its physical characteristics.

(2) The digitized audio signal waveform is stored in an audio data file, and each label is stored in a label file, and each label corresponds to a storage address of each phoneme in the audio data file. The multilayer label according to claim 1, wherein a value is assigned and the audio data file and the label file are linked by making a correspondence between the value and the audio waveform. Speech database configuration method.

(3) A speech database configuration system having multilayer labels according to claim 1, wherein one of the multilayer labels includes a phonetic symbol layer in which each phoneme unit is displayed in Roman letters.

(4) One of the multi-layer labels is that each section divided by the phonetic symbol layer is divided into a plurality of parts according to changes in phonetic characteristics, and labels are given so as to better reflect the actual pronunciation. A speech database configuration method with multi-layer labels as claimed in claim 3, including an event layer.

(5) One of the multilayer labels includes an allophonetic layer that describes a section of devoicing and fricativeization. .

(6) A speech database configuration method having a multilayer label according to claim 3, wherein one of the multilayer labels includes a fusion layer in which consecutive phonemes are fused and describe an inseparable continuous part. .

(7) One of the multilayer labels includes a vowel center layer that describes a pointer indicating the center of the vowel.
Speech database construction method with multi-layered labels as described in Section 1.

(8) One of the multilayer labels includes a comment layer in which comments are written about phenomena that cannot be described in the phonetic symbol layer, the event layer, the allophonic layer, the fusion layer, and the vowel center layer. A speech database configuration method having multilayer labels according to any one of Items 7 to 7.