JPH09319395A

JPH09319395A - Voice data learning device in discrete word voice recognition system

Info

Publication number: JPH09319395A
Application number: JP8315731A
Authority: JP
Inventors: Shigeru Kashiwagi; 繁柏木
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1996-03-26
Filing date: 1996-11-27
Publication date: 1997-12-12

Abstract

PROBLEM TO BE SOLVED: To make it possible to operate without lowering the word-recognition precision even by the voice data obtained from voice input devices with different characteristics. SOLUTION: Voice data are inputted from a first input device 1 to a phoneme recognition constitution part 11. A second voice input device 2 is provided with the characteristic different from the first voice input device 1. The voice data inputted from the second voice input device 2 are inputted to a characteristic extraction part 22 after they are inputted to a data input part 21. After the voice data are frequency analyzed by the characteristic extraction part 22 to be inputted to an automatic labeling part 23. The data giving the voice data a phoneme label based on the inputted voice data and a phoneme structural table inputted together with the voice data are formed in the automatic labeling part 23. A leanring data part 24 for learning a neural net of a phoneme recognition part 11c is formed based on the data.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、ニューラルネッ
トワークによる音素認識部とＤＴＷによる単語認識部か
らなる離散単語音声認識システムにおける音声データ学
習装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice data learning device in a discrete word voice recognition system including a phoneme recognition unit by a neural network and a word recognition unit by a DTW.

【０００２】[0002]

【従来の技術】離散単語音声認識システムの概要を図５
に示す。図５において、１１は音素認識構成部、１２は
単語認識部であり、音素認識構成部１１は次のように構
成されている。１１ａは電話などの音声入力装置より音
声データが入力されるデータ入力部で、この入力部１１
ａに入力された音声データは特徴抽出部１１ｂに供給さ
れ、ここで音声データから有効なデータを取り出して表
１に示す条件で周波数分析される。2. Description of the Related Art An outline of a discrete word speech recognition system is shown in FIG.
Shown in In FIG. 5, reference numeral 11 is a phoneme recognition configuration unit, 12 is a word recognition unit, and the phoneme recognition configuration unit 11 is configured as follows. Reference numeral 11a denotes a data input unit to which voice data is input from a voice input device such as a telephone.
The voice data input to a is supplied to the feature extraction unit 11b, where valid data is extracted from the voice data and subjected to frequency analysis under the conditions shown in Table 1.

【０００３】[0003]

【表１】 [Table 1]

【０００４】上記周波数分析の結果から得られるスペク
トル列は音素認識部１１ｃに入力されて「２３」の音素
に分類される。音素認識部１１ｃは図６に詳細を示すよ
うに出力を二重化したニューラルネットワークによって
構成されている。このニューラルネットワークは入力
層、隠れ層、出力層からなり、入力層に１時刻毎に５フ
レームのスペクトルが入力され、それの中央のスペクト
ルが、該当する音素がどれであるかを出力層のユニット
（出力ユニット）の値によって送出する。出力ユニット
は二重化されているため、各音素カテゴリ毎にユニット
は２個づつ対応づけられている。それに対して結果は最
大の出力値を示すものから２つのユニットを選び、それ
が対応する音素を第１位、第２位音素候補として得る。
従って、場合によっては同一音素を第１位音素候補、第
２位音素候補として出力することもあり得る。次に示す
音素列は、第１位音素候補列Ｐ₁と第２位音素候補列Ｐ₂
の例である。The spectrum sequence obtained from the result of the above frequency analysis is input to the phoneme recognition unit 11c and classified into "23" phonemes. The phoneme recognition unit 11c is composed of a neural network having dual outputs, as shown in detail in FIG. This neural network consists of an input layer, a hidden layer, and an output layer. A spectrum of 5 frames is input to the input layer every one time, and the center spectrum of that is the unit of the output layer. It is sent according to the value of (output unit). Since the output units are duplicated, two units are associated with each phoneme category. On the other hand, as a result, two units are selected from those showing the maximum output value, and the phonemes to which the two units correspond are obtained as first and second phoneme candidates.
Therefore, in some cases, the same phoneme may be output as the first-ranked phoneme candidate and the second-ranked phoneme candidate. The phoneme strings shown below are the first-ranked phoneme candidate string P ₁ and the second-ranked phoneme candidate string P _2.
This is an example.

【０００５】Ｐ₁：−−ｓ−ｚｚｚｉｅｅｇｒｒｏｏａｏｏ−− Ｐ₂：ｓｓ−ｓｓｓｓｅｉｉｙｒｗｗａｏａａｐｐなお、音素認識部１１ｃに入力される５フレームのスペ
クトルは１時刻ごとに１フレームづつずらされながら入
力されて行く。P ₁ : -s-zzzzieegrrooooo- P ₂ : ss-ssssseiiirwwaoaapp The five-frame spectrum input to the phoneme recognition unit 11c is input while shifting one frame at a time.

【０００６】前記音素認識部１１ｃの出力に得られた第
１位、第２位音素候補列Ｐ₁，Ｐ₂は単語認識部１２に供
給され、この単語認識部１２において、辞書１３内の単
語テンプレートとＤＴＷ法（時間正規化法）によってマ
ッチングされ、最も類似する単語を結果として出力す
る。単語認識部１２での演算は、図７に示すようにして
行われる。すなわち、テンプレート音素列Ｐ_tのパター
ンを縦軸に、音素認識部１１ｃで得られた第１位、第２
位音素候補列Ｐ₁，Ｐ₂を横軸に並べ、各格子点におい
て、局所スコアを次に示す（１）式に従って求め、テン
プレートの音素と音素候補の第１位音素候補が等しい場
合「０」を、第２位音素候補が等しい場合「１」を、い
ずれも等しくない場合「２」をセットする。The first and second phoneme candidate sequences P ₁ and P ₂ obtained at the output of the phoneme recognition unit 11c are supplied to the word recognition unit 12, and at this word recognition unit 12, the words in the dictionary 13 are written. The template is matched with the DTW method (time normalization method), and the most similar word is output as a result. The calculation in the word recognition unit 12 is performed as shown in FIG. That is, the patterns of the template phoneme sequence P _t are plotted on the vertical axis, and the first and second positions obtained by the phoneme recognition unit 11c are used.
Position phoneme candidate sequences P ₁ and P ₂ are arranged on the horizontal axis, and at each grid point, a local score is obtained according to the following equation (1). If the phoneme of the template is equal to the first phoneme candidate of the phoneme candidate, “0” is given. Is set to "1" when the second phoneme candidates are equal, and "2" is set when none of them are equal.

【０００７】[0007]

【数１】 [Equation 1]

【０００８】その後、次の（２）式に示すＤＴＷスコア
計算式の制限に従ってスコアを逐次累積して行き、最終
点（図７の最右上点）での累積スコアを類似スコアとし
て得る。Thereafter, the scores are successively accumulated according to the restriction of the DTW score calculation formula shown in the following formula (2), and the cumulative score at the final point (the upper rightmost point in FIG. 7) is obtained as a similar score.

【０００９】[0009]

【数２】 [Equation 2]

【００１０】その類似スコアを辞書中の全テンプレート
に対して求め、スコアが最小のものを認識結果として出
力する。なお、（１）式中でＰ_tはテンプレート音素列
を、また、（１）式、（２）式中でのd(i,j)とg(i,j)は
それぞれ格子点(i,j)での局所スコアと累積スコアを表
している。The similar score is obtained for all templates in the dictionary, and the one with the smallest score is output as the recognition result. Note that P _{t in the} equation (1) is a template phoneme sequence, and d (i, j) and g (i, j) in the equations (1) and (2) are lattice points (i, It shows the local and cumulative scores in j).

【００１１】[0011]

【発明が解決しようとする課題】上記の従来技術におい
て、音素認識はニューラルネットからなる音素認識部に
よって実行される。そのニューラルネットからなる音素
認識部は電話などの音声入力装置から入力された音声デ
ータを学習データとして、バックプロパゲーション法に
よって学習される。しかし、その学習では、実際には各
音素データの周波数特性のみではなく、音声入力装置固
有の伝送周波数特性や付加型の雑音の周波数特性も含め
て学習する手段をとっている。このため、理想的には、
学習データを収録する時に使用した音声入力装置と同じ
特性の装置を実装システムにおいても使用することが必
要となる。しかし、実際には、実装時に小型化やシステ
ム構成の都合などの理由から特性の異なる音声入力装置
を使用しなければならない場合もあり、この結果音声デ
ータの単語認識精度が低下してしまう問題がある。In the above-mentioned prior art, the phoneme recognition is executed by the phoneme recognition unit composed of a neural network. The phoneme recognition unit composed of the neural network is learned by the back propagation method using the voice data input from the voice input device such as a telephone as the learning data. However, in the learning, actually, not only the frequency characteristic of each phoneme data but also the transmission frequency characteristic peculiar to the voice input device and the frequency characteristic of the additional type noise are taken as means for learning. Therefore, ideally,
It is necessary to use a device having the same characteristics as the voice input device used when recording the learning data also in the mounting system. However, in reality, there are cases where it is necessary to use voice input devices with different characteristics at the time of implementation due to reasons such as miniaturization and system configuration, and as a result, there is a problem that the word recognition accuracy of voice data decreases. is there.

【００１２】この発明は上記の事情に鑑みてなされたも
ので、音声入力装置の特性が異なるもので得た音声デー
タによっても、単語認識精度を低下させないで動作させ
ることができるようにした離散単語音声認識システムに
おける音声データ学習装置を提供することを課題とす
る。The present invention has been made in view of the above circumstances, and discrete words can be operated without lowering the word recognition accuracy even with voice data obtained by voice input devices having different characteristics. An object is to provide a voice data learning device in a voice recognition system.

【００１３】[0013]

【課題を解決するための手段】この発明は、上記の課題
を達成するために、第１発明は、音声入力装置より入力
された単語音声データを周波数分析し、それを出力多重
化ニューラルネットに入力させて音素認識を行わせて、
認識音素第１位音素候補と第２位音素候補を得、その認
識された音素候補列と、認識させたい語彙の音素パター
ンを持たせた辞書中のテンプレートとの類似度を、テン
プレート中の音素と認識された音素候補列中の第１位お
よび第２位候補との類似度を局所スコアとし、その局所
スコアをＤＴＷ法によって累積することで全体の類似度
スコアを求めた後、認識させたい全ての語彙の中で、そ
の類似度スコアが最小となる単語を認識結果として出力
する音声認識システムにおいて、前記音声入力装置とは
特性が異なる音声入力装置から入力した音声データに音
素ラベルを付与した学習データを得る自動ラベリング部
を設け、この自動ラベリング部で得た学習データで前記
ニューラルネットを学習させるようにしたことを特徴と
するものである。SUMMARY OF THE INVENTION In order to achieve the above-mentioned object, the present invention provides a frequency analysis of word voice data input from a voice input device and outputs it to an output multiplexing neural network. Let me input and phoneme recognition,
Recognized phonemes First phoneme candidates and second phoneme candidates are obtained, and the similarity between the recognized phoneme candidate sequence and the template in the dictionary having the phoneme pattern of the vocabulary to be recognized is determined as the phoneme in the template. I want to recognize after calculating the overall similarity score by accumulating the local scores by the DTW method by using the similarity with the first and second candidates in the phoneme candidate sequence recognized as In a voice recognition system that outputs a word having the smallest similarity score among all vocabularies as a recognition result, a phoneme label is given to voice data input from a voice input device having characteristics different from those of the voice input device. An automatic labeling unit for obtaining learning data is provided, and the learning data obtained by the automatic labeling unit is used for learning the neural network.

【００１４】第２発明は、前記自動ラベリング部は、ニ
ューラルネットによる学習型音素認識部と、ＤＴＷを基
本とした音素境界最適位置検出部と、発声された音声デ
ータがどのような音素によって構成されているかを示す
音素構成表とからなることを特徴とするものである。According to a second aspect of the invention, the automatic labeling unit comprises a learning type phoneme recognition unit using a neural network, a phoneme boundary optimum position detection unit based on DTW, and what kind of phoneme the uttered voice data is composed of. And a phoneme composition table indicating whether or not it is present.

【００１５】第３発明は、前記学習データは、全種類の
音素が含まれ、かつなるべく多くの音素連鎖が含まれる
ようにして、任意に設定した語彙に対しても認識率が低
下しないようにしたことを特徴とするものである。According to a third aspect of the present invention, the learning data includes all types of phonemes and includes as many phoneme chains as possible, so that the recognition rate does not decrease even for an arbitrarily set vocabulary. It is characterized by having done.

【００１６】第４発明は、前記ニューラルネットを学習
させる際に、もとの音声入力装置から入力した音声デー
タも併せて学習を行って、任意に設定した語彙に対して
も、もとの音声入力装置からでも、もとの音声入力装置
とは特性が異なる音声入力装置からでも認識率が低下し
ないようにしたことを特徴とするものである。According to a fourth aspect of the invention, when the neural network is trained, the voice data input from the original voice input device is also learned, and the original voice is set for the arbitrarily set vocabulary. It is characterized in that the recognition rate is prevented from lowering even from an input device or a voice input device having characteristics different from those of the original voice input device.

【００１７】第５発明は、前記第１〜４発明中におい
て、音声入力装置とは特性が異なる音声入力装置から入
力した無音データを予め音声データによって学習されて
いる既学習ニューラルネットで学習させ、その学習デー
タを自動ラベリング部に学習させたことを特徴とする。According to a fifth aspect of the present invention, in the first to fourth aspects, silent data input from a voice input device having a characteristic different from that of the voice input device is learned by a learned neural network that has been learned in advance by voice data. The feature is that the learning data is learned by the automatic labeling unit.

【００１８】第６発明は、前記第５発明において、前記
音声入力装置とは特性の異なる音声入力装置から無音デ
ータを予め音声データによって学習されている既学習ニ
ューラルネットで学習させ、その学習データを自動ラベ
リング部に入力させるとともに、前記出力多重化ニュー
ラルネットで音声認識されたデータを自動ラベリング部
に入力させて学習させるようにしたことを特徴とする。In a sixth aspect based on the fifth aspect, silent data is learned from a voice input device having a characteristic different from that of the voice input device by a learned neural network which is learned in advance by voice data, and the learned data is learned. In addition to inputting to the automatic labeling unit, data recognized by the output multiplexing neural network is input to the automatic labeling unit for learning.

【００１９】[0019]

【発明の実施の形態】以下この発明の実施の形態を図面
に基づいて説明するに、図５と同一部分は同一符号を付
して示す。図１はこの発明の実施の第１形態を示すシス
テム構成図で、図１において、音素認識構成部１１に
は、第１音声入力装置１から音声データが入力される。
第２音声入力装置２は、図５で示した学習データを収録
する際に使用した第１音声入力装置１とは異なる装置で
ある。第２音声入力装置２から入力された音声データは
データ入力部２１に入力された後、特徴抽出部２２に入
力される。この特徴抽出部２２で音声データは周波数分
析された後、詳細を後述する自動ラベリング部２３に入
力される。自動ラベリング部２３では入力された音声デ
ータと、その音声データと共に入力される音素構成表
（後述する）をもとに音声データに対して音素ラベルを
付与したデータが作成される。このデータを基にして音
素認識部１１ｃのニューラルネットを学習させるための
学習データ部２４を作成する。このようにして得られた
音素ラベルを付与した音声データによって追加学習を適
当な回数行うことで所望の第２音声入力装置２からの入
力に対しても良好な認識結果が得られるような音素認識
部１１ｃが実現できるようになる。上記データ入力部２
１、特徴抽出部２２、自動ラベリング部２３および学習
データ部２４で認識補助システム２５が構成される。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, an embodiment of the present invention will be described with reference to the drawings, in which the same portions as those in FIG. 5 are designated by the same reference numerals. FIG. 1 is a system configuration diagram showing a first embodiment of the present invention. In FIG. 1, voice data is input from a first voice input device 1 to a phoneme recognition configuration unit 11.
The second voice input device 2 is a device different from the first voice input device 1 used when recording the learning data shown in FIG. The voice data input from the second voice input device 2 is input to the data input unit 21 and then to the feature extraction unit 22. The voice data is frequency-analyzed by the feature extraction unit 22 and then input to an automatic labeling unit 23, the details of which will be described later. The automatic labeling unit 23 creates data in which a phoneme label is added to the voice data based on the input voice data and a phoneme configuration table (described later) input together with the voice data. Based on this data, the learning data unit 24 for learning the neural network of the phoneme recognition unit 11c is created. Phoneme recognition in which a good recognition result is obtained even with respect to the desired input from the second speech input device 2 by performing additional learning a suitable number of times using the phoneme-labeled speech data obtained in this way. The part 11c can be realized. The data input section 2
The recognition assisting system 25 includes the feature extraction unit 22, the automatic labeling unit 23, and the learning data unit 24.

【００２０】次に上述した自動ラベリング部２３につい
て図２を用いて述べる。図２において、自動ラベリング
部２３とは、ニューラルネットによる学習型音素認識部
３０と、ＤＴＷを基本とした音素境界最適位置検出部３
１とを持ち、発声された音声データがどういった音素に
よって構成されているかを示す音素構成表３２と、第１
音声入力装置１から取り込んだ音声データによって学習
されたニューラルネット音素認識部１１ｃから構成され
る。Next, the automatic labeling unit 23 described above will be described with reference to FIG. In FIG. 2, an automatic labeling unit 23 is a learning type phoneme recognition unit 30 based on a neural network, and a phoneme boundary optimal position detection unit 3 based on DTW.
1 and a phoneme composition table 32 showing what kind of phoneme the uttered voice data is composed of;
It is composed of a neural network phoneme recognition unit 11c learned by the voice data taken in from the voice input device 1.

【００２１】このように構成された自動ラベリング部２
３に入力された音声データは、既に学習された音素認識
部１１ｃによる音素認識結果の中から発声データの音素
構成表に存在する音素のみを高い順位のものから選択す
る。選択されたものから音素列に変換され、それをもと
に、音素境界最適位置検出部３１によって音素境界が設
定される。その後、その音素境界によって区切られた各
音素データにより、初期状態（未学習状態）のニューラ
ルネットを数回学習させる。この学習の後、ニューラル
ネットによって得られた音素認識結果を基に再び音素境
界最適位置検出部３１により音素境界を設定する。以
下、境界位置の変動が収束するか、規定回数を終了する
かまで、初期状態のニューラルネットによる学習から音
素境界設定までを繰り返すことで最適な音素境界が得ら
れ、それを基にフレーム単位のデータに音素ラベルを付
与することができる。The automatic labeling unit 2 configured as described above
With respect to the voice data input to No. 3, only the phonemes existing in the phoneme composition table of the utterance data are selected from the ones in the higher order among the phoneme recognition results by the phoneme recognition unit 11c that have already been learned. The selected one is converted into a phoneme sequence, and the phoneme boundary optimal position detection unit 31 sets a phoneme boundary based on the conversion. Then, the neural network in the initial state (unlearned state) is trained several times by each phoneme data divided by the phoneme boundary. After this learning, the phoneme boundary optimum position detection unit 31 sets the phoneme boundary again based on the phoneme recognition result obtained by the neural network. Hereinafter, until the variation of the boundary position converges or the specified number of times ends, the learning from the initial state neural network to the setting of the phoneme boundary is repeated to obtain the optimum phoneme boundary, and based on that, the optimum phoneme boundary is obtained. Phoneme labels can be attached to data.

【００２２】上記の追加学習の方式に対して、実現した
い音声認識システムの仕様によって以下の３通りの例が
挙げられる。なお、以下の３つの例は実現に対して、デ
ータ収録にかかる時間と必要とする記憶容量の大小関係
が、例１＜例２＜例３となるので、必要とする認識シス
テムで何が要求されるかに応じてどれかを選択するよう
にする。With respect to the above additional learning method, the following three examples can be given depending on the specifications of the speech recognition system to be realized. In order to realize the following three examples, the relationship between the time required for data recording and the required storage capacity is Example 1 <Example 2 <Example 3, so what is required by the required recognition system? Make sure you choose which one to use.

【００２３】例１：認識させたい認識対象語彙が固定さ
れている場合認識対象語彙に含まれる単語が、複数人によって発声さ
れた音声として第２音声入力装置２に入力され、この装
置２から出力される音声データをデータ入力部２１、特
徴抽出部２２を介して自動ラベリング部２３に供給す
る。自動ラベリング部２３では入力された音声データに
音素ラベルを付与してデータが作成される。この作成さ
れたデータを基に音素認識部１１ｃのニューラルネット
を学習させるための学習データ部２４を作成する。この
学習データ部２４を既に第１音声入力装置１から入力し
た音声データによって学習されている既学習ニューラル
ネットに学習させる。その際の学習の程度は、全学習デ
ータを一通りニューラルネットに学習させることを数回
繰り返す程度とする。この学習によって得られたニュー
ラルネットを音素認識部１１ｃに適用することで、第２
音声入力装置２に対しても良好に動作する音声認識シス
テムとすることができる。Example 1: When the recognition target vocabulary to be recognized is fixed The words included in the recognition target vocabulary are input to the second voice input device 2 as voices uttered by a plurality of persons, and output from this device 2. The generated voice data is supplied to the automatic labeling unit 23 via the data input unit 21 and the feature extraction unit 22. The automatic labeling unit 23 attaches a phoneme label to the input voice data to create the data. A learning data unit 24 for learning the neural network of the phoneme recognition unit 11c is created based on the created data. The learning data unit 24 is trained by a learned neural network that has already been trained with the voice data input from the first voice input device 1. The degree of learning at that time is such that the learning of all the learning data through the neural network is repeated several times. By applying the neural network obtained by this learning to the phoneme recognition unit 11c, the second
A voice recognition system that operates well with respect to the voice input device 2 can be provided.

【００２４】例２：認識対象語彙が任意である場合複数人が発声した単語の音声データを第２音声入力装置
２を通して得たデータとして作成する。その際、発声す
る単語内容は全種類の音素を含み、さらに、なるべく多
くの種類の音素連鎖を含むものが望ましい。そうして得
られた音声データを基にニューラルネットを学習させる
ための学習データを作成する。この学習データを既に第
１音声入力装置１から入力した音声データによって学習
されている既学習ニューラルネットに学習させる。その
際の学習の程度は、全学習データを一通りニューラルネ
ットに学習させることを数回繰り返す程度とする。この
学習によって、第２音声入力装置２からの入力でも、任
意に設定した認識対象語彙に対して良好に動作する音声
認識システムとすることができる。Example 2: When the vocabulary to be recognized is arbitrary The voice data of words uttered by a plurality of people is created as data obtained through the second voice input device 2. At that time, it is desirable that the content of the word to be uttered includes all types of phonemes and further includes as many types of phoneme chains as possible. Learning data for learning the neural network is created based on the voice data thus obtained. This learning data is learned by the learned neural network that has already been learned by the voice data input from the first voice input device 1. The degree of learning at that time is such that the learning of all the learning data through the neural network is repeated several times. By this learning, it is possible to obtain a voice recognition system that operates well even with an input from the second voice input device 2 with respect to an arbitrarily set recognition target vocabulary.

【００２５】例３：認識対象語彙が任意で、もとの音声
入力装置に対しても動作させたい場合複数人が発声した単語の音声データを第２音声入力装置
２を通して得たデータとして作成する。その際発声する
単語内容は全種類の音素を含み、さらに、なるべく多く
の種類の音素連鎖を含むものが望ましい。そうして得ら
れた音声データと、基の音声入力装置である第１音声入
力装置１を通して得て、既に音素認識部のニューラルネ
ットの学習に使用した音声データとをもとにニューラル
ネットを学習させるための学習データを作成する。この
学習データを既に第１音声入力装置１から入力した音声
データによって学習されている既学習ニューラルネット
に学習させる。その際の学習の程度は、全学習データを
一通りニューラルネットに学習させることを数回繰り返
す程度とする。この学習によって任意に設定した認識対
象語彙に対して第１音声入力装置１、第２音声入力装置
２のどちらからの音声入力でも良好に動作する音声認識
システムとすることができる。Example 3: When the vocabulary to be recognized is arbitrary and it is desired to operate the original voice input device as well, voice data of words uttered by a plurality of persons is created as data obtained through the second voice input device 2. . At that time, it is desirable that the content of the word to be uttered includes all types of phonemes and further includes as many types of phoneme chains as possible. A neural network is learned based on the speech data obtained in this way and the speech data obtained through the first speech input apparatus 1 which is the original speech input apparatus and already used for learning the neural network of the phoneme recognition unit. Create learning data for the training. This learning data is learned by the learned neural network that has already been learned by the voice data input from the first voice input device 1. The degree of learning at that time is such that the learning of all the learning data through the neural network is repeated several times. By this learning, a voice recognition system can be provided which operates well with respect to the vocabulary to be recognized, which is arbitrarily set, by voice input from either the first voice input device 1 or the second voice input device 2.

【００２６】次に上記例１を適用した場合の認識実験の
結果を示す。男性が１６名が１０１単語を２回づつ発声
したデータを第１音声入力装置１から取り込んで学習さ
せることで得られた音声認識システムがある。これに対
して、音素的に類似した単語などを多く含み認識させる
タスクとしては難しい単語７８語を認識対象語彙として
設定し、実験を行った（簡単なタスクの場合であると、
音声入力装置の違いによる影響が小さいため）。その結
果を表２に示す。Next, the result of the recognition experiment when the above-mentioned Example 1 is applied is shown. There is a voice recognition system obtained by taking data from 16 voices of 16 men who have uttered 101 words twice each and learning the data from the first voice input device 1. On the other hand, 78 words, which are difficult for a task that includes many phoneme-similar words to be recognized, are set as recognition target vocabulary, and experiments are performed (in the case of a simple task,
Because the effect of differences in voice input devices is small). The results are shown in Table 2.

【００２７】[0027]

【表２】 [Table 2]

【００２８】第１音声入力装置１から入力した場合は、
７８単語語彙に対して、男性３名が２回づつ発声した時
の単語認識率は、８７.５０％であった。これに対し
て、同じ男性３名によって、今度は第１音声入力装置１
とは異なる第２音声入力装置２から入力した場合、単語
認識率が７７.３５％にまで低下した。そこで、上記例
１に従って、７８単語を２回づつ男性３名（上記認識率
を求めた男性３名とは異なる話者）が発声した音声デー
タを、第２音声入力装置２から取り込み、自動ラベリン
グシステムによって音素ラベルを付与した後に追加学習
を行った。その結果単語認識率は８６.７５％にまで改
善することができた。When inputting from the first voice input device 1,
The word recognition rate for three 78-word vocabulary words was 87.50% when three men spoke each two times. On the other hand, by the same three men, this time the first voice input device 1
When inputting from the second voice input device 2 different from, the word recognition rate decreased to 77.35%. Then, according to the above-mentioned example 1, the voice data uttered by three males (speakers different from the three males for which the above recognition rate was obtained) with 78 words twice each was fetched from the second voice input device 2 and automatically labeled. After the phoneme labels were given by the system, additional learning was performed. As a result, the word recognition rate could be improved to 86.75%.

【００２９】次に、この発明の実施の第２形態を図３に
基づいて説明するに、図１、２と同一部分は同一符号を
付して示し、その詳細な説明は省略する。図３はこの発
明の実施の第２形態を示すシステム構成図で、図３にお
いて、音素認識構成部１１には、第１音声入力装置１か
ら音声データが入力される。第２音声入力装置２は、図
５で示した学習データを収録する際に使用した第１音声
入力装置１とは異なる装置である。第２音声入力装置２
から音声データが入力部２１に入力される(詳細を後述
する)と同時に、第２音声入力装置２から無音データ(人
間の発生による音声区間以外のデータ)が入力部２６に
入力された後、特徴抽出部２７に入力される。この特徴
抽出部２７で無音データは、周波数分析された後、あら
かじめ第１音声入力装置１から取り込んだ音声データに
よって学習された既学習ニューラルネット２８にて学習
させる。その際の学習の程度は、全学習データを一通り
ニューラルネット２８にて学習させることを数回繰り返
す程度とする。このように学習されたニューラルネット
２８の出力を、自動ラベリング部２３のニューラルネッ
トによる学習型音素認識部３０に入力して自動ラベリン
グを実行して、無音データに対して音素ラベルを付与し
たデータが作成される。Next, a second embodiment of the present invention will be described with reference to FIG. 3. The same parts as those in FIGS. 1 and 2 are designated by the same reference numerals, and detailed description thereof will be omitted. FIG. 3 is a system configuration diagram showing a second embodiment of the present invention. In FIG. 3, voice data is input from the first voice input device 1 to the phoneme recognition configuration unit 11. The second voice input device 2 is a device different from the first voice input device 1 used when recording the learning data shown in FIG. Second voice input device 2
Voice data is input to the input unit 21 (details will be described later), and at the same time, the second voice input device 2 inputs silence data (data other than the voice section generated by humans) to the input unit 26. It is input to the feature extraction unit 27. The silent data is frequency-analyzed by the feature extraction unit 27, and then learned by the learned neural network 28 that has been learned by the voice data previously taken from the first voice input device 1. The degree of learning at that time is such that learning of all the learning data through the neural network 28 is repeated several times. The output of the neural network 28 learned in this way is input to the learning type phoneme recognition unit 30 by the neural network of the automatic labeling unit 23 to perform automatic labeling, and the data in which the phoneme label is added to the silent data is Created.

【００３０】第２音声入力装置２から入力された音声デ
ータは入力部２１に入力された後、特徴抽出部２２に入
力される。この特徴抽出部２２で音声データは周波数分
析された後、自動ラベリング部２３に入力され、音声デ
ータに対して音素ラベルを付与したデータが作成され
る。無音データおよび音声データに対して音素ラベルを
付与して得られたデータを基にして、音素認識部１１ｃ
のニューラルネットを学習させるための学習データ部２
４を作成する。このようにして得られた音素ラベルを付
与したデータによって、追加学習を適当な回数行うこと
で所望の第２音声入力装置２からの入力に対しても良好
な認識結果が得られような音素認識部１１ｃが実現でき
るようになる。上記入力部２１、入力部２６、特徴抽出
部２２、特徴抽出部２７、ニューラルネット２８、自動
ラベリング部２３および学習データ部２４で認識補助シ
ステム２９が構成される。音声認識構成部１１は上記第
１形態と同様に構成され、その詳細な説明は省略する。The voice data input from the second voice input device 2 is input to the input unit 21 and then to the feature extraction unit 22. The voice data is frequency-analyzed by the feature extraction unit 22, and then input to the automatic labeling unit 23 to create data in which a phoneme label is added to the voice data. Based on data obtained by assigning phoneme labels to silent data and voice data, the phoneme recognition unit 11c
Learning data section 2 for learning the neural network of
Create 4. Phoneme recognition such that a good recognition result can be obtained even with respect to the desired input from the second speech input device 2 by performing additional learning a suitable number of times using the phoneme-labeled data obtained in this way. The part 11c can be realized. A recognition assisting system 29 is configured by the input unit 21, the input unit 26, the feature extraction unit 22, the feature extraction unit 27, the neural network 28, the automatic labeling unit 23, and the learning data unit 24. The voice recognition configuration unit 11 has the same configuration as that of the first embodiment, and detailed description thereof will be omitted.

【００３１】上記のように構成する音声認識システムの
仕様において、複数人によって発声された単語の音声デ
ータを第２音声入力装置２により入力してデータを作成
する。なお、その際に発声する単語の内容は全種類の音
素を含み、なるべく多くの種類の音素連鎖を含むものが
望ましい。その際、音声データと同時に無音データを第
２音声入力装置２により入力してデータを作成し、その
データをニューラルネット２８に学習させる。このよう
にして得られた音声データ、無音データを学習されたニ
ューラルネット２８および第１音声入力装置１により得
られた音声データをもとに、音素認識構成部１１のニュ
ーラルネットを学習させるための学習データを作成す
る。この学習データを、第１音声入力装置１により得ら
れた音声データにより学習された既学習ニューラルネッ
トに追加学習させる。その際の学習の程度は、全学習デ
ータを一通りニューラルネットに学習させることを数回
繰り返す程度とする。この学習により、第１音声入力装
置１、第２音声入力装置２のいずれの音声データに対し
ても音声認識構成部１１の音素認識部１１ｃが良好に動
作するようになり、単語認識率の向上を図ることができ
る。According to the specifications of the voice recognition system configured as described above, the voice data of the words uttered by a plurality of people is input by the second voice input device 2 to create the data. In addition, the content of the word spoken at that time includes all types of phonemes, and preferably includes as many types of phoneme chains as possible. At that time, silent data is input simultaneously with the voice data by the second voice input device 2 to create data, and the neural network 28 is made to learn the data. To learn the neural network of the phoneme recognition configuration unit 11 based on the speech data obtained in this way, the neural network 28 learned the silent data, and the speech data obtained by the first speech input device 1. Create learning data. This learning data is additionally learned by the learned neural network learned by the voice data obtained by the first voice input device 1. The degree of learning at that time is such that the learning of all the learning data through the neural network is repeated several times. By this learning, the phoneme recognition unit 11c of the voice recognition configuration unit 11 works well for both the voice data of the first voice input device 1 and the second voice input device 2, and the word recognition rate is improved. Can be achieved.

【００３２】次に、上記実施の第２形態を適用した場合
の認識実験の結果を示す。男性１６名が１０１単語を２
回ずつ発声したデータを第１音声入力装置１から取り込
んで学習させることで得られた音声認識構成部がある。
これに対して、第２音声入力装置２から取り込んだ男性
３名が１０１単語を２回ずつ発声したデータを追加学習
用のデータとして、本発明の実施の第１形態方式および
第２形態方式により認識実験を行った。なお、認識実験
における対象話者を学習話者とテスト話者とに分類して
実験を行い、その結果を表３に示す。Next, the result of the recognition experiment when the second embodiment is applied will be shown. 16 men 2 101 words
There is a voice recognition configuration unit obtained by fetching data that is uttered each time from the first voice input device 1 and learning the data.
On the other hand, data obtained by the three men who have taken in from the second voice input device 2 and uttered 101 words twice each is used as data for additional learning by the first and second modes of the embodiment of the present invention. A recognition experiment was conducted. The target speaker in the recognition experiment is classified into a learning speaker and a test speaker, and the experiment is performed. The results are shown in Table 3.

【００３３】[0033]

【表３】 [Table 3]

【００３４】第１音声入力装置１から入力した場合は、
１０１単語に対して、実施の第２形態方式による単語認
識率は学習話者が９８．５１％、テスト話者が９７．８
５％であった。実施の第１形態方式による単語認識率は
学習話者が９８．６８％、テスト話者が９７．５２％で
あった。一方、第２音声入力装置２から入力した場合
は、１０１単語に対して、実施の第２形態方式による単
語認識率は学習話者が９８．１８％、テスト話者が９
６．５３％であった。実施の第１形態方式による単語認
識率は学習話者が７３．４３％、テスト話者が５０．０
０％であった。以上示したように実施の第２形態方式に
よれば、第１音声入力装置１、第２音声入力装置２のい
ずれからの音声入力でも良好な音声認識を行えることが
判明した。When inputting from the first voice input device 1,
With respect to 101 words, the word recognition rate according to the second embodiment method was 98.51% for learning speakers and 97.8 for test speakers.
5%. The word recognition rate by the first embodiment method was 98.68% for the learning speaker and 97.52% for the test speaker. On the other hand, in the case of input from the second voice input device 2, the learning speaker has a word recognition rate of 98.18% and the test speaker has a word recognition rate of 9% for 101 words.
It was 6.53%. The word recognition rate according to the first embodiment method was 73.43% for learning speakers and 50.0% for test speakers.
It was 0%. As described above, according to the second embodiment system, it has been found that good voice recognition can be performed by voice input from either the first voice input device 1 or the second voice input device 2.

【００３５】次に、この発明の実施の第３形態を図４に
基づいて説明するに、図２、３と同一部分には同一符号
を付して示し、その詳細な説明は省略する。図４におい
て、図３に示す音声認識システムとほぼ同様な構成から
なる音声認識システムを示すが、第２音声入力装置から
入力した音声データ、無音データを既学習ニューラルネ
ットで学習させたデータおよび第１音声入力装置から入
力した音声データにより学習された音声認識構成部１１
の音素認識部１１ｃの音素認識出力を自動ラベリング部
２３に供給して、自動ラベリング部２３のニューラルネ
ットによる学習型音素認識部３０にて自動ラベリングを
実行し、音素認識部１１ｃを再び学習させるように構成
する。その学習の程度は、全学習データを一通り学習さ
せることを数回繰り返すものとする。この学習により、
第１音声入力装置１、第２音声入力装置２のいずれの音
声データに対しても音声認識構成部の音声認識部が良好
に動作するようになり、単語認識率の効用を図ることが
できるようになる。Next, a third embodiment of the present invention will be described with reference to FIG. 4. The same parts as those in FIGS. 2 and 3 are designated by the same reference numerals, and detailed description thereof will be omitted. FIG. 4 shows a voice recognition system having substantially the same configuration as that of the voice recognition system shown in FIG. 3, except that voice data input from the second voice input device and data obtained by learning silent data by a learned neural network and 1 Voice recognition configuration unit 11 learned from voice data input from a voice input device
The phoneme recognition output of the phoneme recognition unit 11c is supplied to the automatic labeling unit 23, and the learning type phoneme recognition unit 30 using the neural network of the automatic labeling unit 23 performs automatic labeling, so that the phoneme recognition unit 11c is learned again. To configure. As for the degree of learning, it is assumed that learning of all learning data is repeated several times. With this learning,
The voice recognition unit of the voice recognition configuration unit works well for both voice data of the first voice input device 1 and the second voice input device 2, and the word recognition rate can be improved. become.

【００３６】上記のように構成する音声認識システムの
仕様において、複数人によって発声された単語の音声デ
ータを第２音声入力装置２により入力してデータを作成
する。なお、その際に発声する単語の内容は、全種類の
音素を含み、なるべく多くの種類の音素連鎖を含むもの
が望ましい。このようにして得られた音声データ、無音
データおよび第１音声入力装置１により得られた音声デ
ータをもとに、音素認識構成部１１の音声認識部１１ｃ
のニューラルネットを学習させるための学習データを作
成する。この学習データを、第１音声入力装置１により
得られた音声データにより学習されている既学習ニュー
ラルネットに追加学習させる。その際の学習の程度は、
全学習データを一通りニューラルネットに学習させるこ
とを数回繰り返す程度とする。得られた音声認識構成部
１１の音声認識部１１ｃのデータを、再び認識補助シス
テム２９の自動ラベリング部２３の学習型音素認識部３
０に適用して自動ラベリングを実行し、学習型音素認識
部３０を学習させて学習データを得る。そして、その学
習データを音素認識構成部１１の音素認識部１１ｃで学
習させる。この学習の程度は、全学習データを一通り学
習させることを数回繰り返すものとする。この学習によ
り、第１音声入力装置１、第２音声入力装置２のいずれ
の音声データに対しても音声認識構成部１１の音素認識
部１１ｃが良好に動作するようになり、単語認識率の向
上を図ることができる。According to the specifications of the voice recognition system configured as described above, the voice data of the words uttered by a plurality of persons is input by the second voice input device 2 to create the data. In addition, the content of the word uttered at that time preferably includes all types of phonemes, and includes as many types of phoneme chains as possible. Based on the voice data, the silent data, and the voice data obtained by the first voice input device 1 thus obtained, the voice recognition unit 11c of the phoneme recognition configuration unit 11
Create learning data for learning the neural network of. This learning data is additionally learned by a learned neural network that has been learned from the voice data obtained by the first voice input device 1. The degree of learning at that time is
It is assumed that the neural network is trained once for all the learning data and is repeated several times. The obtained data of the speech recognition unit 11c of the speech recognition configuration unit 11 is used again for the learning-type phoneme recognition unit 3 of the automatic labeling unit 23 of the recognition assistance system 29.
0 is applied to perform automatic labeling, and the learning type phoneme recognition unit 30 is learned to obtain learning data. Then, the learning data is learned by the phoneme recognition unit 11c of the phoneme recognition configuration unit 11. As for the degree of this learning, it is assumed that learning of all learning data is repeated several times. By this learning, the phoneme recognition unit 11c of the voice recognition configuration unit 11 works well for both the voice data of the first voice input device 1 and the second voice input device 2, and the word recognition rate is improved. Can be achieved.

【００３７】次に、上記実施の第３形態を適用した場合
の認識実験の結果を示す。男性１６名が１０１単語を２
回ずつ発声したデータを第１音声入力装置１から取り込
んで学習させることで得られた音声認識システムがあ
る。これに対して、第２音声入力装置２から取り込んだ
男性３名が１０１単語を２回ずつ発声したデータを追加
学習用のデータとして、本発明の実施の第１形態および
第３形態方式により認識実験を行った。なお、認識実験
における対象話者を学習話者とテスト話者とに分類して
実験を行い、実施の第３形態方式においては、音素認識
構成部１１の音素認識部１１ｃのデータを認識補助シス
テムの自動ラベリング部２３の音素認識部３０に２回適
用させて実験を行い、その結果を表４に示す。Next, the result of the recognition experiment when the third embodiment is applied will be shown. 16 men 2 101 words
There is a voice recognition system obtained by taking in data that is uttered each time from the first voice input device 1 and learning the data. On the other hand, the data obtained by uttering 101 words twice by three men captured from the second voice input device 2 is recognized as data for additional learning by the first and third embodiments of the present invention. An experiment was conducted. The target speaker in the recognition experiment is classified into a learning speaker and a test speaker, and the experiment is performed. In the third embodiment mode, the data of the phoneme recognition unit 11c of the phoneme recognition configuration unit 11 is used as a recognition assist system. The experiment was carried out by applying the same to the phoneme recognition unit 30 of the automatic labeling unit 23 twice, and the results are shown in Table 4.

【００３８】[0038]

【表４】 [Table 4]

【００３９】第１音声入力装置１から入力した場合は、
１０１単語に対して、実施の第３形態方式による単語認
識率は学習話者が９８．３５％、テスト話者が９７．１
９％であった。実施の第１形態方式による単語認識率は
学習話者が９８．６８％、テスト話者が９７．５２％で
あった。一方、第２音声入力装置２から入力した場合
は、１０１単語に対して、実施の第３形態方式による単
語認識率は学習話者が９８．５１％、テスト話者が９
６．０４％であった。実施の第１形態方式による単語認
識率は学習話者が７３．４３％、テスト話者が５０．０
０％であった。以上示したように実施の第３形態方式に
よれば、第１音声入力装置１、第２音声入力装置２のい
ずれからの音声入力でも良好な音声認識を行えることが
判明した。When inputting from the first voice input device 1,
With respect to 101 words, the word recognition rate according to the third embodiment method was 98.35% for learning speakers and 97.1 for test speakers.
It was 9%. The word recognition rate by the first embodiment method was 98.68% for the learning speaker and 97.52% for the test speaker. On the other hand, when inputting from the second voice input device 2, the word recognition rate by the third embodiment method is 98.51% for the learning speaker and 9 for the test speaker for 101 words.
It was 6.04%. The word recognition rate according to the first embodiment method was 73.43% for learning speakers and 50.0% for test speakers.
It was 0%. As described above, according to the third embodiment method, it has been found that good voice recognition can be performed by voice input from either the first voice input device 1 or the second voice input device 2.

【００４０】[0040]

【発明の効果】以上述べたように、この発明によれば、
音声入力装置を変更した場合でも、単語認識精度を低下
させることがなく、良好に動作させることができる。As described above, according to the present invention,
Even if the voice input device is changed, the word recognition accuracy is not lowered and the operation can be performed favorably.

[Brief description of drawings]

【図１】本発明の実施の第１形態を示すシステム構成
図。FIG. 1 is a system configuration diagram showing a first embodiment of the present invention.

【図２】自動ラベリング部の詳細を示す処理構成図。FIG. 2 is a processing configuration diagram showing details of an automatic labeling unit.

【図３】本発明の実施の第２形態を示すシステム構成
図。FIG. 3 is a system configuration diagram showing a second embodiment of the present invention.

【図４】本発明の実施の第３形態を示すシステム構成
図。FIG. 4 is a system configuration diagram showing a third embodiment of the present invention.

【図５】離散単語音声認識システムの概要を示す構成
図。FIG. 5 is a configuration diagram showing an outline of a discrete word speech recognition system.

【図６】音素認識部（ニューラルネットワーク）の構成
を示す説明図。FIG. 6 is an explanatory diagram showing a configuration of a phoneme recognition unit (neural network).

【図７】単語認識部の構成を示す説明図。FIG. 7 is an explanatory diagram showing a configuration of a word recognition unit.

[Explanation of symbols]

１１…音素認識構成部１１ａ、２１、２６…データ入力部１１ｂ、２２、２７…特徴抽出部１１ｃ…音素認識部１２…単語認識部１３…辞書２３…自動ラベリング部２５、２９…認識補助システム２４…学習データ２８…ニューラルネット 11 ... Phoneme recognition configuration section 11a, 21, 26 ... Data input section 11b, 22, 27 ... Feature extraction section 11c ... Phoneme recognition section 12 ... Word recognition section 13 ... Dictionary 23 ... Automatic labeling section 25, 29 ... Recognition assistance system 24 … Learning data 28… Neural network

Claims

[Claims]

1. A word voice data input from a voice input device is frequency-analyzed, and the result is input to an output multiplex neural network for phoneme recognition to recognize a recognized phoneme first-rank phoneme candidate and second-rank phoneme. A candidate is obtained, and the similarity between the recognized phoneme candidate sequence and the template in the dictionary having the phoneme pattern of the vocabulary to be recognized is determined as the first place in the phoneme candidate sequence recognized as the phoneme in the template. The similarity score with the second candidate is used as a local score, and the local score is obtained by accumulating the local scores by the DTW method to obtain the overall similarity score. In a voice recognition system that outputs the smallest word as a recognition result, learning data obtained by adding phoneme labels to voice data input from a voice input device having characteristics different from those of the voice input device is obtained. Automatic labeling section provided, the audio data learning apparatus in discrete word recognition system is characterized in that so as to learn the neural network learning data obtained in the automatic labeling unit.

2. The automatic labeling unit determines a learning type phoneme recognition unit using a neural network, a phoneme boundary optimal position detection unit based on DTW, and what kind of phoneme the uttered voice data is composed of. The speech data learning device in the discrete word speech recognition system according to claim 1, characterized in that it comprises a phoneme configuration table shown.

3. The learning data includes all types of phonemes and includes as many phoneme chains as possible so that the recognition rate does not decrease even for an arbitrarily set vocabulary. The voice data learning device in the discrete word voice recognition system according to claim 1.

4. When learning the neural network, the voice data input from the original voice input device is also learned, so that the vocabulary set arbitrarily can also be learned from the original voice input device. However, the speech data learning apparatus in the discrete word speech recognition system according to claim 1, wherein the recognition rate does not decrease even from a speech input apparatus having characteristics different from those of the original speech input apparatus.

5. Silence data input from a voice input device having a characteristic different from that of the voice input device is learned by a learned neural network that is learned in advance by voice data, and the learned data is learned by an automatic labeling unit. The voice data learning device in the discrete word voice recognition system according to any one of claims 1 to 4.

6. The silent data is learned from a voice input device having a characteristic different from that of the voice input device by an already learned neural network learned in advance by voice data, and the learned data is input to an automatic labeling unit, and 6. The voice data learning device in a discrete word voice recognition system according to claim 5, wherein the data recognized by the output multiplex neural network is input to an automatic labeling unit for learning.