JP2996925B2

JP2996925B2 - Phoneme boundary detection device and speech recognition device

Info

Publication number: JP2996925B2
Application number: JP9054594A
Authority: JP
Inventors: 芳典匂坂
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1997-03-10
Filing date: 1997-03-10
Publication date: 2000-01-11
Anticipated expiration: 2017-03-10
Also published as: JPH10254477A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、双方向リカレント
型ニューラルネットワーク（Bi-directional Recurrent
Neural Network；以下、ＢＲＮＮという。）を用いて
音声信号波形信号の音素境界を検出する音素境界検出装
置、及び、上記音素境界検出装置を用いて音声認識する
音声認識装置に関する。本明細書で、音素と音素との境
界を音素境界という。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a bi-directional recurrent neural network.
Neural Network; hereinafter referred to as BRNN. ). In this specification, a boundary between phonemes is called a phoneme boundary.

【０００２】[0002]

【従来の技術】従来、音声認識装置のための音響モデル
として音声信号の音声セグメントを用いて音声認識する
方法が、例えば、従来技術文献１「T.Svedsen et al.,
“On the automatic segmentaiton of speech signal
s",Proceedins of ICASSP-87,pp.77-80,1987年」、従来
技術文献２「A.Ljolje et al.,“Automatic segmentati
onand labelling of speech," Proceedings of ICASSP-
91,pp.473-476,1991年」、及び従来技術文献３「J.Glas
s et al.,“A probabilistic framework for feature-b
ased speech recognition",Proceedings of IGSLP-96,p
p.2277-2280,1996年」において開示されている。ここ
で、従来技術文献１及び２においては、音声認識におけ
る音響モデル作成や音声合成のための自動セグメンテー
ションの方法が開示され、従来技術文献３においては、
音声認識のための前処理について開示されている。2. Description of the Related Art Conventionally, a method of recognizing speech using a speech segment of a speech signal as an acoustic model for a speech recognition apparatus is disclosed in, for example, prior art document 1 "T. Svedsen et al.,
“On the automatic segmentaiton of speech signal
s ", Proceedins of ICASSP-87, pp. 77-80, 1987", prior art document 2 "A. Ljolje et al.," Automatic segmentati
onand labelling of speech, "Proceedings of ICASSP-
91, pp. 473-476, 1991 "and prior art document 3" J. Glas
s et al., “A probabilistic framework for feature-b
ased speech recognition ", Proceedings of IGSLP-96, p
p.2277-2280, 1996 ". Here, in prior art documents 1 and 2, a method of automatic segmentation for creating an acoustic model and speech synthesis in speech recognition is disclosed. In prior art document 3,
A pre-processing for speech recognition is disclosed.

【０００３】[0003]

【発明が解決しようとする課題】従来技術文献１におい
ては、隠れマルコフモデル（以下、ＨＭＭという。）
と、発声音声の書き下しテキストデータとを用いて、従
来技術文献２においてはさらに、継続時間長モデルを用
いて、音素ラベルの自動ラベリングを行っている。しか
しながら、ＨＭＭモデルは音素検出のために尤度が最大
となるように学習されているので、音素検出を行うとき
にその性能は比較的低く、処理時間が比較的長いという
問題点があった。また、従来技術文献２においては、継
続時間長モデルを用いて音素検出しているので、処理時
間が比較的長いという問題点があった。In the prior art document 1, a hidden Markov model (hereinafter, referred to as HMM).
In the prior art document 2, the automatic labeling of the phoneme label is further performed by using the duration time model using the utterance voice newly written text data. However, since the HMM model has been learned so as to maximize the likelihood for phoneme detection, there is a problem that the performance of the phoneme detection is relatively low and the processing time is relatively long. Further, in the prior art document 2, since the phoneme is detected using the duration model, there is a problem that the processing time is relatively long.

【０００４】本発明の第１の目的は以上の問題点を解決
し、従来例に比較して高い精度でかつ高速で音素境界を
検出することができ音素境界検出装置を提供することに
ある。A first object of the present invention is to solve the above problems and to provide a phoneme boundary detection device capable of detecting a phoneme boundary with higher accuracy and at a higher speed as compared with the conventional example.

【０００５】本発明の第２の目的は以上の問題点を解決
し、上記音素境界検出装置を用いて、従来例に比較して
高い音声認識率でかつ高速で音声認識することができる
音声認識装置を提供することにある。A second object of the present invention is to solve the above-mentioned problems and to use the above-mentioned phoneme boundary detection device to perform speech recognition with a higher speech recognition rate and higher speed than in the conventional example. It is to provide a device.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の音素境界検出装置は、入力層と、複数のユニットを
有し少なくとも１層の中間層と、１個のユニットを有し
音素境界検出確率を表す音素境界検出値を出力する出力
層とを備えた双方向リカレント型ニューラルネットワー
クを用いて、音声特徴パラメータ系列の音素境界を検出
する音素境界検出装置であって、上記入力層は、複数の
音声特徴パラメータを入力とし、複数のユニットを有す
る第１の入力ニューロングループと、前向きモジュール
と、後向きモジュールとを備え、上記前向きモジュール
は、複数の音声特徴パラメータに基づいて、時間的に前
向きの帰還接続を有して上記第１の入力ニューロングル
ープから出力される複数のパラメータよりも所定の単位
時間だけ遅延された時刻の複数のパラメータを生成して
上記中間層に出力する一方、上記後向きモジュールは、
複数の音声特徴パラメータに基づいて、時間的に後向き
の帰還接続を有して上記第１の入力ニューロングループ
から出力される複数のパラメータよりも所定の単位時間
だけ逆向きに遅延された時刻の複数のパラメータを生成
して上記中間層に出力することを特徴とする。According to a first aspect of the present invention, there is provided a phoneme boundary detecting apparatus comprising: an input layer, at least one intermediate layer having a plurality of units, and a phoneme having one unit. An output layer that outputs a phoneme boundary detection value representing a boundary detection probability, using a bidirectional recurrent neural network, a phoneme boundary detection device that detects a phoneme boundary of a speech feature parameter sequence, wherein the input layer is A plurality of units, a first input neuron group having a plurality of units, a forward module, and a backward module, wherein the forward module is based on the plurality of voice feature parameters, A plurality of parameters having a forward feedback connection and being delayed by a predetermined unit time from a plurality of parameters output from the first input neuron group; While output to the intermediate layer to produce a plurality of parameters of time, the above-mentioned backward module,
Based on the plurality of speech feature parameters, a plurality of times delayed backward by a predetermined unit time from the plurality of parameters output from the first input neuron group having a temporally backward feedback connection. Is generated and output to the intermediate layer.

【０００７】また、請求項２記載の音素境界検出装置
は、請求項１記載の音素境界検出装置において、上記前
向きモジュールは、複数の音声特徴パラメータを入力と
し、複数のユニットを有する第２の入力ニューロングル
ープと、第２の中間ニューロングループから所定の単位
時間だけ遅延されて出力される複数のパラメータを入力
とする、複数のユニットを有する第１の中間ニューロン
グループと、上記第２の入力ニューロングループから出
力される複数のパラメータと、上記第１の中間ニューロ
ングループから出力される複数のパラメータに対してそ
れぞれ各荷重係数を乗算してそれぞれ入力されるように
接続され、複数のユニットを有する第２の中間ニューロ
ングループとを備え、上記後向きモジュールは、複数の
音声特徴パラメータを入力とし、複数のユニットを有す
る第３の入力ニューロングループと、第４の中間ニュー
ロングループから所定の単位時間だけ逆向きに遅延され
て出力される複数のパラメータを入力とする、複数のユ
ニットを有する第３の中間ニューロングループと、上記
第３の入力ニューロングループから出力される複数のパ
ラメータと、上記第３の中間ニューロングループから出
力される複数のパラメータに対してそれぞれ各荷重係数
を乗算してそれぞれ入力されるように接続され、複数の
ユニットを有する第４の中間ニューロングループとを備
え、上記第２の中間ニューロングループから出力される
複数のパラメータに対してそれぞれ各荷重係数を乗算し
てそれぞれ上記中間層の複数のユニットに入力されるよ
うに接続され、上記第１の入力ニューロングループから
出力される複数のパラメータに対してそれぞれ各荷重係
数を乗算してそれぞれ上記中間層の複数のユニットに入
力されるように接続され、上記第４の中間ニューロング
ループから出力される複数のパラメータに対してそれぞ
れ各荷重係数を乗算してそれぞれ上記中間層の複数のユ
ニットに入力されるように接続され、上記中間層から出
力される複数のパラメータに対してそれぞれ各荷重係数
を乗算してそれぞれ上記出力層のユニットに入力される
ように接続されたことを特徴とする。According to a second aspect of the present invention, there is provided the phoneme boundary detection device according to the first aspect, wherein the forward module receives a plurality of speech feature parameters as inputs and has a plurality of units. A neuron group, a first intermediate neuron group having a plurality of units and having a plurality of units input thereto with a plurality of parameters delayed and output by a predetermined unit time from the second intermediate neuron group, and the second input neuron group And a plurality of parameters output from the first intermediate neuron group are connected so as to be input by multiplying each of the plurality of parameters by a respective weighting factor. Wherein the backward module comprises a plurality of speech feature parameters. A third input neuron group having a plurality of units as inputs, and a plurality of units having as input a plurality of parameters output from the fourth intermediate neuron group after being delayed in a reverse direction by a predetermined unit time and output A third intermediate neuron group, a plurality of parameters output from the third input neuron group, and a plurality of parameters output from the third intermediate neuron group are multiplied by respective weighting factors, respectively. A fourth intermediate neuron group having a plurality of units connected to be input and multiplying a plurality of parameters output from the second intermediate neuron group by respective weighting factors. The first input menu is connected to be input to a plurality of units of the hidden layer. A plurality of parameters output from the fourth intermediate neuron group are connected to each other by multiplying each of the plurality of parameters output from the intermediate group by a respective weighting factor and input to each of the plurality of units in the intermediate layer. The parameters are each multiplied by each weighting factor and connected so as to be input to a plurality of units of the intermediate layer, respectively, and the plurality of parameters output from the intermediate layer are each multiplied by each weighting factor. Each is connected so as to be input to the unit of the output layer.

【０００８】さらに、請求項３記載の音素境界検出装置
は、請求項１又は２記載の音素境界検出装置において、
上記出力層から出力される音素境界検出値が所定のしき
い値以上のときに音素境界として検出する第１の検出手
段をさらに備えたことを特徴とする。Further, the phoneme boundary detection device according to claim 3 is the phoneme boundary detection device according to claim 1 or 2,
It is characterized by further comprising a first detecting means for detecting a phoneme boundary as a phoneme boundary when a phoneme boundary detection value output from the output layer is equal to or greater than a predetermined threshold value.

【０００９】さらに、請求項４記載の音素境界検出装置
は、請求項１又は２記載の音素境界検出装置において、
上記出力層から出力される音素境界検出値が所定のしき
い値以上であって、極大値となるときに音素境界として
検出する第２の検出手段をさらに備えたことを特徴とす
る。Further, the phoneme boundary detection device according to claim 4 is the phoneme boundary detection device according to claim 1 or 2,
It is characterized by further comprising a second detecting means for detecting a phoneme boundary when the phoneme boundary detection value output from the output layer is equal to or greater than a predetermined threshold value and reaches a local maximum value.

【００１０】さらに、請求項５記載の音素境界検出装置
は、請求項１又は２記載の音素境界検出装置において、
上記出力層から出力される音素境界検出値が、所定の第
１のしきい値以上であるときに第１の音素境界として検
出し、上記音素境界検出値が、上記第１のしきい値より
も小さい第２のしきい値以上であって上記第１のしきい
値未満でありかつ極大値となるときに第２の音素境界と
して検出する第３の検出手段をさらに備えたことを特徴
とする。Further, the phoneme boundary detection device according to claim 5 is the phoneme boundary detection device according to claim 1 or 2,
When the detected phoneme boundary value output from the output layer is equal to or greater than a predetermined first threshold value, it is detected as a first phoneme boundary, and the detected phoneme boundary value is determined based on the first threshold value. And a third detecting means for detecting as a second phoneme boundary when the second phoneme boundary is not less than the second threshold value and smaller than the first threshold value and has a maximum value. I do.

【００１１】また、請求項６記載の音素境界検出装置
は、請求項５記載の音素境界検出装置において、上記第
３の検出手段は、上記第１の音素境界として検出したも
のを所定の複数個毎に１個の音素境界を選択して第１の
音素境界として選択することを特徴とする。According to a sixth aspect of the present invention, there is provided the phoneme boundary detecting device according to the fifth aspect, wherein the third detecting means detects a plurality of the phoneme boundaries detected as the first phoneme boundary. Each time, one phoneme boundary is selected and selected as a first phoneme boundary.

【００１２】さらに、請求項７記載の音素境界検出装置
は、請求項５又は６記載の音素境界検出装置において、
上記第３の検出手段は、上記検出又は選択した第１の音
素境界と第２の音素境界との間で形成された経路のラテ
ィスに基づいて音素境界を検出することを特徴とする。Further, the phoneme boundary detecting device according to claim 7 is the phoneme boundary detecting device according to claim 5 or 6,
The third detecting means detects a phoneme boundary based on a lattice of a path formed between the detected or selected first phoneme boundary and the second phoneme boundary.

【００１３】本発明に係る請求項８記載の音声認識装置
は、入力された文字列からなる発声音声文の音声信号か
ら音声特徴パラメータを抽出する特徴抽出手段と、上記
特徴抽出手段によって抽出された音声特徴パラメータに
基づいて、請求項１乃至７のうちの１つに記載された音
素境界検出装置によって検出された音素境界と、所定の
音響モデルとを用いて、入力された文字列からなる発声
音声文の音声信号を音声認識する音声認識手段とを備え
たことを特徴とする。According to a second aspect of the present invention, there is provided a speech recognition apparatus for extracting a speech feature parameter from a speech signal of an uttered speech sentence composed of an input character string, and the feature extraction means. An utterance composed of an input character string using a phoneme boundary detected by the phoneme boundary detection device according to one of claims 1 to 7 based on a voice feature parameter and a predetermined acoustic model. Voice recognition means for voice-recognizing a voice signal of a voice sentence.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本発明に係る一
実施形態である音素境界検出ニューラルネットワーク１
０を用いた音声認識装置のブロック図である。この実施
形態は、学習用音声データの特徴パラメータファイル３
１と、学習用音声データの音素境界値ファイル３２とに
基づいて、所定の学習アリゴリズムを用いて、音素境界
検出ニューラルネットワークの初期モデル３３を学習す
ることにより、音素境界検出ニューラルネットワーク１
０を得るニューラルネットワーク学習部２０を備え、単
語レベル照合部５は、得られた音素境界検出ニューラル
ネットワーク１０を用いて音素境界を検出しかつ音素を
検出して単語レベルの音声認識を行うことを特徴として
いる。従って、単語レベル照合部５は、音素境界検出装
置を含む。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a phoneme boundary detection neural network 1 according to an embodiment of the present invention.
FIG. 3 is a block diagram of a speech recognition device using 0. In this embodiment, the feature parameter file 3 of the learning speech data is used.
1 and a phoneme boundary value file 32 of the learning speech data, a predetermined learning algorithm is used to learn the initial model 33 of the phoneme boundary detection neural network.
The neural network learning unit 20 that obtains 0 is used. The word level matching unit 5 detects a phoneme boundary using the obtained phoneme boundary detection neural network 10 and detects a phoneme to perform word-level speech recognition. Features. Therefore, the word level matching unit 5 includes a phoneme boundary detection device.

【００１５】本実施形態においては、音素境界の検出の
ために、図２に示すＢＲＮＮで構成される音素境界検出
ニューラルネットワーク１０を用いた。一般のリカレン
ト型ニューラルネットワークが時間的に過去の情報を再
帰的に利用するのに対し、ＢＲＮＮは過去及び未来の入
力情報が利用できることを特徴とする。In this embodiment, a phoneme boundary detection neural network 10 composed of BRNN shown in FIG. 2 is used for detecting phoneme boundaries. While a general recurrent neural network recursively uses past information in time, BRNN is characterized in that past and future input information can be used.

【００１６】図２において、入力はケプストラムなどの
音声特徴パラメータの情報を、出力は当該フレームの入
力が音素境界であるかどうかの情報（例えば、当該フレ
ームが音素境界である場合１，それ以外は０）を学習時
に教師信号として与える。すなわち、入力ニューロング
ループＡ（ｔ）のユニット数は音声特徴パラメータの次
元数２６個となり、出力のユニット数は１個となる。こ
こで、音声特徴パラメータは、１２次元のメルケプスト
ラム係数（以下、ＭＦＣＣという。）と、パワーと、そ
れぞれの１次回帰係数とを含む。以下、１つのファイル
の総フレーム数は複数Ｌ個とする。In FIG. 2, the input is information on speech feature parameters such as cepstrum, and the output is information on whether the input of the frame is a phoneme boundary (for example, 1 if the frame is a phoneme boundary, 1 otherwise 0) is given as a teacher signal at the time of learning. That is, the number of units of the input neuron group A (t) is 26 in the number of dimensions of the speech feature parameter, and the number of output units is 1. Here, the voice feature parameters include a 12-dimensional mel-cepstral coefficient (hereinafter, referred to as MFCC), power, and respective first-order regression coefficients. Hereinafter, the total number of frames in one file is L.

【００１７】図２において、前向きモジュールＢ（ｔ−
１）は、２６個の音声特徴パラメータに基づいて、時間
的に前向きの帰還接続を有して入力ニューロングループ
Ａ（ｔ）から出力される２６個のパラメータよりも所定
の単位時間だけ遅延された時刻ｔ−１の１０個のパラメ
ータを生成して隠れニューロングループＤに出力するモ
ジュールであり、（ａ）２６個の音声特徴パラメータを
入力とする、２６個のユニットａ₁乃至ａ₂₆を有する入
力ニューロングループ５１と、（ｂ）中間ニューロング
ループ５３から遅延素子５４を介して出力される１０個
のパラメータを入力とする、１０個のユニットｂ₁乃至
ｂ₁₀を有する中間ニューロングループ５２と、（ｃ）入
力ニューロングループ５１から出力される２６個のパラ
メータと、中間ニューロングループ５２から出力される
１０個のパラメータに対してそれぞれ各荷重係数を乗算
してそれぞれ入力されるように接続される１０個のユニ
ットｂ₁乃至ｂ₁₀を有する中間ニューロングループ５３
と、（ｄ）中間ニューロングループ５３から出力される
１０個のパラメータを所定の単位時間だけ遅延させて中
間ニューロングループ５２に出力する遅延素子５４と、
（ｅ）１０個のユニットｂ₁乃至ｂ₁₀を有する前向きモ
ジュールＢ（ｔ−１）の状態ニューロングループとな
り、前向きモジュールＢ（ｔ−１）の動作終了時の時刻
ｔ−１における中間ニューロングループ５３から出力さ
れる出力値を一時的に記憶して、すなわちラッチして、
音素境界検出のための右側のニューラルネットワークの
隠れニューロングループＤに出力する出力ラッチ５５と
を備える。In FIG. 2, the forward-facing module B (t-
1) is delayed by a predetermined unit time from the 26 parameters output from the input neuron group A (t) with a temporally forward feedback connection based on the 26 speech feature parameters. A module that generates ten parameters at time t-1 and outputs them to the hidden neuron group D. (a) An input having ₂₆ units a _{1 to} a ₂₆ that receives 26 speech feature parameters as input A neuron group 51, (b) an intermediate neuron group 52 having _ten units b _{1 to} b ₁₀ which receives ten parameters output from the intermediate neuron group 53 via the delay element 54, and (c) 26) 26 parameters output from the input neuron group 51 and 10 parameters output from the intermediate neuron group 52 Are respectively multiplied by the respective weighting factors and are connected so as to be input respectively. The intermediate neuron group 53 having _ten units b _{1 to} b ₁₀
(D) a delay element 54 for delaying the ten parameters output from the intermediate neuron group 53 by a predetermined unit time and outputting the delayed parameter to the intermediate neuron group 52;
(E) The state neuron group of the forward module B (t-1) having _ten units b _{1 to} b ₁₀ becomes the intermediate neuron group 53 at time t−1 at the time when the operation of the forward module B (t−1) ends. The output value output from is temporarily stored, that is, latched,
An output latch 55 for outputting to the hidden neuron group D of the neural network on the right side for detecting a phoneme boundary.

【００１８】以上のように構成された前向きモジュール
Ｂ（ｔ−１）においては、中間ニューロングループ５２
から中間ニューロングループ５３及び遅延素子５４を介
して中間ニューロングループ５２へと帰還ループを形成
しており、繰り返し計算した後の前向きモジュールＢ
（ｔ−１）の動作終了時の時刻ｔ−１における中間ニュ
ーロングループ５３から出力される出力パラメータベク
トルＢ_m（ｍ＝１，２，…，ｔ−１）は次式で表され
る。In the forward module B (t-1) configured as described above, the intermediate neuron group 52
Form a feedback loop through the intermediate neuron group 53 and the delay element 54 to the intermediate neuron group 52.
The output parameter vector B _m (m = 1, 2,..., T−1) output from the intermediate neuron group 53 at time t−1 at the end of the operation of (t−1) is represented by the following equation.

【００１９】[0019]

【数１】Ｂ_m＝Ｗ_FA・Ａ_m＋Ｗ_FB・Ｂ_m-1 [Number 1] _{_{_{B m = W FA · A m}}} + W FB · B m-1

【００２０】ここで、出力値ベクトルＢ_mは１０個のパ
ラメータ値からなり、その初期値ベクトルＢ₀は次式で
表される。Here, the output value vector B _m is composed of ten parameter values, and the initial value vector B ₀ is represented by the following equation.

【数２】 (Equation 2)

【００２１】また、入力ニューロングループ５１への入
力パラメータベクトルＡ_mは次式で表される。Further, the input parameter vectors A _m to the input neuron group 51 is expressed by the following equation.

【数３】 (Equation 3)

【００２２】ここで、Ｏ_m（１）は時刻ｍにおけるＭＦ
ＣＣの１次の値であり、Ｏ_m（２）は時刻ｍにおけるＭ
ＦＣＣの２次の値であり、以下同様にして、Ｏ_m（２
６）は時刻ｍにおけるＭＦＣＣの２６次の値である。さ
らに、数１の荷重係数行列Ｗ_FA，Ｗ_FBはそれぞれ、１０
×２６の行列、１０×１０の行列であり、次式で表され
る。Where O _m (1) is the MF at time m
O _m (2) is the primary value of CC at time m
This is a second order value of FCC, and similarly, O _m (2
6) is the 26th-order MFCC value at time m. Further, the weighting factor matrices W _FA and W _FB of _{Equation 1} are 10
It is a 26 × 10 matrix and a 10 × 10 matrix, and is represented by the following equation.

【数４】 (Equation 4)

【数５】 (Equation 5)

【００２３】さらに、図２において、後向きモジュール
Ｃ（ｔ＋１）は、２６個の音声特徴パラメータに基づい
て、時間的に後向きの帰還接続を有して入力ニューロン
グループＡ（ｔ）から出力される２６個のパラメータよ
りも所定の単位時間だけ逆向きに遅延された時刻ｔ＋１
の１０個のパラメータを生成して隠れニューロングルー
プＤに出力するモジュールであって、（ａ）２６個の音
声特徴パラメータを入力とする、２６個のユニットａ₁
乃至ａ₂₆を有する入力ニューロングループ６１と、
（ｂ）中間ニューロングループ６３から逆向き遅延素子
６４を介して出力される１０個のパラメータを入力とす
る、１０個のユニットｃ₁乃至ｃ₁₀を有する中間ニュー
ロングループ６２と、（ｃ）入力ニューロングループ６
１から出力される２６個のパラメータと、中間ニューロ
ングループ６２から出力される１０個のパラメータに対
してそれぞれ各荷重係数を乗算してそれぞれ入力される
ように接続される１０個のユニットｃ₁乃至ｃ₁₀を有す
る中間ニューロングループ６３と、（ｄ）中間ニューロ
ングループ６３から出力される１０個のパラメータを所
定の単位時間だけ遅延させて中間ニューロングループ６
２に出力する逆向き遅延素子６４と、（ｅ）１０個のユ
ニットｃ₁乃至ｃ₁₀を有する後向きモジュールＣ（ｔ＋
１）の状態ニューロングループとなり、後向きモジュー
ルＣ（ｔ＋１）の動作終了時の時刻ｔ＋１における中間
ニューロングループ６３から出力される出力値を一時的
に記憶して、すなわちラッチして、音素境界検出のため
の右側のニューラルネットワークの隠れニューロングル
ープＤに出力する出力ラッチ６５とを備える。Further, in FIG. 2, the backward module C (t + 1) has a temporally backward feedback connection and is output from the input neuron group A (t) based on the 26 speech feature parameters. Time t + 1 delayed backward by a predetermined unit time from the number of parameters
(A) 26 units a _{1 which} receive 26 speech feature parameters as inputs.
An input neuron group 61 having a to a ₂₆ ,
(B) an intermediate neuron group 62 having _ten units c _{1 to} c ₁₀ which receives ten parameters output from the intermediate neuron group 63 via the reverse delay element 64; and (c) an input neuron Group 6
10 units c _{1 to} c ₁ to 26 connected so that each of the 26 parameters output from 1 and the 10 parameters output from the intermediate neuron group 62 are multiplied by each weighting factor and input. The intermediate neuron group 63 having c ₁₀ and the intermediate neuron group 6 (d) delaying ten parameters output from the intermediate neuron group 63 by a predetermined unit time.
Opposite to the delay element 64 to be output to 2, (e) backward with 10 units c ₁ to c ₁₀ module C (t +
The state neuron group of 1) is obtained, and the output value output from the intermediate neuron group 63 at the time t + 1 at the end of the operation of the backward module C (t + 1) is temporarily stored, that is, latched to detect a phoneme boundary. And an output latch 65 for outputting to the hidden neuron group D of the neural network on the right side of.

【００２４】以上のように構成された後向きモジュール
Ｃ（ｔ＋１）においては、中間ニューロングループ６２
から中間ニューロングループ６３及び逆向き遅延素子６
４を介して中間ニューロングループ６２へと帰還ループ
を形成しており、繰り返し計算した後の後向きモジュー
ルＣ（ｔ＋１）の動作終了時の時刻ｔ＋１における中間
ニューロングループ６３から出力される出力パラメータ
ベクトルＣ_m（ｍ＝Ｌ，Ｌ−１，…，ｔ＋１）は次式で
表される。In the backward module C (t + 1) configured as described above, the intermediate neuron group 62
To intermediate neuron group 63 and reverse delay element 6
4, a feedback loop is formed to the intermediate neuron group 62, and the output parameter vector C _m output from the intermediate neuron group 63 at time t + 1 at the end of the operation of the backward module C (t + 1) after repeatedly calculating. (M = L, L-1,..., T + 1) is represented by the following equation.

【００２５】[0025]

【数６】Ｃ_m＝Ｗ_BA・Ａ_m＋Ｗ_BC・Ｃ_m+1 [Formula 6] C _m = W _BA · A _m + W _BC · C _{m + 1}

【００２６】ここで、出力値ベクトルＣ_mは１０個のパ
ラメータ値からなり、その初期値ベクトルＣ_L+1は次式
で表される。Here, the output value vector C _m is composed of ten parameter values, and the initial value vector C _{L + 1} is represented by the following equation.

【数７】 (Equation 7)

【００２７】また、入力ニューロングループ６１への入
力パラメータベクトルＡ_mは数３と同様である。Further, the input parameter vectors A _m to the input neuron group 61 is the same as the number 3.

【００２８】さらに、数６の荷重係数行列Ｗ_BA，Ｗ_BCは
それぞれ、１０×２６の行列、１０×１０の行列であ
り、次式で表される。Further, the weighting coefficient matrices W _BA and W _BC of Equation 6 are 10 × 26 matrices and 10 × 10 matrices, respectively, and are represented by the following equations.

【数８】 (Equation 8)

【数９】 (Equation 9)

【００２９】さらに、図２に示すように、３０個の隠れ
ユニットｄ₁乃至ｄ₃₀を有する隠れニューロングループ
Ｄと、１個の出力ユニットｅ₁を有し、音素境界検出確
率を表す音素境界検出値ｙ（ｊ）（ｊ＝１，２，…，
Ｌ）を出力する出力ニューロングループＥとを備える。
状態ニューロングループＢ（ｔ−１）のユニットｂ₁乃
至ｂ₁₀の各出力パラメータに対してそれぞれ、各荷重係
数を乗算して隠れニューロングループＤのユニットｄ₁
乃至ｄ₃₀に入力されるように接続され、状態ニューロン
グループＣ（ｔ＋１）のユニットｃ₁乃至ｃ₁₀の各出力
パラメータに対してそれぞれ、各荷重係数で乗算して隠
れニューロングループＤのユニットｄ₁乃至ｄ₃₀に入力
されるように接続され、２６個のユニットａ₁乃至ａ₂₆
を有する入力ニューロングループＡ（ｔ）の各出力パラ
メータに対してそれぞれ各荷重係数で乗算されて隠れニ
ューロングループＤのユニットｄ₁乃至ｄ₃₀に入力され
るように接続される。さらに、隠れニューロングループ
Ｄのユニットｄ₁乃至ｄ₃₀の各出力パラメータに対して
それぞれ各荷重係数で乗算されて出力ニューロングルー
プＥの出力ユニットｅ₁に入力されるように接続され
る。Further, as shown in FIG. 2, a hidden neuron group D having ₃₀ hidden units d _{1 to} d ₃₀ and a phoneme boundary detection having _one output unit e ₁ and representing a phoneme boundary detection probability. Value y (j) (j = 1, 2,...,
L), and an output neuron group E for outputting L).
Each of the output parameters of the units b _{1 to} b ₁₀ of the state neuron group B (t−1) is multiplied by each weighting factor, and the unit d _{1 of the} hidden neuron group D is multiplied.
To d _30, and the output parameters of the units c _{1 to} c ₁₀ of the state neuron group C (t + 1) are respectively multiplied by the respective weighting factors, and the unit d _{1 of the} hidden neuron group D is multiplied. To d ₃₀ , and connected to 26 units a _{1 to} a ₂₆
Are multiplied by the respective weighting factors for the respective output parameters of the input neuron group A (t), and are input to the units d _{1 to} d ₃₀ of the hidden neuron group D. The output parameters of the units d _{1 to} d ₃₀ of the hidden neuron group D are respectively multiplied by the respective weighting coefficients, and the output parameters are connected to the output unit e ₁ of the output neuron group E.

【００３０】ここで、状態ニューロングループＢ（ｔ−
１）及びＣ（ｔ＋１）並びに入力ニューロングループＡ
（ｔ）から隠れニューロングループＤを介して出力ニュ
ーロングループＥまでの処理は、前向きモジュールＢ
（ｔ−１）及び後向きモジュールＣ（ｔ＋１）の処理動
作の終了後に、学習処理又は演算処理が実行される。当
該ニューラルネットワークにおいては、入力層１００
は、入力ニューロングループＡ（ｔ）と、入力ニューロ
ングループＡ（ｔ）の出力時刻ｔから単位時間だけ遅延
された時刻ｔ−１における出力パラメータを計算する前
向きモジュールＢ（ｔ−１）と、時刻ｔから単位時間だ
け逆向きに遅延されたｔ＋１における出力パラメータを
計算する後向きモジュールＣ（ｔ＋１）を備え、中間層
２００は隠れニューロングループＤを備え、出力層３０
０は出力ニューロングループＥを備える。以上のように
構成された音素境界検出ニューラルネットワーク１０
は、等価的には図３に示すように、前向きモジュールと
後向きモジュールが時間方向に接続され、入力層１００
が入力ニューロングループＡ（ｔ）と、前向きモジュー
ルＢ（ｔ−１）と、後向きモジュールＣ（ｔ＋１）とか
らなるＢＲＮＮである。Here, the state neuron group B (t−
1) and C (t + 1) and input neuron group A
The processing from (t) to the output neuron group E via the hidden neuron group D is performed by the forward module B
After the end of the processing operation of (t-1) and the backward module C (t + 1), a learning process or an arithmetic process is executed. In the neural network, the input layer 100
Is a forward module B (t−1) that calculates an output parameter at time t−1 delayed by a unit time from the output time t of the input neuron group A (t), a backward module C (t + 1) that computes output parameters at t + 1 delayed unit time backward from t, the hidden layer 200 comprises a hidden neuron group D, and the output layer 30
0 comprises the output neuron group E. The phoneme boundary detection neural network 10 configured as described above
Is equivalently, as shown in FIG. 3, the forward module and the backward module are connected in the time direction, and the input layer 100
Is a BRNN composed of an input neuron group A (t), a forward module B (t-1), and a backward module C (t + 1).

【００３１】詳細後述する図４のニューラルネットワー
ク学習処理による学習後の音素境界検出ニューラルネッ
トワーク１０に対して、特徴パラメータ時系列を入力し
たときの出力例を図１０に示す。この例は、詳細後述す
る条件で学習したニューラルネットワーク１０を用い
て、オープンデータに対して得られたものである。ここ
で、点線は教師信号（真値）であり、実線はニューラル
ネットワーク１０の出力値（検出値）を示す。FIG. 10 shows an output example when a feature parameter time series is input to the phoneme boundary detection neural network 10 after learning by the neural network learning process of FIG. This example is obtained for the open data using the neural network 10 learned under the conditions described later in detail. Here, a dotted line indicates a teacher signal (true value), and a solid line indicates an output value (detected value) of the neural network 10.

【００３２】次いで、図１０に示されるような出力結果
から、音素境界を検出するアルゴリズムとして、以下の
４通りの方法を考案した。（ａ）方法１：しきい値ｈを越える出力値を音素境界候
補として判断する。すなわち、次式を満たす出力値を音
素境界候補として判断する。Next, the following four methods were devised as algorithms for detecting a phoneme boundary from the output results as shown in FIG. (A) Method 1: An output value exceeding a threshold value h is determined as a phoneme boundary candidate. That is, an output value satisfying the following equation is determined as a phoneme boundary candidate.

【数１０】ｙ（ｊ）≧ｈ(10) y (j) ≧ h

【００３３】（ｂ）方法２：しきい値ｈを越える出力値
から、極大値となるものを音素境界候補として選択す
る。すなわち、次式を満たす出力値を音素境界候補とし
て判断する。(B) Method 2: From the output values exceeding the threshold value h, the one having the maximum value is selected as a phoneme boundary candidate. That is, an output value satisfying the following equation is determined as a phoneme boundary candidate.

【数１１】ｙ（ｊ）≧ｈかつｙ（ｊ）＞ｙ（ｊ−１）か
つｙ（ｊ）＞ｙ（ｊ＋１）Y (j) ≧ h and y (j)> y (j−1) and y (j)> y (j + 1)

【００３４】（ｃ）方法３：２種類のしきい値ｌ，ｈ
（＞ｌ）を用いて、第２のしきい値ｌから第１のしきい
値ｈまでの極大値となるもの及び、第１のしきい値ｈを
越えるもの全てを選択する。すなわち、(C) Method 3: Two types of thresholds l and h
By using (> l), all those having the local maximum values from the second threshold value l to the first threshold value h and those exceeding the first threshold value h are selected. That is,

【数１２】ｙ（ｊ）≧ｈであるときは第１の音素境界候補として選択し、When y (j) ≧ h, a first phoneme boundary candidate is selected.

【数１３】ｌ≦ｙ（ｊ）＜ｈかつｙ（ｊ）＞ｙ（ｊ−
１）かつｙ（ｊ）＞ｙ（ｊ＋１）であるときは第２の音素境界候補として選択する。（ｄ）方法４：方法３において、連続する２つの第１の
音素境界をｋ個毎に１つのみ第１の音素境界として選択
する。(13) l ≦ y (j) <h and y (j)> y (j−
1) If y (j)> y (j + 1), select as the second phoneme boundary candidate. (D) Method 4: In method 3, only two continuous first phoneme boundaries are selected as the first phoneme boundaries per k units.

【００３５】方法１及び２は、この処理のみを用いて、
音素の境界を一意に決定する方法である。方法３や方法
４は、まず、これらの処理で可能性のある候補をなるべ
く多く残し、次に、別処理により音素候補を決定するた
めの方法である。例えば、第１のしきい値ｈを越えて検
出された候補を第１の音素境界候補とし、第２のしきい
値ｌから第１のしきい値ｈの間で検出された候補を第２
の音素境界候補とすると、第１の音素境界間に存在する
全ての候補に対して、図１１に示すようなラティスが作
成できる。このとき、ＨＭＭやセグメントモデルによる
音素モデルなどの音響モデルを用いて、ラティスを再評
価すれば最適音素経路が決定でき、これにより最終的な
音素境界を決定することができる。Methods 1 and 2 use only this process,
This is a method for uniquely determining the boundaries of phonemes. The methods 3 and 4 are methods for first leaving as many possible candidates as possible in these processes, and then determining phoneme candidates by another process. For example, a candidate detected beyond the first threshold value h is defined as a first phoneme boundary candidate, and a candidate detected between the second threshold value l and the first threshold value h is defined as a second phoneme boundary candidate.
, A lattice as shown in FIG. 11 can be created for all candidates existing between the first phoneme boundaries. At this time, the optimal phoneme path can be determined by re-evaluating the lattice using an acoustic model such as an HMM or a phoneme model based on a segment model, and thereby a final phoneme boundary can be determined.

【００３６】なお、図１において、Ａ／Ｄ変換器２と、
特徴抽出部３と、単語レベル照合部５と、文レベル照合
部６と、ニューラルネットワーク学習部２０とは、例え
ば、デジタル計算機などの演算制御装置で構成され、バ
ッファメモリ４は例えばハードディスクメモリで構成さ
れ、学習用音声データの特徴パラメータファイル３１
と、学習用音声データの音素境界値ファイル３２と、音
素境界検出ニューラルネットワークの初期モデル３３
と、音素境界検出ニューラルネットワーク１０と、単語
モデル７、文法規則８及び意味的規則９とは例えばハー
ドディスクメモリに記憶される。In FIG. 1, the A / D converter 2 and
The feature extracting unit 3, the word level matching unit 5, the sentence level matching unit 6, and the neural network learning unit 20 are configured by, for example, an arithmetic control device such as a digital computer, and the buffer memory 4 is configured by, for example, a hard disk memory. And the characteristic parameter file 31 of the voice data for learning.
And a phoneme boundary value file 32 of learning speech data, and an initial model 33 of a phoneme boundary detection neural network.
The phoneme boundary detection neural network 10, the word model 7, the grammar rules 8, and the semantic rules 9 are stored in, for example, a hard disk memory.

【００３７】図４は、図１のニューラルネットワーク学
習部２０によって実行されるニューラルネットワーク学
習処理を示すフローチャートである。図４において、ま
ず。ステップＳ１で特徴パラメータファイル３１と、上
記特徴パラメータファイルに対応する音素境界値ファイ
ル３２と、音素境界検出ニューラルネットワークの初期
モデル３３とを読み込む。次いで、ステップＳ２で、音
素境界値ファイル３２の総発声数に対応する特徴パラメ
ータファイル３１のファイル数がパラメータＮに設定さ
れ、学習の繰り返し数をパラメータＩに設定する。そし
て、ステップＳ３でパラメータｉを１に初期化し、ステ
ップＳ４でパラメータｎを１に初期化する。ステップＳ
５でｎファイル目の総フレーム数をパラメータＬｎに設
定する。次いで、ステップＳ６でＬｎフレームの特徴パ
ラメータを用いて、前向きモジュールの状態ニューロン
グループＢ（ｔ−１）、後向きモジュールの状態ニュー
ロングループＣ（ｔ＋１）、及び出力ニューロングルー
プＥの出力値（それぞれＬｎグループ）を計算し、ニュ
ーラルネットワークの荷重係数更新パラメータを演算す
る。FIG. 4 is a flowchart showing a neural network learning process executed by the neural network learning section 20 of FIG. In FIG. In step S1, a feature parameter file 31, a phoneme boundary value file 32 corresponding to the feature parameter file, and an initial model 33 of the phoneme boundary detection neural network are read. Next, in step S2, the number of feature parameter files 31 corresponding to the total number of utterances in the phoneme boundary value file 32 is set as the parameter N, and the number of learning repetitions is set as the parameter I. Then, the parameter i is initialized to 1 in step S3, and the parameter n is initialized to 1 in step S4. Step S
In step 5, the total number of frames of the n-th file is set in the parameter Ln. Next, in step S6, using the feature parameters of the Ln frame, the output values of the state neuron group B (t−1) of the forward module, the state neuron group C (t + 1) of the backward module, and the output neuron group E (each of the Ln group ) Is calculated, and the weight coefficient update parameter of the neural network is calculated.

【００３８】そして、ステップＳ７でパラメータｎを１
だけインクリメントした後、ステップＳ８でｎ＞Ｎか否
かが判断され、ｎ≦ＮのときはステップＳ５に戻り、上
記の処理を繰り返す。ステップＳ８でｎ＞Ｎのときは、
ステップＳ９でニューラルネットワークの荷重係数の更
新処理を実行して、ステップＳ１０でパラメータｉを１
だけインクリメントした後、ステップＳ１１でｉ＞Ｉか
否かが判断される。ここで、ｉ≦Ｉのときは所定の繰り
返し数に達したと判断し、ステップＳ１２で得られた音
素境界検出ニューラルネットワーク１０をメモリに記憶
して、当該処理を終了する。Then, in step S7, the parameter n is set to 1
After incrementing by only N, it is determined in step S8 whether n> N. If n ≦ N, the process returns to step S5 to repeat the above processing. When n> N in step S8,
In step S9, a process of updating the weight coefficient of the neural network is executed, and in step S10, the parameter i is set to 1
After incrementing by only i, it is determined in step S11 whether i> I. Here, when i ≦ I, it is determined that the predetermined number of repetitions has been reached, the phoneme boundary detection neural network 10 obtained in step S12 is stored in the memory, and the process ends.

【００３９】図５は、図１の単語照合部によって実行さ
れる単語照合処理を示すフローチャートである。図５に
おいて、まず、ステップＳ２１でバッファメモリ４に記
憶された特徴パラメータと、音素境界検出ニューラルネ
ットワーク１０を読み込む。次いで、ステップＳ２２で
特徴パラメータに基づいて単語モデル７に対する対数尤
度Ｐｗを計算する。さらに、ステップＳ２３で特徴パラ
メータに基づいて、特徴パラメータの総フレーム数Ｌ個
の各フレームに対するニューラルネットワーク１０の出
力値ｙ（ｊ），ｊ＝１，２，…，Ｌを計算する。そし
て、ステップＳ２４で出力値ｙ（ｊ）の対数値を計算し
て、対数尤度FIG. 5 is a flowchart showing a word matching process executed by the word matching unit of FIG. In FIG. 5, first, the feature parameters stored in the buffer memory 4 in step S21 and the phoneme boundary detection neural network 10 are read. Next, in step S22, the log likelihood Pw for the word model 7 is calculated based on the feature parameters. Further, in step S23, based on the characteristic parameters, the output values y (j), j = 1, 2,..., L of the neural network 10 for each of the total L frames of the characteristic parameters are calculated. Then, the log value of the output value y (j) is calculated in step S24, and the log likelihood is calculated.

【数１４】を得る。そしてステップＳ２５で音素境界検出処理を実
行した後、計算した対数尤度Ｐｗ，Ｐｓの重み付け和Ｐ
_totalを次式を用いて計算し、[Equation 14] Get. Then, after performing the phoneme boundary detection processing in step S25, the weighted sum P of the calculated log likelihoods Pw and Ps is calculated.
_total is calculated using the following formula,

【数１５】Ｐ_total＝ｋＰｗ＋（１−ｋ）Ｐｓ単語レベルの照合処理を実行する。すなわち、計算され
た尤度Ｐ_totalに基づいて最大の尤度を有する候補単語
を認識結果として文レベル照合部６に出力して、当該単
語レベル照合処理を終了する。## EQU15 ## P _total = kPw + (1−k) Ps Performs word level collation processing. That is, the candidate word having the maximum likelihood is output to the sentence level matching unit 6 as a recognition result based on the calculated likelihood P _total , and the word level matching processing ends.

【００４０】図６は、図５の単語照合処理におけるサブ
ルーチンである音素境界検出処理（方法１）（ステップ
Ｓ２５）のフローチャートである。図６において、各フ
レームｊ毎にＬまで、音素境界検出ニューラルネットワ
ーク１０の出力値ｙ（ｊ）について、FIG. 6 is a flowchart of a phoneme boundary detection process (method 1) (step S25) which is a subroutine in the word matching process of FIG. In FIG. 6, the output value y (j) of the phoneme boundary detection neural network 10 up to L for each frame j is:

【数１６】ｙ（ｊ）≧ｈであるか否か判断され、ＹＥＳのとき音素境界を判断す
る一方、ＮＯのとき音素内と判断する。It is determined whether y (j) ≧ h. If YES, the phoneme boundary is determined, and if NO, it is determined to be within the phoneme.

【００４１】図７は、図５の単語照合処理におけるサブ
ルーチンである音素境界検出処理（方法２）（ステップ
Ｓ２５）のフローチャートである。図７においては、各
フレームｊ毎にＬまで、音素境界検出ニューラルネット
ワーク１０の出力値ｙ（ｊ）について、FIG. 7 is a flowchart of a phoneme boundary detection process (method 2) (step S25) which is a subroutine in the word matching process of FIG. In FIG. 7, the output value y (j) of the phoneme boundary detection neural network 10 up to L for each frame j is

【数１７】ｙ（ｊ）≧ｈかつｙ（ｊ）＞ｙ（ｊ−１）か
つｙ（ｊ）＞ｙ（ｊ＋１）であるか否か判断され、ＹＥＳのとき音素境界を判断す
る一方、ＮＯのとき音素内と判断する。It is determined whether y (j) ≧ h and y (j)> y (j−1) and y (j)> y (j + 1). If YES, the phoneme boundary is determined. When the result is NO, it is determined to be within a phoneme.

【００４２】図８は、図５の単語照合処理におけるサブ
ルーチンである音素境界検出処理（方法３）（ステップ
Ｓ２５）のフローチャートである。図８においては、各
フレームｊ毎にＬまで、音素境界検出ニューラルネット
ワーク１０の出力値ｙ（ｊ）について、FIG. 8 is a flowchart of a phoneme boundary detection process (method 3) (step S25), which is a subroutine in the word matching process of FIG. In FIG. 8, the output value y (j) of the phoneme boundary detection neural network 10 up to L for each frame j is

【数１８】ｙ（ｊ）≧ｈであるときは第１の音素境界と判断し、If y (j) ≧ h, it is determined to be the first phoneme boundary,

【数１９】ｌ≦ｙ（ｊ）＜ｈかつｙ（ｊ）＞ｙ（ｊ−
１）かつｙ（ｊ）＞ｙ（ｊ＋１）であるときは第２の音素境界と判断し、これら以外のと
きは、音素内と判断する。[Equation 19] l ≦ y (j) <h and y (j)> y (j−
1) If y (j)> y (j + 1), it is determined to be the second phoneme boundary; otherwise, it is determined to be within the phoneme.

【００４３】図９は、図５の単語照合処理におけるサブ
ルーチンである音素境界検出処理（方法４）（ステップ
Ｓ２５）のフローチャートである。図９において同様の
処理については同一のステップ番号を付している。図９
のフローチャートは、図８に比較して、ステップＳ５７
の前段に、ステップＳ５８が挿入され、ステップＳ５８
では、連続する２つの第１の音素境界をｋ個毎に１つの
みを第１の音素境界として間引いて選択することを特徴
とする。FIG. 9 is a flowchart of a phoneme boundary detection process (method 4) (step S25), which is a subroutine in the word matching process of FIG. In FIG. 9, the same processes are denoted by the same step numbers. FIG.
The flowchart of the step S57 is different from that of FIG.
Is inserted before the step S58.
Is characterized in that only two continuous first phoneme boundaries are thinned out and selected as the first phoneme boundaries for every k units.

【００４４】次いで、図１に示す自由発話音声認識装置
の構成及び動作について説明する。図１において、文字
列からなる発声音声文である話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、Ａ／Ｄ変
換部２に入力される。Ａ／Ｄ変換部２は、入力された音
声信号を所定のサンプリング周波数でＡ／Ｄ変換した
後、変換後のデジタルデータを特徴抽出部３に出力す
る。次いで、特徴抽出部３は、入力される音声信号のデ
ジタルデータに対して、例えばＬＰＣ分析を実行し、１
０次元のＭＦＣＣとパワーとを含む１１次元の特徴パラ
メータを抽出する。抽出された特徴パラメータの時系列
はバッファメモリ４を介して単語レベル照合部５に入力
される。Next, the configuration and operation of the free speech recognition apparatus shown in FIG. 1 will be described. In FIG. 1, a speaker's uttered voice, which is a uttered voice sentence composed of a character string, is input to a microphone 1 and converted into a voice signal, and then input to an A / D converter 2. The A / D converter 2 performs A / D conversion on the input audio signal at a predetermined sampling frequency, and outputs the converted digital data to the feature extractor 3. Next, the feature extraction unit 3 performs, for example, LPC analysis on the digital data of the input audio signal,
An 11-dimensional feature parameter including a 0-dimensional MFCC and power is extracted. The time series of the extracted feature parameters is input to the word level matching unit 5 via the buffer memory 4.

【００４５】単語モデルの生成においては、所定のモデ
ルパラメータに基づいて、尤度最大の単語モデル生成処
理を以下の如く実行して単語モデルを生成する。すなわ
ち、上記モデルパラメータにおける同一の単語である複
数Ｎ個の単語の音響的特徴量から最大尤度を有する当該
単語の代表の音素ラベルのサンプルを検出し、検出され
た代表の音素ラベルのサンプルと、複数Ｎ個の単語の音
素ラベルのサンプルとの時間的な対応付けを動的時間整
合法により行って時間的に正規化を行い、時間的に正規
化された代表の音素ラベルのサンプルと、上記複数Ｎ個
の音素ラベルのサンプルとを各単語毎に混合することに
より、単語毎に音響的特徴量を含む単語モデルを生成し
て、単語モデルメモリ７に格納する。要約すれば、生成
された混合分布の確率的音素モデルに基づいて、上記テ
キストの各単語毎の音声特徴パラメータを含む単語モデ
ルを生成する。In generating a word model, a word model is generated by executing a word model generation process with the maximum likelihood based on predetermined model parameters as follows. That is, a sample of a representative phoneme label of the word having the maximum likelihood is detected from the acoustic features of a plurality of N words that are the same word in the model parameters, and a sample of the detected representative phoneme label is detected. , Temporally normalizing the phoneme label samples of a plurality of N words with the phoneme label samples by the dynamic time matching method, and temporally normalized representative phoneme label samples; By mixing the plurality of N phoneme label samples with each word, a word model including an acoustic feature for each word is generated and stored in the word model memory 7. In short, a word model including a speech feature parameter for each word of the text is generated based on the generated stochastic phoneme model of the mixture distribution.

【００４６】単語レベル照合部５に接続される単語モデ
ルメモリ７内の単語モデルは、前後の音素環境を連結す
る環境依存型音素モデルが縦続に連結されてなり、かつ
縦続に連結された複数の状態を含んで構成され、各状態
はそれぞれ以下の情報を有する。（ａ）状態番号、（ｂ）１１次元の音響的特徴量の平均
値、（ｃ）１１次元の音響的特徴量の分散、（ｄ）継続
時間、（ｅ）各クラスタの重み、及び、（ｆ）音素ラベ
ルに対応する音素コード。The word model in the word model memory 7 connected to the word level collating unit 5 is composed of a plurality of cascade-connected environment-dependent phoneme models for connecting preceding and succeeding phoneme environments. Each state includes the following information. (A) state number, (b) average value of 11-dimensional acoustic features, (c) variance of 11-dimensional acoustic features, (d) duration, (e) weight of each cluster, and ( f) Phoneme code corresponding to phoneme label.

【００４７】単語レベル照合部５と文レベル照合部６と
は音声認識回路部を構成し、文レベル照合部６には、品
詞や単語の出力確率及び品詞間や単語間の遷移確率など
を含み文法規則メモリ８に記憶された文法規則と、シソ
ーラスの出力確率や対話管理規則を含み意味的規則メモ
リ９に記憶された意味的規則とが連結される。単語レベ
ル照合部５は、図５の単語レベル照合処理を実行するこ
とにより、単語レベルの音声認識を行う。すなわち、単
語レベル照合部５は、入力された音響的特徴量の時系列
に基づいて、上記メモリ７内の単語モデルと照合して少
なくとも１つの音声認識候補単語を検出し、検出された
候補単語に対して尤度を計算し、かつ、上述の音素境界
検出処理を実行して音素境界を検出して、最大の尤度を
有する候補単語を認識結果の単語として文レベル照合部
６に出力する。さらに、文レベル照合部６は入力された
認識結果の単語に基づいて、上記文法規則と意味的規則
とを含む言語モデルを参照して文レベルの照合処理を実
行することにより、最終的な音声認識結果の文を出力す
る。もし、言語モデルで適合受理されない単語があれ
ば、その情報を単語レベル照合部５に帰還して再度単語
レベルの照合を実行する。単語レベル照合部５と文レベ
ル照合部６は、複数の音素からなる単語を順次連接して
いくことにより、自由発話の連続音声の認識を行い、そ
の音声認識結果データを出力する。The word level collating unit 5 and the sentence level collating unit 6 constitute a speech recognition circuit unit. The sentence level collating unit 6 includes the output probabilities of parts of speech and words, the transition probabilities between parts of speech and between words, and the like. The grammar rules stored in the grammar rule memory 8 are linked to the semantic rules stored in the semantic rule memory 9 including the output probabilities of the thesaurus and the dialog management rules. The word level collating unit 5 performs word level speech recognition by executing the word level collation processing of FIG. That is, the word level collating unit 5 detects at least one speech recognition candidate word by collating with the word model in the memory 7 based on the time series of the input acoustic feature amounts, and detects the detected candidate word. , And executes the above-described phoneme boundary detection processing to detect phoneme boundaries, and outputs the candidate word having the maximum likelihood to the sentence level matching unit 6 as a recognition result word. . Further, the sentence level matching unit 6 executes a sentence level matching process by referring to a language model including the grammatical rule and the semantic rule based on the input word of the recognition result, thereby obtaining a final speech. Output sentence of recognition result. If there is a word that is not accepted by the language model, the information is returned to the word level collating unit 5 and the word level collation is executed again. The word level collating unit 5 and the sentence level collating unit 6 recognize a continuous speech of a free utterance by sequentially connecting words composed of a plurality of phonemes, and output the speech recognition result data.

【００４８】[0048]

【実施例】本発明者は、本特許出願人が所有する音声デ
ータベースを用いて、（１）方法２とＨＭＭに基づく音
素認識により得られる音素境界を音素検出結果とするも
のとの比較、（２）方法２乃至４の比較の２通りの性能
評価を行なった。ニューラルネットワーク１０の入力と
して、フレーム長２５．６ｍｓｅｃ、フレーム周期１０
ｍｓｅｃで分析した２６次元のＭＦＣＣ（１２次元ＭＦ
ＣＣ、パワーとそれぞれの１次回帰係数）を用いた。出
力は、データベース中の音素ラベル情報を利用し、当該
フレームが音素境界である場合１，音素境界に隣接して
いる場合０．５、それ以外は０として与えた。ニューラ
ルネットワーク１０における前向き及び後向きモジュー
ルのユニット数はそれぞれ１０個とし、隠れモジュール
Ｄのユニット数は３０個とし、学習の繰り返し回数Ｉは
１，０００回とした。このときのニューラルネットワー
ク１０の荷重係数の総数は２，１８１個である。学習デ
ータは４６２話者（３，６９６文章）、音素境界総数約
１４万個（約１１０万フレーム）、評価データは学習デ
ータとは別の１６８話者（１，３４４文章）、音素境界
総数５０，３１８個（約４１万フレーム）である。ニュ
ーラルネットワーク１０の真値と検出値との間の平均２
乗誤差は、学習データ及び評価データに対して、それぞ
れ、０．０６０４，０．０６２１であった。また、方法
２乃至４におけるしきい値の値は、実験的にｈ＝０．
４，ｌ＝０．１とした。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor has compared (1) Method 2 with a phoneme boundary obtained by phoneme recognition based on HMM as a phoneme detection result using a speech database owned by the present applicant. 2) Two types of performance evaluations of the methods 2 to 4 were performed. As input to the neural network 10, a frame length of 25.6 msec and a frame period of 10
26-dimensional MFCC (12-dimensional MF
CC, power and their respective first-order regression coefficients) were used. The output is given using phoneme label information in the database, and is given as 1 when the frame is a phoneme boundary, 0.5 when the frame is adjacent to the phoneme boundary, and 0 otherwise. The number of units of the forward and backward modules in the neural network 10 was set to 10, the number of units of the hidden module D was set to 30, and the number of times I of learning was set to 1,000. At this time, the total number of weighting factors of the neural network 10 is 2,181. The training data is 462 speakers (3,696 sentences), the total number of phoneme boundaries is about 140,000 (about 1.1 million frames), and the evaluation data is 168 speakers (1,344 sentences) different from the learning data, and the total number of phoneme boundaries is 50. , 318 (about 410,000 frames). Average 2 between true value and detected value of neural network 10
The squared errors were 0.0604 and 0.0621 for the learning data and the evaluation data, respectively. The threshold value in methods 2 to 4 is experimentally h = 0.
4, 1 = 0.1.

【００４９】次いで、評価方法について述べる。目視に
よりラベル付けされた音素境界に対して、所定の±Ｍフ
レーム以内のマージンの中に、音素境界の検出した結果
が含まれる場合は、正解（以下、正解数をＨとする。）
とし、含まれていなかった場合は、脱落（以下、脱落数
をＤとする。）とした。音素の湧き出しは、挿入（以
下、挿入数をＩとする。）とした。また、所定の±Ｍフ
レームの中に複数の検出候補が含まれていた場合、１つ
を除いて挿入とした。このとき、正解率及びアキュラシ
ーは次式で定義される。Next, an evaluation method will be described. If the result of the detection of the phoneme boundary is included in the margin within a predetermined ± M frame with respect to the phoneme boundary visually labeled, the correct answer (hereinafter, the number of correct answers is H).
If it was not included, it was determined as dropout (hereinafter, the dropout number is D). The source of phonemes was inserted (hereinafter, the number of insertions is referred to as I). When a plurality of detection candidates are included in a predetermined ± M frame, one is excluded except for one. At this time, the accuracy rate and accuracy are defined by the following equations.

【００５０】[0050]

【数２０】正解率＝Ｈ／Ｎ×１００（％）[Equation 20] Correct answer rate = H / N × 100 (%)

【数２１】アキュラシー＝（Ｎ−Ｄ−Ｉ）／Ｎ×１００（％）[Mathematical formula-see original document] Accuracy = (N-DI) / N * 100 (%)

【００５１】本実施例においては、上記２つの尺度で音
素境界の検出性能を評価した。ここで、Ｎは、目視によ
りラベル付けされた音素の総数であり、In this embodiment, the performance of detecting a phoneme boundary was evaluated using the above two measures. Where N is the total number of phonemes visually labeled,

【数２２】Ｎ＝Ｈ＋Ｄである。N = H + D

【００５２】まず、方法２とＨＭＭに基づく結果との比
較について述べる。Ｍ＝０，１，２に対する方法２によ
る検出結果を表１に示す。本実施形態の音素境界検出ニ
ューラルネットワーク１０の性能を比較するために、Ｈ
ＭＭに基づく音素バイグラムを用いた音素認識を行な
い、この結果得られた音素境界を音素境界としたものと
比較した。ここで、音素境界（時間）情報のみに着目
し、認識結果は考慮してない。音素ラベル数６１音素に
対して、３状態各５混合の環境非依存モデルを作成した
場合の結果を表２に、総状態数６００各３混合の環境依
存モデル（例えば、従来技術文献４「鷹見淳一ほか，
“逐次状態分割法による隠れマルコフ網の自動生成”，
電子情報通信学会論文誌ＤーＩＩ，Ｖｏｌ．Ｊ７６−Ｄ
−ＩＩ，Ｎｏ．１０，ｐｐ．２１５５−２１６４，１９
９３年１０月」参照。）（無音モデルは３状態各１０混
合のＨＭＭ）を作成した場合の結果を表３に示す。表１
と表２及び表３とを比較すると、ニューラルネットワー
ク１０に基づく方法の方が高いアキュラシーが得られて
いる。これは、ＨＭＭが音素境界を検出するためにモデ
ルパラメータが学習されたものではなく、副次的に得ら
れた音素境界情報を用いて評価しているのに対して、ニ
ューラルネットワーク１０は音素境界を検出するための
学習がなされているためと考えられる。First, a comparison between the method 2 and the result based on the HMM will be described. Table 1 shows the detection results obtained by the method 2 for M = 0, 1, and 2. In order to compare the performance of the phoneme boundary detection neural network 10 of the present embodiment, H
Phoneme recognition was performed using phoneme bigrams based on MM, and the resulting phoneme boundaries were compared with those obtained as phoneme boundaries. Here, attention is paid only to the phoneme boundary (time) information, and the recognition result is not considered. Table 2 shows the results of creating an environment-independent model with three states and five mixtures each for 61 phonemes with a phoneme label number of 61. Junichi et al.,
“Automatic Generation of Hidden Markov Networks by Sequential State Division Method”,
IEICE Transactions D-II, Vol. J76-D
-II, No. 10, pp. 2155-2164,19
October 1993. " Table 3 shows the results in the case where () (silent model is an HMM of 10 mixtures of 3 states). Table 1
Comparing Table 2 with Tables 2 and 3, a higher accuracy is obtained with the method based on the neural network 10. This is because the HMM does not learn the model parameters in order to detect the phoneme boundaries, but evaluates using the phoneme boundary information obtained secondarily, whereas the neural network 10 uses the phoneme boundaries. It is considered that the learning for detecting is performed.

【００５３】[0053]

【表１】ＢＲＮＮに基づく音素境界検出結果（方法２）しきい値ｈ＝０．４ ──────────────────────────── Ｍ０１２ ──────────────────────────── 正解２３，１７５３８，２４８４０，０５６挿入１８，９８３４，０６６２，２９３脱落２７，１４３１２，０７０１０，２６２ ──────────────────────────── 正解率４６．０６７６．０１７９．６１アキュラシー８．３３６７．９３７５．０５ ────────────────────────────[Table 1] Phoneme boundary detection result based on BRNN (method 2) Threshold value h = 0.4─────────────────────────── {M 0 1 2} Correct answer 23,175 38,248 40,056 Insertion 18,983 4,066 2,293 dropout 27,143 12,070 10,262 ──────────────────────────── Correct answer rate 46.06 76.01 79 .61 Accuracy 8.33 67.93 75.05

【００５４】[0054]

【表２】ＨＭＭに基づく音素境界検出結果（ａ）環境非依存モデル ────────────────────────────── Ｍ０１２ ────────────────────────────── 正解８，８０６２８，２１４３８，８４７挿入３５，３７２１６，２５３５，９１５脱落４１，５１２２２，１０４１１，４７１ ────────────────────────────── 正解率１７．５０５６．０７７７．２０アキュラシー −５２．８０２３．７７６５．４５ ──────────────────────────────Table 2 Results of phoneme boundary detection based on HMM (a) Environment-independent model Ｍ M 0 1 2 ────────────────────────────── Correct answer 8,806 28,214 38,847 Insert 35,372 16,253 5 , 915 dropped 41,512 22,104 11,471 ────────────────────────────── correct answer rate 17.50 56.07 77.20 Accuracy -52.80 23.77 65.45

【００５５】[0055]

【表３】ＨＭＭに基づく音素境界検出結果（ｂ）環境依存モデル ────────────────────────────── Ｍ０１２ ────────────────────────────── 正解１４，１９８３５，９６７４２，６１１挿入３２，９７０１１，５２１５，１１０脱落３６，１２０１４，３５１７，７０７ ────────────────────────────── 正解率２８．２２７１．４７８４．６８アキュラシー −３７．３１４８．５８７４．５３ ──────────────────────────────Table 3 Results of phoneme boundary detection based on HMM (b) Environment-dependent model Ｍ M 0 1 2 ────────────────────────────── Correct answer 14,198 35,967 42,611 Insert 32,970 11,521,5 110 dropout 36,120 14,351,707 ────────────────────────────── Correct answer rate 28.22 71.47 84 .68 Accuracy -37.31 48.58 74.53

【００５６】次に、方法２、３、４による性能の比較を
表４に示す。Next, Table 4 shows a comparison of the performance by the methods 2, 3, and 4.

【００５７】[0057]

【表４】 ─────────────────────────────── 方法２３４ ─────────────────────────────── 正解４０，０５６４８，８５６４８，８５６挿入２，２９３６７，５７０３０，６２９脱落１０，２６２１，４６２１，４６１ ─────────────────────────────── 正解率７９．６１９７．１０９７．１０アキュラシー７５．０５ −３７．１９３６．２２ ───────────────────────────────[Table 4] {Method 2 34} ──────────────────── Correct answer 40,056 48,856 48,856 Insertion 2,293 67,570 30,629 Dropout 10,262 1,462 1,461 ─────────────────────────────── Correct answer rate 79.61 97.10 97.10 Accuracy 75.05 -37. 19 36.22 ───────────────────────────────

【００５８】ここで、方法４の間引き間隔はｋ＝２と
し、全ての評価はＭ＝２で行なった。方法２は最もアキ
ュラシーは高いが、脱落数が多いことが分かる。上述の
ように、音素境界候補の再評価が可能な場合において
は、脱落数が多いこの方法はあまり適切ではないと考え
られる。方法３は脱落数が方法２に対して大幅に低減で
きているが、逆に挿入数が大幅に増えている。方法４で
は、方法３に対して脱落数を増加させることなく、挿入
数が半分以下となっている。また、方法４の検出結果を
ラティス表現した場合、９７．１０％もの多くの正解が
ラティス内に含まれることが分かる。Here, the thinning interval of Method 4 was set to k = 2, and all evaluations were performed with M = 2. It can be seen that Method 2 has the highest accuracy, but has a large number of drops. As described above, when the phoneme boundary candidates can be re-evaluated, this method with a large number of dropouts is not considered appropriate. In method 3, the number of dropouts can be significantly reduced as compared to method 2, but on the contrary, the number of insertions has increased significantly. In Method 4, the number of insertions is less than half without increasing the number of drops compared to Method 3. Also, when the detection result of the method 4 is expressed in a lattice, it can be seen that as many as 97.10% of correct answers are included in the lattice.

【００５９】以上説明したように、本実施形態によれ
ば、音声特徴パラメータを用いてＢＲＮＮであるニュー
ラルネットワーク１０を学習し、学習したニューラルネ
ットワーク１０を用いて、音声特徴パラメータのみに基
づいて音素境界位置を高速にかつ正確に検出することが
できる。音素境界位置がより正確に得ることにより、（ａ）音声認識の性能を向上させるとともに、音声認識
の計算量を大幅に低減させることができる。（ｂ）また、音素境界検出ニューラルネットワーク１０
を併用して音響モデルであるＨＭＭの初期モデルを作成
するときに、その精度を大幅に向上させることができ
る。（ｃ）さらに、音素境界検出ニューラルネットワーク１
０を音声合成のための音声波形信号の切り出しのために
用いることができ、この場合、波形切り出し誤差を大幅
に低減させることができる。As described above, according to the present embodiment, the neural network 10 which is a BRNN is learned using the speech feature parameters, and the phoneme boundary is determined based on only the speech feature parameters using the learned neural network 10. The position can be quickly and accurately detected. By obtaining the phoneme boundary position more accurately, (a) the performance of speech recognition can be improved, and the calculation amount of speech recognition can be significantly reduced. (B) The phoneme boundary detection neural network 10
When an initial model of an HMM, which is an acoustic model, is created by using the above, the accuracy can be greatly improved. (C) Further, a phoneme boundary detection neural network 1
0 can be used for cutting out a speech waveform signal for speech synthesis, and in this case, a waveform cutting error can be significantly reduced.

【００６０】[0060]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の音素境界検出装置によれば、入力層と、複数の
ユニットを有し少なくとも１層の中間層と、１個のユニ
ットを有し音素境界検出確率を表す音素境界検出値を出
力する出力層とを備えた双方向リカレント型ニューラル
ネットワークを用いて、音声特徴パラメータ系列の音素
境界を検出する音素境界検出装置であって、上記入力層
は、複数の音声特徴パラメータを入力とし、複数のユニ
ットを有する第１の入力ニューロングループと、前向き
モジュールと、後向きモジュールとを備え、上記前向き
モジュールは、複数の音声特徴パラメータに基づいて、
時間的に前向きの帰還接続を有して上記第１の入力ニュ
ーロングループから出力される複数のパラメータよりも
所定の単位時間だけ遅延された時刻の複数のパラメータ
を生成して上記中間層に出力する一方、上記後向きモジ
ュールは、複数の音声特徴パラメータに基づいて、時間
的に後向きの帰還接続を有して上記第１の入力ニューロ
ングループから出力される複数のパラメータよりも所定
の単位時間だけ逆向きに遅延された時刻の複数のパラメ
ータを生成して上記中間層に出力する。従って、音声特
徴パラメータのみに基づいて音素境界位置を高速にかつ
正確に検出することができる。また、音素境界位置がよ
り正確に得ることにより、音声認識の性能を向上させる
とともに、音声認識の計算量を大幅に低減させることが
できる。As described above in detail, according to the phoneme boundary detecting device of the first aspect of the present invention, an input layer, at least one intermediate layer having a plurality of units, and one unit An output layer that outputs a phoneme boundary detection value representing a phoneme boundary detection probability and having a bidirectional recurrent neural network, a phoneme boundary detection device that detects a phoneme boundary of a speech feature parameter sequence, The input layer receives a plurality of speech feature parameters as inputs, and includes a first input neuron group having a plurality of units, a forward module, and a backward module, wherein the forward module is based on the plurality of speech feature parameters. ,
A plurality of parameters having a temporally forward feedback connection and having a time delayed by a predetermined unit time from a plurality of parameters output from the first input neuron group are generated and output to the intermediate layer. On the other hand, the backward module has a temporally backward feedback connection based on the plurality of speech feature parameters, and has a backward unit of a predetermined unit time from the plurality of parameters output from the first input neuron group. Are generated and output to the intermediate layer. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６１】また、請求項２記載の音素境界検出装置に
おいては、請求項１記載の音素境界検出装置において、
上記前向きモジュールは、複数の音声特徴パラメータを
入力とし、複数のユニットを有する第２の入力ニューロ
ングループと、第２の中間ニューロングループから所定
の単位時間だけ遅延されて出力される複数のパラメータ
を入力とする、複数のユニットを有する第１の中間ニュ
ーロングループと、上記第２の入力ニューロングループ
から出力される複数のパラメータと、上記第１の中間ニ
ューロングループから出力される複数のパラメータに対
してそれぞれ各荷重係数を乗算してそれぞれ入力される
ように接続され、複数のユニットを有する第２の中間ニ
ューロングループとを備え、上記後向きモジュールは、
複数の音声特徴パラメータを入力とし、複数のユニット
を有する第３の入力ニューロングループと、第４の中間
ニューロングループから所定の単位時間だけ逆向きに遅
延されて出力される複数のパラメータを入力とする、複
数のユニットを有する第３の中間ニューロングループ
と、上記第３の入力ニューロングループから出力される
複数のパラメータと、上記第３の中間ニューロングルー
プから出力される複数のパラメータに対してそれぞれ各
荷重係数を乗算してそれぞれ入力されるように接続さ
れ、複数のユニットを有する第４の中間ニューロングル
ープとを備え、上記第２の中間ニューロングループから
出力される複数のパラメータに対してそれぞれ各荷重係
数を乗算してそれぞれ上記中間層の複数のユニットに入
力されるように接続され、上記第１の入力ニューロング
ループから出力される複数のパラメータに対してそれぞ
れ各荷重係数を乗算してそれぞれ上記中間層の複数のユ
ニットに入力されるように接続され、上記第４の中間ニ
ューロングループから出力される複数のパラメータに対
してそれぞれ各荷重係数を乗算してそれぞれ上記中間層
の複数のユニットに入力されるように接続され、上記中
間層から出力される複数のパラメータに対してそれぞれ
各荷重係数を乗算してそれぞれ上記出力層のユニットに
入力されるように接続される。従って、音声特徴パラメ
ータのみに基づいて音素境界位置を高速にかつ正確に検
出することができる。また、音素境界位置がより正確に
得ることにより、音声認識の性能を向上させるととも
に、音声認識の計算量を大幅に低減させることができ
る。Further, in the phoneme boundary detecting device according to the second aspect, in the phoneme boundary detecting device according to the first aspect,
The forward module receives a plurality of speech feature parameters as input, and inputs a second input neuron group having a plurality of units and a plurality of parameters output from the second intermediate neuron group delayed by a predetermined unit time. A first intermediate neuron group having a plurality of units, a plurality of parameters output from the second input neuron group, and a plurality of parameters output from the first intermediate neuron group. A second intermediate neuron group having a plurality of units and connected so as to be multiplied by the respective weighting factors, and wherein the second module has a plurality of units.
A plurality of speech feature parameters are input, and a plurality of parameters output from a third input neuron group having a plurality of units and delayed from the fourth intermediate neuron group by a predetermined unit time in the reverse direction are input. , A third intermediate neuron group having a plurality of units, a plurality of parameters output from the third input neuron group, and respective weights for a plurality of parameters output from the third intermediate neuron group. A fourth intermediate neuron group having a plurality of units and connected so as to be multiplied by a coefficient, and having a plurality of units, each of which has a weighting factor for each of a plurality of parameters output from the second intermediate neuron group And connected to be input to a plurality of units in the above-mentioned intermediate layer. , A plurality of parameters output from the first input neuron group are respectively multiplied by respective weighting coefficients, and the parameters are connected so as to be input to a plurality of units of the intermediate layer, respectively. Are connected so as to be input to the plurality of units of the intermediate layer, respectively, by multiplying each of the plurality of parameters output from the respective units by the respective load coefficients, and are respectively connected to the plurality of parameters output from the intermediate layer. They are connected so that they are multiplied by a load coefficient and input to the units of the output layer. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６２】さらに、請求項３記載の音素境界検出装置
においては、請求項１又は２記載の音素境界検出装置に
おいて、上記出力層から出力される音素境界検出値が所
定のしきい値以上のときに音素境界として検出する第１
の検出手段をさらに備える。従って、音声特徴パラメー
タのみに基づいて音素境界位置を高速にかつ正確に検出
することができる。また、音素境界位置がより正確に得
ることにより、音声認識の性能を向上させるとともに、
音声認識の計算量を大幅に低減させることができる。Further, in the phoneme boundary detection device according to the third aspect, the phoneme boundary detection device according to the first or second aspect, wherein the detected phoneme boundary value output from the output layer is equal to or more than a predetermined threshold value. First detected as a phoneme boundary
Is further provided. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. In addition, by obtaining more accurate phoneme boundary positions, while improving the performance of speech recognition,
The amount of calculation for speech recognition can be greatly reduced.

【００６３】さらに、請求項４記載の音素境界検出装置
においては、請求項１又は２記載の音素境界検出装置に
おいて、上記出力層から出力される音素境界検出値が所
定のしきい値以上であって、極大値となるときに音素境
界として検出する第２の検出手段をさらに備える。従っ
て、音声特徴パラメータのみに基づいて音素境界位置を
高速にかつ正確に検出することができる。また、音素境
界位置がより正確に得ることにより、音声認識の性能を
向上させるとともに、音声認識の計算量を大幅に低減さ
せることができる。Further, in the phoneme boundary detection device according to the fourth aspect, in the phoneme boundary detection device according to the first or second aspect, the phoneme boundary detection value output from the output layer is not less than a predetermined threshold value. And a second detecting means for detecting a maximum value as a phoneme boundary. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６４】さらに、請求項５記載の音素境界検出装置
においては、請求項１又は２記載の音素境界検出装置に
おいて、上記出力層から出力される音素境界検出値が、
所定の第１のしきい値以上であるときに第１の音素境界
として検出し、上記音素境界検出値が、上記第１のしき
い値よりも小さい第２のしきい値以上であって上記第１
のしきい値未満でありかつ極大値となるときに第２の音
素境界として検出する第３の検出手段をさらに備える。
従って、音声特徴パラメータのみに基づいて音素境界位
置を高速にかつ正確に検出することができる。また、音
素境界位置がより正確に得ることにより、音声認識の性
能を向上させるとともに、音声認識の計算量を大幅に低
減させることができる。Further, in the phoneme boundary detection device according to the fifth aspect, in the phoneme boundary detection device according to the first or second aspect, the phoneme boundary detection value output from the output layer is:
When it is not less than a predetermined first threshold value, it is detected as a first phoneme boundary, and the phoneme boundary detection value is not less than a second threshold value smaller than the first threshold value, and First
And a third detecting means for detecting the second phoneme boundary when the value is less than the threshold value and reaches the maximum value.
Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６５】また、請求項６記載の音素境界検出装置に
おいては、請求項５記載の音素境界検出装置において、
上記第３の検出手段は、上記第１の音素境界として検出
したものを所定の複数個毎に１個の音素境界を選択して
第１の音素境界として選択する。従って、音声特徴パラ
メータのみに基づいて音素境界位置を高速にかつ正確に
検出することができる。また、音素境界位置がより正確
に得ることにより、音声認識の性能を向上させるととも
に、音声認識の計算量を大幅に低減させることができ
る。According to a sixth aspect of the present invention, there is provided the phoneme boundary detection device according to the fifth aspect.
The third detecting means selects one phoneme boundary for each of a plurality of detected ones as the first phoneme boundary and selects it as the first phoneme boundary. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６６】さらに、請求項７記載の音素境界検出装置
においては、請求項５又は６記載の音素境界検出装置に
おいて、上記第３の検出手段は、上記検出又は選択した
第１の音素境界と第２の音素境界との間で形成された経
路のラティスに基づいて音素境界を検出する。従って、
音声特徴パラメータのみに基づいて音素境界位置を高速
にかつ正確に検出することができる。また、音素境界位
置がより正確に得ることにより、音声認識の性能を向上
させるとともに、音声認識の計算量を大幅に低減させる
ことができる。Further, in the phoneme boundary detecting device according to claim 7, in the phoneme boundary detecting device according to claim 5 or 6, the third detecting means includes: A phoneme boundary is detected based on a lattice of a path formed between the two phoneme boundaries. Therefore,
The phoneme boundary position can be quickly and accurately detected based only on the voice feature parameter. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

【００６７】本発明に係る請求項８記載の音声認識装置
においては、入力された文字列からなる発声音声文の音
声信号から音声特徴パラメータを抽出する特徴抽出手段
と、上記特徴抽出手段によって抽出された音声特徴パラ
メータに基づいて、請求項１乃至７のうちの１つに記載
された音素境界検出装置によって検出された音素境界
と、所定の音響モデルとを用いて、入力された文字列か
らなる発声音声文の音声信号を音声認識する音声認識手
段とを備える。従って、音声特徴パラメータのみに基づ
いて音素境界位置を高速にかつ正確に検出することがで
きる。また、音素境界位置がより正確に得ることによ
り、音声認識の性能を向上させるとともに、音声認識の
計算量を大幅に低減させることができる。In the speech recognition apparatus according to the present invention, a feature extraction means for extracting a speech feature parameter from a speech signal of an uttered speech sentence consisting of an input character string, and a feature extraction means for extracting the speech feature parameter. A phoneme boundary detected by the phoneme boundary detection device according to any one of claims 1 to 7 based on the obtained speech feature parameter, and a character string input using a predetermined acoustic model. Voice recognition means for voice-recognizing the voice signal of the uttered voice sentence. Therefore, the phoneme boundary position can be quickly and accurately detected based only on the voice feature parameters. Further, by obtaining the phoneme boundary position more accurately, the performance of speech recognition can be improved and the calculation amount of speech recognition can be significantly reduced.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である音素境界検出
ニューラルネットワークを用いた音声認識装置のブロッ
ク図である。FIG. 1 is a block diagram of a speech recognition apparatus using a phoneme boundary detection neural network according to an embodiment of the present invention.

【図２】図１の音素境界検出ニューラルネットワーク
の構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a phoneme boundary detection neural network of FIG.

【図３】図２の音素境界検出ニューラルネットワーク
の等価構造を示すブロック図である。FIG. 3 is a block diagram showing an equivalent structure of the phoneme boundary detection neural network of FIG. 2;

【図４】図１のニューラルネットワーク学習部によっ
て実行されるニューラルネットワーク学習処理を示すフ
ローチャートである。FIG. 4 is a flowchart illustrating a neural network learning process performed by the neural network learning unit of FIG. 1;

【図５】図１の単語照合部によって実行される単語照
合処理を示すフローチャートである。FIG. 5 is a flowchart illustrating a word matching process performed by the word matching unit of FIG. 1;

【図６】図５の単語照合処理におけるサブルーチンで
ある音素境界検出処理（方法１）のフローチャートであ
る。6 is a flowchart of a phoneme boundary detection process (method 1) which is a subroutine in the word matching process of FIG.

【図７】図５の単語照合処理におけるサブルーチンで
ある音素境界検出処理（方法２）のフローチャートであ
る。FIG. 7 is a flowchart of a phoneme boundary detection process (method 2) which is a subroutine in the word matching process of FIG.

【図８】図５の単語照合処理におけるサブルーチンで
ある音素境界検出処理（方法３）のフローチャートであ
る。8 is a flowchart of a phoneme boundary detection process (method 3) which is a subroutine in the word matching process of FIG.

【図９】図５の単語照合処理におけるサブルーチンで
ある音素境界検出処理（方法４）のフローチャートであ
る。9 is a flowchart of a phoneme boundary detection process (method 4) which is a subroutine in the word matching process of FIG.

【図１０】図５の音素境界検出処理によって検出され
た一例を示すグラフである。FIG. 10 is a graph showing an example detected by the phoneme boundary detection processing of FIG.

【図１１】図５の音素境界検出処理における音素境界
候補のラティス表現を示す図である。11 is a diagram illustrating a lattice representation of a phoneme boundary candidate in the phoneme boundary detection processing of FIG. 5;

[Explanation of symbols]

１…マイクロホン、２…Ａ／Ｄ変換器、３…特徴抽出部、４…バッファメモリ、５…単語レベル照合部、６…文レベル照合部、７…単語モデル、８…文法規則、９…意味的規則、１０…音素境界検出ニューラルネットワーク、２０…ニューラルネットワーク学習部、３１…学習用音声データの特徴パラメータファイル、３２…学習用音声データの音素境界値ファイル、３３…音素境界検出ニューラルネットワークの初期モデ
ル、１００…入力層、２００…中間層、３００…出力層、Ａ（ｔ），５１，６１…入力ニューロングループ、Ｂ（ｔ−１）…前向きモジュール、Ｃ（ｔ＋１）…後向きモジュール、５２，５３，６２，６３…中間ニューロングループ、５４…遅延素子、６４…逆向き遅延素子、Ｄ…隠れニューロングループ、Ｅ…出力ニューロングループ。DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... A / D converter, 3 ... Feature extraction part, 4 ... Buffer memory, 5 ... Word level collation part, 6 ... Sentence level collation part, 7 ... Word model, 8 ... Grammar rule, 9 ... Meaning 10: Phoneme boundary detection neural network, 20: Neural network learning unit, 31: Feature parameter file of learning speech data, 32: Phoneme boundary value file of learning speech data, 33: Initial of phoneme boundary detection neural network Model: 100 input layer, 200: middle layer, 300: output layer, A (t), 51, 61: input neuron group, B (t-1): forward module, C (t + 1): backward module, 52, 53, 62, 63: intermediate neuron group, 54: delay element, 64: reverse delay element, D: hidden neuron group Flop, E ... output neuron group.

フロントページの続き (56)参考文献特開平４−165400（ＪＰ，Ａ) 特開平４−324500（ＪＰ，Ａ) 特開平４−60600（ＪＰ，Ａ) 電子情報通信学会技術研究報告［音声］Ｖｏｌ．96 Ｎｏ．319 ＳＰ96−56 「双方向リカレントニューラルネットワークに基づく音声認識」ｐ．７−12 （1996／10／18) 日本音響学会平成８年度秋季研究発表会講演論文集▲Ｉ▼ ２−３−15”Ｂｉ −ＤｉｒｅｃｔｉｏｎａｌＲｅｃｃｕｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋｓｆｏｒＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”ｐ．77−78（平成８年９月25日) 電子情報通信学会論文誌Ｖｏｌ．Ｊ 73−Ｄ−▲ＩＩ▼ Ｎｏ．５Ｍａｙ「時間遅れ神経回路網（ＴＤＮＮ）による音韻スポッティングのための学習法とその効果」（1990／５／25) 日本音響学会平成９年度春季研究発表会講演論文集▲Ｉ▼ ３−６−７”ＡｃｏｕｓｔｉｃＭｏｄｅｌｓｂａｓｅｄｏｎｎｏｎ−ＵｎｉｆｏｒｍＳｅｇｍｅｎｔｓａｎｄＢｉｄｉｒｅｃｔｉｏｎａｌＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋｓ”ｐ. 101−102（平成９年３月17日) 日本音響学会平成９年度春季研究発表会講演論文集▲Ｉ▼ ３−６−８「リカレントニューラルネットワークを用いたセグメント境界推定」ｐ．103−104（平成９年３月17日) 電子情報通信学会技術研究報告［音声］Ｖｏｌ．97 Ｎｏ．114 ＳＰ97− 15「リカレントニューラルネットワークを用いた音素境界推定と音声認識への応用」ｐ．41−48（1997／６／19) 日本音響学会平成９年度秋季研究発表会講演論文集▲Ｉ▼ ２−Ｑ−10「音素境界推定ネットワークを利用した音声の自動セグメンテーション」ｐ．135−136 （平成９年９月17日) ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．45，Ｎｏ．11，Ｎｏｖｅｍｂｅｒ 1997，”ＢｉｄｉｒｅｃｔｒｉｏｎａｌＲｅｃｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋｓ”, ｐ．2673−2681 ＩｎｔｅｒｎａｔｉｏｎａｌＪｏｉｎｔＣｏｎｆｅｒｅｎｃｅｏｎＮｅｕｒａｌＮｅｔｗｏｒｋｓ，1989, Ｖｏｌ．２，”Ｐａｒａｌｌｅｌｉｓｍ，Ｈｉｅｒａｒｃｈｙ，ＳｃａｌｉｎｇｉｎＴｉｍｅ−ＤｅｌａｙＮｅｕｒａｌＮｅｔｗｏｒｋｓｆｏｒＳｐｏｔｔｉｎｇＪａｐａｎｅｓｅＰｈｏｎｅｍｅｓ／ＣＶ−Ｓｙｌｌａｂｌｅｓ”，ｐ．▲ＩＩ▼−81〜▲ＩＩ ▼−88 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 515 G10L 3/00 539 G10L 9/10 301 G06F 15/18 560 ＩＮＳＰＥＣ（ＤＩＡＬＯＧ) ＪＩＣＳＴファイル（ＪＯＩＳ) ＷＰＩ（ＤＩＡＬＯＧ)Continuation of the front page (56) References JP-A-4-165400 (JP, A) JP-A-4-324500 (JP, A) JP-A-4-60600 (JP, A) IEICE technical report [ Voice] Vol. 96 No. 319 SP96-56 "Speech Recognition Based on Bidirectional Recurrent Neural Network" p. 7-12 (Oct. 18, 1996) Proceedings of the Autumn Meeting of the Acoustical Society of Japan 1996 I-3-2-15 "Bi-Directional Recurrence Neural Networks for Speech Recognition" p. 77-78 (September 25, 1996) Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J 73-D- ▲ II No. 5 May “Learning method for phonological spotting using time-delay neural network (TDNN) and its effects” (May 25, 1990) Proceedings of the Acoustical Society of Japan 1997 Spring Meeting, I- 3- 6-7 "Acoustic Models based on non-Uniform Segments and Bidirectional Citational Recurrent Neural Networks" p. 101-102 (March 17, 1997), Proceedings of the Acoustical Society of Japan. I ▼ 3-6-8 “Segment boundary estimation using a current neural network” p. 103-104 (March 17, 1997) IEICE Technical Report [Voice] Vol. 97 No. 114 SP97-15 “Application of phoneme boundary estimation and speech recognition using a recurrent neural network” p. 41-48 (1997/6/19) Proceedings of the Acoustical Society of Japan, Fall Meeting, 1997, I-Q 2-Q-10, "Automatic segmentation of speech using phoneme boundary estimation network" p. 135-136 (September 17, 1997) IEEE Transactions on Signal Processing, Vol. 45, no. 11, Nov member 1997, "Directional Recurring Neural Networks", p. 2673-2681 International Joint Conference on Neural Networks, 1989, Vol. 2, "Parallelism, Hierarchy, Scaling in Time-Delay Neural Networks for Spotting Japanese Phones / CV-Syllables", p. ▲ II ▼ -81 to ▲ II ▼ -88 (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 3/00 515 G10L 3/00 539 G10L 9/10 301 G06F 15/18 560 INSPEC (DIALOG ) JICST file (JOIS) WPI (DIALOG)

Claims

(57) [Claims]

An input layer, an intermediate layer having at least one layer having a plurality of units, and an output layer having one unit and outputting a phoneme boundary detection value representing a phoneme boundary detection probability. A phoneme boundary detection device for detecting a phoneme boundary of a speech feature parameter sequence using a directional recurrent neural network, wherein the input layer has a plurality of speech feature parameters as inputs and a first input having a plurality of units. A neuron group, a forward module, and a backward module, the forward module having a temporally forward feedback connection based on a plurality of speech feature parameters,
While generating a plurality of parameters at times delayed by a predetermined unit time from a plurality of parameters output from the input neuron group and outputting the plurality of parameters to the intermediate layer, the backward module is configured to generate a plurality of parameters based on a plurality of speech feature parameters. And has a temporally backward feedback connection,
And generating a plurality of parameters at times delayed by a predetermined unit time in a backward direction from a plurality of parameters output from the input neuron group of the plurality of input neuron groups and outputting the generated parameters to the intermediate layer.

2. The forward module receives a plurality of speech feature parameters as input, outputs a second input neuron group having a plurality of units, and is delayed by a predetermined unit time from a second intermediate neuron group and output. With multiple parameters as input,
A first intermediate neuron group having a plurality of units, a plurality of parameters output from the second input neuron group, and respective weighting factors for a plurality of parameters output from the first intermediate neuron group And a second intermediate neuron group having a plurality of units, each of which is connected so as to be input by multiplying by a plurality of units. The backward module has a plurality of speech feature parameters as inputs, and has a third unit having a plurality of units. An input neuron group, a third intermediate neuron group having a plurality of units, and having a plurality of units inputting a plurality of parameters delayed and output by a predetermined unit time in a backward direction from the fourth intermediate neuron group; A plurality of parameters output from the input neuron group of A fourth intermediate neuron group connected to each of the plurality of parameters output from the intermediate neuron group and multiplied by a respective weighting factor, and having a plurality of units; A plurality of parameters output from the neuron group are connected to each other by multiplying each of the plurality of parameters by the respective weighting factors and input to each of the plurality of units of the intermediate layer. Each of the parameters is multiplied by each weighting factor and connected so as to be input to a plurality of units of the intermediate layer, respectively, and each of the weighting factors is output to a plurality of parameters output from the fourth intermediate neuron group. Are connected so as to be input to a plurality of units of the intermediate layer, respectively, 2. The phoneme boundary detection device according to claim 1, wherein the plurality of parameters output from the intermediate layer are multiplied by the respective weighting factors, and the parameters are connected so as to be input to the units of the output layer. .

3. The apparatus according to claim 1, further comprising: a first detection unit that detects a phoneme boundary as a phoneme boundary when a phoneme boundary detection value output from the output layer is equal to or more than a predetermined threshold value. A phoneme boundary detection device according to the above.

4. The method according to claim 1, further comprising a second detecting unit that detects a phoneme boundary when the phoneme boundary detection value output from the output layer is equal to or more than a predetermined threshold value and reaches a maximum value. The phoneme boundary detection device according to claim 1 or 2, wherein

5. When a phoneme boundary detection value output from the output layer is equal to or greater than a predetermined first threshold, the phoneme boundary detection value is detected as a first phoneme boundary, and the phoneme boundary detection value is set to the first phoneme boundary.
A third detection means for detecting a second phoneme boundary when the second phoneme boundary is equal to or more than a second threshold value smaller than the threshold value and smaller than the first threshold value and reaches a maximum value. The phoneme boundary detection device according to claim 1 or 2, wherein:

6. The method according to claim 1, wherein the third detecting means selects one of the detected phoneme boundaries as the first phoneme boundary and selects one as a first phoneme boundary. The phoneme boundary detecting device according to claim 5, wherein

7. The third detection means detects a phoneme boundary based on a lattice of a path formed between the detected or selected first phoneme boundary and the second phoneme boundary. The phoneme boundary detection device according to claim 5 or 6, wherein

8. A feature extracting means for extracting a speech feature parameter from a speech signal of an uttered speech sentence comprising an input character string, and based on the speech feature parameter extracted by the feature extracting means. Speech recognition means for recognizing a speech signal of an uttered speech sentence composed of an input character string using a phoneme boundary detected by the phoneme boundary detection device described in any one of the above and a predetermined acoustic model. A voice recognition device comprising: