JPH06324695A

JPH06324695A - Voice recognizer

Info

Publication number: JPH06324695A
Application number: JP5111850A
Authority: JP
Inventors: Mitsuhiro Inazumi; 満広稲積
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1993-05-13
Filing date: 1993-05-13
Publication date: 1994-11-25

Abstract

(57)【要約】【目的】不特定話者に対応できる音声認識装置を実現
する。【構成】特徴抽出部と、複数個の特徴変換部と、それ
と対をなす適合判定部と、入力生成部と、認識部とによ
りなる音声認識装置。【効果】非常に処理量の多い話者適応処理を必要とせ
ずに不特定話者の音声を認識する事ができる。その結果
音声認識装置を非常に小型化する事ができる。 (57) [Abstract] [Purpose] To realize a voice recognition device that can handle unspecified speakers. A voice recognition device comprising a feature extraction unit, a plurality of feature conversion units, a matching determination unit paired with the feature conversion unit, an input generation unit, and a recognition unit. [Effect] It is possible to recognize the voice of an unspecified speaker without requiring speaker adaptation processing that requires an extremely large amount of processing. As a result, the voice recognition device can be made extremely small.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声認識装置、特に不特
定話者の音声を認識する音声認識装置に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device, and more particularly to a voice recognition device for recognizing a voice of an unspecified speaker.

【０００２】[0002]

【従来の技術】音声認識装置は大別して二つの部分より
なる。その一つは特徴抽出部であり、もう一つは認識部
である。これらはそれぞれ課題を持っており、またそれ
らは相互に関連するものである。2. Description of the Related Art A speech recognition apparatus is roughly divided into two parts. One is a feature extraction unit and the other is a recognition unit. Each of these has challenges and they are interrelated.

【０００３】認識部について考えてみると、従来の音声
認識装置に実用的に用いられている認識手段は、ダイナ
ミックプログラミング（ＤＰ）法、隠れマルコフモデル
（ＨＭＭ）法、及びバックプロパゲーション学習法と多
層パーセプトロン型ニューラルネットワークを用いた方
法（ＭＬＰ法）とがある。これらの詳細については、例
えば中川聖一著「確率モデルによる音声認識」（電子情
報通信学会）、中川、鹿野、東倉共著「音声・聴覚と神
経回路網モデル」（オーム社）等に記述されている。Considering the recognition unit, the recognition means practically used in the conventional speech recognition apparatus are the dynamic programming (DP) method, the hidden Markov model (HMM) method, and the back-propagation learning method. There is a method using a multilayer perceptron type neural network (MLP method). Details of these are described in, for example, Seiichi Nakagawa "Speech Recognition by Probabilistic Model" (IEICE), and Nakagawa, Kano, and Higashikura "Speech / Hearing and Neural Network Model" (Ohmsha). There is.

【０００４】このＤＰ法、ＨＭＭ法に共通する問題は教
師となるデータ、及び認識対象となるデータに始端と終
端を必要とする事である。これらにおいて見かけ上始端
終端に依存しない処理をするためには、可能性のある全
ての始端終端についての処理を行い、最良の結果を与え
る始端終端を試行錯誤的に発見すると言う方法がある。
しかし、例えば長さＮのパタンの中から、ある範疇に属
するデータの部分を検出する場合を考えてみると、始端
の可能性としてはＮのオーダーの可能性があり、また終
端においてもＮのオーダーの可能性がある。つまり、始
端終端の組み合わせとしてはＮの自剰のオーダーの可能
性が考えられる。従ってこの場合においては、この非常
に多数の組み合わせの全てについて認識処理を行わなけ
ればならない。そのためその処理には非常に時間がかか
ると言う問題がある。A problem common to the DP method and the HMM method is that the data to be a teacher and the data to be recognized require a start end and an end. In order to perform processing that does not seem to depend on the starting end point, there is a method of performing processing for all possible starting end points and finding the starting end point that gives the best result by trial and error.
However, considering the case of detecting a portion of data belonging to a certain category from a pattern of length N, there is a possibility of the order of N as the start end, and there is a possibility of N at the end. There is a possibility of order. In other words, there is a possibility that there will be N surplus orders as a combination of the start end and the end. Therefore, in this case, the recognition process must be performed for all of these very large combinations. Therefore, there is a problem that the processing takes a very long time.

【０００５】また組み合わせの数と言う量的な問題以前
に、始端終端の存在と言う仮定自身に、より本質的な質
的な問題がある。入力データが、ある範疇のデータ唯一
つからのみ構成されていると言う条件であれば始端終端
は自明である。しかし、一つ以上の範疇のデータが連続
する場合においてはその境界は自明ではない。むしろ音
声などの時系列情報においては、そのような境界は明確
には存在せず、連続した２つの範疇のデータは、その情
報が重複する遷移領域を経て一方から他方へ変化すると
考えられる。従って、データの始端終端を仮定する事は
その正確度において非常に大きな問題がある。Further, before the quantitative problem of the number of combinations, there is a more essential qualitative problem in the assumption itself of the existence of the beginning and end. If the condition that the input data is composed of only one data of a certain category, the start end is obvious. However, when data of one or more categories are continuous, the boundary is not obvious. Rather, in time-series information such as voice, such a boundary does not exist clearly, and it is considered that data in two consecutive categories changes from one to the other through a transition region in which the information overlaps. Therefore, assuming the beginning and end of data has a very big problem in its accuracy.

【０００６】従来法のもう一つの方法であるＭＬＰ法の
場合はこのようなデータの始端終端を特に仮定する必要
はない。しかしそれに代わって入力データの範囲と言う
新たな始端終端の問題が起こる。つまり、ＭＬＰ法は基
本的には静的なデータを認識するための方法であり、そ
れを用いて時系列データを認識する場合は、ある時間範
囲のデータを１つの入力データとして入力する事によ
り、等価的に時間情報を処理しなければならないと言う
問題がある。この時間範囲はＭＬＰの構成上固定された
ものでなければならない。In the case of the MLP method which is another method of the conventional method, it is not necessary to particularly assume the beginning and end of such data. However, in place of that, a new start / end problem of the range of input data occurs. In other words, the MLP method is basically a method for recognizing static data, and when recognizing time-series data using it, by inputting data in a certain time range as one input data , There is a problem that time information must be processed equivalently. This time range must be fixed due to the structure of the MLP.

【０００７】一方時系列データの長さは、その範疇によ
り、また同一範疇の中においても大きく変動する場合が
多い。例えば音声処理における音素を例にとれば、長い
音素である母音等と、短い音素である破裂音等の平均長
さは１０倍以上異なる。また同一音素範疇においても、
実際の音声中での長さは２倍以上変動する。従って、仮
にデータの入力範囲を全音素の平均的な長さに設定した
とすると、短い音素を認識する場合はその入力データの
中に認識対象以外のデータが多数含まれる事になり、ま
た逆に長い音素を認識する場合は、その入力データの中
に認識対象のデータの一部しか含まれない事になる。こ
れらはいずれも認識能力を下げる原因である。またデー
タ入力範囲を音素毎に異なる長さを設定したとしても、
その音素範疇内での長さが変動するので問題は同様であ
る。このような問題は時系列情報一般に見られる事であ
る。On the other hand, the length of the time-series data often varies greatly depending on the category and even within the same category. For example, taking phonemes in speech processing as an example, the average lengths of vowels, which are long phonemes, and plosives, which are short phonemes, differ by a factor of 10 or more. Also in the same phoneme category,
The length in the actual voice fluctuates more than twice. Therefore, if the input range of data is set to the average length of all phonemes, when recognizing a short phoneme, a large number of data other than the recognition target will be included in the input data, and vice versa. When recognizing a long phoneme, only a part of the data to be recognized is included in the input data. All of these are factors that reduce cognitive ability. In addition, even if you set a different length for the data input range for each phoneme,
The problem is similar because the length within the phoneme category varies. Such a problem is commonly found in time series information.

【０００８】上で述べてきたような問題に加えて、実用
的な音声認識装置を構成するための重要な課題であるの
は、様々な特徴を持つ多数の話者の音声を如何にして正
確に認識するかと言う事である。In addition to the problems described above, an important issue in constructing a practical voice recognition device is how to accurately recognize the voices of many speakers having various characteristics. It means to recognize it.

【０００９】これに対処する従来の方法の主なものは、
一つの標準的な特徴量に対応した認識部を用意し、特徴
抽出部が抽出した話者に依存する特徴を、標準的な話者
に対応する特徴に変換して用いると言う方法である。こ
のような方法は話者適応と呼ばれる。図３はこれを模式
的に示したものである。図中の３０１は特徴抽出部を、
３０２は特徴変換部を、３０３は認識部を、３０４は出
力部を示す。具体的に話者適応処理、つまり特徴変換部
を構成する方法の例としては、古井著「スペクトル空間
の階層的クラスタ化による話者適応化」（電子情報通信
学会音声研究会資料ＳＰ８８−２１）等がある。しか
し、このような適応処理を実際に実行し、良好な結果を
得るためには大量の適応データと、その処理のための大
量の計算が必要となる。The main conventional methods to deal with this are:
This is a method in which a recognition unit corresponding to one standard feature amount is prepared, and the speaker-dependent features extracted by the feature extraction unit are converted into features corresponding to the standard speaker and used. Such a method is called speaker adaptation. FIG. 3 schematically shows this. In the figure, 301 is a feature extraction unit,
Reference numeral 302 is a feature conversion unit, 303 is a recognition unit, and 304 is an output unit. Specifically, as an example of speaker adaptation processing, that is, a method of configuring a feature conversion unit, Furui, "Speaker adaptation by hierarchical clustering of spectrum space" (The Institute of Electronics, Information and Communication Engineers, Speech Study Group material SP88-21) Etc. However, a large amount of adaptive data and a large amount of calculation for the processing are required to actually execute such adaptive processing and obtain good results.

【００１０】多数の話者に対応するもう一つの方法は、
それぞれ特徴の類似した話者をいくつかのグループに分
類し、そのグループのそれぞれについて認識手段を用意
し、認識しようとする話者の特徴と類似した話者グルー
プについての認識手段により音声認識を行うと言うもの
である。図４はそれを模式的に示したものである。図中
の番号４０１は特徴抽出部を、４０２は複数の認識部
を、４０３は出力選択部を、４０４は出力を示す。この
方法においては、少なくとも最終的な認識装置にいては
話者適応処理のような大量の適応データを必要とせず、
またその適応処理そのものも不要である。Another method for dealing with a large number of speakers is
Speakers with similar characteristics are classified into several groups, a recognition means is prepared for each of the groups, and voice recognition is performed by a recognition means for a speaker group similar to the characteristics of the speaker to be recognized. Is to say. FIG. 4 schematically shows it. In the figure, reference numeral 401 is a feature extraction unit, 402 is a plurality of recognition units, 403 is an output selection unit, and 404 is an output. In this method, at least in the final recognition device, a large amount of adaptation data such as speaker adaptation processing is not required,
In addition, the adaptive processing itself is unnecessary.

【００１１】しかし、この方法は複数の認識手段を持つ
ためにシステムが大きくなり、また最適な認識手段を選
択するのが困難であると言う問題がある。However, this method has a problem that the system becomes large because it has a plurality of recognition means, and it is difficult to select an optimum recognition means.

【００１２】[0012]

【発明が解決しようとする課題】上で述べてきたよう
に、従来的なＤＰ法、ＨＭＭ法は取り扱うデータの始端
と終端とを必要とする。見かけ上この制限を緩和するた
めには全ての始端終端の組み合わせについての処理を必
要とする。しかし、このためには非常に多くの処理を必
要とすると言う問題がある。As described above, the conventional DP method and HMM method require the beginning and end of data to be handled. Apparently, in order to relax this restriction, it is necessary to process all the combinations of the start end and the end. However, there is a problem that this requires a great deal of processing.

【００１３】またＭＬＰ法においては学習時に入力範囲
の始端と終端とを必要とする。しかし、これらの方法の
いずれにおいても問題となるのは、一般の時系列情報に
おいてはデータの始端、終端は原理的に明確にはできな
いと言う事である。また無理にそれを仮定する事は認識
能力を下げると言う問題がある。In addition, the MLP method requires the beginning and end of the input range during learning. However, a problem with any of these methods is that in general time-series information, the beginning and end of data cannot be defined in principle. Moreover, there is a problem that forcibly assuming it lowers cognitive ability.

【００１４】更に不特定話者のデータを認識するための
話者適応には大量の適応データとその処理が必要となる
と言う問題がある。また、話者グループ選択には、シス
テムが大きくなり、選択手段の実現が困難であると言う
問題がある。Further, there is a problem that a large amount of adaptation data and its processing are required for speaker adaptation for recognizing data of an unspecified speaker. Further, the speaker group selection has a problem that the system becomes large and it is difficult to realize the selection means.

【００１５】[0015]

【課題を解決するための手段】本発明は、１）、特徴抽出部と、抽出された特徴を入力とする複数
個の特徴変換部と、特徴変換部と対をなす適合判定部
と、特徴変換部と適合判定部の出力を入力とする入力生
成部と、生成された入力を入力とする認識部とを有する
事を特徴とする音声認識装置、２）、前記認識部がニューラルネットワークによりな
り、そのニューラルネットワークを構成する各神経細胞
様素子が、内部状態値記憶手段と、内部状態値記憶手段
に記憶された内部状態値とその神経細胞様素子に入力さ
れる入力値とに基づいて内部状態値を更新する内部状態
値更新手段と、内部状態値記憶手段による内部状態値を
外部出力値へ変換する出力値生成手段とを有する事を特
徴とする前記１）に記載の音声認識装置、３）、前記適合判定部が、少なくとも、二つの特徴入力
部と、その入力された特徴の１つを写像、変換する特徴
写像部と、入力されたもう一つ特徴と写像された特徴と
を比較する比較部とを有する事を特徴とする前記１）及
び前記２）に記載の音声認識装置、４）、前記特徴写像部がニューラルネットワークにより
なり、そのニューラルネットワークを構成する各神経細
胞様素子が、内部状態値記憶手段と、内部状態値記憶手
段に記憶された内部状態値とその神経細胞様素子に入力
される入力値とに基づいて内部状態値を更新する内部状
態値更新手段と、内部状態値記憶手段による内部状態値
を外部出力値へ変換する出力値生成手段とを有する事を
特徴とする前記１）から前記３）のいずれかに記載の音
声認識装置、５）、前記内部状態値更新手段は、前記内部状態値と前
記入力値のそれぞれに重みを剰算し積算する重み付き積
算手段からなり、前記内部状態値記憶手段は、重み付き
積算手段により積算された値を積分する積分手段からな
り、前記出力値生成手段は、前記積分手段により得られ
た値を予め設定された上限値と下限値との間の値へ変換
する出力値制限手段とからなる事を特徴とする前記１）
から前記４）のいずれかに記載の音声認識装置、６）、前記ニューラルネットワークを構成するｉ番目の
神経細胞様素子の内部状態値をＸｉとし、その素子への
ｎ個の重み付き外部入力値をＺｊ（ｊは０からｎ、ｎは
非負の整数）とし、時定数をτｉとする時、前記内部状
態値更新手段が、According to the present invention, 1) a feature extraction unit, a plurality of feature conversion units that receive the extracted features, a matching determination unit paired with the feature conversion unit, and a feature A voice recognition device, comprising: an input generation unit that receives the output of the conversion unit and the matching determination unit; and a recognition unit that receives the generated input as an input, 2), and the recognition unit is a neural network. , Each nerve cell-like element that constitutes the neural network is internally based on the internal state value storage means, the internal state value stored in the internal state value storage means, and the input value input to the nerve cell-like element. The speech recognition apparatus according to 1), further comprising an internal state value updating means for updating the state value, and an output value generating means for converting the internal state value by the internal state value storage means into an external output value. 3), the said conformity The determination unit includes at least two feature input units, a feature mapping unit that maps and transforms one of the input features, and a comparison unit that compares the other input feature and the mapped feature. The speech recognition device according to 1) or 2) above, 4), wherein the feature mapping unit is a neural network, and each neuron-like element that constitutes the neural network has an internal state value. Storage means, internal state value updating means for updating the internal state value based on the internal state value stored in the internal state value storage means and the input value input to the nerve cell-like element, and the internal state value storage means 5. The voice recognition device according to any one of 1) to 3) above, further comprising: an output value generation unit that converts the internal state value into an external output value, 5), and the internal state value updating unit. ,Previous The internal state value and the input value each include a weighted summing unit that sums and accumulates weights, and the internal state value storage unit includes an integrating unit that integrates the values accumulated by the weighted summing unit. The output value generating means includes an output value limiting means for converting the value obtained by the integrating means into a value between an upper limit value and a lower limit value set in advance, 1).
To 6), 6), Xi is the internal state value of the i-th neuron-like element that constitutes the neural network, and n weighted external input values to that element Is Zj (j is 0 to n, n is a non-negative integer), and the time constant is τi, the internal state value updating means

【００１６】[0016]

【数２】 [Equation 2]

【００１７】を満足する値へ内部状態値Ｘｉを更新する
事を特徴とする前記１）から前記５）のいずれかに記載
の音声認識装置、７）、前記重み付き外部入力値Ｚｊが、少なくとも、前
記ｉ番目の神経細胞様素子自身の出力に重みを剰算した
もの、そのニューラルネットワークを構成する他の神経
細胞様素子の出力に重みを剰算したもの、そのニューラ
ルネットワークの外部より与えられる入力値、ある固定
された値へ重みを剰算したものの内の１つ以上を含む事
を特徴とする前記１）から前記６）のいずれかに記載の
音声認識装置、８）、前記入力生成部は、それに入力される複数個の特
徴変換部の出力のいずれかを選択する事を特徴とする前
記１）から前記７）のいずれかに記載の音声認識装置、９）、前記入力生成部は、それに入力される複数個の特
徴変換部の出力の線形結合を生成する事を特徴とする前
記１）から前記７）のいずれかに記載の音声認識装置で
ある。The voice recognition device according to any one of 1) to 5) above, wherein the internal state value Xi is updated to a value satisfying the above condition. 7) The weighted external input value Zj is at least , The output of the i-th neuron-like element itself is weighted, the output of the other neuron-like elements that make up the neural network is weighted, and is given from outside the neural network The voice recognition device according to any one of 1) to 6) above, wherein the input value includes one or more of a weighted value added to a fixed value, 8), and the input generation. Section selects any one of the outputs of a plurality of feature conversion sections input to the speech recognition apparatus according to any one of 1) to 7) above, 9), the input generation section Type in it A speech recognition apparatus according to any one of 7) from the 1), characterized in that to produce a linear combination of the outputs of the plurality of feature transformation unit to be.

【００１８】[0018]

【Example】

（実施例１）図５は特徴抽出部の処理の流れを示すもの
である。入力された音声はディジタル化され、それから
フレームと呼ばれる部分データを離散的に切り出し、そ
のフレームより線形予測分析等の方法により特徴ベクト
ルを抽出する。以下では、このように抽出された特徴ベ
クトルの列を特徴抽出部の出力として説明する。(Embodiment 1) FIG. 5 shows the flow of processing of the feature extraction unit. The input speech is digitized, partial data called a frame is discretely cut out, and a feature vector is extracted from the frame by a method such as linear prediction analysis. In the following, the sequence of feature vectors extracted in this way will be described as the output of the feature extraction unit.

【００１９】図１は本発明の音声認識装置の構成の模式
図である。図中の番号１０１は上で述べたような特徴抽
出部を、１０２は特徴変換部を、１０３は適合判定部
を、１０４は入力生成部を、１０５は認識部を示す。FIG. 1 is a schematic diagram of the configuration of the speech recognition apparatus of the present invention. In the figure, reference numeral 101 is the feature extraction unit as described above, 102 is the feature conversion unit, 103 is the matching determination unit, 104 is the input generation unit, and 105 is the recognition unit.

【００２０】まず本発明の認識部の動作を説明する。本
発明の認識部はニューラルネットワークにより構成され
る。このニューラルネットワークが従来のＭＬＰ法と異
なるのは、それを構成する神経細胞様素子（以下単にノ
ードと略す）が図６に示されるような機能を持つ事であ
る。図中の番号６０１は内部状態記憶手段を、６０２は
内部状態更新手段を、６０３は出力値生成手段を、また
６０４はノード全体を示す。First, the operation of the recognition unit of the present invention will be described. The recognition unit of the present invention is composed of a neural network. This neural network is different from the conventional MLP method in that a nerve cell-like element (hereinafter simply referred to as a node) constituting the neural network has a function as shown in FIG. In the figure, reference numeral 601 indicates internal state storage means, 602 indicates internal state update means, 603 indicates output value generation means, and 604 indicates the entire node.

【００２１】ここでｉ番目のノードの内部状態値をＸ
ｉ、それに入力される重み付き入力値をＺｊ（ｊは０か
らｎの整数で、ｎは非負の整数）とし、また時定数をτ
ｉとすると、上の内部状態値更新手段の動作は例えば次
の式で表される。Here, the internal state value of the i-th node is X
i, Zj (j is an integer from 0 to n, n is a non-negative integer) input to the weighted input value, and τ is a time constant.
If i, the operation of the above internal state value updating means is expressed by the following equation, for example.

【００２２】[0022]

【数３】 [Equation 3]

【００２３】このように動作が微分方程式で与えられる
ようなノードにおいては、そのノードはそれ自身単体で
時間情報を処理する能力を持つ事ができる。上の式のよ
り具体的な例は、例えばｊ番目のノードの出力をＹｊ、
ｊ番目のノードからｉ番目のノードへの結合重みをＷｉ
ｊ、ｉ番目のノードへのバイアス値をθｉ、ｉ番目のノ
ードへ与えられる正味の外部入力値をＤｉとすると次の
式のように表される。In such a node whose operation is given by a differential equation, the node itself can have the ability to process time information. A more specific example of the above equation is that the output of the jth node is Yj,
The connection weight from the j-th node to the i-th node is Wi
When the bias value to the j-th and i-th nodes is θi and the net external input value given to the i-th node is Di, it is expressed by the following equation.

【００２４】[0024]

【数４】 [Equation 4]

【００２５】ここでバイアス値θｉは計算上では、ある
固定された出力値を持つノードへの重み付きの結合とし
て、Ｗｉｊの中へ取り込む事ができる。またＤｉとして
与えられるものは音声認識装置の場合は上で述べた特徴
ベクトルの成分である。勿論この値はニューラルネット
ワーク中の入力ノードの割り当てられたノードのみに与
えられる。Here, the bias value θi can be taken into Wij as a weighted connection to a node having a fixed output value in calculation. Further, what is given as Di is the component of the feature vector described above in the case of the voice recognition device. Of course, this value is given only to the assigned node of the input node in the neural network.

【００２６】また上で述べたＸｉとＹｉとの関係は、単
純な線形関数、しきい値関数、あるいは非線形関数等で
与えられる。この実施例においては、次の式で与えられ
る正負対称シグモイド関数（ロジスティック関数とも呼
ばれる）を用いた。The relationship between Xi and Yi described above is given by a simple linear function, a threshold function, or a non-linear function. In this example, a positive / negative symmetrical sigmoid function (also called a logistic function) given by the following equation was used.

【００２７】[0027]

【数５】 [Equation 5]

【００２８】さて、ニューラルネットワークにある所望
の機能を持たせるためには学習処理が必要である。上の
例では調整されるべきパラメータは結合重みＷｉｊであ
り、Ｗｉｊの修正量を求める事、つまり上で述べたよう
なノードに対しての学習処理としては、例えば次のよう
に表されるものがある。Now, in order to give the neural network a desired function, learning processing is required. In the above example, the parameter to be adjusted is the connection weight Wij, and the learning amount for the Wij, that is, the learning process for the node described above, is expressed as follows, for example. There is.

【００２９】[0029]

【数６】 [Equation 6]

【００３０】ここでＣｉはある学習の評価値であり、Ｅ
ｊは所望の出力（以下教師出力と言う）と実際の出力の
差より得られる誤差評価値である。この誤差評価値は、
例えば教師出力をＴｉ、実際の出力をＹｉとして次の式
のようなKullback-leibler距離によって与える事もでき
る。Here, Ci is an evaluation value of a certain learning, and E
j is an error evaluation value obtained from the difference between the desired output (hereinafter referred to as the teacher output) and the actual output. This error evaluation value is
For example, the teacher output may be Ti and the actual output may be Yi, which can be given by the Kullback-leibler distance as in the following equation.

【００３１】[0031]

【数７】 [Equation 7]

【００３２】上の式は出力範囲が０から１の時の例であ
るが、本実施例の場合のように出力範囲が−１から１の
場合は、上の式と等価な次の式のように表される。The above equation is an example when the output range is from 0 to 1, but when the output range is from -1 to 1 as in the case of this embodiment, the following equation equivalent to the above equation is obtained. Is represented as

【００３３】[0033]

【数８】 [Equation 8]

【００３４】ここで上に述べたよう結合重みＷｉｊで結
合している系については学習評価値の式は、誤差評価関
数としてKullback-leibler距離を用いて具体的に次のよ
うに表す事ができる。For the system connected with the connection weight Wij as described above, the learning evaluation value formula can be specifically expressed as follows using the Kullback-leibler distance as the error evaluation function. .

【００３５】[0035]

【数９】 [Equation 9]

【００３６】上で述べてきたような処理により、出力Ｙ
ｊと学習評価値Ｃｉが求められると、結合重みの修正
量、つまり学習則は次の式で表される。By the processing as described above, the output Y
When j and the learning evaluation value Ci are obtained, the correction amount of the connection weight, that is, the learning rule is expressed by the following equation.

【００３７】[0037]

【数１０】 [Equation 10]

【００３８】ここに示したような学習則によれば、ニュ
ーラルネットワークを構成する際に、層状結合、完全相
互結合等を特殊例として含む任意の結合形態が可能であ
る。更に従来のＭＬＰ法等と異なり、入力ノードと出力
ノードが同一である事も可能である。According to the learning rule as shown here, when the neural network is constructed, arbitrary connection forms including layered connection, complete mutual connection and the like are possible. Further, unlike the conventional MLP method or the like, it is possible that the input node and the output node are the same.

【００３９】上で述べたような例においては、図６に示
す機能は図７のように表現する事もできる。図中の７０
１はデータ入力手段を、７０２は重み付き積算手段を、
７０３は積分手段を示す。これらにより内部状態値更新
手段と内部状態値記憶手段が構成される。また７０４は
出力値生成手段としてのシグモイド関数や、しきい値関
数のような出力値制限手段を示す。これらの機能はまた
演算増幅器等を用いたハードウェア上で図８のように表
現する事もできる。図中の番号８０１はデータ入力手段
と重み付き積算手段を、８０２は積分手段を、８０３は
出力値制限手段を示す。ただし、図８は例であり構成は
これに限るものではない。例えば積分手段と出力値制限
手段を入れ換えたような構成も可能である。In the example described above, the function shown in FIG. 6 can be expressed as shown in FIG. 70 in the figure
1 is a data input means, 702 is a weighted integration means,
Reference numeral 703 represents an integrating means. These constitute an internal state value updating means and an internal state value storage means. Reference numeral 704 represents an output value limiting means such as a sigmoid function or a threshold function as an output value generating means. These functions can also be expressed as shown in FIG. 8 on hardware using an operational amplifier or the like. In the figure, numeral 801 is a data input means and weighted integrating means, 802 is an integrating means, and 803 is an output value limiting means. However, FIG. 8 is an example, and the configuration is not limited to this. For example, a configuration in which the integrating means and the output value limiting means are exchanged is possible.

【００４０】図９は、このようなノードを用いて構成し
た単純な音声認識装置の構成を示したものである。この
図では６個の相互に非対称に完全接続され、かつ自己帰
還接続を持つようなニューラルネットワークを認識手段
として例示する。FIG. 9 shows the configuration of a simple speech recognition apparatus constructed by using such a node. In this figure, six neural networks which are completely connected to each other asymmetrically and have a self-feedback connection are exemplified as the recognition means.

【００４１】ニューラルネットワークにある機能を持た
せるためには学習処理が必要である事、及びその処理の
基本的なアルゴリズムは先にも述べた。しかしこの学習
処理において、どのようなデータを学習させるのかは学
習アルゴリズムとは異なる範疇の問題であり、かつ、あ
る所望の能力をニューラルネットワークに持たせるため
にはより重要な問題である。The learning process is necessary for the neural network to have a certain function, and the basic algorithm of the process is described above. However, in this learning process, what kind of data is to be learned is a problem in a category different from that of the learning algorithm, and is a more important problem in order for the neural network to have a certain desired ability.

【００４２】本実施例では図９に示したようなニューラ
ルネットワーク１つに対し、一つの範疇のデータを認識
させるものとした。仮にそのニューラルネットワークに
対して設定された範疇のデータを肯定データ、及びそれ
以外のデータを否定データと呼ぶ事にする。これらを用
い学習用の入力データとしては、それらの範疇のデータ
を２つ連鎖させたものを用いた。また出力としては肯定
出力と否定出力の２つを設定した。In this embodiment, one neural network as shown in FIG. 9 is made to recognize data in one category. Temporarily, the data of the category set for the neural network will be called positive data, and the other data will be called negative data. Using these, as input data for learning, data obtained by chaining two data in those categories was used. Two outputs were set, a positive output and a negative output.

【００４３】学習出力と入力データとの対応は図１５に
示すようなものである。つまり肯定データが入力される
と、実線で表した出力（肯定出力）が大きな値を示し、
破線で示した出力（否定出力）は小さな値を示すものと
した。また、入力が否定出力である場合はその逆である
とした。そしてデータが連続する場合のそれらの間の遷
移は、データの値、及びその時間微分が連続であると言
う条件を与え、図に示したようなものであるとした。こ
の際、入力データの１フレーム分の入力が、出力データ
の１フレームに対応するものとし、従来のＭＬＰ法のよ
うに複数のフレームのデータを同時に入力するような事
はしない。The correspondence between the learning output and the input data is as shown in FIG. In other words, when positive data is input, the output indicated by the solid line (positive output) shows a large value,
The output (negative output) indicated by the broken line has a small value. When the input is a negative output, the opposite is assumed. Then, the transition between them when the data is continuous gives the condition that the value of the data and its time derivative are continuous and is as shown in the figure. At this time, it is assumed that one frame of input data corresponds to one frame of output data, and the data of a plurality of frames are not simultaneously input unlike the conventional MLP method.

【００４４】図１５に示したような入力と出力を与える
理由は二つある。一つはこのニューラルネットワークの
動作が１階微分方程式で表現されるために、任意の初期
値が必要となる事である。図の例では必ず原点を初期値
として動作が開始されるが、連鎖した後の方のデータで
は前の方のデータにより決定される状態が初期値とな
る。それにより、この連鎖データの組み合わせを種々用
いる事により、このニューラルネットワークは様々な初
期値に対しての所望の動作を学習する事になる。もう一
つはこの遷移に関わるものであり、つまりあるデータが
入力されても、その入力の開始時点においてはその範疇
は定まらないと言う事である。例えば音声／い／が入力
されたとしても、それが「いきおい」の／い／であるの
か「いよいよ」の／い／であるのかは定まらない。その
意味でこの出力の遷移は仮にデータの中点において対称
であるように選択した。There are two reasons for giving the input and output as shown in FIG. One is that since the operation of this neural network is expressed by a first-order differential equation, an arbitrary initial value is required. In the example of the figure, the operation is always started with the origin as the initial value, but in the data after the chaining, the state determined by the previous data becomes the initial value. Therefore, by using various combinations of this chain data, this neural network learns the desired operation for various initial values. The other is related to this transition, that is, even if some data is input, the category is not determined at the start of the input. For example, even if the voice / i / is input, it is uncertain whether it is "Ikioi" / i / or "Iiyoi" / i /. In that sense, the transitions of this output were chosen to be symmetric at the midpoint of the data.

【００４５】このような学習を行う事により例えば図１
９に示すような認識が可能となる。図１９の例は認識対
称である肯定データを複数個の否定データの中へ埋め込
んだ例である。学習に用いたデータは２つのデータの連
鎖のみであったが、この例に見られるように学習の、任
意の長さのデータの中から肯定データを認識する事がで
きる。この認識方法の効果をまとめると、下のような事
柄がある。By performing such learning, for example, FIG.
The recognition as shown in 9 is possible. The example of FIG. 19 is an example of embedding positive data, which is recognition symmetry, in a plurality of negative data. The data used for learning was only a chain of two data, but as seen in this example, positive data can be recognized from the data of arbitrary length for learning. The effects of this recognition method can be summarized as follows.

【００４６】１）、従来例のようにデータの始端終端
を、厳密に、また明示的に与えて学習させる必要がな
い。従って、従来例においては認識処理時においてデー
タの長さのの自剰に比例する時間が必要であったもの
が、長さに比例する時間で処理を行う事が可能となる。1) It is not necessary to strictly and explicitly give and learn the start and end of data as in the conventional example. Therefore, in the conventional example, the time which is proportional to the surplus of the length of the data was required in the recognition processing, but the processing can be performed in the time proportional to the length.

【００４７】２）、その瞬間の入力データを処理するだ
けで良いので、メモリーが非常に少なくても良い。2) Since only the input data at that moment needs to be processed, the memory may be very small.

【００４８】３）、結果の入力長さについての補正が必
要でない。3) No correction is required for the resulting input length.

【００４９】４）、以上により、容易に連続処理が可能
である。4) As described above, continuous processing can be easily performed.

【００５０】５）、範疇分類されたデータの始端終端
が、学習結果に与える影響が小さい。5) The start end and end of the classified data have little influence on the learning result.

【００５１】６）、二つの範疇のデータの様々な組み合
わせを学習させるだけで、任意の個数の範疇の組み合わ
せに対する認識能力を得る事ができる。6) Only by learning various combinations of two categories of data, it is possible to obtain recognition ability for an arbitrary number of combinations of categories.

【００５２】これを更に不特定話者の入力に対しても適
用させるために本発明では、図１に示した特徴変換部と
適合判定部を持つ。In order to further apply this to the input of an unspecified speaker, the present invention has the feature conversion section and the matching determination section shown in FIG.

【００５３】図１０を用いて特徴変換部の動作の概念を
説明する。図中の番号１００１に示す楕円が入力話者の
特徴の広がりを、また１００２の楕円が認識部が持って
いる話者、あるいは認識部の学習に用いた話者（以下標
準話者と略す）の特徴の広がりを示すとする。もしもこ
の広がりが一致すれば、この入力話者の音声は、標準話
者による認識手段により精度良く認識する事ができる。
しかし図１０ａ）に示したように特徴量の広がりが重な
り合わなければ認識精度は落ちる事になる。The concept of the operation of the feature conversion section will be described with reference to FIG. The ellipse indicated by the number 1001 in the figure indicates the spread of the characteristics of the input speaker, and the ellipse indicated by 1002 is the speaker possessed by the recognition unit or the speaker used for learning of the recognition unit (hereinafter abbreviated as standard speaker). Let us show the spread of the characteristics of. If the spreads match, the voice of the input speaker can be accurately recognized by the recognition means by the standard speaker.
However, as shown in FIG. 10a), the recognition accuracy is degraded unless the spreads of the feature quantities overlap.

【００５４】話者適応処理は図１０ｂ）、ｃ）に模式的
に示したように、何らかの特徴変換操作により、入力話
者の特徴量を標準話者の対応する量へ変換し、認識精度
を向上させようとするものである。この特徴変換処理は
様々なものが考えられるが、一般的であるのは、次の式
のような線形変換を用いる方法である。In the speaker adaptation processing, as schematically shown in FIGS. 10b) and 10c), the feature amount of the input speaker is converted into the corresponding amount of the standard speaker by some feature conversion operation, and the recognition accuracy is improved. It is an attempt to improve. Although various types of feature conversion processing can be considered, a general method is a method using linear conversion represented by the following equation.

【００５５】[0055]

【数１１】 [Equation 11]

【００５６】ここでＵｊは入力話者のある音声特徴ベク
トルＵのｊ番目の成分を示し、Ｖｉは入力特徴ベクトル
Ｕに対応する標準話者の特徴ベクトルＶのｉ番目の成分
を示し、Ｔｉｊはベクトルの回転、縮小、拡大等の操作
を表す変換行列のｉｊ成分を示し、Ｇｉはベクトルの移
動操作を表す移動ベクトルＧのｉ番目の成分を示す。話
者適応処理を行うと言う事は、最適な結果が得られるよ
うに上のＴｉｊやＧｉを決定する事である。しかしこれ
を正確に実行するためには大量の適応データとその処理
が必要となる。また適応データが小量である場合は、結
果はむしろ悪くなる場合も多い。Here, Uj represents the j-th component of the speech feature vector U with an input speaker, Vi represents the i-th component of the standard speaker feature vector V corresponding to the input feature vector U, and Tij is The ij component of the transformation matrix representing operations such as rotation, reduction, and expansion of the vector is shown, and Gi represents the i-th component of the movement vector G representing the movement operation of the vector. Performing the speaker adaptation process is to determine the above Tij and Gi so as to obtain the optimum result. However, a large amount of adaptive data and its processing are required to do this accurately. In addition, when the adaptation data is small, the result is often worse.

【００５７】一方本発明は、図１に示したように予め処
理された複数個の特徴変換操作を持つものである。この
特徴変換部は固定されているので、データ量、処理量の
大きい適応処理を都度行う必要はない。またこのような
処理を認識部と分離する事により、認識部の機能とは独
立した処理が可能となる。On the other hand, the present invention has a plurality of pre-processed feature conversion operations as shown in FIG. Since the feature conversion unit is fixed, it is not necessary to perform adaptive processing with a large amount of data and a large amount of processing each time. Further, by separating such processing from the recognition unit, processing independent of the function of the recognition unit becomes possible.

【００５８】このような処理システムの性能の良否は、
実際に入力されているデータを、予め持っている複数個
の特徴変換部のどれにより変換すれば良いか、あるいは
その変換された特徴を組み合わせてどのような入力を生
成すれば良いか、と言う事を判断する適合判断部、入力
生成部の能力によって決まる。この適合判定部は種々の
ものが考えられるが、本実施例では図２に模式的に示す
ようなものを考える。図中の２０１、２０２は特徴入力
部を、２０３は入力された特徴の１つを変換、写像する
特徴写像部を、２０４は変換された特徴と、入力された
特徴との比較部を示す。The quality of the performance of such a processing system depends on
Which of a plurality of pre-existing feature conversion units should be used to convert the actually input data, or what kind of input should be generated by combining the converted features It is determined by the capabilities of the conformity determination unit and the input generation unit that determine things. Various types of conformity determining units are conceivable, but in the present embodiment, a unit as schematically shown in FIG. 2 is considered. In the figure, 201 and 202 are feature input units, 203 is a feature mapping unit that transforms and maps one of the input features, and 204 is a comparison unit between the transformed features and the input features.

【００５９】また本実施例においては、図１１に示すよ
うに特徴写像部を認識手段と同様のニューラルネットワ
ークによるものとして例示する。図１１において番号１
１０１は入力特徴ベクトル列を、１１０２は特徴写像部
を構成するニューラルネットワークを、１１０３は比較
部を模式的に示す。更に図１１においては、図２の２０
１に示した特徴入力部１に対し、２０２の特徴入力部２
に与えられる入力は、特徴入力部１に入力したデータの
一つ前のデータである場合を例示してある。つまり図１
１の例における比較部の入力は、特徴写像部により変換
されたある時点での特徴ベクトルと、その１つ前の特徴
ベクトルとである。仮に、これらの差が小さくなるよう
に特徴写像部を構成するという事を考えると、これはあ
る時点での特徴ベクトルを基に、それよりも以前のデー
タを復元するという事である。同様に特徴入力部２に与
えられるデータが、特徴入力部１に与えられるデータよ
りも後のデータであるとすると、これはある時点でのデ
ータを基にそれよりも後のデータを予測すると言う事で
ある。またこの二つのデータ入力が同一である場合はデ
ータの単純写像を行うと言う事である。Further, in this embodiment, as shown in FIG. 11, the feature mapping section is exemplified by a neural network similar to the recognition means. In FIG. 11, number 1
101 is an input feature vector sequence, 1102 is a neural network forming a feature mapping unit, and 1103 is a comparison unit. Further, in FIG. 11, 20 of FIG.
The feature input unit 1 shown in FIG.
The input given to 1 is illustrated as a case where the input is data immediately before the data input to the feature input unit 1. That is, FIG.
The input of the comparison unit in the first example is the feature vector converted by the feature mapping unit at a certain time point and the feature vector immediately before that. Considering that the feature mapping unit is configured so that the difference between them becomes small, this means that the data before that is restored based on the feature vector at a certain time. Similarly, assuming that the data given to the feature input unit 2 is data after the data given to the feature input unit 1, this means that data after that is predicted based on data at a certain time. It is a thing. When the two data inputs are the same, it means that a simple mapping of the data is performed.

【００６０】このような特徴ベクトルの予測、単純写
像、復元の能力は、その能力の学習に用いた話者の特徴
データに依存する。従って、このような処理により、入
力されている特徴ベクトル列が学習されたものに近いか
どうかと言う適合度を判定する事ができる。特に予測、
復元を用いた方法は、静的な特徴に加えて動的な特徴も
明示的に学習させるためにより有効である。また、音声
について言えば、ある時点での特徴はそれに後続する特
徴の影響を受ける割合が高いので、復元処理がより精度
の高い結果を与える場合もある。The ability of such feature vector prediction, simple mapping, and restoration depends on the feature data of the speaker used for learning the ability. Therefore, by such processing, it is possible to determine the degree of conformity, which is whether the input feature vector sequence is close to the learned one. Especially predictions,
The method using restoration is more effective for explicitly learning dynamic features in addition to static features. In addition, as for speech, a feature at a certain point of time has a high influence of a feature that follows it, so that the restoration process may give a more accurate result.

【００６１】図１２、図１３は２人の話者ＭＡＵとＭＸ
Ｍを学習話者としてデータの復元により話者適合判定部
を構成し、判定を行った例である。横軸は時間であり、
図中の実線はＭＡＵを学習話者とする適合判定部の出力
を、また破線はＭＸＭを学習話者とする適合判定部の出
力を示す。また縦軸は比較部による特徴ベクトル間の差
の絶対値を示したものである。図より明かであるよう
に、それぞれの話者の入力により、それぞれの話者で学
習された適合判定部が他方よりも小さなデータ復元誤差
を示す。つまりより良く適合していると言う事を示す。
しかもこの結果はデータの入力の開始と同時に得られて
おり、つまり非常に小量のデータで適合度の判定が可能
である事を示している。12 and 13 show two speakers MAU and MX.
In this example, M is a learning speaker and a speaker suitability determination unit is configured by data restoration to make a determination. The horizontal axis is time,
In the figure, the solid line shows the output of the matching determination unit with MAU as the learning speaker, and the broken line shows the output of the matching determination unit with MXM as the learning speaker. The vertical axis indicates the absolute value of the difference between the feature vectors by the comparison unit. As is clear from the figure, the input of each speaker causes the matching determination unit learned by each speaker to show a smaller data restoration error than the other. In other words, it indicates that it is better suited.
Moreover, this result is obtained at the same time as the start of data input, that is, it is possible to judge the compatibility with a very small amount of data.

【００６２】上の例と同様の処理を未知話者を含めた場
合に適用した例が下の表である。表中の数字は学習に用
いた単語列を入力した場合のデータの平均復元誤差を任
意単位で示したものである。The table below shows an example in which the same processing as the above example is applied when an unknown speaker is included. The numbers in the table show the average restoration error of the data when the word string used for learning is input, in arbitrary units.

【００６３】[0063]

【表１】 [Table 1]

【００６４】表２は表１と同様のものであるが、入力し
ている音声データは学習に用いたデータ以外、つまりシ
ステムにとっては未知であるデータを入力とした場合の
例である。Table 2 is the same as Table 1 except that the input voice data is data other than the data used for learning, that is, data unknown to the system is input.

【００６５】[0065]

【表２】 [Table 2]

【００６６】この例においても上で述べてきた例と同様
に合理的な結果が得られている。つまり、この適合判定
部は学習した入力データに特有の情報ではなく、話者の
特性に対しての応答をしていると考えられる。更に上の
２表の値については類似性が見られ、この値が話者間の
類似性の指標となる事が考えられる。Similar results are obtained in this example as in the example described above. In other words, it is considered that the matching determination unit responds to the characteristics of the speaker, not the information specific to the learned input data. Further, the values in the above two tables show similarity, and it is considered that this value serves as an index of the similarity between speakers.

【００６７】図１６、図１７は認識部を話者ＭＡＵのデ
ータにより学習させ、また特徴変換部としては、ＭＸＭ
の特徴をＭＡＵの特徴へ変換、またＭＡＵの特徴をＭＡ
Ｕの特徴へ変換する場合の２つを用意した例である。以
下これらの変換部をＭＸＭ−ＭＡＵ変換部のように書
く。この場合ＭＡＵ−ＭＡＵ変換部は恒等変換である。
またこの例の場合の入力話者はＭＭＹであり、発話中に
は認識すべきデータが１つだけ含まれるとする。図１６
の例は比較のために、入力されたデータにＭＸＭ−ＭＡ
Ｕ変換を施す事により入力を生成し認識部へ入力した例
である。図より明かであるように２つの認識出力が得ら
れ誤認識が見られる。16 and 17, the recognition unit is trained by the data of the speaker MAU, and the feature conversion unit is MXM.
Of MAU features to MAU features and MAU features to MA
This is an example in which two are prepared for conversion into U features. Hereinafter, these conversion units will be written as MXM-MAU conversion units. In this case, the MAU-MAU converter is an identity conversion.
In this example, the input speaker is MMY, and only one data to be recognized is included in the utterance. FIG.
For comparison, the input data is MXM-MA
In this example, an input is generated by performing U conversion and input to the recognition unit. As is clear from the figure, two recognition outputs are obtained, and erroneous recognition is seen.

【００６８】一方本発明によれば、先の表よりも明かで
あるように、適合判定部により話者ＭＭＹはＭＸＭより
もＭＡＵに類似していると判断される。従って入力デー
タにＭＡＵ−ＭＡＵ変換（この実施例の場合は恒等変
換）を施して認識部へ入力される。図１７はその結果で
あり、期待される認識結果が得られている。On the other hand, according to the present invention, as is clear from the above table, the matching determination unit determines that the speaker MMY is more similar to MAU than MXM. Therefore, the input data is subjected to MAU-MAU conversion (identity conversion in this embodiment) and input to the recognition unit. FIG. 17 shows the result, and the expected recognition result is obtained.

【００６９】（実施例２）上に示した実施例１の入力生
成部の動作は、最も適合度の高いデータ変換部をただ一
つ選択すると言うものであったが、他の動作を考える事
も可能である。(Second Embodiment) The operation of the input generation section of the first embodiment shown above is such that only the data conversion section having the highest degree of conformity is selected, but other operations should be considered. Is also possible.

【００７０】例えば図１４に示したような場合を考え
る。図中の番号１４０１、１４０２、１４０３は予め用
意された特徴変換部の学習に用いた話者の特徴の広がり
を示すとする。また斜線を施した番号１４０４は入力さ
れる話者の特徴の広がりを示すとする。この例の場合、
実施例１の処理に従えば、１４０１に示されたデータに
対応する特徴変換部が選択される事になる。しかし、１
４０１と入力話者の類似度がそれほど近くない場合には
この処理では依然として期待される結果が得られない場
合がある。図１８はそのような場合の認識例であり、認
識すべきデータを全く検出していない。Consider, for example, the case shown in FIG. It is assumed that the numbers 1401, 1402, 1403 in the figure indicate the spread of the features of the speaker used for learning of the feature conversion unit prepared in advance. Also, the shaded number 1404 indicates the spread of the characteristics of the input speaker. In this example,
According to the processing of the first embodiment, the feature conversion unit corresponding to the data shown in 1401 is selected. But 1
When the similarity between 401 and the input speaker is not so close, the expected result may not be obtained in this process. FIG. 18 shows a recognition example in such a case, and no data to be recognized is detected.

【００７１】個の場合入力生成部の処理として、図１４
の矢印で示したように、それぞれの特徴変換部と入力デ
ータとの類似度に従って、変換されたそれぞれのデータ
を組み合わせる事により、入力話者の近似の精度を向上
させる事ができる。この組み合わせは例えば次の式のよ
うな線形結合で実現できる。In the case of the case of FIG.
As shown by the arrow, by combining the respective converted data according to the similarity between the respective feature conversion sections and the input data, the accuracy of the approximation of the input speaker can be improved. This combination can be realized by, for example, a linear combination represented by the following equation.

【００７２】[0072]

【数１２】 [Equation 12]

【００７３】ここでＴｉｊｋ、Ｇｉｋはｋ番目の変換部
を構成する変換行列、移動ベクトルをそれぞれ示す。ま
た、ρｋはｋ番目の変換の結果得られる特徴の重みを示
す。この重みは、例えば先の表に示したデータの復元誤
差から定数を引き、それの逆数を全体の和が１になるよ
うに正規化する事により得られる。図１９はその結果で
あり、期待される認識結果が得られている。Here, Tijk and Gik respectively represent a transformation matrix and a movement vector forming the k-th transformation unit. Further, ρk represents the weight of the feature obtained as a result of the k-th conversion. This weight is obtained, for example, by subtracting a constant from the data restoration error shown in the above table and normalizing the reciprocal of the constant so that the total sum becomes 1. FIG. 19 shows the result, and the expected recognition result is obtained.

【００７４】[0074]

【発明の効果】以上をまとめると本発明の効果として、１）、従来例のようにデータの始端終端を、厳密に、ま
た明示的に与えて学習させる必要がない。従って、従来
例においては認識処理時においてデータの長さのの自剰
に比例する時間が必要であったものが、長さに比例する
時間で処理を行う事が可能となる。Summarizing the above, the effects of the present invention are as follows: 1) It is not necessary to strictly and explicitly give and learn the beginning and end of data as in the conventional example. Therefore, in the conventional example, the time which is proportional to the surplus of the length of the data was required in the recognition processing, but the processing can be performed in the time proportional to the length.

【００７５】２）、その瞬間の入力データを処理するだ
けで良いので、メモリーが非常に少なくても良い。2) Since only the input data at that moment needs to be processed, the memory may be very small.

【００７６】３）、結果の入力長さについての補正が必
要でない。3) No correction is required for the resulting input length.

【００７７】４）、以上により、容易に連続処理が可能
である。4) As described above, continuous processing can be easily performed.

【００７８】５）、範疇分類されたデータの始端終端
が、学習結果に与える影響が小さい。5) The start end and end of the classified data have little influence on the learning result.

【００７９】６）、二つの範疇のデータの様々な組み合
わせを学習させるだけで、任意の個数の範疇の組み合わ
せに対する認識能力を得る事ができる。6) Only by learning various combinations of two categories of data, it is possible to obtain recognition ability for an arbitrary number of combinations of categories.

【００８０】７）、最終的な認識装置において処理量の
大きな話者適応処理が不要である。7) In the final recognition device, speaker adaptation processing with a large amount of processing is unnecessary.

【００８１】８）、話者の適合判定を非常に小量のデー
タで行う事ができる。8) It is possible to judge the suitability of the speaker with a very small amount of data.

【００８２】９）、以上の結果、不特定話者に対応する
システムを小さくする事ができる。9) As a result of the above, it is possible to downsize the system corresponding to the unspecified speaker.

【００８３】等の効果がある。There are advantages such as the following.

【００８４】これらの効果は音声認識に限らず、オンラ
イン文字認識、あるいはより一般的な時系列データ処理
へ応用が可能なものである。These effects can be applied not only to voice recognition but also to online character recognition or more general time series data processing.

[Brief description of drawings]

【図１】本発明の音声認識装置を構成する要素のブロ
ック図である。FIG. 1 is a block diagram of elements constituting a voice recognition device of the present invention.

【図２】本発明の音声認識装置の一部である適合判定
部を構成する要素のブロック図である。FIG. 2 is a block diagram of elements constituting a matching determination unit which is a part of the voice recognition device of the present invention.

【図３】従来の音声認識装置を構成する要素のブロッ
ク図である。FIG. 3 is a block diagram of elements constituting a conventional voice recognition device.

【図４】従来の音声認識装置を構成する要素のブロッ
ク図である。FIG. 4 is a block diagram of elements constituting a conventional voice recognition device.

【図５】音声認識処理の流れを示す模式図である。FIG. 5 is a schematic diagram showing a flow of voice recognition processing.

【図６】本発明におけるニューラルネットワークを構
成するノードのブロック図である。FIG. 6 is a block diagram of nodes constituting a neural network according to the present invention.

【図７】本発明におけるニューラルネットワークを構
成するノードのブロック図である。FIG. 7 is a block diagram of nodes constituting a neural network according to the present invention.

【図８】本発明におけるニューラルネットワークを構
成するノードのハードウェア構成の模式図である。FIG. 8 is a schematic diagram of a hardware configuration of a node configuring a neural network according to the present invention.

【図９】ニューラルネットワークを用いた音声認識装
置の模式図である。FIG. 9 is a schematic diagram of a voice recognition device using a neural network.

【図１０】話者適応処理を示す模式図である。FIG. 10 is a schematic diagram showing speaker adaptation processing.

【図１１】本発明の適合判定処理を示す模式図であ
る。FIG. 11 is a schematic diagram showing a conformity determination process of the present invention.

【図１２】本発明の適合判定処理による出力例を示す
図である。FIG. 12 is a diagram showing an output example by the conformity determination processing of the present invention.

【図１３】本発明の適合判定処理による出力例を示す
図である。FIG. 13 is a diagram showing an output example by the conformity determination processing of the present invention.

【図１４】話者適応処理を示す模式図である。FIG. 14 is a schematic diagram showing speaker adaptation processing.

【図１５】本発明のニューラルネットワークを用いた
認識部の学習用出力例を示す図である。FIG. 15 is a diagram showing a learning output example of a recognition unit using the neural network of the present invention.

【図１６】本発明の認識例を示す図である。FIG. 16 is a diagram showing a recognition example of the present invention.

【図１７】本発明の認識例を示す図である。FIG. 17 is a diagram showing a recognition example of the present invention.

【図１８】本発明の認識例を示す図である。FIG. 18 is a diagram showing a recognition example of the present invention.

【図１９】本発明の認識例を示す図である。FIG. 19 is a diagram showing a recognition example of the present invention.

[Explanation of symbols]

１０１：特徴抽出部１０２：特徴変換部１０３：適合判定部１０４：入力生成部１０５：認識部２０１：特徴入力部１２０２：特徴入力部２２０３：特徴写像部２０４：比較部３０１：特徴抽出部３０２：特徴変換部３０３：認識部３０４：出力部４０１：特徴抽出部４０２：認識部４０３：出力選択部４０４：出力部６０１：内部状態値記憶手段６０２：内部状態値更新手段６０３：出力値生成手段７０１：データ入力手段７０２：重み付き積算手段７０３：積分手段７０４：出力値制限手段８０１：データ入力、重み付き積算手段８０２：積分手段８０３：出力値制限手段９０１：音声特徴抽出部９０２：ニューラルネットワーク９０３：認識結果出力部１００１：入力特徴１００２：標準特徴１００３：変換特徴１１０１：特徴ベクトル時系列１１０２：写像ニューラルネットワーク１１０３：比較部１４０１：標準特徴１４０２：標準特徴１４０３：標準特徴１４０４：入力特徴 101: Feature extraction unit 102: Feature conversion unit 103: Matching determination unit 104: Input generation unit 105: Recognition unit 201: Feature input unit 1 202: Feature input unit 2 203: Feature mapping unit 204: Comparison unit 301: Feature extraction unit 302: Feature conversion unit 303: Recognition unit 304: Output unit 401: Feature extraction unit 402: Recognition unit 403: Output selection unit 404: Output unit 601: Internal state value storage unit 602: Internal state value updating unit 603: Output value generation Means 701: Data inputting means 702: Weighted integrating means 703: Integrating means 704: Output value limiting means 801: Data input, weighting integrating means 802: Integrating means 803: Output value limiting means 901: Voice feature extraction section 902: Neural Network 903: Recognition result output unit 1001: Input feature 1002: Standard feature 1003: Conversion feature 1101: the time-series feature vector 1102: Mapping Neural Network 1103 Comparison 1401: Standard Features 1402: Standard Features 1403: Standard Features 1404: input feature

Claims

[Claims]

1. A feature extraction unit, a plurality of feature conversion units that receive the extracted features, a matching determination unit that makes a pair with the feature conversion unit, and an output of the feature conversion unit and the matching determination unit as inputs. A voice recognition device, comprising: an input generation unit that performs input and a recognition unit that receives the generated input as an input.

2. The recognition unit is composed of a neural network, and each nerve cell-like element constituting the neural network includes an internal state value storage means, an internal state value stored in the internal state value storage means, and its nerve cell. The internal state value updating means for updating the internal state value based on the input value input to the element, and the output value generating means for converting the internal state value by the internal state value storage means into the external output value. The voice recognition device according to claim 1, wherein the voice recognition device is a voice recognition device.

3. The conformity determination unit includes at least two feature input units, a feature mapping unit that maps and converts one of the input features, and another feature that is input and a feature that is mapped. The voice recognition device according to claim 1 or 2, further comprising: a comparison unit that compares

4. The feature mapping unit comprises a neural network, and each nerve cell-like element forming the neural network includes an internal state value storage means, an internal state value stored in the internal state value storage means, and its nerve. It has an internal state value updating means for updating the internal state value based on the input value input to the cell-like element, and an output value generating means for converting the internal state value by the internal state value storage means to an external output value. The voice recognition device according to any one of claims 1 to 3, characterized in that.

5. The internal state value updating means comprises weighted integrating means for adding weights to each of the internal state value and the input value and integrating the weights, and the internal state value storage means:
The output value generating means includes an integrating means for integrating the value integrated by the weighted integrating means, and the output value generating means converts the value obtained by the integrating means into a value between a preset upper limit value and a lower limit value. 5. The voice recognition device according to claim 1, further comprising an output value limiting unit.

6. The internal state value of the i-th neuron-like element forming the neural network is Xi, and n weighted external input values to the element are Zj (j is 0 to n, n is non-negative). And the time constant is τi, the internal state value updating means is The voice recognition device according to any one of claims 1 to 5, wherein the internal state value Xi is updated to a value that satisfies the above condition.

7. The weighted external input value Zj is at least the output of the i-th nerve cell-like element itself with weights added, and the output of another nerve cell-like element constituting the neural network. The sum of weights, the input value given from outside the neural network,
7. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus includes one or more of a fixed value and a weight added thereto.

8. The voice recognition device according to claim 1, wherein the input generation unit selects any one of outputs of the plurality of feature conversion units input to the input generation unit. .

9. The voice recognition device according to claim 1, wherein the input generation unit generates a linear combination of outputs of a plurality of feature conversion units input to the input generation unit. .