JPH06149286A

JPH06149286A - Unspecified speaker speech recognizing device

Info

Publication number: JPH06149286A
Application number: JP32377292A
Authority: JP
Inventors: Hirofumi Yajima; 弘文矢島
Original assignee: Clarion Co Ltd
Current assignee: Faurecia Clarion Electronics Co Ltd
Priority date: 1992-11-10
Filing date: 1992-11-10
Publication date: 1994-05-27

Abstract

PURPOSE:To provide the unspecified speaker speech recognizing device which has high robust performance to variation on the time base and frequency base of a specific word and also has a high recognition rate. CONSTITUTION:This speech recognizing device is equipped with speech analyzing means (11-14) which analyzes inputted speech signals and generate monitor waveform data, and a data recognizing means 15 having a means which generates one identification system identified with data for identification corresponding to the specific word that one unspecified person voices and plural evaluation systems identified with data for evaluation corresponding to specific words that other speakers voice, and a means which finds the minimum sum of squares of the differences between similarity estimated values obtained by inputting respective evaluation data to corresponding evaluation systems for the specific words and similarity estimated values obtained by inputting the respective evaluation data to the identification system.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、不特定話者から発せら
れる特定単語の音声を認識する不特定話者音声認識装置
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an unspecified speaker voice recognition device for recognizing the sound of a specific word emitted by an unspecified speaker.

【０００２】[0002]

【従来の技術】従来の不特定話者音声認識装置において
は、不特定話者から発せられる特定単語の音声を認識す
るために、話者間によって異なる時間軸上および周波数
軸上の変動を吸収するために、様々な方法が提案されて
きた。2. Description of the Related Art In a conventional speaker-independent voice recognition device, in order to recognize a voice of a specific word uttered by a speaker, variations on a time axis and a frequency axis which are different among speakers are absorbed. To do this, various methods have been proposed.

【０００３】例えば、複合類似度法、時間軸上および周
波数軸上でのＤＰマッチング法、ファジィ理論を用いた
音声認識法等である。For example, there are a composite similarity method, a DP matching method on the time axis and the frequency axis, a voice recognition method using a fuzzy theory, and the like.

【０００４】図４に従来のＤＰマッチング法を用いたシ
ステム構成例のブロック図を示す。図４において、１は
音声を受けて音声信号に変換するマイク、２は音声信号
を増幅するマイクアンプ、３は音声信号を周波数分割し
て複数の音声信号とするバンドパスフィルタ群（以下
「フィルタバンク」という）、４はアナログ音声信号を
ディジタル信号に変換するＡ／Ｄコンバータ、５は予め
登録されている基準の登録音声データと認識すべき音声
データとを比較して音声認識を行う音声認識部、６は基
準の音声データを格納するデータ登録部である。FIG. 4 shows a block diagram of a system configuration example using the conventional DP matching method. In FIG. 4, 1 is a microphone for receiving a voice and converting it into a voice signal, 2 is a microphone amplifier for amplifying the voice signal, 3 is a band-pass filter group (hereinafter referred to as “filter”) for frequency-dividing the voice signal into a plurality of voice signals. "Bank"), 4 is an A / D converter for converting an analog voice signal into a digital signal, and 5 is voice recognition for performing voice recognition by comparing pre-registered reference registered voice data with voice data to be recognized. Reference numeral 6 is a data registration unit for storing reference voice data.

【０００５】次に、上記従来例の動作について説明す
る。音声認識に先だって、マイク１から基準となる特定
単語に対して、不特定多数の話者（例えば２０人）の音
声が入力されると、２０個の基準音声データがデータ登
録部６に格納される。その後、任意の話者から認識すべ
き音声がマイク１に入力されると、その認識音声データ
が、登録されている２０個の基準音声データと順次パタ
ーンマッチングされて、音声認識部５から類似度出力が
送出される。Next, the operation of the above conventional example will be described. Prior to voice recognition, when voices of an unspecified number of speakers (for example, 20 people) are input from the microphone 1 to a reference specific word, 20 reference voice data are stored in the data registration unit 6. It After that, when a voice to be recognized by an arbitrary speaker is input to the microphone 1, the recognized voice data is sequentially pattern-matched with the 20 reference voice data registered, and the voice recognition unit 5 calculates the similarity. Output is sent.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら上記従来
の不特定話者音声認識装置においては、理想出力を基に
していないので、特定単語に対する高認識率を期待でき
ないという問題があった。本発明は上記従来の問題を
解決するものであり、特定単語における時間軸上および
周波数軸上の変動に対するロバスト性が高く、高認識率
が得られる優れた不特定話者音声認識装置を提供するこ
とを目的とする。However, the above-described conventional unspecified speaker voice recognition device has a problem that a high recognition rate for a specific word cannot be expected because it is not based on an ideal output. The present invention solves the above-mentioned conventional problems, and provides an excellent speaker-independent voice recognition device that is highly robust against fluctuations in a specific word on the time axis and the frequency axis and that can obtain a high recognition rate. The purpose is to

【０００７】[0007]

【課題を解決するための手段】本発明は上記目的を達成
するために、不特定話者から発声される１つ又は複数の
特定単語に応じた音声信号を解析してモニタ波形データ
を生成する音声解析手段と、前記不特定話者の一人から
のモニタ波形データを同定用データとし該同定用データ
で同定した１つの同定システムと、他の複数の話者から
のモニタ波形データを複数の評価用データとして該評価
用データで同定した複数の評価システムとを生成する手
段と、前記特定単語に対して前記各評価用データを対応
する評価システムに入力して得られる類似度推定値と、
前記各評価用データを前記同定システムに入力して得ら
れる類似度推定値とのそれぞれの差分の最小２乗和を求
める手段と、を有するデータ認識手段と、を備えた構成
となっている。In order to achieve the above object, the present invention analyzes a voice signal corresponding to one or more specific words uttered by an unspecified speaker to generate monitor waveform data. Voice analysis means, one identification system in which monitor waveform data from one of the unspecified speakers is used as identification data and identified by the identification data, and monitor waveform data from a plurality of other speakers are evaluated a plurality of times. Means for generating a plurality of evaluation systems identified by the evaluation data as use data, and a similarity degree estimation value obtained by inputting each evaluation data for the specific word into a corresponding evaluation system,
And a data recognizing unit having a unit for obtaining a least square sum of respective differences from the similarity estimation value obtained by inputting each of the evaluation data to the identification system.

【０００８】[0008]

【作用】従って、本発明によれば、特定単語に対する一
人の特定話者の同定用データと、他の複数の話者の評価
用データを規定し、同定用データおよび評価用データに
よる類似度推定値の差分の最小２乗和を求めて特定単語
の音声データを認識することにより、各特定単語におけ
る話者間の時間軸上および周波数軸上の変動を吸収し、
ロバスト性の高い高認識率の音声認識を行うことができ
る。Therefore, according to the present invention, the identification data of one specific speaker for a specific word and the evaluation data of a plurality of other speakers are defined, and the similarity estimation based on the identification data and the evaluation data is performed. By recognizing the voice data of a specific word by obtaining the least square sum of the difference between the values, the fluctuations on the time axis and the frequency axis between speakers in each specific word are absorbed,
It is possible to perform speech recognition with high robustness and high recognition rate.

【０００９】[0009]

【実施例】以下、本発明の実施例について図を参照して
詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１０】図１に本発明の実施例の不特定話者音声認
識装置の概略ブロック図を示す。図１において、１１は
音声を受けて音声信号に変換するマイク、１２は音声信
号を増幅するマイクアンプ、１３は音声信号を周波数分
割して複数（ｎ個とする）の音声信号とするフィルタバ
ンク、１４は音声信号をディジタル信号の音声データに
変換するＡ／Ｄコンバータである。これらは話者からの
音声を解析する音声解析手段を構成する。また、１５は
フィルタバンク１３通過後の波形データ（以下「モニタ
波形データ」という）から、類似度を出力するファジィ
同定システムであり、データ認識手段を構成する。FIG. 1 is a schematic block diagram of an unspecified speaker voice recognition apparatus according to an embodiment of the present invention. In FIG. 1, 11 is a microphone that receives voice and converts it into a voice signal, 12 is a microphone amplifier that amplifies the voice signal, and 13 is a filter bank that frequency-divides the voice signal into multiple (n) voice signals. , 14 are A / D converters for converting a voice signal into voice data of a digital signal. These constitute a voice analysis means for analyzing the voice from the speaker. Reference numeral 15 is a fuzzy identification system that outputs the degree of similarity from the waveform data that has passed through the filter bank 13 (hereinafter referred to as "monitor waveform data"), which constitutes data recognition means.

【００１１】次に、ファジィ同定システムの作成につい
て説明する。図１は認識すべき予定の単語ごとにその類
似度を推定する類似度差同定システムである。最初はこ
のように、登録単語ごとに類似度出力を送出するシステ
ムを作成する。この場合、認識すべき予定の単語数は
「０〜９」までの数字１０個であり、したがって類似度
差同定システムも１０システムとなる。各類似度差同定
システムには、入力パラメータｘ１〜ｘｎが供給され、
推定出力ｙ′が出力される。Next, the creation of the fuzzy identification system will be described. FIG. 1 shows a similarity difference identification system that estimates the similarity of each word to be recognized. Initially, a system for sending the similarity output for each registered word is created in this way. In this case, the number of words to be recognized is 10 numbers from “0 to 9”, and therefore the similarity difference identification system is also 10 systems. Input parameters x1 to xn are supplied to each similarity difference identification system,
The estimated output y'is output.

【００１２】図３に各類似度差同定システムの前件部お
よび後件部の同定システムの構成を示す。この構成にお
ける同定システムは、ｉｆ−ｔｈｅｎ形式で記述される
もので、前件部が台形型のメンバーシップ関数のファジ
ィ変数であるファジィ命題からなり、後件部が通常の線
形式からなるものである。特定単語の数をｍ個（この場
合、ｍ＝１０となる）とすると、ファジィ規則（以下
「ルール」と称する）Ｒi （ｉ＝１，２，３，…，ｍ）
はｍ個のルールとなり、前件部のファジィ入力ｘj （ｊ
＝１，２，３，…，ｎ）のメンバーシップ関数をＡniと
すると、（数１）で示すファジィモデルとなる。FIG. 3 shows the configuration of the antecedent and consequent identification systems of each similarity difference identification system. The identification system in this configuration is described in the if-then format, in which the antecedent part is a fuzzy proposition that is a fuzzy variable of a trapezoidal membership function, and the consequent part is an ordinary linear form. is there. If the number of specific words is m (in this case, m = 10), fuzzy rule (hereinafter referred to as “rule”) Ri (i = 1, 2, 3, ..., M)
Becomes m rules, and the fuzzy input xj (j
= 1, 2, 3, ..., N), the fuzzy model shown in (Equation 1) is obtained when the membership function is Ani.

【００１３】[0013]

【数１】ここで、ファジィ入力ｘj を確定入力（非ファジィ入
力）ｘj0（ｘ10, ｘ20,…, ｘn0）とすると、推論出力
ｙ′はｎ個の適合度による重み付き平均で与えられ、
（数２）及び（数３）で表される。[Equation 1] Here, if the fuzzy input xj is a deterministic input (non-fuzzy input) xj0 (x10, x20, ..., xn0), the inference output y'is given by the weighted average by n goodness of fit,
It is represented by (Equation 2) and (Equation 3).

【００１４】[0014]

【数２】 [Equation 2]

【００１５】[0015]

【数３】（数２）におけるｙi は、（数４）で示されるように、
（数１）の後件部の式に入力ｘj0を代入して求めたもの
である。[Equation 3] Yi in (Equation 2) is, as shown in (Equation 4),
This is obtained by substituting the input xj0 into the expression of the consequent part of (Equation 1).

【００１６】[0016]

【数４】また、Ａji（ｘj0）はファジィ変数Ａjiのｘj0における
メンバーシップ値であり、（数３）はこれらｎ個の積で
ある。ωｉは入力ｘ10, ｘ20, …, ｘn0に対するルール
Ｒｉの適合度の積であるが、（数２）においては、ｙｉ
を求めるときの「重み付け」係数として働いている。[Equation 4] Aji (xj0) is the membership value of the fuzzy variable Aji at xj0, and (Equation 3) is the product of these n pieces. ωi is the product of the goodness of fit of the rule Ri with respect to the inputs x10, x20, ..., Xn0.
It acts as a "weighting" factor when determining.

【００１７】このように前件部がメンバーシップ関数で
与えられ、後件部が線形式で与えられるので、例えば図
３に示すように記述される。Since the antecedent part is given by the membership function and the consequent part is given by the linear form in this way, it is described as shown in FIG. 3, for example.

【００１８】（数２）において、（数５）に示すような
〔ωｉ〕を定義する。In (Equation 2), [ωi] as shown in (Equation 5) is defined.

【００１９】[0019]

【数５】この〔ωｉ〕及び（数４）に示すｙｉを（数２）に代入
すると、（数６）が得られる。[Equation 5] By substituting [ωi] and yi shown in (Equation 4) into (Equation 2), (Equation 6) is obtained.

【００２０】[0020]

【数６】さらに、ｚ0i＝〔ωｉ〕，ｚ1i＝〔ωｉ〕ｘ10，ｚ2i＝
〔ωｉ〕ｘ20，…，ｚni＝〔ωｉ〕ｘn0とすると、出力
パラメータｙ′は、（数７）で表される。[Equation 6] Furthermore, z0i = [ωi], z1i = [ωi] x10, z2i =
Assuming that [ωi] x20, ..., Zni = [ωi] xn0, the output parameter y'is represented by (Equation 7).

【００２１】[0021]

【数７】また、後件部が線形式で表されるので、前件部が決まれ
ば後件部は１つの線形式と見なせる。したがって、複数
個の変数の間の関係を解析するための重回帰分析法によ
り、未知の定数の推定値を求めるために最小２乗法を用
いる。すなわち推定値と実測値との残差を求めて、残差
の平方和を最小とすることにより最小２乗推定できる。
また、前件部は非線形計画法のシンプレックス法により
求めることができる。[Equation 7] Further, since the consequent part is expressed in a linear format, if the antecedent part is determined, the consequent part can be regarded as one linear format. Therefore, the least squares method is used to obtain the estimated value of the unknown constant by the multiple regression analysis method for analyzing the relationship between a plurality of variables. That is, the least squares can be estimated by obtaining the residual between the estimated value and the measured value and minimizing the sum of squares of the residual.
The antecedent part can be obtained by the simplex method of nonlinear programming.

【００２２】ファジィ同定システム１５の入力パラメー
タの選択は次のようにして行う。ある１つの他の音声認
識を参照し、認識実験を繰り返して、各単語の特徴量
（例えば波形データの山の数）となると思われるモニタ
波形データ（ｘ１〜ｘｎ）を選ぶものとする。この場
合、音声認識の参照は、例えば図１のフィルタバンク１
３から出力されたモニタ波形データを得て、類似度によ
り音声認識結果を出力するものとする。The selection of the input parameters of the fuzzy identification system 15 is performed as follows. It is assumed that the monitor waveform data (x1 to xn) that is considered to be the feature amount of each word (for example, the number of peaks of the waveform data) is selected by referring to a certain other speech recognition and repeating the recognition experiment. In this case, the reference of the voice recognition is, for example, the filter bank 1 of FIG.
It is assumed that the monitor waveform data output from 3 is obtained and the voice recognition result is output according to the similarity.

【００２３】（表１）は類似度同定システム作成用のデ
ータシートであり、資料ｎｏは数字のコード番号に相当
する。出力パラメータは（表１）に示すように、対象単
語に対する出力値は２００と大きい値を与え、他の単語
に対する出力値は０（又は０に近い値、もしくは乱数を
使用して選択した値）とする。この場合、音声解析系は
本発明の装置と同じものを使用する。Table 1 is a data sheet for making the similarity identification system, and the material no corresponds to a code number of a numeral. As the output parameter, as shown in (Table 1), the output value for the target word is as large as 200, and the output value for other words is 0 (or a value close to 0 or a value selected using a random number). And In this case, the same voice analysis system as the device of the present invention is used.

【００２４】[0024]

【表１】すなわちこのシステムは、より単語の選択性が高いシス
テムを構築する。結局、このファジィ同定システムは、
参照音声認識装置の、より性能アップされた理想音声認
識装置であるといえる。[Table 1] That is, this system builds a system with higher word selectivity. After all, this fuzzy identification system
It can be said that the reference speech recognition apparatus is an ideal speech recognition apparatus with improved performance.

【００２５】資料データとしては、複数の話者（例えば
２０人）が発生した数字「０〜９」の１０個の単語を用
いる。さらに、同定用データとしては、ある一人の話者
のデータとし、残りの１９人分のデータは評価用データ
とする。通常、ファジィ同定では、非線形システムの入
出力関係を入力変数の多項式でモデル化するＧＭＤＨ法
で使われている不遍性規範ＵＣを用いる。しかし、本実
施例の場合においては、評価用データが１９個あるた
め、（数８）に示すＵＣを使用する。As the material data, 10 words of numbers "0-9" generated by a plurality of speakers (for example, 20 people) are used. Further, the identification data is data of one speaker, and the data of the remaining 19 speakers is evaluation data. Usually, in fuzzy identification, the nonuniformity criterion UC used in the GMDH method for modeling the input-output relationship of a nonlinear system with a polynomial of input variables is used. However, in the case of this embodiment, since there are 19 pieces of evaluation data, the UC shown in (Equation 8) is used.

【００２６】[0026]

【数８】この（数８）において、Ａは同定用データで同定したシ
ステム、Ｂ〜Ｔは評価用データで同定したシステムを示
す。例えば、ｙｉＡＪは、Ａの固定用データの資料番号
の内の１つの番号ｉのデータを、同定システムＪに入力
したときの類似度推定値を示す。[Equation 8] In this (Equation 8), A indicates the system identified by the identification data, and BT indicates the system identified by the evaluation data. For example, yiAJ indicates the similarity estimation value when the data of one number i of the material numbers of the fixed data of A is input to the identification system J.

【００２７】次に、複数のファジィ同定システム（ここ
では１０個の固定システム）の合成を行う。この場合の
合成同定システムは、１つの前件部に対して後件部に複
数の類似度推定値をもつ合成同定システムとなる。この
前件部の合成は、各入力パラメータの共通集合をとるこ
とによりなされる。Next, a plurality of fuzzy identification systems (here, 10 fixed systems) are synthesized. The synthetic identification system in this case is a synthetic identification system having a plurality of similarity degree estimated values in the consequent part with respect to one antecedent part. The composition of the antecedent part is performed by taking a common set of each input parameter.

【００２８】すなわち、図１に示す不特定話者音声認識
装置におけるファジィ同定システム１５は、１つの前件
部に対して、後件部にｙ′１ないしｙ′１０の１０個の
類似度推定値を出力する合成同定システムである。した
がって、図１における認識時の動作は、話者が特定単語
（ここでは数字「０〜９」）を発生することにより、音
声解析手段であるマイク１１、マイクアンプ１２、フィ
ルタバンク１３およびＡ／Ｄコンバータ１４を経て得ら
れるモニタ波形データが、データ認識手段であるファジ
ィ同定システム１５に入力され、数字「０〜９」の１０
個の単語に対する類似度推定値が出力される。That is, the fuzzy identification system 15 in the unspecified speaker speech recognition apparatus shown in FIG. 1 estimates 10 similarity degrees of y'1 to y'10 in the consequent part with respect to one antecedent part. It is a synthetic identification system that outputs a value. Therefore, in the recognition operation in FIG. 1, when the speaker generates a specific word (here, the numbers “0 to 9”), the microphone 11, the microphone amplifier 12, the filter bank 13, and A / The monitor waveform data obtained through the D converter 14 is input to the fuzzy identification system 15 which is a data recognition means, and the numeral 10 of the numbers "0 to 9" is input.
The estimated similarity value for each word is output.

【００２９】[0029]

【発明の効果】以上のように、上記実施例から明らかな
ように、本発明によれば、特定単語を認識する不特定話
者音声認識装置に、ファジィ同定システムを適用するこ
とにより、以下に示す効果が得られる。As described above, according to the present invention, as is apparent from the above-described embodiment, the fuzzy identification system is applied to the unspecified speaker voice recognition device for recognizing a specific word. The effect shown is obtained.

【００３０】特定単語において、話者間の時間軸上およ
び周波数軸上の変動を吸収するロバスト性の高い認識動
作が可能で、高認識率の音声認識を実現することができ
る。In a specific word, a highly robust recognition operation that absorbs variations on the time axis and the frequency axis between speakers is possible, and it is possible to realize speech recognition with a high recognition rate.

[Brief description of drawings]

【図１】本発明の不特定話者音声認識装置の概略ブロッ
ク図である。FIG. 1 is a schematic block diagram of an unspecified speaker voice recognition device of the present invention.

【図２】特定単語ごとの類似度差同定システムの入出力
関係を示す図である。FIG. 2 is a diagram showing an input / output relationship of a similarity difference identification system for each specific word.

【図３】図２の類似度差同定システムのファジィルール
の記述法を示す図である。FIG. 3 is a diagram showing a description method of a fuzzy rule of the similarity difference identification system of FIG.

【図４】従来のサブトラクション法を適用した不特定話
者音声認識装置の概略ブロック図である。FIG. 4 is a schematic block diagram of an unspecified speaker voice recognition device to which a conventional subtraction method is applied.

[Explanation of symbols]

１１マイク１２アンプ１３フィルタバンク１４Ａ／Ｄコンバータ１５ファジィ同定システム 11 Microphone 12 Amplifier 13 Filter bank 14 A / D converter 15 Fuzzy identification system

Claims

[Claims]

1. A voice analysis unit that analyzes a voice signal corresponding to one or more specific words uttered by an unspecified speaker to generate monitor waveform data, and a monitor from one of the unspecified speakers. One identification system in which waveform data is used as identification data and identified by the identification data; and a plurality of evaluation systems in which monitor waveform data from other speakers are identified as the plurality of evaluation data by the evaluation data And a similarity estimate value obtained by inputting each of the evaluation data for the specific word into a corresponding evaluation system, and a similarity obtained by inputting each of the evaluation data into the identification system. An unspecified speaker voice recognition device comprising: a data recognition means having a means for obtaining a least sum of squares of respective differences from the degree estimation value.