JPH04233599A

JPH04233599A - Method and device speech recognition

Info

Publication number: JPH04233599A
Application number: JP2408935A
Authority: JP
Inventors: Junichi Tamura; 純一田村
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1990-12-28
Filing date: 1990-12-28
Publication date: 1992-08-21

Abstract

PURPOSE:To perform detailed speech recognition and increase the recognition rate by deriving a voice section and candidate words by word-by-word spotting in a 1st stage of the voice recognition and comparing them standard patterns of phenomes prepared with characteristics of a voice in the next stage. CONSTITUTION:A voice signal is inputted from a microphone and its input waveform is converted into a digital value, which is transferred to a voice analysis part 101. The voice analysis part 101 analysis and compresses the voice to connect to a time series of feature vectors. a word distance calculation part 102 calculates the distance if each frame from the time series of feature vectors and a standard pattern by using continuous Makaranobis' DP and the candidate words are discriminated by a candidate word decision part 104 and stored in a storage part 105 together with parameters. Then the input voice information is matched with word information by a spotting method and the most proper word is identified (108) in each matched phenome sequence and outputted (109).

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は音声認識装置、特に任意
の話者が連続して発声した単語等の音声を、高い認識率
で認識する音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device, and more particularly to a speech recognition device that recognizes, with a high recognition rate, speech such as words uttered consecutively by any speaker.

【０００２】0002

【従来の技術】不特定話者認識に関する認識手法は、い
くつか考案されているが、現状で最も一般的な不特定話
者認識システムについて述べる。2. Description of the Related Art Several recognition methods for speaker-independent recognition have been devised, but the currently most common speaker-independent recognition system will be described.

【０００３】従来、不特定大語彙を目指した認識システ
ムは図１３に示すような構成になっている。音声入力部
１から入力された音声は音声分析部２により音声のパワ
ー項等を含むフイルタバンク出力、ＬＰＣケプストラム
等の特徴パラメータが求められ、ここでパラメータの圧
縮等（フイルタバンク出力の場合、Ｋ−Ｌ変換等による
次元圧縮）も行われる。（分析はフレーム単位で行われ
るので、以下、圧縮後の特徴パラメータを特徴ベクトル
と呼ぶ）。[0003] Conventionally, a recognition system aiming at an unspecified large vocabulary has a configuration as shown in FIG. The voice input from the voice input section 1 is processed by the voice analysis section 2 to obtain characteristic parameters such as a filter bank output including the power term of the voice, LPC cepstrum, etc. -dimensional compression by L transformation etc.) is also performed. (Since the analysis is performed frame by frame, the feature parameters after compression are hereinafter referred to as feature vectors).

【０００４】次に連続音声中から音素境界を決定するた
めの処理が音素境界検出部３により行われる。音素識別
部４では、統計的な手法により音素が決定される。５は
多数の音素サンプルから作成した音素標準パタンを格納
する音素標準パタン格納部。６は音素識別部４の出力結
果から単語辞書７あるいは出力された候補音素の中から
修正規則部８により修正を行って、最終的な認識結果を
出力する単語識別部、９は認識結果を表示する認識結果
表示部である。Next, the phoneme boundary detection section 3 performs processing for determining phoneme boundaries from the continuous speech. The phoneme identification unit 4 determines phonemes using a statistical method. 5 is a phoneme standard pattern storage unit that stores phoneme standard patterns created from a large number of phoneme samples. Reference numeral 6 denotes a word identification unit that corrects the output result of the phoneme identification unit 4 using a word dictionary 7 or the correction rule unit 8 from among the output candidate phonemes and outputs the final recognition result. 9 displays the recognition result. This is the recognition result display section.

【０００５】通常、音素境界検出部３では、判別関数等
を用いており、音素識別部４でも同様に判別される。こ
れら各構成要素の出力は一般的にある一定の閾値を満足
した候補が出力される。それぞれの候補について更に複
数の候補が出力されるが、７、８の様なＴｏｐ　　ｄｏ
ｗｎ的な情報等が用いられ最終的な単語に絞られる。[0005] Normally, the phoneme boundary detection section 3 uses a discriminant function or the like, and the phoneme identification section 4 performs the same discrimination. Each of these components generally outputs candidates that satisfy a certain threshold. Multiple candidates are output for each candidate, but Top do like 7 and 8
Information such as wn is used to narrow down the words to the final word.

【０００６】[0006]

【発明が解決しようとしている課題】しかしながら、上
記従来例の認識装置は基本的な構成がボトム・アップ型
であるので、認識過程のある箇所で誤りが生じた場合、
後の過程に悪影響を及ぼし易い形になっている。（例え
ば、音素境界検出部３において、音素境界を誤った場合
、その誤り方によっては音素識別部４、単語識別部６に
与える影響は大きい）つまり、最終的な音声の認識率は
各過程の誤り率の積に比例して下がるので、高い認識率
が得られなかった。[Problem to be Solved by the Invention] However, since the basic configuration of the conventional recognition device described above is a bottom-up type, if an error occurs at a certain point in the recognition process,
It is in a form that is likely to have a negative impact on subsequent processes. (For example, if the phoneme boundary detection unit 3 makes a mistake in identifying a phoneme boundary, the effect on the phoneme identification unit 4 and word identification unit 6 will be large depending on how the error occurs.) In other words, the final speech recognition rate depends on each process. A high recognition rate could not be obtained because it decreased in proportion to the product of error rates.

【０００７】又、特に、不特定話者を対象とする認識装
置を構成する場合、各過程で判定の為の閾値の設定が非
常に難しい。少なくとも候補の中に目的とするものが存
在する様に閾値を設定すると、各過程における候補群の
数が多くなり、複数候補単語の中から目的とする単語を
正確に絞り込む方法が非常に難しくなっていた。また、
実環境下で認識装置を使用する場合、非定常ノイズ等が
かなり多く、少数単語の認識装置であっても認識率が低
く、実際、使いにくいものとなっていた。[0007] In particular, when constructing a recognition device for unspecified speakers, it is extremely difficult to set threshold values for judgment in each process. If you set a threshold so that at least the target word exists among the candidates, the number of candidate groups in each process will increase, making it extremely difficult to accurately narrow down the target word from multiple candidate words. was. Also,
When a recognition device is used in a real environment, there is a considerable amount of non-stationary noise, etc., and the recognition rate is low even for a recognition device for a small number of words, making it difficult to use.

【０００８】[0008]

【課題を解決する為の手段】上記課題を解決するために
、音声情報を入力し、該音声情報を認識する際に、基準
として用いる単語情報と、音声の特性によって分類した
音素情報を格納し、前記入力した音声情報の特性を判断
し、スポッティング法を用いて入力音声情報と単語情報
のマッチングを行ない、候補単語と該候補単語の音声区
間を導出し、該導出された音声区間に対して前記候補単
語に対応する音素情報を、前記判断された音声の特性に
従って前記格納手段から呼出し、前記入力音声とのマッ
チングを行なうことを特徴とする音声認識方法を提供す
る。[Means for solving the problem] In order to solve the above problem, when voice information is input and the voice information is recognized, word information used as a reference and phoneme information classified according to voice characteristics are stored. , determine the characteristics of the input speech information, match the input speech information and word information using the spotting method, derive a candidate word and a speech interval of the candidate word, and perform a process for the derived speech interval. A speech recognition method is provided, characterized in that phoneme information corresponding to the candidate word is retrieved from the storage means according to the determined characteristics of the speech and matched with the input speech.

【０００９】上記課題を解決するために、音声情報を入
力する入力手段、該音声情報を認識する際に、基準とし
て用いる単語情報と音声の特性によって分類した音素情
報を格納する格納手段、前記入力した音声情報の特性を
判断する判断手段、スポッティング法を用いて入力音声
情報と単語情報のマッチングを行ない、候補単語と該候
補単語の音声区間を導出する導出手段、該導出された音
声区間に対して前記候補単語に対応する音素情報を、前
記判断された音声の特性に従って前記格納手段から呼出
し、前記入力音声とのマッチングを行なう音素認識手段
を有することを特徴とする音声認識装置を提供する。In order to solve the above problems, we have provided an input means for inputting voice information, a storage means for storing word information used as a reference and phoneme information classified according to voice characteristics when recognizing the voice information, and the input means. a determining means for determining the characteristics of the derived speech information; a deriving means for matching input speech information and word information using a spotting method to derive a candidate word and a speech interval of the candidate word; The present invention provides a speech recognition device comprising a phoneme recognition means for reading out phoneme information corresponding to the candidate word from the storage means according to the determined characteristics of the speech and performing matching with the input speech.

【００１０】上記課題を解決するためには、好ましくは
前記音声の特性は、音声を発声する話者によって異なる
ものとする。[0010] In order to solve the above problem, preferably, the characteristics of the voice differ depending on the speaker who utters the voice.

【００１１】[0011]

【実施例】（実施例１）図１は本発明による音声認識シ
ステムの基本構成図で、１００は音声入力部、１０１は
入力された音声を分析、圧縮し、特徴ベクトルの時系列
に変換する音声分析部、１０３は多数の話者が発声した
単語データから求めた標準パタンを音素表記と対応して
、格納する単語標準パタン格納部、１０２は音声分析部
１０１の特徴ベクトル系列と単語標準パタン格納部１０
３に格納されている各々の標準パタンを入力データのフ
レームごとに連続マハラノビスＤＰを用いて距離を算出
する連続マハラノビスＤＰによる単語距離計算部、１０
４は連続マハラノビスＤＰより求めた各フレーム単語標
準パタンとの距離の値により単語標準パタンの中から候
補となる単語を判別する候補単語判別部、１０５は候補
になった１つ以上の単語区間の特徴ベクトルのパラメー
タを格納するパラメータ格納部、１０６は多数話者の発
声した音声の中から音素単位で作成された標準パタンを
格納する音素標準パタン格納部、１０７は候補となった
単語の特徴ベクトル系列について音素単位で連続マハラ
ノビスＤＰにより入力データと音素標準パタンの距離計
算を行う連続マハラノビスＤＰによる音素距離計算部、
１０８は１つ以上の候補単語のそれぞれについてマッチ
ングされた各音素系列から最も適当な単語を識別して出
力する音素単位の認識結果による識別部。１０９は例え
ば音声応答等の手段により音声認識結果を出力する結果
出力部である。図中、第１部は音声区間の切り出しと供
に単語の候補の絞り込み、第２部は候補単語内での音素
単位認識部を示す。[Example] (Example 1) Fig. 1 is a basic configuration diagram of a speech recognition system according to the present invention, where 100 is a speech input section, 101 is an input section that analyzes and compresses input speech, and converts it into a time series of feature vectors. A speech analysis unit 103 stores standard patterns obtained from word data uttered by a large number of speakers in correspondence with phoneme notation, a word standard pattern storage unit 102 stores feature vector series and word standard patterns of the speech analysis unit 101. Storage part 10
word distance calculation unit using continuous Mahalanobis DP, which calculates the distance of each standard pattern stored in 3 using continuous Mahalanobis DP for each frame of input data;
Reference numeral 4 denotes a candidate word discriminator that discriminates candidate words from word standard patterns based on the value of the distance from each frame word standard pattern obtained by continuous Mahalanobis DP, and 105 a candidate word discriminator for one or more candidate word sections. A parameter storage unit that stores parameters of feature vectors; 106 is a phoneme standard pattern storage unit that stores standard patterns created for each phoneme from voices uttered by multiple speakers; 107 is a feature vector of candidate words; a phoneme distance calculation unit using continuous Mahalanobis DP that calculates the distance between input data and a phoneme standard pattern using continuous Mahalanobis DP for the series in units of phonemes;
Reference numeral 108 denotes an identification unit based on phoneme unit recognition results that identifies and outputs the most appropriate word from each phoneme sequence matched for each of one or more candidate words. Reference numeral 109 denotes a result output unit that outputs a voice recognition result by means such as voice response. In the figure, the first part shows the extraction of speech sections and the narrowing down of word candidates, and the second part shows the phoneme unit recognition unit within the candidate words.

【００１２】１１０は、複数話者による複数の標準パタ
ンに対応するように各々の話者の特徴に従って複数の音
素標準パタン郡を分類した話者カテゴリから現在音声を
入力中の話者に最適な話者カテゴリを識別するためのパ
タンが格納されている話者カテゴリ識別パタン格納部。[0012] 110 selects the optimal phoneme pattern for the speaker who is currently inputting speech from the speaker category in which a plurality of phoneme standard patterns are classified according to the characteristics of each speaker so as to correspond to a plurality of standard patterns by a plurality of speakers. A speaker category identification pattern storage section in which patterns for identifying speaker categories are stored.

【００１３】１１１は、入力音声と、後述する最適話者
音素標準パタン格納部１０２によって比較する標準パタ
ンを選択し、図１に示した第２部における音素認識にお
いては、音素標準パタン格納部１０６から最適な音素郡
を選択して最適話者音素標準パタン格納部１１２に格納
するよう指示する処理選択部。111 selects a standard pattern to be compared with the input speech by the optimal speaker phoneme standard pattern storage section 102, which will be described later, and in the phoneme recognition in the second part shown in FIG. a processing selection unit that instructs to select an optimal phoneme group from among the selected phonemes and store it in the optimal speaker phoneme standard pattern storage unit 112;

【００１４】１１２は、処理選択部１１１の指示により
最適な話者カテゴリの音素標準パタンを格納する最適話
者音素標準パタン格納部。Reference numeral 112 denotes an optimum speaker phoneme standard pattern storage unit that stores the phoneme standard pattern of the optimum speaker category according to instructions from the processing selection unit 111.

【００１５】次に動作の流れを説明する。まず音声入力
部１００は、マイクから音声信号を入力し、音声分析部
１０１に入力波形を転送する。音声入力部１００は音声
入力の受付時間中は常に音声又は周囲のノイズ信号等を
取り込み、音声入力波形をデイジタル値に変換した波形
として音声分析部１０１へ転送する。音声分析部１０１
では、常に入力されて来る波形を１０ｍｓｅｃ〜３０ｍ
ｓｅｃ程度の窓幅で分析を行い、２ｍｓｅｃ〜１０ｍｓ
ｅｃの長さを持つフレームごとに、特徴パラメータを求
める特徴パラメータの種類としては比較的高速に分析可
能なＬＰＣケプストラム、ＬＰＣメルケプストラム、高
精度にパラメータを抽出したい場合はＦＦＴケプストラ
ム、ＦＦＴメルケプストラム等が一般的で、他にフィル
タバンク出力値もある。Next, the flow of operation will be explained. First, the audio input section 100 inputs an audio signal from a microphone and transfers the input waveform to the audio analysis section 101 . The audio input unit 100 always takes in audio or surrounding noise signals during the time when audio input is accepted, and transfers the audio input waveform to the audio analysis unit 101 as a waveform converted into a digital value. Voice analysis section 101
Then, the waveform that is always input is 10msec to 30m.
Analysis is performed with a window width of about 2 msec to 10 ms.
Types of feature parameters to obtain feature parameters for each frame with a length of ec include LPC cepstrum and LPC mel cepstrum, which can be analyzed relatively quickly, and FFT cepstrum and FFT mel cepstrum if you want to extract parameters with high precision. is common, and there are also filter bank output values.

【００１６】また、正規化されたパワー情報を用いたり
、パラメータの各次元ごとに重み係数を掛けたりして、
システムの使用状況に最も適したパラメータで、フレー
ムごとに分析される。次に、分析された特徴パラメータ
の次元について圧縮を行う。ケプストラムパラメータは
、通常係数の１次の項〜１２次の項の中から必要な次元
数（例えば６次元）だけ抜き出し、これを特徴ベクトル
とする。[0016] Also, by using normalized power information or multiplying each dimension of the parameter by a weighting coefficient,
Each frame is analyzed with the parameters most appropriate to the system usage. Next, compression is performed on the dimensions of the analyzed feature parameters. The cepstrum parameters are usually extracted from the first to twelfth order terms of the coefficients by the required number of dimensions (for example, six dimensions), and are used as feature vectors.

【００１７】また、スペクトルの差分情報、パワー情報
等をパラメータ化したものを、前記スペクトル情報から
得られたパラメータに合わせて、特徴ベクトルとしても
良い。[0017]Furthermore, spectral difference information, power information, etc. may be parameterized and used as a feature vector in accordance with the parameters obtained from the spectral information.

【００１８】フイルタバンク出力を特徴パラメータとし
た場合、例えばＫ−Ｌ変換、フーリエ変換等の直交変換
により次元圧縮し、低次項を用いる。これら圧縮された
１フレーム分のパラメータを特徴ベクトル、次元圧縮さ
れた後の特徴ベクトルの時系列を特徴ベクトルの系列（
或は、単にパラメータ）と呼ぶことにする。When the filter bank output is used as a feature parameter, the dimension is reduced by orthogonal transformation such as KL transformation or Fourier transformation, and lower-order terms are used. These compressed parameters for one frame are used as feature vectors, and the time series of the feature vectors after dimension compression is used as the feature vector series (
Alternatively, we will simply call it a parameter.

【００１９】本実施例では分析窓長を２５．６ｍｓｅｃ
で分析し、フレーム周期１０ｍｓｅｃ、ＦＦＴスペクト
ルのピーク付近を通る様な包絡スペクトルから、メルケ
プストラム係数を求めた後、係数の１次〜８次を用いる
。In this example, the analysis window length is 25.6 msec.
After calculating the mel cepstral coefficients from the envelope spectrum that passes near the peak of the FFT spectrum with a frame period of 10 msec, the first to eighth orders of the coefficients are used.

【００２０】更に、隣り合うメルケプストラムの差分情
報として１次の回帰係数を求め、先に求めたメルケプス
トラムの係数と同様に回帰係数の１次〜８次を用いて計
１６個の特徴を１フレーム分の特徴ベクトルとする。こ
こでメルケプストラムの０次項はパワーを表わす。（本
実施例では、パワー情報は用いない場合について示す）
次に、単語標準パタン格納部１０３に格納する標準パタ
ンの作成方法について述べる。本システムでは例として
発声変形を含めた１０数字“ゼロ、サン、ニ、レイ、ナ
ナ、ヨン、ゴ、マル、シ、ロク、ク、ハチ、シチ、キュ
ウ、イチ”と“ハイ、イイエ”の計１７単語の認識につ
いて述べる。標準パタンは多数話者の発声した単語音声
から作成する。本実施例では１単語の標準パタンを作成
するのに５０００人分の音声サンプルを用いる。（音声
サンプル数は多ければ多い程良い）なお、ここでは１７
単語のみの認識を目的とし、１７単語の標準パタンを作
成し、格納する例について述べるがこれは１７単語に限
るわけではなく、同様の方法で任意の数のパタンを作成
すれば、任意の音声を認識できるようになる。Furthermore, first-order regression coefficients are obtained as difference information between adjacent mel-cepstrum, and a total of 16 features are calculated using the first to eighth orders of the regression coefficients in the same way as the coefficients of the mel-cepstrum obtained earlier. Let it be a feature vector for a frame. Here, the zero-order term of the mel cepstrum represents power. (This example shows the case where power information is not used)
Next, a method for creating standard patterns to be stored in the word standard pattern storage section 103 will be described. In this system, for example, the 10 numbers "zero, san, ni, lei, nana, yon, go, maru, shi, roku, ku, hachi, shichi, kyu, ichi" and "hai, ie" are used, including vocal variations. We will describe the recognition of a total of 17 words. Standard patterns are created from word sounds uttered by multiple speakers. In this embodiment, voice samples from 5000 people are used to create a standard pattern for one word. (The larger the number of audio samples, the better.) Here, 17
We will discuss an example in which a standard pattern of 17 words is created and stored for the purpose of recognizing only words, but this is not limited to 17 words; if you create any number of patterns in the same way, you can recognize any speech. Be able to recognize.

【００２１】更に、単語標準パタンとして、音素標準パ
タンに格納されている各音素の平均をとったものを予め
定めた法則に従って結合し、単語を文節等の標準パタン
を作成したものを用いるようにすることも可能である。また、これらの標準パタンは話者別に複数あってもよい
。Furthermore, as a word standard pattern, the average of each phoneme stored in the phoneme standard pattern is combined according to a predetermined rule to create a standard pattern of words such as phrases. It is also possible to do so. Further, there may be a plurality of these standard patterns for each speaker.

【００２２】図２に、標準パタンの作成手順を表わすフ
ローチャートを示す。FIG. 2 shows a flowchart showing the standard pattern creation procedure.

【００２３】まず、音声サンプルから標準パタンを作成
する際の仮の比較対象となるコアパタン（核パタン）を
選択する（Ｓ２００）。選択方法は５０００単語の中で
発声時間長と発声パタンが最も平均的な単語を用いる。次に、サンプルの単語を入力し（Ｓ２０１）、入力単語
とコアパタンとの時間軸伸縮マッチングを行い、時間正
規化距離が最小となるマッチング経路に沿って、各フレ
ームごとに平均ベクトル、及び分散共分散行列を作成す
る（Ｓ２０２）。ここで時間軸伸縮マッチングの方法と
してＤＰマッチングを用いる。次に入力単語の話者番号
を次々変えてゆき（Ｓ２０４）５０００名分の単語Ｓｉ
（ｉ＝１〜５０００）について、各フレームごとに特徴
ベクトルの平均値及び、分散共分散行列を求める（Ｓ２
０３、Ｓ２０５）。この様にして計１７単語についてそ
れぞれ上記過程と同様にして単語標準パタンを作成し単
語標準パタン格納部１０３に格納しておく。First, a core pattern is selected as a temporary comparison target when creating a standard pattern from a voice sample (S200). The selection method uses words with the most average utterance duration and utterance pattern among the 5000 words. Next, a sample word is input (S201), time-axis expansion/contraction matching is performed between the input word and the core pattern, and the average vector and variance co-value are calculated for each frame along the matching path that minimizes the time-normalized distance. A dispersion matrix is created (S202). Here, DP matching is used as a time axis expansion/contraction matching method. Next, the speaker numbers of the input words are changed one after another (S204).
(i = 1 to 5000), calculate the average value of the feature vector and the variance-covariance matrix for each frame (S2
03, S205). In this way, standard word patterns are created for each of the 17 words in the same manner as described above and stored in the standard word pattern storage section 103.

【００２４】１１０は、話者カテゴリ識別パタン格納部
である。Reference numeral 110 denotes a speaker category identification pattern storage section.

【００２５】本認識装置は、不特定の話者が発声した単
語、文章等を認識するが、実際に目的とする音声を認識
する前に、現在入力しようとしている話者が、どのカテ
ゴリ内に入るのかこれをあらかじめ学習し、第２部にお
いて複数の音素標準パタン群の中から最もその話者に適
した音素標準パタンを用いて認識する事により、認識精
度の高い認識装置が実現できる。以下に、話者カテゴリ
識別パタンの作成方法について図３に示したフローチャ
ートに従って述べる。まず、複数話者５０００人が“ア
イウエオ”とつなげてゆっくり発声した音声を分析して
得られた特徴ベクトル系列を任意の複数カテゴリに分類
する。ここでは、ｎクラスに分ける事にする。クラスの
分け方は、クラスタリングの手法として存在する。多種
多様な方法のうち、どれを用いても構わない。図３では
、まずＳ４０１〜Ｓ４０５で全５０００の話者の中で最
も平均的な話者を選択し、この話者の特徴ベクトルと最
もＤＰ距離の大きい特徴ベクトルの音声を発声した話者
を選択し、これをＩ２とする（Ｓ４０６）。[0025] This recognition device recognizes words, sentences, etc. uttered by an unspecified speaker, but before actually recognizing the target speech, it determines which category the speaker who is currently inputting falls into. A recognition device with high recognition accuracy can be realized by learning in advance whether or not the speaker enters the speaker's voice, and performing recognition using the phoneme standard pattern most suitable for the speaker from among a plurality of phoneme standard pattern groups in the second part. A method for creating a speaker category identification pattern will be described below according to the flowchart shown in FIG. First, the feature vector series obtained by analyzing the voices slowly uttered by 5,000 speakers using the word "aiueo" is classified into multiple arbitrary categories. Here, we will divide it into n classes. Class division exists as a clustering method. Any of a wide variety of methods may be used. In FIG. 3, first, in steps S401 to S405, the most average speaker among all 5000 speakers is selected, and the speaker who uttered the voice with the feature vector with the largest DP distance from this speaker's feature vector is selected. This is then set as I2 (S406).

【００２６】次に、話者Ｉ１とＩ２のＤＰ距離（正規化
した値）が最も大きい話者Ｉ３を選択する…といった手
順を繰り返し、ＤＰ距離の値が例えば０．０５等、予め
定めた基準値以下となるまで繰り返す。本実施例では、
Ｉ１〜Ｉ９までの９話者が、カテゴリの代表サンプルと
して挙げられた。この話者カテゴリの概念図を図４に、
特徴ベクトルの記号での表現例を図５に示す。[0026] Next, the procedure of selecting speaker I3 with the largest DP distance (normalized value) between speakers I1 and I2 is repeated, and the DP distance value is set to a predetermined standard such as 0.05. Repeat until the value is below the value. In this example,
Nine speakers from I1 to I9 were selected as representative samples for the category. A conceptual diagram of this speaker category is shown in Figure 4.
FIG. 5 shows an example of symbol representation of the feature vector.

【００２７】次にこれらのカテゴリの格となる話者（以
下、格話者と呼ぶ）の特徴ベクトル系列をコアパタンと
して図２に示すフローチャートに従って、連続発声単語
「アイウエオ」の標準パタンを作成する。２０２ではＤ
Ｐマッチングを行ないながらＤＰ経路に従って、対応フ
レームの分散、共分散ベクトルを求めるが、ＤＰ窓の制
限、ＤＰの傾斜制限等を少しきつくして標準パタンとし
て用いる。Next, a standard pattern for the continuously uttered word "aiueo" is created according to the flowchart shown in FIG. 2, using the feature vector series of the case speakers of these categories (hereinafter referred to as case speakers) as core patterns. D in 202
While performing P matching, the variance and covariance vector of the corresponding frame are determined according to the DP path, but the DP window restrictions, DP slope restrictions, etc. are slightly tightened and used as a standard pattern.

【００２８】話者を制限すると、比較的分散の少ない良
好な標準パタンが格話者を中心とするカデゴリ別に生成
できる。By restricting speakers, good standard patterns with relatively little variance can be generated for each category centered on case speakers.

【００２９】また格話者をコアパタンとして、話者カテ
ゴリに対応する標準パタンを作成する時に用いられた話
者の集まりを以下カテゴリ話者群と呼ぶ。A collection of speakers used to create standard patterns corresponding to speaker categories using case speakers as core patterns will hereinafter be referred to as a category speaker group.

【００３０】連続マハラノビスＤＰによる単語距離計算
部１０２では連続マハラノビスＤＰにより次々と入力さ
れる特徴ベクトルの時系列について単語標準パタン格納
部１０３、或いは、話者カテゴリ識別パターン格納部１
１０に格納されている全ての単語或いは音韻連鎖の標準
パタンとの連続マハラノビスＤＰによるマッチングを行
い、距離を計算する。The word distance calculation unit 102 using the continuous Mahalanobis DP stores the time series of feature vectors input one after another using the continuous Mahalanobis DP into the word standard pattern storage unit 103 or the speaker category identification pattern storage unit 1.
The distance is calculated by performing matching with all standard patterns of words or phoneme chains stored in 10 using continuous Mahalanobis DP.

【００３１】ここで、処理選択部１１１は、現在入力中
の話者がどの話者カテゴリに属しているかを識別するた
めに入力音声とのマッチングの対象を、話者カテゴリ識
別パタン格納部１１０か、単語標準パタン格納部１０３
かを選択する。Here, the processing selection unit 111 selects the target of matching with the input voice from the speaker category identification pattern storage unit 110 in order to identify which speaker category the currently inputting speaker belongs to. , word standard pattern storage section 103
Choose one.

【００３２】ここで、処理選択部１１１の動作を説明す
るための内部構成図を図６に示す。FIG. 6 shows an internal configuration diagram for explaining the operation of the processing selection section 111.

【００３３】また、処理選択部１１１の処理動作を示す
フローチャートを、図７に示す。Further, a flowchart showing the processing operation of the processing selection section 111 is shown in FIG.

【００３４】音声認識処理の立上時（Ｓ３０１）には、
話者識別モードとなっているのでＳ３０４へ進む。しか
し、途中で入力話者が替る時、或いは、再度話者識別モ
ードにしたい時のために、話者自身がモードフラグを設
定できるようになっている。そこで、モード切替部１２
１のモードフラグを読み込む（Ｓ３０２）。モードフラ
グが、単語認識モードであれば、モード切替部で単語認
定モードに切替え（Ｓ３０３）、先に述べたように入力
音声を目的単語とみなして、単語認識を行う（Ｓ３１０
）。話者識別モードと判断される（Ｓ３０３）場合、デ
ィスプレイや、音声合成等の指示手段により、「“アイ
ウエオ”と発声して下さい」といった指示を話者に行う
（Ｓ３０４）。最適話者カテゴリを探索し（Ｓ３０５）
、ここでは、その距離の値が、０．１以下になる様な、
制限を設けている（Ｓ３０６）。もしＳ３０６でリジェ
クトされれば、話者の発声長、強度等が極端に標準値と
異なると判断し、リトライ情報を付加して（Ｓ３０７）
、再度入力を促がす（Ｓ３０４）。この時の入力音声指
示部は、「“アイウエオ”のようにつづけてゆっくりと
発声して下さい。では、どうぞ」といった内容に変更し
、話者に指示を与える。このようにして、話者カテゴリ
Ｉ１〜Ｉ９の中からカテゴリを特定した後、そのカテゴ
リに戻す格話者と同一の話者をコアパタンとして作成し
た音素標準パタンを、音素標準パタン格納部１０６から
最適話者音素標準パタン格納部１１２に転送（格納）す
る（Ｓ３０８）。[0034] At the start of the speech recognition process (S301),
Since the mode is speaker identification mode, the process advances to S304. However, in case the input speaker changes midway through, or if the speaker wants to switch to speaker identification mode again, the speaker himself/herself can set a mode flag. Therefore, the mode switching unit 12
1 mode flag is read (S302). If the mode flag is word recognition mode, the mode switching unit switches to word recognition mode (S303), and as described above, the input voice is regarded as the target word and word recognition is performed (S310).
). If it is determined that the mode is the speaker identification mode (S303), an instruction such as "Please utter 'Ai ewoo'" is given to the speaker using an instruction means such as a display or voice synthesis (S304). Search for the optimal speaker category (S305)
, here, the value of the distance is 0.1 or less,
A restriction is set (S306). If rejected in S306, it is determined that the speaker's utterance length, intensity, etc. are extremely different from standard values, and retry information is added (S307).
, prompts for input again (S304). At this time, the input voice instruction section changes the content to ``Please utter slowly and continuously like ``Aiueo.'' Now, go ahead.'' and gives an instruction to the speaker. In this way, after identifying a category from among speaker categories I1 to I9, a phoneme standard pattern created as a core pattern with the same speaker as the case speaker to be returned to that category is stored in the phoneme standard pattern storage unit 106 as the optimal phoneme standard pattern. It is transferred (stored) to the speaker phoneme standard pattern storage unit 112 (S308).

【００３５】話者カテゴリが特定されたら、モードフラ
グを単語認識モードにセットし（Ｓ３０９）、単語認識
処理を始める（Ｓ３１０）。Once the speaker category is specified, the mode flag is set to word recognition mode (S309), and word recognition processing is started (S310).

【００３６】次に、連続マハラノビスＤＰについて説明
する。連続ＤＰの手法は一般的で、特定話者が連続に発
声した文章の中から目的とする単語、或は、音節等の単
位を探し出す方法である。これはワードスポッティング
と呼ばれ、目的とする音声区間の切り出しと同時に認識
も行ってしまうという画期的な方法である。本実施例で
は連続ＤＰ法の各々のフレーム内における距離にマハラ
ノビス距離を用いる事により、不特定性を吸収している
。Next, continuous Mahalanobis DP will be explained. The continuous DP method is common and is a method of searching for a target word, syllable, or other unit from a sentence continuously uttered by a specific speaker. This is called word spotting, and it is an innovative method that recognizes the target speech section at the same time as cutting it out. In this embodiment, unspecificity is absorbed by using the Mahalanobis distance for the distance within each frame of the continuous DP method.

【００３７】図８は、“ゼロ”という単語の標準パター
ンと“ゼロ”という単語を発声した時の入力音声を無声
区間も含めて特徴ベクトルの時系列に分析したものとを
連続マハラノビスＤＰによりマッチングした結果を示し
たものである。図中、黒が濃く出ている所は標準パタン
と入力パタンの距離が大きい所、黒が薄く、白に近い所
は標準パタンと入力パタンの距離が小さいところである
。マッチングを行った結果の下には累積距離の時間変化
を示す。この累積距離はその時点が終端となるＤＰパス
の距離を示すもので、ＤＰパスを求めてその値をメモリ
に保存する。このメモリに保存したＤＰパスは、音声区
間の始端を求める為につかう。例えばこの図においては
距離が最小となった時のＤＰパスを示したが、標準パタ
ンと入力パタンが似ていた場合、累積距離が任意に定め
た閾値より小さくなり、その標準パタンの単語を候補単
語と認める。そして、入力パタンから音声区間を切り出
すために、累積距離が閾値より小さく、更に最小である
時点からＤＰパスをメモリから呼び出してバックトラッ
クすることにより、音声区間の始端が求められる。こう
して求められた音声区間の特徴ベクトルの時系列をパラ
メータ格納部１０５に格納する。FIG. 8 shows the continuous Mahalanobis DP matching between the standard pattern of the word "zero" and the time series analysis of feature vectors of the input speech when the word "zero" is uttered, including silent sections. The results are shown below. In the figure, areas where black is darker are areas where the distance between the standard pattern and the input pattern is large, and areas where black is lighter and closer to white are areas where the distance between the standard pattern and the input pattern is small. Below the matching results, changes in cumulative distance over time are shown. This cumulative distance indicates the distance of the DP path that ends at that point in time, and the DP path is determined and its value is stored in the memory. The DP path stored in this memory is used to find the start of the voice section. For example, this figure shows the DP path when the distance is the minimum, but if the standard pattern and the input pattern are similar, the cumulative distance will be smaller than an arbitrarily determined threshold, and the word of that standard pattern will be selected as a candidate. Recognize it as a word. Then, in order to cut out a voice section from the input pattern, the starting point of the voice section is found by recalling the DP path from the memory and backtracking from the point in time when the cumulative distance is smaller than the threshold value and is minimum. The time series of feature vectors of the voice section thus obtained is stored in the parameter storage unit 105.

【００３８】今まで説明してきた処理系により、まず候
補単語と、その音声区間を分析した特徴ベクトルの系列
と、連続マハラノビスＤＰによる累積距離の結果が得ら
れる。ここで、候補単語の中で“シチ”と“シ”の様に
音声区間が重なっているものが複数選択された時、この
場合“シチ”の方を選択し“シ”は切り捨てる。“ロク
”と“ク”も同様に、“ク”の音声区間の大部分が（こ
こでは８０％以上とする）“ロク”に含まれている時は
、“ク”は切り捨てて“ロク”のみについて検証を行う
。The processing system described above first obtains a candidate word, a series of feature vectors obtained by analyzing its speech interval, and cumulative distance results by continuous Mahalanobis DP. Here, when a plurality of candidate words such as "shichi" and "shi" with overlapping vocal sections are selected, "shichi" is selected and "shi" is discarded. Similarly, for “roku” and “ku”, when most of the vocal interval of “ku” is included in “roku” (here, 80% or more), “ku” is truncated and “roku” is used. Verification is performed only on

【００３９】本実施例では音素標準パタン格納部１０６
に母音（ａ、ｉ、ｕ、ｅ、ｏ）と子音（ｚ、ｓ、ｎ、ｒ
、ｇ、ｍ、ｓｈｉ、ｋ、ｈ、ｃｉ）について音素標準パ
タンを作成しておく。In this embodiment, the phoneme standard pattern storage section 106
vowels (a, i, u, e, o) and consonants (z, s, n, r
, g, m, shi, k, h, ci).

【００４０】なお、本実施例では先に述べた１７単語の
認識を目的としている為、音素標準パタン格納部１０６
に格納する音素は上記１５種類だが、前にも述べたよう
に、認識対象を拡大し、標準パタンの数を増す場合には
、その標準パタンを構成する音素をすべて、同様の方法
で標準パタンを作成し、音素標準パタン格納部１０６に
格納する。Note that in this embodiment, since the purpose is to recognize the 17 words mentioned above, the phoneme standard pattern storage unit 106
The phonemes stored in the 15 types are the above, but as mentioned earlier, when expanding the recognition target and increasing the number of standard patterns, all the phonemes that make up the standard pattern are stored in the standard pattern in the same way. is created and stored in the phoneme standard pattern storage section 106.

【００４１】ここでは、カテゴリ別標準パタン作成に用
いたカテゴリ話者群に分類し、その中の各話者が発声し
た単語の中から、各音素を切り出し、これらの同一の音
素集合について、クラスタリング等を行ない各クラスに
属する複数の音素標準パタンを作成して格納する。[0041] Here, we classify each phoneme into the category speakers used to create the category-based standard pattern, cut out each phoneme from the words uttered by each speaker, and perform clustering on these same phoneme sets. etc., to create and store a plurality of phoneme standard patterns belonging to each class.

【００４２】この様子を図９に示す。話者カテゴリに属
するカテゴリ話者群の中から、音素の部分を切り出す（
例えば音素／ａ／）更に、これをクラスタリング等の処
理を行ない、／ａ／の音素について、１以上の標準パタ
ンを作成する。図では話者カテゴリが１の場合、／ａ／
は、／ａ１／と／ａ２／、／ｕ／は／ｕ１／、／ｕ２／
、／ｕ３／の様に複数の音素クラスに対応する音素標準
パタン系列が格納されている。例えば／ａ１／は有声音
の“ア”／ａ２／は無声化した“ア”といったように、
同一の音素でも単語中における音素出現位置の相異によ
る周囲の音韻の違い（音韻環境）や、同一話者でも発声
の仕方等の相違により変形も激しい。FIG. 9 shows this situation. Extract the phoneme part from the group of category speakers belonging to the speaker category (
For example, the phoneme /a/) is further subjected to processing such as clustering to create one or more standard patterns for the phoneme /a/. In the figure, when the speaker category is 1, /a/
are /a1/ and /a2/, /u/ are /u1/, /u2/
, /u3/, phoneme standard pattern sequences corresponding to a plurality of phoneme classes are stored. For example, /a1/ is a voiced “a”, and /a2/ is a devoiced “a”.
Even the same phoneme can undergo significant variations due to differences in the surrounding phonemes (phonetic environment) due to differences in the position of the phoneme in the word, and differences in the way the phoneme is uttered even by the same speaker.

【００４３】本方法の様に、話者カテゴリ別に分類した
単語の中から音素を切り出し、この中で更にクラスタリ
ング等により複数の音素標準パタンを持つ事によって、
より確度の高い認識結果が得られる。[0043] As in this method, phonemes are extracted from words classified by speaker category, and by further creating multiple phoneme standard patterns through clustering, etc.
More accurate recognition results can be obtained.

【００４４】また、最適話者音素標準パタン格納部１１
２には、前記話者カテゴリ識別パタン格納部１１０の中
から選択された最適な話者カテゴリに対応した音素標準
パタン群が音素標準パタン格納部１０６から処理選択部
１１１により転送され、格納される。In addition, the optimal speaker phoneme standard pattern storage unit 11
2, a group of phoneme standard patterns corresponding to the optimal speaker category selected from the speaker category identification pattern storage section 110 is transferred from the phoneme standard pattern storage section 106 by the processing selection section 111 and stored. .

【００４５】連続マハラノビスＤＰによる音素距離計算
部１０７ではパラメータ格納部１０５に格納されている
候補単語として切り出された音声区間について各音素と
のマッチングを行う。The phoneme distance calculation section 107 using continuous Mahalanobis DP performs matching with each phoneme for the speech section cut out as a candidate word stored in the parameter storage section 105.

【００４６】連続マハラノビスＤＰによる単語距離計算
部１０２と同様に累積距離が最小となった位置からその
音素の区間を計算する。（候補単語判別部１０４と同様
、累積距離が最小となった時点をその音素の終端とし、
始端は連続ＤＰパスのバックトラックにより求める）本
実施例では例えば“ゼロ”→“ｚｅｒｏ”が候補単語の
場合その音声区間について“ｚ”、“ｅ”、“ｒ”、“
ｏ”の４種類の音素についてのみマッチングを行う。４
種の音素と上記“ｚｅｒｏ”と判別され、候補となった
音声区間のマッチングの結果、各音素の累積距離が最小
となる点についてその位置関係と、最小距離の平均値を
求めるこの様子を図１０に示す。Similar to the word distance calculation unit 102 using continuous Mahalanobis DP, the phoneme interval is calculated from the position where the cumulative distance is the minimum. (Similar to the candidate word discrimination unit 104, the point in time when the cumulative distance is the minimum is the end of the phoneme,
In this embodiment, for example, if "zero" → "zero" is a candidate word, "z", "e", "r", "
Matching is performed only for the four types of phonemes "o".4
As a result of matching between the phoneme of the species and the voice section that became a candidate after being identified as "zero", the positional relationship of the point where the cumulative distance of each phoneme is the minimum and the average value of the minimum distance are calculated. 10.

【００４７】各々の音素についてマッチングの結果の距
離の最小値と、その位置をフレームで表わし音素単位の
認識結果による認識部１０８に送る。この例では、“ｚ
”について最小値は“ｊ”、フレーム位置は“ｚｆであ
る。音素単位の認識結果による認識部１０８では、連続
マハラノビスＤＰによる音素距離計算部１０７から送ら
れてきたデータを基に最終的な単語の識別を行う。まず、候補単語の音素列の順番（フレームの位置）がｚ
ｆ＜ｅｆ＜ｒｆ＜ｏｆであるか否かを調べる。もしこの
順番であれば認識単語は“ゼロ”（ｚｅｒｏ）”平均認
識距離The minimum value of the distance as a result of matching for each phoneme and its position are expressed in a frame and sent to the recognition unit 108 based on the recognition results for each phoneme. In this example, “z
”, the minimum value is “j” and the frame position is “zf”. The recognition unit 108 based on the recognition result in units of phonemes performs final word identification based on the data sent from the phoneme distance calculation unit 107 using continuous Mahalanobis DP. First, the order (frame position) of the phoneme string of the candidate word is z
Check whether f<ef<rf<of. If this order is used, the recognized word will be “zero” and the average recognition distance will be “zero”.

【００４８】[0048]

【外１】を求めＸの値が閾値Ｈよりも小さいならば、認識結果と
して“ゼロ”を出力する。If the value of X is smaller than the threshold value H, "zero" is output as the recognition result.

【００４９】図１１は単語候補の出力結果（候補単語判
別部１０４の出力結果）を示したものである。■は単語
“ハチ”、■は単語“シチ”、■は単語“シ”が候補と
して出力される。が、ここで前に述べたように■は■の
区間に８０％以上含まれており、かつ同一の“シ”が■
の中に存在するので音素レベルでの識別は■■について
行なう。FIG. 11 shows the output results of word candidates (output results of the candidate word discriminator 104). The word "Hachi" is output as a candidate for ■, the word "shichi" is output as a candidate for ■, and the word "shi" is output as a candidate for ■. However, as mentioned earlier, ■ is included in the interval of ■ by more than 80%, and the same “shi” is included in the interval of ■.
Since it exists in , identification at the phoneme level will be performed for ■■.

【００５０】ケース■　　単語Ｓ１の音素列“／ｈ／ａ
／ｃ／ｉ／”と単語Ｓ２の音素列“／ｓｈ／ｉ／ｃ／ｉ
／”についてマッチングした結果、どちらも音素の順番
が、候補単語と等しい場合、かつ、個々の音素の距離が
Ｈ（閾値）より小さい場合→平均累積距離Ｘの小さい方
を出力する。Case ■ Phoneme sequence of word S1 “/h/a
/c/i/” and the phoneme sequence of word S2 “/sh/i/c/i
As a result of matching for ``/'', if the order of both phonemes is the same as that of the candidate word, and if the distance of each phoneme is smaller than H (threshold) -> the smaller of the average cumulative distances X is output.

【００５１】ケース■　　どちらも順番が異なる個々の
音素の距離が閾値（Ｈ）より小さい場合→単語と音素列
の文字列によるＤＰマッチングを行い、その距離の閾値
（Ｉ）により決定する。Case ■ When the distance between individual phonemes that are both in different orders is smaller than the threshold (H) -> DP matching is performed using the character string of the word and the phoneme string, and the distance is determined by the threshold (I).

【００５２】ケース■　　順番が合っているか、個々の
音素の閾値が（Ｈ）をクリアしていない場合→リジェク
トケース■　　順番が異なり、音素の閾値もクリアして
いない場合→リジェクト音素単位の認識結果による単語
の識別方法は前記の方法に限らない。後に他の実施例で
も述べるが音素の単位をどの様な形で定義し、標準パタ
ンを作成しておくか、或は同一の音素でも複数用意する
事によって音素判別に用いる閾値Ｈの値、或は識別アル
ゴリズムは異なる。よって、平均累積距離と音素順位の
どちらを優先させるか等の識別アルゴリズムは一意に決
まらない。[0052] Case ■ If the order is correct or the threshold of each phoneme does not clear (H) → Reject case ■ If the order is wrong and the threshold of the phoneme also does not clear → Reject Recognition result of phoneme unit The word identification method is not limited to the above method. As will be described later in other embodiments, the value of the threshold H used for phoneme discrimination can be determined by defining the unit of phoneme and creating a standard pattern, or by preparing a plurality of the same phoneme. The identification algorithm is different. Therefore, the identification algorithm, such as which one to give priority to, the average cumulative distance or the phoneme order, cannot be uniquely determined.

【００５３】音素単位の認識結果による認識部１０８で
最終結果として出力した例えば音声（単語）を結果出力
部１０９で出力する。電話等の音声情報のみで認識させ
る場合、認識結果を「“ゼロ”ですね？」と例えば音声
合成手段を用いて確認する。単語の識別の結果、距離が
十分小さければ認識結果を確認せずに、それに対応した
次の処理へと移行する。For example, the speech (word) outputted as the final result by the recognition unit 108 based on the recognition result of each phoneme is outputted by the result output unit 109. When recognizing only voice information from a telephone or the like, the recognition result is confirmed using a voice synthesis means, for example, saying, "Isn't it 'zero'?" As a result of word identification, if the distance is sufficiently small, the process moves to the next corresponding process without checking the recognition result.

【００５４】なお、本実施例ではパターンマッチングの
方法として統計的に不特定性を吸収する距離尺度として
マハラノビス距離を用いた連続マハラノビスによるマッ
チング方法を用いたが、これに限定することなく、第２
部での認識においてはマルコルモデルのような確率を用
いて不特定性を吸収する距離を用いたマッチング方法が
あれば、どれを用いてもよいことはいうまでもない。[0054] In this embodiment, a continuous Mahalanobis matching method using Mahalanobis distance as a distance measure that statistically absorbs unspecificity was used as a pattern matching method, but the second method is not limited to this.
It goes without saying that for departmental recognition, any matching method that uses a distance that uses probability and absorbs unspecificity, such as the Marcol model, may be used.

【００５５】なお、本実施例では話者群を識別するため
の音韻連鎖として「アイウエオ」と連続して発声した単
語を用いたが、話者群を識別する単語は、これに限らな
い。またこれは、複数であってもよい。例えば、単語Ａ
（母音を含む単語）で、話者の基本的特徴（ホルマント
ピークの長さ等）を分類し、更に、その中でも、話者ご
との特徴（１．濁音を発声する際、“バズ”を含みやす
いか、含みにくいか、２．“ｐ，ｔ，ｋ”等の子音の長
さ、３．平均的な発声速度等）等、話者を分類する上で
最も特徴が強く出る単語Ｂ、単語Ｃから、更に話者群を
分類するとよい。[0055] In this embodiment, the word "aiueo" uttered consecutively was used as the phonetic chain for identifying the speaker group, but the word for identifying the speaker group is not limited to this. Moreover, this may be plural. For example, word A
(Words that include vowels) are categorized based on the basic characteristics of speakers (length of formant peak, etc.). Words B and words that have the strongest characteristics for classifying speakers, such as whether they are easy to include or difficult to include, 2. length of consonants such as “p, t, k,” 3. average speaking speed, etc.) Starting from C, it is better to further classify the speaker groups.

【００５６】（実施例２）前記実施例１では、音素標準
パタン格納部に格納する音素として、本認識装置で認識
を行うのに必要な認識対象単語に含まれる音素に限定し
ていたが、常時格納しておく音素標準パタンは、（日本
語を認識する場合）日本語の全音素について、話者カテ
ゴリ、音素クラスごとに作成した音素標準パタンを作成
しておいても良い。これにより１０６のメモリは増える
が認識対象単語を変えた場合、その対象単語に使用され
て音素（複数）について、話者カテゴリに対応する標準
パタンを１１２に格納すれば良い。(Embodiment 2) In the above embodiment 1, the phonemes stored in the phoneme standard pattern storage section were limited to the phonemes included in the recognition target words necessary for recognition by this recognition device. As for phoneme standard patterns that are always stored, phoneme standard patterns may be created for all Japanese phonemes for each speaker category and phoneme class (when recognizing Japanese). This increases the memory of 106, but when the recognition target word is changed, a standard pattern corresponding to the speaker category can be stored in 112 for phonemes (plurality) used in the target word.

【００５７】更に、音素として、日本語発声に必要な音
素だけでなく各国語（英語、仏語、独語、中国語…）等
に用いられる音素も全て格納しておき、この中から認識
対象語を選択しても良い。Furthermore, as phonemes, we store not only the phonemes necessary for Japanese pronunciation, but also all the phonemes used in various languages (English, French, German, Chinese, etc.), and from among these, we can select the target word for recognition. You may choose.

【００５８】図１２の様に図２．４の音素種を増やし、
更にこれを国別に用意しておけば良い。[0058] As shown in Fig. 12, increase the phoneme types in Fig. 2.4,
Furthermore, it would be a good idea to prepare this for each country.

【００５９】[0059]

【発明の効果】以上説明した様に、本発明によれば、音
声認識の第１段階において単語単位のスポッティングを
行なって音声区間と候補単語を導出し、第２段階で音声
の特性によって複数用意された音素の標準パタンと比較
することにより、第２段階においてより細かな音声認識
が行なわれ、認識率が高くなるという効果が得られる。As explained above, according to the present invention, in the first step of speech recognition, word-by-word spotting is performed to derive speech sections and candidate words, and in the second step, a plurality of candidate words are prepared depending on the characteristics of the speech. By comparing the phoneme pattern with the standard pattern of phonemes, more detailed speech recognition is performed in the second stage, resulting in an effect of increasing the recognition rate.

[Brief explanation of the drawing]

【図１】本実施例の基本的なブロック図。FIG. 1 is a basic block diagram of this embodiment.

【図２】標準パターン作成フロー。[Figure 2] Standard pattern creation flow.

【図３】話者カテゴリ作成フロー。[Fig. 3] Speaker category creation flow.

【図４】話者カテゴリの概念図。FIG. 4 is a conceptual diagram of speaker categories.

【図５】特徴ベクトルの表現例示図。FIG. 5 is a diagram illustrating a representation of a feature vector.

【図６】処理選択部の内部構成図。FIG. 6 is an internal configuration diagram of a processing selection section.

【図７】全体の流れを示すフローチャート。FIG. 7 is a flowchart showing the overall flow.

【図８】マハラノビス距離を用いたマッチングの例示図
。FIG. 8 is an exemplary diagram of matching using Mahalanobis distance.

【図９】話者カテゴリのデータフォーマット図。FIG. 9 is a data format diagram of speaker categories.

【図１０】音素マッチングの例示図。FIG. 10 is an exemplary diagram of phoneme matching.

【図１１】複数の候補単語と入力信号の例示図。FIG. 11 is an exemplary diagram of a plurality of candidate words and input signals.

【図１２】複数の言語についての話者カテゴリを有する
時のデータフォーマット図。FIG. 12 is a data format diagram when having speaker categories for multiple languages.

【図１３】従来の音声認識システムの構成図。FIG. 13 is a configuration diagram of a conventional speech recognition system.

Claims

[Claims]

1. Storing word information used as a reference and phoneme information classified according to voice characteristics when inputting voice information and recognizing the voice information, and determining the characteristics of the input voice information; The input speech information and word information are matched using a spotting method, a candidate word and a speech interval of the candidate word are derived, and the phoneme information corresponding to the candidate word is determined for the derived speech interval. A voice recognition method, characterized in that the voice is retrieved from the storage means according to the characteristics of the input voice, and matching with the input voice is performed.

2. Claim 1, wherein the characteristics of the voice differ depending on the speaker who utters the voice.
The speech recognition method described in .

3. Input means for inputting voice information, storage means for storing word information used as a reference when recognizing the voice information and phoneme information classified according to voice characteristics, characteristics of the input voice information. a derivation means for matching input speech information and word information using a spotting method and deriving a candidate word and a speech interval of the candidate word; A speech recognition device comprising phoneme recognition means for retrieving corresponding phoneme information from the storage means according to the characteristics of the determined speech and performing matching with the input speech.

4. Claim 3, wherein the characteristics of the voice differ depending on the speaker who utters the voice.
The speech recognition device described in .