JPH0876789A

JPH0876789A - System and device for voice recognition unspecified speaker word

Info

Publication number: JPH0876789A
Application number: JP6209794A
Authority: JP
Inventors: Dominiku Uon; ドミニクウォン; Yuji Okuda; 裕二奥田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1994-09-02
Filing date: 1994-09-02
Publication date: 1996-03-22

Abstract

PURPOSE: To recognize a voice with high precision with a smaller amounts of computations and memory size by using inputted voice pitch and line spectrum pair(LSP) parameters for voice recognition. CONSTITUTION: A power/LSP/pitch detecting section 12 detects the power, the LSP parameters and the pitch of voice signals. A sampling section 14 samples the segment between the start and the end of the uttered voice detected by a voice segment detecting section 13 with a prescribed sampling period. A first dictionary section 16 accumulates the sampled pitch and a second dictionary section 18 accumulates the sampled LSP parameters. A first matching section 15 compares the output of the section 14 and the pitch accumulated in the section 16 and recognizes a word. A second matching section 17 compares the recognition result of the section 15 and the LSP parameters accumulated in the section 18 and recognizes the word.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は例えば携帯電話などの音
声ダイアル等に用いられる不特定話者単語音声認識シス
テムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an unspecified speaker word voice recognition system used for voice dials of mobile phones and the like.

【０００２】[0002]

【従来の技術】従来より単語音声認識システムについて
多数の研究がなされている。一般に、単語音声認識シス
テムは、辞書内に蓄積されている参照パターンと入力パ
ターンとを比較することによって音声の認識がなされ、
最もマッチングがとれたものが話された単語とみなされ
る。しかし、同じ単語でも話し方によってはその速さや
長さが異なる（すなわちバリエーションがある）ことか
ら、認識精度の向上には様々な困難がある。現在までに
開発されたシステムは、話者を特定した単語音声認識シ
ステムと話者を特定しない単語音声認識システムがあ
り、話者を特定しない単語音声認識システムの一例が古
井貞著，東海大学出版会，『ディジタルテクノロジー
シリーズディジタル音声処理』の P185 〜P187に記
載されている。図３はこの文献に記載された不特定話者
単語音声認識システムを示すブロック図である。図３の
システムはマルチテンプレート方式と呼ばれる方式によ
るもので、この方式では良好な認識精度を得るには各単
語毎に１２個の参照用のテンプレートが必要であること
が明らかになっている。したがって、この方法では、各
単語毎の参照用のテンプレートの数が増加すると、演算
量とメモリサイズが増大してしまうという欠点があっ
た。もしデバイスのサイズや複雑さに制限がなければこ
の方法でも問題はないが、携帯電話などのような小型で
低消費電力が要求されるものにこの種の単語認識システ
ムを採用しようとした場合、演算量とメモリサイズが大
きな問題となる。2. Description of the Related Art Conventionally, many studies have been made on word speech recognition systems. Generally, a word speech recognition system recognizes speech by comparing an input pattern with a reference pattern stored in a dictionary,
The best match is considered the spoken word. However, even the same word has different speeds and lengths depending on how it is spoken (that is, there are variations), so there are various difficulties in improving recognition accuracy. The systems that have been developed so far include a speaker-specific word speech recognition system and a speaker-unspecified word speech recognition system. An example of a speaker-unspecified word speech recognition system is Satoshi Furui, Tokai University Press. Meeting, "Digital Technology Series Digital Speech Processing", P185-P187. FIG. 3 is a block diagram showing the unspecified speaker word speech recognition system described in this document. The system shown in FIG. 3 is based on a method called a multi-template method, and it is clear that in this method, 12 reference templates are required for each word in order to obtain good recognition accuracy. Therefore, this method has a drawback that the calculation amount and the memory size increase as the number of reference templates for each word increases. If there is no limit on the size or complexity of the device, this method will work, but if you try to adopt this type of word recognition system for something that is small and requires low power consumption, such as a mobile phone, The amount of calculation and the memory size are big problems.

【０００３】[0003]

【発明が解決しようとする課題】上述したように従来の
不特定話者単語音声認識システムは演算量とメモリサイ
ズが大きく、小型で低消費電力が要求される携帯電話な
どには不向きであった。本発明は上記した問題点を考慮
してなされたもので、合理的な認識精度を保ちながら少
ない演算量とメモリサイズで音声を認識できる携帯電話
など好適な不特定話者単語音声認識システムを提供する
ことを目的とする。As described above, the conventional unspecified speaker word speech recognition system has a large amount of calculation and a large memory size, and is not suitable for a mobile phone or the like which is small and requires low power consumption. . The present invention has been made in consideration of the above problems, and provides a suitable unspecified speaker word speech recognition system such as a mobile phone capable of recognizing speech with a small amount of calculation and memory size while maintaining reasonable recognition accuracy. The purpose is to do.

【０００４】[0004]

【課題を解決するための手段】上記課題を解決するため
に本発明は、デジタル化された入力音声のパワー、ＬＳ
Ｐパラメータ、およびピッチを検出するパワー／ＬＳＰ
／ピッチ検出部と、検出された前記入力音声の発声の開
始と終了のポイントを検出する音声区間検出部と、前記
音声区間検出部で検出された発声の開始から終了までの
区間内に検出された前記パワー、ＬＳＰパラメータ、お
よびピッチを所定のサンプリング周期でサンプリングす
るサンプリング部と、前記サンプリング部でサンプリン
グされたピッチを蓄積する第１の辞書部と、前記サンプ
リング部でサンプリングされた前記ＬＳＰパラメータを
蓄積する第２の辞書部と、前記サンプリング部からの出
力と前記第１の辞書部に蓄積されたピッチとを比較して
単語を認識する第１のマッチング部と、前記第１のマッ
チング部の認識結果と前記第２の辞書部に蓄積されたＬ
ＳＰパラメータとを比較して単語を認識する第２のマッ
チング部とを具備する。SUMMARY OF THE INVENTION In order to solve the above problems, the present invention is directed to the power of digitalized input voice, LS.
Power / LSP to detect P parameter and pitch
/ A pitch detection section, a voice section detection section for detecting the start and end points of the detected utterance of the input voice, and a voice section detected in the section from the start to the end of the utterance detected by the voice section detection section. A sampling unit that samples the power, the LSP parameter, and the pitch at a predetermined sampling period, a first dictionary unit that stores the pitch sampled by the sampling unit, and the LSP parameter that is sampled by the sampling unit. The second dictionary unit for storing, the first matching unit for recognizing a word by comparing the output from the sampling unit and the pitch stored in the first dictionary unit, and the first matching unit for the first matching unit. Recognition result and L stored in the second dictionary section
A second matching unit that recognizes a word by comparing it with an SP parameter is provided.

【０００５】さらに本発明の方法はデジタル化された入
力音声のパワー、ＬＳＰパラメータ、およびピッチを検
出するステップと、検出された前記入力音声の発声の開
始と終了のポイントを検出するステップと、前記音声区
間検出部で検出された発声の開始から終了までの区間内
に検出された前記パワー、ＬＳＰパラメータ、およびピ
ッチを所定のサンプリング周期でサンプリングするステ
ップと、サンプリングされたピッチを第１の辞書部に蓄
積するステップと、前記サンプリングされた前記ＬＳＰ
パラメータを第２の辞書部に蓄積するステップと、前記
サンプリングの結果と前記第１の辞書部に蓄積されたピ
ッチとを比較して単語を認識するステップと、この認識
結果と前記第２の辞書部に蓄積されたＬＳＰパラメータ
とを比較して単語を認識するステップよりなる。The method of the present invention further comprises the steps of detecting the power, LSP parameter, and pitch of the digitized input speech, detecting the start and end points of utterance of the detected input speech, and A step of sampling the power, the LSP parameter, and the pitch detected in a section from the start to the end of the utterance detected by the voice section detection section at a predetermined sampling period; and the sampled pitch in the first dictionary section. And storing the sampled LSP
Accumulating a parameter in the second dictionary unit, recognizing a word by comparing the sampling result with the pitch accumulated in the first dictionary unit, the recognition result and the second dictionary It comprises the step of recognizing a word by comparing it with the LSP parameters stored in the copy.

【０００６】[0006]

【作用】上記した構成にあっては、入力された音声のピ
ッチおよびＬＳＰパラメータを音声認識に用いている。
最初のピッチの比較で単語の予備選択が行われ、発声の
バリエーションを軽減している。これにより次のＬＳＰ
パラメータによる比較でのマッチング回数を減らすこと
ができるので、より精度が高く、少ない演算量とメモリ
サイズで音声を認識することができる。In the above structure, the pitch of the input voice and the LSP parameter are used for voice recognition.
Preliminary selection of words is performed at the first pitch comparison to reduce utterance variations. The next LSP
Since it is possible to reduce the number of times of matching in comparison by parameters, it is possible to recognize voice with higher accuracy and with a small amount of calculation and memory size.

【０００７】[0007]

【実施例】本発明の実施例を図面を参照して説明する。
図１は本発明の不特定話者単語音声認識システムの実施
例の構成を示すブロック図である。図１において、１０
はアナログ音声をデジタル信号に変換するアナログ／デ
ジタル変換器、１１は入力された信号の高周波領域を強
調する前処理部、１２は後述する音声信号のパワー，Ｌ
ＳＰ（Line Spectrum Pair）パラメータ、およびピッチ
を検出するパワー／ＬＳＰ／ピッチ検出部、１３は入力
された音声の開始と終了のそれぞれの時点を検出する音
声区間検出部、１４は音声区間検出部１３で検出された
発声の開始から終了までの区間内を所定のサンプリング
周期でサンプリングするサンプリング部、１６はサンプ
リング部１４でサンプリングされたピッチを蓄積する第
１の辞書部、１８はリサンプリング部１４でサンプリン
グされたＬＳＰパラメータを蓄積する第２の辞書部、１
５はサンプリング部１４からの出力と第１の辞書部１６
に蓄積されたピッチとを比較して単語を認識する第１の
マッチング部、１７は第１のマッチング部１５の認識結
果と第２の辞書部１８に蓄積されたＬＳＰパラメータと
を比較して単語を認識する第２のマッチング部、１９は
外部からの入力に応じて、トレーニング処理（図１にお
いてＭＯＤＥ＝１で示す）と認識処理（図１においてＭ
ＯＤＥ＝２で示す）の切り替えを行うモード切り替え部
をそれぞれ示す。An embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of an embodiment of the unspecified speaker word speech recognition system of the present invention. In FIG. 1, 10
Is an analog / digital converter for converting an analog voice into a digital signal, 11 is a pre-processing unit for emphasizing the high frequency region of the input signal, 12 is the power of the voice signal described later, L
Power / LSP / pitch detection unit for detecting SP (Line Spectrum Pair) parameters and pitch, 13 is a voice section detection unit for detecting respective times of start and end of input voice, and 14 is a voice section detection unit 13 The sampling unit that samples the interval from the start to the end of the utterance detected by the sampling unit at a predetermined sampling period, 16 is the first dictionary unit that stores the pitch sampled by the sampling unit 14, and 18 is the resampling unit 14. A second dictionary part that stores the sampled LSP parameters, 1
5 is the output from the sampling unit 14 and the first dictionary unit 16
A first matching unit for recognizing a word by comparing it with the pitch accumulated in the word, and 17 comparing a recognition result of the first matching unit 15 with the LSP parameter accumulated in the second dictionary unit 18 for the word. The second matching unit 19 for recognizing the training process 19 is a training process (indicated by MODE = 1 in FIG. 1) and a recognition process (M in FIG. 1) according to an external input.
A mode switching unit for switching (ODE = 2) is shown.

【０００８】次に、各部の動作について説明するが、動
作説明の前にシステムの入力である音声について説明す
る。システムに入力される音声は声帯の振動により生成
される。通常の呼吸のときは声帯の間隔は大きく開いて
いるが、声を出そうとするとき声帯が接近し、この間を
肺からの空気が通り抜けようとするため、空気流と声帯
の相互作用により、声帯が周期的に開いたり閉じたりし
て、ほぼ規則的な空気の断続が生ずる。この流量の変化
は、非対称三角波で近似することができる、これが耳に
聞こえる音声の音源となる。声帯の緊張が大きく、かつ
肺からの空気圧が高いと、声帯の開閉周期、すなわち振
動周期が短くなって、音源の音の高さが高くなり、逆の
ときは低くなる。これがピッチに対応する。この声帯の
振動を伴う音を有声音（voiced sound）、伴わない音を
無声音（unvoiced sound）と呼ぶ。有声音はピッチを有
する音であるが、無声音は周期がなくピッチ値は０であ
り、本システムでは広帯域ノイズとして扱われる。した
がって、ピッチの大きさはどの話者でも、その値が０の
ときは無声音、０でない場合は有声音となる。たとえ
ば、暴風、くしゃみ、せきなどはピッチを持たない無声
音として扱われる。したがって、ピッチを基準として音
声を認識する場合これらの無声音は除外されるので、誤
認識がなくなる。本発明ではこの特性を利用しており、
このことが、より精度の高いロバスト的な認識を可能に
し、認識精度を保ちながら認識をトレーニングされてい
ない話者にまで広げることを可能にする。さらに、本発
明では使用する周波数領域におけるＬＳＰパラメータを
認識に用いている。このＬＳＰパラメータは音声を特徴
づけるホルマント周波数の近似として用いられる。この
ホルマント周波数は話者の発声を特徴づける声道に大き
く依存し、最初のピッチによる予備選択で候補が絞られ
ているので、認識精度が向上する。Next, the operation of each part will be described. Before explaining the operation, the voice input to the system will be described. The voice input to the system is generated by the vibration of the vocal cords. During normal breathing, the vocal cords are wide apart, but when trying to make a voice, the vocal cords approach each other, and the air from the lungs tries to pass through during this, so the interaction between the airflow and the vocal cords The vocal cords open and close in a cyclic manner, resulting in a nearly regular interruption of air. This change in flow rate can be approximated by an asymmetrical triangular wave, which is the source of the audible voice. When the tension in the vocal cords is high and the air pressure from the lungs is high, the opening / closing cycle of the vocal cords, that is, the vibration cycle, becomes shorter, and the pitch of the sound source becomes higher, and vice versa. This corresponds to the pitch. The sound accompanied by vibration of this vocal cord is called a voiced sound, and the sound not accompanied by it is called an unvoiced sound. The voiced sound is a sound having a pitch, but the unvoiced sound has no period and has a pitch value of 0, and is treated as broadband noise by the present system. Therefore, the pitch of any speaker is unvoiced when the value is 0 and voiced when the value is not 0. For example, storms, sneezes, and coughs are treated as unvoiced sounds without pitch. Therefore, when recognizing a voice with reference to the pitch, these unvoiced sounds are excluded, and erroneous recognition is eliminated. The present invention utilizes this characteristic,
This enables more accurate and robust recognition, and extends recognition to untrained speakers while maintaining recognition accuracy. Further, in the present invention, LSP parameters in the frequency domain used are used for recognition. This LSP parameter is used as an approximation of the formant frequency that characterizes the speech. This formant frequency largely depends on the vocal tract that characterizes the utterance of the speaker, and since the candidates are narrowed down by the preliminary selection by the first pitch, the recognition accuracy is improved.

【０００９】図１において、上述したメカニズムで生成
された音声はアナログ／デジタル変換器１０に入力さ
れ、デジタル信号に変換される。デジタル化された音声
は前処理部１１でその高周波領域が強調される。次にパ
ワー／ＬＳＰ／ピッチ検出部１２で入力音声のパワー，
ＬＳＰパラメーター，ピッチを得る。ここではあらかじ
め決められた長さのフレーム単位で各値が検出される。
ピッチおよびＬＳＰパラメーターは入力された順番にバ
ッファの中に蓄積される。次に、検出されたパワーの値
を用いて音声の開始と終了のポイントが音声区間検出部
１３で検出される。音声区間検出部１３では前記パワー
と所定のスレショルドの比較がなされ、このスレショル
ドを越えた点が開始と終了のポイントと見做している。
本発明ではこの音声区間を１フレームとしている。In FIG. 1, the sound generated by the above mechanism is input to the analog / digital converter 10 and converted into a digital signal. The high-frequency region of the digitized voice is emphasized by the preprocessing unit 11. Next, in the power / LSP / pitch detection unit 12, the power of the input voice,
Get LSP parameters and pitch. Here, each value is detected in frame units of a predetermined length.
Pitch and LSP parameters are stored in the buffer in the order in which they were entered. Next, the voice section detection unit 13 detects the start and end points of voice using the detected power value. The voice section detection unit 13 compares the power with a predetermined threshold, and regards the point beyond this threshold as the start and end points.
In the present invention, this voice section is one frame.

【００１０】次に、サンプリング部１４において、音声
区間検出部１３で検出された発声の開始から終了までの
区間内のピッチとＬＳＰパラメータが所定のサンプリン
グ周期でサンプリングされる。これは発声の長さ、速さ
などが個人により異なる（例えば同じ「１（イチ）」と
発音する場合でも個人あるいは場合によって発声の長
さ，速さが異なる）ことを考慮し、この発声の個人差に
伴う計算の複雑さを最小にするために行われるものであ
る。すなわち、開始／終了のポイントの中のフレーム化
されたパラメータをあらかじめ決められた数に固定し、
このフレーム化されたパラメータのサンプル数と後述す
る辞書部でのサンプル数とを同じにしてバリエーション
を軽減している。（この処理をリサンプリングと称す
る）次に外部からの入力によりモードがトレーニングモード
となっている場合は、トレーニングスイッチ１９は図１
のＭＯＤＥ＝１側つまり辞書部側に切り替わり、ピッチ
およびＬＳＰパラメータがそれぞれ第１の辞書部１６お
よび第２の辞書部１８に入力され、リファレンスデータ
として蓄積される。一方、モードが認識モードとなって
いる場合はトレーニングスイッチ１９は図１のＭＯＤＥ
＝２側つまり比較部側に切り替わる。第１のマッチング
部１５では入力されたピッチと第１の辞書部１６に蓄積
されたリファレンスデータとの類似性が比較される。こ
の比較は例えば両者のデータの排他的論理和をとること
に比較できる。あるいは両者のハミング距離の計算によ
っても比較可能である。この場合、最小のハミング距離
を得たものが最も可能性の高い候補と見做される。Next, the sampling section 14 samples the pitch and the LSP parameter within the section from the start to the end of utterance detected by the voice section detecting section 13 at a predetermined sampling period. This takes into account that the length and speed of utterance differ depending on the individual (for example, even when the same "1 (ichi)" is pronounced, the length and speed of utterance differ depending on the individual or case). This is done to minimize the complexity of calculation due to individual differences. That is, fix the framed parameters in the start / end points to a predetermined number,
Variations are reduced by making the number of samples of this framed parameter the same as the number of samples in the dictionary unit described later. (This process is referred to as resampling) Next, when the mode is the training mode due to an input from the outside, the training switch 19 is set to the state shown in FIG.
MODE = 1 side, that is, the dictionary section side, and the pitch and the LSP parameter are input to the first dictionary section 16 and the second dictionary section 18, respectively, and accumulated as reference data. On the other hand, when the mode is the recognition mode, the training switch 19 is set to MODE in FIG.
= 2 side, that is, it switches to the comparison unit side. The first matching unit 15 compares the input pitch with the reference data stored in the first dictionary unit 16 for similarity. This comparison can be made, for example, by taking an exclusive OR of both data. Alternatively, they can be compared by calculating the Hamming distance between the two. In this case, the one with the smallest Hamming distance is considered the most likely candidate.

【００１１】なお、ピッチの値は前述したように、有声
音と無声音の違いによりバイナリ値で表現することがで
きる。図２にその例を示す。図２では発声の開始点から
終了点までを１６個サンプリングしており、有声音の場
合は１、無声音の場合は０で表される。この例では有声
音の場合は１としたが、必ずしも１でなくてもよくピッ
チ値を多重値としてもよい。ピッチの値が少ないとバリ
エーションを許すので比較すべき候補が多くなる。一
方、ピッチの値が多い場合はその値によって予備選択を
行っているので比較すべき候補の数は少なくなる。As described above, the pitch value can be expressed by a binary value depending on the difference between voiced sound and unvoiced sound. FIG. 2 shows an example thereof. In FIG. 2, 16 points from the start point to the end point of utterance are sampled, and are represented by 1 for voiced sound and 0 for unvoiced sound. In this example, the voice value is set to 1 in the case of voiced sound, but the pitch value may not necessarily be 1 and may be a multiple value. If the pitch value is small, variations are allowed, so there are many candidates to be compared. On the other hand, when the value of the pitch is large, the number of candidates to be compared is small because the preliminary selection is performed according to that value.

【００１２】第１のマッチング部１５で比較されたたの
ち、第２のマッチング部１７でＬＳＰパラメータの比較
がなされる。例えば、ひとケタの数字の認識を例にとる
と、第１のマッチング部１５では第１の辞書部１６に蓄
積された１０コ（０から９まで）のリファレンスと比較
が行われ、第２のマッチング部１７ではより絞られた例
えば３コ程度の第２の辞書部１８に蓄積されたリファレ
ンスとの比較がなされ、最終的な認識の結果が得られ
る。After the comparison by the first matching unit 15, the second matching unit 17 compares the LSP parameters. For example, taking the recognition of a single digit as an example, the first matching unit 15 performs comparison with 10 references (0 to 9) stored in the first dictionary unit 16, and The matching unit 17 makes a comparison with the more narrowed-down, for example, about 3 references stored in the second dictionary unit 18, and obtains the final recognition result.

【００１３】上述したように本発明の不特定話者単語音
声認識システムにあっては、入力された音声のピッチお
よびＬＳＰパラメータを音声認識に用いている。そして
最初のピッチの比較で単語の予備選択が行われ、発声の
バリエーションを軽減している。これにより次のＬＳＰ
パラメータによる比較でのマッチング回数を減らすこと
ができるので、より精度が高く、少ない演算量とメモリ
サイズで音声を認識することができる。As described above, in the unspecified speaker word speech recognition system of the present invention, the pitch of the input speech and the LSP parameter are used for speech recognition. Then, the words are preselected by comparing the first pitches to reduce variations in utterance. The next LSP
Since it is possible to reduce the number of times of matching in comparison by parameters, it is possible to recognize voice with higher accuracy and with a small amount of calculation and memory size.

【００１４】本発明は上記した実施例に限定されるもの
ではなく、例えばリサンプリングにおけるサンプリング
周期や、ピッチ値など適用分野に応じて種々の変形が可
能である。The present invention is not limited to the above-described embodiment, but various modifications can be made depending on the application field such as the sampling cycle in resampling and the pitch value.

【００１５】[0015]

【発明の効果】上述したように、本発明は、精度が高く
少ない演算量とメモリサイズで音声を認識することがで
きる。As described above, according to the present invention, a voice can be recognized with high accuracy and with a small amount of calculation and memory size.

[Brief description of drawings]

【図１】本発明の不特定話者単語音声認識システムの実
施例の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an embodiment of an unspecified speaker word speech recognition system of the present invention.

【図２】図１の実施例におけるピッチ値を示す図であ
る。FIG. 2 is a diagram showing a pitch value in the embodiment of FIG.

【図３】従来の不特定話者単語音声認識システムの一例
を示すブロック図である。FIG. 3 is a block diagram showing an example of a conventional unspecified speaker word speech recognition system.

[Explanation of symbols]

１０…アナログ／デジタル変換器、１１…前処理部、１
２…パワー／ＬＳＰ／ピッチ検出部、１３…音声区間検
出部、１４…リサンプリング部、１５…第１のマッチン
グ部、１６…第１の辞書部、１７…第２のマッチング
部、１８…第２の辞書部、１９…モード切り替え部。10 ... Analog / digital converter, 11 ... Pre-processing unit, 1
2 ... power / LSP / pitch detection unit, 13 ... voice section detection unit, 14 ... resampling unit, 15 ... first matching unit, 16 ... first dictionary unit, 17 ... second matching unit, 18 ... 2 dictionary section, 19 ... Mode switching section.

Claims

[Claims]

1. The power of an input voice digitized, LS
Power / LSP to detect P parameter and pitch
/ A pitch detection section, a voice section detection section for detecting the start and end points of the detected utterance of the input voice, and a voice section detected in the section from the start to the end of the utterance detected by the voice section detection section. A sampling unit that samples the power, the LSP parameter, and the pitch at a predetermined sampling period, a first dictionary unit that stores the pitch sampled by the sampling unit, and the LSP parameter that is sampled by the sampling unit. The second dictionary unit for storing, the first matching unit for recognizing a word by comparing the output from the sampling unit and the pitch stored in the first dictionary unit, and the first matching unit for the first matching unit. Recognition result and L stored in the second dictionary section
An unspecified speaker word speech recognition system comprising a second matching unit that recognizes a word by comparing it with an SP parameter.

2. The value of the pitch is a binary value "1" when the input is voiced and is a binary value "0" when the input is silent.
The unspecified speaker word speech recognition system according to claim 1.

3. The first matching unit compares the output value from the sampling unit with the pitch value stored in the first dictionary unit based on a Hamming distance calculation result. Item 1. An unspecified speaker word speech recognition system according to item 1.

4. The power of the digitized input voice, LS
Detecting the P parameter and the pitch, detecting the start and end points of the detected utterance of the input voice, and detecting in the section from the start to the end of the utterance detected by the voice section detector. Sampling the sampled power, LSP parameter, and pitch at a predetermined sampling period, storing the sampled pitch in a first dictionary unit, and sampling the sampled LSP parameter in a second dictionary unit. To recognize a word by comparing the sampling result with the pitch stored in the first dictionary section, and the recognition result and the LSP parameter stored in the second dictionary section. And a method for recognizing a word by comparing