JPS58189694A

JPS58189694A - Voice recognition system

Info

Publication number: JPS58189694A
Application number: JP57071225A
Authority: JP
Inventors: 市川　熹; 畑岡　信夫; 俊宏木村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1982-04-30
Filing date: 1982-04-30
Publication date: 1983-11-05
Also published as: JPH0421880B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は音声認識方式、％ｌこ不特定の話者が発声した
連続単語音声を認識する方式の改良に関するものである
。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an improvement in a speech recognition method, a method for recognizing continuous word speech uttered by an unspecified speaker.

従来、不特定の話者の発する音声は認識する方式におい
ては、入力音声の特徴を調べ、その特徴に合うように標
準パタンを変形する学習方式、逆１こ入力音声を標準パ
タンに合うように変形する正規化方式、あるいは話者が
異なることによる音声の変形の範囲を予め予想し、その
変動範囲に多数の標準パタンを配置する多標準方式、お
よび適当な前処理手法と組み合せた判別関数法などが提
案されている。これらの内現在、実用レベルの認識能力
を持つものは多標準方式と判別関数法によるものである
。さら匿、連続単語認識まで能力を拡張することを考え
ると多標準方式がほぼ唯一の現実的方式と言えよう。Conventionally, methods for recognizing speech uttered by unspecified speakers include a learning method that examines the characteristics of the input speech and transforms a standard pattern to match the characteristics, and a learning method that transforms the input speech into a standard pattern to match the characteristics. A normalization method that performs deformation, a multi-standard method that predicts the range of speech deformation due to different speakers and arranges a large number of standard patterns within that range of variation, and a discriminant function method that combines with an appropriate preprocessing method. etc. have been proposed. Among these, the ones that currently have practical level recognition ability are based on the multi-standard method and the discriminant function method. Considering that the ability can be expanded to include hidden and continuous word recognition, the multi-standard method is almost the only realistic method.

しかしながら、連続単語認識において、可能性のある単
語連続の組み合せを考えると、二段ＤＰ法や連続ＤＰ法
などの手法を用いても、認識のための処理量は大幅ｔこ
増加する。従って多標準方式そのままに、不特定話者連
続単語認識を行なう方式では、パタンマ、チング部等の
規膜が非常に大きくなり経済性の点で非現実的なものと
なる。However, in continuous word recognition, when considering possible combinations of consecutive words, even if techniques such as the two-stage DP method or the continuous DP method are used, the amount of processing for recognition increases significantly. Therefore, in a system that performs speaker-independent continuous word recognition while maintaining the multi-standard system, the membranes such as pattern and chiming parts become extremely large, making it unrealistic from an economic point of view.

本発明では、このような問題点を改善することを目的と
している。The present invention aims to improve such problems.

音声認識装置への入力中の者は、ある利用場面に注目す
れば、利用中に男女の性が変ったり、成人から子供に変
るなどの変動は起り得ない点に注目する。Those who are inputting information to a speech recognition device should note that, if they pay attention to a certain use situation, changes such as changing gender from male to female or changing from adult to child cannot occur during use.

すなわち、本発明では、認識装置の状態を、比較的認識
処理量の少ない離散発声音声の認識状態（状態１）と、
処理量の多い連続発声音声の認識状態（状態２）に分け
、先ず状態ｌで入力音声を認識し、話者の性格を限定し
た後に、その性格の共通の組の標準バタンを用い、状態
２の認識を行なうことにより、状態２における処理量を
低減させようというものである。That is, in the present invention, the state of the recognition device is divided into a recognition state (state 1) of discrete utterances that requires a relatively small amount of recognition processing;
It is divided into continuous speech recognition states (state 2) that require a large amount of processing. First, the input speech is recognized in state l, and after limiting the personality of the speaker, a standard button of a common set of the personality is used, and state 2 is recognized. The aim is to reduce the amount of processing in state 2 by recognizing the following.

以下、実施例にもとづき本発明を説明する。Hereinafter, the present invention will be explained based on Examples.

第１図は本発明を応用した電話情報サービスシステム構
成の一例である。システム制御部ｌと本発明ｉこよる音
声認識部２、音声応答部３、ハイブリッドコイル４、加
入者電話器５からなり、電話情報サービスシステムより
本発明の説明に必要な部分のみを取り出して記しである
。FIG. 1 shows an example of the configuration of a telephone information service system to which the present invention is applied. Consisting of a system control unit 1, a voice recognition unit 2 according to the present invention, a voice response unit 3, a hybrid coil 4, and a subscriber telephone 5, only the parts necessary for explaining the present invention are extracted from the telephone information service system and written down. It is.

第２図は本発明を説明するための音声認識装置の構成例
である。第２図において、制御部２１は第１図のシステ
ム制御部１からの指令と結果２７を授受する他、音声認
識部２の制御を行なう。分析部２２で分析された入力音
声は標準バタンメモリ２４中の標準バタンデータとの類
似度が類似度計算部２３で計算され、連続バタン・マツ
チング部２５で最適マツチング値が各標準バタンとの間
で計算される。その結果は判定部２６で判定され、判定
結果が制御部２１に送られる。連続バタン・マツチング
処理を行なう認識装置の構成はすでに公知なので（％開
昭５５−２２０５号公報参照）その説明は省略する。こ
の装置の例では、常に入力バタンと指定された標準バタ
ンとを照合しているので、入力が離散発声であることが
あらかじめ判明していれば、マツチング部の出力は離散
発声単語が入力されたものとして判定部２６で判定すれ
ば良く、連続単語入力の場合をこは、連続単語として判
定して行く方式となっており、連続バタンマツチング部
２５の動作は共通である。このマツチング部２５の動作
を離散発声用と連続発声用に　　　　′切り換える方式
（たとえば、特願昭５５−１５８２９６号参照）の装置
においても以下の説明は全く同様に取り扱える。FIG. 2 is a configuration example of a speech recognition device for explaining the present invention. In FIG. 2, a control section 21 not only sends and receives commands and results 27 from the system control section 1 of FIG. 1, but also controls the voice recognition section 2. The similarity calculation unit 23 calculates the degree of similarity between the input voice analyzed by the analysis unit 22 and the standard bang data in the standard bang memory 24, and the continuous bang matching unit 25 calculates the optimum matching value between each standard bang data. is calculated. The result is determined by the determination section 26, and the determination result is sent to the control section 21. Since the configuration of a recognition device that performs continuous bump matching processing is already known (see Japanese Patent Application No. 1982-2205), a description thereof will be omitted. In the example of this device, the input button is always compared with the specified standard button, so if it is known in advance that the input is a discrete utterance, the output of the matching section will be the same as that of the input discrete utterance word. In this case, continuous word input is judged as a continuous word, and the operation of the continuous slam matching section 25 is the same. The following explanation can be applied in exactly the same manner in a system in which the operation of the matching section 25 is switched between discrete vocalization and continuous vocalization (see, for example, Japanese Patent Application No. 158296/1983).

いま、登録されている単語の種類が「はい」、「いいえ
」と０〜９の数字とする。また、各単語と数字の標準バ
タンは話者ｌこよる差異を考慮し、／男／女／子供／各
５種すなわち、３Ｘ５−１５個ずつ登録されているもの
とする。銀行における残高照会の例を取り上げると、第
１図に戻って、利用者からの電話がシステムに入ると、
先ず音声応答部３は「残高照会ですか」と利用者に問う
と共に、音声認識部２はシステム制御部ｌの指令にもと
づき、「はい」か「いいえ」の２種の単語を離散入力と
して認識するモート（状態ｌ）で入力を待つ。利用者が
「はい」又は「いいえ」と答えると、認識部２は「はい
」「いいえ」の２語に対し各１５個の合計３０個の標準
バタンとの照合をすれば良い。この結果、最もマツチン
グの良い標準バタンか男の組（又は女、又は子供の組）
であれば、以降状態２（連続単語認識の状態）では、男
（又は女、又は子供）に属する数字の標準バタンのみを
用いるように制御部２１が割部指令を出す。次の段階で
音声応答装置３は「暗証番号をどうぞ」と利用者に音声
出力すると共に、認識部２は状態２となり連続数字認識
可能な状態となる。It is assumed that the types of words currently registered are "yes", "no", and numbers from 0 to 9. Further, it is assumed that five types of standard clicks for each word and number, ie, 3×5−15, are registered for each word/number/man/woman/child, taking into consideration the differences among speakers. Taking the example of balance inquiry at a bank, going back to Figure 1, when a call from a user enters the system,
First, the voice response unit 3 asks the user, “Do you want to inquire about your balance?” At the same time, the voice recognition unit 2 recognizes two types of words, “yes” or “no,” as discrete inputs based on commands from the system control unit l. Wait for input at the mote (state l). When the user answers "yes" or "no", the recognition unit 2 only has to compare the two words "yes" and "no" with a total of 30 standard bangs, 15 each. As a result, the best matching standard batan or the male group (or female or child group)
If so, then in state 2 (state of continuous word recognition), the control unit 21 issues a division command so that only standard bangs with numbers belonging to men (or women, or children) are used. In the next step, the voice response device 3 outputs a voice to the user saying, "Please enter your password," and the recognition unit 2 changes to state 2, making it possible to recognize consecutive numbers.

利用者は、たとえば暗証番号１’−１２３４Ｊなどと音
声で入力すると認識部２は男（又は女、又は子供）の組
に所属する数字標準バタン１０Ｘ５−５０個との照合を
行なえば良いことになる。従って、認識部３のマツチン
グ能力は高々５０個の標準バタンとの照合で良いことに
なる。これ番こ対し、状態ｌで組を定めずに認識する場
合は１０ＸＩ　５−１５０個の標準バタンとの照合を要
することになる。For example, when the user inputs a password such as 1'-1234J by voice, the recognition unit 2 only needs to match it with 10X5-50 number standard buttons belonging to the male (or female, or child) group. Become. Therefore, the matching ability of the recognition unit 3 is sufficient to match at most 50 standard batons. On the other hand, if recognition is performed in state 1 without determining the set, it will be necessary to check with 5 to 150 standard drums of 10XI.

以上説明したごとく、本発明によれば、経済的に、不特
定話者の連続発声した音声を認識するシステムが実現で
きることになりその効果は大きい。As described above, according to the present invention, it is possible to economically realize a system for recognizing continuous voices uttered by an unspecified speaker, and the effects thereof are significant.

[Brief explanation of the drawing]

第１図は本発明を応用した電話情報サービスシステムの
一構成例を示し、第２図は本発明による音声認識装置のブロック構成を示
す。第　１　図犯　２　図FIG. 1 shows an example of the configuration of a telephone information service system to which the present invention is applied, and FIG. 2 shows a block configuration of a speech recognition device according to the present invention. Figure 1 Criminal Figure 2

Claims

[Claims]

A first recognition state that recognizes discrete utterances and recognizes continuous utterances using multiple sets of standard patterns prepared for multiple types of speakers with different characteristics for voice patterns having the same meaning. In a speech recognition method having a second recognition state, the characteristics of the input speech are recognized in the first recognition state, and a standard pattern to be used in the second recognition state is determined based on the recognized characteristics. A speech recognition method characterized by limiting the number of pairs.