JPS58121100A

JPS58121100A - Word voice recognition system

Info

Publication number: JPS58121100A
Application number: JP57004272A
Authority: JP
Inventors: 貞煕古井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1982-01-14
Filing date: 1982-01-14
Publication date: 1983-07-19

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〈発明の背景〉この発明は複数の入力装置から入力された多種の単語音
声を同時に認識できる単語音声ｓＩｍ方式％式％従来この単飴音声認識方式は、各単瞼儂準パターンをス
ペクトルパラメータの時系列の形で蓄えていたため、単
＠数が大きくなるとこれを蓄えるための記憶容量が膨大
となシ、複数の入力装置からの音声を同時に認識できる
ように複数の１鍼回路を設けた場合には、個々の認識回
路でｉＩｇｗｌｃできる単語の数は小さく抑えられてし
まうという欠点があった。[Detailed Description of the Invention] <Background of the Invention> The present invention is a word speech sIm method that can simultaneously recognize various word sounds input from a plurality of input devices. Since the semi-patterns were stored in the form of a time series of spectral parameters, the storage capacity required to store them was enormous as the number of single @ became large. When one acupuncture circuit is provided, there is a drawback that the number of words that can be iIgwlc by each recognition circuit is kept small.

〈発明の概景〉この発明はこれらの欠点を解決するため、各半時標準パ
ターンを擬音ｉＩ裸準パターンと単語辞書の組合せによ
って蓄えることによシ記憶容量を削減して、災に多棟の
単語のセットが標準パターンとして蓄えられるようにし
、装置を使用する時にどの単語セットを認識対象とする
かを指定することにより、一つの装置で等価的に多数の
語いが認識できるようにし喪ものである。<Overview of the Invention> In order to solve these drawbacks, this invention reduces the storage capacity by storing each half-time standard pattern by a combination of an onomatopoeic iI bare quasi-pattern and a word dictionary. By storing a set of words as a standard pattern and specifying which word set is to be recognized when using the device, it is possible to equivalently recognize a large number of words with one device. It is something.

〈実施例〉第１図はこの発明の実施例を示し、複数の入力回路２に
入力される。＋ｉａ！臓部選択回路２には入力端子１！
〜１ｎよシ少ない数の認識回路１４１〜１４ｍが接続さ
れている。％−識回路１４１〜１４ｍはスペクトルパラ
メータ抽出部３、擬音韻標準パターン畜槓部４、スペク
トル距離計算部５、単語辞畳蓄槓部６、アドレス指定部
７、時間正規化スペクトルマツチング部８及び単語判定
部９よシなる。これらｍｍ回路１４ｔ　〜１４ｍでそれ
ぞれ！ｉ１ｇｖｊｔされた単＃ｔｉＭｌｌ結釆出力イン
ターフエイス部１０を通じてｌｌｉ！１１１１Ｍ乗出力
端子１１へ出力される。認鐵部遇択回路２、認識回路１
４１〜１４ｍ１出力インタ一７エイス部１０は制御部１
２によ多制御される。制御部１２には制御信号入力端子
１３よ多制御信号が与えられる。<Embodiment> FIG. 1 shows an embodiment of the present invention, in which input is made to a plurality of input circuits 2. +ia! Internal organ selection circuit 2 has input terminal 1!
A smaller number of recognition circuits 141 to 14m than 1n are connected. The %-sensing circuits 141 to 14m include a spectral parameter extraction section 3, an onomatopoeic standard pattern accumulation section 4, a spectral distance calculation section 5, a word dictionary accumulation section 6, an address specification section 7, and a time normalized spectrum matching section 8. and the word determination section 9. These mm circuits are 14t to 14m each! lli! through the i1gvjt single #tiMll connection output interface unit 10! It is output to the 1111M power output terminal 11. Approved steel department selection circuit 2, recognition circuit 1
41 to 14m1 output interface 7 eighth section 10 is control section 1
It is controlled by 2. The control section 12 is supplied with multiple control signals through a control signal input terminal 13 .

この単諸曾声認誠方式では、使用に際してあらかじめ各
擬ｆＭ標準パターン蓄積部４に、各擬音−のスペクトル
パラメータセットを蓄えておく。In this single-onomatopoeia recognition method, a spectral parameter set for each onomatopoeic sound is stored in each pseudo-fM standard pattern storage section 4 in advance before use.

このスペクトルパラメータは、相関係数、ケプストラム
、帯域通過フィルタ出力パワー等でおり、擬ｔＷｔ襟準
パターンの作成法については、例えは管材、古井、箱出
の発明による特願昭５５−１３９０９４号明細壷に記載
した方法を用いることができる。この方法では、１人ま
たは複数の話者の音声から抽出した多数のスペクトルパ
ラメータのセットから、クラスタリングの手法によって
代六的なセットを数１０ないし２００ｆ！ＩＩ類程度遍
択し、＊ｆ韻襟準パターン蓄積部４に蓄積する。丈に単
飴辞薔蓄積部６には、各認歇対象率附を、襞ｆ−標準パ
ターンを示す記号の連続した系列として蓄積しておく。These spectral parameters include correlation coefficients, cepstrums, band-pass filter output powers, etc. For a method of creating a quasi-tWt collar pattern, see, for example, Japanese Patent Application No. 139094/1985 invented by Tube, Furui, and Hakode. The method described on the jar can be used. In this method, from a large number of sets of spectral parameters extracted from the speech of one or more speakers, a clustering method is used to generate a set of spectral parameters ranging from several tens to 200 f! Class II patterns are selected and stored in the *f rhyme quasi-pattern storage section 4. In the storage unit 6, each recognition target rate is stored as a continuous series of symbols indicating the fold f-standard pattern.

この方法にも例えば上述の管材、古井、箱出の発明の明
＃ｌ沓記載の方法を用いることができる。For this method, for example, the method described in the invention of the above-mentioned tube material, Furui, and Hakode, can be used.

一般にｉｇｗ＆対象単語が多数であっても、−臓動作を
行うべき各時点においては、入力音声に対して候補とす
べき単語の機知は、全認識対破単飴の一部である場合が
多い。例えば、あるサービスにおいて月日、−日、地名
岬を認識対象とする場合でも、発声者が月日を発声すべ
き時点で、−日までを候補の中に含めて認識製作を村な
う必蓋は必ずしもない。そこでこの発明の率ｌ！音声認
識方式では、単語辞書を蓄積しておく際に、それらを各
時点で認識対象とすべき単語のグループに分割し、各グ
ループに対してそれが単語辞書中の何番目から何番目ま
での系列であるかを示す表を作ｐ１アドレス指定部７に
蓄えておく。In general, even if there are a large number of igw & target words, at each point in time when a gut action should be performed, the wit of the words that should be candidates for the input speech is often a part of the total recognition vs. breaking single candy. . For example, in a certain service, even if month, day, -day, and place name Cape are to be recognized, at the time when the speaker should say month and day, it is necessary to include up to -day among the candidates and perform recognition production. There is not necessarily a lid. Therefore, the rate of this invention! In the speech recognition method, when storing word dictionaries, they are divided into groups of words to be recognized at each point in time, and for each group, the number of words in the word dictionary is determined. A table indicating whether it is a series or not is created and stored in the p1 address designation section 7.

＜ｍ繊動作〉このようにして、擬音−標準パターン、単語辞書及び革
飴グループを示す表を蓄積したのち、未知単語音声のＭ
誠に移る。例えば、電話音声の音声Ｍ誠によって航空券
の座席予約を行うサービスシステムを例に上げて説明を
行うと、まず利用者からの電話の着信を検出すると、シ
ステムは合成音声によって座席予約サービスであること
を告げたのち、決められた順序に従って合成音で質問を
行い、この質問に対する利用者の応答の単語音声を一部
する。このとき、例えば「どこからですか」と質問した
際には、地名の単語グループのみを認識対象（単語の候
補）とすればよく、「何日ですか」と質問した際には日
付の率飴グループのみを認識対象とすれはよいことは、
前述の通りである。<M fiber movement> After accumulating the onomatopoeic standard patterns, word dictionaries, and tables showing leather candy groups in this way, the M fiber movement of unknown word sounds is
Move to Makoto. For example, to explain a service system that uses voice M-Makoto to reserve a seat on an airline ticket, first, when an incoming call from a user is detected, the system uses a synthesized voice to reserve a seat. After telling the user, questions are asked in a predetermined order using synthesized voices, and part of the user's response to the question is recorded. At this time, for example, when asking the question "Where are you from?", you only need to recognize the word group of place names (word candidates), and when asking "What day is it?" It is good to only recognize groups.
As mentioned above.

そこで合成音で質問を行ったらたソちに、システムは制
御信号入力端子１３に認識動作開始信号を入力し、制御
部１２によって、認識（９）路１４１〜１４ｒｎの伺れ
があき状態になっているかを判定し、紹繊部遺択回路２
により、音声入力端子１１〜ｌｎ中の音声が入力された
ものを、認誠部遺択回路２を通じてあき状態になってい
る認識回路に接続する。次に制御信号入力端子１３に、
ｉｌＩ！！繊対象とすべき単讃グループ名（グループ査
号）を入力し、つまシ１誠すべき候補単語セットを指定
し、上述のようにして選択された′ｗｇｇ（ロ）路のア
ドレス指に部７にこの憧を入力して、このグループ名（
グループ査号）と、アドレス指定部７に蓄えられている
衆を用いるととＫよシ、単語辞書中の認識対象とすべき
系列のアドレス範囲を指定する、こののちに音声入力端
子から、未知単ａｔ声の波形を入力する。音声入力端子
に入力される音響波形は、電話回線を通ったものであっ
てもよく、マイクロホンからとったものであってもよい
。音声入力端子から入力され友音声波形は、上述のよう
にして選択され九酩臓回路中のスペク抽出入ラメータ抽
出部３に送られ、例えばｌ　Ｑｍｇ程度の短い時間毎に
スペクトル分析され、その分析結果について、短時間毎
にスペクトル距離計算部５で擬ｇ＠ｌ／Ａ準パターン蓄
＆ｓ４から続出した各擬音−パターンとのスペクトル距
離が計算される。この計ｘＦｉ米と単飴辞書蓄積部６に
蓄えられている指定されたアドレス範囲内、つまシ指定
された候補単語セットの擬音−パターン系列とを用いて
、音声の時間伸縮を吸収するスペクトルマツチングを時
間正規化スペクトルマツチング部８で行い、入力音声と
各系列との類似の度合いを単語判定部９に入力する。単
語判定部９では、最も類似の度合いが大きい系列を選択
し、その単語名（単語番号）を−織紬釆として、關繊結
釆出力インターンエイス部ｌＯを経て酩靴結釆出力端子
１１に出力する。Immediately after asking a question using a synthesized voice, the system inputs a recognition operation start signal to the control signal input terminal 13, and the control unit 12 sets the recognition path 141 to 14rn to a blank state. The selection circuit 2
As a result, the voice input terminals 11 to ln are connected to the idle recognition circuit through the authentication section selection circuit 2. Next, to the control signal input terminal 13,
ilI! ! Enter the name of the single group to be targeted (group code), specify the candidate word set to be targeted, and add the part to the address finger of the ``wgg (ro) path selected as described above. Enter this yearning in 7 and enter this group name (
If you use the address range stored in the address specifying section 7, specify the address range of the series to be recognized in the word dictionary. Input the waveform of a single AT voice. The acoustic waveform input to the audio input terminal may be one that has passed through a telephone line or may be one that has been taken from a microphone. The voice waveform input from the voice input terminal is selected as described above and sent to the spectrum extraction input parameter extraction section 3 in the nine-tone circuit, where it is subjected to spectrum analysis at short time intervals of, for example, lQmg. As for the result, the spectral distance calculation unit 5 calculates the spectral distance from each onomatopoeic pattern that successively follows from the pseudo g@l/A quasi-pattern storage &s4 at short intervals. Using this total xFi rice and the onomatopoeia-pattern series of the specified candidate word set within the specified address range stored in the candy dictionary storage unit 6, a spectral pine tree that absorbs the time expansion and contraction of speech is used. The matching is performed by a time normalized spectrum matching unit 8, and the degree of similarity between the input speech and each sequence is input to a word determining unit 9. The word determination section 9 selects the series with the highest degree of similarity, and outputs the word name (word number) as -oritsumugi-kama to the futsu-yuibutsu output terminal 11 via the silk-tie button output intern ace section 1O. Output.

不特定話者を対象とするｌ１ｌＩ！繊の場合のように、
各認識対象単語に対して複数の代表系列が蓄えられてい
る場合には、類似の度合いが大きい複数の系列をとシ田
し、その単語名に関する多数決による決定を行えば、Ｍ
軸度の高い単鎖決定を行うことができる。l1lI for unspecified speakers! As in the case of fibers,
If multiple representative sequences are stored for each recognition target word, select multiple sequences with a high degree of similarity and make a decision based on majority vote regarding the word name.
Highly axial single-strand determination can be performed.

このような構成になっているから、この発明の方式によ
れば、各単語音声は擬ｆ−パターン名（記号）を単位と
する記号系列で表現されるので、その記憶Ｓｔは、スペ
クトルパラメータを蓄積する従来の方式に比べ、大幅に
少なくてすみ、このため複数の谷＆Ｉｉｍ回路ごとに多
数の単語の標準の形式を蓄え、多数の単語を認識対象と
することが可能となる。この多数の認識対象単語の中か
ら、各時点ごとに、その時に対象とすべき単語のセット
を任意に設定できるので、複数の入力装置（入力回路）
から入力された音声を、その時点であき状態にある任意
の認識回路に入力して、設定した単語セットを対象とし
九Ｍ繊動作を行うことが口Ｊ能となる。このため各認識
回路で認識できる率時の種類がそれぞれＫついて固定さ
れる従来の方式に比べて、この発明の方式によれば一つ
のｍＲ装置を複数の入力装置からの音声に対して極めて
効幕的に用いることができる。With such a configuration, according to the method of the present invention, each word sound is expressed as a symbol sequence whose units are pseudo f-pattern names (symbols), so the memory St stores spectral parameters. Compared to the conventional method of storing data, it requires significantly less data, and therefore it is possible to store standard forms of many words in each of the plurality of valley & Iim circuits, and to recognize many words. From among this large number of recognition target words, it is possible to arbitrarily set a set of words to be recognized at each time point, so multiple input devices (input circuits)
Inputting the input voice into any recognition circuit that is idle at that time, and performing the 9M-sensing motion with the set word set as the target becomes a kuji-noh. Therefore, compared to the conventional method in which the types of rate times that can be recognized by each recognition circuit are fixed at K, the method of the present invention allows one mR device to be extremely effective against audio from multiple input devices. It can be used theatrically.

〈発明の効果〉以上説明したように、この発明による単鎖音声１緘方式
によれば、複数の認識回路のそれぞれが対象とする候補
単語のセットを各時点で指示して複数の入力装置からの
音声を過室あき状態にある認識回路でＩｌ！１鐵できる
ので、多数の利用者が電話機あるいはマイクロホンを通
じて入力した音声を昭織するような場合に多数の語いを
対象とすることができ、しかもその処理能率を大きく高
めることができるため、利用者が待ち状態におかれる確
率を小さくすることができる。またこの発明による単語
１Ｍ！鍼方式によれば、マツチングに必要な距離（類似
度）計算が、入力音声と擬音韻標準パターンとの計算だ
けでよいので、従来の各単語ごとに標準の形式としてス
ペクトルパラメータ系列を蓄えておく方式に比べて計算
量が大幅に減少できる利点がある。<Effects of the Invention> As explained above, according to the single-chain speech one selection method according to the present invention, a set of candidate words to be targeted by each of a plurality of recognition circuits is instructed at each time point, and a set of candidate words to be targeted by each of a plurality of recognition circuits is specified at each time point. The voice of Il! is detected by the recognition circuit in the overloaded state. Since it is possible to use one iron, it is possible to target a large number of words when recording the voice input by many users through telephones or microphones, and the processing efficiency can be greatly increased. It is possible to reduce the probability that a person will be placed in a waiting state. Another word 1M created by this invention! According to the acupuncture method, the distance (similarity) calculation required for matching is only a calculation between the input speech and the onomatopoeic standard pattern, so a spectral parameter series is stored in a standard format for each word as in the past. This method has the advantage of significantly reducing the amount of calculation compared to the conventional method.

[Brief explanation of the drawing]

図はこの発明による単語音声認識方式の基本的な構成を
示すブロック図である。１１〜１ｎ：音声入力端子、２：酩諏部選択（ロ）路、
３ニスベクトルパラメ一タ抽出部、４：擬−ｆＩＩＩ＃
１襟準パターン蓄積部、５ニスベクトル距離計算部、６
：単語辞誉畜槓部、７：アドレス指定部、８：時間正規
化スペクトルマツチング部、９：単語判定部、ｘｏ：關
織結釆出力インター７エイス部、１１：ｇ陳結果出力端
子、１２：制御部、１３：制御信号入力端子、１４１〜
１４ｍ：認識回路。特許出願人　　日本電信電話公社代理人　草野　車The figure is a block diagram showing the basic configuration of the word speech recognition method according to the present invention. 11 to 1n: audio input terminal, 2: drinking section selection (b) path,
3: Varnish vector parameter extraction unit, 4: Pseudo-fIII#
1 collar semi-pattern storage section, 5 varnish vector distance calculation section, 6
: word dictionary storage section, 7: address specification section, 8: time normalization spectrum matching section, 9: word judgment section, xo: Guanori connection output interface 7-8 section, 11: gchen result output terminal, 12: Control unit, 13: Control signal input terminal, 141~
14m: Recognition circuit. Patent Applicant Nippon Telegraph and Telephone Public Corporation Agent Kuruma Kusano

Claims

[Claims]

(1) In a line-based device that recognizes voice input from multiple input devices, multiple recognition circuits and
The Ig circuit is provided with a dictionary containing a plurality of candidate word sets represented by onomatopoeic standard patterns and symbol strings indicating the same, and speech input from a plurality of input devices is connected to the recognition circuit. Supply it to someone! selects a recognition circuit in an idle state according to input detection to the device, specifies a set of candidate words to be recognized in the recognition circuit, and selects a set of candidate words to be recognized from the set of candidate words in the input speech. A word voice wtllI method that selects the word with the most similar standard pattern and outputs it as a recognition result.