JPH067346B2

JPH067346B2 - Voice recognizer

Info

Publication number: JPH067346B2
Application number: JP59169569A
Authority: JP
Inventors: 文雄外川; 伸神谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1984-08-14
Filing date: 1984-08-14
Publication date: 1994-01-26
Anticipated expiration: 2009-01-26
Also published as: JPS6147999A

Description

Detailed Description of the Invention

〈産業上の利用分野〉本発明は、音声入力式の文書処理装置等に用いられる音
声入力装置における音声認識装置の改良に関するもので
ある。〈従来技術〉入力音声をより細分化された音節単位に分解（以下、音
節セグメンテーションという）して抽出した音節区間系
列について、各音節毎に識別した複数個の音節候補から
なる音節候補の時系列（以下、音節ラティスという）に
基づいて識別確度の高い組み合わせによって得られる候
補音節列を辞書と照合して得た候補を入力音声に対する
認識結果とする場合、従来は、上記の音節セグメンテー
ションの際の音節境界を一意に決定することが一般に行
なわれている。しかし、このような処理では、入力音声
がより連続的になり会話音声に近くなると、音節境界が
不明確になる為検出が困難となり、認識率が悪くなると
いう問題点がある。発明者らは、この点に着目し、音節境界の検出精度を上
げる為に、音節セグメンテーションのアルゴリズムによ
って音節境界を一意に決定せず、発声速度を推定し、こ
の推定値を活用して認識率を向上することを提案してい
るが、これによってもなお音節セグメンテーションの誤
りが生じている。〈発明が解決しようとする問題点〉本発明は、発声速度を活用してもなお音節セグメンテー
ションの誤りが生ずる点を解決し、認識率の良好な音声
認識装置を提供することを目的としてなされたものであ
る。〈問題点を解決する為の手段〉上述の目的を達する為に、本発明は、各音節の識別距離
と発声速度に基づいた信頼度の両者を用いて候補音節列
の識別確度を決定するようにしたことを特徴としてい
る。〈作用〉本発明の装置では、音節境界を一意に決定せず、音節の
識別距離と発声速度に基づく信頼度の二つの要素を用い
ており、音節境界を複数の候補として検出し、同時刻に
競合した複数の音節区間の存在を許した音節セグメンテ
ーションを行なう。そして、各音節区間候補について音
節を識別して得られる競合のある音節候補ラティスを用
いて、音節区間検出の信頼度の高い音声区間を優先して
候補音節列が作成される。〈実施例〉以下、図面の一実施例について、本発明を具体的に説明
する。第１図は、本発明を実施した音声入力装置の構成を示す
図である。入力された音声は、先ず、音声分析部１で音響分析さ
れ、無声区間検出部に送られて無音区間が検出され、更
に文節境界検出部５で文節境界としての無音区間かどう
かの検出が行なわれる。一方、有声区間検出部３では、
有音部、即ち音のかたまりが検出され、音節境界候補検
出部６でパワー変化、スペクトル変化、音韻変化等、及
び発声速度記憶部４に記憶されている発声速度情報を用
いて音節境界候補が与えられる。以上の結果は、音声分
析部１で得られた明らかな音節境界情報とともに音節区
間候補検出部７で統合され、妥当な音節区間候補を信頼
度の高い順に得る。これらの音節区間候補は、パターンメモリ12に予め記憶
させてある音節標準パターンとのパターンマッチングな
どを音節識別部８で行なって識別され、後述するよう
に、識別距離の小さい順に候補音節が出力される。こう
して出力された識別距離を持つ音節候補の時系列が音節
候補ラティス(b)である。一方、音節区間候補の信頼度
算出部９で、発声速度記憶部４に記憶された発声速度情
報を用いて、後述する(1)式に従って音節区間候補の信
頼度が算出され、最適な音節区間系列が推定される。候補音節作成部10では、この最適音節区間系列に基づい
て、その他の音節区間系列に対して各音節区間の荷重係
数Ｋを決定し、後述する(2)式に従って識別距離と区間
検出の信頼度を用いて信頼度Ｓの小さい順に候補音節列
を順次作成する。候補音節列は、言語処理部11で辞書13
を用いて順次辞書照合を含む言語処理が行なわれ、解釈
が可能な候補列が認識結果(c)として出力される。そし
て、話者（或はオペレータ）によって確定された候補列
を構成する音節区間系列の各音節長は、発声速度情報と
して以前の平均音節長の値とともに発声速度記憶部４に
記憶され、以後の処理に利用される。次に、音節区間の決定処理について述べる。<Field of Industrial Application> The present invention relates to an improvement of a voice recognition device in a voice input device used in a voice input type document processing device or the like. <Prior Art> A time series of syllable candidates consisting of a plurality of syllable candidates identified for each syllable of a syllable section sequence extracted by decomposing the input speech into more subdivided syllable units (hereinafter referred to as syllable segmentation). (Hereinafter, referred to as syllable lattice) When a candidate obtained by collating a candidate syllable string obtained by a combination with high identification accuracy against a dictionary is used as a recognition result for an input speech, conventionally, in the case of the above syllable segmentation, It is common practice to uniquely determine syllable boundaries. However, in such a process, when the input voice becomes more continuous and becomes closer to the conversation voice, the syllable boundary becomes unclear, which makes detection difficult, resulting in a poor recognition rate. The inventors have paid attention to this point, and in order to improve the detection accuracy of the syllable boundary, the syllable boundary is not uniquely determined by the algorithm of the syllable segmentation, the vocalization rate is estimated, and the estimated value is utilized to recognize the recognition rate. However, this still results in syllable segmentation errors. <Problems to be Solved by the Invention> The present invention has been made for the purpose of solving the problem that syllable segmentation errors occur even when the utterance speed is utilized, and providing a speech recognition apparatus having a good recognition rate. It is a thing. <Means for Solving Problems> In order to achieve the above-mentioned object, the present invention determines the identification accuracy of a candidate syllable string by using both the identification distance of each syllable and the reliability based on the utterance speed. It is characterized by having done. <Operation> The device of the present invention does not uniquely determine the syllable boundary, but uses two elements of the syllable identification distance and the reliability based on the utterance speed, detects the syllable boundary as a plurality of candidates, and detects the same time. Performs syllable segmentation that allows the existence of multiple syllable intervals that compete with each other. Then, using a syllable candidate lattice with competition obtained by identifying syllables for each syllable section candidate, a candidate syllable string is created by giving priority to a speech section with high reliability in syllable section detection. <Example> Hereinafter, the present invention will be specifically described with reference to an example of the drawings. FIG. 1 is a diagram showing a configuration of a voice input device embodying the present invention. The input voice is first subjected to acoustic analysis by the voice analysis unit 1, sent to the unvoiced section detection unit to detect a silent section, and the bunsetsu boundary detection unit 5 further detects whether or not the section is a silent section. Be done. On the other hand, in the voiced section detection unit 3,
A voiced part, that is, a lump of sound is detected, and a syllable boundary candidate detection unit 6 determines a syllable boundary candidate by using power change, spectrum change, phoneme change, and the utterance speed information stored in the utterance speed storage unit 4. Given. The above results are integrated by the syllable section candidate detection section 7 together with the clear syllable boundary information obtained by the speech analysis section 1 to obtain valid syllable section candidates in the order of high reliability. These syllable section candidates are identified by performing pattern matching with a syllable standard pattern stored in the pattern memory 12 in advance by the syllable identification unit 8, and as will be described later, candidate syllables are output in ascending order of identification distance. It The syllabic candidate lattice (b) is the time series of the syllable candidates having the identification distance thus output. On the other hand, the syllable section candidate reliability calculation unit 9 calculates the reliability of the syllable section candidate according to the expression (1) described later using the utterance speed information stored in the utterance speed storage unit 4, and determines the optimum syllable section. The sequence is estimated. The candidate syllable creating unit 10 determines the weighting factor K of each syllable section with respect to the other syllable section series based on the optimum syllable section series, and the identification distance and the reliability of section detection according to the equation (2) described later. Using, the candidate syllable strings are sequentially created in the ascending order of reliability S. The candidate syllable sequence is stored in the dictionary 13 by the language processing unit 11.
Is used to sequentially perform a language process including dictionary matching, and an interpretable candidate string is output as a recognition result (c). Then, each syllable length of the syllable section sequence that constitutes the candidate sequence determined by the speaker (or the operator) is stored in the utterance speed storage unit 4 together with the previous average syllable length value as utterance speed information. Used for processing. Next, the syllable section determination process will be described.

【１】音節区間の信頼度の算出発声速度に基づいた音節区間候補の信頼度は、音節長の
平均値からの偏差で求められる。この為、音声入力装置
を使用する際には、先ず、音節数が既知である標準文章
を使用者が音声入力し、平均音節長が推定される。平
均音節長は、音節数ｎのＩ個の有音区間から成る文章
のｉ番目の有音区間の継続時間をＬ(i)とすると、で求められる。そこで、同時間内に複数の音節区間系列に対する信頼度
Ｄ₁，Ｄ₂，……，Ｄx，…は次の(1)式によって求められ
る。但し、Ｘ(i)：音節区間系列Ｘの第ｉ番目の音節長ｘ：音節区間系列Ｘの音節区間数：平均音節長 d(X(i),L)：X(i)のＬからの偏差であり、通常d(X(i),＝｜Ｘ(i)-｜／ Dx：音節区間系列Ｘの信頼度で与えられる。この(1)式で求められた信頼度の最小の音節区間が最適
音節区間系列ということになる。[1] Calculation of reliability of syllable section The reliability of the syllable section candidate based on the utterance speed is obtained by the deviation from the average value of the syllable length. Therefore, when using the voice input device, first, the user inputs a standard sentence whose number of syllables is known, and the average syllable length is estimated. The average syllable length is L (i), where L (i) is the duration of the i-th voiced section of a sentence consisting of I voiced sections of the number of syllables n. Required by. Therefore, the reliability D ₁ , D ₂ , ..., Dx, ... With respect to a plurality of syllable section sequences within the same time is obtained by the following equation (1). However, X (i): i-th syllable length of syllable section sequence X x: number of syllable sections of syllable section sequence X: average syllable length d (X (i), L): from L of X (i) It is a deviation and is usually given by d (X (i), = | X (i)-| / Dx: reliability of the syllable interval series X. The syllable interval with the minimum reliability obtained by this equation (1). Is the optimum syllable section sequence.

【２】候補音節列の作成同時間内に複数個の音節区間候補が検出されて競合状態
にある音節候補ラティスは、音節の識別距離と上述した
音節区間の信頼度の両者を用いて評価値Ｓを求め、この
値の小さい順に候補音節列を作成する。音節の識別距離
は、上述の最適音節区間系列を基準にして次のようにし
て求められる。今、第２図に示すような音節区間候補が検出され、Ｂ−
Ｃ系列とＤ系列とが競合しているとする。ここで、 La，ai：音節区間候補Ａの音節長と第ｉ音節候補の識別
距離 Lb，bj：音節区間候補Ｂの音節長と第ｊ音節候補の識別
距離 Lc，ck：音節区間候補Ｃの音節長と第ｋ音節候補の識別
距離 Ld，dl：音節区間候補Ｄの音節長と第ｌ音節候補の識別
距離：平均音節長とすると、競合する系列の信頼度は、Ｂ−Ｃ系列の信頼度 Dbc＝｛d(Lb,）＋d(Lc,）｝／２Ｄ系列の信頼度 Dd＝d(Ld,）となり、競合部の評価値は、 Sbc＝(bj+ck)×Kbc＋Ｄbc Sd＝d|×Kd+Dd で求められる。但し、Kbc，Kdは、最適音節区間系列を
基準にした識別距離に対する荷重係数で、Ｂ−Ｃが最適
系列の場合（即ち、Dbc＜Ddの場合）は、Kbc＝1,Kd＝２
であり、Ｄが最適系列の場合(Dd＜Dbc)は、Kd＝1,Kbc＝
1/2である。以上の結果から、全体の評価値は、であり、Sabc,Sadのうち小さいものを評価値Ｓとし、こ
の値の小さい組み合わせ順に候補音節列が作成される。以上の手順を「家を」という入力音声を例として具体的
に説明する。第３図は、「家」と入力された場合の音節境界の状態を
示す図である。即ち、A,B,C,Dの４個の音節区間候補が
あり、競合する部分の各候補B,C,Dの音節長が夫々Lb＝1
2,Lc＝27,Ld＝39、平均音節長＝２３であるとする。
音節区間の各信頼度は、Ｂについて：d(Lb,）＝0.48 Ｃについて：d(Lc,）＝0.17 Ｄについて：d(Ld,）＝0.70 である。従って、音節区間系列の各信頼度は、Ｂ−Ｃ系列： Dbc＝{d(Lb,）＋d(Lc,）／２＝0.32 Ｄ系列： Dd＝d(Ld,）＝0.70 となり、Ｂ−Ｃ系列が最適音節区間系列であると推定さ
れる。このＢ−Ｃ系列について信頼度を正規化すれば、 D’bc＝０ D’d＝0.38 となる。第４図は、各候補音節の識別距離を示す音節候補ラティ
スであり、これらの各識別距離と上述の信頼度（正規化
後）とを用い、各候補音節列の評価値を(2)式に従って Sabc＝ai+(bj+ck+D’bc) Sad＝ai+(dl×2+D'd) によって求め、小さい値を与える組合せの順に候補列を
順次作成して行く。こうして作成された候補音節列に言語処理を施し、認識
結果が別表のように出力される。この結果によれば、第１候補はＡＢＣ系列の「いえお」
となっており、正しい結果が出ていることがわかる。〈発明の効果〉上述の説明から明らかなように、本発明によれば、各音
節の識別距離と発声速度の両者を用いて音節セグメンテ
ーションが行なわれる為、セグメンテーションの誤りが
少なくなり、後の言語処理によって正しく修正される確
率を高くすることができるのである。[2] Creation of candidate syllable sequence A syllabic candidate lattice in a competitive state where multiple syllable section candidates are detected within the same time is evaluated using both the syllable identification distance and the reliability of the syllable section described above. S is obtained, and candidate syllable strings are created in ascending order of this value. The syllable identification distance is obtained as follows based on the optimal syllable section sequence described above. Now, a syllable section candidate as shown in FIG. 2 is detected, and B-
It is assumed that the C series and the D series compete with each other. Here, La, ai: syllable length of syllable section candidate A and i-th syllable candidate identification distance Lb, bj: syllable length of syllable section candidate B and j-th syllable candidate identification distance Lc, ck: syllable section candidate C Syllable length and kth syllable candidate identification distance Ld, dl: Syllable length of syllable section candidate D and lth syllable candidate identification distance: Given the average syllable length, the reliability of competing sequences is the confidence of the BC sequence. Degree Dbc = {d (Lb,) + d (Lc,)} / 2 The reliability of the D sequence becomes Dd = d (Ld,), and the evaluation value of the competitive part is Sbc = (bj + ck) × Kbc + Dbc Sd = d | × Kd + Dd However, Kbc and Kd are weighting factors for the discrimination distance based on the optimal syllable section sequence, and when BC is the optimal sequence (that is, when Dbc <Dd), Kbc = 1 and Kd = 2.
And when D is the optimum sequence (Dd <Dbc), Kd = 1, Kbc =
It is 1/2. From the above results, the overall evaluation value is Therefore, the smaller one of Sabc and Sad is used as the evaluation value S, and candidate syllable strings are created in the ascending order of combinations. The above procedure will be specifically described by taking the input voice "house" as an example. FIG. 3 is a diagram showing a state of a syllable boundary when "house" is input. That is, there are four syllable section candidates A, B, C, D, and the syllable lengths of the candidates B, C, D in the competing part are Lb = 1, respectively.
2, Lc = 27, Ld = 39, and average syllable length = 23.
The reliability of each syllable section is as follows: For B: d (Lb,) = 0.48 For C: d (Lc,) = 0.17 For D: d (Ld,) = 0.70. Therefore, the reliability of each syllable section sequence is as follows: BC sequence: Dbc = {d (Lb,) + d (Lc,) / 2 = 0.32 D sequence: Dd = d (Ld,) = 0.70, and BC The sequence is estimated to be the optimal syllable interval sequence. If the reliability of this BC sequence is normalized, D'bc = 0 D'd = 0.38. FIG. 4 is a syllable candidate lattice showing the identification distance of each candidate syllable. Using these identification distances and the above-mentioned reliability (after normalization), the evaluation value of each candidate syllable string is expressed by the formula (2). According to Sabc = ai + (bj + ck + D'bc) Sad = ai + (dl × 2 + D'd), the candidate sequences are sequentially created in the order of the combination giving a smaller value. The candidate syllable string thus created is subjected to language processing, and the recognition result is output as in a separate table. According to this result, the first candidate is the ABC series "No."
It means that the correct result is obtained. <Effects of the Invention> As is apparent from the above description, according to the present invention, since syllable segmentation is performed using both the identification distance of each syllable and the utterance speed, segmentation errors are reduced, and the subsequent language The probability of correct correction by the processing can be increased.

[Brief description of drawings]

第１図は、本発明を実施した音声入力装置の構成を示す
ブロック図、第２図は、同上、音節候補ラティスを示す図、第３図は、同上、入力音声の音節境界の１例を示す図、第４図は、同上、具体例の音節候補ラティスを示す図で
ある。１…音声分析部４…発声速度記憶部７…音節区間候補検出部８…音節識別部９…音節区間候補の信頼度算出部 10…候補音節列作成部 11…言語処理部 FIG. 1 is a block diagram showing a configuration of a voice input device embodying the present invention, FIG. 2 is a diagram showing a syllable candidate lattice of the same as above, and FIG. 3 is an example of a syllable boundary of an input voice. FIG. 4 is a diagram showing a syllable candidate lattice of a specific example as above. DESCRIPTION OF SYMBOLS 1 ... Voice analysis part 4 ... Speaking speed storage part 7 ... Syllable section candidate detection part 8 ... Syllable identification part 9 ... Syllable section candidate reliability calculation part 10 ... Candidate syllable string creation part 11 ...

Claims

[Claims]

1. An input speech is decomposed into syllable units, each syllable is identified, and a candidate syllable string corresponding to a word or a syllable is sequentially created based on a time series of a plurality of syllable candidates in order of high identification accuracy. In a voice recognition device in a voice input device for recognizing word syllables or syllable syllables, a recognizable syllabic segment candidate obtained by dividing the input voice into several ways has a small identification distance by pattern matching with a syllable standard pattern. A syllable identification unit 8 having a syllable candidate lattice that outputs candidate syllables in order of time series, and a reliability of the syllable section candidate using the occurrence speed information, of a syllable section candidate for estimating an optimum syllable section sequence. The reliability calculation clause 9 and the weighting factor K of each syllable section with respect to other syllable section series are determined based on the optimum syllable section series, and the reliability is calculated using the identification distance and the reliability of section detection. Speech recognition apparatus characterized by having a candidate syllable string creation unit 10 sequentially creates the candidate syllable string in ascending order of.