JP2966002B2

JP2966002B2 - Voice recognition device

Info

Publication number: JP2966002B2
Application number: JP1222738A
Authority: JP
Inventors: 芳春阿部
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1989-08-29
Filing date: 1989-08-29
Publication date: 1999-10-25
Anticipated expiration: 2014-10-25
Also published as: JPH0384600A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、単語音声や文音声の認識を行う音声認識装
置に関し、特に大語彙を認識対象とする音声認識装置の
改良に関するものである。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus for recognizing word speech and sentence speech, and more particularly to an improvement of a speech recognition apparatus for recognizing a large vocabulary.

[Conventional technology]

第３図は例えば特許願（特願昭63−246367）あるいは
日本音響学会講演論文集「線型文脈モデルを用いた大語
彙単語音声認識」（平成元年３月14日発行）に示された
音声認識技術に基づいて構成された大語彙を対象とする
音声認識装置を示す構成ブロック図である。FIG. 3 shows a speech shown in, for example, a patent application (Japanese Patent Application No. 63-246367) or a collection of lectures by the Acoustical Society of Japan, "Large Vocabulary Word Speech Recognition Using a Linear Context Model" (issued March 14, 1989) 1 is a configuration block diagram illustrating a speech recognition device for large vocabulary configured based on a recognition technology.

図において、１は入力音声を分析する音声分析部、２
は音声の発声内容を記述した記号の系列から該記号系列
の表す音声のモデルを該記号の文脈あるいは該記号に対
応する入力音声の部分の文脈に依存して生成するモデル
生成手段としてのモデル生成部、３は認識対象単語を記
述した単語辞書、４はモデル生成部２で生成された音声
モデル中で入力音声に対する尤度を計算する尤度計算
部、５は上記尤度の高い音声モデルを選択しその生成の
源となった記号系列（音素系列）を入力音声の候補とし
て出力する一次選択手段としての一次選択部である。In the figure, reference numeral 1 denotes a voice analysis unit for analyzing an input voice;
Generating a model of a speech represented by the symbol series from a sequence of symbols describing the utterance content of the speech depending on the context of the symbol or the context of an input speech portion corresponding to the symbol; Unit 3, a word dictionary describing words to be recognized, 4 a likelihood calculation unit for calculating the likelihood for the input speech in the speech model generated by the model generation unit 2, and 5 a speech model with a high likelihood. This is a primary selection unit as primary selection means for outputting a selected symbol sequence (phoneme sequence) as a source of the generation as a candidate for input speech.

次に動作について説明する。入力された音声は、音声
分析部１で特徴ベクトル系列に変換される。モデル生成
部２は単語辞書３に記述された認識対象単語の各音素系
列について、この特徴ベクトル系列及び音素系列から文
脈情報を抽出すると共に、この文脈情報に基づいて文脈
を考慮した単語のモデルを生成する。尤度計算部４はこ
のように生成された各単語モデルについて、特徴ベクト
ル系列に対する尤度を計算する。一次選択部５はその中
で尤度の高い単語モデルをＮ個選択し、これら単語モデ
ルの音素系列を単語候補として出力する。Next, the operation will be described. The input voice is converted into a feature vector sequence by the voice analysis unit 1. For each phoneme sequence of the recognition target word described in the word dictionary 3, the model generation unit 2 extracts context information from the feature vector sequence and the phoneme sequence, and generates a word model in which context is considered based on the context information. Generate. The likelihood calculation unit 4 calculates the likelihood for the feature vector sequence for each word model generated in this manner. The primary selection unit 5 selects N word models with high likelihood among them, and outputs phoneme sequences of these word models as word candidates.

[Problems to be solved by the invention]

以上のように構成された従来の音声認識装置は文脈を
考慮したモデルを生成して認識に用いるため、認識率は
極めて高いが、尚、認識の誤りが存在する。実験結果の
分析によれば、これらの認識誤りの大半は、音素系列中
の１〜数個の音素が正解の音素系列と異なる類似単語と
の誤りである。ところで、このような文脈を考慮した従
来の装置は文脈を考慮しない旧来の装置に比べれば認識
性能が十分高いため、認識を誤った場合でも、上位の単
語候補の中には必ず正解単語が含まれていると見做せ、
しかも、第１位の候補単語は正解の単語に対して上記の
ような類似単語である可能性が高い。そこで、上位候補
の中に第１位の候補と類似した単語が含まれるときに限
って、これら類似の単語の間で類似単語の認識に適する
認識手段を用いて再認識をすればまだ認識率を改善でき
るはずである。Since the conventional speech recognition apparatus configured as described above generates a model in consideration of the context and uses it for recognition, the recognition rate is extremely high, but recognition errors still exist. According to the analysis of the experimental results, most of these recognition errors are errors in similar words in which one to several phonemes in the phoneme sequence are different from the correct phoneme sequence. By the way, conventional devices that consider such contexts have sufficiently higher recognition performance than conventional devices that do not consider context, so even if recognition is incorrect, the correct word is always included in the top word candidates. Be considered
In addition, the first candidate word is likely to be a similar word as described above with respect to the correct word. Therefore, only when words that are similar to the first candidate are included in the top candidates, re-recognition between these similar words using a recognition means suitable for recognizing similar words still results in a higher recognition rate. Should be able to be improved.

しかしながら、上記従来の装置は、上位候補の中に類
似単語が含まれる場合でも、特に、類似単語の認識に適
する認識手段を適用していないという問題点を有する。However, the above-described conventional apparatus has a problem that even when a similar word is included in the top candidate, a recognition unit suitable for recognizing a similar word is not applied.

本発明は上記のような問題点を解決するためになされ
たもので、記号系列（音素系列）中の一部分の記号（音
素）が異なるような類似単語が多く含まれるような大語
彙を認識の対象としても認識の精度の高い音声認識装置
を提供することを目的とする。The present invention has been made in order to solve the above problems, and recognizes a large vocabulary that includes many similar words in which some symbols (phonemes) in a symbol sequence (phoneme sequence) are different. It is an object of the present invention to provide a speech recognition device with high recognition accuracy even as a target.

[Means for solving the problem]

この発明に係る音声認識装置は、一次選択手段（一次
選択部５）で選択された入力音声の候補の中で類似の記
号系列（音素系列）を検出する類似記号系列検出手段
（類似音素系列検出部６）と、該類似記号系列検出手段
で類似の記号系列（音素系列）が検出された場合にこれ
らの類似の記号系列の間の相違する記号の部分を強調し
た音声のモデルを生成する強調モデル生成手段（強調モ
デル生成部７）と、該強調モデル生成手段で生成された
各強調音声モデルの尤度を上記尤度計算手段で計算さ
せ、かつ上記強調モデル生成手段で生成された上記各強
調音声モデルの中で入力音声に対する尤度の高い強調音
声モデルを選択しその生成の源となった記号の系列を入
力音声の候補として出力する再選択手段（再選択部８）
とを備えたことを特徴とするものである。The speech recognition apparatus according to the present invention includes a similar symbol sequence detection unit (similar phoneme sequence detection) for detecting a similar symbol sequence (phoneme sequence) among the input speech candidates selected by the primary selection unit (primary selection unit 5). Part 6) and an emphasis for generating a speech model in which, when a similar symbol sequence (phoneme sequence) is detected by the similar symbol sequence detecting means, a portion of a different symbol between these similar symbol sequences is emphasized. A model generation unit (emphasis model generation unit 7), and the likelihood calculation unit calculates the likelihood of each emphasized speech model generated by the emphasis model generation unit; Reselecting means (reselection unit 8) for selecting an emphasized speech model having a high likelihood with respect to the input speech from among the emphasized speech models and outputting a series of symbols as a source of the generation as candidates for the input speech.
It is characterized by having.

[Action]

類似記号系列検出手段（類似音素系列検出部６）は一
次選択手段（一次選択部５）で選択された入力音声の候
補の中で類似の記号系列（音素系列）を検出する。強調
モデル生成手段（強調モデル生成部７）は、類似記号系
列検出手段（類似音素系列検出部６）で類似の記号系列
（音素系列）が検出された場合にこれらの類似の記号系
列の間の相違する記号（音素）の部分を強調した音声の
モデルを生成する。再選択手段（再選択部８）は、強調
モデル生成手段（強調モデル生成部７）で生成された強
調音声モデルの中で入力音声に対する尤度の高い強調音
声モデルを選択し、その生成の源となった記号系列（音
素系列）を入力音声の候補として出力する。The similar symbol sequence detecting unit (similar phoneme sequence detecting unit 6) detects a similar symbol sequence (phoneme sequence) among the input speech candidates selected by the primary selecting unit (primary selecting unit 5). When a similar symbol sequence (phoneme sequence) is detected by the similar symbol sequence detection unit (similar phoneme sequence detection unit 6), the emphasis model generation unit (emphasis model generation unit 7) A speech model in which different symbol (phoneme) portions are emphasized is generated. The reselecting means (reselecting unit 8) selects an emphasized speech model having a high likelihood for the input speech from among the emphasized speech models generated by the emphasized model generating means (emphasized model generating unit 7), The symbol sequence (phoneme sequence) is output as an input speech candidate.

(Example of the invention)

第１図はこの発明の一実施例に係る音声認識装置の構
成ブロック図である。第１図において、第３図に示す構
成要素に対応するものには同一の符号を付し、その説明
を省略する。第１図において、６は一次選択部５で選択
された入力音声の候補の中で類似の音素系列（記号系
列）を検出する類似記号系列検出手段としての類似音素
系列検出部、７は類似音素系列検出部６で類似の音素系
列が検出された場合にこれらの類似の音素系列の間の相
違する音素の部分を強調した音声のモデルを生成する強
調モデル生成手段としての強調モデル生成部、８は強調
モデル生成部７で生成された強調音声モデルの中で入力
音声に対する尤度の高い強調音声モデルを選択し、その
生成の源となった音素系列を入力音声の候補として出力
する再選択手段としての再選択部である。FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. 1, components corresponding to those shown in FIG. 3 are denoted by the same reference numerals, and description thereof will be omitted. In FIG. 1, reference numeral 6 denotes a similar phoneme sequence detecting unit as similar symbol sequence detecting means for detecting a similar phoneme sequence (symbol sequence) in the input speech candidates selected by the primary selecting unit 5, and 7 denotes a similar phoneme. When a similar phoneme sequence is detected by the sequence detection unit 6, an emphasis model generation unit as an emphasis model generation unit that generates a speech model emphasizing a different phoneme portion between these similar phoneme sequences, 8 Means for selecting an emphasized speech model having a high likelihood with respect to the input speech from among the emphasized speech models generated by the emphasis model generation unit 7 and outputting a phoneme sequence as a source of the generation as an input speech candidate As the reselection unit.

次に動作について説明する。 Next, the operation will be described.

入力された音声は、音声分析部１で特徴ベクトル系列
に変換される。一方、モデル生成部２はこの特徴ベクト
ル系列及び単語辞書３の音素系列から文脈情報を抽出す
ると共に、この文脈情報に基づいて文脈を考慮したモデ
ルを生成する。尤度計算部４は生成された各モデルにつ
いて、入力の特徴ベクトル系列に対する尤度を計算す
る。一次選択部５はその中で尤度の高いモデルをＮ個選
択し、これらモデルの生成の源となった音素系列を認識
結果の候補とする。The input voice is converted into a feature vector sequence by the voice analysis unit 1. On the other hand, the model generation unit 2 extracts the context information from the feature vector sequence and the phoneme sequence of the word dictionary 3, and generates a model considering the context based on the context information. The likelihood calculation unit 4 calculates the likelihood for the input feature vector sequence for each generated model. The primary selection unit 5 selects N models having a high likelihood among them, and sets the phoneme sequence that has been the source of generation of these models as a candidate for the recognition result.

類似音素系列検出部６は、第１位の音素系列と残りの
（Ｎ−１）個の候補の中の任意の１つの音素系列との組
み合わせについて、音素系列の異なる部分を検出する。
第２図は、類似音素系列検出部６において音素の系列Ａ
とＢの異なり音素を検出する場合の処理を示すフローチ
ャートである。この処理は、次の３ステップからなる。The similar phoneme sequence detection unit 6 detects a different part of the phoneme sequence for a combination of the first-order phoneme sequence and any one of the remaining (N-1) candidates.
FIG. 2 shows that the similar phoneme sequence
6 is a flowchart showing a process for detecting a phoneme, which is different from FIGS. This process includes the following three steps.

〔１〕脱落のチェック（ステップS1）系列Ａの任意の一音素（ajとする）を取り除いてでき
る系列Ａ′と系列Ｂの一致を調べ、一致する場合、系列
Ａの音素ajを異なり音素とする。[1] Check for dropout (step S1) The sequence A 'formed by removing an arbitrary phoneme (referred to as aj) of the sequence A and the sequence B are checked, and if they match, the phoneme aj of the sequence A is different and is different from the phoneme. I do.

〔２〕挿入のチェック（ステップS2）系列Ｂの任意の一音素（bjとする）を取り除いてでき
る音素系列Ｂ′と系列Ａの一致を調べ、一致する場合、
系列Ｂの音素bjを異なり音素とする。[2] Insertion Check (Step S2) The phoneme sequence B 'formed by removing an arbitrary phoneme (bj) of the sequence B is checked for consistency with the sequence A.
The phoneme bj of the sequence B is different and is a phoneme.

〔３〕置換のチェック（ステップS3）系列Ａの任意の一音素（ajとする）を系列Ｂの同じ位
置の音素（bjとする）と置き換えてできる音素系列Ａ″
と系列Ｂとの一致を調べ、一致する場合、系列Ａの音素
ajと系列Ｂの音素bjとを異なり音素とする。[3] Replacement Check (Step S3) A phoneme sequence A ″ that can be obtained by replacing an arbitrary one phoneme of the sequence A (referred to as aj) with a phoneme at the same position of the sequence B (referred to as bj).
And the sequence B, and if they match, the phoneme of the sequence A
aj and the phoneme bj of the sequence B are different and are phonemes.

なお、本実施例は、再認識をする前でも認識率が高
く、２音素以上誤ることはまれなので、１音素だけ異な
る音素系列を持つものを類似単語と考える。In this embodiment, even before re-recognition, the recognition rate is high, and it is rare that two or more phonemes are incorrect.

強調モデル生成部７は、異なり音素Ｑを持つある音素
系列Ｃからモデルを生成する場合以下のようにして強調
モデルを生成する。まず、音素系列Ｃに対する通常のモ
デルを生成する。この通常のモデルは、音素パターンの
平均ａ（s,j）及びその共分散行列Ｃ（s,j）、また、継
続時間の平均μ（s,j）及びその分散σ^２（s,j）の４種
のパラメータを文脈情報に基づいて、音素系列にしたが
って連結したものである。ここに、ｊは音素の位置を表
す数、ｓは音素の内部状態に付けられた番号である。次
に、異なり音素Ｑの位置をｊとすれば、このように生成
された通常のモデルの共分散行列Σ（s,j）（ｓ＝1,2）
をその各要素をｗ分の１倍としたもので置き換え、更
に、継続時間の分散σ^２をｗ分の１倍したもので置き換
えて強調モデルを得る。ここに、ｗは１より大きく選ぶ
必要があり、実験的にｗ＝1.25のときが最も良かった。When generating a model from a certain phoneme sequence C having a different phoneme Q, the emphasis model generation unit 7 generates an emphasis model as follows. First, a normal model for the phoneme sequence C is generated. This normal model consists of a phoneme pattern mean a (s, j) and its covariance matrix C (s, j), a duration mean μ (s, j) and its variance σ ² (s, j). Are connected according to a phoneme sequence based on context information. Here, j is a number representing the position of the phoneme, and s is a number assigned to the internal state of the phoneme. Next, assuming that the position of the phoneme Q is j, the covariance matrix Σ (s, j) (s = 1,2) of the normal model generated in this way
Is replaced by a value obtained by multiplying each element by 1 / w, and further, a variance σ ^{2 of the} duration is replaced by 1 / w to obtain an enhanced model. Here, w must be selected to be larger than 1, and it was best when w = 1.25 experimentally.

再選択部８は、入力の音声に対する強調モデルの尤度
を、通常モデルと同様に、尤度計算部４において計算す
る。ここで強調モデルの尤度は、通常のモデルの尤度に
比べて、異なり音素の部分の音素パターンの共分散及び
音素継続時間の分散がｗ分の１倍されているため、尤度
は逆に、異なり音素部分でｗ倍に拡大されて計算される
ため、異なり音素部分が相対的に大きな重要度を持って
評価される。従って、強調モデルの尤度を用いて再選択
を行えば、類似単語を精度良く識別できる。The reselection unit 8 calculates the likelihood of the emphasized model for the input speech in the likelihood calculation unit 4 as in the case of the normal model. Here, the likelihood of the emphasized model is different from the likelihood of the normal model, and the covariance of the phoneme pattern and the variance of the phoneme duration of the phoneme part are multiplied by 1 / w. In contrast, since the calculation is performed by being enlarged w times in the different phoneme portions, the different phoneme portions are evaluated with relatively large importance. Therefore, similar words can be identified with high accuracy by performing reselection using the likelihood of the emphasized model.

最終的な再選択結果は、以下のようにして求める。 The final reselection result is obtained as follows.

〔１〕Ｎ個の候補から、第１位の候補の音素系列と類似
の（本実施例では１音素異なる）音素系列を持つ候補を
抽出し、その数をＭとする。[1] A candidate having a phoneme sequence similar to the phoneme sequence of the first candidate (different by one phoneme in the present embodiment) is extracted from the N candidates, and the number thereof is set to M.

〔２〕これら第１位の音素系列と類似の音素系列を持つ
Ｍ個の候補から任意の２つの候補を選んでできるすべて
の組み合わせについて、それぞれ系列Ａ及び系列Ｂとし
て、系列Ａの強調モデルの尤度（La）と系列Ｂの強調モ
デルの尤度（Lb）とを計算する。[2] For all combinations that can select any two candidates from M candidates having a phoneme sequence similar to the first-order phoneme sequence, as a sequence A and a sequence B, respectively, The likelihood (La) and the likelihood (Lb) of the series B enhancement model are calculated.

〔３〕〔２〕で算出した強調モデルの尤度によって、類
似のＭ個の候補を並べ変える。即ち、この並べ変えた後
の第ｋ位の候補の強調モデルによる尤度は第ｋ＋１位以
下の候補の強調モデルによる尤度よりも高くなるように
並べ変える。[3] Similar M candidates are rearranged according to the likelihood of the emphasized model calculated in [2]. That is, the rearrangement is performed so that the likelihood of the k-th candidate after the rearrangement by the enhancement model is higher than the likelihood of the k + 1-th and lower candidates by the enhancement model.

〔４〕〔３〕の並べ変えによって尤度の高いモデルを認
識結果として選択する。[4] A model having a high likelihood is selected as a recognition result by the rearrangement in [3].

以上の説明では、音声の発声を記述する記号を音素と
したが、他の記号例えば音韻や音素片としても構わな
い。また、単語を認識対象としたが、単語に限らず文節
や形態素などを認識対象とすることもできる。さらに、
類似単語として音素系列中の１音素だけ異なるものを検
出したが、２音素あるいはそれ以上異なるものを類似単
語として検出するようにしても構わない。In the above description, the symbol describing the utterance of the voice is a phoneme, but may be another symbol such as a phoneme or a phoneme fragment. In addition, although words are targeted for recognition, not only words but also phrases and morphemes can be targeted for recognition. further,
As similar words, words that differ by one phoneme in the phoneme sequence are detected, but words that differ by two or more phonemes may be detected as similar words.

〔The invention's effect〕

以上のように本発明によれば、一次選択手段で選択さ
れた入力音声の候補の中で類似の記号系列を検出する類
似記号系列検出手段と、該類似記号系列検出手段で類似
の記号系列が検出された場合にこれらの類似の記号系列
の間の相違する記号の部分を強調した音声のモデルを生
成する強調モデル生成手段と、該強調モデル生成手段で
生成された各強調音声モデルの尤度を上記尤度計算手段
で計算させ、かつ上記強調モデル生成手段で生成された
上記各強調音声モデルの中で入力音声に対する尤度の高
い強調音声モデルを選択しその生成の源となった記号の
系列を入力音声の候補として出力する再選択手段とを設
けて構成したので、認識結果の上位の候補の中で類似単
語が検出された場合、検出された類似単語の間の記号系
列の異なる部分の記号を強調したモデルが生成され、強
調されたモデルの中で入力音声に対する尤度の高いモデ
ルが選択され、再認識の結果が得られ、これにより記号
系列中の一部分の記号が異なるような類似単語が多く含
まれるような大語彙を認識の対象としても高い認識精度
を得ることができるという効果がある。As described above, according to the present invention, a similar symbol sequence detecting unit that detects a similar symbol sequence among the input speech candidates selected by the primary selecting unit, and a similar symbol sequence detected by the similar symbol sequence detecting unit. An emphasis model generation means for generating a speech model emphasizing a different symbol portion between these similar symbol sequences when detected, a likelihood of each emphasis speech model generated by the emphasis model generation means Is calculated by the likelihood calculating means, and an emphasized speech model having a high likelihood with respect to the input speech is selected from among the emphasized speech models generated by the emphasized model generating means, and a symbol as a source of the generation is selected. And a re-selecting unit that outputs the sequence as a candidate of the input speech. Therefore, when a similar word is detected in the top candidate of the recognition result, a different part of the symbol sequence between the detected similar words is used. Note Is generated, a model having a high likelihood for the input speech is selected from the emphasized models, and a result of re-recognition is obtained, whereby similar words in which some symbols in the symbol sequence are different from each other are obtained. There is an effect that a high recognition accuracy can be obtained even when a large vocabulary including many vocabularies is targeted for recognition.

[Brief description of the drawings]

第１図はこの発明の一実施例に係る音声認識装置の構成
ブロック図、第２図はこの実施例における類似音素系列
検出部の動作を示すフローチャート、第３図は従来の音
声認識装置の構成ブロック図である。２……モデル生成部（モデル生成手段）、５……一次選
択部（一次選択手段）、６……類似音素系列検出部（類
似記号系列検出手段）、７……強調モデル生成部（強調
モデル生成手段）、８……再選択部（再選択手段）。FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention, FIG. 2 is a flowchart showing the operation of a similar phoneme sequence detection unit in this embodiment, and FIG. It is a block diagram. 2... Model generation unit (model generation unit), 5... Primary selection unit (primary selection unit), 6... Similar phoneme sequence detection unit (similar symbol sequence detection unit), 7. Generating means), 8... Reselection unit (reselection means).

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/20 ──────────────────────────────────────────────────続き Continued on front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 3/00-9/20

Claims

(57) [Claims]

1. Model generation means for generating a speech model represented by a symbol sequence from a sequence of symbols describing the utterance content of the speech depending on the context of the symbol or the context of an input speech portion corresponding to the symbol. And likelihood calculating means for calculating the likelihood of each of the voice models generated by the model generating means with respect to the input voice; and a voice having a high likelihood for the input voice among the voice models generated by the model generating means. A primary selecting means for selecting a model and outputting a series of symbols as a source of the generation as a candidate for an input voice, wherein a similarity among the input voice candidates selected by the primary selecting means is provided. A similar symbol sequence detecting means for detecting a symbol sequence of the symbol sequence, and a sound emphasizing a different symbol portion between these similar symbol sequences when the similar symbol sequence is detected by the similar symbol sequence detecting device. Emphasis model generation means for generating a voice model, and the likelihood calculation means for calculating the likelihood of each emphasized speech model generated by the emphasis model generation means; and Reselecting means for selecting an emphasized speech model having a high likelihood with respect to an input speech from among the emphasized speech models and outputting a series of symbols as a source of the generation as a candidate for the input speech. Recognition device.