JP2951332B2

JP2951332B2 - Clause candidate reduction method in speech recognition

Info

Publication number: JP2951332B2
Application number: JP63051252A
Authority: JP
Inventors: 均岩見田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1988-03-04
Filing date: 1988-03-04
Publication date: 1999-09-20
Anticipated expiration: 2014-09-20
Also published as: JPH01224799A

Description

【発明の詳細な説明】［概要］大語彙の文節音声認識を行う場合に認識の対象となる
有効な候補を予め選択して削減する文節音声認識におけ
る文節候補削減方式に関し、単語辞書の単語から特徴を抽出して、単語を組み合わ
せて作成した文節における処理量を少なくする音声認識
における文節候補削減方式を提供することを目的とし、文節を認識単位とする音声認識装置における文節候補
削減方式において，認識の対象となる全ての単語を含む
単語辞書中の各単語について，その単語を含む文節を発
声した時に現れることが確実に予想される特徴音韻を単
語特徴音韻抽出部において抽出し，単語特徴音韻抽出部
で抽出した，単語数より少ない種類の単語特徴音韻を文
節特徴音韻合成部において組み合わせて文節としての特
徴を合成し，入力音声を分析部で分析してその出力から
入力音声の特徴音韻を入力音韻抽出部により抽出し，文
節特徴音韻合成部で合成した文節特徴と，入力音韻抽出
部で抽出した入力音声の音韻に基づいて文節候補選択部
により音声認識の照合処理に有効な文節候補を選択する
よう構成する。DETAILED DESCRIPTION OF THE INVENTION [Overview] A phrase candidate reduction method in phrase speech recognition in which effective candidates to be recognized are selected and reduced in advance when performing phrase vocabulary recognition of a large vocabulary. The purpose of the present invention is to provide a phrase candidate reduction scheme in speech recognition that extracts features and reduces the amount of processing in a phrase created by combining words. For each word in the word dictionary containing all the words to be recognized, a feature phoneme that is expected to appear when the phrase containing the word is uttered is extracted by the word feature phoneme extraction unit, and the word feature phoneme is extracted. The phrase feature phonemes of less than the number of words extracted by the extraction unit are combined in the phrase feature phoneme synthesis unit to synthesize the features as phrases, and input. The speech is analyzed by the analysis unit, and the feature phonemes of the input speech are extracted from the output by the input phoneme extraction unit, based on the syllable features synthesized by the syllable feature phoneme synthesis unit and the phonemes of the input speech extracted by the input phoneme extraction unit. Thus, a phrase candidate selection unit selects a phrase candidate effective for the collation processing of speech recognition.

［産業上の利用分野］本発明は、大語彙の文節音声認識を行う場合に認識の
対象となる有効な候補を予め選択して削減する文節音声
認識における文節候補削減方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phrase candidate reduction method in phrase speech recognition that selects and reduces valid candidates to be recognized in advance when performing phrase speech recognition of a large vocabulary.

近年、音声認識装置は各種の用途に用いられ、文書を
音声入力により作成したり、装置への指示を音声で行う
場合等に有用である。そのような場合、単に音声を単語
単位で認識せずにより長い単位である文節により認識す
ることが望まれ、それも大語彙の文節音声を認識するこ
とが要望されている。ところが、そのためには認識対象
となる文節が膨大な量になって、認識処理に要する時間
が長くなりその改善が望まれている。2. Description of the Related Art In recent years, a speech recognition device has been used for various purposes, and is useful when, for example, a document is created by speech input, or an instruction to the device is given by speech. In such a case, it is desired that the speech is not simply recognized in word units but is recognized by a phrase which is a longer unit, and it is also required to recognize a phrase voice having a large vocabulary. However, for that purpose, the phrase to be recognized becomes enormous, and the time required for the recognition process becomes longer, and its improvement is desired.

［従来の技術］第３図に従来例の構成図を示す。[Prior Art] FIG. 3 shows a configuration diagram of a conventional example.

第３図に示す構成は本出願と同一の出願人が開発した
音声認識における単語候補削減装置（特願昭62−53066
号）を文節音声認識に適用した構成を示す。The configuration shown in FIG. 3 is a word candidate reduction device in speech recognition developed by the same applicant as the present application (Japanese Patent Application No. Sho 62-53066).
No.) is applied to phrase speech recognition.

図において、単語辞書30には認識の対象となる文節を
構成するすべての単語がその音韻ラベルネットワークと
ともに格納されている。音韻ラベルネットワークは、各
子音や母音を示す音韻ラベルを単語の音声を表現するよ
うに関係づけるものであり、第４図にその例を示す。第
４図には、日本語の単語（文節でもある）の“しかし”
の音韻ラベルネットワークを例示するものであり、図に
おいて“＃”は単語境界、“SH"は“シ”の子音部、
“I"は無声化しない母音の“イ”、“i"は無声化した母
音の“イ”、“K"は“カ”の子音部、“A"は母音の
“ア”を示すそれぞれの音韻ラベルである。このように
ネットワーク表現を用いて無声化という音声変形現象を
表現したものである。In the figure, the word dictionary 30 stores all words constituting a phrase to be recognized together with its phoneme label network. The phoneme label network associates phoneme labels indicating each consonant or vowel so as to express the voice of a word, and an example is shown in FIG. Figure 4 shows the Japanese word (also a phrase) "but"
In the figure, "#" is a word boundary, "SH" is a consonant part of "shi",
“I” indicates the unvoiced vowel “A”, “i” indicates the unvoiced vowel “A”, “K” indicates the consonant part of “K”, and “A” indicates the vowel “A”. It is a phoneme label. In this manner, a voice deformation phenomenon called voicelessness is expressed using a network expression.

文節合成部31は、単語辞書30から可能なすべての文節
を合成する。The phrase synthesizing unit 31 synthesizes all possible phrases from the word dictionary 30.

文節特徴音韻抽出部32では、各合成された文節につい
て、特徴音韻を抽出する。すなわち、各文節ごとにそれ
より確実にあらわれると事前に予想される特徴的な音韻
を抽出する。The phrase characteristic phoneme extraction unit 32 extracts a characteristic phoneme for each synthesized phrase. That is, a characteristic phoneme that is expected to appear more reliably for each phrase is extracted in advance.

一方、入力音声は分析部35で短時間周波数分析され短
時間スペクトル時系列データ（スペクトルパターンとい
う）を発生し、次の入力特徴音韻抽出部34では、分析部
35の出力データから特徴音韻を検出する。その特徴音韻
は、大きいパワーで発音される音韻が容易かつ確実に検
出されるので、これを特徴的な音韻として抽出し、文節
候補選択部33に供給する。On the other hand, the input voice is subjected to short-time frequency analysis by the analysis unit 35 to generate short-time spectrum time-series data (referred to as a spectrum pattern).
Characteristic phonemes are detected from 35 output data. As the characteristic phoneme, a phoneme pronounced with a large power is easily and reliably detected, and is extracted as a characteristic phoneme and supplied to the phrase candidate selection unit 33.

文節候補選択部33において、入力特徴音韻抽出部34で
抽出された各特徴的な音韻と文節特徴音韻抽出部32から
得られた各文節の特徴音韻との間で相関をとって、音声
認識の照合処理に有効な文節を選択する。この文節候補
選択部33における処理は各文節の特徴音韻と入力音声か
ら抽出された特徴音韻とが類似しているものを文節音声
認識の照合を行う時の候補となる文節として選択する。
なお、照合動作は入力音声のスペクトルパターンと候補
である音韻ラベルネットワークから合成したスペクトル
パターンとの類似度（または距離）を求めて最も近似す
る文節を検出して出力するものであり、その際に入力パ
ターンと照合候補となる文節の時間軸のずれの補正にDP
法（ダイナミック・プログラミング・マッチング）を利
用する照合法が高性能な処理方式として使用される。The phrase candidate selection unit 33 calculates a correlation between each characteristic phoneme extracted by the input characteristic phoneme extraction unit 34 and the characteristic phoneme of each phrase obtained from the phrase feature phoneme extraction unit 32, and performs speech recognition. Select a phrase that is valid for the matching process. In the process of the phrase candidate selection unit 33, a phrase in which the characteristic phoneme of each phrase is similar to the characteristic phoneme extracted from the input speech is selected as a phrase to be a candidate when performing phrase speech recognition collation.
Note that the matching operation is to calculate the similarity (or distance) between the spectrum pattern of the input speech and the spectrum pattern synthesized from the phoneme label network as a candidate, and to detect and output the most similar phrase. DP for correcting the time axis deviation between the input pattern and the phrase that is the matching candidate
A matching method using a dynamic programming matching method is used as a high-performance processing method.

［発明が解決しようとする課題］上記した従来例の構成によれば、単語辞書から可能な
すべての文節を合成するので、文節数が莫大であるため
に、処理量に伴う処理時間が膨大になるという問題があ
った。[Problems to be Solved by the Invention] According to the configuration of the conventional example described above, all possible phrases are synthesized from the word dictionary, so that the number of phrases is enormous, and the processing time accompanying the processing amount is enormous. There was a problem of becoming.

本発明は単語辞書の単語から特徴を抽出して、単語を
組み合わせて作成した文節における処理量を少なくする
音声認識における文節候補削減方式を提供することを目
的とする。An object of the present invention is to provide a phrase candidate reduction method in speech recognition that extracts features from words in a word dictionary and reduces the amount of processing in a phrase created by combining words.

［課題を解決するための手段］本発明の原理的構成図を第１図に示す。[Means for Solving the Problems] FIG. 1 shows a principle configuration diagram of the present invention.

第１図において、10は単語辞書、11は各単語の特徴音
韻を抽出する単語特徴音韻抽出部、12は単語を組み合わ
せてそれぞれの特徴音韻を組み合わせた文節特徴音韻合
成部、13は照合に用いる文節の候補を選択する文節候補
選択部、14は入力音声を分析した出力から音韻を抽出す
る入力音韻抽出部、15は分析部を表す。In FIG. 1, 10 is a word dictionary, 11 is a word feature phoneme extraction unit that extracts feature phonemes of each word, 12 is a phrase feature phoneme synthesis unit that combines words and combines each feature phoneme, and 13 is used for matching. A phrase candidate selection unit that selects phrase candidates, 14 is an input phoneme extraction unit that extracts phonemes from an output obtained by analyzing input speech, and 15 represents an analysis unit.

本発明は、単語辞書の単語の特徴的な音韻を抽出し、
単語を組み合わせた文節の特徴となる音韻を単語の特徴
音韻の組み合わせにより合成して文節候補選択部に入力
し、入力音声の特徴となる音韻と比較してそれにすべて
が含まれる特徴音韻を有する文節を照合の候補として選
択する。The present invention extracts a characteristic phoneme of a word in a word dictionary,
A syllable that is a feature of a syllable composed of words is synthesized by a combination of feature syllables of the word and input to the syllable candidate selection unit. Is selected as a candidate for collation.

［作用］第１図の単語辞書10の各単語にはそれぞれの品詞と活
用形が格納されており、単語特徴音韻抽出部11において
予めその単語を含む文節を発声した時に確実に現れると
予想される特徴的な音韻を抽出する。[Operation] Each word in the word dictionary 10 of FIG. 1 stores its part of speech and its inflected form, and it is expected that the word feature phoneme extraction unit 11 will surely appear when a phrase including the word is uttered in advance. Characteristic phonemes are extracted.

各単語について特徴音韻が抽出されると、文節特徴音
韻合成部13は各単語を決められた規則により単語を組み
合わせて文節を合成した時の特徴音韻を抽出する。この
抽出は単語の特徴音韻を組み合わせることにより行われ
る。When the characteristic phoneme is extracted for each word, the phrase characteristic phoneme synthesis unit 13 extracts a characteristic phoneme when a phrase is synthesized by combining words according to a rule determined for each word. This extraction is performed by combining characteristic phonemes of words.

一方、入力音声について分析部15において従来と同様
の方法で音声を分析し、その分析出力から更に入力音声
の音韻の候補を抽出する。On the other hand, the analysis unit 15 analyzes the input voice in the same manner as in the related art, and extracts phonemic candidates of the input voice from the analysis output.

入力音韻抽出部14で抽出した音韻（候補を含む）と文
節特徴音韻合成部12で合成した多数の文節の特徴音韻と
が文節候補選択部13に供給され、入力音韻の候補にすべ
ての特徴音韻が含まれる文節が照合用に使用される文節
候補として選択され、含まれない特徴音韻を含む文節は
照合用の候補として採用しない。The phonemes (including candidates) extracted by the input phoneme extraction unit 14 and the feature phonemes of many phrases synthesized by the phrase feature phoneme synthesis unit 12 are supplied to the phrase candidate selection unit 13, and all the feature phonemes are included in the input phoneme candidates. Is selected as a candidate phrase used for matching, and a phrase including a feature phoneme that is not included is not adopted as a candidate for matching.

［実施例］本発明の実施例の構成図を第２図に示す。[Embodiment] FIG. 2 shows a configuration diagram of an embodiment of the present invention.

第２図において、20は単語辞書、21は特徴音韻ルール
格納部、22は単語特徴音韻抽出部、23は単語特徴音韻パ
ターン保持部、24は文節モデル格納部、25は文節特徴音
韻合成部、26は文節特徴音韻パターン保持部、27は選択
部、28は文節候補生成部、29は入力音韻抽出部、40は分
析部を表す。In FIG. 2, 20 is a word dictionary, 21 is a feature phoneme rule storage unit, 22 is a word feature phoneme extraction unit, 23 is a word feature phoneme pattern holding unit, 24 is a phrase model storage unit, 25 is a phrase feature phoneme synthesis unit, 26 is a phrase feature phoneme pattern holding unit, 27 is a selection unit, 28 is a phrase candidate generation unit, 29 is an input phoneme extraction unit, and 40 is an analysis unit.

第２図おいて、単語辞書20には認識の対象となるすべ
ての単語を含んでいる。各単語には品詞と活用形が記述
され、単語の音韻ラベルネットワークも記述されてい
る。In FIG. 2, the word dictionary 20 contains all words to be recognized. Each word describes the part of speech and inflected form, and also describes the phoneme label network of the word.

単語特徴音韻抽出部22では特徴音韻ルールにしたがっ
て単語辞書の各単語について、その単語を含む文節を発
声した場合に確実に現れると予想される特徴的な音韻を
抽出する。したがって、例えば活用語尾は確実に現れる
とは限らないのでそこから特徴的な音韻は抽出されにく
い。例えば、「ARUKU（あるく）」の場合、「KU」は活
用により変化（「あるかない」のように）するので、
「ARU」から特徴音韻が抽出される。The word characteristic phoneme extraction unit 22 extracts a characteristic phoneme that is expected to appear surely when a phrase including the word is uttered, for each word in the word dictionary according to the characteristic phoneme rule. Therefore, for example, the inflected ending does not always appear, and it is difficult to extract a characteristic phoneme therefrom. For example, in the case of "ARUKU", "KU" changes (like "does not exist") due to utilization,
Characteristic phonemes are extracted from “ARU”.

また、例えば、単語「SHIKASHI（しかし）」の場合、
Ｉは無声化し発声されない場合があるのでこれは特徴的
な音韻ではなく、この単語の特徴音韻パターンは「SH,
A,SH」である。Also, for example, for the word "SHIKASHI (but)",
This is not a characteristic phoneme because I may be unvoiced and not uttered, and the characteristic phoneme pattern of this word is "SH,
A, SH ".

特徴音韻ルール格納部21には特徴的音韻を抽出するた
めのルールが格納されている。（無声化して発声されな
い音韻は特徴音韻とはならない等の規則）。The characteristic phoneme rule storage unit 21 stores rules for extracting characteristic phonemes. (Rules such that a phoneme that is not vocalized and not uttered does not become a feature phoneme).

こうして各単語の特徴音韻パターンが抽出されて単語
特徴音韻パターン保持部23に格納される。Thus, the characteristic phoneme pattern of each word is extracted and stored in the word characteristic phoneme pattern holding unit 23.

次に、文節特徴音韻合成部25では文節モデル格納部24
に記述された文節モデルにしたがって、各単語の品詞情
報に基づいてその単語の特徴音韻パターンと組み合わせ
ることができる他の単語の特徴音韻パターンとを組み合
わせて、文節特徴音韻を合成する。文節モデル格納部24
の内容としては、例えば、「名詞＋助詞」、「動詞＋助
動詞＋助詞」等である。また、文節特徴音韻合成部25の
合成は、組み合わせの対象となる単語特徴音韻パターン
保持部23を直列に接続して行う。Next, the phrase feature phoneme synthesis unit 25 outputs a phrase model storage unit 24.
According to the phrase model described in (1), a phrase characteristic phoneme is synthesized by combining a characteristic phoneme pattern of another word that can be combined with a characteristic phoneme pattern of the word based on the part of speech information of each word. Clause model storage unit 24
Are, for example, "noun + particle", "verb + auxiliary verb + particle", and the like. The synthesizing of the phrase feature phoneme synthesis unit 25 is performed by connecting the word feature phoneme pattern holding units 23 to be combined in series.

文節特徴音韻合成部25で合成した出力は文節特徴音韻
パターン保持部26に格納される。The output synthesized by the phrase feature phoneme synthesis unit 25 is stored in the phrase feature phoneme pattern holding unit 26.

入力音声は分析部40において短時間周波数分析が行わ
れ、その結果は入力音韻抽出部29に供給されて、入力音
韻候補を抽出する。ここでは、分析データに基づいて強
い音韻を検出して、音韻のパターンを作成する。例え
ば、「しかし」と発声した場合は、「（SH,S）,I,（K,
T,P）,A,（SH,S）」となる。ここで、括弧内はそのどち
らかをとることを意味する。The input speech is subjected to short-time frequency analysis in the analysis unit 40, and the result is supplied to the input phoneme extraction unit 29, and the input phoneme candidate is extracted. Here, a strong phoneme is detected based on the analysis data, and a phoneme pattern is created. For example, if you say “but,” you ’ll get “(SH, S), I, (K,
T, P), A, (SH, S) ". Here, parentheses mean that either of them is taken.

選択部27は、文節特徴音韻パターン保持部26の中の各
パターンについて、それが入力音韻抽出部29から供給さ
れた入力音声の音韻候補の中にすべて含まれているか調
べて、すべて含まれている文節特徴音韻パターンだけを
選択する。例えば、入力音声が上記の例のように「（S
H,S）,I,（K,T,P）,A,（SH,S）」の場合、文節特徴音韻
パターンのうち、「SH,A,SH」や「S,A,SH」などが選択
される。The selection unit 27 checks whether each pattern in the phrase feature phoneme pattern holding unit 26 is included in all the phoneme candidates of the input speech supplied from the input phoneme extraction unit 29, and includes all the patterns. Select only the phrase feature phoneme patterns that are present. For example, if the input voice is “(S
H, S), I, (K, T, P), A, (SH, S) ", among the phrase feature phoneme patterns," SH, A, SH "and" S, A, SH " Selected.

文節候補生成部28は選択部27で選択された文節特徴音
韻パターンを合成している元の単語特徴音韻パターン
を、文節特徴音韻パターン保持部26を参照することによ
り検出し、さらにその単語特徴音韻パターンが得られた
元の単語を、単語特徴音韻パターン保持部23を参照する
ことにより検出する。検出された単語を合成することに
より、文節候補を生成する。The phrase candidate generation unit 28 detects the original word feature phoneme pattern that synthesizes the phrase feature phoneme pattern selected by the selection unit 27 by referring to the clause feature phoneme pattern holding unit 26, and further detects the word feature phoneme pattern. The original word from which the pattern was obtained is detected by referring to the word characteristic phoneme pattern holding unit 23. A phrase candidate is generated by synthesizing the detected words.

例えば、選択された文節特徴音韻パターンが「SH,A,S
H」であった場合、それを合成している単語特徴パター
ンは「SH,A,SH」だけであり、その単語特徴音韻パター
ンが得られる単語は「SHIKASHI（しかし）」や、「SHIT
ASHII（したしい）」である。さらに、そこから文節を
合成し、「SHIKASHI」、「SHTASHII（親しい）」、「SH
ITASHIKU（親しく）」などが文節候補となる。For example, if the selected phrase feature phoneme pattern is “SH, A, S
If the word is "H", the only word feature pattern that synthesizes it is "SH, A, SH", and the word from which the word feature phoneme pattern can be obtained is "SHIKASHI (but)" or "SHIT
ASHII. Furthermore, phrases are synthesized therefrom, and "SHIKASHI", "SHTASHII (close)", "SH
"ITASHIKU" is a candidate for a phrase.

文節候補となる各文節の情報は、認識装置の照合部に
おいて入力音声の分析データと従来技術により照合され
て認識が行われることはいうまでもない。It goes without saying that the information of each phrase that is a phrase candidate is collated by the collation unit of the recognition device with the analysis data of the input voice by the conventional technique, and recognition is performed.

［発明の効果］本発明によれば、従来のように可能な文節をすべて合
成してから文節候補を選択することなく単語のレベルで
特徴を抽出しておき、その種類を少なくしてから文節を
合成するので、選択の対象となる文節数を少なくするこ
とができ、候補選択の処理を高速化できる。[Effects of the Invention] According to the present invention, features are extracted at the word level without synthesizing all possible phrases as in the prior art, and selecting the phrase candidates. Is synthesized, the number of phrases to be selected can be reduced, and the process of candidate selection can be speeded up.

[Brief description of the drawings]

第１図は本発明の原理的構成図、第２図は本発明の実施
例の構成図、第３図は従来例の構成図、第４図は音韻ラ
ベルネットワークの説明図である。第１図中、 10:単語辞書 11:単語特徴音韻抽出部 12:文節特徴音韻合成部 13:文節候補選択部 14:入力音韻抽出部 15:分析部FIG. 1 is a block diagram of the principle of the present invention, FIG. 2 is a block diagram of an embodiment of the present invention, FIG. 3 is a block diagram of a conventional example, and FIG. 4 is an explanatory diagram of a phoneme label network. In FIG. 1, 10: word dictionary 11: word feature phoneme extraction unit 12: phrase feature phoneme synthesis unit 13: phrase candidate selection unit 14: input phoneme extraction unit 15: analysis unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭62−32499（ＪＰ，Ａ) 特開昭63−85697（ＪＰ，Ａ) 特開昭61−238099（ＪＰ，Ａ) 特開昭60−33599（ＪＰ，Ａ) 特開昭63−220298（ＪＰ，Ａ) 特開昭58−87599（ＪＰ，Ａ) 特開昭61−256396（ＪＰ，Ａ) 特開昭61−62167（ＪＰ，Ａ) 特公平５−67040（ＪＰ，Ｂ２) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/20 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-62-32499 (JP, A) JP-A-63-85697 (JP, A) JP-A-61-238099 (JP, A) JP-A-60-1985 33599 (JP, A) JP-A-63-220298 (JP, A) JP-A-58-87599 (JP, A) JP-A-61-256396 (JP, A) JP-A-61-62167 (JP, A) JP-B 5-67040 (JP, B2) (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 3/00-9/20

Claims

(57) [Claims]

In a phrase candidate reduction method in a speech recognition device using phrases as a recognition unit, for each word in a word dictionary including all words to be recognized, the phrase appears when a phrase including the word is uttered. In the word feature phoneme extraction unit, the feature phonemes that are reliably predicted are extracted by the word feature phoneme extraction unit. The input speech is analyzed by the analysis unit, the characteristic phonemes of the input speech are extracted from the output by the input phoneme extraction unit, and the syllable features synthesized by the phrase feature phoneme synthesis unit and the input speech extracted by the input phoneme extraction unit are synthesized. A phrase candidate reduction method in speech recognition, characterized in that a phrase candidate selection unit selects a phrase candidate effective for a speech recognition collation process based on a phoneme of the phrase.