JP2002258891A

JP2002258891A - Method, device and program for recognizing voice, and recording medium for the program

Info

Publication number: JP2002258891A
Application number: JP2001054784A
Authority: JP
Inventors: Tetsuo Amakasu; 哲郎甘粕; Hisashi Obara; 永小原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-02-28
Filing date: 2001-02-28
Publication date: 2002-09-11

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognizing method for finding semantic contents of speech from inputted voice with high accuracy by preventing probability dispersion due to an unnecessary calculation. SOLUTION: In the method for recognizing voice, an input voice parameter string and semantic representation with respect to the proceeding speech are received; occurrence probability, extracted from a table where the occurrence probability of each semantic representation of the next speech, is recorded is inputted; the likelihood calculation of semantic representation of a semantic representation list stored with semantic representation representing the semantic contents of a speech sentence assumed to be made by a user corresponding to the input voice parameter string is carried out by using each piece of information of an acoustic model, where syllable standard patterns corresponding to voice parameter strings is recorded; and a sentence model, representing conditioned occurrence probability between semantic representation corresponding to the syllable chain standard pattern of the speech sentence assumed to be made by the user, and semantic representation having the maximum likelihood, is outputted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、人間の音声を入
力し、入力された音声の言語的な意味を受けて意味表現
を出力する音声認識方法及び装置並びにプログラム及び
記録媒体に関し、音声による商品の注文等の処理に用い
られる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method and apparatus, a program and a recording medium for inputting human speech, receiving a linguistic meaning of the inputted speech, and outputting a semantic expression, and relates to a product by speech. It is used for processing of orders and the like.

【０００２】[0002]

【従来の技術】従来の音声認識装置の構成を図９に示
す。入力された音声から例えばFFT(Fast Fourier Trans
form)分析やLPC(Linear Predictive Coding)分析による
音声分析部で入力音声の特徴パラメータ列（以下、「入
力音声パラメータ列」という）が抽出される。入力音声
パラメータ列は、音声認識部において、音声パラメータ
列と音節（音響連鎖パターン）生起確率の対応を記録し
た認識用音響モデル及び言語的制約が規定されている認
識用言語モデルと照合し、認識結果となる認識単語列を
生成・出力する。2. Description of the Related Art The configuration of a conventional speech recognition apparatus is shown in FIG. For example, FFT (Fast Fourier Trans
A characteristic parameter sequence of an input speech (hereinafter, referred to as an “input speech parameter sequence”) is extracted by a speech analysis unit based on form) analysis and LPC (Linear Predictive Coding) analysis. The input speech parameter sequence is compared with a recognition acoustic model in which the correspondence between the speech parameter sequence and the syllable (acoustic chain pattern) occurrence probability and a recognition language model in which linguistic constraints are defined are recognized and recognized by a speech recognition unit. Generate and output the resulting recognized word string.

【０００３】また、意味解析部は音声認識部から出力さ
れた認識単語列を入力し、ルールで記述された意味解析
用文法を用いて認識単語列を解析し、結果を意味表現
（例えば、カテゴリに分類）として出力する。A semantic analysis unit inputs a recognized word string output from the speech recognition unit, analyzes the recognized word string using a semantic analysis grammar described in rules, and expresses the result in a semantic expression (for example, a category). Output).

【０００４】[0004]

【発明が解決しようとする課題】従来の音声認識装置に
おいては、音声認識部で認識用言語モデルを用いて生成
する認識単語列の集合と、意味解析部の用いる意味解析
用文法で解析可能な単語列の集合が異なっていた。その
結果、意味解析部においては、入力された認識単語列を
意味解析用文法によって解析することができない場合が
あった。この場合、意味解析部に入力された認識単語列
は棄却され、意味解析部から有効な出力がなされなかっ
た。また、音声認識部においては後の処理で棄却される
認識単語列を出力することは過剰な処理を行っているこ
とを示し、この過剰な処理が音声認識の効率、また過剰
な処理で生成されたデータにより確率が分散され音声認
識率を落としていた。In a conventional speech recognition apparatus, a set of recognition word strings generated by a speech recognition unit using a recognition language model and a semantic analysis grammar used by a semantic analysis unit can be analyzed. The sets of word strings were different. As a result, in some cases, the semantic analysis unit cannot analyze the input recognized word string using the semantic analysis grammar. In this case, the recognized word string input to the semantic analysis unit was rejected, and no valid output was made from the semantic analysis unit. In the speech recognition unit, outputting a recognized word string that is rejected in later processing indicates that excessive processing is being performed, and this excessive processing is generated by the efficiency of speech recognition and excessive processing. The data disperses the probabilities and lowers the speech recognition rate.

【０００５】また、ルール（例えば、文脈自由文法）と
して記述された意味解析用文法を音声認識用言語モデル
として統一的に利用する手法が提案されているが、音声
認識時に、文節候補を十分に絞り込むには、莫大な数の
ルールを記述しなければならずコストもかかりその実現
性に問題がある。Further, a method has been proposed in which a grammar for semantic analysis described as a rule (for example, a context-free grammar) is uniformly used as a language model for speech recognition. To narrow down, an enormous number of rules must be described, which is costly and has a problem in its feasibility.

【０００６】[0006]

【課題を解決するための手段】上記課題を解決するため
に、この発明は、音声認識時に動的に文を生成するので
はなく、意味表現と対応づけられた文の集合を統一的な
言語モデルとして利用し、さらに次発話の内容を確率的
に予測することによって、音声認識及び意味解析に相当
する処理を統計的に行う手法を提供する。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, the present invention does not dynamically generate a sentence during speech recognition, but uses a unified language for a set of sentences associated with a semantic expression. A method for statistically performing processing corresponding to speech recognition and semantic analysis by using the model as a model and probabilistically predicting the content of the next utterance is provided.

【０００７】[0007]

【発明の実施の形態】図１にこの発明の音声認識装置の
一実施例の構成を示す。この音声認識装置は、入力され
た音声から入力音声パラメータ列を抽出して、その発話
の言語的内容を意味表現として出力する。音声認識装置
は、音声分析部と意味表現解析部から構成される。意味
表現解析部は、尤度計算部と音響モデル格納部と文モデ
ル格納部と意味表現リスト格納部と次発話予測用テーブ
ルを有する発話予測部から構成される。FIG. 1 shows the configuration of an embodiment of a speech recognition apparatus according to the present invention. This speech recognition device extracts an input speech parameter sequence from an inputted speech and outputs the linguistic content of the utterance as a semantic expression. The speech recognition device includes a speech analysis unit and a semantic expression analysis unit. The semantic expression analysis unit includes a likelihood calculation unit, an acoustic model storage unit, a sentence model storage unit, a semantic expression list storage unit, and an utterance prediction unit having a next utterance prediction table.

【０００８】音声分析部は、入力された音声をFFT分析
やLPC分析等により入力音声パラメータ列（特徴パラメ
ータ列）を抽出して出力する。尤度計算部は、入力音声
パラメータ列と直前の発話に対応する意味表現を受取次
発話における全ての意味表現に対する予想生起確率の一
覧（「次発話意味表現生起確率集」とよぶ）を入力と
し、音響モデルに格納された情報及び文モデルに格納さ
れた情報を用いて、入力音声パラメータ列に対応する意
味表現を求めて出力する尤度計算を実行する。The voice analysis unit extracts an input voice parameter sequence (feature parameter sequence) from the input voice by FFT analysis, LPC analysis, or the like, and outputs it. The likelihood calculation unit receives the input speech parameter sequence and the semantic expression corresponding to the immediately preceding utterance and receives a list of expected occurrence probabilities for all semantic expressions in the received next utterance (referred to as “collection of next utterance meaning expression occurrence probabilities”). Then, using the information stored in the acoustic model and the information stored in the sentence model, a likelihood calculation for obtaining and outputting a semantic expression corresponding to the input speech parameter sequence is executed.

【０００９】音響モデルは、入力音声パラメータ列に用
いているものと同じ音声パラメータ列（特徴パラメー
タ）でもって作られた隠れマルコフモデルなどで統計的
に表現された音節標準パターンを記録したものである。
すなわち、音節（Ｃ（子音）Ｖ（母音）単位）と周波数
領域の音響的特徴量と対応させて記録したものである。
（中川聖一著「確率モデルによる音声認識」電子情報
通信学会(1988) 参照）意味表現リストは、ユーザが発話すると想定した発話文
に対する全ての意味表現を記録したものである。The acoustic model is a syllable standard pattern that is statistically represented by a hidden Markov model or the like created using the same speech parameter sequence (feature parameter) as that used for the input speech parameter sequence. .
That is, syllables (C (consonant) V (vowel) units) are recorded in correspondence with acoustic features in the frequency domain.
(See Seiichi Nakagawa, "Speech Recognition by Probabilistic Model," IEICE (1988).) The semantic expression list is a record of all semantic expressions for utterance sentences assumed to be uttered by the user.

【００１０】図２に意味表現リストの具体例を示す。こ
の例では意味表現は３つの値の組み合わせで表現され
る。それぞれ、ユーザの発話を示す意味の命令の種類
（リスト表示、商品注文、注文終了）と、その引数（カ
テゴリ名”野菜”、商品名”キャベツ”、個数”１”）
となる情報である。（ただし、「意味」についてはそれ
ぞれの意味表現の内容を説明したもので、リストの内容
には含まれない。）文モデルは、ユーザの発話文の音節連鎖パターンと、あ
る意味表現についての発話をユーザが行ったときにその
音節連鎖パターンが発生する条件付き（ある意味表現が
生成された時にある発話文が生成される）生起確率を記
録したものである。すなわち、音節連鎖パターンと意味
表現ごとの条件付き生起確率の対応を記録したものであ
る。FIG. 2 shows a specific example of the semantic expression list. In this example, the semantic expression is represented by a combination of three values. Instruction types (list display, product order, order end) indicating the utterance of the user, and their arguments (category name “vegetable”, product name “cabbage”, quantity “1”), respectively
It is the information which becomes. (However, "meaning" explains the contents of each semantic expression and is not included in the contents of the list.) The sentence model is a syllable chain pattern of the user's uttered sentence and an utterance about a certain semantic expression. Is recorded, the conditional occurrence probability (a certain utterance sentence is generated when a certain meaning expression is generated) at which the syllable chain pattern is generated when the user performs is performed. That is, the correspondence between the syllable chain pattern and the conditional occurrence probability for each semantic expression is recorded.

【００１１】図３に文モデルの具体例を示す。この文モ
デル中の音節連鎖パターンは、音響モデル中にその標準
パターンが格納されている音節種類と同じ音節種類を使
って表記されている。ここでは、26個の音節種を用いて
表記している。発話予測部は、尤度計算部からの要求が
あった場合に、前の意味表現と次の意味表現ごとの出現
確率の対応を記録した次発話予測用テーブルを用いて前
発話意味表現を入力して次発話の想定される発話の意味
表現の出現確率を出力する発話予測を実行して次発話意
味表現出現確率集を出力する。FIG. 3 shows a specific example of a sentence model. The syllable chain pattern in this sentence model is described using the same syllable type as the syllable type whose standard pattern is stored in the acoustic model. Here, it is represented using 26 syllable types. The utterance prediction unit inputs the previous utterance semantic expression using a next utterance prediction table that records the correspondence between the previous semantic expression and the appearance probability of each next semantic expression when requested by the likelihood calculation unit. Then, the utterance prediction for outputting the probability of appearance of the meaning expression of the utterance assumed to be the next utterance is executed, and the next utterance meaning expression appearance probability collection is output.

【００１２】発話予測部の構成例を図４に示す。次発話
予測用テーブルは、各行が前の発話の意味表現（M1,・
・・,Mx,・・・,Mn）の場合と対応しており、テーブル
の一番左の列には、その対応する意味表現が格納されて
いる。最左列以外の各列は次発話に各意味表現（M1,・
・・,Mx,・・・,Mn）と対応しておりその列上の各数字
には、一番左の列の意味表現が出現した場合に、次発話
に各意味表現が出現する条件付き確率(P(M1|M1),・・
・,P(Mn|M1)）、・・・、（P(M1|Mn),・・・,P(Mn|M
n)）を与えておく。FIG. 4 shows a configuration example of the utterance prediction unit. In the next utterance prediction table, each row has a semantic expression (M1,
.., Mx,..., Mn), and the corresponding semantic expression is stored in the leftmost column of the table. Each column other than the leftmost column has each semantic expression (M1,
.., Mx, ..., Mn), and each number on that column has a condition that, when the semantic expression in the leftmost column appears, each semantic expression appears in the next utterance Probability (P (M1 | M1), ...
・, P (Mn | M1)), ..., (P (M1 | Mn), ..., P (Mn | M
n)).

【００１３】発話予測処理部は、入力として前の発話の
意味表現を受け取ると以下の処理を行う。入力された意
味表現と次発話予測テーブルの最左列の各意味表現を照
合していき、一致した行（例えば、Mx）の最左列以外の
列の数値（P(M1|Mx),・・・,P(Mn|Mx)）を、次発話意味
表現出現確率集として出力する。図５に次発話予測用テ
ーブルの具体例を示す。このテーブルにおいて意味表現
中の記号”＊”は、意味表現中の値として任意のものを
もつことができることを示す。これはテーブルを効率よ
く作成するための措置である。The utterance prediction processing unit performs the following processing upon receiving the semantic expression of the previous utterance as input. The input semantic expression is checked against each semantic expression in the leftmost column of the next utterance prediction table, and the numerical values (P (M1 | Mx), .., P (Mn | Mx)) are output as a collection of next utterance meaning expression appearance probabilities. FIG. 5 shows a specific example of the next utterance prediction table. In this table, the symbol “*” in the semantic expression indicates that the value in the semantic expression can have any value. This is a measure for efficiently creating a table.

【００１４】尤度計算部の尤度計算の処理手順を図６，
７に示されたフローと図８に示された計算結果の例（ユ
ーザからの音声により生鮮食料品の注文を受け付ける
例）を参照して説明する。ある意味表現について、ユー
ザがその意味表現を内容として表す発話を行ったとき、
入力音声を音声分析した入力音声パラメータ（例えば、
周波数領域に変換した特徴量パラメータ列）を入力し、（１)尤度最大値＝０、結果意味表現＝NULLとする（初
期化）。（２）入力音声パラメータ列を保存する。（３）意味表現リスト中の全ての意味表現それぞれに対
して以下の処理を繰り返す。（意味表現リスト中の全て
の意味表現M（図２の意味表現１〜３）について、尤度
１を求め、最も尤度の高い尤度１を結果とする場合の意
味表現を、入力された音声パラメータ列が表現する意味
内容として出力する。）（3.1）入力音声パラメータ列と文モデル中のすべての
音節連鎖パターン（図３音節連鎖パターン番号：１〜
４）それぞれについて(3.1.1)〜(3.1.4)を繰り返す。The processing procedure of likelihood calculation by the likelihood calculation unit is shown in FIG.
A description will be given with reference to the flow shown in FIG. 7 and an example of the calculation result shown in FIG. 8 (an example of accepting an order for fresh food by voice from the user). For a certain semantic expression, when the user makes an utterance expressing the semantic expression as content,
Input voice parameters (for example,
(1) The maximum likelihood value = 0 and the resulting semantic expression = NULL (initialization). (2) Save the input voice parameter sequence. (3) The following processing is repeated for each of the semantic expressions in the semantic expression list. (For all the semantic expressions M in the semantic expression list (the semantic expressions 1 to 3 in FIG. 2), the likelihood 1 is obtained, and the semantic expression when the highest likelihood 1 is the result is input. This is output as the meaning represented by the voice parameter sequence.) (3.1) The input voice parameter sequence and all syllable chain patterns in the sentence model (FIG. 3 syllable chain pattern numbers: 1 to 1)
4) Repeat (3.1.1) to (3.1.4) for each.

【００１５】この場合、ユーザからの直前の発話が「野
菜のリストを表示してください」という音声があり、前
発話意味表現が１：（リスト表示,＊,NULL）であったと
する。発話予測部は入力された前発話意味表現１：によ
り図５に示された次発話予測テーブルを参照して次発話
意味表現出現確率集（0,0.6,0.4）を尤度計算部に出力
する。従って、尤度計算部における発話意味表現１：
（リスト表示,”野菜”NULL）の計算結果については確
率値５が０で尤度１が０となることは明らかであるので
図８には示していない。（3.1.1）確率値４＝０とする（初期化）。（3.1.2）音節連鎖パターンＳに対する確率値１を計算
する。（3.1.3）入力音声パラメータ列に対する確率値２を計
算する。In this case, it is assumed that the last utterance from the user is a voice saying "Please display the vegetable list", and the previous utterance meaning expression is 1: (list display, *, NULL). The utterance prediction unit refers to the next utterance prediction table shown in FIG. 5 according to the input previous utterance meaning expression 1: and outputs a next utterance meaning expression appearance probability collection (0, 0.6, 0.4) to the likelihood calculation unit. . Therefore, the utterance meaning expression 1 in the likelihood calculation unit 1:
The calculation result of (list display, “vegetable” NULL) is not shown in FIG. 8 because the probability value 5 is 0 and the likelihood 1 is 0. (3.1.1) Probability value 4 = 0 (initialization). (3.1.2) The probability value 1 for the syllable chain pattern S is calculated. (3.1.3) Calculate the probability value 2 for the input speech parameter sequence.

【００１６】確率値１と確率値２の計算は以下のように
行う。文モデル（図３参照）から意味表現２，３に対応する
音節連鎖パターン番号１〜４の意味表現発生時の条件付
き生起確率(0.5,0.5,0,0)、あるいは(0,0,0.5,0.5)を取
り出して確率値１とする。次に音節連鎖パターン番号１〜４の音節連鎖パターン
(1:kjabetsuohitotsukudasai,・・・,4:chjumonowari)
を基に音響モデル（音節の周波数領域における音響的特
徴モデル）を用いて各音節連鎖パターンの周波数領域に
おける音響的特徴列を生成する。生成された各音節連鎖
パターンの周波数領域における音響的特徴列と入力音声
パラメータ列（周波数領域における音響的特徴列）に基
づいて発話意味表現２，３の音節連鎖パターン番号１〜
４に対応する尤度（確率）である確率値２：(1.0×1
0^-1,1.0×10^-1,1.0×10^-4,1.0×10^-4)、あるいは(1.0×
10^-1,1.0×10^-1,1.0×10^-4,1.0×10^-4)を計算する。
（確率値２の計算は前述した音声認識手法により行うこ
とができる。）（3.1.4）確率値３＝確率値１×確率値２を計算する。（3.1.5）発話表現ごとの確率値３を加算して確率値４
を計算、すなわち、Σ((確率値１)×(確率値２))を計算
する。（文モデル中の全ての音節連鎖パターンにわた
って加算した値を確率値４：1.0×10^-1、あるいは1.0×
10^-4とする。）（3.2）前発話意味表現２，３に対する確率値５を計算
する。（尤度計算部は発話予測部に対して要求信号を送
信して、発話予測部の動作の結果により得られた意味表
現生起確率集（図４参照）の中から、発話意味表現２，
３の次発話意味表現出現確率（確率値５）：0.6、ある
いは0.4を取り出す。）（3.3）発話意味表現ごとの尤度１＝確率値４×確率値
５：6.0×10^-2、あるいは4.0×10^-5を計算する。（3.4）尤度１＞尤度最大値（前に格納されている尤
度）を判断する。YESの場合は(3.5)に移行し、NOの場合
は(３)に移行する。（3.5）尤度最大値＝尤度１、結果意味表現Mとして(3)
に移行する。（４）意味表現リスト中の全ての意味表現について上記
の計算が行われた場合に結果意味表現の尤度がある基準
（予め設定）のもとで更新された尤度（最も大きい尤
度）に対する意味表現を選択して出力する。The calculation of the probability value 1 and the probability value 2 is performed as follows. From the sentence model (see FIG. 3), conditional occurrence probabilities (0.5, 0.5, 0, 0) or (0, 0, 0.5) when syllable chain pattern numbers 1 to 4 corresponding to semantic expressions 2 and 3 occur. , 0.5) and take the probability value as 1. Next, the syllable chain patterns of syllable chain pattern numbers 1 to 4
(1: kjabetsuohitotsukudasai, ・・・, 4: chjumonowari)
Then, an acoustic feature sequence in the frequency domain of each syllable chain pattern is generated using an acoustic model (an acoustic feature model in the frequency domain of a syllable). Based on the acoustic feature sequence in the frequency domain of each generated syllable chain pattern and the input speech parameter sequence (the acoustic feature sequence in the frequency domain), syllable chain pattern numbers 1 to 3 of the speech meaning expressions 2 and 3
Probability value 2, which is the likelihood (probability) corresponding to 4: 2 (1.0 × 1
0 ^-1 , 1.0 × 10 ^-1 , 1.0 × 10 ^-4 , 1.0 × 10 ^-4 ) or (1.0 × 10 ^-4
10 ^-1 , 1.0 × 10 ^-1 , 1.0 × 10 ^-4 , 1.0 × 10 ^-4 ) are calculated.
(Calculation of probability value 2 can be performed by the above-described speech recognition method.) (3.1.4) Probability value 3 = probability value 1 × probability value 2 is calculated. (3.1.5) Probability value 4 by adding probability value 3 for each utterance expression
, That is, Σ ((probability value 1) × (probability value 2)). (Probability value 4: 1.0 × 10 ^-1 or 1.0 × the value obtained by adding over all syllable chain patterns in the sentence model
10 ^-4 . (3.2) Calculate the probability value 5 for the previous utterance meaning expressions 2 and 3. (The likelihood calculation unit transmits a request signal to the utterance prediction unit, and from the collection of semantic expression occurrence probabilities obtained as a result of the operation of the utterance prediction unit (see FIG. 4), the utterance meaning expression 2,
The next utterance meaning expression appearance probability of 3 (probability value 5): 0.6 or 0.4 is extracted. (3.3) Likelihood 1 = probability value 4 × probability value 5: 6.0 × 10 ⁻² or 4.0 × 10 ⁻⁵ for each utterance meaning expression. (3.4) Likelihood 1> Likelihood maximum value (likelihood stored before) is determined. If YES, proceed to (3.5); if NO, proceed to (3). (3.5) Maximum likelihood value = likelihood 1, as result semantic expression M (3)
Move to (4) When the above calculation is performed for all the semantic expressions in the semantic expression list, the likelihood of the resulting semantic expression is updated based on a certain standard (pre-set) (the maximum likelihood). Select and output a semantic expression for.

【００１７】図８のとおりに計算結果が得られた場合、
意味表現番号２：（商品注文,”キャベツ”,１）の尤度
１(6.0×10^-2)と意味表現番号３：（注文終了,NULL,NUL
L）の尤度１(4.0×10^-5)が比較され、尤度１の大きい意
味表現、つまり「（商品注文,”キャベツ”,”１”）」
が出力される意味表現となる。また、この発明の音声認
識装置はCPUやメモリ等を有するコンピュータと、アク
セス主体となるユーザが利用する利用者端末と、記録媒
体とから構成することができる。When the calculation result is obtained as shown in FIG.
Semantic expression number 2: Likelihood 1 (6.0 × 10 ^-2 ) of (commodity order, “cabbage”, 1) and semantic expression number 3: (order completed, NULL, NUL
L) of likelihood 1 (4.0 × 10 ⁻⁵ ) is compared, and the meaning expression with a large likelihood 1 is “(product order,“ cabbage ”,“ 1 ”)”.
Is output as a semantic expression. Further, the speech recognition device of the present invention can be configured by a computer having a CPU, a memory, and the like, a user terminal used by a user who is an access subject, and a recording medium.

【００１８】記録媒体は、CD-ROM、磁気ディスク、半導
体メモリ等の機械読み取り可能な記録媒体であり、ここ
に記録されたプログラムはコンピュータに読み取られ、
コンピュータの動作を制御しコンピュータ上に音声分析
部、尤度計算部、発話予測部等の各構成要素を実現す
る。The recording medium is a machine-readable recording medium such as a CD-ROM, a magnetic disk, and a semiconductor memory. The program recorded on the recording medium is read by a computer.
By controlling the operation of the computer, the components such as the voice analysis unit, the likelihood calculation unit, and the speech prediction unit are realized on the computer.

【００１９】[0019]

【発明の効果】この発明によれば、音声認識時に動的に
文を作成することなく、意味表現と対応づけられた文の
集合を統一的な言語モデルとして利用し、次発話の内容
を確率的に予測することによって、音声認識および意味
解析の処理を統計的に行い、不必要な計算による確率の
分散を防止することにより、入力された音声から発話の
意味内容について精度良く求めることが可能となる。According to the present invention, a sentence set associated with a semantic expression is used as a unified language model without dynamically creating a sentence during speech recognition, and the contents of the next utterance are set By performing statistical prediction of speech recognition and semantic analysis by predicting statistically and preventing the dispersion of probabilities due to unnecessary calculations, it is possible to accurately determine the meaning of utterance from input speech. Becomes

[Brief description of the drawings]

【図１】この発明の音声認識装置の構成例を示す図。FIG. 1 is a diagram showing a configuration example of a speech recognition device of the present invention.

【図２】この発明の実施例における意味表現リストの内
容を示す図。FIG. 2 is a diagram showing contents of a semantic expression list in the embodiment of the present invention.

【図３】この発明の実施例における文モデルを示す図。FIG. 3 is a diagram showing a sentence model in the embodiment of the present invention.

【図４】この発明の発話予測部の構成例を示す図。FIG. 4 is a diagram showing a configuration example of an utterance prediction unit according to the present invention.

【図５】この発明に用いられる次発話予測テーブルの内
容の例を示す図。FIG. 5 is a diagram showing an example of the contents of a next utterance prediction table used in the present invention.

【図６】この発明の尤度計算処理手順を示す図。FIG. 6 is a diagram showing a likelihood calculation processing procedure according to the present invention.

【図７】この発明の尤度計算部において一つの意味表現
に対する尤度１を求める手順の模式図。FIG. 7 is a schematic diagram of a procedure for obtaining likelihood 1 for one semantic expression in the likelihood calculation unit of the present invention.

【図８】この発明の尤度計算部における計算結果の例を
示す図。FIG. 8 is a diagram showing an example of a calculation result in a likelihood calculation unit according to the present invention.

【図９】従来の音声認識装置の構成例を示す図。FIG. 9 is a diagram showing a configuration example of a conventional voice recognition device.

Claims

[Claims]

1. A speech recognition method for extracting a feature parameter sequence of an input speech (hereinafter referred to as an "input speech parameter sequence"), converting the input speech parameter sequence into a semantic expression, and outputting the semantic expression. The user receives the semantic expression corresponding to the user's utterance, inputs the estimated probability of occurrence of each semantic expression in the next utterance, and records the syllable standard pattern corresponding to the speech parameter sequence and the utterance sentence assumed to be uttered by the user. The likelihood of the semantic expression stored in the semantic expression list corresponding to the input speech parameter sequence is calculated by using the information of the sentence model representing the conditional occurrence probability between the syllable chain standard pattern and the corresponding semantic expression. A speech recognition method characterized by outputting a semantic expression having the highest likelihood.

2. A speech analysis unit for extracting a characteristic parameter sequence (hereinafter, referred to as an "input speech parameter sequence") of an input speech, and a semantic expression analysis unit for converting the input speech parameter sequence into a semantic expression and outputting the semantic expression. In the speech recognition device, the semantic expression analysis unit includes a semantic expression list storage unit that stores semantic expressions representing the semantic contents of an utterance sentence assumed to be uttered by the user, and an acoustic model that stores a syllable standard pattern corresponding to the voice parameter sequence. A sentence model storage unit storing a sentence model representing a conditional occurrence probability between a syllable chain standard pattern of a utterance sentence assumed to be uttered by a user and a corresponding semantic expression; An utterance prediction unit that receives a semantic expression corresponding to the utterance of the user and estimates and outputs the occurrence probability of each semantic expression in the next utterance; and a likelihood calculation unit. The likelihood calculation unit inputs the input speech parameter sequence and the probability of occurrence of the semantic expression estimated by the utterance prediction unit, and uses each information of the acoustic model and the sentence model to store the semantic expression list corresponding to the input speech parameter sequence. A speech recognition device that calculates the likelihood of the semantic expression stored in the semantic expression and outputs the semantic expression with the highest likelihood.

3. A process for extracting a characteristic parameter sequence of an input voice (hereinafter, referred to as an "input voice parameter sequence"), receiving the input voice parameter sequence and a semantic expression corresponding to the immediately preceding user's utterance, and estimating the next utterance Inputting the probability of occurrence of each semantic expression in, a condition between the acoustic model that records the syllable standard pattern corresponding to the speech parameter sequence and the syllable chain standard pattern of the utterance sentence assumed to be uttered by the user and the corresponding semantic expression The computer calculates the likelihood of the semantic expression stored in the semantic expression list corresponding to the input speech parameter sequence using each information of the sentence model representing the attached occurrence probability and outputs the most likely semantic expression to the computer. The speech recognition program to be executed.

4. A process for extracting a characteristic parameter sequence of an input voice (hereinafter referred to as an "input voice parameter sequence"), receiving the input voice parameter sequence and a semantic expression corresponding to the immediately preceding user's utterance, and estimating the next utterance Inputting the probability of occurrence of each semantic expression in, a condition between the acoustic model that records the syllable standard pattern corresponding to the speech parameter sequence and the syllable chain standard pattern of the utterance sentence assumed to be uttered by the user and the corresponding semantic expression The computer calculates the likelihood of the semantic expression stored in the semantic expression list corresponding to the input speech parameter sequence using each information of the sentence model representing the attached occurrence probability and outputs the most likely semantic expression to the computer. A computer-readable recording medium on which a speech recognition program to be executed is recorded.