JP2000330586A

JP2000330586A - Method and device for recognizing speech

Info

Publication number: JP2000330586A
Application number: JP11140251A
Authority: JP
Inventors: Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-05-20
Filing date: 1999-05-20
Publication date: 2000-11-30
Anticipated expiration: 2019-05-20
Also published as: JP3369121B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device permitting to correct a partial word string or a partial character string of the recognition result obtained over the whole input speech by arbitrarily specifying them. SOLUTION: This speech recognition device comprises a speech recognition part 10 for recognizing input speeches in linguistic units and generating a graph expressing word strings by the arcs corresponding to the linguistic units, a segmentation part 23 for specifying an arbitrary time segment, and a language processing part 20 for generating plural recognition results concerning the arbitrary time segment specified by the segmentation part 23 in the graph generated by the speech recognition part 10.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、連続的に発生され
た音声を認識する方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and an apparatus for recognizing continuously generated speech.

【０００２】[0002]

【従来の技術】音声認識装置の一例として、特開平9-28
1989号公報には、言語的な評価処理において無駄な照合
を省略し、もって現実的な時間で認識処理を行えるよう
にした音声認識装置が開示されている。図９に、この音
声認識装置の概略構成を示す。図９に示すように、上記
公報の音声認識装置は、音韻認識部１１０、音素モデル
記憶部１１１、言語処理部１２０、候補記憶部１２１、
結果記憶部１２２、辞書記憶部１３０、言語情報記憶部
１３０、構文規則記憶部１４０、言語情報記憶部１５０
からなる。2. Description of the Related Art Japanese Patent Application Laid-Open No. 9-28
Japanese Unexamined Patent Publication No. 1989 discloses a speech recognition apparatus that can omit useless collation in linguistic evaluation processing and can perform recognition processing in a realistic time. FIG. 9 shows a schematic configuration of the voice recognition device. As shown in FIG. 9, the speech recognition device of the above publication includes a phoneme recognition unit 110, a phoneme model storage unit 111, a language processing unit 120, a candidate storage unit 121,
Result storage unit 122, dictionary storage unit 130, linguistic information storage unit 130, syntax rule storage unit 140, linguistic information storage unit 150
Consists of

【０００３】音韻認識部１１０は、入力された音声を音
韻単位に分割し（セグメンテーション）、該分割区間の
それぞれの状態について音素モデル記憶部１１１を参照
しながら認識を行い、該音韻認識結果として音韻単位の
グラフ表現（単語の音韻構造をネットワークによって表
現した音韻グラフ）を出力する。この音韻認識部１１０
から出力される音韻グラフは、発話の開始を開始ノード
とし、発話の終了を終了ノードとする複数のノードによ
り接続されたネットワークモデルであって、各ノード区
間の状態が音韻記号や疑似音韻記号により表わされ、状
態間の遷移を表わすアークが付与された構成となってい
る。各アークは認識された音韻単位に対応しており、そ
れぞれ音韻照合スコアと音韻単位のモノグラムのスコア
が認識スコアとして付与される。この認識スコアは、ノ
ード区間の認識された音韻候補の確からしさの尺度、ま
たはアークの遷移確率を示す。各ノードには、そのノー
ドから終了ノード（発話の終了）までの最良のスコアが
付与される。The phoneme recognition unit 110 divides an input speech into phoneme units (segmentation), recognizes each state of the divided section with reference to the phoneme model storage unit 111, and obtains a phoneme recognition result as a phoneme recognition result. A graph representation of a unit (a phonological graph expressing the phonological structure of a word through a network) is output. This phoneme recognition unit 110
Is a network model connected by a plurality of nodes with the start of the utterance as the start node and the end of the utterance as the end node, and the state of each node section is represented by a phonological symbol or a pseudo-phonological symbol. And an arc representing the transition between the states is provided. Each arc corresponds to a recognized phoneme unit, and a phoneme matching score and a monogram score for the phoneme unit are given as recognition scores. The recognition score indicates a measure of the likelihood of the recognized phoneme candidate in the node section or an arc transition probability. Each node is given the best score from that node to the end node (end of speech).

【０００４】言語処理部１２０は、音韻認識部１１０か
ら出力された音韻グラフをもとにして、認識できる単語
の情報を蓄えた辞書記憶部１３０と、品詞から受理でき
る文を記述した構文規則記憶部１４０と、統計的な言語
情報を蓄えた言語情報記憶部１５０とを用いて最終的な
認識を行う。The language processing unit 120 includes a dictionary storage unit 130 that stores information on recognizable words based on the phoneme graph output from the phoneme recognition unit 110, and a syntax rule storage that describes a sentence that can be accepted from the part of speech. The final recognition is performed using the unit 140 and the linguistic information storage unit 150 storing statistical linguistic information.

【０００５】次に、上述の音声認識装置における音韻認
識の具体的な処理の流れを図１０を参照して説明する。Next, a specific processing flow of phoneme recognition in the above speech recognition apparatus will be described with reference to FIG.

【０００６】音声が入力されると、音韻認識部１１０が
その入力音声について音韻認識を行って音韻グラフを作
成する（ステップＳ１０１）。この音韻グラフは言語処
理部１２０に供給され、言語処理部１２０にて以下のス
テップＳ１０２〜Ｓ１０９の言語処理が実行される。When a speech is input, the phoneme recognition unit 110 performs phoneme recognition on the input speech to create a phoneme graph (step S101). This phonological graph is supplied to the language processing unit 120, and the language processing unit 120 executes the following language processing in steps S102 to S109.

【０００７】まず、処理中の候補を記憶しておく候補記
憶部を初期化して初期候補が１つだけ入っている状態に
する（ステップＳ１０２）。次いで、候補記憶部が空で
ないことを確認した上でその候補記憶部から最もスコア
の良い候補を取り出す（ステップＳ１０３、Ｓ１０
４）。ただし、最初は、上記ステップＳ１０２で用意し
た初期候補が取り出される。First, a candidate storage unit for storing a candidate being processed is initialized to a state in which only one initial candidate is stored (step S102). Next, after confirming that the candidate storage unit is not empty, the candidate with the highest score is extracted from the candidate storage unit (steps S103 and S10).
4). However, initially, the initial candidates prepared in step S102 are extracted.

【０００８】ステップＳ１０４で最もスコアの良い候補
が取り出されると、続いて、その取り出された候補につ
いて、照合が音韻グラフの最後まで到達しているか否か
の判定を行う（ステップＳ１０５）。照合が音韻グラフ
の最後まで到達していて、文として成立していれば、そ
の候補を結果記憶部に移し（ステップＳ１０６）、到達
していなければ、ステップＳ１０９に移って取り出した
候補の言語照合処理に入る。When the candidate with the highest score is retrieved in step S104, it is then determined whether or not the retrieved candidate reaches the end of the phoneme graph (step S105). If the collation has reached the end of the phonological graph and is established as a sentence, the candidate is moved to the result storage unit (step S106). Enter processing.

【０００９】上記ステップＳ１０６で候補が結果記憶部
に移されると、続いて結果記憶部に移された候補の数が
十分であるか否かの判定を行う（ステップＳ１０７）。
不十分であれば、上記ステップＳ１０３へ戻り、十分で
あれば、結果記憶部に移された候補を認識結果として出
力する（ステップＳ１０８）。このステップＳ１０８の
認識結果出力は、上記ステップＳ１０３で候補記憶部が
空となった場合にも行われる。After the candidates are moved to the result storage unit in step S106, it is determined whether the number of candidates moved to the result storage unit is sufficient (step S107).
If it is insufficient, the process returns to step S103, and if it is enough, the candidate moved to the result storage unit is output as a recognition result (step S108). The recognition result output in step S108 is also performed when the candidate storage unit becomes empty in step S103.

【００１０】上述の音韻認識処理では、言語照合処理は
処理の終わったノードから続くノードに処理が進むこと
になる。具体的には、辞書記憶部１３０と構文規則記憶
部１４０とを用いて受理され得るノードを選択して新た
な候補とし、その候補の言語的評価を行ってスコアをつ
けるといった処理が順次行われる。ここでは、言語情報
記憶部１５０を参照して評価スコアを得る。この評価ス
コアには、音韻グラフ上の予測スコアも含まれる。In the phoneme recognition processing described above, the language collation processing proceeds from the node where the processing is completed to the subsequent node. Specifically, processing is performed in which a node that can be accepted is selected as a new candidate using the dictionary storage unit 130 and the syntax rule storage unit 140, and the candidate is subjected to linguistic evaluation to give a score. . Here, an evaluation score is obtained with reference to the linguistic information storage unit 150. The evaluation score includes a prediction score on the phoneme graph.

【００１１】以上説明した従来の音声認識装置において
は、予測スコアが実際のスコアを下回らないという条件
を満たす場合、認識結果は得られた順に、よりよい評価
スコアが得られることになる。すなわち、ある入力音声
から得られた音韻グラフに対し、その音韻グラフの始端
から終端までの認識結果をスコア順に得ることができ
る。例えば、候補記憶部に記憶する候補の数、認識結果
記憶部に記憶する認識結果の数をそれぞれ制限しなけれ
ば、音韻グラフ中の始端から終端までをつなぐ、全ての
認識結果をスコア順に得ることができる。In the conventional speech recognition apparatus described above, if the condition that the predicted score does not fall below the actual score is satisfied, a better evaluation score is obtained in the order in which the recognition results are obtained. That is, for a phoneme graph obtained from a certain input voice, recognition results from the start to the end of the phoneme graph can be obtained in the order of scores. For example, if the number of candidates to be stored in the candidate storage unit and the number of recognition results to be stored in the recognition result storage unit are not limited, all recognition results that connect from the beginning to the end of the phoneme graph are obtained in the order of scores. Can be.

【００１２】[0012]

【発明が解決しようとする課題】入力音声全体について
得られた一位認識結果中の部分単語列または部分文字列
に間違いがある場合、その部分についてのみ修正できれ
ば認識処理を効率的に行うことができる。しかしなが
ら、上述した従来の音声認識装置は、入力音声全体につ
いて複数の認識結果候補を得、これら候補のうちから評
価スコアの良い順に候補が選択されるようになっている
ため、そのような一位認識結果中の部分単語列または部
分文字列の修正を行うことはできなかった。If there is an error in the partial word string or partial character string in the first place recognition result obtained for the entire input speech, the recognition process can be performed efficiently if only that part can be corrected. it can. However, the above-described conventional speech recognition device obtains a plurality of recognition result candidates for the entire input speech and selects candidates from the candidates in descending order of the evaluation score. The partial word string or partial character string in the recognition result could not be corrected.

【００１３】本発明の目的は、入力音声全体について得
られた認識結果中の部分単語列または部分文字列を任意
に指定して修正することができる、音声認識方法および
音声認識装置を提供することにある。An object of the present invention is to provide a speech recognition method and a speech recognition apparatus capable of arbitrarily designating and correcting a partial word string or a partial character string in a recognition result obtained for an entire input speech. It is in.

【００１４】[0014]

【課題を解決するための手段】上記目的を達成するた
め、本発明の音声認識装置は、入力音声を言語的単位に
認識し、該言語的単位に対応したアークで単語列が表現
されたグラフを生成する音声認識手段と、任意の時間区
間を指定するための区間指定手段と、前記音声認識手段
により生成されたグラフ中の前記区間指定手段により指
定された任意の時間区間について複数の認識結果を生成
する言語処理手段とを有することを特徴とする。In order to achieve the above object, a speech recognition apparatus according to the present invention recognizes an input speech in a linguistic unit, and a graph in which a word string is represented by an arc corresponding to the linguistic unit. , A section specifying means for specifying an arbitrary time section, and a plurality of recognition results for an arbitrary time section specified by the section specifying means in a graph generated by the voice recognizing means. And a language processing means for generating

【００１５】また、本発明の音声認識方法は、入力音声
を言語的単位に認識し、該言語的単位に対応したアーク
で単語列が表現されたグラフを生成する音声認識ステッ
プと、前記音声認識ステップで生成されたグラフ中の任
意に指定された時間区間について複数の認識結果を生成
する言語処理ステップとを含むことを特徴とする。The speech recognition method according to the present invention includes a speech recognition step of recognizing an input speech in a linguistic unit and generating a graph in which a word string is represented by an arc corresponding to the linguistic unit; A language processing step of generating a plurality of recognition results for an arbitrarily designated time section in the graph generated in the step.

【００１６】（作用）上記のとおりの本発明において
は、入力音声から得られたグラフの任意の時間区間につ
いて複数の認識結果が生成されるので、入力音声全体に
ついて得られた一位認識結果中の部分単語列または部分
文字列に間違いがある場合、その部分について複数の認
識結果を得ることができる。ユーザは、これら認識結果
から任意に正当な結果を選択することで、間違い箇所を
修正することができる。(Operation) In the present invention as described above, a plurality of recognition results are generated for an arbitrary time section of the graph obtained from the input speech, and therefore, among the first-order recognition results obtained for the entire input speech. If there is an error in the partial word string or partial character string, a plurality of recognition results can be obtained for that part. The user can correct an erroneous part by arbitrarily selecting a valid result from these recognition results.

【００１７】[0017]

【発明の実施の形態】次に、本発明の実施形態について
図面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１８】図１に本発明の音声認識装置の一実施形態
を示す。本形態の音声認識装置は、音声認識部１０、標
準パターン記憶部１１、言語処理部２０、候補記憶部２
１、結果記憶部２２、区間指定部２３、予測スコア計算
部２４、言語情報記憶部３０から構成されている。FIG. 1 shows an embodiment of the speech recognition apparatus of the present invention. The speech recognition device according to the present embodiment includes a speech recognition unit 10, a standard pattern storage unit 11, a language processing unit 20, and a candidate storage unit 2.
1, a result storage unit 22, a section designation unit 23, a prediction score calculation unit 24, and a linguistic information storage unit 30.

【００１９】音声認識部１０は、入力された音声を分析
して得られるパラメータベクトル列に対し、標準パター
ン記憶部１１に記憶されている標準パターンと言語情報
記憶部３０に記憶されている辞書情報・言語情報を用い
て、言語的単位を単位とするグラフ（以下、ワードグラ
フ）を生成する。入力された音声の分析には、たとえ
ば、フィルタバンク、フーリエ変換、線形予測係数型分
析器などを用いる。ワードグラフの言語的単位としては
音素・音節・単語などを用いることができる。The speech recognition unit 10 analyzes the standard pattern stored in the standard pattern storage unit 11 and the dictionary information stored in the linguistic information storage unit 30 for the parameter vector sequence obtained by analyzing the input speech. Using the linguistic information, generate a graph (hereinafter referred to as a word graph) in linguistic units. For analysis of the input speech, for example, a filter bank, a Fourier transform, a linear prediction coefficient type analyzer, or the like is used. As a linguistic unit of the word graph, phonemes, syllables, words, and the like can be used.

【００２０】ワードグラフでは、アークで言語的単位を
表わすこととし、各アークには該言語的単位の、分析さ
れた入力音声の対応する部分と標準パターンとの近さを
表わす音響スコア（音響的な確からしさの尺度を示す）
が付与される。ワードグラフは、一つの開始ノードと一
つの終了ノードを持ち、これらはそれぞれ入力音声の始
端と終端に対応する。ワードグラフのノードは、入力音
声での時間位置に対応する情報を持つ。In the word graph, linguistic units are represented by arcs. Each arc has an acoustic score (acoustic score) representing the proximity of the linguistic unit between the corresponding part of the analyzed input speech and the standard pattern. Showing a measure of certainty)
Is given. The word graph has one start node and one end node, which correspond to the start and end of the input speech, respectively. The nodes of the word graph have information corresponding to the time position in the input voice.

【００２１】標準パターン記憶部１１には、あらかじめ
分析された音声が記憶されている。たとえば、音素単
位、前後の環境を考慮した音素単位、音節単位、単語単
位で音声が記憶される。The standard pattern storage 11 stores voices analyzed in advance. For example, speech is stored in phoneme units, phoneme units in consideration of the surrounding environment, syllable units, and word units.

【００２２】言語処理部２０は、音声認識部１０から与
えられるワードグラフをもとにして、言語情報記憶部３
０に記憶されている言語情報、ワードグラフに付与され
ている音響スコア、予測スコア計算部２４にてワードグ
ラフのノード毎に得られる予測スコアをそれぞれ用いて
認識結果候補の評価スコアを計算するとともに、言語処
理中の認識結果候補を候補記憶部２１に記憶させ、言語
処理の終了した認識結果候補を認識結果として結果記憶
部２２に記憶させることで、区間指定部２３で指定され
た区間に対応するワードグラフ中の区間に対し、評価ス
コア順に認識結果を求める。The language processing section 20 is based on the word graph provided from the speech recognition section 10,
The evaluation score of the recognition result candidate is calculated using the linguistic information stored in 0, the acoustic score given to the word graph, and the prediction score obtained for each node of the word graph by the prediction score calculation unit 24. By storing the recognition result candidates in the language processing in the candidate storage unit 21 and storing the recognition result candidates for which the language processing has been completed in the result storage unit 22 as the recognition results, it is possible to correspond to the section specified by the section specifying unit 23. Recognition results are obtained in the order of evaluation scores for the sections in the word graph to be executed.

【００２３】候補記憶部２１は、言語処理部２０におい
てワードグラフが処理されている際に得られる認識結果
候補を評価スコアのよい順に並べ替えて記憶する。ここ
で、認識結果候補は、少なくとも候補の評価スコア、候
補の途中のスコア、最近に処理したノード番号、０か１
の値がセットされる処理終了フラグを最低限情報として
持つ。The candidate storage unit 21 stores recognition result candidates obtained when the word graph is being processed in the language processing unit 20 in the order of good evaluation score. Here, the recognition result candidates include at least a candidate evaluation score, a score in the middle of the candidate, a node number processed recently, and 0 or 1.
Is set as the minimum information.

【００２４】結果記憶部２２は、言語処理部２０にてワ
ードグラフが処理されて得られる認識結果を得られた順
に記憶する。区間指定部２３は、認識結果を求めたい時
間区間を与える。ユーザはこの区間指定部２３を用いて
直接、入力音声内での時間や、入力全体に対する一位認
識結果（最も確かな認識結果）中の部分単語列や部分文
字列を指定することができる。この区間指定部２３は、
キー入力手段などを用いても実現することができる。The result storage unit 22 stores the recognition results obtained by processing the word graph in the language processing unit 20 in the order in which the recognition results are obtained. The section designation unit 23 gives a time section for which a recognition result is to be obtained. The user can directly specify the time in the input voice or the partial word string or partial character string in the first-order recognition result (the most reliable recognition result) for the entire input using the section specifying unit 23. This section designation unit 23
It can also be realized by using key input means or the like.

【００２５】上記区間指定部２３によって一位認識結果
中の部分単語列が指定された場合は、対応するワードグ
ラフのパスの、最初のアークの始端ノードに記録されて
いる時間を区間の開始時間とし、最後のアークの終端ノ
ードに記録されている時間を区間の終了時間とする。一
位認識結果中の部分文字列が指定された場合、対応する
ワードグラフのパスの、最初のアークについて、アーク
に付与された言語的単位と指定された文字列との対応
と、アークの始端ノード・終端ノードにそれぞれ記録さ
れた時間から、区間の開始時間の推定値を求める。これ
と同様にして、最後のアークについても区間の終了時間
の推定値を求める。When a partial word string in the first place recognition result is specified by the section specifying unit 23, the time recorded at the start node of the first arc of the path of the corresponding word graph is set to the start time of the section. And the time recorded in the terminal node of the last arc is set as the end time of the section. If a substring in the first place recognition result is specified, for the first arc in the corresponding word graph path, the correspondence between the linguistic unit assigned to the arc and the specified character string, and the start of the arc An estimated value of the start time of the section is obtained from the time recorded in each of the node and the end node. Similarly, the estimated value of the end time of the section is obtained for the last arc.

【００２６】予測スコア計算部２４は、言語処理部２０
で処理対象となっているワードグラフの各ノードに対し
て、言語情報記憶部３０に記憶されている言語情報とワ
ードグラフに付与されている音響スコアを用いて、その
ノードから開始ノード方向に計算された後ろ向き予測ス
コアとそのノードから終了ノード方向に計算された前向
き予測スコアを与える。なお、これら予測スコアは、必
要なノードについてのみ計算してもよいし、全ノードに
ついて計算してもよい。また、言語情報によっては、動
的計画法を用いることで予測スコアを高速に計算するこ
とができる。The prediction score calculation unit 24 includes a language processing unit 20
For each node of the word graph to be processed in the above, using the linguistic information stored in the linguistic information storage unit 30 and the acoustic score assigned to the word graph, calculation is performed from the node toward the start node. The backward prediction score calculated and the forward prediction score calculated from the node toward the end node are provided. Note that these prediction scores may be calculated only for necessary nodes, or may be calculated for all nodes. Further, depending on the linguistic information, the prediction score can be calculated at high speed by using the dynamic programming.

【００２７】言語情報記憶部３０は、音声認識部１０と
言語処理部２０で用いられる辞書情報・言語情報を記憶
する。辞書情報は、ワードグラフの言語的単位を規定す
る。言語情報はなくてもよいが、言語的単位に関する制
約を用いることでより精度の高い認識結果候補を得るこ
とができる。言語情報としては、たとえば単語に対して
単語ｎ−ｇｒａｍ、単語間接続可否判定表、単語品詞間
接続可否判定表などを用いることができる。The linguistic information storage unit 30 stores dictionary information and linguistic information used by the speech recognition unit 10 and the linguistic processing unit 20. The dictionary information defines a linguistic unit of the word graph. Although there is no need for linguistic information, a more accurate recognition result candidate can be obtained by using a constraint on the linguistic unit. As the linguistic information, for example, for a word, a word n-gram, an inter-word connection availability determination table, a word part-of-speech connectivity availability determination table, or the like can be used.

【００２８】次に、本形態の音声認識装置の全体の動作
を図２のフローチャートを参照して詳細に説明する。Next, the overall operation of the speech recognition apparatus of the present embodiment will be described in detail with reference to the flowchart of FIG.

【００２９】音声が入力されると、音声認識部１０がそ
の入力された音声に対して音韻認識処理を行ってワード
グラフを作成する（ステップＳ１）。ワードグラフは、
入力音声での時間位置に対応するノードとそれを結ぶア
ークからなり、アークには始端ノード、終端ノード、対
応する言語的単位、音響スコアが記録される。このよう
なワードグラフの作成には、たとえば「Computer Speec
h and Language (1997) 11, pp43-72」に示されている
ような方法を適用することができる。When a voice is input, the voice recognition unit 10 performs a phoneme recognition process on the input voice to create a word graph (step S1). The word graph is
It consists of a node corresponding to the time position in the input speech and an arc connecting the node, and the arc records a start node, an end node, a corresponding linguistic unit, and an acoustic score. For example, "Computer Speec
h and Language (1997) 11, pp43-72 ".

【００３０】ワードグラフが作成されると、続いて、予
測スコア計算部２４がそのワードグラフの各ノードに対
して、開始ノードからそのノードに至るパスの最適なス
コア（以下、後ろ向き予測スコア）を計算する（ステッ
プＳ２）。パスのスコアは、パスを構成するアークに付
与されている音響スコアとアークの言語的単位の連鎖と
言語情報記憶部３０により与えられる言語スコアを重み
付けして合計した値である。言語情報記憶部３０により
与えられる言語情報が２つ以下の言語的単位により決ま
る場合、この計算は開始ノードから終了ノード方向に動
的計画法を用いて処理することで、ワードグラフのノー
ド数に比例した時間で行うことができる。後ろ向き予測
スコアが計算されると、続いて、予測スコア計算部２４
がワードグラフの各ノードに対して、そのノードから終
了ノードに至るパスの最適なスコア（以下、前向き予測
スコア）を計算する（ステップＳ３）。この計算におい
ても、言語情報記憶部３０により与えられる言語情報が
２つ以下の言語的単位により決まる場合は、終了ノード
から開始ノード方向に動的計画法を用いて処理すること
で、ワードグラフのノード数に比例した時間で行うこと
ができる。After the word graph is created, the prediction score calculation unit 24 then calculates, for each node of the word graph, an optimal score of a path from the start node to the node (hereinafter, a backward prediction score). Calculation is performed (step S2). The score of the path is a value obtained by weighting and summing the acoustic score given to the arc constituting the path, the chain of linguistic units of the arc, and the language score given by the linguistic information storage unit 30. When the linguistic information provided by the linguistic information storage unit 30 is determined by two or less linguistic units, this calculation is performed by using dynamic programming in the direction from the start node to the end node to reduce the number of nodes in the word graph. It can be done in proportional time. When the backward prediction score is calculated, subsequently, the prediction score calculation unit 24
Calculates the optimal score of the path from the node to the end node (hereinafter, the forward prediction score) for each node of the word graph (step S3). Also in this calculation, when the linguistic information provided by the linguistic information storage unit 30 is determined by two or less linguistic units, the processing is performed from the end node to the start node using the dynamic programming method, so that the word graph This can be done in a time proportional to the number of nodes.

【００３１】次いで、言語処理部２０が区間指定部２３
により指定された音声認識結果の候補を得たい時間区間
を表わす開始時間、終了時間および該時間区間の許容範
囲を表わす所定の誤差に基づいて、ワードグラフ中の区
間始端ノード群、区間終端ノード群、区間内ノード群を
それぞれ求める（ステップＳ４）。区間始端ノード群
は、ワードグラフ中のノードのうちの「開始時間−誤
差」から「開始時間＋誤差」の間に存在する全てのノー
ドである。区間終端ノード群は、ワードグラフ中のノー
ドのうちの「終了時間−誤差」から「終了時間＋誤差」
の間に存在する全てのノードである。区間内ノード群
は、ワードグラフ中のノードのうちの「開始時間−誤
差」から「終了時間＋誤差」の間に存在する全てのノー
ドである。なお、このステップＳ４で、区間始端ノード
群または区間終端ノード群のどちらかが空となった場合
は、言語処理部２０は認識結果が得られない旨を示し処
理を終了する。Next, the language processing section 20 sets the section designating section 23
Based on a start time, an end time, and a predetermined error representing an allowable range of the time section in which a candidate for a speech recognition result specified by the following is to be obtained. , And a node group within the section are obtained (step S4). The section start node group is all nodes existing between “start time−error” and “start time + error” among the nodes in the word graph. The section end node group is calculated from “end time−error” to “end time + error” among the nodes in the word graph.
Are all nodes that exist between The intra-section node group is all nodes existing between “start time−error” and “end time + error” among the nodes in the word graph. If either the section start node group or the section end node group becomes empty in step S4, the language processing unit 20 indicates that a recognition result cannot be obtained, and ends the processing.

【００３２】区間始端ノード群、区間終端ノード群、区
間内ノード群が求められると、続いて、言語処理部２０
は、区間始端ノード群のすべてのノードに対し、そのノ
ードから続く区間内ノード群のノードを接続して得られ
る認識結果候補を候補記憶部２１に記憶する（ステップ
Ｓ５）。この候補記憶部２１に記録される認識結果候補
には、既に処理が済んだワードグラフ中のノード列（区
間内ノード群に限る）、評価スコア、途中スコア、処理
終了フラグの各情報が含まれる。例えば、途中スコアと
して、区間始端ノードの後ろ向き予測スコア、区間始端
ノードから続く区間内ノード群のノードを接続したアー
クの音響スコア（ワードグラフに記述）、言語情報記憶
部３０により得られる言語スコアをそれぞれ重み付けし
て合計した値を持ち、評価スコアとして、その求められ
た途中スコアと接続した区間内ノード群のノードの前向
き予測スコアを重み付けして合計した値を持ち、処理が
済んだノード列として、区間始端ノードとそれに接続し
た区間内ノード群のノードを持ち、処理終了フラグとし
て０を持つ認識結果候補が候補記憶部２１に記憶され
る。なお、途中スコアには、区間始端ノードに記録され
ている時間と開始時間とのずれに比例したペナルティ
（負方向のスコア）を加えることもできる。When the section start node group, the section end node group, and the intra-section node group are obtained, the language processing section 20
Stores in the candidate storage unit 21 the recognition result candidates obtained by connecting all the nodes of the section start node group to the nodes of the intra-section node group following the node (step S5). The recognition result candidates recorded in the candidate storage unit 21 include each information of a node sequence (limited to the intra-section node group), an evaluation score, an intermediate score, and a processing end flag in the already processed word graph. . For example, as the intermediate score, a backward prediction score of the section start node, an acoustic score (described in a word graph) of an arc connecting nodes of a node group in the section following the section start node, and a language score obtained by the language information storage unit 30 are used. Each node has a value obtained by weighting and summing, and as an evaluation score, a value obtained by weighting and summing the forward prediction scores of the nodes in the node group in the section connected to the obtained intermediate score, as a processed node sequence The candidate storage unit 21 has a section start end node and a node of a node group in the section connected to the section start node, and has a recognition end candidate having 0 as a processing end flag. Note that a penalty (score in the negative direction) proportional to the difference between the time recorded at the section start node and the start time can be added to the midway score.

【００３３】認識結果候補が候補記憶部２１に記憶され
ると、続いて、言語処理部２０は、候補記憶部２１が空
かどうかを調べる（ステップＳ６）。空でない場合は、
評価スコアの最も良い認識結果候補を候補記憶部２１内
から取り出すとともに、該候補の情報を候補記憶部２１
内から削除する（ステップＳ７）。そして、その取り出
した認識結果候補が処理終了フラグとして１を持つかど
うか調べる（ステップＳ８）。After the recognition result candidates are stored in the candidate storage unit 21, the language processing unit 20 checks whether the candidate storage unit 21 is empty (step S6). If not empty,
The recognition result candidate having the best evaluation score is extracted from the candidate storage unit 21 and the information of the candidate is stored in the candidate storage unit 21.
It is deleted from within (step S7). Then, it is checked whether the extracted recognition result candidate has 1 as a processing end flag (step S8).

【００３４】上記ステップＳ８の処理で、処理終了フラ
グとして１を持たない場合は、言語処理部２０は、取り
出された認識結果候補について以下のような認識結果候
補作成処理を進める（ステップＳ９）。If the processing in step S8 does not have 1 as the processing end flag, the language processing section 20 proceeds with the following recognition result candidate creation processing for the extracted recognition result candidates (step S9).

【００３５】取り出した認識結果候補に記憶されている
ノード列のうちの最も処理の進んだノードが区間終端ノ
ード群に含まれる場合は、その認識結果候補を、処理終
了フラグを１とした新しい認識結果候補として候補記憶
部２１に記憶させる。この場合、評価スコアはそのまま
にしてもよいし、区間終端ノードに記録されている時間
と終了時間とのずれに比例したペナルティを加えてもよ
い。When the most advanced node in the node sequence stored in the extracted recognition result candidate is included in the section end node group, the recognition result candidate is set to a new recognition with the processing end flag set to 1. It is stored in the candidate storage unit 21 as a result candidate. In this case, the evaluation score may be left as it is, or a penalty proportional to the difference between the time recorded in the section end node and the end time may be added.

【００３６】上記に反して、最も処理の進んだノードが
区間終端ノード群に含まれない場合は、そのノードから
続く区間内ノード群のノードを接続して得られる認識結
果候補を候補記憶部２１に記憶させる。この場合、途中
スコアは、認識結果候補に記憶されている途中スコア
と、認識結果候補に記憶されているノード列の最も処理
の進んだノードから続く区間内ノード群のノードを接続
した音響スコアと、言語スコアとを重み付けして合計す
ることで得られる。評価スコアは、新しく求められた途
中スコアと、接続したノードの前向き予測スコアとを重
み付けして得られる。また、この場合、処理が済んだノ
ード列、すなわち認識結果候補に記憶されているノード
列に接続した区間内ノード群のノードを加えたノード列
が記録されるとともに、処理終了フラグとして０が記録
された認識結果候補が候補記憶部２１に記憶される。Contrary to the above, when the most processed node is not included in the section end node group, the recognition result candidates obtained by connecting the nodes of the node group in the section following the node are stored in the candidate storage section 21. To memorize. In this case, the intermediate score is the intermediate score stored in the recognition result candidate, and the acoustic score connecting the nodes of the node group within the section following the most processed node in the node sequence stored in the recognition result candidate. , And a language score. The evaluation score is obtained by weighting the newly obtained midway score and the forward prediction score of the connected node. In this case, the processed node sequence, that is, the node sequence obtained by adding the nodes of the intra-section node group connected to the node sequence stored in the recognition result candidate is recorded, and 0 is recorded as the processing end flag. The obtained recognition result candidates are stored in the candidate storage unit 21.

【００３７】上記ステップＳ８の処理において、取り出
した認識結果候補が処理終了フラグとして１を持つ場
合、言語処理部２０は、その取り出した認識結果候補の
ノード列を認識結果として結果記憶部２２に記憶させる
（ステップＳ１０）。そして、結果記憶部２２に記憶さ
れた認識結果の個数が所定の個数を越えたかどうか判定
する（ステップＳ１１）。In the process of step S8, when the extracted recognition result candidate has 1 as the processing end flag, the language processing unit 20 stores the extracted node sequence of the recognition result candidate in the result storage unit 22 as the recognition result. (Step S10). Then, it is determined whether or not the number of recognition results stored in the result storage unit 22 has exceeded a predetermined number (step S11).

【００３８】上記ステップＳ１１において、得られた認
識結果の個数が十分な場合、または上述のステップＳ６
において、候補記憶部２１が空の場合は、言語処理部２
０は、結果記憶部２２に記憶させた認識結果を出力して
処理を終了する（ステップＳ１２）。なお、十分な認識
結果が得られていない場合は、言語処理部２０はその旨
出力する。In step S11, when the number of obtained recognition results is sufficient, or in step S6 described above.
In the case where the candidate storage unit 21 is empty, the language processing unit 2
0 outputs the recognition result stored in the result storage unit 22 and ends the process (step S12). If a sufficient recognition result has not been obtained, the language processing unit 20 outputs that fact.

【００３９】以上の音声認識処理において、上述のステ
ップＳ３までの処理で得られるワードグラフのノードに
対する前向き・後ろ向き予測スコアは、認識結果を求め
る区間に依存しないため、区間を変更して認識結果を求
め直す場合には、ステップＳ４からやり直すだけでよ
い。In the above speech recognition processing, the forward / backward prediction scores for the nodes of the word graph obtained by the processing up to step S3 do not depend on the section for which the recognition result is to be obtained. When re-acquiring, it is only necessary to start over from step S4.

【００４０】なお、通常は、一発声に対し一ワードグラ
フが得られるが、複数のワードグラフを連結して一つの
ワードグラフにすることは容易である。よって、本形態
の場合、音声認識部が、連続して入力される複数の入力
音声について、各入力音声毎にグラフを作成し、これら
グラフを連結して１つのグラフを作成するように構成し
てもよい。具体的には、１つ以上のワードグラフを記憶
するグラフ記憶部を持ち、音声認識部が、そのグラフ記
憶部に記憶されたワードグラフを連結するように構成す
る。この場合、複数の入力音声にまたがる時間区間につ
いて、スコア順に異なる複数の音声認識結果を得ること
ができる。またこの場合、言語情報を使用するようにす
れば、複数発声にまたがるコンテキストを利用すること
ができる。Normally, one word graph is obtained for one utterance, but it is easy to connect a plurality of word graphs into one word graph. Therefore, in the case of the present embodiment, the speech recognition unit is configured to create a graph for each of the input voices for a plurality of input voices that are continuously input, and to create one graph by connecting these graphs. You may. Specifically, it has a graph storage unit that stores one or more word graphs, and the speech recognition unit is configured to connect the word graphs stored in the graph storage unit. In this case, it is possible to obtain a plurality of different speech recognition results in the order of score for a time section spanning a plurality of input speeches. In this case, if linguistic information is used, a context that spans multiple utterances can be used.

【００４１】次に、本形態の音声認識装置における音声
認識処理について、具体例を挙げて説明する。図３に、
本形態の音声認識装置の音声認識部にて作成されるワー
ドグラフの一例を示す。このワードグラフは、「こちら
では夜はかなり冷え込みます」という文を発声した場合
の音声から得られたもので、開始ノードが「Ｓ」で示さ
れ、終了ノードが「Ｅ」で示されており、その他のノー
ドには時間順に番号がつけられている。アークには単語
が対応している。なお、図３には省略されているが、ア
ークにはその単語の音響スコアが付与されている。ま
た、ワードグラフは同じノードに入るアークに対しては
同じ単語になるように生成されている。ここでは、この
ワードグラフのアーク「ユーモア」（ノード６からノー
ド８へのアーク）の時間区間に対する複数の認識結果を
得る処理の流れを具体的に説明する。Next, the speech recognition processing in the speech recognition apparatus of the present embodiment will be described with a specific example. In FIG.
4 shows an example of a word graph created by a voice recognition unit of the voice recognition device of the present embodiment. This word graph is obtained from the voice of the sentence "Here is quite cold at night". The start node is indicated by "S", and the end node is indicated by "E". , The other nodes are numbered in chronological order. The word corresponds to the arc. Although omitted from FIG. 3, the arc is given the acoustic score of the word. The word graph is generated so that arcs entering the same node become the same word. Here, the flow of processing for obtaining a plurality of recognition results for the time section of the arc “humor” (arc from node 6 to node 8) of this word graph will be specifically described.

【００４２】まず、前処理としてワードグラフの各ノー
ドに対し、動的計画法を用い、前向き・後ろ向きの予測
スコアを計算する。具体的には、たとえば後ろ向き予測
スコアの場合、あるノードｊに入る全てのアークの始端
ノードｋに対し、Ｓｂ（ノードｊ）＝ｍａｘ（ａ（アークｋｊ）＋ｌ（Ｗ
ｋ，Ｗｊ）＋Ｓｂ（ノードｋ））ノードｋを計算する。ここで、Ｓｂ（ノードｊ）はノードｊの後
ろ向き予測スコア、ａ（アークｋｊ）はノードｋからノ
ードｊに入るアークの音響スコア、Ｗｋはノードｋに入
るアークの単語、ｌ（Ｗｋ，Ｗｊ）は２単語Ｗｋ、Ｗｊ
に関する言語スコアである。First, a forward / backward prediction score is calculated for each node of the word graph as preprocessing by using a dynamic programming method. Specifically, for example, in the case of a backward prediction score, for the start node k of all arcs entering a certain node j, Sb (node j) = max (a (arc kj) +1 (W
k, Wj) + Sb (node k)) compute node k. Here, Sb (node j) is the backward prediction score of node j, a (arc kj) is the acoustic score of the arc entering node j from node k, Wk is the word of the arc entering node k, l (Wk, Wj) Is two words Wk, Wj
Is the language score for

【００４３】ノード番号順（ここでは、「Ｓ」を最初、
「Ｅ」を最後とする。）に上記処理を行うことにより、
ノードｊに入る全てのアークの始端ノードｋに対しＳｂ
（ノードｋ）を計算することができる。この計算の際、
音響スコア、言語スコアのそれぞれに適当な係数を掛け
て重み付けを行ってもよい。言語スコアは、図４に示す
ように２単語に関する表になっている。表を高速に検索
するために高速な検索法（たとえば２分サーチ）を用い
てもよい。得られた予測スコアは、図５に示すようにグ
ラフのノード毎の表として記憶される。図５では一部省
略されているが、実際にはワードグラフの全ノードにつ
いて計算され、記憶される。In the order of node numbers (here, "S" is first,
"E" is the last. ) By performing the above processing,
Sb for the starting node k of all arcs entering node j
(Node k) can be calculated. In this calculation,
Weighting may be performed by multiplying each of the acoustic score and the language score by an appropriate coefficient. The language score is a table relating to two words as shown in FIG. A high-speed search method (for example, a binary search) may be used to search the table at high speed. The obtained prediction score is stored as a table for each node of the graph as shown in FIG. Although partly omitted in FIG. 5, it is actually calculated and stored for all nodes of the word graph.

【００４４】次に、区間始端ノード群、区間終端ノード
群、区間内ノード群を求める。簡単のため、ここでは、
時間区間の許容範囲を表わす誤差を０とする。図３のワ
ードグラフからは、区間始端ノード群として、アーク
「ユーモア」の始端ノードに記録された時間と同じ時間
のノード「６」が求められ、区間終端ノード群として、
終端ノードに記録された時間と同じ時間を持つノード
「８，９，１０」が求められ、区間内ノード群として、
始端ノードに記録された時間から終端ノードに記録され
た時間の間に含まれるノード「６，７，８，９，１０」
が求められる。Next, a section start node group, a section end node group, and an intra-section node group are obtained. For simplicity, here,
An error representing an allowable range of a time section is set to 0. From the word graph of FIG. 3, a node “6” having the same time as the time recorded at the start node of the arc “humor” is obtained as a section start node group, and as a section end node group,
Nodes “8, 9, 10” having the same time as the time recorded in the terminal node are obtained.
Nodes “6, 7, 8, 9, 10” included between the time recorded in the start node and the time recorded in the end node
Is required.

【００４５】次に、区間始端ノード群のすべてのノード
について、それに続く区間内ノード群のノードについて
認識結果候補を作成し、候補記憶部２１に記憶させる。
このとき、候補の評価スコアを計算する。この評価スコ
ア計算は、区間始端ノード群のノードの一つをノードｉ
として、それに続く区間内ノード群のノードｊに対し、ｇ（候補ｉ，ｊ）＝Ｓｂ（ノードｉ）＋ａ（アークｉ
ｊ）＋ｌ（Ｗｉ，Ｗｊ）Ｓ（候補ｉ，ｊ）＝ｇ（候補ｉ，ｊ）＋Ｓｆ（ノード
ｊ）を計算する。ここで、ｇ（候補ｉ，ｊ）は認識結果候補
の途中スコア、Ｓ（候補ｉ，ｊ）は認識結果候補の評価
スコア、候補ｉ，ｊはノード列｛ノードｉ，ノードｊ｝
を持つ認識結果候補、Ｓｆ（ノードｊ）はノードｊの前
向き予測スコアである。図４、図５の表から、図６のよ
うな認識結果候補が候補記憶部２１に記憶される。Next, for all the nodes in the section start node group, recognition result candidates are created for the subsequent nodes in the section node group and stored in the candidate storage unit 21.
At this time, the evaluation score of the candidate is calculated. In this evaluation score calculation, one of the nodes of the section start node group is set to the node i
G (candidate i, j) = Sb (node i) + a (arc i)
j) + l (Wi, Wj) S (candidate i, j) = g (candidate i, j) + Sf (node j) Here, g (candidate i, j) is an intermediate score of the recognition result candidate, S (candidate i, j) is an evaluation score of the recognition result candidate, and candidates i, j are a node sequence {node i, node j}.
, Sf (node j) is the forward prediction score of node j. 4 and 5, recognition result candidates as shown in FIG. 6 are stored in the candidate storage unit 21.

【００４６】続く処理では、候補記憶部２１が空でない
ので、評価スコアの最もよい認識結果候補を取り出す。
処理終了フラグが１であれば、結果記憶部２２にノード
列を認識結果として記憶し、そうでない場合は、取り出
した認識結果候補に基づいて新しい認識結果候補を作成
して候補記憶部に記憶させる。具体的には、取り出した
認識結果候補の処理の済んだノード列の最も処理の進ん
だノードをノードｉとして以下のような処理を行う。In the subsequent processing, since the candidate storage unit 21 is not empty, a recognition result candidate having the best evaluation score is extracted.
If the processing end flag is 1, the node sequence is stored in the result storage unit 22 as a recognition result. Otherwise, a new recognition result candidate is created based on the extracted recognition result candidate and stored in the candidate storage unit. . Specifically, the following processing is performed with the most processed node in the processed node sequence of the extracted recognition result candidate as the node i.

【００４７】ノードｉが区間終端ノード群のノードの場
合、処理終了フラグを１とした認識結果候補を候補記憶
部２１に記憶させる。ノードｉが区間終端ノード群のノ
ードでない場合は、それに続く区間内ノード群のノード
ｊに対し、ｇ（候補＋ｊ）＝ｇ（候補）＋ａ（アークｉｊ）＋ｌ
（Ｗｉ，Ｗｊ）Ｓ（候補＋ｊ）＝ｇ（候補＋ｊ）＋Ｓｆ（ノードｊ）を計算する。ここで、「候補」は取り出した認識結果候
補、「候補＋ｊ」は取り出した認識結果候補の処理の済
んだノード列にノードｊを加えたノード列を持つ認識結
果候補である。If the node i is a node in the section end node group, the recognition result candidates with the processing end flag set to 1 are stored in the candidate storage unit 21. If the node i is not a node of the section end node group, the following is applied to the node j of the node group within the section: g (candidate + j) = g (candidate) + a (arc ij) +1
(Wi, Wj) S (candidate + j) = g (candidate + j) + Sf (node j) is calculated. Here, “candidate” is the extracted recognition result candidate, and “candidate + j” is a recognition result candidate having a node sequence obtained by adding the node j to the processed node sequence of the extracted recognition result candidate.

【００４８】以上の処理について、図６に示す認識結果
候補が記憶されている場合を例に挙げて以下に具体的に
説明する。The above processing will be specifically described below, taking as an example the case where the recognition result candidates shown in FIG. 6 are stored.

【００４９】まず、評価スコアがもっとも高い認識結果
候補「候補６，７」が取り出される。この取り出された
認識結果候補「候補６，７」は処理終了フラグが１でな
いため、新しい認識結果候補を作成する。ただし、「候
補６，７」に記録されているノード列の最も処理の進ん
だノード７は区間終端ノード群のノードではないため、
新しい認識結果候補として処理終了フラグを１にした認
識結果候補は作成しない。図３のワードグラフでは、ノ
ード列の最も処理の進んだノード７に続く区間内ノード
群のノードはノード１０のみとなっているため、新しい
認識結果候補「候補６，７，１０」のみが作成され、記
憶される。これにより、候補記憶部には図７のような認
識結果候補が記憶される。First, a recognition result candidate "candidate 6, 7" having the highest evaluation score is extracted. Since the processing end flag of the extracted recognition result candidate “candidate 6, 7” is not 1, a new recognition result candidate is created. However, since the most advanced node 7 in the node sequence recorded in “candidate 6, 7” is not a node of the section end node group,
A recognition result candidate with the processing end flag set to 1 is not created as a new recognition result candidate. In the word graph of FIG. 3, since only the node 10 in the node group in the section following the node 7 in the node sequence which has undergone the most processing is the node 10, only new recognition result candidates “candidate 6, 7, 10” are created. Is stored. Thereby, the recognition result candidates as shown in FIG. 7 are stored in the candidate storage unit.

【００５０】続いて、図７の認識結果候補のういちから
評価スコアがもっとも高い認識結果候補「候補６，８」
が取り出される。この取り出された認識結果候補「候補
６，８」は処理終了フラグが１でないため、新しい認識
結果候補を作成する。図３のワードグラフでは、「候補
６，８」に記録されているノード列の最も処理の進んだ
ノード８は区間終端ノード群のノードの一つなので、処
理終了フラグを１にした「候補６，８」が新しい認識結
果候補として作成され、記憶される。この場合、ノード
８につながる区間内ノード群はないため、さらに認識結
果候補を作成することはしない。これにより、候補記憶
部には図８のような認識結果候補が記憶される。Subsequently, the recognition result candidates "candidate 6, 8" having the highest evaluation score from among the recognition result candidates in FIG.
Is taken out. Since the processing end flag of the extracted recognition result candidate “candidate 6, 8” is not 1, a new recognition result candidate is created. In the word graph of FIG. 3, since the most advanced node 8 in the node sequence recorded in “candidate 6, 8” is one of the nodes of the section end node group, “candidate 6” in which the processing end flag is set to 1 , 8 "are created and stored as new recognition result candidates. In this case, since there is no intra-section node group connected to the node 8, no further recognition result candidate is created. Thereby, the recognition result candidates as shown in FIG. 8 are stored in the candidate storage unit.

【００５１】続いて、図８の認識結果候補のういちから
評価スコアがもっとも高い認識結果候補「候補６，８」
が取り出される。この取り出された認識結果候補「候補
６，８」は処理終了フラグが１であるため、結果記憶部
に「ノード６，８」が認識結果として記録される。Subsequently, the recognition result candidate “candidate 6, 8” having the highest evaluation score from among the recognition result candidates in FIG.
Is taken out. Since the processing result flag of the extracted recognition result candidate “candidate 6, 8” is 1, “node 6, 8” is recorded in the result storage unit as the recognition result.

【００５２】上述の処理を進めると、認識結果として
「ノード６，８」（スコア−１０２）、「ノード６，
７，１０」（スコア−１０３）、「ノード６，９」（ス
コア−１０５）が順に得られる。これらの認識結果は、
図３のワードグラフと対応させると、それぞれ「ユーモ
ア」、「夜は」、「融和」となり、たとえば、この順で
提示し正解をユーザに選択させるようにすることで良好
なユーザインタフェースを構築できる。ユーザに正当な
認識結果を選択指定させる手段としては、例えばキー入
力やマウス入力などを用いることができる。By proceeding with the above-described processing, as a recognition result, “nodes 6 and 8” (score−102), “nodes 6 and 8”
"7, 10" (score-103) and "node 6, 9" (score-105) are obtained in order. These recognition results
Corresponding to the word graph of FIG. 3, "humor", "at night", and "union" respectively. For example, a good user interface can be constructed by presenting in this order and allowing the user to select the correct answer. . As means for allowing the user to select and specify a valid recognition result, for example, key input, mouse input, or the like can be used.

【００５３】以上説明した本形態の音声認識装置では、
入力音声から得られるワードグラフのノードの予測スコ
アを求めておくことにより、ワードグラフの任意のノー
ドを始終端とする区間の認識結果を、その時間区間内に
あるノードのみに関して処理するだけで、そのパスを含
む、ワードグラフの最初から最後までの全体のパスのス
コアに基づいて認識結果を比較して順序付けすることが
でき、また区間内のスコアだけでなく、全体のスコアや
区間前後のコンテキストを考慮して認識結果を得ること
ができる。このため、入力音声の任意の区間について、
少ない処理量で異なる認識結果を精度良く求めることが
できる。In the speech recognition apparatus of the present embodiment described above,
By obtaining the prediction score of the node of the word graph obtained from the input speech, the recognition result of the section starting and ending at an arbitrary node of the word graph is processed only for the nodes within the time section, Recognition results can be compared and ordered based on the score of the whole path from the beginning to the end of the word graph including the path, and not only the score within the section, but also the overall score and context before and after the section , And a recognition result can be obtained. Therefore, for any section of the input voice,
Different recognition results can be accurately obtained with a small amount of processing.

【００５４】また、本形態の音声認識装置では、二つ以
上の入力音声からそれぞれ得られたワードグラフを連結
して得られる一つのワードグラフの任意の区間につい
て、異なる認識結果を得ることができ、さらに複数発声
にまたがるコンテキストを使用することもできる。この
場合も、任意の区間に対して、少ない処理量で異なる認
識結果を精度良く求めることができる。Further, in the speech recognition apparatus of this embodiment, different recognition results can be obtained for an arbitrary section of one word graph obtained by connecting word graphs obtained from two or more input voices. , And contexts that span multiple utterances. Also in this case, different recognition results can be accurately obtained with a small amount of processing for an arbitrary section.

【００５５】なお、上述した本形態の音声認識装置で
は、結果記憶部２２を用いて認識結果を記憶するように
なっているが、認識結果が得られる度に出力するような
構成とすることもできる。In the above-described speech recognition apparatus of the present embodiment, the recognition result is stored by using the result storage unit 22, but it may be configured to output the recognition result every time the recognition result is obtained. it can.

【００５６】[0056]

【発明の効果】以上説明したように、本発明によれば、
入力音声全体について得られた認識結果中の部分単語列
または部分文字列に間違いがある場合、その部分につい
て複数の認識結果を生成することができ、ユーザはこれ
ら認識結果のうちから任意に正当な結果を選択すること
で修正を行うことができるので、少ない処理量で効率良
く正しい認識結果を得ることができる。As described above, according to the present invention,
If there is an error in the partial word string or partial character string in the recognition result obtained for the entire input speech, a plurality of recognition results can be generated for that part, and the user can arbitrarily select any of these recognition results Since the correction can be performed by selecting the result, a correct recognition result can be efficiently obtained with a small amount of processing.

[Brief description of the drawings]

【図１】本発明の音声認識装置の一実施形態を示すブロ
ック図である。FIG. 1 is a block diagram showing one embodiment of a speech recognition device of the present invention.

【図２】図１に示す音声認識装置の全体の動作の流れを
示すフローチャート図である。FIG. 2 is a flowchart showing an overall operation flow of the speech recognition apparatus shown in FIG. 1;

【図３】ワードグラフの一例を示す図である。FIG. 3 is a diagram illustrating an example of a word graph.

【図４】言語スコアの一例を示す図である。FIG. 4 is a diagram showing an example of a language score.

【図５】予測スコアの一例を示す図である。FIG. 5 is a diagram illustrating an example of a prediction score.

【図６】候補記憶部に記憶される認識結果候補の一例を
示す図である。FIG. 6 is a diagram illustrating an example of recognition result candidates stored in a candidate storage unit.

【図７】候補記憶部に記憶される認識結果候補の一例を
示す図である。FIG. 7 is a diagram illustrating an example of a recognition result candidate stored in a candidate storage unit.

【図８】候補記憶部に記憶される認識結果候補の一例を
示す図である。FIG. 8 is a diagram illustrating an example of a recognition result candidate stored in a candidate storage unit.

【図９】特開平9-281989号公報に開示された音声認識装
置の概略構成を示すブロック図である。FIG. 9 is a block diagram showing a schematic configuration of a speech recognition device disclosed in Japanese Patent Application Laid-Open No. 9-281989.

【図１０】図９に示す音声認識装置における音韻認識の
具体的な処理の流れを示すフローチャート図である。FIG. 10 is a flowchart showing a specific processing flow of phoneme recognition in the speech recognition device shown in FIG. 9;

[Explanation of symbols]

１０音声認識部１１標準パターン記憶部２０言語処理部２１候補記憶部２２結果記憶部２３区間指定部２４予測スコア計算部３０言語情報記憶部 Reference Signs List 10 voice recognition unit 11 standard pattern storage unit 20 language processing unit 21 candidate storage unit 22 result storage unit 23 section designation unit 24 prediction score calculation unit 30 language information storage unit

Claims

[Claims]

1. A speech recognition means for recognizing an input speech in a linguistic unit and generating a graph in which a word string is expressed by an arc corresponding to the linguistic unit, and a section designation for designating an arbitrary time section And a language processing means for generating a plurality of recognition results for an arbitrary time section designated by the section designation means in the graph generated by the speech recognition means.

2. The speech recognition device according to claim 1, wherein the speech recognition means generates a graph in which each arc is provided with an acoustic score indicating at least a measure of acoustic certainty, and wherein the language processing means includes: In the graph generated by the speech recognition means, word strings of all arcs present in an arbitrary time section designated by the section designation means are set as recognition result candidates, and for each of the recognition result candidates, at least the sound A speech recognition apparatus characterized in that an evaluation score based on a score is obtained, and recognition results are obtained in order from a recognition result candidate having the highest evaluation score.

3. The speech recognition apparatus according to claim 2, wherein the language processing means uses, for each recognition result candidate, an acoustic score of the entire path from the start to the end of the entire graph including the recognition result candidate. A speech recognition device configured to obtain an evaluation score by using the speech recognition device.

4. The speech recognition device according to claim 2, wherein the linguistic information storage means stores predetermined linguistic information in advance, and the linguistic information stored in the linguistic information storage means is generated by the speech recognition means. Based on the acoustic score given to each arc in the graph, for each node located at the boundary of each arc in the graph, the optimal score of the path from the beginning of the graph to the node and the optimal score of the graph from the node. A predictive score calculating unit for obtaining a predictive score by respectively obtaining an optimal score of the path leading to the end, wherein the language processing unit includes an acoustic score assigned to each arc and a language stored in the linguistic information storage unit. Speech recognition characterized by calculating an evaluation score based on information and a prediction score calculated by said prediction score calculation means. apparatus.

5. The speech recognition device according to claim 1, wherein the language processing means includes a start time and an end time of an arbitrary time section designated by the section designation means. A speech recognition device, wherein a predetermined allowable range is set, and a plurality of recognition results are obtained for a section in which the allowable range is set.

6. The speech recognition apparatus according to claim 1, wherein the section designation means designates a partial word string in a first place recognition result obtained for the entire input speech. A speech recognition device, comprising:

7. The speech recognition device according to claim 1, wherein the section designation means designates a partial character string in a first place recognition result obtained for the entire input speech. A speech recognition device, comprising:

8. The speech recognition apparatus according to claim 1, wherein the speech recognition unit creates a graph for each of the plurality of input speeches that are continuously input. A speech recognition device configured to create one graph by connecting the graphs.

9. A speech recognition step of recognizing an input speech in a linguistic unit and generating a graph in which a word string is expressed by an arc corresponding to the linguistic unit; A language processing step of generating a plurality of recognition results for an arbitrarily designated time section.

10. The speech recognition method according to claim 9, wherein the speech recognition step is a step of generating a graph in which each arc is given an acoustic score indicating at least a measure of acoustic likelihood; The processing step sets word strings of all arcs present in an arbitrarily designated time interval in the graph generated by the voice recognition step as recognition result candidates, and for each of the recognition result candidates, at least the acoustic score A speech recognition method comprising: obtaining an evaluation score based on the evaluation score; and obtaining a recognition result in order from a recognition result candidate having a good evaluation score.

11. The speech recognition method according to claim 10, wherein the evaluation score in the language processing step is calculated for each of the recognition result candidates by using the sound of the entire path from the beginning to the end of the entire graph including the recognition result candidates. A speech recognition method characterized in that it is obtained using a score.

12. The voice recognition method according to claim 10, wherein the graph is generated based on an acoustic score given to each arc of the graph generated in the voice recognition step and predetermined language information prepared in advance. For each of the nodes located at the boundaries of each of the arcs, a prediction score is obtained by obtaining an optimal score of a path from the beginning of the graph to the node and an optimal score of a path from the node to the end of the graph. The method further includes a calculation step, wherein the calculation of the evaluation score in the language processing step is performed based on the acoustic score given to each arc, the linguistic information, and the prediction score calculated in the prediction score calculation step. Voice recognition method to be used.

13. The method according to claim 9, wherein
In the speech recognition method described in the paragraph, the language processing step includes a process of setting a predetermined allowable range for each of a start time and an end time of an arbitrarily designated time interval. A speech recognition method, comprising the step of obtaining a plurality of recognition results.

14. The method according to claim 9, wherein
The speech recognition method described in the paragraph, further comprising the step of designating an arbitrary time section, the step of designating a partial word string in a first-order recognition result obtained for the entire input speech.

15. The method according to claim 9, wherein:
The speech recognition method described in the paragraph, further comprising a step of designating a partial character string in a first place recognition result obtained for the entire input speech, as a step of designating an arbitrary time section.

16. The method according to claim 9, wherein:
In the voice recognition method described in the paragraph, in the voice recognition step, for a plurality of input voices that are continuously input, a graph is created for each input voice, and these graphs are connected to create one graph. A speech recognition method characterized by the following.