JP2006171096A

JP2006171096A - Continuous input speech recognition device and continuous input speech recognizing method

Info

Publication number: JP2006171096A
Application number: JP2004360013A
Authority: JP
Inventors: Shi Cho; 志鵬張; Tomoyuki Oya; 智之大矢; Sadahiro Furui; 貞煕古井
Original assignee: NTT Docomo Inc; Tokyo Institute of Technology NUC
Current assignee: NTT Docomo Inc; Tokyo Institute of Technology NUC
Priority date: 2004-12-13
Filing date: 2004-12-13
Publication date: 2006-06-29

Abstract

<P>PROBLEM TO BE SOLVED: To make it unnecessary to perform speech segmentation in a speech input stage with respect to continuous input speech becoming objects of recognition, and also to improve recognition rate. <P>SOLUTION: A continuous input speech recognition device of the present invention is a device which performs recognition of input speech using a sound model in which features of phonemes are held, and a language model in which contents of utterance are held and a dictionary in which correspondence among phonemes and morpheme elements are held, and the device is equipped with; the language model in which a plurality of sentences are connected with pause signs; the sound model in which features of phonemes each of which corresponds to each pause sign are included; the dictionary in which elements each of which corresponds to each pause sign are included, a first recognition means which detects the position of the pause sign in the inputted speech; a decision means which decides an utterance section by making the input speech of the head position of the input speech which are positioned before the detected pause sign or the input speech which are positioned on and after the previous pause sign one sentence; and a second recognition means which performs speech recognition to the decided utterance section. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要しない連続入力音声認識装置および連続入力音声認識方法に関する。 The present invention relates to a continuous input speech recognition apparatus and a continuous input speech recognition method that do not require speech segmentation at the speech input stage for continuous input speech to be recognized.

一般に、音声認識にあっては、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素（単語）の対応を保持し、音響モデルと言語モデルの橋渡しを行う辞書とを用い、入力した音声の認識を行う。 In general, in speech recognition, an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that maintains the correspondence between phonemes and morphemes (words) and bridges the acoustic model and language model Use and to recognize the input speech.

従来、言語モデルは認識対象分野の多数の例文を並列的に保持しており、入力された音声との比較はこれらの並列的な文との間で行われていた。そのため、認識対象となる連続入力音声を文を単位に区切る処理、すなわち音声セグメンテーションを音声入力段階において手動もしくは自動に行っていた。この音声セグメンテーションは、入力音声信号のパワー等によって無音区間を検出することで文の区切りを判断するものであり、検出した音声区間に対し音声認識が行われる。 Conventionally, a language model holds a large number of example sentences in a recognition target field in parallel, and comparison with input speech has been performed between these parallel sentences. Therefore, the process of dividing the continuous input speech to be recognized into sentences, that is, speech segmentation, is performed manually or automatically at the speech input stage. In this voice segmentation, a sentence break is determined by detecting a silent section based on the power of an input voice signal or the like, and voice recognition is performed on the detected voice section.

一方、音響モデルについては、背景の雑音の種類とＳＮＲ(Signal to Noise Ratio)を同時に考慮することで認識性能を向上する技術が開示されている（例えば、非特許文献１を参照。）。
張、杉村、古井：‘‘雑音の種類とＳＮＲを同時に考慮した木構造クラスタリングによる雑音適応’’、音響学会秋季発表会、２００３ On the other hand, for acoustic models, a technique for improving recognition performance by simultaneously considering the type of background noise and SNR (Signal to Noise Ratio) has been disclosed (see Non-Patent Document 1, for example).
Zhang, Sugimura, Furui: "Noise adaptation by tree-structure clustering considering both noise type and SNR", Acoustical Society Autumn Meeting, 2003

上述したように、従来は入力音声信号のパワー等によって無音区間を検出することで文の区切りを判断するようにしていたため、雑音環境においては区間検出が困難になり、認識性能が大幅に低下するという問題があった。特に、雑音が変動する環境、すなわち雑音の種類とＳＮＲが変動する場合には、従来の技術は対応できなかった。 As described above, conventionally, since a sentence break is determined by detecting a silence interval based on the power of the input speech signal, the interval detection becomes difficult in a noisy environment, and the recognition performance is greatly reduced. There was a problem. In particular, the conventional technique cannot cope with an environment in which noise fluctuates, that is, in a case where the noise type and SNR fluctuate.

本発明は上記の従来の問題点に鑑み提案されたものであり、その目的とするところは、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがないとともに、認識率の高い連続入力音声認識装置および連続入力音声認識方法を提供することにある。 The present invention has been proposed in view of the above-described conventional problems, and the object of the present invention is to eliminate the need for speech segmentation at the speech input stage for continuous input speech to be recognized, and to achieve a recognition rate. It is an object to provide a continuous input speech recognition apparatus and a continuous input speech recognition method with high accuracy.

上記課題を解決するため、本発明にあっては、請求項１に記載されるように、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う装置であって、複数の文をポーズ記号で連結した上記言語モデルと、上記ポーズ記号に対応する音素の特徴を含めた上記音響モデルと、上記ポーズ記号に対応する要素を含めた上記辞書と、上記入力音声中の上記ポーズ記号の位置を検出する第１の認識手段と、検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を１文として発話区間の判定を行う判定手段と、判定された発話区間に対し音声認識を行う第２の認識手段とを備えるようにしている。 In order to solve the above-described problems, in the present invention, as described in claim 1, an acoustic model that retains phoneme characteristics, a language model that retains speech content, and a correspondence between phonemes and morphemes are retained. An apparatus for recognizing input speech using a dictionary, the language model in which a plurality of sentences are connected by a pose symbol, the acoustic model including features of phonemes corresponding to the pose symbol, and the pose The dictionary including elements corresponding to symbols; first recognition means for detecting the position of the pose symbol in the input speech; and the leading or previous of the input speech before the detected pose symbol. A determination means for determining an utterance section using the input speech after the pause symbol as one sentence and a second recognition means for performing speech recognition on the determined utterance section are provided.

また、請求項２に記載されるように、上記ポーズ記号が検出されなかった場合、所定長の上記入力音声につき上記第２の認識手段により音声認識を行うようにすることができる。 According to a second aspect of the present invention, when the pause symbol is not detected, the second recognition means can perform voice recognition for the input voice having a predetermined length.

また、請求項３に記載されるように、上記音響モデルとして背景の雑音の種類とＳＮＲを同時に考慮した木構造雑音重畳音声モデルを用い、上記木構造雑音重畳音声モデルの中から最適なモデルを選択して音声認識を行うようにすることができる。 In addition, as described in claim 3, a tree structure noise superimposed speech model that simultaneously considers the type of background noise and SNR is used as the acoustic model, and an optimal model is selected from the tree structure noise superimposed speech models. It is possible to select and perform voice recognition.

また、請求項４に記載されるように、上記第１の認識手段による認識結果を用いて上記木構造雑音重畳音声モデルのパラメータを尤度最大となるように線形変換し、当該線形変換後の上記木構造雑音重畳音声モデルを用いて上記第２の認識手段による音声認識を行うようにすることができる。 In addition, as described in claim 4, using the recognition result by the first recognition means, linearly transform the parameters of the tree-structured noise superimposed speech model so as to maximize the likelihood, and after the linear transformation Speech recognition by the second recognition means can be performed using the tree structure noise superimposed speech model.

また、請求項５に記載されるように、上記第１の認識手段、上記判定手段、上記第２の認識手段の処理を、連続して到来する上記入力音声に対し繰り返し適用するようにすることができる。 According to a fifth aspect of the present invention, the processes of the first recognition unit, the determination unit, and the second recognition unit are repeatedly applied to the input speech that arrives continuously. Can do.

また、請求項６に記載されるように、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う方法であって、事前の学習段階における、上記言語モデルを複数の文をポーズ記号で連結したものとして作成する工程と、上記音響モデルを上記ポーズ記号に対応する音素の特徴を含めて作成する工程と、上記辞書を上記ポーズ記号に対応する要素を含めて作成する工程と、上記入力音声の認識段階における、上記入力音声中の上記ポーズ記号の位置を検出する１回目の認識工程と、検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を１文として発話区間の判定を行う判定工程と、判定された発話区間に対し認識を行う２回目の認識工程とを備える連続入力音声認識方法として構成することができる。 In addition, as described in claim 6, input speech recognition is performed using an acoustic model that retains phoneme characteristics, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes. A method of creating the language model as a combination of a plurality of sentences with pose symbols in a prior learning stage, and creating the acoustic model including features of phonemes corresponding to the pose symbols A step of creating the dictionary including an element corresponding to the pose symbol, and a first recognition step of detecting the position of the pose symbol in the input speech in the recognition step of the input speech. A determination step for determining an utterance section using the input speech at the beginning of the input speech or after the previous pause symbol as a sentence before the pause symbol, and the determined speech section And second recognition process to perform recognition can be configured as a continuous input speech recognition method comprising.

本発明の連続入力音声認識装置および連続入力音声認識方法にあっては、言語モデルを複数の文をポーズ記号で連結したものとするとともに、音響モデルおよび辞書をポーズ記号を反映させたものとし、入力音声の認識に際してポーズ記号を文の区切りと判断して認識するようにしているため、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがないとともに、雑音の影響を受けず認識率を大幅に高めることができる。 In the continuous input speech recognition apparatus and continuous input speech recognition method of the present invention, the language model is a plurality of sentences connected with pause symbols, and the acoustic model and the dictionary are reflected with pause symbols, When recognizing the input speech, the pause symbol is recognized as a sentence break and recognized, so there is no need for speech segmentation at the speech input stage for continuous input speech to be recognized and the influence of noise. The recognition rate can be greatly increased.

以下、本発明の好適な実施形態につき図面を参照して説明する。 Preferred embodiments of the present invention will be described below with reference to the drawings.

図１は本発明の一実施形態にかかる連続入力音声認識方法の処理を示すフローチャートである。 FIG. 1 is a flowchart showing processing of a continuous input speech recognition method according to an embodiment of the present invention.

図１において、事前の学習段階においては、先ず、発話内容を保持した言語モデルの作成を行う（ステップＳ１）。具体的には、図２に示すように、複数の例文が文頭記号Ｓｔａｒｔと文末記号Ｅｎｄで区切られて並列に記述された言語モデル作成用テキストＴ１に対し、文頭記号Ｓｔａｒｔと文末記号Ｅｎｄをポーズ記号ＬＰに書き換えて、複数の文をポーズ記号で連結した言語モデル作成用テキストＴ２を作成し、これに基づいて言語モデルを作成する。 In FIG. 1, in the prior learning stage, first, a language model holding the utterance content is created (step S1). Specifically, as shown in FIG. 2, the initial symbol Start and the end symbol End are paused for the language model creation text T1 in which a plurality of example sentences are delimited by the initial symbol Start and the end symbol End and written in parallel. The language model creation text T2 in which a plurality of sentences are connected by pause symbols is created by rewriting the symbol LP, and a language model is created based on the language model creation text T2.

次いで、図１に戻り、音素と形態素（単語）の対応を保持した辞書の作成を行う（ステップＳ２）。ここでは、辞書として上記のポーズ記号に対応する要素を含めたものを作成する。 Next, returning to FIG. 1, a dictionary holding correspondence between phonemes and morphemes (words) is created (step S2). Here, a dictionary including elements corresponding to the pose symbols is created.

次いで、音素の特徴を保持した音響モデルの作成を行う（ステップＳ３）。ここでは、音響モデルとして上記のポーズ記号に対応する音素の特徴を含めたものを作成する。また、本実施形態では、音響モデルとして、背景の雑音の種類とＳＮＲを同時に考慮した木構造雑音重畳音声モデルを用いている。木構造雑音重畳音声モデルは、図３に示すように、ルートのノードＮ_０の下に複数のノードＮ_１〜Ｎ_ｋが木構造に分岐して設けられたものであり、各ノードは雑音の種類および種々のＳＮＲに対応した音素の特徴を保持し、個々に音響モデルとして機能するようになっている。また、上位のノードは下位の複数のノードの持つ特徴を平均化した特徴を有している。従って、適切なノードを選択することで、雑音の種類および種々のＳＮＲに適応した高い認識を行うことができる。 Next, an acoustic model holding the phoneme features is created (step S3). Here, an acoustic model including the phoneme features corresponding to the pose symbol is created. In the present embodiment, a tree-structured noise-superimposed speech model that simultaneously considers the type of background noise and SNR is used as the acoustic model. As shown in FIG. 3, the tree-structure-noise-superimposed speech model is provided with a plurality of nodes N _{1 to} N _k branched into a tree structure under the node N ₀ of the root. It retains phoneme features corresponding to the types and various SNRs, and functions individually as an acoustic model. The upper node has a characteristic obtained by averaging the characteristics of a plurality of lower nodes. Therefore, by selecting an appropriate node, high recognition adapted to the type of noise and various SNRs can be performed.

次に、図１に戻り、認識段階においては、先ず、一定の長さ（環境に応じて適当な値が設定）の入力音声に対し、木構造雑音重畳音声モデルの中から尤度最大となるように最適なモデル（ノード）を選択する処理と、入力音声中のポーズ記号の位置を検出する１回目の認識処理とを行う（ステップＳ４）。 Next, returning to FIG. 1, in the recognition stage, first, the maximum likelihood is obtained from the tree-structured noise superimposed speech model for an input speech of a certain length (an appropriate value is set according to the environment). Thus, the process of selecting the optimum model (node) and the first recognition process of detecting the position of the pause symbol in the input speech are performed (step S4).

次いで、検出されたポーズ記号以前であって入力音声の先頭もしくは前回のポーズ記号以降の入力音声を１文として発話区間を判定する処理を行う（ステップＳ５）。なお、ポーズ記号が検出されなかった場合は、処理対象となった一定の長さの入力音声の全体をそのまま発話区間とする。 Next, a process of determining an utterance section is performed with the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence (step S5). When no pause symbol is detected, the entire input speech of a certain length that is the processing target is used as it is as an utterance section.

次いで、上記の１回目の認識処理による認識結果を用いて、木構造雑音重畳音声モデルのパラメータを尤度最大となるように線形変換を行う（ステップＳ６）。なお、認識率は若干低下するが、この線形変換を省略することもできる。 Next, using the recognition result obtained by the first recognition process, linear conversion is performed so that the parameters of the tree-structured noise-superimposed speech model have the maximum likelihood (step S6). Although the recognition rate slightly decreases, this linear conversion can be omitted.

次いで、判定された発話区間に対し、線形変換された木構造雑音重畳音声モデルを用いて２回目の認識処理を行う（ステップＳ７）。 Next, a second recognition process is performed on the determined speech section using the linearly converted tree-structured noise superimposed speech model (step S7).

そして、これらの処理を入力音声に対して繰り返し適用する。 These processes are repeatedly applied to the input voice.

このように、言語モデルを複数の文をポーズ記号で連結したものとするとともに、音響モデルおよび辞書をポーズ記号を反映させたものとし、入力音声の認識に際してポーズ記号を文の区切りと判断して認識するようにしているため、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがない。また、雑音に影響されず適切に文を発話区間として切り出すことができるため、認識率を大幅に高めることができる。 In this way, the language model is composed of a plurality of sentences connected by pose symbols, and the acoustic model and dictionary reflect the pose symbols, and when the input speech is recognized, the pose symbols are determined as sentence breaks. Since recognition is performed, voice segmentation at the voice input stage is not required for continuous input voice to be recognized. In addition, since the sentence can be appropriately extracted as an utterance section without being affected by noise, the recognition rate can be significantly increased.

更に、音響モデルとして木構造雑音重畳音声モデルを用い、最適なモデルを選択するとともに、１回目の認識結果を用いて線形変換を施しているので、各種の雑音が含まれる入力音声に対しても認識率をより向上させることができる。 In addition, a tree-structured noise-superimposed speech model is used as the acoustic model, and an optimal model is selected and linear conversion is performed using the first recognition result. Therefore, even for input speech that includes various types of noise, The recognition rate can be further improved.

次に、図４は音声認識のアルゴリズムを示す図であり、（ａ）は本発明の連続入力音声認識を示し、（ｂ）は比較のために従来における音声認識を示したものである。 Next, FIG. 4 is a diagram showing a speech recognition algorithm, (a) showing continuous input speech recognition of the present invention, and (b) showing conventional speech recognition for comparison.

図４（ａ）において、一定の長さの入力音声ＩＮに対し、音響モデル（木構造雑音重畳音声モデル）Ｍ１の中から最適なモデル（ノード）を選択し（ステップＳ１１）、選択された音響モデルＭ１と言語モデルＭ２および辞書Ｄを用いて１回目の音声認識を行う（ステップＳ１２）。 In FIG. 4A, an optimal model (node) is selected from the acoustic model (sound model with superimposed noise of tree structure) M1 for the input speech IN having a certain length (step S11), and the selected acoustics are selected. The first speech recognition is performed using the model M1, the language model M2, and the dictionary D (step S12).

次いで、この１回目の音声認識で検出された入力音声ＩＮ中のポーズ記号の位置から発話区間を判定する（ステップＳ１３）とともに、音響モデルＭ１の線形変換適応を行い（ステップＳ１４）、判定された発話区間に対し、音響モデルＭ１の線形変換された内容と言語モデルＭ２および辞書Ｄを用いて２回目の音声認識を行い（ステップＳ１５）、結果ＯＵＴを得る。 Next, the speech section is determined from the position of the pause symbol in the input speech IN detected by the first speech recognition (step S13), and linear transformation adaptation of the acoustic model M1 is performed (step S14). A second speech recognition is performed on the speech section using the linearly converted content of the acoustic model M1, the language model M2, and the dictionary D (step S15), and a result OUT is obtained.

図４（ｂ）において、従来は、一定の長さの入力音声ＩＮに対し、入力音声信号のパワー等に基づく信号処理により区間検出（音声セグメンテーション）を行い（ステップＳ２１）、この区間に対して音声認識を行って結果ＯＵＴを得ており（ステップＳ２２）、入力音声ＩＮに含まれる背景の雑音により区間検出の精度が悪くなると認識率の低下に直結していたが、本発明では音声セグメンテーションを行うことなく、雑音による認識率の低下を防止することができる。 In FIG. 4B, conventionally, section detection (speech segmentation) is performed on the input voice IN having a certain length by signal processing based on the power of the input voice signal and the like (step S21). The result OUT is obtained by performing speech recognition (step S22). If the accuracy of the section detection is deteriorated due to background noise included in the input speech IN, the recognition rate is directly reduced. In the present invention, however, speech segmentation is performed. Without performing this, it is possible to prevent the recognition rate from being lowered due to noise.

次に、図５は本発明の連続入力音声認識装置の構成例を示す図である。 Next, FIG. 5 is a diagram showing a configuration example of the continuous input speech recognition apparatus of the present invention.

図５において、連続入力音声認識システムは、言語モデル、辞書、音響モデル（木構造雑音重畳音声モデル）を保持するモデル／辞書記憶部１と、入力音声ＩＮを分析して特徴ベクトルに変換する特徴抽出部２と、この特徴抽出部２の出力である特徴ベクトルに従いモデル／辞書記憶部１の音響モデルの中から尤度最大となるように最適なモデル（ノード）を選択するモデル選択部３と、特徴ベクトルの時系列データに変換された入力音声ＩＮに対し１回目の音声認識を行う音声認識部４と、この音声認識部４によるポーズ記号の検出結果に基づき発話区間を判定する発話区間判定部５と、音声認識部４による認識結果に基づき音響モデルを尤度が最大化するように線形変換を行うモデル線形変換適応部６と、入力音声ＩＮの判定された発話区間に対し、線形変換された音響モデルを用いて２回目の音声認識を行う音声認識部７と、認識結果を保存する認識結果保存部８とを備えている。なお、音声認識部４および音声認識部７は処理のフェーズが異なるために別々に記載してあるが、プログラム（エンジン）としては同一である。 In FIG. 5, the continuous input speech recognition system includes a model / dictionary storage unit 1 that holds a language model, a dictionary, and an acoustic model (a tree-structured noise superimposed speech model), and features that analyze the input speech IN and convert it into a feature vector. An extraction unit 2 and a model selection unit 3 that selects an optimal model (node) from the acoustic model of the model / dictionary storage unit 1 according to the feature vector output from the feature extraction unit 2 so as to maximize the likelihood. The speech recognition unit 4 that performs the first speech recognition on the input speech IN converted to the time-series data of the feature vector, and the speech segment determination that determines the speech segment based on the detection result of the pause symbol by the speech recognition unit 4 Unit 5, model linear transformation adaptation unit 6 that performs linear transformation so that the likelihood of the acoustic model is maximized based on the recognition result by speech recognition unit 4, and the speech of which input speech IN is determined To between, and a speech recognition unit 7 for speech recognition for the second time using a linear transformed acoustic models, and a recognition result storage unit 8 for storing the recognition result. The voice recognition unit 4 and the voice recognition unit 7 are described separately because of different processing phases, but are the same as the program (engine).

次に、図６は本発明による効果を確認するために認識対象となる雑音音声を用いて調べた認識率の実験結果の例を示す図である。実験で使用した音響モデルは、tree-based clusteringにより状態共有化を行った不特定話者文脈依存音素ＨＭＭ(Hidden Markov Model)である。特徴量としてはＭＦＣＣ(Mel Frequency Cepstrum Coefficient)１２次元、その微分１２次元、対数パワーの１次微分の計２５次元を利用した。図６では、正しい発話区間を指定する場合の単語正解精度（目標値）と、本発明の実施形態による単語正解精度（本発明）とを示している。この結果から、本発明の実施形態では、正しい発話区間を指定する場合と同じ結果であることがわかる。 Next, FIG. 6 is a diagram showing an example of a recognition rate experimental result examined using noise speech to be recognized in order to confirm the effect of the present invention. The acoustic model used in the experiment is an unspecified speaker context-dependent phoneme HMM (Hidden Markov Model) in which state sharing is performed by tree-based clustering. As the feature quantity, a total of 25 dimensions of MFCC (Mel Frequency Cepstrum Coefficient) 12 dimensions, its differential 12 dimensions, and first derivative of logarithmic power was used. FIG. 6 shows the correct word accuracy (target value) when a correct utterance section is designated, and the correct word accuracy (present invention) according to the embodiment of the present invention. From this result, in the embodiment of the present invention, it can be seen that the result is the same as when the correct utterance section is designated.

以上、本発明の好適な実施の形態により本発明を説明した。ここでは特定の具体例を示して本発明を説明したが、特許請求の範囲に定義された本発明の広範な趣旨および範囲から逸脱することなく、これら具体例に様々な修正および変更を加えることができることは明らかである。すなわち、具体例の詳細および添付の図面により本発明が限定されるものと解釈してはならない。 The present invention has been described above by the preferred embodiments of the present invention. While the invention has been described with reference to specific embodiments, various modifications and changes may be made to the embodiments without departing from the broad spirit and scope of the invention as defined in the claims. Obviously you can. In other words, the present invention should not be construed as being limited by the details of the specific examples and the accompanying drawings.

本発明の一実施形態にかかる連続入力音声認識方法の処理を示すフローチャートである。It is a flowchart which shows the process of the continuous input speech recognition method concerning one Embodiment of this invention. 言語モデルの作成の例を示す図である。It is a figure which shows the example of preparation of a language model. 木構造雑音重畳音声モデルの構造例を示す図である。It is a figure which shows the structural example of a tree structure noise superimposition speech model. 音声認識のアルゴリズムを示す図である。It is a figure which shows the algorithm of speech recognition. 本発明の連続入力音声認識装置の構成例を示す図である。It is a figure which shows the structural example of the continuous input speech recognition apparatus of this invention. 認識率の実験結果の例を示す図である。It is a figure which shows the example of the experiment result of a recognition rate.

Explanation of symbols

Ｔ１〜Ｔ２言語モデル作成用テキスト
ＩＮ入力音声
ＯＵＴ結果
Ｍ１音響モデル
Ｍ２言語モデル
Ｄ辞書
１モデル／辞書記憶部
２特徴抽出部
３モデル選択部
４音声認識部
５発話区間判定部
６モデル線形変換適応部
７音声認識部
８認識結果保存部 T1 to T2 Language model creation text IN input speech OUT result M1 acoustic model M2 language model D dictionary 1 model / dictionary storage unit 2 feature extraction unit 3 model selection unit 4 speech recognition unit 5 utterance interval determination unit 6 model linear transformation adaptation unit 7 Speech recognition unit 8 Recognition result storage unit

Claims

An apparatus for recognizing input speech using an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes,
The above language model in which multiple sentences are connected with pause symbols,
The acoustic model including the phoneme features corresponding to the pose symbol;
The dictionary including elements corresponding to the pose symbols;
First recognition means for detecting the position of the pose symbol in the input speech;
A determination means for determining an utterance section using the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence;
A continuous input speech recognition apparatus comprising: a second recognition unit that performs speech recognition on the determined utterance section.

The continuous input speech recognition apparatus according to claim 2, wherein when the pause symbol is not detected, speech recognition is performed by the second recognition means for the input speech having a predetermined length.

As the acoustic model, a tree-structured noise superimposed speech model that simultaneously considers the type of background noise and SNR is used.
The continuous input speech recognition apparatus according to any one of claims 1 and 2, wherein speech recognition is performed by selecting an optimal model from the tree structure noise superimposed speech model.

Using the recognition result obtained by the first recognition means, the tree-structured noise-superimposed speech model parameters are linearly transformed to maximize the likelihood, and the tree-structured noise-superposed speech model after the linear transformation is used to perform the first conversion. The continuous input speech recognition apparatus according to claim 3, wherein speech recognition is performed by the recognition means of 2.

5. The process according to claim 1, wherein the processes of the first recognition unit, the determination unit, and the second recognition unit are repeatedly applied to the input speech that arrives continuously. The continuous input speech recognition apparatus as described.

A method for recognizing input speech using an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes,
In the pre-learning phase,
Creating the language model as a plurality of sentences connected by pause symbols;
Creating the acoustic model including phoneme features corresponding to the pose symbol;
Creating the dictionary including elements corresponding to the pose symbols;
In the input speech recognition stage,
A first recognition step of detecting the position of the pose symbol in the input speech;
A determination step of determining an utterance section using the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence;
A continuous input speech recognition method comprising: a second recognition step of recognizing the determined utterance section.