JP2006171096A - Continuous input speech recognition device and continuous input speech recognizing method - Google Patents

Continuous input speech recognition device and continuous input speech recognizing method Download PDF

Info

Publication number
JP2006171096A
JP2006171096A JP2004360013A JP2004360013A JP2006171096A JP 2006171096 A JP2006171096 A JP 2006171096A JP 2004360013 A JP2004360013 A JP 2004360013A JP 2004360013 A JP2004360013 A JP 2004360013A JP 2006171096 A JP2006171096 A JP 2006171096A
Authority
JP
Japan
Prior art keywords
recognition
speech
input speech
model
symbol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2004360013A
Other languages
Japanese (ja)
Inventor
Shi Cho
志鵬 張
Tomoyuki Oya
智之 大矢
Sadahiro Furui
貞煕 古井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Docomo Inc
Tokyo Institute of Technology NUC
Original Assignee
NTT Docomo Inc
Tokyo Institute of Technology NUC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NTT Docomo Inc, Tokyo Institute of Technology NUC filed Critical NTT Docomo Inc
Priority to JP2004360013A priority Critical patent/JP2006171096A/en
Publication of JP2006171096A publication Critical patent/JP2006171096A/en
Pending legal-status Critical Current

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To make it unnecessary to perform speech segmentation in a speech input stage with respect to continuous input speech becoming objects of recognition, and also to improve recognition rate. <P>SOLUTION: A continuous input speech recognition device of the present invention is a device which performs recognition of input speech using a sound model in which features of phonemes are held, and a language model in which contents of utterance are held and a dictionary in which correspondence among phonemes and morpheme elements are held, and the device is equipped with; the language model in which a plurality of sentences are connected with pause signs; the sound model in which features of phonemes each of which corresponds to each pause sign are included; the dictionary in which elements each of which corresponds to each pause sign are included, a first recognition means which detects the position of the pause sign in the inputted speech; a decision means which decides an utterance section by making the input speech of the head position of the input speech which are positioned before the detected pause sign or the input speech which are positioned on and after the previous pause sign one sentence; and a second recognition means which performs speech recognition to the decided utterance section. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要しない連続入力音声認識装置および連続入力音声認識方法に関する。   The present invention relates to a continuous input speech recognition apparatus and a continuous input speech recognition method that do not require speech segmentation at the speech input stage for continuous input speech to be recognized.

一般に、音声認識にあっては、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素(単語)の対応を保持し、音響モデルと言語モデルの橋渡しを行う辞書とを用い、入力した音声の認識を行う。   In general, in speech recognition, an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that maintains the correspondence between phonemes and morphemes (words) and bridges the acoustic model and language model Use and to recognize the input speech.

従来、言語モデルは認識対象分野の多数の例文を並列的に保持しており、入力された音声との比較はこれらの並列的な文との間で行われていた。そのため、認識対象となる連続入力音声を文を単位に区切る処理、すなわち音声セグメンテーションを音声入力段階において手動もしくは自動に行っていた。この音声セグメンテーションは、入力音声信号のパワー等によって無音区間を検出することで文の区切りを判断するものであり、検出した音声区間に対し音声認識が行われる。   Conventionally, a language model holds a large number of example sentences in a recognition target field in parallel, and comparison with input speech has been performed between these parallel sentences. Therefore, the process of dividing the continuous input speech to be recognized into sentences, that is, speech segmentation, is performed manually or automatically at the speech input stage. In this voice segmentation, a sentence break is determined by detecting a silent section based on the power of an input voice signal or the like, and voice recognition is performed on the detected voice section.

一方、音響モデルについては、背景の雑音の種類とSNR(Signal to Noise Ratio)を同時に考慮することで認識性能を向上する技術が開示されている(例えば、非特許文献1を参照。)。
張、杉村、古井:‘‘雑音の種類とSNRを同時に考慮した木構造クラスタリングによる雑音適応’’、音響学会秋季発表会、2003
On the other hand, for acoustic models, a technique for improving recognition performance by simultaneously considering the type of background noise and SNR (Signal to Noise Ratio) has been disclosed (see Non-Patent Document 1, for example).
Zhang, Sugimura, Furui: "Noise adaptation by tree-structure clustering considering both noise type and SNR", Acoustical Society Autumn Meeting, 2003

上述したように、従来は入力音声信号のパワー等によって無音区間を検出することで文の区切りを判断するようにしていたため、雑音環境においては区間検出が困難になり、認識性能が大幅に低下するという問題があった。特に、雑音が変動する環境、すなわち雑音の種類とSNRが変動する場合には、従来の技術は対応できなかった。   As described above, conventionally, since a sentence break is determined by detecting a silence interval based on the power of the input speech signal, the interval detection becomes difficult in a noisy environment, and the recognition performance is greatly reduced. There was a problem. In particular, the conventional technique cannot cope with an environment in which noise fluctuates, that is, in a case where the noise type and SNR fluctuate.

本発明は上記の従来の問題点に鑑み提案されたものであり、その目的とするところは、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがないとともに、認識率の高い連続入力音声認識装置および連続入力音声認識方法を提供することにある。   The present invention has been proposed in view of the above-described conventional problems, and the object of the present invention is to eliminate the need for speech segmentation at the speech input stage for continuous input speech to be recognized, and to achieve a recognition rate. It is an object to provide a continuous input speech recognition apparatus and a continuous input speech recognition method with high accuracy.

上記課題を解決するため、本発明にあっては、請求項1に記載されるように、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う装置であって、複数の文をポーズ記号で連結した上記言語モデルと、上記ポーズ記号に対応する音素の特徴を含めた上記音響モデルと、上記ポーズ記号に対応する要素を含めた上記辞書と、上記入力音声中の上記ポーズ記号の位置を検出する第1の認識手段と、検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を1文として発話区間の判定を行う判定手段と、判定された発話区間に対し音声認識を行う第2の認識手段とを備えるようにしている。   In order to solve the above-described problems, in the present invention, as described in claim 1, an acoustic model that retains phoneme characteristics, a language model that retains speech content, and a correspondence between phonemes and morphemes are retained. An apparatus for recognizing input speech using a dictionary, the language model in which a plurality of sentences are connected by a pose symbol, the acoustic model including features of phonemes corresponding to the pose symbol, and the pose The dictionary including elements corresponding to symbols; first recognition means for detecting the position of the pose symbol in the input speech; and the leading or previous of the input speech before the detected pose symbol. A determination means for determining an utterance section using the input speech after the pause symbol as one sentence and a second recognition means for performing speech recognition on the determined utterance section are provided.

また、請求項2に記載されるように、上記ポーズ記号が検出されなかった場合、所定長の上記入力音声につき上記第2の認識手段により音声認識を行うようにすることができる。   According to a second aspect of the present invention, when the pause symbol is not detected, the second recognition means can perform voice recognition for the input voice having a predetermined length.

また、請求項3に記載されるように、上記音響モデルとして背景の雑音の種類とSNRを同時に考慮した木構造雑音重畳音声モデルを用い、上記木構造雑音重畳音声モデルの中から最適なモデルを選択して音声認識を行うようにすることができる。   In addition, as described in claim 3, a tree structure noise superimposed speech model that simultaneously considers the type of background noise and SNR is used as the acoustic model, and an optimal model is selected from the tree structure noise superimposed speech models. It is possible to select and perform voice recognition.

また、請求項4に記載されるように、上記第1の認識手段による認識結果を用いて上記木構造雑音重畳音声モデルのパラメータを尤度最大となるように線形変換し、当該線形変換後の上記木構造雑音重畳音声モデルを用いて上記第2の認識手段による音声認識を行うようにすることができる。   In addition, as described in claim 4, using the recognition result by the first recognition means, linearly transform the parameters of the tree-structured noise superimposed speech model so as to maximize the likelihood, and after the linear transformation Speech recognition by the second recognition means can be performed using the tree structure noise superimposed speech model.

また、請求項5に記載されるように、上記第1の認識手段、上記判定手段、上記第2の認識手段の処理を、連続して到来する上記入力音声に対し繰り返し適用するようにすることができる。   According to a fifth aspect of the present invention, the processes of the first recognition unit, the determination unit, and the second recognition unit are repeatedly applied to the input speech that arrives continuously. Can do.

また、請求項6に記載されるように、音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う方法であって、事前の学習段階における、上記言語モデルを複数の文をポーズ記号で連結したものとして作成する工程と、上記音響モデルを上記ポーズ記号に対応する音素の特徴を含めて作成する工程と、上記辞書を上記ポーズ記号に対応する要素を含めて作成する工程と、上記入力音声の認識段階における、上記入力音声中の上記ポーズ記号の位置を検出する1回目の認識工程と、検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を1文として発話区間の判定を行う判定工程と、判定された発話区間に対し認識を行う2回目の認識工程とを備える連続入力音声認識方法として構成することができる。   In addition, as described in claim 6, input speech recognition is performed using an acoustic model that retains phoneme characteristics, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes. A method of creating the language model as a combination of a plurality of sentences with pose symbols in a prior learning stage, and creating the acoustic model including features of phonemes corresponding to the pose symbols A step of creating the dictionary including an element corresponding to the pose symbol, and a first recognition step of detecting the position of the pose symbol in the input speech in the recognition step of the input speech. A determination step for determining an utterance section using the input speech at the beginning of the input speech or after the previous pause symbol as a sentence before the pause symbol, and the determined speech section And second recognition process to perform recognition can be configured as a continuous input speech recognition method comprising.

本発明の連続入力音声認識装置および連続入力音声認識方法にあっては、言語モデルを複数の文をポーズ記号で連結したものとするとともに、音響モデルおよび辞書をポーズ記号を反映させたものとし、入力音声の認識に際してポーズ記号を文の区切りと判断して認識するようにしているため、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがないとともに、雑音の影響を受けず認識率を大幅に高めることができる。   In the continuous input speech recognition apparatus and continuous input speech recognition method of the present invention, the language model is a plurality of sentences connected with pause symbols, and the acoustic model and the dictionary are reflected with pause symbols, When recognizing the input speech, the pause symbol is recognized as a sentence break and recognized, so there is no need for speech segmentation at the speech input stage for continuous input speech to be recognized and the influence of noise. The recognition rate can be greatly increased.

以下、本発明の好適な実施形態につき図面を参照して説明する。   Preferred embodiments of the present invention will be described below with reference to the drawings.

図1は本発明の一実施形態にかかる連続入力音声認識方法の処理を示すフローチャートである。   FIG. 1 is a flowchart showing processing of a continuous input speech recognition method according to an embodiment of the present invention.

図1において、事前の学習段階においては、先ず、発話内容を保持した言語モデルの作成を行う(ステップS1)。具体的には、図2に示すように、複数の例文が文頭記号Startと文末記号Endで区切られて並列に記述された言語モデル作成用テキストT1に対し、文頭記号Startと文末記号Endをポーズ記号LPに書き換えて、複数の文をポーズ記号で連結した言語モデル作成用テキストT2を作成し、これに基づいて言語モデルを作成する。   In FIG. 1, in the prior learning stage, first, a language model holding the utterance content is created (step S1). Specifically, as shown in FIG. 2, the initial symbol Start and the end symbol End are paused for the language model creation text T1 in which a plurality of example sentences are delimited by the initial symbol Start and the end symbol End and written in parallel. The language model creation text T2 in which a plurality of sentences are connected by pause symbols is created by rewriting the symbol LP, and a language model is created based on the language model creation text T2.

次いで、図1に戻り、音素と形態素(単語)の対応を保持した辞書の作成を行う(ステップS2)。ここでは、辞書として上記のポーズ記号に対応する要素を含めたものを作成する。   Next, returning to FIG. 1, a dictionary holding correspondence between phonemes and morphemes (words) is created (step S2). Here, a dictionary including elements corresponding to the pose symbols is created.

次いで、音素の特徴を保持した音響モデルの作成を行う(ステップS3)。ここでは、音響モデルとして上記のポーズ記号に対応する音素の特徴を含めたものを作成する。また、本実施形態では、音響モデルとして、背景の雑音の種類とSNRを同時に考慮した木構造雑音重畳音声モデルを用いている。木構造雑音重畳音声モデルは、図3に示すように、ルートのノードNの下に複数のノードN〜Nが木構造に分岐して設けられたものであり、各ノードは雑音の種類および種々のSNRに対応した音素の特徴を保持し、個々に音響モデルとして機能するようになっている。また、上位のノードは下位の複数のノードの持つ特徴を平均化した特徴を有している。従って、適切なノードを選択することで、雑音の種類および種々のSNRに適応した高い認識を行うことができる。 Next, an acoustic model holding the phoneme features is created (step S3). Here, an acoustic model including the phoneme features corresponding to the pose symbol is created. In the present embodiment, a tree-structured noise-superimposed speech model that simultaneously considers the type of background noise and SNR is used as the acoustic model. As shown in FIG. 3, the tree-structure-noise-superimposed speech model is provided with a plurality of nodes N 1 to N k branched into a tree structure under the node N 0 of the root. It retains phoneme features corresponding to the types and various SNRs, and functions individually as an acoustic model. The upper node has a characteristic obtained by averaging the characteristics of a plurality of lower nodes. Therefore, by selecting an appropriate node, high recognition adapted to the type of noise and various SNRs can be performed.

次に、図1に戻り、認識段階においては、先ず、一定の長さ(環境に応じて適当な値が設定)の入力音声に対し、木構造雑音重畳音声モデルの中から尤度最大となるように最適なモデル(ノード)を選択する処理と、入力音声中のポーズ記号の位置を検出する1回目の認識処理とを行う(ステップS4)。   Next, returning to FIG. 1, in the recognition stage, first, the maximum likelihood is obtained from the tree-structured noise superimposed speech model for an input speech of a certain length (an appropriate value is set according to the environment). Thus, the process of selecting the optimum model (node) and the first recognition process of detecting the position of the pause symbol in the input speech are performed (step S4).

次いで、検出されたポーズ記号以前であって入力音声の先頭もしくは前回のポーズ記号以降の入力音声を1文として発話区間を判定する処理を行う(ステップS5)。なお、ポーズ記号が検出されなかった場合は、処理対象となった一定の長さの入力音声の全体をそのまま発話区間とする。   Next, a process of determining an utterance section is performed with the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence (step S5). When no pause symbol is detected, the entire input speech of a certain length that is the processing target is used as it is as an utterance section.

次いで、上記の1回目の認識処理による認識結果を用いて、木構造雑音重畳音声モデルのパラメータを尤度最大となるように線形変換を行う(ステップS6)。なお、認識率は若干低下するが、この線形変換を省略することもできる。   Next, using the recognition result obtained by the first recognition process, linear conversion is performed so that the parameters of the tree-structured noise-superimposed speech model have the maximum likelihood (step S6). Although the recognition rate slightly decreases, this linear conversion can be omitted.

次いで、判定された発話区間に対し、線形変換された木構造雑音重畳音声モデルを用いて2回目の認識処理を行う(ステップS7)。   Next, a second recognition process is performed on the determined speech section using the linearly converted tree-structured noise superimposed speech model (step S7).

そして、これらの処理を入力音声に対して繰り返し適用する。   These processes are repeatedly applied to the input voice.

このように、言語モデルを複数の文をポーズ記号で連結したものとするとともに、音響モデルおよび辞書をポーズ記号を反映させたものとし、入力音声の認識に際してポーズ記号を文の区切りと判断して認識するようにしているため、認識対象となる連続入力音声に対して音声入力段階における音声セグメンテーションを要することがない。また、雑音に影響されず適切に文を発話区間として切り出すことができるため、認識率を大幅に高めることができる。   In this way, the language model is composed of a plurality of sentences connected by pose symbols, and the acoustic model and dictionary reflect the pose symbols, and when the input speech is recognized, the pose symbols are determined as sentence breaks. Since recognition is performed, voice segmentation at the voice input stage is not required for continuous input voice to be recognized. In addition, since the sentence can be appropriately extracted as an utterance section without being affected by noise, the recognition rate can be significantly increased.

更に、音響モデルとして木構造雑音重畳音声モデルを用い、最適なモデルを選択するとともに、1回目の認識結果を用いて線形変換を施しているので、各種の雑音が含まれる入力音声に対しても認識率をより向上させることができる。   In addition, a tree-structured noise-superimposed speech model is used as the acoustic model, and an optimal model is selected and linear conversion is performed using the first recognition result. Therefore, even for input speech that includes various types of noise, The recognition rate can be further improved.

次に、図4は音声認識のアルゴリズムを示す図であり、(a)は本発明の連続入力音声認識を示し、(b)は比較のために従来における音声認識を示したものである。   Next, FIG. 4 is a diagram showing a speech recognition algorithm, (a) showing continuous input speech recognition of the present invention, and (b) showing conventional speech recognition for comparison.

図4(a)において、一定の長さの入力音声INに対し、音響モデル(木構造雑音重畳音声モデル)M1の中から最適なモデル(ノード)を選択し(ステップS11)、選択された音響モデルM1と言語モデルM2および辞書Dを用いて1回目の音声認識を行う(ステップS12)。   In FIG. 4A, an optimal model (node) is selected from the acoustic model (sound model with superimposed noise of tree structure) M1 for the input speech IN having a certain length (step S11), and the selected acoustics are selected. The first speech recognition is performed using the model M1, the language model M2, and the dictionary D (step S12).

次いで、この1回目の音声認識で検出された入力音声IN中のポーズ記号の位置から発話区間を判定する(ステップS13)とともに、音響モデルM1の線形変換適応を行い(ステップS14)、判定された発話区間に対し、音響モデルM1の線形変換された内容と言語モデルM2および辞書Dを用いて2回目の音声認識を行い(ステップS15)、結果OUTを得る。   Next, the speech section is determined from the position of the pause symbol in the input speech IN detected by the first speech recognition (step S13), and linear transformation adaptation of the acoustic model M1 is performed (step S14). A second speech recognition is performed on the speech section using the linearly converted content of the acoustic model M1, the language model M2, and the dictionary D (step S15), and a result OUT is obtained.

図4(b)において、従来は、一定の長さの入力音声INに対し、入力音声信号のパワー等に基づく信号処理により区間検出(音声セグメンテーション)を行い(ステップS21)、この区間に対して音声認識を行って結果OUTを得ており(ステップS22)、入力音声INに含まれる背景の雑音により区間検出の精度が悪くなると認識率の低下に直結していたが、本発明では音声セグメンテーションを行うことなく、雑音による認識率の低下を防止することができる。   In FIG. 4B, conventionally, section detection (speech segmentation) is performed on the input voice IN having a certain length by signal processing based on the power of the input voice signal and the like (step S21). The result OUT is obtained by performing speech recognition (step S22). If the accuracy of the section detection is deteriorated due to background noise included in the input speech IN, the recognition rate is directly reduced. In the present invention, however, speech segmentation is performed. Without performing this, it is possible to prevent the recognition rate from being lowered due to noise.

次に、図5は本発明の連続入力音声認識装置の構成例を示す図である。   Next, FIG. 5 is a diagram showing a configuration example of the continuous input speech recognition apparatus of the present invention.

図5において、連続入力音声認識システムは、言語モデル、辞書、音響モデル(木構造雑音重畳音声モデル)を保持するモデル/辞書記憶部1と、入力音声INを分析して特徴ベクトルに変換する特徴抽出部2と、この特徴抽出部2の出力である特徴ベクトルに従いモデル/辞書記憶部1の音響モデルの中から尤度最大となるように最適なモデル(ノード)を選択するモデル選択部3と、特徴ベクトルの時系列データに変換された入力音声INに対し1回目の音声認識を行う音声認識部4と、この音声認識部4によるポーズ記号の検出結果に基づき発話区間を判定する発話区間判定部5と、音声認識部4による認識結果に基づき音響モデルを尤度が最大化するように線形変換を行うモデル線形変換適応部6と、入力音声INの判定された発話区間に対し、線形変換された音響モデルを用いて2回目の音声認識を行う音声認識部7と、認識結果を保存する認識結果保存部8とを備えている。なお、音声認識部4および音声認識部7は処理のフェーズが異なるために別々に記載してあるが、プログラム(エンジン)としては同一である。   In FIG. 5, the continuous input speech recognition system includes a model / dictionary storage unit 1 that holds a language model, a dictionary, and an acoustic model (a tree-structured noise superimposed speech model), and features that analyze the input speech IN and convert it into a feature vector. An extraction unit 2 and a model selection unit 3 that selects an optimal model (node) from the acoustic model of the model / dictionary storage unit 1 according to the feature vector output from the feature extraction unit 2 so as to maximize the likelihood. The speech recognition unit 4 that performs the first speech recognition on the input speech IN converted to the time-series data of the feature vector, and the speech segment determination that determines the speech segment based on the detection result of the pause symbol by the speech recognition unit 4 Unit 5, model linear transformation adaptation unit 6 that performs linear transformation so that the likelihood of the acoustic model is maximized based on the recognition result by speech recognition unit 4, and the speech of which input speech IN is determined To between, and a speech recognition unit 7 for speech recognition for the second time using a linear transformed acoustic models, and a recognition result storage unit 8 for storing the recognition result. The voice recognition unit 4 and the voice recognition unit 7 are described separately because of different processing phases, but are the same as the program (engine).

次に、図6は本発明による効果を確認するために認識対象となる雑音音声を用いて調べた認識率の実験結果の例を示す図である。実験で使用した音響モデルは、tree-based clusteringにより状態共有化を行った不特定話者文脈依存音素HMM(Hidden Markov Model)である。特徴量としてはMFCC(Mel Frequency Cepstrum Coefficient)12次元、その微分12次元、対数パワーの1次微分の計25次元を利用した。図6では、正しい発話区間を指定する場合の単語正解精度(目標値)と、本発明の実施形態による単語正解精度(本発明)とを示している。この結果から、本発明の実施形態では、正しい発話区間を指定する場合と同じ結果であることがわかる。   Next, FIG. 6 is a diagram showing an example of a recognition rate experimental result examined using noise speech to be recognized in order to confirm the effect of the present invention. The acoustic model used in the experiment is an unspecified speaker context-dependent phoneme HMM (Hidden Markov Model) in which state sharing is performed by tree-based clustering. As the feature quantity, a total of 25 dimensions of MFCC (Mel Frequency Cepstrum Coefficient) 12 dimensions, its differential 12 dimensions, and first derivative of logarithmic power was used. FIG. 6 shows the correct word accuracy (target value) when a correct utterance section is designated, and the correct word accuracy (present invention) according to the embodiment of the present invention. From this result, in the embodiment of the present invention, it can be seen that the result is the same as when the correct utterance section is designated.

以上、本発明の好適な実施の形態により本発明を説明した。ここでは特定の具体例を示して本発明を説明したが、特許請求の範囲に定義された本発明の広範な趣旨および範囲から逸脱することなく、これら具体例に様々な修正および変更を加えることができることは明らかである。すなわち、具体例の詳細および添付の図面により本発明が限定されるものと解釈してはならない。   The present invention has been described above by the preferred embodiments of the present invention. While the invention has been described with reference to specific embodiments, various modifications and changes may be made to the embodiments without departing from the broad spirit and scope of the invention as defined in the claims. Obviously you can. In other words, the present invention should not be construed as being limited by the details of the specific examples and the accompanying drawings.

本発明の一実施形態にかかる連続入力音声認識方法の処理を示すフローチャートである。It is a flowchart which shows the process of the continuous input speech recognition method concerning one Embodiment of this invention. 言語モデルの作成の例を示す図である。It is a figure which shows the example of preparation of a language model. 木構造雑音重畳音声モデルの構造例を示す図である。It is a figure which shows the structural example of a tree structure noise superimposition speech model. 音声認識のアルゴリズムを示す図である。It is a figure which shows the algorithm of speech recognition. 本発明の連続入力音声認識装置の構成例を示す図である。It is a figure which shows the structural example of the continuous input speech recognition apparatus of this invention. 認識率の実験結果の例を示す図である。It is a figure which shows the example of the experiment result of a recognition rate.

符号の説明Explanation of symbols

T1〜T2 言語モデル作成用テキスト
IN 入力音声
OUT 結果
M1 音響モデル
M2 言語モデル
D 辞書
1 モデル/辞書記憶部
2 特徴抽出部
3 モデル選択部
4 音声認識部
5 発話区間判定部
6 モデル線形変換適応部
7 音声認識部
8 認識結果保存部
T1 to T2 Language model creation text IN input speech OUT result M1 acoustic model M2 language model D dictionary 1 model / dictionary storage unit 2 feature extraction unit 3 model selection unit 4 speech recognition unit 5 utterance interval determination unit 6 model linear transformation adaptation unit 7 Speech recognition unit 8 Recognition result storage unit

Claims (6)

音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う装置であって、
複数の文をポーズ記号で連結した上記言語モデルと、
上記ポーズ記号に対応する音素の特徴を含めた上記音響モデルと、
上記ポーズ記号に対応する要素を含めた上記辞書と、
上記入力音声中の上記ポーズ記号の位置を検出する第1の認識手段と、
検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を1文として発話区間の判定を行う判定手段と、
判定された発話区間に対し音声認識を行う第2の認識手段とを備えたことを特徴とする連続入力音声認識装置。
An apparatus for recognizing input speech using an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes,
The above language model in which multiple sentences are connected with pause symbols,
The acoustic model including the phoneme features corresponding to the pose symbol;
The dictionary including elements corresponding to the pose symbols;
First recognition means for detecting the position of the pose symbol in the input speech;
A determination means for determining an utterance section using the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence;
A continuous input speech recognition apparatus comprising: a second recognition unit that performs speech recognition on the determined utterance section.
上記ポーズ記号が検出されなかった場合、所定長の上記入力音声につき上記第2の認識手段により音声認識を行うことを特徴とする請求項2に記載の連続入力音声認識装置。   The continuous input speech recognition apparatus according to claim 2, wherein when the pause symbol is not detected, speech recognition is performed by the second recognition means for the input speech having a predetermined length. 上記音響モデルとして背景の雑音の種類とSNRを同時に考慮した木構造雑音重畳音声モデルを用い、
上記木構造雑音重畳音声モデルの中から最適なモデルを選択して音声認識を行うことを特徴とする請求項1または2のいずれか一項に記載の連続入力音声認識装置。
As the acoustic model, a tree-structured noise superimposed speech model that simultaneously considers the type of background noise and SNR is used.
The continuous input speech recognition apparatus according to any one of claims 1 and 2, wherein speech recognition is performed by selecting an optimal model from the tree structure noise superimposed speech model.
上記第1の認識手段による認識結果を用いて上記木構造雑音重畳音声モデルのパラメータを尤度最大となるように線形変換し、当該線形変換後の上記木構造雑音重畳音声モデルを用いて上記第2の認識手段による音声認識を行うことを特徴とする請求項3に記載の連続入力音声認識装置。   Using the recognition result obtained by the first recognition means, the tree-structured noise-superimposed speech model parameters are linearly transformed to maximize the likelihood, and the tree-structured noise-superposed speech model after the linear transformation is used to perform the first conversion. The continuous input speech recognition apparatus according to claim 3, wherein speech recognition is performed by the recognition means of 2. 上記第1の認識手段、上記判定手段、上記第2の認識手段の処理を、連続して到来する上記入力音声に対し繰り返し適用することを特徴とする請求項1乃至4のいずれか一項に記載の連続入力音声認識装置。   5. The process according to claim 1, wherein the processes of the first recognition unit, the determination unit, and the second recognition unit are repeatedly applied to the input speech that arrives continuously. The continuous input speech recognition apparatus as described. 音素の特徴を保持した音響モデルと、発話内容を保持した言語モデルと、音素と形態素の対応を保持した辞書とを用いて入力音声の認識を行う方法であって、
事前の学習段階における、
上記言語モデルを複数の文をポーズ記号で連結したものとして作成する工程と、
上記音響モデルを上記ポーズ記号に対応する音素の特徴を含めて作成する工程と、
上記辞書を上記ポーズ記号に対応する要素を含めて作成する工程と、
上記入力音声の認識段階における、
上記入力音声中の上記ポーズ記号の位置を検出する1回目の認識工程と、
検出された上記ポーズ記号以前であって上記入力音声の先頭もしくは前回のポーズ記号以降の上記入力音声を1文として発話区間の判定を行う判定工程と、
判定された発話区間に対し認識を行う2回目の認識工程とを備えたことを特徴とする連続入力音声認識方法。
A method for recognizing input speech using an acoustic model that retains phoneme features, a language model that retains speech content, and a dictionary that retains correspondence between phonemes and morphemes,
In the pre-learning phase,
Creating the language model as a plurality of sentences connected by pause symbols;
Creating the acoustic model including phoneme features corresponding to the pose symbol;
Creating the dictionary including elements corresponding to the pose symbols;
In the input speech recognition stage,
A first recognition step of detecting the position of the pose symbol in the input speech;
A determination step of determining an utterance section using the input speech before the detected pause symbol and at the beginning of the input speech or after the previous pause symbol as one sentence;
A continuous input speech recognition method comprising: a second recognition step of recognizing the determined utterance section.
JP2004360013A 2004-12-13 2004-12-13 Continuous input speech recognition device and continuous input speech recognizing method Pending JP2006171096A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2004360013A JP2006171096A (en) 2004-12-13 2004-12-13 Continuous input speech recognition device and continuous input speech recognizing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2004360013A JP2006171096A (en) 2004-12-13 2004-12-13 Continuous input speech recognition device and continuous input speech recognizing method

Publications (1)

Publication Number Publication Date
JP2006171096A true JP2006171096A (en) 2006-06-29

Family

ID=36671965

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2004360013A Pending JP2006171096A (en) 2004-12-13 2004-12-13 Continuous input speech recognition device and continuous input speech recognizing method

Country Status (1)

Country Link
JP (1) JP2006171096A (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07261782A (en) * 1994-03-22 1995-10-13 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Sound recognition device
JPH09114484A (en) * 1995-10-24 1997-05-02 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice recognition device
JPH10254479A (en) * 1997-03-12 1998-09-25 Mitsubishi Electric Corp Speech recognition device
JPH117292A (en) * 1997-06-16 1999-01-12 Nec Corp Speech recognition device
JPH11237894A (en) * 1998-02-19 1999-08-31 Nippon Telegr & Teleph Corp <Ntt> Method and device for comprehending language
JP2000259176A (en) * 1999-03-08 2000-09-22 Nippon Hoso Kyokai <Nhk> Voice recognition device and its recording medium
JP2001092488A (en) * 1999-09-17 2001-04-06 Atr Interpreting Telecommunications Res Lab Statistical language model creating device and speech recognition device
JP2004117624A (en) * 2002-09-25 2004-04-15 Ntt Docomo Inc Noise adaptation system of voice model, noise adaptation method, and noise adaptation program of voice recognition
JP2004126790A (en) * 2002-09-30 2004-04-22 P To Pa:Kk Notification system of usage amount, notification control method and program for usage amount
JP2004152063A (en) * 2002-10-31 2004-05-27 Nec Corp Structuring method, structuring device and structuring program of multimedia contents, and providing method thereof
JP2004184951A (en) * 2002-12-06 2004-07-02 Nippon Telegr & Teleph Corp <Ntt> Method, device, and program for class identification model, and method, device, and program for class identification
JP2004184716A (en) * 2002-12-04 2004-07-02 Nissan Motor Co Ltd Speech recognition apparatus
JP2004198832A (en) * 2002-12-19 2004-07-15 Nissan Motor Co Ltd Speech recognition device
JP2004279466A (en) * 2003-03-12 2004-10-07 Ntt Docomo Inc System and method for noise adaptation for speech model, and speech recognition noise adaptation program

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07261782A (en) * 1994-03-22 1995-10-13 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Sound recognition device
JPH09114484A (en) * 1995-10-24 1997-05-02 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice recognition device
JPH10254479A (en) * 1997-03-12 1998-09-25 Mitsubishi Electric Corp Speech recognition device
JPH117292A (en) * 1997-06-16 1999-01-12 Nec Corp Speech recognition device
JPH11237894A (en) * 1998-02-19 1999-08-31 Nippon Telegr & Teleph Corp <Ntt> Method and device for comprehending language
JP2000259176A (en) * 1999-03-08 2000-09-22 Nippon Hoso Kyokai <Nhk> Voice recognition device and its recording medium
JP2001092488A (en) * 1999-09-17 2001-04-06 Atr Interpreting Telecommunications Res Lab Statistical language model creating device and speech recognition device
JP2004117624A (en) * 2002-09-25 2004-04-15 Ntt Docomo Inc Noise adaptation system of voice model, noise adaptation method, and noise adaptation program of voice recognition
JP2004126790A (en) * 2002-09-30 2004-04-22 P To Pa:Kk Notification system of usage amount, notification control method and program for usage amount
JP2004152063A (en) * 2002-10-31 2004-05-27 Nec Corp Structuring method, structuring device and structuring program of multimedia contents, and providing method thereof
JP2004184716A (en) * 2002-12-04 2004-07-02 Nissan Motor Co Ltd Speech recognition apparatus
JP2004184951A (en) * 2002-12-06 2004-07-02 Nippon Telegr & Teleph Corp <Ntt> Method, device, and program for class identification model, and method, device, and program for class identification
JP2004198832A (en) * 2002-12-19 2004-07-15 Nissan Motor Co Ltd Speech recognition device
JP2004279466A (en) * 2003-03-12 2004-10-07 Ntt Docomo Inc System and method for noise adaptation for speech model, and speech recognition noise adaptation program

Similar Documents

Publication Publication Date Title
CN106463113B (en) Predicting pronunciation in speech recognition
JP3782943B2 (en) Speech recognition apparatus, computer system, speech recognition method, program, and recording medium
JP4351385B2 (en) Speech recognition system for recognizing continuous and separated speech
EP2048655B1 (en) Context sensitive multi-stage speech recognition
EP1936606B1 (en) Multi-stage speech recognition
KR100845428B1 (en) Speech recognition system of mobile terminal
JP5310563B2 (en) Speech recognition system, speech recognition method, and speech recognition program
JP2008134475A (en) Technique for recognizing accent of input voice
JP2007093789A (en) Speech recognition apparatus, speech recognition method, and speech recognition program
JP6580882B2 (en) Speech recognition result output device, speech recognition result output method, and speech recognition result output program
JP6305955B2 (en) Acoustic feature amount conversion device, acoustic model adaptation device, acoustic feature amount conversion method, and program
WO2010128560A1 (en) Voice recognition device, voice recognition method, and voice recognition program
US6546369B1 (en) Text-based speech synthesis method containing synthetic speech comparisons and updates
JP2010139745A (en) Recording medium storing statistical pronunciation variation model, automatic voice recognition system, and computer program
JPH10254475A (en) Speech recognition method
JP6276513B2 (en) Speech recognition apparatus and speech recognition program
US20040006469A1 (en) Apparatus and method for updating lexicon
JP4922377B2 (en) Speech recognition apparatus, method and program
JP2001312293A (en) Method and device for voice recognition, and computer- readable storage medium
JP7098587B2 (en) Information processing device, keyword detection device, information processing method and program
JP2006171096A (en) Continuous input speech recognition device and continuous input speech recognizing method
JP2005283646A (en) Speech recognition rate estimating apparatus
JP6526602B2 (en) Speech recognition apparatus, method thereof and program
JP5369079B2 (en) Acoustic model creation method and apparatus and program thereof
JP7035476B2 (en) Speech processing program, speech processor, and speech processing method

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20070928

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20100713

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20100910

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20101116

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20110113

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20110329