JP2007240589A - Speech recognition reliability estimating device, and method and program therefor - Google Patents

Speech recognition reliability estimating device, and method and program therefor Download PDF

Info

Publication number
JP2007240589A
JP2007240589A JP2006059216A JP2006059216A JP2007240589A JP 2007240589 A JP2007240589 A JP 2007240589A JP 2006059216 A JP2006059216 A JP 2006059216A JP 2006059216 A JP2006059216 A JP 2006059216A JP 2007240589 A JP2007240589 A JP 2007240589A
Authority
JP
Japan
Prior art keywords
word
utterance
speech recognition
information
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2006059216A
Other languages
Japanese (ja)
Other versions
JP4769098B2 (en
Inventor
Yuichi Nakazawa
裕一 中澤
Katsutoshi Ofu
克年 大附
Hirokazu Masataki
浩和 政瀧
Shinji Tamoto
真詞 田本
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2006059216A priority Critical patent/JP4769098B2/en
Publication of JP2007240589A publication Critical patent/JP2007240589A/en
Application granted granted Critical
Publication of JP4769098B2 publication Critical patent/JP4769098B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To find reliability of a recognition result of a long section of each utterance, each sentence, etc., rather than a short section of a word, a syllable, etc. <P>SOLUTION: In a speech recognition reliability estimating device, a speech recognition section 6 divides an input digital speech signal into word series 50 by utterances and imparts part-of-speech information 52, a sound likelihood score 54, a language likelihood score 55, a word likelihood score 56, a word continuance time length 58, the number 60 of phonemes, and a phoneme continuance time length 62 of each word to the word. An information conversion section 20 performs conversion into utterance feature quantity vectors, by utterance units 54, 55, 56, 58, 60, and 62, which include statistic values such as mean values and variance values by the utterance units and decision values by classification using the part-of-speech information 52 as elements, and a reliability imparting section 22 finds reliability based upon the degree of recognition estimated by using the vectors and an identification model previously found by learning. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は入力音声信号に対する音声認識処理により得られる認識結果の信頼度を推定し、出力する音声認識信頼度推定装置、その方法およびプログラムに関するものである。   The present invention relates to a speech recognition reliability estimation apparatus, method and program for estimating and outputting the reliability of a recognition result obtained by speech recognition processing on an input speech signal.

音声認識では一般に、入力音声信号を分析して得られる音響的特徴量ベクトルの系列と音声をモデル化した音響モデルとの間で尤度を算出し、認識すべき語彙、単語間の接続のしやすさ、規則を表す言語モデルなどの言語的制約の中において、尤度の最も高い候補を認識結果として出力する。しかし、入力音声信号において、発生の曖昧さや、ノイズや音声信号以外の音響信号が入力されることにより、誤った認識結果を出力する可能性が高くなる。また、入力音声が辞書への未登録語である場合は正しい認識結果を出力することが出来ない。   In speech recognition, the likelihood is generally calculated between a sequence of acoustic feature vectors obtained by analyzing the input speech signal and the acoustic model that models speech, and the vocabulary to be recognized and the connection between words are connected. The candidate with the highest likelihood is output as the recognition result in the linguistic constraints such as the language model representing the ease and the rule. However, the possibility of outputting an erroneous recognition result increases due to the ambiguity of occurrence in the input audio signal and the input of acoustic signals other than noise and audio signals. In addition, when the input speech is an unregistered word in the dictionary, a correct recognition result cannot be output.

以上の問題に対して、音声認識結果に信頼度を付与することにより、信頼度の高さにより、認識結果を受理、棄却若しくは、結果の確認を行うことが可能となり、音声認識結果を用いる種々の装置において音声認識誤りに起因する問題を回避することが出来る。例えば、ユーザ(利用者)の想定外の動作が誤認識によって引き起こされることを抑制することが出来る。
非特許文献1、非特許文献2では、単語事後確率を用いて信頼度を計算している。この方法では、単語の音響尤度スコア、単語の言語尤度スコア、forward確率、backward確率を用いて、文中の単語に対する信頼度を算出している。
For the above problems, by adding reliability to the speech recognition result, it becomes possible to accept, reject, or confirm the result with high reliability. The problem caused by the voice recognition error can be avoided in this apparatus. For example, it is possible to prevent an unexpected operation of the user (user) from being caused by misrecognition.
In Non-Patent Document 1 and Non-Patent Document 2, reliability is calculated using word posterior probabilities. In this method, the reliability of a word in a sentence is calculated using the acoustic likelihood score of the word, the language likelihood score of the word, the forward probability, and the backward probability.

なお非特許文献1では、N−best候補を用いて、信頼度の計算が行われている。この方法では、音響尤度スコア、言語尤度スコアなどを用いて、認識結果候補をN位まで作成し、作成された候補を用いて、信頼度を算出する。この方法では、複数の候補に多く出現している単語は信頼度が高いとしている。
特許文献1では、言語的妥当性に基づいた信頼度を算出している。この方法では、音声認識結果の単語系列の並びの妥当性に対して、識別学習を行い、単語の正誤判定を行っている。
In Non-Patent Document 1, the reliability is calculated using N-best candidates. In this method, recognition result candidates are created up to the Nth position using an acoustic likelihood score, a language likelihood score, and the like, and the reliability is calculated using the created candidates. In this method, a word that appears frequently in a plurality of candidates is considered highly reliable.
In patent document 1, the reliability based on linguistic validity is calculated. In this method, identification learning is performed on the validity of the sequence of word sequences in the speech recognition result, and the correctness of the word is determined.

特許文献2では、競合モデルを用いて、信頼度を決定している。この方法では、音声認識結果に用いたモデルと競合モデルとの2種類のモデルを使用する。各モデルにおいてそれぞれ尤度を求め、得られた尤度から尤度比を算出し、認識結果の信頼度として付与する。
非特許文献3では、複数の音声認識モデルを用いて、信頼度の決定を行っている。この方法では、音声認識モデルを2つ以上用いて音声認識を行い、全ての音声認識モデルで信頼できると判断された共通部分が信頼できると判断するものである。
特開2005−275348号公報 特開平11−85188号公報 Frank Wassel,Ralf Schluter,Klaus Macherey,Hermann Ney:“Confidence Measure for Large Vocabulary Continuous Speech Recognition ”,IEEE Transactions Speech and Audio Process Vol.9 No.3 pp.288−298,2001 Thomas Kemp,Thomas Schaaf:“Estimating confidence using word lattices”,Proc.5th Eurospeech,pp.827−830,1997 宇津呂 武仁、西崎 博光、小玉 康広、中川 聖一:「複数の大語彙連続音声認識モデルの出力の共通部分を用いた高信頼度部分の推定」、電子情報通信学会論文誌D−II Vol.J86−D−II No.7 pp.974−987,2003
In Patent Document 2, the reliability is determined using a competitive model. In this method, two types of models, that is, a model used for a speech recognition result and a competitive model are used. Likelihood is obtained for each model, a likelihood ratio is calculated from the obtained likelihood, and is given as the reliability of the recognition result.
In Non-Patent Document 3, the reliability is determined using a plurality of speech recognition models. In this method, speech recognition is performed using two or more speech recognition models, and it is determined that a common part determined to be reliable in all speech recognition models is reliable.
JP 2005-275348 A Japanese Patent Laid-Open No. 11-85188 Frank Wassel, Ral Schlitter, Klaus Machaley, Hermann Ney: “Confidence Measurement for Large Vocational Proceeds Recon Revelations”, IE. 9 No. 3 pp. 288-298, 2001 Thomas Kemp, Thomas Schaaf: “Estimating confidence using word lattices”, Proc. 5th Eurospeech, pp. 827-830, 1997 Takehito Utsuro, Hiromitsu Nishizaki, Yasuhiro Kodama, Seiichi Nakagawa: “Estimation of high-reliability parts using common parts of outputs of multiple large vocabulary continuous speech recognition models”, IEICE Transactions D-II Vol. J86-D-II No. 7 pp. 974-987, 2003

音声認識技術の実用面においては、単語や音節といった短区間の正誤よりも、発話や文単位で高い精度で認識できたか否かの判断が望まれることが多い。しかし、既存の技術では、信頼度を単語や音声単位についてしか算出しないため、実用面における需要を満たすことが困難であった。   In the practical aspect of speech recognition technology, it is often desirable to determine whether or not a speech or sentence can be recognized with higher accuracy than correctness of short sections such as words and syllables. However, with the existing technology, the reliability is calculated only for words and speech units, so it has been difficult to meet practical demands.

この発明によれば、入力されたディジタル音声信号を発話単位に分割し、その分割された発話単位のディジタル音声信号からこの音響特徴パラメータを抽出し、その音響特徴パラメータに対し、与えられた言語的制約のもとで、言語的単位の各カテゴリの特徴を表現した確率モデルに出力する確率に基づくスコアを計算し、少なくとも、最も高いスコアを示すモデルが表現するカテゴリを認識し、各発話ごとの単語系列中の各単語ごとの上記認識に基づく情報を付与した単語系列を生成し、各発話単位ごとに、その単語系列に含まれる各単語単位の上記認識に基づく情報を発話単位の発話特徴量ベクトル情報に変換し、この発話単位の発話特徴量ベクトル情報と識別モデルを用いて認識率を推定し、その推定した認識率に基いて、当該発話単位の情報の基となる発話音声認識結果に対する信頼度を求める。   According to the present invention, the input digital speech signal is divided into utterance units, the acoustic feature parameters are extracted from the divided utterance unit digital speech signals, and the given linguistic features are extracted from the acoustic feature parameters. Under the constraints, calculate the score based on the probability to be output to the probability model expressing the features of each category of linguistic units, recognize at least the category represented by the model showing the highest score, and for each utterance Generate a word sequence to which information based on the recognition for each word in the word sequence is given, and for each utterance unit, utterance feature amount of the utterance unit for information based on the recognition of each word unit included in the word sequence It is converted into vector information, and the recognition rate is estimated using the utterance feature vector information and the identification model for each utterance unit. Based on the estimated recognition rate, the utterance unit Determining a reliability for speech recognition result as the information of the group.

以上の構成によれば、信頼度を算出する際に発話単位の比較的長い区間の単語系列における情報を使用するため、大域的な情報を用いることが出来、対象となる音声により日常で使用する発話や文単位において、高精度な信頼度を出力することができる。   According to the above configuration, since the information in the word sequence of the relatively long section of the utterance unit is used when calculating the reliability, global information can be used, and it is used daily by the target speech. Highly reliable reliability can be output for each utterance or sentence.

実施例1
図1にこの発明の実施例1を示す。音声認識部6は音響分析部8と認識探索部9により構成される。
入力端子2にディジタル変換されたディジタル音声信号が入力されると、一度、記憶部4に記憶され、この記憶されたディジタル音声信号は、発話分割部5において、発話単位の音声信号に分割される。この分割は例えば、所定値以上継続する無音区間により挟まれた入力音声信号を1発話として、分割する。この分割において、最初の発話音声信号の開始や最後の発話音声信号の終了が、対象入力ディジタル音声信号によって予め分かる場合があり、その様な場合の最初の発話音声信号や最後の発話音声信号は前記所定値以上継続する無音区間に挟まれたものではないが、これらは当然、1発話として、容易に検出分割される。発話単位の例を以下に示す。
Example 1
FIG. 1 shows a first embodiment of the present invention. The voice recognition unit 6 includes an acoustic analysis unit 8 and a recognition search unit 9.
When a digital voice signal that has been digitally converted is input to the input terminal 2, the digital voice signal is once stored in the storage unit 4, and the stored digital voice signal is divided into voice signals in units of utterances in the utterance division unit 5. . In this division, for example, an input voice signal sandwiched between silent periods that continue for a predetermined value or more is divided into one utterance. In this division, the start of the first utterance voice signal and the end of the last utterance voice signal may be known in advance by the target input digital voice signal. In such a case, the first utterance voice signal and the last utterance voice signal are Although not sandwiched between silent sections that continue beyond the predetermined value, these are naturally detected and divided as one utterance. Examples of utterance units are shown below.

(1)「その辺ではかなり収益も上がるんじゃないかなと思います。」
(2)「なるほどね。」
(3)「今、あの韓国に行く買い物ツアーとか、そういうのが非常にはやっているんですが、」
(4)「んー」
このように分割された発話単位ごとの音声信号は、音声認識部6に入力される。音声認識部6では、この入力ディジタル信号に対し、音響モデル格納部10に格納されている音響モデルと辞書・言語モデル格納部12に格納されている辞書・言語モデルを用いて、音声認識される。
(1) “I think there will be a lot of profits in that area.”
(2) “I see.”
(3) “I'm doing a shopping tour that goes to Korea right now.
(4) "N-"
The voice signal for each utterance unit divided in this way is input to the voice recognition unit 6. In the speech recognition unit 6, the input digital signal is speech-recognized using the acoustic model stored in the acoustic model storage unit 10 and the dictionary / language model stored in the dictionary / language model storage unit 12. .

音声認識部6から、上述の発話単位ごとに、音声認識結果の単語系列50およびその各単語に音声認識結果に基づく情報を付与して出力する。当該単語の音声認識結果に基づく情報とは、発話単位に含まれる各単語の品詞情報52(例えば、接続詞、名詞、副詞など)、HMM(隠れマルコフモデル)を用いて求められる当該単語の音響尤度スコア54、単語n−gramを用いて求められる当該単語の言語尤度スコア55、音響尤度スコア54と言語尤度スコア55の和で求められる単語尤度スコア56、入力音声と認識結果のマッチング状況の時間的対応関係に基づく単語・音素単位の開始時刻、終了時刻から算出される単語継続時間長58、当該単語の音素数60、この音素数の平均継続時間長を示す音素継続時間長62である。具体的な生成、計算方法は以下で示す。なお、発話単位の分割は、認識結果の単語系列に対し、各単語間の無音区間の長さに基づいて行う。あるいは、単語の品詞情報52を用いて、任意に定めた品詞で区切ってもよい。   From the speech recognition unit 6, the speech recognition result word series 50 and information based on the speech recognition result are added to each word and output for each utterance unit. The information based on the speech recognition result of the word is the acoustic likelihood of the word obtained using the part-of-speech information 52 (for example, conjunction, noun, adverb, etc.) and HMM (Hidden Markov Model) of each word included in the utterance unit. Degree score 54, language likelihood score 55 of the word obtained using the word n-gram, word likelihood score 56 obtained by the sum of acoustic likelihood score 54 and language likelihood score 55, input speech and recognition result The phoneme duration length indicating the average duration length of the phoneme number 60, the phoneme number of the word, calculated from the start time and end time of the word / phoneme unit based on the temporal correspondence of the matching situation 62. Specific generation and calculation methods are shown below. Note that the utterance unit is divided based on the length of the silent interval between the words in the recognition result word sequence. Alternatively, the word part of speech information 52 may be used to divide the part of speech with an arbitrarily determined part of speech.

情報変換部20では、音声認識部6で各単語に付与された上述の各単語に付与された音声認識結果に基づく情報から発話単位の情報に変換し、出力する。ここで、発話単位の情報とは例えば、発話特徴量ベクトルなどが考えられる。以下の説明では、発話単位の情報を発話特徴量ベクトルとして説明する。この発話特徴量ベクトルに変換される情報には、上述の単語系列50の各単語に付与された品詞情報52、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62の音声認識部6で生成された全てまたは一部の情報を用いる。   The information conversion unit 20 converts the information based on the voice recognition result given to each word described above given to each word by the voice recognition unit 6 into information of an utterance unit and outputs the information. Here, the utterance unit information may be, for example, an utterance feature amount vector. In the following description, information on an utterance unit is described as an utterance feature amount vector. The information converted into the utterance feature quantity vector includes part-of-speech information 52, acoustic likelihood score 54, language likelihood score 55, word likelihood score 56, word duration time given to each word in the word sequence 50 described above. All or part of the information generated by the speech recognition unit 6 having the length 58, the number of phonemes 60, and the phoneme duration 62 is used.

信頼度付与部22では、情報変換部20から出力された発話特徴量ベクトル64と識別モデル格納部29に格納されている識別モデルを用いて信頼度を求める。求め方の詳細は以下で述べる。
その発話の認識結果の信頼度を出力部26で出力する。ここで、信頼度のみの出力でもよく、その発話音声認識結果にこの信頼度を付与して、出力してもよい。
図2に、図1中の、認識探索部9の詳細と、これに関連する部分の図を示す。認識探索部9は音響尤度スコア計算部90、言語尤度スコア計算部92、単語尤度スコア計算部96、音素数計数部100、単語継続時間計算部102、音素継続時間計算部104、品詞情報付与部105、単語情報付与部106、により構成されている。
In the reliability providing unit 22, the reliability is obtained using the utterance feature vector 64 output from the information conversion unit 20 and the identification model stored in the identification model storage unit 29. Details of how to find it are described below.
The output unit 26 outputs the reliability of the utterance recognition result. Here, only the reliability may be output, or the reliability may be given to the speech recognition result and output.
FIG. 2 shows details of the recognition search unit 9 in FIG. 1 and a diagram of parts related thereto. The recognition search unit 9 includes an acoustic likelihood score calculation unit 90, a language likelihood score calculation unit 92, a word likelihood score calculation unit 96, a phoneme number counting unit 100, a word duration calculation unit 102, a phoneme duration calculation unit 104, a part of speech. The information adding unit 105 and the word information adding unit 106 are configured.

音声認識部6に入力された発話単位のディジタル音声信号はまず、音響分析部8で音響特徴パラメータに変換される。音響特徴パラメータとは、入力音声信号を数十msecのフレームと呼ばれる単位で分析して得られるLPCケプストラム、MFCCその他のパラメータである。
この音響特徴パラメータに対し、音響尤度スコア計算部90で、音響モデル格納部10に格納されている音響モデルを参照し、音素系列の複数候補が探索される。これら複数の音素系列候補に対し、辞書・言語モデル格納部12に格納されている辞書・言語モデルを参照して、言語尤度スコア計算部92および、単語尤度スコア計算部96により、単語系列の複数候補が探索される。つまり、入力された音響特徴パラメータに対し、与えられた言語的制約のもとで、言語的単位の各カテゴリの特徴を表現した確率モデルに出力する確率に基づくスコアを計算し、最も高いスコアを示すモデルが表現するカテゴリを認識結果とする。
The digital speech signal in units of speech input to the speech recognition unit 6 is first converted into acoustic feature parameters by the acoustic analysis unit 8. The acoustic feature parameters are LPC cepstrum, MFCC, and other parameters obtained by analyzing the input voice signal in units called frames of several tens of msec.
For this acoustic feature parameter, the acoustic likelihood score calculation unit 90 refers to the acoustic model stored in the acoustic model storage unit 10 and searches for a plurality of phoneme sequence candidates. The language likelihood score calculation unit 92 and the word likelihood score calculation unit 96 refer to the dictionary / language model stored in the dictionary / language model storage unit 12 for the plurality of phoneme sequence candidates. A plurality of candidates are searched. In other words, for the input acoustic feature parameters, the score based on the probability to be output to the probability model expressing the features of each category of linguistic units under the given linguistic constraints is calculated, and the highest score is obtained. The category represented by the model shown is taken as the recognition result.

この音声認識において、各単語毎に、音響尤度スコア54、言語尤度スコア55、これら音響尤度スコア54と言語尤度スコア55の和である単語尤度スコア56、更に単語継続時間計算部102からの単語継続時間長58、音素数計数部100からの音素数60、音素継続時間計算部104からの音素継続時間長62、品詞情報付与部105からの品詞情報52がそれぞれ得られる。
各発話単位ごとに、例えば、単語尤度スコア56の合計値の上位N位までのN―best候補を選出する。これら1発話に対するN個の単語系列50について各単語系列50の各単語に品詞情報52、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62が単語情報付与部106で付与されて、音声認識部6から出力される。
In this speech recognition, for each word, an acoustic likelihood score 54, a language likelihood score 55, a word likelihood score 56 that is the sum of the acoustic likelihood score 54 and the language likelihood score 55, and a word duration calculation unit The word duration 58 from 102, the number of phonemes 60 from the phoneme number counting unit 100, the phoneme duration 62 from the phoneme duration calculation unit 104, and the part of speech information 52 from the part of speech information adding unit 105 are obtained.
For each utterance unit, for example, N-best candidates up to the top N of the total value of the word likelihood score 56 are selected. Of these N word sequences 50 for one utterance, each word of each word sequence 50 has a part of speech information 52, an acoustic likelihood score 54, a language likelihood score 55, a word likelihood score 56, a word duration length 58, and a phoneme number 60. The phoneme duration 62 is given by the word information giving unit 106 and outputted from the voice recognition unit 6.

音声認識部6で付与された1発話単位を構成し、上述の音声認識に基づく情報が付与された単語をA1、A2、...、Axとする。この各単語中の任意の単語をAm(m=1、...、x)とし、単語Amの音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62、品詞情報52を示す値をそれぞれam、bm、cm、dm、em、fm、gm、とした場合、例えば、図3に示すように情報付単語列記憶部31内に、各単語Amごとに音声認識に基づく情報が記憶される。   A single utterance unit provided by the speech recognition unit 6 is configured, and words to which information based on the above-described speech recognition is provided are denoted by A1, A2,. . . , Ax. An arbitrary word in each word is Am (m = 1,..., X), and the acoustic likelihood score 54, the language likelihood score 55, the word likelihood score 56, the word duration length 58 of the word Am, When the values indicating the number of phonemes 60, the phoneme duration 62, and the part of speech information 52 are am, bm, cm, dm, em, fm, and gm, respectively, for example, as shown in FIG. In 31, information based on voice recognition is stored for each word Am.

情報変換部20では、1発話内における単語A1、A2、...、Axの音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62、の各統計値、例えば、平均値、分散値、最大値、最小値をそれぞれ求める。まず、単語A1、A2、...、Axにおいての音響尤度スコアa1、a2、...、axが全て音響尤度スコア平均値部201に入力され、これらの平均値Pが計算される。音響尤度スコアa1、a2、...、axが全て音響尤度スコア分散値部202に入力され、これらの分散値Qが計算される。音響尤度スコアa1、a2、...、axが全て音響尤度スコア最大値部203に入力されて、これらの最大値Rが求められる。音響尤度スコアa1、a2、...、axが全て音響尤度スコア最小値部204に入力され、これらの最小値Sが求められる。これらの統計値はこれらに限るものでなく、またこれらのうちの一部のみを用いてもよく、また用いなくてもよい。用いない場合については以下で説明する。   In the information conversion unit 20, the words A1, A2,. . . , Ax acoustic likelihood score 54, language likelihood score 55, word likelihood score 56, word duration length 58, phoneme number 60, phoneme duration length 62, for example, average value, variance value, Find the maximum and minimum values respectively. First, the words A1, A2,. . . , Ax acoustic likelihood scores a1, a2,. . . , Ax are all input to the acoustic likelihood score average value unit 201, and the average value P is calculated. Acoustic likelihood scores a1, a2,. . . , Ax are all input to the acoustic likelihood score variance value unit 202, and these variance values Q are calculated. Acoustic likelihood scores a1, a2,. . . , Ax are all input to the acoustic likelihood score maximum value section 203, and their maximum value R is obtained. Acoustic likelihood scores a1, a2,. . . , Ax are all input to the acoustic likelihood score minimum value unit 204, and these minimum values S are obtained. These statistical values are not limited to these, and only some of them may or may not be used. The case where it is not used will be described below.

以上のように、計算された平均値P、分散値Q、最大値R、最小値Sがそれぞれ音響尤度スコア平均値正規化部205、音響尤度スコア分散値正規化部206、音響尤度スコア最大値正規化部207、音響尤度スコア最小値正規化部208、に入力され、平均値P、分散値Q、最大値R、最小値Sがそれぞれ、0〜1の値に正規化された値Pa、Qa、Ra、Saが算出される。
また同様に残りの情報、つまり言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62についても、同様の処理により、それぞれの平均値、分散値、最大値、最小値の正規化された値を求める。正規化された平均値、正規化された分散値、正規化された最大値、正規化された最小値として表す場合、つまり、言語尤度スコア55の統計値Pb、Qb、Rb、Sb、単語尤度スコア56の統計値Pc、Qc、Rc、Sc、単語継続時間長58の統計値Pd、Qd、Rd、Sd、音素数60の統計値Pe、Qe、Re、Se、音素継続時間長62の統計値Pf、Qf、Rf、Sfを算出する。これら正規化された値を1要素として、すなわち、この場合24要素で構成される発話特徴量ベクトルを合成部260において合成する。なお、この24要素を全て使用する必要はなく、この中の1以上の要素を使用しても問題はない。そして、用いない統計値を算出する必要もない。
As described above, the calculated average value P, variance value Q, maximum value R, and minimum value S are the acoustic likelihood score average value normalization unit 205, the acoustic likelihood score variance value normalization unit 206, and the acoustic likelihood, respectively. The average value P, the variance value Q, the maximum value R, and the minimum value S are each normalized to a value of 0 to 1 and input to the maximum score normalization unit 207 and the minimum acoustic likelihood score normalization unit 208. The calculated values Pa, Qa, Ra, Sa are calculated.
Similarly, for the remaining information, that is, the language likelihood score 55, the word likelihood score 56, the word duration length 58, the phoneme number 60, and the phoneme duration length 62, the average value and the variance value are obtained by the same processing. Find the normalized value of the maximum and minimum values. When expressed as a normalized average value, normalized variance value, normalized maximum value, normalized minimum value, that is, statistical values Pb, Qb, Rb, Sb, and words of language likelihood score 55 Statistical values Pc, Qc, Rc, Sc of likelihood score 56, statistical values Pd, Qd, Rd, Sd of word duration 58, statistical values Pe, Qe, Re, Se, phoneme duration 62 of phoneme number 58 The statistical values Pf, Qf, Rf, and Sf are calculated. These normalized values are combined as one element, that is, in this case, an utterance feature amount vector composed of 24 elements is combined by the combining unit 260. It is not necessary to use all 24 elements, and there is no problem even if one or more of these elements are used. And there is no need to calculate unused statistical values.

また、次の信頼度付与部22で、信頼度を付与する際に用いる複数の単語を1つのシンボルで表した単語クラスに図1、図4中のクラス分け部21により、分類することで、より精度の高い信頼度を得ることが出来る。ここで1つのシンボルで表した単語クラスとは、(例えば一つのシンボルを、「あ」という言葉で始まるか、というものに設定した時に)、例えば「あ」で始まる単語を単語クラスa、「い」で始まる単語クラスをb、「わ」で始まる単語クラスwに属し、これら以外の言葉で始まる単語を単語クラスxとする。例えば、発話単位の例として、「私はあなたを愛しています。」で説明すると、単語系列が「私」「は」「あなた」「を」「愛し」「て」「い」「ます」に区切られた場合、「あなた」と「愛し」は単語クラスaに属し、「い」は単語クラスbに属する。「私」は単語クラスwに属し、「は」「を」「て」「ます」は単語クラスxに属する。   In addition, by classifying a plurality of words used when assigning reliability into a word class represented by one symbol by the classification unit 21 in FIG. 1 and FIG. More highly reliable reliability can be obtained. Here, the word class represented by one symbol is (for example, when one symbol is set to start with the word “a”), for example, a word starting with “a” is a word class a, “ A word class starting with “I” belongs to b, a word class starting with “wa”, and a word starting with a word other than these is set as word class x. For example, as an example of an utterance unit, “I love you” explains, the word sequence is “I”, “Ha”, “You”, “I”, “I love”, “Te”, “I”, “Masu”. When separated, “you” and “love” belong to the word class a, and “i” belong to the word class b. “I” belongs to the word class w, and “ha”, “wo”, “te”, and “mas” belong to the word class x.

このような単語クラスを用い、発話における複数の単語中に、1つでも単語クラスaに属する単語があれば、単語クラスaを「1」とし、なければ、「0」とすることを各単語クラスについて調べる。つまり、単語クラスの数をn個とすると、各要素が「0」もしくは「1」であり、要素数がn個のベクトルとして、出力する。
一つのシンボルを単語クラスで表した単語クラスのクラスタリングの例として、品詞情報52を用いることで、効率的に単語クラス分けをすることが出来る。例えば予め4つの品詞、「接続詞」「名詞」「格助詞」「連用詞」について、接続詞クラスa、名詞クラスb、格助詞クラスc、連用詞クラスdという4つの単語クラスを設定する。1発話中における複数の単語中に前記4つの品詞のそれぞれについて1以上含まれていれば「1」を出力し、含まれていなければ、「0」を出力する。例えば、入力された文章音声の発話単位が「しかし今日私は走る」の場合、分割された単語系列が「しかし」、「今日」、「私」、「は」、「走る」、となる。「しかし」は接続詞クラスa、「今日」と「私」は名詞クラスb、「は」は格助詞クラスc、連用詞クラスdには何れの単語も属さず、「走る」はどこのクラスにも属さない。よって、「しかし今日私は走る。」が入力音声である場合、単語クラスベクトルは(1,1,1,0)となる。
Using such a word class, if there is at least one word belonging to the word class a among a plurality of words in the utterance, the word class a is set to “1”, otherwise, it is set to “0”. Find out about the class. That is, assuming that the number of word classes is n, each element is “0” or “1”, and is output as a vector having n elements.
As an example of word class clustering in which one symbol is represented by a word class, the part of speech information 52 can be used to efficiently classify words. For example, four word classes of a conjunction class a, a noun class b, a case particle class c, and a conjunction particle class d are set in advance for four parts of speech, “conjunction”, “noun”, “case particle”, and “participant”. If one or more of the four parts of speech are included in a plurality of words in one utterance, “1” is output, and if not included, “0” is output. For example, when the utterance unit of the input sentence voice is “But I run today”, the divided word series are “But”, “Today”, “I”, “Ha”, “Run”. “But” is the conjunction class a, “Today” and “I” are the noun class b, “ha” is the case particle class c, and the conjunction particle class d does not belong to any word, and “run” belongs to any class Does not belong. Thus, if “but I run today” is the input speech, the word class vector is (1, 1, 1, 0).

なお、上述したように、単語クラスとして、例えば、「あ」で始まる単語のようなクラス分けをする場合は、図2中に破線で示すように、上位N位の発話のそれぞれについて、その構成単語系列中の各単語がどのような単語クラスに属するかを示す情報g’を単語クラス列情報生成部108で生成記憶し、これを単語情報付与部106へ出力する。図3中のクラス判定部(図3では第m品詞クラス判定部250m)では、品詞クラスの場合と、同様に、各単語クラスについて発話単語列中に1つ以上その単語クラスに属するものがあれば「1」1つもなければ「0」とする。 As described above, for example, when classifying a word class such as a word starting with “A”, as shown by a broken line in FIG. Information g ′ i indicating what word class each word in the word sequence belongs to is generated and stored in the word class string information generation unit 108, and is output to the word information addition unit 106. In the class determination unit in FIG. 3 (the mth part-of-speech class determination unit 250m in FIG. 3), as in the case of the part-of-speech class, there may be one or more utterance word strings belonging to the word class for each word class. If there is no “1”, “0” is assumed.

図3の説明に戻ると、m個の任意の品詞を設定し、m個それぞれのクラスを第1品詞クラス2491、第2品詞クラス2492、...、第m品詞クラス249mとする。ただしmは1以上の整数とする。単語A1、A2、...、Axの品詞情報52の値、g1、g2、...、gxを用いて、クラス分け部21で、どの品詞クラスに属するかを判断し、クラス分けをし、単語A1、A2、...、Axはそれぞれ、相当する第1品詞クラス2491、第2品詞クラス2492、...、第m品詞クラス249mに属される。そして、品詞情報52については、任意の品詞について、発話単位中に、その品詞が含まれていれば「1」を出力し、含まれていなければ「0」を出力する。つまり、クラス分けをした結果、単語が含まれていれば第jクラス(j=1、...、m)に対応する第j品詞クラス判定部250jから「1」を出力し、単語が含まれていなければ「0」を出力し、これらそれぞれを1要素として、合成部260で発話特徴量ベクトルを構成する。   Returning to the explanation of FIG. 3, m arbitrary parts of speech are set, and each of the m classes is designated as a first part of speech class 2491, a second part of speech class 2492,. . . Suppose that the mth part-of-speech class is 249m. However, m is an integer of 1 or more. Words A1, A2,. . . , Ax part-of-speech information 52, g1, g2,. . . , Gx, the classifying unit 21 determines which part-of-speech class to belong to, classifies, and classifies the words A1, A2,. . . , Ax are the corresponding first part-of-speech class 2491, second part-of-speech class 2492,. . . , Belonging to the mth part-of-speech class 249m. As for the part of speech information 52, for any part of speech, “1” is output if the part of speech is included in the utterance unit, and “0” is output if it is not included. That is, as a result of classification, if a word is included, “1” is output from the j-th part of speech class determination unit 250j corresponding to the j-th class (j = 1,..., M), and the word is included. If not, “0” is output, and each of these is set as one element, and the synthesizer 260 constructs an utterance feature quantity vector.

なお、1つのシンボルで表した単語クラスの一例として、品詞情報52を用い、m個の任意の品詞を設定し、かつ、1発話単位においての各単語の音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62のそれぞれの平均値、分散値、最大値、最小値の全ての要素で構成された発話特徴量ベクトルの具体的構成例を図4に示す。
このような発話特徴量ベクトルのみならず、このベクトル中のいくつかの要素のみで発話特徴量ベクトルとしてもよい。
As an example of a word class represented by one symbol, part-of-speech information 52 is used, m arbitrary parts-of-speech are set, and an acoustic likelihood score 54 and a language-likelihood score for each word in one utterance unit. 55, a word likelihood score 56, a word duration 58, a phoneme number 60, and a phoneme duration 62, each of an average value, variance, maximum value, and minimum value of an utterance feature quantity vector composed of all elements. A specific configuration example is shown in FIG.
Not only such an utterance feature quantity vector but also only some elements in this vector may be used as an utterance feature quantity vector.

なお、品詞の種類数mを37種類にすると、精度の高い信頼度を出力させることが出来る。図4に示したように、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62のそれぞれの発話内での平均値、分散値、最大値、最小値などの統計値、上述の単語クラスから出力された値、全てを用いる場合、61次元(6×4+37の発話特徴量ベクトルが発話単位ごとに発話特徴量ベクトルを合成部260から合成出力される。
N−best候補を音声認識部6で求めた場合は、そのN個の候補のそれぞれについて、発話特徴量ベクトルを求める。
When the number of parts of speech m is 37, highly reliable reliability can be output. As shown in FIG. 4, the acoustic likelihood score 54, the language likelihood score 55, the word likelihood score 56, the word duration length 58, the phoneme number 60, and the phoneme duration length 62 within the respective utterances. , Statistical values such as variance value, maximum value, minimum value, and values output from the above word classes, 61 dimensions (6 × 4 + 37 utterance feature quantity vectors are converted into utterance feature quantity vectors for each utterance unit. It is synthesized and output from the synthesis unit 260.
When N-best candidates are obtained by the speech recognition unit 6, an utterance feature amount vector is obtained for each of the N candidates.

図1の説明に戻る。情報変換部20から発話特徴量ベクトルが信頼度付与部22に入力され、識別モデル格納部29に格納されている識別モデルを用いて、前記入力された発話特徴量ベクトルを評価して、信頼度を出力する。このため予め学習音声信号から、上述したように多数の発話特徴量ベクトルを作成し、これら学習用発話特徴量ベクトルについて、そのベクトルが得られた音声認識結果の認識率がn%(0≦n≦100)以上であるか否かを学習して、認識率n%の識別モデルを作成し、識別モデル格納部29に格納しておく。この識別モデルは通常0≦n≦100の範囲で、必要とされる密度で作成される。例えば信頼度を10%間隔で必要とする場合は(n=0%、10%、20%、30%、...、90%、100%)以上であるか否かをそれぞれ評価できる11個の識別モデルを予め作成して、識別モデル格納部29に格納しておく。   Returning to the description of FIG. An utterance feature amount vector is input from the information conversion unit 20 to the reliability assigning unit 22, and the input utterance feature amount vector is evaluated using the identification model stored in the identification model storage unit 29. Is output. For this reason, a large number of utterance feature amount vectors are created in advance from the learning speech signal as described above, and the recognition rate of the speech recognition result from which the vectors are obtained for these learning utterance feature amount vectors is n% (0 ≦ n). ≦ 100) or more is learned, and an identification model with a recognition rate of n% is created and stored in the identification model storage unit 29. This identification model is usually created in the required density in the range of 0 ≦ n ≦ 100. For example, if the reliability is required at 10% intervals (n = 0%, 10%, 20%, 30%, ..., 90%, 100%) or more, 11 pieces can be evaluated. Are created in advance and stored in the identification model storage unit 29.

このような、識別モデルを用いて、評価すべき発話単位特徴量ベクトルの信頼度を求めるには、例えば、その発話特徴量ベクトルを、まずn=0%の識別モデルを用いて評価し、認識結果が0%以上であるか否かを判断する。0%以上であると判断されると、上記評価対象発話特徴量ベクトルをn=10%の識別モデルを用いて評価し、n=10%以上であるか否かを判断する。以下、これらの処理を繰り返し、n=80%の識別モデルを用いて、評価した時に、認識率がn=80%以上ではないと判断された場合は、その評価対象発話特徴量ベクトルの基となった発話音声認識結果の認識率は70%以上80%以下と判断される。この判断結果を発話特徴量ベクトルの基となった発話音声認識結果に対する信頼度とする。   In order to obtain the reliability of an utterance unit feature vector to be evaluated using such an identification model, for example, the utterance feature vector is first evaluated using an identification model of n = 0% and recognized. It is determined whether the result is 0% or more. If it is determined that it is 0% or more, the evaluation target speech feature vector is evaluated using an identification model of n = 10%, and it is determined whether or not n = 10% or more. Hereinafter, when these processes are repeated and evaluation is performed using an identification model of n = 80%, if it is determined that the recognition rate is not n = 80% or more, the basis of the utterance feature quantity vector to be evaluated is The recognition rate of the uttered speech recognition result is determined to be 70% or more and 80% or less. This determination result is used as the reliability for the speech recognition result based on the speech feature vector.

なお、発話音声認識率が70%以上であるか否かのみを判断する場合は、n=70%の識別モデルを1個作成し、これを識別モデル格納部29に格納しておけば良い。
上述のようにして、発話特徴量ベクトルを用い、これの基となった発話音声認識結果の認識率の信頼度を推定する。
上述したような、次元数が非常に多い発話量特徴ベクトルを用いると、非常に大量の学習データが必要となり、少量では、過学習の問題が発生することが多い。そこで、識別モデルには、例えば、サポートベクターマシン(SVM)による統計的識別モデルを用いることが考えられる。サポートベクターマシンにより、例えば、認識率が70%の識別モデルを作るには、認識率70%以上のz次元の発話特徴量ベクトルと認識率70%未満のz次元の発話特徴量ベクトルを用いて学習により、複数個のサポートベクターを求め、これらから発話特徴量ベクトルxを変数とする識別関数f(x)を求め、これを識別モデルとする。
When determining only whether or not the speech recognition rate is 70% or more, one identification model with n = 70% may be created and stored in the identification model storage unit 29.
As described above, using the utterance feature amount vector, the reliability of the recognition rate of the utterance speech recognition result that is the basis of the utterance feature amount vector is estimated.
When an utterance amount feature vector having a very large number of dimensions as described above is used, a very large amount of learning data is required, and a small amount often causes an overlearning problem. Thus, for example, a statistical identification model using a support vector machine (SVM) may be used as the identification model. For example, to create an identification model with a recognition rate of 70% using the support vector machine, a z-dimensional utterance feature vector with a recognition rate of 70% or more and a z-dimensional utterance feature vector with a recognition rate of less than 70% are used. A plurality of support vectors are obtained by learning, and an identification function f (x) using the utterance feature quantity vector x as a variable is obtained from these, and this is used as an identification model.

この識別モデルを用いて、実際に求めた入力音声信号から得たz次元の発話特徴量ベクトルを評価するには、その発話特徴量ベクトルをxとして、前記識別関数f(x)に代入し、その演算結果が正であれば、70%以上の認識率を持つ信頼性があり、負であれば、70%未満の認識率を持つ信頼性があると判断する。なお、サポートベクターマシンの詳細は、電子情報通信学会誌 vol.83 No.6 2000年6月 460頁−466頁等に記載されている。サポートベクターマシンは「マージン最大化」という基準から自動的に、識別面付近の少数の学習サンプルのみを選択して、識別面を構成するため、少数の学習データでも比較的良い識別性能が得られるため本願の発明に利用すれば、認識モデルの作成効率が良い。   In order to evaluate the z-dimensional utterance feature quantity vector obtained from the actually obtained input speech signal using this discrimination model, the utterance feature quantity vector is set as x and substituted into the discrimination function f (x), If the calculation result is positive, it is determined that there is reliability with a recognition rate of 70% or more, and if it is negative, it is determined that there is reliability with a recognition rate of less than 70%. The details of the support vector machine are described in IEICE Journal vol. 83 No. 6 June 2000, pages 460-466. The support vector machine automatically selects only a small number of learning samples near the identification plane and configures the identification plane based on the criterion of “maximizing margin”, so that relatively good discrimination performance can be obtained even with a small amount of learning data. Therefore, when used in the present invention, the creation efficiency of the recognition model is good.

なお、1発話につき、N―best候補のN個の単語系列が認識され、これらN個の音声認識に基づき、作成された各発話特徴量ベクトルを識別モデルでその音声結果の信頼度を推定し、その最も高いものと対応する発話音声認識結果の単語系列を出力する。あるいは、N個の単語系列とその信頼度とを組として出力してもよい。
発話特徴量ベクトルとしては、先に述べたように前記各種の統計量のみを用いてもよく、その統計量、平均値、分散値、最大値、最小値、中の1つまたは、複数を用いても良く、更に、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、についての値のみでもよく、あるいは、単語クラスの系列のみでも良い。
実験結果
以下に、この発明が優れていることを示す実験結果を説明する。
Note that N word sequences of N-best candidates are recognized for each utterance, and based on these N speech recognitions, each utterance feature amount vector is estimated using the identification model to estimate the reliability of the speech result. The word sequence of the speech recognition result corresponding to the highest one is output. Alternatively, N word sequences and their reliability may be output as a set.
As described above, as the utterance feature quantity vector, only the above-mentioned various statistics may be used, and one or a plurality of the statistics, average value, variance value, maximum value, minimum value are used. Further, only the values for the acoustic likelihood score 54, the language likelihood score 55, and the word likelihood score 56 may be used, or only the word class series may be used.
Experimental results Experimental results showing that the present invention is superior will be described below.

発話単位の単語系列50の各単語に付与された品詞情報52、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62の平均値、分散値、最大値、最小値、を正規化した値、および、品詞の種類数として、37種類の品詞情報52を用いた単語クラスを用いて合成した61次元の発話特徴量ベクトルを使用した。14本の放送ニュースデータの連続単語認識において、語彙数約3万のtrigram言語モデル、性別非依存、状態数約5000、各状態の混合数8の状態共有triphone音響モデルを用いて評価した。用いたデータの概要を図5Aに示す。14本のニュースデータの単語数の総数が100、541個、予め計測された単語正解精度83.59%とする。   Part-of-speech information 52, acoustic likelihood score 54, language likelihood score 55, word likelihood score 56, word duration length 58, phoneme number 60, phoneme duration length 62 assigned to each word in the word series 50 of the utterance unit 61-dimensional utterance feature vector synthesized using a word class using 37 types of part-of-speech information 52 as a value obtained by normalizing the average value, variance value, maximum value, minimum value, and the number of types of part-of-speech It was used. In continuous word recognition of 14 broadcast news data, evaluation was performed using a trigram language model with about 30,000 vocabularies, a gender-independent, about 5000 states, and a state sharing triphone acoustic model with 8 states. An overview of the data used is shown in FIG. 5A. Assume that the total number of words in the 14 news data is 100, 541, and the word correct accuracy 83.59% measured in advance.

放送ニュースデータについては、全データの1割を評価用データ、残りの9割を学習用データとするクロス評価を行い、使用したデータの単語正解精度に近い80%を閾値とし、閾値以上と推定された発話を、高精度に認識された発話として、抽出を行った。本発明では、認識率を推定する手段の一つとして、機械学習を用いた。抽出された発話の再現率を式(1)で、適合率を式(2)で求めた。
式(1) 再現率=H/C
式(2) 適合率=H/N
ただし、Cは評価用データに含まれる認識率80%以上の実際の発話数、Nは認識率80%以上と推定された発話数、Hは認識率80%以上と推定された発話の中で実際に80%以上だった発話数とする。
For broadcast news data, cross-evaluation is performed with 10% of all data as evaluation data and the remaining 90% as learning data, and 80% close to the correct word accuracy of the used data is set as a threshold and estimated to be equal to or higher than the threshold. The extracted utterances were extracted as utterances recognized with high accuracy. In the present invention, machine learning is used as one of means for estimating the recognition rate. The recall rate of the extracted utterance was obtained by Equation (1), and the relevance rate was obtained by Equation (2).
Formula (1) Reproducibility = H / C
Formula (2) Conformity rate = H / N
Where C is the actual number of utterances with a recognition rate of 80% or higher, N is the number of utterances with a recognition rate of 80% or higher, and H is the utterance with a recognition rate of 80% or higher. The actual number of utterances was 80% or more.

また比較のため、従来の方法であるN−bestコンフィデンスメジャーを用いた認識率の推定を行った。これは、ある閾値以上の値が付与された単語を正解とし、そうでない単語を不正解と仮定して、認識率を推定したものであり、再現率、適合率は上式(1)(2)を用いて、算出した。比較評価結果を図5Bに示す。N−bestコンフィデンスメジャーを用いた場合、再現率が91.76%、適合率が75.62%であるのに比べ、本願の発明では、再現率が91.87%、適合率が85.64%であり、再現率、適合率とも、向上していることが分かる。従って、連続単語認識において、発話単位の発話特徴量ベクトルを用いて、選択することが、精度の高い結果に結びつくことが分かる。   For comparison, the recognition rate was estimated using an N-best confidence measure which is a conventional method. This is an estimation of the recognition rate assuming that a word to which a value equal to or greater than a certain threshold is assigned is correct and a word that is not correct is an incorrect answer. ) To calculate. The comparative evaluation results are shown in FIG. 5B. When the N-best confidence measure is used, the recall rate is 91.76% and the matching rate is 75.62%. In the present invention, the recall rate is 91.87% and the matching rate is 85.64. It can be seen that both the recall ratio and the precision ratio are improved. Therefore, it can be seen that, in continuous word recognition, selection using an utterance feature amount vector for each utterance leads to a highly accurate result.

本願の発明において、入力音声に対応する文章の作成では、信頼度の低い部分を削除、若しくは、信頼度の高い部分を強調するなどして、より効率的な活用が可能となる。
コンピュータと人間とが音声対話を用いて、コミュニケーションをとる音声対話システムで認識した情報において、信頼度の高い部分に重みをつけて、用いたり、信頼度の低い部分を再度確認したりなど、より効率的な対話を行うことが出来る。音声認識に使用する音響モデルを学習する際に、従来は人手でデータを作成して、学習を行っているが、音声認識を行い、高精度に認識された発話のみを用いて、学習を行うことで、教師なし学習を行うことが出来る。上述の例のように、本願の発明を用いることにより、より効率的な音声認識装置を開発することが出来る。
In the invention of the present application, in the creation of a sentence corresponding to the input voice, it is possible to use more efficiently by deleting a portion with low reliability or emphasizing a portion with high reliability.
In the information recognized by the spoken dialogue system that communicates between the computer and humans using voice dialogue, weighting the parts with high reliability, using them, reconfirming the parts with low reliability, etc. Efficient dialogue can be conducted. When learning an acoustic model to be used for speech recognition, data has been created and learned by hand in the past, but speech recognition is performed and learning is performed using only utterances recognized with high accuracy. Therefore, unsupervised learning can be performed. As in the above example, by using the invention of the present application, a more efficient speech recognition device can be developed.

この発明のシステムの構成例を示すブロック図。The block diagram which shows the structural example of the system of this invention. 音声認識部6の構成例を示すブロック図。The block diagram which shows the structural example of the speech recognition part 6. FIG. 情報付単語系列記憶部31の具体的記憶内容例と情報変換部20の構成例を示す図。The figure which shows the example of a specific memory content of the word sequence memory | storage part 31 with information, and the structural example of the information conversion part 20. FIG. 37(m)種類の品詞情報と、音響尤度スコア54、言語尤度スコア55、単語尤度スコア56、単語継続時間長58、音素数60、音素継続時間長62の各々の平均値、分散値、最大値、最小値の全てを要素として構成された発話特徴量ベクトルを示す図。37 (m) types of part-of-speech information, average values and variances of acoustic likelihood score 54, language likelihood score 55, word likelihood score 56, word duration 58, phoneme number 60, and phoneme duration 62 The figure which shows the speech feature-value vector comprised by making all the value, the maximum value, and the minimum value into an element. この発明の効果を示す実験においてのデータを示す表であり、Aは学習並びに評価に用いたデータであり、Bはこの発明とN−bestコンフィデンスメジャーとの比較評価結果を示す表である。It is a table | surface which shows the data in the experiment which shows the effect of this invention, A is the data used for learning and evaluation, B is a table | surface which shows the comparative evaluation result of this invention and N-best confidence measure.

Claims (13)

入力されたディジタル音声信号を発話単位に分割し、その発話単位のディジタル音声信号から音響特徴パラメータを抽出し、その音響特徴パラメータに対し、与えられた言語的制約のもとで、言語的単位の各カテゴリの特徴を表現した確率モデルに出力する確率に基づくスコアを計算し、少なくとも、最も高いスコアを示すモデルが表現するカテゴリを認識し、上記発話ごとの単語系列を求め、かつその単語系列中の各単語ごとの上記認識に基づく情報を付与した単語系列を生成する音声認識部と、
上記分割された発話単位ごとに、その単語系列に含まれる各単語単位の上記認識に基づく情報を発話単位の発話特徴量ベクトル情報に変換する情報変換部と、
上記発話単位の発話特徴量ベクトル情報と識別モデルを用いて認識率を推定し、その推定した認識率に基いて、当該発話単位の情報の基となる発話音声認識結果に対する信頼度を求める信頼度付与部と
上記信頼度を出力する出力部と、
を備えることを特徴とする音声認識信頼度推定装置。
The input digital speech signal is divided into utterance units, and acoustic feature parameters are extracted from the digital speech signals of the utterance units. Calculate the score based on the probability to be output to the probability model expressing the characteristics of each category, recognize at least the category represented by the model showing the highest score, obtain the word sequence for each utterance, and in the word sequence A speech recognition unit that generates a word sequence to which information based on the recognition for each word is given;
An information conversion unit that converts information based on the recognition of each word unit included in the word series into utterance feature amount vector information of the utterance unit for each of the divided utterance units;
A reliability that estimates the recognition rate using the utterance feature vector information and the identification model of the utterance unit, and obtains the reliability of the utterance speech recognition result that is the basis of the information of the utterance unit based on the estimated recognition rate An assigning unit and an output unit for outputting the reliability,
A speech recognition reliability estimation device comprising:
請求項1記載の音声認識信頼度推定装置において、
上記情報変換部は、複数の単語の群をそれぞれ1つのシンボルで表した異なる単語クラスについて、1発話中の各単語がどの単語クラスに属するかを表す単語クラス情報列を生成して記憶する単語クラス列情報生成部と、
上記単語クラス情報列を基に、上記単語系列中に上記各単語クラスごとにこれに属するものがあるかどうかを判定し、その判定結果を上記発話特徴量ベクトルの少なくとも一部とするクラス判定部と、
を具備することを特徴とする音声認識信頼度推定装置。
The speech recognition reliability estimation apparatus according to claim 1,
The information conversion unit generates and stores a word class information string indicating which word class each word in one utterance belongs to for different word classes each representing a group of a plurality of words with one symbol A class string information generator,
Based on the word class information sequence, it is determined whether there is a word class belonging to each word class in the word series, and the determination result is used as at least a part of the utterance feature quantity vector When,
A speech recognition reliability estimation apparatus comprising:
請求項2記載の音声認識信頼度度推定装置において、
上記シンボルは、上記情報中の各単語ごとの品詞情報を用いることを特徴とする音声認識信頼度推定装置。
In the speech recognition reliability estimation apparatus according to claim 2,
The speech recognition reliability estimation apparatus, wherein the symbol uses part-of-speech information for each word in the information.
請求項1〜3何れかに記載の音声認識信頼度推定装置において、
上記情報変換部は、上記認識の時に得られた音響尤度スコア、言語尤度スコア、単語尤度スコア、単語継続時間長、音素数、音素継続時間長、のうち1つ若しくは複数を用いて、上記発話特徴量ベクトルの少なくとも一部とすることを特徴とする音声認識信頼度推定装置。
In the speech recognition reliability estimation apparatus according to any one of claims 1 to 3,
The information conversion unit uses one or more of an acoustic likelihood score, a language likelihood score, a word likelihood score, a word duration length, a phoneme number, and a phoneme duration length obtained at the time of recognition. A speech recognition reliability estimation apparatus characterized in that it is at least part of the utterance feature quantity vector.
請求項1〜4何れかに記載の音声認信頼度推定装置において、
上記識別モデルとして、サポートべクターマシン(SVM)に基づき作成されたものであることを特徴とする音声認識信頼度推定装置。
In the voice recognition reliability estimation device according to any one of claims 1 to 4,
A speech recognition reliability estimation device created based on a support vector machine (SVM) as the identification model.
請求項1〜5何れかに記載の音声認識信頼度推定装置において、
上記情報変換部は、上記発話単位内に含まれる各単語に付与された上記認識に基づく情報の値を統計情報に変換し、上記発話特徴量ベクトルの少なくとも一部とすることを特徴とする音声認識信頼度推定装置。
In the speech recognition reliability estimation device according to any one of claims 1 to 5,
The information conversion unit converts a value of information based on the recognition given to each word included in the utterance unit into statistical information, and uses it as at least a part of the utterance feature amount vector. Recognition reliability estimation device.
入力されたディジタル音声信号を発話単位に分割し、その分割された発話単位のディジタル音声信号から音響特徴パラメータを抽出し、その音響特徴パラメータに対し、与えられた言語的制約のもとで、言語的単位の各カテゴリの特徴を表現した確率モデルに出力する確率に基づくスコアを計算し、少なくとも、最も高いスコアを示すモデルが表現するカテゴリを認識結果とし、その認識結果の単語系列に含まれる各単語単位の上記認識に基づく情報を付与した単語系列を求める音声認識ステップと、
上記発話単位の単語系列に含まれる各単語単位の上記認識に基づく情報を発話単位の発話特徴量ベクトルに変換する情報変換ステップと、
を有することを特徴とする音声認識信頼度推定方法。
The input digital speech signal is divided into utterance units, and acoustic feature parameters are extracted from the divided utterance unit digital speech signals, and the language is applied to the acoustic feature parameters under given linguistic constraints. Calculate the score based on the probability to be output to the probability model expressing the characteristics of each category of the target unit, and at least the category represented by the model showing the highest score is the recognition result, and each included in the word sequence of the recognition result A speech recognition step for obtaining a word sequence provided with information based on the recognition in units of words;
An information conversion step of converting information based on the recognition of each word unit included in the word sequence of the utterance unit into an utterance feature vector of the utterance unit;
A speech recognition reliability estimation method characterized by comprising:
請求項7記載の音声認識信頼度推定方法において、
上記情報変換ステップは、予め決めた複数の単語の群をそれぞれ1つのシンボルで表した単語クラスについて、その単語クラスに属するものがあるか否かを表す単語クラス列に変換するステップを含むことを特徴とする音声認識信頼度推定方法。
The speech recognition reliability estimation method according to claim 7,
The information converting step includes a step of converting a word class representing a group of a plurality of predetermined words by one symbol into a word class string indicating whether or not there is an item belonging to the word class. A speech recognition reliability estimation method as a feature.
請求項8記載の音声認識信頼度推定方法において、
上記単語クラスは単語の品詞であることを特徴とする音声認識信頼度推定方法。
The speech recognition reliability estimation method according to claim 8,
The speech recognition reliability estimation method, wherein the word class is a part of speech of a word.
請求項7〜9何れに記載の音声認識信頼度推定方法において、
上記分割ステップにおいて、各単語に付加する情報は各単語の上記認識に基づく情報中の音響尤度スコア、言語尤度スコア、単語尤度スコア、単語継続時間長、音素数、音素継続時間長、のうち1つ若しくは複数の値を含むことを特徴とする音声認識信頼度推定方法。
The speech recognition reliability estimation method according to any one of claims 7 to 9,
In the dividing step, information to be added to each word is an acoustic likelihood score, a language likelihood score, a word likelihood score, a word duration length, a phoneme number, a phoneme duration length in information based on the recognition of each word, A speech recognition reliability estimation method characterized by including one or a plurality of values.
請求項10記載の音声認識信頼度推定方法において、
上記情報変換ステップは、上記単語に付与された値を発話内の統計情報に変換するステップを含むことを特徴とする音声認識信頼度推定方法。
The speech recognition reliability estimation method according to claim 10,
The information conversion step includes a step of converting a value given to the word into statistical information in an utterance, the speech recognition reliability estimation method.
請求項7〜11何れかに記載の音声認識信頼度推定方法において、
上記信頼度付与ステップは、サポートベクターマシン(SVM)により、予め作成された識別モデルを用いるステップであることを特徴とする音声認識信頼度推定方法。
The speech recognition reliability estimation method according to any one of claims 7 to 11,
The method for estimating a speech recognition reliability, wherein the reliability providing step is a step of using an identification model created in advance by a support vector machine (SVM).
請求項7〜12何れかに記載した音声認識信頼度推定方法の各ステップをコンピュータに実行させるためのプログラム。   The program for making a computer perform each step of the speech recognition reliability estimation method in any one of Claims 7-12.
JP2006059216A 2006-03-06 2006-03-06 Speech recognition reliability estimation apparatus, method thereof, and program Expired - Fee Related JP4769098B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2006059216A JP4769098B2 (en) 2006-03-06 2006-03-06 Speech recognition reliability estimation apparatus, method thereof, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2006059216A JP4769098B2 (en) 2006-03-06 2006-03-06 Speech recognition reliability estimation apparatus, method thereof, and program

Publications (2)

Publication Number Publication Date
JP2007240589A true JP2007240589A (en) 2007-09-20
JP4769098B2 JP4769098B2 (en) 2011-09-07

Family

ID=38586239

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2006059216A Expired - Fee Related JP4769098B2 (en) 2006-03-06 2006-03-06 Speech recognition reliability estimation apparatus, method thereof, and program

Country Status (1)

Country Link
JP (1) JP4769098B2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100934218B1 (en) 2007-12-13 2009-12-29 한국전자통신연구원 Multilevel speech recognition device and multilevel speech recognition method in the device
JP2012022070A (en) * 2010-07-13 2012-02-02 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, and device and program for the same
JP2012047875A (en) * 2010-08-25 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Business section extracting method and device, and program therefor
JP2013171244A (en) * 2012-02-22 2013-09-02 Nippon Telegr & Teleph Corp <Ntt> Discriminative speech recognition accuracy estimating device, discriminative speech recognition accuracy estimating method and program
JP2014120059A (en) * 2012-12-18 2014-06-30 Fuji Xerox Co Ltd Information processing apparatus and information processing program
JP2018045062A (en) * 2016-09-14 2018-03-22 Kddi株式会社 Program, device and method automatically grading from dictation voice of learner
CN111027793A (en) * 2019-03-27 2020-04-17 广东小天才科技有限公司 Method and system for determining word mastery degree and electronic equipment
CN115691472A (en) * 2022-12-28 2023-02-03 中国民用航空飞行学院 Evaluation method and device for management voice recognition system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002358097A (en) * 2001-06-01 2002-12-13 Mitsubishi Electric Corp Voice recognition device
JP2004013306A (en) * 2002-06-04 2004-01-15 Nec Corp Similarity computing device, index data generating device, video or audio database device, similarity computing method, index data generating method, content representation data storage device, and recording medium
JP2005275348A (en) * 2004-02-23 2005-10-06 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, device, program and recording medium for executing the method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002358097A (en) * 2001-06-01 2002-12-13 Mitsubishi Electric Corp Voice recognition device
JP2004013306A (en) * 2002-06-04 2004-01-15 Nec Corp Similarity computing device, index data generating device, video or audio database device, similarity computing method, index data generating method, content representation data storage device, and recording medium
JP2005275348A (en) * 2004-02-23 2005-10-06 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, device, program and recording medium for executing the method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
中澤裕一他: ""音声認識結果の単語系列の言語的妥当性に基づく正誤判定"", 電子情報通信学会2004年総合大会講演論文集, vol. D-14-11, JPN6010052034, March 2004 (2004-03-01), pages 152, ISSN: 0001720023 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100934218B1 (en) 2007-12-13 2009-12-29 한국전자통신연구원 Multilevel speech recognition device and multilevel speech recognition method in the device
JP2012022070A (en) * 2010-07-13 2012-02-02 Nippon Telegr & Teleph Corp <Ntt> Speech recognition method, and device and program for the same
JP2012047875A (en) * 2010-08-25 2012-03-08 Nippon Telegr & Teleph Corp <Ntt> Business section extracting method and device, and program therefor
JP2013171244A (en) * 2012-02-22 2013-09-02 Nippon Telegr & Teleph Corp <Ntt> Discriminative speech recognition accuracy estimating device, discriminative speech recognition accuracy estimating method and program
JP2014120059A (en) * 2012-12-18 2014-06-30 Fuji Xerox Co Ltd Information processing apparatus and information processing program
JP2018045062A (en) * 2016-09-14 2018-03-22 Kddi株式会社 Program, device and method automatically grading from dictation voice of learner
CN111027793A (en) * 2019-03-27 2020-04-17 广东小天才科技有限公司 Method and system for determining word mastery degree and electronic equipment
CN115691472A (en) * 2022-12-28 2023-02-03 中国民用航空飞行学院 Evaluation method and device for management voice recognition system
CN115691472B (en) * 2022-12-28 2023-03-10 中国民用航空飞行学院 Evaluation method and device for management voice recognition system

Also Published As

Publication number Publication date
JP4769098B2 (en) 2011-09-07

Similar Documents

Publication Publication Date Title
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US10943583B1 (en) Creation of language models for speech recognition
EP3433855B1 (en) Speaker verification method and system
Jiang Confidence measures for speech recognition: A survey
US6542866B1 (en) Speech recognition method and apparatus utilizing multiple feature streams
JP2965537B2 (en) Speaker clustering processing device and speech recognition device
US9646605B2 (en) False alarm reduction in speech recognition systems using contextual information
JP4195428B2 (en) Speech recognition using multiple speech features
US6535850B1 (en) Smart training and smart scoring in SD speech recognition system with user defined vocabulary
JP5218052B2 (en) Language model generation system, language model generation method, and language model generation program
JP4224250B2 (en) Speech recognition apparatus, speech recognition method, and speech recognition program
JP4769098B2 (en) Speech recognition reliability estimation apparatus, method thereof, and program
JPH1063291A (en) Speech recognition method using continuous density hidden markov model and apparatus therefor
JP2006038895A (en) Device and method for speech processing, program, and recording medium
US10199037B1 (en) Adaptive beam pruning for automatic speech recognition
JP6031316B2 (en) Speech recognition apparatus, error correction model learning method, and program
Williams Knowing what you don't know: roles for confidence measures in automatic speech recognition
WO2021118793A1 (en) Speech processing
JP2004198597A (en) Computer program for operating computer as voice recognition device and sentence classification device, computer program for operating computer so as to realize method of generating hierarchized language model, and storage medium
Mary et al. Searching speech databases: features, techniques and evaluation measures
JP3660512B2 (en) Voice recognition method, apparatus and program recording medium
JP4659541B2 (en) Speech recognition apparatus and speech recognition program
Manjunath et al. Development of phonetic engine for Indian languages: Bengali and Oriya
Manjunath et al. Articulatory and excitation source features for speech recognition in read, extempore and conversation modes
JP4259100B2 (en) Unknown speech detection device for speech recognition and speech recognition device

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20080128

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20100830

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20100907

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20101105

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20110607

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20110617

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140624

Year of fee payment: 3

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

LAPS Cancellation because of no payment of annual fees