JP2022010410A

JP2022010410A - Speech recognition device, speech recognition learning device, speech recognition method, speech recognition learning method, and program

Info

Publication number: JP2022010410A
Application number: JP2021188475A
Authority: JP
Inventors: 亮増村; Akira Masumura; 智大田中; Tomohiro Tanaka; 隆伸大庭; Takanobu Oba
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2019-02-07
Filing date: 2021-11-19
Publication date: 2022-01-14
Anticipated expiration: 2039-02-07
Also published as: JP7160170B2

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognition device capable of achieving end-to-end speech recognition in consideration of context.

SOLUTION: A speech recognition device includes: a model parameter learning unit which uses a word sequence of interest as an observation value, uses a word sequence older than the word sequence of interest, an acoustic feature sequence corresponding to the word sequence of interest, and a model parameter θ as parameters, and which learns the model parameter θ by performing maximum likelihood estimation for a likelihood function of probability that an observation value occurs under the parameters; and a spoken voice recognition unit which uses a word sequence to be recognized as an observation value, uses an already recognized word sequence older than the word sequence to be recognized, an acoustic feature corresponding to the word sequence to be recognized, and the learned model parameter θ as parameters, and which repeats a process of recognizing the word sequence to be recognized in a chronological order by maximum likelihood reference for the likelihood function of the probability that the observation value occurs under the parameters.

SELECTED DRAWING: Figure 1

Description

本発明は、音声認識装置、音声認識学習装置、音声認識方法、音声認識学習方法、プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition learning device, a voice recognition method, a voice recognition learning method, and a program.

深層学習技術の進展に伴い、入力を音声とし、出力をテキストとするend-to-end音声認識と呼ばれる音声認識のモデル化方法が登場し、技術的な進展が進んでいる。これまで広く用いられてきた音声認識は、音声と音素系列の関係をモデル化した音響モデル、音素系列と単語の関係をモデル化した発音モデル、単語間の関係をモデル化した言語モデルという３つのモデルの組み合わせにより構成され、各モデルをそれぞれ異なるデータを用いて独立に学習しておくことで音声認識アルゴリズム（装置）を構成していた。一方、end-to-end音声認識は、音声とテキストの関係をモデル化したモデル１つのみで音声認識アルゴリズム（装置）を構成することができ、学習に用いるデータも音声とテキストのペアデータのみである。 With the progress of deep learning technology, a modeling method of speech recognition called end-to-end speech recognition, in which the input is speech and the output is text, has appeared, and technological progress is progressing. There are three types of speech recognition that have been widely used so far: an acoustic model that models the relationship between speech and phoneme sequences, a pronunciation model that models the relationship between phoneme sequences and words, and a language model that models the relationships between words. It was composed of a combination of models, and the speech recognition algorithm (device) was constructed by learning each model independently using different data. On the other hand, in end-to-end speech recognition, a speech recognition algorithm (device) can be configured with only one model that models the relationship between speech and text, and the data used for learning is only paired data of speech and text. Is.

従来技術の構成について述べる。end-to-end音声認識の入力となる音声から自動抽出できる音響特徴量系列をX=(x₁,…,x_T)、出力となる単語系列をW=(w₁,…,w_N)とし、P(W|X,θ)をモデル化する。ここで、θはモデルパラメータを表す。P(W|X,θ)のモデル化は次式で表される。 The configuration of the prior art will be described. end-to-end The acoustic feature sequence that can be automatically extracted from the input speech of speech recognition is X = (x ₁ ,…, x _T ), and the output word sequence is W = (w ₁ ,…, w _N ). And model P (W | X, θ). Here, θ represents a model parameter. The modeling of P (W | X, θ) is expressed by the following equation.

このモデル化による音声認識アルゴリズム（装置）では、音響特徴量系列Xが入力された時の音声認識結果の単語系列W^を次式に基づき決定する。 In the speech recognition algorithm (device) by this modeling, the word sequence W ^ of the speech recognition result when the acoustic feature sequence X is input is determined based on the following equation.

モデルパラメータθは、複数(２つ以上)の単語系列と音響特徴量系列の組の集合からなる学習データD=(W₁,X₁),…,(W_|D|,X_|D|)（ただし、|D|は学習データDの要素数）に基づいて、事前に学習することにより決定される。Dにより最適化されたパラメータθ^は次式に従う。 The model parameter θ is the learning data D = (W ₁ , X ₁ ),…, (W _{| D |} , X _{| D |} ), which consists of a set of multiple (two or more) word sequences and acoustic feature sequence sets. (However, | D | is determined by learning in advance based on the number of elements of the training data D). The parameter θ ^ optimized by D follows the following equation.

詳細なモデル化には、様々な方法を採用することができる。例えば、ニューラルネットワークを用いた方法が代表的であり、非特許文献１や非特許文献２の方法を用いることができる。 Various methods can be adopted for detailed modeling. For example, a method using a neural network is typical, and the methods of Non-Patent Document 1 and Non-Patent Document 2 can be used.

Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: first results,” in NIPS: Workshop Deep Learning and Representation Learning Workshop, 2014.Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: first results,” in NIPS: Workshop Deep Learning and Representation Learning Workshop, 2014. Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577-585.Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577-585.

上述の従来技術は単一発話の音声を音声認識する問題をモデル化したものであり、連続した複数発話から構成される音声系列を音声認識する場合においても、複数発話中のそれぞれの音声を音声認識する場合において、複数発話の発話間の関係を全く利用することができない。すなわち、過去の発話の音声入力に対してどのような単語系列を出力してきたかといった情報を、現在の発話の音声認識を行う際に考慮することができないという課題がある。 The above-mentioned conventional technique models the problem of recognizing the voice of a single utterance, and even when recognizing a voice series composed of a continuous multiple utterances, each voice in the multiple utterances is voiced. In recognizing, the relationship between utterances of multiple utterances cannot be used at all. That is, there is a problem that information such as what kind of word sequence has been output for the voice input of the past utterance cannot be taken into consideration when performing the voice recognition of the current utterance.

具体例を挙げて説明する。例えば、10分程度の講演音声を音声認識する場面において、この講演音声を無音が0.5秒存在するごとに区切り、合計200発話の音声が含まれていた場合を想定する。この200発話は連続した系列であり、連続した発話は互いに関連する情報についての発話である可能性が高いと考えられる。しかしながら、従来技術を適用すると200発話を各発話それぞれ独立に音声認識することになり、文脈情報を音声認識に利用できない。例えば、100発話目が「今期の業績は素晴らしいですね」という発話であったものとし、101発話目が「すばらしいせいかです」という発話であった場合、100発話目を文脈として考慮できれば、101発話目を「素晴らしい成果です」と音声認識できる可能性が高いが、100発話目を文脈として考慮できない場合は、101発話目を「素晴らしい製菓です」や「素晴らしい聖火です」などと誤認識する可能性がある。 A specific example will be described. For example, in a scene where a lecture voice of about 10 minutes is recognized by voice, it is assumed that the lecture voice is divided every 0.5 seconds of silence and a total of 200 speech voices are included. These 200 utterances are a continuous series, and it is highly probable that the continuous utterances are utterances about information related to each other. However, when the conventional technique is applied, 200 utterances are recognized independently for each utterance, and context information cannot be used for voice recognition. For example, if the 100th utterance is the utterance "The achievements of this term are wonderful" and the 101st utterance is "Is it because it is wonderful", if the 100th utterance can be considered as the context, 101 There is a high possibility that the utterance can be voice-recognized as "a wonderful achievement", but if the 100th utterance cannot be considered as a context, the 101st utterance can be mistakenly recognized as "a wonderful confectionery" or "a wonderful holy fire". There is sex.

例えば、すべての発話（上述の例では200発話）をまとめて、発話長が長い１発話として扱うことで上記の課題を解決することを想定する。この場合、end-to-end音声認識アルゴリズム（装置）は音声全体をベクトルに変換して扱う仕組みであるために、発話長が長い発話に対してうまく動作しない問題が招来する。すべての発話をまとめて１発話として、end-to-end音声認識アルゴリズム（装置）で扱うことは、非現実的である。従って従来は、文脈を考慮したend-to-end音声認識が実現できない点が課題であった。 For example, it is assumed that all the utterances (200 utterances in the above example) are collectively treated as one utterance with a long utterance length to solve the above problem. In this case, since the end-to-end speech recognition algorithm (device) is a mechanism that converts the entire speech into a vector and handles it, a problem that the speech with a long speech length does not work well arises. It is unrealistic to treat all utterances as one utterance by the end-to-end speech recognition algorithm (device). Therefore, in the past, there was a problem that end-to-end speech recognition in consideration of context could not be realized.

そこで本発明では、文脈を考慮したend-to-end音声認識を実現できる音声認識装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech recognition device capable of realizing end-to-end speech recognition in consideration of context.

本発明の音声認識装置は、発話音声認識部を含む。 The voice recognition device of the present invention includes an utterance voice recognition unit.

発話音声認識部は、時系列順に取得された音響特徴量系列の集合からなる認識用データに基づき、認識対象である単語系列を観測値とし、認識対象である単語系列よりも過去の、既に認識済みの単語系列、および認識対象である単語系列に対応する音響特徴量系列、および予め学習済みのモデルパラメータθをパラメータとし、パラメータの下で観測値が生起する確率の尤度関数について最尤基準により、認識対象である単語系列を認識する処理を時系列順に繰り返す。 The spoken voice recognition unit uses the word sequence to be recognized as an observation value based on the recognition data consisting of a set of acoustic feature quantity sequences acquired in chronological order, and has already recognized the word sequence to be recognized in the past. The most probable standard for the likelihood function of the probability that an observed value will occur under the parameters, using the already-learned word sequence, the acoustic feature quantity sequence corresponding to the word sequence to be recognized, and the pre-learned model parameter θ as parameters. The process of recognizing the word sequence to be recognized is repeated in chronological order.

本発明の音声認識装置によれば、文脈を考慮したend-to-end音声認識を実現できる。 According to the speech recognition device of the present invention, end-to-end speech recognition can be realized in consideration of the context.

実施例１の音声認識装置の構成を示すブロック図。The block diagram which shows the structure of the voice recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の動作を示すフローチャート。The flowchart which shows the operation of the voice recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の発話音声認識部の構成を示すブロック図。The block diagram which shows the structure of the utterance voice recognition part of the voice recognition apparatus of Example 1. FIG. 実施例１の音声認識装置の発話音声認識部の動作を示すフローチャート。The flowchart which shows the operation of the utterance voice recognition part of the voice recognition apparatus of Example 1. FIG.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations are omitted.

以下、本実施例の音声認識装置１（最小構成、図１の破線枠内の構成を参照）を説明する。ここでは、モデルパラメータθを音声認識装置１とは別の装置で予め学習してあるものとする。 Hereinafter, the voice recognition device 1 (minimum configuration, refer to the configuration in the broken line frame of FIG. 1) of this embodiment will be described. Here, it is assumed that the model parameter θ is learned in advance by a device different from the speech recognition device 1.

なお、この明細書では、文書作成ソフトの都合上、文字の後ろに「^」を付す場合があるが、この「^」は、当該文字の上に表示されているものとして扱う。例えば、W^L^と表記する場合 In this specification, "^" may be added after the character for the convenience of document creation software, but this "^" is treated as being displayed above the character. For example, when writing W ^L ^

を意味するものとする。 It shall mean.

＜音声認識装置１（最小構成）の入力、出力、動作の概要＞
入力１：L個の連続した発話の音響特徴量系列の系列X¹,…,X^L
入力２：モデルパラメータθ（別装置で学習し、本装置に入力）
出力：L個の連続した単語系列の系列W¹^,…,W^L^ <Overview of input, output, and operation of voice recognition device 1 (minimum configuration)>
Input 1: Series of acoustic features series of L consecutive utterances X ¹ ,…, X ^L
Input 2: Model parameter θ (learned by another device and input to this device)
Output: L consecutive word sequence series W ¹ ^,…, W ^L ^

本実施例の音声認識装置１は、L個の連続した発話の音響特徴量系列の系列X¹,…,X^Lと、モデルパラメータθを入力とし、モデルパラメータθに従った確率計算により、L個の連続した単語系列の系列W¹^,…,W^L^を出力する。ここでは、end-to-end音声認識の入力となるL個の連続した発話の音声系列から自動抽出できる音響特徴量系列の系列をX¹,…,X^Lとする。ここでX^lはl番目の発話の音響特徴量系列であり、X^l=(x^l ₁,…,x^l _Tl)として表される。出力となる単語系列の系列をW¹^,…,W^L^、ここでW^l^はl番目の発話の単語系列であり、 In the speech recognition device 1 of the present embodiment, the sequence X ¹ , ..., X ^L of the acoustic feature quantity series of L consecutive utterances and the model parameter θ are input, and L is calculated by the probability calculation according to the model parameter θ. Outputs a series of consecutive word sequences W ¹ ^,…, W ^L ^. Here, let X ¹ , ..., X ^L be the sequence of the acoustic feature sequence that can be automatically extracted from the speech sequence of L consecutive utterances that are the inputs of end-to-end speech recognition. Here, X ^l is the acoustic feature series of the _lth utterance and is expressed as X ^l = (x ^l ₁ ,…, x ^l T l). The output word sequence is W ¹ ^,…, W ^L ^, where W ^l ^ is the lth utterance word sequence.

として表される。 It is expressed as.

ここで、音響特徴量系列には、音声から計算できる任意の特徴量系列を利用することができるが、例えばメルフィルタバンクケプストラム係数や、対数メルフィルタバンクといった特徴量系列を用いることができる。メルフィルタバンクケプストラム係数や対数メルフィルタバンクの説明は割愛する。 Here, as the acoustic feature quantity series, any feature quantity series that can be calculated from voice can be used, but for example, a feature quantity series such as a mel filter bank cepstrum coefficient or a logarithmic mel filter bank can be used. Melfilter bank The explanation of the cepstrum coefficient and the logarithmic melfilter bank is omitted.

単語系列は、英語の場合は例えばスペース区切りの表現、日本語であれば例えば形態素解析により自動分割された表現、または文字単位に区切った表現を単語系列としてもよい。 In the case of English, the word sequence may be, for example, a space-separated expression, in the case of Japanese, for example, an expression automatically divided by morphological analysis, or an expression divided into character units may be used as the word sequence.

次に、図１を参照して実施例１の音声認識装置の一般的な構成について説明する。なお、ここでは、モデルパラメータθを音声認識装置１内で学習するものとする。同図に示すように本実施例の音声認識装置１は、モデルパラメータ学習部１１とモデルパラメータ記憶部１１ａと、発話音声認識部１２と、単語系列記憶部１２ａを含む。ただし上述したように、モデルパラメータ学習部１１とモデルパラメータ記憶部１１ａは別装置の構成要件としてもよい。以下、図２を参照して各構成要件の動作を説明する。 Next, a general configuration of the voice recognition device of the first embodiment will be described with reference to FIG. Here, it is assumed that the model parameter θ is learned in the speech recognition device 1. As shown in the figure, the voice recognition device 1 of this embodiment includes a model parameter learning unit 11, a model parameter storage unit 11a, an utterance voice recognition unit 12, and a word sequence storage unit 12a. However, as described above, the model parameter learning unit 11 and the model parameter storage unit 11a may be constituent requirements of separate devices. Hereinafter, the operation of each configuration requirement will be described with reference to FIG.

＜モデルパラメータ学習部１１＞
入力：複数(２つ以上)の単語系列の系列と音響特徴量系列の系列の組の集合である学習データD=(A₁,B₁),…,(A_|D|,B_|D|)
出力：モデルパラメータθ <Model parameter learning unit 11>
Input: Learning data D = (A ₁ , B ₁ ),…, (A _{| D |} , B _{| D |} )
Output: Model parameter θ

モデルパラメータ学習部１１は、複数(２つ以上)の時系列順に取得された単語系列と、対応する音響特徴量系列の組の集合からなる学習データD=(A₁,B₁),…,(A_|D|,B_|D|)に基づき、注目している単語系列（次式のW^l）を観測値とし、注目している単語系列よりも過去の単語系列（次式のW¹,...,W^l-1）、および注目している単語系列（次式のW^l）に対応する音響特徴量系列（次式のX^l）、およびモデルパラメータθをパラメータとし、パラメータ（次式のW¹,...,W^l-1,X^l,θ）の下で観測値（次式のW^l）が生起する確率の尤度関数について最尤推定を行うことにより、モデルパラメータθを学習する（Ｓ１１）。なお(A_m,B_m)={(W¹,X¹),…,(W^Lm,X^Lm)}である。Dにより最適化されたパラメータθ^は次式に従う。 The model parameter learning unit 11 has training data D = (A ₁ , B ₁ ), ... Based on (A _{| D |} , B _{| D |} ), the word sequence of interest (W ^l of the following equation) is used as the observed value, and the word sequence of the past than the word sequence of interest (W ¹ of the following equation) , ..., W ^l-1 ), the acoustic feature sequence (X ^l in the following equation) corresponding to the word sequence of interest (W ^l in the following equation), and the model parameter θ as parameters ( A model by performing maximum likelihood estimation for the likelihood function of the probability that an observed value (W ^l in the following equation) will occur under W ¹ , ..., W ^l-1 , X ^l , θ in the following equation. The parameter θ is learned (S11). Note that (A _m , B _m ) = {(W ¹ , X ¹ ),…, (W ^Lm , X ^Lm )}. The parameter θ ^ optimized by D follows the following equation.

ここで学習されたθ^を発話音声認識部１２におけるθとして用いる。 The θ ^ learned here is used as θ in the spoken voice recognition unit 12.

＜モデルパラメータ記憶部１１ａ＞
モデルパラメータ記憶部１１ａは、学習されたθ^を記憶する。 <Model parameter storage unit 11a>
The model parameter storage unit 11a stores the learned θ ^.

＜発話音声認識部１２＞
入力１：l番目の発話の音響特徴量系列X^l
入力２：既に音声認識結果として得られた1番目からl-1番目の発話までの単語系列W¹^,…,W^l-1^
入力３：モデルパラメータθ
出力：l番目の発話の単語系列W^l^ <Speech voice recognition unit 12>
Input 1: Acoustic feature series of l-th utterance X ^l
Input 2: The word sequence from the first to the l-1st utterance already obtained as a speech recognition result W ¹ ^,…, W ^l-1 ^
Input 3: Model parameter θ
Output: lth utterance word sequence W ^l ^

発話音声認識部１２は、時系列順に取得された音響特徴量系列の集合（X¹,...,X^L）からなる認識用データに基づき、認識対象である単語系列（次式のW^l）を観測値とし、認識対象である単語系列（次式のW^l）よりも過去の、既に認識済みの単語系列（次式のW¹^,…,W^l-1^）、および認識対象である単語系列（次式のW^l）に対応する音響特徴量系列（次式のX^l）、および学習済みのモデルパラメータθをパラメータとし、パラメータ（次式のW¹^,…,W^l-1^,X^l,θ）の下で観測値（次式のW^l）が生起する確率の尤度関数について、最尤基準により、認識対象である単語系列（次式のW^l^）を認識する処理を時系列順に繰り返す（Ｓ１２）。 The spoken voice recognition unit 12 is based on the recognition data consisting of a set of acoustic feature quantity series (X ¹ , ..., ^{XL) acquired in chronological order, and the word sequence to be recognized (W l} ^of the following equation). ) Is the observed value, and the already recognized word sequence (W ¹ ^,…, W ^l-1 ^ in the following equation) prior to the word sequence to be recognized (W ^l in the following equation), and the recognition target Using the acoustic feature sequence (X ^l in the following equation) corresponding to the word sequence (W ^l in the following equation) and the trained model parameter θ as parameters, the parameters (W ¹ ^,…, W ^l in the following equation) For the likelihood function of the probability that the observed value (W ^l in the following equation) occurs under ^-1 ^, X ^l , θ), the word sequence to be recognized (W ^l ^ in the following equation) according to the most likely criterion. The process of recognizing the above is repeated in chronological order (S12).

すなわち、発話音声認識部１２は、l番目の発話の音響特徴量系列X^lと音声認識結果として得られた1番目からl-1番目の発話までの認識済みの単語系列W¹^,…,W^l-1^が入力された時に、モデルパラメータθに従った確率計算により、l番目の発話についての事後確率分布 That is, the utterance voice recognition unit 12 has the acoustic feature sequence X ^l of the l-th utterance and the recognized word sequence W ¹ ^, ..., From the first to the l-1th utterance obtained as the voice recognition result. Posterior probability distribution for the lth utterance by probabilistic calculation according to the model parameter θ when W ^l-1 ^ is input

を得て、最尤基準によりl番目の発話の音声認識結果の単語系列W^l^を決定する。すなわち、最尤基準による決定は次式に従う。 And determine the word sequence W ^l ^ of the speech recognition result of the lth utterance by the maximum likelihood criterion. That is, the determination based on the maximum likelihood criterion follows the following equation.

上述したように、発話音声認識部１２は、ステップＳ１２を時系列順に再帰的に実行する。例えば、l番目の発話の音声認識結果の単語系列W^l^を既知の認識結果とすることにより、l+1番目の発話についての事後確率分布 As described above, the spoken voice recognition unit 12 recursively executes step S12 in chronological order. For example, by setting the word sequence W ^l ^ of the speech recognition result of the l-th utterance as a known recognition result, the posterior probability distribution for the l + 1-th utterance

を得ることができ、同様に以下のようにl+1番目の発話の音声認識結果の単語系列W^l+1^を決定する。 And similarly, the word sequence W ^{l + 1} ^ of the speech recognition result of the l + 1st utterance is determined as follows.

なお、 note that,

の詳しい定式化と詳細な計算方法は、この後に述べる。 The detailed formulation and detailed calculation method of are described later.

＜単語系列記憶部１２ａ＞
単語系列記憶部１２ａは、発話音声認識部１２が再帰的に用いる単語系列を記憶する。例えば、ステップＳ１２において、単語系列W¹^が認識された場合、単語系列記憶部１２ａは、当該単語系列W¹^を記憶し、単語系列W^l^が認識された場合、単語系列記憶部１２ａは、当該単語系列W^l^を記憶し、単語系列W^L^が認識された場合、単語系列記憶部１２ａは、当該単語系列W^L^を記憶する。 <Word sequence storage unit 12a>
The word sequence storage unit 12a stores the word sequence recursively used by the utterance speech recognition unit 12. For example, in step S12, when the word sequence W ¹ ^ is recognized, the word sequence storage unit 12a stores the word sequence W ¹ ^, and when the word sequence W ^l ^ is recognized, the word sequence storage unit 12a Stores the word sequence W ^l ^, and when the word sequence W ^L ^ is recognized, the word sequence storage unit 12a stores the word sequence W ^L ^.

＜発話音声認識部１２の詳細な構成＞
図３に示すように、発話音声認識部１２は、発話ベクトル計算部１２１と、発話系列埋め込みベクトル計算部１２２と、文脈ベクトル計算部１２３と、事後確率計算部１２４を含む。 <Detailed configuration of the spoken voice recognition unit 12>
As shown in FIG. 3, the utterance speech recognition unit 12 includes an utterance vector calculation unit 121, an utterance series embedded vector calculation unit 122, a context vector calculation unit 123, and a posterior probability calculation unit 124.

発話音声認識部１２は前述のとおり、 As described above, the utterance voice recognition unit 12 has the utterance voice recognition unit 12.

を計算する。この詳細な定式化は次式で表される。 To calculate. This detailed formulation is expressed by the following equation.

なお、 note that,

の計算は、発話音声認識部１２内の発話ベクトル計算部１２１と、発話系列埋め込みベクトル計算部１２２と、文脈ベクトル計算部１２３と、事後確率計算部１２４により実現される。以下では、図４を参照し、l番目の発話のn番目の単語についての確率 Is realized by the utterance vector calculation unit 121 in the utterance voice recognition unit 12, the utterance series embedded vector calculation unit 122, the context vector calculation unit 123, and the posterior probability calculation unit 124. In the following, referring to FIG. 4, the probability for the nth word of the lth utterance

を計算するための詳細な処理を表す。 Represents the detailed processing for calculating.

＜発話ベクトル計算部１２１＞
入力１：l-1番目の発話の単語系列W^l-1^
入力２：モデルパラメータθ
出力：l-1番目の発話の発話ベクトルu^l-1 <Utterance vector calculation unit 121>
Input 1: l-1 word sequence of the first utterance W ^l-1 ^
Input 2: Model parameter θ
Output: l-1 utterance vector of the first utterance u ^l-1

発話ベクトル計算部１２１は、認識対象であるl番目の発話の単語系列W^lよりも過去の、既に認識済みのl-1番目の発話の単語系列W^l-1^をモデルパラメータθに基づく変換関数により、l-1番目の発話の発話ベクトルu^l-1に変換する（Ｓ１２１）。この時、l-1番目の発話の単語系列W^l-1^は1つ以上の単語を含む。発話ベクトルは単語系列に含まれる情報を埋め込んだベクトルを表し、次発話の音声認識に必要な発話の意味的な情報が埋め込まれている。ベクトルの次元数を大きくするほどたくさんの情報を埋め込むことができ、例えば512次元のベクトルとして人手で次元数を決定する。このとき変換関数には、可変長数の記号列を単一ベクトルに変換する関数であれば任意のものを利用できるが、例えば、発話中の単語の頻度ベクトルを構成するような関数を用いることができるし、リカレントニューラルネットワークや双方向リカレントニューラルネットワーク等も用いることができる。 The utterance vector calculation unit 121 converts the already recognized l-1st utterance word sequence W ^l-1 ^, which is earlier than the lth utterance word sequence W ^l to be recognized, based on the model parameter θ. The function converts the utterance vector u ^l-1 of the l-1st utterance (S121). At this time, the word sequence W ^l-1 ^ of the l-1st utterance contains one or more words. The utterance vector represents a vector in which information contained in the word sequence is embedded, and semantic information of the utterance necessary for voice recognition of the next utterance is embedded. The larger the number of dimensions of a vector, the more information can be embedded. For example, the number of dimensions is manually determined as a 512-dimensional vector. At this time, any conversion function can be used as long as it is a function that converts a variable-length symbol string into a single vector. For example, a function that constitutes a frequency vector of a spoken word should be used. A recurrent neural network, a bidirectional recurrent neural network, or the like can also be used.

なお、l=1の場合は入力となる単語系列W⁰は存在しなないため、出力のu⁰はすべての要素が0.0のベクトルとすればよい。 Note that when l = 1, the input word sequence W ⁰ does not exist, so u ⁰ in the output may be a vector with all elements 0.0.

なお、ステップＳ１２１は、W¹^,…,W^l-1^のそれぞれに対して行われることになる。したがって、発話ベクトル計算部１２１は、u¹,…,u^l-1をそれぞれ出力することなる。 Note that step S121 is performed for each of W ¹ ^, ..., W ^l-1 ^. Therefore, the utterance vector calculation unit 121 outputs u ¹ , ..., U ^l-1 respectively.

＜発話系列埋め込みベクトル計算部１２２＞
入力１：過去の発話についての発話ベクトルの系列u¹,…,u^l-1
入力２：モデルパラメータθ
出力：l-1番目の発話系列埋め込みベクトルv^l-1 <Utterance series embedded vector calculation unit 122>
Input 1: A series of utterance vectors for past utterances u ¹ ,…, u ^l-1
Input 2: Model parameter θ
Output: l-1st utterance series embedded vector v ^l-1

発話系列埋め込みベクトル計算部１２２は、過去の発話についての発話ベクトルの系列u¹,…,u^l-1をモデルパラメータθに基づく変換関数により、l-1番目の発話系列埋め込みベクトルv^l-1に変換する（Ｓ１２２）。この発話系列埋め込みベクトルは単一のベクトルであり、次発話の音声認識に必要な意味的情報が埋め込まれている。ベクトルの次元数を大きくするほどたくさんの情報を埋め込むことができ、例えば512次元のベクトルとして人手で次元数を決定する。このとき変換関数には、可変長数のベクトル列を単一ベクトルに変換する関数であれば任意のものを利用できるが、例えばリカレントニューラルネットワークや、発話ベクトル系列の各ベクトルを平均化するような関数を用いることができる。なお、平均化する場合は、発話系列埋め込みベクトルの次元数は、発話ベクトル系列の各次元数に依存する。 The utterance series embedded vector calculation unit 122 uses a transformation function based on the model parameter θ to convert the utterance vector series u ¹ , ..., u ^l-1 for past utterances into the l-1st utterance series embedded vector v ^l-1 . Is converted to (S122). This utterance series embedded vector is a single vector, and semantic information necessary for speech recognition of the next utterance is embedded. The larger the number of dimensions of a vector, the more information can be embedded. For example, the number of dimensions is manually determined as a 512-dimensional vector. At this time, any conversion function can be used as long as it is a function that converts a variable-length vector sequence into a single vector, such as a recurrent neural network or averaging each vector of an utterance vector series. Functions can be used. In the case of averaging, the number of dimensions of the utterance series embedded vector depends on the number of dimensions of the utterance vector series.

なお、l=1の場合は、入力となる過去発話系列についての発話ベクトル系列は存在しないため、出力のv⁰はすべての要素が0.0のベクトルとすればよい。 When l = 1, there is no utterance vector series for the past utterance series that is the input, so v ⁰ of the output should be a vector with all elements 0.0.

＜文脈ベクトル計算部１２３＞
入力１：l番目の発話の単語系列W^lの中のn番目の単語w^l _nよりも過去の単語列w^l ₁,…,w^l _n-1
入力２：l番目の発話の音響特徴量系列X^l
入力３：モデルパラメータθ
出力:l番目の発話のn番目の単語向けの文脈ベクトルs^l _n <Context vector calculation unit 123>
Input 1: The nth word in the l-th utterance word sequence W ^l w ^l A word sequence older than _n w ^l ₁ ,…, w ^l _n-1
Input 2: Acoustic feature series of l-th utterance X ^l
Input 3: Model parameter θ
Output: Context vector for the nth word of the lth utterance s ^l _n

文脈ベクトル計算部１２３は、認識対象であるl番目の発話の単語系列W^lの中のn番目の単語w^l _nよりも過去の単語列w^l ₁,…,w^l _n-1（単語系列と意味を区別するため単語列と呼称する）と、認識対象であるl番目の単語系列W^lに対応するl番目の音響特徴量系列X^lをモデルパラメータθに基づく変換関数によりl番目の発話の単語系列W^lの中のn番目の単語w^l _n向けの文脈ベクトルs^l _nに変換する（Ｓ１２３）。この文脈ベクトルは、次の単語の音声認識に必要な意味的情報と音韻的情報を統合した情報が埋め込まれている。このとき変換関数には、２種類の可変長数のベクトル列を単一ベクトルに変換する関数であれば任意のものを利用できるが、例えば非特許文献２のように音響特徴量系列と単語系列のそれぞれにリカレントニューラルネットワークを設けて、注意機構を加えることにより単一の文脈ベクトルとして表現する関数を用いることもできる。また、最も単純なものであれば、l番目の発話のn番目の単語よりも過去の単語系列の頻度ベクトルとl番目の発話の音響特徴量系列を平均化したベクトルの結合ベクトルを構成するような関数を用いることもできる。 The context vector calculation unit 123 is a word sequence w ^l ₁ ,…, w ^l _n-1 (word sequence) that is older than the nth word w ^l _n in the word sequence W ^l of the lth utterance to be recognized. The l-th acoustic feature sequence X ^l corresponding to the l-th word sequence W ^l to be recognized is uttered by the conversion function based on the model parameter θ. Convert to the context vector s ^l _n for the _nth word w ^l n in the word sequence W ^l of (S123). This context vector is embedded with information that integrates semantic information and phonological information necessary for speech recognition of the next word. At this time, any conversion function can be used as long as it is a function that converts two types of variable-length vector sequences into a single vector. For example, as in Non-Patent Document 2, an acoustic feature sequence and a word sequence can be used. It is also possible to use a function expressed as a single context vector by providing a recurrent neural network for each of the above and adding an attention mechanism. Also, if it is the simplest, construct a combined vector of the frequency vector of the word sequence past the nth word of the lth utterance and the acoustic feature sequence of the lth utterance averaged. Functions can also be used.

＜事後確率計算部１２４＞
入力１：l-1番目の発話系列埋め込みベクトルv^l-1
入力２：l番目の発話のn番目の単語向けの文脈ベクトルs^l _n
入力３：モデルパラメータθ
出力:l番目の発話のn番目の単語についての事後確率 <Posterior probability calculation unit 124>
Input 1: l-1st utterance series embedded vector v ^l-1
Input 2: Context vector for the nth word of the lth utterance s ^l _n
Input 3: Model parameter θ
Output: Posterior probabilities for the nth word of the lth utterance

事後確率計算部１２４は、認識対象である単語系列W^lよりも一つ過去までの発話ベクトル系列u¹,…,u^l-1を変換してなるl-1番目の発話系列埋め込みベクトルv^l-1、および認識対象であるl番目の単語系列W^lのn番目の単語向けの文脈ベクトルs^l _nから、モデルパラメータθに基づく変換関数により、l番目の単語系列W^lのn番目の単語についての事後確率 The posterior probability calculation unit 124 converts the utterance vector sequence u ¹ , ..., u ^l-1 up to the past one from the word sequence W ^l to be recognized, and l-1st utterance sequence embedded vector v ^l . ^-1 and the nth word of the lth word sequence W ^l from the context vector s ^l _n for the nth word of the lth word sequence W ^l to be recognized by the conversion function based on the model parameter θ Posterior probability about

を計算する（Ｓ１２４）。事後確率は各単語を要素としたベクトルとして表すことができ、ベクトル変換により事後確率分布を表現することが可能である。このとき変換関数には、２種類のベクトルを事後確率分布に変換する関数であれば任意のものを利用できるが、例えば、２つのベクトルの結合ベクトルにソフトマックス関数を用いた変換を行う動作を行う関数により実現することができる。それ以外でも、事後確率分布に相当する出力ベクトルの要素の総和が1.0に変換可能な関数が適用可能である。 Is calculated (S124). The posterior probability can be expressed as a vector with each word as an element, and the posterior probability distribution can be expressed by vector transformation. At this time, any conversion function can be used as long as it is a function that converts two types of vectors into a posterior probability distribution. It can be realized by the function to be performed. Other than that, a function that can convert the sum of the elements of the output vector corresponding to the posterior probability distribution to 1.0 is applicable.

本実施例の音声認識装置１によれば、従来のような単一発話を扱うend-to-end音声認識ではなく、発話系列を扱うend-to-end音声認識のモデル化を導入したため、音声入力が発話系列と表される場合に、文脈を考慮したend-to-end音声認識を実現できる。すなわち発話系列中のある発話を音声認識する際に、発話系列の最初の発話から対象発話の１つ前の発話までの情報を文脈として利用することが可能となる。例えば前述と同様に、１０分程度の講演音声を音声認識することを想定し、この講演音声を無音が0.5秒存在するごとに区切ると、200発話の音声が含まれていた場合を想定する。この場合、本実施例の音声認識装置１によれば、連続した200発話中のある発話より前の全ての関連する文脈情報を現在の音声認識に利用することができる。例えば、音声認識装置１は、100発話目を音声認識する際に、1発話目から99発話目までの音声認識結果を文脈として利用することができる。 According to the speech recognition device 1 of the present embodiment, since the modeling of end-to-end speech recognition that handles the utterance sequence is introduced instead of the conventional end-to-end speech recognition that handles single utterances, speech. Context-aware end-to-end speech recognition can be achieved when the input is represented as an utterance sequence. That is, when recognizing a certain utterance in the utterance series by voice, it is possible to use the information from the first utterance of the utterance series to the utterance immediately before the target utterance as a context. For example, as described above, it is assumed that the lecture voice of about 10 minutes is recognized by voice, and if this lecture voice is divided every 0.5 seconds of silence, it is assumed that 200 speech voices are included. In this case, according to the speech recognition device 1 of the present embodiment, all the relevant context information before a certain utterance in 200 consecutive utterances can be used for the current speech recognition. For example, the voice recognition device 1 can use the voice recognition results from the first utterance to the 99th utterance as a context when recognizing the 100th utterance.

本実施例の音声認識装置１は、例えば講演、電話、会議などの音声認識の認識性能を高めることができる。 The voice recognition device 1 of the present embodiment can enhance the recognition performance of voice recognition such as lectures, telephone calls, and conferences.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ－ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these. , CPU, RAM, ROM, has a bus connecting so that data can be exchanged between external storage devices. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. As a physical entity equipped with such hardware resources, there is a general-purpose computer or the like.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program required to realize the above-mentioned functions and data required for processing of this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data required for processing of each program are read into the memory as needed, and are appropriately interpreted and executed and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each configuration requirement represented by the above, ... Department, ... means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by the computer, the processing content of the function that the hardware entity should have is described by the program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ－ＲＡＭ（Random Access Memory）、ＣＤ－ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ－Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ－ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing content can be recorded on a computer-readable recording medium. The recording medium that can be read by a computer may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, the distribution of this program is performed, for example, by selling, transferring, renting, or the like a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first temporarily stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. You may execute the process according to the received program one by one each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and the result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property that regulates the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

Claims

Based on the recognition data consisting of a set of acoustic feature quantity series acquired in chronological order, the word sequence to be recognized is used as the observed value, and the already recognized word sequence and the word sequence to be recognized are older than the word sequence to be recognized. The acoustic feature quantity series corresponding to the word sequence to be recognized and the model parameter θ learned in advance are used as parameters, and the probability function of the probability that the observed value occurs under the parameters is recognized by the most likely criterion. A voice recognition device including a spoken voice recognition unit that repeats the process of recognizing a word sequence in chronological order.

The voice recognition device according to claim 1.
The utterance voice recognition unit is
An utterance that converts an already recognized word sequence past the recognition target word sequence into an utterance vector, which is a vector containing semantic information necessary for speech recognition of the next utterance, by a conversion function based on the model parameter θ. Vector calculator and
An utterance series embedded vector calculation unit that converts a series of utterance vectors into a utterance series embedded vector containing semantic information necessary for speech recognition of the next utterance by a conversion function based on the model parameter θ.
The model parameter is a word sequence in the word sequence to be recognized and an acoustic feature quantity sequence corresponding to the word sequence to be recognized, which is older than the word of interest in the word sequence to be recognized. A context vector calculation unit that converts semantic information and phonological information necessary for speech recognition of words in the word sequence to be recognized into a context vector containing integrated information by a conversion function based on θ.
Conversion based on the model parameter θ from the utterance sequence embedding vector formed by converting the utterance vector sequence up to the past from the word sequence to be recognized and the context vector for words in the word sequence to be recognized. A speech recognition device that includes a post-probability calculator that calculates the post-probability of a word in a word series to be recognized by a function.

Based on the learning data consisting of the set of the word sequence acquired in chronological order and the set of the corresponding acoustic feature quantity series, the word sequence of interest is used as the observed value, and the word sequence past the word sequence of interest is used. , And the acoustic feature quantity sequence corresponding to the word sequence of interest, and the model parameter θ as parameters, and the maximum likelihood estimation is performed for the likelihood function of the probability that the observed value will occur under the parameters. A voice recognition learning device including a model parameter learning unit that learns the model parameter θ.

Based on the recognition data consisting of a set of acoustic feature quantity series acquired in chronological order, the word sequence to be recognized is used as the observed value, and the already recognized word sequence and the word sequence to be recognized are older than the word sequence to be recognized. The acoustic feature quantity series corresponding to the word sequence to be recognized and the model parameter θ learned in advance are used as parameters, and the probability function of the probability that the observed value occurs under the parameters is recognized by the most likely criterion. A voice recognition method including a spoken voice recognition step in which the process of recognizing a word sequence is repeated in chronological order.

Based on the learning data consisting of the set of the word sequence acquired in chronological order and the set of the corresponding acoustic feature quantity series, the word sequence of interest is used as the observed value, and the word sequence past the word sequence of interest is used. , And the acoustic feature quantity sequence corresponding to the word sequence of interest, and the model parameter θ as parameters, and the maximum likelihood estimation is performed for the likelihood function of the probability that the observed value will occur under the parameters. A speech recognition learning method that includes a model parameter learning step to learn the model parameter θ.

A program that causes a computer to function as the voice recognition device according to claim 1 or 2.

A program that causes a computer to function as the voice recognition learning device according to claim 3.