JP2015087556A - Voice recognition wfst generation device, voice recognition device, voice recognition wfst generation method, voice recognition method, program, and recording medium - Google Patents

Voice recognition wfst generation device, voice recognition device, voice recognition wfst generation method, voice recognition method, program, and recording medium Download PDF

Info

Publication number
JP2015087556A
JP2015087556A JP2013226121A JP2013226121A JP2015087556A JP 2015087556 A JP2015087556 A JP 2015087556A JP 2013226121 A JP2013226121 A JP 2013226121A JP 2013226121 A JP2013226121 A JP 2013226121A JP 2015087556 A JP2015087556 A JP 2015087556A
Authority
JP
Japan
Prior art keywords
wfst
unigram
common
stage
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2013226121A
Other languages
Japanese (ja)
Other versions
JP6193726B2 (en
Inventor
山口 義和
Yoshikazu Yamaguchi
義和 山口
祥子 山畠
Shoko Yamahata
祥子 山畠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2013226121A priority Critical patent/JP6193726B2/en
Publication of JP2015087556A publication Critical patent/JP2015087556A/en
Application granted granted Critical
Publication of JP6193726B2 publication Critical patent/JP6193726B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition WFST generation device capable of suppressing a size of a voice recognition WFST.SOLUTION: A common unigram creation unit 11 reads frequencies of appearance of words from each of a plurality of language models 10to 10to calculate a common unigram value, and creates a common unigram WFST from the common unigram value. A first-stage WFST creation unit 16 performs a combining operation of M (M≥1) acoustic models WFST 13to WFST 13, a triphone WFST 14, a dictionary WFST 15, and the common unigram WFST, and connects each WFST for all acoustic models to constitute a first-stage WFST. A second-stage WFST creation unit 17 calculates an n-gram probability obtained by removing the common unigram value as an input from an n-gram probability of each language model while referring to the plural language models 10to 10to create a WFST of the n-gram probability, and connects each WFST of the n-gram probabilities of the respective language modes of all the language models 10to 10to constitute a second-stage WFST.

Description

この発明は、複数種類の音響モデル及び言語モデルを用いた重み付有限状態トランスデューサ(以下、WFSTと称する)を作成する音声認識用WFST作成装置と、音声認識装置と、それらの方法とプログラムと記録媒体に関する。   The present invention relates to a speech recognition WFST creation device for creating a weighted finite state transducer (hereinafter referred to as WFST) using a plurality of types of acoustic models and language models, a speech recognition device, a method, a program thereof, and a recording It relates to the medium.

WFSTを用いた音声認識は、音響モデルや辞書、言語モデルなど音声認識に必要な情報を統合したWFSTに変換し、入力である音声認識対象音声を、WFSTを探索空間と見たててデコードし、音声認識結果文字列に変換する手法である。WFSTを用いた音声認識については、例えば特許文献1や2に開示されている。   Speech recognition using WFST is converted into WFST that integrates information necessary for speech recognition, such as acoustic models, dictionaries, and language models, and the speech recognition target speech that is input is decoded with WFST regarded as a search space. This is a method of converting into a speech recognition result character string. Speech recognition using WFST is disclosed in Patent Documents 1 and 2, for example.

特許第5175325号公報Japanese Patent No. 5175325 特許第4478088号公報Japanese Patent No. 4478088

様々な話者、環境、話題を対象とした音声認識サービスで認識精度の高いWFSTを用いた音声認識で実現するためには、複数の音響モデルと複数の言語モデルを同時に利用する必要がある。同時に利用する音響モデルと言語モデルの数が増えるとWFSTサイズが増大し、音声認識処理に要するメモリが膨大になる課題がある。   In order to realize speech recognition using WFST with high recognition accuracy in a speech recognition service for various speakers, environments, and topics, it is necessary to simultaneously use a plurality of acoustic models and a plurality of language models. When the number of acoustic models and language models used simultaneously increases, the WFST size increases, and there is a problem that the memory required for speech recognition processing becomes enormous.

この発明は、このような課題に鑑みてなされたものであり、複数の音声認識サービスに対応する場合でもメモリサイズの増加を少なくしたWFSTを作成できる音声認識用WFST作成装置と、音声認識装置と、それらの方法とプログラムと記録媒体を提供することを目的とする。   The present invention has been made in view of such a problem, and a speech recognition WFST creation device capable of creating a WFST with reduced increase in memory size even when supporting a plurality of speech recognition services, a speech recognition device, and An object of the present invention is to provide a method, a program, and a recording medium.

この発明の音声認識用WFST作成装置は、共通ユニグラムWFST作成部と、共通ユニグラムWFST記憶部と、第1段WFST作成部と、第2段WFST作成部と、を具備する。共通ユニグラムWFST作成部は、複数の言語モデルからそれぞれの単語の出現頻度を読み込んで共通ユニグラム値を計算し、当該共通ユニグラム値から共通ユニグラムWFSTを作成し、当該共通ユニグラム値と上記共通ユニグラムWFSTを出力する。共通ユニグラムWFST記憶部は、共通ユニグラム値と共通ユニグラムWFSTを記憶する。第1段WFST作成部は、N個(N≧1)の音響モデルのWFSTとトライフォンWFSTと辞書WFSTと共通ユニグラムWFSTとを合成演算した音響モデルごとのWFSTを作成し、全ての上記音響モデルごとのWFSTを結合させて第1段WFSTを構成する。第2段WFST作成部は、共通ユニグラム値を入力として、複数の言語モデルのそれぞれを参照して各言語モデルごとのnグラム確率に対して共通ユニグラム値を除去したnグラム確率を算出して当該nグラム確率のWFSTを作成し、全ての言語モデルの各言語モデルのnグラム確率のWFSTを結合させて第2段WFSTを構成する。   The speech recognition WFST creation apparatus of the present invention includes a common unigram WFST creation section, a common unigram WFST storage section, a first stage WFST creation section, and a second stage WFST creation section. The common unigram WFST creation unit reads the appearance frequency of each word from a plurality of language models, calculates a common unigram value, creates a common unigram WFST from the common unigram value, and creates the common unigram value and the common unigram WFST. Output. The common unigram WFST storage unit stores a common unigram value and a common unigram WFST. The first-stage WFST creation unit creates WFSTs for each acoustic model obtained by synthesizing N (N ≧ 1) acoustic model WFSTs, triphones WFST, dictionary WFST, and common unigram WFST. The first-stage WFSTs are configured by combining the WFSTs for each. The second-stage WFST creation unit receives the common unigram value, calculates each of the plurality of language models, calculates the n-gram probability obtained by removing the common unigram value with respect to the n-gram probability for each language model, and An n-gram probability WFST is created, and the n-gram probability WFSTs of all language models are combined to form a second-stage WFST.

また、この発明の音声認識装置は、上記音声認識用WFST作成装置で作成した第1段WFSTを記憶した第1段WFST記憶部と、第2段WFSTを記憶した第2段WFST記憶部と、音声認識部と、を具備する。音声認識部は、第1段WFST記憶部と第2段WFST記憶部を参照して多段on-the-fly合成による音声認識を実行する。   Further, the speech recognition apparatus of the present invention includes a first-stage WFST storage unit that stores the first-stage WFST created by the speech recognition WFST creation apparatus, a second-stage WFST storage unit that stores the second-stage WFST, A voice recognition unit. The speech recognition unit performs speech recognition by multi-stage on-the-fly synthesis with reference to the first stage WFST storage unit and the second stage WFST storage unit.

本発明の音声認識用WFST作成装置によれば、複数の言語モデルからそれぞれの単語の出現頻度を読み込んで共通ユニグラム値を計算して共通ユニグラムWFSTを作成する。そして、複数の音響モデルの各音響モデルごとに共通ユニグラムで共有化した第1段WFSTを構成するので、第1段WFSTのメモリサイズを抑制することができる。また、この発明の音声認識装置は、上記した音声認識用WFST作成装置で作成した第1段WFSTと第2段WFSTを用いて音声認識を行うので、複数の音声認識サービスに対応する場合でも、少ないメモリ量で高精度な音声認識を可能にする。   According to the speech recognition WFST creation apparatus of the present invention, a common unigram WFST is created by reading the appearance frequency of each word from a plurality of language models and calculating a common unigram value. And since the 1st stage WFST shared by the common unigram for every acoustic model of a several acoustic model is comprised, the memory size of 1st stage WFST can be suppressed. In addition, since the speech recognition apparatus of the present invention performs speech recognition using the first stage WFST and the second stage WFST created by the speech recognition WFST creation apparatus described above, even when supporting a plurality of speech recognition services, Enables highly accurate speech recognition with a small amount of memory.

この発明の音声認識用WFST作成装置100の機能構成例を示す図。The figure which shows the function structural example of the WFST production apparatus 100 for speech recognition of this invention. 音声認識用WFST作成装置100の動作フローを示す図。The figure which shows the operation | movement flow of the WFST production apparatus 100 for speech recognition. 単語wの共通ユニグラムWFSTを示す図。The figure which shows the common unigram WFST of the word w. 第1段WFSTの例を示す図。The figure which shows the example of 1st stage WFST. この発明の音声認識装置200の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。   Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図1に、この発明の音声認識用WFST作成装置100の機能構成例を示す。その動作フローを図2に示す。音声認識用WFST作成装置100は、共通ユニグラムWFST作成部11と、共通ユニグラムWFST記憶部12と、第1段WFST作成部16と、第2段WFST作成部17と、を具備する。音声認識用WFST作成装置100は、例えばROM、RAM、CPU等で構成されるコンピュータに所定のプログラムが読み込まれて、CPUがそのプログラムを実行することで実現される。   FIG. 1 shows a functional configuration example of a speech recognition WFST creation apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition WFST creation apparatus 100 includes a common unigram WFST creation unit 11, a common unigram WFST storage unit 12, a first-stage WFST creation unit 16, and a second-stage WFST creation unit 17. The speech recognition WFST creation apparatus 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

共通ユニグラムWFST作成部11は、複数の言語モデル10,10,…,10からそれぞれの単語の出現頻度を読み込んで共通ユニグラム値を計算し、当該共通ユニグラム値から共通ユニグラムWFSTを作成し、当該共通ユニグラム値と上記共通ユニグラムWFSTを出力する。言語モデル10は例えばスポーツ分野の言語モデル、言語モデル10は例えば芸能分野の言語モデル、言語モデル10は例えば政治分野の言語モデルである。このように複数の分野ごとに言語モデルが用意されている。なお、各言語モデルには単語(ユニグラムのみで良い)の出現頻度情報も含むものとする。 The common unigram WFST creation unit 11 reads the appearance frequency of each word from a plurality of language models 10 1 , 10 2 ,..., 10 N , calculates a common unigram value, and creates a common unigram WFST from the common unigram value. The common unigram value and the common unigram WFST are output. The language model 10 1, for example the field of sports language model, the language model 10 2, for example of the performing arts field language model, the language model 10 N is a language model, for example, the political field. In this way, language models are prepared for a plurality of fields. Note that each language model includes appearance frequency information of words (only unigrams may be used).

言語モデル10(1≦*≦N)における単語wの共通ユニグラム値(出現確率)Pc(w)は次式で計算される。 The common unigram value (appearance probability) Pc (w) of the word w in the language model 10 * (1 ≦ * ≦ N) is calculated by the following equation.

Figure 2015087556
Figure 2015087556

ここでC(w)は言語モデル10における単語wの出現頻度、Wは全ての言語モデルに含まれる単語数である。共通ユニグラムWFST作成部11は、全ての単語wについて共通ユニグラム値(式(1))を計算して求め、共通ユニグラムモデルを形成する。共通ユニグラムモデルとは、全ての単語wを含む共通ユニグラム値の集合(例えば、ファイル)のことである。 Here, C * (w) is the appearance frequency of the word w in the language model 10 * , and W is the number of words included in all the language models. The common unigram WFST creation unit 11 calculates and obtains a common unigram value (formula (1)) for all the words w to form a common unigram model. The common unigram model is a set (for example, a file) of common unigram values including all the words w.

図3に、共通ユニグラムWFSTを示す。共通ユニグラムWFSTは3つのノード(node)で構成され、各ノードには状態番号が記される。初期状態は状態0とする。二重線で囲まれた状態2は終了状態を表す。アークは状態遷移を表す。状態0から状態1へ遷移するアークに併記されている<s>は文頭を表す。状態1から出て状態1に戻るアークが、単語wの入力に対して単語wを共通ユニグラム値で表される出現確率で出力されることを表している。このアークが単語数分作成される。状態1から状態2へのアークに併記されている</s>は文末を表す。共通ユニグラムWFSTは、共通ユニグラムWFST記憶部12に記憶される。   FIG. 3 shows a common unigram WFST. The common unigram WFST is composed of three nodes, and a state number is written in each node. The initial state is state 0. A state 2 surrounded by a double line represents an end state. An arc represents a state transition. <S> written together with an arc transitioning from state 0 to state 1 represents the beginning of a sentence. An arc that returns from state 1 and returns to state 1 indicates that word w is output with the appearance probability represented by a common unigram value in response to the input of word w. This arc is created for the number of words. </ S> written in the arc from state 1 to state 2 represents the end of the sentence. The common unigram WFST is stored in the common unigram WFST storage unit 12.

第1段WFST作成部16は、M個(M≧1)の音響モデル13,13,…13と、トライフォンWFST14と、辞書WFST15と、共通ユニグラム作成部11で作成した共通ユニグラムWFSTとを合成演算した音響モデルごとのWFSTを求め、その音響モデルごとのWFSTを全て結合させて第1段WFSTを作成する(ステップS16)。 The first-stage WFST creation unit 16 includes M (M ≧ 1) acoustic models 13 1 , 13 2 ,... 13 M , a triphone WFST 14, a dictionary WFST 15, and a common unigram WFST created by the common unigram creation unit 11. The first WFST is created by combining all the WFSTs for the respective acoustic models (step S16).

合成演算した音響モデル13(1≦*≦M)ごとのWFSTを次式に示す。 The WFST for each acoustic model 13 * (1 ≦ * ≦ M) obtained by the synthesis operation is shown in the following equation.

Figure 2015087556
Figure 2015087556

ここでoptはWFSTの最適化演算、○はWFSTの合成演算を表す。Hは各音響モデルのWFSTである。CはトライフォンWFSTであり、音響モデルWFSTが出力するトライフォンを音素に変換するWFSTである。Lは音素を単語に変換する辞書WFSTである。Gcは共通ユニグラムWFSTである。WFSTの合成・最適化方法は、例えば参考文献1(堀貴明,塚田元,「重み付き有限状態トランスデューサによる音声認識」情報処理,2004年10月15日,第45巻10号)に記載されているように周知である。 Here, opt represents a WFST optimization operation, and ◯ represents a WFST composition operation. H * is the WFST of each acoustic model. C is a triphone WFST, which converts the triphone output from the acoustic model WFST into phonemes. L is a dictionary WFST that converts phonemes into words. Gc is a common unigram WFST. A method for synthesizing and optimizing WFST is described, for example, in Reference 1 (Takaaki Hori, Mototsuka Tsukada, “Speech Recognition Using Weighted Finite State Transducers” Information Processing, October 15, 2004, Vol. 45, No. 10) As is well known.

第1段WFST作成部16は、合成・最適化した音響モデルごとのWFST(HCLGc)を、全て結合して第1段WFSTを構成する。図4に、第1段WFSTの例を示す。図4を参照して、第1段WFST作成部13の動作を説明する。 The first-stage WFST creation unit 16 combines all WFST (H * CLGc) for each synthesized and optimized acoustic model to form the first-stage WFST. FIG. 4 shows an example of the first stage WFST. The operation of the first stage WFST creation unit 13 will be described with reference to FIG.

第1段WFST作成部13は、状態s1(初期状態)と状態s2を作成する。次に入力シンボルと出力シンボルが共にε(空)で、状態s1から音響モデル1を含むHCLGcの初期状態へ接続する遷移を作成する。同様に全ての音響モデルに対応するHCLGc,…,HCLGcについても状態s1からの遷移を作成する。次に入力シンボルと出力シンボルが共にεで、全てのHCLGc〜HCLGcのそれぞれの終了状態から状態s2へ接続する遷移を作成して1つの第1段WFSTを構成する。 The first stage WFST creation unit 13 creates a state s1 (initial state) and a state s2. Next, a transition is created in which both the input symbol and the output symbol are ε (empty) and are connected from the state s1 to the initial state of H 1 CLGc including the acoustic model 1. Similarly, transitions from the state s1 are created for H 2 CLGc,..., H M CLGc corresponding to all acoustic models. Next, the input symbol and the output symbol are both ε, and a transition connecting from each end state of all H 1 CLGc to H M CLGc to the state s2 is created to constitute one first stage WFST.

このように第1段WFSTは、音響モデルごとのWFST(HCLGc)が並列に結合されて構成される。第1段WFSTは、第1段WFST作成部16から外部に出力される。若しくは第1段WFST記憶部19に記憶するようにしても良い。 In this way, the first stage WFST is configured by connecting WFST (H * CLGc) for each acoustic model in parallel. The first stage WFST is output from the first stage WFST creating unit 16 to the outside. Alternatively, it may be stored in the first stage WFST storage unit 19.

第2段WFST作成部17は、共通ユニグラム値Pc(w)を入力として、複数の言語モデル10,10,…10を参照して各言語モデルごとのnグラム確率に対して共通ユニグラム値Pc(w)を除去したnグラム確率を算出して当該nグラム確率のWFSTを作成し、全ての言語モデルの上記nグラム確率のWFSTを並列に結合した第2段WFSTを作成する(ステップS17)。共通ユニグラム値Pc(w)を除去した各言語モデルごとのnグラム確率Pc(w|uv)(u,v,wは単語)は、次式で計算できる。 The second stage WFST creation unit 17, the common inputs the common unigram value Pc (w), a plurality of language models 10 1, 10 2, with reference to ... 10 N with respect to n-gram probability of each language model unigram An n-gram probability from which the value Pc (w) is removed is calculated to create a WFST having the n-gram probability, and a second-stage WFST in which the n-gram probabilities WFST of all language models are combined in parallel is created (step) S17). The n-gram probability Pc * (w | uv) (u, v, w is a word) for each language model from which the common unigram value Pc (w) is removed can be calculated by the following equation.

Figure 2015087556
Figure 2015087556

第2段WFST作成部17は、全ての言語モデルについて、共通ユニグラム値Pc(w)を除去した各言語モデルごとのnグラム確率P(w|uv)から、言語モデルごとのWFSTを作成する。そして、その各言語モデルごとのWFSTを並列に結合して第2段WFSTを構成する。第2段WFSTは、式(3)から明らかなようにトライグラム確率P(w|uv)から、共通ユニグラム値Pc(w)を除去したnグラム確率に基づくWFSTである。 The second-stage WFST creation unit 17 creates a WFST for each language model from the n-gram probabilities P * (w | uv) for each language model from which the common unigram value Pc (w) is removed for all language models. . Then, the WFST for each language model is coupled in parallel to form a second stage WFST. The second stage WFST is a WFST based on the n-gram probability obtained by removing the common unigram value Pc (w) from the trigram probability P * (w | uv) as is clear from the equation (3).

各言語モデルごとのWFSTを並列に結合して第2段WFSTを構成する方法は、図4を参照して説明した第1段WFSTと同じである。作成された第2段WFSTは外部に出力される。若しくは第2段WFST記憶部20に記憶するようにしても良い。   The method of constructing the second stage WFST by combining the WFST for each language model in parallel is the same as the first stage WFST described with reference to FIG. The created second stage WFST is output to the outside. Alternatively, it may be stored in the second stage WFST storage unit 20.

以上説明した共通ユニグラム作成部11と第1段WFST作成部16と第2段WFST作成部17の処理は、全ての音響モデルと言語モデルについての処理が終了するまで繰り返される(ステップS18のNo)。このステップS11とステップS16とステップS17の時系列動作の制御と動作終了の制御は制御部18が行う。この制御部18の機能は、この実施例の特別な技術的特徴では無く一般的なものである。   The processes of the common unigram creating unit 11, the first stage WFST creating unit 16, and the second stage WFST creating unit 17 described above are repeated until the processing for all acoustic models and language models is completed (No in step S18). . The control unit 18 performs the control of the time series operation and the control of the operation end in steps S11, S16, and S17. The function of the control unit 18 is not a special technical feature of this embodiment but a general one.

以上説明した音声認識用WFST作成装置100は、共通ユニグラムWFST(Gc)を用いることで第1段WFSTのサイズを大きく削減することができる。具体的には、第1段WFSTの数をN−1個削減することができる。   The speech recognition WFST creation apparatus 100 described above can greatly reduce the size of the first stage WFST by using the common unigram WFST (Gc). Specifically, the number of first stage WFSTs can be reduced by N-1.

なお、共通ユニグラム作成部11では、言語モデルごとの単語wの出現頻度を数えて共通ユニグラム値を計算して求める例を説明したが、単語wの出現頻度を利用しないで共通ユニグラム値を求めるようにしても良い。その場合は、言語モデルがそもそも持っている単語wのユニグラム確率P(w)を用いて共通ユニグラム値Pc(w)を次式で計算する。 Although the common unigram creation unit 11 has been described as calculating the common unigram value by counting the appearance frequency of the word w for each language model, the common unigram value is obtained without using the appearance frequency of the word w. Anyway. In that case, the common unigram value Pc (w) is calculated by the following equation using the unigram probability P * (w) of the word w originally possessed by the language model.

Figure 2015087556
Figure 2015087556

共通ユニグラム値Pc(w)を求めた後の第1段WFSTを作成する方法は、上記したものと同じである。ユニグラム確率P(w)は、各言語モデルにおいて既知の値であるので、単語wの出現頻度を別途準備する必要が無い。 The method of creating the first stage WFST after obtaining the common unigram value Pc (w) is the same as described above. Since the unigram probability P * (w) is a known value in each language model, it is not necessary to prepare the appearance frequency of the word w separately.

なお、以上の説明は、全ての言語モデル10に含まれる単語は共通しておなじである前提で行って来た。各言語モデル10に含まれる単語は異なっていても良い。言語モデル10に含まれる単語が異なる場合は、辞書WFST15に言語モデル10に含まれる全ての単語を登録しておき、その全ての単語に対して式(1)若しくは式(4)により算出した共通ユニグラム値Pc(w)を求めれば良い。その後の第1段WFSTを作成する方法は、上記した方法と同じである。 The above explanation has been made on the assumption that the words included in all language models 10 * are the same. The words included in each language model 10 * may be different. If the words included in the language model 10 * are different, all the words included in the language model 10 * are registered in the dictionary WFST15, and the calculation is performed using the formula (1) or the formula (4) for all the words. The common unigram value Pc (w) may be obtained. The method of creating the subsequent first stage WFST is the same as the method described above.

〔音声認識装置〕
図5に、この発明の音声認識装置200の機能構成例を示す。音声認識装置200は、上記した音声認識用WFST作成装置100で作成した第1段WFSTを記憶した第1段WFST記憶部19と、第2段WFSTを記憶した第2段WFST記憶部20と、音声認識部210とを具備する。
[Voice recognition device]
FIG. 5 shows a functional configuration example of the speech recognition apparatus 200 of the present invention. The speech recognition apparatus 200 includes a first stage WFST storage unit 19 that stores the first stage WFST created by the speech recognition WFST creation apparatus 100 described above, a second stage WFST storage unit 20 that stores the second stage WFST, And a voice recognition unit 210.

音声認識部210は、第1段WFST記憶部19と第2段WFST記憶部20に記憶された第1段WFSTと第2段WFSTを用いて多段on-the-fly合成による音声認識を実行する。音声認識部210は、共通ユニグラムWFST(Gc)を用いることでサイズを縮小した第1段WFSTと、共通ユニグラムを、トライグラムに変換する第2段WFSTを探索して音声認識処理を行うので、少ないメモリサイズでも精度の良い音声認識処理を行うことができる。多段on-the-fly音声認識については参考文献2(Takaaki Hori,Atsushi Nakamura “Generalized Fast On-the-fly Composition Algorithm fot WFST-Based Speech Recognition”,Proc. Of INTERSPEECH 2005.)に記載されているように周知である。   The speech recognition unit 210 performs speech recognition by multi-stage on-the-fly synthesis using the first stage WFST and the second stage WFST stored in the first stage WFST storage unit 19 and the second stage WFST storage unit 20. . Since the speech recognition unit 210 searches for the first-stage WFST reduced in size by using the common unigram WFST (Gc) and the second-stage WFST that converts the common unigram into a trigram, the speech recognition processing is performed. Accurate speech recognition processing can be performed with a small memory size. Multistage on-the-fly speech recognition is described in Reference 2 (Takaaki Hori, Atsushi Nakamura “Generalized Fast On-the-fly Composition Algorithm fot WFST-Based Speech Recognition”, Proc. Of INTERSPEECH 2005.) Is well known.

本発明の音声認識用WFST作成装置100は、複数の音響モデルのそれぞれに対応したWFSTを、共有ユニグラムで共有化したWFSTを第1段WFSTとするので、音響モデルごとに作成する第1段WFSTのサイズを抑制することが出来る。また、本発明の音声認識装置200は、この発明の音声認識用WFST作成装置100で作成した第1段WFSTと第2段WFSTを用いて多段on-the-fly音声認識を行うので、小さなメモリサイズでも高精度な音声認識を行うことが可能である。   The speech recognition WFST creation apparatus 100 according to the present invention uses the WFST corresponding to each of the plurality of acoustic models as the first-stage WFST, and the first-stage WFST created for each acoustic model. The size of can be suppressed. Further, since the speech recognition apparatus 200 of the present invention performs multi-stage on-the-fly speech recognition using the first stage WFST and the second stage WFST created by the speech recognition WFST creation apparatus 100 of the present invention, a small memory It is possible to perform highly accurate speech recognition even with a size.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。   When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims (8)

複数の言語モデルからそれぞれの単語の出現頻度を読み込んで共通ユニグラム値を計算し、当該共通ユニグラム値から共通ユニグラムWFSTを作成し、当該共通ユニグラム値と上記共通ユニグラムWFSTを出力する共通ユニグラムWFST作成部と、
上記共通ユニグラム値と上記共通ユニグラムWFSTを記憶する共通ユニグラムWFST記憶部と、
N個(N≧1)の音響モデルのWFSTとトライフォンWFSTと辞書WFSTと上記共通ユニグラムWFSTとを合成演算した音響モデルごとのWFSTを作成し、全ての上記音響モデルごとのWFSTを結合させて第1段WFSTを構成する第1段WFST作成部と、
上記共通ユニグラム値を入力として、複数の言語モデルのそれぞれを参照して各言語モデルごとのnグラム確率に対して上記共通ユニグラム値を除去したnグラム確率を算出して当該nグラム確率のWFSTを作成し、全ての言語モデルの上記各言語モデルのnグラム確率のWFSTを結合させて第2段WFSTを構成する第2段WFST作成部と、
を具備する音声認識用WFST作成装置。
A common unigram WFST creation unit that reads the appearance frequency of each word from a plurality of language models, calculates a common unigram value, creates a common unigram WFST from the common unigram value, and outputs the common unigram value and the common unigram WFST When,
A common unigram WFST storage unit for storing the common unigram value and the common unigram WFST;
Create a WFST for each acoustic model obtained by synthesizing N (N ≧ 1) acoustic models WFST, triphone WFST, dictionary WFST and the common unigram WFST, and combine all WFSTs for each acoustic model. A first stage WFST creating unit constituting the first stage WFST;
Using the common unigram value as an input, referring to each of the plurality of language models, calculating the n-gram probability obtained by removing the common unigram value for the n-gram probability for each language model, and calculating the WFST of the n-gram probability. A second-stage WFST creating unit configured to create a second-stage WFST by combining the WFSTs of the n-gram probabilities of each language model of all language models,
A WFST creation apparatus for speech recognition comprising:
請求項1に記載した音声認識用WFST作成装置において、
上記共通ユニグラムWFST作成部は、
上記複数の言語モデルの単語の出現頻度を利用せずに上記共通ユニグラム値を計算するものであることを特徴とする音声認識用WFST作成装置。
In the WFST creation apparatus for speech recognition according to claim 1,
The common unigram WFST creation part
A speech recognition WFST creation apparatus, characterized in that the common unigram value is calculated without using the appearance frequency of words of the plurality of language models.
請求項1又は2に記載した音声認識用WFST作成装置で作成した第1段WFSTを記憶した第1段WFST記憶部と、第2段WFSTを記憶した第2段WFST記憶部と、
上記第1段WFST記憶部と上記第2段WFST記憶部を参照して多段on-the-fly合成による音声認識を実行する音声認識部と、
を具備する音声認識装置。
A first-stage WFST storage section that stores the first-stage WFST created by the speech recognition WFST creation apparatus according to claim 1, and a second-stage WFST storage section that stores the second-stage WFST;
A speech recognition unit that performs speech recognition by multi-stage on-the-fly synthesis with reference to the first-stage WFST storage unit and the second-stage WFST storage unit;
A speech recognition apparatus comprising:
複数の言語モデルからそれぞれの単語の出現頻度を読み込んで共通ユニグラム値を計算し、当該共通ユニグラム値から共通ユニグラムWFSTを作成し、当該共通ユニグラム値と上記共通ユニグラムWFSTを出力する共通ユニグラムWFST作成過程と、
N個(N≧1)の音響モデルのWFSTとトライフォンWFSTと辞書WFSTと上記共通ユニグラムWFSTを合成演算した音響モデルごとのWFSTを作成し、全ての上記音響モデルごとのWFSTを結合させて第1段WFSTを構成する第1段WFST作成過程と、
上記共通ユニグラム値を入力として、複数の言語モデルのそれぞれを参照して各言語モデルごとのnグラム確率に対して上記共通ユニグラム値を除去したnグラム確率を算出して当該nグラム確率のWFSTを作成し、全ての言語モデルの上記各言語モデルのnグラム確率のWFSTを結合させて第2段WFSTを構成する第2段WFST作成過程と、
を備える音声認識用WFST作成方法。
A common unigram WFST creation process that reads the appearance frequency of each word from a plurality of language models, calculates a common unigram value, creates a common unigram WFST from the common unigram value, and outputs the common unigram value and the common unigram WFST When,
A WFST is created for each acoustic model obtained by synthesizing N (N ≧ 1) acoustic models WFST, triphone WFST, dictionary WFST, and the common unigram WFST, and all WFSTs for each acoustic model are combined. A first stage WFST creation process constituting the first stage WFST;
Using the common unigram value as an input, referring to each of the plurality of language models, calculating the n-gram probability obtained by removing the common unigram value for the n-gram probability for each language model, and calculating the WFST of the n-gram probability. A second-stage WFST creation process for creating a second-stage WFST by combining the WFSTs of the n-gram probabilities of each language model of all the language models,
A method for creating a speech recognition WFST.
請求項4に記載した音声認識用WFST作成方法において、
上記共通ユニグラムWFST作成過程は、
上記複数の言語モデルの単語の出現頻度を利用せずに上記共通ユニグラム値を計算する過程であることを特徴とする音声認識用WFST作成方法。
In the WFST creation method for voice recognition according to claim 4,
The common unigram WFST creation process is as follows:
A method for creating a speech recognition WFST, wherein the common unigram value is calculated without using the appearance frequency of words of the plurality of language models.
請求項4又は5に記載した音声認識用WFST作成方法で作成した第1段WFSTと第2段WFSTを用いて多段on-the-fly合成による音声認識を実行する音声認識過程を、
含む音声認識方法。
A speech recognition process for performing speech recognition by multi-stage on-the-fly synthesis using the first stage WFST and the second stage WFST created by the speech recognition WFST creation method according to claim 4 or 5,
Including speech recognition method.
請求項1又は2に記載した音声認識用WFST作成装置、請求項3に記載した音声認識装置、の何れかの装置の各部の機能を、コンピュータに実行させるためのプログラム。 A program for causing a computer to execute the function of each unit of the speech recognition WFST creation device according to claim 1 or 2, and the speech recognition device according to claim 3. 請求項7記載した何れかのプログラムを記録したコンピュータで読み取り可能な記録媒体。   A computer-readable recording medium on which any one of the programs described in claim 7 is recorded.
JP2013226121A 2013-10-31 2013-10-31 WFST creation device for speech recognition, speech recognition device, method and program thereof, and recording medium Active JP6193726B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013226121A JP6193726B2 (en) 2013-10-31 2013-10-31 WFST creation device for speech recognition, speech recognition device, method and program thereof, and recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2013226121A JP6193726B2 (en) 2013-10-31 2013-10-31 WFST creation device for speech recognition, speech recognition device, method and program thereof, and recording medium

Publications (2)

Publication Number Publication Date
JP2015087556A true JP2015087556A (en) 2015-05-07
JP6193726B2 JP6193726B2 (en) 2017-09-06

Family

ID=53050410

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013226121A Active JP6193726B2 (en) 2013-10-31 2013-10-31 WFST creation device for speech recognition, speech recognition device, method and program thereof, and recording medium

Country Status (1)

Country Link
JP (1) JP6193726B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106356054A (en) * 2016-11-23 2017-01-25 广西大学 Method and system for collecting information of agricultural products based on voice recognition
CN106683677A (en) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 Method and device for recognizing voice
CN113011198A (en) * 2021-03-05 2021-06-22 北京嘀嘀无限科技发展有限公司 Information interaction method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009058989A (en) * 2007-08-29 2009-03-19 Toshiba Corp Determination method for automaton, determination method for finite state transducer, automaton determination device and determination program
JP2013142959A (en) * 2012-01-10 2013-07-22 National Institute Of Information & Communication Technology Language model combining device, language processing device, and program
JP2015014774A (en) * 2013-06-03 2015-01-22 日本電信電話株式会社 Speech recognition wfst creation device, speech recognition device, speech recognition wfst creation method, speech recognition method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009058989A (en) * 2007-08-29 2009-03-19 Toshiba Corp Determination method for automaton, determination method for finite state transducer, automaton determination device and determination program
JP2013142959A (en) * 2012-01-10 2013-07-22 National Institute Of Information & Communication Technology Language model combining device, language processing device, and program
JP2015014774A (en) * 2013-06-03 2015-01-22 日本電信電話株式会社 Speech recognition wfst creation device, speech recognition device, speech recognition wfst creation method, speech recognition method, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
堀貴明 南泰浩: "有限状態トランスデューサ型デコーダの性能改善", 日本音響学会2004年春季研究発表会講演論文集−I−, JPN6017005701, 17 March 2004 (2004-03-17), JP, pages 131 - 132 *
大西翼 ディクソン ポール 岩野公司 古井貞煕: "WFST音声認識デコーダの省メモリ化に関する検討", 日本音響学会 2008年 春季研究発表会講演論文集CD−ROM, JPN6017005703, 19 March 2008 (2008-03-19), JP, pages 7 - 10 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683677A (en) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 Method and device for recognizing voice
US10741170B2 (en) 2015-11-06 2020-08-11 Alibaba Group Holding Limited Speech recognition method and apparatus
US11664020B2 (en) 2015-11-06 2023-05-30 Alibaba Group Holding Limited Speech recognition method and apparatus
CN106356054A (en) * 2016-11-23 2017-01-25 广西大学 Method and system for collecting information of agricultural products based on voice recognition
CN113011198A (en) * 2021-03-05 2021-06-22 北京嘀嘀无限科技发展有限公司 Information interaction method and device and electronic equipment
CN113011198B (en) * 2021-03-05 2022-07-22 北京嘀嘀无限科技发展有限公司 Information interaction method and device and electronic equipment

Also Published As

Publication number Publication date
JP6193726B2 (en) 2017-09-06

Similar Documents

Publication Publication Date Title
JP5331801B2 (en) Method and apparatus for calculating language model look-ahead probability
JP5377889B2 (en) Language processing apparatus and program
JP6614639B2 (en) Speech recognition apparatus and computer program
WO2017213055A1 (en) Speech recognition device and computer program
CN107705787A (en) A kind of audio recognition method and device
JP5554304B2 (en) Automaton determinizing method, automaton determinizing apparatus and automaton determinizing program
KR102375115B1 (en) Phoneme-Based Contextualization for Cross-Language Speech Recognition in End-to-End Models
JPH0772840B2 (en) Speech model configuration method, speech recognition method, speech recognition device, and speech model training method
JP4930379B2 (en) Similar sentence search method, similar sentence search system, and similar sentence search program
JP6095588B2 (en) Speech recognition WFST creation device, speech recognition device, speech recognition WFST creation method, speech recognition method, and program
JP2006243728A (en) Method for converting phoneme to text, and its computer system and computer program
JP5249967B2 (en) Speech recognition device, weight vector learning device, speech recognition method, weight vector learning method, program
CN112669845B (en) Speech recognition result correction method and device, electronic equipment and storage medium
JP2017009842A (en) Speech recognition result output device, speech recognition result output method and speech recognition result output program
KR20170134115A (en) Voice recognition apparatus using WFST optimization and method thereof
JP6193726B2 (en) WFST creation device for speech recognition, speech recognition device, method and program thereof, and recording medium
KR20120052591A (en) Apparatus and method for error correction in a continuous speech recognition system
JP6301794B2 (en) Automaton deformation device, automaton deformation method and program
JP5875569B2 (en) Voice recognition apparatus, method, program, and recording medium
JP6558856B2 (en) Morphological analyzer, model learning device, and program
JP6078435B2 (en) Symbol string conversion method, speech recognition method, apparatus and program thereof
JP5881157B2 (en) Information processing apparatus and program
JP4478088B2 (en) Symbol string conversion method, speech recognition method, symbol string converter and program, and recording medium
JP2003271188A (en) Device and method for processing language
JP2006343405A (en) Speech-understanding device, speech-understanding method, method for preparing word/semantic expression merge database, its program and storage medium

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20160113

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20170126

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20170228

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20170330

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20170808

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20170810

R150 Certificate of patent or registration of utility model

Ref document number: 6193726

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150