JP2019219598A

JP2019219598A - Voice recognition apparatus, voice recognition method, and program

Info

Publication number: JP2019219598A
Application number: JP2018118508A
Authority: JP
Inventors: 寛基富田; Hiroki Tomita
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2019-12-26

Abstract

To provide a voice recognition apparatus, a voice recognition method, and a program, capable of improving the accuracy in voice recognition.SOLUTION: In a voice recognition apparatus 10, a score-calculating part 152 calculates, for each of a plurality of candidates of corresponding relations between a plurality of frames and a plurality of phonemes, a score showing a likelihood of the corresponding relation; a specifying part 153 specifies one of the plurality of candidates as the corresponding relations, on the basis of the score calculated for each of the plurality of candidates by the score-calculating part 152; a determining part 160 determines whether an object word is output in a voice signal, on the basis of the corresponding relation specified by the specifying part 153; and a score-calculating part 152 calculates the score by normalizing a value based on output probability of the phoneme corresponding to each frame, in each of the plurality of candidates, and then accumulating the values over the plurality of frames.SELECTED DRAWING: Figure 2

Description

本発明は、音声認識装置、音声認識方法及びプログラムに関する。 The present invention relates to a speech recognition device, a speech recognition method, and a program.

音声認識の技術が知られている。例えば特許文献１は、音声認識に関する技術として、音声信号の中から検索語（クエリ）に対応する音声が発せられている部分を検索する音声検索装置を開示している。具体的に説明すると、特許文献１に開示された音声検索装置は、検索対象の音声信号におけるフレームと検索語に対応する音素との対応を動的計画法（ＤＰマッチング）を用いて探索し、探索結果に基づいて、検索対象の音声信号の中から検索語に対応する音声が発せられていると推定される区間を特定する。その際に、特許文献１に開示された音声検索装置は、音声信号の中から検索語を誤検出することを抑制するため、各フレームに対応付けられた音素の出力確率に基づく値を、その音素に対応付けられたフレームの数で正規化することにより正規化尤度を算出する。 There are known voice recognition techniques. For example, Patent Literature 1 discloses, as a technology related to voice recognition, a voice search device that searches a voice signal for a portion in which voice corresponding to a search word (query) is emitted. More specifically, the speech search device disclosed in Patent Document 1 searches for a correspondence between a frame in a speech signal to be searched and a phoneme corresponding to a search term using a dynamic programming (DP matching), On the basis of the search result, a section in which it is estimated that a voice corresponding to the search word is emitted is specified from among the voice signals to be searched. At that time, the voice search device disclosed in Patent Document 1 sets a value based on the output probability of the phoneme associated with each frame in order to suppress erroneous detection of the search word from the voice signal. The normalized likelihood is calculated by normalizing with the number of frames associated with the phonemes.

特開２０１５−１６９６９８号公報JP 2015-169698 A

上記のようなフレームと音素との対応関係を探索することにより音声を認識する手法において、フレームと音素との対応関係を探索する際における探索の精度を向上させることで、音声認識の精度を向上させることが望まれている。 In the method of recognizing speech by searching the correspondence between frames and phonemes as described above, the accuracy of speech recognition is improved by improving the search accuracy when searching for the correspondence between frames and phonemes. It is hoped that it will.

本発明は、以上のような課題を解決するためのものであり、音声認識の精度を向上させることが可能な音声認識装置、音声認識方法及びプログラムを提供することを目的とする。 An object of the present invention is to solve the above-described problems, and an object of the present invention is to provide a speech recognition device, a speech recognition method, and a program capable of improving the accuracy of speech recognition.

上記目的を達成するため、本発明に係る音声認識装置は、
音声信号の特徴量が、前記音声信号において発せられているか否かの判定対象となる対象語に対応する複数の音素のそれぞれから出力される出力確率を、前記音声信号における複数のフレームのそれぞれについて取得する出力確率取得手段と、
前記出力確率取得手段により取得された前記出力確率に基づいて、前記複数のフレームと前記複数の音素との対応関係の尤もらしさを示すスコアを、当該対応関係の複数の候補のそれぞれについて算出するスコア算出手段と、
前記スコア算出手段により前記複数の候補のそれぞれについて算出された前記スコアに基づいて、前記複数の候補のうちのいずれかを前記対応関係として特定する特定手段と、
前記特定手段により特定された前記対応関係に基づいて、前記音声信号において前記対象語が発せられているか否かを判定する判定手段と、
を備え、
前記スコア算出手段は、前記複数の候補のそれぞれにおいて、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、前記複数のフレームに亘って累積することにより、前記スコアを算出する、
ことを特徴とする。 In order to achieve the above object, a speech recognition device according to the present invention comprises:
The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal Output probability obtaining means for obtaining;
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired by the output probability acquisition unit, for each of the plurality of candidates of the correspondence. Calculating means;
Identification means for identifying any of the plurality of candidates as the correspondence relationship based on the score calculated for each of the plurality of candidates by the score calculation means,
Determining means for determining whether or not the target word is issued in the audio signal, based on the correspondence specified by the specifying means;
With
The score calculation unit normalizes a value based on the output probability of a phoneme corresponding to each frame in each of the plurality of candidates by the number of frames corresponding to each phoneme, and accumulates the value over the plurality of frames. By calculating the score,
It is characterized by the following.

本発明によれば、音声認識の精度を向上させることができる。 According to the present invention, the accuracy of speech recognition can be improved.

本発明の実施形態に係る音声認識装置のハードウェア構成を示す図である。FIG. 1 is a diagram illustrating a hardware configuration of a speech recognition device according to an embodiment of the present invention. 本発明の実施形態に係る音声認識装置の機能的な構成を示す図である。It is a figure showing the functional composition of the speech recognition device concerning the embodiment of the present invention. 本発明の実施形態において認識対象の音声信号において設定されるフレームの例を示す図である。FIG. 5 is a diagram illustrating an example of a frame set in a speech signal to be recognized in the embodiment of the present invention. 本発明の実施形態における距離テーブルの例を示す図である。It is a figure showing the example of the distance table in an embodiment of the present invention. 図４に示した距離テーブルにおいて、認識対象の音声信号の全区間に亘ってフレームと音素との対応関係が特定された例を示す図である。FIG. 5 is a diagram illustrating an example in which the correspondence between frames and phonemes is specified over the entire section of the speech signal to be recognized in the distance table illustrated in FIG. 4. 図４に示した距離テーブルにおいて、複数のフレームのうちの一のフレームと複数の音素状態のうちの一の音素状態とが指定された例を示す図である。FIG. 5 is a diagram illustrating an example in which one frame of a plurality of frames and one phoneme state of a plurality of phoneme states are specified in the distance table illustrated in FIG. 4. 図６に示した距離テーブルにおいて、先頭のフレームから指定されたフレームまでの各フレームと、先頭の音素状態から指定された音素状態までの各音素状態と、の対応関係の２つの候補の例を示す図である。In the distance table shown in FIG. 6, two examples of the correspondence between each frame from the first frame to the specified frame and each phoneme state from the first phoneme state to the specified phoneme state are shown. FIG. 本発明の実施形態における状態停留スコアの算出例を示す図である。It is a figure showing an example of calculation of a state stop score in an embodiment of the present invention. 本発明の実施形態における状態遷移スコアの算出例を示す図である。It is a figure showing an example of calculation of a state transition score in an embodiment of the present invention. （ａ）及び（ｂ）は、音素毎にフレーム数で正規化してスコアを算出する場合と、全フレーム数で正規化してスコアを算出する場合と、を比較するための例を示す図である。(A) and (b) are diagrams illustrating an example for comparing a case where a score is calculated by normalizing with the number of frames for each phoneme and a case where a score is calculated with normalizing with the total number of frames. . 本発明の実施形態に係る音声認識装置によって実行される音声認識処理の流れを示すフローチャートである。5 is a flowchart illustrating a flow of a voice recognition process performed by the voice recognition device according to the embodiment of the present invention. 図１１に示した音声認識処理における探索処理の流れを示すフローチャートである。12 is a flowchart showing a flow of a search process in the voice recognition process shown in FIG.

以下、本発明の実施形態について、図面を参照して説明する。なお、図中同一又は相当する部分には同一符号を付す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding portions are denoted by the same reference numerals.

図１に、本実施形態に係る音声認識装置１０のハードウェア構成を示す。図１に示すように、音声認識装置１０は、制御部１１と、記憶部１２と、入力部１３と、出力部１４と、通信部１５と、を備える。 FIG. 1 shows a hardware configuration of a speech recognition device 10 according to the present embodiment. As shown in FIG. 1, the speech recognition device 10 includes a control unit 11, a storage unit 12, an input unit 13, an output unit 14, and a communication unit 15.

制御部１１は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）を備える。ＣＰＵは、例えばマイクロプロセッサ等であって、様々な処理や演算を実行する中央演算処理部である。制御部１１において、ＣＰＵが、ＲＯＭに記憶されている制御プログラムを読み出して、ＲＡＭをワークメモリとして用いながら、音声認識装置１０全体の動作を制御する。制御部１１は、制御手段として機能する。 The control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory). The CPU is, for example, a microprocessor or the like, and is a central processing unit that executes various processes and calculations. In the control unit 11, the CPU reads the control program stored in the ROM and controls the operation of the entire speech recognition device 10 while using the RAM as a work memory. The control unit 11 functions as control means.

記憶部１２は、フラッシュメモリ、ハードディスク等の不揮発性メモリである。記憶部１２は、ＯＳ（Operating System）及びアプリケーションプログラムを含む、制御部１１が各種処理を行うために使用するプログラム及びデータを記憶する。また、記憶部１２は、制御部１１が各種処理を行うことにより生成又は取得するデータを記憶する。 The storage unit 12 is a nonvolatile memory such as a flash memory and a hard disk. The storage unit 12 stores programs and data used by the control unit 11 to perform various processes, including an OS (Operating System) and application programs. The storage unit 12 stores data generated or obtained by the control unit 11 performing various processes.

入力部１３は、入力キー、ボタン、スイッチ、タッチパッド、タッチパネル等の入力デバイスを備える。入力部１３は、ユーザから入力された操作指示を受け付け、受け付けた操作指示を制御部１１に送信する。また、入力部１３は、マイクロフォン等の音声入力部を備えており、音声認識装置１０の外部で発せられた音声信号の入力を受け付ける。音声入力部により受け付けられた音声信号は、図示しないアナログデジタル変換器により規定のサンプリング周波数でデジタル信号に変換され、制御部１１に送信される。 The input unit 13 includes input devices such as input keys, buttons, switches, touch pads, and touch panels. The input unit 13 receives an operation instruction input from a user, and transmits the received operation instruction to the control unit 11. The input unit 13 includes a voice input unit such as a microphone, and receives an input of a voice signal generated outside the voice recognition device 10. The audio signal received by the audio input unit is converted into a digital signal at a specified sampling frequency by an analog-to-digital converter (not shown), and transmitted to the control unit 11.

出力部１４は、液晶ディスプレイ、有機ＥＬ（Electro Luminescence）ディスプレイ等の表示部と、スピーカ等の音声出力部と、を備える。表示部は、図示しない表示駆動回路によって駆動されて、状況に応じた様々な画像を表示する。なお、表示部は、入力部１３と互いに重畳して配置され、表示部と入力部１３とでいわゆるタッチパネル（タッチスクリーン）を構成しても良い。音声出力部は、スピーカと音声出力インタフェースとを備え、制御部１１によって生成された音声データを音声に変換して外部に出力する。 The output unit 14 includes a display unit such as a liquid crystal display and an organic EL (Electro Luminescence) display, and an audio output unit such as a speaker. The display unit is driven by a display drive circuit (not shown) and displays various images according to the situation. The display unit may be arranged so as to overlap with the input unit 13, and the display unit and the input unit 13 may constitute a so-called touch panel (touch screen). The audio output unit includes a speaker and an audio output interface, converts audio data generated by the control unit 11 into audio, and outputs the audio to the outside.

通信部１５は、音声認識装置１０が外部の機器と通信するための通信モジュールである。通信部１５は、例えばＷｉ−Ｆｉ（Wireless Fidelity）、ＵＳＢ（Universal Serial Bus）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、ＮＦＣ（Near Field Communication）等の通信規格に従って、外部の機器と通信する。通信部１５は、制御部１１の制御の下、このような有線又は無線による通信を介して、外部の機器と各種のデータ及び情報を送受信する。 The communication unit 15 is a communication module for the voice recognition device 10 to communicate with an external device. The communication unit 15 communicates with external devices according to communication standards such as Wi-Fi (Wireless Fidelity), USB (Universal Serial Bus), Bluetooth (registered trademark), and NFC (Near Field Communication). The communication unit 15 transmits and receives various data and information to and from an external device through such wired or wireless communication under the control of the control unit 11.

次に図２を参照して、音声認識装置１０の機能的な構成について説明する。図２に示すように、音声認識装置１０は、機能的に、音声信号取得部１１０と、特徴量算出部１２０と、出力確率取得部１３０と、変換部１４０と、探索部１５０と、判定部１６０と、を備える。制御部１１において、ＣＰＵがＲＯＭに記憶されたプログラムをＲＡＭに読み出して実行することにより、これら各部として機能する。 Next, a functional configuration of the voice recognition device 10 will be described with reference to FIG. As shown in FIG. 2, the speech recognition device 10 functionally includes a speech signal acquisition unit 110, a feature amount calculation unit 120, an output probability acquisition unit 130, a conversion unit 140, a search unit 150, a determination unit 160. In the control unit 11, the CPU functions as these units by reading the program stored in the ROM into the RAM and executing it.

また、音声認識装置１０は、音響モデル記憶部２１と、候補語記憶部２２と、を備える。音響モデル記憶部２１及び候補語記憶部２２は、記憶部１２の適宜の記憶領域に構築されており、それぞれ音響モデル記憶手段及び候補語記憶手段として機能する。 Further, the speech recognition device 10 includes an acoustic model storage unit 21 and a candidate word storage unit 22. The acoustic model storage unit 21 and the candidate word storage unit 22 are constructed in appropriate storage areas of the storage unit 12, and function as an acoustic model storage unit and a candidate word storage unit, respectively.

音声信号取得部１１０は、認識対象の音声信号を取得する。認識対象の音声信号は、音声認識装置１０において認識すべき音声を示すデジタル信号である。音声信号取得部１１０は、認識対象の音声信号として、例えばユーザから発せられた音声、又は会議、テレビ、映画等で発せられた音声を示す信号を、入力部１３のマイクロフォンを介して取得する。或いは、音声信号取得部１１０は、認識対象の音声信号を、通信部１５を介して外部から取得しても良いし、記憶部１２に予め記憶された音声信号を読み込むことにより取得しても良い。音声信号取得部１１０は、制御部１１が記憶部１２、入力部１３又は通信部１５と協働することによって実現される。音声信号取得部１１０は、音声信号取得手段として機能する。 The audio signal acquisition unit 110 acquires an audio signal to be recognized. The voice signal to be recognized is a digital signal indicating a voice to be recognized by the voice recognition device 10. The audio signal acquisition unit 110 acquires, as the audio signal to be recognized, a signal indicating, for example, a sound emitted from a user or a sound emitted from a conference, a television, a movie, or the like via the microphone of the input unit 13. Alternatively, the audio signal acquisition unit 110 may acquire an audio signal to be recognized from the outside via the communication unit 15 or may acquire the audio signal by reading an audio signal stored in the storage unit 12 in advance. . The audio signal acquisition unit 110 is realized by the control unit 11 cooperating with the storage unit 12, the input unit 13, or the communication unit 15. The audio signal acquiring unit 110 functions as an audio signal acquiring unit.

特徴量算出部１２０は、音声信号取得部１１０により取得された音声信号の特徴量（音響特徴量）を、フレーム毎に算出する。特徴量とは、音声信号の特徴を示す指標となるパラメータである。具体的に説明すると、特徴量は、ケプストラム又はメルケプストラムと呼ばれる音声信号を周波数軸上に変換して得られる周波数軸系特徴パラメータと、音声信号のエネルギー２乗和又はその対数を計算することにより得られるパワー系特徴パラメータと、を組み合わせることによって得られる。 The feature amount calculation unit 120 calculates a feature amount (acoustic feature amount) of the audio signal acquired by the audio signal acquisition unit 110 for each frame. The feature amount is a parameter serving as an index indicating the feature of the audio signal. More specifically, the feature amount is obtained by calculating a frequency axis characteristic parameter obtained by converting an audio signal called a cepstrum or a mel cepstrum on a frequency axis, and calculating a sum of energy squares of the audio signal or a logarithm thereof. It is obtained by combining the obtained power system characteristic parameters.

例えば、特徴量は、周波数軸系特徴パラメータ１２成分（１２次元）及びパワー系特徴パラメータ１成分（１次元）と、直前のフレームの各成分との差分を取った△周波数軸系特徴パラメータ１２成分（１２次元）及び△パワー系特徴パラメータ１成分（１次元）と、直前のフレームの各成分との差分の差分を取った△△周波数軸系特徴パラメータ１２成分（１２次元）と、の合計３８成分を有する３８次元ベクトル量として表現される。或いは、特徴量として、ニューラルネットワークの入力層に適用される４０次元のメルフィルタバンクを用いても良い。 For example, the feature amount is obtained by taking the difference between the 12 components (12-dimensional) of the frequency axis system parameter and one component (1D) of the power system parameter and the components of the immediately preceding frame. (12 dimensions) and △ power system feature parameter 1 component (1 dimensional) and difference of the difference between each component of the immediately preceding frame and △△ frequency axis feature parameter 12 component (12 dimensional) It is represented as a 38-dimensional vector quantity having components. Alternatively, a 40-dimensional mel filter bank applied to the input layer of the neural network may be used as the feature value.

フレームとは、音声信号における所定の時間長を有する時間窓である。具体的に図３の上側に、先頭から末尾までの時間長Ｌの認識対象の音声信号の波形の例を示す。この波形図において、縦軸は波形の振幅（エネルギー）の大きさを表し、横軸は時間を表している。そして、図３の下側に、上側に示した音声信号において設定されるフレームを示す。音声信号の先頭のフレームから末尾のフレームまで、それぞれフレーム長ＦのＴ個のフレームが、規定のシフト長Ｓずつシフトして設定される。フレーム長Ｆ及びシフト長Ｓは、例えばＦ＝２５ｍｓｅｃ、Ｓ＝１０ｍｓｅｃ等、音響モデルの作成時に定められた時間長に合わせて設定される。シフト長Ｓよりもフレーム長Ｆの方が長いため、各フレームは、隣接するフレームと時間長（Ｆ−Ｓ）だけ重複する。 A frame is a time window having a predetermined time length in an audio signal. Specifically, the upper part of FIG. 3 shows an example of the waveform of a speech signal to be recognized having a time length L from the beginning to the end. In this waveform diagram, the vertical axis represents the amplitude (energy) of the waveform, and the horizontal axis represents time. The lower part of FIG. 3 shows a frame set in the audio signal shown on the upper side. From the first frame to the last frame of the audio signal, T frames each having a frame length F are set by shifting by a specified shift length S. The frame length F and the shift length S are set in accordance with the time length determined when the acoustic model was created, for example, F = 25 msec, S = 10 msec. Since the frame length F is longer than the shift length S, each frame overlaps with an adjacent frame by a time length (FS).

特徴量算出部１２０は、このように設定されたフレーム単位で、認識対象の音声信号をフーリエ変換することで周波数スペクトルに変換する。そして、特徴量算出部１２０は、得られた周波数スペクトルにメルフィルタバンクを適用することにより、音声信号の特徴量をフレーム毎に抽出する。特徴量算出部１２０は、制御部１１によって実現される。特徴量算出部１２０は、特徴量算出手段として機能する。 The feature amount calculation unit 120 converts the speech signal to be recognized into a frequency spectrum by performing a Fourier transform on the frame unit thus set. Then, the characteristic amount calculation unit 120 extracts a characteristic amount of the audio signal for each frame by applying a mel filter bank to the obtained frequency spectrum. The feature amount calculation unit 120 is realized by the control unit 11. The feature amount calculation unit 120 functions as a feature amount calculation unit.

図２に戻って、出力確率取得部１３０は、特徴量算出部１２０によって算出された特徴量が、音響モデル記憶部２１に記憶された音響モデルの各音素から出力される出力確率を取得する。音響モデルとは、音声認識装置１０によって認識可能な言葉を構成する複数の音素のそれぞれについて、その周波数特性をモデル化したものである。音素とは、母音、子音等、音声認識の対象となる言語において音声の区切りとなる基本単位である。 Returning to FIG. 2, the output probability acquisition unit 130 acquires an output probability that the feature amount calculated by the feature amount calculation unit 120 is output from each phoneme of the acoustic model stored in the acoustic model storage unit 21. The acoustic model is obtained by modeling the frequency characteristics of each of a plurality of phonemes constituting words recognizable by the speech recognition device 10. A phoneme is a basic unit that serves as a speech delimiter in a language to be subjected to speech recognition, such as a vowel or a consonant.

音響モデル記憶部２１は、例えばモノフォン（１音素）による音響モデル（モノフォンモデル）、バイフォン（２音素）による音響モデル（バイフォンモデル）、トライフォン（３音素）による音響モデル（トライフォンモデル）等を記憶している。モノフォンモデルは、１音素毎に生成された音響モデルであり、隣接する音素に依存しない、すなわち前後の音素状態との状態遷移を固定化した音響モデルである。バイフォンモデル及びトライフォンモデルは、それぞれ２音素毎及び３音素毎に生成された音響モデルであり、隣接する音素に依存する音響モデルである。バイフォンモデルは、前後片方の音素状態との状態遷移を考慮した音響モデルである。トライフォンモデルは、前後両方の音素状態との状態遷移を考慮した音響モデルである。 The acoustic model storage unit 21 stores, for example, an acoustic model (monophone model) using a monophone (one phoneme), an acoustic model (biphone model) using a biphone (two phonemes), and an acoustic model (triphone model) using a triphone (three phonemes). Etc. are stored. The monophone model is an acoustic model generated for each phoneme, and does not depend on adjacent phonemes, that is, an acoustic model in which a state transition between previous and subsequent phoneme states is fixed. The biphone model and the triphone model are acoustic models generated for every two phonemes and every three phonemes, respectively, and are acoustic models dependent on adjacent phonemes. The biphone model is an acoustic model that takes into account the state transition between the front and rear phoneme states. The triphone model is an acoustic model that takes into account state transitions with both front and rear phoneme states.

以下では、理解を容易にするため、音響モデルとしてモノフォンモデルを用いる場合を例にとって説明するが、バイフォンモデル又はトライフォンモデルを用いる場合であっても同様に説明可能である。音声認識装置１０は、音響モデルを一般的な方法で学習して、音響モデル記憶部２１に予め記憶しておく。 In the following, for ease of understanding, a case where a monophone model is used as an acoustic model will be described as an example. However, a case where a biphone model or a triphone model is used can be similarly described. The speech recognition device 10 learns the acoustic model by a general method, and stores the acoustic model in the acoustic model storage unit 21 in advance.

音響モデルとして、例えば、一般的な音声認識で利用される音響モデルであるＨＭＭ（Hidden Markov Model；隠れマルコフモデル）を利用することができる。ＨＭＭは、統計的な手法により音声信号からその音声信号が出力される元となった言葉を確率的に推定するためのモデルである。ＨＭＭは、時間的な状態の揺らぎを示す遷移確率と、各状態から入力された特徴量を出力する確率（出力確率）と、をパラメータとした標準パターンを用いる。 As the acoustic model, for example, a Hidden Markov Model (HMM), which is an acoustic model used in general speech recognition, can be used. The HMM is a model for probabilistically estimating a word from which the voice signal is output from the voice signal by a statistical method. The HMM uses a standard pattern in which transition probabilities indicating temporal state fluctuations and probabilities of outputting feature values input from each state (output probabilities) are used as parameters.

出力確率は、所定の重み係数で重み付けされた複数のガウス（正規）分布を加算した混合ガウス分布によって表される。出力確率取得部１３０は、音響モデル記憶部２１に記憶された音響モデルの各音素と、特徴量算出部１１５が算出した各フレームにおける特徴量と、を比較し、各フレームにおける特徴量が各音素から出力される出力確率を計算する。出力確率取得部１３０は、制御部１１が記憶部１２と協働することによって実現される。出力確率取得部１３０は、出力確率取得手段として機能する。 The output probability is represented by a mixed Gaussian distribution obtained by adding a plurality of Gaussian (normal) distributions weighted by a predetermined weighting factor. The output probability acquisition unit 130 compares each phoneme of the acoustic model stored in the acoustic model storage unit 21 with the feature amount in each frame calculated by the feature amount calculation unit 115, and determines that the feature amount in each frame is each phoneme. Calculate the output probability output from. The output probability acquisition unit 130 is realized by the control unit 11 cooperating with the storage unit 12. The output probability obtaining unit 130 functions as an output probability obtaining unit.

或いは、出力確率取得部１３０は、ニューラルネットワークを用いて出力確率を取得しても良い。ニューラルネットワークを用いる場合、出力確率取得部１３０は、ニューラルネットワークの入力層に、４０次元のメルフィルタバンクとして得られた音声信号の特徴量を入力する。そして、出力確率取得部１３０は、ニューラルネットワークの出力層から出力される値を、この特徴量から出力される各音素の出力確率として取得する。 Alternatively, the output probability obtaining unit 130 may obtain the output probability using a neural network. When a neural network is used, the output probability acquisition unit 130 inputs the feature amount of the audio signal obtained as a 40-dimensional mel filter bank to the input layer of the neural network. Then, the output probability obtaining unit 130 obtains a value output from the output layer of the neural network as an output probability of each phoneme output from the feature amount.

より詳細に説明すると、出力確率取得部１３０は、出力確率を、音素の状態（音素状態）毎に取得する。音素の状態とは、音素を時間方向に細分化した単位であり、音響モデルの最小単位に相当する。音響モデルの各音素には、予め状態数が定められている。 More specifically, the output probability obtaining unit 130 obtains the output probability for each phoneme state (phoneme state). The phoneme state is a unit obtained by subdividing the phoneme in the time direction, and corresponds to a minimum unit of the acoustic model. Each phoneme of the acoustic model has a predetermined number of states.

以下では、各音素に定められた状態数が「３」である場合を例にとって説明する。例えば、音素「ａ」は、この音素の発話開始時を含む第１状態「ａ１」と、中間状態である第２状態「ａ２」と、発話終了時を含む第３状態「ａ３」と、の３つの状態に分けられる。音響モデルで利用される全音素の数をＫと表した場合、（３×Ｋ）個の音素状態が存在する。出力確率取得部１３０は、（３×Ｋ）個の音素状態のそれぞれについて、且つ、認識対象の音声信号の先頭から末尾までＴ個のフレームのそれぞれについて、（３×Ｋ×Ｔ）個の出力確率を計算する。 Hereinafter, a case where the number of states determined for each phoneme is “3” will be described as an example. For example, the phoneme “a” includes a first state “a1” including the start of the utterance of the phoneme, a second state “a2” that is an intermediate state, and a third state “a3” including the end of the utterance. It can be divided into three states. When the number of all phonemes used in the acoustic model is represented by K, there are (3 × K) phoneme states. The output probability acquisition unit 130 outputs (3 × K × T) output signals for each of the (3 × K) phoneme states and for each of T frames from the beginning to the end of the speech signal to be recognized. Calculate the probability.

候補語記憶部２２は、複数の候補語を記憶している。候補語とは、認識対象の音声信号において発せられている言葉の候補であって、認識対象の音声信号において発せられているか否かの判定対象となる対象語である。候補語記憶部２２は、候補語として、音声認識の対象となる言語における典型的な単語、句、文節等の語句を複数記憶している。例えば、音声認識の対象となる言語が日本語である場合には、候補語記憶部２２は、候補語として、多数の日本語の語句を記憶している。或いは、音声認識の対象となる言語が英語である場合には、候補語記憶部２２は、候補語として、多数の英語の語句を記憶している。音声認識装置１０は、認識される言葉として想定される複数の候補語を、予め候補語記憶部２２に記憶しておく。 The candidate word storage unit 22 stores a plurality of candidate words. The candidate word is a candidate word that is uttered in the speech signal to be recognized, and is a target word to be determined as to whether or not it is uttered in the speech signal to be recognized. The candidate word storage unit 22 stores, as candidate words, a plurality of words such as typical words, phrases, and phrases in a language to be subjected to speech recognition. For example, when the target language for speech recognition is Japanese, the candidate word storage unit 22 stores many Japanese phrases as candidate words. Alternatively, when the target language for speech recognition is English, the candidate word storage unit 22 stores many English words and phrases as candidate words. The speech recognition device 10 stores a plurality of candidate words assumed as recognized words in the candidate word storage unit 22 in advance.

変換部１４０は、候補語記憶部２２に記憶された複数の候補語のそれぞれを、音素列に変換する。音素列とは、文字列を構成する少なくとも１つの文字に対応する音素を、文字列における文字と同じ順序で並べたものである。変換部１４０は、複数の候補語のそれぞれについて、音響モデル記憶部２１に記憶された音響モデルの複数の音素のうちの、候補語を構成する少なくとも１つの文字に対応する音素を並べることにより、候補語を音素列に変換する。 The conversion unit 140 converts each of the plurality of candidate words stored in the candidate word storage unit 22 into a phoneme string. The phoneme sequence is a sequence in which phonemes corresponding to at least one character constituting the character string are arranged in the same order as the characters in the character string. The conversion unit 140 arranges, for each of the plurality of candidate words, phonemes corresponding to at least one character forming the candidate word, among a plurality of phonemes of the acoustic model stored in the acoustic model storage unit 21. Convert candidate words to phoneme strings.

例えば、候補語が日本語「ラーメン」である場合、「ラーメン」は「ｒ」と「ａ：」と「ｍ」と「ｅ」と「Ｎ」との５つの音素（モノフォン）を含むため、変換部１４０は、候補語「ラーメン」を音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」に変換する。或いは、候補語が英語「ｃａｋｅ」である場合、「ｃａｋｅ」は「ｋ」と「ｅ」と「ｉ」と「ｋ」との４つの音素（モノフォン）を含むため、変換部１４０は、候補語「ｃａｋｅ」を音素列「ｋ，ｅ，ｉ，ｋ」に変換する。変換部１４０は、制御部１１が記憶部１２と協働することによって実現される。変換部１４０は、変換手段として機能する。 For example, if the candidate word is Japanese "Ramen", "Ramen" includes five phonemes (monophone) of "r", "a:", "m", "e", and "N", The conversion unit 140 converts the candidate word “ramen” into a phoneme string “r, a :, m, e, N”. Alternatively, when the candidate word is English “cake”, since “cake” includes four phonemes (monophone) of “k”, “e”, “i”, and “k”, the conversion unit 140 The word “cake” is converted into a phoneme string “k, e, i, k”. The conversion unit 140 is realized by the control unit 11 cooperating with the storage unit 12. The conversion unit 140 functions as a conversion unit.

探索部１５０は、出力確率取得部１３０により取得された出力確率に基づいて、認識対象の音声信号における各フレームと、変換部１４０により変換された音素列を構成する各音素と、の対応関係を探索する。フレームと音素との対応関係とは、認識対象の音声信号において候補語が発せられていると仮定した場合に、音声信号内の各フレームが、候補語に対応する音素列のどの音素に対応するのかを定めた情報である。 The search unit 150 determines the correspondence between each frame in the speech signal to be recognized and each phoneme included in the phoneme string converted by the conversion unit 140, based on the output probability obtained by the output probability obtaining unit 130. Explore. The correspondence between a frame and a phoneme is that each frame in the speech signal corresponds to any phoneme in the phoneme sequence corresponding to the candidate word, assuming that the candidate word is issued in the speech signal to be recognized. This is the information that determines

探索部１５０は、候補語記憶部２２に記憶された複数の候補語のそれぞれについて、動的計画法（ＤＰ（Dynamic Programming）マッチング）の手法によりＨＭＭを解くことにより、フレームと音素との対応関係を探索する。探索部１５０は、制御部１１によって実現される。探索部１５０は、探索手段として機能する。 The search unit 150 solves the HMM for each of the plurality of candidate words stored in the candidate word storage unit 22 by a dynamic programming (DP) matching method, thereby obtaining the correspondence between the frame and the phoneme. To explore. The search unit 150 is realized by the control unit 11. The search unit 150 functions as a search unit.

フレームと音素との対応関係を探索するため、探索部１５０は、距離テーブルを生成する。ここで、距離とは、各音素の音響モデルと各フレームにおける音声信号の特徴量との違いの度合を示す指標である。距離は、出力確率の対数をとった値の符号を逆にすることにより得られる。ある音素のあるフレームにおける距離が小さいほど、その音素からそのフレームにおける音声信号の特徴量が出力される確率が大きい。すなわち、ある音素のあるフレームにおける距離が小さいほど、その音素の音響モデルとそのフレームにおける音声信号の特徴量とが近いことを表す。探索部１５０は、候補語から変換された音素列の各音素について出力確率取得部１３０により取得された出力確率から距離を算出し、例えば図４に示す距離テーブル３０を生成する。 To search for the correspondence between the frame and the phoneme, the search unit 150 generates a distance table. Here, the distance is an index indicating the degree of difference between the acoustic model of each phoneme and the feature amount of the audio signal in each frame. The distance is obtained by reversing the sign of the logarithmic value of the output probability. The smaller the distance of a certain phoneme in a certain frame, the greater the probability that the feature of the audio signal in the frame is output from the phoneme. In other words, the smaller the distance of a certain phoneme in a certain frame, the closer the acoustic model of the phoneme and the feature amount of the audio signal in the frame are. The search unit 150 calculates a distance from the output probabilities obtained by the output probability obtaining unit 130 for each phoneme of the phoneme string converted from the candidate word, and generates, for example, the distance table 30 shown in FIG.

距離テーブル３０は、候補語から変換された音素列の各音素について得られた距離を、フレーム毎に並べて配置したデータテーブルである。図４は、例として、候補語「ラーメン」から変換された音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」について生成された距離テーブル３０を示している。探索部１５０は、認識対象の音声信号における先頭のフレームから末尾のフレームまでのＴ個のフレームを行にとり、音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」の５個の音素における計１５個の音素状態を列にとったマトリックスを用意する。そして、探索部１５０は、マトリックスの各要素に、出力確率から計算された距離を並べて配置する。 The distance table 30 is a data table in which the distances obtained for each phoneme of the phoneme string converted from the candidate words are arranged for each frame. FIG. 4 shows, as an example, the distance table 30 generated for the phoneme string “r, a :, m, e, N” converted from the candidate word “ramen”. The search unit 150 takes T frames from the first frame to the last frame in the speech signal to be recognized as rows, and calculates a total of 15 frames in the five phonemes of the phoneme string “r, a :, m, e, N”. A matrix is prepared by taking the phoneme states in columns. Then, the search unit 150 arranges the distance calculated from the output probability in each element of the matrix.

具体的に、先頭からｍ番目のフレーム（第ｍフレーム；ｍは１≦ｍ≦Ｔを満たす自然数）での、先頭からｉ番目の音素状態（第ｉ音素状態；ｉは１≦ｉ≦１５を満たす自然数）の出力確率をＳ（ｍ，ｉ）と表すと、探索部１５０は、マトリックスのｍ行ｉ列目の位置（ｍ，ｉ）の要素として、距離Ｄ（ｍ，ｉ）＝−ｌｏｇＳ（ｍ，ｉ）の値を配置する。これにより、図４に示す距離テーブル３０が得られる。距離テーブル３０において、距離Ｄ（ｍ，ｉ）が小さいほど、フレームと音素状態とが対応している確率が高いことを意味する。なお、図４において、距離テーブル３０内の各位置に表記された距離を示す値は、理解を容易にするための一例であって、正確な値とは限らない。 Specifically, in the m-th frame from the beginning (the m-th frame; m is a natural number satisfying 1 ≦ m ≦ T), the i-th phoneme state from the beginning (i-th phoneme state; i is 1 ≦ i ≦ 15) When the output probability of a natural number that satisfies is expressed as S (m, i), the search unit 150 determines that the distance D (m, i) = − logS The value of (m, i) is arranged. Thereby, the distance table 30 shown in FIG. 4 is obtained. In the distance table 30, the smaller the distance D (m, i), the higher the probability that the frame corresponds to the phoneme state. In FIG. 4, the value indicating the distance described at each position in the distance table 30 is an example for facilitating understanding, and is not necessarily an accurate value.

距離テーブル３０を生成すると、探索部１５０は、距離テーブル３０におけるフレームと音素との対応関係を、動的計画法により探索する。図５に、図４に示した距離テーブル３０において特定されたフレームと音素との対応関係の例として、斜線を付した経路（系列）を示す。図５では、第１フレームが音素「ｒ」の第１状態に対応し、第２、第３フレームが音素「ｒ」の第２状態に対応し、第４フレームが音素「ｒ」の第３状態に対応し、第５フレームが音素「ａ：」の第１状態に対応し、…というように、先頭から末尾までのフレームのそれぞれに音素状態が１つずつ対応付けられる。このように、探索部１５０は、先頭のフレームにおける先頭の音素状態に該当する位置（１，１）から末尾のフレームにおける末尾の音素状態に該当する位置（Ｔ，１５）までを結ぶ経路（系列）を探索する。 When the distance table 30 is generated, the search unit 150 searches the correspondence between the frames and the phonemes in the distance table 30 by the dynamic programming. FIG. 5 shows a hatched path (series) as an example of the correspondence between the frames and phonemes specified in the distance table 30 shown in FIG. In FIG. 5, the first frame corresponds to the first state of the phoneme “r”, the second and third frames correspond to the second state of the phoneme “r”, and the fourth frame corresponds to the third state of the phoneme “r”. The fifth frame corresponds to the first state of the phoneme “a:” corresponding to the state, and one phoneme state is associated with each of the frames from the beginning to the end, such as. As described above, the search unit 150 determines the path (sequence) connecting the position (1, 1) corresponding to the first phoneme state in the first frame to the position (T, 15) corresponding to the last phoneme state in the last frame. Explore).

このような対応関係を特定するための探索部１５０による探索について、より詳細に説明する。図２に示すように、探索部１５０は、指定部１５１と、スコア算出部１５２と、特定部１５３と、繰り返し部１５４と、の機能を備える。これら各部は、制御部１１によって実現され、それぞれ指定手段、スコア算出手段、特定手段及び繰り返し手段として機能する。 The search by the search unit 150 for specifying such a correspondence will be described in more detail. As illustrated in FIG. 2, the search unit 150 has functions of a designation unit 151, a score calculation unit 152, a specification unit 153, and a repetition unit 154. These units are realized by the control unit 11 and function as a designating unit, a score calculating unit, a specifying unit, and a repeating unit, respectively.

指定部１５１は、生成された距離テーブル３０において、複数のフレームのうちの一のフレームを指定し、且つ、複数の音素状態のうちの一の音素状態を指定する。ここで、音素状態を指定するとは、各音素を構成する状態が複数存在する場合には、複数の音素のうちの一の音素を指定し、更に複数の状態のうちの一の状態を指定することを意味する。一方で、各音素を構成する状態が１つしか存在しない場合には、音素状態を指定することは、音素を指定することに相当する。 The specifying unit 151 specifies one frame of the plurality of frames in the generated distance table 30, and specifies one phoneme state of the plurality of phoneme states. Here, to designate a phoneme state means that when there are a plurality of states constituting each phoneme, one phoneme of the plurality of phonemes is designated, and one state of the plurality of states is further designated. Means that. On the other hand, when there is only one state that constitutes each phoneme, designating a phoneme state is equivalent to designating a phoneme.

探索部１５０は、複数のフレームのうちの先頭のフレームから指定部１５１により指定された一のフレームまでの各フレームと、複数の音素状態のうちの先頭の音素状態から指定部１５１により指定された一の音素状態までの各音素状態と、の対応関係を動的計画法により探索する。例えば、指定部１５１が、先頭からｍ番目のフレーム（第ｍフレーム）と先頭からｉ番目の音素状態（第ｉ音素状態）とを指定した場合、探索部１５０は、先頭のフレームから第ｍフレームまでのｍ個のフレームと、先頭の音素状態から第ｉ音素状態までのｉ個の音素状態と、の間の対応関係Ｒ（ｍ，ｉ）を探索する。対応関係Ｒ（ｍ，ｉ）は、第ｍフレームと第ｉ音素状態とが対応していると仮定した場合における、先頭からｍ個のフレームと先頭からｉ個の音素状態との対応関係である。 The search unit 150 is specified by the specifying unit 151 from each of the frames from the first frame of the plurality of frames to the one frame specified by the specifying unit 151 and the first phoneme state of the plurality of phoneme states. The correspondence between each phoneme state and one phoneme state is searched for by a dynamic programming method. For example, when the specifying unit 151 specifies the m-th frame from the top (m-th frame) and the i-th phoneme state from the top (i-th phoneme state), the search unit 150 sets the m-th frame to the m-th frame from the top frame. A search is made for a correspondence R (m, i) between the m frames up to and the i phoneme states from the head phoneme state to the i-th phoneme state. The correspondence R (m, i) is the correspondence between the m-th frame from the beginning and the i-phoneme states from the beginning, assuming that the m-th frame and the i-th phoneme state correspond to each other. .

具体的に図６に、距離テーブル３０において、指定部１５１が第１５フレーム（ｍ＝１５）と第８音素状態（ｉ＝８）、すなわち図６において斜線を付した要素を指定した場合の例を示す。ここで、第８音素状態は、先頭から３番目の音素「ｍ」における第２状態に相当する。この場合、探索部１５０は、先頭から１５個のフレームと先頭から８個の音素状態との対応関係Ｒ（１５，８）を、位置（１，１）から位置（１５，８）までの間における１２０（＝１５×８）個の距離Ｄ（ｍ，ｉ）の値、すなわち図６において太線で囲われた部分における距離Ｄ（ｍ，ｉ）の値に基づいて、探索する。 Specifically, FIG. 6 shows an example in which the designation unit 151 designates the fifteenth frame (m = 15) and the eighth phoneme state (i = 8) in the distance table 30, that is, the element shaded in FIG. Is shown. Here, the eighth phoneme state corresponds to the second state of the third phoneme “m” from the top. In this case, the search unit 150 determines the correspondence R (15,8) between the first 15 frames and the first 8 phoneme states from the position (1,1) to the position (15,8). The search is performed based on the values of 120 (= 15 × 8) distances D (m, i) in FIG. 6, that is, the values of the distances D (m, i) in the portion surrounded by the thick line in FIG.

より詳細には、探索部１５０は、指定部１５１により指定されたフレームよりも１つ前のフレームにおいて特定された対応関係を利用して、指定されたフレームにおける対応関係を探索する。具体的に説明すると、探索部１５０は、指定部１５１により第ｍフレームと第ｉ音素状態とが指定された場合における対応関係Ｒ（ｍ，ｉ）を、第（ｍ−１）フレームから第ｍフレームにかけて音素状態が遷移する場合と遷移しない場合との２通りの候補のうちから特定する。 More specifically, the search unit 150 searches for the correspondence in the specified frame by using the correspondence specified in the frame immediately before the frame specified by the specifying unit 151. More specifically, the search unit 150 determines the correspondence R (m, i) when the m-th frame and the i-th phoneme state are specified by the specifying unit 151 from the (m-1) th frame to the m-th frame. The phoneme state is specified from two types of candidates, that is, a case where the phoneme state changes over the frame and a case where the phoneme state does not change.

ここで、第（ｍ−１）フレームから第ｍフレームにかけて音素状態が遷移する場合とは、対応関係Ｒ（ｍ，ｉ）において第（ｍ−１）フレームが第（ｉ−１）音素状態に対応しており、且つ、第ｍフレームが第ｉ音素状態に対応している場合である。これに対して、第（ｍ−１）フレームから第ｍフレームにかけて音素状態が遷移しない場合とは、対応関係Ｒ（ｍ，ｉ）において第（ｍ−１）フレームと第ｍフレームとが共に第ｉ音素状態に対応している場合である。 Here, the case where the phoneme state transitions from the (m-1) th frame to the mth frame means that the (m-1) th frame becomes the (i-1) th phoneme state in the correspondence R (m, i). In this case, the m-th frame corresponds to the i-th phoneme state. On the other hand, the case where the phoneme state does not transition from the (m−1) th frame to the mth frame means that both the (m−1) th frame and the mth frame in the correspondence R (m, i) This is a case corresponding to the i-phoneme state.

図７に、図６に示した第１５フレームと第８音素状態とが指定された状態において、対応関係Ｒ（１５，８）の２通りの候補の例を示す。図７において、実線で示す経路（系列）は、第１４フレームから第１５フレームにかけて音素状態が遷移しない場合の対応関係Ｒ（１５，８）の候補Ｃ１を表す。一方で、破線で示す経路（系列）は、第１４フレームから第１５フレームにかけて音素状態が遷移する場合の対応関係Ｒ（１５，８）の候補Ｃ２を表す。 FIG. 7 shows two examples of the correspondence R (15, 8) in a state where the fifteenth frame and the eighth phoneme state shown in FIG. 6 are designated. In FIG. 7, a path (series) indicated by a solid line represents a candidate C1 of the correspondence R (15, 8) when the phoneme state does not transition from the 14th frame to the 15th frame. On the other hand, the path (series) indicated by the broken line represents the candidate C2 of the correspondence R (15, 8) when the phoneme state transitions from the 14th frame to the 15th frame.

２つの候補Ｃ１，Ｃ２は、図７に示す距離テーブル３０において、いずれも先頭の位置（１，１）から指定された位置（１５，８）を結ぶ経路によって表される。ここで、第１の候補Ｃ１の経路は、指定された第１５フレームの１つ前の第１４フレームにおいて、太い実線で囲った位置（１４，８）を通る。言い換えると、第１の候補Ｃ１は、第１４フレームと第１５フレームとがいずれも第８音素状態に対応していると仮定された場合の対応関係（１５，８）に相当する。これに対して、第２の候補Ｃ２の経路は、指定された第１５フレームの１つ前の第１４フレームにおいて、太い破線で囲った位置（１４，７）を通る。言い換えると、第２の候補Ｃ２は、第１４フレームが第７音素状態に対応しており、且つ、第１５フレームが第８音素状態に対応していると仮定された場合の対応関係（１５，８）に相当する。 Each of the two candidates C1 and C2 is represented by a path connecting the designated position (15, 8) to the designated position (15, 8) in the distance table 30 shown in FIG. Here, the path of the first candidate C1 passes through the position (14, 8) surrounded by a thick solid line in the fourteenth frame immediately before the designated fifteenth frame. In other words, the first candidate C1 corresponds to the correspondence (15, 8) when it is assumed that the fourteenth frame and the fifteenth frame both correspond to the eighth phoneme state. On the other hand, the route of the second candidate C2 passes through the position (14, 7) surrounded by a thick broken line in the fourteenth frame immediately before the specified fifteenth frame. In other words, the second candidate C2 corresponds to the correspondence relationship (15, 15) when it is assumed that the fourteenth frame corresponds to the seventh phoneme state and the fifteenth frame corresponds to the eighth phoneme state. 8).

探索部１５０は、第１の候補Ｃ１の経路のうちの位置（１，１）から位置（１４，８）までの経路として、指定部１５１により第１４フレームと第８音素状態とが指定された際に特定された対応関係Ｒ（１４，８）の経路を利用する。また、探索部１５０は、第２の候補Ｃ２の経路のうちの位置（１，１）から位置（１４，７）までの経路として、指定部１５１により第１４フレームと第７音素状態とが指定された際に特定された対応関係Ｒ（１４，７）の経路を利用する。探索部１５０は、このような２つの候補Ｃ１，Ｃ２のうちから、スコア算出部１５２により算出されるスコアに基づいて対応関係Ｒ（１５，８）を特定する。 The search unit 150 specifies the fourteenth frame and the eighth phoneme state by the specifying unit 151 as the path from the position (1, 1) to the position (14, 8) in the path of the first candidate C1. The route of the correspondence R (14, 8) specified at this time is used. In addition, the search unit 150 specifies the fourteenth frame and the seventh phoneme state by the specifying unit 151 as the path from the position (1, 1) to the position (14, 7) in the path of the second candidate C2. The route of the correspondence R (14, 7) specified at the time of the identification is used. The search unit 150 specifies the correspondence R (15, 8) from the two candidates C1 and C2 based on the score calculated by the score calculation unit 152.

スコア算出部１５２は、出力確率取得部１３０により取得された出力確率に基づいて、対応関係Ｒ（ｍ，ｉ）を探索するためのスコアを算出する。スコアは、フレームと音素との対応関係の尤もらしさを示す尺度であって、尤度とも呼ぶ。スコア算出部１５２は、２つの候補Ｃ１，Ｃ２のそれぞれについて、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、複数のフレームに亘って累積することにより、スコアを算出する。ここで、出力確率に基づく値とは、具体的には出力確率の対数をとることによって得られる距離に相当する。 The score calculation unit 152 calculates a score for searching for the correspondence R (m, i) based on the output probability acquired by the output probability acquisition unit 130. The score is a measure of the likelihood of the correspondence between a frame and a phoneme, and is also called likelihood. The score calculation unit 152 normalizes the value based on the output probability of the phoneme corresponding to each frame with respect to each of the two candidates C1 and C2 by the number of frames corresponding to each phoneme, and over a plurality of frames. By accumulating, a score is calculated. Here, the value based on the output probability specifically corresponds to a distance obtained by taking the logarithm of the output probability.

具体的に説明すると、スコア算出部１５２は、距離テーブル３０における先頭の位置（１，１）から指定された位置（ｍ，ｉ）までを結ぶ経路のスコアとして、下記（１）式に示すＰ（ｍ，ｎ）を算出する。下記（１）式におけるＰ１（ｋ）及びＰ２（ｍ，ｎ）は、スコアの算出対象の経路での第ｔフレームにおける距離Ｄ（ｔ）又は出力確率Ｓ（ｔ）を用いて、それぞれ下記（２）式及び下記（３）式のように表される。 More specifically, the score calculation unit 152 determines the score of a path connecting the head position (1, 1) to the specified position (m, i) in the distance table 30 as P in the following equation (1). (M, n) is calculated. P1 (k) and P2 (m, n) in the following equation (1) are calculated using the distance D (t) or the output probability S (t) in the t-th frame on the path for which the score is to be calculated, as follows: It is expressed by the expression 2) and the following expression (3).

上記（１）式において、ｎは、指定部１５１により指定された第ｉ音素状態が含まれる音素が、先頭の音素から何番目であるかを表す値である。例えば、図７において指定された第８音素状態は、先頭から３番目の音素「ｍ」の第２状態であるので、ｉ＝８はｎ＝３に対応する。一般的に、各音素の状態数が３である場合、第ｉ音素状態と当該第ｉ音素状態が含まれる第ｎ音素とは、“３×（ｎ−１）＜ｉ≦３×ｎ”の関係を満たす。 In the above expression (1), n is a value indicating the number of the phoneme including the i-th phoneme state designated by the designation unit 151 from the head phoneme. For example, since the eighth phoneme state specified in FIG. 7 is the second state of the third phoneme “m” from the top, i = 8 corresponds to n = 3. Generally, when the number of states of each phoneme is 3, the i-th phoneme state and the n-th phoneme including the i-th phoneme state are “3 × (n−1) <i ≦ 3 × n”. Satisfy the relationship.

上記（１）式の右辺の第１項は、距離テーブル３０における先頭の音素から（ｎ−１）番目の音素までの各音素に関するスコアである。スコア算出部１５２は、先頭からｋ番目の音素（第ｋ音素）に関するスコアＰ１（ｋ）を、上記（２）式に従って算出する。上記（２）式において、Ｔ（ｋ）は、スコアの算出対象となる対応関係Ｒ（ｍ，ｉ）の候補において、先頭からｋ番目の音素（第ｋ音素）に対応するフレームの数を表す。すなわち、距離テーブル３０における先頭の音素から末尾の音素までＴ（ｋ）を積算した値“Ｔ（１）＋Ｔ（２）＋…”は、フレームの総数Ｔに一致する。 The first term on the right side of the above equation (1) is a score for each phoneme from the first phoneme to the (n−1) th phoneme in the distance table 30. The score calculation unit 152 calculates the score P1 (k) for the k-th phoneme (k-th phoneme) from the head according to the above equation (2). In the above equation (2), T (k) represents the number of frames corresponding to the k-th phoneme (k-th phoneme) from the head in the correspondence R (m, i) candidate for which the score is to be calculated. . That is, the value “T (1) + T (2) +...” Obtained by integrating T (k) from the first phoneme to the last phoneme in the distance table 30 matches the total number T of frames.

また、上記（２）式において、ａ（ｋ）は、第ｋ音素に対応するＴ（ｋ）個のフレームが、距離テーブル３０における先頭から何番目のフレームから開始するかを表す値である。言い換えると、対応関係Ｒ（ｍ，ｉ）の候補において、第ｋ音素は、第ａ（ｋ）フレームから開始するＴ（ｋ）個のフレームに対応付けられている。ここで、ａ（ｋ）とＴ（ｋ）とは、“ａ（ｋ）＝Ｔ（１）＋Ｔ（２）＋…Ｔ（ｋ−１）＋１”の関係を満たす。 In the above equation (2), a (k) is a value indicating the number of the T (k) frames corresponding to the k-th phoneme starting from the top in the distance table 30. In other words, in the candidates for the correspondence R (m, i), the k-th phoneme is associated with T (k) frames starting from the a (k) -th frame. Here, a (k) and T (k) satisfy the relationship of “a (k) = T (1) + T (2) +... T (k−1) +1”.

スコア算出部１５２は、上記（２）式に従って、第ｋ音素の出力確率に基づく値である距離Ｄ（ｔ）又は出力確率の対数ｌоｇＳ（ｔ）を、第ｋ音素に対応する第ａ（ｋ）フレームからのＴ（ｋ）個のフレームに亘って累積する。そして、スコア算出部１５２は、累積した値を、第ｋ音素に対応するフレーム数Ｔ（ｋ）で正規化する、すなわちＴ（ｋ）の値で除算することによって、第ｋ音素のスコアＰ１（ｋ）を算出する。 The score calculation unit 152 calculates the distance D (t) or the logarithm logS (t) of the output probability, which is a value based on the output probability of the k-th phoneme, according to the above equation (2), using the a (k) corresponding to the k-th phoneme. ) Accumulate over T (k) frames from the frame. Then, the score calculation unit 152 normalizes the accumulated value by the number of frames T (k) corresponding to the k-th phoneme, that is, divides the accumulated value by the value of T (k), thereby obtaining the score P1 ( k) is calculated.

上記（１）式の右辺の第２項は、距離テーブル３０における第ｎ音素、すなわち指定部１５１により指定された第ｉ音素状態が含まれる音素に関するスコアである。スコア算出部１５２は、第ｎ音素に関するスコアＰ２（ｍ，ｎ）を、上記（３）式に従って算出する。スコア算出部１５２は、第ｎ音素に対応する第ａ（ｎ）フレームから第ｍフレームまでの“ｍ−ａ（ｎ）＋１”個のフレームに亘って、距離Ｄ（ｔ）又は出力確率の対数ｌоｇＳ（ｔ）を累積する。そして、スコア算出部１５２は、累積した値をフレーム数“ｍ−ａ（ｎ）＋１”で除算することによって、スコアＰ２（ｍ，ｎ）を算出する。 The second term on the right side of the above equation (1) is a score relating to the n-th phoneme in the distance table 30, that is, the phoneme including the i-th phoneme state specified by the specifying unit 151. The score calculation unit 152 calculates the score P2 (m, n) for the n-th phoneme according to the above equation (3). The score calculation unit 152 calculates the distance D (t) or the logarithm of the output probability over “ma (n) +1” frames from the a (n) frame to the mth frame corresponding to the nth phoneme. Accumulate lgS (t). Then, the score calculation unit 152 calculates the score P2 (m, n) by dividing the accumulated value by the number of frames “ma (n) +1”.

このように、スコア算出部１５２は、第ｎ音素のスコアＰ２（ｍ，ｎ）を、第１音素から第（ｎ−１）音素までのスコアＰ１（ｋ）とは異なる式に基づいて算出する。この理由は、探索部１５０が対応関係Ｒ（ｍ，ｉ）を探索する時点では、第ｎ音素に対応するフレームの数Ｔ（ｎ）が未だ確定していないからである。言い換えると、第（ｍ＋１）フレーム以降のフレームに対応する音素が引き続き第ｎ音素であるのか、それとも第（ｎ＋１）音素以降であるのかが確定していない。例えば図７に示した距離テーブル３０では、第１６フレーム以降のフレームに対応する音素が引き続き「ｍ」であるのか否かが確定していない。 As described above, the score calculation unit 152 calculates the score P2 (m, n) of the n-th phoneme based on an equation different from the score P1 (k) from the first phoneme to the (n-1) -th phoneme. . This is because the number T (n) of frames corresponding to the n-th phoneme has not yet been determined when the search unit 150 searches for the correspondence R (m, i). In other words, it is not determined whether the phoneme corresponding to the (m + 1) th and subsequent frames is the nth phoneme or the (n + 1) th or later phoneme. For example, in the distance table 30 shown in FIG. 7, it is not determined whether or not the phonemes corresponding to the 16th frame and subsequent frames are still “m”.

そのため、スコア算出部１５２は、ｍ個のフレームのうちの、ｎ個の音素のうちの先頭の音素から（ｎ−１）番目の音素までに対応するフレームを除いた残りのフレームを、第ｎ音素に対応する少なくとも１つのフレームとして用いて、第ｎ音素のスコアＰ２（ｍ，ｎ）を算出する。すなわち、スコア算出部１５２は、第ｎ音素のスコアＰ２（ｍ，ｎ）を算出する際、第ｎ音素に対応する少なくとも１つのフレームの数として最終的に確定する値Ｔ（ｎ）の代わりに、先頭から（ｎ−１）個の音素に対応するフレームをｍ個のフレームから除いた残りのフレームの数“ｍ−ａ（ｎ）＋１”を用いて、距離Ｄ（ｔ）を累積した値を正規化する。 Therefore, the score calculation unit 152 determines the remaining frames excluding the frames corresponding to the (n-1) th phoneme from the first phoneme of the n phonemes of the m frames, to the n-th phoneme. The score P2 (m, n) of the n-th phoneme is calculated using at least one frame corresponding to the phoneme. That is, when calculating the score P2 (m, n) of the n-th phoneme, the score calculation unit 152 replaces the value T (n) finally determined as the number of at least one frame corresponding to the n-th phoneme. A value obtained by accumulating the distance D (t) using the number of remaining frames “ma (n) +1” obtained by excluding the frame corresponding to the (n−1) phonemes from the head from the m frames. Is normalized.

スコア算出部１５２は、対応関係Ｒ（ｍ，ｉ）の複数の候補のそれぞれについて、スコアＰ（ｍ，ｎ）を算出する。具体的に説明すると、スコア算出部１５２は、対応関係Ｒ（ｍ，ｉ）の複数の候補として、第（ｍ−１）フレームと第ｉ音素状態とが対応していると仮定された場合における第１の候補Ｃ１と、第（ｍ−１）フレームと第（ｉ−１）音素状態とが対応していると仮定された場合における第２の候補Ｃ２と、のそれぞれについて、スコアＰ（ｍ，ｎ）を算出する。 The score calculation unit 152 calculates a score P (m, n) for each of the plurality of candidates of the correspondence R (m, i). More specifically, the score calculation unit 152 determines that the (m−1) th frame and the i-th phoneme state correspond to each other as a plurality of candidates for the correspondence R (m, i). For each of the first candidate C1 and the second candidate C2 when it is assumed that the (m-1) th frame and the (i-1) th phoneme state correspond to each other, the score P (m , N).

第１の候補Ｃ１について算出されるスコアＰ（ｍ，ｎ）は、第（ｍ−１）フレームから第ｍフレームにかけて音素状態が遷移しない（停留する）場合のスコアであるため、「状態停留スコア」と呼ぶ。これに対して、第２の候補Ｃ２について算出されるスコアＰ（ｍ，ｎ）は、第（ｍ−１）フレームから第ｍフレームにかけて音素状態が遷移する場合のスコアであるため、「状態遷移スコア」と呼ぶ。 The score P (m, n) calculated for the first candidate C1 is a score when the phoneme state does not transition (stops) from the (m-1) th frame to the mth frame, and thus the “state stop score” ". On the other hand, the score P (m, n) calculated for the second candidate C2 is a score when the phoneme state transitions from the (m−1) th frame to the mth frame, and thus “state transition” The score is called.

具体的に図８及び図９を参照して、状態停留スコア及び状態遷移スコアの算出例について説明する。図８は、図７に示した対応関係Ｒ（１５，８）の第１の候補Ｃ１について状態停留スコアを算出する例を示す。これに対して、図９は、図７に示した対応関係Ｒ（１５，８）の第２の候補Ｃ２について状態遷移スコアを算出する例を示す。なお、図８及び図９では、理解を容易にするため、図７に示した距離テーブル３０のうちの一部のみを示しており、関与しない部分の距離Ｄ（ｔ）の値を省略している。 With reference to FIG. 8 and FIG. 9, an example of calculating the state stop score and the state transition score will be described. FIG. 8 shows an example of calculating the state stationary score for the first candidate C1 of the correspondence R (15, 8) shown in FIG. On the other hand, FIG. 9 shows an example in which the state transition score is calculated for the second candidate C2 of the correspondence R (15, 8) shown in FIG. 8 and 9 show only a part of the distance table 30 shown in FIG. 7 for easy understanding, and omit the value of the distance D (t) of a part that is not involved. I have.

（Ａ１）図８に示す第１の候補Ｃ１において、第１音素「ｒ」の３つの状態は、第１から第４フレームまでの４個のフレームに対応付けられている。そのため、スコア算出部１５２は、上記（２）式に従って、第１から第４フレームにおける第１の候補Ｃ１に沿った距離の和“Ｄ（１）＋Ｄ（２）＋Ｄ（３）＋Ｄ（４）＝３＋４＋３＋２＝１２”を、対応するフレームの数“Ｔ（１）＝４”で除算する。これにより、スコア算出部１５２は、第１音素「ｒ」に関するスコア“Ｐ１（１）＝１２／４＝３”を得る。 (A1) In the first candidate C1 shown in FIG. 8, three states of the first phoneme “r” are associated with four frames from the first to fourth frames. Therefore, the score calculation unit 152 calculates the sum “D (1) + D (2) + D (3) + D (4)” of the distances along the first candidate C1 in the first to fourth frames according to the above equation (2). = 3 + 4 + 3 + 2 = 12 "is divided by the number of corresponding frames" T (1) = 4 ". Thereby, the score calculation unit 152 obtains the score “P1 (1) = 12/4 = 3” for the first phoneme “r”.

（Ａ２）次に、第２音素「ａ：」の３つの状態は、第５から第１１フレームまでの７個のフレームに対応付けられている。そのため、スコア算出部１５２は、上記（２）式に従って、第５から第１１フレームにおける第１の候補Ｃ１に沿った距離の和“Ｄ（５）＋Ｄ（６）＋…＋Ｄ（１１）＝４＋５＋５＋６＋５＋２＋１＝２８”を、対応するフレームの数“Ｔ（２）＝７”で除算する。これにより、スコア算出部１５２は、第２音素「ａ：」に関するスコア“Ｐ１（２）＝２８／７＝４”を得る。 (A2) Next, the three states of the second phoneme “a:” are associated with seven frames from the fifth to eleventh frames. Therefore, the score calculation unit 152 calculates the sum “D (5) + D (6) +... + D (11) = 4 + 5 + 5 + 6 + 5 + 2 + 1” of the distances along the first candidate C1 in the fifth to eleventh frames according to the above equation (2). = 28 "by the number of corresponding frames" T (2) = 7 ". Thereby, the score calculation unit 152 obtains the score “P1 (2) = 28/7 = 4” for the second phoneme “a:”.

（Ａ３）第３音素「ｍ」は、指定部１５１により指定された第８音素状態が含まれる音素である。そのため、スコア算出部１５２は、上記（３）式に従って、第３音素「ｍ」に対応する先頭のフレームである第１２フレームから指定部１５１により指定された第１５フレームにおける第１の候補Ｃ１に沿った距離の和“Ｄ（１２）＋Ｄ（１３）＋Ｄ（１４）＋Ｄ（１５）＝５＋３＋２＋４＝１４”を、第１２から第１５フレームまでのフレームの数“１５−１２＋１＝４”で除算する。これにより、スコア算出部１５２は、第３音素「ｍ」に関するスコア“Ｐ２（１５，３）＝１４／４＝３．５”を得る。 (A3) The third phoneme “m” is a phoneme that includes the eighth phoneme state specified by the specifying unit 151. Therefore, the score calculation unit 152 determines from the twelfth frame, which is the first frame corresponding to the third phoneme “m”, the first candidate C1 in the fifteenth frame specified by the specification unit 151 according to the above equation (3). The sum of the distances along the line “D (12) + D (13) + D (14) + D (15) = 5 + 3 + 2 + 4 = 14” is divided by the number of frames from the twelfth to fifteenth frames “15−12 + 1 = 4”. . Thereby, the score calculation unit 152 obtains the score “P2 (15,3) = 14/4 = 3.5” for the third phoneme “m”.

このようにして先頭のフレーム及び音素状態から指定されたフレーム及び音素状態までのスコアＰ１（１），Ｐ１（２），Ｐ２（１５，３）を算出すると、スコア算出部１５２は、上記（１）式により、対応関係Ｒ（１５，８）の状態停留スコア“Ｐ（１５，３）＝Ｐ１（１）＋Ｐ１（２）＋Ｐ２（１５，３）＝３＋４＋３．５＝１０．５”を得る。 When the scores P1 (1), P1 (2), and P2 (15, 3) from the first frame and the phoneme state to the designated frame and the phoneme state are calculated in this manner, the score calculation unit 152 calculates ), The state stationary score “P (15,3) = P1 (1) + P1 (2) + P2 (15,3) = 3 + 4 + 3.5 = 10.5” of the correspondence R (15,8) is obtained.

（Ｂ１）図９に示す第２の候補Ｃ２において、第１音素「ｒ」の３つの状態は、第１から第９フレームまでの４個のフレームに対応付けられている。そのため、スコア算出部１５２は、上記（２）式に従って、第１から第９フレームにおける第２の候補Ｃ２に沿った距離の和“Ｄ（１）＋Ｄ（２）＋…＋Ｄ（９）＝３＋４＋３＋４＋６＋４＋７＋３＋２＝３６”を、対応するフレームの数“Ｔ（１）＝９”で除算する。これにより、スコア算出部１５２は、第１音素「ｒ」に関するスコア“Ｐ１（１）＝３６／９＝４”を得る。 (B1) In the second candidate C2 shown in FIG. 9, the three states of the first phoneme “r” are associated with four frames from the first to ninth frames. Therefore, the score calculation unit 152 calculates the sum of the distances along the second candidate C2 in the first to ninth frames “D (1) + D (2) +... + D (9) = 3 + 4 + 3 + 4 + 6 + 4 + 7 + 3 + 2 according to the above equation (2). = 36 "by the number of corresponding frames" T (1) = 9 ". Thereby, the score calculation unit 152 obtains the score “P1 (1) = 36/9 = 4” for the first phoneme “r”.

（Ｂ２）次に、第２音素「ａ：」の３つの状態は、第１０から第１３フレームまでの４個のフレームに対応付けられている。そのため、スコア算出部１５２は、上記（２）式に従って、第１０から第１３フレームにおける第２の候補Ｃ２に沿った距離の和“Ｄ（１０）＋Ｄ（１１）＋Ｄ（１２）＋Ｄ（１３）＝４＋５＋５＋１＝１５”を、対応するフレームの数“Ｔ（２）＝４”で除算する。これにより、スコア算出部１５２は、第２音素「ａ：」に関するスコア“Ｐ１（２）＝１５／４＝３．７５”を得る。 (B2) Next, the three states of the second phoneme “a:” are associated with four frames from the tenth to thirteenth frames. Therefore, the score calculation unit 152 calculates the sum “D (10) + D (11) + D (12) + D (13) of the distances along the second candidate C2 in the tenth to thirteenth frames according to the above equation (2). = 4 + 5 + 5 + 1 = 15 "is divided by the number of corresponding frames" T (2) = 4 ". Thereby, the score calculation unit 152 obtains the score “P1 (2) = 15/4 = 3.75” for the second phoneme “a:”.

（Ｂ３）第３音素「ｍ」は、指定部１５１により指定された第８音素状態が含まれる音素である。そのため、スコア算出部１５２は、上記（３）式に従って、第３音素「ｍ」に対応する先頭のフレームである第１４フレームから指定部１５１により指定された第１５フレームにおける第２の候補Ｃ２に沿った距離の和“Ｄ（１４）＋Ｄ（１５）＝６＋４＝１０”を、第１４から第１５フレームまでのフレームの数“１５−１４＋１＝２”で除算する。これにより、スコア算出部１５２は、第３音素「ｍ」に関するスコア“Ｐ２（１５，３）＝１０／２＝５”を得る。 (B3) The third phoneme “m” is a phoneme that includes the eighth phoneme state specified by the specifying unit 151. Therefore, according to the above equation (3), the score calculation unit 152 converts the fourteenth frame, which is the first frame corresponding to the third phoneme “m”, into the second candidate C2 in the fifteenth frame specified by the specification unit 151. The sum of the distances along the line “D (14) + D (15) = 6 + 4 = 10” is divided by the number of frames from the 14th to the 15th frame, “15−14 + 1 = 2”. Thereby, the score calculation unit 152 obtains the score “P2 (15,3) = 10/2 = 5” for the third phoneme “m”.

このようにして先頭のフレーム及び音素状態から指定されたフレーム及び音素状態までのスコアＰ１（１），Ｐ１（２），Ｐ２（１５，３）を算出すると、スコア算出部１５２は、上記（１）式により、対応関係Ｒ（１５，８）の状態遷移スコア“Ｐ（１５，３）＝Ｐ１（１）＋Ｐ１（２）＋Ｐ２（１５，３）＝４＋３．７５＋５＝１２．７５”を得る。 When the scores P1 (1), P1 (2), and P2 (15, 3) from the first frame and the phoneme state to the designated frame and the phoneme state are calculated in this manner, the score calculation unit 152 calculates ), The state transition score “P (15,3) = P1 (1) + P1 (2) + P2 (15,3) = 4 + 3.75 + 5 = 12.75” of the correspondence R (15,8) is obtained.

このように、スコア算出部１５２は、距離Ｄ（ｔ）の累積値を音素毎にフレームの数で正規化することにより、第ｋ音素におけるスコアを算出する。そして、スコア算出部１５２は、第ｋ音素におけるｋの値が１からｎまでのそれぞれである場合に算出されたｎ個の音素のスコアＰ１（１）、Ｐ１（２）、…、Ｐ１（ｎ−１）、Ｐ２（ｍ，ｎ）を累積することにより、対応関係Ｒ（ｍ，ｉ）の２つの候補Ｃ１，Ｃ２のそれぞれのスコアＰ（ｍ，ｎ）を算出する、 As described above, the score calculation unit 152 calculates the score at the k-th phoneme by normalizing the cumulative value of the distance D (t) by the number of frames for each phoneme. Then, the score calculation unit 152 calculates the scores P1 (1), P1 (2),..., P1 (n) of the n phonemes calculated when the value of k in the k-th phoneme is 1 to n, respectively. -1), by accumulating P2 (m, n), calculate the respective scores P (m, n) of the two candidates C1, C2 of the correspondence R (m, i).

ここで、スコア算出部１５２がスコアを算出する際に、距離Ｄ（ｔ）の累積値をフレームの数で正規化する処理を音素毎に分けて実行する理由は、音素列のうちの一部の音素のみの影響によって音声信号を誤認識することを抑制するためである。具体的に、スコア算出部１５２が、距離Ｄ（ｔ）の累積値を、音素毎ではなく先頭のフレームから指定部１５１により指定された第ｍフレームまでの全フレームの数で正規化することによって対応関係Ｒ（ｍ，ｉ）のスコアを算出する場合について考える。 Here, when the score calculation unit 152 calculates the score, the process of normalizing the accumulated value of the distance D (t) by the number of frames is performed separately for each phoneme. This is for suppressing erroneous recognition of the audio signal due to the influence of only the phoneme. Specifically, the score calculation unit 152 normalizes the accumulated value of the distance D (t) by the number of all frames from the first frame to the m-th frame specified by the specifying unit 151, not by phoneme. Consider a case where the score of the correspondence R (m, i) is calculated.

距離Ｄ（ｔ）の累積値を全フレームの数で正規化する場合、スコア算出部１５２は、対応関係Ｒ（ｍ，ｉ）のスコアとして、例えば下記（４）式に示すＰ’（ｍ）を算出する。すなわち、スコア算出部１５２は、距離Ｄ（ｔ）又は出力確率の対数ｌоｇＳ（ｔ）の累積値を、先頭のフレームから指定部１５１により指定された第ｍフレームまでの全フレームの数“ｍ”で除算することにより、対応関係Ｒ（ｍ，ｉ）のスコアＰ’（ｍ）を算出する。 When the cumulative value of the distance D (t) is normalized by the number of all frames, the score calculation unit 152 calculates the score of the correspondence R (m, i) as P ′ (m) shown in the following equation (4), for example. Is calculated. That is, the score calculation unit 152 calculates the distance D (t) or the cumulative value of the logarithm lgS (t) of the output probability as the number “m” of all frames from the first frame to the m-th frame specified by the specification unit 151. , The score P ′ (m) of the correspondence R (m, i) is calculated.

具体的に図１０（ａ）及び（ｂ）を参照して、上記（１）式に従って音素毎にフレーム数で正規化する場合と、上記（４）式に従って全フレーム数で正規化する場合と、の違いについて説明する。なお、図１０（ａ）及び（ｂ）に示す距離テーブル３１，３２は、図７から図９に示した距離テーブル３０とは異なる例であって、１１個のフレーム及び５個の音素に亘る対応関係Ｒ（１１，５）を示している。また、図１０（ａ）及び（ｂ）は、理解を容易にするため、各音素の状態数は１つである場合の例を示している。 Specifically, referring to FIGS. 10A and 10B, a case where normalization is performed by the number of frames for each phoneme according to the above equation (1), and a case where normalization is performed using the total number of frames according to the above equation (4) , Will be described. Note that the distance tables 31 and 32 shown in FIGS. 10A and 10B are examples different from the distance tables 30 shown in FIGS. 7 to 9 and cover 11 frames and 5 phonemes. The correspondence R (11, 5) is shown. FIGS. 10A and 10B show an example in which the number of states of each phoneme is one for easy understanding.

上記（１）式に従って音素毎にフレーム数で正規化してスコアを算出する場合、スコア算出部１５２は、図１０（ａ）に示した対応関係Ｒ（１１，５）のスコアを、“Ｐ（１１，５）＝６／１＋（２＋２＋２＋３＋３＋１＋２）／７＋７／１＋４／１＋６／１＝２５．１”と算出する。同様に、スコア算出部１５２は、図１０（ｂ）に示した対応関係Ｒ（１１，５）のスコアを、“Ｐ（１１，５）＝（４＋４）／２＋（３＋３＋３＋３）／４＋４／１＋（３＋３）／２＋（４＋５）／２＝１８．５”と算出する。このように、図１０（ｂ）の方が図１０（ａ）よりも小さいスコアが得られる。 When the score is calculated by normalizing the number of frames for each phoneme according to the above equation (1), the score calculation unit 152 sets the score of the correspondence R (11, 5) shown in FIG. 11,5) = 6/1 + (2 + 2 + 2 + 3 + 3 + 1 + 2) /7+7/1+4/1+6/1=25.1 ". Similarly, the score calculation unit 152 converts the score of the correspondence R (11,5) shown in FIG. 10B into “P (11,5) = (4 + 4) / 2 + (3 + 3 + 3 + 3) / 4 + 4/1 + ( 3 + 3) / 2 + (4 + 5) /2=18.5 ". Thus, a score smaller in FIG. 10B than in FIG. 10A is obtained.

これに対して、上記（４）式に従って全フレーム数で正規化してスコアを算出する場合、スコア算出部１５２は、図１０（ａ）に示した対応関係Ｒ（１１，５）のスコアを、“Ｐ’（１１）＝（６＋２＋２＋２＋３＋３＋１＋２＋７＋４＋６）／１１＝３．４５”と算出する。同様に、スコア算出部１５２は、図１０（ｂ）に示した対応関係Ｒ（１１，５）のスコアを、“Ｐ’（１１）＝（４＋４＋３＋３＋３＋３＋４＋３＋３＋４＋５）／１１＝３．５４”と算出する。このように、音素毎にフレーム数で正規化した場合とは異なり、図１０（ａ）の方が図１０（ｂ）よりも小さいスコアが得られる。 On the other hand, when the score is calculated by normalizing with the total number of frames according to the above equation (4), the score calculation unit 152 calculates the score of the correspondence R (11, 5) shown in FIG. “P ′ (11) = (6 + 2 + 2 + 2 + 3 + 3 + 1 + 2 + 7 + 4 + 6) /11=3.45” is calculated. Similarly, the score calculation unit 152 calculates the score of the correspondence R (11, 5) shown in FIG. 10B as “P ′ (11) = (4 + 4 + 3 + 3 + 3 + 3 + 4 + 3 + 3 + 4 + 5) /11=3.54”. In this way, unlike the case where the number of frames is normalized for each phoneme, the score in FIG. 10A is smaller than that in FIG. 10B.

このように音素毎にフレーム数で正規化するか否かでスコアの大小が異なる理由は、音素列のうちの一部の音素のみの経路が、経路全体の中で長い部分を占めていることに起因する。例えば図１０（ａ）では、音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」のうちの音素「ａ：」が、全部で１１個のフレームのうちの第２から第８フレームまでの７個のフレームに対応している。このように一部の音素のみが全フレームのうちの多くのフレームに対応している場合、その一部の音素について算出された距離Ｄ（ｔ）が経路全体のスコアに大きな影響を与え易い。その結果、その一部の音素のみが認識対象の音声信号に類似している場合であっても、音素列全体の類似度が高いと誤判定され易い。 The reason why the scores differ depending on whether or not to normalize by the number of frames for each phoneme is that the path of only some phonemes in the phoneme sequence occupies a long part in the entire path. caused by. For example, in FIG. 10A, the phoneme “a:” of the phoneme string “r, a :, m, e, N” is the 7th from the second to eighth frames of the 11 frames in total. Corresponding to each frame. As described above, when only some phonemes correspond to many frames in all frames, the distance D (t) calculated for some phonemes tends to greatly affect the score of the entire route. As a result, even when only some of the phonemes are similar to the speech signal to be recognized, it is likely to be erroneously determined that the similarity of the entire phoneme sequence is high.

このような誤判定を避けるために、スコア算出部１５２は、上記（１）式に従って音素毎にフレーム数で正規化してスコアを算出する。その結果、各音素の重みが均一化されるため、一部の音素のみが経路全体のスコアに大きな影響を与え難くなり、ＤＰマッチングの精度を高めることができる。 In order to avoid such erroneous determination, the score calculation unit 152 calculates a score by normalizing the number of frames for each phoneme according to the above equation (1). As a result, the weight of each phoneme is made uniform, so that only a part of the phonemes hardly greatly affects the score of the entire route, and the accuracy of DP matching can be improved.

図２に示した音声認識装置１０の機能の説明に戻る。特定部１５３は、スコア算出部１５２により算出されたスコアに基づいて、先頭のフレームから指定部１５１により指定されたフレームまでの各フレームと、先頭の音素状態から指定部１５１により指定された音素状態までの各音素状態と、の対応関係を特定する。例えば、指定部１５１により第ｍフレームと第ｉ音素状態とが指定された場合、特定部１５３は、スコア算出部１５２により対応関係Ｒ（ｍ，ｉ）の第１の候補Ｃ１について算出された状態停留スコアと、第２の候補Ｃ２について算出された状態遷移スコアと、を比較する。そして、特定部１５３は、２つの候補Ｃ１，Ｃ２のうちのスコアが良い方の経路を、対応関係Ｒ（ｍ，ｉ）の最尤系列として特定する。 Returning to the description of the function of the voice recognition device 10 shown in FIG. Based on the score calculated by the score calculation unit 152, the identification unit 153 determines each frame from the first frame to the frame specified by the specification unit 151 and the phoneme state specified by the specification unit 151 from the first phoneme state. The correspondence between each phoneme state up to and is specified. For example, when the m-th frame and the i-th phoneme state are specified by the specifying unit 151, the specifying unit 153 sets the state calculated by the score calculation unit 152 for the first candidate C1 of the correspondence R (m, i). The stop score is compared with the state transition score calculated for the second candidate C2. Then, the specifying unit 153 specifies, as the maximum likelihood sequence of the correspondence R (m, i), the path having the better score among the two candidates C1 and C2.

ここで、スコアが良いとは、上記（１）式で表されるスコアＰ（ｍ，ｎ）の値が小さいことに相当する。図７の例では、第１の候補Ｃ１について算出された状態停留スコア“１０．５”は、第２の候補Ｃ２について算出された状態遷移スコア“１２．７５”よりも小さい。そのため、特定部１５３は、第１の候補Ｃ１を対応関係Ｒ（ｍ，ｉ）の最尤系列として選択し、第２の候補Ｃ２を対応関係Ｒ（ｍ，ｉ）の最尤系列から除外する。このように、特定部１５３は、経路が異なる２つの候補Ｃ１，Ｃ２のうちからスコアが良い方の候補を選択することにより、フレームと音素との対応関係を特定する。 Here, a good score corresponds to a small value of the score P (m, n) expressed by the above equation (1). In the example of FIG. 7, the state stop score “10.5” calculated for the first candidate C1 is smaller than the state transition score “12.75” calculated for the second candidate C2. Therefore, the identifying unit 153 selects the first candidate C1 as the maximum likelihood sequence of the correspondence R (m, i), and excludes the second candidate C2 from the maximum likelihood sequence of the correspondence R (m, i). . As described above, the specifying unit 153 specifies the correspondence between the frame and the phoneme by selecting the candidate having the better score from the two candidates C1 and C2 having different routes.

繰り返し部１５４は、指定部１５１により指定されるフレームと音素状態とのうちの少なくとも一方を変化させながら、スコア算出部１５２及び特定部１５３の処理を繰り返す。言い換えると、繰り返し部１５４は、先頭からｍ個のフレームにおけるｍの値と、先頭からｎ個の音素又はｉ個の音素状態におけるｎ又はｉの値と、のうちの少なくとも一方を増加させながら、スコア算出部１５２及び特定部１５３の処理を繰り返し実行する。 The repetition unit 154 repeats the processes of the score calculation unit 152 and the specification unit 153 while changing at least one of the frame specified by the specification unit 151 and the phoneme state. In other words, the repetition unit 154 increases at least one of the value of m in the m frames from the beginning and the value of n or i in the n phonemes or the i phoneme states from the beginning, The processing of the score calculation unit 152 and the specification unit 153 is repeatedly executed.

繰り返し部１５４による繰り返し処理において、指定部１５１により第ｍフレームと第ｉ音素状態とが指定された場合、スコア算出部１５２は、第（ｍ−１）フレームと第ｉ音素状態とが指定された際に特定された対応関係Ｒ（ｍ−１，ｉ）を用いて、対応関係Ｒ（ｍ，ｉ）の第１の候補Ｃ１のスコア（状態停留スコア）を算出する。更に、スコア算出部１５２は、第（ｍ−１）フレームと第（ｉ−１）音素状態とが指定された際に特定された対応関係Ｒ（ｍ−１，ｉ−１）を用いて、対応関係Ｒ（ｍ，ｉ）の第２の候補Ｃ２のスコア（状態遷移スコア）を算出する。そして、特定部１５３は、スコア算出部１５２により算出された状態停留スコアと状態遷移スコアとのうちの良い方のスコアの経路を、対応関係Ｒ（ｍ，ｉ）として特定する。 In the repetition process by the repetition unit 154, when the designation unit 151 designates the m-th frame and the i-th phoneme state, the score calculation unit 152 designates the (m-1) -th frame and the i-th phoneme state. Using the correspondence R (m-1, i) specified at this time, the score (state stationary score) of the first candidate C1 of the correspondence R (m, i) is calculated. Further, the score calculation unit 152 uses the correspondence R (m−1, i−1) specified when the (m−1) th frame and the (i−1) th phoneme state are specified, The score (state transition score) of the second candidate C2 of the correspondence R (m, i) is calculated. Then, the specifying unit 153 specifies, as the correspondence R (m, i), the path of the better score between the state stop score and the state transition score calculated by the score calculator 152.

このようにして、探索部１５０は、指定部１５１により指定されるフレームと音素状態とを先頭から末尾まで順にシフトさせながら、距離テーブル３０全体におけるＴ個のフレームと１５個の音素状態との対応関係Ｒ（Ｔ，１５）を徐々に構築していく。 In this way, the search unit 150 shifts the frame and the phoneme state designated by the designation unit 151 in order from the beginning to the end, and adjusts the correspondence between the T frames and the 15 phoneme states in the entire distance table 30. The relation R (T, 15) is gradually built.

更に、繰り返し部１５４は、候補語記憶部２２に記憶された複数の候補語のそれぞれについて、指定部１５１、スコア算出部１５２及び特定部１５３の処理を繰り返す。これにより、スコア算出部１５２は、複数の候補語のそれぞれについて、上述した手法によりスコアを算出する。そして、特定部１５３は、複数の候補語のそれぞれについて、スコア算出部１５２により算出されたスコアに基づいて、フレームと音素状態との対応関係を特定する。 Further, the repetition unit 154 repeats the processing of the specification unit 151, the score calculation unit 152, and the identification unit 153 for each of the plurality of candidate words stored in the candidate word storage unit 22. Thereby, the score calculation unit 152 calculates a score for each of the plurality of candidate words by the above-described method. Then, the specifying unit 153 specifies, for each of the plurality of candidate words, the correspondence between the frame and the phoneme state based on the score calculated by the score calculating unit 152.

判定部１６０は、特定部１５３により特定された対応関係に基づいて、候補語記憶部２２に記憶された複数の候補語のそれぞれについて、認識対象の音声信号において発せられている言葉であるか否かを判定する。具体的に説明すると、判定部１６０は、特定部１５３により複数の候補語のそれぞれについて特定された対応関係のスコアを比較することにより、複数の候補語のうちのいずれかを、認識対象の音声信号において発せられている言葉として判定する。 The determination unit 160 determines whether each of the plurality of candidate words stored in the candidate word storage unit 22 is a word uttered in the speech signal to be recognized based on the correspondence specified by the specifying unit 153. Is determined. More specifically, the determination unit 160 compares any one of the plurality of candidate words with the score of the correspondence relationship specified for each of the plurality of candidate words by the specifying unit 153, and determines one of the plurality of candidate words as a speech to be recognized. It is determined as a word uttered in the signal.

複数の候補語のそれぞれについて特定された対応関係とは、複数の候補語のそれぞれの距離テーブル３０において、先頭から末尾までの各フレームと先頭から末尾までの各音素状態との対応関係として最終的に特定された対応関係である。判定部１６０は、複数の候補語のそれぞれについて特定された対応関係において、スコア算出部１５２により算出されたスコアの大小関係を比較する。そして、判定部１６０は、複数の候補語のうちのスコアが最も良い、すなわちスコアが最も小さい候補語を、認識対象の音声信号において発せられている言葉と判定する。判定部１６０は、制御部１１によって実現される。判定部１６０は、判定手段として機能する。 The correspondence specified for each of the plurality of candidate words is finally determined as the correspondence between each frame from the beginning to the end and each phoneme state from the beginning to the end in each distance table 30 of the plurality of candidate words. Is the correspondence relationship specified in. The determination unit 160 compares the magnitude relation of the scores calculated by the score calculation unit 152 in the correspondence specified for each of the plurality of candidate words. Then, the determination unit 160 determines the candidate word having the highest score among the plurality of candidate words, that is, the candidate word having the lowest score, as the word uttered in the speech signal to be recognized. The determination unit 160 is realized by the control unit 11. The determination unit 160 functions as a determination unit.

以上のように構成される音声認識装置１０によって実行される音声認識処理の流れについて、図１１及び図１２に示すフローチャートを参照して説明する。 The flow of the voice recognition processing executed by the voice recognition device 10 configured as described above will be described with reference to the flowcharts shown in FIGS.

図１１に示す音声認識処理は、ユーザから入力部１３等を介して音声認識を開始する指示を受け付けると、開始する。音声認識処理を開始すると、制御部１１は、認識対象の音声信号を取得する（ステップＳ１）。例えば、制御部１１は、ユーザから発せられた音声、又は会議、テレビ、映画等で発せられた音声を示す信号を、入力部１３又は通信部１５を介して取得する。なお、音声認識処理を開始する指示を受け付けた時点で既に認識対象の音声信号が取得されていた場合には、ステップＳ１の処理は省略される。 The speech recognition process illustrated in FIG. 11 starts when an instruction to start speech recognition is received from the user via the input unit 13 or the like. When the voice recognition process is started, the control unit 11 acquires a voice signal to be recognized (step S1). For example, the control unit 11 acquires, via the input unit 13 or the communication unit 15, a signal indicating a sound emitted from a user or a sound emitted from a conference, a television, a movie, or the like. It should be noted that if the voice signal to be recognized has already been acquired when the instruction to start the voice recognition process is received, the process of step S1 is omitted.

音声信号を取得すると、制御部１１は、取得した音声信号の先頭のフレームを指定する（ステップＳ２）。先頭のフレームを指定すると、制御部１１は、特徴量算出部１２０として機能し、指定したフレームにおける音声信号の特徴量を算出する（ステップＳ３）。具体的に説明すると、制御部１１は、先頭のフレームにおける音声データをフーリエ変換して周波数スペクトルに変換し、メルフィルタバンクを適用することにより、音声信号からその特徴量を抽出する。 After acquiring the audio signal, the control unit 11 specifies the first frame of the acquired audio signal (Step S2). When the first frame is specified, the control unit 11 functions as the feature amount calculation unit 120, and calculates the feature amount of the audio signal in the specified frame (Step S3). More specifically, the control unit 11 performs Fourier transform on the audio data in the first frame to convert the audio data into a frequency spectrum, and applies a mel filter bank to extract the feature amount from the audio signal.

特徴量を算出すると、制御部１１は、出力確率取得部１３０として機能し、音響モデル記憶部２１に記憶された音響モデルの全音素について、出力確率を取得する（ステップＳ４）。例えば、制御部１１は、正規混合連続分布又はニューラルネットワークの手法を用いて、モノフォンモデルの全音素について、算出された特徴量が出力される出力確率を算出する。これにより、制御部１１は、指定したフレームにおける音声信号が、モノフォンモデルにおけるどの音素に対応する確率が高いかの指標を算出する。 After calculating the feature amount, the control unit 11 functions as the output probability obtaining unit 130, and obtains the output probabilities for all phonemes of the acoustic model stored in the acoustic model storage unit 21 (Step S4). For example, the control unit 11 calculates an output probability that the calculated feature amount is output for all phonemes of the monophone model using a normal mixture continuous distribution or a neural network technique. Thus, the control unit 11 calculates an index indicating which phoneme in the monophone model corresponds to the audio signal in the specified frame with a high probability.

出力確率を取得すると、制御部１１は、指定したフレームが音声信号の末尾のフレームに到達したか否かを判定する（ステップＳ５）。指定したフレームが末尾のフレームに到達していない場合（ステップＳ５；ＮＯ）、制御部１１は、次のフレームを指定する（ステップＳ６）。例えば、現在先頭のフレームを指定している場合には、制御部１１は、次のフレームとして先頭から２番目のフレームを指定する。現在先頭からｔ番目のフレームを指定している場合には、制御部１１は、次のフレームとして先頭から（ｔ＋１）番目のフレームを指定する。 After acquiring the output probability, the control unit 11 determines whether or not the specified frame has reached the last frame of the audio signal (Step S5). If the specified frame has not reached the last frame (step S5; NO), the control unit 11 specifies the next frame (step S6). For example, when the first frame is currently specified, the control unit 11 specifies the second frame from the top as the next frame. When the t-th frame from the head is currently specified, the control unit 11 specifies the (t + 1) -th frame from the head as the next frame.

次のフレームを指定すると、制御部１１は、処理をステップＳ３に戻す。そして、制御部１１は、新たに指定したフレームについて、ステップＳ３における特徴量の算出処理、及びステップＳ４における出力確率の取得処理を実行する。このように、制御部１１は、認識対象の音声信号の先頭から末尾までの各フレームについて特徴量を算出する。そして、制御部１１は、算出した特徴量の出力確率を、音響モデルの全音素について取得する。 When the next frame is designated, the control unit 11 returns the process to step S3. Then, for the newly specified frame, the control unit 11 executes a feature amount calculation process in step S3 and an output probability acquisition process in step S4. As described above, the control unit 11 calculates the feature amount for each frame from the beginning to the end of the audio signal to be recognized. Then, the control unit 11 acquires the output probabilities of the calculated feature amounts for all phonemes of the acoustic model.

最終的に、指定したフレームが末尾のフレームに到達すると（ステップＳ５；ＹＥＳ）、制御部１１は、変換部１４０として機能し、候補語記憶部２２に記憶された複数の候補語を読み込んで、それぞれ音素列に変換する（ステップＳ７）。例えば、読み込んだ候補語が「ラーメン」である場合、制御部１１は、候補語「ラーメン」を音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」に変換する。或いは、読み込んだ候補語が英語「ｃａｋｅ」である場合、制御部１１は、候補語「ｃａｋｅ」を音素列「ｋ，ｅ，ｉ，ｋ」に変換する。 Finally, when the specified frame reaches the last frame (Step S5; YES), the control unit 11 functions as the conversion unit 140, reads a plurality of candidate words stored in the candidate word storage unit 22, and Each is converted into a phoneme string (step S7). For example, when the read candidate word is “ramen”, the control unit 11 converts the candidate word “ramen” into a phoneme string “r, a :, m, e, N”. Alternatively, when the read candidate word is English “cake”, the control unit 11 converts the candidate word “cake” into a phoneme string “k, e, i, k”.

複数の候補語のそれぞれを音素列に変換すると、制御部１１は、探索部１５０として機能し、複数の候補語のそれぞれについてフレームと音素状態との対応関係を探索する（ステップＳ８）。ステップＳ８の探索処理の詳細については、図１２に示すフローチャートを参照して説明する。 When each of the plurality of candidate words is converted into a phoneme string, the control unit 11 functions as the search unit 150, and searches for a correspondence between a frame and a phoneme state for each of the plurality of candidate words (step S8). Details of the search processing in step S8 will be described with reference to the flowchart shown in FIG.

図１２に示す探索処理を開始すると、制御部１１は、指定部１５１として機能し、複数の候補語のうちから候補語を１つ指定する（ステップＳ８０１）。候補語を１つ指定すると、制御部１１は、指定した候補語について、距離テーブル３０を生成する（ステップＳ８０２）。例えば、指定した候補語が「ラーメン」である場合、制御部１１は、ステップＳ７で得られた音素列「ｒ，ａ：，ｍ，ｅ，Ｎ」を構成する１５個の音素状態を列にとり、認識対象の音声信号におけるＴ個のフレームを行にとったマトリックスを用意する。そして、制御部１１は、ステップＳ４で取られた出力確率から距離を算出し、算出した距離をマトリックスの各要素に配置することにより、図４に示す距離テーブル３０を生成する。 When the search process illustrated in FIG. 12 starts, the control unit 11 functions as the specifying unit 151 and specifies one candidate word from a plurality of candidate words (step S801). When one candidate word is specified, the control unit 11 generates the distance table 30 for the specified candidate word (Step S802). For example, when the designated candidate word is “ramen”, the control unit 11 takes 15 phoneme states constituting the phoneme string “r, a :, m, e, N” obtained in step S7 into a row. , A matrix in which T frames of a speech signal to be recognized are arranged in rows is prepared. Then, the control unit 11 calculates the distance from the output probability obtained in step S4, and arranges the calculated distance in each element of the matrix to generate the distance table 30 illustrated in FIG.

距離テーブル３０を生成すると、制御部１１は、指定部１５１として機能し、生成した距離テーブル３０におけるフレームを１つ指定する（ステップＳ８０３）。更に、制御部１１は、生成した距離テーブル３０における音素状態を１つ指定する（ステップＳ８０４）。そして、制御部１１は、スコア算出部１５２として機能し、指定したフレームと音素状態について、１つ前のフレームからの状態停留スコア及び状態遷移スコアを算出する（ステップＳ８０５）。 When the distance table 30 is generated, the control unit 11 functions as the specifying unit 151, and specifies one frame in the generated distance table 30 (Step S803). Further, the control unit 11 specifies one phoneme state in the generated distance table 30 (step S804). Then, the control unit 11 functions as the score calculation unit 152, and calculates, for the specified frame and the phoneme state, the state stationary score and the state transition score from the immediately preceding frame (step S805).

具体的に説明すると、先頭からｍ番目のフレームと先頭からｉ番目の音素状態とを指定した場合、制御部１１は、対応関係Ｒ（ｍ，ｉ）の２つの候補について、上述した式（１）に従ってスコアＰ（ｍ，ｎ）を算出する。例えば、図７に示したように先頭から１５番目のフレームと先頭から８番目の音素状態（先頭から３番目の音素の第２状態）とを指定した場合、制御部１１は、対応関係Ｒ（１５，８）における第１の候補Ｃ１のスコアＰ（１５，３）を状態停留スコアとして算出し、第２の候補Ｃ２のスコアＰ（１５，３）を状態遷移スコアとして算出する。 More specifically, when the m-th frame from the beginning and the i-th phoneme state from the beginning are specified, the control unit 11 determines the above-mentioned equation (1) for the two candidates of the correspondence R (m, i). ) Is calculated according to the score P (m, n). For example, when the fifteenth frame from the beginning and the eighth phoneme state from the beginning (the second state of the third phoneme from the beginning) are specified as shown in FIG. The score P (15,3) of the first candidate C1 in (15,8) is calculated as the state stop score, and the score P (15,3) of the second candidate C2 is calculated as the state transition score.

状態停留スコア及び状態遷移スコアを算出すると、制御部１１は、特定部１５３として機能し、算出した状態停留スコアと状態遷移スコアとを比較して、スコアが良い方の経路を残す（ステップＳ８０６）。具体的に説明すると、制御部１１は、状態停留スコアと状態遷移スコアとのうちのスコアが小さい方に対応する経路を、先頭のフレーム及び音素状態から指定したフレーム及び音素状態までの最尤系列として決定する。このとき、制御部１１は、スコアが大きい方の経路については候補から除去する。 After calculating the state stop score and the state transition score, the control unit 11 functions as the specifying unit 153, compares the calculated state stop score with the state transition score, and leaves a route with a better score (step S806). . To be more specific, the control unit 11 sets the path corresponding to the smaller one of the state stationary score and the state transition score to the maximum likelihood sequence from the first frame and the phoneme state to the specified frame and the phoneme state. To be determined. At this time, the control unit 11 removes the route with the larger score from the candidates.

このようにして経路を選択すると、制御部１１は、指定した音素状態が末尾の音素状態に到達したか否かを判定する（ステップＳ８０７）。指定した音素状態が末尾の音素状態に到達していない場合（ステップＳ８０７；ＮＯ）、制御部１１は、次の音素状態を指定する（ステップＳ８０８）。例えば、現在先頭からｉ番目の音素状態を指定している場合には、制御部１１は、次の音素状態として先頭から（ｉ＋１）番目の音素状態を指定する。 When the route is selected in this way, the control unit 11 determines whether or not the specified phoneme state has reached the last phoneme state (step S807). If the specified phoneme state has not reached the last phoneme state (step S807; NO), the control unit 11 specifies the next phoneme state (step S808). For example, if the i-th phoneme state from the beginning is currently designated, the control unit 11 designates the (i + 1) -th phoneme state from the beginning as the next phoneme state.

次の音素状態を指定すると、制御部１１は、処理をステップＳ８０５に戻す。そして、制御部１１は、繰り返し部１５４として機能し、新たに指定した音素状態について、ステップＳ８０５における状態遷移スコア及び状態停留スコアの算出処理、及びステップＳ８０６における経路の選択処理を実行する。このように、制御部１１は、ステップＳ８０３で指定したフレームについて、距離テーブル３０における先頭から順に音素状態を１つずつ指定する。そして、制御部１１は、先頭のフレームから指定されたフレームまでの各フレームと、先頭の音素状態から指定された音素状態までの各音素状態と、の対応関係を特定する。 When the next phoneme state is designated, the control unit 11 returns the process to step S805. Then, the control unit 11 functions as the repetition unit 154, and executes the process of calculating the state transition score and the state stationary score in step S805 and the process of selecting a route in step S806 for the newly specified phoneme state. As described above, the control unit 11 specifies phoneme states one by one in order from the top in the distance table 30 for the frame specified in step S803. Then, the control unit 11 specifies the correspondence between each frame from the first frame to the specified frame and each phoneme state from the first phoneme state to the specified phoneme state.

なお、ステップＳ８０４で先頭の音素状態を指定した場合には、１つ前の音素状態が存在しないため、制御部１１は、ステップＳ８０５において状態遷移スコアを算出しない。この場合、制御部１１は、ステップＳ８０６において、状態停留スコアに対応する経路、すなわち、先頭のフレームから指定されたフレームまでに亘る全てのフレームが先頭の音素状態に対応しているとの対応関係を特定する。 When the first phoneme state is specified in step S804, there is no previous phoneme state, and thus the control unit 11 does not calculate the state transition score in step S805. In this case, the control unit 11 determines in step S806 that the path corresponding to the state stationary score, that is, all frames from the first frame to the specified frame correspond to the first phoneme state. To identify.

指定した音素状態が末尾の音素状態に到達すると（ステップＳ８０７；ＹＥＳ）、制御部１１は、指定したフレームが音声信号の認識区間における末尾のフレームに到達したか否かを判定する（ステップＳ８０９）。指定したフレームが末尾のフレームに到達していない場合（ステップＳ８０９；ＮＯ）、制御部１１は、次のフレームを指定する（ステップＳ８１０）。例えば、現在先頭からｍ番目のフレームを指定している場合には、制御部１１は、次のフレームとして先頭から（ｍ＋１）番目のフレームを指定する。 When the specified phoneme state reaches the last phoneme state (step S807; YES), the control unit 11 determines whether the specified frame has reached the last frame in the speech signal recognition section (step S809). . If the specified frame has not reached the last frame (step S809; NO), the control unit 11 specifies the next frame (step S810). For example, if the m-th frame from the head is currently specified, the control unit 11 specifies the (m + 1) -th frame from the head as the next frame.

次のフレームを指定すると、制御部１１は、処理をステップＳ８０４に戻す。そして、制御部１１は、繰り返し部１５４として機能し、新たに指定したフレームについて、ステップＳ８０４からステップＳ８０８の処理を実行する。なお、ステップＳ８０３で先頭のフレームを指定した場合には、１つ前のフレームが存在しないため、制御部１１は、ステップＳ８０４からステップＳ８０８の処理をスキップする。このように、制御部１１は、認識対象の音声信号の２番目のフレームから順にフレームを１つずつ指定しながら、認識対象の音声信号における各フレームと、指定された候補語における各音素状態と、の対応関係を特定する。 When the next frame is designated, the control unit 11 returns the process to step S804. Then, the control unit 11 functions as the repetition unit 154, and executes the processing from step S804 to step S808 for the newly designated frame. When the first frame is specified in step S803, the control unit 11 skips the processing from step S804 to step S808 because there is no previous frame. In this way, the control unit 11 specifies each frame in the speech signal to be recognized and each phoneme state in the designated candidate word while designating one frame at a time from the second frame of the speech signal to be recognized. , The correspondence relationship is specified.

指定したフレームが末尾のフレームに到達すると（ステップＳ８０９；ＹＥＳ）、制御部１１は、全ての候補語を指定したか否かを判定する（ステップＳ８１１）。全ての候補語を指定していない場合（ステップＳ８１１；ＮＯ）、制御部１１は、候補語記憶部２２に記憶された複数の候補語のうちの未指定の候補語を次の候補語として指定する（ステップＳ８１２）。 When the specified frame reaches the last frame (step S809; YES), the control unit 11 determines whether all candidate words have been specified (step S811). If all candidate words have not been specified (step S811; NO), the control unit 11 specifies an unspecified candidate word among the plurality of candidate words stored in the candidate word storage unit 22 as the next candidate word. (Step S812).

次の候補語を指定すると、制御部１１は、処理をステップＳ８０２に戻す。そして、制御部１１は、繰り返し部１５４として機能し、新たに指定した候補語について、ステップＳ８０２からステップＳ８１１の処理を実行する。このように、制御部１１は、候補語記憶部２２に記憶された複数の候補語のそれぞれについて、認識対象の音声信号の各フレームと各音素状態との間の対応関係を探索する。 When the next candidate word is specified, the control unit 11 returns the process to step S802. Then, the control unit 11 functions as the repetition unit 154, and executes the processing from step S802 to step S811 for the newly specified candidate word. As described above, the control unit 11 searches for the correspondence between each frame of the speech signal to be recognized and each phoneme state for each of the plurality of candidate words stored in the candidate word storage unit 22.

最終的に、全ての候補語の指定を終えると（ステップＳ８１１；ＹＥＳ）、制御部１１は、図１２に示した探索処理を終了する。 Finally, when all candidate words have been designated (step S811; YES), the control unit 11 terminates the search processing shown in FIG.

図１１に示した音声認識処理に戻る。複数の候補語のそれぞれについてフレームと音素状態との対応関係を探索すると、制御部１１は、判定部１６０として機能し、複数の候補語のうちのスコアが最良の候補語を認識結果として判定する（ステップＳ９）。具体的に説明すると、制御部１１は、指定したフレームが末尾のフレームに到達し、且つ、指定した音素状態が末尾の音素状態に到達した場合において、ステップＳ８０５で算出された状態停留スコアと状態遷移スコアのうちの、ステップＳ８０６で残された経路のスコア、すなわち良い方のスコアを、複数の候補語間で比較する。そして、制御部１１は、複数の候補語のうちのスコアが最良の候補語を、音声信号において発せられている言葉として判定する。 The process returns to the speech recognition processing shown in FIG. When the correspondence between the frame and the phoneme state is searched for each of the plurality of candidate words, the control unit 11 functions as the determination unit 160 and determines the candidate word having the best score among the plurality of candidate words as the recognition result. (Step S9). More specifically, when the specified frame reaches the end frame and the specified phoneme state reaches the end phoneme state, the control unit 11 determines the state stop score and the state stop score calculated in step S805. Among the transition scores, the score of the route left in step S806, that is, the better score, is compared among a plurality of candidate words. Then, the control unit 11 determines the candidate word having the best score among the plurality of candidate words as the word uttered in the audio signal.

認識対象の音声信号において発せられている言葉を判定すると、制御部１１は、認識結果を出力する（ステップＳ１０）。例えば、制御部１１は、音声信号において発せられていると判定された言葉を出力部１４の表示部に表示する。或いは、制御部１１は、音声信号において発せられていると判定された言葉をスピーカから音声で出力する。これにより、ユーザは、音声信号の認識結果を確認することができる。以上により、図１１に示した音声認識処理は終了する。 When determining the words uttered in the speech signal to be recognized, the control unit 11 outputs a recognition result (step S10). For example, the control unit 11 displays on the display unit of the output unit 14 a word determined to be uttered in the audio signal. Alternatively, the control unit 11 outputs a word determined to be uttered in the audio signal as a sound from the speaker. Thereby, the user can confirm the recognition result of the audio signal. Thus, the speech recognition processing shown in FIG. 11 ends.

以上説明したように、本実施形態に係る音声認識装置１０は、動的計画法によりフレームと音素との対応関係を探索し、認識対象の音声信号において発せられている言葉を判定する装置であって、フレームと音素との対応関係を探索する過程で、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、複数のフレームに亘って累積することにより算出されたスコアに基づいて対応関係を特定する。 As described above, the speech recognition device 10 according to the present embodiment is a device that searches for a correspondence between a frame and a phoneme by a dynamic programming method and determines a word uttered in a speech signal to be recognized. In the process of searching for the correspondence between a frame and a phoneme, a value based on the output probability of the phoneme corresponding to each frame is normalized by the number of frames corresponding to each phoneme, and accumulated over a plurality of frames. Then, the correspondence is specified based on the calculated score.

音素毎に対応するフレームの数で正規化されたスコアを用いることにより、各音素の重みの偏りを小さくすることができ、各音素のスコア全体に対する寄与度が均一化される。これにより、音素列の中で一部の音素が多くのフレームに対応している場合、その一部の音素のみの影響によって精度の良いスコアが得られなくなることを抑制することができる。その結果、フレームと音素との対応関係を探索する際における探索の精度を向上させることができるため、音声認識の精度の向上につながる。 By using the score normalized by the number of frames corresponding to each phoneme, the bias of the weight of each phoneme can be reduced, and the contribution of each phoneme to the entire score is made uniform. Thereby, when some phonemes correspond to many frames in the phoneme sequence, it is possible to suppress that an accurate score cannot be obtained due to the influence of only some phonemes. As a result, the accuracy of the search when searching for the correspondence between the frame and the phoneme can be improved, which leads to an improvement in the accuracy of speech recognition.

また、本実施形態に係る音声認識装置１０によれば、精度の良いスコアを算出することにより、フレームと音素との対応関係を探索する過程で候補の枝刈りの精度を高めることができるため、残すべき候補の数を抑えることができる。そのため、対応関係の探索時の計算コストを抑えることができ、省メモリ化が可能となる。特に、多くの候補語の中から認識対象の音声信号において発せられている言葉を判定する場合において、少ない計算コストで効率的な音声認識が可能となる。 Further, according to the speech recognition device 10 according to the present embodiment, by calculating an accurate score, it is possible to improve the accuracy of the pruning of candidates in the process of searching for the correspondence between the frame and the phoneme. The number of candidates to be left can be reduced. Therefore, the calculation cost at the time of searching for the correspondence can be suppressed, and the memory can be saved. In particular, when judging a word uttered in a speech signal to be recognized from among many candidate words, efficient speech recognition can be performed with a small calculation cost.

（変形例）
以上に本発明の実施形態について説明したが、上記実施形態は一例であり、本発明の適用範囲はこれに限られない。すなわち、本発明の実施形態は種々の応用が可能であり、あらゆる実施の形態が本発明の範囲に含まれる。 (Modification)
Although the embodiment of the present invention has been described above, the above embodiment is an example, and the scope of the present invention is not limited to this. That is, the embodiments of the present invention can be applied in various ways, and all embodiments are included in the scope of the present invention.

例えば、上記実施形態では、音声認識装置１０は、認識対象の音声信号を取得し、候補語記憶部２２に記憶された複数の候補語のうちのいずれかを、取得した音声信号において発せられている言葉として判定した。しかしながら、本発明において、音声認識装置１０は、対象語として検索語を取得し、検索対象となる音声信号のうちから取得した検索語が発せられている区間を検索しても良い。すなわち、音声認識装置１０は、いわゆる音声検索装置として機能しても良い。 For example, in the above-described embodiment, the speech recognition device 10 acquires a speech signal to be recognized, and emits one of a plurality of candidate words stored in the candidate word storage unit 22 in the acquired speech signal. Was determined to be a word. However, in the present invention, the speech recognition device 10 may acquire a search word as a target word, and may search a speech signal to be searched for a section in which the acquired search word is issued. That is, the voice recognition device 10 may function as a so-called voice search device.

音声認識装置１０が音声検索装置として機能する場合、探索部１５０は、上記実施形態において複数の候補語のそれぞれについて実行したフレームと音素との対応関係の探索処理を、音声信号における異なる複数の区間のそれぞれについて実行する。具体的に説明すると、スコア算出部１５２は、出力確率取得部１３０により取得された出力確率に基づいて、音声信号における異なる複数の区間のそれぞれについて、フレームと音素との対応関係の複数の候補のスコアを上記（１）式に従って算出する。特定部１５３は、複数の区間のそれぞれについて、スコア算出部１５２により算出されたスコアに基づいて、フレームと音素との対応関係を特定する。そして、判定部１６０は、特定部１５３により複数の区間のそれぞれについて特定された対応関係のスコアを比較することにより、複数の区間のうちから対象語が発せられている区間を判定する。 When the voice recognition device 10 functions as a voice search device, the search unit 150 performs the search process of the correspondence between the frame and the phoneme, which is performed for each of the plurality of candidate words in the above embodiment, in a plurality of different sections in the voice signal. For each of More specifically, the score calculation unit 152 determines, based on the output probabilities obtained by the output probability obtaining unit 130, a plurality of candidates for the correspondence between the frame and the phoneme for each of a plurality of different sections in the audio signal. The score is calculated according to the above equation (1). The specifying unit 153 specifies a correspondence between a frame and a phoneme based on the score calculated by the score calculating unit 152 for each of the plurality of sections. Then, the determination unit 160 determines the section in which the target word is emitted from the plurality of sections by comparing the scores of the correspondence relationships specified for each of the plurality of sections by the specifying unit 153.

上記実施形態では、出力確率取得部１３０は、出力確率を、音響モデル記憶部２１に記憶された音響モデルの全音素について取得した。しかしながら、本発明において、出力確率取得部１３０は、少なくとも候補語記憶部２２に記憶された複数の候補語に対応する複数の音素のそれぞれについて出力確率を取得すれば良い。言い換えると、出力確率取得部１３０は、音響モデルの全音素のうちの少なくとも音声認識で使用する一部の音素について出力確率を取得すれば良く、音声認識で使用しない音素については出力確率を取得しなくても良い。 In the above embodiment, the output probability acquisition unit 130 acquires the output probabilities for all phonemes of the acoustic model stored in the acoustic model storage unit 21. However, in the present invention, the output probability obtaining unit 130 only needs to obtain the output probabilities for at least each of the plurality of phonemes corresponding to the plurality of candidate words stored in the candidate word storage unit 22. In other words, the output probability obtaining unit 130 may obtain the output probabilities for at least some of the phonemes used in speech recognition among all phonemes of the acoustic model, and obtain the output probabilities for the phonemes not used in speech recognition. You don't have to.

本発明において、音声認識装置１０は、図２に示した構成を全て備えていなくても良い。例えば、音響モデル記憶部２１又は候補語記憶部２２は、音声認識装置１０の外部の装置に設けられていても良い。その場合、音声認識装置１０は、必要に応じて外部の装置と通信することにより、音響モデル記憶部２１に記憶された音響モデルの情報、又は候補語記憶部２２に記憶された複数の候補語の情報を取得する。 In the present invention, the voice recognition device 10 does not have to include all the components shown in FIG. For example, the acoustic model storage unit 21 or the candidate word storage unit 22 may be provided in a device external to the speech recognition device 10. In this case, the speech recognition device 10 communicates with an external device as necessary, thereby obtaining information on the acoustic model stored in the acoustic model storage unit 21 or a plurality of candidate words stored in the candidate word storage unit 22. Get the information of.

また、本発明において、音声認識装置１０は、特徴量算出部１２０、出力確率取得部１３０又は変換部１４０の機能を備えていなくても良い。例えば、外部の装置が候補語記憶部２２に記憶された複数の候補語を対応する音素列に変換する機能を備えており、音声認識装置１０は、外部の装置から各候補語に対応する音素列の情報を取得しても良い。或いは、外部の装置が特徴量算出部１２０による音声信号の特徴量を算出する処理、又は出力確率取得部１３０による出力確率を取得する処理を実行し、音声認識装置１０は、その結果を示す情報を外部の装置から取得しても良い。 Further, in the present invention, the voice recognition device 10 may not have the function of the feature amount calculation unit 120, the output probability acquisition unit 130, or the conversion unit 140. For example, the external device has a function of converting a plurality of candidate words stored in the candidate word storage unit 22 into a corresponding phoneme sequence, and the speech recognition device 10 outputs a phoneme corresponding to each candidate word from the external device. Column information may be obtained. Alternatively, an external device executes a process of calculating the feature amount of the audio signal by the feature amount calculation unit 120 or a process of acquiring the output probability by the output probability acquisition unit 130, and the speech recognition device 10 outputs information indicating the result. May be obtained from an external device.

上記実施形態では、制御部１１において、ＣＰＵがＲＯＭに記憶されたプログラムを実行することによって、図２に示した各部として機能した。しかしながら、本発明において、制御部１１は、ＣＰＵの代わりに、例えばＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、各種制御回路等の専用のハードウェアを備え、専用のハードウェアが、図２に示した各部として機能しても良い。この場合、各部の機能それぞれを個別のハードウェアで実現しても良いし、各部の機能をまとめて単一のハードウェアで実現しても良い。また、各部の機能のうち、一部を専用のハードウェアによって実現し、他の一部をソフトウェア又はファームウェアによって実現しても良い。 In the above embodiment, the control unit 11 functions as each unit shown in FIG. 2 by the CPU executing the program stored in the ROM. However, in the present invention, the control unit 11 includes dedicated hardware such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), and various control circuits instead of the CPU. However, it may function as each unit shown in FIG. In this case, each function of each unit may be realized by individual hardware, or the function of each unit may be realized by a single piece of hardware. In addition, a part of the function of each unit may be realized by dedicated hardware, and the other part may be realized by software or firmware.

なお、本発明に係る機能を実現するための構成を予め備えた音声認識装置として提供できることはもとより、プログラムの適用により、既存の情報処理装置等を、本発明に係る音声認識装置として機能させることもできる。すなわち、上記実施形態で例示した音声認識装置１０による各機能構成を実現させるためのプログラムを、既存の情報処理装置等を制御するＣＰＵ等が実行できるように適用することで、本発明に係る音声認識装置として機能させることができる。 It should be noted that the present invention can be provided not only as a speech recognition device having a configuration for realizing the function according to the present invention but also as an existing information processing device or the like as a speech recognition device according to the present invention by applying a program. You can also. That is, by applying a program for realizing each functional configuration by the voice recognition device 10 exemplified in the above embodiment so that a CPU or the like that controls an existing information processing device or the like can execute the voice according to the present invention. It can function as a recognition device.

また、このようなプログラムの適用方法は任意である。プログラムを、例えば、フレキシブルディスク、ＣＤ（Compact Disc）−ＲＯＭ、ＤＶＤ（Digital Versatile Disc）−ＲＯＭ、メモリカード等のコンピュータ読み取り可能な記憶媒体に格納して適用できる。さらに、プログラムを搬送波に重畳し、インターネットなどの通信媒体を介して適用することもできる。例えば、通信ネットワーク上の掲示板（ＢＢＳ：Bulletin Board System）にプログラムを掲示して配信してもよい。そして、このプログラムを起動し、ＯＳ（Operating System）の制御下で、他のアプリケーションプログラムと同様に実行することにより、上記の処理を実行できるように構成してもよい。 The method of applying such a program is arbitrary. The program can be applied by being stored in a computer-readable storage medium such as a flexible disk, a CD (Compact Disc) -ROM, a DVD (Digital Versatile Disc) -ROM, and a memory card. Furthermore, the program can be superimposed on a carrier wave and applied via a communication medium such as the Internet. For example, the program may be posted on a bulletin board (BBS: Bulletin Board System) on a communication network and distributed. Then, this program may be activated and executed in the same manner as other application programs under the control of an OS (Operating System), so that the above-described processing can be executed.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲とが含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。
（付記１）
音声信号の特徴量が、前記音声信号において発せられているか否かの判定対象となる対象語に対応する複数の音素のそれぞれから出力される出力確率を、前記音声信号における複数のフレームのそれぞれについて取得する出力確率取得手段と、
前記出力確率取得手段により取得された前記出力確率に基づいて、前記複数のフレームと前記複数の音素との対応関係の尤もらしさを示すスコアを、当該対応関係の複数の候補のそれぞれについて算出するスコア算出手段と、
前記スコア算出手段により前記複数の候補のそれぞれについて算出された前記スコアに基づいて、前記複数の候補のうちのいずれかを前記対応関係として特定する特定手段と、
前記特定手段により特定された前記対応関係に基づいて、前記音声信号において前記対象語が発せられているか否かを判定する判定手段と、
を備え、
前記スコア算出手段は、前記複数の候補のそれぞれにおいて、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、前記複数のフレームに亘って累積することにより、前記スコアを算出する、
ことを特徴とする音声認識装置。
（付記２）
前記スコア算出手段は、前記複数のフレームのうちの先頭からｍ個のフレームと、前記複数の音素のうちの先頭からｎ個の音素と、の対応関係の複数の候補のそれぞれについて、前記スコアを算出し、
前記特定手段は、前記スコア算出手段により前記複数の候補のそれぞれについて算出された前記スコアに基づいて、前記複数の候補のうちのいずれかを、前記ｍ個のフレームと前記ｎ個の音素との前記対応関係として特定し、
前記ｍ個のフレームにおけるｍの値と前記ｎ個の音素におけるｎの値との少なくとも一方を変化させながら、前記スコア算出手段及び前記特定手段の処理を繰り返す繰り返し手段、を更に備える、
ことを特徴とする付記１に記載の音声認識装置。
（付記３）
前記スコア算出手段は、前記ｍ個のフレームと前記ｎ個の音素との前記対応関係の前記複数の候補として、前記複数のフレームのうちの先頭から（ｍ−１）番目のフレームからｍ番目のフレームにかけて状態が停留する場合における第１の候補と、前記（ｍ−１）番目のフレームから前記ｍ番目のフレームにかけて状態が遷移する場合における第２の候補と、のそれぞれについて前記スコアを算出し、
前記特定手段は、前記スコア算出手段により前記第１の候補と前記第２の候補とのそれぞれについて算出された前記スコアに基づいて、前記第１の候補と前記第２の候補とのうちのいずれか一方を、前記ｍ個のフレームと前記ｎ個の音素との前記対応関係として特定する、
ことを特徴とする付記２に記載の音声認識装置。
（付記４）
前記スコア算出手段は、
前記ｎ個の音素のうちの先頭からｋ番目の音素の出力確率に基づく値を、当該ｋ番目の音素に対応する少なくとも１つのフレームに亘って累積し、且つ、当該少なくとも１つのフレームの数で正規化することにより、当該ｋ番目の音素のスコアを算出し、
前記ｋ番目の音素におけるｋの値が１からｎまでのそれぞれである場合に算出された前記ｎ個の音素のスコアを累積することにより、前記ｍ個のフレームと前記ｎ個の音素との対応関係の前記複数の候補のそれぞれのスコアを算出する、
ことを特徴とする付記２又は３に記載の音声認識装置。
（付記５）
前記スコア算出手段は、前記ｍ個のフレームのうちの、前記ｎ個の音素のうちの前記先頭の音素から（ｎ−１）番目の音素までに対応するフレームを除いた残りのフレームを、前記ｎ個の音素のうちの先頭からｎ番目の音素に対応する少なくとも１つのフレームとして用いて、当該ｎ番目の音素のスコアを算出する、
ことを特徴とする付記４に記載の音声認識装置。
（付記６）
前記音声信号の前記特徴量をフレーム毎に算出する特徴量算出手段、を更に備え、
前記出力確率取得手段は、前記特徴量算出手段により算出された前記特徴量に基づいて、前記出力確率を取得する、
ことを特徴とする付記１から５のいずれか１つに記載の音声認識装置。
（付記７）
認識対象となる前記音声信号を取得する音声信号取得手段と、
前記音声信号取得手段により取得された前記音声信号において発せられている言葉の候補となる複数の候補語が記憶された候補語記憶手段と、を更に備え、
前記出力確率取得手段は、前記候補語記憶手段に記憶された前記複数の候補語のそれぞれを前記対象語として、前記出力確率を取得し、
前記スコア算出手段は、前記複数の候補語のそれぞれについて、前記出力確率取得手段により取得された前記出力確率に基づいて、前記スコアを算出し、
前記特定手段は、前記複数の候補語のそれぞれについて、前記スコア算出手段により算出された前記スコアに基づいて、前記対応関係を特定し、
前記判定手段は、前記特定手段により前記複数の候補語のそれぞれについて特定された前記対応関係の前記スコアを比較することにより、前記複数の候補語のうちのいずれかを、前記音声信号において発せられている前記言葉として判定する、
ことを特徴とする付記１から６のいずれか１つに記載の音声認識装置。
（付記８）
前記スコア算出手段は、検索対象となる前記音声信号における異なる複数の区間のそれぞれにおいて、前記出力確率取得手段により取得された前記出力確率に基づいて、前記スコアを算出し、
前記特定手段は、前記複数の区間のそれぞれについて、前記スコア算出手段により算出された前記スコアに基づいて、前記対応関係を特定し、
前記判定手段は、前記特定手段により前記複数の区間のそれぞれについて特定された前記対応関係の前記スコアを比較することにより、前記複数の区間のうちから前記対象語が発せられている区間を判定する、
ことを特徴とする付記１から６のいずれか１つに記載の音声認識装置。
（付記９）
音声信号の特徴量が、前記音声信号において発せられているか否かの判定対象となる対象語に対応する複数の音素のそれぞれから出力される出力確率を、前記音声信号における複数のフレームのそれぞれについて取得する出力確率取得ステップと、
前記出力確率取得ステップで取得された前記出力確率に基づいて、前記複数のフレームと前記複数の音素との対応関係の尤もらしさを示すスコアを、当該対応関係の複数の候補のそれぞれについて算出するスコア算出ステップと、
前記スコア算出ステップで前記複数の候補のそれぞれについて算出された前記スコアに基づいて、前記複数の候補のうちのいずれかを前記対応関係として特定する特定ステップと、
前記特定ステップで特定された前記対応関係に基づいて、前記音声信号において前記対象語が発せられているか否かを判定する判定ステップと、
を含み、
前記スコア算出ステップでは、前記複数の候補のそれぞれにおいて、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、前記複数のフレームに亘って累積することにより、前記スコアを算出する、
ことを特徴とする音声認識方法。
（付記１０）
コンピュータを、
音声信号の特徴量が、前記音声信号において発せられているか否かの判定対象となる対象語に対応する複数の音素のそれぞれから出力される出力確率を、前記音声信号における複数のフレームのそれぞれについて取得する出力確率取得手段、
前記出力確率取得手段により取得された前記出力確率に基づいて、前記複数のフレームと前記複数の音素との対応関係の尤もらしさを示すスコアを、当該対応関係の複数の候補のそれぞれについて算出するスコア算出手段、
前記スコア算出手段により前記複数の候補のそれぞれについて算出された前記スコアに基づいて、前記複数の候補のうちのいずれかを前記対応関係として特定する特定手段、
前記特定手段により特定された前記対応関係に基づいて、前記音声信号において前記対象語が発せられているか否かを判定する判定手段、
として機能させ、
前記スコア算出手段は、前記複数の候補のそれぞれにおいて、各フレームに対応する音素の出力確率に基づく値を、音素毎に対応するフレームの数で正規化し、且つ、前記複数のフレームに亘って累積することにより、前記スコアを算出する、
ことを特徴とするプログラム。 As described above, the preferred embodiments of the present invention have been described, but the present invention is not limited to the specific embodiments, and the present invention includes the invention described in the claims and the equivalents thereof. included. Hereinafter, the invention described in the claims of the present application is additionally described.
(Appendix 1)
The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal Output probability obtaining means for obtaining;
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired by the output probability acquisition unit, for each of the plurality of candidates of the correspondence. Calculating means;
Identification means for identifying any of the plurality of candidates as the correspondence relationship based on the score calculated for each of the plurality of candidates by the score calculation means,
Determining means for determining whether or not the target word is issued in the audio signal, based on the correspondence specified by the specifying means;
With
The score calculation unit normalizes a value based on the output probability of a phoneme corresponding to each frame in each of the plurality of candidates by the number of frames corresponding to each phoneme, and accumulates the value over the plurality of frames. By calculating the score,
A speech recognition device characterized by the above-mentioned.
(Appendix 2)
The score calculation means calculates the score for each of a plurality of candidates having a correspondence relationship between m frames from the beginning of the plurality of frames and n phonemes from the beginning of the plurality of phonemes. Calculate,
The specifying means, based on the score calculated by the score calculation means for each of the plurality of candidates, any one of the plurality of candidates, the m frames and the n phonemes Specified as the correspondence,
A repetition unit that repeats the processes of the score calculation unit and the identification unit while changing at least one of the value of m in the m frames and the value of n in the n phonemes,
3. The speech recognition device according to claim 1, wherein:
(Appendix 3)
The score calculation means may include, as the plurality of candidates for the correspondence relationship between the m frames and the n phonemes, the (m-1) th frame to the mth frame from the top of the plurality of frames. The score is calculated for each of a first candidate in the case where the state stops over the frame and a second candidate in the case where the state transitions from the (m-1) th frame to the mth frame. ,
The specifying unit is configured to determine which of the first candidate and the second candidate based on the score calculated for each of the first candidate and the second candidate by the score calculating unit. One of them is specified as the correspondence between the m frames and the n phonemes,
3. The speech recognition device according to claim 2, wherein:
(Appendix 4)
The score calculation means,
A value based on the output probability of the k-th phoneme from the head of the n phonemes is accumulated over at least one frame corresponding to the k-th phoneme, and is calculated by the number of the at least one frame. By normalizing, the score of the k-th phoneme is calculated,
By accumulating the scores of the n phonemes calculated when the value of k in the k-th phoneme is 1 to n, the correspondence between the m frames and the n phonemes is calculated. Calculating a score for each of the plurality of candidates in the relationship;
The speech recognition device according to claim 2 or 3, wherein:
(Appendix 5)
The score calculation means calculates the remaining frames excluding the frames corresponding to the (n-1) th phoneme from the first phoneme of the n phonemes of the m frames. Using at least one frame corresponding to the n-th phoneme from the beginning of the n phonemes, calculating the score of the n-th phoneme;
The speech recognition device according to claim 4, wherein
(Appendix 6)
A feature value calculating unit configured to calculate the feature value of the audio signal for each frame,
The output probability obtaining unit obtains the output probability based on the feature amount calculated by the feature amount calculating unit,
6. The speech recognition device according to any one of supplementary notes 1 to 5, wherein:
(Appendix 7)
Audio signal acquisition means for acquiring the audio signal to be recognized,
Candidate word storage means in which a plurality of candidate words that are candidates for words emitted in the audio signal acquired by the audio signal acquisition means are stored,
The output probability obtaining means obtains the output probability, each of the plurality of candidate words stored in the candidate word storage means as the target word,
The score calculating means calculates the score based on the output probability obtained by the output probability obtaining means, for each of the plurality of candidate words,
The specifying unit specifies, for each of the plurality of candidate words, the correspondence based on the score calculated by the score calculating unit,
The determination unit is configured to compare any of the scores of the correspondence specified for each of the plurality of candidate words by the specifying unit, thereby causing any of the plurality of candidate words to be emitted in the audio signal. Judge as the said word,
7. The speech recognition device according to any one of supplementary notes 1 to 6, wherein:
(Appendix 8)
The score calculation unit calculates the score based on the output probability acquired by the output probability acquisition unit in each of a plurality of different sections in the audio signal to be searched,
The specifying unit specifies, for each of the plurality of sections, the correspondence based on the score calculated by the score calculating unit,
The determining unit determines a section in which the target word is emitted from the plurality of sections by comparing the scores of the correspondence relationships specified for each of the plurality of sections by the specifying unit. ,
7. The speech recognition device according to any one of supplementary notes 1 to 6, wherein:
(Appendix 9)
The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal An output probability obtaining step to obtain;
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired in the output probability acquisition step, for each of the plurality of candidates of the correspondence. A calculating step;
A specifying step of specifying any of the plurality of candidates as the correspondence relationship based on the score calculated for each of the plurality of candidates in the score calculating step;
A determining step of determining whether or not the target word is issued in the audio signal based on the correspondence relationship identified in the identifying step;
Including
In the score calculating step, in each of the plurality of candidates, a value based on the output probability of a phoneme corresponding to each frame is normalized by the number of frames corresponding to each phoneme, and accumulated over the plurality of frames. By calculating the score,
A speech recognition method characterized in that:
(Appendix 10)
Computer
The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal Output probability obtaining means to obtain,
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired by the output probability acquisition unit, for each of the plurality of candidates of the correspondence. Calculation means,
Specifying means for specifying any of the plurality of candidates as the correspondence, based on the score calculated for each of the plurality of candidates by the score calculating means,
Determining means for determining whether or not the target word is issued in the audio signal, based on the correspondence specified by the specifying means;
Function as
The score calculation unit normalizes a value based on the output probability of a phoneme corresponding to each frame in each of the plurality of candidates by the number of frames corresponding to each phoneme, and accumulates the value over the plurality of frames. By calculating the score,
A program characterized by the following.

１０…音声認識装置、１１…制御部、１２…記憶部、１３…入力部、１４…出力部、１５…通信部、２１…音響モデル記憶部、２２…候補語記憶部、３０，３１，３２…距離テーブル、１１０…音声信号取得部、１２０…特徴量算出部、１３０…出力確率取得部、１４０…変換部、１５０…探索部、１５１…指定部、１５２…スコア算出部、１５３…特定部、１５４…繰り返し部、１６０…判定部 DESCRIPTION OF SYMBOLS 10 ... Voice recognition apparatus, 11 ... Control part, 12 ... Storage part, 13 ... Input part, 14 ... Output part, 15 ... Communication part, 21 ... Acoustic model storage part, 22 ... Candidate word storage part, 30, 31, 32 ... Distance table, 110 ... Sound signal acquisition unit, 120 ... Feature amount calculation unit, 130 ... Output probability acquisition unit, 140 ... Conversion unit, 150 ... Search unit, 151 ... Designation unit, 152 ... Score calculation unit, 153 ... Specification unit 154: Repeating unit 160: Judging unit

Claims

The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal Output probability obtaining means for obtaining;
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired by the output probability acquisition unit, for each of the plurality of candidates of the correspondence. Calculating means;
Identification means for identifying any of the plurality of candidates as the correspondence relationship based on the score calculated for each of the plurality of candidates by the score calculation means,
Determining means for determining whether or not the target word is issued in the audio signal, based on the correspondence specified by the specifying means;
With
The score calculation unit normalizes a value based on the output probability of a phoneme corresponding to each frame in each of the plurality of candidates by the number of frames corresponding to each phoneme, and accumulates the value over the plurality of frames. By calculating the score,
A speech recognition device characterized by the above-mentioned.

The score calculation means calculates the score for each of a plurality of candidates having a correspondence relationship between m frames from the beginning of the plurality of frames and n phonemes from the beginning of the plurality of phonemes. Calculate,
The specifying means, based on the score calculated by the score calculation means for each of the plurality of candidates, any one of the plurality of candidates, the m frames and the n phonemes Specified as the correspondence,
A repetition unit that repeats the processes of the score calculation unit and the identification unit while changing at least one of the value of m in the m frames and the value of n in the n phonemes,
The speech recognition device according to claim 1, wherein:

The score calculation means may include, as the plurality of candidates for the correspondence relationship between the m frames and the n phonemes, the (m-1) th frame to the mth frame from the top of the plurality of frames. The score is calculated for each of a first candidate in the case where the state stops over the frame and a second candidate in the case where the state transitions from the (m-1) th frame to the mth frame. ,
The specifying unit is configured to determine which of the first candidate and the second candidate based on the score calculated for each of the first candidate and the second candidate by the score calculating unit. One of them is specified as the correspondence between the m frames and the n phonemes,
The speech recognition device according to claim 2, wherein:

The score calculation means,
A value based on the output probability of the k-th phoneme from the head of the n phonemes is accumulated over at least one frame corresponding to the k-th phoneme, and is calculated by the number of the at least one frame. By normalizing, the score of the k-th phoneme is calculated,
By accumulating the scores of the n phonemes calculated when the value of k in the k-th phoneme is 1 to n, the correspondence between the m frames and the n phonemes is calculated. Calculating a score for each of the plurality of candidates in the relationship;
The speech recognition device according to claim 2 or 3, wherein:

The score calculation means calculates the remaining frames excluding the frames corresponding to the (n-1) th phoneme from the first phoneme of the n phonemes of the m frames. Using at least one frame corresponding to the n-th phoneme from the beginning of the n phonemes, calculating the score of the n-th phoneme;
The speech recognition device according to claim 4, wherein:

A feature value calculating unit configured to calculate the feature value of the audio signal for each frame,
The output probability obtaining unit obtains the output probability based on the feature amount calculated by the feature amount calculating unit,
The speech recognition device according to any one of claims 1 to 5, wherein:

Audio signal acquisition means for acquiring the audio signal to be recognized,
Candidate word storage means in which a plurality of candidate words that are candidates for words emitted in the audio signal acquired by the audio signal acquisition means are stored,
The output probability obtaining means obtains the output probability, each of the plurality of candidate words stored in the candidate word storage means as the target word,
The score calculating means calculates the score based on the output probability obtained by the output probability obtaining means, for each of the plurality of candidate words,
The specifying unit specifies, for each of the plurality of candidate words, the correspondence based on the score calculated by the score calculating unit,
The determination unit is configured to compare any of the scores of the correspondence specified for each of the plurality of candidate words by the specifying unit, thereby causing any of the plurality of candidate words to be emitted in the audio signal. Judge as the said word,
The speech recognition device according to any one of claims 1 to 6, wherein:

The score calculation unit calculates the score based on the output probability acquired by the output probability acquisition unit in each of a plurality of different sections in the audio signal to be searched,
The specifying unit specifies, for each of the plurality of sections, the correspondence based on the score calculated by the score calculating unit,
The determining unit determines a section in which the target word is emitted from the plurality of sections by comparing the scores of the correspondence relationships specified for each of the plurality of sections by the specifying unit. ,
The speech recognition device according to any one of claims 1 to 6, wherein:

The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal An output probability obtaining step to obtain;
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired in the output probability acquisition step, for each of the plurality of candidates of the correspondence. A calculating step;
A specifying step of specifying any of the plurality of candidates as the correspondence relationship based on the score calculated for each of the plurality of candidates in the score calculating step;
A determining step of determining whether or not the target word is issued in the audio signal based on the correspondence relationship identified in the identifying step;
Including
In the score calculating step, in each of the plurality of candidates, a value based on the output probability of a phoneme corresponding to each frame is normalized by the number of frames corresponding to each phoneme, and accumulated over the plurality of frames. By calculating the score,
A speech recognition method characterized in that:

Computer
The output probability output from each of the plurality of phonemes corresponding to the target word to be determined whether or not the feature amount of the audio signal is issued in the audio signal, for each of the plurality of frames in the audio signal Output probability obtaining means to obtain,
A score for calculating a likelihood of the correspondence between the plurality of frames and the plurality of phonemes based on the output probability acquired by the output probability acquisition unit, for each of the plurality of candidates of the correspondence. Calculation means,
Specifying means for specifying any of the plurality of candidates as the correspondence, based on the score calculated for each of the plurality of candidates by the score calculating means,
Determining means for determining whether or not the target word is issued in the audio signal, based on the correspondence specified by the specifying means;
Function as
The score calculation unit normalizes a value based on the output probability of a phoneme corresponding to each frame in each of the plurality of candidates by the number of frames corresponding to each phoneme, and accumulates the value over the plurality of frames. By calculating the score,
A program characterized by the following.