JP4928514B2

JP4928514B2 - Speech recognition apparatus and speech recognition program

Info

Publication number: JP4928514B2
Application number: JP2008218059A
Authority: JP
Inventors: 亨今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2008-08-27
Filing date: 2008-08-27
Publication date: 2012-05-09
Anticipated expiration: 2028-08-27
Also published as: JP2010054685A

Description

本発明は、音声認識装置及び音声認識プログラムに係り、特に連続的に更新させた最新モデルを用いて高精度な音声認識を実現するための音声認識装置及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus and a speech recognition program, and more particularly to a speech recognition apparatus and a speech recognition program for realizing highly accurate speech recognition using a continuously updated latest model.

従来、ニュース番組等でアナウンサーが読み上げる原稿は、記者が入稿した電子原稿をディレクターが印刷し、放送時間の長さや話の流れに応じて、放送直前又は放送中に手書きで加筆修正したものを用いている。 Conventionally, the announcer's manuscripts read out by news announcers in news programs, etc., are those that the director printed electronic manuscripts submitted by reporters, and were handwritten and revised immediately before or during the broadcast, depending on the length of the broadcast and the flow of the story. Used.

また、ニュース番組の字幕制作のために用いられる音声認識では、新たな固有名詞や話題に対応するため、この読み上げ原稿の元となる電子原稿を言語モデルの適応学習データに利用して、認識誤りを削減することが重要であることが知られている（例えば、特許文献１参照。）。 Also, in speech recognition used for news program subtitle production, in order to deal with new proper nouns and topics, recognition errors may occur by using the electronic manuscript that is the source of the reading manuscript as adaptive learning data for the language model. Is known to be important (see, for example, Patent Document 1).

また、音声認識を利用した従来の字幕制作システムでは、例えば言語モデルの学習に８分程度を要したことから、放送開始の１０分前までに出稿された電子原稿を適応学習データとしていた（例えば、非特許文献１参照。）。 In addition, in the conventional caption production system using speech recognition, for example, it took about 8 minutes to learn a language model, and therefore, an electronic manuscript published 10 minutes before the start of broadcasting was used as adaptive learning data (for example, Non-patent document 1).

また、従来では、ユーザに大きな負担をかけることなく音響モデルと言語モデルを更新して音声認識の認識精度を向上させるため、音声認識における音響モデル管理サーバが、更新された音響データを取得して構築した音響モデルを、ネットワークを介して音声認識装置に送信し、音声認識装置が、音声認識の際に参照する音響モデルを、音響モデル管理サーバが送信した音響モデルにより更新する手法が知られている（例えば、特許文献２参照。）。このように、特許文献２においても音声認識における最新モデルの重要性が言及されている。 Also, conventionally, in order to improve the recognition accuracy of speech recognition by updating the acoustic model and language model without imposing a heavy burden on the user, the acoustic model management server in speech recognition acquires updated acoustic data. A method is known in which the constructed acoustic model is transmitted to a speech recognition device via a network, and the acoustic model that the speech recognition device refers to during speech recognition is updated with the acoustic model transmitted by the acoustic model management server. (For example, refer to Patent Document 2). Thus, Patent Document 2 also mentions the importance of the latest model in speech recognition.

更に、辞書への新出単語の追加のみであれば、起動している音声認識デコーダを停止させることなく、単語の発音辞書ネットワークに新出単語を追加し、未知語に割り当てられたＮ−ｇｒａｍ確率を流用することもできる（例えば、特許文献３、非特許文献２等参照。）。
特許第３８３６６０７号公報安藤彰男他，“音声認識を利用した放送用ニュース字幕制作システム，”信学論，Ｖｏｌ．Ｊ８４−Ｄ−ＩＩ，Ｎｏ．６，ｐｐ．８７７−８８７，２００１．６．特開２００２−９１４７７号公報特開２００２−２０７４９５号公報西村竜一他，“音声入力Ｗｅｂシステムｗ３ｖｏｉｃｅにおける音声認識手法の検討，”音講論集，１−１０−１７，ｐｐ．５１−５２，２００８．３． Furthermore, if only a new word is added to the dictionary, the new word is added to the word pronunciation dictionary network without stopping the activated speech recognition decoder, and the N-gram assigned to the unknown word is added. Probability can also be used (see, for example, Patent Document 3 and Non-Patent Document 2).
Japanese Patent No. 3836607 Akio Ando et al., “Broadcast news subtitle production system using voice recognition,” IEICE, Vol. J84-D-II, No. 6, pp. 877-887, 2001.6. JP 2002-91477 A JP 2002-207495 A Ryuichi Nishimura et al., “Examination of speech recognition methods in the speech input Web system w3voice,” Sound Lecture, 1-10-17, pp. 51-52, 2008.3.

しかしながら、上述した従来技術においては、例えばニュース等における記者の出稿が、音声認識を終了する時間より遅れた場合には、その電子原稿は言語モデルに反映されず、認識誤りを生じさせる原因の１つとなっていた。また、従来の音声認識システムは、一般に１つの音声認識デコーダしか備えていないため、言語モデルが最新のものに更新されたとしても、音声認識デコーダを一度停止し、これを再び手動で起動させて最新言語モデルを読み込む必要があった。 However, in the above-described conventional technology, for example, when a reporter's submission in news or the like is delayed from the time for ending speech recognition, the electronic manuscript is not reflected in the language model, which is one of the causes of recognition errors. It was one. In addition, since the conventional speech recognition system generally includes only one speech recognition decoder, even if the language model is updated to the latest one, the speech recognition decoder is stopped once and manually activated again. It was necessary to load the latest language model.

したがって、例えば字幕制作等における音声認識では、字幕放送が始まり、起動中の音声認識を停止してしまうと、字幕放送が中断されることになり、運用上好ましくない。また、言語モデル更新後の音声認識デコーダの再起動にも、手間を要するものであった。 Therefore, for example, in speech recognition in subtitle production or the like, if subtitle broadcasting starts and speech recognition that is being activated is stopped, subtitle broadcasting is interrupted, which is not preferable in operation. Also, it takes time to restart the speech recognition decoder after updating the language model.

更に、ニュース番組の字幕制作では語彙（サイズ６万単語）のエントリーも随時入れ替えており、１つの音声認識デコーダを動かしながら言語モデルと発音辞書をダイナミックに更新することは困難であった。 Furthermore, in the production of news program subtitles, vocabulary (size 60,000 words) entries are also changed as needed, and it is difficult to dynamically update the language model and pronunciation dictionary while moving one speech recognition decoder.

つまり、上述したように、従来では音声認識における最新モデルの重要性が言及されているが、ここでも音声認識の停止と再起動を前提としており、音声認識を途切れさせることなく運用を継続させられるものではなかった。 In other words, as mentioned above, the importance of the latest model in speech recognition has been mentioned in the past, but here again it is assumed that speech recognition is stopped and restarted, and operation can be continued without interrupting speech recognition. It was not a thing.

本発明は、上述した問題点に鑑みなされたものであり、連続的に更新させた最新モデルを用いて高精度な音声認識を実現するための音声認識装置及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a speech recognition device and a speech recognition program for realizing highly accurate speech recognition using the latest model continuously updated. And

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、入力音声を認識して文字に変換する音声認識装置において、言語モデル、発音辞書、音響モデル、及び音声認識パラメータのうち少なくとも１つを随時学習するモデル学習手段と、前記モデル学習手段により最新モデルに更新されたことを通知するモデル更新通知手段と、前記入力音声の音響特徴量を抽出する音響分析手段と、前記音響分析手段により得られる音響特徴量と、予め蓄積或いは前記モデル学習手段により更新された言語モデル、発音辞書、音響モデル、及び音声認識パラメータを読み込み、前記音響特徴量の音声認識を行う複数の音声認識デコーダと、前記複数の音声認識デコーダのうち、前記モデル更新通知手段により通知される更新情報に基づいて、前記音声認識を行う音声認識デコーダの選択を行うデコーダ制御手段とを有し、前記デコーダ制御手段は、古いモデルで起動中の音声認識デコーダに加えて、最新モデルの音声認識デコーダを同時に起動し、前記音響分析手段から得られる音響特徴量に含まれる発話始端を検出したタイミングで最新モデルの音声認識デコーダに切り替えることを特徴とする。 According to the first aspect of the present invention, in the speech recognition apparatus that recognizes an input speech and converts it into a character, model learning means that learns at least one of a language model, a pronunciation dictionary, an acoustic model, and a speech recognition parameter as needed. A model update notification means for notifying that the model learning means has been updated to the latest model, an acoustic analysis means for extracting an acoustic feature quantity of the input speech, and an acoustic feature quantity obtained by the acoustic analysis means, A plurality of speech recognition decoders that read a language model, a pronunciation dictionary, an acoustic model, and speech recognition parameters that are stored in advance or updated by the model learning unit, and perform speech recognition of the acoustic feature amount; and a plurality of speech recognition decoders Among them, a speech recognition decoder that performs speech recognition based on update information notified by the model update notification means. -Option have a decoder control means for said decoder control means, in addition to the speech recognition decoder running the old models, and activates the voice recognition decoder latest models simultaneously, acoustic features obtained from the acoustic analysis means The voice recognition decoder is switched to the latest model at the timing of detecting the utterance start included in the volume .

請求項１記載の発明によれば、連続的に更新させた最新モデルを用いて高精度な音声認識を実現することができる。また、常に最新モデルを用いて高精度な音声認識を連続して実現することができる。 According to the first aspect of the present invention, it is possible to realize highly accurate speech recognition using the latest model continuously updated. In addition, high-accuracy speech recognition can always be realized continuously using the latest model.

請求項２に記載された発明は、前記デコーダ制御手段は、前記複数の音声認識デコーダの全てに順次途切れなく最新モデルを読み込ませて再起動させることを特徴とする。 The invention described in claim 2 is characterized in that the decoder control means causes all of the plurality of speech recognition decoders to sequentially read and restart the latest model without interruption.

請求項２記載の発明によれば、音声認識を途切れさせることなく、迅速に最新モデルに更新して、その最新モデルを用いた音声認識を行うことができる。 According to the second aspect of the present invention, it is possible to quickly update to the latest model without interrupting speech recognition and perform speech recognition using the latest model.

請求項３に記載された発明は、前記デコーダ制御手段は、前記再起動させた後、それぞれの音声認識デコーダに前記入力音声の認識を、前記音響分析手段から得られる音響特徴量に含まれる発話始端を検出したタイミングで順次受け持たせることを特徴とする。 According to a third aspect of the present invention, after the decoder control means restarts, the speech recognition decoder recognizes the input speech, and the speech features included in the acoustic feature quantity obtained from the acoustic analysis means. It is characterized in that the starting edge is sequentially received at the timing when it is detected .

請求項３記載の発明によれば、最新モデルを利用しつつ、処理時間の要する複雑な音声認識のトータル的な処理時間を削減することができる。 According to the third aspect of the present invention, it is possible to reduce the total processing time of complicated speech recognition that requires processing time while using the latest model.

請求項４に記載された発明は、音声認識結果を修正し、修正した履歴情報を前記モデル学習手段に出力して学習データとして利用させるための文字修正手段を有することを特徴とする。 According to a fourth aspect of the present invention, there is provided character correction means for correcting a speech recognition result, outputting the corrected history information to the model learning means, and using it as learning data.

請求項４記載の発明によれば、同じ音声認識誤りの起きる可能性を軽減させることができる。 According to the fourth aspect of the present invention, the possibility of the same voice recognition error occurring can be reduced.

請求項５に記載された発明は、入力音声を認識して文字に変換する音声認識処理をコンピュータに実行させるための音声認識プログラムにおいて、コンピュータを、言語モデル、発音辞書、音響モデル、及び音声認識パラメータのうち少なくとも１つを随時学習するモデル学習手段、前記モデル学習手段により最新モデルに更新されたことを通知するモデル更新通知手段、前記入力音声の音響特徴量を抽出する音響分析手段、前記音響分析手段により得られる音響特徴量と、予め蓄積或いは前記モデル学習手段により更新された言語モデル、発音辞書、音響モデル、及び音声認識パラメータを読み込み、前記音響特徴量の音声認識を行う複数の音声認識デコーダ、及び、前記複数の音声認識デコーダのうち、前記モデル更新通知手段により通知される更新情報に基づいて、前記音声認識を行う音声認識デコーダの選択を行うと共に、古いモデルで起動中の音声認識デコーダに加えて、最新モデルの音声認識デコーダを同時に起動し、前記音響分析手段から得られる音響特徴量に含まれる発話始端を検出したタイミングで最新モデルの音声認識デコーダに切り替えるデコーダ制御手段として機能させる。 According to a fifth aspect of the present invention, there is provided a speech recognition program for causing a computer to execute speech recognition processing for recognizing input speech and converting it into a character. The computer comprises a language model, a pronunciation dictionary, an acoustic model, and speech recognition. Model learning means for learning at least one of parameters as needed, model update notification means for notifying that the model has been updated to the latest model, acoustic analysis means for extracting acoustic feature quantities of the input speech, A plurality of speech recognitions that perform speech recognition of the acoustic feature values by reading the acoustic feature values obtained by the analysis means and the language model, pronunciation dictionary, acoustic model, and speech recognition parameters stored in advance or updated by the model learning means Among the plurality of speech recognition decoders, the decoder is notified by the model update notification means. Based on the update information, the performs selection of the speech recognition decoder for speech recognition, in addition to the speech recognition decoder running the old models, and activates the voice recognition decoder latest models simultaneously, the acoustic analysis means To function as decoder control means for switching to the latest model speech recognition decoder at the timing when the utterance start point included in the acoustic feature value obtained from the above is detected .

請求項５記載の発明によれば、連続的に更新させた最新モデルを用いて高精度な音声認識を実現することができる。また、プログラムをインストールすることにより、容易にデータ分類処理を実現することができる。 According to the fifth aspect of the invention, it is possible to realize highly accurate speech recognition using the latest model continuously updated. Moreover, the data classification process can be easily realized by installing the program.

本発明によれば、連続的に更新させた最新モデルを用いて高精度な音声認識を実現することができる。 According to the present invention, highly accurate speech recognition can be realized using the latest model updated continuously.

＜本発明の概要＞
本発明は、既に古いモデルを読み込んで起動している音声認識デコーダに加えて、音声認識処理を途切れさせることなく更新された最新モデルを読み込むための別の音声認識デコーダを同時に起動し、音声認識を行う音声認識デコーダを最新モデルのものに切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現するものである。 <Outline of the present invention>
The present invention simultaneously activates another speech recognition decoder for reading the latest updated model without interrupting speech recognition processing in addition to the speech recognition decoder already activated by reading an old model, By switching the speech recognition decoder for performing the latest model to the latest model, high-accuracy speech recognition is always realized continuously using the latest model.

以下に、本発明における音声認識装置及び音声認識プログラムを好適に実施した形態について、図面を用いて説明する。 Hereinafter, preferred embodiments of a voice recognition device and a voice recognition program according to the present invention will be described with reference to the drawings.

＜音声認識装置：機能構成例＞
図１は、本実施形態における音声認識装置の機能構成の一例を示す図である。図１に示す音声認識装置１０は、音響分析手段１１と、デコーダ制御手段１２と、音声認識デコーダ１３−１，１３−２と、文字修正手段１４と、モデル学習手段１５と、モデル更新通知手段１６と、蓄積手段１７と、学習データ１８とを有するよう構成されている。 <Voice recognition device: functional configuration example>
FIG. 1 is a diagram illustrating an example of a functional configuration of the speech recognition apparatus according to the present embodiment. A speech recognition apparatus 10 shown in FIG. 1 includes an acoustic analysis unit 11, a decoder control unit 12, speech recognition decoders 13-1 and 13-2, a character correction unit 14, a model learning unit 15, and a model update notification unit. 16, storage means 17, and learning data 18.

音響分析手段１１は、入力される音声から音響特徴量を抽出する。なお、音響特徴量としては、例えば周波数特性や音のパワー、性別属性等の各種音響特徴量を抽出する。また、これらの特徴量は一般的な音声認識手法で用いることができ、これにより例えば声の特徴を表す１２次元程度のメル周波数ケプストラム係数（ＭＦＣＣ：ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔｓ）（例えば、鹿野他、「音声認識システム」オーム社、２００１等を参照。）や、線形予測係数等のような声道の形状を数値化した特徴量、韻律（ピッチ、抑揚等）等の特徴量、またそれらの特徴量の平均値や分散等の統計的情報を分析することにより、種々の特徴量を取得することができる。また、音響分析手段１１は、分析により得られる各種音響特徴量をデコーダ制御手段１２に出力する。 The acoustic analysis unit 11 extracts an acoustic feature amount from the input voice. As the acoustic feature amount, for example, various acoustic feature amounts such as frequency characteristics, sound power, and sex attributes are extracted. In addition, these feature quantities can be used in a general speech recognition technique. For example, about 12-dimensional Mel Frequency Cepstrum Coefficients (MFCC) (for example, Shikano et al., “ Voice recognition system "(Ohm, 2001, etc.)), feature quantities obtained by quantifying the shape of the vocal tract, such as linear prediction coefficients, feature quantities such as prosody (pitch, inflection, etc.), and feature quantities thereof By analyzing statistical information such as the average value and variance of the image, various feature quantities can be acquired. The acoustic analysis unit 11 outputs various acoustic feature amounts obtained by the analysis to the decoder control unit 12.

デコーダ制御手段１２は、複数の音声認識デコーダ１３（図１に示す例では、音声認識デコーダ１３−１，１３−２）等の起動や、どの音声認識デコーダ１３を選択してデコード（音声→文字解読）を行うのか等、音響認識結果を取得するための制御を行う。 The decoder control means 12 activates a plurality of speech recognition decoders 13 (in the example shown in FIG. 1, speech recognition decoders 13-1 and 13-2) and the like, selects which speech recognition decoder 13 to select (speech → character Control for acquiring the acoustic recognition result, such as whether to perform decoding).

具体的には、デコーダ制御手段１２は、後述するモデル更新通知手段１６から通知される、モデルが更新されたことを示す更新情報にしたがって、複数の音声認識デコーダ１３のうち、音声認識実行中ではない任意の音声認識デコーダを選択し、選択した音声認識デコーダに最新モデルを読み込ませて再起動させる。また、デコーダ制御手段１２は、音声認識を担当する音声認識デコーダ１３に音響特徴量を送信すると共に、得られる文字情報等の音声認識結果を文字修正手段１４に出力する。 Specifically, the decoder control unit 12 is performing speech recognition among the plurality of speech recognition decoders 13 in accordance with update information notified from the model update notification unit 16 described later and indicating that the model has been updated. Select any speech recognition decoder that is not present and load the selected speech recognition decoder with the latest model and restart it. In addition, the decoder control unit 12 transmits the acoustic feature amount to the speech recognition decoder 13 in charge of speech recognition, and outputs the obtained speech recognition result such as character information to the character correction unit 14.

音声認識デコーダ１３は、予めその時点で蓄積手段１７に蓄積或いは後述するモデル学習手段１５により更新されている最新の言語モデル、発音辞書、音響モデル、及び音声認識パラメータの全てを読み込んで起動しており、音声認識可能な状態になっている。 The speech recognition decoder 13 reads and activates all of the latest language model, pronunciation dictionary, acoustic model, and speech recognition parameters that have been stored in the storage unit 17 at that time or updated by the model learning unit 15 described later. And voice recognition is possible.

音声認識デコーダ１３は、音響特徴量をデコーダ制御手段１２から取得すると、逐次音声認識を実行し、デコーダ制御手段１２に文字情報等の音声認識結果を出力する。なお、図１の例では、音声認識デコーダが２つ設けられているが、本発明においてはこれに限定されるものではなく、３つ以上が設けられていてもよい。なお、逐次音声認識は、例えば特許第３８３４１６９号公報で示されているような早期確定型の手段等の従来手法を用いることができる。 When the speech recognition decoder 13 acquires the acoustic feature amount from the decoder control unit 12, the speech recognition decoder 13 sequentially performs speech recognition and outputs a speech recognition result such as character information to the decoder control unit 12. In the example of FIG. 1, two speech recognition decoders are provided. However, the present invention is not limited to this, and three or more speech recognition decoders may be provided. Note that the sequential speech recognition can use a conventional method such as an early-determined type unit as disclosed in Japanese Patent No. 3834169, for example.

文字修正手段１４は、デコーダ制御手段１２により得られる音声認識結果に対してユーザ等によるチェックや自動文章校正処理等により、例えば人名等の誤記等に対して正確な文字が入力され、その文字に対応する文章の所定の部位を変換する。なお、文字修正手段１４は、文字の追加や削除等も指示情報の入力により実行することができる。 The character correction means 14 is input with an accurate character, for example, for an error such as a personal name by a user's check on the speech recognition result obtained by the decoder control means 12, an automatic sentence proofreading process, or the like. Convert a given part of the corresponding sentence. In addition, the character correction means 14 can perform addition and deletion of a character by inputting instruction information.

モデル学習手段１５は、学習データ１８が最新のテキストや音声等により新たに更新されると、例えば所定時間毎やデータ更新時、番組変更等の切り替わり等のタイミングで、音声認識デコーダ１３の処理とはまったく非同期で、自動又は手動で言語モデル、発音辞書、音響モデル、及び音声認識等に用いられるパラメータファイル（音声認識パラメータ）のうち、少なくとも１つを最新のものに学習してデータの更新を行う。 When the learning data 18 is newly updated with the latest text, voice, or the like, the model learning unit 15 performs processing of the voice recognition decoder 13 at a timing such as switching at a predetermined time, data update, program change, or the like. Is completely asynchronous, and automatically or manually learns at least one of the language model, pronunciation dictionary, acoustic model, and parameter files (speech recognition parameters) used for speech recognition, etc. to update the data. Do.

なお、音声認識パラメータとしては、例えば音声認識の過程で保持すべき最大単語数や、言語モデルと音響モデルによる各スコアのバランスを調整する重み係数等、音声認識の正確さと処理速度を調整する変数のリスト等からなる。 Note that the speech recognition parameters include variables that adjust the accuracy and processing speed of speech recognition, such as the maximum number of words that should be retained in the speech recognition process, weighting factors that adjust the balance between the scores of the language model and the acoustic model, etc. A list of

これにより、蓄積手段１７に蓄積されるモデルは、最新の言語モデル、発音辞書、音響モデル、パラメータファイルに更新される。また、モデル学習手段１５は、モデルを学習したことを知らせる旨の信号をモデル更新通知手段１６に出力する。 As a result, the model stored in the storage unit 17 is updated to the latest language model, pronunciation dictionary, acoustic model, and parameter file. Further, the model learning unit 15 outputs a signal indicating that the model has been learned to the model update notification unit 16.

モデル更新通知手段１６は、モデル学習手段１５により入力されたデータ更新に関する更新信号をデコーダ制御手段１２に出力する。ここで、更新信号とは、言語モデル、発音辞書、音響モデル、及びパラメータファイルのうち、どのデータが更新されたのかを示すデータ識別情報、更新日付、更新バージョン等である。 The model update notification unit 16 outputs an update signal related to the data update input by the model learning unit 15 to the decoder control unit 12. Here, the update signal is data identification information indicating which data has been updated among the language model, pronunciation dictionary, acoustic model, and parameter file, an update date, an update version, and the like.

これにより、例えば、図１に示す実施形態において、例えばデコーダ制御手段１２は、まず音声認識デコーダ１３−１に最新モデルの学習データ１８を読み込ませて起動する。また、次の最新モデルの学習データ１８が生成されると、デコーダ制御手段１２は、モデル更新通知手段１６からの更新情報の通知を受け、音声認識デコーダ１３−２に最新モデルの学習データ１８を読み込ませて起動させ認識可能な状態になったことを確認後、例えば音響分析手段１１から得られる音響特徴量等に基づく所定のタイミングで音声認識の処理対象を音声認識デコーダ１３−１から音声認識デコーダ１３−２に切り替える。 Thereby, for example, in the embodiment shown in FIG. 1, for example, the decoder control means 12 first activates the speech recognition decoder 13-1 by reading the learning data 18 of the latest model. When the learning data 18 of the next latest model is generated, the decoder control means 12 receives notification of update information from the model update notification means 16 and sends the learning data 18 of the latest model to the speech recognition decoder 13-2. After confirming that it has been read and activated and is in a recognizable state, the speech recognition processing target is recognized from the speech recognition decoder 13-1 at a predetermined timing based on, for example, the acoustic feature amount obtained from the acoustic analysis means 11. Switch to the decoder 13-2.

また、デコーダ制御手段１２は、音声認識デコーダ１３−１，１３−２が共に最新モデルを用いている場合には、両方を用いて１文章毎に交互に音声認識処理をさせることもできる。 In addition, when both the speech recognition decoders 13-1 and 13-2 use the latest model, the decoder control means 12 can alternately perform speech recognition processing for each sentence using both.

蓄積手段１７は、本実施形態における音声認識処理を実現するために必要なデータを蓄積し、音声認識処理の必要に応じた読み込みや、モデル学習手段１５の必要に応じた書き出しを行う。具体的には、蓄積されるデータは、音声認識デコーダ１３における音声認識処理に必要な予め蓄積或いは自動又は手動で更新されるモデルであり、例えば言語モデル、発音辞書、音響モデル、音声認識パラメータの全てである。 The accumulating unit 17 accumulates data necessary for realizing the speech recognition processing in the present embodiment, and performs reading according to the necessity of the speech recognition processing and writing according to the necessity of the model learning unit 15. Specifically, the accumulated data is a model that is stored in advance or automatically or manually updated necessary for the speech recognition processing in the speech recognition decoder 13, and includes, for example, a language model, a pronunciation dictionary, an acoustic model, and a speech recognition parameter. It is all.

ここで、言語モデルには、例えば単語と単語の繋がり易さを確率で表した一般的なＮグラム・モデルを利用することができ、これにより、例えば単語「地球」の次に単語「温暖化」が接続する確率は０．８等と数値化して表現することができる。 Here, as the language model, for example, a general N-gram model expressing the ease of connection between words can be used, and for example, the word “global warming” can be used next to the word “earth”. The probability that “is connected” can be expressed numerically as 0.8 mag.

また、発音辞書は、各単語の発音を母音と子音の組み合わせで表したファイルであり、例えば単語「地球」の発音は「／ｃｈｉｋｙｕ：／」等と記述されている。 The pronunciation dictionary is a file in which the pronunciation of each word is represented by a combination of vowels and consonants. For example, the pronunciation of the word “Earth” is described as “/ chi ki u: /” or the like.

音響モデルは、各母音・子音の声の周波数特性等を表したものであり、一般的な隠れマルコフ・モデル（ＨＭＭ）で表すことができる。 The acoustic model represents the frequency characteristics of each vowel / consonant voice and can be represented by a general hidden Markov model (HMM).

音声認識パラメータは、音声認識の過程で保持すべき最大単語数や、言語モデルと音響モデルによる各スコアのバランスを調整する重み係数等、音声認識の正確さと処理速度を調整する変数のリストである。また、モデルを最新のものに学習する部分は、音声認識システムの中に含まれていても、外部で独立して起動し、更新されたモデルを何らかの通信手段で音声認識システムに伝送しても構わない。 The speech recognition parameters are a list of variables that adjust the accuracy and processing speed of speech recognition, such as the maximum number of words that should be retained during the speech recognition process and weighting factors that adjust the balance of each score between the language model and the acoustic model. . In addition, the part that learns the latest model is included in the speech recognition system, or it can be activated independently and transmitted to the speech recognition system by some communication means. I do not care.

また、学習データ１８は、テキストや音声等の所定の分野に関する各種データが蓄積されている。また、学習データ１８は、各種データに更新があり、その内容が現在音声認識されているものに該当する場合や、各種モデル等を更新する場合には、その更新した旨とデータ自体をモデル学習手段１５に出力する。 The learning data 18 stores various data relating to a predetermined field such as text and voice. In addition, when the learning data 18 is updated in various data and the content corresponds to what is currently voice-recognized, or when various models are updated, the learning and the data itself are model-learned. Output to means 15.

＜デコーダ制御手段１２における音声認識デコーダ１３の更新及び制御方法について＞
ここで、上述したデコーダ制御手段１２における音声認識デコーダ１３の更新及び制御方法について説明する。 <Regarding Update and Control Method of Speech Recognition Decoder 13 in Decoder Control Unit 12>
Here, a method for updating and controlling the speech recognition decoder 13 in the decoder control means 12 will be described.

＜デコーダ更新：実施例１＞
デコーダ制御手段１２は、音声認識デコーダ１３を同時に起動し、途切れなく最新モデルの音声認識デコーダに切り替わるようになっている。また、デコーダ制御手段１２は、例えば入稿された最新の電子原稿によって言語モデルと発音辞書が自動（又は手動）で更新された旨を示す更新情報の通知をモデル更新通知手段１６から受け、音声認識デコーダ１３−１が音声認識を実行中である場合には、これとは別に新たに音声認識デコーダ１３−２を最新モデルで起動する。 <Decoder update: Example 1>
The decoder control means 12 starts the speech recognition decoder 13 at the same time, and switches to the latest model speech recognition decoder without interruption. The decoder control means 12 receives from the model update notification means 16 a notification of update information indicating that the language model and the pronunciation dictionary have been automatically (or manually) updated with the latest electronic manuscript submitted, for example. If the recognition decoder 13-1 is executing speech recognition, the speech recognition decoder 13-2 is newly activated with the latest model.

そして、デコーダ制御手段１２は、音声認識デコーダ１３−２が認識可能な状態になったことを確認後、例えば、音響分析手段１１により取得した入力音声における非音声区間等の所定のタイミングで音声認識の対象を音声認識デコーダ１３−１から音声認識デコーダ１３−２に切り替える。 Then, after confirming that the speech recognition decoder 13-2 is in a recognizable state, the decoder control unit 12 performs speech recognition at a predetermined timing such as a non-speech interval in the input speech acquired by the acoustic analysis unit 11, for example. Is switched from the speech recognition decoder 13-1 to the speech recognition decoder 13-2.

また、デコーダ制御手段１２は、以後同様にモデルの更新とデコーダの起動、選択、切り替えを繰り返し行うことで、最新モデルを用いた音声認識を継続して行うことができる。 In addition, the decoder control means 12 can continuously perform speech recognition using the latest model by repeatedly updating the model and starting, selecting and switching the decoder in the same manner.

＜デコーダ更新：実施例２＞
デコーダ制御手段１２は、予め複数の音声認識デコーダ１３の全てを、その時点での最新モデルで起動させ、入力音声から得られる音響特徴量に基づく所定のタイミング（例えば、１文章毎、ニュースの１テーマ毎、１番組毎、所定時間毎等）で複数の音声認識デコーダを任意に切り替えて音声認識処理を行う。 <Decoder update: Example 2>
The decoder control unit 12 activates all of the plurality of speech recognition decoders 13 in advance with the latest model at that time, and predetermined timing based on the acoustic feature amount obtained from the input speech (for example, every sentence, 1 news) The speech recognition processing is performed by arbitrarily switching a plurality of speech recognition decoders for each theme, for each program, every predetermined time, and the like.

次に、音声認識モデルが最新モデルに更新される場合には、起動している複数の音声認識パラメータのうち、ある１つの音声認識デコーダを停止させ、モデルが最新の状態に更新された後に起動させる。また、最新モデルに更新されていない音声認識デコーダについても同様に順次更新をした後に再起動を行う。 Next, when the speech recognition model is updated to the latest model, a certain speech recognition decoder is stopped from among a plurality of activated speech recognition parameters, and the model is activated after the model is updated to the latest state. Let Similarly, the speech recognition decoder that has not been updated to the latest model is also updated after being sequentially updated.

なお、更新中の音声認識デコーダは、その時点では音声認識処理を行わず、再起動後、最新モデルで音声認識を行う。これにより、複数の音声認識デコーダを並列して起動させることで、音声認識に時間のかかる音声が入力された場合でも、デコーダの負荷を軽減することができ、トータル的な音声認識処理速度を向上させることができる。 Note that the speech recognition decoder being updated does not perform speech recognition processing at that time, and performs speech recognition with the latest model after restarting. As a result, multiple speech recognition decoders can be started in parallel, reducing the load on the decoder even when speech that takes a long time for speech recognition is input, and improving the overall speech recognition processing speed. Can be made.

上述した実施形態によれば、連続的に更新させた最新モデルを用いて高精度な音声認識を実現するための音声認識装置を提供することができる。音声認識を途切れさせることなく、音声認識を行う音声認識デコーダを最新モデルのものに切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現することができる。 According to the above-described embodiment, it is possible to provide a speech recognition device for realizing highly accurate speech recognition using the latest model continuously updated. By switching the speech recognition decoder that performs speech recognition to the latest model without interrupting speech recognition, high-accuracy speech recognition can always be realized continuously using the latest model.

これにより、例えばテレビの生放送番組にリアルタイムで字幕を付与する目的で音声認識を利用する場合、放送中に既に音声認識を運用している状態であっても、音声認識を一瞬たりとも停止させることなく、常に最新モデルを読み込ませて高精度な音声認識が連続して実現可能となる。 As a result, for example, when using speech recognition for the purpose of providing subtitles in real time on a live TV program, even if speech recognition is already in operation during broadcasting, speech recognition is stopped even for a moment. Instead, the latest model can always be read and high-accuracy speech recognition can be realized continuously.

＜ハードウェア構成＞
ここで、上述したように音声認識装置１０は、専用の装置構成により本発明における音声認識処理を行うこともできるが、後述する各構成における音声認識処理をコンピュータに実行させることができる実行プログラムを生成し、例えば、汎用のパーソナルコンピュータ、ワークステーション等にプログラムをインストールすることにより、本発明における音声認識処理を実現することができる。 <Hardware configuration>
Here, as described above, the speech recognition apparatus 10 can perform speech recognition processing according to the present invention with a dedicated device configuration, but an execution program that can cause a computer to execute speech recognition processing in each configuration described below. For example, the speech recognition processing according to the present invention can be realized by installing the program in a general-purpose personal computer, workstation, or the like.

ここで、本発明における実行可能なコンピュータのハードウェア構成例について図を用いて説明する。図２は、本発明における音声認識処理が実現可能なハードウェア構成の一例を示す図である。 Here, an example of a hardware configuration of an executable computer in the present invention will be described with reference to the drawings. FIG. 2 is a diagram illustrating an example of a hardware configuration capable of realizing the speech recognition process according to the present invention.

図２におけるコンピュータ本体には、入力装置２１と、出力装置２２と、ドライブ装置２３と、補助記憶装置２４と、メモリ装置２５と、各種制御を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２６と、ネットワーク接続装置２７とを有するよう構成されており、これらはシステムバスＢで相互に接続されている。 2 includes an input device 21, an output device 22, a drive device 23, an auxiliary storage device 24, a memory device 25, a CPU (Central Processing Unit) 26 for performing various controls, and a network connection device. 27, and these are connected to each other by a system bus B.

入力装置２１は、使用者が操作するキーボード及びマウス等のポインティングデバイスを有しており、使用者からのプログラムの実行等、各種操作信号を入力する。出力装置２２は、本発明における音声認識処理を行うためのコンピュータ本体を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイを有し、ＣＰＵ２６が有する制御プログラムによりプログラムの実行経過や結果等を表示することができる。 The input device 21 has a pointing device such as a keyboard and a mouse operated by a user, and inputs various operation signals such as execution of a program from the user. The output device 22 has a display for displaying various windows and data necessary for operating the computer main body for performing the speech recognition processing according to the present invention. Can be displayed.

ここで、本発明において、コンピュータ本体にインストールされる実行プログラムは、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリやＣＤ−ＲＯＭ等の可搬型の記録媒体２８等により提供される。プログラムを記録した記録媒体２８は、ドライブ装置２３にセット可能であり、記録媒体２８に含まれる実行プログラムが、記録媒体２８からドライブ装置２３を介して補助記憶装置２４にインストールされる。 Here, in the present invention, the execution program installed in the computer main body is provided by a portable recording medium 28 such as a USB (Universal Serial Bus) memory or a CD-ROM, for example. The recording medium 28 on which the program is recorded can be set in the drive device 23, and the execution program included in the recording medium 28 is installed in the auxiliary storage device 24 from the recording medium 28 via the drive device 23.

補助記憶装置２４は、ハードディスク等のストレージ手段であり、本発明における実行プログラムや、コンピュータに設けられた制御プログラム等を蓄積し必要に応じて入出力を行うことができる。 The auxiliary storage device 24 is a storage means such as a hard disk, and can store an execution program according to the present invention, a control program provided in a computer, etc., and perform input / output as necessary.

メモリ装置２５は、ＣＰＵ２６により補助記憶装置２４から読み出された実行プログラム等を格納する。なお、メモリ装置２５は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等からなる。 The memory device 25 stores an execution program read from the auxiliary storage device 24 by the CPU 26. The memory device 25 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.

ＣＰＵ２６は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、メモリ装置２５により読み出され格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して、上述した音声認識における各処理を実現することができる。プログラムの実行中に必要な各種情報は、補助記憶装置２４から取得することができ、また格納することもできる。 The CPU 26 performs processing of the entire computer, such as various operations and data input / output with each hardware component, based on a control program such as an OS (Operating System) and an execution program read and stored by the memory device 25. By controlling the above, it is possible to realize each processing in the voice recognition described above. Various kinds of information necessary during the execution of the program can be acquired from the auxiliary storage device 24 and can also be stored.

ネットワーク接続装置２７は、通信ネットワーク等と接続することにより、実行プログラムを通信ネットワークに接続されている他の端末等から取得したり、プログラムを実行することで得られた実行結果又は本発明における実行プログラム自体を他の端末等に提供することができる。 The network connection device 27 acquires an execution program from another terminal connected to the communication network by connecting to a communication network or the like, or an execution result obtained by executing the program or an execution in the present invention The program itself can be provided to other terminals.

上述したようなハードウェア構成により、特別な装置構成を必要とせず、低コストで高精度な音声認識処理を実現することができる。また、実行プログラム（音声認識プログラム等）をインストールすることにより、容易に音声認識処理を実現することができる。 With the hardware configuration as described above, it is possible to realize a highly accurate voice recognition process at a low cost without requiring a special device configuration. In addition, by installing an execution program (speech recognition program or the like), speech recognition processing can be easily realized.

＜音声認識処理手順＞
次に、本実施形態における音声認識処理手順の一例についてフローチャートを用いて説明する。なお、以下の説明においては、モデル自動更新に対応した音声認識装置全体の処理において、音声認識デコーダを最大Ｄ個まで起動できるものとして、学習データの更新に応じたモデルの学習、更新処理、音声認識デコーダの追加起動と認識を行う音声認識デコーダの選択及び切り替え処理がそれぞれ非同期並列動作的に行われているため、それらの処理をそれぞれ分けて説明する。 <Voice recognition processing procedure>
Next, an example of a speech recognition processing procedure in the present embodiment will be described using a flowchart. In the following description, it is assumed that up to D speech recognition decoders can be activated in the processing of the entire speech recognition apparatus that supports automatic model updating, model learning according to learning data update, update processing, speech Since the selection and switching process of the speech recognition decoder that performs additional activation and recognition of the recognition decoder is performed in an asynchronous parallel operation, these processes will be described separately.

＜モデルの学習・更新処理＞
まず、本実施形態におけるモデルの学習・更新処理手順についてフローチャートを用いて説明する。図３は、本実施形態におけるモデルの学習・更新処理手順の一例を示すフローチャートである。 <Model learning / update process>
First, a model learning / updating process procedure according to the present embodiment will be described with reference to a flowchart. FIG. 3 is a flowchart illustrating an example of a model learning / updating process procedure according to the present embodiment.

図３において、まず、モデル学習手段は、テキストや音声等の学習データが更新されたことを示す更新イベントを外部等から受信すると（Ｓ０１）、蓄積手段に蓄積されたモデル（言語モデル・発音辞書・音響モデル・各種パラメータファイル）に対して、最新モデルを学習し、モデルの更新を行う（Ｓ０２）。 In FIG. 3, first, when the model learning means receives an update event indicating that the learning data such as text and speech has been updated from the outside (S01), the model (language model / pronunciation dictionary) stored in the storage means. The latest model is learned for the acoustic model and various parameter files), and the model is updated (S02).

次に、モデル更新通知手段は、デコーダ制御手段にモデルが更新されたことを示すモデル更新イベント（モデル更新情報）を通知する（Ｓ０３）。また、デコーダ制御手段は、Ｓ０３により得られる更新情報により、複数の音声認識デコーダのうち、更新可能な音声認識デコーダを選択し、選択した音声認識デコーダに最新モデルを読み込ませて起動させる（Ｓ０４）。なお、このとき更新される音声認識デコーダは、まだ予備装置として起動されていない音声認識デコーダか、又は複数の並列して起動しているデコーダのうち、所定の順序で順次選択される音声認識デコーダに対して、停止、更新、再起動の処理を行う。 Next, the model update notification means notifies the decoder control means of a model update event (model update information) indicating that the model has been updated (S03). In addition, the decoder control means selects an updatable speech recognition decoder from among the plurality of speech recognition decoders based on the update information obtained in S03, and loads and activates the selected speech recognition decoder with the latest model (S04). . The speech recognition decoder updated at this time is a speech recognition decoder that has not yet been activated as a spare device, or a speech recognition decoder that is sequentially selected in a predetermined order from a plurality of decoders that are activated in parallel. In response, stop, update, and restart processes are performed.

なお、後述の処理を具体的に説明するために、更新された最新モデルｄ’を「ｄ’＝（ｄ＋１）％Ｄ」とする。なお、この式は、音声認識デコーダを識別する番号ｄに１を加えてＤ（起動できる最大音声認識デコーダ数）で割った余りｄ’として表現したものである。 It should be noted that the updated latest model d ′ is assumed to be “d ′ = (d + 1)% D” in order to specifically describe the processing described later. This expression is expressed as a remainder d 'obtained by adding 1 to the number d for identifying a speech recognition decoder and dividing by D (the maximum number of speech recognition decoders that can be activated).

＜音声認識デコーダの追加起動と認識対象切り替え処理＞
次に、音声認識デコーダの追加起動と認識対象切り替え処理について、フローチャートを用いて説明する。 <Additional activation of speech recognition decoder and recognition target switching processing>
Next, additional activation of the speech recognition decoder and recognition target switching processing will be described using a flowchart.

図４は、音声認識デコーダの追加起動と認識対象切替処理の一例を示すフローチャートである。図４に示す処理では、まず音声認識装置全体の動作を開始すると、まず初期値設定を行う（Ｓ１１）。具体的には、音声認識デコーダの番号ｄに０をセットし、その時点での最新モデル（言語モデル・発音辞書・音響モデル・パラメータファイル等）を読み込み、音声認識デコーダｄが起動する（Ｓ１２）。 FIG. 4 is a flowchart showing an example of additional activation of the speech recognition decoder and recognition target switching processing. In the processing shown in FIG. 4, when the operation of the entire speech recognition apparatus is started, initial value setting is first performed (S11). Specifically, 0 is set in the number d of the speech recognition decoder, the latest model (language model, pronunciation dictionary, acoustic model, parameter file, etc.) at that time is read, and the speech recognition decoder d is activated (S12). .

ここで、認識させたい音声が入力され始めると（Ｓ１３）、音響分析による音響特徴量の抽出を開始し（Ｓ１４）、その中から例えば人間の声の発話始端を検出する（Ｓ１５）。 Here, when the voice to be recognized starts to be input (S13), the extraction of the acoustic feature quantity by the acoustic analysis is started (S14), and the utterance start point of, for example, a human voice is detected therefrom (S15).

ここで、もし音声認識と非同期で並列動作しているモデルの学習・更新処理が行われている場合には、上述した図３に示すように、音声認識デコーダの番号ｄに１を加えてＤで割った余りをｄ’＝（ｄ＋１）％Ｄとして、モデル（言語モデル、発音辞書、音響モデル、パラメータファイル）の学習及び更新が行われ、デコーダ制御手段１２にモデル更新イベントが通知されると共に、音声認識デコーダｄ’がその最新モデルで起動されるものとする。 Here, if learning / updating processing of a model operating asynchronously and in parallel with speech recognition is performed, as shown in FIG. 3 described above, 1 is added to the number d of the speech recognition decoder and D The model (language model, pronunciation dictionary, acoustic model, parameter file) is learned and updated with the remainder divided by d ′ = (d + 1)% D, and a model update event is notified to the decoder control means 12. Assume that the speech recognition decoder d ′ is activated with its latest model.

この状態において、デコード制御手段１２は、音声認識デコーダｄ’が起動済みであるか否かを判断し（Ｓ１６）、音声認識デコーダｄ’が起動済みである場合（Ｓ１６において、ＹＥＳ）、音声認識デコーダｄを停止し、音声認識デコーダｄを音声認識デコーダｄ’で更新し、音声認識処理を担当するデコーダ番号ｄをｄ’に切り替える（Ｓ１７）。また、Ｓ１６の処理において、音声認識デコーダｄ’が起動済みでない場合（Ｓ１６において、ＮＯ）には、音声認識デコーダの番号ｄは不変となる。 In this state, the decoding control means 12 determines whether or not the speech recognition decoder d ′ has been activated (S16). If the speech recognition decoder d ′ has been activated (YES in S16), speech recognition is performed. The decoder d is stopped, the speech recognition decoder d is updated by the speech recognition decoder d ′, and the decoder number d in charge of speech recognition processing is switched to d ′ (S17). Further, in the process of S16, when the speech recognition decoder d 'has not been activated (NO in S16), the number d of the speech recognition decoder remains unchanged.

そして、デコーダ制御手段１２は、入力音声の音響特徴量を音響分析手段から受け取り、これを音声認識デコーダｄに送信する（Ｓ１８）。音声認識デコーダｄは、正解単語の探索を行い（Ｓ１９）、認識結果の文字列を音声認識デコーダｄからデコーダ制御部に送信する（Ｓ２０）。そして、デコーダ制御手段１２は、認識結果の文字列を外部に出力し（Ｓ２１）、これが生放送番組の字幕制作等のアプリケーション等で用いられる。 Then, the decoder control means 12 receives the acoustic feature quantity of the input speech from the acoustic analysis means and transmits it to the speech recognition decoder d (S18). The speech recognition decoder d searches for a correct word (S19), and transmits a recognition result character string from the speech recognition decoder d to the decoder control unit (S20). Then, the decoder control means 12 outputs the character string of the recognition result to the outside (S21), and this is used for applications such as caption production for live broadcast programs.

ここで、入力信号が発話終端か否かを判断し（Ｓ２２）、入力音声が発話の終端に達していない場合（Ｓ２２において、ＮＯ）、Ｓ１８の処理におけるデコーダ制御手段１２における音響特徴量の受信と音声認識デコーダｄへの送信に戻り、音声認識デコーダを変更することなく、発話終端まで音声認識を繰り返し行う。 Here, it is determined whether or not the input signal is the utterance end (S22), and if the input voice has not reached the utterance end (NO in S22), the acoustic control is received by the decoder control means 12 in the process of S18. Returning to the transmission to the voice recognition decoder d, the voice recognition is repeated until the end of the utterance without changing the voice recognition decoder.

また、Ｓ２２の処理において、もし入力音声が発話の終端に達している場合（Ｓ２２において、ＹＥＳ）、次に音声認識全体の処理を終了するか否かを判断し（Ｓ２３）、音声認識を終了しない場合（Ｓ２３において、ＮＯ）、音声認識を継続するため、Ｓ１５の処理における発話始端の検出に戻り、音声認識処理を終了するまで後続の処理を繰り返し行う。また、音声認識を終了する場合（Ｓ２３において、ＹＥＳ）、音声認識全体の処理を終了する。 Also, in the process of S22, if the input voice has reached the end of the utterance (YES in S22), it is then determined whether or not the entire voice recognition process is to be terminated (S23), and the voice recognition is terminated. If not (NO in S23), in order to continue the speech recognition, the process returns to the detection of the utterance start end in the processing of S15, and the subsequent processing is repeated until the speech recognition processing is completed. When the voice recognition is finished (YES in S23), the entire voice recognition process is finished.

上述した処理手順により、連続的に更新させた最新モデルを用いて高精度な音声認識を実現するための音声認識プログラムを提供することができる。具体的には、音声認識を途切れさせることなく、音声認識を行う音声認識デコーダを最新モデルのものに切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現するものである。例えば、テレビの生放送番組にリアルタイムで字幕を付与する目的で音声認識を利用する場合、放送中に既に音声認識を運用している状態であっても、音声認識を一瞬たりとも停止させることなく、常に最新モデルを読み込み、高精度な音声認識が連続して可能となる。 With the processing procedure described above, it is possible to provide a speech recognition program for realizing highly accurate speech recognition using the latest model updated continuously. Specifically, by switching the speech recognition decoder that performs speech recognition to the latest model without interrupting speech recognition, high-accuracy speech recognition is always realized continuously using the latest model. . For example, when using voice recognition for the purpose of giving subtitles to a live TV program in real time, even if voice recognition is already in operation during broadcasting, voice recognition is not stopped even momentarily, The latest model is always read and highly accurate speech recognition is possible continuously.

＜発話の始端検出及び終端検出について＞
なお、上述した処理において、音響分析時に行われる発話の始端検出及び終端検出の処理手順は、例えばエンドレス音素認識による時間遅れの少ないオンライン発話区間検出（例えば、特開２００７−２３３１４８号公報等）を用いることができる。この概要を以下に説明する。 <About start and end detection of speech>
In the processing described above, the processing procedure for detecting the start and end of speech performed during acoustic analysis is, for example, online speech segment detection (eg, Japanese Patent Application Laid-Open No. 2007-233148) with little time delay due to endless phoneme recognition. Can be used. This outline will be described below.

＜音素認識による発話区間検出＞
リアルタイム音声認識のための発話区間検出では、フレーム単位の細かな音声／非音声の判定よりも、多少の非音声区間を音声区間と誤ることはあっても、音声区間の欠落をできる限り抑え、音声を適度な長さの区間に切り出して、認識率の向上に寄与することが重要である。また、字幕表示のため、音声入力から音声始終端検出までの遅れ時間は、できる限り小さいことも求められる。 <Speech segment detection by phoneme recognition>
In speech segment detection for real-time speech recognition, even if some non-speech segments may be mistaken as speech segments, rather than making fine speech / non-speech determinations in units of frames, It is important to cut the speech into sections of an appropriate length to contribute to the improvement of the recognition rate. In addition, for caption display, the delay time from voice input to voice start / end detection is also required to be as small as possible.

例えば、字幕制作システムにおける音声認識では、音のパワーだけでなく周波数特性も考慮して、男女並列の性別依存音響モデルによる音素認識をエンドレスに実行し、その時の尤度から発話区間検出を行うようにしている。音素認識は、タスクによらず適用できるため、タスク依存の言語モデルを利用する手法よりも簡易であり、音響モデルを男女並列に動作させても、計算量はほとんど問題にならない。 For example, in speech recognition in a caption production system, phoneme recognition using a gender-dependent gender-dependent acoustic model is performed endlessly, taking into account not only the sound power but also the frequency characteristics, and the utterance interval is detected from the likelihood at that time. I have to. Since phoneme recognition can be applied regardless of tasks, it is simpler than a method using a task-dependent language model, and even if the acoustic models are operated in parallel with men and women, the amount of computation hardly becomes a problem.

そこで、本実施形態では男女間遷移が可能で枝刈り共通の男女並列音素認識を常時実行し、累積音素尤度の比を利用して発話の始端と終端を早期に検出する。これにより、ニュース番組に対する音声区間検出実験では、従来の短時間パワーによるＦＲＲ（ＦａｌｓｅＲｅｊｅｃｔｉｏｎＲａｔｅ：誤って非音声と判定された音声区間の割合）が４．６％であったのに対して、上述の手法は０．５３％と非常に小さく、発話の始終端検出までの遅れ時間も十分短いことが確認されている。 Therefore, in the present embodiment, gender parallel phoneme recognition that allows transition between men and women and is common to pruning is always performed, and the beginning and end of the utterance are detected early using the ratio of cumulative phoneme likelihood. Thereby, in the voice segment detection experiment for the news program, the FRR (False Rejection Rate: the ratio of the voice segment erroneously determined as non-voice) by the short-time power was 4.6%, It has been confirmed that the above-described method is as very small as 0.53%, and the delay time until the start / end detection of the utterance is sufficiently short.

なお、上述した発話の始端検出及び終端検出の処理手順は、公知のあらゆる発話区間検出方式で動作させることが可能であり、また音声認識デコーダｄにおける正解単語探索の処理手順も公知のあらゆる音声認識方式で動作させることが可能である。 It should be noted that the processing procedure for detecting the start and end of the utterance described above can be operated by any known utterance section detection method, and the processing procedure for searching for the correct word in the speech recognition decoder d is any known speech recognition. It is possible to operate in a manner.

＜音声認識処理の具体的な実施例＞
次に、上述した音声認識処理の具体的な実施例について図を用いて説明する。図５は、本実施形態における音声認識手法を適用した具体的な実施例を示す図である。図５では、音声認識装置１０を用いた字幕制作システム３０の一例を示している。具体的には、字幕制作システム３０は、ダイレクト方式（例えば、アナウンサーによる原稿読み上げ、記者現場リポート等）の番組音声やリスピーク方式（例えば、インタビュー等）の復唱音声等の入力を切り替え、Ａ／Ｄ変換等により得られた入力音声を上述した音声認識装置１０に入力する。 <Specific Example of Speech Recognition Processing>
Next, a specific embodiment of the voice recognition process described above will be described with reference to the drawings. FIG. 5 is a diagram illustrating a specific example to which the speech recognition method according to the present embodiment is applied. FIG. 5 shows an example of a caption production system 30 that uses the speech recognition apparatus 10. Specifically, the caption production system 30 switches input of program audio of a direct method (for example, reading a manuscript by an announcer, a reporter's site report, etc.) and a repetitive audio of a risk peak method (for example, an interview, etc.), and A / D The input voice obtained by conversion or the like is input to the voice recognition device 10 described above.

音声認識装置１０では、学習データであるニュース電子原稿から随時学習される言語モデルや発音辞書等や、不特定話者音響モデル等の各種モデルデータを用いて、音声認識デコーダＡ，Ｂにより、男性ＨＭＭや女性ＨＭＭを用いて音声認識を行い、字幕の確認、修正を行って字幕画面に文字列を表示する。 The speech recognition apparatus 10 uses a speech recognition decoder A and B to create a male by using various model data such as a language model, pronunciation dictionary, etc., which is learned from news electronic manuscript as learning data, and an unspecified speaker acoustic model. Voice recognition is performed using an HMM or a female HMM, and subtitles are confirmed and corrected, and a character string is displayed on the subtitle screen.

このように、本発明における音声認識手法を適用することで、字幕制作システム３０において音声認識を一切途切れさせることなく、音声認識を行う音声認識デコーダを最新モデルのものに切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現することができる。すなわち、音声認識処理を行うユーザ（番組制作者等）は、モデルが最新なものであるかどうかを気にする必要なく、常に自動的に最新モデルで音声認識が起動していることが保証される。 As described above, by applying the speech recognition method according to the present invention, the speech recognition decoder for performing speech recognition is switched to the latest model without interrupting speech recognition at all in the caption production system 30, so that the latest model is always updated. High-accuracy speech recognition can be continuously realized by using. In other words, users who perform speech recognition processing (program producers, etc.) do not need to worry about whether the model is the latest one, and it is guaranteed that voice recognition is always activated automatically with the latest model. The

＜本実施形態における従来手法との比較結果＞
ここで、モデル自動更新に対応した音声認識装置の効果を調べるため、放送番組中の各ニュース項目に対応する電子原稿を適応学習しなかった場合（放送１時間前のモデル）に対して、学習した場合（放送直前に学習したモデル）の効果を、音声認識による字幕制作実験（認識誤りのリアルタイム手動修正）により調べた結果について説明する。 <Results of comparison with conventional method in this embodiment>
Here, in order to investigate the effect of the speech recognition apparatus that supports automatic model update, learning is performed when the electronic manuscript corresponding to each news item in the broadcast program is not adaptively learned (model one hour before the broadcast). The results of examining the effect of this case (the model learned immediately before broadcasting) by a caption production experiment by voice recognition (real-time manual correction of recognition errors) will be described.

図６は、更新の効果の一例を示す図である。なお、図６では、一例として言語モデルと発音辞書の更新の効果を示している。例えば、２つのニュース番組での実験の結果、図６に示すように、関連原稿で言語モデル（語彙サイズ６万単語）を適応学習すると、言語モデルの複雑さの指標であるテストセット・パープレキシティと未知語率（発音辞書に登録されていない単語の割合）は大幅に減少、そしてトライグラム・ヒット率（言語モデルのカバー率）は上昇し、音声認識誤りも約１／３に削減された。放送１時間前のモデルでも字幕の誤りはほとんど残らないが、人名等の固有名詞が未知語となり、人手による修正に手間を要した。したがって、字幕に誤りの残る可能性が低く、字幕の表示遅れも小さい本発明のデコーダ制御方式は、運用上好ましいと言える。 FIG. 6 is a diagram illustrating an example of the effect of the update. FIG. 6 shows the effect of updating the language model and the pronunciation dictionary as an example. For example, as a result of an experiment with two news programs, as shown in FIG. 6, when a language model (vocabulary size 60,000 words) is adaptively learned with a related manuscript, a test set perplexity that is an indicator of the complexity of the language model. Tee and unknown word rate (percentage of words not registered in the pronunciation dictionary) are greatly reduced, trigram hit rate (language model coverage) is increased, and speech recognition errors are reduced to about 1/3. It was. Even in the model one hour before the broadcast, there was almost no subtitle error, but proper nouns such as personal names became unknown words, requiring manual correction. Therefore, it can be said that the decoder control method of the present invention with low possibility of remaining errors in subtitles and small display delay of subtitles is preferable in operation.

以上に説明したように本発明によれば、連続的に更新させた最新モデルを用いて高精度な音声認識を実現することができる。また、本発明によれば、音声認識を一切途切れさせることなく、音声認識処理を担当する音声認識デコーダを最新モデルのものに自動的に切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現するものである。 As described above, according to the present invention, highly accurate speech recognition can be realized using the latest model continuously updated. In addition, according to the present invention, the voice recognition decoder in charge of the voice recognition processing is automatically switched to the latest model without interrupting the voice recognition at all, so that the voice recognition using the latest model is always performed with high accuracy. Is realized continuously.

具体的には、本発明は、音声認識を途切れさせることなく、音声認識を行う音声認識デコーダを最新モデルのものに切り替えることにより、常に最新モデルを用いて高精度な音声認識を連続して実現するものである。例えば、テレビの生放送番組にリアルタイムで字幕を付与する目的で音声認識を利用する場合、放送中に既に音声認識を運用している状態であっても、音声認識を一瞬たりとも停止させることなく、常に最新モデルを読み込み、高精度な音声認識を連続して実現することができる。 Specifically, the present invention continuously realizes highly accurate speech recognition using the latest model by switching the speech recognition decoder that performs speech recognition to the latest model without interrupting speech recognition. To do. For example, when using voice recognition for the purpose of giving subtitles to a live TV program in real time, even if voice recognition is already in operation during broadcasting, voice recognition is not stopped even momentarily, The latest model can always be read and highly accurate voice recognition can be realized continuously.

以上本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited to the specific embodiment, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

本実施形態における音声認識装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of the speech recognition apparatus in this embodiment. 本発明における音声認識処理が実現可能なハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions which can implement | achieve the speech recognition process in this invention. 本実施形態におけるモデルの学習・更新処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the learning / update process procedure of the model in this embodiment. 音声認識デコーダの追加起動と認識対象切替処理の一例を示すフローチャートである。It is a flowchart which shows an example of the additional starting of a speech recognition decoder, and a recognition target switching process. 本実施形態における音声認識手法を適用した具体的な実施例を示す図である。It is a figure which shows the specific Example to which the speech recognition method in this embodiment is applied. 更新の効果の一例を示す図である。It is a figure which shows an example of the effect of an update.

Explanation of symbols

１０音声認識装置
１１音響分析手段
１２デコーダ制御手段
１３音声認識デコーダ
１４文字修正手段
１５モデル学習手段
１６モデル更新通知手段
１７蓄積手段
１８学習データ
２１入力装置
２２出力装置
２３ドライブ装置
２４補助記憶装置
２５メモリ装置
２６ＣＰＵ
２７ネットワーク接続装置
２８記録媒体
３０字幕制作システム DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 Acoustic analysis means 12 Decoder control means 13 Speech recognition decoder 14 Character correction means 15 Model learning means 16 Model update notification means 17 Storage means 18 Learning data 21 Input device 22 Output device 23 Drive device 24 Auxiliary storage device 25 Memory Device 26 CPU
27 Network connection device 28 Recording medium 30 Subtitle production system

Claims

In a speech recognition device that recognizes input speech and converts it into characters,
Model learning means for learning at least one of a language model, pronunciation dictionary, acoustic model, and speech recognition parameter as needed;
Model update notification means for notifying that the model learning means has been updated to the latest model;
Acoustic analysis means for extracting an acoustic feature of the input speech;
A plurality of acoustic features obtained by the acoustic analysis unit and a language model, pronunciation dictionary, acoustic model, and speech recognition parameter that are stored in advance or updated by the model learning unit are read, and speech recognition of the acoustic feature amount is performed. A speech recognition decoder;
Wherein the plurality of speech recognition decoder, based on the update information notified by the model update notification means, we have a decoder control means for selecting the speech recognition decoder for the speech recognition,
The decoder control means simultaneously activates the latest model speech recognition decoder in addition to the speech recognition decoder activated in the old model, and detects the utterance start included in the acoustic feature amount obtained from the acoustic analysis means. A speech recognition apparatus characterized by switching to the latest model speech recognition decoder .

The decoder control means includes
The speech recognition apparatus according to claim 1 , wherein all of the plurality of speech recognition decoders sequentially read and restart the latest model without interruption.

The decoder control means includes
After the restarting, the respective speech recognition decoders sequentially recognize the input speech at the timing when the utterance start point included in the acoustic feature amount obtained from the acoustic analysis unit is detected. Item 3. The speech recognition device according to Item 2 .

Correct speech recognition result, according the modified history information to any one of claims 1 to 3, characterized in that it has a character modifying means for use as an output to the learning data to the model learning unit Voice recognition device.

In a speech recognition program for causing a computer to execute speech recognition processing for recognizing input speech and converting it into characters,
Computer
Model learning means for learning at least one of a language model, a pronunciation dictionary, an acoustic model, and a speech recognition parameter as needed;
Model update notification means for notifying that the model learning means has been updated to the latest model,
Acoustic analysis means for extracting an acoustic feature of the input speech;
A plurality of acoustic features obtained by the acoustic analysis unit and a language model, pronunciation dictionary, acoustic model, and speech recognition parameter that are stored in advance or updated by the model learning unit are read, and speech recognition of the acoustic feature amount is performed. A speech recognition decoder, and
Among the plurality of speech recognition decoders, a speech recognition decoder for performing speech recognition is selected based on update information notified by the model update notification means, and in addition to a speech recognition decoder activated in an old model Voice recognition for simultaneously starting a speech recognition decoder of the latest model and functioning as a decoder control means for switching to the speech recognition decoder of the latest model at the timing when the utterance start point included in the acoustic feature value obtained from the acoustic analysis means is detected program.