JP5243886B2

JP5243886B2 - Subtitle output device, subtitle output method and program

Info

Publication number: JP5243886B2
Application number: JP2008207407A
Authority: JP
Inventors: 憲石原; 誠庄境
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2008-08-11
Filing date: 2008-08-11
Publication date: 2013-07-24
Anticipated expiration: 2028-08-11
Also published as: JP2010044171A

Description

本発明は、共通の原稿をアナウンス用と字幕用との双方に利用してリアルタイム放送を行う場合に、アナウンスの音声に合わせて字幕を出力する字幕出力装置、字幕出力方法及びプログラムに関する。 The present invention relates to a caption output device, a caption output method, and a program for outputting captions in accordance with the sound of an announcement when a common document is used for both announcements and captions for real-time broadcasting.

近年、放送業界ではデジタル放送番組に字幕を表示することが推奨されていることもあり、字幕放送番組が増えつつある。字幕放送番組が録画番組の場合には、予め字幕を付加した放送番組データを作成して記録媒体に記録しておくことができるが、ニュース番組等のリアルタイム放送（生放送）番組である場合には、アナウンサーの発声するタイミングに合わせて字幕をリアルタイムで送出する必要がある。
図１２は、ニュース等のリアルタイム放送番組において従来一般的に行われている字幕送出の仕組みを示す図である。同図に示すように、アナウンサーがマイクロホンに向かってニュース原稿を読み上げている時に、字幕担当者がアナウンサーの音声を聞いて字幕の表示タイミングを判断し、表示タイミングとなった時に字幕切替装置のボタンを押す等の操作を行う。これにより、カメラで撮影された映像と、マイクロホンで収集された音声と、字幕切替装置において予め用意されている字幕とが、多重化機で多重化され、通信回線を介して受信機に送出される。 In recent years, it has been recommended in the broadcasting industry to display subtitles on digital broadcast programs, and subtitle broadcast programs are increasing. When the subtitle broadcast program is a recorded program, it is possible to create broadcast program data with subtitles added in advance and record it on a recording medium, but in the case of a real-time broadcast (live broadcast) program such as a news program Therefore, it is necessary to send subtitles in real time according to the timing of the announcer's utterance.
FIG. 12 is a diagram illustrating a subtitle transmission mechanism that is generally performed in a conventional real-time broadcast program such as news. As shown in the figure, when the announcer is reading the news manuscript into the microphone, the subtitle person listens to the announcer's voice to determine the display timing of the subtitle, and when the display timing is reached, the button on the subtitle switching device Perform operations such as pressing. As a result, video captured by the camera, audio collected by the microphone, and subtitles prepared in advance in the subtitle switching device are multiplexed by the multiplexer and sent to the receiver via the communication line. The

このような字幕送出の仕組みでは、字幕担当者の技量に応じて、音声に対する字幕表示タイミングの遅延が少なくとも３〜５秒程度発生する。このため、視聴者は音声を聴いてからかなりの間をおいた後に対応する字幕を見ることとなり、違和感を覚えてしまう。また、人手で字幕表示を行うため、操作ミスにより字幕が誤表示される危険性がある。
これに対して、ドラマなどの録画番組の場合には、音声と字幕との同期をとったデータを予め作成しておくことができるため、放送時に字幕表示の遅延や誤表示を防ぐことができる（例えば、特許文献１参照）。特許文献１に記載の自動字幕番組制作システムは、テキスト文から提示単位字幕文を生成し、提示単位字幕文毎にアナウンス音声との音声認識を行い、始点／終点タイミング情報を同期点として検出し、当該検出した始点／終点タイミング情報を提示単位字幕文毎に付与しておく。これにより、放送時には、付与したタイミング情報に基づいて音声と字幕との同期をとることが可能となる。
特開２０００−２７０２６３号公報 In such a subtitle transmission mechanism, a delay in subtitle display timing with respect to audio occurs for at least about 3 to 5 seconds according to the skill of the subtitle person. For this reason, the viewer sees the corresponding subtitle after a long time after listening to the sound, and feels uncomfortable. In addition, since the caption is displayed manually, there is a risk that the caption is erroneously displayed due to an operation error.
On the other hand, in the case of a recorded program such as a drama, data synchronized with audio and subtitles can be created in advance, so that subtitle display delays and erroneous display during broadcasting can be prevented. (For example, refer to Patent Document 1). The automatic caption program production system described in Patent Literature 1 generates a presentation unit caption sentence from a text sentence, performs speech recognition with the announcement voice for each presentation unit caption sentence, and detects start / end timing information as a synchronization point. The detected start / end timing information is assigned to each presentation unit subtitle sentence. Thereby, at the time of broadcasting, it becomes possible to synchronize audio and subtitles based on the given timing information.
JP 2000-270263 A

特許文献１に記載の録画番組の技術をリアルタイム放送番組に適用した場合には、提示単位字幕文全体とアナウンス音声との音声認識を行い、始点／終点タイミング情報を同期点として検出した後に、提示単位字幕文を送出することとなる。つまり、提示単位字幕文に対応する音声アナウンスが終了してから当該提示単位字幕文が表示されることとなり、原理上、提示単位字幕文単位での大幅な遅延が生じることとなる。
また、特許文献１に記載の技術では、アナウンサーの息継ぎ等の無音区間（ポーズ、間）が想定通りに発生しなかった場合、アナウンサーが原稿を読み間違えた場合、読み飛ばした場合、雑音がはいった場合等を想定していないため、音声に対応する提示単位字幕文が正しく認識されない場合が発生する。この場合、録画放送の場合には、放送前に修正することができるが、リアルタイム放送の場合には、修正する間もなく誤った提示単位字幕文が表示されてしまうという不具合が発生する。 When the technology of the recorded program described in Patent Document 1 is applied to a real-time broadcast program, speech recognition is performed between the entire presentation unit subtitle sentence and the announcement voice, and the start point / end point timing information is detected as a synchronization point, and then presented. Unit subtitle text will be sent out. That is, the presentation unit subtitle sentence is displayed after the audio announcement corresponding to the presentation unit subtitle sentence ends, and in principle, a significant delay occurs in the presentation unit subtitle sentence unit.
In addition, in the technique described in Patent Document 1, when silence sections (pauses, etc.) such as a breather of an announcer do not occur as expected, when an announcer mistakes reading a manuscript, or skips reading, noise is generated. In other words, the presentation unit subtitle sentence corresponding to the voice may not be correctly recognized. In this case, in the case of a recorded broadcast, it can be corrected before the broadcast, but in the case of a real-time broadcast, there is a problem that an erroneous presentation unit subtitle sentence is displayed soon after the correction.

本発明は、上述した従来の問題点に鑑みてなされたものであり、リアルタイム放送において、音声に対して少ない遅延で字幕を出力することができる字幕出力装置、字幕出力方法及びプログラムを提供する。
また、音声に対応した字幕を誤りなく正確に出力することを可能とする字幕出力装置、字幕出力方法及びプログラムを提供する。 The present invention has been made in view of the above-described conventional problems, and provides a caption output device, a caption output method, and a program capable of outputting captions with a small delay with respect to audio in real-time broadcasting.
Also provided are a caption output device, a caption output method, and a program capable of accurately outputting captions corresponding to audio without error.

上記目的を解決するために、請求項１に記載の発明は、音声に合わせて字幕を出力する字幕出力装置であって、入力されたテキスト文を字幕の出力単位に分割することにより、字幕単位文を生成する字幕単位文生成手段と、前記テキスト文を音声認識の処理単位に分割することにより、音声認識単位文を生成する音声認識単位文生成手段と、前記音声認識単位文生成手段により生成された音声認識単位文の文節を音声認識するための認識候補の集合である認識候補単位を、前記音声認識単位文の先頭の文節に対応するものから順に連結することにより、音声認識ネットワークを生成する音声認識ネットワーク生成手段と、前記テキスト文が発声された音声と、前記音声認識ネットワーク生成手段により生成された音声認識ネットワークを構成する認識候補単位との照合を先頭から逐次行うことにより、音声認識処理を行う音声認識手段と、所定のタイミングで前記字幕単位文を出力する字幕単位文出力手段とを備え、前記音声認識処理手段は、前記音声認識処理を、前記音声認識ネットワーク生成手段により生成された２以上の音声認識ネットワークを用いて並列に行い、かつ前記音声認識ネットワーク生成手段は、前記字幕単位文の先頭の文節に対応する認識候補単位を少なくとも含む字幕先頭検出用ネットワークを生成する字幕先頭検出用ネットワーク生成手段を備え、前記字幕先頭検出用ネットワーク生成手段は、前記字幕先頭検出用ネットワークと、当該字幕先頭検出用ネットワークと音声認識処理を並列に行う対象となる音声認識ネットワークとの間のネットワーク間距離が所定の閾値以上となるまで、前記字幕単位文の先頭の文節に対応する認識候補単位に対して、前記字幕単位文の先頭の文節に後続する文節に対応する認識候補単位を順に連結し、当該認識候補単位の連結数が最小となる１または複数の認識候補単位を前記字幕先頭検出用ネットワークとし、前記字幕単位文出力手段は、前記字幕先頭検出用ネットワークを構成する全ての認識候補単位との前記照合が完了した時点で、前記字幕単位文を出力することを特徴とする字幕出力装置を提供する。
本発明によれば、字幕出力装置は、前記字幕先頭検出用ネットワークを構成する全ての認識候補単位との前記照合が完了した時点で前記字幕単位文を出力するため、リアルタイム放送において、音声に対して少ない遅延で字幕を出力することが可能となる。 In order to solve the above-mentioned object, the invention described in claim 1 is a caption output device that outputs captions in accordance with audio, and divides an input text sentence into caption output units, thereby subtitle units. Generated by a caption unit sentence generating means for generating a sentence, a speech recognition unit sentence generating means for generating a speech recognition unit sentence by dividing the text sentence into speech recognition processing units, and the speech recognition unit sentence generating means A speech recognition network is generated by concatenating recognition candidate units, which are a set of recognition candidates for speech recognition of the phrases of the recognized speech recognition unit sentences, in order from the one corresponding to the first phrase of the speech recognition unit sentences. A speech recognition network generating means, a voice uttered by the text sentence, and a voice recognition network generated by the voice recognition network generating means That by performing sequential matching recognition candidate units from the beginning, comprising a speech recognition means for performing speech recognition processing, the caption unit text output means for outputting the caption unit statement at a predetermined timing, the voice recognition processing means Performs the voice recognition processing in parallel using two or more voice recognition networks generated by the voice recognition network generation means, and the voice recognition network generation means corresponds to the first phrase of the subtitle unit sentence. Subtitle head detection network generating means for generating a subtitle head detection network including at least a recognition candidate unit to be detected, wherein the subtitle head detection network generation means includes the subtitle head detection network, the subtitle head detection network, Distance between networks with the voice recognition network for which voice recognition processing is performed in parallel Until the recognition candidate unit corresponding to the first phrase of the subtitle unit sentence is sequentially connected to the recognition candidate unit corresponding to the phrase subsequent to the first phrase of the subtitle unit sentence until is equal to or greater than a predetermined threshold, One or a plurality of recognition candidate units with the smallest number of connected recognition candidate units is used as the caption head detection network, and the caption unit sentence output means includes all recognition candidate units constituting the caption head detection network. The subtitle output apparatus outputs the subtitle unit sentence when the collation is completed .
According to the present invention, the subtitle output device outputs the subtitle unit sentence when the collation with all recognition candidate units constituting the subtitle head detection network is completed. Subtitles can be output with a small delay.

また、前記音声認識処理を、前記音声認識ネットワーク生成手段により生成された２以上の音声認識ネットワークを用いて並列に行うため、発話者の読み飛ばし等による誤認識を防ぎ、精度の高い音声認識結果に基づいて、音声に対応した字幕を少ない遅延で正確に出力することができる。 Further, the voice recognition processing, in order to perform in parallel using two or more speech recognition networks generated by the speech recognition network generation means, preventing false recognition by such skipping utterance's high speech recognition accuracy Based on the result, subtitles corresponding to audio can be accurately output with a small delay.

また、字幕の先頭文節が発声されたことを正確に検出するための字幕先頭検出用ネットワークを生成して音声認識を行うことで、字幕の出力タイミングの判定を正確かつ容易に行うことができる。
請求項２に記載の発明は、請求項１に記載の字幕出力装置において、前記音声認識処理手段は、前記音声認識処理を、前記字幕先頭検出用ネットワーク生成手段により生成された前記字幕先頭検出用ネットワークに対応する第１の字幕単位文を含む第１の音声認識単位文に対応する第１の音声認識ネットワークと、該第１の音声認識単位文に後続する第２の音声認識単位文に対応する第２の音声認識ネットワークとを用いて並列に行うことを特徴とする。
請求項３に記載の発明は、請求項２に記載の字幕出力装置において、前記音声認識処理手段は、前記第１の音声認識ネットワークと前記第２の音声認識ネットワークと前記字幕先頭検出用ネットワークのそれぞれによって表わされる事象の発生の有無を検出する事象発生判定手段をさらに備えることを特徴とする。
請求項４に記載の発明は、請求項２または請求項３に記載の字幕出力装置において、前記字幕先頭検出用ネットワーク生成手段は、前記第１の字幕単位文の先頭の文節に対応する認識候補単位を少なくとも含む音声認識ネットワークを仮の字幕先頭検出用ネットワークとして設定し、該仮の字幕先頭検出用ネットワークと前記第２の音声認識ネットワークとの間のネットワーク間距離を算出し、前記ネットワーク間距離が前記所定の閾値以上の場合は、前記仮の字幕先頭検出用ネットワークを前記字幕先頭検出用ネットワークとすることを特徴とする。
請求項５に記載の発明は、請求項４に記載の字幕出力装置において、前記字幕先頭検出用ネットワーク生成手段は、前記ネットワーク間距離が前記所定の閾値未満の場合は、前記仮の字幕先頭検出用ネットワークに対して、前記第１の字幕単位文の先頭文節に後続する文節に対応する認識候補単位を連結することで前記仮の字幕先頭検出用ネットワークを更新し、該更新された仮の字幕先頭検出用ネットワークと前記第２の音声認識ネットワークとの間のネットワーク間距離が前記所定の閾値以上となるまで前記仮の字幕先頭検出用ネットワークの更新処理を繰り返し行い、該更新処理が終了した時点での前記仮の字幕先頭検出用ネットワークを前記字幕先頭検出用ネットワークとすることを特徴とする。 Further, by generating a subtitle head detection network for accurately detecting that the head phrase of the subtitle is uttered and performing voice recognition, it is possible to accurately and easily determine the output timing of the subtitle.
According to a second aspect of the present invention, in the caption output device according to the first aspect, the voice recognition processing unit performs the voice recognition processing for the caption head detection generated by the caption head detection network generation unit. Corresponding to the first speech recognition network corresponding to the first speech recognition unit sentence including the first caption unit sentence corresponding to the network, and the second speech recognition unit sentence subsequent to the first speech recognition unit sentence. The second voice recognition network is used in parallel.
According to a third aspect of the present invention, in the caption output device according to the second aspect, the voice recognition processing means includes the first voice recognition network, the second voice recognition network, and the caption head detection network. It further comprises event occurrence determination means for detecting whether or not an event represented by each occurrence has occurred.
According to a fourth aspect of the present invention, in the subtitle output device according to the second or third aspect, the subtitle head detection network generating means recognizes a recognition candidate corresponding to a head phrase of the first subtitle unit sentence. A speech recognition network including at least a unit is set as a temporary caption head detection network, a network distance between the temporary caption head detection network and the second speech recognition network is calculated, and the network distance Is equal to or greater than the predetermined threshold, the temporary caption head detection network is the caption caption detection network.
According to a fifth aspect of the present invention, in the subtitle output apparatus according to the fourth aspect, the subtitle head detection network generation means detects the temporary subtitle head detection when the inter-network distance is less than the predetermined threshold. The temporary subtitle head detection network is updated by linking the recognition candidate unit corresponding to the phrase subsequent to the first phrase of the first subtitle unit sentence to the network, and the updated temporary subtitle is updated. When the temporary caption head detection network update process is repeated until the network distance between the head detection network and the second voice recognition network is equal to or greater than the predetermined threshold, and the update process is completed The provisional subtitle head detection network in is used as the subtitle head detection network.

請求項６に記載の発明は、請求項１から５の何れか１項に記載の字幕出力装置において、前記音声認識ネットワーク生成手段は、前記連結された各認識候補単位間に、誤認識を防ぐための特殊認識候補を挿入した上で、前記音声認識ネットワークを生成することを特徴とする。
本発明によれば、各認識候補文節間に、誤認識を防ぐための特殊認識候補を挿入することで、発話者の息継ぎの違い、読み間違い、言い直し、咳払い、雑音等に影響されずに、正確に音声認識を行うことができる。 According to a sixth aspect of the present invention, in the caption output device according to any one of the first to fifth aspects, the voice recognition network generating means prevents erroneous recognition between the connected recognition candidate units. The voice recognition network is generated after inserting a special recognition candidate for use.
According to the present invention, by inserting a special recognition candidate for preventing misrecognition between each recognition candidate clause, it is not affected by differences in breathing of the speaker, misreading, rephrasing, coughing, noise, etc. , Voice recognition can be performed accurately.

請求項７に記載の発明は、請求項１から６の何れか１項に記載の字幕出力装置において、前記音声認識ネットワーク生成手段は、前記認識候補単位に、誤認識を防ぐための特殊認識候補を含めた上で、前記音声認識ネットワークを生成することを特徴とする。
本発明によれば、認識候補単位に特殊認識候補が含まれることにより、発話者の読み間違い、雑音等に影響されずに、音声認識の誤認識を防ぐことができる。
請求項８に記載の発明は、請求項６又は７に記載の字幕出力装置において、前記特殊認識候補には、ポーズがないことを表すＮＵＬＬと、無音のポーズがあることを表すＳＩＬと、任意の音を表すＧａｒｂａｇｅと、の少なくとも１つが含まれることを特徴とする。 The invention described in claim 7 is the caption output device according to any one of claims 1 to 6 , wherein the voice recognition network generation means is a special recognition candidate for preventing erroneous recognition in the recognition candidate unit. In addition, the voice recognition network is generated.
According to the present invention, by including special recognition candidates in the recognition candidate unit, it is possible to prevent erroneous recognition of speech recognition without being affected by misreading of a speaker, noise, and the like.
The invention according to claim 8 is the caption output device according to claim 6 or 7 , wherein the special recognition candidate is NULL indicating that there is no pause, SIL indicating that there is a silent pause, and an arbitrary And at least one of Garbage representing the sound of.

請求項９に記載の発明は、音声に合わせて字幕を出力する字幕出力装置が実行する字幕出力方法であって、入力されたテキスト文を字幕の出力単位に分割することにより、字幕単位文を生成する字幕単位文生成ステップと、前記テキスト文を音声認識の処理単位に分割することにより、音声認識単位文を生成する音声認識単位文生成ステップと、前記音声認識単位文生成ステップにおいて生成された音声認識単位文の文節を音声認識するための認識候補の集合である認識候補単位を、前記音声認識単位文の先頭の文節に対応するものから順に連結することにより、音声認識ネットワークを生成する音声認識ネットワーク生成ステップと、前記テキスト文が発声された音声と、前記音声認識ネットワーク生成ステップにおいて生成された音声認識ネットワークを構成する認識候補単位との照合を先頭から逐次行うことにより、音声認識処理を行う音声認識ステップと、所定のタイミングで前記字幕単位文を出力する字幕単位文出力ステップと、を備え、前記音声認識ステップは、前記音声認識処理を、前記音声認識ネットワーク生成手段により生成された２以上の音声認識ネットワークを用いて並列に行い、かつ前記音声認識ネットワーク生成ステップは、前記字幕単位文の先頭の文節に対応する認識候補単位を少なくとも含む字幕先頭検出用ネットワークを生成する字幕先頭検出用ネットワーク生成ステップを備え、前記字幕先頭検出用ネットワーク生成ステップは、前記字幕先頭検出用ネットワークと、当該字幕先頭検出用ネットワークと音声認識処理を並列に行う対象となる音声認識ネットワークとの間のネットワーク間距離が所定の閾値以上となるまで、前記字幕単位文の先頭の文節に対応する認識候補単位に対して、前記字幕単位文の先頭の文節に後続する文節に対応する認識候補単位を順に連結し、当該認識候補単位の連結数が最小となる１または複数の認識候補単位を前記字幕先頭検出用ネットワークとし、前記字幕単位文出力ステップは、前記字幕先頭検出用ネットワークを構成する全ての認識候補単位との前記照合が完了した時点で、前記字幕単位文を出力することを特徴とする字幕出力方法を提供する。 The invention according to claim 9 is a subtitle output method executed by a subtitle output apparatus that outputs subtitles in accordance with audio, and divides the input text sentence into subtitle output units, thereby subtitle unit sentences being Generated in the subtitle unit sentence generation step, the speech recognition unit sentence generation step for generating the speech recognition unit sentence by dividing the text sentence into the speech recognition processing units, and the speech recognition unit sentence generation step. Speech that generates a speech recognition network by concatenating recognition candidate units, which are a set of recognition candidates for speech recognition of phrases of speech recognition unit sentences, in order from the one corresponding to the first phrase of the speech recognition unit sentences A recognition network generation step; a voice in which the text sentence is uttered; and a voice recognition generated in the voice recognition network generation step. By performing sequential collated with the recognition candidates units constituting the network from the beginning, e Bei a speech recognition step of performing speech recognition processing, the caption unit sentence output step of outputting the caption unit statement at a predetermined timing, and In the voice recognition step, the voice recognition processing is performed in parallel using two or more voice recognition networks generated by the voice recognition network generation means, and the voice recognition network generation step includes a head of the caption unit sentence. A subtitle head detection network generating step for generating a subtitle head detection network including at least a recognition candidate unit corresponding to the phrase of the subtitle, wherein the subtitle head detection network generation step includes the subtitle head detection network, the subtitle head detection network, Sound for which detection network and speech recognition processing are performed in parallel Corresponds to the phrase that follows the first phrase of the caption unit sentence for the recognition candidate unit that corresponds to the first phrase of the caption unit sentence until the network distance to the recognition network is equal to or greater than a predetermined threshold. Recognition candidate units are sequentially connected, and one or a plurality of recognition candidate units with the smallest number of connected recognition candidate units is used as the caption head detection network, and the caption unit sentence output step includes the caption head detection network. The subtitle output method is characterized in that the subtitle unit sentence is output at the time when the collation with all recognition candidate units constituting the subtitle is completed .

請求項１０に記載の発明は、請求項９に記載の字幕出力方法において、前記音声認識処理ステップは、前記音声認識処理を、前記字幕先頭検出用ネットワーク生成手段により生成された前記字幕先頭検出用ネットワークに対応する第１の字幕単位文を含む第１の音声認識単位文に対応する第１の音声認識ネットワークと、該第１の音声認識単位文に後続する第２の音声認識単位文に対応する第２の音声認識ネットワークとを用いて並列に行うことを特徴とする。
請求項１１に記載の発明は、請求項１０に記載の字幕出力方法において、前記音声認識処理ステップは、前記第１の音声認識ネットワークと前記第２の音声認識ネットワークと前記字幕先頭検出用ネットワークのそれぞれによって表わされる事象の発生の有無を検出する事象発生判定ステップをさらに備えることを特徴とする。
請求項１２に記載の発明は、請求項１０または請求項１１に記載の字幕出力方法において、前記字幕先頭検出用ネットワーク生成ステップは、前記第１の字幕単位文の先頭の文節に対応する認識候補単位を少なくとも含む音声認識ネットワークを仮の字幕先頭検出用ネットワークとして設定し、該仮の字幕先頭検出用ネットワークと前記第２の音声認識ネットワークとの間のネットワーク間距離を算出し、前記ネットワーク間距離が前記所定の閾値以上の場合は、前記仮の字幕先頭検出用ネットワークを前記字幕先頭検出用ネットワークとすることを特徴とする。
請求項１３に記載の発明は、請求項１２に記載の字幕出力方法において、前記字幕先頭検出用ネットワーク生成ステップは、前記ネットワーク間距離が前記所定の閾値未満の場合は、前記仮の字幕先頭検出用ネットワークに対して、前記第１の字幕単位文の先頭文節に後続する文節に対応する認識候補単位を連結することで前記仮の字幕先頭検出用ネットワークを更新し、該更新された仮の字幕先頭検出用ネットワークと前記第２の音声認識ネットワークとの間のネットワーク間距離が前記所定の閾値以上となるまで前記仮の字幕先頭検出用ネットワークの更新処理を繰り返し行い、該更新処理が終了した時点での前記仮の字幕先頭検出用ネットワークを前記字幕先頭検出用ネットワークとすることを特徴とする。 According to a tenth aspect of the present invention, in the subtitle output method according to the ninth aspect, in the speech recognition processing step, the speech recognition processing is performed for the caption head detection generated by the caption head detection network generating means. Corresponding to the first speech recognition network corresponding to the first speech recognition unit sentence including the first caption unit sentence corresponding to the network, and the second speech recognition unit sentence subsequent to the first speech recognition unit sentence. The second voice recognition network is used in parallel.
The invention described in claim 11 is the caption output method according to claim 10, wherein the speech recognition processing step includes: the first speech recognition network; the second speech recognition network; and the caption head detection network. It further comprises an event occurrence determination step for detecting whether or not an event represented by each occurrence has occurred.
According to a twelfth aspect of the present invention, in the caption output method according to the tenth or eleventh aspect, the subtitle head detection network generation step includes a recognition candidate corresponding to a head phrase of the first subtitle unit sentence. A speech recognition network including at least a unit is set as a temporary caption head detection network, a network distance between the temporary caption head detection network and the second speech recognition network is calculated, and the network distance Is equal to or greater than the predetermined threshold, the temporary caption head detection network is the caption caption detection network.
According to a thirteenth aspect of the present invention, in the subtitle output method according to the twelfth aspect, the subtitle head detection network generating step detects the temporary subtitle head detection when the inter-network distance is less than the predetermined threshold. The temporary subtitle head detection network is updated by linking the recognition candidate unit corresponding to the phrase subsequent to the first phrase of the first subtitle unit sentence to the network, and the updated temporary subtitle is updated. When the temporary caption head detection network update process is repeated until the network distance between the head detection network and the second voice recognition network is equal to or greater than the predetermined threshold, and the update process is completed The provisional subtitle head detection network in is used as the subtitle head detection network.

請求項１４に記載の発明は、コンピュータに、入力されたテキスト文を字幕の出力単位に分割することにより、字幕単位文を生成する字幕単位文生成ステップと、前記テキスト文を音声認識の処理単位に分割することにより、音声認識単位文を生成する音声認識単位文生成ステップと、前記音声認識単位文生成ステップにおいて生成された音声認識単位文の文節を音声認識するための認識候補の集合である認識候補単位を、前記音声認識単位文の先頭の文節に対応するものから順に連結することにより、音声認識ネットワークを生成する音声認識ネットワーク生成ステップと、前記テキスト文が発声された音声と、前記音声認識ネットワーク生成ステップにおいて生成された音声認識ネットワークを構成する認識候補単位との照合を先頭から逐次行うことにより、所定のタイミングで前記字幕単位文を出力する字幕単位文出力ステップとを実行させるためのプログラムであって、前記音声認識ステップは、前記音声認識処理を、前記音声認識ネットワーク生成手段により生成された２以上の音声認識ネットワークを用いて並列に行い、かつ前記音声認識ネットワーク生成ステップは、前記字幕単位文の先頭の文節に対応する認識候補単位を少なくとも含む字幕先頭検出用ネットワークを生成する字幕先頭検出用ネットワーク生成ステップを備え、前記字幕先頭検出用ネットワーク生成ステップは、前記字幕先頭検出用ネットワークと、当該字幕先頭検出用ネットワークと音声認識処理を並列に行う対象となる音声認識ネットワークとの間のネットワーク間距離が所定の閾値以上となるまで、前記字幕単位文の先頭の文節に対応する認識候補単位に対して、前記字幕単位文の先頭の文節に後続する文節に対応する認識候補単位を順に連結し、当該認識候補単位の連結数が最小となる１または複数の認識候補単位を前記字幕先頭検出用ネットワークとし、前記字幕単位文出力ステップは、前記字幕先頭検出用ネットワークを構成する全ての認識候補単位との前記照合が完了した時点で、前記字幕単位文を出力することを特徴とするプログラムを提供する。 According to the fourteenth aspect of the present invention, a subtitle unit sentence generating step for generating a subtitle unit sentence by dividing a text sentence input to a computer into output units of subtitles, and a processing unit for speech recognition of the text sentence. A speech recognition unit sentence generation step for generating a speech recognition unit sentence by dividing the speech recognition unit sentence, and a set of recognition candidates for speech recognition of the phrases of the speech recognition unit sentence generated in the speech recognition unit sentence generation step. A speech recognition network generating step of generating a speech recognition network by connecting recognition candidate units in order from the one corresponding to the first phrase of the speech recognition unit sentence, the speech from which the text sentence is uttered, and the speech Matching with the recognition candidate units making up the speech recognition network generated in the recognition network generation step from the top By performing the following, a program for executing the caption unit sentence output step of outputting the caption unit statement at a predetermined timing, the speech recognition step, the voice recognition processing, the speech recognition network generation means The speech recognition network generation step generates a subtitle head detection network including at least a recognition candidate unit corresponding to a head phrase of the subtitle unit sentence. A subtitle head detection network generation step, wherein the subtitle head detection network generation step includes: the subtitle head detection network; and a voice recognition network to be subjected to speech recognition processing in parallel with the subtitle head detection network. The network distance between Up to the recognition candidate unit corresponding to the first phrase of the caption unit sentence, the recognition candidate units corresponding to the phrase following the first phrase of the caption unit sentence are sequentially connected, and the number of connected recognition candidate units. One or a plurality of recognition candidate units that minimizes the caption head detection network, and the caption unit sentence output step is performed when the collation with all recognition candidate units constituting the caption head detection network is completed. And providing a program characterized by outputting the caption unit sentence .

本発明によれば、字幕出力装置は、字幕単位文の少なくとも先頭の文節に対応する認識候補単位との照合が完了した時点で前記字幕単位文を出力するため、リアルタイム放送において、音声に対して少ない遅延で字幕を出力することが可能となる。
また、字幕出力装置は、テキスト文が発声された音声の音声認識処理を、２以上の音声認識ネットワークを用いて並列に行うため、発話者の読み飛ばし等による音声の誤認識を防ぐことができ、音声に対応した字幕を正確に出力することができる。 According to the present invention, the subtitle output device outputs the subtitle unit sentence when the collation with the recognition candidate unit corresponding to at least the first clause of the subtitle unit sentence is completed. Subtitles can be output with a small delay.
In addition, since the caption output device performs speech recognition processing of the voice in which the text sentence is uttered in parallel using two or more speech recognition networks, it is possible to prevent erroneous recognition of speech due to skipping of a speaker or the like. , Subtitles corresponding to audio can be output accurately.

以下、本発明の実施形態について、図面を参照しつつ説明する。
図１は、本発明の実施形態に係る字幕出力装置１０の機能構成を示すブロック図である。本実施形態では、ニュース等のリアルタイム放送番組の原稿が電子化された連続テキスト文と、当該原稿がアナウンサーにより読み上げられた音声とが、字幕出力装置１０に入力されるものとする。これにより、字幕出力装置１０から字幕単位文が出力され、当該字幕単位文は、図１２に示す従来の方法で音声や映像と多重化された後に、受信機に送出されて表示されるものとする。
図１に示すように、本実施形態に係る字幕出力装置１０は、形態素解析部１１、文節推定部１２、音声認識単位文生成部１３、字幕単位文生成部１４、ビタビネットワーク生成部１５、音声認識部１６、及び、字幕単位文出力部１７を含んで構成される。これらの機能は、字幕出力装置１０が備える図示せぬＣＰＵ（Central Processing Unit）が、ハードディスクやＲＯＭ（Read Only Memory）等の記憶装置に記憶されたプログラムやデータ等のソフトウェアを読み出して実行することにより実現される機能である。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of a caption output device 10 according to an embodiment of the present invention. In the present embodiment, it is assumed that a continuous text sentence in which a manuscript of a real-time broadcast program such as news is digitized and a sound in which the manuscript is read out by an announcer are input to the caption output device 10. As a result, a caption unit sentence is output from the caption output device 10, and the caption unit sentence is multiplexed with audio and video by the conventional method shown in FIG. 12, and then sent to the receiver for display. To do.
As shown in FIG. 1, the caption output device 10 according to the present embodiment includes a morpheme analyzer 11, a phrase estimator 12, a speech recognition unit sentence generator 13, a caption unit sentence generator 14, a Viterbi network generator 15, A recognition unit 16 and a caption unit sentence output unit 17 are included. For these functions, a CPU (Central Processing Unit) (not shown) provided in the caption output device 10 reads and executes software such as programs and data stored in a storage device such as a hard disk or ROM (Read Only Memory). This is a function realized by.

（形態素解析部）
形態素解析部１１は、光ディスク等の記録媒体や通信回線を介して字幕出力装置１０に入力された連続テキスト文を、予め記憶装置に記憶されている文法のルールや品詞、読み仮名等の辞書データベースを用いて、形態素（Morpheme：品詞、単語等の言語で意味を持つ最小単位）に分割し、それぞれの品詞、読み等を判別する。
図２は、形態素解析結果の具体例を示す図である。同図には、連続テキスト文「民主党、社民党、国民新党の野党３党が提出した福田総理大臣に対する問責決議が参議院本会議で初めて可決されました。」を入力とした場合に、形態素解析により出力される表層語（連続テキスト文が分割された結果である各形態素）、基本形（活用語の終止形）、読み（表記上の仮名）、発音（表音上の仮名）、品詞名、活用形が示されている。
なお、図２においては、各表層語に対応する読みは１つずつ表示されているが、複数の読みを持つ表層語については、複数の読みを得ることができる。例えば、図２では、「３」の読みは「サン」のみが示されているが、「ミ」、「スリー」の読みも得ることもできる。 (Morphological Analysis Department)
The morpheme analysis unit 11 is a dictionary database of grammatical rules, parts of speech, reading kana, etc. stored in advance in a storage device for continuous text sentences input to the caption output device 10 via a recording medium such as an optical disk or a communication line. Is divided into morphemes (Morpheme: the smallest unit having meaning in a language such as part of speech or word), and each part of speech, reading or the like is discriminated.
FIG. 2 is a diagram illustrating a specific example of a morphological analysis result. In the figure, the morphological analysis is performed when the continuous text sentence “A resolution of the question to Prime Minister Fukuda submitted by the Democratic Party, the Social Democratic Party, and the New National Party of the Opposition Party was passed for the first time at the Upper House of the House of Councilors” was input. Surface words (each morpheme that is the result of splitting a continuous text sentence), basic form (terminal form of a usage word), reading (kana on the notation), pronunciation (kana on the phonetics), part of speech name, Inflection forms are shown.
In FIG. 2, one reading corresponding to each surface word is displayed one by one, but a plurality of readings can be obtained for a surface word having a plurality of readings. For example, in FIG. 2, only “Sun” is shown as “3”, but “mi” and “three” can also be obtained.

（文節推定部）
文節推定部１２は、連続テキスト文中の句読点や形態素解析部１１の解析結果による単語・品詞情報を、予め記憶装置に記憶されている文節推定ルールと照合することで、文節の単位（区切り位置）を推定する。なお、文節推定ルールとは、助詞、助動詞等の品詞種類や句読点の並び条件に基づいて、文節の単位を推定する公知のロジックである。なお、文節とは、名詞、動詞等の自立語に接語が接続された発音上の単位である。例えば、「あの人は私の甥です。」というテキスト文の文節は、「あの」、「人は」、「私の」、「甥です。」の４つとなる。 (Phrase estimation part)
The phrase estimation unit 12 compares phrase / part-of-speech information based on the punctuation marks in the continuous text sentence and the analysis result of the morpheme analysis unit 11 with the phrase estimation rule stored in the storage device in advance, thereby determining the phrase unit (delimiter position). Is estimated. The phrase estimation rule is a well-known logic that estimates a phrase unit based on part-of-speech types such as particles and auxiliary verbs and punctuation alignment conditions. A phrase is a phonetic unit in which a close word is connected to an independent word such as a noun or a verb. For example, the text sentence “That person is my niece” has four clauses: “That”, “People”, “My”, and “I am niece”.

（字幕単位文生成部）
字幕単位文生成部１４は、所望の字幕単位文生成条件（例えば、画面に表示する字幕の文字数は３０文字以内とする等の条件）に適合するように、入力された連続テキスト文を文節の区切りで分割することで、自然な箇所で区切られた字幕単位文を生成する。 (Subtitle unit sentence generator)
The subtitle unit sentence generation unit 14 converts the input continuous text sentence into a phrase so as to meet a desired subtitle unit sentence generation condition (for example, a condition that the number of subtitle characters displayed on the screen is 30 characters or less). By dividing by delimiters, subtitle unit sentences delimited by natural parts are generated.

（音声認識単位文生成部）
音声認識単位文生成部１３は、連続テキスト文の句読点や形態素解析部１１による単語・品詞情報を、予め記憶装置に記憶されている公知の息継ぎ推定ルールと照合することによって、息継ぎによる無音区間を推定し、連続テキスト文を無音区間で区切ることにより、音声認識に適した処理単位である音声認識単位文を生成する。
図３は、形態素解析部１１による解析結果に基づいて、文節推定部１２により推定される文節と、字幕単位文生成部１４により生成される字幕単位文と、音声認識単位文生成部１３により生成される音声認識単位文の具体例を示す図である。
図３に示す原稿の連続テキスト文「民主党、社民党、国民新党の野党３党が提出した福田総理大臣に対する問責決議が参議院本会議で初めて可決されました。」は、形態素解析部１１により形態素解析され、当該形態素解析された結果としての句読点や単語・品詞に基づいて、文節推定部１２により図３に示す文節が推定され、字幕単位文生成部１４により図３に示す字幕文単位文が生成され、音声認識単位文生成部１３により図３に示す音声認識単位文が生成されることとなる。 (Speech recognition unit sentence generator)
The speech recognition unit sentence generation unit 13 compares the punctuation marks of the continuous text sentence and the word / part of speech information obtained by the morpheme analysis unit 11 with a known breath connection estimation rule stored in advance in a storage device, so that the silent section due to the breath connection is determined. A speech recognition unit sentence, which is a processing unit suitable for speech recognition, is generated by estimating and separating continuous text sentences by silent sections.
FIG. 3 shows a phrase estimated by the phrase estimation unit 12, a caption unit sentence generated by the caption unit sentence generation unit 14, and a speech recognition unit sentence generation unit 13 based on the analysis result by the morpheme analysis unit 11. It is a figure which shows the specific example of the speech recognition unit sentence made.
The morphological analysis department 11 has issued a continuation text sentence in the manuscript shown in Fig. 3 "The resolution of the question to Prime Minister Fukuda submitted by the Democratic Party, the Social Democratic Party, and the New National Party's three opposition parties was passed for the first time at the House of Councilors'Meeting." 3 is estimated by the phrase estimation unit 12 based on the punctuation marks and words / parts of speech as a result of the analysis and the morphological analysis, and the caption unit sentence generation unit 14 determines the caption unit sentence shown in FIG. The speech recognition unit sentence generator 13 generates the speech recognition unit sentence shown in FIG.

（ビタビネットワーク生成部）
ビタビネットワーク生成部１５は、原稿の連続テキスト文がアナウンサーにより読み上げられた場合の音声を認識するためのビタビネットワーク（Viterbi Network）を生成する。このビタビネットワークは、音声認識単位文生成部１３により生成された音声認識単位文の文節を音声認識するための認識候補の集合である認識候補単位を、当該音声認識単位文の先頭の文節に対応するものから順に連結したものである。ここで、「認識候補」とは、文節が発声された音声を音声認識可能とするために、１つの文節に対して形態素解析部１１により得られた１又は複数の読みの発音記号列を例えば音素ＨＭＭ（Hidden Markov Model）に変換したものである。また、「認識候補単位」とは、１つの文節に対する認識候補の集合である。したがって、「文節」と「認識候補単位」とは１対１に対応する。１つの文節に対して得られた読みが複数の場合には、「文節」と「認識候補」、及び、「認識候補単位」と「認識候補」とは、１対多の関係となる。１つの文節に対して得られた読みが１つの場合には、「認識候補」と「認識候補単位」とは一致する。ビタビネットワーク生成部１５は、このビタビネットワークを、音声認識単位文生成部１３により生成された音声認識単位文の数だけ生成する。 (Viterbi network generator)
The Viterbi network generation unit 15 generates a Viterbi network for recognizing voice when a continuous text sentence of an original is read out by an announcer. In this Viterbi network, a recognition candidate unit that is a set of recognition candidates for speech recognition of a phrase of the speech recognition unit sentence generated by the speech recognition unit sentence generation unit 13 is associated with the first phrase of the speech recognition unit sentence. They are connected in order from the one to be done. Here, the “recognition candidate” is a phonetic symbol string of one or a plurality of readings obtained by the morphological analysis unit 11 for one phrase, for example, in order to enable speech recognition of the voice in which the phrase is uttered. This is a phoneme HMM (Hidden Markov Model). The “recognition candidate unit” is a set of recognition candidates for one phrase. Therefore, “sentences” and “recognition candidate units” have a one-to-one correspondence. When there are a plurality of readings obtained for one phrase, the “phrase” and the “recognition candidate”, and the “recognition candidate unit” and the “recognition candidate” have a one-to-many relationship. When there is one reading obtained for one phrase, the “recognition candidate” and the “recognition candidate unit” match. The Viterbi network generation unit 15 generates this Viterbi network by the number of voice recognition unit sentences generated by the voice recognition unit sentence generation unit 13.

また、ビタビネットワーク生成部１５は、連結された認識候補単位間に、誤認識を防ぐための特殊認識候補を挿入する。ここで、「特殊認識候補」としては、「ＳＩＬ」、「ＮＵＬＬ」、「Ｇａｒｂａｇｅ」等が存在する。「ＮＵＬＬ」は、ポーズ（間）がないことを意味しており、無音区間も不要語も発生しなかった場合を表現している。「ＳＩＬ」は、無音のポーズ（無音区間）を意味しており、アナウンサーが発声の間を任意に取ることによって、ビタビネットワークの音声認識の尤度が低下するのを防ぐ機能を有する。「Ｇａｒｂａｇｅ」は、音声認識において期待していない語を意味し、不要語を吸収する機能を有する。不要語が挿入される場合としては、例えば、「福田そーり、ゲホ、総理大臣に対する・・・」といった咳き込みや、「もん、問責決議が」のような言い直しが発生した場合である。このように、認識候補単位の間にＮＵＬＬ、ＳＩＬ、Ｇａｒｂａｇｅ等の特殊認識候補を挿入することで、読み間違いや間のおき方の違いを吸収し、高精度の音声認識を行うことが可能となる。 Moreover, the Viterbi network generation unit 15 inserts special recognition candidates for preventing erroneous recognition between the connected recognition candidate units. Here, “SIL”, “NULL”, “Garbage”, and the like exist as “special recognition candidates”. “NULL” means that there is no pause (interval), and represents a case where neither a silent section nor an unnecessary word occurs. “SIL” means a silent pause (silent section), and has a function to prevent the likelihood of the voice recognition of the Viterbi network from being lowered by the announcer arbitrarily taking between utterances. “Garbage” means a word that is not expected in speech recognition and has a function of absorbing unnecessary words. Examples of cases where unnecessary words are inserted include coughing such as “Sorri Fukuda, Geho, Prime Minister…” and rewording such as “Mon, the resolution of the question”. . In this way, by inserting special recognition candidates such as NULL, SIL, Garbage, etc. between recognition candidate units, it is possible to absorb reading errors and differences in the way between them and perform highly accurate speech recognition. Become.

さらに、各認識候補単位を構成する認識候補中に特殊認識候補を含めることも可能である。例えば、形態素解析時に、読み仮名候補がない、或いは、英文字・記号などで読み方が不明又は不明瞭であると判定された文節については、Ｇａｒｂａｇｅを並列な認識候補として、認識候補単位中に含めることができる。また、雑音などの理由による音声認識誤りを避けるためにＧａｒｂａｇｅを認識候補単位中に含めることもできる。さらに、アナウンサーの読み飛ばし等による誤認識を避けるためには、ＮＵＬＬを並列な認識候補として認識候補単位中に含めることができる。なお、Ｇａｒｂａｇｅは、全音素ＨＭＭの並列な枝として構成される。 Furthermore, it is possible to include special recognition candidates in the recognition candidates constituting each recognition candidate unit. For example, Garbage is included in the recognition candidate unit as a parallel recognition candidate for phrases that have no kana candidates or are determined to be unclear or ambiguous by English characters or symbols during morphological analysis. be able to. Further, Garbage can be included in the recognition candidate unit in order to avoid a voice recognition error due to noise or the like. Furthermore, NULL can be included in the recognition candidate unit as a parallel recognition candidate in order to avoid erroneous recognition due to skipping of the announcer. Garbage is configured as a parallel branch of all phoneme HMMs.

図４には、３つの音声認識単位文及び文節から生成される３つのビタビネットワークの例を示す。なお、この例では、連続テキスト文の形態素解析時において、「３」の読み候補は、「サン」、「ミ」、「スリー」の３通りが存在し、「福田」の読み候補は「フクタ」、「フクダ」、「フグダ」の３通りが存在したため、同図に示すように、文節「３党が」の認識候補単位は認識候補「サン」、「ミ」、「スリー」で構成されており、文節「福田」の認識候補単位は認識候補「フクタ」、「フクダ」、「フグダ」で構成されている。また、この例では、文節「民主党」の認識候補単位は認識候補「ミンシュトー」、「ＮＵＬＬ」、「Ｇａｒｂａｇｅ」で構成されている。また、図４に示すビタビネットワークを構成する各認識候補単位を連結する矢印は、図５に示すように、ＮＵＬＬ、ＳＩＬ、Ｇａｒｂａｇｅを経由したビタビ状態遷移を表している。 FIG. 4 shows an example of three Viterbi networks generated from three speech recognition unit sentences and phrases. In this example, at the time of morphological analysis of a continuous text sentence, there are three reading candidates of “3”: “Sun”, “Mi”, and “Three”, and “Fukuda” reading candidates are “Fukuta” ”,“ Fukuda ”, and“ Fuguda ”. As shown in the figure, the recognition candidate unit of the phrase“ 3 party is ”consists of recognition candidates“ Sun ”,“ Mi ”, and“ Three ”. The recognition candidate unit of the phrase “Fukuda” is composed of recognition candidates “Fukuta”, “Fukuda”, and “Fuguda”. Further, in this example, the recognition candidate unit of the phrase “Democratic Party” is composed of recognition candidates “Minstow”, “NULL”, and “Garbage”. In addition, as shown in FIG. 5, the arrows connecting the recognition candidate units constituting the Viterbi network shown in FIG. 4 represent Viterbi state transitions via NULL, SIL, and Garbage.

さらに、ビタビネットワーク生成部１５は、図１に示すように字幕先頭検出用ネットワーク生成機能１５１を備えている。字幕先頭検出用ネットワーク生成機能１５１は、各字幕単位文の先頭の文節が発声されたことを検出するための字幕先頭検出用のビタビネットワーク（以下、「字幕先頭検出用ネットワーク」という）を、字幕単位文生成部１４により生成された字幕単位文の数だけ生成する。この字幕先頭検出用ネットワークは、先頭部分が所定のビタビネットワークの先頭部分で構成され、終端部分が所定の字幕単位文の先頭部分に対応する認識候補単位で構成されている。なお、この字幕先頭検出用ネットワークの生成方法の詳細は後述する。 Further, the Viterbi network generation unit 15 includes a subtitle head detection network generation function 151 as shown in FIG. The subtitle head detection network generation function 151 generates a subtitle head detection Viterbi network (hereinafter referred to as “subtitle head detection network”) for detecting that the head phrase of each subtitle unit sentence is uttered. The number of subtitle unit sentences generated by the unit sentence generation unit 14 is generated. In this subtitle head detection network, the head portion is composed of the head portion of a predetermined Viterbi network, and the terminal portion is composed of recognition candidate units corresponding to the head portion of the predetermined caption unit sentence. The details of the method for generating the caption head detection network will be described later.

（音声認識部）
音声認識部１６は、原稿の連続テキスト文がアナウンサーにより発声された音声を、ビタビネットワーク生成部１５で生成されたビタビネットワークを用いて音声認識する。
図６は、音声認識処理部１６の詳細な機能構成を示すブロック図である。同図に示すように、音声認識処理部１６は、音声特徴量抽出部１６１と、ビタビネットワーク比較評価部１６２と、事象発生判定部１６３とを含んで構成される。
音声特徴量抽出部１６１は、入力音声から音声特徴量を求める。
ビタビネットワーク比較評価部１６２は、ビタビネットワークを構成する各認識候補単位を構成する各認識候補及び各認識候補単位の間に挿入された特殊認識候補の音声特徴量と、音声特徴量抽出部１６１で得られた音声特徴量との比較照合を逐次行い、ビタビネットワークで表される時系列的な音声特徴量変化が起こった尤度（確率）を逐次算出する。 (Voice recognition unit)
The voice recognition unit 16 recognizes voice generated by the announcer of the continuous text sentence of the document using the Viterbi network generated by the Viterbi network generation unit 15.
FIG. 6 is a block diagram illustrating a detailed functional configuration of the speech recognition processing unit 16. As shown in the figure, the speech recognition processing unit 16 includes a speech feature amount extraction unit 161, a Viterbi network comparison / evaluation unit 162, and an event occurrence determination unit 163.
The voice feature quantity extraction unit 161 obtains a voice feature quantity from the input voice.
The Viterbi network comparison / evaluation unit 162 includes each recognition candidate constituting each recognition candidate unit constituting the Viterbi network and the speech feature amount of the special recognition candidate inserted between each recognition candidate unit, and the speech feature amount extraction unit 161. Comparison with the obtained speech feature amount is sequentially performed, and likelihood (probability) that a time-series speech feature amount change represented by the Viterbi network has occurred is sequentially calculated.

なお、ビタビネットワーク比較評価部１６２は、複数のビタビネットワークを並列に評価し、各ビタビネットワークの尤度を同時並行に算出する並列認識処理を行うことが可能である。並列評価を行う場合に並列評価対象となるビタビネットワークは、並列評価を行わない場合に認識対象となるビタビネットワーク（すなわち、現在アナウンサーが発声中の文節を含む音声認識単位文に対応するビタビネットワーク）に後続する１つ又は２つのビタビネットワークとしてもよいし、前後に隣接するビタビネットワークとしてもよい。また、並列評価対象となる字幕先頭検出用ネットワークは、上記認識対象となるビタビネットワークの先頭部分を含むネットワークとすることができる。これらの並列評価対象となるビタビネットワークの決定ルールは、予めプログラムやデータベースで定義しておくことができる。
事象発生判定部１６３は、ビタビネットワーク比較評価部１６２で算出された尤度に基づいて、複数のビタビネットワークで表される事象の何れか、もしくは、どれも発生していないことを任意の時点で判定し、事象検出結果を出力する。 The Viterbi network comparison and evaluation unit 162 can perform parallel recognition processing for evaluating a plurality of Viterbi networks in parallel and calculating the likelihood of each Viterbi network in parallel. The Viterbi network that is the target of parallel evaluation when performing parallel evaluation is the Viterbi network that is the target of recognition when parallel evaluation is not performed (that is, the Viterbi network corresponding to the speech recognition unit sentence including the phrase that the announcer is currently speaking) One or two Viterbi networks that follow or a Viterbi network adjacent to the front and rear may be used. Moreover, the subtitle head detection network to be subjected to parallel evaluation can be a network including the head portion of the Viterbi network to be recognized. The rules for determining the Viterbi network to be subjected to parallel evaluation can be defined in advance by a program or a database.
Based on the likelihood calculated by the Viterbi network comparison and evaluation unit 162, the event occurrence determination unit 163 indicates that any of the events represented by the plurality of Viterbi networks or none has occurred at any time. Determine and output the event detection result.

（字幕単位文出力部）
字幕単位文出力部１７は、音声認識部１６から得られた事象検出結果に基づいて所定の字幕単位文の出力タイミングを検出した時に、その字幕単位文を出力する。本実施形態では、字幕単位文出力部１７は、字幕先頭検出用ネットワーク生成機能１５１で生成された字幕先頭検出用ネットワークで表される事象が発生したことを検出した時に、当該字幕先頭検出用ネットワークに対応する字幕単位文を出力する。
なお、音声認識部１６は、字幕先頭検出用ネットワークで表される事象が発生したことを検出した後も、当該字幕先頭検出用ネットワークを構成する認識候補単位を先頭部分に有するビタビネットワークを続けて最後まで音声認識するため、次の字幕文が不要なタイミングを出力されるのを防ぐことができる。 (Subtitle unit sentence output part)
When the subtitle unit sentence output unit 17 detects the output timing of a predetermined subtitle unit sentence based on the event detection result obtained from the speech recognition unit 16, the subtitle unit sentence output unit 17 outputs the subtitle unit sentence. In this embodiment, when the caption unit sentence output unit 17 detects that an event represented by the caption head detection network generated by the caption head detection network generation function 151 has occurred, the caption head detection network 17 The caption unit sentence corresponding to is output.
Note that the voice recognition unit 16 continues the Viterbi network having the recognition candidate unit constituting the caption head detection network at the head part after detecting that the event represented by the caption head detection network has occurred. Since the voice is recognized to the end, it is possible to prevent the next subtitle sentence from being output at an unnecessary timing.

（字幕出力処理）
次に、図７に示すフローチャートを参照して、本実施形態に係る字幕出力装置１０が実行する字幕出力処理について説明する。
まず、字幕単位文生成部１４は、形態素解析部１１及び文節推定部１２による処理結果に基づいて、入力された原稿の連続テキスト文を字幕の出力単位に分割することにより、複数の字幕単位文を生成する（ステップＳ１０１）。
次に、音声認識単位文生成部１３は、形態素解析部１１による処理結果に基づいて、入力された原稿の連続テキスト文を音声認識の処理単位に分割することにより、複数の音声認識単位文を生成する（ステップＳ１０２）。 (Subtitle output processing)
Next, caption output processing executed by the caption output device 10 according to the present embodiment will be described with reference to the flowchart shown in FIG.
First, the subtitle unit sentence generation unit 14 divides a continuous text sentence of the input document into subtitle output units based on the processing results of the morphological analysis unit 11 and the phrase estimation unit 12, thereby generating a plurality of subtitle unit sentences. Is generated (step S101).
Next, the speech recognition unit sentence generation unit 13 divides the input continuous text sentence of the document into speech recognition processing units based on the processing result by the morphological analysis unit 11, thereby obtaining a plurality of speech recognition unit sentences. Generate (step S102).

次に、ビタビネットワーク生成部１５は、音声認識単位文生成部１３により生成された複数の音声認識単位文毎に、各文節に対応する認識候補単位を連結してビタビネットワークを生成する。また、ビタビネットワーク生成部１５は、字幕先頭検出用ネットワーク生成機能１５１により字幕先頭検出用ネットワークを生成する（ステップＳ１０３）。
次に、生放送中に、原稿の連続テキスト文がアナウンサーにより読み上げられて、リアルタイム音声が字幕出力装置１０に入力されると、音声認識部１６は、入力音声と、ビタビネットワーク生成部１５により生成された字幕先頭検出用ネットワークを含む複数の各ビタビネットワークを構成する認識候補単位とを、先頭から逐次並列に照合することにより、並列認識処理を行う（ステップＳ１０４）。
字幕単位文出力部１７は、字幕先頭検出用ネットワークで表される事象が発生したことを検出した時に、当該字幕先頭検出用ネットワークに対応する字幕単位文を出力する（ステップＳ１０５）。 Next, the Viterbi network generation unit 15 generates a Viterbi network by connecting the recognition candidate units corresponding to each phrase for each of the plurality of speech recognition unit sentences generated by the speech recognition unit sentence generation unit 13. Further, the Viterbi network generation unit 15 generates a caption head detection network by the caption head detection network generation function 151 (step S103).
Next, during a live broadcast, when the continuous text sentence of the manuscript is read out by the announcer and real-time audio is input to the caption output device 10, the audio recognition unit 16 generates the input audio and the Viterbi network generation unit 15. The parallel recognition processing is performed by sequentially collating the recognition candidate units constituting each of the plurality of Viterbi networks including the subtitle head detection network in parallel from the top (step S104).
When the caption unit sentence output unit 17 detects that an event represented by the caption head detection network has occurred, the caption unit sentence output unit 17 outputs a caption unit sentence corresponding to the caption head detection network (step S105).

（字幕先頭検出用ネットワークの生成処理）
次に、図８に示すフローチャートを参照して、ビタビネットワーク生成部１５の字幕先頭検出用ネットワーク生成機能１５１が実行する字幕先頭検出用ネットワークの生成処理について説明する。
前提として、「ネットワーク間距離」の算出方法を定義する。このネットワーク間距離は、ビタビネットワーク同士の類似度を表す指標となり、ネットワーク間距離が小さいほど２つのビタビネットワークを形成する音素同士が類似しており、誤認識が起こる確率が高いことを表す。例えば、ネットワーク間距離は、各ビタビネットワークに含まれる認識候補単位を形成する音素間距離を積算した値として定義できる。なお、ビタビネットワークが複数の経路を有する（つまり、ビタビネットワークに含まれる認識候補単位の中に複数の認識候補を含むものがある）場合は、例えば、比較対象となっているビタビネットワーク間の最近接部分の距離をネットワーク間距離として定義できる。 (Subtitle head detection network generation process)
Next, with reference to the flowchart shown in FIG. 8, the caption head detection network generation processing executed by the caption head detection network generation function 151 of the Viterbi network generation unit 15 will be described.
As a premise, a calculation method of “distance between networks” is defined. This inter-network distance is an index representing the degree of similarity between Viterbi networks, and the smaller the inter-network distance, the more similar the phonemes that form the two Viterbi networks, and the higher the probability that erroneous recognition will occur. For example, the distance between networks can be defined as a value obtained by integrating distances between phonemes forming recognition candidate units included in each Viterbi network. Note that when the Viterbi network has a plurality of routes (that is, some of the recognition candidate units included in the Viterbi network include a plurality of recognition candidates), for example, the latest between Viterbi networks being compared The distance of the contact part can be defined as the distance between networks.

まず、字幕単位文生成部１４で生成された字幕単位文のうち、字幕先頭検出用ネットワークの生成対象となる字幕単位文を１つ選択し、当該字幕単位文の先頭文節に対応する認識候補単位を含むビタビネットワーク（以下、「対象ビタビネットワーク」という）に対して、仮の字幕先頭検出用ネットワークを設定する。具体的には、対象ビタビネットワークの先頭の認識候補単位から字幕単位文の先頭文節に対応する認識候補単位までを、仮の字幕先頭検出用ネットワークとする（ステップＳ２０１）。 First, among the subtitle unit sentences generated by the subtitle unit sentence generation unit 14, one subtitle unit sentence to be generated by the subtitle head detection network is selected, and the recognition candidate unit corresponding to the first phrase of the subtitle unit sentence is selected. A temporary subtitle head detection network is set for a Viterbi network including the following (hereinafter referred to as “target Viterbi network”). Specifically, a temporary caption head detection network is defined from the head recognition candidate unit of the target Viterbi network to the recognition candidate unit corresponding to the head phrase of the caption unit sentence (step S201).

仮の字幕先頭検出用ネットワークと、対象ビタビネットワークと並列に音声認識されるビタビネットワークのうち字幕単位文の先頭文節に対応する認識候補単位を含まない各ビタビネットワークとの間のネットワーク間距離を各々算出する。算出したネットワーク間距離の中に予め定められた所定の閾値未満のものがある場合（ステップＳ２０２：Ｎｏ）、仮の字幕先頭検出用ネットワークに対して、字幕単位文の次の文節に対応する認識候補単位を追加していき（ステップＳ２０３）、ビタビネットワーク間距離が所定の閾値以上となり、他のビタビネットワークと十分な距離を確保できた場合に（ステップＳ２０２；Ｙｅｓ）、字幕先頭検出用ネットワークを決定する（ステップＳ２０５）。なお、仮の字幕先頭検出用ネットワークに認識候補単位を追加したときに、対象ビタビネットワークの終端に到達した場合、すなわち、仮の字幕先頭検出用ネットワークと対象ビタビネットワークとが同一となった場合は（ステップＳ２０４；Ｙｅｓ）、対象ビタビネットワーク全体を字幕先頭検出用ネットワークとして採用する。以上の字幕先頭検出用ネットワーク生成処理を、字幕単位文生成部１４で生成された字幕単位文の数だけ行う。 The inter-network distance between the temporary caption head detection network and each Viterbi network that does not include the recognition candidate unit corresponding to the head sentence of the caption unit sentence in the Viterbi network that is recognized in parallel with the target Viterbi network calculate. When there is a calculated inter-network distance that is less than a predetermined threshold value (step S202: No), recognition corresponding to the next phrase of the subtitle unit sentence for the temporary subtitle head detection network When candidate units are added (step S203) and the distance between the Viterbi networks becomes equal to or greater than a predetermined threshold and a sufficient distance from other Viterbi networks can be secured (step S202; Yes), the caption head detection network is set. Determination is made (step S205). When the recognition candidate unit is added to the temporary caption head detection network and the end of the target Viterbi network is reached, that is, when the temporary caption head detection network and the target Viterbi network are the same. (Step S204; Yes), the entire target Viterbi network is adopted as the caption head detection network. The above subtitle head detection network generation processing is performed for the number of subtitle unit sentences generated by the subtitle unit sentence generation unit 14.

以上のような手順で字幕単位文の先頭数文節を含む字幕先頭検出用ネットワークを生成し、字幕先頭検出用ネットワークで表される事象が発生したことを検出した時に当該字幕先頭検出用ネットワークに対応する字幕単位文を出力することで、字幕単位文の先頭数文節が発声された時に字幕単位文の出力を行うことができ、必要最小限の遅延で字幕単位文を出力することができる。また、並列に認識される他のビタビネットワークとのネットワーク間距離を十分にとることで、認識間違いをなくすことができる。 Generate a subtitle head detection network that includes the first few clauses of subtitle unit sentences according to the above procedure, and respond to the subtitle head detection network when an event represented by the subtitle head detection network is detected. By outputting the subtitle unit sentence to be output, the subtitle unit sentence can be output when the first few clauses of the subtitle unit sentence are uttered, and the subtitle unit sentence can be output with a minimum delay. In addition, it is possible to eliminate recognition errors by taking a sufficient distance between networks with other Viterbi networks recognized in parallel.

（字幕先頭検出用ネットワーク決定処理の具体例）
次に、字幕先頭検出用ネットワーク生成機能１５１が、図９に示す音声認識単位文に基づいて、同図に示す字幕単位文の先頭を認識するための字幕先頭検出用ネットワークを決定する処理の具体例について説明する。
この例では、現在発声中の文節を含む音声認識単位文に対応するビタビネットワークと、当該ビタビネットワークに後続するビタビネットワークと、の２つを並行して用いて音声認識処理を行うものとする。また、実際には、音声認識単位文に対応するビタビネットワークを構成する認識候補単位を用いて字幕先頭検出用ネットワークが生成されるが、ここでは、「ビタビネットワーク」及び「認識候補単位」の代わりに、対応する「音声認識単位文」及び「文節」を用いて説明することとする。 (Specific example of network decision processing for subtitle head detection)
Next, a specific example of processing in which the caption head detection network generation function 151 determines a caption head detection network for recognizing the head of the caption unit sentence shown in FIG. 9 based on the speech recognition unit sentence shown in FIG. An example will be described.
In this example, it is assumed that the voice recognition process is performed using the Viterbi network corresponding to the voice recognition unit sentence including the phrase currently being uttered and the Viterbi network subsequent to the Viterbi network in parallel. In practice, a caption head detection network is generated using recognition candidate units that constitute a Viterbi network corresponding to a speech recognition unit sentence. Here, instead of “Viterbi network” and “recognition candidate unit”, In addition, the explanation will be made using the corresponding “voice recognition unit sentence” and “sentence”.

まず、字幕単位文１）の字幕先頭検出用ネットワークを決定するために、音声認識単位文（Ａ）の先頭文節「別府へ」を、仮の字幕先頭検出用ネットワークとして設定する（図８のステップＳ２０１に対応）。この仮の字幕先頭検出用ネットワーク「別府へ」と、音声認識単位文（Ｂ）の先頭の文節「切符を」とのネットワーク間距離を計算すると、「ベップヘ」と「キップオ」とのネットワーク間距離はかなり近いので（ステップＳ２０２；Ｎｏ）、音声認識単位文（Ａ）の次の文節「行く」を仮の字幕先頭検出用ネットワークに追加する（ステップＳ２０３）。これにより、仮の字幕先頭検出用ネットワーク（音声認識単位文（Ａ）の先頭から２文節「別府へ」+「行く」）と、音声認識単位文（Ｂ）の先頭から２文節「切符を」+「買う」とのネットワーク間距離を十分に保つことができるため（ステップＳ２０２；Ｙｅｓ）、「別府へ」+「行く」を字幕単位文１）の字幕先頭検出用ネットワークとすることにより（ステップＳ２０５）、先頭２文節の発声で、音声認識単位文（Ａ）が発声されていることを高精度に判定することができる。 First, in order to determine the subtitle head detection network for subtitle unit sentence 1), the head phrase “To Beppu” of speech recognition unit sentence (A) is set as a temporary subtitle head detection network (step in FIG. 8). Corresponding to S201). When the inter-network distance between this temporary subtitle head detection network “To Beppu” and the first phrase “ticket” of the speech recognition unit sentence (B) is calculated, the inter-network distance between “Bep-he” and “Kip-o” Is quite close (step S202; No), the next phrase “go” of the speech recognition unit sentence (A) is added to the temporary caption head detection network (step S203). As a result, a temporary subtitle head detection network (two phrases “To Beppu” + “go”) from the beginning of the speech recognition unit sentence (A) and two phrases “ticket” from the beginning of the speech recognition unit sentence (B). + Because it is possible to maintain a sufficient network distance with “Buy” (Step S202; Yes), “To Beppu” + “Go” is used as a caption head detection network for caption unit sentence 1) (Step) S205) It can be determined with high accuracy that the speech recognition unit sentence (A) is uttered by the utterance of the first two phrases.

次の字幕単位文２）の字幕先頭検出用ネットワークは、上記と同様の処理手順により、「切符を」+「買う」となる。
次の字幕単位文３）は、音声認識単位文（Ｂ）の「チップを」+「渡した」まででは、音声認識単位文（Ｃ）の「チップを渡す」と十分な距離がとれないため、「ものか」までが接続され、音声認識単位文（Ｂ）の先頭文節から「チップを」+「渡した」+「ものか」までが、字幕単位文３）の字幕先頭検出用ネットワークとなる。 The subtitle head detection network of the next subtitle unit sentence 2) becomes “buy a ticket” + “buy” by the same processing procedure as described above.
In the next caption unit sentence 3), until the “chip” of the voice recognition unit sentence (B) + “pass”, a sufficient distance cannot be taken from “pass the chip” of the voice recognition unit sentence (C). , "Thing" is connected, and from the first sentence of the speech recognition unit sentence (B) to "chip" + "passed" + "what" is the subtitle head detection network of subtitle unit sentence 3) Become.

（並列認識処理の具体例）
次に、図１０及び図１１を参照して、並列認識処理の具体例について説明する。
図１０（ａ）は、原稿の連続テキスト文「民主党、社民党、国民新党の野党３党が提出した福田総理大臣に対する問責決議が参議院本会議で初めて可決されました。自民公明両党は対抗措置として・・・」から生成されたビタビネットワーク、（ｂ）は上記連続テキスト文から生成された字幕単位文、（ｃ）は（ｂ）の字幕単位文１）、２）各々の下線部分を音声認識した時点で各字幕単位文を出力するための字幕先頭検出用ネットワークである。 (Specific example of parallel recognition processing)
Next, a specific example of parallel recognition processing will be described with reference to FIGS. 10 and 11.
Figure 10 (a) is the first text of the manuscript, “The Democratic Party, the Social Democratic Party, and the National New Party's three opposition parties, the first resolution passed by the Fukuda Prime Minister, was passed at the Upper House of the House of Councilors. Viterbi network generated from "... as a measure", (b) is a caption unit sentence generated from the continuous text sentence, (c) is a caption unit sentence 1) of (b), 2) each underlined part It is a subtitle head detection network for outputting subtitle unit sentences at the time of voice recognition.

図１１は、図１０（ａ）に示すビタビネットワーク及び図１０（ｃ）に示す字幕先頭検出用ネットワークに基づいて音声認識部１６が行う音声認識処理、及び、音声認識処理による事象検出結果に基づいて字幕単位文出力部１７が行う字幕単位文の出力処理の具体例を示す図である。
まず、音声認識部１６は、ビタビネットワーク生成部１５が生成した図１０（ａ），（ｃ）に示すビタビネットワークのうち、先頭のビタビネットワーク１Ａと、並列認識処理対象となる次のビタビネットワーク２Ａと、字幕先頭検出用ネットワーク１Ｂとを検出対象として入力する（ステップＳ３０１）。
音声１「みんしゅとうしゃみんとう」がアナウンサーにより発声された時に、音声認識部１６は、字幕先頭検出用ネットワーク１Ｂの事象を検出する（ステップＳ３０２）。そして、音声認識部１６は、検出対象から検出済みの字幕先頭検出用ネットワーク１Ｂを除外し、次の字幕先頭検出用ネットワーク２Ｂを追加する（ステップＳ３０３）。 11 is based on the speech recognition processing performed by the speech recognition unit 16 based on the Viterbi network shown in FIG. 10A and the subtitle head detection network shown in FIG. 10C, and the event detection result by the speech recognition processing. It is a figure which shows the specific example of the output process of a caption unit sentence which the subtitle unit sentence output part 17 performs.
First, the voice recognition unit 16 includes the first Viterbi network 1A and the next Viterbi network 2A to be subjected to parallel recognition processing among the Viterbi networks shown in FIGS. 10A and 10C generated by the Viterbi network generation unit 15. And the subtitle head detection network 1B are input as detection targets (step S301).
When the voice 1 “Minshu and Ushaminto” is uttered by the announcer, the voice recognition unit 16 detects an event of the caption head detection network 1B (step S302). Then, the speech recognition unit 16 excludes the detected caption head detection network 1B from the detection target, and adds the next caption head detection network 2B (step S303).

字幕単位文出力部１７は、音声認識部１６による事象検出結果に基づいて、字幕単位文１）を出力する（ステップＳ３０４）。
次に、音声２「こくみんしんとうのやとうさんとうがていしゅつした」が発声されると、音声認識部１６は、ビタビネットワーク１Ａの事象を検出する（ステップＳ３０５）。音声認識部１６は、検出対象からビタビネットワーク１Ａを除外し、次の並列認識対象のビタビネットワーク３Ａを追加する（ステップＳ３０６）。
次に、音声３「ふくだそうりだいじんにたいするもんせきけつぎが」が発声されると、音声認識部１６は、ビタビネットワーク２Ａ及び字幕先頭検出用ネットワーク２Ｂの事象を検出する（ステップＳ３０７）。音声認識部１６は、検出対象からビタビネットワーク２Ａ及び字幕先頭検出用ネットワーク２Ｂを除外し、次の並列認識対象のビタビネットワーク４Ａを追加する（ステップＳ３０８）。 The caption unit sentence output unit 17 outputs the caption unit sentence 1) based on the event detection result by the voice recognition unit 16 (step S304).
Next, when the voice 2 “Kokumin Shinto no Yato san ga sutatsu” is uttered, the voice recognition unit 16 detects an event of the Viterbi network 1A (step S305). The voice recognition unit 16 excludes the Viterbi network 1A from the detection target, and adds the next Viterbi network 3A to be parallel recognized (Step S306).
Next, when the voice 3 “Fujisoda Daiseki ni sei ketsujiga ga” is uttered, the voice recognition unit 16 detects an event in the Viterbi network 2A and the subtitle head detection network 2B (step S307). . The voice recognition unit 16 excludes the Viterbi network 2A and the caption head detection network 2B from the detection target, and adds the next Viterbi network 4A to be recognized in parallel (Step S308).

字幕単位文出力部１７は、音声認識部１６による字幕先頭検出用ネットワーク２Ｂの事象検出に基づいて、字幕単位文２）を出力する（ステップＳ３０９）。
なお、ビタビネットワーク２Ａ及び字幕先頭検出用ネットワーク２Ｂは同一であるため、一方を他方で兼用することも可能である。
以上説明したように、字幕出力装置１０は、字幕単位文の少なくとも先頭の文節に対応する認識候補単位との音声の照合が完了した時点で字幕単位文を出力するため、リアルタイム放送において少ない遅延で字幕を出力することが可能となる。また、ＮＵＬＬ、ＳＩＬ、ＳＩＬ、Ｇａｒｂａｇｅ等の特殊認識候補をビタビネットワークの構成要素とすることで、アナウンサーの読み間違いや間のおき方の違いを吸収し、高精度の音声認識を行うことが可能となる。
また、字幕出力装置１０は、音声認識処理を、２以上のビタビネットワークを用いて並列に行うため、アナウンサーの読み飛ばし等による音声の誤認識を防いだり、発声タイミングのずれを回復することができ、音声と対応した字幕を正確に出力することができる。 The caption unit sentence output unit 17 outputs the caption unit sentence 2) based on the event detection of the caption head detection network 2B by the voice recognition unit 16 (step S309).
Since the Viterbi network 2A and the subtitle head detection network 2B are the same, one can also be used as the other.
As described above, the caption output device 10 outputs a caption unit sentence at the time when the speech collation with the recognition candidate unit corresponding to at least the first clause of the caption unit sentence is completed, and therefore, with less delay in real-time broadcasting. Subtitles can be output. Also, by using special recognition candidates such as NULL, SIL, SIL, and Garbage as components of the Viterbi network, it is possible to absorb mistakes in the announcer's reading and differences in how they are placed, and perform highly accurate speech recognition. It becomes.
In addition, since the caption output device 10 performs voice recognition processing in parallel using two or more Viterbi networks, it can prevent misrecognition of voice due to skipping of an announcer or the like, and can recover a deviation in utterance timing. , Subtitles corresponding to audio can be accurately output.

なお、本発明は、上述した実施形態に限定されることなく、特許請求の範囲に記載の技術的範囲内において、上述した実施形態に適宜の変形を加えて実施可能であることは言うまでもない。
例えば、上述した実施形態では、字幕単位文出力部１７は、字幕先頭検出用ネットワークを用いて字幕単位文の出力タイミングを判定したが、これに限らず、例えば、字幕先頭検出用ネットワークを用いずに、字幕単位文に対応する音声の認識が開始されてからの時間で出力タイミングを判定してもよい。また、字幕単位文の先頭の数文節に対応する認識候補単位と入力音声との照合が完了した時点で字幕単位文を出力してもよい。「数文節」は予め定められた数であってもよいし、並列認識される他のビタビネットワークとの尤度の差が大きくなり事象発生が検出されるまでの数であってもよい。また、文節の代わりに音節や文字数を用いてもよい。 Needless to say, the present invention is not limited to the above-described embodiment, and can be implemented by appropriately modifying the above-described embodiment within the technical scope described in the claims.
For example, in the above-described embodiment, the caption unit sentence output unit 17 determines the output timing of the caption unit sentence using the caption head detection network. However, the present invention is not limited to this, and for example, the caption head detection network is not used. In addition, the output timing may be determined based on the time after the recognition of the voice corresponding to the caption unit sentence is started. Alternatively, the caption unit sentence may be output at the time when the recognition candidate unit corresponding to the first few phrases of the caption unit sentence is matched with the input speech. The “several phrase” may be a predetermined number, or may be a number until a difference in likelihood with other Viterbi networks recognized in parallel increases and an event occurrence is detected. Moreover, a syllable or the number of characters may be used instead of the phrase.

また、字幕先頭検出用ネットワークの決定方法は、上述した実施形態に限定されることはなく、最低限、字幕単位文の先頭文節が発声されたことを検出できるように、字幕単位文の先頭文節に対応する認識候補単位を少なくとも含むネットワークとなるように決定すればよい。
また、上述した実施形態では、音声認識の性能を高めるために、字幕単位文と音声認識単位文とを別々に生成し、音声認識単位文を字幕単位文と一致させなかったが、音声認識単位文を字幕単位文と一致させることも可能である。 The method for determining the caption head detection network is not limited to the above-described embodiment, and at the very least, it is possible to detect that the head phrase of the caption unit sentence is uttered. May be determined so as to be a network including at least a recognition candidate unit corresponding to.
Further, in the above-described embodiment, in order to improve the performance of speech recognition, the caption unit sentence and the speech recognition unit sentence are generated separately and the speech recognition unit sentence is not matched with the caption unit sentence. It is also possible to match a sentence with a caption unit sentence.

また、形態素解析以外の解析ルール、分割ルール等を用いて、字幕単位文や音声認識単位文を生成してもよい。また、ビタビネットワーク以外の音声認識のためのネットワークを用いて音声認識処理を行ってもよい。
また、上述した実施形態では、生放送のニュース番組でアナウンサーのリアルタイム音声に合わせて字幕を出力する例について説明したが、共通の原稿をアナウンス用と字幕用との双方に利用するリアルタイム放送であれば、スポーツ中継であっても、生講演であってもよい。 Also, subtitle unit sentences and speech recognition unit sentences may be generated using analysis rules, division rules, and the like other than morphological analysis. Further, the voice recognition process may be performed using a voice recognition network other than the Viterbi network.
In the above-described embodiment, an example in which subtitles are output in accordance with the announcer's real-time audio in a live broadcast news program has been described. However, if a common manuscript is used for both announcements and subtitles, It can be a sports broadcast or a live lecture.

本発明の実施形態に係る字幕出力装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the caption output device which concerns on embodiment of this invention. 形態素解析結果の具体例を示す図である。It is a figure which shows the specific example of a morphological analysis result. 形態素解析結果に基づいて、推定される文節及び生成される字幕単位文・音声認識単位文の具体例を示す図である。It is a figure which shows the specific example of the estimated clause and the subtitle unit sentence and the speech recognition unit sentence produced | generated based on the morphological analysis result. ３つの音声認識単位文及び文節から生成される３つのビタビネットワークの具体例を示す図である。It is a figure which shows the specific example of three Viterbi networks produced | generated from three speech recognition unit sentences and phrases. 図４に示すビタビネットワークを構成する各認識候補単位を連結する矢印が意味する内容を説明するための図である。It is a figure for demonstrating the content which the arrow which connects each recognition candidate unit which comprises the Viterbi network shown in FIG. 4 means. 同実施形態に係る音声認識部の詳細な機能構成を示すブロック図である。3 is a block diagram showing a detailed functional configuration of a voice recognition unit according to the embodiment. FIG. 同実施形態に係る字幕出力装置が実行する字幕出力処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the caption output process which the caption output device concerning the embodiment performs. 同実施形態に係る字幕先頭検出用ネットワーク生成機能が実行する字幕先頭検出用ネットワーク生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the network production | generation process for subtitle head detection which the network production | generation function for subtitle head detection concerning the embodiment performs. 同実施形態に係る字幕先頭検出用ネットワークを決定する処理の具体例を説明するための音声認識単位文及び字幕単位文を示す図である。It is a figure which shows the speech recognition unit sentence and subtitle unit sentence for demonstrating the specific example of the process which determines the subtitle head detection network which concerns on the embodiment. 同実施形態に係る並列認識処理の具体例を説明するための図である。It is a figure for demonstrating the specific example of the parallel recognition process which concerns on the same embodiment. 同実施形態に係る並列認識処理の具体例を説明するための図である。It is a figure for demonstrating the specific example of the parallel recognition process which concerns on the same embodiment. 従来におけるリアルタイム放送番組において一般的に行われている字幕送出の仕組みを示す図である。It is a figure which shows the mechanism of the subtitle transmission generally performed in the conventional real-time broadcast program.

Explanation of symbols

１０字幕出力装置
１１形態素解析部
１２文節推定部
１３音声認識単位文生成部
１４字幕単位文生成部
１５ビタビネットワーク生成部
１５１字幕先頭検出用ネットワーク生成機能
１６音声認識部
１６１音声特徴量抽出部
１６２ビタビネットワーク比較評価部
１６３事象発生判定部
１７字幕単位文出力部 DESCRIPTION OF SYMBOLS 10 Subtitle output device 11 Morphological analysis part 12 Phrase estimation part 13 Speech recognition unit sentence generation part 14 Subtitle unit sentence generation part 15 Viterbi network generation part 151 Subtitle head detection network generation function 16 Voice recognition part 161 Voice feature-value extraction part 162 Viterbi Network comparison / evaluation unit 163 Event occurrence determination unit 17 Subtitle unit sentence output unit

Claims

A subtitle output device that outputs subtitles in accordance with audio,
Subtitle unit sentence generation means for generating a subtitle unit sentence by dividing the input text sentence into subtitle output units;
Voice recognition unit sentence generation means for generating a voice recognition unit sentence by dividing the text sentence into processing units for voice recognition;
The recognition candidate units, which are a set of recognition candidates for speech recognition of the phrases of the speech recognition unit sentences generated by the speech recognition unit sentence generation means, are connected in order from the one corresponding to the first phrase of the speech recognition unit sentences. Voice recognition network generation means for generating a voice recognition network,
A speech recognition unit that performs speech recognition processing by sequentially performing collation from the beginning of the speech in which the text sentence is uttered and the recognition candidate units that constitute the speech recognition network generated by the speech recognition network generation unit;
Subtitle unit sentence output means for outputting the subtitle unit sentence at a predetermined timing ,
The voice recognition processing means includes
Performing the speech recognition processing in parallel using two or more speech recognition networks generated by the speech recognition network generating means; and
The voice recognition network generation means includes
Subtitle head detection network generating means for generating a subtitle head detection network including at least a recognition candidate unit corresponding to the head clause of the subtitle unit sentence,
The subtitle head detection network generation means includes:
The head of the subtitle unit sentence until the inter-network distance between the subtitle head detection network and the subtitle head detection network and the voice recognition network to be subjected to voice recognition processing in parallel is equal to or greater than a predetermined threshold. For the recognition candidate unit corresponding to the phrase, the recognition candidate units corresponding to the phrase following the first phrase of the caption unit sentence are sequentially connected, and one or a plurality of the connection numbers of the recognition candidate units are minimized. The recognition candidate unit is the subtitle head detection network,
The caption unit sentence output means includes:
A caption output device that outputs the caption unit sentence when the collation with all recognition candidate units constituting the caption head detection network is completed .

The voice recognition processing means includes
The voice recognition processing is performed by a first voice recognition corresponding to a first voice recognition unit sentence including a first caption unit sentence corresponding to the caption head detection network generated by the caption head detection network generation unit. The subtitle output according to claim 1, wherein the subtitle output is performed in parallel using a network and a second voice recognition network corresponding to a second voice recognition unit sentence following the first voice recognition unit sentence. apparatus.

The voice recognition processing means includes
The event occurrence determination means for detecting whether or not an event represented by each of the first voice recognition network, the second voice recognition network, and the caption head detection network is generated. The caption output device described in 1.

The subtitle head detection network generation means includes:
A speech recognition network including at least a recognition candidate unit corresponding to a head phrase of the first caption unit sentence is set as a temporary caption head detection network;
An inter-network distance between the temporary caption head detection network and the second voice recognition network is calculated, and when the inter-network distance is equal to or greater than the predetermined threshold, the temporary caption head detection network is 4. The caption output device according to claim 2, wherein the caption caption detection network is used.

The subtitle head detection network generation means includes:
When the inter-network distance is less than the predetermined threshold, the recognition candidate unit corresponding to the phrase following the first phrase of the first caption unit sentence is connected to the temporary caption head detection network. The temporary subtitle head detection network is updated at, and the temporary subtitle head detection network is updated until the inter-network distance between the updated temporary subtitle head detection network and the second speech recognition network is equal to or greater than the predetermined threshold. 5. The subtitle head detection network is repeatedly updated, and the temporary subtitle head detection network at the end of the update process is the subtitle head detection network. Subtitle output device.

The voice recognition network generation means includes
Between each recognition candidate units which are connected, in terms of inserting the special recognition candidates to prevent the erroneous recognition, according to any one of claims 1 to 5, characterized in that to generate the speech recognition network Subtitle output device.

The voice recognition network generation means includes
The caption output device according to any one of claims 1 to 6 , wherein the speech recognition network is generated after a special recognition candidate for preventing erroneous recognition is included in the recognition candidate unit.

Wherein the special recognition candidate, a NULL indicating that poses no and SIL indicating that there is a silent pause, claim 6, characterized and Garbage represent any sound of that contained at least one or 8. The caption output device according to 7 .

A subtitle output method executed by a subtitle output device that outputs subtitles in accordance with audio,
A subtitle unit sentence generation step for generating a subtitle unit sentence by dividing the input text sentence into subtitle output units;
A speech recognition unit sentence generation step for generating a speech recognition unit sentence by dividing the text sentence into speech recognition processing units;
The recognition candidate units, which are a set of recognition candidates for speech recognition of the speech recognition unit sentence clauses generated in the speech recognition unit sentence generation step, are connected in order from the one corresponding to the first phrase of the speech recognition unit sentence. A voice recognition network generation step for generating a voice recognition network,
A speech recognition step for performing speech recognition processing by sequentially performing collation from the top of the speech from which the text sentence is uttered and the recognition candidate units constituting the speech recognition network generated in the speech recognition network generation step;
A caption unit sentence output step for outputting the caption unit sentence at a predetermined timing ; and
Bei to give a,
The speech recognition step includes
Performing the speech recognition processing in parallel using two or more speech recognition networks generated by the speech recognition network generating means; and
The voice recognition network generation step includes:
A subtitle head detection network generating step for generating a subtitle head detection network including at least a recognition candidate unit corresponding to a head clause of the subtitle unit sentence,
The subtitle head detection network generation step includes:
The head of the subtitle unit sentence until the inter-network distance between the subtitle head detection network and the subtitle head detection network and the voice recognition network to be subjected to voice recognition processing in parallel is equal to or greater than a predetermined threshold. For the recognition candidate unit corresponding to the phrase, the recognition candidate units corresponding to the phrase following the first phrase of the caption unit sentence are sequentially connected, and one or a plurality of the connection numbers of the recognition candidate units are minimized. The recognition candidate unit is the subtitle head detection network,
The subtitle unit sentence output step includes:
A subtitle output method , comprising: outputting the subtitle unit sentence when the collation with all recognition candidate units constituting the subtitle head detection network is completed .

The voice recognition processing step includes
The voice recognition processing is performed by a first voice recognition corresponding to a first voice recognition unit sentence including a first caption unit sentence corresponding to the caption head detection network generated by the caption head detection network generation unit. The subtitle output according to claim 9, wherein the subtitle output is performed in parallel using a network and a second speech recognition network corresponding to a second speech recognition unit sentence subsequent to the first speech recognition unit sentence. Method.

The voice recognition processing step includes
The event occurrence determination step of detecting whether or not an event represented by each of the first voice recognition network, the second voice recognition network, and the caption head detection network is generated. Subtitle output method described in 1.

The subtitle head detection network generation step includes:
A speech recognition network including at least a recognition candidate unit corresponding to the first clause of the first caption unit sentence is set as a temporary caption head detection network, and the temporary caption head detection network and the second speech recognition are set. The inter-network distance to a network is calculated, and when the inter-network distance is equal to or greater than the predetermined threshold, the temporary caption head detection network is set as the caption head detection network. The subtitle output method according to claim 10 or claim 11.

The subtitle head detection network generation step includes:
When the inter-network distance is less than the predetermined threshold, the recognition candidate unit corresponding to the phrase following the first phrase of the first caption unit sentence is connected to the temporary caption head detection network. The temporary subtitle head detection network is updated at, and the temporary subtitle head detection network is updated until the inter-network distance between the updated temporary subtitle head detection network and the second speech recognition network is equal to or greater than the predetermined threshold. 13. The subtitle head detection network is repeatedly updated, and the temporary subtitle head detection network at the time when the update process is completed is the subtitle head detection network. Subtitle output method.

On the computer,
A subtitle unit sentence generation step for generating a subtitle unit sentence by dividing the input text sentence into subtitle output units;
A speech recognition unit sentence generation step for generating a speech recognition unit sentence by dividing the text sentence into speech recognition processing units;
The recognition candidate units, which are a set of recognition candidates for speech recognition of the speech recognition unit sentence clauses generated in the speech recognition unit sentence generation step, are connected in order from the one corresponding to the first phrase of the speech recognition unit sentence. A voice recognition network generation step for generating a voice recognition network,
A speech recognition step for performing speech recognition processing by sequentially performing collation from the top of the speech from which the text sentence is uttered and the recognition candidate units constituting the speech recognition network generated in the speech recognition network generation step;
A program for executing a caption unit sentence output step for outputting the caption unit sentence at a predetermined timing ,
The speech recognition step includes
Performing the speech recognition processing in parallel using two or more speech recognition networks generated by the speech recognition network generating means; and
The voice recognition network generation step includes:
A subtitle head detection network generating step for generating a subtitle head detection network including at least a recognition candidate unit corresponding to a head clause of the subtitle unit sentence,
The subtitle head detection network generation step includes:
The head of the subtitle unit sentence until the inter-network distance between the subtitle head detection network and the subtitle head detection network and the voice recognition network to be subjected to voice recognition processing in parallel is equal to or greater than a predetermined threshold. For the recognition candidate unit corresponding to the phrase, the recognition candidate units corresponding to the phrase following the first phrase of the caption unit sentence are sequentially connected, and one or a plurality of the connection numbers of the recognition candidate units are minimized. The recognition candidate unit is the subtitle head detection network,
The subtitle unit sentence output step includes:
The program for outputting the caption unit sentence when the collation with all recognition candidate units constituting the caption head detection network is completed .