JP2010233019A

JP2010233019A - Caption shift correction device, reproduction device, and broadcast device

Info

Publication number: JP2010233019A
Application number: JP2009079244A
Authority: JP
Inventors: Masaki Naito; 正樹内藤; Kazunori Matsumoto; 一則松本; Hisashi Kawai; 恒河井
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-03-27
Filing date: 2009-03-27
Publication date: 2010-10-14
Anticipated expiration: 2029-03-27
Also published as: JP5246948B2

Abstract

<P>PROBLEM TO BE SOLVED: To sequentially create a broadcast content which corrects a time shift between a video or a voice and a caption with a high degree of accuracy while receiving the broadcast content. <P>SOLUTION: A voice recognition unit 21 recognizes the voice of a broadcasting program and generates a recognition result phoneme sequence corresponding to the voice. A caption conversion phoneme sequence generation unit 22 generates the phoneme sequence corresponding to each caption in the video of the broadcasting program and generates a caption conversion phoneme sequence by connecting these phoneme sequences. A phoneme sequence collation unit 23 associates the caption with the voice by using information of the voice and the caption in the received broadcast content at a constant time interval, predicts a start time and an end time after correction of the time shift with respect to the caption based on the result, and estimates a range of the time shift of the voice and the caption. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、字幕ずれ補正装置、再生装置および放送装置に関し、特に、放送コンテンツに含まれる映像や音声と字幕との間の時間的ずれ幅を高精度に補正でき、字幕から特定映像部分を検索して再生でき、時間的ずれ幅が高精度に補正された映像、音声および字幕を含む放送コンテンツを送信できる字幕ずれ補正装置、再生装置および放送装置に関する。 The present invention relates to a caption displacement correction device, a playback device, and a broadcast device, and in particular, can accurately correct a temporal displacement width between video and audio included in broadcast content and captions, and search for a specific video portion from captions. The present invention relates to a caption shift correction apparatus, a playback apparatus, and a broadcast apparatus that can transmit broadcast content including video, audio, and captions that can be played back and whose temporal shift width is corrected with high accuracy.

近年、TV放送番組の映像に対して字幕を付与することが推奨されており、また、地上波デジタル放送が開始されるに伴って字幕つきTV放送番組を容易に視聴することが可能となった。これにより、字幕付きTV放送番組が増加する傾向にある。 In recent years, it has been recommended to add subtitles to TV broadcast program videos, and it has become possible to easily watch TV broadcast programs with subtitles as terrestrial digital broadcasting has started. . As a result, TV broadcast programs with subtitles tend to increase.

TV放送番組を字幕付きのものとする場合、一般的には、出演者の発声を視聴してその内容を文字化し、それを字幕として出力し、映像に付与するという手順が取られる。しかし、出演者の発声の内容を文字化して字幕を作成するのに時間がかかるため、字幕が出力される時刻は出演者の発声よりも遅延し、映像や音声と字幕間に時間的ずれが生じる。 When a TV broadcast program has subtitles, a procedure is generally taken in which the utterances of the performers are viewed, the contents are converted into text, output as subtitles, and added to the video. However, since it takes time to transcribe the content of the performer's utterance and create a subtitle, the time when the subtitle is output is delayed from the utterance of the performer, and there is a time lag between the video and audio and the subtitle. Arise.

図１３は、この状態を示す。例えば、出演者が「こんにちは。朝のニュースです。・・・」と発声した場合、その発声内容を文字化するのに要する時間分だけ遅延して字幕が出力される。このため、映像や音声と字幕間に時間的ずれが生じる。発声の途中で字幕が出力されることもある。 FIG. 13 shows this state. For example, performers is "Hello. This morning the news. ..." If you say, subtitle is output with a delay time required for the character of the speech content. For this reason, a time lag occurs between video and audio and subtitles. Subtitles may be output during the utterance.

特許文献１には、放送番組における映像と字幕の表示タイミングのずれを解消するため、放送局内で、音声と放送原稿の時間的ずれ幅を推定し、該時間的ずれ幅を基に字幕の出力タイミングを決定する装置が記載されている。 In Patent Document 1, in order to eliminate the difference in display timing between video and subtitles in a broadcast program, the time lag between the audio and the broadcast manuscript is estimated in the broadcast station, and subtitles are output based on the time lag. An apparatus for determining timing is described.

また、近年、映像の検索を行うためのメタ情報として字幕を利用する取り組みも進んでいる。特許文献２には、放送の受信側で音声と字幕の時間的ずれ幅を推定し、字幕のタイムコードを修正し、検索に用いるメタ情報を生成する方法、および同メタ情報を用い映像の検索を行う装置が記載されている。 In recent years, efforts have been made to use subtitles as meta information for video search. Patent Document 2 discloses a method for estimating a time lag between audio and subtitles on a broadcast receiving side, correcting a timecode of subtitles, and generating meta information used for search, and video search using the meta information. An apparatus for performing is described.

また、本発明者は、発声内容に依存しない字幕と音声の照合手法を用いて音声と字幕の時間的ずれ幅を推定する字幕ずれ推定装置を特許文献３，４(先願)で提案した。特許文献３の字幕ずれ推定装置では、字幕の長さを基に時間的ずれ幅を予測し、字幕と音声の照合範囲を制限し、さらに照合結果の重み付けを行うことで、照合に関わる計算処理量の削減および時間的ずれ幅の推定精度の向上を図っている。また、特許文献４の字幕ずれ推定装置では、放送コンテンツ全体に渡り、音声認識結果から得られる音素系列と字幕を変換して得られる音素系列を照合し、字幕と音声の最適な対応づけを行うことで、音声と字幕間の時間的ずれ幅の推定精度を向上させている。 In addition, the present inventor has proposed a subtitle deviation estimation device that estimates a temporal deviation width between audio and subtitles using a subtitle / audio collation method that does not depend on the utterance content in Patent Documents 3 and 4 (prior application). In the caption shift estimation device of Patent Document 3, a temporal shift width is predicted based on the length of the caption, the collation range between the caption and the voice is limited, and the collation result is weighted, thereby performing calculation processing related to collation The amount is reduced and the estimation accuracy of the time lag is improved. In the caption shift estimation device of Patent Document 4, the phoneme sequence obtained from the speech recognition result and the phoneme sequence obtained by converting the caption are collated over the entire broadcast content, and the optimum correspondence between the caption and the speech is performed. This improves the estimation accuracy of the time lag width between the audio and the caption.

特開平１０−１３６２６０公報JP-A-10-136260 特開２００５−２２９４１３公報JP 2005-229413 A 特願２００７−２３６５５０号(先願)Japanese Patent Application No. 2007-236550 (prior application) 特願２００８−０９３０２９号(先願)Japanese Patent Application No. 2008-093029 (prior application)

以上のように、映像や音声と字幕との間の時間的ずれを補正する方法、および映像や音声と字幕との間の時間的ずれを補正して検索用メタ情報の精度を向上させる手法が提案されているが、その補正に際しては、少ない計算処理量で、かつ精度良く時間的ずれ幅を推定することが要求される。 As described above, there are a method for correcting the time lag between video and audio and subtitles, and a method for improving the accuracy of search meta information by correcting the time lag between video and audio and subtitles. Although it has been proposed, for the correction, it is required to estimate the temporal shift width with a small amount of calculation processing and with high accuracy.

特許文献１の字幕ずれ推定装置では、放送収録時に使用する放送原稿の冒頭部分に対応する発音記号列を表す音響モデルと音声を照合し、放送原稿の冒頭部分との照合スコアが最も高い音声区間を検出し、検出された音声区間の時刻情報を基に字幕の出力時刻を決定する。しかし、この手法を映像や音声と字幕との対応付けに適用した場合、広範囲に渡って音響モデルと音声の照合処理を行う必要があり、多量の計算処理を必要とするという課題がある。また、字幕の先頭部分と類似した音声が複数個所に存在する場合には字幕に該当する音声区間の特定が難しいという課題もある。 In the caption shift estimation device of Patent Document 1, an audio model representing a phonetic symbol string corresponding to the beginning portion of a broadcast manuscript used at the time of broadcast recording is collated with speech, and the speech section having the highest collation score with the beginning portion of the broadcast manuscript And the subtitle output time is determined based on the detected time information of the voice section. However, when this method is applied to associating video and audio with subtitles, it is necessary to perform a collation process between the acoustic model and audio over a wide range, and there is a problem that a large amount of calculation processing is required. In addition, when there are a plurality of voices similar to the head part of the caption, it is difficult to specify the voice section corresponding to the caption.

特許文献２には、映像や音声と字幕とを対応付けるための具体的手法が記載されていない。これには、音声と字幕間の時間的ずれ幅を推定する際、時間的ずれの範囲を定めて照合することにより計算処理量を少なくする旨が記載されているが、時間的ずれの分布が広範囲に渡ることが予想される場合、照合に要する計算処理量が多くなり、照合精度も低下するという課題がある。 Patent Document 2 does not describe a specific method for associating video and audio with captions. This describes that, when estimating the time lag between audio and subtitles, the amount of calculation processing is reduced by determining and collating the time lag range. When it is expected to cover a wide range, there is a problem that the amount of calculation processing required for collation increases and collation accuracy also decreases.

特許文献３の字幕ずれ推定装置でも、時間的ずれの範囲を定めて照合することにより音声と字幕間の時間的ずれ幅の推定に要する計算処理量を少なくするが、特許文献２と同様の課題がある。また、音声と各字幕間の時間的ずれ幅を独立に推定するので、複数の字幕が同じ音声部分に重複して対応付けられたり、字幕の対応付けの順序が前後逆になったりすることがある。したがって、これにより推定された時間的ずれ幅に従って音声と字幕との間の時間的ずれを補正した場合、前後の字幕が時間的に重複したり、前後の字幕の順序が逆転する可能性がある。 Even in the caption shift estimation device of Patent Document 3, the amount of calculation processing required to estimate the time shift width between the voice and the caption is reduced by determining and collating the time shift range. There is. In addition, since the time lag between audio and each subtitle is estimated independently, multiple subtitles may be associated with the same audio part in duplicate, or the order of subtitle association may be reversed. is there. Therefore, when the time lag between the audio and the subtitles is corrected according to the estimated time lag width, there is a possibility that the subtitles before and after are overlapped in time or the order of the subtitles before and after is reversed. .

特許文献４の字幕ずれ推定装置では、放送コンテンツ全体に渡り、音声認識結果から得られる音素系列と字幕を変換して得られる音素系列を照合し、字幕と音声の最適な対応づけを行うので、特許文献３におけるような補正精度の劣化を防ぐことができる。しかし、これでは、対象とする放送コンテンツ内の、全ての字幕と音声を取得した後に、時間的ずれを補正する必要があるため、放送コンテンツを受信しつつ映像や音声と字幕間の時間的ずれを補正し、その結果を再生するなどといった、リアルタイムの補正および補正結果の利用ができないという課題がある。 In the caption shift estimation device of Patent Document 4, the entire phonetic content is collated with the phoneme sequence obtained from the speech recognition result and the phoneme sequence obtained by converting the caption, and the subtitle and the speech are optimally matched. It is possible to prevent deterioration of correction accuracy as in Patent Document 3. However, since it is necessary to correct the time lag after acquiring all subtitles and audio in the target broadcast content, the time lag between video and audio and subtitles is received while receiving the broadcast content. There is a problem that real-time correction and the use of the correction result cannot be performed, such as correcting the image and reproducing the result.

本発明の目的は、上記課題を解決し、放送コンテンツを受信しつつ、映像や音声と字幕との間の時間的ずれを高精度に補正した放送コンテンツを逐次作成することができる字幕ずれ推定装置、補正装置、再生装置および放送装置を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems, and capable of sequentially creating broadcast content in which broadcast content is received and broadcast content in which temporal deviation between video and audio and caption is corrected with high accuracy is created. Another object is to provide a correction device, a playback device, and a broadcasting device.

上記課題を解決するため、本発明に係る字幕ずれ補正装置は、放送コンテンツを受信しつつ、受信した放送コンテンツ中の音声を認識し、該音声に対応する認識結果音素列を生成する音声認識部と、放送コンテンツの映像中の各字幕に対応する音素列を生成するとともに、それらの音素列を連結して字幕変換音素列を生成する字幕変換音素列生成部と、前記音声認識部により生成された認識結果音素列と前記字幕変換音素列生成部により生成された字幕変換音素列との間の編集距離に基づき字幕と音声を対応付けて、字幕の開始、終了時刻を決定する音素列照合部と、前記音素列照合部により決定された字幕の開始、終了時刻に基づき、音声と字幕との間の時間的ずれを補正するずれ補正部を備え、音素列照合部は、字幕受信時に、字幕と字幕受信時以前の音声との対応付けを行い、この結果から時間的ずれ補正後の字幕の開始、終了時刻を予測し、その後、予測された字幕の開始、終了時刻に至るまで一定時間ごとに、新たに受信した放送コンテンツ中の音声および字幕の情報を用いて字幕と音声との対応付けを行い、その結果から時間的ずれ補正後の字幕の開始、終了時刻を予測する処理を繰り返す点に第１の特徴がある。 In order to solve the above-described problem, a caption shift correction apparatus according to the present invention recognizes a sound in a received broadcast content while receiving the broadcast content, and generates a recognition result phoneme string corresponding to the sound. And a phoneme sequence corresponding to each subtitle in the video of the broadcast content, and a subtitle conversion phoneme sequence generation unit that generates a subtitle conversion phoneme sequence by concatenating those phoneme sequences, and generated by the speech recognition unit A phoneme string collating unit that associates subtitles with audio based on an edit distance between the recognition result phoneme string and the caption converted phoneme string generated by the caption converted phoneme string generating unit, and determines the start and end times of the subtitles And a shift correction unit that corrects a time shift between the audio and the subtitle based on the start and end times of the subtitles determined by the phoneme sequence collation unit, and the phoneme sequence collation unit And character Corresponding with the sound before reception, predict the start and end time of the caption after time lag correction from this result, and then at regular intervals until the predicted start and end time of the caption, The process of associating subtitles with audio using the audio and subtitle information in the newly received broadcast content, and repeating the process of predicting the start and end times of the subtitles after correcting the time lag from the result is as follows. There is one feature.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が、字幕と字幕受信時以前の音声との対応付けの結果と字幕の開始時刻の予測値と放送コンテンツ受信時に取得した字幕の表示時間に基づいて時間的ずれ補正後の字幕の終了時刻を予測する点に第２の特徴がある。 Further, the subtitle shift correcting apparatus according to the present invention is such that the phoneme string collating unit determines the result of associating the subtitle with the audio before receiving the subtitle, the predicted start time of the subtitle, and the subtitle acquired when the broadcast content is received. There is a second feature in that the end time of the caption after the time lag correction is predicted based on the display time.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が、字幕と字幕受信時以前の音声との対応付けの結果と字幕の開始時刻の予測値と字幕文字列から推定した字幕に対応する音声長の予測値に基づいて時間的ずれ補正後の字幕の終了時刻を予測する点に第３の特徴がある。 In the caption shift correcting device according to the present invention, the phoneme string matching unit applies the subtitle estimated from the result of associating the caption with the audio before the caption reception, the predicted start time of the caption, and the caption character string. A third feature is that the end time of the caption after correcting the time lag is predicted based on the predicted value of the corresponding voice length.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が時間的ずれ補正後の字幕の開始、終了時刻を予測する時に使用した認識結果音素列を保存しておき、前記音素列照合部は、時間的ずれ補正後の字幕の開始、終了時刻を予測する処理を繰り返す際には前回処理を行った時刻から現時刻までに前記音声認識部により生成された認識結果音素列を保存された認識結果音素列に結合して認識結果音素列を生成し、これにより生成された認識結果音素列と字幕変換音素列との間の編集距離に基づいて時間的ずれ補正後の字幕の終了時刻を予測する点に第４の特徴がある。 Further, the caption shift correction device according to the present invention stores the recognition result phoneme sequence used when the phoneme sequence verification unit predicts the start and end times of the caption after time shift correction, and the phoneme sequence verification The unit stores the recognition result phoneme sequence generated by the speech recognition unit from the time of the previous processing to the current time when repeating the process of predicting the start and end times of the subtitles after the time lag correction. The recognition result phoneme string is combined with the recognized recognition result phoneme string to generate a recognition result phoneme string, and the end time of the caption after time lag correction based on the edit distance between the recognition result phoneme string generated thereby and the caption conversion phoneme string There is a fourth feature in predicting.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が時間的ずれ補正後の字幕の開始、終了時刻を予測する時に使用した認識結果音素列と前記音声認識部が認識した認識中間結果と前記音素列照合部が照合した照合中間結果を保存しておき、前記音声認識部は、時間的ずれ補正後の字幕の開始、終了時刻を予測する処理が繰り返される際には前回処理時に保存された認識中間結果を引き継いで前回処理を行った時刻から現時刻までの音声を認識して認識結果音素列を生成し、前記音素列照合部は、この認識結果音素列を保存された認識結果音素列に結合して認識結果音素列を生成し、これにより生成された認識結果音素列と前回処理時に保存された照合中間結果を用いて時間的ずれ補正後の字幕の終了時刻を予測する点に第５の特徴がある。 Further, the subtitle shift correcting apparatus according to the present invention includes a recognition result phoneme sequence used when the phoneme sequence matching unit predicts a start time and an end time of a subtitle after time shift correction and a recognition intermediate recognized by the speech recognition unit. The result and the collation intermediate result collated by the phoneme string collation unit are stored, and the speech recognition unit performs the process of predicting the start and end times of the subtitles after the time lag correction at the time of the previous process. Recognizing the speech from the time of the previous processing to the current time taking over the stored recognition intermediate result and generating a recognition result phoneme sequence, the phoneme sequence collation unit recognizes the recognition result phoneme sequence stored The recognition result phoneme string is generated by combining with the result phoneme string, and the end time of the caption after the time lag correction is predicted using the generated recognition result phoneme string and the collation intermediate result saved at the previous processing. The fifth feature A.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が時間的ずれ補正後の字幕の開始、終了時刻を予測する時に使用した認識結果音素列と前記音声認識部が認識した認識中間結果と前記音素列照合部が照合した照合中間結果を保存しておき、前記音声認識部は、時間的ずれ補正後の字幕の開始、終了時刻を予測する処理が繰り返される際には前回処理時に保存された認識中間結果を引き継いで前回処理を行った時刻から現時刻までの音声を認識して認識結果音素列を生成し、前記音素列照合部は、この認識結果音素列と最初に認識を開始した時刻から現時刻までの音声を認識した認識結果音素列とを比較し、異なった音素に遡って認識結果音素列と字幕変換音素列との間の編集距離を、保存された照合中間結果を用いて計算し、該編集距離に基づき音声と字幕の時間的ずれ幅を推定する点に第６の特徴がある。 Further, the subtitle shift correcting apparatus according to the present invention includes a recognition result phoneme sequence used when the phoneme sequence matching unit predicts a start time and an end time of a subtitle after time shift correction and a recognition intermediate recognized by the speech recognition unit. The result and the collation intermediate result collated by the phoneme string collation unit are stored, and the speech recognition unit performs the process of predicting the start and end times of the subtitles after the time lag correction at the time of the previous process. Recognizing the speech from the time of the previous processing to the current time by taking over the stored recognition intermediate result and generating a recognition result phoneme string, the phoneme string collation unit first recognizes this recognition result phoneme string Compare the recognition result phoneme sequence that recognized the speech from the start time to the current time, and go back to the different phonemes, and edit the distance between the recognition result phoneme sequence and the subtitle conversion phoneme sequence, the stored collation intermediate result Calculate using the Distance is the sixth aspect of the point to estimate the temporal deviation of the audio and subtitles based on.

また、本発明に係る字幕ずれ補正装置は、音声と字幕の時間的ずれ幅を推定した後、該推定の時刻より予め定めた時刻以上前の音声区間に対応する字幕、その字幕に対応する認識結果音素列および字幕変換音素列、照合中間結果を破棄する点に第７の特徴がある。 In addition, the caption shift correction apparatus according to the present invention estimates a time gap between audio and caption, and then recognizes a caption corresponding to a speech section that is earlier than a predetermined time from the estimated time and a corresponding caption. The seventh feature is that the result phoneme string, the subtitle converted phoneme string, and the collation intermediate result are discarded.

また、本発明に係る字幕ずれ補正装置は、音声と字幕の時間的ずれ幅を推定した後、該推定の時刻より予め定めた時刻以上前の音声区間に対応する認識結果音素および字幕変換音素列、照合中間結果を破棄する点に第８の特徴がある。 In addition, the caption shift correcting device according to the present invention estimates a temporal shift width between speech and caption, and then recognizes a recognition result phoneme and a caption converted phoneme sequence corresponding to a speech section that is a predetermined time or more before the estimated time. There is an eighth feature in that the collation intermediate result is discarded.

また、本発明に係る字幕ずれ推定装置は、前記編集距離が、音素の挿入、削除、置換によって一方の音素列から他方の音素列へ変換するのに要する手順の回数に応じたコストを指標として定義される点に第９の特徴がある。 Further, the caption displacement estimation device according to the present invention uses the cost corresponding to the number of steps required to convert the edit distance from one phoneme sequence to the other as a result of insertion, deletion, and replacement of phonemes. There is a ninth feature in the point defined.

また、本発明に係る字幕ずれ補正装置は、前記編集距離が、音素ごとの音声認識性能を元に定められた、ある音素をある音素に置換する際に要するコスト、ある音素を挿入する際に要するコスト、ある音素を削除する際に要するコストを指標として定義される点に第１０の特徴がある。 In addition, the caption displacement correction device according to the present invention is configured such that the editing distance is determined based on the speech recognition performance for each phoneme, and the cost required for replacing a phoneme with a phoneme, when inserting a phoneme. The tenth feature is that the cost required for deleting a certain phoneme is defined as an index.

また、本発明に係る字幕ずれ補正装置は、前記字幕変換音素列生成部が、字幕の切れ目に文章の切れ目を表す擬似音素を加えた字幕変換音素列を生成し、前記音素列照合部は、前記擬似音素に対しては他の音素より小さいコストを与えて編集距離を計算する点に第１１の特徴がある。 Further, in the caption shift correction device according to the present invention, the caption conversion phoneme sequence generation unit generates a caption conversion phoneme sequence in which pseudo phonemes representing sentence breaks are added to caption breaks, and the phoneme sequence matching unit includes: An eleventh feature is that the pseudo-phonemes are given a cost smaller than that of other phonemes and the edit distance is calculated.

また、本発明に係る字幕ずれ補正装置は、前記字幕変換音素列生成部が、字幕を解析して得られる文章の切れ目に文章の切れ目を表す擬似音素を加えた字幕変換音素列を生成し、前記音素列照合部は、前記擬似音素に対しては他の音素より小さいコストを与えて編集距離を計算する点に第１２の特徴がある。 Further, in the caption shift correcting device according to the present invention, the caption converted phoneme string generation unit generates a caption converted phoneme string in which a pseudo phoneme representing a sentence break is added to a sentence break obtained by analyzing the caption, The phoneme string collation unit has a twelfth feature in that the pseudo-phoneme is given a cost smaller than that of other phonemes and calculates the edit distance.

また、本発明に係る字幕ずれ補正装置は、前記字幕変換音素列生成部が、字幕の切れ目と字幕を解析して得られる文章の切れ目に文章の切れ目を表す擬似音素を加えた字幕変換音素列を生成し、前記音素列照合部は、前記擬似音素に対しては他の音素より小さいコストを与えて編集距離を計算する点に第１３の特徴がある。 Further, in the caption shift correcting device according to the present invention, the caption converted phoneme string generating unit adds a pseudo phoneme representing a sentence break to a sentence break obtained by analyzing the caption break and the caption. There is a thirteenth feature in that the phoneme string collating unit calculates the edit distance by giving the pseudo-phoneme a cost smaller than that of other phonemes.

また、本発明に係る字幕ずれ補正装置は、前記字幕変換音素列生成部が、字幕を解析して得られる文章の切れ目に文章の切れ目を表す擬似音素を加えた字幕変換音素列を生成し、前記音声認識部は、無音が一定時間継続する箇所に無音を表す擬似音素を加えた認識結果音素列を生成し、前記音素列照合部は、無音を表す擬似音素と文章の切れ目を表す擬似音素間のコストを０または小さい値とし、無音を表す擬似音素と他の音素間のコストを他と比較して大きな値として編集距離を計算する点に第１４の特徴がある。 Further, in the caption shift correcting device according to the present invention, the caption converted phoneme string generation unit generates a caption converted phoneme string in which a pseudo phoneme representing a sentence break is added to a sentence break obtained by analyzing the caption, The speech recognition unit generates a recognition result phoneme sequence in which a pseudophoneme representing silence is added to a place where silence continues for a certain period of time, and the phoneme sequence matching unit is a pseudophoneme representing a pseudophoneme representing silence and a sentence break The fourteenth feature is that the edit distance is calculated by setting the cost between the pseudophoneme representing silence and the other phoneme to a large value by comparing the cost between the pseudophoneme representing silence and the other phoneme.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が、前記認識結果音素列の先頭r音素と前記字幕変換音素列の先頭n音素(n=1〜C：Cは字幕変換音素列の全音素数)間の編集距離を計算し、該編集距離の中からその値が小さい上位N個を選択し、前記認識結果音素列の先頭r+1音素と前記字幕変換音素列間の編集距離を計算する際、字幕変換音素列の先頭n音素(n=1〜C)の中で、先に選択された上位N個の編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しないという処理を繰り返し実行し、認識結果音素列と字幕変換音素列間の編集距離を計算する点に第１５の特徴がある。 Further, in the caption shift correcting device according to the present invention, the phoneme string collating unit includes a first r phoneme of the recognition result phoneme string and a first n phoneme of the caption converted phoneme string (n = 1 to C: C is a caption converted phoneme). Edit distance between the total number of phonemes in the sequence), and select the top N smaller values from the edit distance, and edit between the first r + 1 phoneme of the recognition result phoneme sequence and the subtitle conversion phoneme sequence When calculating the distance, the edit distance is calculated only for the top n phonemes (n = 1 to C) of the caption conversion phoneme sequence that can be calculated using the top N selected edit distances. The fifteenth feature is that the process of not calculating the others is repeatedly executed to calculate the edit distance between the recognition result phoneme string and the caption converted phoneme string.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が、前記認識結果音素列の先頭r音素と前記字幕変換音素列の先頭n音素(n=1〜C：Cは字幕変換音素列の全音素数)間の編集距離を計算し、該編集距離の中からそれが最小のもの、および最小のものとの差が予め定めた閾値内に収まるものを選択し、前記認識結果音素列の先頭r+1音素と前記字幕変換音素列間の編集距離を計算する際、字幕変換音素列の先頭n音素(n=1〜C)の中で、先に選択された編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しないという処理を繰り返し実行し、認識結果音素列と字幕変換音素列間の編集距離を計算する点に第１６の特徴がある。 Further, in the caption shift correcting device according to the present invention, the phoneme string collating unit includes a first r phoneme of the recognition result phoneme string and a first n phoneme of the caption converted phoneme string (n = 1 to C: C is a caption converted phoneme). Edit distance between the total number of phonemes in the sequence), and from the edit distance, select the smallest one and the one whose difference from the smallest one falls within a predetermined threshold, and the recognition result phoneme sequence When calculating the edit distance between the first r + 1 phoneme of the subtitle and the subtitle converted phoneme string, using the edit distance previously selected from the first n phonemes (n = 1 to C) of the subtitle converted phoneme string There is a sixteenth feature in that the edit distance is calculated only for those that can be calculated, the process of not calculating the other is repeatedly executed, and the edit distance between the recognition result phoneme string and the caption converted phoneme string is calculated.

また、本発明に係る字幕ずれ補正装置は、前記音素列照合部が、前記認識結果音素列の先頭r音素と前記字幕変換音素列の先頭n音素(n=1〜C：Cは字幕変換音素列の全音素数)間の編集距離を計算し、該編集距離の中からその値が最小となる字幕変換音素列の先頭m音素を選択し、前記認識結果音素列の先頭r+1音素と前記字幕変換音素列間の編集距離を計算する際に、前記認識結果音素列の先頭r+1音素と前記字幕変換音素列の先頭m-N音素(Nは一定の値)から先頭m+N音素との編集距離の中で、既に計算されている先頭r音素との編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しないという処理を繰り返し実行し、認識結果音素列と字幕変換音素列間の編集距離を計算する点に第１７の特徴がある。 Further, in the caption shift correcting device according to the present invention, the phoneme string collating unit includes a first r phoneme of the recognition result phoneme string and a first n phoneme of the caption converted phoneme string (n = 1 to C: C is a caption converted phoneme). Edit distance between the total number of phonemes in the sequence), and select the first m phoneme of the subtitle conversion phoneme sequence having the smallest value from the edit distance, and the first r + 1 phoneme of the recognition result phoneme sequence and the When calculating the edit distance between caption conversion phoneme strings, the first r + 1 phoneme of the recognition result phoneme string and the first mN phoneme (N is a constant value) from the first mN phoneme of the caption conversion phoneme string The editing distance is calculated only for those that can be calculated using the editing distance with the first r phoneme that has already been calculated, and the other is not calculated. A seventeenth feature is that the edit distance between subtitle-converted phoneme strings is calculated.

また、本発明に係る再生装置は、前記第１〜第１７のいずれかの特徴を有する字幕ずれ補正装置と、前記字幕ずれ補正装置により時間的ずれが補正された字幕を、受信した放送コンテンツ中の音声および映像と共に再生する再生手段を備える点に第１の特徴がある。 In addition, a playback device according to the present invention includes a caption shift correction device having any one of the first to seventeenth features and a caption whose time shift is corrected by the caption shift correction device in received broadcast content. There is a first feature in that it has a reproducing means for reproducing together with the audio and video.

また、本発明に係る再生装置は、前記第１〜第１７のいずれかの特徴を有する字幕ずれ補正装置と、前記字幕ずれ補正装置により音声および映像との時間的ずれが補正された字幕を保存する字幕保存手段と、入力されたキーワードに合致する部分の映像を、前記字幕保存手段に保存された字幕内の文字情報を元に検索する検索手段を備え、前記検索手段により検索された部分の映像を再生する点に第２の特徴がある。 In addition, the playback device according to the present invention stores the caption shift correction device having any one of the first to seventeenth features and the caption in which the temporal shift between audio and video is corrected by the caption shift correction device. Subtitle storage means, and a search means for searching for a portion of the video that matches the input keyword based on character information in the subtitle stored in the subtitle storage means. A second feature is that the video is reproduced.

また、本発明に係る放送装置は、前記字幕ずれ補正装置と、前記字幕ずれ補正装置により時間的ずれが補正された音声、映像および字幕を放送番組として送信する送信手段を備えた点に特徴がある。 In addition, the broadcast device according to the present invention is characterized in that it includes the subtitle shift correction device and a transmission unit that transmits audio, video, and subtitles whose time shift is corrected by the subtitle shift correction device as a broadcast program. is there.

本発明では、放送コンテンツを受信しつつ、放送コンテンツ中の音声を認識した結果得られる認識結果音素系列と字幕を変換して得られる字幕変換音素系列を逐次照合し、字幕と音声の対応付けを行って字幕の開始、終了時刻を決定し、その結果により映像や音声と字幕との間の時間的ずれを補正する。この際、字幕受信時に、字幕と字幕受信時以前の音声との対応付けを行い、その結果から時間的ずれ補正後の字幕の開始、終了時刻を予測し、その後、予測された字幕の開始、終了時刻に至るまで一定時間ごとに、新たに受信した放送コンテンツ中の音声および字幕の情報を用いて字幕と音声との対応付けを行い、その結果から時間的ずれ補正後の字幕の開始、終了時刻を予測を予測する処理を繰り返す。このようにして、放送コンテンツ中の字幕の開始、終了時刻を逐次的に決定することにより、放送コンテンツを受信しつつ、映像や音声と字幕との間の時間的ずれ幅を高精度補正することができ、該時間的ずれが高精度に補正された放送コンテンツを逐次作成することができる。 In the present invention, while receiving broadcast content, the recognition result phoneme sequence obtained as a result of recognizing the sound in the broadcast content and the subtitle converted phoneme sequence obtained by converting the subtitle are sequentially collated, and the subtitle and the audio are associated with each other. The subtitle start and end times are determined, and the time lag between the video and audio and the subtitle is corrected based on the result. At this time, when subtitles are received, the subtitles are associated with the audio before the subtitles are received, and the start and end times of the subtitles after the time lag correction are predicted from the results, and then the start of the predicted subtitles, At regular intervals until the end time is reached, subtitles and audio are associated with each other using the audio and subtitle information in the newly received broadcast content, and the start and end of the subtitles after correcting the time lag from the result The process of predicting the time is repeated. In this way, by sequentially determining the start and end times of the subtitles in the broadcast content, the time lag between the video and audio and the subtitles can be accurately corrected while receiving the broadcast content. Broadcast contents in which the time lag is corrected with high accuracy can be sequentially created.

また、第１１ないし第１４の特徴によれば、字幕の切れ目や字幕の文章の切れ目、放送コンテンツ内の音声中の発声の切れ目を考慮して音声と字幕の対応付けを行うことで、映像や音声と字幕との間の時間的ずれを高精度に補正することができる。 In addition, according to the eleventh to fourteenth features, by associating audio and subtitles in consideration of subtitle breaks, subtitle sentence breaks, and utterance breaks in audio within broadcast content, The time lag between the audio and the subtitle can be corrected with high accuracy.

また、第７、第８、第１５ないし第１７の特徴によれば、映像や音声と字幕との間の時間的ずれの補正に要するメモリ量や計算量を低減することができる。 Further, according to the seventh, eighth, fifteenth to seventeenth features, it is possible to reduce the amount of memory and the amount of calculation required for correcting the time lag between the video and audio and the caption.

本発明に係る字幕ずれ補正装置の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the caption shift correction apparatus which concerns on this invention. 字幕開始/終了時刻決定部の一実施形態を示すブロック図である。It is a block diagram which shows one Embodiment of a caption start / end time determination part. 音声認識部の構成例を示すブロック図である。It is a block diagram which shows the structural example of a speech recognition part. 言語モデルの例を示す図である。It is a figure which shows the example of a language model. 音声照合部での照合処理を示す模式図である。It is a schematic diagram which shows the collation process in a voice collation part. 字幕変換音素列生成部での字幕変換音素列の生成処理を示す模式図である。It is a schematic diagram which shows the production | generation process of a caption conversion phoneme sequence in a caption conversion phoneme sequence production | generation part. 音素列照合処理の第１実施形態(第１段階)を説明する図である。It is a figure explaining 1st Embodiment (1st step) of phoneme string collation processing. 音素列照合処理の第１実施形態(第２段階)を説明する図である。It is a figure explaining 1st Embodiment (2nd step) of phoneme string collation processing. 音素列照合処理の第１実施形態(第３段階)を説明する図である。It is a figure explaining 1st Embodiment (3rd step) of phoneme string collation processing. 音素列照合処理の第１実施形態(第４段階)を説明する図である。It is a figure explaining 1st Embodiment (4th step) of phoneme string collation processing. ずれ幅推定の計算処理量を低減させるための編集距離計算過程の例を示す説明図である。It is explanatory drawing which shows the example of the edit distance calculation process for reducing the calculation processing amount of deviation width estimation. ずれ幅推定の計算処理量を低減させるための編集距離計算過程の他の例を示す説明図である。It is explanatory drawing which shows the other example of the edit distance calculation process for reducing the calculation processing amount of deviation width estimation. 放送番組の出演者の発声と字幕の出力との時間関係を示す説明図である。It is explanatory drawing which shows the time relationship between the utterance of the performer of a broadcast program, and the output of a caption.

以下、図面を参照して本発明を説明する。図１は、本発明に係る字幕ずれ補正装置の基本構成を示すブロック図である。この字幕ずれ補正装置は、情報分離部11、字幕開始/終了時刻決定部12およびずれ補正部13を備える。ここで補正されるのは、情報内容からみた場合の音声と字幕との間の時間的ずれである。以下、単にずれ、ずれ幅と称した場合も、時間的ずれ、時間的ずれの大きさ(時間的ずれ幅)を意味する。 The present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a basic configuration of a caption shift correction apparatus according to the present invention. This caption deviation correction apparatus includes an information separation unit 11, a caption start / end time determination unit 12, and a deviation correction unit 13. What is corrected here is a time lag between the audio and the caption when viewed from the information content. Hereinafter, the term “deviation” and “deviation width” also mean the temporal deviation and the magnitude of temporal deviation (temporal deviation width).

情報分離部11には放送コンテンツが入力される。放送コンテンツは、それぞれのトラックに格納された映像、音声および字幕を含む。情報内容からみて、映像と音声との間に時間的ずれがなく、字幕は、それらに対して時間遅れをもっている。情報分離部11は、放送コンテンツ中の音声と字幕を別々に分離する。 Broadcast content is input to the information separation unit 11. Broadcast content includes video, audio, and subtitles stored in each track. From the viewpoint of information content, there is no time lag between video and audio, and subtitles have a time delay relative to them. The information separation unit 11 separates audio and subtitles in the broadcast content separately.

字幕開始/終了時刻決定部12は、情報分離部11で分離された音声と字幕の対応から字幕の出力開始時刻および終了時刻を決定する。字幕開始/終了時刻決定部12については後で詳細に説明する。 The subtitle start / end time determination unit 12 determines the subtitle output start time and end time from the correspondence between the audio and subtitles separated by the information separation unit 11. The caption start / end time determination unit 12 will be described in detail later.

ずれ補正部13は、字幕開始/終了時刻決定部12で決定された字幕の開始時刻および終了時刻に基づき、情報内容からみて、映像、音声および字幕との間にずれがない放送コンテンツを逐次出力する。映像、音声および字幕との間のずれは、映像、音声、字幕をそれぞれ格納している各トラックの各情報を一時的にバッファに蓄え、その読み出しタイミングを制御することで解消される。すなわち、映像と音声を随時読み出し、出力しつつ、字幕開始/終了時刻決定部12で決定された字幕の開始時刻および終了時刻に合わせて字幕を読み出し出力するようにすればよい。 Based on the start time and end time of the subtitles determined by the subtitle start / end time determination unit 12, the misalignment correction unit 13 sequentially outputs broadcast content that does not deviate from video, audio, and subtitles in terms of information content. To do. Deviations between video, audio, and subtitles are eliminated by temporarily storing each piece of information for each track storing video, audio, and subtitles in a buffer and controlling the read timing. That is, subtitles may be read out and output in accordance with the start time and end time of the subtitle determined by the subtitle start / end time determination unit 12 while reading and outputting video and audio as needed.

図２は、字幕開始/終了時刻決定部12の一実施形態を示すブロック図である。本実施形態の字幕開始/終了時刻決定部は、音声認識部21、認識結果音素列格納部22、字幕変換音素列生成部23、字幕変換音素列格納部24、音素列照合部25および音素列照合結果格納部26を備える。 FIG. 2 is a block diagram showing an embodiment of the caption start / end time determination unit 12. The subtitle start / end time determination unit of the present embodiment includes a speech recognition unit 21, a recognition result phoneme sequence storage unit 22, a subtitle conversion phoneme sequence generation unit 23, a subtitle conversion phoneme sequence storage unit 24, a phoneme sequence collation unit 25, and a phoneme sequence. A verification result storage unit 26 is provided.

TV放送コンテンツの場合、出演者の映像と共に出演者により発声された音声が取得される。したがって、映像と音声とは時間的ずれなく取得されていると考えることができる。これにより取得された音声は、音声認識部21に入力される。音声認識部21は、入力される音声を認識処理し、音声に対応する認識結果音素列を送出する。 In the case of TV broadcast content, the voice uttered by the performer is acquired together with the video of the performer. Therefore, it can be considered that the video and audio are acquired without time lag. The voice thus acquired is input to the voice recognition unit 21. The speech recognition unit 21 recognizes the input speech and sends a recognition result phoneme string corresponding to the speech.

図３は、音声認識部21の構成例を示すブロック図である。音声認識部21は、音声検出部31、音響分析部32、音響モデル格納部33、言語モデル格納部34および音声照合部35を備える。この構成は、音声認識で一般的なものである。 FIG. 3 is a block diagram illustrating a configuration example of the voice recognition unit 21. The speech recognition unit 21 includes a speech detection unit 31, an acoustic analysis unit 32, an acoustic model storage unit 33, a language model storage unit 34, and a speech collation unit 35. This configuration is common in speech recognition.

音声検出部31は、入力された音声から人声を含む区間の音声を切り出して音響分析部32に送る。音声検出部31での音声の切り出しには、例えば、入力のパワーの大小に基づく音声検出手法を利用できる。この音声検出手法では、入力のパワーを逐次計算し、入力のパワーが予め定めた閾値を一定時間連続して上回った時点を音声の開始時点と判定し、逆に、入力のパワーが予め定めた閾値を一定時間連続して下回った時点を音声の終了時点と判定する。音声検出部31により切り出された音声は、音声開始時点から音声終了時点まで逐次音響分析部32に送られる。 The voice detection unit 31 cuts out the voice of the section including the human voice from the input voice and sends it to the acoustic analysis unit 32. For example, a voice detection method based on the magnitude of input power can be used to cut out the voice in the voice detection unit 31. In this voice detection method, the input power is sequentially calculated, and when the input power continuously exceeds a predetermined threshold for a certain period of time, it is determined as the voice start time, and conversely, the input power is predetermined. The point of time when the threshold value is continuously lowered for a certain time is determined as the end point of the voice. The voice cut out by the voice detection unit 31 is sequentially sent to the acoustic analysis unit 32 from the voice start time to the voice end time.

音響分析部32は、音声検出部31により切り出された音声の音響分析を行い、MFCCなど音声の特徴を現す音響特徴量列を音声照合部35に送出する。 The acoustic analysis unit 32 performs an acoustic analysis of the voice cut out by the voice detection unit 31, and sends an acoustic feature amount sequence representing a voice feature such as MFCC to the voice collation unit 35.

音響モデル格納部33は、日本語音声を構成する単位である音素ごとに用意したHMMなどの標準パタンを格納している。この標準パタンを日本語単語・文章を構成する音素列に即して連結することで任意の日本語単語・文章に対応する標準パタンを作成することができる。 The acoustic model storage unit 33 stores a standard pattern such as an HMM prepared for each phoneme which is a unit constituting Japanese speech. A standard pattern corresponding to any Japanese word / sentence can be created by connecting this standard pattern in accordance with the phoneme sequence constituting the Japanese word / sentence.

また、言語モデル格納部34は、日本語の単語間、音素間などの接続関係を規定する言語モデルを格納している。この言語モデルには、(1)音節間の接続関係を規定する連続音節認識文法、(2)単語間の接続関係を規定する文法規則、(3)N個の音素の組が連続する確率を規定する統計的言語モデル、(4)N個の単語の組が連続する確率を規定する統計的言語モデルなどがある。 The language model storage unit 34 stores a language model that defines connection relationships such as between Japanese words and phonemes. This language model includes (1) a continuous syllable recognition grammar that prescribes the connection relationship between syllables, (2) a grammar rule that prescribes a connection relationship between words, and (3) a probability that a set of N phonemes is continuous. There are statistical language models that prescribe, and (4) statistical language models that prescribe the probability that a set of N words will continue.

図４は、言語モデルの例を示す図である。同図(a)は、音節間の接続関係を規定する連続音節認識文法であり、これは、子音/b//d/・・・と母音/a//i/・・・の接続関係を規定している。同図(b)は、単語間の接続関係を規定する文法規則であり、これは、/単語1//単語2/・・・の接続関係を規定している。言語モデルについては、例えば、「鹿野ら著：「IT Text 音声認識システム」オーム社」に記載されている。 FIG. 4 is a diagram illustrating an example of a language model. Figure (a) is a continuous syllable recognition grammar that defines the connection relationship between syllables. This shows the connection relationship between consonant / b // d / ... and vowel / a // i / ... It prescribes. FIG. 4B shows grammatical rules that define the connection relationship between words, which defines the connection relationship of / word1 // word2 /. The language model is described in, for example, “Kano et al .:“ IT Text Speech Recognition System ”Ohmsha”.

図５は、音声照合部35(図３)での照合処理を示す模式図である。同図は、音響分析部32から送出される音響特徴量列が音声照合部35で標準パタンと照合され、これにより照合結果/sh//i/・・・/u/が得られると共に、各音素に対応する音声区間の開始、終了時刻が取得されることを示している。 FIG. 5 is a schematic diagram showing collation processing in the voice collation unit 35 (FIG. 3). In the figure, the acoustic feature quantity sequence sent from the acoustic analysis unit 32 is collated with the standard pattern by the voice collation unit 35, thereby obtaining a collation result /sh//i/.../u/ It shows that the start and end times of the speech segment corresponding to the phoneme are acquired.

図３に戻って、音声照合部35は、言語モデルに記された接続規則に従って音響モデルを接続して標準パタンを生成すると共に、Viterbiアルゴリズムを用い、音響分析部32から送出される音響特徴量列と標準パタンを照合する。この照合の結果、両者の照合スコアを最大とする音声区間と標準パタンの対応が得られる。音声認識部21での認識結果として、認識結果音素列および標準パタンを構成する各音素に対応する音声区間の開始時刻、終了時刻が取得される。これにより得られた認識結果音素列は、認識結果音素列格納部22(図２)に格納される。なお、音声照合については、「中川聖一ら著：「確率モデルによる音声認識」電子情報通信学会」に記載されている。 Returning to FIG. 3, the voice collation unit 35 generates a standard pattern by connecting the acoustic model according to the connection rules described in the language model, and uses the Viterbi algorithm to transmit the acoustic feature amount from the acoustic analysis unit 32. Match columns against standard patterns. As a result of this collation, the correspondence between the voice section that maximizes the collation score of both and the standard pattern is obtained. As the recognition result in the speech recognition unit 21, the start time and the end time of the speech section corresponding to each phoneme constituting the recognition result phoneme string and the standard pattern are acquired. The recognition result phoneme string obtained in this way is stored in the recognition result phoneme string storage unit 22 (FIG. 2). Voice collation is described in “Seiichi Nakagawa et al .:“ Voice Recognition by Stochastic Model ”” The Institute of Electronics, Information and Communication Engineers.

音声認識部21の認識結果の取得処理は、音素列照合部25から指示された時点で行う。その際、音声照合部35が照合の過程で求めた照合中間結果(部分的な照合結果)を保持しておくものとする。その上で、音素列照合部25からの指示に応じて、前回の認識結果取時の照合中間結果を引き継ぎ継続して照合を行う。もしくは、前回照合時に用いた照合中間結果を破棄した後に再度初期状態から照合を開始する。 The recognition result acquisition process of the speech recognition unit 21 is performed when instructed by the phoneme string collation unit 25. At this time, it is assumed that the collation intermediate result (partial collation result) obtained by the voice collation unit 35 during the collation process is held. After that, according to the instruction from the phoneme string collating unit 25, the collation intermediate result at the time of the previous recognition result acquisition is continued and collation is performed. Alternatively, the collation is restarted from the initial state after discarding the collation intermediate result used at the previous collation.

一方、字幕変換音素列生成部23は、入力される字幕に対応する字幕変換音素列(字幕変換音素列)を生成する。ここで生成される字幕変換音素列は、各字幕ごとの音素列ではなく、放送コンテンツ中の各字幕に対応する音素列が複数連結された音素列である。字幕変換音素列は、字幕変換音素列格納部24に格納される。 On the other hand, the caption converted phoneme string generation unit 23 generates a caption converted phoneme string (caption converted phoneme string) corresponding to the input caption. The subtitle-converted phoneme sequence generated here is not a phoneme sequence for each subtitle but a phoneme sequence in which a plurality of phoneme sequences corresponding to each subtitle in the broadcast content are connected. The caption converted phoneme string is stored in the caption converted phoneme string storage unit 24.

図６は、字幕変換音素列生成部23(図２)での字幕変換音素列の生成処理を示す模式図である。字幕変換音素列生成部23は、漢字仮名混じり文により記述されている字幕の形態素解析を行い、それを品詞に分割すると共に読みを表す仮名文字列に変換し、さらに、仮名文字から発音記号への変換規則を記載した変換表を参照して、仮名文字列を音素列に変換して字幕変換音素列を生成する。 FIG. 6 is a schematic diagram showing a caption conversion phoneme string generation process in the caption conversion phoneme string generation unit 23 (FIG. 2). The subtitle conversion phoneme sequence generation unit 23 performs morphological analysis of subtitles described by a kana-kana mixed sentence, divides it into parts of speech and converts it into a kana character string representing a reading, and further converts kana characters into phonetic symbols The kana character string is converted into a phoneme string by referring to the conversion table in which the conversion rules are described, thereby generating a caption converted phoneme string.

例えば、漢字仮名混じり文により記述されている字幕文字列が「７時のニュースです」の場合、字幕変換音素列生成部23は、まず、形態素解析により「７」「時」「の」「ニュース」「です」の各品詞に分割する。次に、これらを、読みを表す仮名文字列「しち」「じ」「の」「にゅーす」「です」に変換し、さらに、仮名文字列から発音記号への変換規則を記載した変換表を参照して、仮名文字列を音素列/sh/ /i/ /ch/ /i/ /j/ /i/ /n/ /o/ /ny/ /uu/ /s/ /u/ /d/ /e/ /s/ /u/に変換する。 For example, when the subtitle character string described in the kanji kana mixed sentence is “7 o'clock news”, the subtitle conversion phoneme string generation unit 23 first performs “7” “time” “no” “news” by morphological analysis. ”And“ is ”. Next, these are converted to the kana character strings "Shi", "ji", "no", "news", and "is" representing the readings, and a conversion table that describes the conversion rules from kana character strings to phonetic symbols Refer to the kana string as the phoneme string / sh / / i / / ch / / i / / j / / i / / n / / o / / ny / / uu / / s / / u / / d / / e / / s / Convert to / u /.

音素列照合部25は、認識結果音素列格納部22に格納された認識結果音素列と字幕変換音素列格納部24に格納された字幕変換音素列を照合し、音声と字幕の対応から字幕の出力開始、終了時刻を決定する。 The phoneme string collation unit 25 collates the recognition result phoneme string stored in the recognition result phoneme string storage unit 22 with the subtitle conversion phoneme string stored in the caption conversion phoneme string storage unit 24, and determines the subtitle Determine the output start and end times.

音素列照合結果格納部26は、音素列照合部25が照合過程で求めた照合中間結果(部分的な編集距離)を保持する。音素列照合結果格納部26に部分的な編集距離を保持しておくことにより、認識結果音素列の後に新たな認識結果音素列が追加されたり、字幕変換音素列の後に新たな字幕変換音素列が追加されたりした際の照合を効率的に行うことができる。すなわち、認識結果音素列の後に新たな認識結果音素列が追加されたり、字幕変換音素列の後に新たな字幕変換音素列が追加されたりした場合、既に求められている部分的な編集処理を用いて新たに追加された音素に関係する編集距離を求めることができる。なお、編集距離については後述する。 The phoneme string collation result storage unit 26 holds the collation intermediate result (partial editing distance) obtained by the phoneme string collation unit 25 during the collation process. By holding a partial editing distance in the phoneme string collation result storage unit 26, a new recognition result phoneme string is added after the recognition result phoneme string, or a new caption converted phoneme string is added after the caption conversion phoneme string. It is possible to efficiently perform collation when is added. That is, when a new recognition result phoneme sequence is added after the recognition result phoneme sequence, or when a new subtitle conversion phoneme sequence is added after the subtitle conversion phoneme sequence, the partial editing process already obtained is used. The edit distance related to the newly added phoneme can be obtained. The edit distance will be described later.

次に、音素列照合部25での照合処理について説明する。音素列照合部25は、認識結果音素列格納部22に格納された認識結果音素列と字幕変換音素列格納部24に格納された字幕変換音素列を比較し、両者の異なりの程度を表す編集距離を算出する。この編集距離は、音素の挿入や削除、置換によって、１つの音素列を別の音素列に変換するために必要な手順の回数に応じた編集コストを指標として定義される。 Next, the collation process in the phoneme string collation unit 25 will be described. The phoneme sequence matching unit 25 compares the recognition result phoneme sequence stored in the recognition result phoneme sequence storage unit 22 with the subtitle conversion phoneme sequence stored in the subtitle conversion phoneme sequence storage unit 24, and represents the degree of difference between the two. Calculate the distance. This editing distance is defined by using as an index the editing cost corresponding to the number of steps required to convert one phoneme string into another phoneme string by inserting, deleting, or replacing phonemes.

例えば、編集距離は、１つの音素列を別の音素列に変換するために必要な手順の最小回数であり、"/sh/ /i/ /N/ /k/ /a/ /i/"を"/t/ /o/ /k/ /a/ /i/"に変形する場合、以下に示すように、最低3回の手順が必要とされるので、編集距離は3となる。
1. /sh/ /i/ /N/ /k/ /a/ /i/
2. /t/ /i/ /N/ /k/ /a/ /i/ ("/sh/"を"/t/"に置換)
3. /t/ /o/ /N/ /k/ /a/ /i/（"/i/"を"/o/"に置換)
4. /t/ /o/ /k/ /a/ /i/（"/N/"を削除して終了） For example, the edit distance is the minimum number of steps required to convert one phoneme sequence to another, and "/ sh / / i / / N / / k / / a / / i /" When transforming to "/ t / / o / / k / / a / / i /", as shown below, at least three steps are required, so the edit distance is 3.
1./sh/ / i / / N / / k / / a / / i /
2. / t / / i / / N / / k / / a / / i / (Replace "/ sh /" with "/ t /")
3. / t / / o / / N / / k / / a / / i / (Replace "/ i /" with "/ o /")
4. / t / / o / / k / / a / / i / (Delete "/ N /" and exit)

認識結果音素列と字幕変換音素列間の編集距離は、単に手順の回数に応じて定義されるものに限られない。例えば、音素ごとの音声認識性能(音素間の間違いやすさなど)を元に、ある音素Aをある音素Bに置換する際に要するコスト、ある音素Aを挿入する際に要するコスト、ある音素Aを削除する際に要するコストを個別に定め、これらのコストを元に編集距離を定義することも可能である。例えば、音素/b/と音素/p/は間違いやすいので、コストは小とされる The editing distance between the recognition result phoneme string and the subtitle converted phoneme string is not limited to that defined simply according to the number of procedures. For example, based on the speech recognition performance for each phoneme (ease of error between phonemes, etc.), the cost required to replace a phoneme A with a phoneme B, the cost required to insert a phoneme A, a phoneme A It is also possible to individually define the cost required for deleting and to define the edit distance based on these costs. For example, phoneme / b / and phoneme / p / are easy to mistake, so the cost is small

編集距離は、動的計画法に基づく、以下のアルゴリズムを用いることで高速に計算できる。 The edit distance can be calculated at high speed by using the following algorithm based on dynamic programming.

認識結果音素列:rph[1],rph[2],・・・,rph[R]
字幕変換音素列:cph[1],cph[2],・・・,cph[C]
認識結果音素列の最初の音素(rph[1])から最後の音素(rph[R])まで:
字幕変換音素列の最初の音素(cph[1])から最後の音素(cph[C])まで:
認識結果音素列の先頭r音素rph[1・・r]と
字幕変換音素列の先頭c音素cph[1・・c]との編集距離
d[r,c]=minimum(
d[r-1,c]+ins_cost(cph[c]), //音素の挿入
d[r,c-1]+del_cost(rph[r]), //音素の削除
d[r-1,c-1]+sub_cost(cph[c],rph[r]) //音素の置換
) Recognition result phoneme sequence: rph [1], rph [2], ..., rph [R]
Subtitle conversion phoneme sequence: cph [1], cph [2], ..., cph [C]
From the first phoneme (rph [1]) to the last phoneme (rph [R]) in the recognition result phoneme sequence:
From the first phoneme (cph [1]) to the last phoneme (cph [C]) of a closed caption conversion phoneme sequence:
The first r phoneme rph [1 · r] of the recognition result phoneme string
Editing distance from the first c phoneme cph [1 ・・ c] of the caption conversion phoneme string
d [r, c] = minimum (
d [r-1, c] + ins_cost (cph [c]), // insert phoneme
d [r, c-1] + del_cost (rph [r]), // Delete phoneme
d [r-1, c-1] + sub_cost (cph [c], rph [r]) // phoneme substitution
)

ここで、ins_cost(cph[c])、del_cost(rph[r])、sub_cost(cph[c],rph[r])はそれぞれ、認識結果音素列に音素を挿入する際に要するコスト、認識結果音素列あから音素を削除する際に要するコスト、認識結果音素列の音素を他の音素に置換する際に要するコストを表す。 Here, ins_cost (cph [c]), del_cost (rph [r]), and sub_cost (cph [c], rph [r]) are the cost required to insert the phoneme into the recognition result phoneme string, and the recognition result, respectively. This represents the cost required to delete a phoneme from the phoneme string and the cost required to replace the phoneme in the recognition result phoneme string with another phoneme.

本アルゴリズムは、認識結果音素列の先頭r-1音素rph[1・・r-1]と字幕変換音素列の先頭c音素cph[1・・c]間の編集距離d[r-1,c]、認識結果音素列の先頭r音素rph[1・・r]と字幕変換音素列の先頭c-1音素cph[1・・c-1]間の編集距離d[r,c-1]、認識結果音素列の先頭r-1音素rph[1・・r-1]と字幕変換音素列の先頭c-1音素cph[1・・c-1]間の編集距離d[r-1,c-1]を利用し、認識結果音素列の先頭r音素(rph[1・・r]）と字幕変換音素列の先頭c音素cph[1・・c1]間の編集距離d[r,c]を算出する処理を繰り返し行うことで、効率良く編集距離を計算するものである。 This algorithm calculates the editing distance d [r−1, c] between the first r-1 phoneme rph [1 ·· r-1] of the recognition result phoneme sequence and the first c phoneme cph [1 ·· c] of the caption conversion phoneme sequence ], Edit distance d [r, c-1] between the first r phoneme rph [1 ·· r] of the recognition result phoneme sequence and the first c-1 phoneme cph [1 ·· c-1] of the caption conversion phoneme sequence, Editing distance d [r-1, c between the first r-1 phoneme rph [1 ·· r-1] of the recognition result phoneme sequence and the first c-1 phoneme cph [1 ·· c-1] of the subtitle conversion phoneme sequence -1], the edit distance d [r, c] between the first r phoneme (rph [1 ·· r]) of the recognition result phoneme sequence and the first c phoneme cph [1 ·· c1] of the caption conversion phoneme sequence The edit distance is efficiently calculated by repeatedly performing the process of calculating.

なお、認識結果音素列と字幕変換音素列間の編集距離を変換手順の回数で定義する場合、 When defining the edit distance between the recognition result phoneme sequence and the caption conversion phoneme sequence as the number of conversion steps,

挿入コスト:ins_cost(cph[c])=常に1
削除コスト:del_cost(rph[r])=常に1
置換コスト:sub_cost(cph[c],rph[r])=0(cph[c]=rph[r]の場合)
=1(cph[c]≠rph[r]の場合)
である。 Insertion cost: ins_cost (cph [c]) = always 1
Deletion cost: del_cost (rph [r]) = always 1
Replacement cost: sub_cost (cph [c], rph [r]) = 0 (when cph [c] = rph [r])
= 1 (when cph [c] ≠ rph [r])
It is.

編集距離の計算と同時に、選択された編集距離の最小値が音素の挿入、削除、置換のいずれの編集方法によるものであるかを保存しておく。認識結果音素列と字幕変換音素列の組み合わせ最後まで編集距離の計算が終了した後、保存された編集方法の選択結果を、認識結果音素列の最後の音素rph[R]と字幕変換音素列の最後の音素cph[C]までの編集距離を計算した際の選択結果から逆順に読み出すことで、編集距離が最小となる編集方法(挿入、削除、置換の組み合わせ)を求めることができ、その結果を基に、認識結果音素列の各音素が字幕変換音素列のどの音素と対応付けされたかの情報を取得することができる。 Simultaneously with the calculation of the edit distance, it is stored whether the selected minimum edit distance is due to the phoneme insertion, deletion, or replacement editing method. After the calculation of the edit distance to the end of the combination of the recognition result phoneme string and the subtitle conversion phoneme string, the selection result of the saved editing method is used as the last phoneme rph [R] of the recognition result phoneme string and the subtitle conversion phoneme string. By reading in reverse order from the selection result when calculating the edit distance to the last phoneme cph [C], the edit method (combination of insertion, deletion, and replacement) that minimizes the edit distance can be obtained. Based on the information, it is possible to obtain information as to which phoneme in the subtitle conversion phoneme sequence is associated with each phoneme in the recognition result phoneme sequence.

放送コンテンツを受信する際は、放送コンテンツの進行に従い、複数個の字幕を次々と受信することとなる。本発明では、以下に説明するように、新たな字幕を受信するたびに逐次的に認識結果音素列と字幕変換音素列の照合して字幕の開始、終了時刻を予測し、その結果に基づいて、放送コンテンツを受信しつつ、受信時刻から予め定めた遅延時間T(sec)をもって、字幕ずれが補正された放送コンテンツを逐次的に生成できるようにしている。なお、遅延時間Tは、各種放送コンテンツにおける映像と字幕間のずれ幅の分布を参考にして予め定められる。例えば、全放送コンテンツにおける映像と字幕間のずれ幅の最大値を遅延時間T(sec)とすることができる。 When receiving broadcast content, a plurality of subtitles are received one after another as the broadcast content progresses. In the present invention, as described below, each time a new subtitle is received, the recognition result phoneme sequence and the subtitle converted phoneme sequence are collated sequentially to predict the start and end times of the subtitle, and based on the result In addition, while receiving broadcast content, it is possible to sequentially generate broadcast content with corrected caption displacement with a predetermined delay time T (sec) from the reception time. Note that the delay time T is determined in advance with reference to the distribution of shift widths between video and subtitles in various broadcast contents. For example, the maximum value of the deviation width between video and subtitles in all broadcast contents can be set as the delay time T (sec).

本発明は、特に、認識結果音素列と字幕変換音素列の照合処理(音素列照合処理)に特徴があるので、以下に、音素列照合処理の実施形態について説明する。ここでは、"こんにちわ朝の・・・"という音声を受信し、それに対応する字幕が遅延して得られた場合を想定する。
(第１実施形態) Since the present invention is particularly characterized in the recognition result phoneme string and subtitle conversion phoneme string collation process (phoneme string collation process), an embodiment of the phoneme string collation process will be described below. Here, it is assumed that the voice “Konchiwa Morning ...” is received and the corresponding subtitles are obtained with a delay.
(First embodiment)

図７ないし図１０は、音素列照合処理の第１実施形態を説明する図である。
(A)まず、最初の字幕「こんにちわ」が得られた場合、以下の照合処理を行う。図７は、この場合の照合処理を示している。 7 to 10 are diagrams for explaining the first embodiment of the phoneme string collation processing.
(A) First, when the first subtitle “Konchiwa” is obtained, the following verification process is performed. FIG. 7 shows the collation process in this case.

(1)字幕変換音素列格納部24から現時刻までの字幕に対応した字幕変換音素列"/GB/ /k/ /o/ /N/ /n/ /i/ /ch/ /i/ /w/ /a/ /GB/"を取得する。なお、GBは、字幕の文章の切れ目に挿入された擬似音素(後述)である。 (1) Subtitle conversion phoneme sequence corresponding to subtitle conversion phoneme sequence storage unit 24 to current time "/ GB / / k / / o / / N / / n / / i / / ch / / i / / w Get / / a / / GB / ". Note that GB is a pseudo-phoneme (described later) inserted in a break between subtitle sentences.

(2)音声認識部21から認識結果音素列格納部22を介して、T(sec)前から現時刻までの音声に対応する認識結果音素列"/t/ /o/ /N/ /n/ /i/"を取得する。図示のバッファサイズ分がT(sec)に相当する。ここでは、字幕「こんにちわ」が取得された時点で、それに対応する音声がまだ終了していない、すなわち「こんにちわ」の発声の途中で字幕「こんにちわ」が得られた場合を想定している。また、音声認識部21に対して照合中間結果の破棄した後に現時刻以降の音声の照合を行うように指示する。 (2) The recognition result phoneme sequence “/ t // o // N // n / corresponding to the speech from T (sec) to the current time through the recognition result phoneme sequence storage unit 22 from the speech recognition unit 21. Get / i / ". The buffer size shown is equivalent to T (sec). Here, it is assumed that when the subtitle “Konchiwa” is acquired, the corresponding audio has not yet ended, that is, the subtitle “Konchiwa” is obtained in the middle of the utterance of “Konchiwa”. Further, the voice recognition unit 21 is instructed to collate voices after the current time after discarding the collation intermediate result.

(3)(1)で取得した字幕変換音素列と(2)で取得した認識結果音素列を上記アルゴリズムに従って照合する(照合範囲1)。音素列照合結果格納部26は、照合範囲1についての照合結果を照合中間結果として保持する。 (3) The subtitle conversion phoneme sequence acquired in (1) and the recognition result phoneme sequence acquired in (2) are collated according to the above algorithm (collation range 1). The phoneme string matching result storage unit 26 holds the matching result for the matching range 1 as a matching intermediate result.

(4)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。 (4) From the collation results up to the current time, find the phoneme in the caption converted phoneme string that has the closest edit distance from the recognition result phoneme string.

(5)(4)で見つけた音素から照合履歴を遡り、字幕の仮開始時刻を予測する。 (5) The verification history is traced back from the phonemes found in (4), and the temporary start time of subtitles is predicted.

(6)以下の基準に従い字幕の仮終了時刻を予測する。 (6) Predict the provisional end time of subtitles according to the following criteria.

(6-1)(4)で予測された音素が字幕変換音素列の終了音素の場合、照合履歴を遡って字幕の終了時刻を予測し、仮終了時刻とする。
(6-2)(6-1)以外の場合、字幕の仮終了時刻は、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻であると予測する。 (6-1) When the phoneme predicted in (4) is the end phoneme of the subtitle conversion phoneme string, the end time of the subtitle is predicted by going back to the collation history and set as the temporary end time.
In cases other than (6-2) and (6-1), the subtitle temporary end time is predicted to be a time obtained by adding the original subtitle display time length to the subtitle temporary start time.

(B)以降、一定時間おきに以下の照合処理を行う。 (B) Thereafter, the following verification process is performed at regular intervals.

(ア)次の字幕を取得するまで。すなわち、下記(イ)〜(エ)の条件に該当しない場合。図８は、この場合の照合処理を示している。
(1)音声認識部21から認識結果音素列格納部22を介して、前回照合処理を行った時刻から現時刻までの音声に対応する認識結果音素列"/ch/ /i/ /h/ /a/ /a/ /s/ /a/"を取得する。また、音声認識部21に対して照合中間結果を破棄した後に現時刻以降の音声の照合を行うように指示する。
(2)前回照合に使用した認識結果音素列の後に、(1)で取得した認識結果音素列を追加する。
(3)前回照合に使用した字幕変換音素列と(2)の認識結果音素列を上記アルゴリズムに従って照合する(照合範囲1+照合範囲2)。この場合、音素列照合結果格納部26に保持されている照合中間結果を引き継ぎ継続して照合範囲2についての照合を行うことができる。音素列照合結果格納部26は、さらに照合範囲2についての照合中間結果を保持する。
(4)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。
(5)(4)で見つけた音素から照合履歴を遡り、字幕の仮開始時刻を予測(更新)する。
(6)以下の基準に従い字幕の仮終了時刻を予測(更新)する。
(6-1)(4)で見つけた音素が字幕変換音素列の終了音素の場合、照合履歴を遡って字幕の終了時刻を予測し、仮終了時刻とする。
(6-2)(6-1)以外の場合、字幕の仮終了時刻は、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻であると予測される。 (A) Until the next subtitle is obtained. That is, when the following conditions (a) to (d) are not met. FIG. 8 shows the collation process in this case.
(1) The recognition result phoneme sequence “/ ch // i // h //” corresponding to the speech from the time of the previous collation processing to the current time through the recognition result phoneme sequence storage unit 22 from the speech recognition unit 21 Get a / / a / / s / / a / ". Further, the voice recognition unit 21 is instructed to collate voices after the current time after discarding the collation intermediate result.
(2) The recognition result phoneme string acquired in (1) is added after the recognition result phoneme string used for the previous verification.
(3) The subtitle converted phoneme string used for the previous collation is collated with the recognition result phoneme string of (2) according to the above algorithm (collation range 1 + collation range 2). In this case, the collation range 2 can be collated by continuing the collation intermediate result held in the phoneme string collation result storage unit 26. The phoneme string collation result storage unit 26 further holds a collation intermediate result for the collation range 2.
(4) From the collation results up to the current time, find the phoneme in the caption converted phoneme string that has the closest edit distance from the recognition result phoneme string.
(5) The verification history is traced back from the phonemes found in (4), and the provisional start time of subtitles is predicted (updated).
(6) Predict (update) the subtitle temporary end time according to the following criteria.
(6-1) When the phoneme found in (4) is the end phoneme of the subtitle conversion phoneme string, the end time of the subtitle is predicted by going back to the collation history and set as the temporary end time.
In cases other than (6-2) and (6-1), the subtitle temporary end time is predicted to be a time obtained by adding the original subtitle display time length to the subtitle temporary start time.

(イ)次の字幕を取得した場合。図９および図１０は、この場合の照合処理を示している。
(1)今回取得した字幕に対応する字幕変換音素列"/a/ /s/ /a/ /n/ /o/ /GB/"を、前回照合に使用した字幕変換音素列の後に追加する(図９)。
(2)(1)の字幕変換音素列と前回照合に使用した認識結果音素列を上記アルゴリズムに従って照合する。ここで、認識結果音素列が追加されるまでは、照合範囲1+照合範囲2+照合範囲3が照合範囲となる。照合範囲3では、現時刻のT(sec)前からの認識結果音素列について照合を行う。この場合、音素列照合結果格納部26に保持されている照合中間結果を引き継ぎ継続して照合範囲3についての照合を行うことができる。音素列照合結果格納部26は、さらに照合範囲3についての照合中間結果を保持する。 (B) When the next subtitle is acquired. 9 and 10 show the collation process in this case.
(1) Add the subtitle conversion phoneme sequence "/ a / / s / / a / / n / / o / / GB /" corresponding to the subtitle acquired this time after the subtitle conversion phoneme sequence used for the previous verification ( FIG. 9).
(2) The subtitle conversion phoneme sequence of (1) is collated with the recognition result phoneme sequence used for the previous collation according to the above algorithm. Here, until the recognition result phoneme string is added, the collation range 1 + collation range2 + collation range 3 is the collation range. In the collation range 3, collation is performed on the recognition result phoneme string from T (sec) before the current time. In this case, the collation range 3 can be collated by continuing the collation intermediate result held in the phoneme string collation result storage unit 26. The phoneme string collation result storage unit 26 further holds a collation intermediate result for the collation range 3.

さらに認識結果音素列が追加された場合(図１０)、照合範囲1+照合範囲2+照合範囲3+照合範囲4となる。この場合、音素列照合結果格納部26に保持されている照合中間結果を引き継ぎ継続して照合範囲4についての照合を行うことができる。音素列照合結果格納部26は、さらに照合範囲4についての照合中間結果を保持する。
(3)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。
(4)(3)で見つけた音素から照合履歴を遡り、字幕の仮開始時刻を予測(更新)する。
(5)以下の基準に従い字幕の仮終了時刻を予測(更新)する。
(5-1)(4)で見つけた音素が字幕変換音素列の終了音素の場合、照合履歴を遡って字幕の終了時刻を予測し、仮終了時刻とする。
(5-2)(5-1)以外の場合、字幕の仮終了時刻は、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻であると予測する。 Further, when a recognition result phoneme string is added (FIG. 10), the collation range 1 + the collation range 2 + the collation range 3 + the collation range 4 is obtained. In this case, the collation range 4 can be collated by continuing the collation intermediate result held in the phoneme string collation result storage unit 26. The phoneme string collation result storage unit 26 further holds a collation intermediate result for the collation range 4.
(3) From the collation results up to the current time, find the phoneme in the closed caption converted phoneme string that has the shortest edit distance from the recognition result phoneme string.
(4) The verification history is traced back from the phoneme found in (3), and the temporary start time of the caption is predicted (updated).
(5) Predict (update) the temporary end time of subtitles according to the following criteria.
(5-1) When the phoneme found in (4) is the end phoneme of the subtitle conversion phoneme string, the end time of the subtitle is predicted by going back to the collation history and set as the provisional end time.
In cases other than (5-2) and (5-1), the subtitle temporary end time is predicted to be a time obtained by adding the original subtitle display time length to the subtitle temporary start time.

(ウ)字幕の仮開始時刻が現時刻のT(sec)前となった場合
(1)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。
(2)(1)で見つけた音素から照合履歴を遡り、字幕の仮開始時刻を予測(更新)する。
(3)新たに予測された字幕の仮開始時刻が現時刻のT(sec)前もしくはそれ以前なら、元の仮開始時刻即ち現時刻のT(sec)前を字幕開始時刻の確定値とする。 (C) If the subtitle provisional start time is T (sec) before the current time
(1) From the collation results up to the current time, find the phoneme in the caption converted phoneme string that has the closest edit distance from the recognition result phoneme string.
(2) The verification history is traced back from the phoneme found in (1), and the temporary start time of the caption is predicted (updated).
(3) If the newly predicted subtitle temporary start time is before or before T (sec) of the current time, the original temporary start time, that is, T (sec) before the current time, is set as the final value of the subtitle start time. .

(エ)字幕の仮終了時刻が現時刻のT(sec)前となった場合
(1)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。
(2)(1)で見つけた音素から照合履歴を遡り、字幕の仮終了時刻を予測(更新)する。
(3)新たに予測された字幕の仮終了時刻が現時刻のT(sec)前もしくはそれ以前なら、元の仮終了時刻即ち現時刻のT(sec)前を字幕終了時刻の確定値とする。 (D) If the subtitle's provisional end time is T (sec) before the current time
(1) From the collation results up to the current time, find the phoneme in the caption converted phoneme string that has the closest edit distance from the recognition result phoneme string.
(2) The verification history is traced back from the phoneme found in (1), and the temporary end time of the caption is predicted (updated).
(3) If the newly predicted provisional end time of the caption is before or before T (sec) of the current time, the original provisional end time, that is, T (sec) before the current time, is set as the final value of the caption end time. .

以上の照合処理によれば、(ウ)(3)，(エ)(3)の時点でそれぞれ字幕の開始時刻、終了時刻を決定できる。ずれ補正部13(図１)は、この結果に基づいて、放送コンテンツ中の映像、音声と字幕との間のずれを補正し、逐次送出する。以上の照合処理を放送コンテンツ終了まで繰り返し行うことにより、入力時刻から一定時間T(sec)だけ遅延するが、時間的に対応した映像、音声および字幕を含む放送コンテンツが送出される。
(第２実施形態) According to the above collation processing, the start time and end time of subtitles can be determined at the points (c) (3) and (d) (3), respectively. Based on this result, the shift correction unit 13 (FIG. 1) corrects the shift between the video, audio, and subtitles in the broadcast content and sequentially transmits them. By repeating the above collation processing until the end of the broadcast content, the broadcast content including video, audio, and subtitles corresponding to the time is transmitted although it is delayed for a predetermined time T (sec) from the input time.
(Second Embodiment)

第１実施形態では、字幕の仮終了時刻を、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻であると予測している(6-2)。しかし、字幕の出力開始時刻は含まれているが字幕の出力終了時刻が含まれていない放送コンテンツや、字幕が一定間隔で送出されていて字幕の表示時間長と実際の発声長との相関が低い放送コンテンツなどがある。このような放送コンテンツでは、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻を字幕の仮終了時刻と予測するのは適当でない。 In the first embodiment, the subtitle temporary end time is predicted to be a time obtained by adding the original subtitle display time length to the subtitle temporary start time (6-2). However, there is a correlation between the display time length of the subtitles and the actual utterance length when the subtitle output start time is included but the subtitle output end time is not included, or the subtitles are transmitted at regular intervals. There are low broadcast contents. In such broadcast content, it is not appropriate to predict the subtitle provisional end time by adding the original subtitle display time length to the subtitle provisional start time.

そこで、第２実施形態では、字幕の仮開始時刻に字幕文字列から推定した字幕の発声長を加えた時刻を字幕の仮終了時刻と予測する。発声長は、例えば、字幕変換音素列および各音素の平均的な継続時間の情報などに基づいて推定できる。
(第３実施形態) Therefore, in the second embodiment, a time obtained by adding the utterance length of the subtitle estimated from the subtitle character string to the subtitle temporary start time is predicted as the subtitle temporary end time. The utterance length can be estimated based on, for example, the caption converted phoneme string and information on the average duration of each phoneme.
(Third embodiment)

第１および第２実施形態では、認識結果取得(A-2),(B-ア-1)後、音声認識部21に対して照合中間結果を破棄した後に現時刻以降の音声を初期状態から認識するように指示している。しかし、例えば、ある音素の途中で認識結果取得がなされた場合、次回では、音声認識部21において音素の途中から認識が行われることとなる。一方で、認識用言語モデルには、音素の途中から始まるパターンが記述されていないため、認識結果取得前後における認識精度の低下が懸念される。 In the first and second embodiments, after obtaining the recognition results (A-2) and (B-a-1), the speech recognition unit 21 discards the collation intermediate result, and the speech after the current time is changed from the initial state. Instructed to recognize. However, for example, when the recognition result is acquired in the middle of a certain phoneme, the speech recognition unit 21 performs recognition from the middle of the phoneme next time. On the other hand, since the recognition language model does not describe a pattern that starts from the middle of the phoneme, there is a concern that the recognition accuracy may deteriorate before and after obtaining the recognition result.

この問題を回避するため、第３実施形態では、認識結果取得(A-2),(B-ア-1)後、音声認識部21に対して前回の照合中間結果を引き継ぎ継続して以降の音声を認識するように指示する。
(第４実施形態) In order to avoid this problem, in the third embodiment, after obtaining the recognition results (A-2) and (B-A-1), the speech recognition unit 21 continues to take over the previous collation intermediate result. Instruct to recognize speech.
(Fourth embodiment)

第３実施形態では、認識結果取得後も前回の照合中間結果を引き継ぎ継続して音声を認識するので、次回認識結果取得時に前回認識結果取得済みの区間における認識結果が変化してしまう場合がある。 In the third embodiment, since the speech is recognized by continuing the previous collation intermediate result even after the recognition result is acquired, the recognition result in the section where the previous recognition result has been acquired may change when the next recognition result is acquired. .

そこで、第４実施形態では、そのような場合、認識結果音素列が変化した箇所に遡って再度音素列の照合処理を行う。具体的には、上記(B)の処理を以下の通り変更する。 Therefore, in the fourth embodiment, in such a case, the phoneme string collation process is performed again by going back to the place where the recognition result phoneme string has changed. Specifically, the process (B) is changed as follows.

(オ)次の字幕を取得するまで。すなわち、下記(イ)〜(エ)の条件に該当しない場合。
(1)音声認識部21から認識結果音素列格納部22を介して、放送コンテンツ先頭から現時刻までの音声に対応する認識結果音素列を取得する。また、音声認識部21に対して照合中間結果を引き継ぎ継続して以降の音声の認識を行うよう指示する。
(2)前回照合に使用した認識結果音素列と(1)の認識結果音素列を先頭から比較し、両者が異なる最初の音素を検出する。
(3)前回照合に使用した字幕変換音素列と(1)の認識結果音素列を上記アルゴリズムに従って照合する。この場合、音素列照合結果格納部26に保持されている照合中間結果を引き継ぎ継続して(2)で検出した音素から照合を開始する。音素列照合結果格納部26は、このときの照合中間結果を保持する。
(4)現時刻までの照合結果の中から、認識結果音素列との編集距離が最も近い、字幕変換音素列中の音素を見つける。
(5)(4)で見つけた音素から照合履歴を遡り、字幕の仮開始時刻を予測(更新)する。
(6)以下の基準に従い字幕の仮終了時刻を予測(更新)する。
(6-1)(4)で見つけた音素が字幕変換音素列の終了音素の場合、照合履歴を遡って字幕の終了時刻を予測し、仮終了時刻とする。
(6-2)(6-1)以外の場合、字幕仮終了時刻は、字幕の仮開始時刻に元の字幕の表示時間長を加えた時刻であると予測する。
(第５実施形態) (E) Until the next subtitle is obtained. That is, when the following conditions (a) to (d) are not met.
(1) The recognition result phoneme sequence corresponding to the speech from the beginning of the broadcast content to the current time is acquired from the speech recognition unit 21 via the recognition result phoneme sequence storage unit 22. Further, the voice recognition unit 21 is instructed to continue taking over the collation intermediate result and perform subsequent voice recognition.
(2) The recognition result phoneme string used for the previous collation is compared with the recognition result phoneme string of (1) from the beginning, and the first phoneme in which both are different is detected.
(3) The subtitle conversion phoneme sequence used for the previous verification is compared with the recognition result phoneme sequence of (1) according to the above algorithm. In this case, the collation intermediate result held in the phoneme string collation result storage unit 26 is continued and collation is started from the phoneme detected in (2). The phoneme string collation result storage unit 26 holds the collation intermediate result at this time.
(4) From the collation results up to the current time, find the phoneme in the caption converted phoneme string that has the closest edit distance from the recognition result phoneme string.
(5) The verification history is traced back from the phonemes found in (4), and the provisional start time of subtitles is predicted (updated).
(6) Predict (update) the subtitle temporary end time according to the following criteria.
(6-1) When the phoneme found in (4) is the end phoneme of the subtitle conversion phoneme string, the end time of the subtitle is predicted by going back to the collation history and set as the temporary end time.
(6-2) In cases other than (6-1), the subtitle temporary end time is predicted to be a time obtained by adding the display time length of the original subtitle to the temporary start time of the subtitle.
(Fifth embodiment)

第１〜第４実施形態では、音素列照合のために、放送コンテンツの開始から現時刻までの認識結果音素列、字幕変換音素列および両者の照合中間結果をすべて保持している。しかし、遠い過去の照合結果が現時刻の照合に与える影響は少ない。そこで、第５実施形態では、それらの情報を破棄し、それらを音素列照合処理に用いないことで計算量やメモリ使用量を削減する。 In the first to fourth embodiments, for the phoneme string collation, the recognition result phoneme string from the start of the broadcast content to the current time, the subtitle conversion phoneme string, and the collation intermediate result of both are held. However, the influence of the past collation results on the collation at the current time is small. Therefore, in the fifth embodiment, the amount of calculation and memory usage are reduced by discarding such information and not using them for phoneme string collation processing.

具体的には、(B-エ-3)において決定(確定)した字幕の終了時刻が現時刻よりT×α(sec)(αは予め定める定数)以上前である字幕、その字幕に対応する認識結果音素列、字幕変換音素列、照合中間結果を破棄し、音素列照合の計算対象から除外する。
(第６実施形態) Specifically, it corresponds to a subtitle whose subtitle end time determined (confirmed) in (B-d-3) is T × α (sec) (α is a predetermined constant) or more before the current time, and that subtitle. The recognition result phoneme string, subtitle conversion phoneme string, and collation intermediate result are discarded and excluded from the calculation target of phoneme string collation.
(Sixth embodiment)

第５実施形態では、字幕単位で不要な音素列などの破棄を行うが、第６実施形態では、音素単位で不要な音素列などの破棄を行う。 In the fifth embodiment, unnecessary phoneme strings are discarded in units of subtitles. In the sixth embodiment, unnecessary phoneme strings are discarded in units of phonemes.

具体的には、(B-エ-3)において字幕の終了時刻を確定した後、認識結果音素のうちその終了時刻が現時刻よりT×α(sec)(αは予め定める定数)以上前である音素、その音素に対応する字幕変換音素、照合中間結果を破棄し、音素列照合の計算対象から除外する。
(第７実施形態) Specifically, after the subtitle end time is confirmed in (B-d-3), the end time of the recognition result phoneme is T × α (sec) (α is a predetermined constant) or more before the current time. A certain phoneme, subtitle conversion phoneme corresponding to the phoneme, and collation intermediate result are discarded and excluded from the calculation target of phoneme string collation.
(Seventh embodiment)

第７実施形態では、字幕変換音素列生成部23に、字幕変換音素列を生成する際に字幕の切れ目に、文章の切れ目を表す擬似音素(GB)を挿入する機能を持たせる。そして、音素列照合部25で編集距離を計算する際、GBに対する編集コスト(挿入コスト、削除コスト、置換コスト)を、以下に示すように、他の音素に対する編集コストより小さな値に設定する。 In the seventh embodiment, the caption converted phoneme string generation unit 23 is provided with a function of inserting pseudo phonemes (GB) representing sentence breaks at the caption breaks when the caption converted phoneme string is generated. Then, when the editing distance is calculated by the phoneme string collating unit 25, the editing cost (insertion cost, deletion cost, replacement cost) for GB is set to a value smaller than the editing cost for other phonemes as shown below.

cph[c]=GBのとき
挿入コスト:ins_cost(cph[c])=0
削除コスト:del_cost(rph[r])=α1(0<α1<1)
置換コスト:sub_cost(cph[c],rph[r])=α2((0<α2<1) When cph [c] = GB Insertion cost: ins_cost (cph [c]) = 0
Deletion cost: del_cost (rph [r]) = α1 (0 <α1 <1)
Replacement cost: sub_cost (cph [c], rph [r]) = α2 ((0 <α2 <1)

通常、音声認識を行うと、文章の切れ目などの無音区間に存在する雑音を誤認識し、音素列を出力する事例が見られる。字幕変換音素列生成部23に本機能を持たせることにより、雑音の誤認識による音素列が字幕変換音素列よりも字幕の切れ目に挿入された擬似音素と対応付けされやすくなるため、音声と字幕との対応付けの精度を向上させることができる。
(第８実施形態) In general, when speech recognition is performed, there are cases in which noise existing in silent sections such as sentence breaks is erroneously recognized and a phoneme string is output. By providing this function to the caption conversion phoneme sequence generation unit 23, it is easier to associate a phoneme sequence due to misrecognition of noise with a pseudo phoneme inserted at a caption break than a caption conversion phoneme sequence. The accuracy of association with can be improved.
(Eighth embodiment)

第８実施形態では、字幕変換音素列生成部23に、字幕の切れ目ではなく字幕の文章の切れ目と判定した箇所に、擬似音素(GB)を挿入する機能を持たせる。そして、音素列照合部25で編集距離を計算する際、GBに対する編集コストを他の音素に対する編集コストより小さな値に設定する。 In the eighth embodiment, the caption converted phoneme string generation unit 23 is provided with a function of inserting pseudo phonemes (GB) at locations determined not as caption breaks but as caption sentence breaks. When the phoneme string collation unit 25 calculates the edit distance, the edit cost for GB is set to a value smaller than the edit cost for other phonemes.

文章の切れ目は、句点「。」を検出して文境界とする方法、文章を解析して文境界を検出する方法などで判定できる。文章を解析して文境界を検出する方法は、例えば、丸山他「日本語節境界検出プログラムCBAPの開発と評価」言語処理学会、自然言語処理２００４年７月号に記載されている。 The break of the sentence can be determined by a method of detecting a phrase “.” As a sentence boundary, a method of analyzing a sentence and detecting a sentence boundary, and the like. A method for detecting sentence boundaries by analyzing sentences is described in, for example, Maruyama et al. “Development and Evaluation of Japanese Section Boundary Detection Program CBAP”, Language Processing Society, July 2004 of Natural Language Processing.

１画面に表示される字幕の文字数には制限があるため、字幕は必ずしも文章ごとに送出されるわけではない。字幕変換音素列生成部23に本機能を持たせることにより、雑音の誤認識による音素列が字幕変換音素列よりも字幕の文章の切れ目に挿入された擬似音素と対応付けされやすくなるため、字幕が文章ごとに送出されない場合でも、音声と字幕の対応付けの精度を向上させることができる。
(第９実施形態) Since the number of subtitle characters displayed on one screen is limited, subtitles are not necessarily sent for each sentence. By providing this function to the caption conversion phoneme sequence generation unit 23, it becomes easier to associate a phoneme sequence due to false recognition of noise with a pseudo phoneme inserted in a caption break than a caption conversion phoneme sequence. Even when the text is not sent for each sentence, it is possible to improve the accuracy of the correspondence between the voice and the caption.
(Ninth embodiment)

第９実施形態では、字幕変換音素列生成部23に、字幕の切れ目と字幕の文章の切れ目と判定した箇所の両方に、擬似音素(GB)を挿入する機能を持たせる。そして、音素列照合部25で編集距離を計算する際、GBに対する編集コストを他の音素に対する編集コストより小さな値に設定する。字幕変換音素列生成部23に本機能を持たせることにより、字幕の文章の切れ目の判定に誤りがあったとしても、少なくとも字幕の切れ目に対しては擬似音素(GB)が挿入されるので、音声と字幕の対応付けの精度を向上させることができる。
(第１０実施形態) In the ninth embodiment, the caption converted phoneme string generation unit 23 is provided with a function of inserting pseudophonemes (GB) at both locations determined as caption breaks and caption sentence breaks. When the phoneme string collation unit 25 calculates the edit distance, the edit cost for GB is set to a value smaller than the edit cost for other phonemes. By giving this function to the caption conversion phoneme string generation unit 23, even if there is an error in the determination of the break of the caption text, a pseudo phoneme (GB) is inserted at least for the break of the caption, It is possible to improve the accuracy of correspondence between audio and subtitles.
(Tenth embodiment)

第１０実施形態では、第７〜第９のいずれかの実施形態により擬似音素(GB)を挿入するとともに、音声認識部21に、認識結果音素列を生成する際に音声検出部31と音声照合部35において無音に対応すると見なされた区間が予め定めた時間以上継続した場合、認識結果音素列の該当箇所に、無音を表す擬似音素(Q)を挿入する機能を持たせる。そして、音素列照合部25で編集距離を計算する際、無音を表す擬似音素(Q)を擬似音素(GB)に置換する際の編集コストを0あるいは小さい値に設定し、擬似音素(Q)を他の音素に置換する際の編集コストを他の編集コストと比較して大きな値に設定する。例えば、以下に示すように編集コストを設定する。なお、β1>>1,β2>>1,0<α1<1,0<α2<1である。 In the tenth embodiment, pseudo-phonemes (GB) are inserted according to any of the seventh to ninth embodiments, and the speech recognition unit 21 and the speech verification unit 31 perform speech verification when generating the recognition result phoneme sequence. When the section that is considered to correspond to silence in the unit 35 continues for a predetermined time or longer, a function of inserting a pseudo-phoneme (Q) representing silence is provided at a corresponding portion of the recognition result phoneme string. Then, when calculating the editing distance in the phoneme string matching unit 25, the editing cost when replacing the pseudophoneme (Q) representing silence with the pseudophoneme (GB) is set to 0 or a small value, and the pseudophoneme (Q) Is set to a large value in comparison with other editing costs. For example, the editing cost is set as shown below. Note that β1 >> 1, β2 >> 1,0 <α1 <1,0 <α2 <1.

cph[c]≠GBのとき
挿入コスト:ins_cost(cph[c])=常に1
削除コスト:del_cost(rph[r])=１(rph[r]≠Qの場合)
=β1(rph[r]=Qの場合)
置換コスト:sub_cost(cph[c], rph[r])=0 (cph[c]=rph[r]の場合)
=1
(cph[c]≠rph[r]かつrph[r]≠Qの場合)
=β2
(cph[c]≠rph[r]かつrph[r]=Qの場合) When cph [c] ≠ GB Insertion cost: ins_cost (cph [c]) = always 1
Deletion cost: del_cost (rph [r]) = 1 (when rph [r] ≠ Q)
= β1 (when rph [r] = Q)
Replacement cost: sub_cost (cph [c], rph [r]) = 0 (when cph [c] = rph [r])
= 1
(when cph [c] ≠ rph [r] and rph [r] ≠ Q)
= β2
(When cph [c] ≠ rph [r] and rph [r] = Q)

cph[c]=GBのとき
挿入コスト:ins_cost(cph[c])=0
削除コスト:del_cost(rph[r])=α1
置換コスト:sub_cost(cph[c],rph[r])=0(rph[r]=Qの場合)
=α2(rph[r]≠Qの場合) When cph [c] = GB Insertion cost: ins_cost (cph [c]) = 0
Deletion cost: del_cost (rph [r]) = α1
Replacement cost: sub_cost (cph [c], rph [r]) = 0 (when rph [r] = Q)
= α2 (when rph [r] ≠ Q)

音素列照合部25で認識結果音素列と字幕変換音素列を対応付けする際、ある字幕変換音素列の先頭部分の音素列が前発話の末尾部分の音素列に対応するなどと誤って対応付けすることがあり、この場合には大きなずれ幅推定誤りが生じる。しかし、第１０実施形態では、発声中の無音部分が字幕の切れ目や字幕の文章の切れ目と判定した箇所に対応付けされやすくなるので、音声と字幕との対応付けの精度を向上させることができる。
(第１１実施形態) When associating the recognition result phoneme string and the caption converted phoneme string in the phoneme string matching unit 25, the phoneme string at the beginning of a given caption converted phoneme string is incorrectly associated with the phoneme string at the end of the previous utterance. In this case, a large deviation estimation error occurs. However, in the tenth embodiment, since the silent part in the utterance is easily associated with the portion determined as the break of the subtitle or the break of the text of the subtitle, it is possible to improve the accuracy of the association between the audio and the subtitle. .
(Eleventh embodiment)

第１１実施形態は、第１〜第１０実施形態において、音素列照合部25での編集距離の計算の過程を工夫してずれ幅推定の計算処理量を低減するものである。各字幕に対応する音素列が連結された字幕変換音素列には多くの音素が含まれることになるので、字幕変換音素列と認識結果音素列との照合において計算処理量を低減することは有効である。 In the eleventh embodiment, in the first to tenth embodiments, the process of calculating the edit distance in the phoneme string collating unit 25 is devised to reduce the calculation processing amount of the deviation width estimation. Since the subtitle conversion phoneme sequence in which the phoneme sequence corresponding to each subtitle is concatenated contains many phonemes, it is effective to reduce the amount of calculation processing in collation between the subtitle conversion phoneme sequence and the recognition result phoneme sequence. It is.

音素列照合部25で編集距離を計算する過程において、まず、認識結果音素列の先頭r音素と音素字幕変換音素列の先頭n音素(n=1〜C:Cは字幕変換音素列の全音素数)とを照合し、それらの間の編集距離を計算する。次に、編集距離の中からその値が小さい上位N個を選択する。 In the process of calculating the editing distance in the phoneme string collating unit 25, first, the first r phoneme of the recognition result phoneme string and the first n phoneme of the phoneme caption conversion phoneme string (n = 1 to C: C are all phoneme numbers of the caption conversion phoneme string) ) And calculate the edit distance between them. Next, the top N items with the smallest value are selected from the edit distances.

そして、認識結果音素列の先頭r+1音素と字幕変換音素列間の編集距離を計算する際に、字幕変換音素列の先頭n音素(n=1〜C)の中で、先に選択された上位N個の編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しない。以上の処理を繰り返し実行し、認識結果音素列と字幕変換音素列間の編集距離を計算する。 Then, when calculating the edit distance between the first r + 1 phoneme of the recognition result phoneme string and the caption converted phoneme string, it is selected first among the first n phonemes (n = 1 to C) of the caption converted phoneme string. The edit distance is calculated only for those that can be calculated using the top N edit distances, and not for others. The above processing is repeatedly executed to calculate the edit distance between the recognition result phoneme string and the caption converted phoneme string.

図１１は、第１１実施形態における編集距離の計算過程を示す説明図である。同図に示すように、まず、認識結果音素列の先頭r音素と字幕変換音素列の先頭n音素(n=1〜C)間の編集距離を計算する。これにより編集距離d[r,1],d[r,2],・・・,d[r,C]が得られる。次に、これらの編集距離の中からその値が小さい上位N個d[r,c-1],d[r,c-2],・・・,d[r,c-N]を選択する。次に、認識結果音素列の先頭r+1音素と字幕変換音素列間の編集距離を計算する際、編集距離d[r,c-1],d[r,c-2],・・・,d[r,c-N]用いて計算可能なものについてのみ編集距離を計算する。ここで、計算可能な編集距離は、d[r+1,c-1],d[r+1,c-2],・・・,d[r+1,c-N]となり、d[r+1,1],d[r+1,2],・・・,d[r+1,c-0]の計算を省略することができる。以上の処理を認識結果音素列の先頭音素数を順次増やしながら繰り返し、編集距離d[R,C]まで計算する。 FIG. 11 is an explanatory diagram showing the edit distance calculation process in the eleventh embodiment. As shown in the figure, first, the edit distance between the first r phoneme of the recognition result phoneme string and the first n phoneme (n = 1 to C) of the caption converted phoneme string is calculated. As a result, edit distances d [r, 1], d [r, 2],..., D [r, C] are obtained. Next, the top N d [r, c-1], d [r, c-2],..., D [r, c-N] having a smaller value are selected from these edit distances. Next, when calculating the edit distance between the first r + 1 phoneme of the recognition result phoneme sequence and the subtitle converted phoneme sequence, the edit distance d [r, c-1], d [r, c-2],. , d [r, cN] compute edit distances only for those that can be computed. Here, the computable edit distance is d [r + 1, c-1], d [r + 1, c-2],..., D [r + 1, cN], and d [r + 1, 1], d [r + 1, 2],..., D [r + 1, c-0] can be omitted. The above processing is repeated while sequentially increasing the number of head phonemes in the recognition result phoneme sequence, and calculation is performed up to the edit distance d [R, C].

第１１実施形態によれば、認識結果音素列と字幕変換音素列の組み合わせの内の一部編集距離を計算する必要がなくなり、少ない計算処理量でずれ幅を推定できる。
(第１２実施形態) According to the eleventh embodiment, it is not necessary to calculate a partial editing distance in the combination of the recognition result phoneme string and the caption converted phoneme string, and the shift width can be estimated with a small calculation processing amount.
(Twelfth embodiment)

第１２実施形態も、第１〜第１０実施形態において、音素列照合部25での編集距離の計算の過程を工夫してずれ幅推定の計算処理量を低減するものである。 In the twelfth embodiment as well, in the first to tenth embodiments, the process of calculating the edit distance in the phoneme string collating unit 25 is devised to reduce the calculation processing amount of the deviation width estimation.

音素列照合部25で編集距離を計算する過程において、まず、認識結果音素列の先頭r音素と音素字幕変換音素列の先頭n音素(n=1〜C)とを照合し、それらの間の編集距離を計算する。次に、編集距離の中からその値が最小のもの、および最小のものとの編集距離の差が予め定めた閾値内に収まるものを選択する。この選択方法が第１１実施形態と異なる。 In the process of calculating the edit distance in the phoneme string collation unit 25, first, the first r phoneme of the recognition result phoneme string and the first n phoneme (n = 1 to C) of the phoneme subtitle conversion phoneme string are collated, and between them Calculate edit distance. Next, the edit distance having the minimum value and the edit distance difference between the edit distance and the minimum edit distance are selected within a predetermined threshold. This selection method is different from the eleventh embodiment.

そして、認識結果音素列の先頭r+1音素と音素字幕変換音素列間の編集距離を計算する際に、音素字幕変換音素列の先頭n音素(n=1〜C)の中で、先に選択された編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しない。以上の処理を繰り返し実行し、認識結果音素列と字幕変換音素列との編集距離を計算する。 Then, when calculating the edit distance between the first r + 1 phoneme of the recognition result phoneme sequence and the phoneme subtitle conversion phoneme sequence, the first n phonemes of the phoneme subtitle conversion phoneme sequence (n = 1 to C), The edit distance is calculated only for those that can be calculated using the selected edit distance and not for others. The above processing is repeatedly executed to calculate the edit distance between the recognition result phoneme string and the caption converted phoneme string.

第１２実施形態によれば、認識結果音素列と字幕変換音素列の組み合わせの内の一部編集距離を計算する必要がなくなり、少ない計算量でずれ幅を推定できる。
(第１３実施形態) According to the twelfth embodiment, it is not necessary to calculate a partial editing distance in the combination of the recognition result phoneme string and the caption converted phoneme string, and the shift width can be estimated with a small calculation amount.
(13th Embodiment)

第１３実施形態も、第１〜第１０実施形態において、音素列照合部25での編集距離の計算の過程を工夫してずれ幅推定の計算処理量を低減するものである。 In the thirteenth embodiment as well, in the first to tenth embodiments, the calculation process of the edit distance in the phoneme string collating unit 25 is devised to reduce the calculation processing amount of the deviation width estimation.

音素列照合部25で編集距離を計算する過程において、まず、認識結果音素列の先頭r音素と字幕変換音素列の先頭n音素(n=1〜C)とを照合し、それらの間の編集距離を計算する。次に、編集距離の中からその値が最小となる先頭m音素を選択する。 In the process of calculating the editing distance by the phoneme string collating unit 25, first, the first r phoneme of the recognition result phoneme string is collated with the first n phoneme (n = 1 to C) of the caption conversion phoneme string, and editing between them is performed. Calculate the distance. Next, the first m phoneme having the minimum value is selected from the editing distance.

そして、認識結果音素列の先頭r+1音素と字幕変換音素列間の編集距離を計算する際に、前記字幕変換音素列の先頭r+1音素と字幕変換音素列の先頭m-N音素(Nは一定の値)から先頭m+N音素との編集距離の中で、既に計算されている先頭r音素との編集距離を用いて計算可能なものについてのみ編集距離を計算し、他については計算しない。以上の処理を繰り返し実行し、認識結果音素列と字幕変換音素列との編集距離を計算する。 Then, when calculating the edit distance between the first r + 1 phoneme of the recognition result phoneme string and the caption converted phoneme string, the first r + 1 phoneme of the caption converted phoneme string and the first mN phoneme of the caption converted phoneme string (N is The edit distance is calculated only for those that can be calculated using the edit distance with the first r phoneme that has already been calculated among the edit distance with the first m + N phoneme from a certain value), and not for the other . The above processing is repeatedly executed to calculate the edit distance between the recognition result phoneme string and the caption converted phoneme string.

図１２は、第１３実施形態による編集距離計算過程を示す説明図である。同図では、認識結果音素列の先頭r音素と字幕変換音素列の先頭n音素(n=1〜C)とを照合したとき、認識結果音素列の先頭r音素と字幕変換音素列の先頭C-1からC-5までとの編集距離が計算されている場合を示している。ここで、認識結果音素列の先頭r音素と字幕変換音素列の先頭C-2音素との編集距離が最小であるとし、N=2(前後2音素)とすると、字幕変換音素列の先頭r+1音素と字幕変換音素列間の編集距離を計算する際に編集距離計算の対象となるところは、字幕変換音素列の先頭C-0からC-4音素である。しかし、認識結果音素列の先頭r音素と字幕変換音素列C-0との編集距離は計算されていないので、実際に編集距離を計算するのは、字幕変換音素列の先頭r+1音素と字幕変換音素列C-1からC-4音素である。 FIG. 12 is an explanatory diagram showing an edit distance calculation process according to the thirteenth embodiment. In the figure, when the first r phoneme of the recognition result phoneme string is matched with the first n phoneme of the caption conversion phoneme string (n = 1 to C), the first r phoneme of the recognition result phoneme string and the first C phoneme of the caption conversion phoneme string The case where the edit distance from -1 to C-5 is calculated is shown. Here, assuming that the editing distance between the first r phoneme of the recognition result phoneme sequence and the first C-2 phoneme of the subtitle conversion phoneme sequence is the minimum, and N = 2 (two phonemes before and after), the first r of the subtitle conversion phoneme sequence When calculating the edit distance between the +1 phoneme and the caption converted phoneme string, the target of the edit distance calculation is the first C-0 to C-4 phonemes of the caption converted phoneme string. However, since the edit distance between the first r phoneme of the recognition result phoneme string and the subtitle converted phoneme string C-0 is not calculated, the edit distance is actually calculated using the first r + 1 phoneme of the subtitle converted phoneme string Subtitle conversion phoneme strings C-1 to C-4 phonemes.

第１３実施形態によれば、認識結果音素列と字幕変換音素列の組み合わせの内の一部編集距離を計算する必要がなくなり、少ない計算量でずれ幅を推定できる。 According to the thirteenth embodiment, it is not necessary to calculate a partial edit distance in the combination of the recognition result phoneme string and the caption converted phoneme string, and the shift width can be estimated with a small calculation amount.

以上、実施形態を説明したが、本発明は、上記実施形態に限られるものではない。また、本発明は、時間的ずれが補正された字幕を、受信した放送コンテンツ中の音声および映像と共に再生する再生装置や、放送コンテンツにおける特定映像部分や音声部分を検索して再生する再生装置としても実現できる。例えば、上記実施形態のいずれかの字幕ずれ補正装置により時間的ずれが補正された字幕を保存しておき、この保存されている字幕内の文字を利用して、入力されたキーワードに合致する映像部分や音声部分を検索して再生できる。この場合、情報内容からみて、映像および音声と字幕間の時間的ずれは補正されているので、字幕内の文字から所望の映像部分や音声部分を正しく検索して再生できる。 Although the embodiment has been described above, the present invention is not limited to the above embodiment. In addition, the present invention is a playback device that plays back subtitles corrected for time lag together with audio and video in received broadcast content, and a playback device that searches for and plays back specific video portions and audio portions in broadcast content. Can also be realized. For example, a subtitle that has been corrected for time lag by any of the subtitle shift correction apparatuses of the above embodiments is stored, and a video that matches the input keyword using characters in the stored subtitle Search and play back parts and audio parts. In this case, since the time lag between the video and audio and the subtitle is corrected in view of the information content, a desired video portion and audio portion can be correctly searched from the characters in the subtitle and reproduced.

11・・・情報分離部、12・・・字幕開始/終了時刻決定部、13・・・ずれ補正部、21・・・音声認識部、22・・・認識結果音素列格納部、23・・・字幕変換音素列生成部、24・・・字幕変換音素列格納部、25・・・音素列照合部、26・・・音素列照合結果格納部、31・・・音声検出部、32・・・音響分析部、33・・・音響モデル格納部、34・・・言語モデル格納部、35・・・音声照合部 11 ... Information separation unit, 12 ... Subtitle start / end time determination unit, 13 ... Deviation correction unit, 21 ... Speech recognition unit, 22 ... Recognition result phoneme string storage unit, 23 ... Subtitle conversion phoneme sequence generation unit, 24 ... Subtitle conversion phoneme sequence storage unit, 25 ... Phoneme sequence collation unit, 26 ... Phoneme sequence collation result storage unit, 31 ... Speech detection unit, 32 ... -Acoustic analysis unit, 33 ... Acoustic model storage unit, 34 ... Language model storage unit, 35 ... Speech collation unit

Claims

A speech recognition unit that recognizes speech in the received broadcast content and generates a recognition result phoneme sequence corresponding to the speech while receiving the broadcast content;
A subtitle conversion phoneme sequence generation unit that generates a phoneme sequence corresponding to each subtitle in the video of the broadcast content and generates a subtitle conversion phoneme sequence by concatenating those phoneme sequences;
Subtitle start and end times by associating subtitles and audio based on the editing distance between the recognition result phoneme sequence generated by the speech recognition unit and the subtitle conversion phoneme sequence generated by the subtitle conversion phoneme sequence generation unit A phoneme sequence matching unit for determining
Based on the start and end times of the subtitles determined by the phoneme string collation unit, a shift correction unit that corrects a time shift between the audio and the subtitles,
The phoneme string matching unit associates the subtitle with the audio before the subtitle reception when receiving the subtitle, and predicts the start and end times of the subtitle after the time lag correction from the result, and then the predicted subtitle The subtitles and audio are associated with each other using the audio and subtitle information in the newly received broadcast content at regular time intervals until the start and end times of the video. A subtitle shift correction apparatus characterized by repeating a process of predicting start and end times.

The phoneme string matching unit is adapted to correct the time lag based on the result of associating the subtitles with the audio before the subtitle reception, the predicted start time of the subtitles, and the subtitle display time acquired when the broadcast content is received. The subtitle shift correction apparatus according to claim 1, wherein an end time of the subtitle is predicted.

The phoneme string collating unit temporally determines a subtitle based on a result of associating a subtitle with audio before receiving the subtitle, a predicted value of the start time of the subtitle, and a predicted value of the audio length corresponding to the subtitle estimated from the subtitle character string. The caption shift correction apparatus according to claim 1, wherein an end time of a caption after shift correction is predicted.

The phoneme string matching unit stores the recognition result phoneme string used when predicting the start and end times of subtitles after correcting the time lag, and the phoneme string matching unit starts the subtitles after correcting the time lag When the process for predicting the end time is repeated, the recognition result phoneme sequence generated by the speech recognition unit from the time when the previous process was performed to the current time is combined with the stored recognition result phoneme sequence. A sequence is generated, and an end time of a caption after time lag correction is predicted based on an edit distance between a recognition result phoneme sequence generated thereby and a caption converted phoneme sequence. 4. The caption shift correction device according to any one of 3 above.

The recognition result phoneme sequence used when the phoneme sequence collation unit predicts the start and end times of the subtitles after the time lag correction, the recognition intermediate result recognized by the speech recognition unit, and the collation intermediate collated by the phoneme sequence collation unit The speech recognition unit saves the result, and the speech recognition unit takes over the recognition intermediate result saved during the previous process and repeats the previous process when the process of predicting the start and end times of the caption after the time lag correction is repeated. The recognition result phoneme string is generated by recognizing the speech from the performed time to the current time and generating a recognition result phoneme string, and the recognition result phoneme string is combined with the stored recognition result phoneme string. 4. The subtitle end time after time offset correction is predicted using the generated recognition result phoneme string and the collation intermediate result saved during the previous processing. Kanji characters Deviation correction device.

The recognition result phoneme sequence used when the phoneme sequence collation unit predicts the start and end times of the subtitles after the time lag correction, the recognition intermediate result recognized by the speech recognition unit, and the collation intermediate collated by the phoneme sequence collation unit The speech recognition unit saves the result, and the speech recognition unit takes over the recognition intermediate result saved during the previous process and repeats the previous process when the process of predicting the start and end times of the caption after the time lag correction is repeated. Recognizing the speech from the performed time to the current time to generate a recognition result phoneme sequence, the phoneme sequence collation unit has recognized the recognition result phoneme sequence and the speech from the time when the recognition was first started to the current time The recognition result phoneme sequence is compared, and the edit distance between the recognition result phoneme sequence and the subtitle conversion phoneme sequence is calculated using different stored phonemes based on the edited distance. And subtitle time gap Subtitles displacement correction device according to any one of claims 1 and estimates 3.

After the shift correction unit corrects the temporal shift width between the audio and the subtitle, the subtitle corresponding to the audio section before the correction time, the recognition result phoneme sequence corresponding to the subtitle, and the subtitle converted phoneme 7. The caption shift correction apparatus according to claim 5, wherein the column and the collation intermediate result are discarded.

After the shift correction unit corrects the temporal shift width between the voice and the subtitle, the recognition result phoneme, the subtitle converted phoneme string, and the collation intermediate result corresponding to the voice section that is a predetermined time before the correction time are discarded. 7. The caption shift correction apparatus according to claim 5 or 6,

The edit distance is defined by using as an index a cost corresponding to the number of steps required to convert from one phoneme string to another by inserting, deleting, or replacing a phoneme. 9. The caption shift correction device according to any one of claims 8 to 10.

The edit distance is determined based on the speech recognition performance for each phoneme, the cost required to replace a phoneme with a phoneme, the cost required to insert a phoneme, and the cost required to delete a phoneme 9. The caption shift correction apparatus according to claim 1, wherein the caption shift correction apparatus is defined as an index.

The subtitle conversion phoneme sequence generation unit generates a subtitle conversion phoneme sequence in which pseudo phonemes representing sentence breaks are added to subtitle breaks, and the phoneme sequence collation unit is smaller than other phonemes for the pseudo phonemes 11. The caption shift correction apparatus according to claim 9, wherein the edit distance is calculated by giving a cost.

The subtitle conversion phoneme sequence generation unit generates a subtitle conversion phoneme sequence in which pseudo phonemes representing sentence breaks are added to sentence breaks obtained by analyzing subtitles, and the phoneme string matching unit The subtitle shift correction apparatus according to claim 9 or 10, wherein the edit distance is calculated with a cost smaller than that of other phonemes.

The subtitle conversion phoneme sequence generation unit generates a subtitle conversion phoneme sequence by adding pseudo phonemes representing sentence breaks to sentence breaks obtained by analyzing subtitle breaks and subtitles, and the phoneme string collation unit includes: 11. The caption shift correction apparatus according to claim 9, wherein the edit distance is calculated by giving a pseudo-phoneme smaller cost than other phonemes.

The subtitle conversion phoneme sequence generation unit generates a subtitle conversion phoneme sequence in which pseudo phonemes representing sentence breaks are added to sentence breaks obtained by analyzing subtitles, and the speech recognition unit continues silence for a certain period of time. A recognition result phoneme string is generated by adding a pseudophoneme representing silence at a location, and the phoneme string collating unit sets the cost between the pseudophoneme representing silence and the pseudophoneme representing a sentence break to 0 or a small value, 11. The caption shift correction device according to claim 9, wherein the edit distance is calculated as a large value by comparing a cost between the pseudo phoneme to be expressed and another phoneme with another. 11.

The phoneme string collating unit calculates an edit distance between the first r phoneme of the recognition result phoneme string and the first n phoneme of the caption converted phoneme string (n = 1 to C: C is the total number of phonemes of the caption converted phoneme string). , When the edit distance between the top r + 1 phoneme of the recognition result phoneme sequence and the subtitle conversion phoneme sequence is calculated, the top N pieces having the smaller value are selected from the edit distances. Of n phonemes (n = 1 to C), the edit distance is calculated only for those that can be calculated using the top N edit distances selected previously, and the other is not repeatedly calculated. 15. The caption shift correcting device according to claim 1, wherein an edit distance between the recognition result phoneme string and the caption converted phoneme string is calculated.

The phoneme string collating unit calculates an edit distance between the first r phoneme of the recognition result phoneme string and the first n phoneme of the caption converted phoneme string (n = 1 to C: C is the total number of phonemes of the caption converted phoneme string). A minimum one of the edit distances and a difference between the edit distance and a minimum difference within a predetermined threshold, and the first r + 1 phoneme of the recognition result phoneme string and the caption converted phoneme string When calculating the edit distance between, the edit distance is calculated only for those that can be calculated using the edit distance selected earlier, among the first n phonemes (n = 1 to C) of the caption conversion phoneme string, 15. The caption shift correcting apparatus according to claim 1, wherein the processing for not calculating the others is repeatedly executed to calculate the edit distance between the recognition result phoneme string and the caption converted phoneme string.

The phoneme string collating unit calculates an edit distance between the first r phoneme of the recognition result phoneme string and the first n phoneme of the caption converted phoneme string (n = 1 to C: C is the total number of phonemes of the caption converted phoneme string). Selecting the first m phoneme of the subtitle conversion phoneme string having the smallest value from the editing distance, and calculating the edit distance between the first r + 1 phoneme of the recognition result phoneme string and the subtitle conversion phoneme string The first r + 1 phoneme of the recognition result phoneme sequence and the first mN phoneme of the caption conversion phoneme sequence (N is a constant value) and the first m + N phoneme already calculated in the editing distance. r Repeat the process of calculating the edit distance only for those that can be calculated using the edit distance with the phoneme and not the others, and calculate the edit distance between the recognition result phoneme string and the subtitle conversion phoneme string. 15. The caption shift correction apparatus according to claim 1.

A caption shift correction device according to any one of claims 1 to 17,
A playback device comprising: playback means for playing back a caption whose time shift has been corrected by the caption shift correction device together with audio and video in the received broadcast content.

A caption shift correction device according to any one of claims 1 to 17,
Subtitle storage means for storing subtitles corrected for time lag between audio and video by the subtitle shift correction device;
Search means for searching for video of a part that matches the input keyword based on character information in the subtitle stored in the subtitle storage means,
A playback apparatus for playing back the video of the part searched by the search means.

A caption shift correction device according to any one of claims 1 to 17,
A broadcasting apparatus comprising: transmission means for transmitting, as a broadcast program, audio, video, and subtitles corrected for temporal deviation by the subtitle deviation correcting apparatus.