JP2009088990A

JP2009088990A - Reception apparatus, television broadcast playback method, and television broadcast playback program

Info

Publication number: JP2009088990A
Application number: JP2007255866A
Authority: JP
Inventors: Shuji Ogasawara; 修司小笠原
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2007-09-28
Filing date: 2007-09-28
Publication date: 2009-04-23

Abstract

<P>PROBLEM TO BE SOLVED: To display captions in synchronization with the video and audio even in a case where information for displaying captions is not contained in television broadcast. <P>SOLUTION: A television receiver includes: a playback unit 41 which receives a broadcast wave of television broadcast containing video signals, audio signals and caption information and delays by a predetermined time and plays back the video signals and the audio signals contained in the broadcast wave of the television broadcast; a character recognizing unit 23 which extracts character information from the video signals; an audio recognizing unit 25 which extracts character information from the audio signals; and a determination unit 27 which determines timing to display the caption information according to a correlation of the caption information and the extracted character information. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、受信装置、テレビジョン放送再生方法およびテレビジョン放送再生プログラムに関し、特に字幕情報を含むテレビジョン放送を受信する受信装置、そのテレビジョン放送を再生するためのテレビジョン放送再生方法およびテレビジョン放送再生プログラムに関する。 The present invention relates to a receiving apparatus, a television broadcast reproduction method, and a television broadcast reproduction program, and more particularly to a receiving apparatus that receives a television broadcast including subtitle information, a television broadcast reproduction method and a television for reproducing the television broadcast. It is related with John broadcast reproduction program.

近年、テレビジョン放送におけるニュースやスポーツ番組等の生放送の番組は、音声を文字で表した字幕情報が付与される場合がある。字幕情報は、放送する番組の音声を放送局側で音声認識等して作成されるため、放送される映像に対して数秒から数十秒遅れて放送信号に付与される。 In recent years, live broadcast programs such as news and sports programs in television broadcasts may be given subtitle information that expresses sound in characters. Since the caption information is created by recognizing the sound of the broadcast program on the broadcast station side, the caption information is added to the broadcast signal with a delay of several seconds to several tens of seconds with respect to the broadcast video.

この問題に対して従来は、字幕と映像とを同期させるために、放送局側で字幕文字列の映像に対する遅延時間を放送信号に付与し、放送信号の受信側で、映像を所定時間遅延させて再生するとともに、映像に対する字幕文字列の遅延時間に基づいて決定されるタイミングで字幕文字列を表示する技術が知られている（例えば、特許文献１〜３）。 Conventionally, in order to synchronize subtitles and video, the broadcast station adds a delay time to the video of the subtitle character string to the broadcast signal, and delays the video by a predetermined time on the broadcast signal reception side. And a subtitle character string are displayed at a timing determined based on the delay time of the subtitle character string with respect to the video (for example, Patent Documents 1 to 3).

これら従来の技術では、放送局側で映像と字幕との時間の差を含む放送信号を送信しなければ、受信側で字幕と映像とを同期させることができないといった問題がある。
特開２００４−２０７８２１号公報特開２００６−２１１６３６号公報特開２００６−３２４７７９号公報 In these conventional techniques, there is a problem in that the subtitle and video cannot be synchronized on the receiving side unless a broadcast signal including a time difference between the video and subtitle is transmitted on the broadcasting station side.
JP 2004-207721 A JP 2006-211636 A Japanese Patent Laid-Open No. 2006-324779

この発明は上述した問題点を解決するためになされたもので、この発明の目的の一つは、テレビジョン放送に字幕を表示するための情報が含まれていない場合であっても映像と音声に字幕を同期して表示することが可能な受信装置を提供することである。 The present invention has been made to solve the above-described problems, and one of the objects of the present invention is to provide video and audio even when information for displaying subtitles is not included in a television broadcast. It is to provide a receiving apparatus capable of displaying subtitles synchronously.

この発明の他の目的は、テレビジョン放送に字幕を表示するための情報が含まれていない場合であっても映像と音声に字幕を同期して表示することが可能なテレビジョン放送再生方法およびテレビジョン放送再生プログラムを提供することである。 Another object of the present invention is a television broadcast reproduction method capable of synchronizing and displaying subtitles in video and audio even when the information for displaying subtitles is not included in the television broadcast, and It is to provide a television broadcast reproduction program.

上述した目的を達成するためにこの発明のある局面によれば、受信装置は、映像信号と、音声信号と、字幕情報とを含むテレビジョン放送の放送波を受信する受信装置であって、テレビジョン放送の放送波に含まれる映像信号と音声信号とを所定時間遅延させて再生する再生手段と、映像信号と音声信号との少なくとも一方から文字情報を抽出する文字情報抽出手段と、字幕情報と抽出された文字情報との相関に基づいて、字幕情報を表示するタイミングを決定する決定手段と、を備える。 In order to achieve the above-described object, according to one aspect of the present invention, a receiving apparatus is a receiving apparatus that receives a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information. A reproduction means for reproducing a video signal and an audio signal included in a broadcast wave of John Broadcast with a predetermined delay, a character information extraction means for extracting character information from at least one of the video signal and the audio signal, subtitle information, Determining means for determining the timing for displaying the subtitle information based on the correlation with the extracted character information.

この局面に従えば、映像信号と音声信号との少なくとも一方から文字情報が抽出され、字幕情報と文字情報との相関に基づいて、字幕情報を表示するタイミングが決定される。このため、所定時間遅延して再生されるテレビジョン放送の映像と音声に字幕情報を同期させることができる。この結果、テレビジョン放送に字幕を表示するための情報が含まれていない場合であっても映像と音声に字幕を同期して表示することが可能な受信装置を提供することができる。 According to this aspect, character information is extracted from at least one of the video signal and the audio signal, and the timing for displaying the caption information is determined based on the correlation between the caption information and the character information. For this reason, it is possible to synchronize the caption information with the video and audio of the television broadcast reproduced with a predetermined time delay. As a result, it is possible to provide a receiving apparatus that can display subtitles in synchronism with video and audio even when the information for displaying the subtitles is not included in the television broadcast.

好ましくは、文字情報抽出手段は、音声信号を音声認識することにより認識される文字列を文字情報として出力する音声認識手段を含む。 Preferably, the character information extracting means includes voice recognition means for outputting a character string recognized by voice recognition of the voice signal as character information.

好ましくは、決定手段は、字幕情報と音声認識手段により出力される文字列との相関に基づいて、字幕情報を表示するタイミングを決定できない場合、文字列を表示する表示手段を、さらに備える。 Preferably, the determining means further includes a display means for displaying the character string when the timing for displaying the caption information cannot be determined based on the correlation between the caption information and the character string output by the voice recognition means.

この局面に従えば、音声信号に含まれるが字幕情報に含まれない文字列を表示することができる。 According to this aspect, it is possible to display a character string that is included in the audio signal but not included in the caption information.

好ましくは、文字情報抽出手段は、映像信号を文字認識することにより認識される文字列を文字情報として出力する文字認識手段をさらに含み、決定手段は、字幕情報と音声認識手段により出力される文字列との相関に基づいて、字幕情報を表示するタイミングを決定できない場合、字幕情報と文字認識手段が出力する文字列との相関に基づいて、字幕情報を表示するタイミングを決定する。 Preferably, the character information extraction means further includes character recognition means for outputting a character string recognized by character recognition of the video signal as character information, and the determination means is a character output by the caption information and voice recognition means. If the timing for displaying the caption information cannot be determined based on the correlation with the column, the timing for displaying the caption information is determined based on the correlation between the caption information and the character string output by the character recognition means.

この局面に従えば、字幕情報に含まれる文字列を映像に同期して表示することができる。 According to this aspect, the character string included in the subtitle information can be displayed in synchronization with the video.

好ましくは、文字情報抽出手段は、映像信号を文字認識することにより認識される文字列を文字情報として出力する文字認識手段を含む。 Preferably, the character information extraction means includes character recognition means for outputting a character string recognized by character recognition of the video signal as character information.

この局面に従えば、映像信号に含まれるが字幕情報に含まれない文字列を表示することができる。 According to this aspect, it is possible to display a character string that is included in the video signal but not included in the caption information.

好ましくは、文字情報抽出手段は、映像信号に含まれる唇の動きおよび形状を認識することにより母音の配列を文字情報として出力する母音認識手段を含む。 Preferably, the character information extracting means includes vowel recognition means for outputting an array of vowels as character information by recognizing the movement and shape of the lips included in the video signal.

この発明の他の局面によれば、テレビジョン放送再生方法は、映像信号と、音声信号と、字幕情報とを含むテレビジョン放送の放送波を受信するステップと、テレビジョン放送の放送波に含まれる映像信号と音声信号とを所定時間遅延させて再生するステップと、映像信号と音声信号との少なくとも一方から文字情報を抽出するステップと、字幕情報と抽出された文字情報との相関に基づいて、字幕情報を表示するタイミングを決定するステップと、を含む。 According to another aspect of the present invention, a television broadcast reproduction method includes a step of receiving a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information, and a broadcast wave of the television broadcast. Based on a correlation between the step of reproducing the video signal and the audio signal delayed by a predetermined time, the step of extracting the character information from at least one of the video signal and the audio signal, and the subtitle information and the extracted character information Determining the timing for displaying the caption information.

この局面に従えば、テレビジョン放送に字幕を表示するための情報が含まれていない場合であっても映像と音声に字幕を同期して表示することが可能なテレビジョン放送再生方法を提供することができる。 According to this aspect, there is provided a television broadcast reproduction method capable of displaying subtitles in synchronization with video and audio even when the information for displaying the subtitles is not included in the television broadcast. be able to.

この発明のさらに他の局面によれば、テレビジョン放送再生プログラムは、映像信号と、音声信号と、字幕情報とを含むテレビジョン放送の放送波を受信するステップと、テレビジョン放送の放送波に含まれる映像信号と音声信号とを所定時間遅延させて再生するステップと、映像信号と音声信号との少なくとも一方から文字情報を抽出するステップと、字幕情報と抽出された文字情報との相関に基づいて、字幕情報を表示するタイミングを決定するステップと、をコンピュータに実行させる。 According to still another aspect of the present invention, a television broadcast reproduction program receives a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information, and a broadcast wave of the television broadcast. Based on the correlation between the step of reproducing the included video signal and the audio signal with a predetermined time delay, the step of extracting the character information from at least one of the video signal and the audio signal, and the subtitle information and the extracted character information And determining the timing for displaying the subtitle information.

この局面に従えば、テレビジョン放送に字幕を表示するための情報が含まれていない場合であっても映像と音声に字幕を同期して表示することが可能なテレビジョン放送再生プログラムを提供することができる。 According to this aspect, there is provided a television broadcast reproduction program capable of displaying subtitles in synchronization with video and audio even when the information for displaying the subtitles is not included in the television broadcast. be able to.

以下、図面を参照しつつ、本発明の実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがってそれらについての詳細な説明は繰返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

図１は、本発明の実施の形態の１つにおけるテレビジョン受信機の機能の概要を示す機能ブロック図である。図１を参照して、受信装置としてのテレビジョン受信機１は、テレビジョン放送の放送波を受信し、放送信号を出力するチューナ１０と、放送信号を処理するための制御部２０と、字幕を一時的に記憶する第１バッファメモリ３１と、映像信号と音声信号とを一時的に記憶する第２バッファメモリ３３と、放送信号を再生する再生部４１と、を含む。 FIG. 1 is a functional block diagram showing an outline of functions of a television receiver in one embodiment of the present invention. Referring to FIG. 1, a television receiver 1 as a receiving device receives a broadcast wave of a television broadcast and outputs a broadcast signal, a control unit 20 for processing the broadcast signal, subtitles, and the like. Includes a first buffer memory 31 that temporarily stores video signals, a second buffer memory 33 that temporarily stores video signals and audio signals, and a reproduction unit 41 that reproduces broadcast signals.

チューナ１０は、放送波を選択・受信し、映像信号、音声信号および字幕信号を含む放送信号を制御部２０に出力する。チューナ１０は、アナログ放送用のチューナであってもよいし、デジタル放送用のチューナであってもよい。チューナ１０は、放送波の種類に適したチューナ１０が用いられる。 The tuner 10 selects / receives a broadcast wave and outputs a broadcast signal including a video signal, an audio signal, and a caption signal to the control unit 20. The tuner 10 may be an analog broadcast tuner or a digital broadcast tuner. As the tuner 10, a tuner 10 suitable for the type of broadcast wave is used.

制御部２０は、映像信号、音声信号および字幕信号を分離する分離部２１と、映像信号が入力される文字認識部２３と、音声信号が入力される音声認識部２５と、字幕信号が入力され、字幕を表示するタイミングを決定する決定部２７と、再生部４１を制御するための再生制御部２９と、を含む。 The control unit 20 receives a video signal, an audio signal, and a caption signal, a character recognition unit 23 to which the video signal is input, an audio recognition unit 25 to which the audio signal is input, and a caption signal. , A determination unit 27 that determines the timing for displaying the subtitles, and a reproduction control unit 29 for controlling the reproduction unit 41.

放送信号は、音声信号、映像信号および字幕信号を含む。分離部２１は、放送信号を、音声信号と映像信号と字幕信号とに分離し、映像信号を文字認識部２３に出力し、音声信号を音声認識部２５に出力し、字幕信号を決定部２７に出力するとともに、音声信号と映像信号とを第２バッファメモリ３３に記憶する。字幕信号は、文字列を含む。 The broadcast signal includes an audio signal, a video signal, and a caption signal. The separation unit 21 separates the broadcast signal into an audio signal, a video signal, and a caption signal, outputs the video signal to the character recognition unit 23, outputs the audio signal to the voice recognition unit 25, and determines the caption signal 27. And the audio signal and the video signal are stored in the second buffer memory 33. The caption signal includes a character string.

なお、分離部２１は、音声信号と映像信号とを第２バッファメモリ３３に記録するようにしたが、チューナで受信した放送信号をそのまま第２バッファメモリ３３に記憶するようにしてもよく、デジタル放送の場合は、パーシャルＴＳを記憶するようにしてもよい。 The separating unit 21 records the audio signal and the video signal in the second buffer memory 33. However, the separating unit 21 may store the broadcast signal received by the tuner in the second buffer memory 33 as it is. In the case of broadcasting, the partial TS may be stored.

文字認識部２３は、分離部２１から入力される映像信号に含まれる映像を文字認識する。映像信号に含まれるすべてのフレームを文字認識するようにしてもよいが、すべてのフレームを文字認識しなくてもよい。例えば、所定の時間間隔で抽出したフレーム、または映像の変化の激しいフレームを抽出し、１フレームの画像ごとに文字認識するようにしてもよい。画像（映像）からの文字認識は、画像から文字領域を判別し、判別した文字領域をパターンマッチングして、文字を特定する。文字認識は、ＯＣＲなどで広く知られた技術を用いればよい。文字認識部２３は、映像信号から文字列を抽出した順に、抽出した文字列に番号を付す。ここでは、文字認識部２３が第ｉ番目に抽出した文字列を映像文字列ＶＳ（ｉ）（ｉは正の整数）という。 The character recognition unit 23 performs character recognition on the video included in the video signal input from the separation unit 21. Although all the frames included in the video signal may be recognized as characters, all the frames may not be recognized as characters. For example, a frame extracted at a predetermined time interval or a frame with a drastic change in video may be extracted so that characters are recognized for each image of one frame. In character recognition from an image (video), a character region is determined from the image, and a character is specified by pattern matching the determined character region. For character recognition, a technique widely known in OCR or the like may be used. The character recognition unit 23 numbers the extracted character strings in the order in which the character strings are extracted from the video signal. Here, the i-th character string extracted by the character recognition unit 23 is referred to as a video character string VS (i) (i is a positive integer).

文字認識部２３は、フレームを文字認識して映像文字列ＶＳ（ｉ）を抽出した場合、映像文字列ＶＳ（ｉ）と、その映像文字列ＶＳ（ｉ）が抽出されたフレームが再生される映像時間ＶＴ（ｉ）との組を、決定部２７に出力する。ここでは、説明のために時間を相対時間で説明する。相対時間は、放送信号の再生を開始する時刻を０とし、それからの経過時間で時を示す。映像時間ＶＴ（ｉ）は、映像文字列ＶＳ（ｉ）を含む最初のフレームが再生される時刻（開始時刻）を少なくとも含む。 When the character recognition unit 23 recognizes the frame and extracts the video character string VS (i), the video character string VS (i) and the frame from which the video character string VS (i) is extracted are reproduced. The set with the video time VT (i) is output to the determination unit 27. Here, for the sake of explanation, time will be described as relative time. The relative time indicates the time as the elapsed time since the time when the broadcast signal reproduction is started is set to zero. The video time VT (i) includes at least a time (start time) at which the first frame including the video character string VS (i) is reproduced.

音声認識部２５は、分離部２１から入力される音声信号に含まれる音声を音声認識する。音声認識部２５は、無音期間で挟まれた期間の音声を音声認識し、文字列を抽出する。音声認識の方法は、例えば、音響的な特徴を持つ音響（音素）モデルと言語的な特徴を持つ言語モデルを利用する方法を用いる。音響モデルとしては隠れマルコフモデル（ＨＭＭ）が広く利用されており、ＨＭＭを作成するツールとしてＨＴＫが有名である。また、ＨＴＫを用いるオープンソースの大語彙連続音声認識エンジンとしてＪｕｌｉｕｓが知られている。音声認識の方法は、これに限定されることなく、従来周知な方法を用いるようにしてもよい。音声認識部２５は、音声信号から文字列を抽出した順に、抽出した文字列に番号を付す。ここでは、音声認識部２５が第ｊ番目に抽出した文字列を音声文字列ＡＳ（ｊ）（ｊは正の整数）という。 The voice recognition unit 25 recognizes voice included in the voice signal input from the separation unit 21. The voice recognition unit 25 recognizes voice in a period sandwiched between silence periods and extracts a character string. As a speech recognition method, for example, a method using an acoustic (phoneme) model having acoustic features and a language model having linguistic features is used. Hidden Markov Models (HMMs) are widely used as acoustic models, and HTK is famous as a tool for creating HMMs. Julius is known as an open source large vocabulary continuous speech recognition engine using HTK. The speech recognition method is not limited to this, and a conventionally known method may be used. The voice recognition unit 25 assigns numbers to the extracted character strings in the order in which the character strings are extracted from the voice signal. Here, the j-th character string extracted by the speech recognition unit 25 is referred to as a speech character string AS (j) (j is a positive integer).

音声認識部２５は、音声文字列ＡＳ（ｊ）を抽出した場合、音声文字列ＡＳ（ｊ）と、その音声文字列ＡＳ（ｊ）が抽出された音声が再生される音声時間ＡＴ（ｊ）との組を、決定部２７に出力する。音声時間ＡＴ（ｊ）は、音声文字列ＡＳ（ｊ）が抽出された音声が再生される開始時刻と再生が終了する終了時刻とを含む。 When the voice recognition unit 25 extracts the voice character string AS (j), the voice character string AS (j) and the voice time AT (j) in which the voice from which the voice character string AS (j) is extracted is reproduced. Are output to the determination unit 27. The voice time AT (j) includes a start time at which the voice from which the voice character string AS (j) is extracted is played back and an end time at which the playback ends.

決定部２７は、文字認識部２３から映像文字列ＶＳ（ｉ）と映像時間ＶＴ（ｉ）との組が入力され、音声認識部２５から音声文字列ＡＳ（ｊ）と音声時間ＡＴ（ｊ）との組が入力され、分離部２１から字幕信号が入力される。決定部２７は、字幕信号に含まれる文字列に、それが入力される順に番号を付す。ここでは、第ｋ番目に入力される文字列を字幕文字列ＳＳ（ｋ）（ｋは正の整数）という。また、字幕文字列ＳＳ（ｋ）が入力される時刻を字幕時間ＳＴ（ｋ）に仮に設定する。 The determination unit 27 receives a set of the video character string VS (i) and the video time VT (i) from the character recognition unit 23, and the voice character string AS (j) and the voice time AT (j) from the voice recognition unit 25. And a subtitle signal is input from the separation unit 21. The determination unit 27 assigns numbers to the character strings included in the caption signal in the order in which they are input. Here, the kth input character string is referred to as a caption character string SS (k) (k is a positive integer). Further, the time when the subtitle character string SS (k) is input is temporarily set as the subtitle time ST (k).

決定部２７は、字幕文字列ＳＳ（ｋ）を入力される順に、少なくとも１つの音声文字列ＡＳ（ｊ）または映像文字列ＶＳ（ｉ）と比較し、字幕文字列ＳＳ（ｋ）を再生する字幕時間ＳＴ（ｋ）を決定する。決定部２７は、字幕文字列ＳＳ（ｋ）と、それに対し決定された字幕時間ＳＴ（ｋ）との組を第１バッファメモリ３１に記憶する。 The determination unit 27 compares the subtitle character string SS (k) with at least one audio character string AS (j) or video character string VS (i) in the order of input, and reproduces the subtitle character string SS (k). The caption time ST (k) is determined. The determination unit 27 stores a set of the subtitle character string SS (k) and the subtitle time ST (k) determined for the subtitle character string SS (k) in the first buffer memory 31.

より具体的には、決定部２７は、字幕文字列ＳＳ（ｋ）を、それが入力される順に、少なくとも１つの音声文字列ＡＳ（ｊ）と比較し、所定の相関のある１つを決定する。例えば、音声文字列ＡＳ（ｊ）のうち字幕文字列ＳＳ（Ｋ）と所定の相関のある音声文字列ＡＳ（Ｊ）が決定されたならば、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、音声時間ＡＴ（Ｊ）と同じ値に決定する。これにより、字幕文字列ＳＳ（Ｋ）を音声時間ＡＴ（Ｊ）に表示することができるので、字幕文字列ＳＳ（Ｋ）を音声に同期させることができる。 More specifically, the determination unit 27 compares the subtitle character string SS (k) with at least one speech character string AS (j) in the order in which the subtitle character string SS (k) is input, and determines one having a predetermined correlation. To do. For example, if an audio character string AS (J) having a predetermined correlation with the subtitle character string SS (K) is determined in the audio character string AS (j), the subtitle time for reproducing the subtitle character string SS (K) is determined. ST (K) is determined to be the same value as the voice time AT (J). Thereby, since the subtitle character string SS (K) can be displayed in the audio time AT (J), the subtitle character string SS (K) can be synchronized with the audio.

さらに、決定部２７は、音声文字列ＡＳ（ｊ）のうち字幕文字列ＳＳ（Ｋ）と所定の相関のある音声文字列ＡＳ（Ｊ）が決定されなければ、字幕文字列ＳＳ（Ｋ）を少なくとも１つの映像文字列ＶＳ（ｉ）と比較し、所定の相関のある１つを決定する。例えば、映像文字列ＶＳ（ｉ）のうち字幕文字列ＳＳ（Ｋ）と所定の相関のある映像文字列ＶＳ（Ｉ）が決定されたならば、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、映像時間ＶＴ（Ｉ）と同じ値に決定する。これにより、字幕文字列ＳＳ（Ｋ）を映像時間ＶＴ（Ｉ）に表示することができるので、字幕文字列ＳＳ（Ｋ）を映像に同期させることができる。 Further, the determination unit 27 determines the subtitle character string SS (K) if the audio character string AS (J) having a predetermined correlation with the subtitle character string SS (K) is not determined in the audio character string AS (j). Compared with at least one video character string VS (i), one having a predetermined correlation is determined. For example, if the video character string VS (I) having a predetermined correlation with the subtitle character string SS (K) is determined in the video character string VS (i), the subtitle time for reproducing the subtitle character string SS (K) is determined. ST (K) is determined to be the same value as the video time VT (I). Thereby, since the subtitle character string SS (K) can be displayed at the video time VT (I), the subtitle character string SS (K) can be synchronized with the video.

決定部２７は、映像文字列ＶＳ（ｉ）のうち字幕文字列ＳＳ（Ｋ）と所定の相関のある映像文字列ＶＳ（Ｉ）が決定されなければ、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、字幕文字列ＳＳ（Ｋ）の前後の字幕文字列ＳＳ（Ｋ−１）および字幕文字列ＳＳ（Ｋ＋１）それぞれを再生する字幕時間ＳＴ（Ｋ−１）および字幕時間ＳＴ（Ｋ＋１）の間に設定する。これにより、音声信号または映像信号から、字幕文字列ＳＳ（Ｋ）を表示するタイミングが定まらない場合であっても、字幕文字列ＳＳ（Ｋ）を適切なタイミングで表示することができる。 If the video character string VS (I) having a predetermined correlation with the subtitle character string SS (K) is not determined in the video character string VS (i), the determination unit 27 reproduces the subtitle character string SS (K). Subtitle time ST (K-1) and subtitle time for reproducing subtitle character string SS (K-1) and subtitle character string SS (K + 1) before and after subtitle character string SS (K), respectively. Set during ST (K + 1). Thereby, even when the timing for displaying the subtitle character string SS (K) is not determined from the audio signal or the video signal, the subtitle character string SS (K) can be displayed at an appropriate timing.

さらに、決定部２７は、音声文字列ＡＳ（ｊ）のうち字幕文字列ＳＳ（ｋ）のいずれとも所定の相関がないとされた音声文字列ＡＳ（Ｊ）を、新たな字幕文字列ＳＳに設定する。新たな字幕文字列ＳＳは、音声文字列ＡＳ（Ｊ）の音声時間ＡＴ（Ｊ）前後の字幕時間ＳＴを有する２つの字幕文字列ＳＳの間に挿入する。新たな字幕文字列ＳＳの字幕時間ＳＴは音声時間ＡＴ（Ｊ）と同じ値に設定する。これにより、字幕情報に含まれないが、音声信号に含まれる文字列を音声に同期して表示することができる。 Further, the determination unit 27 converts the speech character string AS (J), which has no predetermined correlation with any of the subtitle character strings SS (k) among the sound character strings AS (j), into a new subtitle character string SS. Set. The new subtitle character string SS is inserted between two subtitle character strings SS having a subtitle time ST before and after the audio time AT (J) of the audio character string AS (J). The subtitle time ST of the new subtitle character string SS is set to the same value as the audio time AT (J). Thereby, although it is not contained in subtitle information, the character string contained in an audio | voice signal can be displayed synchronizing with an audio | voice.

また、決定部２７は、映像文字列ＶＳ（ｉ）のうち字幕文字列ＳＳ（ｋ）のいずれとも所定の相関がないとされた映像文字列ＶＳ（Ｉ）を、新たな字幕文字列ＳＳに設定する。新たな字幕文字列ＳＳは、映像文字列ＶＳ（Ｉ）の映像時間ＶＴ（Ｉ）の前後の字幕時間ＳＴを有する２つの字幕文字列ＳＳの間に挿入する。新たな字幕文字列ＳＳの字幕時間ＳＴは映像時間ＶＴ（Ｉ）と同じ値に設定する。これにより、字幕情報に含まれないが、映像信号に含まれる文字列を映像に同期して表示することができる。 In addition, the determination unit 27 converts the video character string VS (I) that has no predetermined correlation with any of the subtitle character strings SS (k) in the video character string VS (i) to a new subtitle character string SS. Set. The new subtitle character string SS is inserted between the two subtitle character strings SS having the subtitle time ST before and after the video time VT (I) of the video character string VS (I). The subtitle time ST of the new subtitle character string SS is set to the same value as the video time VT (I). Thereby, although not included in the caption information, the character string included in the video signal can be displayed in synchronization with the video.

再生制御部２９は、再生部４１を制御する。再生部４１は、再生制御部２９から入力される指示に従って、第１バッファメモリ３１から字幕文字列ＳＳ（ｋ）と字幕時間ＳＴ（ｋ）とを読み出し、第２バッファメモリ３３から映像信号および音声信号を読み出す。再生部４１は、第２バッファメモリ３３から映像信号および音声信号を、それらが記憶されてから所定時間経過した後に再生する。これにより、映像信号および音声信号が、放送信号が受信されてから所定時間遅延して再生される。 The playback control unit 29 controls the playback unit 41. The reproduction unit 41 reads the subtitle character string SS (k) and the subtitle time ST (k) from the first buffer memory 31 according to the instruction input from the reproduction control unit 29, and the video signal and audio from the second buffer memory 33. Read the signal. The reproduction unit 41 reproduces the video signal and the audio signal from the second buffer memory 33 after a predetermined time has elapsed since they were stored. As a result, the video signal and the audio signal are reproduced with a predetermined time delay after the broadcast signal is received.

再生部４１は、第１バッファメモリ３１から読み出した字幕文字列ＳＳ（ｋ）を、映像信号および音声信号を再生する時間が、字幕時間ＳＴ（ｋ）となった時に表示する。これにより、映像の表示および音声の出力に同期して字幕文字列を表示することができる。字幕文字列ＳＳ（ｋ）の表示は、映像信号に重畳して表示する。映像信号に字幕文字列ＳＳ（ｋ）の画像を合成してもよいし、ＯＳＤ等のディスプレイのオンスクリーン機能を用いて、表示するようにしてもよい。 The reproduction unit 41 displays the subtitle character string SS (k) read from the first buffer memory 31 when the time for reproducing the video signal and the audio signal is the subtitle time ST (k). Thereby, a subtitle character string can be displayed in synchronization with video display and audio output. The subtitle character string SS (k) is displayed superimposed on the video signal. An image of the subtitle character string SS (k) may be synthesized with the video signal, or may be displayed using an on-screen function of a display such as OSD.

第１バッファメモリ３１および第２バッファメモリ３３は、ＤＲＡＭ（ＤｙｎａｍｉｃＲＡＭ）またはＳＤＲＡＭ（ＳｙｍｃｈｒｏｎｏｕｓＤｙｎａｍｉｃＲＡＭ）である。また、フラッシュメモリなどの半導体メモリであってもよい。 The first buffer memory 31 and the second buffer memory 33 are DRAM (Dynamic RAM) or SDRAM (Symchronous Dynamic RAM). Further, it may be a semiconductor memory such as a flash memory.

制御部２０は、フラッシュメモリ５１に記憶されたプログラムをロードして実行する。なお、制御部２０が実行するプログラムをフラッシュメモリ５１に記録する場合に限らず、他の記録媒体に記録するようにしてもよい。 The control unit 20 loads and executes a program stored in the flash memory 51. Note that the program executed by the control unit 20 is not limited to being recorded in the flash memory 51, but may be recorded in another recording medium.

ここでいうプログラムは、制御部２０により直接実行可能なプログラムだけでなく、ソースプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 The program here includes not only a program directly executable by the control unit 20 but also a source program, a compressed program, an encrypted program, and the like.

なお、ここでは、文字認識部２３が映像信号から映像文字列ＶＳ（ｉ）を抽出する例を示したが、文字認識部２３に代えて、映像信号に含まれる唇の動きおよび形状を認識することにより母音の配列を文字情報として出力する母音認識部を備えるようにしてもよい。母音認識部が出力する母音の配列を、文字認識部２３が出力する映像文字列ＶＳ（ｉ）と同様に処理することができる。また、母音認識部と文字認識部２３を備えるようにして、双方から出力される文字列のいずれか一方を映像文字列ＶＳ（ｉ）とするようにしてもよいし、いずれか一方でなくて両方用いてもよい。 In this example, the character recognition unit 23 extracts the video character string VS (i) from the video signal. However, instead of the character recognition unit 23, the movement and shape of the lips included in the video signal are recognized. Accordingly, a vowel recognition unit that outputs the vowel array as character information may be provided. The arrangement of vowels output by the vowel recognition unit can be processed in the same manner as the video character string VS (i) output by the character recognition unit 23. In addition, the vowel recognition unit and the character recognition unit 23 may be provided so that either one of the character strings output from both may be the video character string VS (i). Both may be used.

図２は、決定部の詳細な機能を示す機能ブロック図である。図２を参照して、決定部２７は、相関算出部６１と、選択部６３とを含む。相関算出部６１は、選択部６３から映像文字列ＶＳ（ｉ）と字幕文字列ＳＳ（ｋ）の組、または、音声文字列ＡＳ（ｊ）と字幕文字列ＳＳ（ｋ）の組、が入力される。相関算出部６１は、字幕文字列ＳＳ（ｋ）と映像文字列ＶＳ（ｉ）との相関を求め、選択部に相関値ＸＣ（ｋ，ｉ）を出力する。また、字幕文字列ＳＳ（ｋ）と音声文字列ＡＳ（ｊ）との相関を求め、選択部に相関値ＸＣ（ｋ、ｊ）を出力する。 FIG. 2 is a functional block diagram illustrating detailed functions of the determination unit. Referring to FIG. 2, determination unit 27 includes a correlation calculation unit 61 and a selection unit 63. The correlation calculation unit 61 receives a set of the video character string VS (i) and the subtitle character string SS (k) or a set of the audio character string AS (j) and the subtitle character string SS (k) from the selection unit 63. Is done. The correlation calculation unit 61 obtains the correlation between the subtitle character string SS (k) and the video character string VS (i), and outputs the correlation value XC (k, i) to the selection unit. Further, the correlation between the subtitle character string SS (k) and the voice character string AS (j) is obtained, and the correlation value XC (k, j) is output to the selection unit.

相関値の算出方法は、種々の方法があり、いずれを用いもよいが、ここでは、音声の音素ごとの系列を比較する例を示す。なお、相関値の算出方法を音声の音素ごとの系列を比較する方法に限定するものではない。図３は、二つの音素系列の比較を説明するための図である。ここでは、音声文字列ＡＳ(ｊ)の音素系列をａ_ｉ（ｋｏｎｂａｎｈａ）とし、字幕文字列ＳＳ（ｋ）の音素系列をｂ_ｉ（ｋｏｎｂａｎｈａ）としている。この場合、相互相関関数は次式で与えられる。但し、ａ_ｉｂ_ｉ＋τは、ａ_ｉとｂ_ｉ＋τとが一致するときに「１」となり、異なるときに「０」となる。Ｃ_τが最大となるτが最も相関の強いことを示す。 There are various methods for calculating the correlation value, and any of them may be used. Here, an example in which sequences of speech phonemes are compared is shown. Note that the correlation value calculation method is not limited to the method of comparing sequences for each phoneme of speech. FIG. 3 is a diagram for explaining comparison between two phoneme sequences. Here, the phoneme sequence of the speech character string AS (j) is a _i (konbanha), and the phoneme sequence of the subtitle character string SS (k) is b _i (konbanha). In this case, the cross-correlation function is given by However, a _i b _{i + τ} is “1” when a _i and b _{i + τ} match, and is “0” when they are different. _Τ that maximizes C _τ indicates the strongest correlation.

図３に示す音素系列ａ_iおよび音素系列ｂ_ｉの場合には、τ＝０から順番に相関値Ｃ_τを計算し、Ｃ_τがしきい値Ｔを超えれば、その時のτが最も相関の強いことを示す。図３では、τ＝０でしきい値Ｔを超えるので、τ＝１以上の相関値Ｃ_τを計算する必要はない。したがって、音素系列ａ_ｉおよび音素系列ｂ_ｉの時間のずれＴ５−Ｔ０が音声文字列ＡＳ(ｊ)と字幕文字列ＳＳ（ｋ）のずれとなる。

In the case of the phoneme sequence a _i and phoneme sequence b _i shown in FIG. 3, the correlation value C _τ is calculated in order from τ = 0, and if C _τ exceeds the threshold value T, τ at that time is the most correlated. Indicates strong. In FIG. 3, since the threshold value T is exceeded when τ = 0, there is no need to calculate a correlation value C _τ of τ = 1 or more. Therefore, the time difference T5-T0 between the phoneme sequence a _i and the phoneme sequence b _i is the difference between the speech character string AS (j) and the subtitle character string SS (k).

相関算出部６１は、τ＝０から順番に相関値ＸＣ＝Ｃτを計算し、ＸＣがしきい値Ｔを超えた時、相関値ＸＣとして選択部６３に出力する。しきい値を超えるＸＣがない場合、選択部６３には何も出力されない。または、しきい値を超えるＸＣがなかったことを示す値を出力しても良い。ここでは、しきい値を超えるＸＣがなかったことを示す値として０を出力することとする。なお、相関値ＸＣの強い値が得られない場合は、字幕文字列ＳＳ（ｋ）の音素系列を、子音部分を省略した母音部分のみで構成し、相関を求めても良い。また、相関値ＸＣを求めるのに用いる系列は、音素以外に、音節や単語単位等を用いても良い。さらに、相関値ＸＣは音声文字列ＡＳ（ｊ）または字幕文字列ＳＳ（ｋ）のどちらか一方をずらしてそれぞれ相関値ＸＣを算出し、しきい値を超えた相関値ＸＣを用いるか、音声文字列ＡＳ（ｊ）と字幕文字列ＳＳ（ｋ）がずれないことが明白であれば、音声文字列ＡＳ（ｊ）または字幕文字列ＳＳ（ｋ）のどちらもずらさずに求めた相関値ＸＣを用いてもよい。また、相関を求める場合、ある音声文字列ＡＳ（ｊ）および字幕文字列ＳＳ（ｋ）の全てを比較する必要はなく、初めの１０文字分など予め定めた文字数に限定して比較するようにしてもよい。これにより、相関値ＸＣを算出する時間を短くすることができる。 The correlation calculation unit 61 calculates the correlation value XC = Cτ sequentially from τ = 0, and outputs the correlation value XC to the selection unit 63 when XC exceeds the threshold value T. If there is no XC exceeding the threshold value, nothing is output to the selector 63. Alternatively, a value indicating that there is no XC exceeding the threshold value may be output. Here, 0 is output as a value indicating that there is no XC exceeding the threshold. If a strong correlation value XC cannot be obtained, the phoneme sequence of the subtitle character string SS (k) may be composed of only the vowel part from which the consonant part is omitted to obtain the correlation. The series used for obtaining the correlation value XC may use syllables, word units, etc. in addition to phonemes. Further, the correlation value XC is calculated by shifting either one of the voice character string AS (j) or the subtitle character string SS (k) and using the correlation value XC exceeding the threshold value, If it is clear that the character string AS (j) and the subtitle character string SS (k) are not shifted, the correlation value XC obtained without shifting either the voice character string AS (j) or the subtitle character string SS (k). May be used. Further, when obtaining the correlation, it is not necessary to compare all of the voice character string AS (j) and the subtitle character string SS (k), and the comparison is limited to a predetermined number of characters such as the first 10 characters. May be. Thereby, the time for calculating the correlation value XC can be shortened.

相関算出部６１が相関値を算出する組合せは、選択部６３により決定される。図２に戻って、選択部６３は、文字認識部２３から映像文字列ＶＳ（ｉ）および映像時間ＶＴ（ｉ）が入力され、音声認識部２５から音声文字列ＡＳ（ｊ）および音声時間ＡＴ（ｊ）が入力され、分離部２１から入力される字幕信号に含まれる字幕文字列ＳＳ（ｋ）および字幕時間ＳＴ（ｋ）が入力される。 The combination for which the correlation calculation unit 61 calculates the correlation value is determined by the selection unit 63. Returning to FIG. 2, the selection unit 63 receives the video character string VS (i) and the video time VT (i) from the character recognition unit 23, and the voice character string AS (j) and the voice time AT from the voice recognition unit 25. (J) is input, and the subtitle character string SS (k) and subtitle time ST (k) included in the subtitle signal input from the separation unit 21 are input.

選択部６３は、字幕文字列ＳＳ（ｋ）を、それが入力される順に処理対象に設定する。ここでは、字幕文字列ＳＳ（Ｋ）を処理対象とする場合を例に説明する。選択部６３は、最初に、字幕文字列ＳＳ（Ｋ）の字幕時間ＳＴ（Ｋ）より前で、最も近い時を音声時間ＡＴ（Ｊ）とする音声文字列ＡＳ（Ｊ）を比較対象に設定する。そして、相関算出部６１に字幕文字列ＳＳ（Ｋ）と音声文字列ＡＳ（Ｊ）との相関を算出させる。相関算出部６１で算出される相関値ＸＣがしきい値Ｔを超えていれば、相関算出部６１は相関値ＸＣを出力し、選択部６３は、音声文字列ＡＳ（Ｊ）を選択し、字幕文字列ＳＳ（Ｋ）に対応する対応音声文字列ＡＳ（Ｊ）に決定する。そして、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、対応音声文字列に決定された音声文字列ＡＳ（Ｊ）の音声時間ＡＴ（Ｊ）と同じ値に変更する。 The selection unit 63 sets the caption character string SS (k) as a processing target in the order in which it is input. Here, a case where the subtitle character string SS (K) is a processing target will be described as an example. First, the selection unit 63 sets the speech character string AS (J) having the speech time AT (J) as the comparison target before the caption time ST (K) of the caption character string SS (K) and the closest time. To do. Then, the correlation calculation unit 61 calculates the correlation between the subtitle character string SS (K) and the voice character string AS (J). If the correlation value XC calculated by the correlation calculation unit 61 exceeds the threshold value T, the correlation calculation unit 61 outputs the correlation value XC, the selection unit 63 selects the speech character string AS (J), The corresponding speech character string AS (J) corresponding to the subtitle character string SS (K) is determined. Then, the subtitle time ST (K) for reproducing the subtitle character string SS (K) is changed to the same value as the audio time AT (J) of the audio character string AS (J) determined as the corresponding audio character string.

一方、相関算出部６１で算出される相関値ＸＣがしきい値Ｔを超えなければ、相関算出部６１はしきい値Ｔを超える相関値ＸＣがなかったことを示す値として０を出力し、選択部６３は、音声文字列ＡＳ（Ｊ）の一つ前の音声文字列ＡＳ（Ｊ−１）を処理対象に決定し、相関算出部６１に相関値を求めさせる。処理対象とされる音声文字列ＡＳ（Ｊ−ｍ）（ｍは正の整数）が、既に対応音声文字列に決定されたものであれば、字幕文字列ＳＳ（Ｋ）と相関のある音声文字列ＡＳ（ｊ）が存在しないと判断する。 On the other hand, if the correlation value XC calculated by the correlation calculation unit 61 does not exceed the threshold value T, the correlation calculation unit 61 outputs 0 as a value indicating that there is no correlation value XC exceeding the threshold value T, The selection unit 63 determines the speech character string AS (J-1) immediately before the speech character string AS (J) as a processing target, and causes the correlation calculation unit 61 to obtain a correlation value. If the speech character string AS (Jm) (m is a positive integer) to be processed has already been determined as the corresponding speech character string, the speech character correlated with the subtitle character string SS (K) It is determined that the column AS (j) does not exist.

次に、選択部６３は、最初に、字幕文字列ＳＳ（Ｋ）の字幕時間ＳＴ（Ｋ）より前で、最も近い時を映像時間ＶＴ（Ｉ）とする映像文字列ＶＳ（Ｉ）を比較対象に設定する。そして、相関算出部６１に字幕文字列ＳＳ（Ｋ）と映像文字列ＶＳ（Ｉ）との相関を算出させる。相関算出部６１で算出される相関値ＸＣがしきい値Ｔを超えていれば、相関算出部６１は相関値ＸＣを出力し、選択部６３は、映像文字列ＶＳ（Ｉ）を字幕文字列ＳＳ（Ｋ）に対応する対応映像文字列に決定する。そして、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、対応映像文字列に決定された映像文字列ＶＳ（Ｉ）の映像時間ＶＴ（Ｉ）と同じ値に変更する。 Next, the selection unit 63 first compares the video character string VS (I) with the video time VT (I) as the closest time before the subtitle time ST (K) of the subtitle character string SS (K). Set the target. Then, the correlation calculation unit 61 calculates the correlation between the subtitle character string SS (K) and the video character string VS (I). If the correlation value XC calculated by the correlation calculation unit 61 exceeds the threshold value T, the correlation calculation unit 61 outputs the correlation value XC, and the selection unit 63 converts the video character string VS (I) into the subtitle character string. The corresponding video character string corresponding to SS (K) is determined. Then, the subtitle time ST (K) for reproducing the subtitle character string SS (K) is changed to the same value as the video time VT (I) of the video character string VS (I) determined as the corresponding video character string.

なお、字幕時間ＳＴ（Ｋ）は、開始時刻と終了時刻を含み、映像時間ＶＴ（Ｉ）は開始時刻のみを含む場合、字幕時間ＳＴ（Ｋ）の開始時刻を映像時間ＶＴ（Ｉ）の開始時刻に設定し、終了時刻は、開始時刻から所定時間経過後の時刻に設定すればよい。また、次の字幕時間ＳＴ（Ｋ＋１）の開始時刻を、字幕時間ＳＴ（Ｋ）の終了時刻に設定するようにしてもよい。 When the caption time ST (K) includes a start time and an end time, and the video time VT (I) includes only the start time, the start time of the caption time ST (K) is set as the start of the video time VT (I). The time is set, and the end time may be set to a time after a predetermined time has elapsed from the start time. Further, the start time of the next caption time ST (K + 1) may be set to the end time of the caption time ST (K).

一方、相関算出部６１で算出される相関値ＸＣがしきい値Ｔを超えなければ、相関算出部６１はしきい値Ｔを超える相関値ＸＣがなかったことを示す値として０を出力し、選択部６３は、映像文字列ＶＳ（Ｉ）の一つ前の映像文字列ＶＳ（Ｉ−１）を処理対象に決定し、相関算出部６１に相関値を求めさせる。処理対象とされる映像文字列ＶＳ（Ｉ−ｎ）（ｎは正の整数）が、既に対応映像文字列に決定されたものであれば、字幕文字列ＳＳ（Ｋ）と相関のある映像文字列ＶＳ（ｉ）が存在しないと判断する。 On the other hand, if the correlation value XC calculated by the correlation calculation unit 61 does not exceed the threshold value T, the correlation calculation unit 61 outputs 0 as a value indicating that there is no correlation value XC exceeding the threshold value T, The selection unit 63 determines the video character string VS (I-1) immediately before the video character string VS (I) as a processing target, and causes the correlation calculation unit 61 to obtain a correlation value. If the video character string VS (In) (n is a positive integer) to be processed has already been determined as the corresponding video character string, the video character correlated with the subtitle character string SS (K) It is determined that the column VS (i) does not exist.

選択部６３は、映像文字列ＶＳ（ｉ）のうち字幕文字列ＳＳ（Ｋ）と所定の相関のある映像文字列ＶＳ（Ｉ）が決定されなければ、字幕文字列ＳＳ（Ｋ）を再生する字幕時間ＳＴ（Ｋ）を、字幕文字列ＳＳ（Ｋ）の前後の字幕文字列ＳＳ（Ｋ−１）および字幕文字列ＳＳ（Ｋ＋１）それぞれを再生する字幕時間ＳＴ（Ｋ−１）と字幕時間ＳＴ（Ｋ＋１）との間に設定する。 If the video character string VS (I) having a predetermined correlation with the subtitle character string SS (K) is not determined in the video character string VS (i), the selection unit 63 reproduces the subtitle character string SS (K). Subtitle time ST (K-1) and subtitle time ST (K-1) for reproducing subtitle character string SS (K-1) and subtitle character string SS (K + 1) before and after subtitle character string SS (K), respectively. Set to ST (K + 1).

図４は、同期処理の流れの一例を示すフローチャートである。同期処理は、制御部２０がフラッシュメモリ５３に記録されたプログラムを実行することにより、制御部２０により実行される処理である。図７〜図１１は、音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示すタイムチャートである。Ｔ０が映像または音声の再生が開始される時刻である。例えば、図７を参照して、音声文字列ＡＳ（１）に対応する音声時間ＡＴ（１）の開始時刻は時刻Ｔ０あり、終了時刻は時刻Ｔ１である。また、映像文字列ＶＳ（１）に対応する映像時間ＶＴ（１）の開始時刻は時刻Ｔ０であり、終了時刻は時刻Ｔ１である。また、字幕文字列ＳＳ（１）に対応する字幕時間ＳＴ（１）の開始時刻は時刻Ｔ１であり、終了時刻は時刻Ｔ２である。 FIG. 4 is a flowchart illustrating an example of the flow of synchronization processing. The synchronization process is a process executed by the control unit 20 when the control unit 20 executes a program recorded in the flash memory 53. 7 to 11 are time charts showing temporal arrangements of the audio character string AS, the video character string VS, and the subtitle character string SS. T0 is the time when video or audio playback starts. For example, referring to FIG. 7, the start time of speech time AT (1) corresponding to speech character string AS (1) is time T0, and the end time is time T1. The start time of the video time VT (1) corresponding to the video character string VS (1) is time T0, and the end time is time T1. The start time of the caption time ST (1) corresponding to the caption character string SS (1) is time T1, and the end time is time T2.

以下、図４に示す同期処理の流れを、図７〜図１１を参照しながら説明する。図４を参照して、制御部２０は、変数ｋ，ｊ，ｉ，ａｍｉｎ，ｖｍｉｎをそれぞれ初期値「１」に設定する（ステップＳ０１）。変数ｋは、処理対象となる字幕文字列ＳＳ（ｋ）および字幕時間ＳＴ（ｋ）を特定するために用いられる。変数ｊは、処理対象となる音声文字列ＡＳ（ｊ）および音声時間ＡＴ（ｊ）を特定するために用いられる。変数ｉは、処理対象となる映像文字列ＶＳ（ｉ）および映像時間ＶＴ（ｉ）を特定するために用いられる。変数ａｍｉｎは、字幕文字列ＳＳと比較の対象となる最後の音声文字列ＡＳを特定するために用いられる。変数ｖｍｉｎは、字幕文字列ＳＳと比較の対象となる最後の映像文字列ＶＳを特定するために用いられる。ステップＳ０１が実行されることにより、変数ｋに「１」が設定されるので、第１番目の字幕文字列ＳＳ（１）および字幕時間ＳＴ（１）が処理対象に設定される。 Hereinafter, the flow of the synchronization process shown in FIG. 4 will be described with reference to FIGS. Referring to FIG. 4, control unit 20 sets variables k, j, i, amin, and vmin to initial values “1”, respectively (step S01). The variable k is used to specify the caption character string SS (k) and the caption time ST (k) to be processed. The variable j is used to specify the speech character string AS (j) and the speech time AT (j) to be processed. The variable i is used to specify the video character string VS (i) and the video time VT (i) to be processed. The variable amin is used to specify the last speech character string AS to be compared with the subtitle character string SS. The variable vmin is used to specify the last video character string VS to be compared with the subtitle character string SS. By executing step S01, “1” is set in the variable k, so the first subtitle character string SS (1) and subtitle time ST (1) are set as processing targets.

次のステップＳ０２においては、音声時間ＡＴ（ｊ）のうちで、字幕時間ＳＴ（ｋ）よりも時間的に前であり、最も字幕時間ＳＴ（ｋ）に近い音声時間ＡＴ（Ｊ）を選択する。音声文字列ＡＳ（Ｊ）は、複数の音声文字列のうち字幕文字列ＳＳ（ｋ）と比較の対象となる最初の音声文字列である。ＳＴ（ｋ）に近いほうから順番に過去に遡っていくので、比較の対象としては最初になる。例えば、図７を参照して、字幕時間ＳＴ（１）の開始時刻Ｔ１よりも前で、最も時刻Ｔ１に近いのは音声文字列ＡＳ（１）なので、音声時間ＡＴ（１）を選ぶ。次に変数ｊにステップＳ０２で選択した「Ｊ」を設定し（ステップＳ０３）、処理をステップＳ０４に進める。 In the next step S02, the audio time AT (J) that is temporally earlier than the caption time ST (k) and is closest to the caption time ST (k) is selected from the audio time AT (j). . The voice character string AS (J) is the first voice character string to be compared with the subtitle character string SS (k) among the plurality of voice character strings. Since it goes back to the past in order from the closest to ST (k), it becomes the first comparison target. For example, referring to FIG. 7, the voice character string AS (1) is the closest to the time T1 before the start time T1 of the caption time ST (1), so the voice time AT (1) is selected. Next, “J” selected in step S02 is set in variable j (step S03), and the process proceeds to step S04.

ステップＳ０４においては、字幕文字列ＳＳ（ｋ）と音声文字列ＡＳ（ｊ）との相関値ＸＣを算出する。そして、相関値ＸＣとしきい値Ｔとを比較する（ステップＳ０５）。相関値ＸＣがしきい値Ｔを超えるならば処理をステップＳ０６に進め、そうでなければ処理をステップＳ１１に進める。 In step S04, a correlation value XC between the subtitle character string SS (k) and the voice character string AS (j) is calculated. Then, the correlation value XC is compared with the threshold value T (step S05). If correlation value XC exceeds threshold value T, the process proceeds to step S06; otherwise, the process proceeds to step S11.

ステップＳ０６においては、字幕時間ＳＴ（ｋ）に音声時間ＡＴ（ｊ）を設定する。字幕文字列ＳＳ（ｋ）と音声文字列ＡＳ（ｊ）との相関値ＸＣがしきい値Ｔを超えていれば、字幕文字列ＳＳ（ｋ）が音声文字列ＡＳ（ｊ）と類似する。このため、字幕文字列ＳＳ（ｋ）を音声文字列ＡＳ（ｊ）が再生される音声時間ＡＴ（ｊ）に表示するようにして、同期させる。ここで、図７に示した字幕文字列ＳＳ（１）と音声文字列ＡＳ（１）との相関値ＸＣがしきい値Ｔを越えた場合のタイムチャートを図８に示す。このとき、字幕時間ＳＴ（１）に音声時間ＡＴ（１）が設定される。 In step S06, the audio time AT (j) is set as the caption time ST (k). If the correlation value XC between the subtitle character string SS (k) and the audio character string AS (j) exceeds the threshold value T, the subtitle character string SS (k) is similar to the audio character string AS (j). For this reason, the subtitle character string SS (k) is synchronized by being displayed at the audio time AT (j) when the audio character string AS (j) is reproduced. Here, FIG. 8 shows a time chart when the correlation value XC between the caption character string SS (1) and the voice character string AS (1) shown in FIG. At this time, the audio time AT (1) is set to the caption time ST (1).

なお、相関値ＸＣを求める時に相関値Ｃ_τのτが「０」でない場合は、字幕時間ＳＴ（ｋ）を音声時間ＡＴ（ｊ）からτの分だけ増減させた値に設定する。 If _τ of correlation value C _τ is not “0” when obtaining correlation value XC, subtitle time ST (k) is set to a value obtained by increasing or decreasing audio time AT (j) by τ.

次のステップＳ０６Ａにおいては、音声文字列の未割り当て文字列挿入処理を実行する。図５は、未割り当て音声文字列挿入処理の流れの一例を示すフローチャートである。図５を参照して、ステップＳ３１において、音声文字列ＡＳ（ｊ）以前に、字幕文字列ＳＳが割り当てられていない音声文字列ＡＳの数Ｍ＝（ｊ−１）−ａｍｉｎ＋１を計算する。そして、ステップＳ３２において、Ｍが正であるか確認をすることにより、音声文字列ＡＳ（ｊ）以前に、字幕文字列ＳＳが割り当てられていない音声文字列ＡＳが存在するか否かを判断する。Ｍが正であれば音声文字列ＡＳ（ｊ）以前に、字幕文字列ＳＳが割り当てられていない音声文字列ＡＳが存在すると判断し、処理をステップＳ３３に進め、Ｍが正でなければそのような音声文字列ＡＳが存在しないと判断し、処理を同期処理に戻す。 In the next step S06A, an unassigned character string insertion process for a speech character string is executed. FIG. 5 is a flowchart showing an example of the flow of unassigned speech character string insertion processing. Referring to FIG. 5, in step S31, the number M = (j−1) −amin + 1 of the speech character strings AS to which the subtitle character string SS is not assigned before the speech character string AS (j) is calculated. In step S32, it is determined whether or not there is a voice character string AS to which no subtitle character string SS is assigned before the voice character string AS (j) by checking whether M is positive. . If M is positive, it is determined that there is a voice character string AS to which no subtitle character string SS is assigned before the voice character string AS (j), and the process proceeds to step S33. It is determined that there is no voice character string AS, and the process returns to the synchronous process.

ステップＳ３３〜Ｓ４０において、字幕文字列ＳＳ（ｋ）とその一つ前の字幕文字列ＳＳ（ｋ−１）の間に、字幕文字列ＳＳが割り当てられていない音声文字列ＡＳを挿入する。ＳＳが配列の場合、ＳＳ（ｋ）以降の字幕文字列ＳＳを、未割り当て音声文字列の数Ｍの分だけ繰り下げる必要があるため、ステップＳ３３〜ステップＳ３６においてＳＳ（ｋ）以降の字幕文字列ＳＳを全てコピーし、その後のステップＳ３７〜ステップＳ４０において、字幕文字列ＳＳが割り当てられていない音声文字列ＡＳを挿入する。なお、ＳＳを双方向の連結リストとして構成しておけば、ＳＳ（ｋ−１）の次のリストへのポインタおよびＳＳ（ｋ）の前のリストへのポインタを書き換えて挿入する処理のみで済む。 In steps S33 to S40, an audio character string AS to which no subtitle character string SS is assigned is inserted between the subtitle character string SS (k) and the immediately preceding subtitle character string SS (k-1). When SS is an array, the subtitle character string SS after SS (k) needs to be moved down by the number M of unassigned speech character strings, so that the subtitle character string after SS (k) in steps S33 to S36. All SSs are copied, and in subsequent steps S37 to S40, a speech character string AS to which no subtitle character string SS is assigned is inserted. If the SS is configured as a bi-directional linked list, only the process of rewriting and inserting the pointer to the next list of SS (k−1) and the pointer to the list before SS (k) is sufficient. .

次のステップＳ４１において、ＮおよびｋにＭをそれぞれ加算する。このステップＳ４１や図４のステップＳ０９などは、双方向の連結リストの場合不要であり、ステップＳ１０などの条件式を変更する必要がある。双方向の連結リストとして構成した場合の詳細例は、当業者に自明であるのでここでは繰り返さない。 In the next step S41, M is added to N and k, respectively. Step S41 and step S09 in FIG. 4 are not necessary in the case of a bidirectional linked list, and it is necessary to change the conditional expression such as step S10. A detailed example in the case of a bi-directional linked list is obvious to those skilled in the art and will not be repeated here.

図４に戻って、ステップＳ０７においては、変数ａｍｉｎを変数ｊに１を加算した値に設定する。ステップＳ０６において、音声時間ＡＴ（ｊ）を、字幕時間ＳＴ（ｋ）に割り当てたため、比較の対象となる最後の音声文字列ＡＳ（ａｍｉｎ）を、音声時間ＡＴ（ｊ）の次の音声時間ＡＴ（ｊ＋１）の音声文字列ＡＳ（ｊ＋１）に設定するためである。これにより、１つの字幕文字列ＳＳ（ｋ）と比較の対象となる音声文字列の数を制限することができ、処理速度を早くすることができる。 Returning to FIG. 4, in step S07, the variable amin is set to a value obtained by adding 1 to the variable j. In step S06, since the audio time AT (j) is assigned to the caption time ST (k), the last audio character string AS (amin) to be compared is set to the audio time AT next to the audio time AT (j). This is because the voice character string AS (j + 1) of (j + 1) is set. Thereby, the number of voice character strings to be compared with one subtitle character string SS (k) can be limited, and the processing speed can be increased.

一方、処理をステップＳ１１に進める場合、字幕文字列ＳＳ（ｋ）と音声文字列ＡＳ（ｊ）との相関値ＸＣがしきい値Ｔを超えていない。この場合には、字幕文字列ＳＳ（ｋ）と比較の対象を、音声文字列ＡＳ（ｊ）の１つ前の音声文字列ＡＳ（ｊ−１）に変更するために、変数ｊを１減算する（ステップＳ１１）。そして、比較の対象となる最後の音声文字列ＡＳ（ａｍｉｎ）よりも前になったか否かを判断するために、変数ｊが変数ａｍｉｎよりも小さくなったか否かを判断する（ステップＳ１２）。変数ｊが変数ａｍｉｎよりも小さければ処理をステップＳ１３に進め、そうでなければ処理をステップＳ０４に戻す。 On the other hand, when the process proceeds to step S11, the correlation value XC between the subtitle character string SS (k) and the voice character string AS (j) does not exceed the threshold value T. In this case, the variable j is decremented by 1 in order to change the subject of comparison with the subtitle character string SS (k) to the voice character string AS (j−1) immediately before the voice character string AS (j). (Step S11). Then, it is determined whether or not the variable j is smaller than the variable amin in order to determine whether or not it is before the last phonetic character string AS (amin) to be compared (step S12). If variable j is smaller than variable amin, the process proceeds to step S13; otherwise, the process returns to step S04.

処理をステップＳ１３に進める場合、処理対象としている字幕文字列ＳＳ（ｋ）と比較の対象とする音声文字列ＡＳ（ｊ）が存在しないと判断された場合である。この場合、字幕文字列ＳＳ（ｋ）を映像文字列ＶＳ（ｉ）と比較する。ステップＳ１３〜ステップＳ１８は、比較対象が異なるのみでステップＳ０２〜ステップＳ０７と同様であり、ステップＳ１９およびステップＳ２０は、ステップＳ１１およびステップＳ１２と同様である。従って、ここでは説明を繰り返さない。また、ステップＳ１７Ａで実行される未割り当て映像文字列挿入処理を図６に示す。図６に示す未割り当て映像文字列挿入処理は、図５に示した未割り当て音声文字列挿入処理と、挿入する文字列が異なるのみなので、ここでは説明を繰り返さない。 When the process proceeds to step S13, it is determined that the subtitle character string SS (k) to be processed and the voice character string AS (j) to be compared do not exist. In this case, the subtitle character string SS (k) is compared with the video character string VS (i). Steps S13 to S18 are the same as steps S02 to S07 except that the comparison targets are different, and steps S19 and S20 are the same as steps S11 and S12. Therefore, the description will not be repeated here. FIG. 6 shows the unassigned video character string insertion process executed in step S17A. The unassigned video character string insertion process shown in FIG. 6 is different from the unassigned audio character string insertion process shown in FIG. 5 only in the character string to be inserted, and therefore description thereof will not be repeated here.

ステップＳ２０において、ｉ＜ｖｍｉｎとなった時は、処理対象としている字幕文字列ＳＳ（ｋ）と比較の対象とする映像文字列ＶＳ（ｉ）が存在しないと判断された場合である。この場合は処理対象としている字幕文字列ＳＳ（ｋ）の字幕時間ＳＴ（ｋ）を、次のいずれかの方法で決定する（ステップＳ２１）。
（１）処理対象としている字幕文字列ＳＳ（ｋ）に対して計算したすべての相関値ＸＣのうちの最大値を与えた音声文字列ＡＳの音声時間ＡＴまたは映像文字列ＶＳの映像時間ＶＴを、字幕時間ＳＴ（ｋ）に設定する。
（２）処理対象としている字幕文字列ＳＳ（ｋ）より前の字幕文字列ＳＳ（１）〜字幕文字列ＳＳ（ｋ−１）の字幕時間ＳＴ（１）〜字幕時間ＳＴ（ｋ−１）が変更された時間（ずれ時間）の平均だけ、字幕時間ＳＴ（ｋ）を増減させる。
（３）字幕時間ＳＴ（ｋ）を予め定めた時間だけ増減させる。 In step S20, when i <vmin, it is determined that the subtitle character string SS (k) to be processed and the video character string VS (i) to be compared do not exist. In this case, the caption time ST (k) of the caption character string SS (k) to be processed is determined by one of the following methods (step S21).
(1) The audio time AT of the audio character string AS or the video time VT of the video character string VS giving the maximum value among all the correlation values XC calculated for the caption character string SS (k) to be processed. , Subtitle time ST (k) is set.
(2) Subtitle time ST (1) to subtitle time ST (k-1) of subtitle character string SS (1) to subtitle character string SS (k-1) before the subtitle character string SS (k) to be processed The subtitle time ST (k) is increased / decreased by the average of the time (shift time) at which is changed.
(3) Increase or decrease the caption time ST (k) by a predetermined time.

ステップＳ２１において、字幕文字列ＳＳ（ｋ）の字幕時間ＳＴ（ｋ）を決定した場合、次のステップＳ２２においては、決定された字幕時間ＳＴ（ｋ）より後の音声時間ＡＴの音声文字列ＡＳ、および決定された字幕時間ＳＴ（ｋ）より後の映像時間ＶＴの映像文字列ＶＳが比較の対象となるように、変数ａｍｉｎ，ｖｍｉｎを更新する。その後、処理をステップＳ０８に進める。 When the subtitle time ST (k) of the subtitle character string SS (k) is determined in step S21, in the next step S22, the audio character string AS of the audio time AT after the determined subtitle time ST (k) is determined. The variables amin and vmin are updated so that the video character string VS of the video time VT after the determined subtitle time ST (k) is to be compared. Thereafter, the process proceeds to step S08.

図７に示したタイムチャートにおいて、字幕文字列ＳＳ（２）と音声文字列ＡＳ（２）との相関値ＸＣがしきい値Ｔを超えた場合、字幕時間ＳＴ（２）に音声時間ＡＴ（２）が設定される。このときのタイムチャートを図９に示す。音声文字列ＡＳ（２）が再生される時間に字幕文字列ＳＳ（２）が表示され、同期する。 In the time chart shown in FIG. 7, when the correlation value XC between the subtitle character string SS (2) and the audio character string AS (2) exceeds the threshold value T, the audio time AT ( 2) is set. A time chart at this time is shown in FIG. The subtitle character string SS (2) is displayed and synchronized with the time when the audio character string AS (2) is reproduced.

さらに、図９に示した状態から字幕文字列ＳＳ（３）については、字幕時間ＳＴ（３）よりも時間的に前で一番近い音声文字列ＡＳ（４）が比較の対象となるが、相関値ＸＣがしきい値Ｔを超えなければ、音声文字列ＡＳ（３）が比較の対象となる。字幕文字列ＳＳ（３）と音声文字列ＡＳ（３）との相関値ＸＣがしきい値Ｔを超えた場合、字幕時間ＳＴ（３）に音声時間ＡＴ（３）が設定される。このときのタイムチャオートを図１０に示す。音声文字列ＡＳ（３）が再生される時間に字幕文字列ＳＳ（３）が表示され、同期する。このように、字幕文字列ＳＳ（３）を、音声文字列ＡＳ（４）だけでなく、さらに遡ったより前の音声文字列ＡＳ（３）と比較するようにしたので、字幕文字列ＳＳ（３）を適切な位置に表示することができ、正確に音声と同期させることができる。 Furthermore, for the subtitle character string SS (3) from the state shown in FIG. 9, the closest sound character string AS (4) in time before the subtitle time ST (3) is the object of comparison. If the correlation value XC does not exceed the threshold value T, the phonetic character string AS (3) is to be compared. When the correlation value XC between the subtitle character string SS (3) and the audio character string AS (3) exceeds the threshold value T, the audio time AT (3) is set as the subtitle time ST (3). The time chart auto at this time is shown in FIG. The subtitle character string SS (3) is displayed and synchronized with the time when the audio character string AS (3) is reproduced. In this way, the subtitle character string SS (3) is compared not only with the audio character string AS (4) but also with the previous audio character string AS (3) that goes back further. ) Can be displayed at an appropriate position, and can be accurately synchronized with the sound.

図４に戻って、ステップＳ０８においては、字幕文字列ＳＳ（ｋ）を必要に応じて分割する。音声文字列ＡＳ（ｊ）の長さと、字幕文字列ＳＳ（ｋ）の長さとが異なる場合に、音声文字列ＡＳ（ｊ）と字幕文字列ＳＳ（ｋ）のうち長いほうの文字列を分割する。音声文字列ＡＳ（ｊ）を分割する場合、後半の音声文字情報がｊ＋１番目の音声文字列ＡＳ（ｊ＋１）となる。後半の音声時間ＡＴ（ｊ＋１）は、前半の音声時間ＡＴ（ｊ）と同じにしてもよいが、音素や文字ごとに時間が分かっているのであれば、分割後の音声文字列ＡＳ（ｊ＋１）の最初の音素や文字の時間をＡＴ（ｊ＋１）としてもよい。また、この分割処理は、文字列が配列の場合、以降の文字列を全てコピーする必要があるが、双方向の連結リストとして構成すれば、前後のポインタを書き換えるだけでよい。 Returning to FIG. 4, in step S08, the subtitle character string SS (k) is divided as necessary. When the length of the voice character string AS (j) is different from the length of the subtitle character string SS (k), the longer character string of the voice character string AS (j) and the subtitle character string SS (k) is divided. To do. When the phonetic character string AS (j) is divided, the second half phonetic character information becomes the j + 1th phonetic character string AS (j + 1). The voice time AT (j + 1) in the second half may be the same as the voice time AT (j) in the first half, but if the time is known for each phoneme or character, the divided voice character string AS (j + 1) The time of the first phoneme or character may be AT (j + 1). Further, in this division processing, when the character string is an array, it is necessary to copy all the subsequent character strings. However, if it is configured as a bidirectional linked list, it is only necessary to rewrite the front and rear pointers.

字幕文字列ＳＳ（ｋ）を分割する場合は、字幕文字列ＳＳの個数が増加するので字幕文字列ＳＳの総個数を示す定数Ｎを１加算した値に変更する（Ｎ＝Ｎ＋１）。後半の字幕時間ＳＴ（ｋ＋１）は、音素や文字ごとに時間が分からない場合は次の字幕時間ＳＴ（ｋ＋２）と同じにするか、一定時間遅延される。例えば、図８において音声文字列ＡＳ（２）と字幕文字列ＳＳ（２）との相関値ＸＣがしきい値Ｔよりも高い場合であって、音声文字列ＡＳ（２）が字幕文字列ＳＳ（２）よりも短い場合、字幕文字列ＳＳ（２）が分割されて図１１のようになる。この場合、図１１における字幕文字列ＳＳ（３）が図８における字幕文字列ＳＳ（２）の後半部分に相当し、図１１における字幕文字列ＳＳ（４）が図８における字幕文字列ＳＳ（３）に相当する。 When the subtitle character string SS (k) is divided, the number of subtitle character strings SS increases, so that the constant N indicating the total number of subtitle character strings SS is changed to a value obtained by adding 1 (N = N + 1). The subtitle time ST (k + 1) in the latter half is the same as the next subtitle time ST (k + 2) or is delayed for a certain time when the time is unknown for each phoneme or character. For example, in FIG. 8, the correlation value XC between the audio character string AS (2) and the subtitle character string SS (2) is higher than the threshold value T, and the audio character string AS (2) is the subtitle character string SS. When the length is shorter than (2), the subtitle character string SS (2) is divided as shown in FIG. In this case, the subtitle character string SS (3) in FIG. 11 corresponds to the latter half of the subtitle character string SS (2) in FIG. 8, and the subtitle character string SS (4) in FIG. It corresponds to 3).

次のステップＳ０９においては、処理対象とする字幕文字列ＳＳ（ｋ）を次の字幕文字列ＳＳ（ｋ＋１）に変更するために、変数ｋに１を加算し、処理をステップＳ１０に進める。ステップＳ１０においては、変数ｋが定数Ｎを超えていないか確認する。定数Ｎは、字幕文字列ＳＳの総個数を示す。変数ｋが定数Ｎを超えていなければ、処理をステップＳ０２に戻し、次の字幕文字列ＳＳ（ｋ）を処理対象として、上述したのと同様の処理を繰り返す。変数ｋが定数Ｎを超えていれば、それ以上、字幕文字列ＳＳが存在しないと判断して処理を終了する。 In the next step S09, in order to change the caption character string SS (k) to be processed to the next caption character string SS (k + 1), 1 is added to the variable k, and the process proceeds to step S10. In step S10, it is confirmed whether the variable k does not exceed the constant N. The constant N indicates the total number of subtitle character strings SS. If the variable k does not exceed the constant N, the process returns to step S02, and the same processing as described above is repeated with the next caption character string SS (k) as the processing target. If the variable k exceeds the constant N, it is determined that there is no more subtitle character string SS and the process is terminated.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

＜付記＞
（１）前記決定手段は、前記文字情報抽出手段により出力される少なくとも１つの前記文字情報のうちから前記字幕情報に対応する１つの対応文字情報を選択する対応文字情報選択手段を含み、
前記文字情報抽出手段により前記対応文字情報が抽出された前記音声信号または前記映像信号の部分が前記再生手段により再生される時を、前記字幕情報を表示するタイミングに決定する、請求項１に記載の受信装置。
（２）前記決定手段は、前記音声認識手段が出力する少なくとも１つの前記文字列のうちから前記字幕情報との相関が所定の値以上の対応文字列を選択する対応文字列選択手段を含み、
前記文字情報抽出手段により前記対応文字列が抽出された前記音声信号の部分が前記再生手段により再生される時を、前記字幕情報を表示するタイミングに決定する、請求項２に記載の受信装置。
（３）前記決定手段は、前記文字認識手段が出力する少なくとも１つの前記文字列のうちから前記字幕情報との相関が所定の値以上の対応文字列を選択する対応文字列選択手段を含み、
前記文字情報抽出手段により前記対応文字列が抽出された前記映像信号の部分が前記再生手段により再生される時を、前記字幕情報を表示するタイミングに決定する、請求項４または５に記載の受信装置。
（４）前記決定手段は、前記母音認識手段が出力する少なくとも１つの前記母音の配列のうちから前記字幕情報との相関が所定の値以上の対応母音配列を選択する対応母音配列選択手段を含み、
前記文字情報抽出手段により前記対応母音配列が抽出された前記映像信号の部分が前記再生手段により再生される時を、前記字幕情報を表示するタイミングに決定する、請求項６に記載の受信装置。 <Appendix>
(1) The determination unit includes a corresponding character information selection unit that selects one corresponding character information corresponding to the caption information from at least one character information output by the character information extraction unit,
2. The time when the subtitle information is displayed is determined when the reproduction unit reproduces the portion of the audio signal or the video signal from which the corresponding character information has been extracted by the character information extraction unit. Receiver.
(2) The determination unit includes a corresponding character string selection unit that selects a corresponding character string whose correlation with the caption information is a predetermined value or more from at least one of the character strings output by the voice recognition unit,
The receiving apparatus according to claim 2, wherein when the portion of the audio signal from which the corresponding character string has been extracted by the character information extracting unit is reproduced by the reproducing unit, the timing for displaying the caption information is determined.
(3) The determination unit includes a corresponding character string selection unit that selects a corresponding character string whose correlation with the caption information is a predetermined value or more from at least one of the character strings output by the character recognition unit,
6. The reception according to claim 4 or 5, wherein when the portion of the video signal from which the corresponding character string is extracted by the character information extraction unit is reproduced by the reproduction unit, the timing for displaying the caption information is determined. apparatus.
(4) The determination unit includes a corresponding vowel array selection unit that selects a corresponding vowel array whose correlation with the caption information is a predetermined value or more from at least one array of the vowels output by the vowel recognition unit. ,
The receiving apparatus according to claim 6, wherein when the portion of the video signal from which the corresponding vowel array is extracted by the character information extracting unit is reproduced by the reproducing unit, the timing for displaying the caption information is determined.

本発明の実施の形態の１つにおけるテレビジョン受信機の機能の概要を示す機能ブロック図である。It is a functional block diagram which shows the outline | summary of the function of the television receiver in one of embodiment of this invention. 決定部の詳細な機能を示す機能ブロック図である。It is a functional block diagram which shows the detailed function of a determination part. 二つの音素系列の比較を説明するための図である。It is a figure for demonstrating the comparison of two phoneme series. 同期処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of a synchronous process. 未割り当て音声文字列挿入処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of an unassigned speech character string insertion process. 未割り当て映像文字列挿入処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of an unassigned video character string insertion process. 音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示す第１のタイムチャートである。It is a 1st time chart which shows temporal arrangement | positioning with the audio | voice character string AS, the video character string VS, and the caption character string SS. 音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示す第２のタイムチャートである。It is a 2nd time chart which shows temporal arrangement | positioning with the audio | voice character string AS, the video character string VS, and the caption character string SS. 音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示す第３のタイムチャートである。It is a 3rd time chart which shows temporal arrangement | positioning with the audio | voice character string AS, the video character string VS, and the caption character string SS. 音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示す第４のタイムチャートである。It is a 4th time chart which shows temporal arrangement | positioning with the audio | voice character string AS, the video character string VS, and the caption character string SS. 音声文字列ＡＳと、映像文字列ＶＳと、字幕文字列ＳＳとの時間的な配置を示す第５のタイムチャートである。It is a 5th time chart which shows temporal arrangement | positioning with the audio | voice character string AS, the video character string VS, and the caption character string SS.

Explanation of symbols

１テレビジョン受信機、１０チューナ、２０制御部、２１分離部、２３文字認識部、２５音声認識部、２７決定部、２９再生制御部、３１第１バッファメモリ、３３第２バッファメモリ、４１再生部、５３カード、６１相関算出部、６３選択部。 1 TV receiver, 10 tuner, 20 control unit, 21 separation unit, 23 character recognition unit, 25 speech recognition unit, 27 determination unit, 29 playback control unit, 31 first buffer memory, 33 second buffer memory, 41 playback Part, 53 cards, 61 correlation calculation part, 63 selection part.

Claims

A receiving device for receiving a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information,
Reproduction means for reproducing a video signal and an audio signal included in a broadcast wave of a television broadcast with a predetermined time delay;
Character information extraction means for extracting character information from at least one of the video signal and the audio signal;
And a determination unit configured to determine a timing for displaying the caption information based on a correlation between the caption information and the extracted character information.

The receiving apparatus according to claim 1, wherein the character information extraction unit includes a voice recognition unit that outputs a character string recognized by voice recognition of the voice signal as the character information.

The determination means further includes a display means for displaying the character string when the timing for displaying the caption information cannot be determined based on the correlation between the caption information and the character string output by the voice recognition means. The receiving device according to claim 2 provided.

The character information extraction means further includes character recognition means for outputting a character string recognized by character recognition of the video signal as the character information,
If the determination means cannot determine the timing for displaying the subtitle information based on the correlation between the subtitle information and the character string output by the voice recognition means, the subtitle information and the character recognition means output The receiving device according to claim 2, wherein timing for displaying the caption information is determined based on a correlation with the character string to be performed.

The receiving apparatus according to claim 1, wherein the character information extraction unit includes a character recognition unit that outputs a character string recognized by character recognition of the video signal as the character information.

The receiving apparatus according to claim 1, wherein the character information extracting unit includes a vowel recognition unit that outputs an array of vowels as the character information by recognizing a lip movement and shape included in the video signal.

Receiving a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information;
Reproducing a video signal and an audio signal included in a broadcast wave of the television broadcast with a predetermined delay;
Extracting character information from at least one of the video signal and the audio signal;
Determining a timing for displaying the subtitle information based on a correlation between the subtitle information and the extracted character information.

Receiving a broadcast wave of a television broadcast including a video signal, an audio signal, and caption information;
Reproducing a video signal and an audio signal included in a broadcast wave of the television broadcast with a predetermined delay;
Extracting character information from at least one of the video signal and the audio signal;
The television broadcast reproduction program which makes a computer perform the step which determines the timing which displays the said caption information based on the correlation with the said caption information and the extracted said character information.