JP2021090172A

JP2021090172A - Caption data generation device, content distribution system, video reproduction device, program, and caption data generation method

Info

Publication number: JP2021090172A
Application number: JP2019220769A
Authority: JP
Inventors: 紀英谷知; Norihide Yachi; 慎一郎松田; Shinichiro Matsuda; 良隆中島; Yoshitaka Nakajima
Original assignee: Yomiuri Telecasting Corp
Current assignee: Yomiuri Telecasting Corp
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2021-06-10

Abstract

To eliminate a delay difference of video/voice and a caption.SOLUTION: A caption data generation device comprises: a first acquisition part which acquires a caption text obtained by converting a voice of a broadcast program into a text, while listening to the voice of the broadcast program by an operator; a second acquisition part which acquires a voice text obtained by recognizing the voice of the broadcast program and converting a result into a text and a voice time of the voice of the broadcast program; a comparison part which compares the caption text with the voice text and detects a coincident word or paragraph; and a caption data generation part which generates caption data obtained by associating the caption text with the voice time of the voice text including a word or paragraph coincident with the word or paragraph of the caption text.SELECTED DRAWING: Figure 2

Description

本発明は字幕データ生成装置、コンテンツ配信システム、映像再生装置、プログラム及び字幕データ生成方法に関し、特に、映像及び音声（実際に話したタイミング）と字幕との出力タイミングのずれを解消する字幕データ生成装置、コンテンツ配信システム、映像再生装置、プログラム及び字幕データ生成方法に関する。 The present invention relates to a subtitle data generation device, a content distribution system, a video playback device, a program, and a subtitle data generation method, and in particular, subtitle data generation that eliminates a difference in output timing between video and audio (actually spoken timing) and subtitles. The present invention relates to an apparatus, a content distribution system, a video reproduction apparatus, a program, and a subtitle data generation method.

従来から、テレビ番組放送において、字幕付きのテレビ番組が放送されている。このような字幕放送は、聴覚障害者がテレビ番組を楽しむためには特に有益である。近年では、生放送等における字幕放送も試みられている。この生放送等における字幕放送に対処するために、大きく分けて二つの方式が存在する。 Traditionally, TV programs with subtitles have been broadcast in TV program broadcasting. Such subtitled broadcasting is particularly useful for hearing-impaired people to enjoy television programs. In recent years, subtitle broadcasting in live broadcasting and the like has also been attempted. In order to deal with subtitled broadcasting in live broadcasting and the like, there are roughly two methods.

ひとつは、リアルタイム字幕方式と呼ばれ、放送番組の音声をオペレータが聞き取ってキーボードから文字を入力し、テキスト化する方式である（例えば、特許文献１）。 One is called a real-time subtitle method, which is a method in which an operator listens to the voice of a broadcast program, inputs characters from a keyboard, and converts the text into text (for example, Patent Document 1).

他の方法は、音声認識技術を用いて、放送番組の音声をテキスト化する音声認識方式である。 Another method is a voice recognition method that converts the voice of a broadcast program into text by using voice recognition technology.

特開2001-188649号公報Japanese Unexamined Patent Publication No. 2001-188649

しかしながら、リアルタイム字幕方式は、放送番組の音声をオペレータが聞き取ってキーボードから文字を入力する作業などを行うため、テキストは正確であるが、実際に話したタイミングから５から２０秒程度遅れてしまう課題がある。これを解消するために、映像・音声を、ある一定の固定時間だけ遅延させて字幕と同期させる手法も考えられるが、遅延時間は固定時間でなく、変化するものなので、現実的ではない。 However, in the real-time subtitle method, since the operator listens to the voice of the broadcast program and inputs characters from the keyboard, the text is accurate, but there is a problem that it is delayed by about 5 to 20 seconds from the timing of actually speaking. There is. In order to solve this, a method of delaying the video / audio by a certain fixed time and synchronizing it with the subtitles can be considered, but the delay time is not a fixed time but changes, so it is not realistic.

一方、音声認識方式は、人手を介することなく、自動で行われるため、テキスト化される時間は短く、ほぼリアルタイムでテキストを取得するが、リアルタイム字幕方式に比べてテキストは正確ではないという課題がある。 On the other hand, since the voice recognition method is performed automatically without human intervention, the time for converting to text is short and the text is acquired in almost real time, but there is a problem that the text is not accurate compared to the real-time subtitle method. is there.

そこで、本発明は上記課題に鑑みて発明されたものであって、音声の内容が正確にテキスト化された字幕を、その字幕の元となった映像及び音声と同期して表示することができる字幕データ生成装置、コンテンツ配信システム、映像再生装置、プログラム及び字幕データ生成方法を提供することにある。 Therefore, the present invention has been invented in view of the above problems, and it is possible to display a subtitle in which the content of the audio is accurately converted into text in synchronization with the video and audio that are the source of the subtitle. It is an object of the present invention to provide a subtitle data generation device, a content distribution system, a video playback device, a program, and a subtitle data generation method.

本発明の一態様は、放送番組の音声をオペレータが聞き取りながら、前記放送番組の音声をテキスト化した字幕テキストを取得する第１の取得部と、前記放送番組の音声を音声認識してテキスト化された音声テキストと、前記放送番組の音声の音声時刻とを取得する第２の取得部と、前記字幕テキストと前記音声テキストとを比較して、一致する単語又は文節を検出する比較部と、前記字幕テキストと、前記字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストの音声時刻とが関連付けられた字幕データを生成する字幕データ生成部とを備える字幕データ生成装置である。 One aspect of the present invention is a first acquisition unit that acquires a subtitle text in which the audio of the broadcast program is converted into text while the operator listens to the audio of the broadcast program, and an audio recognition of the audio of the broadcast program and converts it into text. A second acquisition unit that acquires the audio text and the audio time of the audio of the broadcast program, a comparison unit that compares the subtitle text with the audio text, and detects a matching word or phrase. It is a subtitle data generation device including a subtitle data generation unit that generates subtitle data in which the subtitle text is associated with the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text.

本発明の一態様は、放送番組データを、映像データと、音声データと、放送番組の音声をオペレータが聞き取りながら、前記放送番組の音声をテキスト化した字幕テキストとに分離する分離部と、前記音声データを音声認識してテキスト化し、音声テキストを生成する音声認識部と、前記字幕テキストを取得する第１の取得部と、前記音声テキストと前記音声データの音声時刻とを取得する第２の取得部と、前記字幕テキストと前記音声テキストとを比較して、一致する単語又は文節を検出する比較部と、前記字幕テキストと、前記字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストの音声時刻とが関連付けられた字幕データを生成する字幕データ生成部と、前記映像データ及び前記音声データを所定時間遅延させ、前記映像データ、前記音声データ及び前記字幕データを少なくとも含む配信用コンテンツデータを生成する配信コンテンツデータ生成部とを備えるコンテンツ配信システムである。 One aspect of the present invention includes a separation unit that separates broadcast program data into video data, audio data, and subtitle text in which the audio of the broadcast program is converted into text while the operator listens to the audio of the broadcast program. A voice recognition unit that recognizes voice data and converts it into text to generate voice text, a first acquisition unit that acquires the subtitle text, and a second acquisition unit that acquires the voice text and the voice time of the voice data. An acquisition unit, a comparison unit that compares the subtitle text with the audio text to detect a matching word or phrase, and an audio including the subtitle text and a word or phrase that matches the word or phrase of the subtitle text. A distribution content that includes at least the video data, the audio data, and the subtitle data by delaying the video data and the audio data for a predetermined time with the subtitle data generation unit that generates the subtitle data associated with the audio time of the text. Distribution content that generates data This is a content distribution system that includes a data generation unit.

本発明の一態様は、放送番組を記憶する記憶部と、前記記憶部から前記放送番組を読み出し、映像ファイルと、音声ファイルと、字幕データとに分離する分離部と、前記字幕データから字幕テキストを取得する第１の取得部と、前記音声ファイルの音声を音声認識してテキスト化された音声テキストと、前記放送番組の音声の音声時刻とを取得する第２の取得部と、前記字幕テキストと前記音声テキストとを比較して、一致する単語又は文節を検出する比較部と、前記字幕テキストと、前記字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストの音声時刻とが関連付けられた字幕データを生成する字幕データ生成部と、前記字幕データの音声時刻に対応する映像ファイルの映像に、前記字幕データの字幕テキストを重畳する表示制御部とを備える映像再生装置である。 One aspect of the present invention is a storage unit that stores a broadcast program, a separation unit that reads the broadcast program from the storage unit and separates the video file, an audio file, and subtitle data, and a subtitle text from the subtitle data. A first acquisition unit for acquiring the data, a second acquisition unit for acquiring the audio text obtained by recognizing the audio of the audio file and converting it into text, and the audio time of the audio of the broadcast program, and the subtitle text. The comparison unit that detects a matching word or phrase by comparing the voice text with the voice text, the subtitle text, and the voice time of the voice text including the word or phrase matching the word or phrase of the subtitle text are associated with each other. It is a video reproduction device including a subtitle data generation unit that generates the subtitle data, and a display control unit that superimposes the subtitle text of the subtitle data on the video of the video file corresponding to the audio time of the subtitle data.

本発明の一態様は、放送番組の音声をオペレータが聞き取りながら、前記放送番組の音声をテキスト化した字幕テキストを取得する第１の取得処理と、前記放送番組の音声を音声認識してテキスト化された音声テキストと、前記放送番組の音声の音声時刻とを取得する第２の取得処理と、前記字幕テキストと前記音声テキストとを比較して、一致する単語又は文節を検出する比較処理と、前記字幕テキストと、前記字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストの音声時刻とが関連付けられた字幕データを生成する字幕データ生成処理とをコンピュータに実行させるプログラムである。 One aspect of the present invention includes a first acquisition process of acquiring a subtitle text in which the audio of the broadcast program is converted into text while the operator listens to the audio of the broadcast program, and voice recognition of the audio of the broadcast program and conversion into text. A second acquisition process for acquiring the voice text and the voice time of the voice of the broadcast program, a comparison process for comparing the subtitle text with the voice text, and a comparison process for detecting a matching word or phrase. It is a program for causing a computer to execute a subtitle data generation process for generating subtitle data in which the subtitle text and the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text are associated with each other.

本発明の一態様は、放送番組の音声をオペレータが聞き取りながら、前記放送番組の音声をテキスト化した字幕テキストを取得し、前記放送番組の音声を音声認識してテキスト化された音声テキストと、前記放送番組の音声の音声時刻とを取得し、前記字幕テキストと前記音声テキストとを比較して、一致する単語又は文節を検出し、前記字幕テキストと、前記字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストの音声時刻とが関連付けられた字幕データを生成する字幕データ生成方法である。 In one aspect of the present invention, while the operator listens to the audio of the broadcast program, the subtitle text obtained by converting the audio of the broadcast program into text is acquired, and the audio of the broadcast program is recognized and converted into text. The audio time of the audio of the broadcast program is acquired, the subtitle text is compared with the audio text, a matching word or phrase is detected, and the subtitle text matches the word or phrase of the subtitle text. It is a subtitle data generation method that generates subtitle data associated with the audio time of an audio text including a word or a phrase.

本発明は、映像及び音声（実際に話したタイミング）と字幕との出力タイミングのずれを解消し、音声の内容が正確にテキスト化された字幕を、その字幕の元となった映像及び音声と同期して表示することができる。 The present invention eliminates the difference in output timing between the video and audio (actually spoken timing) and the subtitle, and creates a subtitle in which the content of the audio is accurately converted into text with the video and audio that is the source of the subtitle. It can be displayed in synchronization.

図１は本実施形態におけるコンテンツ配信システムの全体構成例を示す図である。FIG. 1 is a diagram showing an overall configuration example of the content distribution system according to the present embodiment. 図２はコンテンツ配信装置１の機能構成例を示すブロック図である。FIG. 2 is a block diagram showing a functional configuration example of the content distribution device 1. 図３はコンテンツ配信装置１の動作を説明するための図である。FIG. 3 is a diagram for explaining the operation of the content distribution device 1. 図４は上述した実施の形態において表示タイミング修正後において、視聴端末３に表示される字幕の例を示したものである。FIG. 4 shows an example of subtitles displayed on the viewing terminal 3 after the display timing is corrected in the above-described embodiment. 図５は連結した字幕の表示例を示したものである。FIG. 5 shows a display example of connected subtitles. 図６はコンピュータシステムによって構成されたコンテンツ配信装置１のブロック図である。FIG. 6 is a block diagram of the content distribution device 1 configured by the computer system. 図７は第２の実施の形態におけるレコーダのブロック図である。FIG. 7 is a block diagram of the recorder according to the second embodiment.

本発明の実施の形態を説明する。 Embodiments of the present invention will be described.

＜第１の実施の形態＞
図１は、本実施形態におけるコンテンツ配信システムの全体構成例を示す図である。図１に示すように、コンテンツ配信システムは、コンテンツ配信装置１と、音声認識装置２と、視聴者の視聴端末３と、通信回線４とを備えて構成される。 <First Embodiment>
FIG. 1 is a diagram showing an overall configuration example of the content distribution system according to the present embodiment. As shown in FIG. 1, the content distribution system includes a content distribution device 1, a voice recognition device 2, a viewer's viewing terminal 3, and a communication line 4.

コンテンツ配信装置１は、単数又は複数のサーバ装置や記憶装置等を含んで構成されたサーバシステムである。コンテンツ配信装置１は、放送番組を受信し、字幕のついた動画（放送番組）を視聴端末３に配信する。本実施の形態では、地上デジタル放送等の放送と同時に配信される同時配信システムを想定している。 The content distribution device 1 is a server system configured to include a single server device, a plurality of server devices, a storage device, and the like. The content distribution device 1 receives a broadcast program and distributes a moving image (broadcast program) with subtitles to the viewing terminal 3. In this embodiment, a simultaneous distribution system that is distributed at the same time as broadcasting such as terrestrial digital broadcasting is assumed.

音声認識装置２は、Web APIのサーバであり、既存の音声認識技術を用いて、受信した音声データをテキスト化するものである。音声認識装置２が用いる既存の音声認識技術は、８０から９０パーセント程度の正答率を持ち、やや正確性には欠けるが、ほぼリアルタイムで、音声データをテキスト化することが可能である。尚、音声データがテキスト化されたものを、以下、音声テキストと記載する。本実施の形態では、コンテンツ配信装置１とは通信回線４と接続されたサーバを想定しているが、コンテンツ配信装置１が音声認識装置２を備えていても良い。 The voice recognition device 2 is a Web API server, and uses existing voice recognition technology to convert received voice data into text. The existing voice recognition technology used by the voice recognition device 2 has a correct answer rate of about 80 to 90%, and although it lacks accuracy, it is possible to convert voice data into text in almost real time. The textualized voice data is hereinafter referred to as voice text. In the present embodiment, the content distribution device 1 is assumed to be a server connected to the communication line 4, but the content distribution device 1 may include a voice recognition device 2.

視聴端末３は、例えば、HTTP Live Streaming (HLS)などの方式に対応し、字幕付き動画を視聴できる端末であり、無線通信基地局等を介して通信回線４に接続し、コンテンツ配信装置１とデータ通信を行うことができる。視聴端末３は、例えば、スマートフォンや、携帯電話機、携帯型ゲーム装置、据置型家庭用ゲーム装置、業務用ゲーム装置、パソコン、タブレット型コンピュータ、据置型家庭用ゲーム装置のコントローラ等である。視聴端末３は、基本的には、複数存在し、各視聴者により操作される。 The viewing terminal 3 is a terminal that supports a method such as HTTP Live Streaming (HLS) and can watch a moving image with subtitles, and is connected to a communication line 4 via a wireless communication base station or the like to be connected to the content distribution device 1. Data communication can be performed. The viewing terminal 3 is, for example, a smartphone, a mobile phone, a portable game device, a stationary home-use game device, a business-use game device, a personal computer, a tablet computer, a controller of a stationary home-use game device, or the like. Basically, there are a plurality of viewing terminals 3, and each viewer operates them.

通信回線４は、データ通信が可能な通信路を意味する。すなわち、通信回線４は、直接接続のための専用線（専用ケーブル）やイーサネット（登録商標）等によるＬＡＮの他、電話通信網やケーブル網、インターネット等の通信網を含み、通信方法については有線／無線を問わない。そして、コンテンツ配信装置１と音声認識装置２とは、通信回線４に接続可能で、相互に通信可能であり、コンテンツ配信装置１と視聴端末３とは、通信回線４に接続可能で、相互に通信可能である。 The communication line 4 means a communication path capable of data communication. That is, the communication line 4 includes a telephone communication network, a cable network, a communication network such as the Internet, as well as a LAN by a dedicated line (dedicated cable) for direct connection and Ethernet (registered trademark), and the communication method is wired. / Regardless of wireless. Then, the content distribution device 1 and the voice recognition device 2 can be connected to the communication line 4 and can communicate with each other, and the content distribution device 1 and the viewing terminal 3 can be connected to the communication line 4 and can communicate with each other. Communication is possible.

次に、コンテンツ配信装置１の構成を説明する。図２はコンテンツ配信装置１の機能構成例を示すブロック図である。 Next, the configuration of the content distribution device 1 will be described. FIG. 2 is a block diagram showing a functional configuration example of the content distribution device 1.

コンテンツ配信装置１は、分離部１１と、リアルタイム字幕データ取得部１２と、音声テキスト取得部１３と、テキスト比較部１４と、字幕データ生成部１５と、配信コンテンツデータ生成部１６とを備える。 The content distribution device 1 includes a separation unit 11, a real-time caption data acquisition unit 12, a voice text acquisition unit 13, a text comparison unit 14, a subtitle data generation unit 15, and a distribution content data generation unit 16.

分離部１１は、地上デジタル放送のコンテンツファイルを受信する。コンテンツファイルは、映像に字幕が重畳されていない映像ファイル（以下、「元映像ファイル」という場合がある）及び音声ファイルと、アンシラリーデータ(Ａｎｃｉｌｌａｒｙ)とを含むことができる。 The separation unit 11 receives the content file of the terrestrial digital broadcasting. The content file can include a video file (hereinafter, may be referred to as “original video file”) and an audio file in which subtitles are not superimposed on the video, and ancillary data (Ancillary).

映像ファイル及び音声ファイルには、表示タイミングであるＰＴＳ(presentation time stamp)の情報も含む。 The video file and the audio file also include information on the PTS (presentation time stamp), which is the display timing.

アンシラリーデータは、字幕データを含む。字幕データは、映像に重畳するテキスト（文字列）、各文字列の表示タイミング（ＰＴＳ：表示開始タイミング及び終了タイミング）、各文字列の表示位置（映像内の位置）等を含む。尚、字幕データは、解像度及びアスペクト比の少なくとも一方が互いに異なる複数の映像形式毎に、上記文字列、表示タイミング及び表示位置等を示してもよい。上記映像形式としては、ＨＤ（high definition video）、ＳＤ（standard definition television）、ワンセグ（携帯）等が例示される。 Ancillary data includes subtitle data. The subtitle data includes text (character string) superimposed on the video, display timing of each character string (PTS: display start timing and end timing), display position of each character string (position in the video), and the like. The subtitle data may indicate the character string, display timing, display position, and the like for each of a plurality of video formats in which at least one of the resolution and the aspect ratio is different from each other. Examples of the video format include HD (high definition video), SD (standard definition television), One Seg (mobile phone), and the like.

本実施の形態では、字幕データは、オペレータが放送番組の音声を聞き取って生成されたリアルタイム字幕方式により生成されたものであることを想定している。そのため、映像に重畳する字幕テキストの表示タイミングは、元となった映像・音声よりも遅延していることに留意すべきである。 In the present embodiment, it is assumed that the subtitle data is generated by the real-time subtitle method generated by the operator listening to the audio of the broadcast program. Therefore, it should be noted that the display timing of the subtitle text superimposed on the video is delayed compared to the original video / audio.

分離部１１は、コンテンツファイルを受信し、映像ファイルと、音声ファイルと、アンシラリーデータとに分離する。分離部１１は、音声ファイルを音声データ（映像又は音声から取得したＰＴＳを含む）として、音声認識装置２に出力する。また、分離部１１は、アンシラリーデータを、リアルタイム字幕データ取得部１２に出力する。また、分離部１１は、映像ファイル及び音声ファイルを、配信コンテンツデータ生成部１６に出力する。 The separation unit 11 receives the content file and separates the video file, the audio file, and the ancillary data. The separation unit 11 outputs the audio file as audio data (including PTS acquired from video or audio) to the audio recognition device 2. Further, the separation unit 11 outputs the ancillary data to the real-time subtitle data acquisition unit 12. Further, the separation unit 11 outputs the video file and the audio file to the distribution content data generation unit 16.

リアルタイム字幕データ取得部１２は、アンシラリーデータから字幕データを分離し、字幕データ（字幕テキスト及びＰＴＳ）を取得し、字幕テキスト及びＰＴＳをテキスト比較部１４に出力する。 The real-time subtitle data acquisition unit 12 separates the subtitle data from the ancillary data, acquires the subtitle data (subtitle text and PTS), and outputs the subtitle text and PTS to the text comparison unit 14.

音声テキスト取得部１３は、音声認識装置２から、音声テキストを取得し、取得した音声テキストとその音声テキストのＰＴＳとを対応付けて、テキスト比較部１４に出力する。尚、音声のＰＴＳが映像又は音声データより取得できない場合は、音声認識装置２から音声テキストを取得した取得時刻を、音声テキストのＰＴＳの代わりに用いることもできる。音声認識装置２は、音声データをほぼリアルタイムでテキスト化することが可能であるため、音声テキストの取得時刻は、実際の音声のＰＴＳとほぼ相違がないからである。 The voice text acquisition unit 13 acquires the voice text from the voice recognition device 2, associates the acquired voice text with the PTS of the voice text, and outputs the voice text to the text comparison unit 14. If the voice PTS cannot be acquired from the video or voice data, the acquisition time at which the voice text is acquired from the voice recognition device 2 can be used instead of the voice text PTS. This is because the voice recognition device 2 can convert the voice data into text in almost real time, so that the acquisition time of the voice text is almost the same as the PTS of the actual voice.

テキスト比較部１４は、リアルタイム字幕データ取得部１２が取得した字幕テキストと、音声テキスト取得部１３が取得した音声テキストとを比較し、字幕テキストと音声テキストとで一致する単語又は文節を検出する。検出する方法として、字幕テキストと音声テキストに対して形態素解析を行って単語又は文節に分割した後、単語又は文節単位で一致するテキストの検出を行う。尚、全ての字幕テキストと音声テキストとの間で検出を行って良いが、ある字幕テキストの単語又は文節を含む音声テキストを検出する場合、ある字幕テキストのＰＴＳの前後２０秒間のＰＴＳを持つ音声テキストに限定しても良い。このようにすれば、検出範囲を狭めることができ、検出処理の高速化を図ることができる。 The text comparison unit 14 compares the subtitle text acquired by the real-time subtitle data acquisition unit 12 with the voice text acquired by the voice text acquisition unit 13, and detects a word or phrase that matches the subtitle text and the voice text. As a method of detection, morphological analysis is performed on the subtitle text and the voice text, the text is divided into words or phrases, and then the matching text is detected for each word or phrase. It should be noted that detection may be performed between all subtitle texts and voice texts, but when detecting voice texts including words or phrases in a certain subtitle text, voices having a PTS of 20 seconds before and after the PTS of the certain subtitle texts. It may be limited to text. By doing so, the detection range can be narrowed and the detection process can be speeded up.

更に、テキスト比較部１４は、字幕テキストと音声テキストとの対応関係を判断する。字幕テキストと音声テキストとの対応付けは、音声テキストの単語又は文節と字幕テキストの単語又は文節とが予め定められた数以上一致すれば、音声テキストと字幕テキストとは対応していると判断する。予め定められた数は、現在の音声認識の精度を考えると、音声テキストの全ての単語又は文節の数の７から８割程度とする。但し、これは一例であり、使用する音声認識の精度に応じて、適時変更してもかまわない。 Further, the text comparison unit 14 determines the correspondence between the subtitle text and the voice text. Regarding the correspondence between the subtitle text and the audio text, if the words or phrases of the audio text and the words or phrases of the subtitle text match more than a predetermined number, it is determined that the audio text and the subtitle text correspond to each other. .. The predetermined number is about 70 to 80% of the number of all words or phrases in the speech text, considering the current accuracy of speech recognition. However, this is just an example, and it may be changed in a timely manner according to the accuracy of the voice recognition used.

尚、字幕テキストと音声テキストとの対応関係は、ひとつの字幕テキストに対してひとつの音声テキストが対応する場合、ひとつの字幕テキストに対して複数の音声テキストが対応する場合、複数の字幕テキストに対してひとつの音声テキストが対応する場合が考えられる。 The correspondence between the subtitle text and the voice text is as follows: when one voice text corresponds to one subtitle text, when multiple voice texts correspond to one subtitle text, to multiple subtitle texts. On the other hand, it is conceivable that one voice text corresponds to it.

字幕データ生成部１５は、テキスト比較部１４の比較結果（字幕テキストと音声テキストとの対応関係）に基づいて、字幕テキストと、この字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストのＰＴＳとを関連付け、新たな字幕データを生成する。 Based on the comparison result (correspondence between the subtitle text and the audio text) of the text comparison unit 14, the subtitle data generation unit 15 includes the subtitle text and the audio text including the word or phrase matching the word or phrase of the subtitle text. Generates new subtitle data by associating with PTS.

字幕テキストと音声テキストのＰＴＳとの関連付けであるが、字幕テキストと音声テキストとの対応関係によって、以下のケースが想定される。 Regarding the association between the subtitle text and the PTS of the audio text, the following cases are assumed depending on the correspondence between the subtitle text and the audio text.

（１）ひとつの字幕テキストに対してひとつの音声テキストが対応する場合
原則、字幕テキストと、この字幕テキストに対応する音声テキストのＰＴＳとを、関連付ける。 (1) When one voice text corresponds to one subtitle text In principle, the subtitle text is associated with the PTS of the voice text corresponding to this subtitle text.

（２）ひとつの字幕テキストに対して複数の音声テキストが対応する場合
ひとつの字幕テキストに、一致する複数の音声テキストの単語又は文節が含まれる場合は、対応する複数の音声テキストのＰＴＳの時刻のうち、最も早い時刻を開始時刻とし、最も遅い時刻を終了時刻として選択し、これらの時刻をＰＴＳと採用し、このＰＴＳとひとつの字幕テキストとを関連付ける。 (2) When multiple voice texts correspond to one subtitle text When one subtitle text contains words or phrases of multiple matching voice texts, the PTS time of the corresponding multiple voice texts Of these, the earliest time is selected as the start time, the latest time is selected as the end time, these times are adopted as PTS, and this PTS is associated with one subtitle text.

（３）複数の字幕テキストに対してひとつの音声テキストが対応する場合
複数の字幕テキストに対してひとつの音声テキストが対応する場合は、対応する複数の字幕テキストの数で、ひとつの音声テキストのＰＴＳの時間を分割する。例えば、２個の字幕テキストに対してひとつの音声テキストが対応し、ひとつの音声テキストのＰＴＳが15:05:10-15:05:15である場合、15:05:10-15:05:12と15:05:13-15:05:15とに２分割し、ＰＴＳの時刻の早い字幕テキストに15:05:10-15:05:12を関連付け、ＰＴＳの時刻の遅い字幕テキストに15:05:13-15:05:15を関連付ける。尚、これは一例であり、例えば、それぞれの字幕テキストの文字数に応じて、ＰＴＳを分割するようにしても良い。 (3) When one voice text corresponds to a plurality of subtitle texts When one voice text corresponds to a plurality of subtitle texts, the number of the corresponding multiple subtitle texts is used for one voice text. Divide the PTS time. For example, if one voice text corresponds to two subtitle texts and the PTS of one voice text is 15: 05: 10-15: 05: 15, 15: 05: 10-15: 05: Divide into 12 and 15:05:13-15:05:15, associate 15:05:10-15:05:12 with the subtitle text with the early PTS time, and 15 with the subtitle text with the late PTS time. Associate: 05: 13-15: 05: 15. Note that this is an example, and for example, the PTS may be divided according to the number of characters in each subtitle text.

但し、（１）、（２）、（３）ともに音声認識で誤認識があった場合など、字幕テキストの先頭や最後尾の単語又は文節と対応する音声テキストがない場合も考えられる。例えば、字幕テキストの先頭の単語で一致するものが無かった場合、字幕テキストにおいて、一致した単語から前にいくつの単語があるかを判断する。そして、字幕テキストと一致した単語からその個数分、音声テキストを遡って、その単語を含む音声テキストのPTSを採用する。このようにすることで、字幕テキストの元となった映像及び音声とのずれを防止することができる。 However, it is also possible that there is no voice text corresponding to the first or last word or phrase of the subtitle text, such as when there is an erroneous recognition in voice recognition in all of (1), (2), and (3). For example, if there is no matching word in the first word of the subtitle text, it is determined how many words precede the matching word in the subtitle text. Then, the voice text is traced back by the number of words that match the subtitle text, and the PTS of the voice text including the word is adopted. By doing so, it is possible to prevent deviation from the video and audio that are the source of the subtitle text.

そして、字幕データ生成部１５は、上記のようにＰＴＳと字幕テキストとが関連付けられた字幕データのテーブルを保有する。 Then, the subtitle data generation unit 15 has a table of subtitle data in which the PTS and the subtitle text are associated as described above.

尚、字幕データのテーブルの作成には多少処理時間を要するが、同時配信は地上波デジタル放送などのオンエアから３０秒程度遅れて送出されるのが一般的であり、その３０秒の間に、映像・音声と同期した字幕テーブルを作成することで、同時配信の際には映像・音声・字幕が同期した送出が可能である。 It takes some processing time to create a table of subtitle data, but simultaneous distribution is generally sent with a delay of about 30 seconds from on-air such as terrestrial digital broadcasting, and during that 30 seconds, By creating a subtitle table synchronized with video / audio, it is possible to transmit video / audio / subtitles in synchronization during simultaneous distribution.

配信コンテンツデータ生成部１６は、分離部１１から映像ファイル及び音声ファイルを受信し、映像ファイル及び音声ファイルを、例えば、HTTP Live Streaming (HLS)などの方式に変換する。また、配信コンテンツデータ生成部１６は、字幕データ生成部１５から字幕データを読み出し、Web Video Text Tracks(WebVTT)などの字幕フォーマットに変換する。 The distribution content data generation unit 16 receives the video file and the audio file from the separation unit 11 and converts the video file and the audio file into a method such as HTTP Live Streaming (HLS). Further, the distribution content data generation unit 16 reads the subtitle data from the subtitle data generation unit 15 and converts it into a subtitle format such as Web Video Text Tracks (WebVTT).

尚、リアルタイム字幕データ取得部１２と、音声テキスト取得部１３と、テキスト比較部１４と、字幕データ生成部１５とから、字幕データ生成装置５が構成される。 The subtitle data generation device 5 is composed of a real-time subtitle data acquisition unit 12, a voice text acquisition unit 13, a text comparison unit 14, and a subtitle data generation unit 15.

続いて、コンテンツ配信装置１の動作を説明する。図３はコンテンツ配信装置１の動作を説明するための図である。 Subsequently, the operation of the content distribution device 1 will be described. FIG. 3 is a diagram for explaining the operation of the content distribution device 1.

コンテンツ配信装置１は、地上デジタル放送のコンテンツファイルを受信する。コンテンツファイルは、映像に字幕が重畳されていない映像ファイル及び音声ファイル（ＰＴＳを含む）と、アンシラリーデータとを含むことができる。アンシラリーデータは、字幕データを含む。字幕データは、映像に重畳する文字列、及び、各文字列の表示タイミング（ＰＴＳ：表示開始タイミング）を少なくとも含む。但し、各文字列の表示位置（映像内の位置）等を含んでも良い。 The content distribution device 1 receives a content file of terrestrial digital broadcasting. The content file can include a video file and an audio file (including PTS) in which subtitles are not superimposed on the video, and ancillary data. Ancillary data includes subtitle data. The subtitle data includes at least a character string superimposed on the video and a display timing (PTS: display start timing) of each character string. However, the display position (position in the image) of each character string may be included.

分離部１１は、アンシラリーデータから字幕データを分離し、字幕データをリアルタイム字幕データ取得部１２に出力する。また、分離部１１は、音声ファイルを音声データ（ＰＴＳを含む）として、音声認識装置２に出力する。 The separation unit 11 separates the subtitle data from the ancillary data and outputs the subtitle data to the real-time subtitle data acquisition unit 12. Further, the separation unit 11 outputs the voice file as voice data (including PTS) to the voice recognition device 2.

リアルタイム字幕データ取得部１２は、アンシラリーデータから字幕データを分離し、字幕データ（字幕テキスト及びＰＴＳ）を取得し、字幕テキスト及びＰＴＳをテキスト比較部１４に出力する。図３の例では、字幕テキスト及びＰＴＳとして、「15:05:20-15:05:22 俳優の〇〇さんと女優の□□さんが（映像と8秒遅延）」と、「15:05:27-15:05:29 今日結婚の報告を行いました。（映像と11秒遅れ）」と、「15:05:38-15:05:40 俳優仲間など大勢に祝福されていました。（映像と18秒遅れ）」と、「15:05:47-15:05:49 では話題が変わって天気予報です。（映像と15秒遅れ）」とを取得したものとする。尚、括弧書きの部分は後述の説明のために、実際の映像との遅延時間を表記したものであり、実際のデータには含まれない。 The real-time subtitle data acquisition unit 12 separates the subtitle data from the ancillary data, acquires the subtitle data (subtitle text and PTS), and outputs the subtitle text and PTS to the text comparison unit 14. In the example of Fig. 3, the subtitle text and PTS are "15:05:20-15:05:22 Actor XX and actress □□ (8 seconds delay with the video)" and "15:05. : 27-15: 05: 29 I reported my marriage today. (11 seconds behind the video) "," 15:05:38-15:05:40 Many actors and other friends were blessed. (18 seconds behind the video) ”and“ 15: 05: 47-15: 05: 49, the topic changed and the weather forecast. (15 seconds behind the video) ”. It should be noted that the part in parentheses indicates the delay time with the actual video for the purpose of explanation described later, and is not included in the actual data.

一方、音声テキスト取得部１３は、音声認識装置２から、音声テキストを取得し、取得した音声テキストと、その音声のタイムスタンプとを対応付けて、テキスト比較部１４に出力する。図３の例では、音声テキスト及びタイムスタンプとして、「15:05:12-15:05:14 俳優の〇〇山と」と、「15:05:14-15:05:16 女優の□□山が」と、「15:05:16-15:05:18 今日血痕の報告を行いました。」と、「15:05:20-15:05:22 俳優仲間など」と、「15:05:23-15:05:25 大勢に宿泊されていました。」と、「15:05:32-15:05:34 では話題が川で天気予報です。」とを取得したものとする。 On the other hand, the voice text acquisition unit 13 acquires the voice text from the voice recognition device 2, associates the acquired voice text with the time stamp of the voice, and outputs the voice text to the text comparison unit 14. In the example of Fig. 3, the voice text and time stamp are "15: 05: 12-15: 05: 14 with actor XX Mountain" and "15: 05: 14-15: 05: 16 actress □□". "Yamaga", "15:05:16-15:05:18 I reported blood stains today.", "15:05:20-15:05:22 Actors, etc.", "15: 05:23-15:05:25 We were staying in large numbers. "And" At 15:05:32-15:05:34, the topic is the weather forecast in the river. "

テキスト比較部１４は、リアルタイム字幕データ取得部１２が取得した字幕テキストと、音声テキスト取得部１３が取得した音声テキストとに対して、形態素解析を行い、単語単位で一致する単語を検出する。 The text comparison unit 14 performs morphological analysis on the subtitle text acquired by the real-time subtitle data acquisition unit 12 and the voice text acquired by the voice text acquisition unit 13, and detects words that match on a word-by-word basis.

図３の例のリアルタイム字幕に対して形態素解析の結果は、以下の通りである。尚、各単語は括弧書きしている。
・15:05:20-15:05:22の字幕テキスト：（俳優）（の）（〇〇）（さん）（と）（女優）（の）（□□）（さん）（が）
・15:05:27-15:05:29の字幕テキスト：（今日）（結婚）（の）（報告）（を）（行い）（ました）（。）
・15:05:38-15:05:40の字幕テキスト：（俳優）（仲間）（など）（大勢）（に）（祝福）（されて）（いました）（。）
・15:05:47-15:05:49の字幕テキスト：（では）（話題）（が）（変わって）（天気）（予報）（です）（。） The results of morphological analysis for the real-time subtitles in the example of FIG. 3 are as follows. Each word is written in parentheses.
・ Subtitle text of 15:05:20-15:05:22: (actor) (of) (〇〇) (san) (and) (actress) (of) (□□) (san) (ga)
・ Subtitled text at 15:05:27-15:05:29: (Today) (Marriage) () (Report) () (Done) (Done) (.)
・ Subtitled text at 15:05:38-15:05:40: (actor) (friend) (etc.) (many) (ni) (blessing) (was) (was) (.)
・ Subtitle text of 15: 05: 47-15: 05: 49: (in) (topic) (ga) (changed) (weather) (forecast) (is) (.)

図３の例の音声テキストに対して形態素解析の結果は、以下の通りである。尚、各単語は括弧書きしている。
・15:05:12-15:05:13の音声テキスト：（俳優）（の）（〇〇）（山）（と）
・15:05:14-15:05:15の音声テキスト：（女優）（の）（□□）（山）（が）
・15:05:16-15:05:18の音声テキスト：（今日）（血痕）（の）（報告）（を）（行い）（ました）（。）
・15:05:20-15:05:22の音声テキスト：（俳優）（仲間）（など）
・15:05:23-15:05:25の音声テキスト：（大勢）（に）（宿泊）（されて）（いました）（。）
・15:05:32-15:05:34の音声テキスト：（では）（話題）（が）（川）（で）（天気）（予報）（です）（。） The results of the morphological analysis of the voice text in the example of FIG. 3 are as follows. Each word is written in parentheses.
・ 15: 05: 12-15: 05: 13 voice text: (actor) (of) (〇〇) (mountain) (and)
・ 15: 05: 14-15: 05: 15 voice text: (actress) (of) (□□) (mountain) (ga)
・ Voice text of 15: 05: 16-15: 05: 18: (Today) (Blood stain) () (Report) () (Done) (Done) (.)
・ Voice text of 15: 05: 20-15: 05: 22: (actor) (friend) (etc.)
・ 15: 05: 23-15: 05: 25 voice text: (many) (ni) (accommodation) (was) (was) (.)
・ Voice text of 15:05:32-15:05:34: (Well) (Topic) (Ga) (River) (De) (Weather) (Forecast) (.) (.)

テキスト比較部１４は、ひとつの字幕テキストを基準とし、そのＰＴＳから前後２０秒間のＰＴＳを持つ音声テキストに対して一致又は不一致の検出を行う。図３の例では、まず、「15:05:20-15:05:22の字幕テキスト：（俳優）（の）（〇〇）（さん）（と）（女優）（の）（□□）（さん）（が）」と、単語が一致する音声テキストの単語を検出していく。すると、「15:05:12-15:05:13の音声テキスト：（俳優）（の）（〇〇）（山）（と）」は、字幕テキストの“（さん）”と音声テキストの“（山）”とが相違するが、他の部分は一致している。同様に、「15:05:14-15:05:15の音声テキスト：（女優）（の）（□□）（山）（が）」は、字幕テキストの“（さん）”と音声テキストの“（山）”とが相違するが、他の部分は一致している。 The text comparison unit 14 detects a match or a mismatch with respect to the voice text having the PTS for 20 seconds before and after the one subtitle text as a reference. In the example of Fig. 3, first, "Subtitle text of 15:05:20-15:05:22: (actor) (of) (〇〇) (san) (and) (actress) (of) (□□) (San) (ga) "and the words in the voice text that match the words are detected. Then, "15:05:12-15:05:13 voice text: (actor) (no) (〇〇) (mountain) (and)" is the subtitle text "(san)" and the voice text "(san)". (Mountain) ”is different, but the other parts are the same. Similarly, "15: 05: 14-15: 05: 15 voice text: (actress) (of) (□□) (mountain) (ga)" is the subtitle text "(san)" and the voice text. It is different from "(mountain)", but the other parts are the same.

従って、音声テキスト「俳優の〇〇山と」は、音声認識装置２によって、15:05:12-15:05:13の音声「俳優の〇〇さんと女優の□□さんが」の一部がテキスト化されたものであると推測できる。同様に、音声テキスト「女優の□□山が」は、音声認識装置２によって、15:05:12-15:05:13の音声「俳優の〇〇さんと女優の□□さんが」の一部がテキスト化されたものであると推測できる。 Therefore, the voice text "Actor OO Mountain" is a part of the voice "Actor OO and actress □□" by the voice recognition device 2 at 15:05:12-15:05:13. Can be inferred to be a text version. Similarly, the voice text "Actress □□ Yamaga" is one of the voices "Actor XX and Actress □□ ga" at 15:05:12-15:05:13 by the voice recognition device 2. It can be inferred that the part is a text version.

同様な検出処理を行うと、以下のような音声テキストと字幕テキストとの対応関係が得られる。
・15:05:12-15:05:13の音声テキスト「俳優の〇〇山と」及び15:05:14-15:05:15の音声テキスト「女優の□□山が」と、15:05:20-15:05:22の字幕テキスト「俳優の〇〇さんと女優の□□さんが」とが対応する。
・15:05:16-15:05:18の音声テキスト「今日血痕の報告を行いました。」と、5:05:27-15:05:29の字幕テキスト「今日結婚の報告を行いました。」とが対応する。
・15:05:20-15:05:22の音声テキスト「俳優仲間など」及び15:05:23-15:05:25の音声テキスト「大勢に宿泊されていました。」と、15:05:38-15:05:40の字幕テキスト「俳優仲間など大勢に祝福されていました。」とが対応する。
・15:05:32-15:05:34の音声テキスト「では話題が川で天気予報です。」と、15:05:47-15:05:49の字幕テキスト「では話題が変わって天気予報です。」とが対応する。 When the same detection process is performed, the following correspondence between the voice text and the subtitle text can be obtained.
・ 15: 05: 12-15: 05: 13 voice text "Actor XX Mountain" and 15: 05: 14-15: 05: 15 voice text "Actress □□ Mountain", 15: 05: 20-15: 05: 22 The subtitle text "Mr. XX of the actor and Mr. □□ of the actress" correspond.
・ 15: 05: 16-15: 05: 18 voice text "I reported blood stains today." And 5:05:27-15:05:29 subtitle text "I reported my marriage today. "." Corresponds.
・ 15:05:20-15:05:22 voice text "Actors, etc." and 15:05:23-15:05:25 voice text "A lot of people were staying.", 15:05 : 38-15: 05: 40 The subtitle text "I was blessed by a lot of actors and friends." Corresponds.
・ 15: 05: 32-15: 05: 34 voice text "The topic is the weather forecast in the river." And the subtitle text "15: 05: 47-15: 05: 49 The topic changes and the weather forecast." Is. "

テキスト比較部１４は、このような結果を、字幕データ生成部１５に出力する。 The text comparison unit 14 outputs such a result to the subtitle data generation unit 15.

字幕データ生成部１５は、テキスト比較部１４の比較結果を受け、字幕テキストと、この字幕テキストと一致する音声テキストのＰＴＳとを関連付けて字幕データを生成する。 The subtitle data generation unit 15 receives the comparison result of the text comparison unit 14, and generates the subtitle data by associating the subtitle text with the PTS of the audio text that matches the subtitle text.

例えば、字幕データ生成部１５は、15:05:12-15:05:13の音声テキスト「俳優の〇〇山と」及び15:05:14-15:05:15の音声テキスト「女優の□□山が」と、15:05:20-15:05:22の字幕テキスト「俳優の〇〇さんと女優の□□さんが」との対応関係から、音声テキストの最も早い時刻“15:05:12”を開始時刻とし、音声テキストの最も遅い時刻である“15:05:15”を終了時刻とするＰＴＳ（15:05:12-15:05:15）を作成し、これと字幕テキスト「俳優の〇〇さんと女優の□□さんが」とを関連付けて字幕データを生成する。このような関連付けにより、本来、映像・音声と８秒遅延している字幕テキストを、本来の映像・音声の時刻に変更することができる。 For example, the subtitle data generation unit 15 has the audio text "Actor XX Mountain" at 15:05:12-15:05:13 and the audio text "Actress □" at 15:05:14-15:05:15. From the correspondence between "□ Yamaga" and the subtitle text "Actor XX and actress □□ ga" at 15:05:20-15:05:22, the earliest time of the audio text is "15:05". Create a PTS (15:05:12-15:05:15) with ": 12" as the start time and "15:05:15" as the end time, which is the latest time of the audio text, and this and the subtitle text. Generate subtitle data by associating "actor XX with actress □□". By such an association, the subtitle text that is originally delayed by 8 seconds from the video / audio can be changed to the original video / audio time.

このような関連付けを、図３に示したように、全ての字幕データに対して行う。結果として、以下のような字幕データのテーブルを作成する。
15:05:12-15:05:15：俳優の〇〇さんと女優の□□さんが
15:05:16-15:05:18：今日結婚の報告を行いました。
15:05:20-15:05:25：俳優仲間など大勢に祝福されていました。
15:05:32-15:05:34：では話題が変わって天気予報です。 As shown in FIG. 3, such an association is performed for all subtitle data. As a result, the following table of subtitle data is created.
15:05:12-15:05:15: Actor 〇〇 and actress □□
15:05:16-15:05:18: I reported my marriage today.
15:05:20-15:05:25: It was blessed by many actors and others.
15:05:32-15:05:34: Then the topic changes and it is the weather forecast.

尚、上述したように、字幕データのテーブルの作成には多少処理時間を要するが、同時配信は地上波デジタル放送のオンエアから３０秒程度遅れて送出されるのが一般的であるので、その時間内に字幕データのテーブルの作成は十分可能である。 As mentioned above, it takes some processing time to create a table of subtitle data, but simultaneous distribution is generally sent with a delay of about 30 seconds from the on-air of terrestrial digital broadcasting, so that time. It is quite possible to create a table of subtitle data inside.

次に、配信コンテンツデータ生成部１６は、分離部１１から受信した映像ファイル及び音声ファイルを、字幕データを生成するために要する時間分蓄積し、例えば、HTTP Live Streaming (HLS)などの方式に変換する。 Next, the distribution content data generation unit 16 accumulates the video file and the audio file received from the separation unit 11 for the time required to generate the subtitle data, and converts the video file and the audio file into a method such as HTTP Live Streaming (HLS), for example. To do.

また、配信コンテンツデータ生成部１５は、字幕データのテーブルを参照し、字幕テキストを、Web Video Text Tracks (WebVTT)などの字幕フォーマットに変換する。上述した字幕データのテーブルに示した字幕データを、WebVTTの字幕フォーマットに変換した一例を示す。
15:05:12 -->15:05:15
俳優の〇〇さんと女優の□□さんが
15:05:16 -->15:05:18
今日結婚の報告を行いました。
15:05:20 -->15:05:25
俳優仲間など大勢に祝福されていました。
15:05:32 -->15:05:34
では話題が変わって天気予報です。 Further, the distribution content data generation unit 15 refers to the subtitle data table and converts the subtitle text into a subtitle format such as Web Video Text Tracks (WebVTT). An example of converting the subtitle data shown in the above-mentioned subtitle data table into the WebVTT subtitle format is shown.
15:05:12-> 15:05:15
Actor 〇〇 and actress □□
15:05:16-> 15:05:18
I reported my marriage today.
15:05:20-> 15:05:25
It was blessed by many actors.
15:05:32-> 15:05:34
Then the topic changes and it is the weather forecast.

配信コンテンツデータ生成部１６は、配信用のコンテンツデータを生成し、視聴端末３に送信する。 The distribution content data generation unit 16 generates content data for distribution and transmits it to the viewing terminal 3.

本実施の形態は、元となる映像又は音声に対してテキスト内容は正確であるが、元となる映像又は音声の出力タイミングに対して出力タイミングが遅延しているリアルタイム字幕データと、元となる映像又は音声に対してテキスト内容は余り正確ではないが、元となる映像又は音声の出力タイミングと同様な出力タイミングを持つ音声認識字幕データとを用いている。そして、リアルタイム字幕データの字幕テキストと音声認識字幕データの音声テキストとを比較することにより、字幕テキストと音声テキストとの対応関係を調べ、その対応関係に基づいて、内容が正確な字幕テキストと、元となる映像又は音声の出力タイミングと同様な出力タイミングである音声認識字幕データの出力タイミングとを関連付けて、新たな字幕データを生成している。 In the present embodiment, the text content is accurate with respect to the original video or audio, but the output timing is delayed with respect to the output timing of the original video or audio, and the original subtitle data. Although the text content is not very accurate with respect to the video or audio, audio recognition subtitle data having an output timing similar to the output timing of the original video or audio is used. Then, by comparing the subtitle text of the real-time subtitle data with the audio text of the voice recognition subtitle data, the correspondence between the subtitle text and the audio text is investigated, and based on the correspondence, the subtitle text whose content is accurate and the content are accurate. New subtitle data is generated by associating the output timing of the voice recognition subtitle data, which is the same output timing as the output timing of the original video or audio.

このような構成により、本実施の形態は、映像又は音声の内容が正確にテキスト化された字幕を、映像、音声と同期して表示することができる。 With such a configuration, in the present embodiment, subtitles in which the content of the video or audio is accurately converted into text can be displayed in synchronization with the video and audio.

尚、本実施の形態では、リアルタイム字幕データから、字幕テキスト及び字幕テキストの表示時刻であるＰＴＳを用いる例を説明したが、字幕テキストのＰＴＳは必項なものでなく、字幕テキストのＰＴＳを用いなくても本発明は実施可能である。 In the present embodiment, an example of using the subtitle text and the PTS which is the display time of the subtitle text from the real-time subtitle data has been described, but the PTS of the subtitle text is not indispensable and the PTS of the subtitle text is not used. However, the present invention is feasible.

＜第１の実施の形態の変形例１＞
次に、本実施の形態の変形例１を説明する。 <Modification 1 of the first embodiment>
Next, a modification 1 of the present embodiment will be described.

オペレータの入力によるリアルタイム字幕は、スピード優先で確定した単語又は文節から次々と送出していくため、２文字から１０文字程度の非常に短い字幕が送出されるケースが頻繁にある。 Since real-time subtitles input by the operator are transmitted one after another from words or phrases determined with priority on speed, there are many cases where very short subtitles of about 2 to 10 characters are transmitted.

また、表示した字幕は、通常、２秒程度は表示する必要があるため、短い字幕が短時間で切り替わって表示される。図４は上述した実施の形態において表示タイミング修正後において、視聴端末３に表示される字幕の例を示したものである。図４に示されるように、短い字幕が短時間で切り替わって表示されている。このように、短い字幕が短時間で切り替わって表示されるのは、字幕が見難いという課題があった。一方、テレビ放送の字幕は、１ページあたり４２文字（２１文字×２行）を表示することが可能である。 Further, since the displayed subtitles usually need to be displayed for about 2 seconds, short subtitles are switched and displayed in a short time. FIG. 4 shows an example of subtitles displayed on the viewing terminal 3 after the display timing is corrected in the above-described embodiment. As shown in FIG. 4, short subtitles are switched and displayed in a short time. In this way, the short subtitles being switched and displayed in a short time have a problem that the subtitles are difficult to see. On the other hand, the subtitles of television broadcasting can display 42 characters (21 characters x 2 lines) per page.

そこで、実施の形態の変形例１では、字幕データ生成部１５は、短い字幕（単語や文節）を連結し、字幕を見やすくする例を説明する。 Therefore, in the first modification of the embodiment, the subtitle data generation unit 15 will explain an example in which short subtitles (words and phrases) are connected to make the subtitles easier to see.

実施の形態の変形例１の字幕データ生成部１５は、短い字幕（単語や文節）を、４２文字以内まで連結し、その字幕の表示タイミング（ＰＴＳ）を変更する。但し、文の意味のない所で字幕が切れてしまうと、かえって読み難いものとなるため、４２文字以内、かつ、所定の区切り記号で字幕が終了するように、字幕を連結する。ここで、所定の区切り記号は句点（「。」又は「．」）が代表的であるが、これに限られない。例えば、コロン（：）や、セミコロン（；）、感嘆符（！）、疑問符（？）等でも良い。 The subtitle data generation unit 15 of the modification 1 of the embodiment connects short subtitles (words and phrases) up to 42 characters and changes the display timing (PTS) of the subtitles. However, if the subtitle is cut off at a place where the sentence is meaningless, it will be difficult to read. Therefore, the subtitles are concatenated so that the subtitle ends within 42 characters and at a predetermined delimiter. Here, the predetermined delimiter is typically a punctuation mark ("." Or "."), But is not limited to this. For example, a colon (:), a semicolon (;), an exclamation mark (!), A question mark (?), Etc. may be used.

具体的な動作を説明すると、字幕データ生成部１５は、音声テキストとタイムスタンプとの関連付けの終了後、４２文字以内、かつ、所定の区切り記号で字幕が終了するように、字幕を連結する。ここでは、所定の区切り記号を句点（「。」又は「．」）として説明する。 Explaining a specific operation, the subtitle data generation unit 15 concatenates the subtitles so that the subtitle ends within 42 characters and at a predetermined delimiter after the association between the voice text and the time stamp is completed. Here, a predetermined delimiter will be described as a punctuation mark (“.” Or “.”).

字幕データ生成部１５は、音声テキストとタイムスタンプとの関連付けの終了時点で以下のようなテーブルを保持している。
15:05:12-15:05:15：俳優の〇〇さんと女優の□□さんが
15:05:16-15:05:18：今日結婚の報告を行いました。
15:05:20-15:05:25：俳優仲間など大勢に祝福されていました。
15:05:32-15:05:34：では話題が変わって天気予報です。 The subtitle data generation unit 15 holds the following table at the end of the association between the voice text and the time stamp.
15:05:12-15:05:15: Actor 〇〇 and actress □□
15:05:16-15:05:18: I reported my marriage today.
15:05:20-15:05:25: It was blessed by many actors and others.
15:05:32-15:05:34: Then the topic changes and it is the weather forecast.

字幕データ生成部１５は、最初の字幕である「俳優の〇〇さんと女優の□□さんが」と連結できる字幕を検索する。条件は、４２文字以内、かつ、句点（「。」又は「．」）で字幕が終了することである。すると、次の字幕「今日結婚の報告を行いました。」は連結可能であるが、その次の字幕「俳優仲間など大勢に祝福されていました。」まで連結すると、４２文字を超えてしまい、連結することはできない。 The subtitle data generation unit 15 searches for subtitles that can be linked to the first subtitle, "Mr. XX of the actor and Mr. □□ of the actress." The condition is that the subtitle ends with a punctuation mark ("." Or ".") Within 42 characters. Then, the next subtitle "I reported my marriage today." Can be concatenated, but if you concatenate to the next subtitle "I was blessed by a lot of actors and friends.", It exceeded 42 characters. , Cannot be connected.

そこで、最初の字幕「俳優の〇〇さんと女優の□□さんが」と次の字幕「今日結婚の報告を行いました。」とを連結し、ひとつの字幕として「俳優の〇〇さんと女優の□□さんが今日結婚の報告を行いました。」を生成する。そして、この字幕のＰＴＳとして、最初の字幕のＰＴＳと次の字幕のＰＴＳとを連結したものを使用する。すなわち、（15:05:12-15:05:18）である。 Therefore, the first subtitle "Actor XX and actress □□" and the next subtitle "I reported my marriage today." Were connected, and one subtitle was "With actor XX." Actress □□ reported her marriage today. ” Then, as the PTS of this subtitle, the PTS of the first subtitle and the PTS of the next subtitle are concatenated. That is, (15:05:12-15:05:18).

同様に、字幕「俳優仲間など大勢に祝福されていました。」と連結できる字幕を検索する。条件は、４２文字以内、かつ、句点（「。」又は「．」）で字幕が終了することである。すると、次の字幕「では話題が変わって天気予報です。」は連結可能である。そこで、最初の字幕「俳優仲間など大勢に祝福されていました。」と次の字幕「では話題が変わって天気予報です。」とを連結し、ひとつの字幕として「俳優仲間など大勢に祝福されていました。では話題が変わって天気予報です。」を生成する。そして、この字幕のＰＴＳとして、最初の字幕のＰＴＳと次の字幕のＰＴＳとを連結したものを使用する。すなわち、（15:05:20-15:05:34）である。 Similarly, search for subtitles that can be linked to the subtitle "I was blessed by many actors." The condition is that the subtitle ends with a punctuation mark ("." Or ".") Within 42 characters. Then, the next subtitle "The topic changes and it is the weather forecast." Can be linked. Therefore, the first subtitle "I was blessed by many actors and others." And the next subtitle "The topic changed and the weather forecast." Were connected, and one subtitle was "Blessed by many actors and others." Then the topic changed and it was a weather forecast. " Then, as the PTS of this subtitle, the PTS of the first subtitle and the PTS of the next subtitle are concatenated. That is, (15:05:20-15:05:34).

結果として、字幕データ生成部１５は、以下のテーブルを保有することになる。尚、“／”は改行を示しているが、２１文字目と２２文字目とがひとつの単語である場合、その前の単語の切れ目で改行を行っても良い。
15:05:12-15:05:18：俳優の〇〇さんと女優の□□さんが今日結婚の／報告を行いました。
15:05:20-15:05:34：俳優仲間など大勢に祝福されていました。では／話題が変わって天気予報です。 As a result, the subtitle data generation unit 15 has the following table. Although "/" indicates a line break, if the 21st character and the 22nd character are one word, a line break may be performed at the break of the word before the word.
15:05:12-15:05:18: Actor 〇〇 and actress □□ reported their marriage today.
15:05:20-15:05:34: It was blessed by many actors and others. Then / the topic has changed and it is the weather forecast.

図５は連結した字幕の表示例を示したものである。このように、ある程度まとまった字幕が長く表示されるので、視聴者は字幕の意味を理解しながら映像を視聴することができる。 FIG. 5 shows a display example of connected subtitles. In this way, since the subtitles that are organized to some extent are displayed for a long time, the viewer can watch the video while understanding the meaning of the subtitles.

尚、第１の実施の形態の変形例１は、上述した字幕テキストと音声テキストとの対応関係において、複数の字幕テキストに対してひとつの音声テキストが対応する場合に特に有効である。複数の字幕テキストをひとつの字幕テキストにまとめることにより、音声テキストのＰＴＳを分割することなく、使用することができるからである。 The modification 1 of the first embodiment is particularly effective when one voice text corresponds to a plurality of subtitle texts in the above-mentioned correspondence relationship between the subtitle text and the voice text. This is because by combining a plurality of subtitle texts into one subtitle text, the PTS of the audio text can be used without being divided.

＜第１の実施の形態の変形例２＞
第１の実施の形態の変形例２を説明する。 <Modification 2 of the first embodiment>
A modified example 2 of the first embodiment will be described.

上述したコンテンツ配信装置１は、具体的には、各種の演算処理等を行うプロセッサを有するコンピュータシステムによって実現することができる。図６はコンピュータシステムによって構成されたコンテンツ配信装置１のブロック図である。 Specifically, the content distribution device 1 described above can be realized by a computer system having a processor that performs various arithmetic processes and the like. FIG. 6 is a block diagram of the content distribution device 1 configured by the computer system.

コンテンツ配信装置１は、図６に示す如く、プロセッサ１０１、メモリ（ＲＯＭやＲＡＭ）１０２、入力装置（キーボード、マウス、タッチパネルなど）１０３、通信装置１０４、記憶装置（ハードディスク、半導体ディスクなど）１０５を有するコンピュータ１００により構成することができる。尚、記憶装置１０５は、コンピュータ１００と物理的に外部に設けられ、ＬＡＮ等のネットワークを介してコンピュータ１００と接続されていても良い。 As shown in FIG. 6, the content distribution device 1 includes a processor 101, a memory (ROM or RAM) 102, an input device (keyboard, mouse, touch panel, etc.) 103, a communication device 104, and a storage device (hard disk, semiconductor disk, etc.) 105. It can be configured by the computer 100 having the computer 100. The storage device 105 may be physically provided externally to the computer 100 and may be connected to the computer 100 via a network such as a LAN.

コンテンツ配信装置１は、記憶装置１０５に格納されたプログラムがメモリ１０２にロードされ、プロセッサ１０１により実行されることにより、分離処理１１１と、リアルタイム字幕データ取得処理１１２と、音声テキスト取得処理１１３と、テキスト比較処理１１４と、字幕データ生成処理１１５と、配信コンテンツデータ生成処理１１６とが実現されるものである。 In the content distribution device 1, the program stored in the storage device 105 is loaded into the memory 102 and executed by the processor 101, so that the separation process 111, the real-time subtitle data acquisition process 112, the audio text acquisition process 113, and the like. The text comparison process 114, the subtitle data generation process 115, and the distribution content data generation process 116 are realized.

ここで、分離処理１１１が分離部１１に対応し、リアルタイム字幕データ取得処理１１２がリアルタイム字幕データ取得部１２に対応し、音声テキスト取得処理１１３が音声テキスト取得部１３に対応し、テキスト比較処理１１４がテキスト比較部１４に対応し、字幕データ生成処理１１５が字幕データ生成部１５に対応し、配信コンテンツデータ生成処理１１６が配信コンテンツデータ生成部１６に対応する。 Here, the separation process 111 corresponds to the separation unit 11, the real-time caption data acquisition process 112 corresponds to the real-time caption data acquisition unit 12, the voice text acquisition process 113 corresponds to the voice text acquisition unit 13, and the text comparison process 114. Corresponds to the text comparison unit 14, the subtitle data generation process 115 corresponds to the subtitle data generation unit 15, and the distribution content data generation process 116 corresponds to the distribution content data generation unit 16.

＜第２の実施の形態＞
上述した実施の形態で述べた内容が正確なリアルタイム字幕データのテキストと、表示タイミングは遅延していない音声認識字幕データの表示タイミングとを関連付けて新たな字幕データを生成するという技術的な思想は、コンテンツを配信する配信側だけではなく、テレビ放送を録画して視聴するレコーダなどにも応用することができる。そこで、第２の実施の形態は、上述した実施の形態における字幕データ生成装置５を、レコーダに適用した例を説明する。 <Second Embodiment>
The technical idea of generating new subtitle data by associating the text of the real-time subtitle data whose content described in the above-described embodiment is accurate with the display timing of the voice recognition subtitle data whose display timing is not delayed is , It can be applied not only to the distribution side that distributes the content, but also to the recorder that records and watches the TV broadcast. Therefore, in the second embodiment, an example in which the subtitle data generation device 5 in the above-described embodiment is applied to the recorder will be described.

図７は第２の実施の形態におけるレコーダのブロック図である。 FIG. 7 is a block diagram of the recorder according to the second embodiment.

レコーダ５０は、分離部１１と、リアルタイム字幕データ取得部１２、音声テキスト取得部１３、テキスト比較部１４及び字幕データ生成部１５を備える字幕データ生成装置５とに加え、記憶部５１と、表示制御部５２とを備える。 The recorder 50 includes a storage unit 51 and display control in addition to a separation unit 11, a subtitle data generation device 5 including a real-time subtitle data acquisition unit 12, a voice text acquisition unit 13, a text comparison unit 14, and a subtitle data generation unit 15. A unit 52 is provided.

記憶部５１は、例えば、地上デジタル放送を録画したデータが格納される。 The storage unit 51 stores, for example, data recorded from terrestrial digital broadcasting.

分離部１１は、記憶部５１に格納されているデータを読み出し、上述した実施の形態と同様に、アンシラリーデータから字幕データを分離し、字幕データをリアルタイム字幕データ取得部１２に出力し、音声ファイルを音声データ（ＰＴＳを含む）として、音声認識装置２に出力する。 The separation unit 11 reads the data stored in the storage unit 51, separates the subtitle data from the ancillary data, outputs the subtitle data to the real-time caption data acquisition unit 12, and outputs the audio, as in the above-described embodiment. The file is output to the voice recognition device 2 as voice data (including PTS).

音声テキスト取得部１３は、音声認識装置２から、音声テキストを取得し、取得した音声テキストと、その音声のタイムスタンプとを対応付けて、テキスト比較部１４に出力する。 The voice text acquisition unit 13 acquires the voice text from the voice recognition device 2, associates the acquired voice text with the time stamp of the voice, and outputs the voice text to the text comparison unit 14.

テキスト比較部１４は、リアルタイム字幕データ取得部１２が取得した字幕テキストと、音声テキスト取得部１３が取得した音声テキストとに対して、形態素解析を行い、単語単位で一致する単語を検出する。そして、テキスト比較部１４は、字幕テキストと音声テキストとの対応関係を判断する。 The text comparison unit 14 performs morphological analysis on the subtitle text acquired by the real-time subtitle data acquisition unit 12 and the voice text acquired by the voice text acquisition unit 13, and detects words that match on a word-by-word basis. Then, the text comparison unit 14 determines the correspondence between the subtitle text and the voice text.

字幕データ生成部１５は、テキスト比較部１４の比較結果に基づいて、字幕テキストと、この字幕テキストの単語又は文節と一致する単語又は文節を含む音声テキストのＰＴＳとを関連付け、新たな字幕データを生成する。 Based on the comparison result of the text comparison unit 14, the subtitle data generation unit 15 associates the subtitle text with the PTS of the audio text including the word or phrase matching the word or phrase of the subtitle text, and creates new subtitle data. Generate.

表示制御部５２は、分離部１１から映像ファイル及び音声ファイルを受信して一時的に蓄積し、字幕データ生成部１５の字幕データテーブルから字幕データを読み出し、字幕データのＰＴＳと同一の時刻の映像に字幕テキストを重畳する。そして、字幕テキストが重畳された映像及び音声を、ディスプレイ等の表示装置に出力する。 The display control unit 52 receives the video file and the audio file from the separation unit 11, temporarily stores the video file, reads the subtitle data from the subtitle data table of the subtitle data generation unit 15, and displays the video at the same time as the PTS of the subtitle data. Superimpose subtitle text on. Then, the video and audio on which the subtitle text is superimposed are output to a display device such as a display.

このように構成することにより、テレビ放送を録画したレコーダであっても、音声の内容が正確にテキスト化された字幕を、映像及び音声と同期して遅延がなく表示することができる。尚、本実施の形態のように、放送を録画して後から見る場合や、いわゆる追っかけ再生の場合、新たな字幕データの生成する処理は地上波デジタル放送などのオンエア後となるので、十分許容範囲内の時間で処理が可能であり、映像・音声・字幕が同期した再生が可能である。 With this configuration, even a recorder that has recorded a television broadcast can display subtitles in which the audio content is accurately converted into text in synchronization with the video and audio without delay. In addition, as in the present embodiment, in the case of recording a broadcast and viewing it later, or in the case of so-called chase playback, the process of generating new subtitle data is performed after on-air such as terrestrial digital broadcasting, so that it is sufficiently permissible. Processing is possible within the time range, and video, audio, and subtitles can be played back in synchronization.

以上、好ましい実施の形態をあげて本発明を説明したが、本発明は必ずしも上記実施の形態に限定されるものではなく、その技術的思想の範囲内において様々に変形し実施することが出来る。 Although the present invention has been described above with reference to preferred embodiments, the present invention is not necessarily limited to the above embodiments, and can be variously modified and implemented within the scope of the technical idea.

１コンテンツ配信装置
２音声認識装置
３視聴端末
４通信回線
５字幕データ生成装置
１１分離部
１２リアルタイム字幕データ取得部
１３音声テキスト取得部
１４テキスト比較部
１５字幕データ生成部
１６配信コンテンツデータ生成部
５０レコーダ
５１記憶部
５２表示制御部
１００コンピュータ
１０１プロセッサ
１０２メモリ
１０３入力装置
１０４通信装置
１０５記憶装置 1 Content distribution device 2 Voice recognition device 3 Viewing terminal 4 Communication line 5 Subtitle data generation device 11 Separation unit 12 Real-time subtitle data acquisition unit 13 Voice text acquisition unit 14 Text comparison unit 15 Subtitle data generation unit 16 Distribution content data generation unit 50 Recorder 51 Storage unit 52 Display control unit 100 Computer 101 Processor 102 Memory 103 Input device 104 Communication device 105 Storage device

Claims

A first acquisition unit that acquires subtitle text that is a text version of the audio of the broadcast program while the operator listens to the audio of the broadcast program.
A second acquisition unit that acquires the voice text obtained by voice-recognizing the voice of the broadcast program and converting it into a text, and the voice time of the voice of the broadcast program.
A comparison unit that compares the subtitle text with the voice text and detects a matching word or phrase.
A subtitle data generation device including a subtitle data generation unit that generates subtitle data in which the subtitle text is associated with the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text.

The subtitle data generation device according to claim 1, wherein the audio time is a PTS (presentation time stamp) of a broadcast program.

The subtitle data generation device according to claim 1, wherein the voice time is an acquisition time when the voice text is acquired.

When the subtitle text includes a plurality of words or phrases of the voice text, the subtitle data generation unit associates the earliest and latest voice times of the plurality of voice texts with the subtitle text. The subtitle data generation device according to any one of 1 to 3.

The subtitle data generation device according to any one of claims 1 to 4, wherein the subtitle data generation unit connects two or more of the subtitle texts so as to be within a predetermined number of characters.

A separation unit that separates the broadcast program data into video data, audio data, and subtitle text in which the audio of the broadcast program is converted into text while the operator listens to the audio of the broadcast program.
A voice recognition unit that recognizes the voice data and converts it into text to generate voice text,
The first acquisition unit for acquiring the subtitle text and
A second acquisition unit that acquires the voice text and the voice time of the voice data,
A comparison unit that compares the subtitle text with the voice text and detects a matching word or phrase.
A subtitle data generation unit that generates subtitle data in which the subtitle text and the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text are associated with each other.
A content distribution system including a distribution content data generation unit that delays the video data and the audio data for a predetermined time and generates distribution content data including at least the video data, the audio data, and the subtitle data.

The content distribution system according to claim 6, wherein the audio time is a PTS (presentation time stamp) of a broadcast program.

The content distribution system according to claim 6, wherein the voice time is an acquisition time when the voice text is acquired.

A storage unit that stores broadcast programs,
A separation unit that reads the broadcast program from the storage unit and separates it into a video file, an audio file, and subtitle data.
The first acquisition unit that acquires the subtitle text from the subtitle data, and
A second acquisition unit that acquires the voice text obtained by voice-recognizing the voice of the voice file and converting it into a text, and the voice time of the voice of the broadcast program.
A comparison unit that compares the subtitle text with the voice text and detects a matching word or phrase.
A subtitle data generation unit that generates subtitle data in which the subtitle text and the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text are associated with each other.
A video playback device including a display control unit that superimposes the subtitle text of the subtitle data on the video of the video file corresponding to the audio time of the subtitle data.

The first acquisition process of acquiring the subtitle text in which the audio of the broadcast program is converted into text while the operator listens to the audio of the broadcast program,
A second acquisition process for acquiring the voice text obtained by voice-recognizing the voice of the broadcast program and converting it into a text, and the voice time of the voice of the broadcast program.
A comparison process for detecting a matching word or phrase by comparing the subtitle text with the voice text,
A program that causes a computer to execute a subtitle data generation process for generating subtitle data in which the subtitle text is associated with the audio time of the audio text including the word or phrase matching the word or phrase of the subtitle text.

While the operator listens to the audio of the broadcast program, the subtitle text obtained by converting the audio of the broadcast program into text is acquired.
The voice text obtained by voice-recognizing the voice of the broadcast program and converting it into a text and the voice time of the voice of the broadcast program are acquired.
The subtitle text is compared with the voice text to detect a matching word or phrase.
A method for generating subtitle data that generates subtitle data in which the subtitle text is associated with the audio time of an audio text including a word or phrase that matches the word or phrase of the subtitle text.