JP2010157816A

JP2010157816A - Subtitle information generating device, subtitle information generating method, and program

Info

Publication number: JP2010157816A
Application number: JP2008333773A
Authority: JP
Inventors: Osahiro Ogawa; 修太小川; Hisashi Aoki; 恒青木; Yoshihiro Omori; 善啓大盛; Kazuhiko Abe; 一彦阿部; Koji Yamamoto; 晃司山本; Kazunori Imoto; 和範井本; Makoto Hirohata; 誠広畑; Toshisuke Takayama; 俊輔高山; Shigeru Motoi; 滋本井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-12-26
Filing date: 2008-12-26
Publication date: 2010-07-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a subtitle information generating device for generating a subtitle group having timing information reduced in deviation from actual speech timing. <P>SOLUTION: When an isochronal clause pending character string being a timing unit character string where the number of isochronal clauses is not uniquely determined exists between the pair of timing unit character strings having original timing information, the subtitle information generating device obtains a differential time between the timing information of the determined timing unit character string and the original timing information, acquires respective kinds of timing information by subtracting the differential time from the original timing information of the pair of timing unit character strings, and determines the timing information of the isochronal clause pending character string existing between by the use of the timing information of the pair of timing unit character strings. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、字幕情報を作成する字幕情報作成装置、字幕情報作成方法及びプログラムに関する。 The present invention relates to a caption information creation apparatus, caption information creation method, and program for creating caption information.

放送などでは、音声の内容を文字列化したＣｌｏｓｅｄＣａｐｔｉｏｎ（以下、「ＣＣ」という）が、音声と共に配信されている。１回に表示するＣＣの文字数や行数など表示形式に対する視聴者の要望に応えるため、受信側でＣＣを基に一定の書式に則ったまとまり（以下、字幕グループ）を作成し、各字幕グループへタイミング情報を付与する方法が開示されている（特許文献１参照）。 In broadcasting or the like, Closed Caption (hereinafter referred to as “CC”) in which the content of audio is converted into a character string is distributed along with the audio. In order to respond to the viewer's request for the display format such as the number of characters and lines of CC displayed at one time, the receiving side creates a set (hereinafter referred to as subtitle group) according to a certain format based on the CC, and each subtitle group A method for providing timing information is disclosed (see Patent Document 1).

また、生放送では、ＣＣの表示タイミングと対応する音声の発話タイミングとの間にずれがある。さらに、１単語のみ画面上に表示されたまま数秒経過するなど、十分な作業時間の下で読み易さを考慮して編集された収録放送のＣＣとは表示状態が異なる。 In live broadcasting, there is a difference between the CC display timing and the corresponding voice utterance timing. Furthermore, the display state is different from the CC of the recorded broadcast edited in consideration of readability under sufficient work time, for example, several seconds pass while only one word is displayed on the screen.

これを解決するためにも受信側の字幕グループ作成とタイミング情報付与が必要である。各字幕グループの前端にあたる可能性のある部分文字列（以下、「タイミング単位文字列」という）は、ＣＣの内容と視聴者の要望によるため、なるべく細かい文字列単位でタイミング単位文字列を設定し、かつ、全てのタイミング単位文字列についてタイミング情報を算出できるようにする必要がある。 In order to solve this problem, it is necessary to create a caption group on the receiving side and give timing information. Since the partial character string (hereinafter referred to as “timing unit character string”) that may be at the front end of each subtitle group depends on the contents of the CC and the viewer's request, the timing unit character string is set in as small a character string unit as possible. In addition, it is necessary to be able to calculate timing information for all timing unit character strings.

特許文献１の発明では、タイミング情報が決定している位置（以下、「基点位置」という）のタイミング情報と、基点位置と字幕グループ前端との間に存在する文字数と文字種、又は、音素に基づいて、字幕グループ前端のタイミング情報を類推演算して決定した。
特開２０００−３５０１１７号公報 In the invention of Patent Document 1, based on the timing information at which the timing information is determined (hereinafter referred to as “base point position”), the number of characters and the character type, or phoneme existing between the base point position and the front end of the caption group. The timing information at the front end of the caption group was determined by analogy.
JP 2000-350117 A

ＣＣの表示と音声の発話タイミングとの間にずれがある状況において、基点位置から字幕グループ前端までの間に、対応モーラ数が１個に決定できない文字列（以下、「モーラ未決文字列」という）が含まれる場合がある。 In a situation where there is a difference between the CC display and the voice utterance timing, a character string (hereinafter referred to as “mora undecided character string”) in which the number of corresponding mora cannot be determined as one between the base point position and the front end of the caption group. ) May be included.

上記のような場合には、特許文献１の発明では、モーラ未決文字列以降、字幕グループ前端までの間に位置する各タイミング単位文字列のタイミング情報を算出できないという問題点があった。また、各タイミング情報の決定過程の違いによって生じる誤差の大小を考慮しないため、字幕グループに付与するタイミング情報が実際の発話タイミングと大きくずれる場合があるという問題点があった。 In such a case, the invention of Patent Document 1 has a problem that timing information of each timing unit character string located between the mora undecided character string and the front end of the caption group cannot be calculated. In addition, since the size of the error caused by the difference in the determination process of each timing information is not taken into account, there is a problem that the timing information given to the caption group may be largely deviated from the actual utterance timing.

そこで本発明は、上記問題点を解決するためになされたものであって、実際の発話タイミングとの間にずれが少ないタイミング情報を持つ字幕グループの作成を行う字幕情報作成装置及びその方法を提供する。 Accordingly, the present invention has been made to solve the above-described problems, and provides a caption information creating apparatus and method for creating a caption group having timing information with little deviation from the actual speech timing. To do.

本発明は、音声に対する字幕情報を作成する字幕情報作成装置において、予め作成された字幕文字列群、及び、前記字幕文字列中の任意の文字数毎に付与された前記音声との参考的な対応を示す時間情報であるオリジナルタイミング情報を含むオリジナル字幕情報を受け取り、前記音声と同期して前記字幕文字列を表示するための時間情報であるタイミング情報を付与するタイミング単位文字列を、前記字幕文字列群から抽出する抽出部と、前記音声と前記タイミング単位文字列の対応付けを音声認識によって行い、前記対応付けができた前記タイミング単位文字列に対し、前記音声の時間情報に基づいて前記タイミング情報を決定する第１タイミング情報決定部と、前記タイミング情報が決定された決定済みタイミング単位文字列の前記タイミング情報を基準にして、前記各タイミング単位文字列が含む等時性を持つ音の分節である等時分節の数を用いて、前記タイミング情報が決定できなかった未決タイミング単位文字列の前記タイミング情報を決定する第２タイミング情報決定部と、前記オリジナルタイミング情報を有する一対の前記タイミング単位文字列の間に、前記等時分節の数が一意に決定できない前記タイミング単位文字列である等時分節未決文字列が存在する場合に、（１）前記決定済みタイミング単位文字列の前記タイミング情報と前記オリジナルタイミング情報の差分時間を求め、（２）一対の前記タイミング単位文字列の前記オリジナルタイミング情報から前記差分時間を差し引いて前記タイミング情報をそれぞれ求め、（３）一対の前記タイミング単位文字列の前記タイミング情報を用いて、その間にある前記等時分節未決文字列の前記タイミング情報を決定する第３タイミング情報決定部と、前記字幕文字列群を任意の書式に則って分けた字幕文字列である字幕グループと、前記各字幕グループを前記音声と同期して表示するための前記決定済みタイミング単位文字列、前記未決タイミング単位文字列、又は、前記等時分節未決文字列のそれぞれの前記タイミング情報とを含む前記字幕情報を作成する作成部と、を有することを特徴とする字幕情報作成装置である。 The present invention provides a subtitle character creation group for creating subtitle information for audio, and a reference correspondence between a subtitle character string group created in advance and the audio given for each arbitrary number of characters in the subtitle character string A timing unit character string that receives original subtitle information including original timing information that is time information indicating the timing information that is time information for displaying the subtitle character string in synchronization with the audio is represented by the subtitle character. An extraction unit that extracts from a group of columns, the speech and the timing unit character string are associated by speech recognition, and the timing unit character string that has been associated with the timing unit based on the time information of the speech A first timing information determination unit for determining information, and the timing timing character string for which the timing information has been determined. The timing of the undecided timing unit character string for which the timing information could not be determined using the number of isochronous segments that are isochronous segments included in each timing unit character string with reference to the timing information An isochronous segment that is a timing unit character string in which the number of isochronous segments cannot be uniquely determined between a second timing information determining unit that determines information and a pair of the timing unit character strings having the original timing information When there is an undecided character string, (1) a difference time between the timing information of the determined timing unit character string and the original timing information is obtained, and (2) from the original timing information of the pair of timing unit character strings. The timing information is obtained by subtracting the difference time, and (3) a pair of timing unit sentences A third timing information determining unit that determines the timing information of the isochronous segment undecided character string in between using the timing information of the column, and a subtitle character obtained by dividing the subtitle character string group according to an arbitrary format A subtitle group that is a column and each of the determined timing unit character string, the undecided timing unit character string, or the isochronous segment undecided character string for displaying each subtitle group in synchronization with the audio A subtitle information generation apparatus comprising: a generation unit that generates the subtitle information including timing information.

本発明によれば、実際の発話タイミングとの間にずれが少ないタイミング情報を持つ字幕グループを作成できる。 According to the present invention, it is possible to create a caption group having timing information with little deviation from the actual speech timing.

以下、本発明の一実施例の字幕情報作成装置について図面に基づいて説明する。 Hereinafter, a caption information creating apparatus according to an embodiment of the present invention will be described with reference to the drawings.

なお、以下の各実施例において、字幕情報作成装置を日本語に適用する場合は、等時性を持つ音の分節単位として、「モーラ」を用い、英語に適用する場合は、一定音節のまとまりであるフットを使う。すなわち、「等時性を持つ音の分節単位」とは、モーラ、フットなどをいう。モーラとフットについては、後から説明する。 In each of the following examples, when applying the caption information creation device to Japanese, “Mora” is used as the segmental unit of isochronous sound, and when applying to English, a set of fixed syllables is applied. Use the foot that is. In other words, “isochronous sound segment unit” refers to mora, foot, and the like. Mola and foot will be explained later.

以下、本発明の実施例１の字幕情報作成装置１０について図１〜図６、図１０に基づいて説明する。 Hereinafter, a caption information creating apparatus 10 according to a first embodiment of the present invention will be described with reference to FIGS. 1 to 6 and FIG. 10.

（１）字幕情報作成装置１０の構成
図１は、本実施例に係わる字幕情報作成装置１０を示すブロック図である。 (1) Configuration of Subtitle Information Creation Device 10 FIG. 1 is a block diagram showing a caption information creation device 10 according to the present embodiment.

この字幕情報作成装置１０は、タイミング単位文字列抽出部１００、音声ベースタイミング情報決定部１０１、モーラ数決定部１０２、モーラベースタイミング情報決定部１０３，差分時間ベースタイミング情報決定部１０４、字幕グループ作成部１０５とを備えている。 This subtitle information creation device 10 includes a timing unit character string extraction unit 100, a voice base timing information determination unit 101, a mora number determination unit 102, a mora base timing information determination unit 103, a difference time base timing information determination unit 104, and a subtitle group generation. Part 105.

なお、この字幕情報作成装置１０は、例えば、汎用のコンピュータ装置１０を基本ハードウェアとして用いることでも実現することが可能である。すなわち、タイミング単位文字列抽出部１００、音声ベースタイミング情報決定部１０１、モーラ数決定部１０２、モーラベースタイミング情報決定部１０３，差分時間ベースタイミング情報決定部１０４、字幕グループ作成部１０５は、上記のコンピュータ装置１０に搭載されたプロセッサにプログラムを実行させることにより実現することができる。このとき、字幕情報作成装置１０は、上記のプログラムをコンピュータ装置１０に予めインストールすることで実現してもよいし、ＣＤ−ＲＯＭなどの記憶媒体に記憶して、又はネットワークを介して上記のプログラムを配布して、このプログラムをコンピュータ装置１０に適宜インストールすることで実現してもよい。 Note that the caption information creating device 10 can also be realized by using, for example, a general-purpose computer device 10 as basic hardware. That is, the timing unit character string extraction unit 100, the voice base timing information determination unit 101, the mora number determination unit 102, the mora base timing information determination unit 103, the difference time base timing information determination unit 104, and the subtitle group creation unit 105 This can be realized by causing a processor mounted on the computer apparatus 10 to execute a program. At this time, the caption information creating device 10 may be realized by installing the above program in the computer device 10 in advance, or may be stored in a storage medium such as a CD-ROM or via the network. May be distributed and this program installed in the computer apparatus 10 as appropriate.

タイミング単位文字列抽出部１００は、オリジナル字幕情報の字幕文字列群からタイミング単位文字列を抽出する。 The timing unit character string extraction unit 100 extracts a timing unit character string from the caption character string group of the original caption information.

音声ベースタイミング情報決定部１０１は、タイミング単位文字列と音声との対応付けを行い、その結果に基づいて各タイミング単位文字列のタイミング情報を決定する。 The voice base timing information determination unit 101 associates the timing unit character string with the voice, and determines timing information of each timing unit character string based on the result.

モーラ数決定部１０２は、各タイミング単位文字列のモーラ数を決定する。 The mora number determination unit 102 determines the mora number of each timing unit character string.

モーラベースタイミング情報決定部１０３は、各タイミング単位文字列のモーラ数と、決定済みタイミング単位文字列のタイミング情報に基づいて、未決タイミング単位文字列のタイミング情報を決定する。 The mora base timing information determination unit 103 determines the timing information of the pending timing unit character string based on the number of mora of each timing unit character string and the timing information of the determined timing unit character string.

差分時間ベースタイミング情報決定部１０４は、決定済みタイミング単位文字列と未決タイミング単位文字列との間に、モーラ未決文字列が存在する場合に、決定済みタイミング単位文字列のタイミング情報と、オリジナル字幕情報のタイミング情報に基づいて、未決タイミング単位文字列のタイミング情報を決定する。 The difference time base timing information determination unit 104 determines the timing information of the determined timing unit character string and the original subtitle when the mora undecided character string exists between the determined timing unit character string and the undecided timing unit character string. Based on the information timing information, the timing information of the pending timing unit character string is determined.

字幕グループ作成部１０５は、字幕グループを作成する。 The caption group creation unit 105 creates a caption group.

（２）字幕情報作成装置１０の動作
以下では、図２とその他の図面を用いて実施例１に係わる字幕情報作成装置１０の動作について説明する。図２は、実施例１に係わる字幕情報作成装置１０のフローチャートである。 (2) Operation of Subtitle Information Creation Device 10 Hereinafter, the operation of the caption information creation device 10 according to the first embodiment will be described with reference to FIG. 2 and other drawings. FIG. 2 is a flowchart of the caption information creating apparatus 10 according to the first embodiment.

なお、ここでは例として、図１０に示すように、ＣＣを表示する機能を持つテレビを想定して説明する。音声及びオリジナル字幕情報は放送電波として受信され、ここから取り出された音声とオリジナル字幕情報が本発明の字幕情報作成装置１０へ入力される。 Here, as an example, a description will be given assuming a television having a function of displaying CC as shown in FIG. The audio and the original subtitle information are received as broadcast radio waves, and the audio and original subtitle information extracted from this are input to the subtitle information creating apparatus 10 of the present invention.

（３）タイミング単位文字列抽出
まず、図２のステップＳ０において、タイミング単位文字列抽出部１００が、オリジナル字幕情報に含まれる字幕文字列群を対象として、そこからタイミング単位文字列を抽出する。 (3) Timing Unit Character String Extraction First, in step S0 of FIG. 2, the timing unit character string extraction unit 100 extracts a timing unit character string from a caption character string group included in the original caption information.

字幕文字列群を構成する個々の字幕文字列は、音声と同期して一度に画面上に表示するまとまりであり、図３のｃ１からｃ５のように、それぞれに表示開始と表示終了のタイミング情報が付随している。図３では、それぞれ表示開始時間、表示終了時間、字幕文字列を「，」区切りで表記している。表示開始、終了時間は音声の再生開始時間を０秒とした秒単位である。 Individual subtitle character strings constituting the subtitle character string group are displayed on the screen at a time in synchronism with audio, and display start and display end timing information respectively as shown in c1 to c5 of FIG. Is attached. In FIG. 3, the display start time, the display end time, and the subtitle character string are indicated by “,” delimiters, respectively. The display start and end times are in seconds with the audio playback start time being 0 seconds.

抽出すべきタイミング単位文字列は一定の文字パターンであり、この字幕情報作成装置１０において、第１に音声と字幕文字列を対応付ける際の字幕文字列側の対応付け要素、第２に字幕グループの前端候補、という役割がある。そのためタイミング単位文字列として定義する文字パターンは、音声との対応付けのためになるべく少ない文字数で構成される文字列であることが望ましく、かつ、字幕グループの前端になり得る単位であることが条件となる。 The timing unit character string to be extracted is a fixed character pattern. In the caption information creation device 10, the first is the association element on the caption character string side when the speech and the caption character string are associated, and the second is the caption group. It has the role of leading edge candidate. Therefore, it is desirable that the character pattern defined as the timing unit character string is a character string composed of as few characters as possible for association with audio, and is a unit that can be the front end of a caption group. It becomes.

ここではタイミング単位文字列の文字パターン単位として「単語」を採り上げ、前記「字幕グループの前端になり得る」について説明する。例えば「オリンピック」という単語は以下に示すように、字幕グループの前端となり得る。 Here, “word” is taken as the character pattern unit of the timing unit character string, and the above “can be the front end of a caption group” will be described. For example, the word “Olympic” can be the front end of a caption group, as shown below.

字幕グループ１：「さあ、いよいよ来週から」
字幕グループ２：「オリンピックが始まります。」。 Subtitle group 1: “Now, next week”
Subtitle group 2: “The Olympics will begin”.

ここで「オリンピック」をより細かく「オリ」「ン」「ピック」に分けたとして、以下の例のように「ン」から始まる字幕グループは、表示される際の可読性という観点から適切なものとは言いがたい。 Assuming that the “Olympic” is divided into “Ori”, “N”, and “Pick” in more detail, subtitle groups that begin with “N” are appropriate from the viewpoint of readability when displayed. Is hard to say.

字幕グループ１：「さあ、いよいよ来週からオリ」
字幕グループ２：「ンピックが始まります。」
したがって、「オリンピック」の「ン」は字幕グループの前端とはなり得ない。単語「オリンピック」を一つのまとまりとして扱った方が妥当である。 Subtitle group 1: “Now, it ’s finally next week”
Subtitle group 2: “The pick begins.”
Therefore, “N” in “Olympic” cannot be the front end of the subtitle group. It is more appropriate to treat the word “Olympic” as a unit.

また、タイミング単位文字列は日本語に限定する必要は無い。例えば「ＩＮ」や「ＴＨＥ」など特定の英文字のパターンをタイミング単位文字列として設定してもよい。よって本発明で扱う字幕文字列は、言語に依存しない。英語については、後の実施例で詳しく説明する。 The timing unit character string need not be limited to Japanese. For example, a specific English character pattern such as “IN” or “THE” may be set as the timing unit character string. Therefore, the subtitle character string handled in the present invention does not depend on the language. English will be described in detail in a later example.

なお、説明を簡単にするため、以降はタイミング単位文字列として日本語の単語を用いるものとし、字幕文字列から得た単語を以下では字幕単語と呼ぶ。字幕単語の抽出は単純な部分文字列検索や形態素解析など公知の技術を用いて行う。 For the sake of simplicity, hereinafter, a Japanese word is used as the timing unit character string, and a word obtained from the caption character string is hereinafter referred to as a caption word. Subtitle words are extracted using a known technique such as simple partial character string search or morphological analysis.

（４）音声ベースタイミング情報決定
次に、図２のステップＳ１において、音声ベースタイミング情報決定部１０１が、前記受信した音声から音声認識を行って単語（以下、「音声単語」という）を抽出する。動的計画法を用いて字幕単語列と対応付ける。対応する音声単語が見つかった字幕単語（以下、「一致単語」という）のタイミング情報を、音声単語のタイミング情報として付与する。 (4) Determination of Speech Base Timing Information Next, in step S1 of FIG. 2, the speech base timing information determination unit 101 performs speech recognition from the received speech and extracts a word (hereinafter referred to as “speech word”). . Correlate with subtitle word sequence using dynamic programming. Timing information of a subtitle word (hereinafter referred to as “match word”) in which a corresponding voice word is found is given as timing information of the voice word.

例えば、図４で示すように音声単語列上の「オリンピック」が、字幕単語列上の「オリンピック」と対応付けられ、音声単語側のタイミング情報である６０．８８（表示開始時間）と６１．６３（表示終了時間）を、字幕単語側のタイミング情報として付与する。 For example, as shown in FIG. 4, “Olympic” on the speech word string is associated with “Olympic” on the caption word string, and 60.88 (display start time) and 61. 63 (display end time) is given as timing information on the caption word side.

付与後に、ステップＳ２において、全字幕単語のタイミング情報が決定したかどうか判定し、決定している場合は、ステップＳ６において字幕グループの作成を行い、終了する。 After the assignment, it is determined in step S2 whether timing information of all the caption words has been determined. If so, a caption group is created in step S6 and the process ends.

なお、図４の音声単語列は音声認識の結果を羅列したものであり認識誤りを含んでいる。例では「朝」、「の」、「本」、「だ」、「外」が認識誤りの音声単語である。 The speech word string in FIG. 4 is a list of speech recognition results and includes recognition errors. In the example, “morning”, “no”, “book”, “da”, and “outside” are speech words with recognition errors.

（５）モーラベースタイミング情報決定
図２のステップＳ２の判定で、タイミング情報が決定していない字幕単語（以下、「不一致単語」という）が存在する場合、モーラベースタイミング情報決定部１０３が、モーラベースタイミング情報決定を行い不一致単語のタイミング情報を決定する。 (5) Determination of Mora Base Timing Information When there is a subtitle word for which timing information is not determined (hereinafter referred to as “mismatched word”) in the determination of step S2 in FIG. Base timing information determination is performed to determine timing information for mismatched words.

このとき、モーラ数決定部１０２は、各単語のモーラ数を決定してモーラベースタイミング情報決定部１０３へ提供する。 At this time, the mora number determining unit 102 determines the mora number of each word and provides the determined number to the mora base timing information determining unit 103.

なお、「モーラ」とは、一定時間長を持つ音の分節単位をいう。言語の違い、あるいは同一言語内でも方言の違いなどで時間長が異なる。音節は全ての言語が持つが、モーラは持たない言語が存在する。日本語の場合、原則として仮名１つが同じ長さ（１音）で発音され、１音が１モーラに相当する。但し、拗音は１モーラと認識される。また撥音「ン」、長音「ー」、促音「ッ」も１モーラとなる。英語の分節単位はモーラでなく音節を用いる。 “Mora” refers to a segmental unit of sound having a certain length of time. The length of time varies depending on the language or dialect even within the same language. All languages have syllables, but some languages do not have mora. In Japanese, in principle, one kana is pronounced with the same length (one note), and one note corresponds to one mora. However, the roaring sound is recognized as 1 mora. In addition, the sound repellent “N”, the long sound “-”, and the prompt sound “T” are also 1 mora. Use English syllables instead of mora.

図５を用いて、モーラベースタイミング情報決定について説明する。 The determination of mora base timing information will be described with reference to FIG.

モーラベースタイミング情報決定部１０３は、一致単語のタイミング情報及び各単語のモーラ数を用いて１モーラ当たりの時間を算出し、それに基づいて字幕単語のタイミング情報を類推演算で算出する。例えば一致単語「出場」と「選手」の間にある不一致単語「する」と「日本」それぞれのタイミング情報を決定する場合、以下の式で１モーラ当たりの時間を算出する。 The mora base timing information determination unit 103 calculates the time per mora using the timing information of the matching word and the number of mora of each word, and calculates the timing information of the caption word by analogy based on the time. For example, when determining the timing information for each of the mismatch words “S” and “Japan” between the match words “participation” and “player”, the time per mora is calculated by the following formula.

ｍ＝（Ｔ２−Ｔ１）／ｍｔ

但し、ｍは１モーラ当たりの時間、Ｔ１は「出場」の表示終了時間、Ｔ２は「選手」の表示開始時間、ｍｔは「出場」と「選手」間にある総モーラ数である。
m = (T2-T1) / mt

Here, m is the time per mora, T1 is the display end time of “participation”, T2 is the display start time of “player”, and mt is the total number of mora between “participation” and “player”.

「する」は２モーラ、「日本」は３モーラであるからｍｔは５であり、ｍは（６３．３８−６２．６０）／５の結果０．１５６となる。 Since “do” is 2 mora and “Japan” is 3 mora, mt is 5, and m is 0.156 as a result of (63.38-62.60) / 5.

ｍに基づいて「する」の発話にはｍ×２、「日本」の発話にはｍ×３の時間が経過することが決まる。よって「する」と「日本」それぞれのタイミング情報は、

「する」表示開始時間＝Ｔ１
「する」表示終了時間＝Ｔ１＋ｍ×２
「日本」表示開始時間＝Ｔ１＋ｍ×２
「日本」表示終了時間＝Ｔ２

で得られる。 Based on m, it is determined that m × 2 time elapses for the “to do” utterance and m × 3 time elapses for the “Japan” utterance. Therefore, the timing information for “Yes” and “Japan”

“Yes” display start time = T1
“Yes” display end time = T1 + m × 2
“Japan” display start time = T1 + m × 2
"Japan" display end time = T2

It is obtained by.

さて上記したようにモーラベースタイミング情報決定部１０３は、字幕単語が時系列的に間隔を置かず連続していることを前提としている。すなわち、隣接する単語の発話タイミング間に一定時間以上の空きが無いことを前提としている。 As described above, the mora base timing information determination unit 103 is based on the premise that subtitle words are continuous without being spaced in time series. In other words, it is assumed that there is no vacancy for a certain time or more between the utterance timings of adjacent words.

そこで、図３に示すように、演算前に、オリジナル字幕情報から得たオリジナルのタイミング情報に基づいて、時系列的に連続する字幕単語のまとまり（以下、「時系列連続単語グループ」という）を作り、以降はこれらグループ毎に演算してタイミング情報を決定する。グループは、オリジナル字幕情報において、ある字幕文字列の表示終了時間と、次に表示される字幕文字列の表示開始時間の間に一定以上の時間差がある場合にその字幕文字列間を区切りとすることで作成する。例えば時系列的に連続しているとみなす最大時間差を２．００秒と決めた場合、ｃ２の表示終了時間とｃ３の表示開始時間との間には、３３．３７−３０．４０＝２．９７秒の時間差があるので、ここで範囲を分ける。 Therefore, as shown in FIG. 3, a group of subtitle words that are continuous in time series (hereinafter referred to as “time-series continuous word group”) based on the original timing information obtained from the original subtitle information before the calculation is performed. After that, the timing information is determined by calculating each group. In the original subtitle information, a group separates the subtitle character strings when there is a certain time difference between the display end time of one subtitle character string and the display start time of the next displayed subtitle character string. Create it. For example, when the maximum time difference that is considered to be continuous in time series is determined to be 2.00 seconds, 33.37-30.40 = 2.3 between the display end time of c2 and the display start time of c3. Since there is a time difference of 97 seconds, the range is divided here.

さて、上記の通り一致単語２つに挟まれた不一致単語のタイミング情報を決定する。 Now, as described above, the timing information of the mismatch word sandwiched between the two match words is determined.

次に一致単語に挟まれていない不一致単語のタイミング情報の決定について説明する。 Next, determination of timing information of mismatched words that are not sandwiched between matching words will be described.

時系列連続単語グループの前端及び後端の字幕単語が一致単語でない場合、時系列連続単語グループの前端からグループ内で最初の一致単語手前までに存在する不一致単語は上記した「一致単語に挟まれていない不一致単語」となる。なお後端についても状況は前後端が逆になるのみで同様であるため、前端のみを例に挙げて説明する。 If the subtitle words at the front and rear end of the time-series continuous word group are not matching words, the unmatched words existing from the front end of the time-series continuous word group to the first matching word in the group are sandwiched between the above-mentioned “matching words”. Not a mismatched word ". The situation at the rear end is the same except that the front and rear ends are reversed, and only the front end will be described as an example.

図５で、単語「さあ」が時系列連続単語グループ前端にあたる単語であるとする。このとき「さあ」と「本日」の２つのタイミング情報の決定には、グループ内で前端から見て最初の一致単語である「オリンピック」を用いて１モーラ当たりの時間を算出し、それを用いる。６モーラの「オリンピック」の表示開始時間と終了時間から、１モーラ当たりの時間は、（６１．６３−６０．８８）／６で得られるので、不一致単語それぞれのタイミング情報は、

「さあ」表示開始時間＝Ｔ３−ｍ×４−ｍ×２
「さあ」表示終了時間＝Ｔ３−ｍ×４
「本日」表示開始時間＝Ｔ３−ｍ×４
「本日」表示終了時間＝Ｔ３
「オリンピック」の表示開始時間：Ｔ３
１モーラ当たりの時間：ｍ

と決定する。１モーラ当たりの時間として、グループ内の別箇所で既に算出した１モーラ当たりの時間を用いてもよい。すなわち、一致単語２つとそれらに挟まれた不一致単語に基づいて算出した１モーラ当たりの時間を用いてもよい。また、グループ内の各一致単語について１モーラ当たりの時間を算出し、それらの平均を用いてもよい。 In FIG. 5, it is assumed that the word “SAA” is a word corresponding to the front end of the time-series continuous word group. At this time, to determine the two timing information of “Sa” and “Today”, the first matching word “Olympic” as viewed from the front end in the group is used to calculate the time per 1 mora and use it. . From the display start time and end time of the 6-mora “Olympic”, the time per 1 mora is obtained as (61.63-60.88) / 6.

“Sa” display start time = T3-m × 4-m × 2
“Sa” display end time = T3-m × 4
“Today” display start time = T3-m × 4
"Today" display end time = T3
"Olympic" display start time: T3
Time per mora: m

And decide. As the time per 1 mora, the time per 1 mora that has already been calculated at another location in the group may be used. In other words, the time per 1 mora calculated based on two matched words and a mismatched word sandwiched between them may be used. Moreover, the time per 1 mora about each matching word in a group may be calculated, and those averages may be used.

（６）差分時間ベースタイミング情報決定
モーラ数決定部１０２でモーラ数が１個に決定できない単語（以下、「モーラ未決単語」という）がある。 (6) Determination of differential time base timing information There is a word (hereinafter referred to as “mora undecided word”) in which the number of mora cannot be determined by the mora number determination unit 102.

それは、第１に棒読みか桁読みかがわからない数字文字列、第２に「、」「。」「・・・」など対応するモーラ数が一定でない記号、第３にモーラ数決定部１０２で対応していない文字パターンなどである。「対応していない文字パターン」とは、例えば、モーラ数決定部１０２で辞書を用いている場合、辞書未登録語のことを指す。 The first is a numeric character string that does not know whether it is a bar or digit, the second is a symbol whose corresponding mora number is not constant, such as “,”, “.”, “...”, And the third is a mora number determination unit 102 It is a character pattern that is not done. The “non-corresponding character pattern” refers to an unregistered word when a dictionary is used in the mora number determination unit 102, for example.

図６の例では「３３２４」がモーラ未決単語である。 In the example of FIG. 6, “3324” is a mora pending word.

「３３２４」は、桁読みであれば「サンゼンサンビャクニジューヨン」で１３モーラである。棒読みであれば「サンサンニーヨン」で８モーラである。また、例えば株式市場であれば「サンゼンサンビャクフタジューヨン」で１４モーラとなる。このように、ＣＣの同じ表記に対して実際の発話がどのようになるかは文脈に依存し、自動判別するためには更なる知識と計算量を要する。強いていずれかの読み方に決めて計算を行い、その読み方が本来の発話の読み方と異なっている場合、大きな誤差が生じる可能性がある。以下に、棒読みと桁読みの違いによって生じる誤差の算出結果を示す。 If “3324” is digit reading, “Sanzen San Byakuni Juyong” is 13 mora. If it is a stick reading, “Sansan Nyon” is 8 mora. Further, for example, in the case of a stock market, “Sanzen San Byakufu Juyon” is 14 mora. Thus, the actual utterance for the same notation of CC depends on the context, and further knowledge and calculation amount are required for automatic determination. If the calculation is determined by any one of the reading methods and the reading method is different from the original reading method, a large error may occur. The calculation result of the error caused by the difference between bar reading and digit reading is shown below.

一致単語「参加」に基づく１モーラ当たりの時間＝（６４．９３−６４．４８）／３＝０．１５

「２２２４」の桁読みと棒読みのモーラの差による誤差＝０．１５×（１３−８）＝０．７５

モーラ未決単語が混在するたびに上記のような誤差が生じる可能性が増すため、モーラベースタイミング情報決定部１０３のみでは、実質的に誤差の小さいタイミング情報を算出することができない場合がある。
Time per mora based on the matching word “participation” = (64.93−64.48) /3=0.15

Error due to difference between “2224” digit reading and bar reading mora = 0.15 × (13−8) = 0.75

Since there is an increased possibility that the above error will occur each time a mora pending word is mixed, the mora base timing information determination unit 103 alone may not be able to calculate timing information with a substantially small error.

そこで、ステップＳ３において、モーラベースタイミング情報決定を行う際、時系列連続単語グループ内に、一致単語と不一致単語の間にモーラ未決単語が存在する場合があるときには、ステップＳ４において、差分時間ベースタイミング情報決定部１０４が、不一致単語のタイミング情報を決定する。 Therefore, when determining the mora base timing information in step S3, if there is a case where a mora undecided word exists between the matched word and the unmatched word in the time series continuous word group, the difference time base timing is determined in step S4. The information determination unit 104 determines the timing information of the mismatch word.

差分時間ベースタイミング情報決定部１０４は、オリジナル字幕情報のタイミング情報（以下、「オリジナルタイミング情報」という）を用いる。このオリジナルタイミング情報は、オリジナル字幕情報で１画面表示単位の字幕文字列毎に付与されているものであるが、生放送でリアルタイムに付与された字幕情報である場合、音声に完全に同期しているわけでない。これらの表示開始時間を、各字幕文字列の前端にあたる単語の表示開始時間とみなす。例えば、図３のｃ２の字幕文字列の表示開始時間は２７．４８であり、これをｃ２の字幕文字列の前端単語「さあ」の表示開始時間と見なす（ｅ２）。同様に、ｅ１、ｅ３の単語もオリジナルタイミング情報に基づいた表示開始時間（以下、オリジナル表示開始時間）を持つ単語となる。 The difference time base timing information determination unit 104 uses timing information of original caption information (hereinafter referred to as “original timing information”). This original timing information is given for each subtitle character string in the unit of one-screen display in the original subtitle information, but when it is subtitle information given in real time in live broadcasting, it is completely synchronized with the audio. That's not true. These display start times are regarded as display start times of words corresponding to the front end of each subtitle character string. For example, the display start time of the subtitle character string of c2 in FIG. 3 is 27.48, and this is regarded as the display start time of the front end word “Sa” of the subtitle character string of c2 (e2). Similarly, the words e1 and e3 are words having a display start time (hereinafter referred to as original display start time) based on the original timing information.

図６を用いて、差分時間ベースタイミング情報決定の流れを説明する。 With reference to FIG. 6, the flow of determining the difference time base timing information will be described.

まず、図６の丸数字の１に示すように、モーラベースタイミング情報決定により、一致単語「参加」を基点としてオリジナルタイミング情報に基づいた表示開始時間を持つ単語「前回」のタイミング情報を算出する。 First, as indicated by a circled number 1 in FIG. 6, timing information of the word “previous” having a display start time based on the original timing information is calculated based on the matching word “participation” by determining the mora base timing information. .

次に、図６の丸数字の２に示すように、「前回」のオリジナル表示開始時間７１．３８と、モーラベースタイミング情報決定で算出したタイミング情報の表示開始時間６３．８８の差分時間７．５を算出する。 Next, as indicated by a circled number 2 in FIG. 6, a difference time 7 .7 between the “previous” original display start time 71.38 and the display start time 63.88 of the timing information calculated by determining the mora base timing information. 5 is calculated.

次に、図６の丸数字の３に示すように、この差分時間を「今年」のオリジナル表示開始時間６８．３８から引くことで、６０．８８という表示開始時間を得る。 Next, as indicated by a circled number 3 in FIG. 6, the display start time of 60.88 is obtained by subtracting this difference time from the original display start time 68.38 of “this year”.

以降、モーラベースタイミング情報決定部１０３で、「今年」から「は」までのタイミング情報を算出する。 Thereafter, the mora base timing information determination unit 103 calculates timing information from “this year” to “ha”.

また、「と」についても「参加」を計算基点としたモーラベースタイミング情報決定部１０３を用いてタイミング情報を決定する。 For “to”, the timing information is determined using the mora base timing information determination unit 103 with “participation” as the calculation base point.

以上の処理で前後端両方向に隣接する単語のタイミング情報が判明し、モーラ未決単語「３３２４」のタイミング情報が決定する。 With the above processing, the timing information of the adjacent words in both the front and rear ends is found, and the timing information of the mora pending word “3324” is determined.

（７）字幕グループ作成
図２で示すように、音声ベースタイミング情報決定（ステップＳ１）、モーラベースタイミング情報決定（ステップＳ３）及び差分時間ベースタイミング情報決定（ステップＳ４）によって全単語にタイミング情報が決定されたら、次に、ステップＳ６において、字幕グループ作成部１０５が字幕グループを作成する。 (7) Subtitle group creation As shown in FIG. 2, timing information is provided for all words by audio base timing information determination (step S1), mora base timing information determination (step S3), and differential time base timing information determination (step S4). If determined, next, in step S6, the caption group creation unit 105 creates a caption group.

作成は、第１に（Ｓ１）〜（Ｓ４）によって決定されたタイミング情報、第２にユーザの字幕表示形態に対する要望、第３に可読性を考慮した書式、に則る。 The creation follows firstly the timing information determined by (S1) to (S4), secondly the user's desire for the subtitle display form, and thirdly the format considering readability.

第１の決定されたタイミング情報で、ある単語の表示終了時間とその次の単語の表示開始時間が同一でなければ、それら２単語を同じ字幕グループとはしない。 If the display end time of a word and the display start time of the next word are not the same in the first determined timing information, the two words are not considered as the same caption group.

第２のユーザの字幕表示形態に対する要望は、行数や文字数の指定等が考えられる。 The second user's request for the subtitle display form may be the designation of the number of lines and the number of characters.

第３の可読性を考慮した書式とは、次のようなものがある。 Examples of the format considering the third legibility are as follows.

・一回に表示する文字数とその表示時間を統計的に得た人間の読解速度を考慮して決める。・ Determine the number of characters to be displayed at one time and the display time of the data, taking into account the human reading speed obtained statistically.

・句読点が行頭に来ないようにする。・ Prevent punctuation marks at the beginning of the line.

・接頭辞の直後、接尾辞の直前で区切らない。・ Do not delimit immediately after the prefix or immediately before the suffix.

字幕グループの作成については、特許文献１の発明の請求項１、２に示されるような方法で作成してもいい。 The subtitle group may be created by a method as shown in claims 1 and 2 of the invention of Patent Document 1.

そして、各字幕グループ前端単語の表示開始時間を字幕グループの表示開始時間とし、字幕グループ後端単語の表示終了時間を字幕グループの表示終了時間とする。 Then, the display start time of each subtitle group front end word is defined as the subtitle group display start time, and the subtitle group rear end word display end time is defined as the subtitle group display end time.

字幕グループの表示開始、終了時間の決め方については、ユーザ要望や可読性に基づいて意図的に時間差を与えることも考えられる。 Regarding how to determine the display start and end times of a caption group, it is conceivable that a time difference is intentionally given based on user requests and readability.

例えば、以下に示すように、実際の発話時間よりも字幕の発話時間を長くすることで、想定するユーザの字幕読解速度が実際の発話よりも遅い場合に対応する。 For example, as shown below, the subtitle utterance time is set longer than the actual utterance time to cope with the case where the assumed subtitle reading speed of the user is slower than the actual utterance.

・字幕グループＮの表示終了時間と、Ｎの次に表示される字幕グループの表示開始時間との間に一定以上の時間差Ｔｄがあれば、字幕グループＮの表示終了時間にＴｄ以下の加算を行って字幕グループＮが対応する発話終了後も一定時間字幕が表示されるようにする。・ If there is a certain time difference Td between the display end time of the subtitle group N and the display start time of the subtitle group displayed next to N, the display end time of the subtitle group N is added to Td or less. Thus, the subtitles are displayed for a certain time after the utterance corresponding to the subtitle group N ends.

・字幕グループＮの表示開始時間と、Ｎの前に表示される字幕グループの表示終了時間との間に一定以上の時間差Ｔｄがあれば、字幕グループＮの表示開始時間からＴｄ以下の減算を行って字幕グループＮが対応する発話が開始する一定時間前から字幕グループＮが表示されるようにする。・ If there is a certain time difference Td between the display start time of subtitle group N and the display end time of subtitle group displayed before N, subtract Td or less from the display start time of subtitle group N Thus, the subtitle group N is displayed from a predetermined time before the utterance corresponding to the subtitle group N starts.

（８）効果
本実施例によれば、ＣＣ上で基点位置との間にモーラ未決文字列が存在するタイミング単位文字列、及び、モーラ未決範囲内のモーラ未決文字列それぞれのタイミング情報を算出でき、これらの文字列を前端とする字幕グループを作ることができ、ＣＣ表示状態に対する要望に対応することができる。 (8) Effect According to the present embodiment, it is possible to calculate the timing information of the timing unit character string in which the mora pending character string exists between the base point position on the CC and the mora pending character string in the mora pending range. Thus, a subtitle group having these character strings as the front end can be created, and the demand for the CC display state can be met.

すなわち、基点位置のタイミング情報と、ＣＣのオリジナルタイミング情報（ＣＣで１画面表示分の字幕文字列毎に付与されているタイミング情報）の差分時間を算出し、これを用いてモーラ未決文字列以降に位置するタイミング単位文字列のタイミング情報を補間できる。 That is, the difference time between the timing information of the base point position and the original timing information of CC (timing information given to each subtitle character string for one screen display in CC) is calculated, and using this, the mora undecided character string and thereafter The timing information of the timing unit character string located at can be interpolated.

以下、本発明の実施例２の字幕情報作成装置１０について図７〜図８に基づいて説明する。 Hereinafter, the caption information creating apparatus 10 according to the second embodiment of the present invention will be described with reference to FIGS.

本実施例は、モーラ数推定部１０６について説明する。なお便宜上、単語で構成される未決範囲を未決単語範囲として、これを例に説明する。未決単語範囲は、前後端がモーラ未決単語であり、範囲内に一致単語又はオリジナル表示開始時間を持つ単語が存在しない範囲とする。但し、複数の範囲が隣接又は重複する場合は、それらをまとめた最大の範囲のみを認める。 In this embodiment, the mora number estimation unit 106 will be described. For convenience, an undecided range composed of words will be described as an undecided word range. The undecided word range is a range in which the front and rear ends are mora undecided words, and there is no matching word or word having the original display start time in the range. However, if multiple ranges are adjacent or overlap, only the maximum range that combines them is allowed.

（１）字幕情報作成装置１０の構成
図７は、本実施例に係わる字幕情報作成装置１０を示すブロック図である。なお、説明は、実施例１との異なる箇所についてのみ説明する。 (1) Configuration of Subtitle Information Creation Device 10 FIG. 7 is a block diagram showing the caption information creation device 10 according to the present embodiment. The description will be made only on portions different from the first embodiment.

モーラ数推定部１０６は、未決単語範囲を決定し、この未決単語範囲内の各単語のモーラ数を推定する。 The mora number estimation unit 106 determines an undecided word range and estimates the number of mora for each word in the undecided word range.

モーラベースタイミング情報決定部１０３は、推定されたモーラ数に基づいて、未決単語範囲内の各単語のタイミング情報を決定する。 The mora base timing information determination unit 103 determines the timing information of each word in the pending word range based on the estimated number of mora.

（２）モーラ数推定
図８を用いて、モーラ数推定について説明する。 (2) Estimation of the number of mora The estimation of the number of mora will be described with reference to FIG.

字幕単語列上の「２００」、「から」、「３００」の３単語が１つの未決単語範囲を構成している。まず、未決単語範囲の前方に隣接する単語「今」の表示終了時間と、後方に隣接する単語「の」の表示開始時間から、未決単語範囲で経過している総時間を算出する。例では、７４．０１−７２．６６＝１．３５が得られる。 Three words “200”, “from” and “300” on the subtitle word string constitute one undecided word range. First, the total time elapsed in the pending word range is calculated from the display end time of the word “now” adjacent to the front of the pending word range and the display start time of the word “no” adjacent to the backward. In the example, 74.01-72.66 = 1.35 is obtained.

次に、この未決単語範囲内の各単語のモーラ数候補の全組み合わせを作り、範囲の総時間に基づいて組み合わせ毎に１モーラ当たりの時間を算出する。 Next, all combinations of mora number candidates for each word in the undecided word range are created, and the time per mora is calculated for each combination based on the total time of the range.

「２００」は桁読み「ニヒャク」（３モーラ）あるいは棒読み「ニーゼロゼロ」（６モーラ）であり、「から」は２モーラ、「３００」は桁読み「サンビャク」（４モーラ）あるいは棒読み「サンゼロゼロ」（６モーラ）である。組み合わせの数は２×１×２＝４通りである。それぞれの組み合わせ毎に１モーラ当たりの時間を算出する。 “200” is a digit reading “Nihyaku” (3 mora) or a stick reading “Knee Zero Zero” (6 mora). "Zero zero" (6 mora). The number of combinations is 2 × 1 × 2 = 4. The time per 1 mora is calculated for each combination.

算出した各時間から、未決単語範囲周辺の１モーラ当たりの時間に基づいて１つを選択する。図８では未決単語範囲候補の後方に隣接する単語「の」の算出に用いた１モーラ当たりの時間が０．１５であるため、１モーラ当たりの時間として０．１５が算出されるモーラ数の組み合わせ「ニヒャク」（３モーラ）＋「から」（２モーラ）＋「サンビャク」（４モーラ）を選択し、これを推定結果とする。 From each calculated time, one is selected based on the time per 1 mora around the pending word range. In FIG. 8, since the time per mora used for the calculation of the word “no” adjacent to the back of the pending word range candidate is 0.15, 0.15 is calculated as the time per mora. The combination “Nihyaku” (3 mora) + “From” (2 mora) + “Sanjak” (4 mora) is selected and used as the estimation result.

選択方法は、例えば未決単語範囲の前後の単語のタイミング情報を決定するのに用いた１モーラ当たりの時間を平均し、その値に最も近い時間を選択するなどでもよい。 As the selection method, for example, the time per one mora used for determining the timing information of the words before and after the pending word range may be averaged, and the time closest to the value may be selected.

（３）効果
本実施例によれば、モーラ未決文字列が１個以上ある範囲（前後端がモーラ未決文字列であり、かつ内部に基点位置を含まない。また未決単語範囲が隣接又は重複するものはそれらをまとめた１つのみを範囲として認める）について、未決単語範囲前後のタイミング情報と未決単語範囲内のモーラ数候補から実際に発話されたモーラ数を推定し、未決単語範囲内の各モーラ未決文字列のタイミング情報を決定できる。 (3) Effect According to the present embodiment, a range having one or more mora undecided character strings (front and rear ends are mora undecided character strings and do not include a base point position inside. Further, undecided word ranges are adjacent or overlapped. For each of those in the pending word range, the number of mora actually spoken is estimated from the timing information before and after the pending word range and the number of mora numbers in the pending word range. The timing information of the mora pending character string can be determined.

以下、本発明の実施例３の字幕情報作成装置１０について図９に基づいて説明する。 Hereinafter, the caption information creating apparatus 10 according to the third embodiment of the present invention will be described with reference to FIG.

本実施例は、タイミング情報の決定過程を用いた字幕グループ作成に関するものである。 This embodiment relates to creation of a caption group using a timing information determination process.

（１）タイミング付随情報
タイミング情報が決定する際に「どのような過程を経て決定したか」の情報をタイミング付随情報とする。タイミング情報が決定する過程は多様である。例えば、受信した音声を聴きながらユーザがタイミング情報を入力してもいい。 (1) Timing-accompanying information When timing information is determined, information indicating “what process has been determined” is used as timing-accompanying information. There are various processes for determining timing information. For example, the user may input timing information while listening to the received voice.

また、以下に挙げるように上記した各タイミング情報決定部を用いてもいい。 Further, as described below, each timing information determination unit described above may be used.

Ｍ１：音声ベースタイミング情報決定部１０１
Ｍ２：モーラベースタイミング情報決定部１０３
Ｍ３：差分時間ベースタイミング情報決定部１０４
Ｍ４：モーラ数推定部１０６
なお、モーラ数推定も「Ｍ４：モーラ数推定部１０６」として加える。モーラ数推定をバリエーションの一要素として加えるのは、モーラ数推定部１０６を使用するか使用しないかにより、得られるタイミング情報に生じる誤差に違いがあると考えられるからである。 M1: Voice base timing information determination unit 101
M2: Mora base timing information determination unit 103
M3: Difference time base timing information determination unit 104
M4: Mora number estimation unit 106
The mora number estimation is also added as “M4: mora number estimation unit 106”. The reason why the mora number estimation is added as one element of the variation is that it is considered that there is a difference in error in the obtained timing information depending on whether the mora number estimation unit 106 is used or not.

（２）過程のバリエーション
以下にタイミング情報が決定する過程のバリエーションを示す。 (2) Process Variations The following are variations of the process that timing information determines.

１．Ｍ１
２．Ｍ１＋Ｍ２
３．Ｍ１＋Ｍ３
４．Ｍ１＋Ｍ２＋Ｍ３
５．Ｍ１＋Ｍ２＋Ｍ４
６．Ｍ１＋Ｍ３＋Ｍ４
７．Ｍ１＋Ｍ２＋Ｍ３＋Ｍ４
前端行のＭ１「音声ベースタイミング情報決定部１０１のみで決定する」のみで決定する場合が最も時間的な誤差が少ない。以降、下へ行くほど、得られるタイミング情報の時間的な誤差が大きくなる可能性が増える。ここで、時間的誤差とは、音声の時間と、字幕の表示時間の誤差である。 1. M1
2. M1 + M2
3. M1 + M3
4). M1 + M2 + M3
5). M1 + M2 + M4
6). M1 + M3 + M4
7). M1 + M2 + M3 + M4
The time error is the smallest when it is determined only by M1 “determined only by the voice base timing information determination unit 101” in the front row. Thereafter, the lower the level, the greater the possibility that the temporal error in the obtained timing information will increase. Here, the time error is an error between the audio time and the caption display time.

（３）誤差の生じ第１の状況
時間的な誤差の生じる状況について「Ｍ１」と「Ｍ１＋Ｍ２」を例に説明する。 (3) First Situation in which Error Occurs A situation in which a temporal error occurs will be described by taking “M1” and “M1 + M2” as examples.

図５の一致単語「選手」と不一致単語「する」を比べた場合、「する」のタイミング情報は「出場」「する」「日本」「選手」が同じ話速で発話されていることを前提としている。そのため「する」と「日本」それぞれを発話する時に話速が異なっていれば、誤差が生じることになる。一方、一致単語は音声との対応付けに基づいたタイミング情報が付与されているため、「する」と比較して誤差が生じる可能性は小さい。 When the matched word “player” in FIG. 5 is compared with the unmatched word “Yes”, the timing information of “Yes” assumes that “Enter”, “Yes”, “Japan”, and “Player” are spoken at the same speed. It is said. For this reason, if the speaking speed is different when speaking “Yes” and “Japan”, an error occurs. On the other hand, since the matching word is given timing information based on the association with the voice, the possibility of an error is small compared to “Yes”.

さらに、前記バリエーションの内「Ｍ２：モーラベースタイミング情報決定部１０３」については、同じ手段であっても決定されるタイミング情報の誤差に差異が生じる。モーラベースタイミング情報決定部１０３内で、誤差が生じる要因を以下に挙げる。なお、タイミング情報を決定する不一致単語を、以下の要因一覧上では「当該単語」と呼ぶ。 Further, regarding the “M2: mora base timing information determination unit 103” among the variations, there is a difference in the error of the timing information determined even with the same means. The factors causing the error in the mora base timing information determination unit 103 are listed below. The mismatch word that determines the timing information is referred to as “the word” in the following factor list.

第１の要因は、当該単語と、計算起点とする一致単語と間にあるモーラ数である。モーラ数が小さいほど、誤差が生じる可能性が小さくなる。掛け合わせるモーラ数が大きくなればなるほど１モーラ当たりの時間誤差が増えるからである。 The first factor is the number of mora between the word and the matching word that is the calculation starting point. The smaller the number of mora, the less likely that an error will occur. This is because the time error per 1 mora increases as the number of mora to be multiplied increases.

第２の要因は、当該単語が一致単語に挟まれているかいないか、である。挟まれている方が、挟まれていない場合に比べ誤差が生じる可能性が小さくなる。挟まれている場合、双方の一致単語のタイミング情報に基づいて１モーラ当たりの時間を算出することができるが、挟まれていない場合はそれができないからである。 The second factor is whether or not the word is sandwiched between matching words. The possibility that an error will occur is smaller when it is sandwiched than when it is not sandwiched. This is because the time per mora can be calculated based on the timing information of both matching words when sandwiched, but cannot be done when not sandwiched.

第３の要因は、当該単語を含む時系列連続単語グループ内の一致単語の数である。一致単語数が多い方が、誤差が生じる可能性が小さくなる。一致単語数が多ければ多いほど、それぞれから１モーラ当たりの時間を算出でき、１モーラ当たりの時間について誤差が生じる可能性が小さくなるからである。 The third factor is the number of matching words in the time-series continuous word group including the word. The larger the number of matching words, the smaller the possibility of errors. This is because as the number of matching words increases, the time per mora can be calculated from each of them, and the possibility of an error in the time per mora is reduced.

（４）誤差の生じる第２の状況
また、モーラベースタイミング情報決定部１０３において、計算の基点となる一致単語が、時系列連続単語グループＧに存在しない場合がある。この場合、Ｇの前後の時系列連続単語グループに含まれるオリジナル表示開始時間を持つ単語と、Ｇに含まれるオリジナル表示開始時間を持つ単語を用いて、差分時間ベースタイミング決定部同様、差分時間で補間する方法が考えられる。この方法を用いてタイミング情報を得た単語を基点として、タイミング情報を決定したＧ内の各単語のタイミング情報は、この方法を用いなかったＧ以外の時系列連続単語グループの単語のタイミング情報と比較して、誤差が生じる可能性が大きい。 (4) Second Situation in which an Error Occurs In the mora base timing information determination unit 103, there is a case where the coincidence word serving as the calculation base point does not exist in the time-series continuous word group G. In this case, using the word having the original display start time included in the time series continuous word group before and after G and the word having the original display start time included in G, the difference time is the same as the difference time base timing determination unit. A method of interpolation is conceivable. The timing information of each word in G for which the timing information has been determined based on the word for which timing information is obtained using this method is the timing information of the words in the time-series continuous word group other than G that does not use this method. In comparison, there is a greater possibility of errors.

（５）誤差の生じる第３の状況
さらに「Ｍ４：モーラ数推定部１０６」においても、以下のような要因で誤差が生じる可能性が異なる。 (5) Third Situation in which Error Occurs Further, in “M4: mora number estimation unit 106”, the possibility that an error occurs due to the following factors is different.

推定結果のモーラ数組み合わせに基づいて算出される１モーラ当たりの時間と、推定結果に選ばれなかった組み合わせ候補の１モーラ当たりの時間との時間差がある。この時間差が大きいほどモーラ数推定結果が誤っている可能性が小さくなり、付与されるタイミング情報が誤差を含む可能性が小さくなる。 There is a time difference between the time per 1 mora calculated based on the combination of the number of mora in the estimation result and the time per 1 mora of the combination candidate not selected in the estimation result. As this time difference is larger, the possibility that the mora number estimation result is incorrect is reduced, and the possibility that the given timing information includes an error is reduced.

また、モーラ数推定で、組み合わせから得られる１モーラ当たりの時間に差が無く、モーラ数の組み合わせ候補を１個に決定できない場合も、モーラ数推定結果が誤る可能性が大きい。 Also, in the estimation of the number of mora, when there is no difference in the time per one mora obtained from the combination and the number of mora number combination candidates cannot be determined as one, the possibility that the mora number estimation result is erroneous is high.

（６）まとめ
以上のようにタイミング情報が決定する過程には多くのバリエーションがある。 (6) Summary There are many variations in the process of determining timing information as described above.

決定過程は決定するタイミング情報の誤差の大小に影響する。本実施例に係わる字幕情報作成装置１０は、このような決定過程をタイミング付随情報として記録しておき、字幕グループを作成する際に用いる。 The determination process affects the magnitude of the error in the timing information to be determined. The caption information creating apparatus 10 according to the present embodiment records such a determination process as timing-accompanying information and uses it when creating a caption group.

（７）字幕情報作成装置１０の構成
図９は、本実施例に係わる字幕情報作成装置１０を示すブロック図である。 (7) Configuration of Subtitle Information Creation Device 10 FIG. 9 is a block diagram showing the caption information creation device 10 according to the present embodiment.

この字幕情報作成装置１０は、タイミング単位文字列抽出部１００、タイミング情報決定部１０７、字幕グループ作成部１０５、タイミング付随情報記憶部１０８とを備えている。 The caption information creation device 10 includes a timing unit character string extraction unit 100, a timing information determination unit 107, a caption group creation unit 105, and a timing associated information storage unit 108.

タイミング情報決定部１０７は、前記タイミング単位文字列抽出部１００で抽出した各タイミング単位文字列へタイミング情報を決定する。 The timing information determination unit 107 determines timing information for each timing unit character string extracted by the timing unit character string extraction unit 100.

タイミング付随情報記憶部１０８は、前記タイミング情報決定部１０８でタイミング情報が決定する際に、その決定過程をタイミング付随情報として、各タイミング情報に対応付けて記憶する。 When the timing information determination unit 108 determines the timing information, the timing associated information storage unit 108 stores the determination process as timing associated information in association with each timing information.

字幕グループ作成部１０５は、前記タイミング付随情報記憶部１０８で記憶したタイミング付随情報を用いて字幕グループを作成する。 The caption group creation unit 105 creates a caption group using the timing associated information stored in the timing associated information storage unit 108.

（８）タイミング付随情報を用いた字幕グループの作成
字幕グループ作成部１０５の動作は、上記の字幕グループ作成部１０５と同じである。 (8) Creation of caption group using timing-accompanying information The operation of the caption group creation unit 105 is the same as that of the caption group creation unit 105 described above.

但し、字幕グループ作成部１０５は、タイミング付随情報記憶部１０８に記憶されたタイミング付随情報を参照し、時間的誤差を含む可能性がより小さい決定過程に基づいたタイミング情報を持つ単語を、優先して字幕グループの前端になるよう字幕グループを作成する。 However, the subtitle group creation unit 105 refers to the timing incidental information stored in the timing incidental information storage unit 108 and prioritizes words having timing information based on a determination process that is less likely to include a temporal error. Create a subtitle group to be at the front of the subtitle group.

また、逆に、時間的誤差を含む可能性がより大きい決定過程に基づいたタイミング情報を持つ単語を、字幕グループ前端に置くことを避けて字幕グループを作成する。 Conversely, a subtitle group is created by avoiding placing a word having timing information based on a determination process having a high possibility of including a time error at the front end of the subtitle group.

（９）効果
本実施例によれば、決定した各タイミング情報について、決定過程の違いによって生じるタイミング情報の誤差が生じる可能性の大小を考慮して字幕グループの作成を行うことにより、字幕グループ前端又は後端までの間に位置する各タイミング単位文字列のタイミング情報の算出を可能とし、かつ実際の発話タイミングとの間にずれが少ないタイミング情報を持つ字幕グループの作成できる。 (9) Effect According to the present embodiment, for each determined timing information, a subtitle group is created in consideration of the possibility of an error in timing information caused by a difference in the determination process. Alternatively, it is possible to calculate timing information of each timing unit character string positioned between the rear end and to create a caption group having timing information with little deviation from the actual speech timing.

以下、本発明の実施例４の字幕情報作成装置１０について図１１〜図１６に基づいて説明する。本実施例は、字幕情報作成装置１０を英語に適用した場合である。 Hereinafter, a caption information creating apparatus 10 according to a fourth embodiment of the present invention will be described with reference to FIGS. In this embodiment, the caption information creating apparatus 10 is applied to English.

（１）字幕情報作成装置１０の構成
図１１は、本実施例に係わる字幕情報作成装置１０を示すブロック図である。 (1) Configuration of Subtitle Information Creation Device 10 FIG. 11 is a block diagram showing the caption information creation device 10 according to the present embodiment.

この字幕情報作成装置１０は、タイミング単位文字列抽出部１００、音声ベースタイミング情報決定部１０１、等時分節ベースタイミング情報決定部１０８、差分時間ベースタイミング情報決定部１０９、字幕グループ作成部１０５とを備えている。 The caption information creating apparatus 10 includes a timing unit character string extraction unit 100, an audio base timing information determination unit 101, an isochronous segment base timing information determination unit 108, a difference time base timing information determination unit 109, and a caption group creation unit 105. I have.

等時分節ベースタイミング情報決定部１０８は、等時分節数と、決定済みタイミング単位文字列のタイミング情報とに基づいて、未決タイミング単位文字列のタイミング情報を決定する。 The isochronous segment base timing information determination unit 108 determines the timing information of the undecided timing unit character string based on the number of isochronous segments and the timing information of the determined timing unit character string.

差分時間ベースタイミング情報決定部１０９は、決定済みタイミング単位文字列と未決タイミング単位文字列との間に、等時分節未決文字列が存在する場合に、決定済みタイミング単位文字列のタイミング情報と、オリジナル字幕情報のタイミング情報に基づいて、未決タイミング単位文字列のタイミング情報を決定する。 The difference time base timing information determination unit 109, when there is an isochronous segment undecided character string between the determined timing unit character string and the undecided timing unit character string, the timing information of the determined timing unit character string, Based on the timing information of the original caption information, the timing information of the pending timing unit character string is determined.

（２）字幕情報作成装置１０の動作
図１２は、実施例４に係わる字幕情報作成装置１０のフローチャートである。 (2) Operation of Subtitle Information Creation Device 10 FIG. 12 is a flowchart of the caption information creation device 10 according to the fourth embodiment.

以下では、図１２のフローチャートに沿って、その他の図を用いて本実施例に係わる字幕情報作成装置１０の動作について説明する。なお、ここでは、実施例１と同様に、例としてＣＣを表示する機能を持つテレビを想定して説明する。音声及びオリジナル字幕情報は放送電波として受信され、ここから取り出された音声とオリジナル字幕情報が本発明の字幕情報作成装置１０へ入力される。 Hereinafter, the operation of the caption information creating apparatus 10 according to the present embodiment will be described with reference to the flowchart of FIG. Here, as in the first embodiment, a description will be given assuming a television having a function of displaying a CC as an example. The audio and the original subtitle information are received as broadcast radio waves, and the audio and original subtitle information extracted from this are input to the subtitle information creating apparatus 10 of the present invention.

（３）タイミング単位文字列抽出
まず、図１２のステップＳ０において、タイミング単位文字列抽出部１００が、オリジナル字幕情報に含まれる字幕文字列群を対象として、そこからタイミング単位文字列を抽出する。 (3) Timing Unit Character String Extraction First, in step S0 of FIG. 12, the timing unit character string extraction unit 100 extracts a timing unit character string from a caption character string group included in the original caption information.

字幕文字列群を構成する個々の字幕文字列は、音声と同期して一度に画面上に表示するまとまりであり、図１３のｃ１からｃ５のように、それぞれに表示開始と終了のタイミング情報が付随している。それぞれ表示開始時間、表示終了時間、字幕文字列を「，」区切りで表記している。表示開始、終了時間は音声の再生開始時間を０秒とした秒単位である。 The individual subtitle character strings constituting the subtitle character string group are displayed on the screen at a time in synchronism with the audio, and display timing information of display start and end is respectively displayed as shown in c1 to c5 of FIG. Accompanying. Each of the display start time, display end time, and subtitle character string is indicated by “,” delimiters. The display start and end times are in seconds with the audio playback start time being 0 seconds.

抽出すべきタイミング単位文字列は一定の文字パターンであり、この字幕情報作成装置１０において、第１に音声と字幕文字列を対応付ける際の字幕文字列側の対応付け要素、第２に字幕グループの前端候補、という役割がある。そのためタイミング単位文字列として定義する文字パターンは、音声との対応付けのためになるべく少ない文字数で構成される文字列であることが望ましく、かつ、字幕グループの前端になり得る単位であることを条件とする。 The timing unit character string to be extracted is a fixed character pattern. In the caption information creation device 10, the first is the association element on the caption character string side when the speech and the caption character string are associated, and the second is the caption group. It has the role of leading edge candidate. Therefore, it is desirable that the character pattern defined as the timing unit character string is a character string composed of as few characters as possible for association with audio, and is a unit that can be the front end of a caption group. And

実施例１では、字幕情報作成装置１０を日本語に適用して、「タイミング単位文字列」を単語としていた。しかし、本実施例では、英語を事例としているため、タイミング単位文字列をフットとする。そして、タイミング単位文字列抽出部１００は、フット毎にタイミング単位文字列を区切る。 In the first embodiment, the subtitle information creation device 10 is applied to Japanese and “timing unit character string” is used as a word. However, in this embodiment, since English is used as an example, the timing unit character string is a foot. Then, the timing unit character string extraction unit 100 divides the timing unit character string for each foot.

「フット」とは、英語の発話において、強勢毎に区切られる発話単位である。強勢とは、強くゆっくりと発話される音節であり、文中の単語には強勢を持つ単語（以下、「強勢単語」という）と、強勢を持たず弱く速く発話される単語（以下、「弱勢単語」という）とがある。強勢単語と、その強勢単語によって区切られた一続きの弱勢単語を一まとめにしたものをフットと呼ぶ。フットは一息で発話される発話区間であるため音声との対応付け単位として適当である。また、発話上意味を持つまとまりになっているので、字幕グループの前端としても適している。 A “foot” is an utterance unit that is divided for each stress in an English utterance. A stress is a syllable that is spoken strongly and slowly, and a word in the sentence is a stressed word (hereinafter referred to as a “stressed word”) and a word that is spoken quickly without a stress (hereinafter referred to as a “stressed word”). ")". A group of stress words and a series of stress words separated by the stress words is called a foot. Since the foot is an utterance section that is uttered at a breath, it is suitable as a unit for associating with the voice. Also, since it is a group that has meaning in utterance, it is also suitable as the front end of a caption group.

フットは、日本語におけるモーラ同様に、英語の発話において等時性を持つ音の分節単位である。以下の例では、大文字が強勢の位置、「｜」がフットの境界を示す。 A foot is a segmental unit of sound having isochronism in English utterances, like a mora in Japanese. In the following example, capital letters indicate the position of stress, and “|” indicates the foot boundary.

１）the｜HOUSE is｜BEAUtiful
２）the｜HOUSE is very｜BEAUtiful
３）the｜HOUSE is not very｜BEAUtiful

１）〜３）の「house is」、「house is very」、「house is not very」は、フットに含まれる音節数に関わりなく、強勢間の時間間隔が大きく変化しない（田窪，前川，窪園，本多，白井，中川：「言語の科学２音声」：岩波書店：１９８８参照）。以降の説明ではタイミング単位文字列としてフット単位を用いるものとし、字幕文字列から得たフット単位の文字列を以下では字幕要素と呼ぶ。
1) the ｜ HOUSE is ｜ BEAUtiful
2) the ｜ HOUSE is very ｜ BEAUtiful
3) the ｜ HOUSE is not very ｜ BEAUtiful

1) to 3) “house is”, “house is very” and “house is not very” do not change the time interval between stresses regardless of the number of syllables contained in the foot (Takubo, Maekawa, Kuboen) , Honda, Shirai, Nakagawa: "Language Science 2: Speech": Iwanami Shoten: 1988). In the following description, the foot unit is used as the timing unit character string, and the foot-unit character string obtained from the caption character string is hereinafter referred to as a caption element.

（４）音声ベースタイミング情報決定
次に、音声ベースタイミング情報決定部１０１が、前記受信した音声から比較のための文字列（以下、「音声要素」という）を抽出し、字幕要素列と対応付ける。音声要素の抽出には、抽出単位を字幕要素同様フットとして公知である音声認識などを用いる。音声要素と字幕要素列との対応付けには、公知の動的計画法などを用いる。 (4) Audio Base Timing Information Determination Next, the audio base timing information determination unit 101 extracts a character string for comparison (hereinafter referred to as “audio element”) from the received audio and associates it with a subtitle element string. For the extraction of the voice element, voice recognition, which is known as a foot like the caption element, is used. A known dynamic programming method or the like is used for associating an audio element with a caption element sequence.

図１２のステップＳ１において、対応する音声要素が見つかった字幕要素（以下、「一致要素」という）のタイミング情報として、音声要素のタイミング情報を付与する。例えば、図１４で示すように音声要素列上の「house is not」が、字幕要素列上の「house is not」と対応付けられ、音声要素側のタイミング情報である６０．８８（表示開始時間）と６１．０８（表示終了時間）を、字幕要素側のタイミング情報として付与する。 In step S1 of FIG. 12, the timing information of the audio element is given as the timing information of the caption element (hereinafter referred to as “matching element”) in which the corresponding audio element is found. For example, as shown in FIG. 14, “house is not” on the audio element string is associated with “house is not” on the caption element string, and is 60.88 (timing information on display start time). ) And 61.08 (display end time) are given as timing information on the caption element side.

ステップＳ２において、上記付与後、全字幕要素のタイミング情報が決定したかどうか判定する。 In step S2, it is determined whether timing information of all subtitle elements has been determined after the above assignment.

ステップＳ２において、タイミング情報が決定している場合は、字幕グループの作成を行い、終了する。 If the timing information is determined in step S2, a caption group is created and the process ends.

なお、図１４の音声要素列は音声認識の結果を羅列したものであり認識誤りを含んでいる。例では「does」、「boots」が認識誤りの音声要素である。 The speech element sequence in FIG. 14 is a list of speech recognition results and includes recognition errors. In the example, “does” and “boots” are speech elements of recognition errors.

（５）等時分節ベースタイミング情報決定
ステップＳ２の判定で、タイミング情報が決定していない字幕要素（以下、「不一致要素」という）が存在する場合、ステップＳ３において、等時分節ベースタイミング情報決定部１０９が、不一致要素のタイミング情報を決定する。 (5) Isochronous segment base timing information determination If there is a caption element for which timing information has not been determined (hereinafter referred to as “mismatch element”) in the determination in step S2, isochronous segment base timing information determination is performed in step S3. The unit 109 determines the timing information of the mismatched elements.

図１５を用いて、等時分節ベースタイミング情報決定について説明する。 The determination of isochronous segment base timing information will be described with reference to FIG.

等時分節ベースタイミング情報決定部１０９は、一致要素のタイミング情報と、フット数に基づいて字幕要素のタイミング情報を類推演算で算出する。 The isochronous segment base timing information determination unit 109 calculates the timing information of subtitle elements by analogy calculation based on the timing information of matching elements and the number of feet.

例えば一致要素「beautiful as you」の後方にある不一致要素「say.」のタイミング情報を決定する場合、１フット当たりの時間（以下、「フット時間長」という）ｆｔに基づいて以下のように算出する。 For example, when determining the timing information of the non-matching element “say.” Behind the matching element “beautiful as you”, it is calculated as follows based on the time per foot (hereinafter referred to as “foot time length”) ft. To do.

ｆｔ＝「beautiful as you」の発話終了時間−「beautiful as you」の発話開始時間＝６１．４８−６１．２８＝０．２
「say」の発話開始時間＝「beautiful as you」の発話終了時間＝６１．４８
「say」の発話終了時間＝「say」の発話開始時間＋ｆｔ＝６１．６８

１フットの等時性はピリオドで区切られる１文区間内で保たれるものとする。一文内の複数の一致要素から異なる時間長（ｆｔ）が得られる場合は、それらの平均値を用いる、又は、タイミング情報を決定する不一致要素に最も近い位置にある一致要素のフット時間長を用いるなどしてもよい。
ft = utterance end time of “beautiful as you” −utterance start time of “beautiful as you” = 61.48−61.28 = 0.2
"Say" utterance start time = "beautiful as you" utterance end time = 61.48
“Say” utterance end time = “say” utterance start time + ft = 61.68

One foot isochronism shall be maintained within a sentence segment delimited by periods. When different time lengths (ft) are obtained from a plurality of matching elements in one sentence, the average value of them is used, or the foot time length of the matching element closest to the non-matching element for determining timing information is used. Etc.

（６）差分時間ベースタイミング情報決定
等時分節数が一意に決定できない文字列（以下、「等時分節未決要素」という）がある。それは、第１に棒読みか桁読みかがわからない数字文字列、第２に等時分節決定処理で対応していない文字列、などである。なお、対応していない文字列とは（等時分節決定に辞書を用いている場合）、辞書未登録の文字列が該当する。 (6) Determination of differential time base timing information There is a character string (hereinafter referred to as “isochronous segment undecided element”) in which the number of isochronous segments cannot be uniquely determined. The first is a numeric character string that does not know whether it is a bar or digit reading, and the second is a character string that is not supported by the isochronous segment determination process. Note that a character string that is not supported corresponds to a character string that is not registered in a dictionary (when a dictionary is used for isochronous segment determination).

図１６の例では「１０９６３」が等時分節未決要素である。 In the example of FIG. 16, “10963” is an isochronous segment pending element.

「１０９６３」は「ten／nine／sixty／three」と発話する場合で４フット（「／」がフット境界）であり、「one／o／nine／six／three」と発話する場合で５フットとなる。このように、ＣＣの表記に対して実際の発話がどのようになるかは文脈に依存し、自動判別するためには更なる知識と計算量を要する。強いていずれかの読み方に決めて計算を行い、その読み方が実際の発話の読み方と異なっている場合、大きな誤差が生じる可能性がある。 “10963” is 4 feet when speaking “ten / nine / sixty / three” (“/” is the foot boundary), and 5 feet when speaking “one / o / nine / six / three”. Become. As described above, the actual utterance with respect to the notation of CC depends on the context, and further knowledge and calculation amount are required for automatic determination. If the calculation is determined by any one of the reading methods and the reading method is different from the actual reading method of the utterance, a large error may occur.

等時分節未決要素が混在するたびに誤差が生じる可能性が増すため、等時分節ベースタイミング情報決定部１０９のみでは、誤差の小さいタイミング情報を算出することができない場合がある。 Since there is an increased possibility of an error every time isochronous segment pending elements are mixed, there are cases where only the isochronous segment base timing information determination unit 109 cannot calculate timing information with a small error.

そこで、ステップＳ３において、等時分節ベースタイミング情報決定を行う際、字幕要素列上の一致要素と不一致要素の間に等時分節未決要素が存在する場合には、ステップＳ４において、差分時間ベースタイミング決定部１１０が、不一致要素のタイミング情報を決定する。 Therefore, when determining the isochronous segment base timing information in step S3, if there is an isochronous segment pending element between the matching element and the mismatching element on the caption element sequence, the differential time base timing is determined in step S4. The determination unit 110 determines the timing information of the mismatched elements.

差分時間ベースタイミング情報決定部１１０は、オリジナル字幕情報のタイミング情報を用いる。オリジナル字幕情報のタイミング情報は、オリジナル字幕情報で１画面表示単位の字幕文字列毎に付与されている。これらの表示開始時間を、各字幕文字列の前端にあたる単語の表示開始時間とみなす。例えば、図１３のｃ２の字幕文字列の表示開始時間は２７．４８であり、これをｃ２の字幕文字列の前端単語「decided」の表示開始時間と見なす（ｅ２）。同様に、ｅ１、ｅ３、ｅ４の単語もオリジナルタイミング情報に基づいた表示開始時間（以下、オリジナル表示開始時間）を持つ単語となる。 The difference time base timing information determination unit 110 uses the timing information of the original caption information. The timing information of the original subtitle information is given for each subtitle character string in one screen display unit in the original subtitle information. These display start times are regarded as display start times of words corresponding to the front end of each subtitle character string. For example, the display start time of the subtitle character string of c2 in FIG. 13 is 27.48, which is regarded as the display start time of the front end word “decided” of the subtitle character string of c2 (e2). Similarly, the words e1, e3, and e4 are words having a display start time (hereinafter referred to as original display start time) based on the original timing information.

図１６を用いて、差分時間ベースタイミング情報決定の流れを説明する。 The flow of the difference time base timing information determination will be described with reference to FIG.

まず、図１６の丸数字の１が示すように、等時分節ベースタイミング情報決定により、一致要素「pound」を基点としてオリジナルタイミング情報に基づいた表示開始時間を持つ要素「after」のタイミング情報を算出する。「after」は本来１フットではないが、前方に等時分節未決要素が存在するため、便宜上１要素となる。 First, as indicated by the circled numeral 1 in FIG. 16, the timing information of the element “after” having the display start time based on the original timing information with the matching element “pound” as the base point is determined by the isochronous segment base timing information determination. calculate. Although “after” is not originally 1 foot, there is an isochronous segment undetermined element in the front, so it is one element for convenience.

次に、図１６の丸数字の２に示すように、「after」のオリジナル表示開始時間７１．０３と、等時分節ベースタイミング情報決定手段で算出したタイミング情報の表示開始時間６３．５３の差分時間７．５を算出する。 Next, as indicated by the circled number 2 in FIG. 16, the difference between the original display start time 71.03 of “after” and the display start time 63.53 of the timing information calculated by the isochronous segment base timing information determining means. Calculate time 7.5.

次に、図１６の丸数字の３に示すように、この差分時間を「Just dial」のオリジナル表示開始時間６９．５６から引くことで、６２．０６という表示開始時間を得る。 Next, as shown by a circled number 3 in FIG. 16, this difference time is subtracted from the original display start time 69.56 of “Just dial” to obtain a display start time of 62.06.

以上の処理で前後端両方向に隣接する要素のタイミング情報が判明し、等時分節未決要素「１０９６３」のタイミング情報が決定する。 With the above processing, the timing information of elements adjacent in both the front and rear ends is found, and the timing information of the isochronous segment undetermined element “10963” is determined.

（７）字幕グループ作成
図１２で示すように、ステップＳ１において音声ベースタイミング情報決定、ステップＳ３において等時分節ベースタイミング情報決定、及び、ステップＳ４において差分時間ベースタイミング情報決定することによって全要素にタイミング情報が決定されたら、次に、ステップＳ６において字幕グループ作成部１０５が字幕グループを作成する。 (7) Subtitle group creation As shown in FIG. 12, the audio base timing information is determined in step S1, the isochronous segment base timing information is determined in step S3, and the differential time base timing information is determined in step S4. Once the timing information is determined, the subtitle group creation unit 105 next creates a subtitle group in step S6.

決定されたタイミング情報で、ある要素の表示終了時間とその次の要素の表示開始時間が同一でなければ、それら２要素を同じ字幕グループとはしない。 If the display end time of a certain element is not the same as the display start time of the next element in the determined timing information, these two elements are not set as the same caption group.

ユーザの字幕表示形態に対する要望は、行数や文字数の指定等が考えられる。 The user's request for the subtitle display form may be to specify the number of lines or the number of characters.

また、可読性を考慮した書式とは、
・統計的に得た人間の読解速度を考慮して、一回に表示する文字数と表示時間を決める。 In addition, the format considering readability is
・ Determine the number of characters to be displayed at one time and the display time in consideration of statistically obtained human reading speed.

・終止符（「．」）、コンマ（「，」）、疑問符（「？」）、感嘆符（「！」）などが表示文字列前端に来ないようにする。 -Make sure that the end of the display character string (..), comma (","), question mark ("?"), Exclamation mark ("!"), Etc. does not come.

・「Ｍｒ．」「Ｍｓ．」「Ｍｔ．」などの直後で区切らない。・ No separation immediately after “Mr.”, “Ms.”, “Mt.”, etc.

などが考えられる。 And so on.

最後に、各字幕グループ前端フットの表示開始時間を字幕グループの表示開始時間とし、字幕グループ後端フットの表示終了時間を字幕グループの表示終了時間とする。字幕グループの表示開始、終了時間の決め方についてはユーザ要望や可読性に基づいて意図的に時間差を与えることも考えられる。例えば以下に示すように、実際の発話時間よりも字幕の発話時間を長くすることで、想定するユーザの字幕読解速度が実際の発話よりも遅い場合に対応する。 Finally, the display start time of each subtitle group front end foot is the subtitle group display start time, and the subtitle group rear end foot display end time is the subtitle group display end time. Regarding how to determine the display start and end times of the subtitle group, it may be possible to intentionally give a time difference based on user requests and readability. For example, as shown below, the subtitle utterance time is set longer than the actual utterance time to cope with a case where the assumed subtitle reading speed of the user is slower than the actual utterance.

・字幕グループｇの表示終了時間と、ｇの次に表示される字幕グループの表示開始時間との間に一定以上の時間差Ｔｄがあれば、字幕グループｇの表示終了時間にＴｄ以下の加算を行って字幕グループＮが対応する発話終了後も一定時間字幕が表示されるようにする。・ If there is a certain time difference Td between the display end time of the subtitle group g and the display start time of the subtitle group displayed next to g, an addition of Td or less is added to the display end time of the subtitle group g. Thus, the subtitles are displayed for a certain time after the utterance corresponding to the subtitle group N ends.

・字幕グループｇの表示開始時間と、ｇの前に表示される字幕グループの表示終了時間との間に一定以上の時間差Ｔｄがあれば、字幕グループｇの表示開始時間からＴｄ以下の減算を行って字幕グループｇが対応する発話が開始する一定時間前から字幕グループｇが表示されるようにする。・ If there is a certain time difference Td between the display start time of the subtitle group g and the display end time of the subtitle group displayed before g, subtract Td or less from the display start time of the subtitle group g. Thus, the subtitle group g is displayed from a predetermined time before the utterance corresponding to the subtitle group g starts.

（８）効果
本実施例によれば、英語においても、モーラを用いる日本語と同様に、タイミング情報を算出でき、これらの文字列を前端とする字幕グループを作ることができ、ＣＣ表示状態に対する要望に対応することができる。 (8) Effects According to the present embodiment, in English as well as Japanese using mora, timing information can be calculated, a subtitle group having these character strings as the front end can be created, and the CC display state can be improved. Can respond to requests.

以下、本発明の実施例５の字幕情報作成装置１０について図１７〜図１９に基づいて説明する。 Hereinafter, a caption information creating apparatus 10 according to a fifth embodiment of the present invention will be described with reference to FIGS.

本実施例は、等時分節未決定範囲内の等時分節数を推定するものであり、等時分節推定部１１１を有している。未決範囲は、前後端が等時分節未決要素であり、範囲内に一致要素、又は、オリジナル表示開始時間を持つ字幕要素が存在しない範囲とする。但し、複数の範囲が隣接又は重複する場合は、それらをまとめた最大の範囲のみを認める。 The present embodiment estimates the number of isochronous segments within the isochronous segment undetermined range, and includes an isochronous segment estimation unit 111. The undecided range is a range in which the front and rear ends are isochronous segment undecided elements, and there is no matching element or caption element having the original display start time in the range. However, if multiple ranges are adjacent or overlap, only the maximum range that combines them is allowed.

（１）字幕情報作成装置１０の構成
図１７は、本実施例に係わる字幕情報作成装置１０を示すブロック図である。実施例４の図１１との異なる箇所についてのみ説明する。 (1) Configuration of Subtitle Information Creation Device 10 FIG. 17 is a block diagram showing the caption information creation device 10 according to the present embodiment. Only a different part from FIG. 11 of Example 4 is demonstrated.

等時分節推定部１１１は、未決範囲を対象に、範囲内のフット数を推定する。 The isochronous segment estimation unit 111 estimates the number of feet in the range for the undecided range.

等時分節ベースタイミング情報決定部１０９は、推定されたフット数に基づいて、範囲内の各字幕要素のタイミング情報を決定する。 The isochronous segment base timing information determination unit 109 determines the timing information of each subtitle element within the range based on the estimated number of feet.

（２）等時分節数推定
図１８を用いて、等時分節推定について説明する。 (2) Isochronous Segment Estimation With reference to FIG. 18, isochronous segment estimation will be described.

字幕要素列上の「CHI」、「or」、「ISWC」の３要素が１つの未決範囲を構成している。この場合、範囲後方に隣接する「next」（前方に未決範囲があるため、便宜上１要素としているが「フット」ではない）は未決範囲の後端フットの一部になると予想できるので、「send it to」の表示終了時間と「year」の表示開始時間の差分を対象範囲の総経過時間とする。 The three elements “CHI”, “or”, and “ISWC” on the caption element string constitute one undecided range. In this case, it is expected that “next” adjacent to the rear of the range (which is a single element for convenience because it has a pending range ahead but not a “foot”) will be part of the rear foot of the pending range. The difference between the display end time of “it to” and the display start time of “year” is defined as the total elapsed time of the target range.

例では７２．７９−７２．１９＝０．６が得られる。次に範囲内の各要素のフット候補の全組み合わせを作り、範囲の総時間に基づいて組み合わせ毎に１フット当たりの時間を算出する。 In the example 72.79-72.19 = 0.6 is obtained. Next, all combinations of foot candidates for each element within the range are created, and the time per foot is calculated for each combination based on the total time of the range.

「CHI」は、「Computer Human Interaction」の略称であり、ユーザインタフェース関連で世界最大規模の学会である。これは一般に「カイ」（１フット）あるいは１文字ずつ「シー、エイチ、アイ」（３フット）で発話される。 “CHI” is an abbreviation for “Computer Human Interaction” and is the world's largest academic society related to user interfaces. This is generally spoken by “Kai” (1 foot) or “Chi, H, Eye” (3 feet) one character at a time.

「ISWC」は、「International Symposium on Wearable Computers」の略称であり、「イーズウィック」（１フット）あるいは「アイ、エス、ダブリュー、シー」（４フット）で発話される。したがって組み合わせの数は２×２＝４通りである。 “ISWC” is an abbreviation for “International Symposium on Wearable Computers” and is spoken by “Easwick” (1 foot) or “I, S, W, Sea” (4 foot). Therefore, the number of combinations is 2 × 2 = 4.

算出した時間から、未決範囲前後の１フット当たりの時間に基づいて１つを選択する。例では範囲前方のフット「send it to」から１フット当たり０．３秒を得る。そこで、１フット当たりの時間として０．３が算出されるフット数の組み合わせである「カイ」（１フット）＋「イーズウィック」（１フット）を選択し、これを推定結果とする。選択方法は、例えば未決範囲の前後の１フット当たりの時間を平均し、その値に最も近い時間を選択するなどしてもよい。 From the calculated time, one is selected based on the time per foot before and after the pending range. In the example, 0.3 seconds per foot is obtained from the foot “send it to” in front of the range. Therefore, “Kai” (1 foot) + “Easwick” (1 foot), which is a combination of the number of feet for which 0.3 is calculated as the time per foot, is selected and used as the estimation result. As the selection method, for example, the time per foot before and after the pending range may be averaged, and the time closest to the value may be selected.

得られた推定結果に基づき、図１８の字幕要素列のフット構成とそれぞれのタイミング情報は、図１９のように決定できる。 Based on the obtained estimation result, the foot configuration of the caption element sequence in FIG. 18 and the respective timing information can be determined as shown in FIG.

Example of change

なお、本発明は上記実施例そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施例に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施例に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施例にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the components without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiments. Furthermore, constituent elements over different embodiments may be appropriately combined.

例えば、上記各実施例では、言語として、日本語と英語に適用したが、本発明は、これ以外のドイツ語、フランス語、イタリア語、スペイン語、ロシア語、中国語、ハングル語などにも適用できる。 For example, in each of the above embodiments, the language is applied to Japanese and English, but the present invention is also applicable to German, French, Italian, Spanish, Russian, Chinese, Korean, etc. it can.

本発明の実施例１の字幕情報作成装置１０のブロック図である。1 is a block diagram of a caption information creating device 10 according to a first embodiment of the present invention. 実施例１の動作を示すフローチャートである。3 is a flowchart showing the operation of the first embodiment. オリジナル字幕情報とタイミング単位文字列の例である。It is an example of original subtitle information and a timing unit character string. 音声ベースタイミング情報決定の説明図である。It is explanatory drawing of audio | voice base timing information determination. モーラベースタイミング情報決定の説明図である。It is explanatory drawing of mora base timing information determination. 差分時間ベースタイミング情報決定の説明図である。It is explanatory drawing of difference time base timing information determination. 実施例２の字幕情報作成装置１０のブロック図である。It is a block diagram of the closed caption information creation apparatus 10 of Example 2. モーラ数推定の説明図である。It is explanatory drawing of mora number estimation. 実施例３の字幕情報作成装置１０のブロック図である。It is a block diagram of the closed caption information creation apparatus 10 of Example 3. テレビに字幕を表示した状態の図である。It is a figure of the state which displayed subtitles on the television. 実施例４の字幕情報作成装置１０のブロック図である。It is a block diagram of the closed caption information creation apparatus 10 of Example 4. 実施例４の動作を示すフローチャートである。10 is a flowchart illustrating the operation of the fourth embodiment. オリジナル字幕情報とタイミング単位文字列の例である。It is an example of original subtitle information and a timing unit character string. 音声ベースタイミング情報決定の説明図である。It is explanatory drawing of audio | voice base timing information determination. 等時分節ベースタイミング情報決定の説明図である。It is explanatory drawing of isochronous segment base timing information determination. 差分時間ベースタイミング情報決定の説明図である。It is explanatory drawing of difference time base timing information determination. 実施例５の字幕情報作成装置１０のブロック図である。It is a block diagram of the closed caption information creation apparatus 10 of Example 5. 等時分節推定の説明図である。It is explanatory drawing of isochronous segment estimation. 等時分節推定の結果に基づく図１７のフット構成とタイミング情報の表である。It is a table | surface of the foot structure of FIG. 17, and timing information based on the result of isochronous segment estimation.

Explanation of symbols

１０字幕情報作成装置
１００タイミング単位文字列抽出部
１０１音声ベースタイミング情報決定部
１０２モーラ数決定部
１０３モーラベースタイミング情報決定部
１０４差分時間ベースタイミング情報決定部
１０５字幕グループ作成部
１０６モーラ数推定部
１０７タイミング情報決定部
１０８タイミング情報記憶部
１０９等時分節ベースタイミング情報決定部
１１０差分時間ベースタイミング情報決定部
１１１等時分節推定部 DESCRIPTION OF SYMBOLS 10 Subtitle information creation apparatus 100 Timing unit character string extraction part 101 Voice base timing information determination part 102 Mora number determination part 103 Mora base timing information determination part 104 Difference time base timing information determination part 105 Subtitle group creation part 106 Mora number estimation part 107 Timing information determination unit 108 Timing information storage unit 109 Isochronous segment base timing information determination unit 110 Differential time base timing information determination unit 111 Isochronous segment estimation unit

Claims

In a caption information creation device that creates caption information for audio,
Receiving original subtitle information including original timing information, which is time information indicating a reference correspondence with the voice provided for each arbitrary number of characters in the subtitle character string group and the subtitle character string created in advance, An extraction unit that extracts a timing unit character string that gives timing information that is time information for displaying the subtitle character string in synchronization with the voice from the subtitle character string group;
First timing information determination is performed by associating the voice with the timing unit character string by voice recognition, and determining the timing information based on time information of the voice for the timing unit character string with which the correspondence has been made. And
Based on the timing information of the determined timing unit character string for which the timing information is determined, using the number of isochronous segments that are isochronous segments included in each timing unit character string, A second timing information determination unit that determines the timing information of an undecided timing unit character string for which timing information could not be determined;
When an isochronous segment undecided character string that is the timing unit character string in which the number of isochronous segments cannot be uniquely determined exists between the pair of timing unit character strings having the original timing information, (1) The difference time between the timing information of the determined timing unit character string and the original timing information is obtained, and (2) the timing information is obtained by subtracting the difference time from the original timing information of the pair of timing unit character strings. (3) a third timing information determination unit that determines the timing information of the isochronous segment undecided character string between them using the timing information of the pair of timing unit character strings;
A subtitle group that is a subtitle character string obtained by dividing the subtitle character string group according to an arbitrary format, the determined timing unit character string for displaying each subtitle group in synchronization with the audio, and the undecided timing unit A creation unit for creating the subtitle information including a character string or the timing information of each of the isochronous segment undecided character strings;
A subtitle information creating apparatus characterized by comprising:

Finding an undecided range in which the front and rear ends of the subtitle character string group are the isochronous segment undecided character string and the character string group not including the determined timing unit character string,
Based on a combination of the timing information of the determined timing unit character strings adjacent before and after the undecided range and isochronous segment candidates of the timing unit character strings in the undecided range, 1 isochronous for each combination Calculate the display time per segment,
Choose the combination of isochronous segment candidates that gives the closest time compared to the display time per isochronous segment before and after the pending range;
An estimation unit that estimates the number of selected isochronous segment candidates as the number of isochronous segments of the timing unit character string within the pending range;
The closed caption information creating apparatus according to claim 1.

Timing associated information indicating that the timing information is determined by the first timing information determination unit, the second timing information determination unit, or the third timing information determination unit is associated with each timing information. A timing-accompanying information storage unit for storing
The creation unit sets the timing unit character string having the timing information with a small time error based on the timing-accompanying information at a front end of the caption group.
The closed caption information creating apparatus according to claim 1.

The second timing information determining unit includes an isochronous segment number determining unit that determines the number of isochronous segments included in the timing unit character string.
The closed caption information creating apparatus according to claim 1.

The extraction unit divides the timing unit character string with the isochronous segment as a minimum constituent unit.
The closed caption information creating apparatus according to claim 1.

The isochronous segment is a mora or a foot.
The closed caption information creating apparatus according to claim 1.

In the caption information creation method for creating caption information for audio,
Receiving original subtitle information including original timing information, which is time information indicating a reference correspondence with the voice provided for each arbitrary number of characters in the subtitle character string group and the subtitle character string created in advance, An extraction step of extracting a timing unit character string that gives timing information, which is time information for displaying the subtitle character string in synchronization with the voice, from the subtitle character string group;
First timing information determination is performed by associating the voice with the timing unit character string by voice recognition, and determining the timing information based on time information of the voice for the timing unit character string with which the correspondence has been made. Steps,
Based on the timing information of the determined timing unit character string for which the timing information is determined, using the number of isochronous segments that are isochronous segments included in each timing unit character string, A second timing information determination step for determining the timing information of an undecided timing unit character string for which timing information could not be determined;
When an isochronous segment undecided character string that is the timing unit character string in which the number of isochronous segments cannot be uniquely determined exists between the pair of timing unit character strings having the original timing information, (1) The difference time between the timing information of the determined timing unit character string and the original timing information is obtained, and (2) the timing information is obtained by subtracting the difference time from the original timing information of the pair of timing unit character strings. (3) a third timing information determining step of determining the timing information of the isochronous segment undecided character string between them using the timing information of the pair of timing unit character strings;
A subtitle group that is a subtitle character string obtained by dividing the subtitle character string group according to an arbitrary format, the determined timing unit character string for displaying each subtitle group in synchronization with the audio, and the undecided timing unit Creating the subtitle information including a character string or the timing information of each of the isochronous segment undecided character strings;
A method for creating subtitle information, comprising:

On the computer,
Receiving original subtitle information including original timing information which is time information indicating a reference correspondence with a subtitle character string group created in advance and audio given for each arbitrary number of characters in the subtitle character string, An extraction function for extracting from the subtitle character string group a timing unit character string that gives timing information that is time information for displaying the subtitle character string in synchronization with voice;
First timing information determination for associating the speech with the timing unit character string by speech recognition and determining the timing information based on the time information of the speech for the timing unit character string that has been associated Function and
Based on the timing information of the determined timing unit character string for which the timing information has been determined, using the number of isochronous segments that are isochronous sound segments included in each timing unit character string, A second timing information determination function for determining the timing information of an undecided timing unit character string for which timing information could not be determined;
When an isochronous segment undecided character string that is the timing unit character string in which the number of isochronous segments cannot be uniquely determined exists between the pair of timing unit character strings having the original timing information, (1) The difference time between the timing information of the determined timing unit character string and the original timing information is obtained, and (2) the timing information is obtained by subtracting the difference time from the original timing information of the pair of timing unit character strings. (3) a third timing information determining function for determining the timing information of the isochronous segment undecided character string between them using the timing information of the pair of timing unit character strings;
A subtitle group that is a subtitle character string obtained by dividing the subtitle character string group according to an arbitrary format, the determined timing unit character string for displaying each subtitle group in synchronization with the audio, and the undecided timing unit A creation function for creating the subtitle information including a character string or the timing information of each of the isochronous segment undecided character strings;
Subtitle information creation program for realizing