JP2018091998A

JP2018091998A - Information processing system and information processing method

Info

Publication number: JP2018091998A
Application number: JP2016235267A
Authority: JP
Inventors: 優樹瀬戸; Yuki Seto; 陽前澤; Akira Maezawa; 貴裕岩田; Takahiro Iwata
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2018-06-14
Anticipated expiration: 2036-12-02
Also published as: JP6809177B2

Abstract

PROBLEM TO BE SOLVED: To reduce the work load for providing information concerning plural character strings pronounced in time series.SOLUTION: An information processing system 20 comprises: a time analysis unit 62 which generates time information Tn showing the time point at which a character string is pronounced on the time axis, for each of plural character strings, by analyzing character string information B which represents the plural character strings and an acoustic signal X which represents the sound pronounced for the plural character strings sequentially; and an information correspondence unit 64 which associates, for each of the plural character strings, delivery information which shows related information concerning the character string, with the time information Tn which the time analysis unit 62 generates for the character string.SELECTED DRAWING: Figure 9

Description

本発明は、端末装置に情報を提供する技術に関する。 The present invention relates to a technique for providing information to a terminal device.

演劇や演奏等の各種の興行の実演に関する情報を、興行の進行に並行して利用者の端末装置に提供する技術が従来から提案されている。例えば特許文献１には、興行の進行に並行して利用者の携帯デバイスに順次に時間コードを送信し、時間コードから特定される時点で字幕等の解説用の情報を携帯デバイスに表示させる構成が開示されている。 There has been proposed a technique for providing information related to performances of various entertainments such as plays and performances to user terminal devices in parallel with the progress of the performances. For example, Patent Document 1 discloses a configuration in which time codes are sequentially transmitted to a user's portable device in parallel with the performance of the box office, and caption information such as captions is displayed on the portable device at the time specified from the time code. Is disclosed.

特開２００９−２１３１８０号公報JP 2009-213180 A

しかし、特許文献１の技術のもとで字幕等の解説用の情報を携帯デバイスに表示させるためには、解説用の情報と時間コードとの対応関係を事前に決定する必要があり、この対応関係を決定する作業の負荷が大きいという問題がある。以上の事情を考慮して、本発明は、時系列に発音される複数の文字列に関する情報を利用者に提供するための作業の負荷を軽減することを目的とする。 However, in order to display information for explanation such as subtitles on the portable device under the technique of Patent Document 1, it is necessary to determine in advance the correspondence between the information for explanation and the time code. There is a problem that the workload of determining the relationship is heavy. In view of the above circumstances, an object of the present invention is to reduce the work load for providing a user with information on a plurality of character strings that are pronounced in time series.

以上の課題を解決するために、本発明の好適な態様に係る情報処理システムは、複数の文字列を表す文字列情報と、前記複数の文字列を順次に発音した音を表す音響信号とを解析することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成する時間解析部と、前記複数の文字列の各々について、当該文字列に関連する関連情報を示す配信情報と、当該文字列について前記時間解析部が生成した時間情報とを対応させる情報対応部とを具備する。
また、本発明の好適な態様に係る情報処理方法は、コンピュータシステムが、複数の文字列を表す文字列情報と、前記複数の文字列を順次に発音した音を表す音響信号とを解析することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成し、前記複数の文字列の各々について、当該文字列に関連する関連情報を示す配信情報と、当該文字列について生成した時間情報とを対応させる。 In order to solve the above problems, an information processing system according to a preferred aspect of the present invention includes: character string information representing a plurality of character strings; and an acoustic signal representing a sound produced by sequentially producing the plurality of character strings. By analyzing, for each of the plurality of character strings, a time analysis unit that generates time information indicating a time point on the time axis at which the character string is pronounced, and for each of the plurality of character strings, the character string The information corresponding part which matches the delivery information which shows the relevant information relevant to, and the time information which the said time analysis part produced | generated about the said character string is comprised.
Further, in the information processing method according to a preferred aspect of the present invention, the computer system analyzes character string information representing a plurality of character strings and an acoustic signal representing a sound obtained by sequentially sounding the plurality of character strings. Then, for each of the plurality of character strings, time information indicating a point in time on which the character string is pronounced is generated, and for each of the plurality of character strings, related information related to the character string is indicated. The distribution information is associated with the time information generated for the character string.

本発明の第１実施形態に係る情報提供システムの構成図である。1 is a configuration diagram of an information providing system according to a first embodiment of the present invention. 情報配信システムの構成図である。It is a block diagram of an information delivery system. 音響信号および変調信号の説明図である。It is explanatory drawing of an acoustic signal and a modulation signal. 参照テーブルの模式図である。It is a schematic diagram of a reference table. 情報配信システムの動作のフローチャートである。It is a flowchart of operation | movement of an information delivery system. 端末装置の構成図である。It is a block diagram of a terminal device. 参照テーブルの模式図である。It is a schematic diagram of a reference table. 端末装置の動作のフローチャートである。It is a flowchart of operation | movement of a terminal device. 情報処理システムの構成図である。It is a block diagram of an information processing system. 文字列情報の模式図である。It is a schematic diagram of character string information. 情報処理システムの動作のフローチャートである。It is a flowchart of operation | movement of an information processing system. 第２実施形態における情報処理システムの構成図である。It is a block diagram of the information processing system in 2nd Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る情報提供システム１００の構成図である。情報提供システム１００は、各種の興行（例えば演劇，演奏または映画等）に関する情報をその進行に並行して利用者に提供するためのコンピュータシステムである。図１に例示される通り、第１実施形態の情報提供システム１００は、情報配信システム１０と情報処理システム２０とを具備する。情報配信システム１０は、各種の興行が開催される劇場またはホール等の施設に設置される。施設に来場した利用者は端末装置３０を携帯する。端末装置３０は、例えば携帯電話機またはスマートフォン等の可搬型の情報端末である。なお、実際には複数の端末装置３０が施設内に存在するが、以下の説明では便宜的に任意の１個の端末装置３０に着目する。 <First Embodiment>
FIG. 1 is a configuration diagram of an information providing system 100 according to the first embodiment of the present invention. The information providing system 100 is a computer system for providing a user with information related to various types of entertainment (for example, theater, performance, or movie) in parallel with the progress. As illustrated in FIG. 1, the information providing system 100 according to the first embodiment includes an information distribution system 10 and an information processing system 20. The information distribution system 10 is installed in a facility such as a theater or a hall where various performances are held. A user who comes to the facility carries the terminal device 30. The terminal device 30 is a portable information terminal such as a mobile phone or a smartphone. Note that a plurality of terminal devices 30 actually exist in the facility, but in the following description, attention is paid to any one terminal device 30 for convenience.

第１実施形態では、事前に決定された複数の文字列（以下「発音文字列」という）が所定の順番で時系列に発音されるショー（例えばキャラクターショー）が施設内で開催される場合を想定する。すなわち、第１実施形態の発音文字列は、例えばショーの進行に並行して随時に発音される台詞，歌詞または解説（ナレーション）である。利用者は、端末装置３０を携帯した状態でショーを鑑賞する。端末装置３０は、時系列に発音される複数の発音文字列の各々に関連する情報（以下「関連情報」という）をショーの進行に並行して順次に表示する。第１実施形態では、発音文字列の翻訳文を関連情報として端末装置３０に表示する場合を例示する。例えば、各発音文字列が日本語で発音される一方、発音文字列に対応する英語の翻訳文が関連情報として表示される。以上の構成によれば、端末装置３０が順次に表示する関連情報をショーの鑑賞に並行して随時に確認することで、例えば発音文字列の言語の理解が困難である外国人がショーの内容を把握できるという利点がある。 In the first embodiment, a case where a show (for example, a character show) in which a plurality of predetermined character strings (hereinafter referred to as “pronunciation character strings”) are chronologically pronounced in a predetermined order is held in the facility. Suppose. That is, the pronunciation character string of the first embodiment is, for example, dialogue, lyrics, or explanation (narration) that is pronounced at any time in parallel with the progress of the show. The user views the show with the terminal device 30 carried. The terminal device 30 sequentially displays information related to each of a plurality of pronunciation character strings that are pronounced in time series (hereinafter referred to as “related information”) in parallel with the progress of the show. In 1st Embodiment, the case where the translated sentence of a pronunciation character string is displayed on the terminal device 30 as related information is illustrated. For example, each pronunciation string is pronounced in Japanese, while an English translation corresponding to the pronunciation string is displayed as related information. According to the above configuration, by confirming the related information displayed sequentially by the terminal device 30 at the same time as the appreciation of the show, for example, foreigners who have difficulty understanding the language of the pronunciation character string can show the contents of the show. There is an advantage that can be grasped.

＜情報配信システム１０＞
図２は、情報配信システム１０の構成図である。図２に例示される通り、情報配信システム１０は、制御装置１２と記憶装置１４と放音装置１６とを具備するコンピュータシステムで実現される。制御装置１２は、例えばＣＰＵ（Central Processing Unit）を含む処理回路であり、情報配信システム１０の全体を統括的に制御する。記憶装置１４は、例えば磁気記録媒体もしくは半導体記録媒体等の公知の記録媒体、または複数種の記録媒体の組合せで構成され、制御装置１２が実行するプログラムと制御装置１２が使用する各種のデータとを記憶する。第１実施形態の記憶装置１４は、音響信号Ｘと参照テーブルＱaとを記憶する。 <Information distribution system 10>
FIG. 2 is a configuration diagram of the information distribution system 10. As illustrated in FIG. 2, the information distribution system 10 is realized by a computer system including a control device 12, a storage device 14, and a sound emission device 16. The control device 12 is a processing circuit including, for example, a CPU (Central Processing Unit), and comprehensively controls the entire information distribution system 10. The storage device 14 is configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and a program executed by the control device 12 and various data used by the control device 12. Remember. The storage device 14 of the first embodiment stores the acoustic signal X and the reference table Qa.

音響信号Ｘは、複数の発音文字列を順次に発音した音声を表す時間領域の信号である。図３は、音響信号Ｘの説明図である。図３に例示される通り、第１実施形態の音響信号Ｘは、相異なる複数の発音文字列Ｌn（Ｌ1，Ｌ2，Ｌ3，…）の各々について発音期間Ａn（Ａ1，Ａ2，Ａ3，…）を包含する（ｎは自然数）。任意の１個の発音文字列Ｌnに対応する発音期間Ａnは、当該発音文字列Ｌnが発音された時間軸上の期間である。各発音期間Ａnの時間長は、発音文字列Ｌnの長短に応じた可変長である。音響信号Ｘは、声優等の発声者が発音した音声の収録または公知の音声合成処理で事前に生成される。アニメキャラクター等の出演者が演技するショーに並行して音響信号Ｘが表す音声を再生することで、各出演者が恰も実際に発音しているような演出が実現される。 The acoustic signal X is a signal in the time domain that represents a sound produced by sequentially producing a plurality of pronunciation character strings. FIG. 3 is an explanatory diagram of the acoustic signal X. As illustrated in FIG. 3, the acoustic signal X of the first embodiment includes a pronunciation period An (A1, A2, A3,...) For each of a plurality of different pronunciation character strings Ln (L1, L2, L3,...). (N is a natural number). The pronunciation period An corresponding to any one pronunciation character string Ln is a period on the time axis in which the pronunciation character string Ln is pronounced. The time length of each pronunciation period An is a variable length according to the length of the pronunciation character string Ln. The acoustic signal X is generated in advance by recording a sound produced by a speaker such as a voice actor or by a known speech synthesis process. By reproducing the sound represented by the sound signal X in parallel with the show where the performers such as anime characters perform, an effect is realized in which each performer actually pronounces the sound.

図４は、参照テーブルＱaの説明図である。図４に例示される通り、参照テーブルＱaは、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnの関連情報Ｃnを示す配信情報Ｄn（Ｄ1，Ｄ2，Ｄ3，…）と、音響信号Ｘにおいて当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔn（Ｔ1，Ｔ2，Ｔ3，…）とを対応させたデータテーブルである。第１実施形態において任意の１個の関連情報Ｃnに対応する配信情報Ｄnは、当該関連情報Ｃnを識別するための識別情報である。発音文字列Ｌnと関連情報Ｃnとは相互に対応するから、第１実施形態の配信情報Ｄnは、発音文字列Ｌnを識別するための情報とも換言され得る。また、任意の１個の発音文字列Ｌnに対応する時間情報Ｔnは、音響信号Ｘにおいて当該発音文字列Ｌnが発音される時点（例えば発音期間Ａnの始点）を指定する。例えば、所定の時点（以下「基準時点」という）を基準として発音文字列Ｌnの発音が開始される時点までの経過時間が時間情報Ｔnにより指定される。音響信号Ｘが表す音声の再生をショーの開始と同時に開始する場合を想定すると、基準時点は、音響信号Ｘの始点（すなわちショーの開始の時点）である。なお、発音文字列Ｌnが発音される時間軸上の各時点の時間間隔を時間情報Ｔnが表す構成も採用され得る。 FIG. 4 is an explanatory diagram of the reference table Qa. As illustrated in FIG. 4, the reference table Qa includes, for each of a plurality of pronunciation character strings Ln, distribution information Dn (D1, D2, D3,...) Indicating related information Cn of the pronunciation character string Ln, and an acoustic signal. This is a data table in which time information Tn (T1, T2, T3,...) Indicating the time point on the time axis at which the pronunciation character string Ln is pronounced in X is associated. In the first embodiment, the distribution information Dn corresponding to any one piece of related information Cn is identification information for identifying the related information Cn. Since the phonetic character string Ln and the related information Cn correspond to each other, the distribution information Dn of the first embodiment can be rephrased as information for identifying the phonetic character string Ln. Further, the time information Tn corresponding to any one pronunciation character string Ln designates the time point (for example, the start point of the pronunciation period An) at which the pronunciation character string Ln is pronounced in the acoustic signal X. For example, the elapsed time up to the time when the pronunciation character string Ln is started is specified by the time information Tn with a predetermined time (hereinafter referred to as “reference time”) as a reference. Assuming the case where the reproduction of the sound represented by the acoustic signal X is started simultaneously with the start of the show, the reference time point is the starting point of the acoustic signal X (that is, the start time of the show). A configuration in which the time information Tn represents the time interval at each time point on the time axis at which the pronunciation character string Ln is pronounced may be employed.

図２の制御装置１２は、記憶装置１４に記憶されたプログラムを実行することで、複数の関連情報Ｃnを端末装置３０に順次に表示させるための複数の機能（情報管理部４２および信号処理部４４）を実現する。なお、制御装置１２の機能を複数の装置に分散した構成、または、制御装置１２の機能の一部を専用の電子回路が実現する構成も採用され得る。 The control device 12 in FIG. 2 executes a program stored in the storage device 14 to thereby display a plurality of functions (information management unit 42 and signal processing unit 42) for sequentially displaying a plurality of related information Cn on the terminal device 30. 44). A configuration in which the function of the control device 12 is distributed to a plurality of devices or a configuration in which a dedicated electronic circuit realizes a part of the function of the control device 12 may be employed.

情報管理部４２は、相異なる発音文字列Ｌnに対応する複数の配信情報Ｄnの各々をショーの進行に並行して順次に選択する。各配信情報Ｄnの選択には参照テーブルＱaが使用される。具体的には、情報管理部４２は、参照テーブルＱa内の任意の１個の時間情報Ｔnが指定する時点が到来すると、参照テーブルＱa内で当該時間情報Ｔnに対応する配信情報Ｄnを選択する。例えば図４の例示を想定すると、時間情報Ｔ1が示す時点が到来すると配信情報Ｄ1が選択され、時間情報Ｔ2が示す時点が到来すると配信情報Ｄ2が選択される。すなわち、時間情報Ｔnが指定する時点（例えば音響信号Ｘにおける発音期間Ａnの始点）の到来毎に、当該時間情報Ｔnに対応する配信情報Ｄnが選択される。 The information management unit 42 sequentially selects each of a plurality of distribution information Dn corresponding to different pronunciation character strings Ln in parallel with the progress of the show. The reference table Qa is used for selecting each distribution information Dn. Specifically, when the time point specified by any one piece of time information Tn in the reference table Qa arrives, the information management unit 42 selects the distribution information Dn corresponding to the time information Tn in the reference table Qa. . For example, assuming the example of FIG. 4, the distribution information D1 is selected when the time indicated by the time information T1 arrives, and the distribution information D2 is selected when the time indicated by the time information T2 arrives. That is, every time the time point designated by the time information Tn (for example, the start point of the sound generation period An in the acoustic signal X) arrives, the distribution information Dn corresponding to the time information Tn is selected.

信号処理部４４は、情報管理部４２が選択した配信情報Ｄnを音響成分として含む音響信号Ｚを生成する。具体的には、第１実施形態の信号処理部４４は、変調処理部４４１と信号合成部４４２とを含んで構成される。変調処理部４４１は、情報管理部４２が選択した配信情報Ｄnを示す音響成分を表す変調信号Ｙを生成する。具体的には、変調処理部４４１は、例えば所定の周波数の正弦波等の搬送波を配信情報Ｄnにより変調する周波数変調、または、拡散符号を利用した配信情報Ｄnの拡散変調等の変調処理により変調信号Ｙを生成する。配信情報Ｄnは、所定の周波数帯域の音響成分として変調信号Ｙに含有される。具体的には、配信情報Ｄnの音響成分の周波数帯域は、利用者が通常の環境で聴取する音声または楽音等の音の周波数帯域を上回る範囲（例えば１８ｋＨｚ以上かつ２０ｋＨｚ以下）に包含される。 The signal processing unit 44 generates an acoustic signal Z including the distribution information Dn selected by the information management unit 42 as an acoustic component. Specifically, the signal processing unit 44 of the first embodiment includes a modulation processing unit 441 and a signal synthesis unit 442. The modulation processing unit 441 generates a modulation signal Y representing an acoustic component indicating the distribution information Dn selected by the information management unit 42. Specifically, the modulation processing unit 441 performs modulation by modulation processing such as frequency modulation that modulates a carrier wave such as a sine wave of a predetermined frequency with the distribution information Dn or spread modulation of the distribution information Dn using a spread code. A signal Y is generated. The distribution information Dn is contained in the modulation signal Y as an acoustic component in a predetermined frequency band. Specifically, the frequency band of the acoustic component of the distribution information Dn is included in a range (for example, 18 kHz or more and 20 kHz or less) that exceeds the frequency band of sounds such as voices or musical sounds that the user listens in a normal environment.

図３に例示される通り、任意の１個の配信情報Ｄnの音響成分は、変調信号Ｙのうち当該配信情報Ｄnに対応する時間情報Ｔnが指定する時点に応じた単位期間Ｕn内に含有される。図３では、時間情報Ｔnが指定する時点を起点とする所定長の期間を単位期間Ｕnとして例示した。前述の通り、時間情報Ｔnは、音響信号Ｘにおいて発音文字列Ｌnが発音される時点を指定する。したがって、音響信号Ｘのうち任意の１個の発音文字列Ｌnが発音される発音期間Ａnと、変調信号Ｙのうち当該発音文字列Ｌnに対応する配信情報Ｄnの音響成分を含む単位期間Ｕnとは時間軸上で相互に重複する。なお、１個の単位期間Ｕn内に複数回にわたり配信情報Ｄnの音響成分を含めることも可能である。また、時間情報Ｔnが示す時点（発音期間Ａnの始点）の手前の時点から配信情報Ｄnを音響成分を発生することも可能である。 As illustrated in FIG. 3, the acoustic component of any one piece of distribution information Dn is contained in a unit period Un corresponding to the time point specified by the time information Tn corresponding to the distribution information Dn in the modulation signal Y. The In FIG. 3, a period of a predetermined length starting from the time point specified by the time information Tn is exemplified as the unit period Un. As described above, the time information Tn designates the point in time when the pronunciation character string Ln is pronounced in the acoustic signal X. Accordingly, a sound generation period An in which an arbitrary one of the sound character strings Ln of the sound signal X is sounded, and a unit period Un including a sound component of the distribution information Dn corresponding to the sound character string Ln of the modulation signal Y, and Overlap each other on the time axis. In addition, it is also possible to include the acoustic component of the distribution information Dn multiple times within one unit period Un. It is also possible to generate the distribution component Dn as an acoustic component from the time point before the time point indicated by the time information Tn (the start point of the sound generation period An).

図２の信号合成部４４２は、記憶装置１４に記憶された音響信号Ｘと変調処理部４４１が生成した変調信号Ｙとを合成することで音響信号Ｚを生成する。具体的には、信号合成部４４２は、音響信号Ｘと変調信号Ｙとを時間領域で加算することで音響信号Ｚを生成する。信号合成部４４２が生成した音響信号Ｚは放音装置１６に供給される。なお、音響信号Ｚをデジタルからアナログに変換するＤ/Ａ変換器、および、音響信号Ｚを増幅する増幅器の図示は便宜的に省略した。 2 synthesizes the acoustic signal X stored in the storage device 14 and the modulation signal Y generated by the modulation processing unit 441 to generate the acoustic signal Z. Specifically, the signal synthesis unit 442 generates the acoustic signal Z by adding the acoustic signal X and the modulation signal Y in the time domain. The acoustic signal Z generated by the signal synthesis unit 442 is supplied to the sound emitting device 16. Note that a D / A converter that converts the acoustic signal Z from digital to analog and an amplifier that amplifies the acoustic signal Z are omitted for the sake of convenience.

図２の放音装置１６は、例えば施設内に設置されたスピーカ装置であり、信号処理部４４（信号合成部４４２）が生成した音響信号Ｚが表す音を施設内に再生する。したがって、音響信号Ｘの各発音期間Ａnでは発音文字列Ｌnの発声音が再生され、変調信号Ｙの各単位期間Ｕnでは配信情報Ｄnの音響成分が再生される。すなわち、音響信号Ｘが表す発音文字列Ｌnの発声音とともに配信情報Ｄnの音響成分が再生される。以上の説明から理解される通り、第１実施形態の放音装置１６は、発音文字列Ｌnの発声音を再生する音響機器として機能するほか、空気振動としての音波を伝送媒体とする音響通信で配信情報Ｄnを周囲に送信する送信機としても機能する。すなわち、複数の発音文字列Ｌnの各々の発声音を各出演者の演技に並行して放音装置１６から順次に再生することでショーが構成される一方、各発音文字列Ｌnを発音した音声の再生毎に、当該発音文字列Ｌnに対応する配信情報Ｄnが音響通信により端末装置３０に送信される。 The sound emitting device 16 in FIG. 2 is, for example, a speaker device installed in a facility, and reproduces the sound represented by the acoustic signal Z generated by the signal processing unit 44 (signal combining unit 442) in the facility. Therefore, the uttered sound of the pronunciation character string Ln is reproduced in each pronunciation period An of the acoustic signal X, and the acoustic component of the distribution information Dn is reproduced in each unit period Un of the modulation signal Y. That is, the acoustic component of the distribution information Dn is reproduced together with the utterance sound of the pronunciation character string Ln represented by the acoustic signal X. As understood from the above description, the sound emitting device 16 of the first embodiment functions as an acoustic device that reproduces the uttered sound of the phonetic character string Ln, and in acoustic communication using sound waves as air vibration as a transmission medium. It also functions as a transmitter that transmits the distribution information Dn to the surroundings. That is, a show is constructed by sequentially reproducing the utterances of each of the plurality of pronunciation character strings Ln from the sound emitting device 16 in parallel with the performance of each performer, while the sound that pronounces each of the pronunciation character strings Ln. Is reproduced, the distribution information Dn corresponding to the pronunciation character string Ln is transmitted to the terminal device 30 by acoustic communication.

図５は、情報配信システム１０の動作を例示するフローチャートである。例えばショーの運営者からの指示を契機としてショーの開始とともに図５の処理が開始される。記憶装置１４に記憶された音響信号Ｘを先頭から順次に取得して信号処理部４４に供給する動作に並行して、図５の処理が実行される。 FIG. 5 is a flowchart illustrating the operation of the information distribution system 10. For example, in response to an instruction from the show operator, the processing of FIG. The processing of FIG. 5 is executed in parallel with the operation of sequentially acquiring the acoustic signals X stored in the storage device 14 from the top and supplying them to the signal processing unit 44.

情報管理部４２は、参照テーブルＱa内の複数の時間情報Ｔnのうち未選択の最先の時間情報Ｔnで指定される時点が到来するまで待機する（Ｓa1：NO）。時間情報Ｔnで指定される時点が到来すると（Ｓa1：YES）、情報管理部４２は、参照テーブルＱa内で当該時間情報Ｔnに対応する配信情報Ｄnを選択する（Ｓa2）。 The information management unit 42 waits until the time point specified by the unselected earliest time information Tn among the plurality of time information Tn in the reference table Qa arrives (Sa1: NO). When the time point specified by the time information Tn comes (Sa1: YES), the information management unit 42 selects the distribution information Dn corresponding to the time information Tn in the reference table Qa (Sa2).

変調処理部４４１は、情報管理部４２が選択した配信情報Ｄnの音響成分を表す変調信号Ｙを生成する（Ｓa3）。信号合成部４４２は、変調処理部４４１が生成した変調信号Ｙを音響信号Ｘに合成することで音響信号Ｚを生成して放音装置１６に供給する（Ｓa4）。すなわち、発音文字列Ｌnの発声音の再生とともに配信情報Ｄnが音響通信により放音装置１６から送信される。参照テーブルＱaに登録された全部の配信情報Ｄnの送信が完了するまで（Ｓa5：NO）、情報管理部４２による配信情報Ｄnの選択（Ｓa1，Ｓa2）と、発音文字列Ｌnの再生および当該配信情報Ｄnの送信（Ｓa3，Ｓa4）とが、発音文字列Ｌn毎に順次に反復される。全部の配信情報Ｄnの送信が完了すると（Ｓa5：YES）、図５の処理が終了する。 The modulation processing unit 441 generates a modulation signal Y representing the acoustic component of the distribution information Dn selected by the information management unit 42 (Sa3). The signal synthesis unit 442 generates an acoustic signal Z by synthesizing the modulation signal Y generated by the modulation processing unit 441 with the acoustic signal X, and supplies the acoustic signal Z to the sound emitting device 16 (Sa4). That is, the distribution information Dn is transmitted from the sound emitting device 16 by acoustic communication along with the reproduction of the uttered sound of the pronunciation character string Ln. Until transmission of all the distribution information Dn registered in the reference table Qa is completed (Sa5: NO), selection of the distribution information Dn by the information management unit 42 (Sa1, Sa2), reproduction of the pronunciation character string Ln, and the distribution Transmission of information Dn (Sa3, Sa4) is repeated sequentially for each pronunciation character string Ln. When transmission of all the distribution information Dn is completed (Sa5: YES), the processing in FIG.

＜端末装置３０＞
図６は、端末装置３０の構成図である。図６に例示される通り、端末装置３０は、制御装置３２と記憶装置３４と収音装置３６と表示装置３８とを具備する。制御装置３２は、例えばＣＰＵを含む処理回路であり、端末装置３０の全体を統括的に制御する。記憶装置３４は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体、あるいは複数種の記録媒体の組合せで構成され、制御装置３２が実行するプログラムと制御装置３２が使用する各種のデータとを記憶する。 <Terminal device 30>
FIG. 6 is a configuration diagram of the terminal device 30. As illustrated in FIG. 6, the terminal device 30 includes a control device 32, a storage device 34, a sound collection device 36, and a display device 38. The control device 32 is a processing circuit including, for example, a CPU, and comprehensively controls the entire terminal device 30. The storage device 34 is configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and includes a program executed by the control device 32 and various data used by the control device 32. Remember.

第１実施形態の記憶装置３４は、図７に例示される参照テーブルＱbを記憶する。例えば、移動体通信網またはインターネット等の通信網を介して端末装置３０が情報配信サーバ（図示略）から受信した参照テーブルＱbが記憶装置３４に記憶される。図７に例示される通り、参照テーブルＱbは、ショーで発音される複数の発音文字列Ｌnの各々について配信情報Ｄn（Ｄ1，Ｄ2，Ｄ3，…）と関連情報Ｃn（Ｃ1，Ｃ2，Ｃ3，…）とを対応させたデータテーブルである。第１実施形態における関連情報Ｃnは、前述の通り、発音文字列Ｌnの翻訳文を表すテキストデータである。 The storage device 34 of the first embodiment stores a reference table Qb illustrated in FIG. For example, the reference table Qb received by the terminal device 30 from the information distribution server (not shown) via a mobile communication network or a communication network such as the Internet is stored in the storage device 34. As illustrated in FIG. 7, the reference table Qb includes distribution information Dn (D1, D2, D3,...) And related information Cn (C1, C2, C3,. Is a data table corresponding to. The related information Cn in the first embodiment is text data representing a translated sentence of the pronunciation character string Ln as described above.

図６の収音装置３６（マイクロホン）は、周囲の音を収音して音響信号Ｖを生成する音響機器である。具体的には、収音装置３６は、情報配信システム１０の放音装置１６による再生音（すなわち発音文字列Ｌnの発声音と配信情報Ｄnの音響成分との混合音）を表す音響信号Ｖを生成する。なお、収音装置３６が生成した音響信号Ｖをアナログからデジタルに変換するＡ/Ｄ変換器、および、音響信号Ｖを増幅する増幅器（マイクアンプ）の図示は便宜的に省略した。表示装置３８（例えば液晶表示パネル）は、制御装置３２による制御のもとで関連情報Ｃnを表示する。なお、収音装置３６および表示装置３８の一方または双方を、端末装置３０とは別体で構成して端末装置３０に接続することも可能である。 The sound collection device 36 (microphone) in FIG. 6 is an acoustic device that collects ambient sounds and generates an acoustic signal V. Specifically, the sound collection device 36 generates an acoustic signal V that represents a sound reproduced by the sound emitting device 16 of the information distribution system 10 (that is, a mixed sound of the uttered sound of the pronunciation character string Ln and the acoustic component of the distribution information Dn). Generate. The A / D converter that converts the acoustic signal V generated by the sound collection device 36 from analog to digital and the amplifier (microphone amplifier) that amplifies the acoustic signal V are omitted for convenience. The display device 38 (for example, a liquid crystal display panel) displays the related information Cn under the control of the control device 32. Note that one or both of the sound collection device 36 and the display device 38 may be configured separately from the terminal device 30 and connected to the terminal device 30.

制御装置３２は、記憶装置３４に記憶されたプログラムを実行することで、図６に例示される通り、関連情報Ｃnを利用者に提供するための複数の機能（情報抽出部５２および再生制御部５４）を実現する。なお、制御装置３２の機能を複数の装置に分散した構成、または、制御装置３２の機能の一部を専用の電子回路が実現する構成も採用され得る。 As illustrated in FIG. 6, the control device 32 executes a program stored in the storage device 34 to provide a plurality of functions (information extraction unit 52 and reproduction control unit) for providing related information Cn to the user. 54) is realized. A configuration in which the function of the control device 32 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit realizes a part of the function of the control device 32 may be employed.

情報抽出部５２は、収音装置３６が生成した音響信号Ｖから配信情報Ｄnを順次に抽出する。具体的には、情報抽出部５２は、音響信号Ｖのうち配信情報Ｄnの音響成分を含む周波数帯域を例えば帯域通過フィルタで強調し、変調信号Ｙの生成時の変調処理に対応する復調処理を強調後の音響信号Ｖに対して実行することで配信情報Ｄnを抽出する。各発音文字列Ｌnの発声音を放音装置１６が再生するたびに、当該発音文字列Ｌnに対応する配信情報Ｄnが抽出される。以上の説明から理解される通り、第１実施形態の収音装置３６は、端末装置３０の相互間の音声通話や動画撮影時の音声収録に利用されるほか、音響通信による配信情報Ｄnの受信に利用される。すなわち、収音装置３６および情報抽出部５２は、図６に例示される通り、各発音文字列Ｌnの発声音の再生に並行して順次に送信される配信情報Ｄnを音響通信により受信する音響通信部５６として機能する。 The information extraction unit 52 sequentially extracts the distribution information Dn from the acoustic signal V generated by the sound collection device 36. Specifically, the information extraction unit 52 emphasizes the frequency band including the acoustic component of the distribution information Dn in the acoustic signal V by, for example, a band pass filter, and performs a demodulation process corresponding to the modulation process at the time of generating the modulated signal Y. The distribution information Dn is extracted by executing the sound signal V after the enhancement. Each time the sound emitting device 16 reproduces the uttered sound of each phonetic character string Ln, the distribution information Dn corresponding to the phonetic character string Ln is extracted. As understood from the above description, the sound collection device 36 of the first embodiment is used for voice recording between the terminal devices 30 and voice recording at the time of moving image shooting, and also receives distribution information Dn by acoustic communication. Used for That is, as illustrated in FIG. 6, the sound collection device 36 and the information extraction unit 52 receive the distribution information Dn that is sequentially transmitted in parallel with the reproduction of the uttered sound of each phonetic character string Ln by acoustic communication. It functions as the communication unit 56.

再生制御部５４は、複数の発音文字列Ｌnの各々の関連情報Ｃn（すなわち発音文字列Ｌnの翻訳文）を表示装置３８に順次に表示させる。具体的には、再生制御部５４は、情報抽出部５２が配信情報Ｄnを抽出するたびに、参照テーブルＱbに登録された複数の関連情報Ｃnのうち当該配信情報Ｄnに対応する関連情報Ｃnを選択して表示装置３８に表示させる。したがって、複数の発音文字列Ｌnの各々の発声音を放音装置１６が再生するたびに、当該発音文字列Ｌnに対応する関連情報Ｃnが表示装置３８に順次に表示される。すなわち、各出演者の演技に同期した各発音文字列Ｌnの再生に並行して当該発音文字列Ｌnの関連情報Ｃnが順次に表示される。 The reproduction control unit 54 causes the display device 38 to sequentially display the related information Cn of each of the plurality of pronunciation character strings Ln (that is, the translated sentence of the pronunciation character string Ln). Specifically, each time the information extraction unit 52 extracts the distribution information Dn, the reproduction control unit 54 selects the related information Cn corresponding to the distribution information Dn among the plurality of related information Cn registered in the reference table Qb. Select and display on the display device 38. Accordingly, each time the sound emitting device 16 reproduces the uttered sound of the plurality of pronunciation character strings Ln, the related information Cn corresponding to the pronunciation character string Ln is sequentially displayed on the display device 38. That is, the related information Cn of the pronunciation character string Ln is sequentially displayed in parallel with the reproduction of each pronunciation character string Ln synchronized with the performance of each performer.

図８は、端末装置３０の動作を例示するフローチャートである。例えば利用者からの指示を契機として図８の処理が開始される。図８の処理を開始すると、情報抽出部５２は、収音装置３６が生成した音響信号Ｖから配信情報Ｄnを抽出できたか否かを判定する（Ｓb1）。情報抽出部５２が配信情報Ｄnを抽出した場合（Ｓb1：YES）、再生制御部５４は、配信情報Ｄnに対応する関連情報Ｃnを参照テーブルＱbから検索し（Ｓb2）、当該関連情報Ｃnを表示装置３８に表示させる（Ｓb3）。利用者から終了が指示されるまで（Ｓb4：NO）、情報抽出部５２による配信情報Ｄnの抽出毎に、関連情報Ｃnの検索（Ｓb2）および表示（Ｓb3）が実行される。利用者から終了が指示されると（Ｓb4：YES）、図８の処理が終了する。 FIG. 8 is a flowchart illustrating the operation of the terminal device 30. For example, the process of FIG. 8 is started in response to an instruction from the user. When the process of FIG. 8 is started, the information extraction unit 52 determines whether or not the distribution information Dn can be extracted from the acoustic signal V generated by the sound collection device 36 (Sb1). When the information extracting unit 52 extracts the distribution information Dn (Sb1: YES), the reproduction control unit 54 searches the related information Cn corresponding to the distribution information Dn from the reference table Qb (Sb2), and displays the related information Cn. It is displayed on the device 38 (Sb3). Until the termination is instructed by the user (Sb4: NO), the retrieval (Sb2) and display (Sb3) of the related information Cn are executed each time the distribution information Dn is extracted by the information extraction unit 52. When the end is instructed by the user (Sb4: YES), the process of FIG. 8 ends.

＜情報処理システム２０＞
図１の情報処理システム２０は、情報配信システム１０が配信情報Ｄnの送信のために参照する前述の参照テーブルＱa（図４）を生成するコンピュータシステムである。図９は、情報処理システム２０の構成図である。図９に例示される通り、第１実施形態の情報処理システム２０は、制御装置２２と記憶装置２４とを具備するコンピュータシステムで実現される。制御装置２２は、例えばＣＰＵを含む処理回路であり、情報処理システム２０の全体を統括的に制御する。記憶装置２４は、例えば磁気記録媒体もしくは半導体記録媒体等の公知の記録媒体、または複数種の記録媒体の組合せで構成され、制御装置２２が実行するプログラムと制御装置２２が使用する各種のデータとを記憶する。第１実施形態の記憶装置２４は、音響信号Ｘと文字列情報Ｂとを記憶する。音響信号Ｘは、前述の通り、複数の発音文字列Ｌnを順次に発音した音声を表す時間領域の信号である。 <Information processing system 20>
The information processing system 20 in FIG. 1 is a computer system that generates the aforementioned reference table Qa (FIG. 4) that the information distribution system 10 refers to for transmission of distribution information Dn. FIG. 9 is a configuration diagram of the information processing system 20. As illustrated in FIG. 9, the information processing system 20 of the first embodiment is realized by a computer system including a control device 22 and a storage device 24. The control device 22 is a processing circuit including a CPU, for example, and comprehensively controls the entire information processing system 20. The storage device 24 is configured by a known recording medium such as a magnetic recording medium or a semiconductor recording medium, or a combination of a plurality of types of recording media, and a program executed by the control device 22 and various data used by the control device 22. Remember. The storage device 24 of the first embodiment stores the acoustic signal X and the character string information B. As described above, the acoustic signal X is a time-domain signal representing a sound in which a plurality of pronunciation character strings Ln are sequentially generated.

文字列情報Ｂは、ショーで発音される複数の発音文字列Ｌnを表すデータである。例えば、図１０に例示される通り、単数または複数の出演者が順次に発音する複数の発音文字列Ｌnの時系列が文字列情報Ｂで指定される。例えば、各出演者の台詞またはショーの解説（ナレーション）を時系列に配列した台本を表すテキストデータ（台本データ）が文字列情報Ｂとして好適に利用される。 The character string information B is data representing a plurality of pronunciation character strings Ln that are pronounced in a show. For example, as illustrated in FIG. 10, a time series of a plurality of pronunciation character strings Ln that a single or a plurality of performers sequentially generate is designated by the character string information B. For example, text data (script data) representing a script in which each performer's dialogue or show commentary (narration) is arranged in time series is suitably used as the character string information B.

情報処理システム２０は、記憶装置２４に記憶された音響信号Ｘと文字列情報Ｂとを利用して参照テーブルＱaを生成する。情報処理システム２０が生成した参照テーブルＱaが情報配信システム１０の記憶装置１４に転送されて、前述の通り、端末装置３０に対する配信情報Ｄnの配信に利用される。参照テーブルＱaは、例えばインターネット等の通信網を介して情報配信システム１０に転送される。なお、半導体記録媒体等の可搬型の記録媒体を利用して参照テーブルＱaを情報処理システム２０から情報配信システム１０に転送することも可能である。 The information processing system 20 generates the reference table Qa using the acoustic signal X and the character string information B stored in the storage device 24. The reference table Qa generated by the information processing system 20 is transferred to the storage device 14 of the information distribution system 10 and used for distributing the distribution information Dn to the terminal device 30 as described above. The reference table Qa is transferred to the information distribution system 10 via a communication network such as the Internet. It is also possible to transfer the reference table Qa from the information processing system 20 to the information distribution system 10 using a portable recording medium such as a semiconductor recording medium.

制御装置２２は、記憶装置２４に記憶されたプログラムを実行することで、音響信号Ｘおよび文字列情報Ｂから参照テーブルＱaを生成するための複数の機能（時間解析部６２および情報対応部６４）を実現する。なお、制御装置２２の機能を複数の装置に分散した構成、または、制御装置２２の機能の一部を専用の電子回路が実現する構成も採用され得る。 The control device 22 executes a program stored in the storage device 24 to thereby generate a plurality of functions (a time analysis unit 62 and an information correspondence unit 64) for generating the reference table Qa from the acoustic signal X and the character string information B. Is realized. A configuration in which the function of the control device 22 is distributed to a plurality of devices or a configuration in which a dedicated electronic circuit realizes a part of the function of the control device 22 may be employed.

時間解析部６２は、記憶装置２４に記憶された文字列情報Ｂと音響信号Ｘとを解析することで、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔnを生成する。図９に例示される通り、第１実施形態の時間解析部６２は、文字列特定部６２１と照合処理部６２２とを含んで構成される。 The time analysis unit 62 analyzes the character string information B and the acoustic signal X stored in the storage device 24, so that the pronunciation character string Ln is pronounced for each of the plurality of pronunciation character strings Ln. The time information Tn indicating the time point is generated. As illustrated in FIG. 9, the time analysis unit 62 according to the first embodiment includes a character string specifying unit 621 and a matching processing unit 622.

文字列特定部６２１は、音響信号Ｘが表す音声の発音内容を表す文字列（以下「認識文字列」という）Ｒを取得する。第１実施形態の文字列特定部６２１は、音響信号Ｘに対する音声認識で認識文字列Ｒを特定する。音響信号Ｘの音声認識には、例えば隠れマルコフモデル（HMM：Hidden Markov Model）等の音響モデルと、言語的な制約を示す言語モデルとを利用した認識処理等の公知の技術が任意に採用され得る。なお、情報処理システム２０が通信可能な音声認識サーバに音響信号Ｘの音声認識を実行させることも可能である。例えば、文字列特定部６２１は、音響信号Ｘを音声認識サーバに送信し、音声認識サーバによる音声認識で特定された認識文字列Ｒを音声認識サーバから取得する。すなわち、文字列特定部６２１は、それ自身が認識文字列Ｒを生成する要素のほか、音声認識サーバ等の他装置により生成された認識文字列Ｒを取得する要素も包含する。 The character string specifying unit 621 acquires a character string (hereinafter referred to as a “recognized character string”) R that represents the pronunciation of the voice represented by the acoustic signal X. The character string specifying unit 621 of the first embodiment specifies the recognized character string R by voice recognition for the acoustic signal X. For speech recognition of the acoustic signal X, for example, a known technique such as recognition processing using an acoustic model such as a Hidden Markov Model (HMM) and a language model indicating linguistic restrictions is arbitrarily adopted. obtain. It is also possible to cause the voice recognition server with which the information processing system 20 can communicate to execute voice recognition of the acoustic signal X. For example, the character string specifying unit 621 transmits the acoustic signal X to the voice recognition server, and acquires the recognized character string R specified by the voice recognition by the voice recognition server from the voice recognition server. That is, the character string specifying unit 621 includes not only an element that itself generates the recognized character string R but also an element that acquires the recognized character string R generated by another device such as a voice recognition server.

照合処理部６２２は、記憶装置２４に記憶された文字列情報Ｂが表す複数の発音文字列Ｌnと文字列特定部６２１が特定した認識文字列Ｒとを相互に照合することで、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔnを生成する。音響信号Ｘは、複数の発音文字列Ｌnを順次に発音した音声を表す信号であるから、音響信号Ｘから特定された認識文字列Ｒは、文字列情報Ｂが表す各発音文字列Ｌnに類似（理想的には一致）する部分を包含する。照合処理部６２２は、認識文字列Ｒのうち各発音文字列Ｌnに類似する文字列を探索し、音響信号Ｘのうち当該文字列に対応する時間軸上の期間（すなわち当該文字列が発音された期間）の始点の時刻を示す時間情報Ｔnを生成する。時間情報Ｔnは、前述の通り、例えば音響信号Ｘの始点を基準時点とした経過時間を指定する。 The collation processing unit 622 collates a plurality of pronunciation character strings Ln represented by the character string information B stored in the storage device 24 with the recognized character string R specified by the character string specifying unit 621, thereby generating a plurality of pronunciations. For each character string Ln, time information Tn indicating the time point on the time axis at which the pronunciation character string Ln is generated is generated. Since the acoustic signal X is a signal representing a sound produced by sequentially producing a plurality of pronunciation character strings Ln, the recognized character string R specified from the acoustic signal X is similar to each of the pronunciation character strings Ln represented by the character string information B. Includes parts that (ideally match). The matching processing unit 622 searches for a character string similar to each phonetic character string Ln in the recognized character string R, and in the acoustic signal X, a period on the time axis corresponding to the character string (that is, the character string is pronounced). Time information Tn indicating the time of the start point of the (period). As described above, the time information Tn designates an elapsed time with the start point of the acoustic signal X as a reference time, for example.

認識文字列Ｒと各発音文字列Ｌnとの照合には、例えば２種類の文字列間の類似性を評価するための編集距離（レーベンシュタイン距離）が好適に利用される。すなわち、照合処理部６２２は、認識文字列Ｒの全体に対する発音文字列Ｌnの相対的な位置を変化させた複数の場合の各々について両者間の編集距離を算定する。認識文字列Ｒに対する発音文字列Ｌnの位置は、例えば所定の文字数（例えば１以上の文字数）ずつ順次に変更される。編集距離が最小となるときの認識文字列Ｒと発音文字列Ｌnとの重複部分が、認識文字列Ｒのうち各発音文字列Ｌnに類似する文字列として探索される。以上の説明から理解される通り、文字列情報Ｂが示す複数の発音文字列Ｌnの各々について、音響信号Ｘにおいて当該発音文字列Ｌnが発音される時点を示す時間情報Ｔnが生成される。 For collation between the recognized character string R and each phonetic character string Ln, for example, an edit distance (Levenstein distance) for evaluating the similarity between two types of character strings is preferably used. That is, the collation processing unit 622 calculates the edit distance between the two for each of a plurality of cases in which the relative position of the pronunciation character string Ln with respect to the entire recognized character string R is changed. The position of the pronunciation character string Ln with respect to the recognized character string R is sequentially changed by, for example, a predetermined number of characters (for example, one or more characters). An overlapping portion of the recognized character string R and the pronunciation character string Ln when the edit distance is minimized is searched for as a character string similar to each pronunciation character string Ln in the recognized character string R. As understood from the above description, for each of the plurality of pronunciation character strings Ln indicated by the character string information B, time information Tn indicating the point in time when the pronunciation character string Ln is pronounced in the acoustic signal X is generated.

図９の情報対応部６４は、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnに関連する関連情報Ｃnを示す配信情報Ｄnと、当該発音文字列Ｌnについて時間解析部６２（照合処理部６２２）が生成した時間情報Ｔnとを対応させた参照テーブルＱaを生成する。配信情報Ｄnは、参照テーブルＱa内で相互に重複しないように適宜に設定される。情報対応部６４が生成した参照テーブルＱaは、記憶装置２４に格納されてから、情報配信システム１０の記憶装置１４に転送される。 For each of the plurality of phonetic character strings Ln, the information corresponding unit 64 in FIG. 9 includes distribution information Dn indicating related information Cn related to the phonetic character string Ln, and a time analysis unit 62 (matching process) for the phonetic character string Ln. The reference table Qa is generated by associating the time information Tn generated by the unit 622). The distribution information Dn is appropriately set so as not to overlap each other in the reference table Qa. The reference table Qa generated by the information corresponding unit 64 is stored in the storage device 24 and then transferred to the storage device 14 of the information distribution system 10.

図１１は、情報処理システム２０の動作を例示するフローチャートである。例えば情報処理システム２０の管理者からの指示を契機として図１１の処理が開始される。図１１の処理を開始すると、時間解析部６２は、以下に例示される通り、文字列情報Ｂと音響信号Ｘとを解析することで各発音文字列Ｌnの時間情報Ｔnを生成する（Ｓc1〜Ｓc3）。まず、文字列特定部６２１は、音響信号Ｘに対する音声認識で認識文字列Ｒを特定する（Ｓc1）。照合処理部６２２は、文字列情報Ｂが示す複数の発音文字列Ｌnの時系列から１個の発音文字列Ｌn（例えば未選択の最先の発音文字列Ｌn）を選択する（Ｓc2）。照合処理部６２２は、音響信号Ｘの認識文字列Ｒと選択した発音文字列Ｌnとを相互に照合することで、音響信号Ｘにおいて当該発音文字列Ｌnが発音される時点を示す時間情報Ｔnを生成する（Ｓc3）。 FIG. 11 is a flowchart illustrating the operation of the information processing system 20. For example, the process of FIG. 11 is started in response to an instruction from the administrator of the information processing system 20. When the processing of FIG. 11 is started, the time analysis unit 62 generates time information Tn of each phonetic character string Ln by analyzing the character string information B and the acoustic signal X as exemplified below (Sc1˜). Sc3). First, the character string specifying unit 621 specifies the recognized character string R by voice recognition for the acoustic signal X (Sc1). The matching processing unit 622 selects one phonetic character string Ln (for example, the unselected earliest phonetic character string Ln) from the time series of the plurality of phonetic character strings Ln indicated by the character string information B (Sc2). The matching processing unit 622 collates the recognized character string R of the acoustic signal X with the selected phonetic character string Ln, thereby obtaining time information Tn indicating the time point when the phonetic character string Ln is pronounced in the acoustic signal X. Generate (Sc3).

文字列情報Ｂに含まれる全部の発音文字列Ｌnの各々について、発音文字列Ｌnの選択（Ｓc2）と時間情報Ｔnの生成（Ｓc3）とが反復される（Ｓc4：NO）。全部の発音文字列Ｌnについて時間情報Ｔnを生成すると（Ｓc4：YES）、情報対応部６４は、複数の発音文字列Ｌnの各々について、配信情報Ｄnと時間情報Ｔnとを対応させた参照テーブルＱaを生成して記憶装置２４に格納する（Ｓc5）。 For each of all the pronunciation character strings Ln included in the character string information B, the selection of the pronunciation character string Ln (Sc2) and the generation of the time information Tn (Sc3) are repeated (Sc4: NO). When the time information Tn is generated for all the pronunciation character strings Ln (Sc4: YES), the information correspondence unit 64 associates the distribution information Dn with the time information Tn for each of the plurality of pronunciation character strings Ln. Is stored in the storage device 24 (Sc5).

以上に説明した通り、第１実施形態では、文字列情報Ｂと音響信号Ｘとを解析することで各発音文字列Ｌnの時間情報Ｔnが生成され、発音文字列Ｌn毎に配信情報Ｄnと時間情報Ｔnとを対応させた参照テーブルＱaが生成される。したがって、時系列に発音される発音文字列Ｌnの関連情報Ｃnを端末装置３０の利用者に提供するための作業の負荷を軽減することが可能である。 As described above, in the first embodiment, the time information Tn of each phonetic character string Ln is generated by analyzing the character string information B and the acoustic signal X, and the distribution information Dn and time are generated for each phonetic character string Ln. A reference table Qa that associates the information Tn is generated. Therefore, it is possible to reduce the work load for providing the user of the terminal device 30 with the related information Cn of the pronunciation character string Ln that is chronologically pronounced.

＜第２実施形態＞
本発明の第２実施形態について説明する。なお、以下に例示する各形態において作用または機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. In addition, about the element which an effect | action or function is the same as that of 1st Embodiment in each form illustrated below, the code | symbol used by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

図１２は、第２実施形態における情報処理システム２０の構成図である。第２実施形態では、情報処理システム２０の時間解析部６２の構成が第１実施形態とは相違する。情報配信システム１０および端末装置３０の構成と、情報処理システム２０のうち時間解析部６２以外の要素の構成とは、第１実施形態と同様である。 FIG. 12 is a configuration diagram of the information processing system 20 in the second embodiment. In the second embodiment, the configuration of the time analysis unit 62 of the information processing system 20 is different from that of the first embodiment. The configurations of the information distribution system 10 and the terminal device 30 and the configurations of elements other than the time analysis unit 62 in the information processing system 20 are the same as those in the first embodiment.

時間解析部６２は、第１実施形態と同様に、記憶装置２４に記憶された文字列情報Ｂと音響信号Ｘとを解析することで、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔnを生成する。図１２に例示される通り、第２実施形態の時間解析部６２は、推定処理部６２４と特徴抽出部６２５と照合処理部６２６とを含んで構成される。 Similar to the first embodiment, the time analysis unit 62 analyzes the character string information B and the acoustic signal X stored in the storage device 24 to thereby generate the pronunciation character string for each of the plurality of pronunciation character strings Ln. Time information Tn indicating the time point on the time axis at which Ln is sounded is generated. As illustrated in FIG. 12, the time analysis unit 62 according to the second embodiment includes an estimation processing unit 624, a feature extraction unit 625, and a matching processing unit 626.

推定処理部６２４は、文字列情報Ｂが示す複数の発音文字列Ｌnの各々を発音した場合に観測される音声の音響的な特徴量Ｆaの時系列を推定する。複数の発音文字列Ｌnの各々について特徴量Ｆaの時系列が推定される。特徴量Ｆaは、音韻（すなわち発音内容）の差異が明確に反映される傾向がある音響的な特性値である。例えばＭＦＣＣ（Mel-Frequency Cepstrum Coefficients）が特徴量Ｆaの好適例として想定される。推定処理部６２４による特徴量Ｆaの時系列の推定には、発音文字列Ｌnと特徴量Ｆaの時系列と関係の統計的な傾向を表現した統計的な数理モデル（以下「特徴量生成モデル」という）が利用される。特徴量生成モデルは、例えば隠れマルコフモデル（HMM）により表現され、任意の文字列と当該文字列を発音した音声の特徴量とを含む多数の学習データを利用した機械学習により事前に生成される。したがって、特徴量生成モデルに発音文字列Ｌnを付与した場合、多数の学習データにおける文字列に対する特徴量の傾向のもとで、発音文字列Ｌnに対して尤度（具体的には発音文字列Ｌnが観測されたという条件のもとでの事後確率）が最大となるように特徴量Ｆaの時系列が推定される。 The estimation processing unit 624 estimates the time series of the acoustic feature quantity Fa of speech observed when each of the plurality of pronunciation character strings Ln indicated by the character string information B is pronounced. A time series of the feature amount Fa is estimated for each of the plurality of pronunciation character strings Ln. The feature amount Fa is an acoustic characteristic value that tends to clearly reflect a difference in phonemes (ie, pronunciation content). For example, MFCC (Mel-Frequency Cepstrum Coefficients) is assumed as a suitable example of the feature amount Fa. The estimation processing unit 624 estimates the time series of the feature value Fa using a statistical mathematical model (hereinafter referred to as “feature value generation model”) expressing a statistical tendency of the relationship between the pronunciation character string Ln and the time series of the feature value Fa. Is used). The feature quantity generation model is expressed by, for example, a hidden Markov model (HMM), and is generated in advance by machine learning using a large number of learning data including an arbitrary character string and a feature quantity of a voice that pronounces the character string. . Therefore, when the phonetic character string Ln is assigned to the feature quantity generation model, the likelihood (specifically, the phonetic character string) with respect to the phonetic character string Ln under the tendency of the feature quantity with respect to the character string in a large number of learning data. The time series of the feature value Fa is estimated so that the posterior probability (under the condition that Ln is observed) is maximized.

なお、特徴量生成モデルをショーの出演者毎に事前に用意し、複数の特徴量生成モデルの何れかを出演者毎に選択的に利用して特徴量Ｆaの時系列を推定することも可能である。例えば、各発音文字列Ｌnを発音する出演者が文字列情報Ｂにより指定される構成を想定すると、任意の１個の発音文字列Ｌnについて、推定処理部６２４は、当該発音文字列Ｌnの発音者として文字列情報Ｂで指定された出演者の特徴量生成モデルを利用して、特徴量Ｆaの時系列を生成する。また、出演者の属性毎（例えば大人／子供，男性／女性）に用意された複数の特徴量生成モデルを選択的に利用することも可能である。 It is also possible to prepare a feature quantity generation model in advance for each performer of the show, and estimate the time series of the feature quantity Fa by selectively using one of a plurality of feature quantity generation models for each performer. It is. For example, assuming a configuration in which a performer who pronounces each phonetic character string Ln is specified by the character string information B, the estimation processing unit 624 generates the pronunciation of the phonetic character string Ln for any one phonetic character string Ln. A time series of the feature value Fa is generated using the feature value generation model of the performer designated by the character string information B as the performer. It is also possible to selectively use a plurality of feature quantity generation models prepared for each performer attribute (for example, adult / child, male / female).

特徴抽出部６２５は、音響信号Ｘの特徴量Ｆbの時系列を抽出する。特徴量Ｆbは、特徴量Ｆaと同種の音響的な特性値（例えばＭＦＣＣ）である。音響信号Ｘの特徴量Ｆbの抽出には公知の解析技術が任意に採用される。なお、特徴量Ｆaおよび特徴量Ｆbの種類は任意である。例えば、相異なる音階音（例えば平均律の１２半音の各々）に対応する複数の要素を含むクロマベクトル（PCP：Pitch Class Profile）をＭＦＣＣとともに特徴量Ｆaおよび特徴量Ｆbとして利用することも可能である。クロマベクトルのうち任意の１個の音階音に対応する要素は、当該音階音に対応する帯域成分の強度を複数のオクターブにわたり加算または平均した数値に設定される。 The feature extraction unit 625 extracts a time series of the feature amount Fb of the acoustic signal X. The feature amount Fb is the same acoustic characteristic value (for example, MFCC) as the feature amount Fa. A known analysis technique is arbitrarily employed for extracting the feature value Fb of the acoustic signal X. Note that the types of the feature value Fa and the feature value Fb are arbitrary. For example, it is also possible to use a chroma vector (PCP: Pitch Class Profile) including a plurality of elements corresponding to different scales (for example, each of 12 semitones of equal temperament) as the feature value Fa and the feature value Fb together with the MFCC. is there. The element corresponding to any one scale sound in the chroma vector is set to a numerical value obtained by adding or averaging the intensities of the band components corresponding to the scale sound over a plurality of octaves.

照合処理部６２６は、推定処理部６２４が各発音文字列Ｌnについて推定した特徴量Ｆaの時系列と、特徴抽出部６２５が音響信号Ｘから抽出した特徴量Ｆbの時系列とを相互に照合することで、文字列情報Ｂが指定する複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔnを生成する。前述の通り、音響信号Ｘは、複数の発音文字列Ｌnを順次に発音した音声を表す信号である。したがって、任意の１個の発音文字列Ｌnについて推定された特徴量Ｆaの時系列は、音響信号Ｘのうち当該発音文字列Ｌnが発音された部分の特徴量Ｆbの時系列に類似（理想的には一致）するという傾向がある。以上の傾向を考慮して、第２実施形態の照合処理部６２６は、音響信号Ｘの特徴量Ｆbの時系列のうち発音文字列Ｌnの特徴量Ｆaの時系列に類似する部分を探索し、音響信号Ｘのうち当該部分の始点の時刻を示す時間情報Ｔnを当該発音文字列Ｌnについて生成する。 The collation processing unit 626 collates the time series of the feature amount Fa estimated by the estimation processing unit 624 for each phonetic character string Ln with the time series of the feature amount Fb extracted from the acoustic signal X by the feature extraction unit 625. Thus, for each of the plurality of pronunciation character strings Ln designated by the character string information B, time information Tn indicating the time point on the time axis at which the pronunciation character string Ln is pronounced is generated. As described above, the acoustic signal X is a signal representing a sound in which a plurality of pronunciation character strings Ln are sequentially pronounced. Therefore, the time series of the feature amount Fa estimated for any one pronunciation character string Ln is similar to the time series of the feature quantity Fb of the portion of the acoustic signal X where the pronunciation character string Ln is pronounced (ideal Tend to match). Considering the above tendency, the matching processing unit 626 of the second embodiment searches for a portion similar to the time series of the feature quantity Fa of the pronunciation character string Ln in the time series of the feature quantity Fb of the acoustic signal X, Time information Tn indicating the time of the start point of the portion of the acoustic signal X is generated for the pronunciation character string Ln.

＜変形例＞
以上に例示した各態様は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
Each aspect illustrated above can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.

（１）前述の通り、例えばショーの台本を著す台本データが文字列情報Ｂとして好適に利用され得る。しかし、台本データには、複数の発音文字列Ｌnの時系列だけでなく、実際にはショー内で発音されない情報（以下「非発音情報」という）も包含され得る。例えば、ショーの場面または登場人物の解説、もしくは演技上の注意点等が非発音情報として想定される。時間解析部６２による文字列情報Ｂと音響信号Ｘとの照合では非発音情報は不要であるから、台本データから非発音情報を除外したデータを文字列情報Ｂとして照合処理部（６２２または６２６）が音響信号Ｘと対比する構成が好適である。 (1) As described above, for example, script data for writing a show script can be suitably used as the character string information B. However, the script data can include not only the time series of the plurality of pronunciation character strings Ln but also information that is not actually pronounced in the show (hereinafter referred to as “non-speech information”). For example, a show scene, a commentary on a character, or a cautionary note on performance is assumed as non-pronunciation information. Since non-pronunciation information is not required for collation between the character string information B and the sound signal X by the time analysis unit 62, data obtained by excluding non-pronunciation information from the script data is regarded as character string information B (622 or 626). Is preferable to contrast with the acoustic signal X.

（２）前述の各形態では、発音文字列Ｌnの翻訳文を関連情報Ｃnとして例示したが、関連情報Ｃnの内容は以上の例示に限定されない。例えば、発音文字列Ｌn自体を関連情報Ｃnとして端末装置３０の表示装置３８に表示させることも可能である。以上の構成によれば、例えばショー内で発音される発音文字列Ｌnの聴取が困難な聴覚障碍者がショーの内容を把握できるという利点がある。 (2) In the above-described embodiments, the translated sentence of the pronunciation character string Ln is exemplified as the related information Cn. However, the content of the related information Cn is not limited to the above examples. For example, the pronunciation character string Ln itself can be displayed on the display device 38 of the terminal device 30 as the related information Cn. According to the above configuration, for example, there is an advantage that a hearing impaired person who is difficult to listen to the pronunciation character string Ln pronounced in the show can grasp the contents of the show.

また、発音文字列Ｌnの内容（台詞または歌詞）を直接的に表す情報のほか、ショーを解説する情報（例えば台本のト書きの情報またはショーを補足的に解説する情報）を関連情報Ｃnとして端末装置３０の表示装置３８に表示させることも可能である。例えば、発音文字列Ｌn自体またはその翻訳文を示す情報と、その発音文字列Ｌnを発音する登場人物を表す情報（例えば登場人物の名前またはアイコン等）とを含む関連情報Ｃnが表示され得る。また、発音文字列Ｌnが楽曲の歌詞である場合（発音文字列を歌詞とする楽曲がショー内で歌唱される場合）、当該楽曲に関する音楽情報（例えば楽譜またはコード）を関連情報Ｃnとして利用者に提示する構成も好適である。関連情報Ｃnは、文字列を表すテキストデータに限定されない。例えば、各種の情報を表す音声または画像を関連情報Ｃnとして使用することも可能である。 Further, in addition to information that directly represents the content (line or lyrics) of the pronunciation character string Ln, information that explains the show (for example, information on the script or supplementary explanation of the show) is used as related information Cn. It is also possible to display on the display device 38 of the device 30. For example, related information Cn including information indicating the phonetic character string Ln itself or a translation thereof and information indicating a character who pronounces the phonetic character string Ln (for example, the name or icon of the character) may be displayed. Further, when the pronunciation character string Ln is the lyrics of a song (when a song having the pronunciation character string as a lyrics is sung in a show), music information (for example, a score or a chord) related to the song is used as related information Cn. The configuration presented in (1) is also suitable. The related information Cn is not limited to text data representing a character string. For example, it is possible to use sound or images representing various types of information as the related information Cn.

音楽情報またはショーの解説等の補足的な関連情報Ｃnについては、発音文字列Ｌnと認識文字列Ｒとの照合（第１実施形態）、または、特徴量Ｆaの時系列と特徴量Ｆbの時系列との照合（第２実施形態）では、時間情報Ｔnを特定することが困難である。そこで、参照テーブルＱaにおいては、以上に例示した補足的な関連情報Ｃnについては、直前の関連情報Ｃn-1に対応する時間情報Ｔn-1が示す時点と、直後の関連情報Ｃn+1に対応する時間情報Ｔn+1が示す時点との間の時点を示す時間情報Ｔnが生成される。例えば、時間情報Ｔn-1が示す時点と時間情報Ｔn+1が示す時点との中点を示す時間情報Ｔnが関連情報Ｃnの配信情報Ｄnに対応付けられる。時間情報Ｔn-1が示す時点と時間情報Ｔn+1が示す時点との間について情報処理システム２０の管理者が指示した時点を時間情報Ｔnが示す構成も好適である。 For supplementary related information Cn such as music information or show commentary, collation between the pronunciation character string Ln and the recognized character string R (first embodiment), or the time series of the feature value Fa and the feature value Fb In the collation with the series (second embodiment), it is difficult to specify the time information Tn. Therefore, in the reference table Qa, the supplementary related information Cn illustrated above corresponds to the time point indicated by the time information Tn-1 corresponding to the immediately preceding related information Cn-1 and the immediately following related information Cn + 1. Time information Tn indicating a time point between the time point indicated by the time information Tn + 1 to be generated is generated. For example, the time information Tn indicating the midpoint between the time point indicated by the time information Tn-1 and the time point indicated by the time information Tn + 1 is associated with the distribution information Dn of the related information Cn. A configuration in which the time information Tn indicates a time point when the administrator of the information processing system 20 indicates between the time point indicated by the time information Tn-1 and the time point indicated by the time information Tn + 1 is also preferable.

（３）前述の通り、発音文字列Ｌnは台詞だけでなく歌詞も含み得る。すなわち、楽曲の伴奏音に並行して発音された歌唱音声を音響信号Ｘが表す場合が想定される。音響信号Ｘが歌唱音声を表す場合を想定すると、第１実施形態の文字列特定部６２１による認識文字列Ｒの特定においては、音響信号Ｘを伴奏音と歌唱音声とに分離する構成が好適である。分離後の歌唱音声に対する音声認識により認識文字列Ｒが特定される。伴奏音と歌唱音声との分離には、例えば音源が定位する位置の相違を利用した公知の音源分離が好適に利用される。以上の構成によれば、音響信号Ｘから伴奏音を除外した歌唱音声について音声認識が実行されるから、認識文字列Ｒを高精度に特定できるという利点がある。 (3) As described above, the phonetic character string Ln can include not only dialogue but also lyrics. That is, the case where the acoustic signal X represents the singing voice sounded in parallel with the musical accompaniment sound is assumed. Assuming the case where the acoustic signal X represents a singing voice, in the specification of the recognized character string R by the character string specifying unit 621 of the first embodiment, a configuration in which the acoustic signal X is separated into an accompaniment sound and a singing voice is preferable. is there. The recognition character string R is specified by voice recognition for the separated singing voice. For the separation of the accompaniment sound and the singing sound, for example, a well-known sound source separation using a difference in position where the sound source is localized is preferably used. According to the above configuration, since voice recognition is performed on the singing voice from which the accompaniment sound is excluded from the acoustic signal X, there is an advantage that the recognized character string R can be specified with high accuracy.

（４）関連情報Ｃnを端末装置３０の表示装置３８に表示させる場合に、表示装置３８の１画面内に関連情報Ｃnが収まるように関連情報Ｃnを調整することも可能である。例えば、関連情報Ｃnが表す文字数が多い場合に、文字列のサイズを縮小することで関連情報Ｃnを表示装置３８の１画面内に表示する構成が好適である。また、例えば発音文字列Ｌnが歌詞である場合を想定すると、当該発音文字列Ｌnに対応する関連情報Ｃnを、楽曲のフレーズ（例えば音楽的な意味に応じて楽曲を区分した各区間）毎に表示させることも可能である。 (4) When the related information Cn is displayed on the display device 38 of the terminal device 30, it is possible to adjust the related information Cn so that the related information Cn is contained in one screen of the display device 38. For example, when the number of characters represented by the related information Cn is large, a configuration in which the related information Cn is displayed on one screen of the display device 38 by reducing the size of the character string is preferable. For example, assuming that the pronunciation character string Ln is lyrics, the related information Cn corresponding to the pronunciation character string Ln is obtained for each phrase of the song (for example, each section into which the song is divided according to musical meaning). It can also be displayed.

（５）前述の各形態では、関連情報Ｃnを端末装置３０の表示装置３８に表示したが、関連情報Ｃnを利用者に提示する方法は表示に限定されない。例えば、関連情報Ｃnを表す音声をスピーカまたはイヤホン等の放音装置１６から放射することで利用者に関連情報Ｃnを提示することも可能である。また、前述の各形態では、携帯電話機またはスマートフォン等の可搬型の情報端末を端末装置３０として例示したが、端末装置３０の具体的な形態は以上の例示に限定されない。例えば、鉄道事業者の施設内（例えば駅構内）に設置される電光掲示板または商業施設に設置される電子看板（例えばデジタルサイネージ）等の案内用の表示機器も端末装置３０として好適に利用され得る。 (5) In each of the above embodiments, the related information Cn is displayed on the display device 38 of the terminal device 30, but the method of presenting the related information Cn to the user is not limited to display. For example, it is also possible to present the related information Cn to the user by emitting sound representing the related information Cn from the sound emitting device 16 such as a speaker or an earphone. Moreover, in each above-mentioned form, although portable information terminals, such as a mobile telephone or a smart phone, were illustrated as the terminal device 30, the specific form of the terminal device 30 is not limited to the above illustration. For example, a display device for guidance such as an electric bulletin board installed in a facility of a railway operator (for example, a station premises) or an electronic signboard (for example, digital signage) installed in a commercial facility can be suitably used as the terminal device 30. .

（６）情報提供システム１００（特に情報配信システム１０）が利用される場面は、前述の各形態で例示したショーには限定されない。例えば、交通機関（例えば電車，バス），展示施設（例えば博物館，美術館），または観光施設等の各種の施設を案内する音声を時系列で発音する場面でも、前述の各形態で例示した情報提供システム１００が利用され得る。 (6) The scene in which the information providing system 100 (particularly the information distribution system 10) is used is not limited to the shows exemplified in the above embodiments. For example, the information provided in the above-mentioned forms is also provided in a scene where sounds for guiding various facilities such as transportation facilities (for example, trains, buses), exhibition facilities (for example, museums, art museums), or tourist facilities are chronologically pronounced. System 100 may be utilized.

（７）前述の各形態では、音響信号Ｘの再生に並行して実時間的に配信情報Ｄnを含ませる（すなわち変調信号Ｙを合成する）構成を例示したが、配信情報Ｄnを含む音響信号Ｚを事前に生成して記憶装置１４に保持することも可能である。音響信号Ｚの生成には、前述の各形態で例示した通り、情報処理システム２０が生成した参照テーブルＱaが利用される。すなわち、音響信号Ｘのうち参照テーブルＱaの各時間情報Ｔnが示す時点に、当該時間情報Ｔnに対応する配信情報Ｄnの変調信号Ｙを合成することで、音響信号Ｚが生成される。音響信号Ｚを事前に生成する構成では、情報配信システム１０から情報管理部４２および信号処理部４４が省略され、制御装置１２は、記憶装置１４に記憶された音響信号Ｚを放音装置１６に供給する。 (7) In the above-described embodiments, the configuration in which the distribution information Dn is included in real time (that is, the modulation signal Y is synthesized) in parallel with the reproduction of the acoustic signal X is exemplified. It is also possible to generate Z in advance and hold it in the storage device 14. For the generation of the acoustic signal Z, the reference table Qa generated by the information processing system 20 is used as illustrated in the above-described embodiments. That is, the acoustic signal Z is generated by synthesizing the modulation signal Y of the distribution information Dn corresponding to the time information Tn at the time indicated by the time information Tn of the reference table Qa in the acoustic signal X. In the configuration in which the acoustic signal Z is generated in advance, the information management unit 42 and the signal processing unit 44 are omitted from the information distribution system 10, and the control device 12 sends the acoustic signal Z stored in the storage device 14 to the sound emitting device 16. Supply.

（８）前述の各形態では、関連情報Ｃnを識別するための識別情報を配信情報Ｄnとして例示したが、関連情報Ｃn自体を配信情報Ｄnとして情報配信システム１０から端末装置３０に送信することも可能である。関連情報Ｃnを配信情報Ｄnとして配信する構成では、端末装置３０に参照テーブルＱbを保持する必要はない。以上の例示から理解される通り、配信情報Ｄnは、関連情報Ｃnを示す情報として包括的に表現され、関連情報Ｃnを識別するための識別情報のほか関連情報Ｃn自体を包含する。 (8) In each of the above embodiments, the identification information for identifying the related information Cn is exemplified as the distribution information Dn. However, the related information Cn itself may be transmitted from the information distribution system 10 to the terminal device 30 as the distribution information Dn. Is possible. In the configuration in which the related information Cn is distributed as the distribution information Dn, the terminal device 30 does not need to hold the reference table Qb. As understood from the above examples, the distribution information Dn is comprehensively expressed as information indicating the related information Cn, and includes the related information Cn itself in addition to the identification information for identifying the related information Cn.

（９）前述の各形態では、音波を伝送媒体とする音響通信で配信情報Ｄnを端末装置３０に送信したが、端末装置３０に配信情報Ｄnを送信するための通信方式は音響通信に限定されない。例えば、放音装置１６による発音文字列Ｌnの放音に同期して、電波や赤外線等の電磁波を伝送媒体とした無線通信で端末装置３０に配信情報Ｄnを送信することも可能である。配信情報Ｄnの送信には、移動体通信網等の通信網が介在しない近距離無線通信が好適であり、音波を伝送媒体とする音響通信や電磁波を伝送媒体とする無線通信は近距離無線通信の例示である。なお、前述の各形態で例示した音響通信によれば、発音文字列Ｌnの音声を放音するための放音装置１６を配信情報Ｄnの送信に流用できるという利点、および、例えば遮音壁の設置により通信範囲を容易に制御できるという利点がある。 (9) In each of the embodiments described above, the distribution information Dn is transmitted to the terminal device 30 by acoustic communication using sound waves as a transmission medium. However, the communication method for transmitting the distribution information Dn to the terminal device 30 is not limited to acoustic communication. . For example, the distribution information Dn can be transmitted to the terminal device 30 by wireless communication using electromagnetic waves such as radio waves and infrared rays in synchronization with the sound emission of the pronunciation character string Ln by the sound emission device 16. For the transmission of the distribution information Dn, short-range wireless communication that does not involve a communication network such as a mobile communication network is suitable. Acoustic communication using sound waves as a transmission medium and wireless communication using electromagnetic waves as a transmission medium are short-range wireless communication. This is an example. In addition, according to the acoustic communication exemplified in each of the above-described embodiments, the sound emitting device 16 for emitting the sound of the pronunciation character string Ln can be used for transmission of the distribution information Dn and, for example, by the installation of a sound insulation wall There is an advantage that the communication range can be easily controlled.

（１０）前述の各形態では、関連情報Ｃnを含む参照テーブルＱbが端末装置３０の記憶装置３４に記憶された場合を例示したが、移動体通信網またはインターネット等の通信網を介して端末装置３０と通信可能な情報配信サーバに参照テーブルＱbを保持することも可能である。具体的には、端末装置３０の再生制御部５４は、情報抽出部５２が抽出した配信情報Ｄnを指定した情報要求を情報配信サーバに送信する。情報配信サーバは、情報要求で指定された配信情報Ｄnに対応する関連情報Ｃnを参照テーブルＱbから検索して要求元の端末装置３０に送信する。端末装置３０の再生制御部５４は、情報配信サーバから受信した関連情報Ｃnを表示装置３８に表示させる。ただし、前述の各形態のように端末装置３０の記憶装置３４に参照テーブルＱbを保持する構成によれば、通信網を介した通信を必要とせずに関連情報Ｃnを表示できるという利点がある。 (10) In each of the above-described embodiments, the case where the reference table Qb including the related information Cn is stored in the storage device 34 of the terminal device 30 is exemplified. However, the terminal device is connected via a mobile communication network or a communication network such as the Internet. It is also possible to hold the reference table Qb in an information distribution server capable of communicating with 30. Specifically, the reproduction control unit 54 of the terminal device 30 transmits an information request specifying the distribution information Dn extracted by the information extraction unit 52 to the information distribution server. The information distribution server searches the reference table Qb for related information Cn corresponding to the distribution information Dn specified in the information request and transmits it to the requesting terminal device 30. The reproduction control unit 54 of the terminal device 30 causes the display device 38 to display the related information Cn received from the information distribution server. However, according to the configuration in which the reference table Qb is held in the storage device 34 of the terminal device 30 as in each of the above-described embodiments, there is an advantage that the related information Cn can be displayed without requiring communication through the communication network.

（１１）第１実施形態では、各発音文字列Ｌnと認識文字列Ｒとを照合処理部６２２が照合する構成を例示し、第２実施形態では、特徴量Ｆaの時系列と特徴量Ｆbの時系列とを照合処理部６２６が照合したが、照合処理部（６２２または６２６）による照合処理には、例えば以下に例示される通り種々の工夫が適用され得る。 (11) The first embodiment exemplifies a configuration in which the collation processing unit 622 collates each phonetic character string Ln and the recognized character string R. In the second embodiment, the time series of the feature amount Fa and the feature amount Fb The collation processing unit 626 collates the time series, but various devices can be applied to the collation processing by the collation processing unit (622 or 626), for example, as exemplified below.

例えば、ショーの場面毎（例えば第１幕，第２幕，発話の場面，歌唱の場面等の複数の場面の各々）に分割して照合処理を実行することも可能である。任意の１個の発音文字列Ｌnについて、それ以前の発音文字列Ｌnに関する照合処理の結果を踏まえて時間情報Ｔnを生成する構成も採用され得る。ショーを撮影した動画像を照合処理において参照して、各発音文字列Ｌnの時間情報Ｔnを生成することも可能である。また、発音文字列Ｌnが楽曲の歌詞である場合、当該楽曲の楽譜情報で指定される各音符の音高や発音期間（あるいはリズム）を参照して、各発音文字列Ｌnの時間情報Ｔnを生成する構成も好適である。 For example, it is possible to divide each show scene (for example, each of a plurality of scenes such as the first curtain, the second curtain, the utterance scene, and the singing scene) and execute the collation process. A configuration in which time information Tn is generated for any one phonetic character string Ln based on the result of collation processing related to the previous phonetic character string Ln may be employed. It is also possible to generate time information Tn of each phonetic character string Ln by referring to a moving image in which a show is photographed in collation processing. When the pronunciation character string Ln is the lyrics of the music, the time information Tn of each phonetic character string Ln is obtained by referring to the pitch or the pronunciation period (or rhythm) of each note specified by the musical score information of the music. The structure to generate is also suitable.

また、時間情報Ｔnの条件を制約したうえで照合処理を実行することも可能である。例えば、照合処理部による照合処理で発音文字列Ｌn毎の暫定的な時間情報Ｔnを生成すると、情報処理システム２０の管理者は、自身がショーについて把握している情報（例えば台本から把握できる情報）から適正であると判断できる１個以上の時間情報Ｔnを指定する。例えば複数の時間情報Ｔnの各々には、当該時間情報Ｔnの適否が不明であることを意味する初期値に設定されたフラグが付加され、管理者が指定した時間情報Ｔnのフラグが、適正であることを意味する数値に変更される。具体的には、ショー内の場面毎の先頭の発音文字列Ｌnが発音される時刻が事前に判明している場合、管理者はその知識により適正な時間情報Ｔnを指定することが可能である。 It is also possible to execute the collation process after restricting the condition of the time information Tn. For example, when the provisional time information Tn for each phonetic character string Ln is generated by the collation processing by the collation processing unit, the administrator of the information processing system 20 knows about the show (for example, information that can be grasped from the script). ) Specifies one or more pieces of time information Tn that can be determined to be appropriate. For example, a flag set to an initial value indicating that the suitability of the time information Tn is unknown is added to each of the plurality of time information Tn, and the flag of the time information Tn designated by the administrator is appropriate. It is changed to a numerical value that means there is. Specifically, when the time at which the first pronunciation character string Ln is pronounced for each scene in the show is known in advance, the administrator can specify appropriate time information Tn based on that knowledge. .

例えば時間情報Ｔn1と時間情報Ｔn2（ｎ1≠ｎ2）が適正であることを利用者が指定すると、照合処理部は、数値ｎ1から数値ｎ2までの範囲内の各時間情報Ｔm（ｎ1＜ｍ＜ｎ2）について、時間情報Ｔmが示す時点が、時間情報Ｔn1の時点と時間情報Ｔn2との間に位置するという制約条件のもとで照合処理を実行する。以上の構成によれば、管理者の手動による指示（ショーに関する事前知識）を反映して時間情報Ｔnの解析の精度を向上させることが可能である。 For example, when the user designates that the time information Tn1 and the time information Tn2 (n1 ≠ n2) are appropriate, the collation processing unit sets each time information Tm (n1 <m <n2) within the range from the numerical value n1 to the numerical value n2. ) Is performed under the constraint that the time point indicated by the time information Tm is located between the time point of the time information Tn1 and the time information Tn2. According to the above configuration, it is possible to improve the accuracy of the analysis of the time information Tn by reflecting the manual instruction (prior knowledge about the show) of the manager.

また、発音文字列Ｌnから特徴量Ｆaの時系列を推定するための特徴量生成モデルに、発音文字列Ｌnの発話または歌唱の速度（例えば歌唱のテンポまたは発話の速度）に関する制約を付加することも可能である。例えば、楽曲に指定された歌唱の速度（テンポ）または台本で指定された発話の速度（例えば「早目に発音」等のト書き）を反映させたセミ隠れマルコフモデル（semi-HMM）を特徴量生成モデルとして利用することで、歌唱または発話の速度に関する制約のもとで特徴量Ｆaの時系列を推定することが可能である。なお、以上の例示ではセミ隠れマルコフモデルを例示したが、各状態間の遷移確率を適宜に設定すれば、隠れマルコフモデルでも同様の作用を実現することが可能である。 Further, a restriction on the utterance or singing speed of the pronunciation character string Ln (for example, the tempo of singing or the utterance speed) is added to the feature value generation model for estimating the time series of the feature value Fa from the pronunciation character string Ln. Is also possible. For example, a semi-hidden Markov model (semi-HMM) that reflects the singing speed (tempo) specified in the music or the utterance speed specified in the script (for example, “pronounced early”) By using it as a generation model, it is possible to estimate the time series of the feature value Fa under restrictions on the speed of singing or speaking. Although the semi-hidden Markov model has been illustrated in the above examples, the same action can be realized even with the hidden Markov model if the transition probability between the states is appropriately set.

（１２）前述の通り、ショーの場面毎に照合処理を実行する構成が想定され得る。場面の切替は、例えば、舞台の撮影により生成された動画像を解析することで検出可能である。具体的には、動画像から算定される明度（例えば１画面内の明度の累積値）の変化を参照することで暗転の有無が判定される。例えば、明度の低下量が所定の閾値を上回る場合に、舞台の暗転が発生したと判定する。舞台の暗転が発生した時点を場面の切替の時点として照合処理を実行することが可能である。ただし、舞台が暗転しても場面の切替が発生しない場合（例えばひとつの場面内で舞台が暗転する場合）、または、舞台が暗転せずに場面の切替が発生する場合（例えば略同等の明度を維持したまま場面が切替わる場合）も想定され得る。以上の事情を考慮すると、例えば、場面の切替の時点で暗転が発生する確率を表現する確率モデルと、場面の切替以外の時点で暗転が発生する確率を表現する確率モデルとを公知の機械学習により生成し、双方の確率モデルを利用して場面の切替の有無を判定する構成が好適である。 (12) As described above, a configuration in which collation processing is performed for each show scene can be assumed. The switching of the scene can be detected, for example, by analyzing a moving image generated by shooting the stage. Specifically, the presence or absence of dark transition is determined by referring to a change in brightness (for example, a cumulative value of brightness within one screen) calculated from a moving image. For example, when the amount of decrease in lightness exceeds a predetermined threshold, it is determined that a stage darkening has occurred. The collation process can be executed with the point in time when the stage darkening occurs as the point of scene switching. However, if the scene does not switch even if the stage is dark (for example, if the stage is dark in one scene), or if the scene is switched without the stage being dark (for example, approximately the same brightness) It is also possible to assume that the scene changes while maintaining Considering the above circumstances, for example, a well-known machine learning is used to express a probability model that expresses the probability that darkness will occur at the time of scene switching and a probability model that expresses the probability that darkness will occur at times other than scene switching. It is preferable to use the probability model for determining whether or not scenes are switched using both probability models.

（１３）前述の各形態で例示した通り、情報処理システム２０は、制御装置２２とプログラムとの協働で実現される。本発明の好適な態様に係るプログラムは、複数の発音文字列Ｌnを表す文字列情報Ｂと、複数の発音文字列Ｌnを順次に発音した音を表す音響信号Ｘとを解析することで、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnが発音される時間軸上の時点を示す時間情報Ｔnを生成する時間解析部６２、および、複数の発音文字列Ｌnの各々について、当該発音文字列Ｌnに関連する関連情報Ｃnを示す配信情報Ｄnと、当該発音文字列Ｌnについて時間解析部６２が生成した時間情報Ｔnとを対応させる情報対応部６４としてコンピュータを機能させる。 (13) As illustrated in the above embodiments, the information processing system 20 is realized by cooperation of the control device 22 and a program. The program according to a preferred aspect of the present invention analyzes a plurality of character string information B representing a plurality of pronunciation character strings Ln and an acoustic signal X representing a sound obtained by sequentially sounding the plurality of pronunciation character strings Ln. For each of the phonetic character strings Ln, the time analysis unit 62 for generating time information Tn indicating the time point on the time axis at which the phonetic character string Ln is played, and for each of the plurality of phonetic character strings Ln, The computer is caused to function as the information corresponding unit 64 that associates the distribution information Dn indicating the related information Cn related to the character string Ln with the time information Tn generated by the time analysis unit 62 for the pronunciation character string Ln.

以上に例示したプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体または磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。なお、非一過性の記録媒体とは、一過性の伝搬信号（transitory, propagating signal）を除く任意の記録媒体を含み、揮発性の記録媒体を除外するものではない。また、通信網を介した配信の形態でプログラムをコンピュータに提供することも可能である。 The programs exemplified above can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. Note that the non-transitory recording medium includes any recording medium except for a transient propagation signal (transitory, propagating signal), and does not exclude a volatile recording medium. It is also possible to provide a program to a computer in the form of distribution via a communication network.

（１４）以上に例示した形態から、例えば以下の態様が把握される。
＜態様１＞
本発明の好適な態様（態様１）に係る情報処理システムは、複数の文字列を表す文字列情報と、前記複数の文字列を順次に発音した音を表す音響信号とを解析することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成する時間解析部と、前記複数の文字列の各々について、当該文字列に関連する関連情報を示す配信情報と、当該文字列について前記時間解析部が生成した時間情報とを対応させる情報対応部とを具備する。以上の態様では、文字列情報と音響信号とを解析することで各文字列の時間情報が生成され、配信情報と時間情報とが文字列毎に対応付けられる。したがって、時系列に発音される複数の文字列の各々の関連情報を提供するための作業の負荷を軽減することが可能である。
＜態様２＞
態様１の好適例（態様２）において、前記時間解析部は、前記音響信号が表す音の発音内容を表す認識文字列を取得する文字列特定部と、前記文字列情報が表す前記複数の文字列と前記文字列特定部が取得した前記認識文字列とを照合することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成する照合処理部とを含む。
＜態様３＞
態様１の好適例（態様３）において、前記時間解析部は、前記複数の文字列を発音した音の特徴量を推定する推定処理部と、前記音響信号の特徴量を抽出する特徴抽出部と、前記推定処理部が推定した特徴量と前記特徴抽出部が抽出した特徴量とを照合することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成する照合処理部とを含む。
＜態様４＞
本発明の好適な態様（態様４）に係る情報処理方法は、コンピュータシステムが、複数の文字列を表す文字列情報と、前記複数の文字列を順次に発音した音を表す音響信号とを解析することで、前記複数の文字列の各々について、当該文字列が発音される時間軸上の時点を示す時間情報を生成し、前記複数の文字列の各々について、当該文字列に関連する関連情報を示す配信情報と、当該文字列について生成した時間情報とを対応させる。 (14) From the forms exemplified above, for example, the following aspects are grasped.
<Aspect 1>
An information processing system according to a preferred aspect (aspect 1) of the present invention analyzes character string information representing a plurality of character strings and an acoustic signal representing sound that is produced by sequentially producing the plurality of character strings, For each of the plurality of character strings, a time analysis unit that generates time information indicating a point in time on which the character string is pronounced, and for each of the plurality of character strings, related information related to the character string The information corresponding | compatible part which matches the delivery information which shows, and the time information which the said time analysis part produced | generated about the said character string is comprised. In the above aspect, the time information of each character string is generated by analyzing the character string information and the sound signal, and the distribution information and the time information are associated with each character string. Therefore, it is possible to reduce the work load for providing each piece of related information of a plurality of character strings pronounced in time series.
<Aspect 2>
In a preferred example of aspect 1 (aspect 2), the time analysis unit includes a character string specifying unit that obtains a recognized character string representing the pronunciation content of the sound represented by the acoustic signal, and the plurality of characters represented by the character string information. Collation that generates time information indicating the time point on the time axis at which the character string is pronounced for each of the plurality of character strings by collating the string with the recognized character string acquired by the character string specifying unit And a processing unit.
<Aspect 3>
In a preferred example of aspect 1 (aspect 3), the time analysis unit includes: an estimation processing unit that estimates a feature amount of a sound that pronounces the plurality of character strings; and a feature extraction unit that extracts a feature amount of the acoustic signal; By comparing the feature amount estimated by the estimation processing unit with the feature amount extracted by the feature extraction unit, for each of the plurality of character strings, a time point on the time axis at which the character string is pronounced is indicated. And a verification processing unit that generates time information.
<Aspect 4>
An information processing method according to a preferred aspect (aspect 4) of the present invention is such that a computer system analyzes character string information representing a plurality of character strings and an acoustic signal representing a sound produced by sequentially producing the plurality of character strings. For each of the plurality of character strings, time information indicating a point in time on which the character string is pronounced is generated, and for each of the plurality of character strings, related information related to the character string is generated. Is associated with the time information generated for the character string.

１００…情報提供システム、１０…情報配信システム、１２，２２，３２…制御装置、１４，２４，３４…記憶装置、１６…放音装置、２０…情報処理システム、３０…端末装置、３６…収音装置、３８…表示装置、４２…情報管理部、４４…信号処理部、４４１…変調処理部、４４２…信号合成部、５２…情報抽出部、５４…再生制御部、５６…音響通信部、６２…時間解析部、６２１…文字列特定部、６２２…照合処理部、６２４…推定処理部、６２５…特徴抽出部、６２６…照合処理部、６４…情報対応部。
DESCRIPTION OF SYMBOLS 100 ... Information provision system, 10 ... Information distribution system, 12, 22, 32 ... Control device, 14, 24, 34 ... Storage device, 16 ... Sound emitting device, 20 ... Information processing system, 30 ... Terminal device, 36 ... Collection Sound device, 38 ... display device, 42 ... information management unit, 44 ... signal processing unit, 441 ... modulation processing unit, 442 ... signal synthesis unit, 52 ... information extraction unit, 54 ... reproduction control unit, 56 ... acoustic communication unit, 62 ... time analysis unit, 621 ... character string specifying unit, 622 ... collation processing unit, 624 ... estimation processing unit, 625 ... feature extraction unit, 626 ... collation processing unit, 64 ... information correspondence unit.

Claims

By analyzing character string information representing a plurality of character strings and an acoustic signal representing a sound obtained by sequentially sounding the plurality of character strings, the time during which the character strings are pronounced for each of the plurality of character strings A time analysis unit for generating time information indicating a time point on the axis;
An information processing system comprising, for each of the plurality of character strings, an information corresponding unit that associates distribution information indicating related information related to the character string with time information generated by the time analysis unit for the character string. .

The time analysis unit
A character string specifying unit for acquiring a recognized character string representing the pronunciation content of the sound represented by the acoustic signal;
By collating the plurality of character strings represented by the character string information with the recognized character string acquired by the character string specifying unit, for each of the plurality of character strings, on the time axis where the character string is pronounced The information processing system of Claim 1. The collation process part which produces | generates the time information which shows this time is included.

The time analysis unit
An estimation processing unit that estimates a feature amount of a sound produced by pronounced the plurality of character strings;
A feature extraction unit for extracting a feature amount of the acoustic signal;
A time indicating a time point on the time axis at which the character string is pronounced for each of the plurality of character strings by collating the feature amount estimated by the estimation processing unit with the feature amount extracted by the feature extraction unit. The information processing system according to claim 1, further comprising: a matching processing unit that generates information.

Computer system
By analyzing character string information representing a plurality of character strings and an acoustic signal representing a sound obtained by sequentially sounding the plurality of character strings, the time during which the character strings are pronounced for each of the plurality of character strings Generate time information indicating the time on the axis,
An information processing method that associates, for each of the plurality of character strings, distribution information indicating related information related to the character string with time information generated for the character string.