JP6471923B2

JP6471923B2 - Signal processing apparatus and method, and program

Info

Publication number: JP6471923B2
Application number: JP2018109373A
Authority: JP
Inventors: 高橋　秀介; 秀介高橋; 井上　晃; 晃井上; 西口　正之; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2013-10-21
Filing date: 2018-06-07
Publication date: 2019-02-20
Anticipated expiration: 2034-06-04
Also published as: JP2018157593A

Description

本技術は信号処理装置および方法、並びにプログラムに関し、特に、異なる経路で取得した複数のコンテンツを同期させることができるようにした信号処理装置および方法、並びにプログラムに関する。 The present technology relates to a signal processing apparatus, method, and program, and more particularly, to a signal processing apparatus, method, and program that can synchronize a plurality of contents acquired through different paths.

近年、多機能型携帯電話機やタブレット型端末装置など、ネットワーク接続を前提とし、様々なメディアコンテンツを再生できる機器が増加してきている。さらに、従来から存在しているテレビジョン受像機なども含め、ネットワーク機能を用いた複数機器の連携による利活用が求められている。 In recent years, devices capable of playing various media contents on the premise of network connection, such as multifunctional mobile phones and tablet terminal devices, are increasing. Further, there is a demand for utilization by cooperation of a plurality of devices using a network function including a conventional television receiver.

例えば複数機器の連携では、以下に示す（Ａ１）乃至（Ａ４）などの時間同期関係を有する複数のメディアコンテンツを、放送やインターネットなどを通じて複数機器で受信し、それらのコンテンツを同期して再生するというアプリケーションプログラムが想定される。 For example, in the cooperation of a plurality of devices, a plurality of media contents having a time synchronization relationship such as the following (A1) to (A4) are received by a plurality of devices through broadcasting, the Internet, etc., and these contents are reproduced in synchronization. An application program is assumed.

（Ａ１）メインの映像・音声コンテンツに対する外国語音声コンテンツ、解説音声コンテンツ、クローズドキャプションと文字情報
（Ａ２）ある楽曲を楽器毎に演奏、撮影した複数の映像・音声コンテンツ
（Ａ３）１つのシーンを複数の角度から撮影した映像・音声コンテンツ
（Ａ４）メインの映像・音声コンテンツとその高解像版の映像・音声コンテンツ (A1) Foreign language audio content, commentary audio content, closed captioning and text information for main video / audio content (A2) Multiple video / audio contents of a musical piece played and photographed for each instrument (A3) One scene Video / audio content taken from multiple angles (A4) Main video / audio content and its high-resolution video / audio content

このような複数のコンテンツは、再生時には同期が保たれた状態で再生させる必要がある。例えば、複数のコンテンツを同期させる技術として、異なる複数の撮影装置で同時刻に撮影された各コンテンツから特徴量を抽出し、それらの特徴量の類似度を計算することで、複数のコンテンツを同期させる技術が開示されている（例えば、特許文献１参照）。 It is necessary to reproduce such a plurality of contents in a state where synchronization is maintained at the time of reproduction. For example, as a technology to synchronize multiple contents, extract feature values from each content photographed at the same time by multiple image capture devices, and calculate the similarity of those feature quantities to synchronize multiple contents The technique to make is disclosed (for example, refer patent document 1).

特開２０１３−１７４７６５号公報JP 2013-174765 A

ところで、実際には、上記のようなメディアコンテンツを、複数の機器がそれぞれ異なる経路で受信しようとする場合、伝送遅延や、送出および受信の処理遅延、受信機器の動作クロックの違いなどの要因により、同期を保ってコンテンツを再生することが困難である。また、特許文献１に記載の技術では、同期をとって再生しようとするコンテンツ同士が類似した特徴を有していない場合には、それらのコンテンツの同期をとることができなかった。 By the way, in reality, when a plurality of devices try to receive media content as described above via different paths, due to factors such as transmission delays, transmission and reception processing delays, and differences in operation clocks of receiving devices. It is difficult to reproduce content while maintaining synchronization. Also, with the technique described in Patent Document 1, when contents to be reproduced in synchronization do not have similar characteristics, the contents cannot be synchronized.

本技術は、このような状況に鑑みてなされたものであり、異なる経路で取得した複数のコンテンツを同期させることができるようにするものである。 The present technology has been made in view of such a situation, and makes it possible to synchronize a plurality of contents acquired through different routes.

本技術の一側面の信号処理装置は、第１のコンテンツに含まれる音響信号を帯域分割する帯域分割部と、前記帯域分割部により帯域分割された前記音響信号の周期性情報を帯域ごとに検出する周期性検出部と、前記周期性検出部により検出された帯域ごとの前記周期性情報を全帯域分統合する周期性情報統合部と、前記周期性情報統合部で統合された周期性情報のピーク位置を検出し、ピーク情報を生成するピーク検出部と、前記ピーク検出部で生成された複数の時間区間の前記ピーク情報を、１つの時間区間の情報とするダウンサンプル部と、前記ダウンサンプル部でダウンサンプルした情報を、前記第１のコンテンツと同期対象である第２のコンテンツを同期させる際の同期用特徴量として出力する出力部とを備える。 A signal processing device according to an aspect of the present technology detects, for each band, a band division unit that divides a sound signal included in first content, and periodic information of the acoustic signal that is band-divided by the band division unit. The periodicity detection unit, the periodicity information integration unit that integrates the periodicity information for each band detected by the periodicity detection unit for all bands, and the periodicity information integrated by the periodicity information integration unit A peak detection unit that detects peak positions and generates peak information; a downsampling unit that uses the peak information of a plurality of time intervals generated by the peak detection unit as information of one time interval; and the downsampling An output unit that outputs information down-sampled by the unit as a synchronization feature amount when synchronizing the first content and the second content to be synchronized.

本技術の一側面の信号処理方法またはプログラムは、第１のコンテンツに含まれる音響信号を帯域分割する帯域分割処理と、前記帯域分割処理により帯域分割された前記音響信号の周期性情報を帯域ごとに検出する周期性検出処理と、前記周期性検出処理により検出された帯域ごとの前記周期性情報を全帯域分統合する周期性情報統合処理と、前記周期性情報統合処理で統合された周期性情報のピーク位置を検出し、ピーク情報を生成するピーク検出処理と、前記ピーク検出処理で生成された複数の時間区間の前記ピーク情報を、１つの時間区間の情報とするダウンサンプル処理と、前記ダウンサンプル処理でダウンサンプルした情報を、前記第１のコンテンツと同期対象である第２のコンテンツを同期させる際の同期用特徴量として出力する出力処理とを含む。 A signal processing method or program according to an aspect of the present technology includes a band division process for dividing an acoustic signal included in first content, and a periodicity information of the acoustic signal band-divided by the band division process for each band. Periodicity detection processing, periodicity information integration processing for integrating the periodicity information for each band detected by the periodicity detection processing for all bands, and periodicity integrated by the periodicity information integration processing A peak detection process for detecting a peak position of information and generating peak information; a down-sampling process in which the peak information of a plurality of time sections generated by the peak detection process is information of one time section; The information down-sampled by the down-sampling process is output as a synchronization feature amount when synchronizing the first content and the second content to be synchronized. And an output processing.

本技術の一側面においては、第１のコンテンツに含まれる音響信号が帯域分割され、帯域分割された前記音響信号の周期性情報が帯域ごとに検出され、検出された帯域ごとの前記周期性情報が全帯域分統合され、統合された周期性情報のピーク位置が検出され、ピーク情報が生成され、生成された複数の時間区間の前記ピーク情報が、１つの時間区間の情報とされ、ダウンサンプルされた情報が、前記第１のコンテンツと同期対象である第２のコンテンツを同期させる際の同期用特徴量として出力される。 In one aspect of the present technology, the acoustic signal included in the first content is band-divided, the periodicity information of the acoustic signal that has been band-divided is detected for each band, and the periodicity information for each detected band Are integrated for all bands, the peak position of the integrated periodicity information is detected, peak information is generated, and the generated peak information of a plurality of time intervals is set as one time interval information, and downsampling is performed. The information thus output is output as a feature value for synchronization when synchronizing the first content and the second content to be synchronized.

本技術の一側面によれば、異なる経路で取得した複数のコンテンツを同期させることができる。 According to one aspect of the present technology, it is possible to synchronize a plurality of contents acquired through different routes.

なお、ここに記載された効果は必ずしも限定されるものではなく、本開示中に記載された何れかの効果であってもよい。 Note that the effects described here are not necessarily limited, and may be any of the effects described in the present disclosure.

提供装置の構成例を示す図である。It is a figure which shows the structural example of a provision apparatus. 音声同期用特徴量計算部の構成例を示す図である。It is a figure which shows the structural example of the feature-value calculation part for audio | voice synchronization. 音声同期用特徴量のダウンサンプルについて説明する図である。It is a figure explaining the down sample of the feature-value for audio | voice synchronization. コンテンツ再生システムの構成例を示す図である。It is a figure which shows the structural example of a content reproduction system. 音声同期用特徴量計算部の構成例を示す図である。It is a figure which shows the structural example of the feature-value calculation part for audio | voice synchronization. 同期計算部の構成例を示す図である。It is a figure which shows the structural example of a synchronous calculation part. 音声同期用特徴量の同期計算について説明する図である。It is a figure explaining the synchronous calculation of the feature-value for audio | voice synchronization. 音声同期用特徴量の同期計算について説明する図である。It is a figure explaining the synchronous calculation of the feature-value for audio | voice synchronization. 音声同期用特徴量の同期計算について説明する図である。It is a figure explaining the synchronous calculation of the feature-value for audio | voice synchronization. 類似度の計算の対象とするブロックについて説明する図である。It is a figure explaining the block used as the object of calculation of similarity. 類似度の計算について説明する図である。It is a figure explaining calculation of similarity. 送信処理を説明するフローチャートである。It is a flowchart explaining a transmission process. サブチャンネル信号と音声同期用特徴量の多重化について説明する図である。It is a figure explaining multiplexing of the feature-value for a subchannel signal and audio | voice synchronization. 音声同期用特徴量算出処理を説明するフローチャートである。It is a flowchart explaining the feature-value calculation process for audio | voice synchronization. メインコンテンツ再生処理を説明するフローチャートである。It is a flowchart explaining a main content reproduction process. サブコンテンツ再生処理を説明するフローチャートである。It is a flowchart explaining a sub content reproduction process. 音声同期用特徴量算出処理を説明するフローチャートである。It is a flowchart explaining the feature-value calculation process for audio | voice synchronization. 同期補正情報生成処理を説明するフローチャートである。It is a flowchart explaining a synchronous correction information generation process. 本技術の適用例を示す図である。It is a figure showing an example of application of this art. 本技術の適用例を示す図である。It is a figure showing an example of application of this art. 本技術の適用例を示す図である。It is a figure showing an example of application of this art. 提供装置の構成例を示す図である。It is a figure which shows the structural example of a provision apparatus. コンテンツ再生システムの構成例を示す図である。It is a figure which shows the structural example of a content reproduction system. 送信処理を説明するフローチャートである。It is a flowchart explaining a transmission process. メインコンテンツ再生処理を説明するフローチャートである。It is a flowchart explaining a main content reproduction process. サブコンテンツ再生処理を説明するフローチャートである。It is a flowchart explaining a sub content reproduction process. 同期補正情報生成処理を説明するフローチャートである。It is a flowchart explaining a synchronous correction information generation process. 類似度の計算の対象とするブロックについて説明する図である。It is a figure explaining the block used as the object of calculation of similarity. コンピュータの構成例を示す図である。It is a figure which shows the structural example of a computer.

以下、図面を参照して、本技術を適用した実施の形態について説明する。 Hereinafter, embodiments to which the present technology is applied will be described with reference to the drawings.

〈第１の実施の形態〉
〈本技術の特徴〉
まず、本技術の特徴について説明する。 <First Embodiment>
<Features of this technology>
First, features of the present technology will be described.

本技術は、特に以下の特徴Ｂ１乃至特徴Ｂ６を有している。 The present technology particularly has the following features B1 to B6.

（特徴Ｂ１）
本技術によれば、内容の異なる複数のメディアコンテンツを異なる伝送路経由で送信し、異なる複数の機器で受信する際に音声を使って自動同期を行う下記の構成を有する方法および装置を実現することができる。
（１）メディアコンテンツは映像、音声、画像、文字情報などを多重化したデータストリームとされている
（２）伝送対象とする複数メディアコンテンツは、上述した（Ａ１）乃至（Ａ４）に示した例のように時間同期関係を有する
（３）送出対象とする複数のメディアコンテンツのうちの少なくとも１つをメインチャンネル信号と定め、その音声信号から音声同期用特徴量を算出し、システムが規定する伝送フォーマットによりメインチャンネル信号からメイン送出信号を生成する
（４）残りの各メディアコンテンツ（サブチャンネル信号）とメインチャンネル信号の時間同期関係が符合するように、システムが規定する伝送フォーマットによりメインチャンネル信号の音声同期用特徴量とサブチャンネル信号の多重化処理を行い、サブ送出信号を生成する
（５）メイン送出信号を受信するメイン受信機器では、メインチャンネル信号の再生時において、その音声信号をスピーカなどにより出力する
（６）メインチャンネル信号の音声同期用特徴量を含むサブ送出信号を受信するサブ受信機器は、メイン受信機器がスピーカ出力したメインチャンネル信号の音声をマイクロホンなどにより収音して、音声同期用特徴量を計算し、受信したメインチャンネル信号の音声同期用特徴量との自動同期計算を行い、音声特徴量に基づく同期補正情報（時間差情報）を算出する
（７）上記音声特徴量に基づく同期補正情報に基づき、サブ受信機器は受信したサブチャンネル信号に対してメインチャンネル信号との同期補正処理を行い再生する (Feature B1)
According to the present technology, a method and an apparatus having the following configuration for transmitting a plurality of media contents having different contents via different transmission paths and performing automatic synchronization using sound when received by a plurality of different devices are realized. be able to.
(1) The media content is a data stream in which video, audio, image, character information, etc. are multiplexed. (2) The multiple media content to be transmitted is the example shown in the above (A1) to (A4). (3) At least one of a plurality of media contents to be transmitted is defined as a main channel signal, a voice synchronization feature is calculated from the audio signal, and transmission defined by the system The main transmission signal is generated from the main channel signal according to the format. (4) The main channel signal is generated according to the transmission format defined by the system so that the time synchronization relationship between the remaining media contents (sub-channel signal) and the main channel signal matches. Performs multiplexing processing of audio synchronization features and subchannel signals, (5) The main receiving device that receives the main transmission signal outputs the audio signal through a speaker or the like during reproduction of the main channel signal. (6) Includes the feature value for audio synchronization of the main channel signal. The sub-receiving device that receives the sub-sending signal picks up the sound of the main channel signal output from the speaker by the main receiving device using a microphone, etc., calculates the feature value for sound synchronization, and for the sound synchronization of the received main channel signal Automatic synchronization calculation with the feature amount is performed, and synchronization correction information (time difference information) based on the speech feature amount is calculated. (7) Based on the synchronization correction information based on the speech feature amount, the sub-receiving device converts the received subchannel signal into the received subchannel signal. Play with the correction of the synchronization with the main channel signal.

なお、上記の（１）のデータストリームの伝送としては、放送波、インターネットなどのネットワークにおけるメディアコンテンツの伝送を想定し、多重化データストリームが占有する論理伝送路を伝送路と呼ぶこととする。 As the transmission of the data stream (1) above, it is assumed that the media content is transmitted in a network such as a broadcast wave or the Internet, and a logical transmission path occupied by the multiplexed data stream is referred to as a transmission path.

また、上記にある「音声同期用特徴量の算出」と「自動同期計算」は、例えば特開２０１３−１７４７６５号公報に記載されている技術により実現される。なお、音声同期用特徴量を伝送前にダウンサンプルしたり、音声同期用特徴量を利用した自動同期計算時に、必要に応じて音声同期用特徴量のフレームレート変換を行うようにすることもできる。 Further, the above-described “calculation of voice synchronization feature value” and “automatic synchronization calculation” are realized by the technique described in, for example, Japanese Patent Application Laid-Open No. 2013-174765. It is also possible to downsample the audio synchronization feature amount before transmission, or to perform frame rate conversion of the audio synchronization feature amount as necessary during automatic synchronization calculation using the audio synchronization feature amount. .

このような技術を用いることにより、メインチャンネル信号の音声をサブ受信機器が収音する際に、雑音や騒音があるような悪環境であってもロバストに自動同期計算を行うことができる。なお、必ずしもこの技術を用いる必要はない。 By using such a technique, when the sub-receiving device collects the sound of the main channel signal, the automatic synchronization calculation can be performed robustly even in a bad environment where there is noise or noise. Note that this technique is not necessarily used.

（特徴Ｂ２）
上記の（特徴Ｂ１）において、送出側システムがメイン送出信号とサブ送出信号を各々メイン受信機器、サブ受信機器に対して一方的に送出する。 (Feature B2)
In the above (feature B1), the transmission side system unilaterally transmits the main transmission signal and the sub transmission signal to the main reception device and the sub reception device, respectively.

この場合、サブ送出信号はメイン送出信号に先んじて送出する必要がある。 In this case, the sub transmission signal needs to be transmitted prior to the main transmission signal.

（特徴Ｂ３）
上記の（特徴Ｂ１）において、送出側システムがメイン送出信号をメイン受信機器に対して一方的に送出し、サブ受信機器は自身のタイミングによりネットワーク経由などでサブ送出信号を取得して自動同期計算を行い、サブチャンネル信号の同期再生を行う。 (Feature B3)
In the above (feature B1), the sending system unilaterally sends the main sending signal to the main receiving device, and the sub receiving device acquires the sub sending signal via the network or the like at its own timing and performs automatic synchronization calculation. To perform synchronized playback of the subchannel signal.

この構成の利点としては、ネットワークの伝送遅延などを加味してサブ受信機器が自身の都合によりサブ送出信号の取得を制御できる。 As an advantage of this configuration, the sub-receiving device can control the acquisition of the sub-transmission signal for its own convenience, taking into account the transmission delay of the network.

（特徴Ｂ４）
上記の（特徴Ｂ１）において、メイン受信機器は、自身のタイミングによりネットワーク経由などでメイン送出信号を取得して、メインチャンネル信号の再生を行い、サブ受信機器も自身のタイミングによりネットワーク経由などでサブ送出信号を取得して自動同期計算を行い、サブチャンネル信号の同期再生を行う。 (Feature B4)
In the above (Characteristic B1), the main receiving device acquires the main transmission signal via the network at its own timing and reproduces the main channel signal, and the sub receiving device also receives the sub signal via the network at its own timing. Obtains the transmission signal, performs automatic synchronization calculation, and performs synchronous reproduction of the subchannel signal.

（特徴Ｂ５）
上記の（特徴Ｂ１）において、メインチャンネル信号の音声信号が複数系統ある。 (Feature B5)
In the above (feature B1), there are a plurality of audio signals of the main channel signal.

例えば複数系統のメインチャンネル信号は、２か国語放送の主音声と副音声などとされる。全ての系統の音声信号に対して音声同期用特徴量が算出され、サブチャンネル信号と多重化されて送出される。サブ受信機器では、収音した音声と受信した全ての音声同期用特徴量との同期計算を行う際に、メインチャンネル信号のどの音声が再生されているかが判別される。メイン受信機器が出力する音声信号の切り替えも上記同期計算により検出される。 For example, the main channel signals of a plurality of systems are the main audio and sub audio of bilingual broadcasting. Voice synchronization feature values are calculated for the audio signals of all systems, multiplexed with subchannel signals, and transmitted. The sub-receiving device determines which sound of the main channel signal is being reproduced when performing the synchronization calculation of the collected sound and all the received sound synchronization feature values. The switching of the audio signal output from the main receiving device is also detected by the synchronization calculation.

（特徴Ｂ６）
上記の（特徴Ｂ１）において、サブ受信機器における自動同期計算において「同期ずれ」を検出し、サブ受信機器側でリアルタイム補正処理を行う。 (Feature B6)
In the above (feature B1), “synchronization deviation” is detected in the automatic synchronization calculation in the sub-receiving device, and real-time correction processing is performed on the sub-receiving device side.

メイン受信機器とサブ受信機器は独立して動作するため、オーディオクロックが異なり同期ずれが発生する。そこで、その同期ずれを検出して補正することで、同期を保ったまま複数のコンテンツを再生することが可能となる。 Since the main receiving device and the sub receiving device operate independently, the audio clock differs and a synchronization shift occurs. Therefore, by detecting and correcting the synchronization deviation, it is possible to reproduce a plurality of contents while maintaining the synchronization.

〈提供装置の構成例〉
次に、本技術を適用した具体的な実施の形態について説明する。 <Example configuration of the providing device>
Next, specific embodiments to which the present technology is applied will be described.

まず、上述した（Ａ１）乃至（Ａ４）に示した例のように時間同期関係を有するコンテンツを提供する提供装置の構成例について説明する。 First, a configuration example of a providing apparatus that provides content having a time synchronization relationship as in the examples shown in the above-described (A1) to (A4) will be described.

図１は、提供装置の構成例を示す図である。この提供装置１１には、主となるコンテンツ（以下、メインコンテンツと称する）を再生するための信号であるメインチャンネル信号と、メインコンテンツと内容的な関連性を有するコンテンツ（以下、サブコンテンツと称する）を再生するための信号であるサブチャンネル信号とが供給される。 FIG. 1 is a diagram illustrating a configuration example of a providing apparatus. The providing device 11 includes a main channel signal, which is a signal for reproducing main content (hereinafter referred to as main content), and content (hereinafter referred to as sub-content) having a content relationship with the main content. A subchannel signal which is a signal for reproduction is supplied.

ここで、メインコンテンツとサブコンテンツは、少なくとも映像と音声の何れかからなり、互いに時間同期関係を有するコンテンツである。つまり、再生時には、メインコンテンツとサブコンテンツが同期した状態で再生されることが望ましい。 Here, the main content and the sub-content are at least one of video and audio, and are contents having a time synchronization relationship with each other. That is, at the time of reproduction, it is desirable to reproduce the main content and the sub content in a synchronized state.

なお、以下では、メインコンテンツおよびサブコンテンツは、それぞれ映像を再生する画像信号と、その画像信号に付随する音声信号とからなるものとして説明を続ける。したがって、この例ではメインチャンネル信号とサブチャンネル信号は、それぞれ画像信号と音声信号から構成されている。 In the following, description will be continued assuming that the main content and the sub-content are each composed of an image signal for reproducing a video and an audio signal accompanying the image signal. Therefore, in this example, the main channel signal and the subchannel signal are composed of an image signal and an audio signal, respectively.

提供装置１１は、変換部２１、出力部２２、音声同期用特徴量計算部２３、多重化処理部２４、および出力部２５を有している。 The providing apparatus 11 includes a conversion unit 21, an output unit 22, a voice synchronization feature amount calculation unit 23, a multiplexing processing unit 24, and an output unit 25.

変換部２１は、供給されたメインチャンネル信号を、所定の放送規格などで定められたフォーマットに変換し、その結果得られたメイン送出信号を出力部２２に供給する。出力部２２は、変換部２１から供給されたメイン送出信号を、例えば放送波により送信したり、インターネットなどの通信網を介して送信したりする。 The converter 21 converts the supplied main channel signal into a format defined by a predetermined broadcast standard and supplies the main transmission signal obtained as a result to the output unit 22. The output unit 22 transmits the main transmission signal supplied from the conversion unit 21 by broadcast waves, for example, or transmits it via a communication network such as the Internet.

音声同期用特徴量計算部２３は、供給されたメインチャンネル信号を構成する音声信号から、音声同期用特徴量を抽出し、多重化処理部２４に供給する。ここで、音声同期用特徴量は、メインコンテンツとサブコンテンツとの再生時に、サブコンテンツをメインコンテンツに同期させて再生させるために用いられる特徴量である。 The voice synchronization feature quantity calculation unit 23 extracts the voice synchronization feature quantity from the voice signal constituting the supplied main channel signal, and supplies the voice synchronization feature quantity to the multiplexing processing unit 24. Here, the audio synchronization feature amount is a feature amount used for reproducing the sub-content in synchronization with the main content when reproducing the main content and the sub-content.

多重化処理部２４は、供給されたメインチャンネル信号を用いて、音声同期用特徴量計算部２３からの音声同期用特徴量と、供給されたサブチャンネル信号との時間同期関係を調整する。すなわち、提供装置１１では、予めメインチャンネル信号とサブチャンネル信号とは同期がとれた状態となっているので、多重化処理部２４は、メインチャンネル信号を用いて、時間同期関係において音声同期用特徴量とサブチャンネル信号とが同期した状態で、音声同期用特徴量をサブチャンネル信号に対応付ける。例えばMPEG-4 System において、オーディオ信号、ビデオ信号などはそれぞれ１つのメディアオブジェクト(ES(Elementary Stream))として扱われ、多重化される。ESを分割して得られるAccess Unit(AU)と呼ばれる最小単位毎に時間属性が定義されるので、音声同期用特徴量も時間属性情報を有する１つのメディアオブジェクトとして扱うことにより、サブチャンネル信号であるメディアオブジェクトと容易に多重化を行うことができる。 The multiplexing processing unit 24 adjusts the time synchronization relationship between the audio synchronization feature value from the audio synchronization feature value calculation unit 23 and the supplied subchannel signal, using the supplied main channel signal. That is, in the providing apparatus 11, since the main channel signal and the subchannel signal are in a synchronized state in advance, the multiplexing processing unit 24 uses the main channel signal to perform the audio synchronization feature in a time synchronization relationship. With the amount and the subchannel signal synchronized, the audio synchronization feature amount is associated with the subchannel signal. For example, in the MPEG-4 System, an audio signal, a video signal, etc. are each handled as one media object (ES (Elementary Stream)) and multiplexed. Since the time attribute is defined for each minimum unit called Access Unit (AU) obtained by dividing ES, the audio synchronization feature amount is treated as one media object having time attribute information. Multiplexing with certain media objects is easy.

また、多重化処理部２４は、時間的に同期がとれた状態で音声同期用特徴量とサブチャンネル信号とを多重化した後、必要に応じてフォーマット変換を行って、その結果得られたサブ送出信号を出力部２５に供給する。 Further, the multiplexing processing unit 24 multiplexes the audio synchronization feature quantity and the subchannel signal in a time-synchronized state, then performs format conversion as necessary, and obtains the sub-result obtained as a result. The transmission signal is supplied to the output unit 25.

出力部２５は、多重化処理部２４から供給されたサブ送出信号を、例えば放送波により送信したり、インターネットなどの通信網を介して送信したりする。ここで、メイン送出信号とサブ送出信号とは、互いに異なる伝送路を介してコンテンツの再生側のシステムに送信される。 The output unit 25 transmits the sub transmission signal supplied from the multiplexing processing unit 24 by, for example, a broadcast wave or transmits it via a communication network such as the Internet. Here, the main transmission signal and the sub transmission signal are transmitted to the content reproduction side system via different transmission paths.

なお、図１の例では、提供装置１１は、１つの装置により構成されているが、提供装置１１が複数の装置により構成されてもよいし、各処理がクラウドコンピューティングにより実行されるようにしてもよい。 In the example of FIG. 1, the providing device 11 is configured by a single device, but the providing device 11 may be configured by a plurality of devices, and each process is executed by cloud computing. May be.

〈音声同期用特徴量計算部の構成例〉
また、図１に示した音声同期用特徴量計算部２３は、より詳細には例えば図２に示すように構成される。 <Configuration example of feature calculation unit for voice synchronization>
Further, the voice synchronization feature quantity calculation unit 23 shown in FIG. 1 is configured in more detail as shown in FIG. 2, for example.

音声同期用特徴量計算部２３は、周波数帯域分割部５１、周期性検出部５２−１乃至周期性検出部５２−４、周期性強度検出部５３−１乃至周期性強度検出部５３−４、周期性情報統合部５４、ピーク検出部５５、およびダウンサンプル部５６を有している。 The voice synchronization feature quantity calculation unit 23 includes a frequency band division unit 51, a periodicity detection unit 52-1 through a periodicity detection unit 52-4, a periodicity intensity detection unit 53-1 through a periodicity intensity detection unit 53-4, A periodic information integration unit 54, a peak detection unit 55, and a downsampling unit 56 are included.

周波数帯域分割部５１は、供給されたメインチャンネル信号を構成する音声信号を、窓関数を用いて数10msec乃至100msec程度の時間区間に分割する。 The frequency band dividing unit 51 divides the audio signal constituting the supplied main channel signal into time intervals of about several tens of milliseconds to 100 msec using a window function.

ここで、周波数帯域分割部５１からピーク検出部５５までの間で行われる処理は、１つの時間区間に対して行われる。このとき窓関数を適用する時間位置を数msec乃至100msec程度後にシフトさせることで時間方向に連続した複数の時間区間（時間フレーム）を得ることができる。これに対してダウンサンプル部５６では、連続した複数の時間区間の結果が１つに統合され、統合後の新たな時間区間に対する特徴量が算出される。 Here, the processing performed between the frequency band dividing unit 51 and the peak detecting unit 55 is performed for one time interval. At this time, a plurality of time intervals (time frames) continuous in the time direction can be obtained by shifting the time position to which the window function is applied after about several milliseconds to 100 milliseconds. On the other hand, in the downsampling unit 56, the results of a plurality of continuous time intervals are integrated into one, and the feature amount for the new time interval after integration is calculated.

周波数帯域分割部５１は、複数のバンドパスフィルタを用いて、時間区間ごとの音声信号を４つの周波数帯域に分割し、各周波数帯域の音声信号を、周期性検出部５２−１乃至周期性検出部５２−４のそれぞれに供給する。 The frequency band dividing unit 51 divides the audio signal for each time interval into four frequency bands using a plurality of bandpass filters, and the audio signal of each frequency band is converted to the periodicity detection unit 52-1 to the periodicity detection. It supplies to each of the parts 52-4.

なお、バンドパスフィルタとしては、例えばオクターブバンドフィルタなどの、高い周波数ほど通過周波数帯域幅が広くなるフィルタを用いると効果的である。 As the bandpass filter, it is effective to use a filter whose pass frequency bandwidth becomes wider as the frequency becomes higher, such as an octave band filter.

周期性検出部５２−１乃至周期性検出部５２−４は、周波数帯域分割部５１から供給された所定の周波数帯域の時間区間ごとの音声信号の自己相関関数を計算することにより、時間区間ごとの周期性を表す周期性情報を抽出する。 The periodicity detection unit 52-1 to the periodicity detection unit 52-4 calculate the autocorrelation function of the audio signal for each time interval of the predetermined frequency band supplied from the frequency band dividing unit 51, thereby The periodicity information representing the periodicity of is extracted.

なお、ここでは、周期性情報として、インデックスがｂである周波数帯域の、インデックスがτである時間遅れの音声信号の自己相関関数ｘ（ｂ，τ）そのものを用いるが、自己相関関数ｘ（ｂ，τ）をｘ（ｂ，０）で除算した値を用いることもできる。また、自己相関関数ｘ（ｂ，τ）の計算方法としては、所定の周波数帯域の音声信号に対して離散フーリエ変換を行うことにより得られるスペクトルのピークを用いた方法などを用いることができる。 Here, as the periodicity information, the autocorrelation function x (b, τ) itself of the time-delayed audio signal in the frequency band with the index b and the index τ is used, but the autocorrelation function x (b , Τ) divided by x (b, 0) can also be used. As a method for calculating the autocorrelation function x (b, τ), a method using a spectrum peak obtained by performing discrete Fourier transform on an audio signal in a predetermined frequency band can be used.

周期性検出部５２−１乃至周期性検出部５２−４は、抽出された時間区間ごとの周期性情報を周期性強度検出部５３−１乃至周期性強度検出部５３−４と周期性情報統合部５４に供給する。なお、以下、周期性検出部５２−１乃至周期性検出部５２−４を特に区別する必要がない場合、単に周期性検出部５２と称することとする。 The periodicity detection unit 52-1 to the periodicity detection unit 52-4 integrate the periodicity information for each extracted time interval with the periodicity intensity detection unit 53-1 to the periodicity intensity detection unit 53-4. Supplied to the unit 54. Hereinafter, the periodicity detection unit 52-1 to the periodicity detection unit 52-4 are simply referred to as the periodicity detection unit 52 when it is not necessary to distinguish between them.

周期性強度検出部５３−１乃至周期性強度検出部５３−４は、周期性検出部５２−１乃至周期性検出部５２−４から供給される時間区間ごとの周期性情報に基づいて、時間区間ごとの周期性の強度を計算する。具体的には、τ＝０近傍以外のτにおける周期性情報である自己相関関数ｘ（ｂ，τ）の最大値が周期性の強度として計算される。この周期性の強度が大きいほど、処理対象の音声信号の周期性が強く、周期性の強度が小さいほど、処理対象の音声信号の周期性がノイズらしくなる。 The periodicity intensity detecting unit 53-1 to the periodicity intensity detecting unit 53-4 are based on the periodicity information for each time interval supplied from the periodicity detecting unit 52-1 to the periodicity detecting unit 52-4. Calculate the intensity of periodicity for each interval. Specifically, the maximum value of the autocorrelation function x (b, τ), which is periodicity information in τ other than τ = 0, is calculated as the intensity of periodicity. The greater the periodicity strength, the stronger the periodicity of the processing target speech signal, and the smaller the periodicity strength, the more likely the periodicity of the processing target speech signal will be noise.

周期性強度検出部５３−１乃至周期性強度検出部５３−４は、時間区間ごとの周期性の強度を、閾値を超えたか否かにより２値化し、時間区間ごとの周期性強度情報とする。すなわち、時間区間ごとに、周期性の強度が所定の閾値を超えた場合、周期性強度情報は１とされ、周期性の強度が所定の閾値以下である場合、周期性強度情報は０とされる。周期性強度検出部５３−１乃至周期性強度検出部５３−４は、時間区間ごとの周期性強度情報を周期性情報統合部５４に供給する。 The periodic strength detecting unit 53-1 to the periodic strength detecting unit 53-4 binarize the periodic strength for each time interval depending on whether or not the threshold value is exceeded, and use it as periodic strength information for each time interval. . That is, for each time interval, when the intensity of periodicity exceeds a predetermined threshold value, the periodicity intensity information is set to 1, and when the intensity of periodicity is less than or equal to the predetermined threshold value, the periodicity intensity information is set to 0. The The periodic strength detection unit 53-1 to the periodic strength detection unit 53-4 supply the periodic strength information for each time interval to the periodicity information integration unit 54.

なお、以下、周期性強度検出部５３−１乃至周期性強度検出部５３−４を特に区別する必要がない場合、単に周期性強度検出部５３とも称する。 In the following description, the periodic intensity detector 53-1 to the periodic intensity detector 53-4 are also simply referred to as the periodic intensity detector 53 when it is not necessary to distinguish between them.

周期性情報統合部５４は、周期性検出部５２から供給された時間区間ごとの周期性情報と、周期性強度検出部５３から供給された時間区間ごとの周期性強度情報とに基づいて、時間区間ごとの周期性情報を統合する周期性統合処理を行う。具体的には、周期性情報統合部５４は、次式（１）を用いて時間区間ごとに周期性情報である自己相関関数ｘ（ｂ，τ）の総和を求める。 Based on the periodicity information for each time interval supplied from the periodicity detection unit 52 and the periodicity information for each time interval supplied from the periodicity intensity detection unit 53, the periodicity information integration unit 54 Perform periodicity integration processing to integrate periodicity information for each section. Specifically, the periodicity information integration unit 54 obtains the sum of the autocorrelation functions x (b, τ), which is periodicity information, for each time interval using the following equation (1).

なお、式（１）において、N_bは周波数帯域の総数を表し、p(b)は周期性強度情報を表す。また、N_pは周期性強度情報p(b)が１となる周波数帯域の数を表す。 In Equation (1), N _b represents the total number of frequency bands, and p (b) represents periodic strength information. N _p represents the number of frequency bands in which the periodic intensity information p (b) is 1.

周期性情報統合部５４は、周期性統合処理の結果得られる時間区間ごとの周期性情報の総和Ｓ（τ）をピーク検出部５５に供給する。 The periodicity information integration unit 54 supplies the peak detection unit 55 with the sum S (τ) of periodicity information for each time interval obtained as a result of the periodicity integration processing.

ピーク検出部５５は、時間区間ごとに、周期性情報統合部５４から供給された周期性情報の総和Ｓ（τ）に対してピーク検出を行い、ピーク位置τ_ｐの値が１となり、ピーク位置τ_ｐ以外の値が０となるピーク情報Ｐ（τ）を生成する。ピーク検出方法としては、例えば、周期性情報の総和Ｓ（τ）の微分値が正から負に変わるときのインデックスτを、ピーク位置τ_ｐとして検出する方法がある。 Peak detector 55, for each time interval, performs peak detection on the sum of the periodicity information supplied from the periodicity information integrating section 54 S (tau), next to 1 the value of the peak position tau _p, the peak position value other than tau _p generates peak information P (tau) which is a 0. As a peak detection method, for example, there is a method of detecting the index τ when the differential value of the sum S (τ) of periodicity information changes from positive to negative as the peak position τ _p .

なお、ピーク検出部５５は、ピーク位置τ_ｐの周期性情報の総和Ｓ（τ_ｐ）が、所定の閾値より小さい場合、そのピーク位置τ_ｐのピーク情報Ｐ（τ_ｐ）を０とするようにしてもよい。これにより、ピーク情報Ｐ（τ_ｐ）のノイズを低減することができる。また、ピーク情報は、周期性情報の総和Ｓ（τ）そのものであってもよい。 The peak detector 55 is the sum of the periodicity information of the peak position τ _{_p} S (τ _{_p)} is smaller than a predetermined threshold, to the peak information P of the peak position τ _{_p} (τ _{_p)} and 0 It may be. Thereby, the noise of the peak information P (τ _p ) can be reduced. The peak information may be the sum S (τ) of periodicity information itself.

ピーク検出部５５は、時間区間ごとのピーク情報Ｐ（τ）を、時間区間ごとの音声同期用特徴量の時系列データとしてダウンサンプル部５６に供給する。 The peak detection unit 55 supplies the peak information P (τ) for each time interval to the down-sampling unit 56 as time series data of the feature amount for voice synchronization for each time interval.

ダウンサンプル部５６は、ピーク検出部５５から供給された複数の時間区間の音声同期用特徴量、つまり複数の時間区間のピーク情報Ｐ（τ）を新たな１つの時間区間の情報として統合し、最終的な音声同期用特徴量としてのピーク情報Ｐ’_ｉ（τ）を生成する。換言すれば、ダウンサンプル部５６は、ピーク情報Ｐ（τ）をダウンサンプルすることでピーク情報Ｐ’_ｉ（τ）を生成する。 The down-sampling unit 56 integrates the voice synchronization feature quantities of the plurality of time sections supplied from the peak detection unit 55, that is, the peak information P (τ) of the plurality of time sections, as new one time section information, Peak information P ′ _i (τ) is generated as a final audio synchronization feature amount. In other words, the down-sampling unit 56 generates peak information P ′ _i (τ) by down-sampling the peak information P (τ).

なお、Ｐ’_ｉ（τ）においてτは時間遅れを示すインデックスであり、ｉは時間区間を示すインデックスである。ダウンサンプル部５６は、このようにして得られた時間区間ごとのピーク情報Ｐ’_ｉ（τ）を、時間区間ごとの音声同期用特徴量の時系列データとして多重化処理部２４に供給する。 In P ′ _i (τ), τ is an index indicating a time delay, and i is an index indicating a time interval. The down-sampling unit 56 supplies the peak information P ′ _i (τ) for each time interval obtained in this way to the multiplexing processing unit 24 as time-sequential data of voice synchronization feature values for each time interval.

ここで、図３を参照してピーク情報Ｐ’_ｉ（τ）の生成について説明する。なお、図３において、縦軸は時間遅れを示すインデックスτを示しており、横軸は時間、すなわち時間区間を示すインデックスｉを示している。 Here, generation of peak information P ′ _i (τ) will be described with reference to FIG. In FIG. 3, the vertical axis indicates an index τ indicating time delay, and the horizontal axis indicates time, that is, an index i indicating a time interval.

この例では図中、上側にはピーク情報Ｐ（τ）の系列が示されており、図中、下側にはピーク情報Ｐ’_ｉ（τ）の系列が示されている。特に、図３では時間遅れがτであり、インデックスｉにより特定される時間区間のピーク情報Ｐ（τ）がＰ_ｉ（τ）で表されている。また、各四角形は１つの時間区間のピーク情報を表している。特に、白色の四角形は、その四角形により表されるピーク情報が０であることを表しており、黒色の四角形は、その四角形により表されるピーク情報が１であることを表している。 In this example, a series of peak information P (τ) is shown on the upper side in the figure, and a series of peak information P ′ _i (τ) is shown on the lower side in the figure. In particular, in FIG. 3, the time delay is τ, and the peak information P (τ) of the time interval specified by the index i is represented by P _i (τ). Each square represents the peak information of one time section. In particular, a white square represents that the peak information represented by the square is 0, and a black square represents that the peak information represented by the square is 1.

図中、上側に示されるピーク情報Ｐ_ｉ（τ）の時間区間の長さは8msecとされている。つまり、ピーク情報Ｐ_ｉ（τ）は8msecの時間間隔で算出されている。そして、ここでは時間遅れτが同じであり、時間方向（時間区間方向）に隣接する４つのピーク情報Ｐ_ｉ（τ）が１つに統合され、１つのピーク情報Ｐ’_ｉ（τ）とされる。したがって、１つのピーク情報Ｐ’_ｉ（τ）の時間区間は32msecとなる。 In the drawing, the length of the time section of the peak information P _i (τ) shown on the upper side is 8 msec. That is, the peak information P _i (τ) is calculated at a time interval of 8 msec. Here, the time delay τ is the same, and the four peak information P _i (τ) adjacent in the time direction (time interval direction) are integrated into one peak information P ′ _i (τ). The Therefore, the time interval of one peak information P ′ _i (τ) is 32 msec.

例えばダウンサンプル部５６は、次式（２）を計算することでピーク情報Ｐ_ｉ（τ）を統合（ダウンサンプル）し、ピーク情報Ｐ’_ｉ（τ）とする。 For example, the down-sampling unit 56 integrates (down-samples) the peak information P _i (τ) by calculating the following equation (2) to obtain peak information P ′ _i (τ).

この式（２）の計算では、統合される４つの連続するピーク情報Ｐ_ｉ（τ）のうち、１つでも値が「１」であるピーク情報Ｐ_ｉ（τ）がある場合、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「１」とされる。逆に、統合される４つの連続するピーク情報Ｐ_ｉ（τ）の値が全て「０」である場合に、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「０」とされる。 In the calculation of the formula (2), if there is at least one peak information P _i (τ) whose value is “1” among the four continuous peak information P _i (τ) to be integrated, The value of the obtained peak information P ′ _i (τ) is “1”. Conversely, when the values of the four continuous peak information P _i (τ) to be integrated are all “0”, the value of the peak information P ′ _i (τ) obtained by the integration is set to “0”. The

このように時間区間方向に連続して並ぶピーク情報Ｐ_ｉ（τ）の論理和を求めてダウンサンプルを行うことで、時間方向に並ぶピーク情報の系列に含まれているピーク位置に関する情報がダウンサンプルにより除去されてしまうことがなくなる。これにより、ダウンサンプル後においても、時間遅れ方向においてピーク位置に関する情報がどのように遷移していくかを保持することが可能となる。 In this way, by down-sampling by obtaining the logical sum of the peak information P _i (τ) continuously arranged in the time interval direction, the information on the peak position included in the series of peak information arranged in the time direction is down. It will not be removed by the sample. As a result, it is possible to hold how the information regarding the peak position transitions in the time delay direction even after down-sampling.

例えば、ピーク情報Ｐ_ｉ（τ）をダウンサンプルする際に、単純に複数の時間区間のピーク情報Ｐ_ｉ（τ）のうちの１つのピーク情報Ｐ_ｉ（τ）の値を、ダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値として採用した場合、情報量が欠落し、同期計算の精度が低下する。つまり性能が劣化する。 For example, when the downsampling peak information P _{i (τ),} simply the value of one peak information P _{i (τ)} of a plurality of time intervals of the peak information P _{i (τ),} after down-sampling When it is adopted as the value of the peak information P ′ _i (τ), the amount of information is lost and the accuracy of the synchronization calculation is reduced. That is, the performance deteriorates.

具体的には、例えば４つの時間区間にわたってピーク位置が遷移している場合、単純にピーク情報Ｐ_ｉ（τ）を間引いてピーク情報Ｐ’_ｉ（τ）とすると、遷移途中の１つの時間区間のピーク情報Ｐ_ｉ（τ）のみが最終的な特徴量として採用され、ピーク位置が遷移した情報は失われてしまう。 Specifically, for example, four if the peak position is transitioning over time intervals, simply when the peak information P _{'i (τ)} by thinning out the peak information P _{i (tau),} the transition during the one time interval Only the peak information P _i (τ) is adopted as the final feature amount, and the information on the transition of the peak position is lost.

一方、上述したようにダウンサンプルの際に複数の時間区間のピーク情報Ｐ_ｉ（τ）に基づいて適切な値をピーク情報Ｐ’_ｉ（τ）として出力する手法では、ダウンサンプル後の時間区間の中で遷移が発生しているという情報を、ダウンサンプル後の１時間区間内に保持することができる。結果として、ダウンサンプル処理を行っても検出性能を保持することができる。 On the other hand, in the method of outputting an appropriate value as the peak information P ′ _i (τ) based on the peak information P _i (τ) of a plurality of time intervals at the time of down-sampling as described above, the time interval after down-sampling Information indicating that a transition has occurred can be held within one hour interval after down-sampling. As a result, detection performance can be maintained even when down-sampling is performed.

しかも、このようなダウンサンプル処理を行うことで、音声同期用特徴量を伝送する際の伝送量を削減することができる。また計算済み音声同期用特徴量をメモリやストレージに保持する際に、必要な容量を削減することができる。 In addition, by performing such down-sampling processing, it is possible to reduce the transmission amount when transmitting the audio synchronization feature amount. In addition, the required capacity can be reduced when the calculated voice synchronization feature amount is held in the memory or storage.

さらに、２つの音声同期用特徴量間の同期処理を行う際の演算量を削減することができる。同期処理は入力特徴量の長さがｎ倍になると、その演算量がｎ^２倍となるため、ダウンサンプル処理の効果は大きい。一方、単純に間引き処理を行っただけでは、同期の検出性能が劣化してしまうため、ダウンサンプル部５６によるダウンサンプル方法のように、必要な情報を保持したままでダウンサンプルを行う処理が必要となる。 Furthermore, it is possible to reduce the amount of calculation when performing synchronization processing between two audio synchronization feature quantities. In the synchronization processing, when the length of the input feature amount is increased by n times, the calculation amount is increased by n ² times, and thus the effect of the down-sampling processing is great. On the other hand, simply performing the thinning-out process degrades the synchronization detection performance. Therefore, it is necessary to perform a down-sampling process while holding necessary information as in the down-sampling method by the down-sampling unit 56. It becomes.

なお、図３では、音声同期用特徴量としてのピーク情報を1/4にダウンサンプルする例について説明したが、1/2や1/8など、他のどのようなレートで変換（ダウンサンプル）することも可能である。 In FIG. 3, the example of downsampling the peak information as the feature amount for voice synchronization to 1/4 has been described. However, conversion at any other rate (downsampling) such as 1/2 or 1/8 is performed. It is also possible to do.

また、ピーク情報のダウンサンプルの際には、上述した式（２）の計算方法以外の方法を用いることもできる。 Further, when down-sampling peak information, a method other than the calculation method of Equation (2) described above can be used.

例えば、４つの時間区間のうちの２つの時間区間以上、ピーク情報Ｐ_ｉ（τ）の値が「１」である場合にダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値を「１」としてもよい。また、３つの時間区間以上、ピーク情報Ｐ_ｉ（τ）の値が「１」である場合にダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値を「１」としてもよいし、４つの全ての時間区間のピーク情報Ｐ_ｉ（τ）の値が「１」である場合にダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値を「１」としてもよい For example, when the value of the peak information P _i (τ) is “1” for two or more time intervals among four time intervals, the value of the peak information P ′ _i (τ) after down-sampling is “1”. It is good. Further, when the value of the peak information P _i (τ) is “1” for three or more time intervals, the value of the peak information P ′ _i (τ) after down-sampling may be set to “1”. When the value of the peak information P _i (τ) in all the time sections is “1”, the value of the peak information P ′ _i (τ) after down-sampling may be set to “1”.

さらに、ダウンサンプル前の４つの時間区間において２つの時間区間以上、連続してピーク情報Ｐ_ｉ（τ）の値が「１」である場合にダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値を「１」としてもよいし、３つの時間区間以上、連続してピーク情報Ｐ_ｉ（τ）の値が「１」である場合にダウンサンプル後のピーク情報Ｐ’_ｉ（τ）の値を「１」としてもよい。 Furthermore, when the value of the peak information P _i (τ) is “1” continuously for two time intervals or more in the four time intervals before down-sampling, the peak information P ′ _i (τ) after down-sampling The value may be “1”, or the value of the peak information P ′ _i (τ) after down-sampling when the value of the peak information P _i (τ) is “1” continuously for three or more time intervals. May be set to “1”.

また、以上においては時間軸方向（時間区間方向）にピーク情報Ｐ_ｉ（τ）をダウンサンプルする方法について説明したが、ピーク情報Ｐ_ｉ（τ）を時間遅れτ方向にダウンサンプルするようにしてもよい。 In the above description, the method of down-sampling the peak information P _i (τ) in the time axis direction (time interval direction) has been described. However, the peak information P _i (τ) is down-sampled in the time delay τ direction. Also good.

そのような場合、ダウンサンプル部５６は例えば次式（３）を計算することでピーク情報Ｐ_ｉ（τ）をダウンサンプルし、ピーク情報Ｐ’_ｉ（τ）とする。 In such a case, the down-sampling unit 56 down-samples the peak information P _i (τ) by calculating, for example, the following equation (3) to obtain peak information P ′ _i (τ).

式（３）の計算では、時間遅れτ方向に連続して並ぶ、同じ時間区間の４つのピーク情報Ｐ_ｉ（τ）が統合されて１つのピーク情報Ｐ’_ｉ（τ）とされる。 In the calculation of Expression (3), four peak information P _i (τ) arranged in the same time interval continuously arranged in the time delay τ direction is integrated into one peak information P ′ _i (τ).

このとき、統合される４つの連続するピーク情報Ｐ_ｉ（τ）のうち、１つでも値が「１」であるピーク情報Ｐ_ｉ（τ）がある場合、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「１」とされる。逆に、統合される４つの連続するピーク情報Ｐ_ｉ（τ）の値が全て「０」である場合に、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「０」とされる。 In this case, among the four successive peak information P _{i (tau)} which is integrated peak information even one value "1" P _{i (tau)} when there is a peak information P obtained by the integration ' The value of _i (τ) is “1”. Conversely, when the values of the four continuous peak information P _i (τ) to be integrated are all “0”, the value of the peak information P ′ _i (τ) obtained by the integration is set to “0”. The

さらに、ピーク情報Ｐ_ｉ（τ）を時間区間ｉ方向および時間遅れτ方向の両方向にダウンサンプルするようにしてもよい。 Furthermore, the peak information P _i (τ) may be down-sampled in both the time interval i direction and the time delay τ direction.

そのような場合、ダウンサンプル部５６は例えば次式（４）を計算することでピーク情報Ｐ_ｉ（τ）をダウンサンプルし、ピーク情報Ｐ’_ｉ（τ）とする。 In such a case, the down-sampling unit 56 down-samples the peak information P _i (τ) by calculating, for example, the following equation (4) to obtain peak information P ′ _i (τ).

式（４）の計算では、時間区間ｉ方向に連続して並ぶ、同じ時間遅れτの２つのピーク情報Ｐ_ｉ（τ）と、それらの２つのピーク情報Ｐ_ｉ（τ）に対して時間遅れτ方向に隣接して並ぶ２つのピーク情報Ｐ_ｉ（τ）とからなる合計４つのピーク情報Ｐ_ｉ（τ）が統合されて１つのピーク情報Ｐ’_ｉ（τ）とされる。 In the calculation of Expression (4), two peak information P _i (τ) of the same time delay τ that are continuously arranged in the direction of the time interval i and the time delay with respect to the two peak information P _i (τ) A total of four pieces of peak information P _i (τ) composed of two pieces of peak information P _i (τ) arranged adjacent to each other in the τ direction are integrated into one peak information P ′ _i (τ).

このとき、統合される４つのピーク情報Ｐ_ｉ（τ）のうち、１つでも値が「１」であるピーク情報Ｐ_ｉ（τ）がある場合、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「１」とされる。逆に、統合される４つのピーク情報Ｐ_ｉ（τ）の値が全て「０」である場合に、統合により得られたピーク情報Ｐ’_ｉ（τ）の値は「０」とされる。 In this case, among the four peak information P _{i (tau)} to be integrated, if even one value is the peak information P _{i (tau)} is "1", the peak information P obtained by the integration _'i ( The value of τ) is “1”. Conversely, when the values of the four pieces of peak information P _i (τ) to be integrated are all “0”, the value of the peak information P ′ _i (τ) obtained by the integration is “0”.

以上のようにしてダウンサンプル部５６は、ピーク情報Ｐ（τ）をダウンサンプルしてピーク情報Ｐ’_ｉ（τ）を求めると、得られた新たな時間区間ごとのピーク情報Ｐ’_ｉ（τ）を、時間区間ごとの音声同期用特徴量の時系列データとして多重化処理部２４に供給する。 As described above, when the down-sampling unit 56 down-samples the peak information P (τ) to obtain the peak information P ′ _i (τ), the obtained peak information P ′ _i (τ for each new time section is obtained. ) Is supplied to the multiplexing processing unit 24 as time-series data of voice synchronization feature values for each time interval.

〈コンテンツ再生システムの構成例〉
次に、提供装置１１から送信されるメイン送出信号とサブ送出信号を、それぞれメイン受信信号およびサブ受信信号として受信してメインコンテンツとサブコンテンツを再生するコンテンツ再生システムの構成について説明する。このようなコンテンツ再生システムは、例えば図４に示すように構成される。 <Example configuration of content playback system>
Next, the configuration of a content reproduction system that receives the main transmission signal and the sub transmission signal transmitted from the providing apparatus 11 as the main reception signal and the sub reception signal, respectively, and reproduces the main content and the sub content will be described. Such a content reproduction system is configured, for example, as shown in FIG.

図４に示すコンテンツ再生システムは、メイン受信機器８１、表示部８２、スピーカ８３、マイクロホン８４、サブ受信機器８５、表示部８６、およびスピーカ８７を有している。なお、ここでは、コンテンツ再生システムが複数の装置から構成される場合を例として示しているが、コンテンツ再生システムは１つの装置から構成されるようにしてもよい。 The content reproduction system shown in FIG. 4 includes a main receiving device 81, a display unit 82, a speaker 83, a microphone 84, a sub receiving device 85, a display unit 86, and a speaker 87. Here, the case where the content reproduction system is configured by a plurality of devices is shown as an example, but the content reproduction system may be configured by one device.

メイン受信機器８１は、提供装置１１から送信されたメイン受信信号を受信し、メイン受信信号から得られるメインコンテンツの再生を制御する。 The main reception device 81 receives the main reception signal transmitted from the providing device 11 and controls the reproduction of the main content obtained from the main reception signal.

メイン受信機器８１は、入力部１１１および再生処理部１１２を備えている。 The main receiving device 81 includes an input unit 111 and a reproduction processing unit 112.

入力部１１１は、提供装置１１から送信されたメイン送出信号を、メイン受信信号として受信して再生処理部１１２に供給する。再生処理部１１２は、入力部１１１から供給されたメイン受信信号に含まれているメインコンテンツの画像信号と音声信号を抽出し、画像信号を表示部８２に供給して再生させるとともに、音声信号をスピーカ８３に供給して再生させる。すなわち、再生処理部１１２は、メインコンテンツの再生を制御する。 The input unit 111 receives the main transmission signal transmitted from the providing device 11 as a main reception signal and supplies the main transmission signal to the reproduction processing unit 112. The reproduction processing unit 112 extracts an image signal and an audio signal of the main content included in the main reception signal supplied from the input unit 111, supplies the image signal to the display unit 82 for reproduction, and transmits the audio signal to the speaker. 83 to be played back. That is, the playback processing unit 112 controls the playback of the main content.

表示部８２は、例えば液晶表示装置などからなり、再生処理部１１２から供給された画像信号に基づいて、メインコンテンツの画像（映像）を表示させる。スピーカ８３は、音声再生装置であり、再生処理部１１２から供給された音声信号に基づいて、メインコンテンツの音声を出力する。 The display unit 82 includes, for example, a liquid crystal display device, and displays the main content image (video) based on the image signal supplied from the reproduction processing unit 112. The speaker 83 is an audio reproduction device, and outputs the main content audio based on the audio signal supplied from the reproduction processing unit 112.

マイクロホン８４は、スピーカ８３から出力されたメインコンテンツの音声を収音し、その結果得られた音声信号をサブ受信機器８５に供給する。 The microphone 84 picks up the sound of the main content output from the speaker 83, and supplies the sound signal obtained as a result to the sub reception device 85.

サブ受信機器８５は、提供装置１１から送信されたサブ送出信号を、サブ受信信号として受信し、サブ受信信号から得られるサブコンテンツの再生を制御する。 The sub reception device 85 receives the sub transmission signal transmitted from the providing apparatus 11 as a sub reception signal, and controls the reproduction of the sub content obtained from the sub reception signal.

サブ受信機器８５は、音声同期用特徴量計算部１２１、バッファ１２２、入力部１２３、分離処理部１２４、バッファ１２５、同期計算部１２６、および再生処理部１２７を備えている。 The sub reception device 85 includes an audio synchronization feature quantity calculation unit 121, a buffer 122, an input unit 123, a separation processing unit 124, a buffer 125, a synchronization calculation unit 126, and a reproduction processing unit 127.

音声同期用特徴量計算部１２１は、マイクロホン８４から供給された音声信号から、音声同期用特徴量を算出し、バッファ１２２に供給する。バッファ１２２は、音声同期用特徴量計算部１２１から供給された音声同期用特徴量を一時的に記録する。 The voice synchronization feature quantity calculation unit 121 calculates a voice synchronization feature quantity from the voice signal supplied from the microphone 84 and supplies it to the buffer 122. The buffer 122 temporarily records the feature value for voice synchronization supplied from the feature value calculation unit 121 for voice synchronization.

入力部１２３は、提供装置１１から送信されたサブ受信信号を受信して分離処理部１２４に供給する。分離処理部１２４は、入力部１２３から供給されたサブ受信信号を、音声同期用特徴量とサブチャンネル信号とに分離させて、バッファ１２５に供給する。バッファ１２５は、分離処理部１２４から供給された音声同期用特徴量とサブチャンネル信号を一時的に記録する。 The input unit 123 receives the sub reception signal transmitted from the providing apparatus 11 and supplies it to the separation processing unit 124. The separation processing unit 124 separates the sub reception signal supplied from the input unit 123 into the audio synchronization feature amount and the subchannel signal, and supplies the separated feature to the buffer 125. The buffer 125 temporarily records the audio synchronization feature amount and the subchannel signal supplied from the separation processing unit 124.

同期計算部１２６は、バッファ１２２に記録されている音声同期用特徴量と、バッファ１２５に記録されている音声同期用特徴量とに基づいて、メインコンテンツとサブコンテンツとを同期させるための音声特徴量に基づく同期補正情報を生成し、再生処理部１２７に供給する。すなわち、同期計算部１２６は、収音して得られた音声信号から抽出した音声同期用特徴量と、サブ受信信号に含まれている音声同期用特徴量とのマッチング処理により、メインコンテンツとサブコンテンツとの再生時刻のずれを検出し、そのずれを示す音声特徴量に基づく同期補正情報を生成する。 The synchronization calculation unit 126 synchronizes the main content and the sub-content based on the audio synchronization feature quantity recorded in the buffer 122 and the audio synchronization feature quantity recorded in the buffer 125. Is generated and supplied to the reproduction processing unit 127. That is, the synchronization calculation unit 126 performs matching processing between the audio synchronization feature amount extracted from the audio signal obtained by collecting the sound and the audio synchronization feature amount included in the sub reception signal, and thereby performs the main content and the sub content. Is detected, and synchronization correction information based on the audio feature amount indicating the deviation is generated.

再生処理部１２７は、同期計算部１２６から供給された同期補正情報に基づいて、バッファ１２５に記録されているサブチャンネル信号の再生タイミング（時刻）を補正し、サブチャンネル信号としての画像信号と音声信号を、それぞれ表示部８６およびスピーカ８７に供給する。すなわち、再生処理部１２７は、サブコンテンツの再生を制御する。例えばMPEG-4 Systemを用いて、音声同期用特徴量を１つのメディアオブジェクトとして扱い、サブチャンネル信号のメディアオブジェクトと同期および多重化している場合に、各メディアオブジェクトの最小単位であるAccess Unit(AU)にはそれぞれ時間属性が定義されているので、上記の同期補正情報からサブチャンネル信号のメディアオブジェクトの適切な再生タイミング(時刻)を算出することができる。 The reproduction processing unit 127 corrects the reproduction timing (time) of the subchannel signal recorded in the buffer 125 based on the synchronization correction information supplied from the synchronization calculation unit 126, and the image signal and audio as the subchannel signal are corrected. The signals are supplied to the display unit 86 and the speaker 87, respectively. That is, the reproduction processing unit 127 controls the reproduction of the sub content. For example, when the MPEG-4 System is used and the feature value for audio synchronization is handled as one media object, and synchronized and multiplexed with the media object of the subchannel signal, the Access Unit (AU) that is the minimum unit of each media object Since the time attribute is defined for each of (), an appropriate reproduction timing (time) of the media object of the subchannel signal can be calculated from the above-described synchronization correction information.

表示部８６は、例えば液晶表示装置などからなり、再生処理部１２７から供給された画像信号に基づいて、サブコンテンツの画像（映像）を表示させる。スピーカ８７は、音声再生装置であり、再生処理部１２７から供給された音声信号に基づいて、サブコンテンツの音声を出力する。 The display unit 86 includes, for example, a liquid crystal display device, and displays an image (video) of sub-content based on the image signal supplied from the reproduction processing unit 127. The speaker 87 is an audio reproduction device, and outputs sub-content audio based on the audio signal supplied from the reproduction processing unit 127.

〈音声同期用特徴量計算部の構成例〉
また、図４に示した音声同期用特徴量計算部１２１は、より詳細には例えば図５に示すように構成される。 <Configuration example of feature calculation unit for voice synchronization>
Further, the voice synchronization feature quantity calculation unit 121 shown in FIG. 4 is configured in more detail as shown in FIG. 5, for example.

音声同期用特徴量計算部１２１は、周波数帯域分割部１５１、周期性検出部１５２−１乃至周期性検出部１５２−４、周期性強度検出部１５３−１乃至周期性強度検出部１５３−４、周期性情報統合部１５４、およびピーク検出部１５５を有している。 The voice synchronization feature quantity calculation unit 121 includes a frequency band division unit 151, a periodicity detection unit 152-1 through a periodicity detection unit 152-4, a periodicity intensity detection unit 153-1 through a periodicity intensity detection unit 153-4, A periodic information integration unit 154 and a peak detection unit 155 are included.

なお、これらの周波数帯域分割部１５１乃至ピーク検出部１５５は、図２に示した周波数帯域分割部５１乃至ピーク検出部５５と同様であるので、その説明は省略する。但し、周波数帯域分割部１５１と周波数帯域分割部５１とでは、窓関数のシフト時間を異なった値に設定することも可能である。例えばサブ受信機器８５の演算リソースが豊富な場合に、周波数帯域分割部１５１において、より短いシフト時間を用いることで、より細かい粒度での音声同期用特徴量の抽出が可能になる。 The frequency band dividing unit 151 to the peak detecting unit 155 are the same as the frequency band dividing unit 51 to the peak detecting unit 55 shown in FIG. However, the frequency band dividing unit 151 and the frequency band dividing unit 51 can set the shift time of the window function to different values. For example, when the sub-reception device 85 has abundant computing resources, the frequency band dividing unit 151 uses the shorter shift time, so that it is possible to extract the feature amount for voice synchronization with a finer granularity.

また、以下、周期性検出部１５２−１乃至周期性検出部１５２−４を特に区別する必要のない場合、単に周期性検出部１５２とも称し、周期性強度検出部１５３−１乃至周期性強度検出部１５３−４を特に区別する必要のない場合、周期性強度検出部１５３とも称する。 Hereinafter, when it is not necessary to particularly distinguish the periodicity detection unit 152-1 to the periodicity detection unit 152-4, they are also simply referred to as the periodicity detection unit 152, and the periodicity detection unit 153-1 to the periodicity intensity detection. When it is not necessary to particularly distinguish the unit 153-4, the unit 153-4 is also referred to as a periodic intensity detection unit 153.

〈同期計算部の構成例〉
さらに、図４に示した同期計算部１２６は、より詳細には例えば図６に示すように構成される。 <Configuration example of the synchronous calculation unit>
Furthermore, the synchronization calculator 126 shown in FIG. 4 is configured as shown in FIG. 6 in more detail.

図６の同期計算部１２６は、フレームレート変換部１８１、フレームレート変換部１８２、ブロック統合部１８３、ブロック統合部１８４、類似度計算部１８５、および最適パス検索部１８６を有している。 The synchronization calculation unit 126 of FIG. 6 includes a frame rate conversion unit 181, a frame rate conversion unit 182, a block integration unit 183, a block integration unit 184, a similarity calculation unit 185, and an optimal path search unit 186.

フレームレート変換部１８１はバッファ１２２からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データを読み出し、音声同期用特徴量のフレームレートを変換してブロック統合部１８３に供給する。ここでいうフレームレートとは、音声同期用特徴量の時系列データにおける単位時間当たりの時間区間数、つまり時間区間の長さをいう。 The frame rate conversion unit 181 reads the time series data of the audio synchronization feature amount for each time interval of the main content from the buffer 122, converts the frame rate of the audio synchronization feature amount, and supplies the converted frame rate to the block integration unit 183. Here, the frame rate refers to the number of time intervals per unit time in the time-series data of the audio synchronization feature amount, that is, the length of the time interval.

フレームレート変換部１８２はバッファ１２５からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データを読み出し、音声同期用特徴量のフレームレートを変換してブロック統合部１８４に供給する。 The frame rate conversion unit 182 reads the time series data of the audio synchronization feature value for each time interval of the main content from the buffer 125, converts the frame rate of the audio synchronization feature value, and supplies the converted frame rate to the block integration unit 184.

バッファ１２２およびバッファ１２５にそれぞれ保持されている音声同期用特徴量はフレームレート、つまり時間区間の長さが異なる場合がある。 The audio synchronization feature values held in the buffer 122 and the buffer 125 may differ in frame rate, that is, the length of the time interval.

例えば提供装置１１から提供されるサブコンテンツ（サブ送出信号）の転送ビットレートを削減するため、サブ送出信号に含まれている音声同期用特徴量が低レートに設定されている一方で、マイクロホン８４で収音された音声から計算される音声同期用特徴量は、伝送の必要がないため高いレートに設定されている場合が考えられる。 For example, in order to reduce the transfer bit rate of the sub-content (sub-transmission signal) provided from the providing device 11, the audio synchronization feature amount included in the sub-transmission signal is set to a low rate, while the microphone 84 is set. The feature amount for voice synchronization calculated from the voice picked up in step 1 may be set at a high rate because there is no need for transmission.

このような場合、例えば図７に示すように、マイクロホン８４で収音された音声から計算される音声同期用特徴量をダウンサンプル部５６と同じ手法を用いてダウンサンプルすることが考えられる。なお、図７において各矢印Ｑ１１乃至矢印Ｑ１４に示される音声同期用特徴量としてのピーク情報の縦軸は時間遅れτを示しており、横軸は時間区間ｉを示している。また、１つの四角形は１つの時間区間におけるピーク情報を表している。 In such a case, for example, as shown in FIG. 7, it is conceivable to downsample the audio synchronization feature amount calculated from the sound collected by the microphone 84 using the same method as the downsampling unit 56. In FIG. 7, the vertical axis of the peak information as the voice synchronization feature amount indicated by the arrows Q11 to Q14 indicates the time delay τ, and the horizontal axis indicates the time interval i. One square represents peak information in one time section.

この例では、提供装置１１側では、矢印Ｑ１１に示すように音声同期用特徴量としてのピーク情報が求められた後、そのピーク情報がダウンサンプルされ、矢印Ｑ１２に示す、より時間区間が長いピーク情報とされてサブ受信機器８５へと伝送される。ここでは、8msecの時間区間のピーク情報が、32msecの時間区間のピーク情報へとフレームレート変換（ダウンサンプル）されている。 In this example, on the providing apparatus 11 side, after peak information as a feature amount for voice synchronization is obtained as indicated by an arrow Q11, the peak information is downsampled, and a peak with a longer time interval indicated by an arrow Q12 is obtained. Information is transmitted to the sub-receiving device 85. Here, the peak information of the time interval of 8 msec is frame rate converted (downsampled) into the peak information of the time interval of 32 msec.

一方、サブ受信機器８５の音声同期用特徴量計算部１２１では、メイン受信機器８１で再生されたメインコンテンツの音声を収音して得られた音声信号から音声同期用特徴量が算出され、その結果、矢印Ｑ１３に示すピーク情報が音声同期用特徴量として得られる。ここでは矢印Ｑ１３に示すピーク情報は、8msecの時間区間ごとに算出されている。 On the other hand, the audio synchronization feature value calculation unit 121 of the sub reception device 85 calculates the audio synchronization feature value from the audio signal obtained by collecting the audio of the main content reproduced by the main reception device 81, and as a result. The peak information indicated by the arrow Q13 is obtained as the voice synchronization feature amount. Here, the peak information indicated by the arrow Q13 is calculated for each 8 msec time interval.

このようにして音声同期用特徴量計算部１２１で得られた音声同期用特徴量と、提供装置１１から受信した音声同期用特徴量とでは時間区間の長さ、つまりフレームレートが異なる。そこで、フレームレート変換部１８１は、それらの音声同期用特徴量のフレームレートが一致するように、音声同期用特徴量計算部１２１で得られた音声同期用特徴量に対してフレームレート変換としてダウンサンプルを行い、矢印Ｑ１４に示す音声同期用特徴量としてのピーク情報を得る。矢印Ｑ１４に示す音声同期用特徴量は、32msecの時間区間のピーク情報となっている。 Thus, the length of the time interval, that is, the frame rate, differs between the voice synchronization feature value obtained by the voice synchronization feature value calculation unit 121 and the voice synchronization feature value received from the providing apparatus 11. Therefore, the frame rate conversion unit 181 downgrades the frame rate conversion for the audio synchronization feature quantity obtained by the audio synchronization feature quantity calculation unit 121 so that the frame rates of the audio synchronization feature quantities match. A sample is performed to obtain peak information as a voice synchronization feature amount indicated by an arrow Q14. The voice synchronization feature amount indicated by the arrow Q14 is peak information of a time interval of 32 msec.

このようにしてフレームレート（時間区間の長さ）が揃えられた後、音声同期用特徴量が用いられて同期計算が行われる。このようにサブ受信機器８５側において音声同期用特徴量のダウンサンプルを行うことで、任意のフレームレート（ビットレート）に対応することができる。 After the frame rate (the length of the time interval) is aligned in this way, the synchronization calculation is performed using the audio synchronization feature amount. Thus, by performing the downsampling of the audio synchronization feature quantity on the sub-receiving device 85 side, it is possible to cope with an arbitrary frame rate (bit rate).

また、サブ受信機器８５に伝送されてくる音声同期用特徴量は高レートであるが、マイクロホン８４で収音された音声から計算される音声同期用特徴量は低いレートになるケースもある。例えばサブ受信機器８５の演算リソースが潤沢ではなく、音声同期用特徴量の計算に必要な演算量を削減するために、フレームシフト量を大きくするケースなどである。 In addition, the voice synchronization feature value transmitted to the sub receiver 85 has a high rate, but the voice synchronization feature value calculated from the sound collected by the microphone 84 may have a low rate. For example, there are cases where the computing resources of the sub receiving device 85 are not abundant and the frame shift amount is increased in order to reduce the amount of computation necessary for calculating the voice synchronization feature value.

そのような場合、例えば図８の矢印Ｑ２１に示す、サブ送出信号に含まれている音声同期用特徴量のフレームレートがダウンサンプル部５６と同じ手法が用いられてフレームレート変換部１８２によってダウンサンプルされ、矢印Ｑ２２に示す音声同期用特徴量が得られる。なお、図８において各矢印Ｑ２１乃至矢印Ｑ２３に示される音声同期用特徴量としてのピーク情報の縦軸は時間遅れτを示しており、横軸は時間区間ｉを示している。また、１つの四角形は１つの時間区間におけるピーク情報を表している。 In such a case, for example, the frame rate of the voice synchronization feature amount included in the sub-transmission signal indicated by the arrow Q21 in FIG. Thus, the voice synchronization feature amount indicated by the arrow Q22 is obtained. In FIG. 8, the vertical axis of the peak information as the voice synchronization feature value indicated by the arrows Q21 to Q23 indicates the time delay τ, and the horizontal axis indicates the time interval i. One square represents peak information in one time section.

この例では、8msecの時間区間のピーク情報が、32msecの時間区間のピーク情報へとフレームレート変換（ダウンサンプル）されている。 In this example, the peak information of the 8 msec time interval is frame rate converted (downsampled) to the peak information of the 32 msec time interval.

また、サブ受信機器８５の音声同期用特徴量計算部１２１では、メイン受信機器８１で再生されたメインコンテンツの音声を収音して得られた音声信号から音声同期用特徴量が算出され、その結果、矢印Ｑ２３に示すピーク情報が音声同期用特徴量として得られる。ここでは矢印Ｑ２３に示すピーク情報は、32msecの時間区間ごとに算出されている。 In addition, the audio synchronization feature value calculation unit 121 of the sub reception device 85 calculates the audio synchronization feature value from the audio signal obtained by collecting the audio of the main content reproduced by the main reception device 81, and as a result. The peak information indicated by the arrow Q23 is obtained as the voice synchronization feature amount. Here, the peak information indicated by the arrow Q23 is calculated for each 32 msec time interval.

このようにサブ送出信号に含まれている音声同期用特徴量をダウンサンプルして、サブ送出信号に含まれている音声同期用特徴量のフレームレートと、サブ受信機器８５で算出される音声同期用特徴量のフレームレートとを一致させるようにしてもよい。 In this way, the audio synchronization feature amount included in the sub transmission signal is down-sampled, and the frame rate of the audio synchronization feature amount included in the sub transmission signal and the audio synchronization calculated by the sub reception device 85 are obtained. The frame rate of the feature amount may be matched.

さらに、上記説明ではよりフレームレートが高い音声同期用特徴量をダウンサンプルすることでフレームレートを一致させたが、よりフレームレートが低い音声同期用特徴量をアップサンプルすることでフレームレートを一致させてもよい。 Furthermore, in the above description, the frame rate is matched by down-sampling a voice synchronization feature with a higher frame rate. However, the frame rate is matched by up-sampling a voice synchronization feature with a lower frame rate. May be.

そのような場合、例えば図９に示すようにサブ送出信号に含まれている音声同期用特徴量のフレームレートがフレームレート変換部１８２によってアップサンプルされる。なお、図９において各矢印Ｑ３１乃至矢印Ｑ３４に示される音声同期用特徴量としてのピーク情報の縦軸は時間遅れτを示しており、横軸は時間区間ｉを示している。また、１つの四角形は１つの時間区間におけるピーク情報を表している。 In such a case, for example, as shown in FIG. 9, the frame rate of the audio synchronization feature amount included in the sub transmission signal is upsampled by the frame rate conversion unit 182. In FIG. 9, the vertical axis of the peak information as the voice synchronization feature value indicated by the arrows Q31 to Q34 indicates the time delay τ, and the horizontal axis indicates the time interval i. One square represents peak information in one time section.

この例では、提供装置１１側では、矢印Ｑ３１に示すように音声同期用特徴量としてのピーク情報が求められた後、ダウンサンプルが行われて矢印Ｑ３２に示す、より時間区間が長いピーク情報とされてサブ受信機器８５へと伝送される。ここでは、8msecの時間区間のピーク情報が、32msecの時間区間のピーク情報へとフレームレート変換（ダウンサンプル）されている。 In this example, on the providing device 11 side, after peak information as a feature amount for voice synchronization is obtained as indicated by an arrow Q31, downsampling is performed, and peak information having a longer time interval is indicated by an arrow Q32. And transmitted to the sub-receiving device 85. Here, the peak information of the time interval of 8 msec is frame rate converted (downsampled) into the peak information of the time interval of 32 msec.

一方、サブ受信機器８５の音声同期用特徴量計算部１２１では、メイン受信機器８１で再生されたメインコンテンツの音声を収音して得られた音声信号から音声同期用特徴量が算出され、その結果、矢印Ｑ３３に示すピーク情報が音声同期用特徴量として得られる。ここでは矢印Ｑ３３に示すピーク情報は、8msecの時間区間ごとに算出されている。 On the other hand, the audio synchronization feature value calculation unit 121 of the sub reception device 85 calculates the audio synchronization feature value from the audio signal obtained by collecting the audio of the main content reproduced by the main reception device 81, and as a result. The peak information indicated by the arrow Q33 is obtained as the voice synchronization feature amount. Here, the peak information indicated by the arrow Q33 is calculated for each time interval of 8 msec.

この例では、音声同期用特徴量計算部１２１により算出された音声同期用特徴量と、提供装置１１から受信した音声同期用特徴量とではフレームレートが一致していない状態となっている。 In this example, the frame rate does not match between the voice synchronization feature quantity calculated by the voice synchronization feature quantity calculation unit 121 and the voice synchronization feature quantity received from the providing apparatus 11.

そこで、フレームレート変換部１８２は、提供装置１１から受信した音声同期用特徴量としてのピーク情報をアップサンプルし、矢印Ｑ３４に示す8msecの時間区間のピーク情報を算出することで、同期計算に用いる音声同期用特徴量の時間同期の粒度を揃える。例えばフレームレート変換部１８２は、次式（５）を計算することで、ピーク情報をアップサンプルする。 Therefore, the frame rate conversion unit 182 upsamples the peak information as the voice synchronization feature value received from the providing apparatus 11 and calculates the peak information of the 8 msec time section indicated by the arrow Q34 to be used for the synchronization calculation. Align the time synchronization granularity of the features for voice synchronization. For example, the frame rate conversion unit 182 upsamples the peak information by calculating the following equation (5).

式（５）の計算では、時間遅れτが同じであり、時間方向（時間区間方向）に隣接するアップサンプル後の４つのピーク情報Ｐ_ｉ（τ）と同じ位置関係にあるアップサンプル前の１つのピーク情報Ｐ’_ｉ（τ）の値が、そのままアップサンプル後の４つの各ピーク情報Ｐ_ｉ（τ）の値とされている。 In the calculation of Expression (5), the time delay τ is the same, and the pre-upsample 1 is in the same positional relationship as the four peak information P _i (τ) after the upsample adjacent in the time direction (time interval direction). The values of the two pieces of peak information P ′ _i (τ) are the values of the four pieces of peak information P _i (τ) after the upsampling as they are.

このように同期計算に用いる音声同期用特徴量を、より高いフレームレートに合わせて適宜アップサンプルすることで、疑似的に高分解能な同期精度を実現することができる。 As described above, by appropriately upsampling the voice synchronization feature amount used for the synchronization calculation in accordance with a higher frame rate, it is possible to realize a pseudo-high resolution synchronization accuracy.

さらに、サブ受信機器８５での演算リソース削減のため、サブ送出信号に含まれている音声同期用特徴量と、音声同期用特徴量計算部１２１で算出された音声同期用特徴量との両方をダウンサンプルすることも可能である。 Further, in order to reduce computation resources in the sub receiving device 85, both the voice synchronization feature value included in the sub transmission signal and the voice synchronization feature value calculated by the voice synchronization feature value calculation unit 121 are obtained. It is also possible to downsample.

以上のようにフレームレート変換部１８１およびフレームレート変換部１８２を有することで、異なるフレームレートの音声同期用特徴量間の同期を行うことができるようになる。また、演算リソースや伝送帯域などに応じて様々なフレームレートを指定できるようになり、システムの柔軟性を高めることができる。 By having the frame rate conversion unit 181 and the frame rate conversion unit 182 as described above, it is possible to perform synchronization between audio synchronization feature quantities having different frame rates. In addition, various frame rates can be specified according to computing resources, transmission bands, etc., and the flexibility of the system can be enhanced.

図６の説明に戻り、ブロック統合部１８３は、フレームレート変換部１８１からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データの供給を受け、連続した複数（例えば64個）の時間区間を１ブロックとして、ブロック単位で統合する。ブロック統合部１８３は、ブロック単位の音声同期用特徴量の時系列データを類似度計算部１８５に供給する。 Returning to the description of FIG. 6, the block integration unit 183 receives the time-series data of the audio synchronization feature amount for each time interval of the main content from the frame rate conversion unit 181 and receives a plurality of continuous (for example, 64) time intervals. Are integrated in units of blocks. The block integration unit 183 supplies the time series data of the feature amount for voice synchronization in units of blocks to the similarity calculation unit 185.

ブロック統合部１８４は、フレームレート変換部１８２からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データの供給を受け、連続した複数（例えば64個）の時間区間を１ブロックとして、ブロック単位で統合する。ブロック統合部１８４は、ブロック単位の音声同期用特徴量の時系列データを類似度計算部１８５に供給する。 The block integration unit 184 receives the time-series data of the audio synchronization feature amount for each time interval of the main content from the frame rate conversion unit 182, and sets a plurality of (for example, 64) time intervals as one block. Integrate with. The block integration unit 184 supplies the time series data of the feature amount for voice synchronization in units of blocks to the similarity calculation unit 185.

なお、ブロックを構成する複数の時間区間は、連続していなくてもよい。例えば、複数の偶数番目の時間区間を１ブロックとしたり、複数の奇数番目の時間区間を１ブロックとしたりすることもできる。この場合、時間区間ごとの音声同期用特徴量の時系列データに対して間引き処理を行うことができるので、演算量を削減することができる。 Note that the plurality of time intervals constituting the block may not be continuous. For example, a plurality of even-numbered time intervals can be one block, or a plurality of odd-numbered time intervals can be one block. In this case, since the thinning process can be performed on the time-series data of the voice synchronization feature value for each time interval, the amount of calculation can be reduced.

類似度計算部１８５は、ブロック統合部１８３とブロック統合部１８４のそれぞれから供給されたブロック単位の音声同期用特徴量の時系列データ同士の類似度を計算し、各ブロック間の類似度を表す類似度マトリックスを生成する。類似度計算部１８５は、類似度マトリックスを最適パス検索部１８６に供給する。 The similarity calculation unit 185 calculates the similarity between the time-series data of the feature quantities for voice synchronization in units of blocks supplied from the block integration unit 183 and the block integration unit 184, and represents the similarity between the blocks. Generate a similarity matrix. The similarity calculation unit 185 supplies the similarity matrix to the optimum path search unit 186.

最適パス検索部１８６は、類似度計算部１８５から供給された類似度マトリックスから最適な類似度のパスを検索し、そのパス上の類似度に対応する２つのブロックの時間差を表す情報を音声特徴量に基づく同期補正情報として生成する。そして、最適パス検索部１８６は、音声特徴量に基づく同期補正情報を再生処理部１２７に供給する。 The optimal path search unit 186 searches the similarity matrix supplied from the similarity calculation unit 185 for a path with the optimal similarity, and uses information representing the time difference between the two blocks corresponding to the similarity on the path as a voice feature. It is generated as synchronization correction information based on the amount. Then, the optimum path search unit 186 supplies synchronization correction information based on the audio feature amount to the reproduction processing unit 127.

以上のように、同期計算部１２６は、音声信号のピッチ情報に基づいて、音声特徴量に基づく同期補正情報を生成する。したがって、音声信号ごとに異なるノイズが含まれる場合などにおいても、ロバストに音声特徴量に基づく同期補正情報を生成することができる。 As described above, the synchronization calculation unit 126 generates synchronization correction information based on the audio feature amount based on the pitch information of the audio signal. Therefore, even when different noise is included in each audio signal, the synchronization correction information based on the audio feature amount can be generated robustly.

すなわち、人間は、周波数特性を有する複数の音を聞いた場合、共通成分として、同一の基本周波数を有する音、つまりピッチが同一である音を知覚することにより、ノイズ音が含まれている場合であっても、共通成分を容易に知覚することができる。本技術は、このことを考慮して、ピッチ情報に基づいて同期補正情報を生成することにより、ノイズ音に対してロバストに同期補正情報を生成する。 That is, when a human hears a plurality of sounds having frequency characteristics, noise sounds are included by perceiving sounds having the same fundamental frequency, that is, sounds having the same pitch, as a common component Even so, the common component can be easily perceived. In consideration of this, the present technology generates synchronization correction information based on pitch information, thereby generating synchronization correction information robust to noise sounds.

〈類似度の計算と最適な類似度のパスの検索について〉
ここで、類似度の計算と最適な類似度のパスの検索について説明する。 <Similarity calculation and search for optimal similarity path>
Here, the calculation of the similarity and the search for the path with the optimum similarity will be described.

図１０は、類似度の計算の対象とするブロックを説明する図である。 FIG. 10 is a diagram illustrating a block that is a target of similarity calculation.

なお、図１０においてｉは、音声同期用特徴量計算部１２１で得られた音声同期用特徴量のブロックのインデックスであり、ｊは、サブ受信信号に含まれている音声同期用特徴量のブロックのインデックスである。なお、より詳細には、これらの音声同期用特徴量は、適宜、フレームレート変換部１８１やフレームレート変換部１８２によりフレームレート変換されるが、ここでは説明を簡単にするため、フレームレート変換は行われないものとして類似度の計算についての説明を続ける。 In FIG. 10, i is the index of the block of the feature value for voice synchronization obtained by the feature amount calculation unit 121 for voice synchronization, and j is the block of the feature amount for voice synchronization included in the sub received signal. Is the index of More specifically, these audio synchronization feature quantities are appropriately frame rate converted by the frame rate conversion unit 181 and the frame rate conversion unit 182, but here, for the sake of simplicity, the frame rate conversion is performed. The description of the similarity calculation will be continued as it is not performed.

また、X(i)は、音声同期用特徴量計算部１２１で得られた音声同期用特徴量のうちのインデックスｉのブロックの音声同期用特徴量の時系列データを表し、Y(j)は、サブ受信信号に含まれている音声同期用特徴量のうちのインデックスｊのブロックの音声同期用特徴量の時系列データを表す。 X (i) represents time-sequential data of the voice synchronization feature quantity of the block of index i among the voice synchronization feature quantities obtained by the voice synchronization feature quantity calculation unit 121, and Y (j) represents , Represents the time-sequential data of the voice synchronization feature quantity of the index j block among the voice synchronization feature quantities included in the sub received signal.

図１０に示すように、類似度の計算の対象は、ｎ個のX(i)のそれぞれと、ｍ個のY(j)のそれぞれとの、ｎ×ｍ個の組み合わせである。 As shown in FIG. 10, the target of similarity calculation is n × m combinations of each of n X (i) and m Y (j).

図１１は、類似度の計算方法を説明する図である。 FIG. 11 is a diagram illustrating a method for calculating similarity.

なお、図１１のマトリックスでは、横軸が、ブロック内の時間区間の先頭からの個数を表すブロック内時間区間番号を表し、縦軸がインデックスτを表している。また、白色の正方形は、対応するブロック内時間区間番号の時間区間のインデックスτの音声同期用特徴量の時系列データＰ（τ）が０であることを表し、黒色の正方形は、その時系列データＰ（τ）が１であることを表している。さらに、図１１の例では、ブロックを構成する時間区間の個数が４個であり、τが０乃至３であるものとする。 In the matrix of FIG. 11, the horizontal axis represents the intra-block time interval number representing the number from the beginning of the time interval within the block, and the vertical axis represents the index τ. A white square indicates that the time series data P (τ) of the voice synchronization feature amount of the index τ of the time section of the corresponding intra-block time section number is 0, and the black square indicates the time series data. It represents that P (τ) is 1. Furthermore, in the example of FIG. 11, it is assumed that the number of time intervals constituting the block is four and τ is 0 to 3.

図１１に示すように、X(i)とY(j)の類似度を計算する場合、まず、X(i)とY(j)の論理積X(i)∩Y(j)が計算され、次に、X(i)とY(j)の論理和X(i)∪Y(j)が計算される。例えば、図１１に示すように、９個の０と７個の１からなるX(i)とY(j)の類似度を計算する場合、まず、１２個の０と４個の１からなる論理積X(i)∩Y(j)が計算され、６個の０と１０個の１からなる論理和X(i)∪Y(j)が計算される。 As shown in FIG. 11, when calculating the similarity between X (i) and Y (j), first, the logical product X (i) ∩Y (j) of X (i) and Y (j) is calculated. Next, the logical sum X (i) ∪Y (j) of X (i) and Y (j) is calculated. For example, as shown in FIG. 11, when calculating the similarity between X (i) and Y (j) consisting of 9 0s and 7 1s, first, it consists of 12 0s and 4 1s. The logical product X (i) ∩Y (j) is calculated, and the logical sum X (i) ∪Y (j) consisting of six 0s and ten 1s is calculated.

そして、次式（６）により、論理積X(i)∩Y(j)の１の数であるNumber（X(i)∩Y(j)）と論理和の１の数であるNumber（X(i)∪Y(j)）に基づいて、X(i)とY(j)の類似度A(i,j)が計算される。 Then, according to the following equation (6), Number (X (i) ∩Y (j)) which is the number of 1 of the logical product X (i) ∩Y (j) and Number (X which is the number of 1 of the logical sum) Based on (i) ∪Y (j)), the similarity A (i, j) between X (i) and Y (j) is calculated.

図１１の例では、Number(X(i)∩Y(j))が４であり、Number（X(i)∪Y(j)）が１０であるので、類似度A(i,j)は０．４となる。 In the example of FIG. 11, since Number (X (i) ∩Y (j)) is 4 and Number (X (i) ∪Y (j)) is 10, the similarity A (i, j) is 0.4.

なお、音声同期用特徴量の時系列データとして、周期性情報の総和S(τ)を採用した場合には、類似度の計算方法としてコサイン距離を用いて類似度を計算する方法などを採用することができる。 In addition, when the sum S (τ) of periodicity information is used as the time series data of the feature amount for voice synchronization, a method of calculating similarity using a cosine distance is used as a method of calculating similarity. be able to.

また、類似度マトリックスは、例えば横軸がインデックスｊとされ、縦軸がインデックスｉとされた、インデックスｉとインデックスｊに対応する各点の類似度A(i,j)を示す情報である。 The similarity matrix is information indicating the similarity A (i, j) of each point corresponding to the index i and the index j, for example, where the horizontal axis is the index j and the vertical axis is the index i.

最適パス検索部１８６は、動的計画法を用いて、類似度マトリックス上のパスの類似度の積算値が最大となるパスを最適な類似度のパスとして検索する。最適パス検索部１８６は、最適な類似度のパス上の類似度に対応するインデックスの差分ｉ−ｊを、音声特徴量に基づく同期補正情報として生成する。 The optimal path search unit 186 searches for a path having the maximum integrated value of the path similarity on the similarity matrix as a path with the optimal similarity using dynamic programming. The optimal path search unit 186 generates an index difference ij corresponding to the similarity on the path having the optimal similarity as synchronization correction information based on the audio feature amount.

〈送信処理の説明〉
続いて、提供装置１１の動作について説明する。 <Description of transmission processing>
Next, the operation of the providing device 11 will be described.

提供装置１１は、互いに時間同期がとれているメインチャンネル信号とサブチャンネル信号が供給されると、送信処理を行って、メイン送出信号およびサブ送出信号を送信する。以下、図１２のフローチャートを参照して、提供装置１１による送信処理について説明する。 When the main channel signal and the sub channel signal that are time-synchronized with each other are supplied, the providing device 11 performs a transmission process and transmits the main transmission signal and the sub transmission signal. Hereinafter, the transmission process by the providing apparatus 11 will be described with reference to the flowchart of FIG.

ステップＳ１１において、音声同期用特徴量計算部２３は、音声同期用特徴量算出処理を行って、供給されたメインチャンネル信号を構成する音声信号から、音声同期用特徴量を計算し、多重化処理部２４に供給する。 In step S11, the voice synchronization feature value calculation unit 23 performs a voice synchronization feature value calculation process, calculates a voice synchronization feature value from the voice signal constituting the supplied main channel signal, and performs a multiplexing process. To the unit 24.

なお、音声同期用特徴量算出処理の詳細は後述する。 Details of the voice synchronization feature value calculation processing will be described later.

ステップＳ１２において、変換部２１は、供給されたメインチャンネル信号を、システムが規定する所定の伝送フォーマットの信号に変換することでメイン送出信号を生成し、得られたメイン送出信号を出力部２２に供給する。 In step S12, the conversion unit 21 generates a main transmission signal by converting the supplied main channel signal into a signal of a predetermined transmission format defined by the system, and the obtained main transmission signal is output to the output unit 22. Supply.

ステップＳ１３において、出力部２２は、変換部２１から供給されたメイン送出信号を送信する。 In step S 13, the output unit 22 transmits the main transmission signal supplied from the conversion unit 21.

ステップＳ１４において、多重化処理部２４は、音声同期用特徴量とサブチャンネル信号との多重化処理を行い、その結果得られたサブ送出信号を出力部２５に供給する。 In step S 14, the multiplexing processing unit 24 multiplexes the audio synchronization feature quantity and the subchannel signal, and supplies the sub-transmission signal obtained as a result to the output unit 25.

例えば多重化処理部２４は、供給されたメインチャンネル信号を用いて、音声同期用特徴量計算部２３からの音声同期用特徴量と、供給されたサブチャンネル信号との時間同期関係が符合するように、システムが規定する伝送フォーマットにより音声同期用特徴量とサブチャンネル信号を多重化する。 For example, the multiplexing processing unit 24 uses the supplied main channel signal so that the time synchronization relationship between the audio synchronization feature value from the audio synchronization feature value calculation unit 23 and the supplied subchannel signal matches. In addition, the audio synchronization feature quantity and the subchannel signal are multiplexed according to a transmission format defined by the system.

これにより、例えば図１３に示すサブ送出信号が得られる。 Thereby, for example, the sub transmission signal shown in FIG. 13 is obtained.

図１３の例では、サブ送出信号としてのビットストリームにおける区間Ｔ１１と区間Ｔ１２には、それぞれ１フレーム分の画像信号、音声信号、および音声同期用特徴量が含まれている。 In the example of FIG. 13, the section T11 and the section T12 in the bit stream as the sub transmission signal each include an image signal, an audio signal, and an audio synchronization feature amount for one frame.

例えば、区間Ｔ１１に含まれる画像信号と音声信号は、１フレーム分のサブチャンネル信号であり、区間Ｔ１１に含まれる音声同期用特徴量は、そのサブチャンネル信号に時間的に対応するフレームのメインチャンネル信号から抽出された、音声同期用特徴量である。このように、サブ送出信号では、同じフレームのサブチャンネル信号と音声同期用特徴量とが対応付けられて多重化されており、サブ送出信号の受信側では、各フレームのサブチャンネル信号に対応付けられた音声同期用特徴量が特定できるようになされている。 For example, the image signal and the audio signal included in the section T11 are sub-channel signals for one frame, and the audio synchronization feature amount included in the section T11 is the main channel of the frame that temporally corresponds to the sub-channel signal. This is a voice synchronization feature amount extracted from the signal. As described above, in the sub transmission signal, the sub-channel signal of the same frame and the audio synchronization feature amount are associated and multiplexed, and on the receiving side of the sub-transmission signal, the sub-transmission signal is associated with the sub-channel signal of each frame. The specified feature amount for voice synchronization can be specified.

図１２のフローチャートの説明に戻り、ステップＳ１５において、出力部２５は、多重化処理部２４から供給されたサブ送出信号を送信し、送信処理は終了する。 Returning to the description of the flowchart of FIG. 12, in step S15, the output unit 25 transmits the sub transmission signal supplied from the multiplexing processing unit 24, and the transmission process ends.

以上のようにして、提供装置１１は、メインチャンネル信号から得られた音声同期用特徴量と、サブチャンネル信号とを対応付けて多重化することでサブ送出信号を生成し、サブ送出信号とメイン送出信号を送信する。 As described above, the providing apparatus 11 generates a sub transmission signal by associating and multiplexing the audio synchronization feature amount obtained from the main channel signal and the sub channel signal, and generates the sub transmission signal and the main transmission signal. Send a send signal.

このようにサブチャンネル信号に音声同期用特徴量を対応付けて送信することで、受信側においては、メインチャンネル信号とサブチャンネル信号を、異なる伝送路を介して複数の異なる機器で受信した場合においても、音声同期用特徴量を用いて、メインコンテンツとサブコンテンツを、同期を保って再生することができるようになる。 By transmitting the audio synchronization feature quantity in association with the subchannel signal in this way, on the receiving side, when the main channel signal and the subchannel signal are received by a plurality of different devices via different transmission paths. However, the main content and the sub-content can be reproduced in synchronization using the audio synchronization feature amount.

〈音声同期用特徴量算出処理の説明〉
次に、図１４のフローチャートを参照して、図１２のステップＳ１１の処理に対応する音声同期用特徴量算出処理について説明する。 <Explanation of feature calculation processing for voice synchronization>
Next, with reference to the flowchart of FIG. 14, the audio synchronization feature quantity calculation process corresponding to the process of step S11 of FIG. 12 will be described.

ステップＳ４１において、周波数帯域分割部５１は、供給された音声信号を、窓関数を用いて、数10msec乃至100msec程度の時間区間に分割する。 In step S41, the frequency band dividing unit 51 divides the supplied audio signal into time intervals of about several tens msec to 100 msec using a window function.

ステップＳ４２において、周波数帯域分割部５１は、複数のバンドパスフィルタを用いて、時間区間ごとの音声信号を４つの周波数帯域に分割する。周波数帯域分割部５１は、各周波数帯域の音声信号を、周期性検出部５２−１乃至周期性検出部５２−４のそれぞれに供給する。 In step S42, the frequency band dividing unit 51 divides the audio signal for each time interval into four frequency bands using a plurality of bandpass filters. The frequency band dividing unit 51 supplies the audio signal of each frequency band to each of the periodicity detecting unit 52-1 to the periodicity detecting unit 52-4.

ステップＳ４３において、周期性検出部５２は、周波数帯域分割部５１から供給された所定の周波数帯域の時間区間ごとの音声信号の自己相関関数ｘ（ｂ，τ）を計算することにより、時間区間ごとの周期性情報を抽出し、周期性強度検出部５３および周期性情報統合部５４に供給する。なお、ステップＳ４３の処理は、周期性検出部５２ごとに行われる。 In step S43, the periodicity detection unit 52 calculates the autocorrelation function x (b, τ) of the audio signal for each time interval of the predetermined frequency band supplied from the frequency band dividing unit 51, thereby The periodicity information is extracted and supplied to the periodicity intensity detection unit 53 and the periodicity information integration unit 54. The process of step S43 is performed for each periodicity detection unit 52.

ステップＳ４４において、周期性強度検出部５３は、周期性検出部５２から供給された時間区間ごとの周期性情報に基づいて、時間区間ごとの周期性の強度を計算する。そして、周期性強度検出部５３は、時間区間ごとの周期性の強度を、閾値を超えたかどうかで２値化することにより、時間区間ごとの周期性強度情報を生成し、周期性情報統合部５４に供給する。なお、ステップＳ４４の処理は、周期性強度検出部５３ごとに行われる。 In step S 44, the periodic strength detector 53 calculates the periodic strength for each time interval based on the periodicity information for each time interval supplied from the periodicity detector 52. And the periodicity intensity | strength detection part 53 produces | generates the periodicity intensity | strength information for every time section by binarizing the intensity | strength of the periodicity for every time section by whether it exceeded the threshold value, and the periodicity information integration part 54. In addition, the process of step S44 is performed for each periodic strength detector 53.

ステップＳ４５において、周期性情報統合部５４は、周期性検出部５２から供給された時間区間ごとの周期性情報と、周期性強度検出部５３から供給される時間区間ごとの周期性強度情報とに基づいて、上述した式（１）を用いて周期性統合処理を行う。周期性情報統合部５４は、周期性統合処理の結果得られる時間区間ごとの周期性情報の総和Ｓ（τ）をピーク検出部５５に供給する。 In step S 45, the periodicity information integration unit 54 converts the periodicity information for each time interval supplied from the periodicity detection unit 52 and the periodicity information for each time interval supplied from the periodicity intensity detection unit 53. Based on the above formula (1), periodic integration processing is performed. The periodicity information integration unit 54 supplies the peak detection unit 55 with the sum S (τ) of periodicity information for each time interval obtained as a result of the periodicity integration processing.

ステップＳ４６において、ピーク検出部５５は、時間区間ごとに、周期性情報統合部５４から供給された周期性情報の総和Ｓ（τ）に対してピーク検出を行い、ピーク情報Ｐ（τ）を生成し、ダウンサンプル部５６に供給する。 In step S46, the peak detection unit 55 performs peak detection on the sum S (τ) of the periodicity information supplied from the periodicity information integration unit 54 for each time interval, and generates peak information P (τ). And supplied to the down sample unit 56.

ステップＳ４７において、ダウンサンプル部５６は、ピーク検出部５５から供給された複数の時間区間におけるピーク情報Ｐ（τ）を１つの時間区間に統合することで、ピーク情報のダウンサンプル処理を行う。 In step S47, the downsampling unit 56 performs peak information downsampling processing by integrating the peak information P (τ) in a plurality of time intervals supplied from the peak detection unit 55 into one time interval.

ダウンサンプル部５６は、このようにして得られた時間区間ごとのピーク情報を、時間区間ごとの音声同期用特徴量の時系列データとして多重化処理部２４に供給し、音声同期用特徴量算出処理は終了する。音声同期用特徴量算出処理が終了すると、その後、処理は図１２のステップＳ１２へと進む。 The down-sampling unit 56 supplies the peak information for each time interval obtained in this way to the multiplexing processing unit 24 as time-sequential data of the audio synchronization feature value for each time interval, and calculates the audio synchronization feature value. The process ends. When the voice synchronization feature value calculation process is completed, the process proceeds to step S12 in FIG.

音声同期用特徴量計算部２３では、以上のようにして周期性情報に基づいて音声同期用特徴量を算出するので、音声同期用特徴量をロバストに生成することができる。 Since the voice synchronization feature value calculating unit 23 calculates the voice synchronization feature value based on the periodicity information as described above, the voice synchronization feature value can be generated robustly.

〈メインコンテンツ再生処理の説明〉
また、提供装置１１からメイン送出信号が送信されると、コンテンツ再生システムは、そのメイン送出信号を、メイン受信信号として取得して、メインコンテンツを再生する。以下、図１５のフローチャートを参照して、コンテンツ再生システムによるメインコンテンツ再生処理について説明する。 <Description of main content playback processing>
When the main transmission signal is transmitted from the providing device 11, the content reproduction system acquires the main transmission signal as the main reception signal and reproduces the main content. Hereinafter, the main content reproduction process by the content reproduction system will be described with reference to the flowchart of FIG.

ステップＳ７１において、入力部１１１は、メイン受信信号を取得して再生処理部１１２に供給する。例えば入力部１１１は、提供装置１１から送信されたメイン受信信号を受信することで、メイン受信信号を取得する。 In step S 71, the input unit 111 acquires the main reception signal and supplies it to the reproduction processing unit 112. For example, the input unit 111 acquires the main reception signal by receiving the main reception signal transmitted from the providing apparatus 11.

ステップＳ７２において、再生処理部１１２は、入力部１１１から供給されたメイン受信信号に基づいてメインコンテンツを再生させ、メインコンテンツ再生処理は終了する。 In step S72, the reproduction processing unit 112 reproduces the main content based on the main reception signal supplied from the input unit 111, and the main content reproduction process ends.

例えば、再生処理部１１２は、メイン受信信号から、メインコンテンツの画像信号と音声信号を抽出し、画像信号を表示部８２に供給して再生させるとともに、音声信号をスピーカ８３に供給して再生させる。これにより、メインコンテンツが再生される。 For example, the reproduction processing unit 112 extracts the image signal and audio signal of the main content from the main reception signal, supplies the image signal to the display unit 82 for reproduction, and supplies the audio signal to the speaker 83 for reproduction. Thereby, the main content is reproduced.

以上のようにして、コンテンツ再生システムは、メイン受信信号を取得してメインコンテンツを再生する。 As described above, the content playback system acquires the main reception signal and plays back the main content.

〈サブコンテンツ再生処理の説明〉
また、メインコンテンツの再生と同期して、コンテンツ再生システムは、サブ受信信号を取得して、サブコンテンツを再生する。以下、図１６のフローチャートを参照して、コンテンツ再生システムによるサブコンテンツ再生処理について説明する。 <Description of sub-content playback processing>
Further, in synchronization with the reproduction of the main content, the content reproduction system acquires the sub received signal and reproduces the sub content. Hereinafter, with reference to the flowchart of FIG. 16, sub-content reproduction processing by the content reproduction system will be described.

ステップＳ１０１において、入力部１２３は、サブ受信信号を取得して分離処理部１２４に供給する。例えば入力部１２３は、提供装置１１から送信されたサブ送出信号を、サブ受信信号として受信することで、サブ受信信号を取得する。 In step S 101, the input unit 123 acquires the sub reception signal and supplies it to the separation processing unit 124. For example, the input unit 123 acquires the sub reception signal by receiving the sub transmission signal transmitted from the providing apparatus 11 as the sub reception signal.

ステップＳ１０２において、分離処理部１２４は、入力部１２３から供給されたサブ受信信号を、サブチャンネル信号と音声同期用特徴量とに分離させ、分離されたサブチャンネル信号と音声同期用特徴量をバッファ１２５に供給して記録させる。 In step S102, the separation processing unit 124 separates the sub reception signal supplied from the input unit 123 into the subchannel signal and the audio synchronization feature quantity, and buffers the separated subchannel signal and the audio synchronization feature quantity. 125 to be recorded.

ステップＳ１０３において、マイクロホン８４は、スピーカ８３から出力されたメインコンテンツの音声を収音し、その結果得られた音声信号を音声同期用特徴量計算部１２１に供給する。例えばステップＳ１０３では、図１５のステップＳ７２の処理で再生されたメインコンテンツの音声が収音される。 In step S 103, the microphone 84 collects the audio of the main content output from the speaker 83, and supplies the audio signal obtained as a result to the audio synchronization feature quantity calculation unit 121. For example, in step S103, the sound of the main content reproduced by the process in step S72 in FIG. 15 is collected.

ステップＳ１０４において、音声同期用特徴量計算部１２１は、音声同期用特徴量算出処理を行って、マイクロホン８４から供給された音声信号から音声同期用特徴量を計算し、バッファ１２２に供給して記録させる。 In step S104, the audio synchronization feature value calculation unit 121 performs audio synchronization feature value calculation processing, calculates the audio synchronization feature value from the audio signal supplied from the microphone 84, supplies it to the buffer 122, and records it. Let

なお、音声同期用特徴量算出処理として、図１７のフローチャートに示すステップＳ１３１乃至ステップＳ１３６の処理が行われるが、これらの処理は図１４のステップＳ４１乃至ステップＳ４６の処理と同様であるので、その説明は省略する。但し、図１７に示す音声同期用特徴量算出処理では、マイクロホン８４から供給された音声信号から音声同期用特徴量が計算され、バッファ１２２に蓄積される。また、音声同期用特徴量計算部１２１では、ピーク検出部１５５で得られたピーク情報が音声同期用特徴量とされる。 Note that, as the voice synchronization feature amount calculation processing, the processing of steps S131 to S136 shown in the flowchart of FIG. 17 is performed, and these processing are the same as the processing of steps S41 to S46 of FIG. Description is omitted. However, in the audio synchronization feature value calculation process shown in FIG. 17, the audio synchronization feature value is calculated from the audio signal supplied from the microphone 84 and stored in the buffer 122. Further, in the voice synchronization feature quantity calculation unit 121, the peak information obtained by the peak detection unit 155 is used as the voice synchronization feature quantity.

図１６のフローチャートの説明に戻り、ステップＳ１０５において、同期計算部１２６は、同期補正情報生成処理を行って、音声特徴量に基づく同期補正情報を生成し、再生処理部１２７に供給する。なお、同期補正情報生成処理の詳細は後述するが、この処理では、バッファ１２２に記録されている音声同期用特徴量と、バッファ１２５に記録されている音声同期用特徴量とを比較することで、メインコンテンツとサブコンテンツとを同期させるための音声特徴量に基づく同期補正情報が生成される。 Returning to the description of the flowchart of FIG. 16, in step S 105, the synchronization calculation unit 126 performs synchronization correction information generation processing, generates synchronization correction information based on the audio feature amount, and supplies the synchronization correction information to the reproduction processing unit 127. Although details of the synchronization correction information generation process will be described later, in this process, the audio synchronization feature quantity recorded in the buffer 122 is compared with the audio synchronization feature quantity recorded in the buffer 125. The synchronization correction information based on the audio feature amount for synchronizing the main content and the sub-content is generated.

ステップＳ１０６において、再生処理部１２７は、同期計算部１２６から供給された音声特徴量に基づく同期補正情報に基づいて、バッファ１２５に記録されているサブチャンネル信号の再生タイミングを補正し、補正後のサブチャンネル信号に基づいてサブコンテンツを再生させる。 In step S106, the reproduction processing unit 127 corrects the reproduction timing of the subchannel signal recorded in the buffer 125 based on the synchronization correction information based on the audio feature amount supplied from the synchronization calculation unit 126, and performs the correction. The sub contents are reproduced based on the sub channel signal.

すなわち、再生処理部１２７は、サブチャンネル信号を構成する画像信号と音声信号を、音声特徴量に基づく同期補正情報により示される時間だけ遅くまたは早く表示部８６とスピーカ８７に供給し、再生させる。換言すれば、音声特徴量に基づく同期補正情報から特定される、現在時刻において再生されているメインコンテンツの部分と対応する再生時刻のサブコンテンツの部分が再生される。 That is, the reproduction processing unit 127 supplies the image signal and audio signal constituting the subchannel signal to the display unit 86 and the speaker 87 for reproduction or reproduction later or earlier by the time indicated by the synchronization correction information based on the audio feature amount. In other words, the sub-content part at the reproduction time corresponding to the main content part reproduced at the current time, which is specified from the synchronization correction information based on the audio feature amount, is reproduced.

例えば、サブコンテンツをメインコンテンツと同期させるための再生位置の調整（補正）は、サブコンテンツやメインコンテンツの無音区間で行われる。 For example, the adjustment (correction) of the reproduction position for synchronizing the sub content with the main content is performed in a silent section of the sub content or the main content.

表示部８６は、再生処理部１２７から供給された画像信号に基づいて、サブコンテンツの画像を表示し、スピーカ８７は、再生処理部１２７から供給された音声信号に基づいて、サブコンテンツの音声を出力する。 The display unit 86 displays the sub-content image based on the image signal supplied from the reproduction processing unit 127, and the speaker 87 outputs the sub-content audio based on the audio signal supplied from the reproduction processing unit 127. Output.

このようにして、メインコンテンツと同期してサブコンテンツが再生されると、サブコンテンツ再生処理は終了する。 In this way, when the sub-content is played back in synchronization with the main content, the sub-content playback processing ends.

以上のようにして、コンテンツ再生システムは、再生されているメインコンテンツの音声を収音して得られた音声信号から音声同期用特徴量を計算し、得られた音声同期用特徴量と、サブ受信信号に含まれている音声同期用特徴量とを用いて音声特徴量に基く同期補正情報を計算する。また、コンテンツ再生システムは、得られた同期補正情報を用いてサブコンテンツを、メインコンテンツと同期させて再生する。 As described above, the content reproduction system calculates the audio synchronization feature amount from the audio signal obtained by collecting the audio of the main content being reproduced, and obtains the audio synchronization feature amount and the sub-reception. Synchronization correction information based on the voice feature amount is calculated using the voice synchronization feature amount included in the signal. Further, the content reproduction system reproduces the sub-content in synchronization with the main content using the obtained synchronization correction information.

このように、収音して得られた音声信号から抽出された音声同期用特徴量と、サブ受信信号に含まれている音声同期用特徴量とを用いて音声特徴量に基づく同期補正情報を計算することで、メインコンテンツとサブコンテンツとの伝送経路が異なる場合であっても、それらのコンテンツを同期して再生することができる。 As described above, the synchronization correction information based on the audio feature amount is obtained using the audio synchronization feature amount extracted from the audio signal obtained by collecting the sound and the audio synchronization feature amount included in the sub reception signal. By calculating, even if the transmission paths of the main content and the sub-content are different, those contents can be reproduced in synchronization.

なお、この例では、音声同期用特徴量の同期計算、つまりマッチング処理は、毎フレーム行われるが、音声同期用特徴量の同期計算は、必ずしも時間的に連続して行われる必要はなく、間欠的に行われるようにしてもよい。但し、同期計算を連続的に行った方がサブコンテンツの再生時刻（再生位置）の補正時に、違和感なく補正を行うことができる。 In this example, the synchronization calculation of the voice synchronization feature value, that is, the matching process is performed every frame, but the synchronization calculation of the voice synchronization feature value is not necessarily performed continuously in time, and is intermittent. May be performed automatically. However, when the synchronization calculation is continuously performed, correction can be performed without a sense of incompatibility when correcting the playback time (playback position) of the sub-content.

〈同期補正情報生成処理の説明〉
さらに、図１８のフローチャートを参照して、図１６のステップＳ１０５の処理に対応する同期補正情報生成処理について説明する。 <Description of synchronization correction information generation processing>
Further, the synchronization correction information generation process corresponding to the process of step S105 of FIG. 16 will be described with reference to the flowchart of FIG.

ステップＳ１６１において、フレームレート変換部１８１およびフレームレート変換部１８２は、必要に応じてフレームレート変換処理を行う。 In step S161, the frame rate conversion unit 181 and the frame rate conversion unit 182 perform frame rate conversion processing as necessary.

すなわち、フレームレート変換部１８１は、バッファ１２２からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データを読み出して、必要に応じて音声同期用特徴量をフレームレート変換、つまりダウンサンプルし、ブロック統合部１８３に供給する。また、フレームレート変換部１８２は、バッファ１２５からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データを読み出して、必要に応じて音声同期用特徴量をフレームレート変換、つまりダウンサンプルまたはアップサンプルし、ブロック統合部１８４に供給する。 That is, the frame rate conversion unit 181 reads the time series data of the audio synchronization feature value for each time interval of the main content from the buffer 122, performs frame rate conversion, that is, down-samples the audio synchronization feature value as necessary, This is supplied to the block integration unit 183. Also, the frame rate conversion unit 182 reads the time series data of the audio synchronization feature value for each time interval of the main content from the buffer 125, and converts the audio synchronization feature value to frame rate conversion, that is, down-sample or up, as necessary. Sample and supply to the block integration unit 184.

ステップＳ１６２において、ブロック統合部１８３およびブロック統合部１８４は、音声同期用特徴量の時系列データを統合する。 In step S162, the block integration unit 183 and the block integration unit 184 integrate the time series data of the audio synchronization feature amount.

具体的には、ブロック統合部１８３は、フレームレート変換部１８１からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データの供給を受ける。そして、ブロック統合部１８３は、連続した複数（例えば64個）の時間区間を１ブロックとして、ブロック単位で、供給された時間区間ごとの音声同期用特徴量の時系列データを統合し、類似度計算部１８５に供給する。 Specifically, the block integration unit 183 receives the time-series data of the feature amount for audio synchronization for each time interval of the main content from the frame rate conversion unit 181. Then, the block integration unit 183 integrates the time series data of the feature quantities for voice synchronization for each supplied time section in units of blocks, with a plurality of continuous (for example, 64) time sections as one block. It supplies to the calculation part 185.

また、ブロック統合部１８４は、フレームレート変換部１８２からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データの供給を受ける。そして、ブロック統合部１８４は、連続した複数（例えば64個）の時間区間を１ブロックとして、ブロック単位で、供給された時間区間ごとの音声同期用特徴量の時系列データを統合し、類似度計算部１８５に供給する。 Further, the block integration unit 184 receives supply of time-series data of feature amounts for audio synchronization for each time interval of the main content from the frame rate conversion unit 182. Then, the block integration unit 184 integrates the time series data of the feature amount for voice synchronization for each supplied time interval in units of blocks, with a plurality of continuous (for example, 64) time intervals as one block. It supplies to the calculation part 185.

ステップＳ１６３において、類似度計算部１８５は、ブロック統合部１８３とブロック統合部１８４のそれぞれから供給されたブロック単位の音声同期用特徴量の時系列データ同士の類似度を計算し、各ブロック間の類似度を表す類似度マトリックスを生成する。類似度計算部１８５は、類似度マトリックスを最適パス検索部１８６に供給する。 In step S163, the similarity calculation unit 185 calculates the similarity between the time series data of the feature quantities for voice synchronization in units of blocks supplied from the block integration unit 183 and the block integration unit 184. A similarity matrix representing the similarity is generated. The similarity calculation unit 185 supplies the similarity matrix to the optimum path search unit 186.

ステップＳ１６４において、最適パス検索部１８６は、類似度計算部１８５から供給された類似度マトリックスから最適な類似度のパスを検索し、音声特徴量に基づく同期補正情報を生成する。そして、最適パス検索部１８６は、音声特徴量に基づく同期補正情報を再生処理部１２７に供給して、同期補正情報生成処理は終了する。 In step S164, the optimal path search unit 186 searches for a path with the optimal similarity from the similarity matrix supplied from the similarity calculation unit 185, and generates synchronization correction information based on the audio feature amount. Then, the optimum path search unit 186 supplies synchronization correction information based on the audio feature amount to the reproduction processing unit 127, and the synchronization correction information generation process ends.

以上のようにして、コンテンツ再生システムは、周期性情報に基づいて音声特徴量に基づく同期補正情報を生成するので、同期補正情報をロバストに生成することができる。 As described above, the content reproduction system generates the synchronization correction information based on the audio feature amount based on the periodicity information, so that the synchronization correction information can be generated robustly.

なお、以上においては、メインコンテンツが１つである場合について説明したが、メインコンテンツが複数あってもよい。 In addition, although the case where there is one main content has been described above, there may be a plurality of main contents.

そのような場合、提供装置１１の音声同期用特徴量計算部２３は、複数のメインコンテンツごとに音声同期用特徴量を計算し、多重化処理部２４は、１つのサブコンテンツのサブチャンネル信号と、複数のメインコンテンツの音声同期用特徴量とを多重化し、サブ送出信号とする。また、出力部２２は、複数のメインコンテンツのメインチャンネル信号から得られたメイン送出信号を送信する。 In such a case, the audio synchronization feature value calculating unit 23 of the providing apparatus 11 calculates the audio synchronization feature value for each of the plurality of main contents, and the multiplexing processing unit 24 includes a subchannel signal of one subcontent, Multiple feature values for audio synchronization of main contents are multiplexed and used as a sub transmission signal. The output unit 22 transmits a main transmission signal obtained from the main channel signals of a plurality of main contents.

さらに、この場合、図４に示したコンテンツ再生システムでは、再生処理部１１２は、複数のメインコンテンツのうちの１つを選択して再生する。また、入力部１２３は、１つのサブチャンネル信号に対して、複数のメインコンテンツの音声同期用特徴量が対応付けられているサブ受信信号を受信する。 Furthermore, in this case, in the content reproduction system shown in FIG. 4, the reproduction processing unit 112 selects and reproduces one of the plurality of main contents. Further, the input unit 123 receives a sub reception signal in which a plurality of main content audio synchronization feature quantities are associated with one sub channel signal.

そして、同期計算部１２６は、入力部１２３で取得された各メインコンテンツの音声同期用特徴量と、音声同期用特徴量計算部１２１で得られた音声同期用特徴量とを比較して類似度を計算し、スピーカ８３で再生されているメインコンテンツを特定する。例えば、音声同期用特徴量とのマッチングの結果、最も類似度の高い音声同期用特徴量のメインコンテンツが、再生されているメインコンテンツであるとされる。 Then, the synchronization calculation unit 126 compares the audio synchronization feature quantity of each main content acquired by the input unit 123 with the audio synchronization feature quantity obtained by the audio synchronization feature quantity calculation unit 121 to obtain a similarity. The main content being calculated by the speaker 83 is specified. For example, as a result of matching with the audio synchronization feature value, the main content of the audio synchronization feature value having the highest similarity is assumed to be the main content being reproduced.

再生されているメインコンテンツが特定されると、特定されたメインコンテンツの音声同期用特徴量について得られた同期補正情報に基づいて、サブコンテンツの再生位置が補正される。すなわち、同期計算部１２６は、特定されたメインコンテンツと、サブコンテンツとを同期させるための音声特徴量に基づく同期補正情報を生成する。 When the main content being reproduced is specified, the reproduction position of the sub-content is corrected based on the synchronization correction information obtained for the audio synchronization feature amount of the specified main content. That is, the synchronization calculation unit 126 generates synchronization correction information based on the audio feature amount for synchronizing the identified main content and sub-content.

〈本技術の適用例１〉
また、以上において説明した本技術は、様々な形態のシステムに適用することができる。 <Application example 1 of this technology>
In addition, the present technology described above can be applied to various types of systems.

例えば、本技術は図１９に示すシステムに適用可能である。 For example, the present technology can be applied to the system shown in FIG.

図１９に示すシステムでは、例えば放送局などの提供装置２１１が、図１の提供装置１１に対応する。提供装置２１１は、メインコンテンツとサブコンテンツとを提供する。 In the system shown in FIG. 19, a providing device 211 such as a broadcasting station corresponds to the providing device 11 in FIG. The providing device 211 provides main content and sub-content.

この例では、提供装置２１１は、メインコンテンツのメイン送出信号を、例えば放送波により放送することで、メイン受信機器２１２にメイン送出信号を送信する。そして、メイン受信機器２１２は、放送波により送信されたメイン送出信号を、メイン受信信号として受信してメインコンテンツを再生する。このとき、メイン受信機器２１２は、メインコンテンツの音声を、メイン受信機器２１２に備えられたスピーカ２１３から出力する。 In this example, the providing apparatus 211 transmits the main transmission signal to the main receiving device 212 by broadcasting the main transmission signal of the main content by, for example, a broadcast wave. Then, the main reception device 212 receives the main transmission signal transmitted by the broadcast wave as the main reception signal and reproduces the main content. At this time, the main receiving device 212 outputs the sound of the main content from the speaker 213 provided in the main receiving device 212.

したがって、この例ではメイン受信機器２１２は、図４に示したメイン受信機器８１、表示部８２、およびスピーカ８３から構成されることになる。この場合、入力部１１１が、放送波により放送されたメイン受信信号を受信する。また、スピーカ２１３が、図４のスピーカ８３に対応することになる。 Therefore, in this example, the main receiving device 212 is configured by the main receiving device 81, the display unit 82, and the speaker 83 shown in FIG. In this case, the input unit 111 receives a main reception signal broadcast by a broadcast wave. Further, the speaker 213 corresponds to the speaker 83 of FIG.

例えば、メイン受信機器２１２は、テレビジョン受像機などとされ、ユーザはメイン受信機器２１２で再生されるメインコンテンツを視聴する。 For example, the main receiving device 212 is a television receiver or the like, and the user views main content reproduced by the main receiving device 212.

一方、提供装置２１１からは、サブ送出信号も送信される。この例では提供装置２１１はサブ送出信号を、例えばインターネットなどの通信網２１４を介して、ストリーミング配信等によりサブ受信機器２１５に送信する。ここでは、サブ送信信号は、いわゆるプッシュ型の通信により送信される。 On the other hand, the providing apparatus 211 also transmits a sub transmission signal. In this example, the providing device 211 transmits the sub transmission signal to the sub reception device 215 by streaming distribution or the like via the communication network 214 such as the Internet. Here, the sub transmission signal is transmitted by so-called push type communication.

また、サブ受信機器２１５は、例えばタブレット型の端末装置などからなり、通信網２１４を介して送信されてきたサブ送信信号を、サブ受信信号として受信して、サブコンテンツを再生する。すなわち、サブ受信機器２１５は、内蔵する表示部にサブコンテンツの画像を表示させるとともに、内蔵するスピーカからサブコンテンツの音声を出力させる。 The sub reception device 215 includes, for example, a tablet-type terminal device and the like, receives a sub transmission signal transmitted via the communication network 214 as a sub reception signal, and reproduces the sub content. That is, the sub reception device 215 displays the sub content image on the built-in display unit and outputs the sub content sound from the built-in speaker.

このとき、サブ受信機器２１５は、スピーカ２１３から出力されたメインコンテンツの音声を収音して音声同期用特徴量を計算し、得られた音声同期用特徴量と、サブ受信信号に含まれている音声同期用特徴量とを用いて音声特徴量に基づく同期補正情報を生成する。そして、サブ受信機器２１５は、音声特徴量に基づく同期補正情報を用いてサブコンテンツを、メインコンテンツと同期させて再生させる。 At this time, the sub reception device 215 collects the audio of the main content output from the speaker 213 and calculates the audio synchronization feature quantity, and is included in the obtained audio synchronization feature quantity and the sub reception signal. Synchronization correction information based on the voice feature is generated using the voice synchronization feature. Then, the sub receiving device 215 uses the synchronization correction information based on the audio feature amount to reproduce the sub content in synchronization with the main content.

これにより、メイン受信機器２１２で再生されるメインコンテンツと、サブ受信機器２１５で再生されるサブコンテンツとが同期した状態で再生されることになり、ユーザは、適宜、サブコンテンツを見聞きしながら、メインコンテンツを視聴することができる。つまり、サブコンテンツを、例えばメインコンテンツの補助情報として活用しながら、メインコンテンツを楽しむことができる。 As a result, the main content played back by the main receiving device 212 and the sub-content played back by the sub-receiving device 215 are played back in synchronization with each other, and the user appropriately watches the main content while watching and listening to the sub-contents. Can be watched. That is, the main content can be enjoyed while using the sub-content as auxiliary information for the main content, for example.

この例では、サブコンテンツのサブチャンネル信号は、例えばメインコンテンツの映像とは別アングルの映像の画像信号や、メインコンテンツに対するコメンタリー音声の音声信号、メインコンテンツに関連する文字情報などとされる。 In this example, the sub-channel signal of the sub-content is, for example, an image signal of a video at a different angle from the video of the main content, an audio signal of commentary audio for the main content, character information related to the main content, or the like.

この場合、サブ受信機器２１５は、例えば図４に示したマイクロホン８４、サブ受信機器８５、表示部８６、およびスピーカ８７から構成されることになる。したがって、入力部１２３は、通信網２１４を介して送信されてきたサブ送信信号を、サブ受信信号として受信することになる。 In this case, the sub reception device 215 includes, for example, the microphone 84, the sub reception device 85, the display unit 86, and the speaker 87 illustrated in FIG. Therefore, the input unit 123 receives a sub transmission signal transmitted via the communication network 214 as a sub reception signal.

以上のように、図１９の例では、プッシュ型の通信で、互いに異なる伝送経路で送信されたメインコンテンツとサブコンテンツを、受信側において簡単かつ高精度に同期させて再生することができる。なお、この例では、サブ送出信号は、メイン送出信号に先んじて送出される必要がある。すなわち、メイン送出信号のメイン受信機器２１２への到着時刻と、サブ送出信号のサブ受信機器２１５への到着時刻の差（到着時間差）を考慮した時間差で、メイン送出信号とサブ送出信号の送信が行われる必要がある。 As described above, in the example of FIG. 19, the main content and the sub-content transmitted through different transmission paths can be easily and accurately reproduced on the receiving side in push-type communication. In this example, the sub transmission signal needs to be transmitted prior to the main transmission signal. That is, the transmission of the main transmission signal and the sub transmission signal is performed with a time difference taking into account the difference (arrival time difference) between the arrival time of the main transmission signal at the main reception device 212 and the arrival time of the sub transmission signal at the sub reception device 215. Need to be done.

〈本技術の適用例２〉
また、本技術は、例えば図２０に示すシステムにも適用可能である。なお、図２０において、図１９における場合と対応する部分には同一の符号を付してあり、その説明は省略する。 <Application example 2 of this technology>
Moreover, this technique is applicable also to the system shown, for example in FIG. In FIG. 20, the same reference numerals are given to the portions corresponding to those in FIG. 19, and the description thereof is omitted.

図２０の例では、図１９の例と同様に、提供装置２１１から放送波により、つまりプッシュ型の通信によりメイン送出信号がメイン受信機器２１２に送信される。 In the example of FIG. 20, as in the example of FIG. 19, the main transmission signal is transmitted from the providing device 211 to the main reception device 212 by broadcast waves, that is, by push-type communication.

これに対して、サブ送出信号は、サーバ２４１により通信網２１４を介してサブ受信機器２１５に送信される。なお、サーバ２４１は、何らかの方法により、予めサブ送出信号を提供装置２１１等から取得して記録している。 On the other hand, the sub transmission signal is transmitted from the server 241 to the sub reception device 215 via the communication network 214. Note that the server 241 acquires and records the sub-transmission signal from the providing device 211 or the like in advance by some method.

この例では、サブ送出信号は、いわゆるプル型の通信により送信される。したがって、サーバ２４１は、サブ受信機器２１５からサブ送出信号の送信要求があったとき、通信網２１４を介して、サブ送出信号をサブ受信機器２１５に送信する。 In this example, the sub transmission signal is transmitted by so-called pull-type communication. Therefore, the server 241 transmits the sub transmission signal to the sub reception device 215 via the communication network 214 when the sub reception device 215 requests transmission of the sub transmission signal.

すなわち、サブ受信機器２１５に対応する図４のサブ受信機器８５の入力部１２３は、サーバ２４１にサブ送出信号の送信要求を送信するとともに、その送信要求に応じてサーバ２４１から送信されてきたサブ送出信号を、サブ受信信号として受信する。 That is, the input unit 123 of the sub-receiving device 85 in FIG. 4 corresponding to the sub-receiving device 215 transmits a sub-transmission signal transmission request to the server 241 and transmits the sub-transmission signal transmitted from the server 241 in response to the transmission request. The transmission signal is received as a sub reception signal.

この場合、サブ受信機器２１５は、メインコンテンツの放送前に予めサブ送出信号を受信して記録しておくことができる。したがって、予めサブ送出信号を受信して記録しておけば、メインコンテンツの放送時に、通信網２１４の状態等によってサブコンテンツをメインコンテンツと同期して再生させることができないなどの事態を防止することができる。 In this case, the sub reception device 215 can receive and record the sub transmission signal in advance before broadcasting the main content. Therefore, if the sub transmission signal is received and recorded in advance, it is possible to prevent a situation in which the sub content cannot be reproduced in synchronization with the main content due to the state of the communication network 214 or the like when the main content is broadcast. .

サブ受信機器２１５は、メイン受信機器２１２でのメインコンテンツの再生が開始されると、スピーカ２１３から出力されたメインコンテンツの音声を収音して音声同期用特徴量を計算する。そして、サブ受信機器２１５は、得られた音声同期用特徴量と、サブ受信信号に含まれている音声同期用特徴量とを用いて音声特徴量に基づく同期補正情報を生成し、同期補正情報を用いてサブコンテンツを、メインコンテンツと同期させて再生させる。 When the main reception device 212 starts to reproduce the main content, the sub reception device 215 collects the audio of the main content output from the speaker 213 and calculates the audio synchronization feature amount. Then, the sub reception device 215 generates synchronization correction information based on the audio feature amount using the obtained audio synchronization feature amount and the audio synchronization feature amount included in the sub reception signal, and the synchronization correction information Is used to reproduce the sub-content in synchronization with the main content.

このように図２０の例では、サブ受信機器２１５が自身に都合のよいタイミングでサブ受信信号を取得することができる。 As described above, in the example of FIG. 20, the sub reception device 215 can acquire the sub reception signal at a timing convenient for itself.

〈本技術の適用例３〉
また、本技術は、例えば図２１に示すシステムにも適用可能である。なお、図２１において、図２０における場合と対応する部分には同一の符号を付してあり、その説明は省略する。 <Application example 3 of this technology>
Moreover, this technique is applicable also to the system shown, for example in FIG. In FIG. 21, the same reference numerals are given to the portions corresponding to those in FIG. 20, and the description thereof is omitted.

図２１の例では、メインコンテンツ、つまりメイン送出信号は、サーバ２４１とは異なるサーバ２７１により提供される。すなわち、サーバ２７１はメイン受信機器２１２から要求があったとき、通信網２７２を介して、記録しているメイン送出信号をメイン受信機器２１２に送信する。つまり、この例ではメイン送出信号はプル型の通信により送信される。 In the example of FIG. 21, the main content, that is, the main transmission signal is provided by a server 271 different from the server 241. That is, the server 271 transmits the recorded main transmission signal to the main receiving device 212 via the communication network 272 when requested by the main receiving device 212. That is, in this example, the main transmission signal is transmitted by pull-type communication.

具体的には、メイン受信機器２１２に対応する図４のメイン受信機器８１の入力部１１１は、サーバ２７１にメイン送出信号の送信要求を送信するとともに、その送信要求に応じてサーバ２７１から送信されてきたメイン送出信号を、メイン受信信号として受信する。 Specifically, the input unit 111 of the main receiving device 81 in FIG. 4 corresponding to the main receiving device 212 transmits a transmission request for the main transmission signal to the server 271 and is transmitted from the server 271 in response to the transmission request. The main transmission signal received is received as the main reception signal.

この場合、メイン受信機器２１２は、予めメイン送出信号を受信して記録しておくことができる。したがって、予めメイン送出信号を受信して記録しておけば、メインコンテンツの再生時に通信網２７２の状態等によってメインコンテンツの再生が途中で途切れたり、停止したりするなどの事態を防止することができる。 In this case, the main receiving device 212 can receive and record the main transmission signal in advance. Therefore, if the main transmission signal is received and recorded in advance, it is possible to prevent a situation in which the reproduction of the main content is interrupted or stopped during the reproduction of the main content due to the state of the communication network 272 or the like.

また、サブ送出信号は図２０の例と同様に、サーバ２４１によってプル型の通信により送信される。 Further, the sub transmission signal is transmitted by the server 241 by pull-type communication as in the example of FIG.

このように図２１の例では、メイン受信機器２１２とサブ受信機器２１５が、それぞれ自身に都合のよいタイミングでメイン受信信号とサブ受信信号を取得することができる。 In this way, in the example of FIG. 21, the main reception device 212 and the sub reception device 215 can acquire the main reception signal and the sub reception signal at a timing convenient for themselves.

なお、仮に通信網２７２が通信網２１４と同一の通信網であったとしても、メイン送信信号とサブ送信信号の送信タイミングや受信する機器等が異なれば、通常、これらのメイン送信信号とサブ送信信号の伝送経路は異なる経路となる。 Even if the communication network 272 is the same communication network as the communication network 214, if the transmission timing of the main transmission signal and the sub-transmission signal, the receiving device, etc. are different, these main transmission signal and sub-transmission are usually used. The signal transmission path is different.

〈第２の実施の形態〉
〈本技術の特徴〉
ところで、上述した（Ａ１）乃至（Ａ４）に示した例のように時間同期関係を有する複数のメディアコンテンツを、放送やIP（Internet Protocol）網などを通じて複数機器で受信し、受信したメディアコンテンツを同期して再生するというアプリケーションプログラムが想定される。 <Second Embodiment>
<Features of this technology>
By the way, as in the examples shown in the above (A1) to (A4), a plurality of media contents having a time synchronization relationship are received by a plurality of devices through broadcasting, an IP (Internet Protocol) network, and the received media contents are received. An application program that plays in synchronization is assumed.

このような機能性の実現のために、Hybridcastのように放送でコンテンツを配信すると同時に、IP網により個別に追加コンテンツを配信し、コンテンツ受信機では、放送により配信されたコンテンツと、IP網により配信された追加コンテンツを時間的に同期させて同時に出力するという放送通信連携サービスに向けたシステムの研究開発がされている。 In order to realize such functionality, content is delivered by broadcasting like Hybridcast, and at the same time, additional content is delivered individually by the IP network. The content receiver uses the content delivered by broadcasting and the IP network. Research and development of a system for a broadcasting / communication cooperation service in which additional content delivered is synchronized in time and output simultaneously.

例えば、Hybridcastについては「松村欣司、鹿喰善明、Michael J Evans，「インターネット配信情報との連動による放送番組パーソナライズシステムの検討」、映像情報メディア学会年次大会講演予稿集、２００９年８月２６日、ｐ．３−８」（以下、非特許文献１とも称する）に記載されている。 For example, “Hyoji Matsumura, Yoshiaki Shiga, Michael J Evans,“ Examination of a personalized system for broadcasting programs linked with Internet distribution information ”, Proceedings of Annual Conference of the Institute of Image Information and Television Engineers, August 26, 2009 , P. 3-8 "(hereinafter also referred to as non-patent document 1).

また、「日本放送協会，「HybridcastTMの概要と技術」，NHK技研R&D，no.124, p.10-17, 2010年11月，日本放送出版協会,http://www.nhk.or.jp/strl/publica/rd/rd124/PDF/P10-17.pdf」（以下、非特許文献２とも称する）や、「日本放送協会，「HybridcastTMを支える技術」，NHK技研R&D，no.133, p.20-27, 2012年5月，日本放送出版協会,http://www.nhk.or.jp/strl/publica/rd/rd133/PDF/P20-27.pdf」（以下、非特許文献３とも称する）などにもHybridcastについて記載されている。 Also, “Japan Broadcasting Corporation,“ HybridcastTM Overview and Technology ”, NHK STRL R & D, no.124, p.10-17, November 2010, Japan Broadcasting Publishing Association, http://www.nhk.or.jp /strl/publica/rd/rd124/PDF/P10-17.pdf (hereinafter also referred to as Non-Patent Document 2), “Japan Broadcasting Corporation,“ Technology supporting HybridcastTM ”, NHK STRL R & D, no.133, p. .20-27, May 2012, Japan Broadcasting Publishing Association, http://www.nhk.or.jp/strl/publica/rd/rd133/PDF/P20-27.pdf ”(hereinafter referred to as Non-Patent Document 3) (Also referred to as “Hybridcast”).

Hybridcastでは、放送ストリームの基準クロック（PCR（Program Clock Reference））に基づく提示時間情報（PTS（Presentation Time Stamp））を付加した追加コンテンツを放送コンテンツの送出と同時、あるいは少し先んじてストリーミング配信し、受信機で、通信コンテンツの遅延と変動を吸収するために十分な量のバッファを持ち、放送コンテンツを遅らせ、両者のタイムスタンプを比較することで同期をとることを基本原理としている。 With Hybridcast, additional content with presentation time information (PTS (Presentation Time Stamp)) based on the reference clock (PCR (Program Clock Reference)) of the broadcast stream is streamed at the same time as the broadcast content is sent, or a little earlier. The basic principle is that the receiver has a sufficient amount of buffer to absorb delays and fluctuations in the communication content, delays the broadcast content, and synchronizes both by comparing the time stamps.

例えば、非特許文献２によれば、両受信機が同一機器内にある試作環境において１映像フレーム内（33ms）程度の精度で同期がとれることが確認できている。 For example, according to Non-Patent Document 2, it has been confirmed that synchronization can be achieved with an accuracy of about one video frame (33 ms) in a prototype environment in which both receivers are in the same device.

追加コンテンツを受信する機器は、IP網に無線接続されるスマートホンやタブレット型のパーソナルコンピュータといった放送コンテンツの受信機と独立な機器でもよい。そのような場合には、放送コンテンツ受信機は、追加コンテンツを受信する機器に対して、提示時刻情報（タイムスタンプ）を提供する必要がある。これは通常IP網を介して連携される。 The device that receives the additional content may be a device that is independent of the broadcast content receiver, such as a smart phone or a tablet personal computer that is wirelessly connected to the IP network. In such a case, the broadcast content receiver needs to provide presentation time information (time stamp) to a device that receives the additional content. This is usually linked via an IP network.

また、放送でなくともIP網などのネットワーク経由のみで複数コンテンツを配信して、協定世界時（UTC（Coordinated Universal Time））を基準クロックとしてタイムスタンプを付加し、受信機側で同期を行い、出力するシステムの実現も容易に想像できる。 In addition, even if it is not broadcast, multiple contents are distributed only via a network such as the IP network, a time stamp is added using Coordinated Universal Time (UTC) as a reference clock, and synchronization is performed on the receiver side. Realization of the output system can be easily imagined.

実際、上記のような放送通信連携サービスを独立した受信機で利用する場合、タイムスタンプの比較による方法では、以下の２つの要因により厳密な同期をとることが困難である。 In fact, when the above-mentioned broadcasting / communication cooperation service is used by an independent receiver, it is difficult for the method based on the time stamp comparison to achieve strict synchronization due to the following two factors.

まず、第１に、放送コンテンツ受信機と追加コンテンツ受信機は独立した電子機器である以上、システムクロックに差異があり、時間の経過とともに同期ずれが発生する。 First, since the broadcast content receiver and the additional content receiver are independent electronic devices, there is a difference in the system clock, and a synchronization shift occurs as time passes.

また、第２に、ユーザはテレビジョン受像機などの放送コンテンツ受信機からある程度距離をおき、スマートホンやタブレット型パーソナルコンピュータなどの追加コンテンツ受信機を手元に持ち、IP網経由で配信される追加コンテンツを楽しむという使用形態が想定される。この使用形態で放送コンテンツ、および追加コンテンツに音声信号が含まれる場合、ユーザの視聴位置で厳密な同期を取ることが困難になる。 Second, the user is at a certain distance from a broadcast content receiver such as a television receiver, has an additional content receiver such as a smartphone or tablet personal computer, and is distributed via the IP network. A usage pattern of enjoying content is assumed. When the audio content is included in the broadcast content and the additional content in this usage pattern, it becomes difficult to achieve exact synchronization at the viewing position of the user.

例えば、ユーザが放送コンテンツ受信機から10m離れている場合、放送コンテンツ受信機から出力された音声信号がユーザ位置に到達するには10(m)/340(m/s)＝約30(ms)の時間を要することになる。ここで、音速は約340(m/s)である。 For example, when the user is 10 m away from the broadcast content receiver, 10 (m) / 340 (m / s) = about 30 (ms) for the audio signal output from the broadcast content receiver to reach the user position It will take time. Here, the sound speed is about 340 (m / s).

また、第１の実施の形態では、放送コンテンツ受信機が出力する音声を追加コンテンツ受信機が収音し、音声同期用特徴量を計算して、IP網で配信されてくる放送コンテンツの音声同期用特徴量と同期計算を行う手法となっている。しかし、IP網の伝送遅延や、ゆらぎなどが大きい場合には、広範囲にわたり同期位置のサーチを行う必要があり、処理量が多くなってしまう。 In the first embodiment, the additional content receiver picks up the audio output from the broadcast content receiver, calculates the audio synchronization feature, and synchronizes the audio of the broadcast content distributed over the IP network. This is a technique for performing synchronous calculation with feature quantities. However, when the transmission delay or fluctuation of the IP network is large, it is necessary to search the synchronization position over a wide range, and the processing amount increases.

そこで、上述した提供装置とコンテンツ再生システムが、以下の特徴Ｂ１１乃至特徴Ｂ２０を有するようにすることで、異なる経路で取得した複数のコンテンツを、さらに少ない処理量で同期させることができるようになる。 Therefore, by providing the above-described providing apparatus and content reproduction system with the following features B11 to B20, a plurality of contents acquired through different paths can be synchronized with a smaller amount of processing. .

（特徴Ｂ１１）
メディアコンテンツは映像、音声、画像、文字情報などを多重化したデータストリームとされている。 (Feature B11)
The media content is a data stream in which video, audio, image, character information, etc. are multiplexed.

なお、この（特徴Ｂ１１）のデータストリームの伝送としては、放送波、インターネットなどのネットワークにおけるメディアコンテンツの伝送を想定し、多重化データストリームが占有する論理伝送路を伝送路と呼ぶこととする。 As for the transmission of the data stream of (feature B11), the transmission of media content in a network such as a broadcast wave or the Internet is assumed, and the logical transmission path occupied by the multiplexed data stream is referred to as a transmission path.

（特徴Ｂ１２）
伝送対象とする複数メディアコンテンツは時間同期関係を有する。 (Feature B12)
Multiple media contents to be transmitted have a time synchronization relationship.

（特徴Ｂ１３）
送出対象とする複数のメディアコンテンツのうち少なくとも１つをメインチャンネル信号と定め、残りの各メディアコンテンツをサブチャンネル信号とする。 (Feature B13)
At least one of a plurality of media contents to be transmitted is defined as a main channel signal, and each remaining media content is defined as a subchannel signal.

（特徴Ｂ１４）
基準時刻信号からメインチャンネル信号、およびサブチャンネル信号のそれぞれについて提示時刻情報（PTC）を生成する。 (Feature B14)
Presentation time information (PTC) is generated for each of the main channel signal and the sub-channel signal from the reference time signal.

ここで、基準時刻信号は放送ストリームの基準クロック（PCR）または協定世界時（UTC）などが用いられる。 Here, the reference clock signal is a broadcast stream reference clock (PCR) or Coordinated Universal Time (UTC).

（特徴Ｂ１５）
メインチャンネル信号の提示時刻情報をメインチャンネル信号と多重化し、メイン送出信号を生成して伝送する。一方、メインチャンネル信号の音声信号から音声同期用特徴量も算出しておく。 (Feature B15)
The presentation time information of the main channel signal is multiplexed with the main channel signal, and a main transmission signal is generated and transmitted. On the other hand, a feature value for voice synchronization is also calculated from the voice signal of the main channel signal.

（特徴Ｂ１６）
メインチャンネル信号とサブチャンネル信号の時間同期関係が符合するようにし、システムが規定する伝送フォーマットにより、サブチャンネル信号の提示時刻情報とメインチャンネル信号の音声同期用特徴量とサブチャンネル信号の多重化処理を行い、サブ送出信号を生成する。 (Feature B16)
The time synchronization relationship between the main channel signal and the subchannel signal is matched, and the presentation time information of the subchannel signal, the audio synchronization feature quantity of the main channel signal, and the subchannel signal multiplexing processing according to the transmission format specified by the system To generate a sub transmission signal.

（特徴Ｂ１７）
メイン受信機器はメイン受信信号を取得して分離し、メインチャンネル信号の再生時において、その音声信号に基づく音声をスピーカなどにより出力する。同時にメイン受信機器は、受信したメインチャンネル信号の提示時刻情報を外部より参照したり、取得したりできるよう提示する。 (Feature B17)
The main receiving device acquires and separates the main reception signal, and outputs sound based on the audio signal through a speaker or the like when reproducing the main channel signal. At the same time, the main receiving device presents the presentation time information of the received main channel signal so that it can be referred to from outside or acquired.

例えばメインチャンネル信号の提示時刻情報はソフトウェアのAPI（Application Programing Interface）によりその取得手段が提供され、無線通信によるIP網接続経由などで外部から参照できるようにしておく。 For example, the presentation time information of the main channel signal is provided by a software API (Application Programming Interface) so that it can be referred from the outside via an IP network connection by wireless communication.

（特徴Ｂ１８）
サブ受信機器は、サブ受信信号を取得して分離し、受信したサブチャンネル信号の提示時刻情報とメイン受信機器から取得したメインチャンネル信号の提示時刻情報を比較し、提示時刻情報に基づく同期補正情報を生成する。 (Feature B18)
The sub reception device acquires and separates the sub reception signal, compares the presentation time information of the received sub channel signal with the presentation time information of the main channel signal acquired from the main reception device, and synchronizes correction information based on the presentation time information Is generated.

（特徴Ｂ１９）
サブ受信機器は、メイン受信機器がスピーカから出力したメインチャンネル信号の音声をマイクロホンなどにより収音して、音声同期用特徴量を計算し、（特徴Ｂ１８）で生成された提示時刻情報に基づく同期補正情報を考慮して、受信したメインチャンネル信号の音声同期用特徴量との自動同期計算を行い、音声特徴量に基づく同期補正情報（時間差情報）を算出する。 (Feature B19)
The sub receiving device collects the sound of the main channel signal output from the speaker by the main receiving device using a microphone or the like, calculates the audio synchronization feature value, and synchronizes based on the presentation time information generated in (Feature B18). In consideration of the correction information, automatic synchronization calculation with the audio synchronization feature amount of the received main channel signal is performed, and synchronization correction information (time difference information) based on the audio feature amount is calculated.

提示時刻情報の比較で得られる提示時刻情報に基づく同期補正情報から、おおまかな同期位置が分かるので、後段の音声同期用特徴量による自動同期計算処理に要する処理量も少なくて済む。 Since the approximate synchronization position can be found from the synchronization correction information based on the presentation time information obtained by comparing the presentation time information, the amount of processing required for the automatic synchronization calculation processing using the subsequent audio synchronization feature amount can be reduced.

（特徴Ｂ２０）
上記音声特徴量に基づく同期補正情報に基づき、サブ受信機器は受信したサブチャンネル信号に対してメインチャンネル信号との同期補正処理を行い再生する。 (Feature B20)
Based on the synchronization correction information based on the audio feature amount, the sub-receiving device performs a synchronization correction process with the main channel signal on the received sub-channel signal and reproduces it.

〈提供装置の構成例〉
次に、以上において説明した特徴Ｂ１１乃至特徴Ｂ２０を有する提供装置とコンテンツ再生システムの具体的な実施の形態について説明する。 <Example configuration of the providing device>
Next, specific embodiments of the providing apparatus and the content reproduction system having the features B11 to B20 described above will be described.

図２２は、上述した（Ａ１）乃至（Ａ４）に示した例のように時間同期関係を有するコンテンツを提供する提供装置の構成例を示す図である。なお、図２２において、図１における場合と対応する部分には同一の符号を付してあり、その説明は適宜省略する。 FIG. 22 is a diagram illustrating a configuration example of a providing apparatus that provides content having a time synchronization relationship as in the examples illustrated in (A1) to (A4) described above. In FIG. 22, parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

提供装置３０１は、基準時刻信号生成部３１１、多重化処理部３１２、出力部２２、音声同期用特徴量計算部２３、多重化処理部２４、および出力部２５を有している。 The providing apparatus 301 includes a reference time signal generation unit 311, a multiplexing processing unit 312, an output unit 22, a voice synchronization feature value calculation unit 23, a multiplexing processing unit 24, and an output unit 25.

提供装置３０１の構成は、提供装置１１の変換部２１が多重化処理部３１２に置き換えられ、さらに新たに基準時刻信号生成部３１１が設けられている点で、提供装置１１の構成と異なっている。 The configuration of the providing device 301 is different from the configuration of the providing device 11 in that the conversion unit 21 of the providing device 11 is replaced with a multiplexing processing unit 312 and a reference time signal generating unit 311 is newly provided. .

基準時刻信号生成部３１１は、PCRやUTCに基づいて、メインチャンネル信号とサブチャンネル信号のコンテンツ提示のタイミングを示す提示時刻情報を生成し、多重化処理部３１２および多重化処理部２４に供給する。例えば、提示時刻情報はPTSなどとされ、この提示時刻情報は再生側において、メインチャンネル信号とサブチャンネル信号の同期をとるために利用される。 The reference time signal generation unit 311 generates presentation time information indicating the content presentation timing of the main channel signal and the subchannel signal based on PCR and UTC, and supplies the presentation time information to the multiplexing processing unit 312 and the multiplexing processing unit 24. . For example, the presentation time information is PTS or the like, and this presentation time information is used on the playback side to synchronize the main channel signal and the subchannel signal.

多重化処理部３１２は、供給されたメインチャンネル信号を、所定の放送規格などで定められたフォーマットに変換する。また、多重化処理部３１２は、フォーマット変換されたメインチャンネル信号と、基準時刻信号生成部３１１から供給された提示時刻情報とを多重化することでメイン送出信号を生成し、出力部２２に供給する。メイン送出信号に含まれている提示時刻情報は、メインチャンネル信号の提示時刻情報である。 The multiplexing processing unit 312 converts the supplied main channel signal into a format defined by a predetermined broadcast standard or the like. Further, the multiplexing processing unit 312 generates a main transmission signal by multiplexing the main channel signal subjected to format conversion and the presentation time information supplied from the reference time signal generating unit 311, and supplies it to the output unit 22. To do. The presentation time information included in the main transmission signal is the presentation time information of the main channel signal.

また、多重化処理部２４は、時間的に同期がとれた状態で、音声同期用特徴量計算部２３から供給された音声同期用特徴量、供給されたサブチャンネル信号、および基準時刻信号生成部３１１から供給された提示時刻情報を多重化した後、必要に応じてフォーマット変換を行ってサブ送出信号を生成する。多重化処理部２４は、得られたサブ送出信号を出力部２５に供給する。サブ送出信号に含まれている提示時刻情報は、サブチャンネル信号の提示時刻情報である。 In addition, the multiplexing processing unit 24, in a time-synchronized state, the voice synchronization feature quantity supplied from the voice synchronization feature quantity calculation unit 23, the supplied subchannel signal, and the reference time signal generation unit After multiplexing the presentation time information supplied from 311, format conversion is performed as necessary to generate a sub transmission signal. The multiplexing processing unit 24 supplies the obtained sub transmission signal to the output unit 25. The presentation time information included in the sub transmission signal is the presentation time information of the subchannel signal.

なお、提供装置１１における場合と同様に、多重化処理部２４がメインチャンネル信号を用いて、音声同期用特徴量、サブチャンネル信号、および提示時刻情報の時間同期関係を調整してもよい。 Similarly to the case of the providing apparatus 11, the multiplexing processing unit 24 may adjust the time synchronization relationship between the audio synchronization feature quantity, the subchannel signal, and the presentation time information using the main channel signal.

〈コンテンツ再生システムの構成例〉
また、提供装置３０１から送信されるメイン送出信号とサブ送出信号を、それぞれメイン受信信号およびサブ受信信号として受信してメインコンテンツとサブコンテンツを再生するコンテンツ再生システムは、例えば図２３に示すように構成される。なお、図２３において、図４における場合と対応する部分には同一の符号を付してあり、その説明は適宜省略する。 <Example configuration of content playback system>
Also, a content playback system that receives the main transmission signal and the sub transmission signal transmitted from the providing apparatus 301 as the main reception signal and the sub reception signal, respectively, and reproduces the main content and the sub content is configured as shown in FIG. 23, for example. Is done. In FIG. 23, parts corresponding to those in FIG. 4 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

図２３に示すコンテンツ再生システムは、メイン受信機器３４１、表示部８２、スピーカ８３、マイクロホン８４、サブ受信機器３４２、表示部８６、およびスピーカ８７を有している。 The content reproduction system shown in FIG. 23 includes a main reception device 341, a display unit 82, a speaker 83, a microphone 84, a sub reception device 342, a display unit 86, and a speaker 87.

メイン受信機器３４１は、提供装置３０１から送信されたメイン受信信号を受信し、メイン受信信号から得られるメインコンテンツの再生を制御する。 The main reception device 341 receives the main reception signal transmitted from the providing apparatus 301 and controls the reproduction of the main content obtained from the main reception signal.

メイン受信機器３４１は、入力部１１１、分離処理部３５１、提示部３５２、および再生処理部１１２を備えている。このメイン受信機器３４１の構成は、新たに分離処理部３５１と提示部３５２が設けられている点で、メイン受信機器８１の構成と異なる。 The main receiving device 341 includes an input unit 111, a separation processing unit 351, a presentation unit 352, and a reproduction processing unit 112. The configuration of the main reception device 341 is different from the configuration of the main reception device 81 in that a separation processing unit 351 and a presentation unit 352 are newly provided.

分離処理部３５１は、入力部１１１から供給されたメイン受信信号を、メインチャンネル信号と、そのメインチャンネル信号の提示時刻情報とに分離し、メインチャンネル信号を再生処理部１１２に供給するとともに、提示時刻情報を提示部３５２に供給する。 The separation processing unit 351 separates the main reception signal supplied from the input unit 111 into a main channel signal and presentation time information of the main channel signal, supplies the main channel signal to the reproduction processing unit 112, and presents it. The time information is supplied to the presentation unit 352.

提示部３５２は、分離処理部３５１から供給された提示時刻情報を、インターネットなどの有線の通信網や、無線通信網を介してサブ受信機器３４２に提示する。すなわち、通信相手からの要求に応じて提示時刻情報が送信される。 The presentation unit 352 presents the presentation time information supplied from the separation processing unit 351 to the sub reception device 342 via a wired communication network such as the Internet or a wireless communication network. That is, the presentation time information is transmitted in response to a request from the communication partner.

また、サブ受信機器３４２は、提供装置３０１から送信されたサブ送出信号を、サブ受信信号として受信し、サブ受信信号から得られるサブコンテンツの再生を制御する。 Also, the sub reception device 342 receives the sub transmission signal transmitted from the providing apparatus 301 as a sub reception signal, and controls the reproduction of the sub content obtained from the sub reception signal.

サブ受信機器３４２は、取得部３６１、提示時刻情報比較部３６２、音声同期用特徴量計算部１２１、バッファ１２２、入力部１２３、分離処理部１２４、バッファ１２５、同期計算部１２６、および再生処理部１２７を備えている。 The sub reception device 342 includes an acquisition unit 361, a presentation time information comparison unit 362, a voice synchronization feature amount calculation unit 121, a buffer 122, an input unit 123, a separation processing unit 124, a buffer 125, a synchronization calculation unit 126, and a reproduction processing unit. 127.

サブ受信機器３４２の構成は、新たに取得部３６１、および提示時刻情報比較部３６２が設けられている点で、サブ受信機器８５の構成と異なる。 The configuration of the sub reception device 342 is different from the configuration of the sub reception device 85 in that an acquisition unit 361 and a presentation time information comparison unit 362 are newly provided.

取得部３６１は、APIなどを利用して、提示部３５２により提示された提示時刻情報を、有線または無線の通信網を介して取得し、提示時刻情報比較部３６２に供給する。すなわち、取得部３６１は、提示部３５２により送信された提示時刻情報を受信する。 The acquisition unit 361 acquires the presentation time information presented by the presentation unit 352 through a wired or wireless communication network using an API or the like, and supplies the presentation time information to the presentation time information comparison unit 362. That is, the acquisition unit 361 receives the presentation time information transmitted from the presentation unit 352.

分離処理部１２４は、入力部１２３から供給されたサブ受信信号を、音声同期用特徴量、サブチャンネル信号、および提示時刻情報に分離させ、提示時刻情報を提示時刻情報比較部３６２に供給するとともに、音声同期用特徴量およびサブチャンネル信号をバッファ１２５に供給する。 The separation processing unit 124 separates the sub reception signal supplied from the input unit 123 into the audio synchronization feature amount, the subchannel signal, and the presentation time information, and supplies the presentation time information to the presentation time information comparison unit 362. The audio synchronization feature amount and the subchannel signal are supplied to the buffer 125.

提示時刻情報比較部３６２は、分離処理部１２４から供給された提示時刻情報と、取得部３６１から供給された提示時刻情報とを比較して、メインチャンネル信号とサブチャンネル信号とを同期させるための提示時刻情報に基づく同期補正情報を生成し、同期計算部１２６に供給する。 The presentation time information comparison unit 362 compares the presentation time information supplied from the separation processing unit 124 with the presentation time information supplied from the acquisition unit 361, and synchronizes the main channel signal and the sub-channel signal. Synchronization correction information based on the presentation time information is generated and supplied to the synchronization calculator 126.

この提示時刻情報に基づく同期補正情報は、それ自体でメインチャンネル信号とサブチャンネル信号とのずれを補正し、同期させることができるものである。しかし、この例では、より高精度にそれらの信号を同期させるため、提示時刻情報に基づく同期補正情報は、同期計算部１２６において、バッファ１２５から読み出す音声同期用特徴量の範囲を定めるために用いられる。換言すれば、バッファ１２５に記録されている音声同期用特徴量と、バッファ１２２に記録されている音声同期用特徴量との大まかな同期をとるために利用される。このように、提示時刻情報に基づく同期補正情報を用いることで、より少ない処理量で音声同期用特徴量のマッチング処理を行うことができるようになる。 The synchronization correction information based on the presentation time information itself can correct and synchronize the deviation between the main channel signal and the sub channel signal. However, in this example, in order to synchronize these signals with higher accuracy, the synchronization correction information based on the presentation time information is used by the synchronization calculation unit 126 to define the range of the audio synchronization feature amount read from the buffer 125. It is done. In other words, it is used for roughly synchronizing the audio synchronization feature quantity recorded in the buffer 125 and the audio synchronization feature quantity recorded in the buffer 122. As described above, by using the synchronization correction information based on the presentation time information, it is possible to perform the voice synchronization feature amount matching processing with a smaller amount of processing.

〈送信処理の説明〉
続いて、以上において説明した提供装置３０１とコンテンツ再生システムの具体的な動作について説明する。 <Description of transmission processing>
Next, specific operations of the providing apparatus 301 and the content reproduction system described above will be described.

まず、図２４のフローチャートを参照して、提供装置３０１により行われる送信処理について説明する。 First, the transmission processing performed by the providing apparatus 301 will be described with reference to the flowchart of FIG.

ステップＳ１９１において、基準時刻信号生成部３１１は、メインチャンネル信号とサブチャンネル信号の提示時刻情報を生成し、多重化処理部３１２および多重化処理部２４に供給する。 In step S 191, the reference time signal generation unit 311 generates presentation time information of the main channel signal and the subchannel signal, and supplies the presentation time information to the multiplexing processing unit 312 and the multiplexing processing unit 24.

ステップＳ１９２において、音声同期用特徴量計算部２３は、音声同期用特徴量算出処理を行って、供給されたメインチャンネル信号を構成する音声信号から、音声同期用特徴量を計算し、多重化処理部２４に供給する。なお、ステップＳ１９２において行われる音声同期用特徴量算出処理は、図１４を参照して説明した音声同期用特徴量算出処理と同様であるので、その説明は省略する。 In step S192, the voice synchronization feature quantity calculation unit 23 performs voice synchronization feature quantity calculation processing, calculates voice synchronization feature quantities from the voice signals constituting the supplied main channel signal, and performs multiplexing processing. To the unit 24. The voice synchronization feature value calculation process performed in step S192 is the same as the voice synchronization feature value calculation process described with reference to FIG.

ステップＳ１９３において、多重化処理部３１２は、供給されたメインチャンネル信号と、基準時刻信号生成部３１１から供給された提示時刻情報とを多重化することでメイン送出信号を生成し、出力部２２に供給する。また、このとき多重化処理部３１２は、必要に応じて、メインチャンネル信号のフォーマット変換を行う。 In step S 193, the multiplexing processing unit 312 generates a main transmission signal by multiplexing the supplied main channel signal and the presentation time information supplied from the reference time signal generating unit 311, and outputs it to the output unit 22. Supply. At this time, the multiplexing processing unit 312 performs format conversion of the main channel signal as necessary.

ステップＳ１９４において、出力部２２は、多重化処理部３１２から供給されたメイン送出信号を送信する。 In step S194, the output unit 22 transmits the main transmission signal supplied from the multiplexing processing unit 312.

ステップＳ１９５において、多重化処理部２４は、音声同期用特徴量、サブチャンネル信号、および提示時刻情報を多重化してサブ送出信号を生成し、出力部２５に供給する。 In step S 195, the multiplexing processing unit 24 multiplexes the audio synchronization feature quantity, the subchannel signal, and the presentation time information to generate a sub transmission signal, and supplies the sub transmission signal to the output unit 25.

すなわち、多重化処理部２４は、音声同期用特徴量計算部２３からの音声同期用特徴量、供給されたサブチャンネル信号、および基準時刻信号生成部３１１から供給された提示時刻情報を多重化してサブ送出信号とする。 That is, the multiplexing processing unit 24 multiplexes the voice synchronization feature value from the voice synchronization feature value calculation unit 23, the supplied subchannel signal, and the presentation time information supplied from the reference time signal generation unit 311. Sub-transmission signal.

ステップＳ１９６において、出力部２５は、多重化処理部２４から供給されたサブ送出信号を送信し、送信処理は終了する。 In step S196, the output unit 25 transmits the sub transmission signal supplied from the multiplexing processing unit 24, and the transmission process ends.

以上のようにして、提供装置３０１は、メインチャンネル信号とサブチャンネル信号とで共通して用いられる提示時刻情報を生成し、提示時刻情報が含まれるメイン送出信号とサブ送出信号を生成する。 As described above, the providing apparatus 301 generates presentation time information that is used in common by the main channel signal and the sub channel signal, and generates a main transmission signal and a sub transmission signal that include the presentation time information.

これにより、コンテンツの再生側において、提示時刻情報を利用して、より少ない処理量で、メインコンテンツとサブコンテンツを同期させることができるようになる。 As a result, on the content reproduction side, the main content and the sub-content can be synchronized with a smaller amount of processing using the presentation time information.

〈メインコンテンツ再生処理の説明〉
また、提供装置３０１からメイン送出信号が送信されると、コンテンツ再生システムは、そのメイン送出信号を、メイン受信信号として取得して、メインコンテンツを再生する。以下、図２５のフローチャートを参照して、コンテンツ再生システムによるメインコンテンツ再生処理について説明する。 <Description of main content playback processing>
When the main transmission signal is transmitted from the providing device 301, the content reproduction system acquires the main transmission signal as the main reception signal and reproduces the main content. Hereinafter, the main content reproduction process by the content reproduction system will be described with reference to the flowchart of FIG.

ステップＳ２２１において、入力部１１１は、メイン受信信号を取得して分離処理部３５１に供給する。例えば入力部１１１は、提供装置３０１から送信されたメイン受信信号を受信することで、メイン受信信号を取得する。 In step S221, the input unit 111 acquires the main reception signal and supplies it to the separation processing unit 351. For example, the input unit 111 acquires the main reception signal by receiving the main reception signal transmitted from the providing apparatus 301.

ステップＳ２２２において、分離処理部３５１は、入力部１１１から供給されたメイン受信信号を、メインチャンネル信号と提示時刻情報とに分離する。分離処理部３５１は、分離されたメインチャンネル信号を再生処理部１１２に供給するとともに、提示時刻情報を提示部３５２に供給する。 In step S222, the separation processing unit 351 separates the main reception signal supplied from the input unit 111 into a main channel signal and presentation time information. The separation processing unit 351 supplies the separated main channel signal to the reproduction processing unit 112 and supplies the presentation time information to the presentation unit 352.

ステップＳ２２３において、再生処理部１１２は、分離処理部３５１から供給されたメインチャンネル信号に基づいてメインコンテンツを再生させる。なお、ステップＳ２２３では、図１５のステップＳ７２の処理と同様の処理が行われる。 In step S223, the reproduction processing unit 112 reproduces the main content based on the main channel signal supplied from the separation processing unit 351. In step S223, the same processing as that in step S72 in FIG. 15 is performed.

ステップＳ２２４において、提示部３５２は、分離処理部３５１から供給された提示時刻情報を提示して、メインコンテンツ再生処理は終了する。例えば、提示時刻情報は、メインコンテンツの再生と同期した状態で、無線等によりサブ受信機器３４２に送信される。 In step S224, the presentation unit 352 presents the presentation time information supplied from the separation processing unit 351, and the main content reproduction process ends. For example, the presentation time information is transmitted to the sub reception device 342 by wireless or the like in a state synchronized with the reproduction of the main content.

以上のようにして、コンテンツ再生システムは、メイン受信信号を取得してメインコンテンツを再生するとともに、メインコンテンツ、すなわちメインチャンネル信号の提示時刻情報の提示を行う。 As described above, the content reproduction system obtains the main reception signal, reproduces the main content, and presents the presentation time information of the main content, that is, the main channel signal.

このようにメインコンテンツの再生とともに、そのメインコンテンツの提示時刻情報を提示することで、その提示時刻情報を取得するサブ受信機器３４２は、より少ない処理量で、音声同期用特徴量を用いた同期計算を行うことができるようになる。 In this way, by presenting the presentation time information of the main content along with the reproduction of the main content, the sub-receiving device 342 that acquires the presentation time information can perform synchronization calculation using the audio synchronization feature amount with a smaller processing amount. Will be able to do.

〈サブコンテンツ再生処理の説明〉
また、メインコンテンツの再生と同期して、コンテンツ再生システムは、サブ受信信号を取得して、サブコンテンツを再生する。以下、図２６のフローチャートを参照して、コンテンツ再生システムによるサブコンテンツ再生処理について説明する。 <Description of sub-content playback processing>
Further, in synchronization with the reproduction of the main content, the content reproduction system acquires the sub received signal and reproduces the sub content. Hereinafter, with reference to the flowchart of FIG. 26, sub-content reproduction processing by the content reproduction system will be described.

なお、ステップＳ２５１の処理は、図１６のステップＳ１０１の処理と同様であるので、その説明は省略する。 Note that the processing in step S251 is the same as the processing in step S101 in FIG.

ステップＳ２５２において、分離処理部１２４は、入力部１２３から供給されたサブ受信信号を、サブチャンネル信号、音声同期用特徴量、および提示時刻情報に分離させる。そして分離処理部１２４は、サブチャンネル信号と音声同期用特徴量をバッファ１２５に供給して記録させるとともに、サブチャンネル信号の提示時刻情報を提示時刻情報比較部３６２に供給する。 In step S252, the separation processing unit 124 separates the sub reception signal supplied from the input unit 123 into a subchannel signal, a voice synchronization feature amount, and presentation time information. The separation processing unit 124 supplies the subchannel signal and the audio synchronization feature amount to the buffer 125 for recording, and supplies the presentation time information of the subchannel signal to the presentation time information comparison unit 362.

ステップＳ２５３において、取得部３６１は、提示部３５２により送信された提示時刻情報を受信することで、メインチャンネル信号の提示時刻情報を取得し、提示時刻情報比較部３６２に供給する。 In step S 253, the acquisition unit 361 receives the presentation time information transmitted from the presentation unit 352, acquires the presentation time information of the main channel signal, and supplies the presentation time information comparison unit 362.

ステップＳ２５４において、提示時刻情報比較部３６２は、分離処理部１２４から供給された提示時刻情報と、取得部３６１から供給された提示時刻情報とを比較して提示時刻情報に基づく同期補正情報を生成し、同期計算部１２６に供給する。 In step S254, the presentation time information comparison unit 362 compares the presentation time information supplied from the separation processing unit 124 with the presentation time information supplied from the acquisition unit 361, and generates synchronization correction information based on the presentation time information. And supplied to the synchronization calculator 126.

例えば提示時刻情報に基づく同期補正情報は、バッファ１２５に時系列に並べられて記録されている各時刻の音声同期用特徴量の系列のうち、同期計算部１２６での同期計算の対象とされる範囲（以下、探索範囲とも称する）を示す情報とされる。 For example, the synchronization correction information based on the presentation time information is subjected to synchronization calculation in the synchronization calculation unit 126 among the sequence of audio synchronization feature values at each time recorded in time series in the buffer 125. The information indicates a range (hereinafter also referred to as a search range).

この探索範囲は、現時点において再生されているメインコンテンツ、つまり取得部３６１により取得された最新の提示時刻情報と同じ時刻を示しているサブチャンネル信号の提示時刻情報に対応付けられている音声同期用特徴量を含む、所定長の音声同期用特徴量系列とされる。 This search range is the audio synchronization feature associated with the presentation time information of the sub-channel signal indicating the same time as the main content being reproduced at the present time, that is, the latest presentation time information acquired by the acquisition unit 361. A feature amount sequence for voice synchronization having a predetermined length including the amount.

提示時刻が同じであるメインチャンネル信号とサブチャンネル信号の位置は、互いに同期する信号位置、つまり同時に再生すべき再生位置（フレーム位置）である。したがって、提示時刻情報を比較して、メインチャンネル信号と同じ提示時刻情報を有するサブチャンネル信号の位置を検出することで、再生中のメインコンテンツと大まかに同期がとれたサブコンテンツの再生位置を特定することができる。 The positions of the main channel signal and the sub-channel signal having the same presentation time are signal positions synchronized with each other, that is, reproduction positions (frame positions) to be reproduced simultaneously. Therefore, by comparing the presentation time information and detecting the position of the sub-channel signal having the same presentation time information as the main channel signal, the reproduction position of the sub-content roughly synchronized with the main content being reproduced is specified. be able to.

提示時刻情報が比較されて提示時刻情報に基づく同期補正情報が生成されると、その後、ステップＳ２５５およびステップＳ２５６の処理が行われるが、これらの処理は図１６のステップＳ１０３およびステップＳ１０４の処理と同様であるので、その説明は省略する。なお、これらの処理では、メインコンテンツの音声が収音され、その音声から音声同期用特徴量が算出される。 When the presentation time information is compared and the synchronization correction information based on the presentation time information is generated, the processes of step S255 and step S256 are performed. These processes are the same as the processes of step S103 and step S104 of FIG. Since it is the same, the description is omitted. In these processes, the audio of the main content is collected, and the audio synchronization feature amount is calculated from the audio.

ステップＳ２５７において、同期計算部１２６は、同期補正情報生成処理を行って、音声特徴量に基づく同期補正情報を生成し、再生処理部１２７に供給する。なお、同期補正情報生成処理の詳細は後述するが、この処理では、提示時刻情報に基づく同期補正情報が用いられて、バッファ１２２に記録されている音声同期用特徴量と、バッファ１２５に記録されている音声同期用特徴量とが比較され、音声特徴量に基づく同期補正情報が生成される。 In step S257, the synchronization calculation unit 126 performs synchronization correction information generation processing, generates synchronization correction information based on the audio feature amount, and supplies the synchronization correction information to the reproduction processing unit 127. Although details of the synchronization correction information generation process will be described later, in this process, the synchronization correction information based on the presentation time information is used, and the audio synchronization feature quantity recorded in the buffer 122 and the buffer 125 are recorded. And the synchronization correction information based on the voice feature amount is generated.

ステップＳ２５８において、再生処理部１２７は、同期計算部１２６から供給された音声特徴量に基づく同期補正情報に基づいて、バッファ１２５に記録されているサブチャンネル信号の再生タイミングを補正し、補正後のサブチャンネル信号に基づいてサブコンテンツを再生させる。ステップＳ２５８では、図１６のステップＳ１０６と同様の処理が行われる。 In step S258, the reproduction processing unit 127 corrects the reproduction timing of the subchannel signal recorded in the buffer 125 based on the synchronization correction information based on the audio feature amount supplied from the synchronization calculation unit 126, and performs the corrected correction. The sub contents are reproduced based on the sub channel signal. In step S258, processing similar to that in step S106 in FIG. 16 is performed.

以上のようにして、コンテンツ再生システムは、メインコンテンツの提示時刻情報を取得して、サブ受信信号に含まれているサブコンテンツの提示時刻情報と比較することで、提示時刻情報に基づく同期補正情報を生成する。そして、コンテンツ再生システムは、提示時刻情報に基づく同期補正情報により示される探索範囲に含まれる音声同期用特徴量を対象としてマッチング処理を行い、音声特徴量に基づく同期補正情報を算出する。 As described above, the content reproduction system obtains the presentation time information of the main content and compares it with the presentation time information of the sub content included in the sub received signal, so that the synchronization correction information based on the presentation time information is obtained. Generate. Then, the content reproduction system performs matching processing on the audio synchronization feature amount included in the search range indicated by the synchronization correction information based on the presentation time information, and calculates synchronization correction information based on the audio feature amount.

これにより、メインコンテンツとサブコンテンツとの伝送経路が異なる場合であっても、より少ない処理量で同期補正情報を算出し、それらのコンテンツを同期して再生させることができる。 Thereby, even when the transmission paths of the main content and the sub-content are different, the synchronization correction information can be calculated with a smaller processing amount, and the contents can be reproduced in synchronization.

すなわち、コンテンツ再生システムでは、サブ受信機器３４２は、まず提示時刻情報によりメインチャンネル信号とサブチャンネル信号の大まかな同期をとり、さらにメインコンテンツの音声を収音して得られた音声信号から音声同期用特徴量を計算する。 That is, in the content reproduction system, the sub-receiving device 342 first synchronizes the main channel signal and the sub-channel signal roughly according to the presentation time information, and further uses the audio signal obtained by collecting the audio of the main content for audio synchronization. Calculate features.

そして、サブ受信機器３４２は、受信したメインチャンネル信号の音声同期用特徴量との自動同期計算を行うことで、サブ受信機器３４２により再生されるサブコンテンツを視聴するユーザの視聴位置での高精度なコンテンツ同期が可能となる。実際に、提示時刻情報が用いられておおよその同期位置の範囲が絞られているので、音声同期用特徴量による自動同期計算処理に要する処理量も少なくて済む。 Then, the sub-receiving device 342 performs high-accuracy at the viewing position of the user who views the sub-contents reproduced by the sub-receiving device 342 by performing automatic synchronization calculation with the audio synchronization feature amount of the received main channel signal. Content synchronization is possible. Actually, the presentation time information is used and the range of the approximate synchronization position is narrowed down, so that the processing amount required for the automatic synchronization calculation processing by the feature amount for voice synchronization can be reduced.

例えば、コンテンツ再生システムにおいて表示部８２およびスピーカ８３と、表示部８６およびスピーカ８７とが離れた位置に配置されており、ユーザが表示部８６およびスピーカ８７の近傍でコンテンツを視聴しているとする。そのような場合、スピーカ８３から出力された音声がユーザの視聴位置に到達するまでには、ある程度の時間を要する。 For example, in the content reproduction system, it is assumed that the display unit 82 and the speaker 83, the display unit 86 and the speaker 87 are arranged at positions away from each other, and the user is viewing the content near the display unit 86 and the speaker 87. . In such a case, it takes some time for the sound output from the speaker 83 to reach the viewing position of the user.

したがって、そのような場合には、提示時刻情報を比較するだけでは、ユーザの視聴位置において、メインコンテンツとサブコンテンツの再生を高精度に同期させることは困難である。すなわち、例えばほぼ同じ時刻でスピーカ８３とスピーカ８７とで、メインコンテンツの音声と、サブコンテンツの音声とがそれぞれ再生されることになるので、メインコンテンツの音声がユーザに到達するまでに時間がかかってしまうと、ユーザには、メインコンテンツの音声とサブコンテンツの音声とがずれて聞こえてしまうことになる。 Therefore, in such a case, it is difficult to synchronize the reproduction of the main content and the sub content with high accuracy at the viewing position of the user only by comparing the presentation time information. That is, for example, the audio of the main content and the audio of the sub-content are reproduced by the speaker 83 and the speaker 87 at approximately the same time, so it takes time until the audio of the main content reaches the user. Then, the user can hear the sound of the main content and the sound of the sub-content shifted.

これに対して、本技術を適用したコンテンツ再生システムでは、サブ受信機器３４２に接続され、サブ受信機器３４２近傍に配置されたマイクロホン８４によりメインコンテンツの音声が収音されて同期計算が行われる。そのため、コンテンツ再生システムでは、ユーザの視聴位置において同期がとれた状態でメインコンテンツとサブコンテンツを再生することができる。しかも、コンテンツ再生システムでは、提示時刻情報を比較して提示時刻情報に基づく同期補正情報を生成し、マッチング処理の探索範囲を限定することで、より少ない処理量でコンテンツを同期させることができる。 On the other hand, in the content reproduction system to which the present technology is applied, the sound of the main content is collected by the microphone 84 connected to the sub receiving device 342 and disposed near the sub receiving device 342, and the synchronization calculation is performed. Therefore, in the content reproduction system, it is possible to reproduce the main content and the sub content while being synchronized at the viewing position of the user. In addition, the content reproduction system compares the presentation time information to generate synchronization correction information based on the presentation time information, and limits the search range of the matching process, so that the contents can be synchronized with a smaller processing amount.

〈同期補正情報生成処理の説明〉
さらに、図２７のフローチャートを参照して、図２６のステップＳ２５７の処理に対応する同期補正情報生成処理について説明する。 <Description of synchronization correction information generation processing>
Furthermore, the synchronization correction information generation process corresponding to the process of step S257 of FIG. 26 will be described with reference to the flowchart of FIG.

ステップＳ２８１において、フレームレート変換部１８１およびフレームレート変換部１８２は、必要に応じてフレームレート変換処理を行う。 In step S281, the frame rate conversion unit 181 and the frame rate conversion unit 182 perform frame rate conversion processing as necessary.

すなわち、フレームレート変換部１８１は、バッファ１２２からメインコンテンツの時間区間ごとの音声同期用特徴量の時系列データを読み出して、必要に応じて音声同期用特徴量をフレームレート変換、つまりダウンサンプルし、ブロック統合部１８３に供給する。 That is, the frame rate conversion unit 181 reads the time series data of the audio synchronization feature value for each time interval of the main content from the buffer 122, performs frame rate conversion, that is, down-samples the audio synchronization feature value as necessary, This is supplied to the block integration unit 183.

また、フレームレート変換部１８２は、バッファ１２５に記録されている音声同期用特徴量の時系列データのうち、提示時刻情報比較部３６２から供給された提示時刻情報に基づく同期補正情報により示される探索範囲に含まれている時系列データのみを読み出す。 Also, the frame rate conversion unit 182 searches for the synchronization indicated by the synchronization correction information based on the presentation time information supplied from the presentation time information comparison unit 362 among the time-series data of the audio synchronization feature values recorded in the buffer 125. Read only the time-series data included in the range.

そして、フレームレート変換部１８２は、読み出した音声同期用特徴量を必要に応じてフレームレート変換、つまりダウンサンプルまたはアップサンプルし、ブロック統合部１８４に供給する。 Then, the frame rate conversion unit 182 performs frame rate conversion, that is, down-sample or up-sample the read audio synchronization feature quantity as necessary, and supplies the result to the block integration unit 184.

ステップＳ２８２において、ブロック統合部１８３およびブロック統合部１８４は、音声同期用特徴量の時系列データを統合する。 In step S282, the block integration unit 183 and the block integration unit 184 integrate the time series data of the audio synchronization feature amount.

例えば、図２８の矢印Ａ１１に示すように、図１８のステップＳ１６２の処理では、バッファ１２２に記録されているｎ個の各ブロックX(i)と、バッファ１２５に記録されているｍ個の各ブロックY(j)とが処理対象とされていた。つまり、探索対象となる音声同期用特徴量のブロックの組み合わせは、ｎ×ｍ通りとされていた。なお、より詳細には、音声同期用特徴量に対して適宜、フレームレート変換が行われるが、図２８では説明を簡単にするため、フレームレート変換は行われないものとして説明を続ける。 For example, as indicated by an arrow A11 in FIG. 28, in the process of step S162 in FIG. 18, each of the n blocks X (i) recorded in the buffer 122 and each of the m blocks recorded in the buffer 125. Block Y (j) was the processing target. In other words, there are n × m combinations of speech synchronization feature value blocks to be searched. More specifically, the frame rate conversion is appropriately performed on the audio synchronization feature value. However, in order to simplify the description in FIG. 28, the description will be continued assuming that the frame rate conversion is not performed.

ここで、マッチング処理の対象とされるブロックY(j)は、バッファ１２５に記録されている全てのブロック、または十分に広い範囲のブロックとされる。 Here, the block Y (j) to be subjected to the matching processing is all the blocks recorded in the buffer 125 or a sufficiently wide range of blocks.

なお、図２８において、ｉは、音声同期用特徴量計算部１２１で得られた音声同期用特徴量のブロックのインデックスであり、ｊは、サブ受信信号に含まれている音声同期用特徴量のブロックのインデックスである。 In FIG. 28, i is the index of the block of the feature value for voice synchronization obtained by the feature amount calculation unit 121 for voice synchronization, and j is the feature value for voice synchronization included in the sub-received signal. The block index.

一方、ステップＳ２８２では、矢印Ａ１２に示すように、バッファ１２５に記録されているｍ個の各ブロックのうち、提示時刻情報に基づく同期補正情報により示される探索範囲に含まれているｍ’個のブロックのみがマッチング処理の対象とされる。すなわち、類似度計算部１８５での類似度計算の対象とされる。 On the other hand, in step S282, m ′ blocks included in the search range indicated by the synchronization correction information based on the presentation time information among the m blocks recorded in the buffer 125, as indicated by an arrow A12. Only blocks are subject to matching processing. That is, it is a target of similarity calculation in the similarity calculation unit 185.

この例では、PTS_iは提示時刻情報を表しており、この提示時刻情報により示される位置が、現時点で再生されているメインコンテンツの位置となっている。そして、そのメインコンテンツの提示時刻情報と同じ時刻のサブコンテンツの提示時刻情報に対応する位置を含む所定長の範囲、つまりｍ’個のブロックからなる範囲が探索範囲とされている。したがって、探索対象となる音声同期用特徴量のブロックの組み合わせは、ｎ×ｍ’通りとなる。 In this example, PTS _i represents presentation time information, and the position indicated by the presentation time information is the position of the main content being played back at the present time. A range of a predetermined length including a position corresponding to the presentation time information of the sub-content at the same time as the presentation time information of the main content, that is, a range including m ′ blocks is set as the search range. Accordingly, there are n × m ′ combinations of blocks of the feature amount for voice synchronization to be searched.

このように、提示時刻情報を比較して得られる提示時刻情報に基づく同期補正情報を用いれば、マッチング処理の対象とされる音声同期用特徴量の範囲を必要最小限に限定することができるので、類似度計算の探索に要する処理時間を大幅に低減させることができる。 As described above, if the synchronization correction information based on the presentation time information obtained by comparing the presentation time information is used, it is possible to limit the range of the audio synchronization feature amount to be subjected to the matching process to the minimum necessary. The processing time required for searching for similarity calculation can be greatly reduced.

図２７のフローチャートの説明に戻り、音声同期用特徴量の時系列データが統合されると、その後、処理はステップＳ２８３に進む。そして、ステップＳ２８３およびステップＳ２８４の処理が行われて同期補正情報生成処理は終了するが、これらの処理は図１８のステップＳ１６３およびステップＳ１６４の処理と同様であるので、その説明は省略する。同期補正情報生成処理が終了すると、その後、処理は図２６のステップＳ２５８へと進む。 Returning to the description of the flowchart of FIG. 27, when the time-series data of the feature amount for audio synchronization is integrated, the process proceeds to step S283. Then, the processing of step S283 and step S284 is performed and the synchronization correction information generation processing ends. Since these processing are the same as the processing of step S163 and step S164 of FIG. 18, the description thereof is omitted. When the synchronization correction information generation process ends, the process proceeds to step S258 in FIG.

以上のようにして、コンテンツ再生システムは、提示時刻情報に基づく同期補正情報により示される探索範囲の音声同期用特徴量を用いて、音声特徴量に基づく同期補正情報を生成する。これにより、より少ない処理量で、同期補正情報をロバストに生成することができる。 As described above, the content reproduction system generates the synchronization correction information based on the audio feature amount by using the audio synchronization feature amount in the search range indicated by the synchronization correction information based on the presentation time information. As a result, the synchronization correction information can be generated robustly with a smaller amount of processing.

また、図２２に示した提供装置３０１と図２３に示したコンテンツ再生システムも、図１９乃至図２１に示した各システムに適用可能である。 Further, the providing apparatus 301 shown in FIG. 22 and the content reproduction system shown in FIG. 23 are also applicable to the systems shown in FIGS.

ところで、上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、コンピュータにインストールされる。ここで、コンピュータには、専用のハードウェアに組み込まれているコンピュータや、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のコンピュータなどが含まれる。 By the way, the above-described series of processing can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software is installed in the computer. Here, the computer includes, for example, a general-purpose computer capable of executing various functions by installing a computer incorporated in dedicated hardware and various programs.

図２９は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 29 is a block diagram illustrating a hardware configuration example of a computer that executes the above-described series of processing by a program.

コンピュータにおいて、ＣＰＵ（Central Processing Unit）５０１，ＲＯＭ（Read Only Memory）５０２，ＲＡＭ（Random Access Memory）５０３は、バス５０４により相互に接続されている。 In a computer, a CPU (Central Processing Unit) 501, a ROM (Read Only Memory) 502, and a RAM (Random Access Memory) 503 are connected to each other by a bus 504.

バス５０４には、さらに、入出力インターフェース５０５が接続されている。入出力インターフェース５０５には、入力部５０６、出力部５０７、記録部５０８、通信部５０９、及びドライブ５１０が接続されている。 An input / output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.

入力部５０６は、キーボード、マウス、マイクロホン、撮像素子などよりなる。出力部５０７は、ディスプレイ、スピーカなどよりなる。記録部５０８は、ハードディスクや不揮発性のメモリなどよりなる。通信部５０９は、ネットワークインターフェースなどよりなる。ドライブ５１０は、磁気ディスク、光ディスク、光磁気ディスク、又は半導体メモリなどのリムーバブルメディア５１１を駆動する。 The input unit 506 includes a keyboard, a mouse, a microphone, an image sensor, and the like. The output unit 507 includes a display, a speaker, and the like. The recording unit 508 includes a hard disk, a nonvolatile memory, and the like. The communication unit 509 includes a network interface or the like. The drive 510 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

以上のように構成されるコンピュータでは、ＣＰＵ５０１が、例えば、記録部５０８に記録されているプログラムを、入出力インターフェース５０５及びバス５０４を介して、ＲＡＭ５０３にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 501 loads the program recorded in the recording unit 508 to the RAM 503 via the input / output interface 505 and the bus 504 and executes the program, for example. Is performed.

コンピュータ（ＣＰＵ５０１）が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア５１１に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 A program executed by the computer (CPU 501) can be provided by being recorded on a removable medium 511 as a package medium, for example. The program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

コンピュータでは、プログラムは、リムーバブルメディア５１１をドライブ５１０に装着することにより、入出力インターフェース５０５を介して、記録部５０８にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部５０９で受信し、記録部５０８にインストールすることができる。その他、プログラムは、ＲＯＭ５０２や記録部５０８に、あらかじめインストールしておくことができる。 In the computer, the program can be installed in the recording unit 508 via the input / output interface 505 by attaching the removable medium 511 to the drive 510. Further, the program can be received by the communication unit 509 via a wired or wireless transmission medium and installed in the recording unit 508. In addition, the program can be installed in advance in the ROM 502 or the recording unit 508.

なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.

また、本技術の実施の形態は、上述した実施の形態に限定されるものではなく、本技術の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiments of the present technology are not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present technology.

例えば、本技術は、１つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, the present technology can take a configuration of cloud computing in which one function is shared by a plurality of devices via a network and is jointly processed.

また、上述のフローチャートで説明した各ステップは、１つの装置で実行する他、複数の装置で分担して実行することができる。 In addition, each step described in the above flowchart can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.

さらに、１つのステップに複数の処理が含まれる場合には、その１つのステップに含まれる複数の処理は、１つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by being shared by a plurality of apparatuses in addition to being executed by one apparatus.

また、本明細書中に記載された効果はあくまで例示であって限定されるものではなく、他の効果があってもよい。 Moreover, the effect described in this specification is an illustration to the last, and is not limited, There may exist another effect.

さらに、本技術は、以下の構成とすることも可能である。 Furthermore, this technique can also be set as the following structures.

（１）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算部と、
前記第１のコンテンツと時間同期関係を有する第２のコンテンツに対して同期がとれた状態で取得された前記特徴量と、前記特徴量計算部により抽出された前記特徴量とを比較することで、前記第２のコンテンツを前記第１のコンテンツと同期して再生するための音声特徴量に基づく同期補正情報を生成する同期計算部と
を備える情報処理装置。
（２）
前記特徴量計算部は、再生された前記第１のコンテンツの音声を収音することで得られた前記音声信号から前記特徴量を抽出する
（１）に記載の情報処理装置。
（３）
前記第２のコンテンツと、前記第２のコンテンツに同期がとれた状態で前記第２のコンテンツに対応付けられている前記特徴量とを取得する第１の入力部をさらに備える
（１）または（２）に記載の情報処理装置。
（４）
前記第２のコンテンツおよび前記特徴量は、前記第１のコンテンツとの到着時間差が考慮されたタイミングで前記情報処理装置に送信される
（３）に記載の情報処理装置。
（５）
前記第１の入力部は、前記第２のコンテンツおよび前記特徴量の送信を要求し、その要求に応じて送信されてきた前記第２のコンテンツおよび前記特徴量を受信する
（３）に記載の情報処理装置。
（６）
前記第１のコンテンツの送信を要求し、その要求に応じて送信されてきた前記第１のコンテンツを受信する第２の入力部をさらに備える
（５）に記載の情報処理装置。
（７）
前記特徴量計算部は、再生された１つの前記第１のコンテンツについて、前記音声信号から前記特徴量を抽出し、
前記同期計算部は、前記第２のコンテンツと対応付けられている複数の前記第１のコンテンツの前記特徴量のそれぞれと、前記特徴量計算部により抽出された前記特徴量とを比較することで、再生された前記第１のコンテンツを特定し、特定された前記第１のコンテンツと前記第２のコンテンツを同期して再生するための前記音声特徴量に基づく同期補正情報を生成する
（２）乃至（６）の何れか一項に記載の情報処理装置。
（８）
前記第２のコンテンツの再生を制御する再生処理部をさらに備える
（２）乃至（７）の何れか一項に記載の情報処理装置。
（９）
前記再生処理部は、前記音声特徴量に基づく同期補正情報に基づいて前記第２のコンテンツの再生位置を補正する
（８）に記載の情報処理装置。
（１０）
前記第１のコンテンツの提示時刻情報を取得する取得部と、
前記第１のコンテンツの前記提示時刻情報と、前記第２のコンテンツの前記提示時刻情報とを比較して、提示時刻情報に基づく同期補正情報を生成する比較部と
をさらに備え、
前記同期計算部は、取得された前記特徴量の系列のうちの前記提示時刻情報に基づく同期補正情報により示される範囲に含まれる前記特徴量と、前記特徴量計算部により抽出された前記特徴量とを比較して前記音声特徴量に基づく同期補正情報を生成する
（１）乃至（９）の何れか一項に記載の情報処理装置。
（１１）
前記同期計算部は、取得された前記特徴量と、前記特徴量計算部により抽出された前記特徴量とのフレームレートが一致するように、取得された前記特徴量、または前記特徴量計算部により抽出された前記特徴量の少なくとも一方に対してフレームレート変換を行ってから前記特徴量を比較する
（１）乃至（１０）の何れか一項に記載の情報処理装置。
（１２）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算ステップと、
前記第１のコンテンツと時間同期関係を有する第２のコンテンツに対して同期がとれた状態で取得された前記特徴量と、前記特徴量計算ステップの処理により抽出された前記特徴量とを比較することで、前記第２のコンテンツを前記第１のコンテンツと同期して再生するための音声特徴量に基づく同期補正情報を生成する同期計算ステップと
を含む情報処理方法。
（１３）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算ステップと、
前記第１のコンテンツと時間同期関係を有する第２のコンテンツに対して同期がとれた状態で取得された前記特徴量と、前記特徴量計算ステップの処理により抽出された前記特徴量とを比較することで、前記第２のコンテンツを前記第１のコンテンツと同期して再生するための音声特徴量に基づく同期補正情報を生成する同期計算ステップと
を含む処理をコンピュータに実行させるプログラム。
（１４）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算部と、
前記第１のコンテンツに対して時間同期関係を有する第２のコンテンツと、前記第２のコンテンツに対して同期がとれた状態で、前記第２のコンテンツに対応付けられた前記特徴量とを出力する第１の出力部と
を備える情報処理装置。
（１５）
前記第１のコンテンツを出力する第２の出力部をさらに備える
（１４）に記載の情報処理装置。
（１６）
前記第１の出力部は、前記第１のコンテンツとの到着時間差が考慮されたタイミングで前記第２のコンテンツおよび前記特徴量を出力する
（１５）に記載の情報処理装置。
（１７）
前記第１の出力部は、前記第２のコンテンツおよび前記特徴量の送信が要求された場合、その要求に応じて前記第２のコンテンツおよび前記特徴量を出力する
（１５）に記載の情報処理装置。
（１８）
前記第２の出力部は、前記第１のコンテンツの送信が要求された場合、その要求に応じて前記第１のコンテンツを出力する
（１７）に記載の情報処理装置。
（１９）
前記特徴量計算部は、複数の前記第１のコンテンツについて、前記音声信号から前記特徴量を抽出し、
前記第１の出力部は、複数の前記第１のコンテンツの前記特徴量を前記第２のコンテンツに対応付けて出力する
（１４）乃至（１８）の何れか一項に記載の情報処理装置。
（２０）
前記特徴量計算部は前記特徴量をダウンサンプルし、
前記第１の出力部は、前記第２のコンテンツと、ダウンサンプルされた前記特徴量とを出力する
（１４）乃至（１９）の何れか一項に記載の情報処理装置。
（２１）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算ステップと、
前記第１のコンテンツに対して時間同期関係を有する第２のコンテンツと、前記第２のコンテンツに対して同期がとれた状態で、前記第２のコンテンツに対応付けられた前記特徴量とを出力する出力ステップと
を含む情報処理方法。
（２２）
第１のコンテンツの音声信号から特徴量を抽出する特徴量計算ステップと、
前記第１のコンテンツに対して時間同期関係を有する第２のコンテンツと、前記第２のコンテンツに対して同期がとれた状態で、前記第２のコンテンツに対応付けられた前記特徴量とを出力する出力ステップと
を含む処理をコンピュータに実行させるプログラム。 (1)
A feature amount calculation unit that extracts a feature amount from the audio signal of the first content;
By comparing the feature amount acquired in a synchronized state with the second content having a time synchronization relationship with the first content and the feature amount extracted by the feature amount calculation unit An information processing apparatus comprising: a synchronization calculation unit that generates synchronization correction information based on an audio feature amount for reproducing the second content in synchronization with the first content.
(2)
The information processing apparatus according to (1), wherein the feature amount calculation unit extracts the feature amount from the audio signal obtained by collecting the reproduced audio of the first content.
(3)
(1) or (1) further comprising a first input unit that acquires the second content and the feature amount associated with the second content in a state in which the second content is synchronized. The information processing apparatus according to 2).
(4)
The information processing apparatus according to (3), wherein the second content and the feature amount are transmitted to the information processing apparatus at a timing that takes into account a difference in arrival time with respect to the first content.
(5)
The first input unit requests transmission of the second content and the feature amount, and receives the second content and the feature amount transmitted in response to the request. Information processing device.
(6)
The information processing apparatus according to (5), further comprising: a second input unit that requests transmission of the first content and receives the first content transmitted in response to the request.
(7)
The feature quantity calculation unit extracts the feature quantity from the audio signal for one reproduced first content,
The synchronization calculation unit compares each of the feature amounts of the plurality of first contents associated with the second content with the feature amount extracted by the feature amount calculation unit. , Specifying the reproduced first content, and generating synchronization correction information based on the audio feature amount for reproducing the identified first content and the second content in synchronization (2) Thru | or the information processing apparatus as described in any one of (6).
(8)
The information processing apparatus according to any one of (2) to (7), further including a reproduction processing unit that controls reproduction of the second content.
(9)
The information processing apparatus according to (8), wherein the reproduction processing unit corrects the reproduction position of the second content based on synchronization correction information based on the audio feature amount.
(10)
An acquisition unit for acquiring presentation time information of the first content;
A comparison unit that compares the presentation time information of the first content with the presentation time information of the second content and generates synchronization correction information based on the presentation time information;
The synchronization calculation unit includes the feature amount included in a range indicated by the synchronization correction information based on the presentation time information in the acquired feature amount series, and the feature amount extracted by the feature amount calculation unit. The information processing apparatus according to any one of (1) to (9), wherein synchronization correction information based on the audio feature amount is generated.
(11)
The synchronization calculation unit uses the acquired feature value or the feature value calculation unit so that a frame rate of the acquired feature value matches the feature value extracted by the feature value calculation unit. The information processing apparatus according to any one of (1) to (10), wherein frame rate conversion is performed on at least one of the extracted feature values, and then the feature values are compared.
(12)
A feature amount calculating step for extracting a feature amount from the audio signal of the first content;
The feature amount acquired in synchronization with the second content having a time synchronization relationship with the first content is compared with the feature amount extracted by the processing of the feature amount calculation step. Thus, an information processing method including a synchronization calculation step of generating synchronization correction information based on an audio feature amount for reproducing the second content in synchronization with the first content.
(13)
A feature amount calculating step for extracting a feature amount from the audio signal of the first content;
The feature amount acquired in synchronization with the second content having a time synchronization relationship with the first content is compared with the feature amount extracted by the processing of the feature amount calculation step. Thus, a program for causing a computer to execute a process including: a synchronization calculation step of generating synchronization correction information based on an audio feature amount for reproducing the second content in synchronization with the first content.
(14)
A feature amount calculation unit that extracts a feature amount from the audio signal of the first content;
The second content having a time synchronization relationship with the first content, and the feature amount associated with the second content in a state of being synchronized with the second content An information processing apparatus comprising: a first output unit.
(15)
The information processing apparatus according to (14), further including a second output unit that outputs the first content.
(16)
The information processing apparatus according to (15), wherein the first output unit outputs the second content and the feature amount at a timing in which a difference in arrival time from the first content is considered.
(17)
The information output according to (15), wherein when the transmission of the second content and the feature amount is requested, the first output unit outputs the second content and the feature amount in response to the request. apparatus.
(18)
The information processing apparatus according to (17), wherein when the transmission of the first content is requested, the second output unit outputs the first content in response to the request.
(19)
The feature amount calculation unit extracts the feature amount from the audio signal for a plurality of the first contents,
The information processing apparatus according to any one of (14) to (18), wherein the first output unit outputs the feature quantities of the plurality of first contents in association with the second contents.
(20)
The feature amount calculation unit downsamples the feature amount,
The information processing apparatus according to any one of (14) to (19), wherein the first output unit outputs the second content and the downsampled feature amount.
(21)
A feature amount calculating step for extracting a feature amount from the audio signal of the first content;
The second content having a time synchronization relationship with the first content, and the feature amount associated with the second content in a state of being synchronized with the second content An information processing method comprising: an output step.
(22)
A feature amount calculating step for extracting a feature amount from the audio signal of the first content;
The second content having a time synchronization relationship with the first content, and the feature amount associated with the second content in a state of being synchronized with the second content A program for causing a computer to execute a process including an output step.

１１提供装置，２２出力部，２３音声同期用特徴量計算部，２４多重化処理部，２５出力部，８１メイン受信機器，８５サブ受信機器，１１１入力部，１１２再生処理部，１２１音声同期用特徴量計算部，１２３入力部，１２６同期計算部，１２７再生処理部，３１１基準時刻信号生成部，３５２提示部，３６１取得部，３６２提示時刻情報比較部 DESCRIPTION OF SYMBOLS 11 Provision apparatus, 22 Output part, 23 Voice synchronization feature-value calculation part, 24 Multiplexing process part, 25 Output part, 81 Main receiving apparatus, 85 Sub receiving apparatus, 111 Input part, 112 Playback processing part, 121 For voice synchronization Feature amount calculation unit, 123 input unit, 126 synchronization calculation unit, 127 reproduction processing unit, 311 reference time signal generation unit, 352 presentation unit, 361 acquisition unit, 362 presentation time information comparison unit

Claims

A band dividing unit for dividing the sound signal included in the first content;
A periodicity detection unit that detects, for each band, periodicity information of the acoustic signal that is band-divided by the band-division unit;
A periodicity information integration unit that integrates the periodicity information for each band detected by the periodicity detection unit for all bands;
Detecting a peak position of periodicity information integrated by the periodicity information integration unit, and generating peak information;
A down-sampling unit that uses the peak information of a plurality of time intervals generated by the peak detection unit as information of one time interval;
A signal processing apparatus comprising: an output unit that outputs information down-sampled by the down-sampling unit as a synchronization feature amount when synchronizing the first content and the second content to be synchronized.

The signal processing apparatus according to claim 1, wherein the acoustic signal is an acoustic signal obtained by collecting sound of the reproduced first content.

A playback processing unit for controlling playback of the second content;
The reproduction processing unit is configured to compare the second content based on the synchronization correction information obtained by comparing the feature amount of the first content associated with the second content with the feature amount for synchronization. The signal processing apparatus according to claim 1, wherein the reproduction position of the content is corrected.

The signal processor
A band dividing process for dividing the sound signal included in the first content;
Periodicity detection processing for detecting, for each band, periodicity information of the acoustic signal that has been band-divided by the band-division processing;
Periodicity information integration processing for integrating the periodicity information for each band detected by the periodicity detection processing for all bands;
A peak detection process for detecting peak positions of the periodicity information integrated in the periodicity information integration process and generating peak information;
Down-sampling processing in which the peak information of a plurality of time intervals generated by the peak detection processing is information of one time interval;
A signal processing method for executing an output process for outputting information down-sampled by the down-sampling process as a synchronization feature amount when synchronizing the first content and the second content to be synchronized.

A band dividing process for dividing the sound signal included in the first content;
Periodicity detection processing for detecting, for each band, periodicity information of the acoustic signal that has been band-divided by the band-division processing;
Periodicity information integration processing for integrating the periodicity information for each band detected by the periodicity detection processing for all bands;
A peak detection process for detecting peak positions of the periodicity information integrated in the periodicity information integration process and generating peak information;
Down-sampling processing in which the peak information of a plurality of time intervals generated by the peak detection processing is information of one time interval;
A program that causes a computer to execute a process including: an output process that outputs information down-sampled by the down-sampling process as a synchronization feature amount when synchronizing the first content and the second content to be synchronized.