JP6411274B2

JP6411274B2 - Timing correction system, method and program thereof

Info

Publication number: JP6411274B2
Application number: JP2015080590A
Authority: JP
Inventors: 優鎌本; 善史白木; 佐藤　尚; 尚佐藤; パブロナバガブリエル; 守谷　健弘; 健弘守谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-10
Filing date: 2015-04-10
Publication date: 2018-10-24
Anticipated expiration: 2035-04-10
Also published as: JP2016200711A

Description

本発明は、映像を見るものによって入力されるテキスト情報を、その映像に重畳して表示する技術に関する。 The present invention relates to a technique for displaying text information input by a person viewing a video so as to be superimposed on the video.

映像を見るものによって入力されるテキスト情報を、その映像に重畳して表示する技術の従来技術として非特許文献１が知られている。非特許文献１では、視聴者は、動画を視聴しながら、コメントを投稿することができる。 Non-Patent Document 1 is known as a prior art of a technique for displaying text information input by a person viewing a video in a superimposed manner. In Non-Patent Document 1, a viewer can post a comment while watching a moving image.

「動画の視聴コメントの投稿」、[online]、NIWANGO.INC、[平成27年2月2日検索]、インターネット<URL : http://info.nicovideo.jp/help/player/howto/>“Watching videos and posting comments”, [online], NIWANGO.INC, [Search February 2, 2015], Internet <URL: http://info.nicovideo.jp/help/player/howto/>

しかしながら、従来技術では、動画に対してコメントしたいと思ってから、コメントを入力し、コメント投稿ボタンをクリックまたはエンターキーを押下する必要があるため、視聴者がコメントしたいと思ったタイミングから遅れてコメントが表示される場合がある。逆に動画の内容を予め知っている場合には、予めコメントを入力しておき、コメント投稿ボタンをクリックまたはエンターキーを押下するタイミングを視聴者が図ることもできるが、その場合であっても、視聴者がコメントしたいと思ったタイミングよりも早くなったり、または、遅くなったりする場合がある。例えば、ミュージックビデオやライブ映像の楽曲のテンポに合わせて、拍手を意味するテキスト情報「８」をコメントする場合、実際に拍手する場合よりも、ズレてしまう場合が多い、または、ズレ幅が大きくなりやすい。 However, in the conventional technology, it is necessary to enter a comment after clicking the comment post button or pressing the enter key after you want to comment on the video, so it is delayed from the timing when the viewer wants to comment Comments may be displayed. Conversely, if you know the content of the video in advance, you can enter comments in advance and the viewer can try to click the comment post button or press the enter key. , It may be earlier or later than when the viewer wants to comment. For example, when commenting the text information “8” meaning applause in time with the tempo of a music video or live video composition, there are many cases where the discrepancy is greater than when applause is actually performed, or the discrepancy is larger. Prone.

本発明は、コメントを映像に重畳して表示する際のタイミングを、映像に合わせて補正するタイミング補正システム、その方法及びプログラムを提供することを目的とする。 An object of the present invention is to provide a timing correction system, a method thereof, and a program for correcting the timing when a comment is superimposed and displayed on a video in accordance with the video.

上記の課題を解決するために、本発明の一態様によれば、タイミング補正システムは、対象映像信号を見るものによって入力され、その対象映像信号に重畳して表示され、所定の行為を意味する視覚情報及びその視覚情報の入力時刻を含むメタデータを検出するキー入力検出部と、(i)対象映像信号、(ii)対象映像信号に対応する対象音響信号、並びに、(iii)対象映像信号に重畳して表示されている既に入力済みの他の視覚情報及びそのメタデータの少なくとも何れかから、キー入力検出部で検出された視覚情報の意味する行為の基準となるタイミングである基準タイミングを検出するタイミング検出部と、キー入力検出部で検出された視覚情報を表示部に表示するためのタイミングを、基準タイミングに基づき補正する補正部とを含む。 In order to solve the above-described problem, according to one aspect of the present invention, a timing correction system is inputted by a person viewing a target video signal, is displayed superimposed on the target video signal, and means a predetermined action A key input detection unit for detecting visual information and metadata including the input time of the visual information; (i) a target video signal; (ii) a target audio signal corresponding to the target video signal; and (iii) a target video signal. A reference timing that is a timing that is a reference of an action that the visual information detected by the key input detection unit from at least one of other already input visual information and its metadata displayed superimposed on A timing detection unit for detecting, and a correction unit for correcting the timing for displaying the visual information detected by the key input detection unit on the display unit based on the reference timing.

上記の課題を解決するために、本発明の他の態様によれば、タイミング補正方法は、キー入力検出部が、対象映像信号を見るものによって入力され、その対象映像信号に重畳して表示され、所定の行為を意味する視覚情報及びその視覚情報の入力時刻を含むメタデータを検出するキー入力検出ステップと、タイミング検出部が、(i)対象映像信号、(ii)対象映像信号に対応する対象音響信号、並びに、(iii)対象映像信号に重畳して表示されている既に入力済みの他の視覚情報及びそのメタデータの少なくとも何れかから、キー入力検出ステップで検出された視覚情報の意味する行為の基準となるタイミングである基準タイミングを検出するタイミング検出ステップと、補正部が、キー入力検出ステップで検出された視覚情報を表示部に表示するためのタイミングを、基準タイミングに基づき補正する補正ステップとを含む。 In order to solve the above-described problem, according to another aspect of the present invention, a timing correction method includes a key input detection unit that is input by an object that views a target video signal and is displayed superimposed on the target video signal. A key input detection step for detecting metadata including visual information meaning a predetermined action and input time of the visual information, and a timing detection unit corresponding to (i) the target video signal and (ii) the target video signal Meaning of the visual information detected in the key input detection step from the target audio signal and (iii) other input visual information already superimposed on the target video signal and / or its metadata A timing detection step for detecting a reference timing, which is a reference timing for the act to be performed, and a correction unit for displaying the visual information detected in the key input detection step on the display unit The timing, and a correction step of correcting, based on the reference timing.

本発明によれば、コメントを映像に重畳して表示する際のタイミングを、映像に合わせて補正することができるという効果を奏する。 According to the present invention, it is possible to correct the timing when a comment is superimposed on a video and displayed in accordance with the video.

第一実施形態に係るタイミング補正システムの機能ブロック図。The functional block diagram of the timing correction system which concerns on 1st embodiment. 第一実施形態に係るタイミング補正システムの処理フローの例を示す図。The figure which shows the example of the processing flow of the timing correction system which concerns on 1st embodiment. キー入力検出部の備える記憶部に記憶されるデータの例を示す図。The figure which shows the example of the data memorize | stored in the memory | storage part with which a key input detection part is provided. 補正部の補正例を示す図。The figure which shows the correction example of a correction | amendment part. テキスト情報txt_in(r)が入力時刻t_in(r)に表示部の右端から表示され、左端に向かって、移動し、左端から消えていく場合の例を示す図。The figure which shows the example in case text information txt _in (r) is displayed from the right end of a display part at the input time t _in (r), moves toward the left end, and disappears from the left end. テキスト情報txt_in(q)がタイミングt_out(q)に表示部の右端から表示され、左端に向かって、移動し、左端から消えていく場合の例を示す図。The figure which shows the example in case text information txt _in (q) is displayed from the right end of a display part at timing _tout (q), moves toward the left end, and disappears from the left end.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態に係る動画配信システム１＞
図１は第一実施形態に係るタイミング補正システム１００の機能ブロック図を、図２はその処理フローを示す。 <Video distribution system 1 according to the first embodiment>
FIG. 1 is a functional block diagram of a timing correction system 100 according to the first embodiment, and FIG. 2 shows a processing flow thereof.

動画配信システム１は、１台以上の視聴者端末９１と、タイミング補正システム１００と、対象映像信号を視聴者端末９１に配信する動画配信サーバ９２とを含む。各視聴者端末９１、タイミング補正システム１００及び動画配信サーバ９２は、通信回線を介して通信可能とされている。なお、動画配信システム１に含まれる各視聴者端末９１、タイミング補正システム１００、動画配信サーバ９２は、NTP(Network Time Protocol)等により、時刻を同期しておくことが望ましい。 The moving image distribution system 1 includes one or more viewer terminals 91, a timing correction system 100, and a moving image distribution server 92 that distributes the target video signal to the viewer terminals 91. Each viewer terminal 91, the timing correction system 100, and the moving image distribution server 92 are communicable via a communication line. It should be noted that the viewer terminals 91, the timing correction system 100, and the moving image distribution server 92 included in the moving image distribution system 1 are preferably synchronized in time by NTP (Network Time Protocol) or the like.

＜視聴者端末９１＞
視聴者端末９１は、対象映像信号(例えば、動画)を見るもの(例えば、動画の視聴者)によって操作され、入力部（キーボード、マウス、タッチパネル等）と、表示部（ディスプレイ、タッチパネル等）とを含み、例えば、パーソナルコンピュータ、スマートホン、タブレット等からなる。視聴者は、視聴者端末９１の入力部を介して、動画配信サーバ９２に対して対象映像の再生を要求することができる。また、視聴者端末９１の表示部を介して、対象映像信号を視聴することができる。さらに、視聴者は、入力部を介して、対象映像信号に重畳して表示されるテキスト情報（例えば、コメント）を入力することができる。視聴者端末９１は、テキスト情報が入力されると、例えばシステム時刻に基づき入力時刻を取得して、メタデータとしてテキスト情報と一緒に送信する。 <Viewer terminal 91>
The viewer terminal 91 is operated by an object (for example, a moving image viewer) that views a target video signal (for example, a moving image), and has an input unit (keyboard, mouse, touch panel, etc.) and a display unit (display, touch panel, etc.). For example, a personal computer, a smart phone, a tablet or the like. The viewer can request the moving image distribution server 92 to play back the target video via the input unit of the viewer terminal 91. In addition, the target video signal can be viewed via the display unit of the viewer terminal 91. Furthermore, the viewer can input text information (for example, a comment) displayed superimposed on the target video signal via the input unit. When the text information is input, the viewer terminal 91 acquires the input time based on, for example, the system time, and transmits it as metadata along with the text information.

＜動画配信サーバ９２＞
動画配信サーバ９２は、動画データベース及びビデオカメラから動画を受け取り、視聴者端末９１の要求に応じて、動画データベース内に格納されている動画、または、ビデオカメラで収録した動画をリアルタイムで配信する。また、ビデオカメラで収録された動画に限らず、リアルタイムで合成・編集されたＣＧやモーションキャプチャ等から合成されたＣＧをリアルタイム配信することもある。なお、本実施形態において、動画とは、時間軸に同期させた音響信号と共に提供される映像信号を意味する。動画データベースには、動画と共に動画に付加されたテキスト情報が記憶される。さらに、テキスト情報にはメタデータが付加されている。メタデータとしては、テキスト情報の入力時刻、テキスト情報の大きさ、その色、その出現方法、その移動速度や、移動位置等がある。例えば、大きさ、色、出現方法、移動速度、移動位置等は、テキスト情報の入力者が選択できるものとしてもよく、視聴者端末９１がメタデータとしてテキスト情報と一緒に送信し、動画データベースに動画と共に記憶される。 <Video distribution server 92>
The moving image distribution server 92 receives moving images from the moving image database and the video camera, and distributes the moving images stored in the moving image database or the moving images recorded by the video camera in real time in response to a request from the viewer terminal 91. In addition to moving images recorded by a video camera, CG synthesized and edited in real time or CG synthesized from motion capture may be distributed in real time. In the present embodiment, the moving image means a video signal provided together with an audio signal synchronized with the time axis. The moving image database stores text information added to the moving image together with the moving image. Further, metadata is added to the text information. The metadata includes the input time of text information, the size of text information, its color, its appearance method, its moving speed, moving position, and the like. For example, the size, color, appearance method, moving speed, moving position, and the like may be selectable by the text information input person, and the viewer terminal 91 transmits the metadata together with the text information as metadata to the video database. It is stored with the movie.

＜タイミング補正システム１００＞
タイミング補正システム１００は、テキスト情報txt_in(p)とそのそのテキスト情報の入力時刻t_in(p)を示すメタデータ及びテキスト情報付の対象映像信号が入力され、テキスト情報txt_in(p)とそのテキスト情報を表示部(例えばディスプレイ)に表示するためのタイミングt_in(p)またはt_out(p)を含むメタデータとを出力する。なお、pは入力されるテキスト情報全てに付与されるインデックスを示す。なお、テキスト情報txt_in(p)は、対象映像信号を見るものによって入力され、その対象映像信号に重畳して表示されるものある。 <Timing correction system 100>
Timing correction system 100, the text information txt _in (p) and its their target video signal input time t _in (p) metadata and with text information indicating the text information is entered, the text information txt _in (p) Metadata including timing t _in (p) or t _out (p) for displaying the text information on a display unit (for example, a display) is output. Note that p indicates an index assigned to all input text information. Note that the text information txt _in (p) is input by what looks at the target video signal and is displayed superimposed on the target video signal.

タイミング補正システム１００は、キー入力検出部１１０と、タイミング検出部１２０と、補正部１３０とを含む。 The timing correction system 100 includes a key input detection unit 110, a timing detection unit 120, and a correction unit 130.

＜キー入力検出部１１０＞
キー入力検出部１１０は、テキスト情報txt_in(p)とそのテキスト情報の入力時刻t_in(p)とを受け取り、所定の行為を意味するテキスト情報txt_in(q)及びそのテキスト情報の入力時刻t_in(q)を含むメタデータを検出し（Ｓ１１０）、出力する。なお、qは所定の行為を意味するテキスト情報全てに付与されるインデックスを示す。入力される全てのテキスト情報の個数をPとし、入力されるテキスト情報の内、所定の行為を意味するテキスト情報の個数をQとすると、P≧Qであり、p=1,2,…,P、q=1,2,…,Qである。 <Key input detection unit 110>
The key input detection unit 110 receives the text information txt _in (p) and the input time t _in (p) of the text information, and receives the text information txt _in (q) indicating the predetermined action and the input time of the text information. Metadata including t _in (q) is detected (S110) and output. In addition, q shows the index provided to all the text information meaning a predetermined action. If the number of all text information to be input is P, and the number of text information meaning a predetermined action among the input text information is Q, then P ≧ Q, and p = 1, 2,. P, q = 1, 2,..., Q.

例えば、キー入力検出部１１０は図示しない記憶部を備え、記憶部には所定の行為とその所定の行為を意味するテキスト情報とが対応付けて記憶される（図３参照）。キー入力検出部１１０は、受け取ったテキスト情報txt_in(p)と記憶部に記憶されるテキスト情報とが一致するか否かを判定し、一致する場合には、対応するテキスト情報txt_in(q)と所定の行為（または、所定の行為を示すインデックス）と入力時刻t_in(q)とを、補正部１３０に出力する。一方、一致しない場合には、対応するテキスト情報txt_in(r)と入力時刻t_in(r)とを、動画配信サーバ９２に出力する。なお、rは所定の行為を意味するテキスト情報以外の全てのテキスト情報に付与されるインデックスを示す。入力されるテキスト情報の内、所定の行為を意味するテキスト情報以外のテキストの個数をRとすると、P=Q+Rであり、r=1,2,…,Rである。 For example, the key input detection unit 110 includes a storage unit (not shown), and a predetermined action and text information indicating the predetermined action are stored in association with each other (see FIG. 3). The key input detection unit 110 determines whether or not the received text information txt _in (p) matches the text information stored in the storage unit. If they match, the corresponding text information txt _in (q ), A predetermined action (or an index indicating the predetermined action), and an input time t _in (q) are output to the correction unit 130. On the other hand, if they do not match, the corresponding text information txt _in (r) and input time t _in (r) are output to the moving image distribution server 92. Note that r represents an index assigned to all text information other than text information meaning a predetermined action. If the number of texts other than text information meaning a predetermined action among the input text information is R, P = Q + R and r = 1, 2,...

＜タイミング検出部１２０＞
タイミング検出部１２０は、動画配信サーバ９２から(i)対象映像信号、(ii)対象映像信号に対応する対象音響信号、並びに、(iii)対象映像信号に重畳して表示される既に入力済みの他のテキスト情報及びそのメタデータの少なくとも何れかを受け取る。例えば、テキスト情報付きの動画には(i)〜(iii)の全てが含まれる。 <Timing detection unit 120>
The timing detection unit 120 is already input from the video distribution server 92 (i) the target video signal, (ii) the target audio signal corresponding to the target video signal, and (iii) the superimposed video signal displayed on the target video signal. Receive at least one of other text information and its metadata. For example, a moving image with text information includes all of (i) to (iii).

タイミング検出部１２０は、(i)〜(iii)の少なくとも何れかから、キー入力検出部１１０で検出されたテキスト情報の意味する行為の基準となるタイミングである基準タイミングt_out(s)を検出し（Ｓ１２０）、出力する。なお、sは(i)〜(iii)の少なくとも何れかに含まれる所定の行為に出現番号を表すインデックスを示し、s=1,2,…,Sであり、Sは(i)〜(iii)の少なくとも何れかに含まれる所定の行為の出現回数である。 The timing detection unit 120 detects a reference timing t _out (s) that is a timing that is a reference of an action that the text information detected by the key input detection unit 110 detects from at least one of (i) to (iii). (S120) and output. Note that s indicates an index representing an appearance number for a predetermined action included in at least one of (i) to (iii), s = 1, 2,..., S, and S is (i) to (iii) ) Is the number of appearances of the predetermined action included in at least one of the above.

基準タイミングt_out(s)を検出する方法としては、テキスト情報の意味する行為に応じて様々な方法が考えられる。 As a method for detecting the reference timing t _out (s), various methods are conceivable depending on the action that the text information means.

例えば、テキスト情報の意味する行為が「拍手」の場合に基準タイミングを検出する方法を、用いるデータに応じて例示する。テキスト情報の意味する行為が「拍手」の場合、さらに、テキスト情報の意味する行為として、一定のテンポに合わせて行う「手拍子(clap)」と、一定のテンポを持たず行う「拍手(applause)」とが考えられる。なお、以下、単に「拍手」といった場合、一定のテンポを持たず行う「拍手(applause)」を意味するものとする。「手拍子」と「拍手」とは、手を叩く時間的間隔や音量的差異が異なるため（参考文献１）、例えば、(ii)対象映像信号に対応する対象音響信号等に基づいて、テキスト情報が何れの行為を意味するのかを判別することができる。
（参考文献１）鎌本優，河原一彦，尾本章，守谷健弘，「音楽鑑賞時に励起される拍手音・手拍子音の低遅延伝送に向けた基礎的検討」、日本音響学会 2014年秋季研究発表会, 1-Q-17、2014年.
この実施形態では、テキスト情報の意味する行為が「手拍子」の場合の、基準タイミングの検出方法について説明する。 For example, a method of detecting the reference timing when the action that the text information means is “applause” is illustrated according to the data to be used. When the action that the text information means is “applause”, the action that the text information means is “applause” that is performed at a constant tempo and “applause” that is performed without a constant tempo. "You could think so. Hereinafter, simply “applause” means “applause” performed without a certain tempo. Since “hand clapping” and “clapping” are different in time interval and volume difference between claps (Reference 1), for example, (ii) text information based on the target audio signal corresponding to the target video signal, etc. It is possible to determine which action means.
(Reference 1) Yu Kamamoto, Kazuhiko Kawahara, Akira Omoto, Takehiro Moriya, “Fundamental study for low-delay transmission of applause and clapping sounds excited during music appreciation”, Acoustical Society of Japan 2014 Autumn Study Presentation, 1-Q-17, 2014.
In this embodiment, a method for detecting the reference timing when the action that the text information means is “hand clapping” will be described.

(i)行為が「手拍手」であり、データが対象映像信号の場合
対象映像信号に一定のテンポがある場合には、そのテンポに合わせて拍手すると仮定し、テンポを基準タイミングとして検出する（参考文献２参照）。
（参考文献２）三上弾、松本鮎美、門田浩二、川村春美、小島明、「動作学習のための遅延同期ビデオフィードバックシステム」、情報処理学会、情報処理学会論文誌コンシューマ・デバイス＆システム、２０１４年、vol.4、No.1、pp22-31.
(ii)行為が「手拍手」であり、データが対象音響信号の場合
対象音響信号に一定のテンポがある場合には、そのテンポに合わせて拍手すると仮定し、テンポを基準タイミングとして検出する（参考文献２〜参考文献５参照）。
(参考文献３)角尾衣未留,宮本賢一,小野順貴,嵯峨山茂樹,“調波音・打楽器音分離手法を用いた音楽音響信号からのリズム特徴量の抽出”,日本音響学会春季研究発表会講演集,Mar.2008,pp.905-906.
(参考文献４)角尾衣未留,小野順貴,嵯峨山茂樹,“リズムマップ：音楽音響信号からの単位リズムパターンの抽出と楽曲構造の解析”,情報処理学会研究報告,Aug.2008,vol.2008-MUS-76,no.25,pp.149-154.
（参考文献５）角尾衣未留,小野順貴,嵯峨山茂樹,“和声境界を考慮した単位リズムパターンの抽出に基づく音楽音響信号の小節境界推定”,日本音響学会秋季研究発表会講演集,Sep.2009,no.3-5-10,pp.897-898.
(iii)行為が「手拍手」であり、データが対象映像信号に重畳して表示される既に入力済みの他のテキスト情報及びそのメタデータの場合
例えば、タイミング検出部１２０は、入力済みの他のテキスト情報の中から、キー入力検出部１１０で検出されたテキスト情報の意味する行為と同じ行為を意味する他のテキスト情報を抽出する。例えば、図示しない記憶部（図３参照）を参照して、タイミング検出部１２０は、受け取った入力済みの他のテキスト情報と、記憶部に記憶され、キー入力検出部１１０で検出されたテキスト情報の意味する行為に対応するテキスト情報とが一致するか否かを判定し、一致する場合には、その表示時刻とを、抽出する。 (i) When the action is “applause” and the data is the target video signal If the target video signal has a certain tempo, it is assumed that the applause is in accordance with the tempo and the tempo is detected as the reference timing ( (See Reference 2).
(Reference 2) Amami Mikami, Tomomi Matsumoto, Koji Kadota, Harumi Kawamura, Akira Kojima, "Delayed Synchronous Video Feedback System for Motion Learning", Information Processing Society of Japan, IPSJ Transactions Consumer Devices & Systems, 2014 Year, vol.4, No.1, pp22-31.
(ii) When the action is “hand applause” and the data is the target acoustic signal If the target acoustic signal has a certain tempo, it is assumed that the applause matches the tempo and the tempo is detected as the reference timing ( References 2 to 5).
(Reference 3) Mio Tsunoo, Kenichi Miyamoto, Junki Ono, Shigeki Hiyama, “Extraction of Rhythm Features from Musical Sound Signals Using Harmonic / Percussive Sound Separation Techniques”, Spring Meeting of Acoustical Society of Japan Proceedings of the conference, Mar. 2008, pp.905-906.
(Reference 4) Mio Tsunoo, Junki Ono, Shigeki Hiyama, “Rhythm Map: Extraction of Unit Rhythm Patterns from Musical Acoustic Signals and Analysis of Musical Structure”, Information Processing Society of Japan, Aug.2008, vol .2008-MUS-76, no.25, pp.149-154.
(Reference 5) Mio Tsunoo, Junki Ono, Shigeki Hiyama, “Nodal Boundary Estimation of Musical Acoustic Signals Based on Extraction of Unit Rhythm Patterns Considering Harmonic Boundaries”, Proc. , Sep. 2009, no. 3-5-10, pp. 897-898.
(iii) In the case where the action is “applause” and the data is displayed in a superimposed manner on the target video signal and the other text information and its metadata already input. The other text information that means the same action as the action that the text information detected by the key input detection unit 110 means is extracted from the text information. For example, with reference to a storage unit (not shown) (see FIG. 3), the timing detection unit 120 receives the received other text information and the text information stored in the storage unit and detected by the key input detection unit 110. It is determined whether or not the text information corresponding to the act meaning "?" Matches, and if it matches, the display time is extracted.

タイミング検出部１２０は、抽出した表示時刻の統計量に基づいて基準タイミングを求める。例えば、抽出した表示時刻を用いて、所定の時間区間(例えば、動画がミュージックビデオであり、曲のテンポが148BPM(Beats per Minutes)の場合、一拍の間隔は405ms程度なので、所定の時間区間を405msとする)毎に、時間区間毎の代表値(平均値、最頻値、最小値及び最大値等の複数の表示時刻を代表する何らかの値)を求め、基準タイミングとして検出する。例えば、抽出した表示時刻を用いて、ヒストグラムを作成し、多数決により基準タイミングを求める。つまり、最頻値を基準タイミングとする。 The timing detection unit 120 obtains a reference timing based on the extracted statistics of the display time. For example, using the extracted display time, a predetermined time interval (for example, if the video is a music video and the tempo of the song is 148 BPM (Beats per Minutes), the interval of one beat is about 405 ms. Every 405 ms), a representative value for each time interval (any value representing a plurality of display times such as an average value, a mode value, a minimum value, and a maximum value) is obtained and detected as a reference timing. For example, a histogram is created using the extracted display time, and the reference timing is obtained by majority vote. That is, the mode value is set as the reference timing.

(タイミングに揺らぎを与える方法)
拍手のタイミングにゆらぎを与えてもよい(参考文献１参照)。 (How to give timing fluctuation)
You may give fluctuation to the timing of applause (refer to reference 1).

例えば、(iii)の場合、抽出した表示時刻を用いて、所定の時間区間毎に、表示時刻の平均値と分散とを求め、その平均値と分散とを持つガウス分布に従う乱数を基準タイミングとしてもよい。この方法により、拍手のタイミングにゆらぎを与えることができ、より自然なタイミングで所定の行動に対応するテキスト情報を表示することができる。 For example, in the case of (iii), using the extracted display time, the average value and variance of the display time are obtained for each predetermined time interval, and a random number according to a Gaussian distribution having the average value and variance is used as the reference timing. Also good. By this method, fluctuations in the timing of applause can be provided, and text information corresponding to a predetermined action can be displayed at a more natural timing.

なお、同様の方法により、分散を求め、(i),(ii)の方法と組合せてもよい。例えば、(i),(ii)の方法で基準タイミングを求め、その基準タイミングを中心として、求めた分散を持つガウス分布に従う乱数を新たな(最終的に用いる)基準タイミングとする。このような方法により、より自然なタイミングで所定の行動に対応するテキスト情報を表示することができる。 The variance may be obtained by a similar method and combined with the methods (i) and (ii). For example, the reference timing is obtained by the methods (i) and (ii), and a random number according to a Gaussian distribution having the obtained variance around the reference timing is set as a new (finally used) reference timing. By such a method, text information corresponding to a predetermined action can be displayed at a more natural timing.

また、予め手拍子を行う際に一般的に生じる分散の値を求めておき、その分散に基づき、基準タイミングを求めてもよい。例えば、(i),(ii),(iii)の方法で基準タイミングを求め、その基準タイミングを中心として、手拍子を行う際に一般的に生じる分散を持つガウス分布に従う乱数を新たな(最終的に用いる)基準タイミングとする。 Alternatively, a dispersion value generally generated when clapping is obtained in advance, and the reference timing may be obtained based on the dispersion. For example, a reference timing is obtained by the methods (i), (ii), and (iii), and a random number according to a Gaussian distribution having a variance that generally occurs when clapping is performed around the reference timing is newly (final). Used as a reference timing).

(行為が「手拍子」以外の場合について)
なお、データが(iii)対象映像信号に重畳して表示される既に入力済みの他のテキスト情報及びそのメタデータの場合には、行為が「手拍子」以外の行為であっても容易に適用することができる。例えば、行為が「拍手」または「笑い」の場合、所定の時間区間を一連の行為「拍手」または「笑い」が、継続しうる最大の時間に設定する。例えば、何らかの事象に対して、「拍手」を送るのは、長くとも30秒程度であろうと想定される場合、最初に、「拍手」を意味する他のテキスト情報が表示されてから１分以内に表示される「拍手」を意味する他のテキスト情報から代表値を求め、基準タイミングとして検出する。前述の方法により「拍手」や「笑い」のタイミングにゆらぎを与えてもよい。 (If the action is other than “hand clapping”)
In the case of other text information and its metadata already input (iii) displayed superimposed on the target video signal, it is easily applied even if the action is an action other than “clapping” be able to. For example, when the action is “applause” or “laughter”, a predetermined time interval is set to the maximum time that the series of actions “applause” or “laughter” can continue. For example, if it is assumed that it will be about 30 seconds at the most to send "applause" for some event, within 1 minute after other text information that means "applause" is first displayed A representative value is obtained from other text information indicating “applause” displayed on the screen and detected as a reference timing. The timing of “applause” and “laughter” may be given fluctuation by the above-described method.

ただし、初めて対象映像信号を配信する場合（対象映像信号にテキスト情報が付加されていない場合）、または、ビデオカメラで収録した動画をリアルタイムで配信する場合は、対象映像信号に既に入力済みの他のテキスト情報がないので、(iii)を用いることはできない。 However, when distributing the target video signal for the first time (when text information is not added to the target video signal), or when distributing a video recorded by a video camera in real time, Since there is no text information, (iii) cannot be used.

(i)対象映像信号、または、(ii)対象音響信号から「手拍子」以外の行為の基準タイミングを検出する方法としては、例えば、予め所定の行為（例えば「拍手」「笑い」）を撮影した映像信号または音響信号から特徴量を抽出し、図示しない記憶部に記憶しておく。さらに、動画配信サーバ９２から受け取った(i)対象映像信号、または、(ii)対象音響信号から特徴量を取り出し、記憶部に記憶されている特徴量との類似度を求め、類似度が閾値以上となる場合に、所定の行為が行われていると判断し、代表値を求め、基準タイミングとして検出する。例えば、既存の顔認証技術を応用し、(i)対象映像信号から笑顔を検出し、検出した時刻から「笑い」の基準タイミングを求めてもよい。また、例えば、動画圧縮用の特徴量である動き補償ベクトル（参考文献６参照）を利用してもよい。
(参考文献６)村上篤道、浅井光太郎、関口俊一、「高効率映像符号化技術 HEVC/H.265とその応用」、オーム社、２０１３年、p.20-28,125-132
予め「拍手」を撮影した映像信号を圧縮符号化して動き補償ベクトルの時系列を取得し、図示しない記憶部に記憶しておく。さらに、動画配信サーバ９２から受け取った(i)対象映像信号を圧縮符号化して動き補償ベクトルの時系列を取得し（そもそも(i)対象映像信号が圧縮符号化されているものであれば動き補償ベクトルが含まれるため、それをそのまま利用すればよい）、記憶部に記憶されている動き補償ベクトルの時系列との類似度を求めればよい。 As a method for detecting the reference timing of actions other than “hand clapping” from (i) the target video signal or (ii) target sound signal, for example, a predetermined action (for example, “applause” or “laughter”) is taken in advance. A feature amount is extracted from the video signal or the audio signal and stored in a storage unit (not shown). Further, the feature amount is extracted from (i) the target video signal or (ii) the target acoustic signal received from the video distribution server 92, the similarity with the feature amount stored in the storage unit is obtained, and the similarity is a threshold value In the case described above, it is determined that a predetermined action is being performed, a representative value is obtained, and detected as a reference timing. For example, by applying an existing face authentication technology, (i) a smile may be detected from the target video signal, and the reference timing of “laughter” may be obtained from the detected time. Further, for example, a motion compensation vector (see Reference Document 6) that is a feature amount for moving image compression may be used.
(Reference 6) Atsumi Murakami, Kotaro Asai, Shunichi Sekiguchi, “High-efficiency video coding technology HEVC / H.265 and its application”, Ohm, 2013, p.20-28,125-132
A video signal obtained by capturing “applause” in advance is compression-encoded to obtain a time series of motion compensation vectors, which are stored in a storage unit (not shown). Furthermore, (i) the target video signal received from the video distribution server 92 is compression-encoded to obtain a time series of motion compensation vectors (in the first place, if the target video signal is compression-encoded, motion compensation is performed. Since a vector is included, it may be used as it is), and the degree of similarity with the time series of motion compensation vectors stored in the storage unit may be obtained.

＜補正部１３０＞
補正部１３０は、所定の行為を意味するテキスト情報txt_in(q)及びそのテキスト情報の入力時刻t_in(q)を含むメタデータと、基準タイミングt_out(s)とを受け取り、キー入力検出部１１０で検出されたテキスト情報を表示部に表示するためのタイミングを、基準タイミングt_out(s)に基づき補正し（Ｓ１３０）、テキスト情報txt_in(q)とそのテキスト情報を表示するためのタイミングt_out(q)を含むメタデータとを動画配信サーバ９２に出力する。例えば、S個のt_out(s)の中から最もt_in(q)に近いt_out(s)を、テキスト情報txt_in(q)を表示するためのタイミングt_out(q)とする(t_out(q)←t_out(s))。例えば、テキスト情報txt_in(q)が「８」の場合の例を図４に示す。 <Correction unit 130>
The correction unit 130 receives text information txt _in (q) meaning a predetermined action and metadata including an input time t _in (q) of the text information and a reference timing t _out (s), and detects key input. The timing for displaying the text information detected by the unit 110 on the display unit is corrected based on the reference timing t _out (s) (S130), and the text information txt _in (q) and the text information are displayed. The metadata including the timing t _out (q) is output to the moving image distribution server 92. For example, t _out (s) that is closest to t _in (q) among S t _out (s) is set as timing t _out (q) for displaying text information txt _in (q) (t _out (q) ← t _out (s)). For example, FIG. 4 shows an example in which the text information txt _in (q) is “8”.

なお、動画配信サーバ９２は、対象映像信号にテキスト情報を重畳して出力する。動画配信サーバ９２は、テキスト情報txt_in(q)またはtxt_in(r)を、それぞれタイミングt_out(q)または入力時刻t_in(r)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。例えば、テキスト情報txt_in(p)がタイミングt_out(q)または入力時刻t_in(r)に表示部の右端から表示され、左端に向かって、移動し、左端から消えていく場合、補正されていない場合（入力時刻t_in(r)に表示）の例を図５に示し、補正されている場合（タイミングt_out(q)に表示）の例を図６に示す。 The moving image distribution server 92 superimposes the text information on the target video signal and outputs it. The video distribution server 92 displays the text information txt _in (q) or txt _in (r) on the display unit of the viewer terminal 91 at the timing t _out (q) or the input time t _in (r), respectively. And delivered together with the target video signal. For example, if text information txt _in (p) is displayed from the right edge of the display at timing t _out (q) or input time t _in (r), moves toward the left edge, and disappears from the left edge, it is corrected. FIG. 5 shows an example of the case where it is not corrected (displayed at the input time t _in (r)), and FIG. 6 shows an example of the case where it is corrected (displayed at the timing t _out (q)).

＜タイミングt_out(q)の適用時期について＞
なお、動画配信サーバ９２は、テキスト情報txt_in(q)をタイミングt_out(q)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信するタイミングとして二つのタイミング考えられる。 <When to apply timing t _out (q)>
Note that the video distribution server 92 can consider two timings for distributing the text information txt _in (q) together with the target video signal so that the text information txt _in (q) is displayed on the display unit of the viewer terminal 91 at timing t _out (q). .

(1)視聴者がテキスト情報txt_in(q)を入力した際、その再生時においては、テキスト情報txt_in(q)を入力時刻t_in(q)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。そして、動画配信サーバ９２は、動画データベース内に、対象映像信号とともにテキスト情報txt_in(q)と(入力時刻t_in(q)ではなく)タイミングt_out(q)とを格納しておき、次の再生時には、テキスト情報txt_in(q)をタイミングt_out(q)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。 (1) when the viewer inputs the text information txt _in (q), the at the time of reproduction is displayed on the display unit of the viewer terminal 91 by text information txt _in (q) enter the time t _in (q) As described above, it is distributed together with the target video signal. Then, the video distribution server 92 stores the text information txt _in (q) and the timing t _out (q) (not the input time t _in (q)) together with the target video signal in the video database. Is reproduced together with the target video signal so that the text information txt _in (q) is displayed on the display unit of the viewer terminal 91 at the timing t _out (q).

(2)視聴者がテキスト情報txt_in(q)を入力した際、その再生時において、テキスト情報txt_in(q)をタイミングt_out(q)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。この場合、タイミングt_out(q)と入力時刻t_in(q)との大小関係に応じて配信方法が異なる。 (2) when the viewer inputs the text information txt _in (q), at the time of the reproduction, so as to be displayed on the display unit of the viewer terminal 91 text information txt _in (q) at the timing t _out (q) To the target video signal. In this case, the delivery method differs depending on the magnitude relationship between the timing t _out (q) and the input time t _in (q).

(2-A)タイミングt_out(q)が入力時刻t_in(q)よりも遅い場合には、テキスト情報txt_in(q)をタイミングt_out(q)で表示部に表示されるように、表示のタイミングを遅らせて、対象映像信号とともに配信すればよい。 (2-A) When the timing t _out (q) is later than the input time t _in (q), the text information txt _in (q) is displayed on the display unit at the timing t _out (q). The display timing may be delayed and distributed together with the target video signal.

(2-B)タイミングt_out(q)が入力時刻t_in(q)よりも早い場合には、以下の方法により、配信する。 (2-B) When the timing t _out (q) is earlier than the input time t _in (q), distribution is performed by the following method.

(2-B-1)例えば、図５及び図６のように、テキスト情報txt_in(q)の表示方法が時間の経過を表す場合には、その時間の経過を利用し、テキスト情報txt_in(q)がまるでタイミングt_out(q)で表示されたように、対象映像信号とともに配信する。例えば、図６の場合、タイミングt_out(q)が入力時刻t_in(q)よりも早いので、t=t_out(q)のとき（図６の一番の上の図）、動画配信サーバ９２は、テキスト情報txt_in(q)を受け取っていないため、テキスト情報txt_in(q)を表示されるように、配信することができない。テキスト情報txt_in(q)を受け取ったタイミング(入力時刻t=t_in(q))で、仮に、タイミングt_out(q)で視聴者端末９１の表示部に表示されていた場合、t=t_in(q)においてテキスト情報txt_in(q)が表示される位置にテキスト情報txt_in(q)を表示する。つまり、図６の一番の上の図から図６の真ん中の図までのテキスト情報txt_in(q)の移動（遷移）を表示せずに、t=t_in(q)において、突然、テキスト情報txt_in(q)を図６の真ん中の図のように表示し、図６の一番の下の図までテキスト情報txt_in(q)の移動させる。なお、この場合、動画配信サーバ９２は、動画データベース内に、対象映像信号とともにテキスト情報txt_in(q)とタイミングt_out(q)とを格納しておき、次の再生時には、テキスト情報txt_in(q)をタイミングt_out(q)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。 (2-B-1) For example, as shown in FIGS. 5 and 6, when the display method of the text information txt _in (q) represents the passage of time, the passage of time is used to obtain the text information txt _in Distribute together with the target video signal as if (q) was displayed at timing t _out (q). For example, in the case of FIG. 6, the timing t _out (q) is earlier than the input time t _in (q), so when t = t _out (q) (the top diagram in FIG. 6), the video distribution server 92, because it does not receive the text information txt _in (q), so as to display the text information txt _in (q), it can not be delivered. If the text information txt _in (q) is received (input time t = t _in (q)) and is displayed on the display unit of the viewer terminal 91 at the timing t _out (q), t = t to display the text information txt _in (q) to position the text information txt _in (q) is displayed in the _in (q). In other words, without displaying the movement (transition) of the text information txt _in (q) from the top diagram in FIG. 6 to the middle diagram in FIG. 6, the text is suddenly displayed at t = t _in (q). Information txt _in (q) is displayed as shown in the middle diagram of FIG. 6, and the text information txt _in (q) is moved to the bottom diagram of FIG. In this case, the moving image distribution server 92 in the moving image database may be stored text information txt _in the (q) and a timing t _out (q) with target video signal, when the next playback, text information txt _in (q) is distributed together with the target video signal so as to be displayed on the display unit of the viewer terminal 91 at timing t _out (q).

(2-B-2)また、テキスト情報が示す所定の行為が、手拍子などの周期的な行為の場合には、取得したテキスト情報txt_in(q)を次の周期で表示する。つまり、テキスト情報txt_in(q)をタイミングt_out(q+1)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。なお、この場合、動画配信サーバ９２は、動画データベース内に、対象映像信号とともにテキスト情報txt_in(q)とタイミングt_out(q)とを格納してもよいし、テキスト情報txt_in(q)とタイミングt_out(q+1)とを格納してもよい。次の再生時には、テキスト情報txt_in(q)をタイミングt_out(q)またはタイミングt_out(q+1)で視聴者端末９１の表示部に表示されるように、対象映像信号とともに配信する。 (2-B-2) When the predetermined action indicated by the text information is a periodic action such as clapping, the acquired text information txt _in (q) is displayed in the next period. That is, the text information txt _in (q) is distributed together with the target video signal so as to be displayed on the display unit of the viewer terminal 91 at timing t _out (q + 1). In this case, the video distribution server 92 may store the text information txt _in (q) and the timing t _out (q) together with the target video signal in the video database, or the text information txt _in (q). And timing t _out (q + 1) may be stored. At the next playback time, the text information txt _in (q) is distributed together with the target video signal so as to be displayed on the display unit of the viewer terminal 91 at the timing t _out (q) or the timing t _out (q + 1).

＜効果＞
このような構成により、テキスト情報を映像に重畳して表示する際のタイミングを、映像に合わせて補正することができる。例えば、楽曲のテンポに合わせる場合には、テキスト情報の表示のタイミングを合わせることで、視聴者の一体感を向上させることができる。 <Effect>
With such a configuration, it is possible to correct the timing when the text information is displayed superimposed on the video according to the video. For example, when matching the music tempo, the sense of unity of the viewer can be improved by matching the display timing of the text information.

＜変形例＞
本実施形態では、タイミング補正システム１００を視聴者端末９１または動画配信サーバ９２とは別装置として構成したが、視聴者端末９１または動画配信サーバ９２に組み込まれる構成としてもよい。また、タイミング補正システム１００は、キー入力検出部１１０と、タイミング検出部１２０と、補正部１３０とを含み、独立した装置として説明したが、各部が視聴者端末９１または動画配信サーバ９２に組み込まれる構成としてもよい。例えば、キー入力検出部１１０を視聴者端末９１に組み込み、タイミング検出部１２０と補正部１３０とを動画配信サーバ９２に組み込む構成としてもよい。 <Modification>
In the present embodiment, the timing correction system 100 is configured as a separate device from the viewer terminal 91 or the video distribution server 92, but may be configured to be incorporated in the viewer terminal 91 or the video distribution server 92. Further, the timing correction system 100 includes the key input detection unit 110, the timing detection unit 120, and the correction unit 130, and has been described as an independent device. However, each unit is incorporated in the viewer terminal 91 or the moving image distribution server 92. It is good also as a structure. For example, the key input detection unit 110 may be incorporated in the viewer terminal 91 and the timing detection unit 120 and the correction unit 130 may be incorporated in the moving image distribution server 92.

また、動画配信サーバ９２は、テキスト情報を付加せずに動画のみを配信するサーバであってもよい。その場合には、タイミング補正システム１００においてテキスト情報用のデータベースを備え、テキスト情報を付加すればよい。 The moving image distribution server 92 may be a server that distributes only moving images without adding text information. In that case, the timing correction system 100 may include a text information database and add text information.

本実施形態では、視聴者によって入力され、対象映像信号に重畳して表示される情報としてテキスト情報の例を示したが、他の視覚情報であってもよい。ここで、「視覚情報」とは、表示部を介して視覚的に認識可能な情報であって、例えば、文字、図形若しくは記号若しくはこれらの結合又はこれらと色彩との結合である。また、静止画に限らず、動く画像であってもよい。例えば、(1)本実施形態のように、「笑い」や「拍手」等の所定の行為を意味するテキスト情報（例えば「ｗ」や「８」等）、(2)テキスト情報以外の「笑い」や「拍手」等の所定の行為を意味し、識別するためのコンピュータ上のビット情報、(3)顔文字、絵文字等、通常のテキスト情報で無いもの。例えば、キャリアの異なる携帯電話間で共通絵文字(参考文献７参照)、(4)アスキーアート等，全体としてはテキスト情報とテキスト情報の配置情報を用いた絵のようになっているもの（参考文献８参照）、(5)上述の(1)〜(4)に対応するネットスラング。例えば、「笑い」を意味するテキスト情報「ｗｗｗｗｗ…」に対して「草生えた」等のネットスラングがある。
（参考文献７）「docomo／au共通絵文字」、株式会社NTTドコモ、[online]、[平成27年2月9日検索]、インターネット<URL: https://www.nttdocomo.co.jp/service/developer/smart_phone/make_contents/pictograph/>
（参考文献８）「アスキーアート」、[online]、2015年2月2日、ウィキペディア、[平成27年2月9日検索]、インターネット<URL: http://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%B9%E3%82%AD%E3%83%BC%E3%82%A2%E3%83%BC%E3%83%88> In the present embodiment, an example of text information is shown as information input by a viewer and displayed superimposed on a target video signal, but other visual information may be used. Here, the “visual information” is information visually recognizable via the display unit, and is, for example, a character, a figure, a symbol, a combination thereof, or a combination of these and a color. Further, the image is not limited to a still image, and may be a moving image. For example, (1) text information (eg, “w”, “8”, etc.) indicating a predetermined action such as “laughter” or “applause”, (2) “laughter” other than text information, as in this embodiment ”Or“ applause ”means a predetermined action such as bit information on a computer for identification, (3) emoticons, pictograms, etc. that are not normal text information. For example, common pictograms between mobile phones of different carriers (see Reference 7), (4) ASCII art, etc., which are generally like pictures using text information and text information layout information (Reference 8) (5) Net slang corresponding to the above (1) to (4). For example, there is a net slang such as “I grew up” for text information “www...
(Reference 7) “docomo / au common pictograms”, NTT DOCOMO, Inc., [online], [Search February 9, 2015], Internet <URL: https://www.nttdocomo.co.jp/service / developer / smart_phone / make_contents / pictograph />
(Reference 8) "ASCII art", [online], February 2, 2015, Wikipedia, [Search February 9, 2015], Internet <URL: http://en.wikipedia.org/wiki/% E3% 82% A2% E3% 82% B9% E3% 82% AD% E3% 83% BC% E3% 82% A2% E3% 83% BC% E3% 83% 88>

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A storage unit that stores a predetermined action and visual information that means the predetermined action in association with each other;
Entered by what you see an object video signal, the visual information displayed superimposed on the subject image signal, in the case where it is determined whether it matches the visual information stored in the storage unit, matches, Visual information displayed superimposed on the target video signal, predetermined action corresponding to the visual information displayed superimposed on the target video signal , and input of visual information displayed superimposed on the target video signal A key input detection unit for detecting metadata including time;
(i) the target video signal, (ii) a target audio signal corresponding to the target video signal, and (iii) other already input visual information displayed superimposed on the target video signal and the already A timing detection unit that detects a reference timing that is a timing that is a reference of an action that the visual information detected by the key input detection unit from at least one of metadata of other visual information that has been input;
The timing for displaying the visual information detected by said key input detection unit on the display unit, seen including a correction portion that corrects, based on the reference timing,
Visual information indicating a predetermined action stored in the storage unit is superimposed on the target video signal at the reference timing corrected by the correction unit.
Timing correction system.

  The timing correction system of claim 1,
  The timing detector
  (1) (i) the tempo when the target video signal has a certain tempo,
  (2) (ii) the tempo when the target acoustic signal has a certain tempo,
  (3) (i) When the similarity between the feature amount extracted from the target video signal and the feature amount extracted from the video signal obtained by photographing a predetermined action in advance is equal to or greater than a threshold value, for each predetermined time interval Typical value,
  (4) (ii) A predetermined value when the similarity between the feature value extracted from the target sound signal and the feature value extracted from the sound signal corresponding to the video signal obtained by capturing a predetermined action is equal to or greater than a threshold value Representative value for each time interval,
  Detecting at least one of the reference timing,
  Timing correction system.

The timing correction system of claim 1,
The timing detection unit is based on at least a plurality of (iii) other already input visual information that is displayed superimposed on the target video signal and statistics of metadata of the other already input visual information. a reference timing is intended Mel asked Te,
Extract the display time of each of the already input visual information that means an action that matches the predetermined action included in the metadata detected by the key input detection unit from the other already input visual information. The mode value for each predetermined time interval of the extracted display time is detected as the reference timing for each predetermined time interval,
Timing correction system.

The timing correction system of claim 1,
The timing detection unit is based on at least a plurality of (iii) other already input visual information that is displayed superimposed on the target video signal and statistics of metadata of the other already input visual information. Te is shall request the reference timing,
Extract the display time of each of the already input visual information that means an action that matches the predetermined action included in the metadata detected by the key input detection unit from the other already input visual information. Using the extracted display time, for each predetermined time interval, the average value and variance of the display time are obtained, and a random number according to a Gaussian distribution having the obtained average value and variance is used as a reference timing.
Timing correction system.

  The timing correction system according to any one of claims 1 to 3,
  The timing detection unit obtains in advance a value of variance that occurs for each type of predetermined action, and the metadata detected by the key input detection unit out of the obtained variance with the reference timing as a center. A random number according to a Gaussian distribution corresponding to a predetermined action included is used as a new reference timing.
  Timing correction system.

In the storage unit, a predetermined action and visual information indicating the predetermined action are stored in association with each other,
Key input detection unit is input by what you see an object video signal, the visual information displayed superimposed on the subject image signal, determines whether or not to match the visual information stored in the storage unit, If they match, the visual information displayed superimposed on the target video signal, a predetermined action corresponding to the visual information displayed superimposed on the target video signal, and displayed superimposed on the target video signal A key input detection step for detecting metadata including input time of visual information to be displayed;
The timing detection unit includes (i) the target video signal, (ii) a target audio signal corresponding to the target video signal, and (iii) another already input that is displayed superimposed on the target video signal. Timing detection for detecting a reference timing, which is a reference timing of an action meant by the visual information detected in the key input detection step, from at least one of visual information and metadata of other visual information that has already been input. Steps,
Correction section, viewed contains a correction step of the timing for displaying the visual information detected by said key input detection step on the display unit, to correct based on the reference timing,
Visual information indicating a predetermined action stored in the storage unit is superimposed on the target video signal at the reference timing corrected in the correction step.
Timing correction method.

  The timing correction method according to claim 6, comprising:
  In the timing detection step,
  (1) (i) the tempo when the target video signal has a certain tempo,
  (2) (ii) the tempo when the target acoustic signal has a certain tempo,
  (3) (i) When the similarity between the feature amount extracted from the target video signal and the feature amount extracted from the video signal obtained by photographing a predetermined action in advance is equal to or greater than a threshold value, for each predetermined time interval Typical value,
  (4) (ii) A predetermined value when the similarity between the feature value extracted from the target sound signal and the feature value extracted from the sound signal corresponding to the video signal obtained by capturing a predetermined action is equal to or greater than a threshold value Representative value for each time interval,
  Detecting at least one of the reference timing,
  Timing correction method.

The timing correction method according to claim 6 , comprising:
The timing detection step is based on at least a plurality of (iii) other already input visual information displayed superimposed on the target video signal and statistics of metadata of the already input other visual information. a reference timing is intended Mel asked Te,
The display time of each of the already input other visual information that means an action that matches the predetermined action included in the metadata detected in the key input detection step is extracted from the other already input visual information. The mode value for each predetermined time interval of the extracted display time is detected as the reference timing for each predetermined time interval,
Timing correction method.

The timing correction method according to claim 6 , comprising:
The timing detection step is based on at least a plurality of (iii) other already input visual information displayed superimposed on the target video signal and statistics of metadata of the already input other visual information. Te is shall request the reference timing,
Extract the display time of each of the already input visual information that means an action that matches the predetermined action included in the metadata detected by the key input detection unit from the other already input visual information. Using the extracted display time, for each predetermined time interval, the average value and variance of the display time are obtained, and a random number according to a Gaussian distribution having the obtained average value and variance is used as a reference timing.
Timing correction method.

  A timing correction method according to any one of claims 6 to 8,
  In the timing detection step, a variance value generated for each type of the predetermined action is obtained in advance, and the metadata detected in the key input detection step in the obtained variance is centered on the reference timing. A random number according to a Gaussian distribution corresponding to a predetermined action included is used as a new reference timing.
  Timing correction method.

The program for functioning a computer as a timing correction system in any one of Claims 1-5 .