JP7344096B2

JP7344096B2 - Haptic metadata generation device, video-tactile interlocking system, and program

Info

Publication number: JP7344096B2
Application number: JP2019209200A
Authority: JP
Inventors: 正樹高橋; 真希子東; 拓也半田; 雅規佐野; 俊宏清水
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2023-09-13
Anticipated expiration: 2039-11-19
Also published as: JP2021082954A

Description

本発明は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムに関する。 The present invention provides a tactile metadata generation device that extracts a person object from a video and generates tactile metadata corresponding to the dynamic person object, and a video-tactile link that drives and controls a tactile presentation device based on the generated tactile metadata. Regarding systems and programs.

一般的なカメラ映像など、映像コンテンツの視聴は、視覚と聴覚の２つの感覚に訴えるが、映像と同期したタイミングで触覚にも刺激を与えることで、より臨場感・没入感の高い映像視聴が可能となる。例えば、野球映像を視聴する際、ボールがバットに当たるタイミングで触覚提示デバイスを介して視聴者に刺激を与えることで、バッターのヒッティングの感覚を疑似体験できる。また、視覚に障害のある方々に触覚刺激を提供することで、スポーツの試合状況を理解させることにも繋がると考えられる。このように、触覚は映像視聴における第３の感覚として期待されている。 Viewing video content, such as general camera footage, appeals to the two senses of sight and hearing, but by stimulating the tactile sense at the same time as the video, video viewing with a higher sense of realism and immersion can be achieved. It becomes possible. For example, when watching a baseball video, a tactile presentation device provides stimulation to the viewer at the timing when the ball hits the bat, allowing the viewer to simulate the sensation of a batter's hitting. It is also believed that providing tactile stimulation to people with visual impairments will help them understand the situation in sports matches. In this way, the sense of touch is expected to be the third sense in video viewing.

特に、スポーツはリアルタイムでの映像視聴が重要視されるため、映像に対する触覚刺激の提示は、自動、且つリアルタイムで行われる必要がある。そこで、選手の動きに同期した触覚刺激の提示が、触覚を併用した映像コンテンツの映像視聴に効果的な場合が多い。 In particular, since real-time video viewing is important in sports, presentation of tactile stimulation for video needs to be performed automatically and in real time. Therefore, presenting tactile stimulation in synchronization with the player's movements is often effective for viewing video content that also uses the sense of touch.

このため、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 Therefore, in order to realize video viewing of video content using the sense of touch, it is necessary to extract the movement of a human object from the video content and generate tactile information corresponding to the extracted movement of the human object as haptic metadata. It becomes necessary.

しかし、従来の触覚メタデータの生成法では、触覚を併用した映像視聴を実現するとしても、触覚提示デバイスにより、どのようなタイミングで、またどのような刺激をユーザに提示するかを示す触覚メタデータを、映像と同期した態様で人手により編集する必要があった。 However, with conventional haptic metadata generation methods, even if video viewing using tactile sensation is realized, haptic metadata that indicates at what timing and what kind of stimulus is presented to the user by a haptic presentation device is not available. It was necessary to manually edit the data in synchronization with the video.

収録番組の場合、人手で時間をかけて触覚メタデータを編集することが可能である。しかし、生放送映像に対して触覚提示デバイスによる刺激提示を連動させるには、事前に触覚情報を編集することができないことから、リアルタイムで映像コンテンツの映像解析を行い、触覚メタデータを生成することが要求される。 In the case of recorded programs, it is possible to manually edit tactile metadata over time. However, in order to link the stimulus presentation by a tactile presentation device to live broadcast video, it is not possible to edit the tactile information in advance, so it is necessary to perform video analysis of the video content in real time and generate tactile metadata. required.

近年、スポーツ映像解析技術は、目覚ましい成長を遂げている。ウィンブルドンでも使用されているテニスのホークアイシステムは、複数の固定カメラ映像をセンサとしてテニスボールを３次元的に追跡し、ジャッジに絡むＩＮ／ＯＵＴの判定を行っている。また２０１４年のＦＩＦＡワールドカップでは、ゴールラインテクノロジーと称して、数台の固定カメラの映像を解析し、ゴールの判定を自動化している。更に、サッカースタジアムへ多数のステレオカメラを設置し、フィールド内の全選手をリアルタイムに追跡するＴＲＡＣＡＢシステム等、スポーツにおけるリアルタイム映像解析技術の高度化が進んでいる。 In recent years, sports video analysis technology has achieved remarkable growth. The tennis Hawkeye system, which is also used at Wimbledon, uses multiple fixed camera images as sensors to track tennis balls three-dimensionally and makes IN/OUT decisions that are relevant to the judges. Also, at the 2014 FIFA World Cup, goal line technology was used to analyze footage from several fixed cameras and automate goal decisions. Furthermore, real-time video analysis technology in sports is becoming more sophisticated, such as the TRACAB system, which installs many stereo cameras in soccer stadiums and tracks all players on the field in real time.

一方で、動的な人物オブジェクトとして選手の姿勢を計測するには、従来、マーカー式のモーションキャプチャー方式を用いた計測が一般的である。しかし、この方式は、選手の体に多数のマーカーを装着する必要があり、実試合には適用できない。そこで、近年では、選手の体に投光されている赤外線パターンを読み取り、その赤外線パターンの歪みから深度情報を得る深度センサを用いることで、マーカーレスでの人物姿勢計測が可能になっている。また、マーカー式ではなく、光学式のモーションキャプチャー方式を応用した種々の技術が開示されている（例えば、特許文献１，２参照）。 On the other hand, in order to measure the posture of a player as a dynamic human object, measurement using a marker-type motion capture method has conventionally been common. However, this method requires many markers to be attached to the players' bodies, and cannot be applied to actual matches. Therefore, in recent years, it has become possible to measure human posture without markers by using depth sensors that read the infrared pattern projected onto the athlete's body and obtain depth information from the distortion of the infrared pattern. Furthermore, various techniques have been disclosed that apply an optical motion capture method instead of a marker method (for example, see Patent Documents 1 and 2).

例えば、特許文献１では、立体視を用いた仮想現実システムにおいて他者の模範動作映像を表示することにより使用者に対して動作を教示する際に、光学式のモーションキャプチャー方式により、計測対象者の骨格の３次元位置を計測する装置が開示されている。また、特許文献２には、光学式のモーションキャプチャー方式を利用してプレイヤーの動作を測定し、測定したデータとモデルのフォームに関するデータとに基づいて同プレイヤーのフォームを評価するトレーニング評価装置について開示されている。しかし、これらの技術は、モーションキャプチャー方式を利用するため、実際の試合に適用できず、汎用的なカメラ映像から人物のプレー動作を計測することは難しい。 For example, in Patent Document 1, in a virtual reality system using stereoscopic vision, when a user is taught a motion by displaying a model motion image of another person, an optical motion capture method is used to An apparatus for measuring the three-dimensional position of a skeleton is disclosed. Further, Patent Document 2 discloses a training evaluation device that measures a player's motion using an optical motion capture method and evaluates the player's form based on the measured data and data regarding the form of the model. has been done. However, since these technologies use motion capture methods, they cannot be applied to actual matches, and it is difficult to measure the playing movements of people from general-purpose camera images.

また、モーションキャプチャー方式によらず、一人又は二人が一組となってバドミントンの試合やバドミントン練習を撮影したカメラ映像のみから、人物の動きをシミュレートする装置が開示されている（例えば、特許文献３参照）。特許文献３の技術では、撮影したカメラ映像から、ショットなどの動作を検出するものとなっているが、専用に設置したカメラによる撮影映像から処理することを前提としており、汎用的な放送カメラ映像から人物のプレー動作を計測することは難しい。 In addition, devices have been disclosed that simulate the movement of a person based solely on camera footage of a badminton match or badminton practice taken by one person or a group of two people, without using the motion capture method (for example, a patent (See Reference 3). The technology of Patent Document 3 detects actions such as shots from captured camera images, but it is based on the premise that processing is performed from images captured by a specially installed camera, and it is not compatible with general-purpose broadcast camera images. It is difficult to measure a person's playing movements from the ground.

ところで、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっている。この深層学習技術を用いることで、通常のカメラ映像から静止画像を抽出し、その静止画像に含まれる選手の姿勢を自動計測することが可能となっている。 By the way, with the recent development of deep learning technology, it has become possible to estimate the skeletal position of a person from a normal still image that does not include depth information, which was difficult in the past, without using a depth sensor. By using this deep learning technology, it is now possible to extract still images from regular camera footage and automatically measure the athlete's posture contained in those still images.

特開２００２－８０６３号公報Japanese Patent Application Publication No. 2002-8063 特開２００２－２５３７１８号公報Japanese Patent Application Publication No. 2002-253718 特開２０１８－１８７３８３号公報Japanese Patent Application Publication No. 2018-187383

上述したように、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 As mentioned above, in order to realize video viewing of video content using the sense of touch, the movement of a human object is extracted from the video content, and tactile information corresponding to the extracted movement of the human object is generated as haptic metadata. It becomes necessary.

しかし、従来技術では、リアルタイムで映像コンテンツの映像解析のみから、触覚メタデータを生成することが困難である。即ち、映像のみから触覚メタデータを生成する場合には、カメラ映像からリアルタイムで人物オブジェクトの動きを解析する必要がある。リアルタイムのスポーツ競技では、その競技に影響を与えることは好ましくないため、マーカー装着によるモーションキャプチャー方式や、撮影距離に制限のある深度センサなどを用いずに、撮影条件に制限の無い汎用的な放送カメラ映像のみから触覚メタデータを生成することが望ましい。 However, with the conventional technology, it is difficult to generate haptic metadata only from video analysis of video content in real time. That is, when generating tactile metadata only from video, it is necessary to analyze the movement of a human object in real time from the camera video. In real-time sports competitions, it is undesirable to affect the competition, so instead of using a motion capture method using markers or a depth sensor with limited shooting distance, we use general-purpose broadcasting with no restrictions on shooting conditions. It is desirable to generate haptic metadata only from camera images.

つまり、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成する技法が望まれる。 In other words, a technique is desired that automatically and in real time generates tactile metadata regarding the movements of human objects (players, etc.) only from ordinary camera images taken of sports.

尚、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっているが、これに代表される骨格検出アルゴリズムは基本的に静止画単位で骨格位置を検出するものである。このため、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成するには、更なる工夫が必要になる。 Furthermore, with the recent development of deep learning technology, it has become possible to estimate a person's skeletal position from a normal still image that does not include depth information without using a depth sensor, which was difficult in the past. The skeletal detection algorithm typified by this basically detects the skeletal position on a still image basis. Therefore, further ingenuity is required to automatically and in real time generate tactile metadata regarding the movements of human objects (athletes, etc.) only from ordinary camera images taken of sports.

本発明の目的は、上述の問題に鑑みて、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムを提供することにある。 In view of the above-mentioned problems, an object of the present invention is to provide a tactile metadata generation device that automatically extracts a person object from a video, synchronizes and automatically generates tactile metadata corresponding to a dynamic person object, An object of the present invention is to provide a video-tactile interlocking system and a program that drive and control a tactile presentation device based on data.

本発明の触覚メタデータ生成装置は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置であって、入力された映像について、現フレーム画像を含む複数フレーム分の過去のフレーム画像を抽出する複数フレーム抽出手段と、前記現フレーム画像を含む複数フレーム分のフレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトの第１の骨格座標集合を生成する人物骨格抽出手段と、前記現フレーム画像を含む複数フレーム分のフレーム画像の各々について、前記第１の骨格座標集合を基に、各人物オブジェクトの骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物オブジェクトを識別し、人物ＩＤを付与した第２の骨格座標集合を生成する人物識別手段と、前記現フレーム画像を基準に、前記複数フレーム分のフレーム画像における前記第２の骨格座標集合を時系列に連結し、人物オブジェクト毎の骨格の軌跡を示す軌跡特徴量の集合として骨格軌跡集合を生成する軌跡特徴量生成手段と、前記骨格軌跡集合の軌跡特徴量を基に、機械学習により前記触覚提示デバイスを作動させるための情報を検出する人物動作認識手段と、前記現フレーム画像に対応して触覚提示デバイスを作動させるための触覚メタデータを生成し、フレーム単位で外部出力するメタデータ生成手段と、を備えることを特徴とする。 The tactile metadata generation device of the present invention is a tactile metadata generation device that extracts a person object from a video and generates tactile metadata corresponding to the dynamic person object. a plurality of frame extraction means for extracting a plurality of past frame images including the current frame image, and a first skeletal coordinate of each human object based on a skeletal detection algorithm for each of the plurality of frame images including the current frame image. A human skeleton extracting means that generates a set, and for each of a plurality of frame images including the current frame image, the position and size of the skeleton of each human object, and its surroundings, based on the first skeleton coordinate set. person identification means for identifying a person object by extracting image information and generating a second set of skeletal coordinates to which a person ID is assigned; a trajectory feature generation means for connecting two skeletal coordinate sets in time series to generate a skeletal trajectory set as a set of trajectory features indicating the trajectory of the skeleton of each human object; a human motion recognition means for detecting information for activating the tactile presentation device by machine learning; and a tactile metadata for activating the tactile presentation device corresponding to the current frame image, and generating haptic metadata for activating the tactile presentation device on a frame-by-frame basis. The present invention is characterized by comprising a metadata generating means for externally outputting.

また、本発明の触覚メタデータ生成装置において、前記現フレーム画像を含む複数フレーム分のフレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち全ての人物オブジェクトの前記骨格軌跡集合を用いて特定の動オブジェクトを選定し、各差分画像から得られる特定の動オブジェクトの座標位置、大きさ、移動方向を要素とし連結した動オブジェクト情報を生成する動オブジェクト検出手段を更に備え、前記人物動作認識手段は、前記動オブジェクト情報を基に、全ての人物オブジェクトの前記骨格軌跡集合のうち前記触覚提示デバイスを作動させるための骨格軌跡集合を選定し、選定した骨格軌跡集合の軌跡特徴量を基に、機械学習により前記触覚提示デバイスを作動させるための情報を検出することを特徴とする。 Further, in the tactile metadata generation device of the present invention, a moving object is detected based on a difference image between adjacent frames using each of a plurality of frame images including the current frame image, and is detected from each difference image. A specific moving object is selected using the set of skeletal trajectories of all human objects among the moving objects, and the moving object is connected using the coordinate position, size, and movement direction of the specific moving object obtained from each difference image as elements. The human motion recognition means further includes a moving object detection means for generating information, and the human motion recognition means selects a set of skeletal trajectories for activating the tactile presentation device from among the set of skeletal trajectories of all human objects, based on the moving object information. The present invention is characterized in that information for operating the tactile presentation device is detected by machine learning based on the trajectory features of the selected set of skeletal trajectories.

また、本発明の触覚メタデータ生成装置において、前記人物動作認識手段は、前記現フレーム画像に対応して、前記現フレーム画像内の各人物オブジェクトの識別、位置座標、及び分類、並びに、前記機械学習により前記触覚提示デバイスを作動させるタイミング及び速さを示す情報を前記触覚メタデータの生成のために検出することを特徴とする。 Further, in the tactile metadata generation device of the present invention, the human motion recognition means may identify, position coordinates, and classify each human object in the current frame image, and the machine The present invention is characterized in that information indicating the timing and speed at which the tactile presentation device is actuated is detected by learning in order to generate the tactile metadata.

更に、本発明の映像触覚連動システムは、本発明の触覚メタデータ生成装置と、触覚刺激を提示する触覚提示デバイスと、前記触覚メタデータ生成装置から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、前記触覚提示デバイスを駆動するよう制御する制御ユニットと、を備えることを特徴とする。 Furthermore, the video-tactile interlocking system of the present invention is based on the tactile metadata generation device of the present invention, a tactile presentation device that presents tactile stimulation, and tactile metadata obtained from the tactile metadata generation device. The present invention is characterized by comprising a control unit that refers to drive reference data and controls the tactile presentation device to be driven.

更に、本発明のプログラムは、コンピュータを、本発明の触覚メタデータ生成装置として機能させるためのプログラムとして構成する。 Furthermore, the program of the present invention is configured as a program for causing a computer to function as the haptic metadata generation device of the present invention.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができる。これにより、スポーツ映像のリアルタイム視聴時での触覚刺激の提示が可能となる。つまり、視覚・聴覚への情報提供のみならず、触覚にも訴えることで、従来の映像視聴では伝えきれない臨場感や没入感を提供することができる。さらに、視覚に障害を持つ方々にスポーツの状況を分かりやすく伝えることが可能となる。 According to the present invention, a person object can be automatically extracted from a video, and tactile metadata corresponding to a dynamic person object can be automatically generated in a synchronized manner. This makes it possible to present tactile stimulation when viewing sports videos in real time. In other words, by appealing not only to the visual and auditory senses but also to the sense of touch, it is possible to provide a sense of presence and immersion that cannot be conveyed through conventional video viewing. Furthermore, it will be possible to convey sports conditions in an easy-to-understand manner to people with visual impairments.

特に、スポーツ映像視聴に際し、各選手の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む触覚メタデータを生成することで、触覚提示デバイスにより、プレーの種類、タイミング、強度などに関する触覚刺激をユーザに提示できるようになる。触覚情報を用いたパブリックビューイング、エンターテインメント、将来の触覚放送などのサービス性の向上に繋がる。また、スポーツ以外でも、工場での触覚アラームへの応用や、監視カメラ映像解析に基づいたセキュリティシステムなど、様々な用途に応用することも可能になる。 In particular, when viewing sports videos, tactile presentation is possible by generating tactile metadata that includes the identification, position coordinates, and classification (team classification) of each player, as well as information indicating the timing and speed at which the tactile presentation device is activated. The device allows the user to be presented with tactile stimuli related to play type, timing, intensity, etc. This will lead to improvements in services such as public viewing, entertainment, and future tactile broadcasting using tactile information. In addition to sports, it will also be possible to apply this technology to a variety of other uses, such as tactile alarms in factories and security systems based on video analysis from surveillance cameras.

本発明による一実施形態の触覚メタデータ生成装置を備える映像触覚連動システムの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a video-tactile interlocking system including a tactile metadata generation device according to an embodiment of the present invention; FIG. 本発明による一実施形態の触覚メタデータ生成装置の処理例を示すフローチャートである。2 is a flowchart illustrating a processing example of a haptic metadata generation device according to an embodiment of the present invention. 本発明による一実施形態の触覚メタデータ生成装置における人物骨格抽出処理に関する説明図である。FIG. 3 is an explanatory diagram regarding human skeleton extraction processing in the tactile metadata generation device according to one embodiment of the present invention. （ａ）は１フレーム画像を例示する図であり、（ｂ）は本発明による一実施形態の触覚メタデータ生成装置における１フレーム画像における人物骨格抽出例を示す図である。(a) is a diagram illustrating one frame image, and (b) is a diagram illustrating an example of extracting a human skeleton from one frame image in the haptic metadata generation device according to an embodiment of the present invention. 本発明による一実施形態の触覚メタデータ生成装置における軌跡特徴量の説明図である。FIG. 3 is an explanatory diagram of trajectory feature amounts in the haptic metadata generation device according to one embodiment of the present invention. 本発明による一実施形態の触覚メタデータ生成装置における動オブジェクト検出のために生成する差分画像例を示す図である。FIG. 3 is a diagram showing an example of a difference image generated for detecting a moving object in a haptic metadata generation device according to an embodiment of the present invention. 本発明による一実施形態の映像触覚連動システムにおける制御ユニットの概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a control unit in an image-tactile interlocking system according to an embodiment of the present invention.

（システム構成）
以下、図面を参照して、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１について詳細に説明する。図１は、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１の概略構成を示すブロック図である。 (System configuration)
DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a video-tactile link system 1 including a tactile metadata generation device 12 according to an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a video-tactile interlocking system 1 including a tactile metadata generation device 12 according to an embodiment of the present invention.

図１に示す映像触覚連動システム１は、カメラや記録装置等の映像出力装置１０から映像を入力し、入力された映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成する触覚メタデータ生成装置１２と、生成した触覚メタデータを基に、本例では２台の触覚提示デバイス１４Ｌ，１４Ｒと、各触覚提示デバイス１４Ｌ，１４Ｒを個別に駆動制御する制御ユニット１３と、を備える。 The video-tactile interlocking system 1 shown in FIG. 1 inputs video from a video output device 10 such as a camera or a recording device, automatically extracts human objects from the input video, and extracts tactile metadata corresponding to dynamic human objects. Based on the tactile metadata generation device 12 that synchronizes and automatically generates tactile metadata, and the tactile metadata generated, in this example, two tactile presentation devices 14L, 14R and each tactile presentation device 14L, 14R are individually driven and controlled. A control unit 13 is provided.

まず、映像出力装置１０が出力する映像は、本例ではチームＡ，Ｂのダブルスのバドミントン競技をリアルタイムで撮影されたものとしてディスプレイ１１に表示され、ユーザＵによって視覚されるものとする。 First, in this example, it is assumed that the video output by the video output device 10 is displayed on the display 11 as a doubles badminton match between teams A and B taken in real time, and is viewed by the user U.

バドミントン競技は、ネットを挟んで自陣・敵陣に分かれ、シャトルをラケットで打ち合う競技であり、ラケットでシャトルを打つ瞬間に触覚提示デバイス１４Ｌ，１４Ｒにより触覚刺激をユーザＵに提示することで、より臨場感を高め、また視覚障害者にも試合状況を伝えることが可能である。 Badminton is a game in which players are divided into their own team and opponent's team with a net in between, and each team hits a shuttlecock with a racket.By presenting tactile stimulation to the user U using the tactile presentation devices 14L and 14R at the moment the shuttlecock is hit with the racket, the game becomes more realistic. It also makes it possible to convey the game situation to visually impaired people.

そこで、ユーザＵは、左手ＨＬで触覚提示デバイス１４Ｌを把持し、右手ＨＲで触覚提示デバイス１４Ｒを把持して、本例では映像解析に同期した振動刺激が提示されるものとする。尚、制御ユニット１３は、１台の触覚提示デバイスに対してのみ駆動制御する形態でもよいし、３台以上の触覚提示デバイスに対して個別に駆動制御する形態でもよい。また、限定するものではないが、本例の制御ユニット１３は、チームＡの人物オブジェクトの動きに対応した振動刺激は触覚提示デバイス１４Ｌで、チームＢの人物オブジェクトの動きに対応した振動刺激は触覚提示デバイス１４Ｒで提示するように分類して制御するものとする。 Therefore, it is assumed that the user U holds the tactile presentation device 14L with the left hand HL and the tactile presentation device 14R with the right hand HR, and in this example, vibration stimulation synchronized with video analysis is presented. Note that the control unit 13 may be configured to drive and control only one tactile presentation device, or may be configured to individually drive and control three or more tactile presentation devices. Further, although not limited to this, the control unit 13 of this example uses the tactile presentation device 14L to generate the vibration stimulation corresponding to the movement of the human object of team A, and uses the tactile sense presentation device 14L to generate the vibration stimulation corresponding to the movement of the human object of team B. It is assumed that the presentation device 14R classifies and controls the presentation.

触覚提示デバイス１４Ｌ，１４Ｒは、球状のケース１４１内に、制御ユニット１３の制御によって振動刺激を提示可能な振動アクチュエーター１４２が収容されている。尚、触覚提示デバイス１４Ｌ，１４Ｒは、振動刺激の他、電磁気パルス刺激を提示するものでもよい。本例では、制御ユニット１３と各触覚提示デバイス１４Ｌ，１４Ｒとの間は有線接続され、触覚メタデータ生成装置１２と制御ユニット１３との間も有線接続されている形態を例に説明するが、それぞれ近距離無線通信で無線接続されている形態としてもよい。 In the tactile presentation devices 14L and 14R, a vibration actuator 142 capable of presenting vibration stimulation under the control of the control unit 13 is housed in a spherical case 141. Note that the tactile presentation devices 14L and 14R may be devices that provide electromagnetic pulse stimulation in addition to vibration stimulation. In this example, the control unit 13 and each tactile presentation device 14L, 14R are connected by wire, and the tactile metadata generation device 12 and the control unit 13 are also connected by wire. They may each be wirelessly connected via short-range wireless communication.

触覚メタデータ生成装置１２は、複数フレーム抽出部１２１、人物骨格抽出部１２２、人物識別部１２３、軌跡特徴量生成部１２４、動オブジェクト検出部１２５、人物動作認識部１２６、及びメタデータ生成部１２７を備える。 The tactile metadata generation device 12 includes a multiple frame extraction unit 121 , a human skeleton extraction unit 122 , a person identification unit 123 , a trajectory feature generation unit 124 , a moving object detection unit 125 , a human motion recognition unit 126 , and a metadata generation unit 127 Equipped with

複数フレーム抽出部１２１は、入力された映像について、現フレーム画像を含むＴ（Ｔは２以上の整数）フレーム分の過去のフレーム画像を抽出し、人物骨格抽出部１２２及び動オブジェクト検出部１２５に出力する。 The multiple frame extraction unit 121 extracts T (T is an integer of 2 or more) past frame images including the current frame image from the input video, and sends them to the human skeleton extraction unit 122 and the moving object detection unit 125. Output.

人物骨格抽出部１２２は、現フレーム画像を含むＴフレーム分のフレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクト（以下、単に「人物」とも称する。）の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成し、現フレーム画像を含むＴフレーム分のフレーム画像とともに、人物識別部１２３に出力する。 The human skeleton extraction unit 122 calculates a skeleton coordinate set P ⁿ _b ( n: detected number of people, b: skeleton ID) and outputs it to the person identification unit 123 together with T frames of frame images including the current frame image.

人物識別部１２３は、現フレーム画像を含むＴフレーム分のフレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成し、軌跡特徴量生成部１２４に出力する。 The person identification unit 123 extracts the position and size of each person's skeleton and surrounding image information for each of T frames of frame images including the current frame image, based on the skeleton coordinate set P ⁿ _b . A person is identified, and a set of skeletal coordinates P ⁱ _b (i: person ID, b: skeletal ID) to which a person ID is assigned is generated and output to the trajectory feature generation unit 124 .

軌跡特徴量生成部１２４は、現フレーム画像を基準に、Ｔフレーム分のフレーム画像における骨格座標集合Ｐ^ｉ _ｂを時系列に連結し、人物毎の骨格の軌跡を示す軌跡特徴量の集合として骨格軌跡集合Ｔ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成し、動オブジェクト検出部１２５及び人物動作認識部１２６に出力する。 The trajectory feature generation unit 124 connects the skeletal coordinate set P ⁱ _b in T frames of frame images in time series using the current frame image as a reference, and generates a skeleton as a set of trajectory features indicating the skeletal trajectory of each person. A trajectory set T ⁱ _b (i: person ID, b: skeleton ID) is generated and output to the moving object detection section 125 and the human motion recognition section 126.

動オブジェクト検出部１２５は、現フレーム画像を含むＴフレーム分のフレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち軌跡特徴量生成部１２４から得られる全ての人物の骨格軌跡集合Ｔ^ｉ _ｂを用いて特定の動オブジェクトを選定し、各差分画像から得られる特定の動オブジェクトの座標位置、大きさ、移動方向を要素とし連結した動オブジェクト情報を生成し、人物動作認識部１２６に出力する。 The moving object detection unit 125 detects moving objects based on difference images between adjacent frames using each of the T frame images including the current frame image, and detects trajectory features of the moving objects detected from each difference image. A specific moving object is selected using the set of skeletal trajectories T ⁱ _b of all the people obtained from the quantity generation unit 124, and the coordinate position, size, and movement direction of the specific moving object obtained from each difference image are used as elements. Connected moving object information is generated and output to the human motion recognition unit 126.

人物動作認識部１２６は、動オブジェクト情報を基に、全ての人物の骨格軌跡集合Ｔ^ｉ _ｂのうち、触覚提示デバイスを作動させるための骨格軌跡集合Ｔ^ｉ _ｂを選定し、選定した骨格軌跡集合Ｔ^ｉ _ｂの軌跡特徴量を基に、機械学習（サポートベクターマシン、又はニューラルネットワーク等）により触覚提示デバイスを作動させるタイミング及び速さを示す情報を検出し、メタデータ生成部１２７に出力する。 The human motion recognition unit 126 selects a skeletal trajectory set T ⁱ _b for operating the tactile presentation device from among the skeletal trajectory set T ⁱ _b of all the people based on the moving object information, and selects the selected skeletal trajectory set T i b. Based on the trajectory feature amount of T ⁱ _b , information indicating the timing and speed of operating the tactile presentation device is detected by machine learning (support vector machine, neural network, etc.) and output to the metadata generation unit 127.

メタデータ生成部１２７は、現フレーム画像に対応して、現フレーム画像内の各人物の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む触覚メタデータを生成し、フレーム単位で制御ユニット１３に出力する。 The metadata generation unit 127 generates, corresponding to the current frame image, information indicating the identification, position coordinates, and classification (team classification) of each person in the current frame image, as well as the timing and speed at which the tactile presentation device is activated. tactile metadata including tactile metadata is generated and output to the control unit 13 in units of frames.

以下、より具体に、図２を基に、図３乃至図６を参照しながら、触覚メタデータ生成装置１２における触覚メタデータ生成処理について説明する。 Hereinafter, the haptic metadata generation process in the haptic metadata generation device 12 will be described in more detail based on FIG. 2 and with reference to FIGS. 3 to 6.

（触覚メタデータ生成処理）
図２は、本発明による一実施形態の触覚メタデータ生成装置１２の処理例を示すフローチャートである。そして、図３は、触覚メタデータ生成装置１２における人物骨格抽出処理に関する説明図である。また、図４（ａ）は１フレーム画像を例示する図であり、図４（ｂ）は触覚メタデータ生成装置１２における１フレーム画像における人物骨格抽出例を示す図である。図５は、触覚メタデータ生成装置１２における軌跡特徴量の説明図である。そして、図６は、触覚メタデータ生成装置１２における動オブジェクト検出のために生成する差分画像例を示す図である。 (Tactile metadata generation processing)
FIG. 2 is a flowchart showing a processing example of the haptic metadata generation device 12 according to an embodiment of the present invention. FIG. 3 is an explanatory diagram regarding human skeleton extraction processing in the haptic metadata generation device 12. Further, FIG. 4(a) is a diagram illustrating one frame image, and FIG. 4(b) is a diagram illustrating an example of human skeleton extraction from one frame image in the haptic metadata generation device 12. FIG. 5 is an explanatory diagram of trajectory feature amounts in the haptic metadata generation device 12. FIG. 6 is a diagram showing an example of a difference image generated for detecting a moving object by the haptic metadata generation device 12.

図２に示すように、触覚メタデータ生成装置１２は、まず、複数フレーム抽出部１２１により、入力された映像について、現フレーム画像を含むＴ（Ｔは２以上の整数）フレーム分の過去のフレーム画像を抽出する（ステップＳ１）。 As shown in FIG. 2, the haptic metadata generation device 12 first extracts T (T is an integer of 2 or more) past frames including the current frame image from the input video using the multiple frame extraction unit 121. Extract an image (step S1).

続いて、触覚メタデータ生成装置１２は、人物骨格抽出部１２２により、現フレーム画像を含むＴフレーム分のフレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成する（ステップＳ２）。 Next, in the tactile metadata generation device 12, the human skeleton extraction unit 122 calculates a skeleton coordinate set P ⁿ _b (n : detected number of people, b: skeleton ID) (step S2).

近年の深層学習技術の発展により、通常の画像から人物の骨格位置を推定することが可能となった。OpenPoseやVisionPose（NextSystem社）に代表されるように、骨格検出アルゴリズムをオープンソースで公開しているものも存在する。そこで、本例の人物骨格抽出部１２２は、VisionPoseを用いて、図３に示すように、フレーム画像毎に人物の骨格３０点を検出し、その位置座標を示す骨格座標集合Ｐ^ｎ _ｂを生成する。 Recent developments in deep learning technology have made it possible to estimate the skeletal position of a person from ordinary images. There are also open source skeleton detection algorithms, such as OpenPose and VisionPose (NextSystem). Therefore, the human skeleton extraction unit 122 of this example uses VisionPose to detect 30 human skeleton points for each frame image, as shown in FIG. 3, and generates a skeleton coordinate set P ⁿ _b indicating the position coordinates. do.

VisionPoseでは、図３において、Ｐ^ｎ _１：“頭”、Ｐ^ｎ _２：“鼻”、Ｐ^ｎ _３：“左目”、Ｐ^ｎ _４：“右目”、Ｐ^ｎ _５：“左耳”、Ｐ^ｎ _６：“右耳”、Ｐ^ｎ _７：“首”、Ｐ^ｎ _８：“背骨（肩）”、Ｐ^ｎ _９：“左肩”、Ｐ^ｎ _１０：“右肩”、Ｐ^ｎ _１１：“左肘”、Ｐ^ｎ _１２：“右肘”、Ｐ^ｎ _１３：“左手首”、Ｐ^ｎ _１４：“右手首”、Ｐ^ｎ _１５：“左手”、Ｐ^ｎ _１６：“右手”、Ｐ^ｎ _１７：“左親指”、Ｐ^ｎ _１８：“右親指”、Ｐ^ｎ _１９：“左指先”、Ｐ^ｎ _２０：“右指先”、Ｐ^ｎ _２１：“背骨（中央）”、Ｐ^ｎ _２２：“背骨（基端部）”、Ｐ^ｎ _２３：“左尻部”、Ｐ^ｎ _２４：“右尻部”、Ｐ^ｎ _２５：“左膝”、Ｐ^ｎ _２６：“右膝”、Ｐ^ｎ _２７：“左足首”、Ｐ^ｎ _２８：“右足首”、Ｐ^ｎ _２９：“左足”、及び、Ｐ^ｎ _３０：“右足”、についての座標位置と、各座標位置を図示するような線で連結した描画が可能である。 In VisionPose, in FIG. 3, P ⁿ ₁ : “Head”, P ⁿ ₂ : “Nose”, P ⁿ ₃ : “Left eye”, P ⁿ ₄ : “Right eye”, P ⁿ ₅ : “Left ear”, P ⁿ ₆ : “Right ear”, P ⁿ ₇ : “Neck”, P ⁿ ₈ : “Spine (shoulder)”, P ⁿ ₉ : “Left shoulder”, P ⁿ ₁₀ : “Right shoulder”, P ⁿ ₁₁ : “Left elbow ”, P ⁿ ₁₂ : “Right elbow”, P ⁿ ₁₃ : “Left wrist”, P ⁿ ₁₄ : “Right wrist”, P ⁿ ₁₅ : “Left hand”, P ⁿ ₁₆ : “Right hand”, P ⁿ ₁₇ : “ P ⁿ ₁₈ : “Right thumb”, P ⁿ ₁₉ : “Left finger tip”, P ⁿ ₂₀ : “Right finger tip”, P ⁿ ₂₁ : “Spine (center)”, P ⁿ ₂₂ : “Spine (base)”. P ⁿ ₂₃ : “Left buttock”, P ⁿ ₂₄ : “Right butt”, P ⁿ ₂₅ : “Left knee”, P ⁿ ₂₆ : “Right knee”, P ⁿ ₂₇ : “Left ankle” ”, P ⁿ ₂₈ : “Right ankle”, P ⁿ ₂₉ : “Left foot”, and P ⁿ ₃₀ : “Right foot”, and each coordinate position can be connected with a line as shown in the diagram. It is.

このVisionPoseの骨格検出アルゴリズムに基づき、図４（ａ）に示すチームＡ，Ｂのダブルスのバドミントン競技の１フレーム画像Ｆに対して、人物の骨格抽出を行ったフレーム画像Ｆａを図４（ｂ）に示している。図４（ａ）に示すフレーム画像Ｆには、チームＡの人物オブジェクトＯｐ１，Ｏｐ２と、チームＢの人物オブジェクトＯｐ３，Ｏｐ４と、人物以外の動オブジェクトＯｍ１，Ｏｍ２，Ｏｍ３，Ｏｍ４，Ｏｍ５（ラケットやシャトル）が写り込んでいるが、VisionPoseの骨格検出アルゴリズムを適用すると、図４（ｂ）に示すように、人物オブジェクトＯｐ１，Ｏｐ２，Ｏｐ３，ＯＰ４にそれぞれ対応する骨格座標集合Ｐ^１ _ｂ，Ｐ^２ _ｂ，Ｐ^３ _ｂ，Ｐ^４ _ｂを推定して生成することができる。図４（ｂ）からも理解されるように、激しい動きを伴うバドミントン競技においても、比較的精度よく各人物の骨格を推定できている。尚、骨格検出アルゴリズムは、静止画単位での推定に留まるので、触覚メタデータ生成装置１２は、後続する処理として、人物の識別を行い、各人物の骨格位置の推移を軌跡特徴量として定量化し、時間軸を考慮した高精度な動作認識を行う。 Based on this VisionPose skeleton detection algorithm, a frame image Fa obtained by extracting human skeletons from one frame image F of a doubles badminton competition between teams A and B shown in Figure 4(a) is created as shown in Figure 4(b). It is shown in Frame image F shown in FIG. 4(a) includes person objects Op1 and Op2 of team A, person objects Op3 and Op4 of team B, and non-person moving objects Om1, Om2, Om3, Om4, and Om5 (such as rackets and shuttle) is in the image, but when VisionPose's skeleton detection algorithm is applied, skeleton coordinate sets P ¹ _b and P 2 corresponding to human objects Op1, Op2, Op3, and ^OP4 , respectively, are obtained as shown in FIG. 4(b). _b , P ³ _b , and P ⁴ _b can be estimated and generated. As can be seen from FIG. 4(b), even in a badminton game that involves violent movements, the skeleton of each person can be estimated with relative accuracy. Incidentally, since the skeleton detection algorithm is limited to estimation in units of still images, the tactile metadata generation device 12 identifies the person as a subsequent process and quantifies the transition of the skeleton position of each person as a trajectory feature. , performs highly accurate motion recognition that takes the time axis into consideration.

続いて、触覚メタデータ生成装置１２は、人物識別部１２３により、現フレーム画像を含むＴフレーム分のフレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する（ステップＳ３）。 Next, the tactile metadata generation device 12 uses the person identification unit 123 to determine the position and size of the skeleton of each person based on the skeleton coordinate set P ⁿ _b for each of the T frames of frame images including the current frame image. The person is identified by extracting the surrounding image information, and a skeletal coordinate set P ⁱ _b (i: person ID, b: skeletal ID) to which a person ID is assigned is generated (step S3).

前述した人物骨格抽出部１２２により、現フレーム画像を含むＴフレーム分のフレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂとして、１以上の人物の骨格の検出が可能となる。しかし、各フレーム画像の骨格座標集合Ｐ^ｎ _ｂでは、「誰」の情報は存在しないため、各人物の骨格を識別する必要がある。この識別には、各フレーム画像における各骨格座標集合Ｐ^ｎ _ｂの座標付近の画像情報を利用する。即ち、人物識別部１２３は、骨格座標集合Ｐ^ｎ _ｂを基に、各人物の骨格の位置及びサイズと、その周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を抽出することにより、人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する。 The human skeleton extraction unit 122 described above can detect one or more human skeletons as a skeleton coordinate set P ⁿ _b for each of T frames of frame images including the current frame image. However, since there is no "who" information in the skeletal coordinate set P ⁿ _b of each frame image, it is necessary to identify the skeletal structure of each person. For this identification, image information near the coordinates of each skeletal coordinate set P ⁿ _b in each frame image is used. That is, the person identification unit 123 extracts the position and size of each person's skeleton and surrounding image information (color information and texture information near the face or back) based on the skeleton coordinate set P ⁿ _b . , a person is identified and a skeletal coordinate set P ⁱ _b (i: person ID, b: skeleton ID) to which a person ID is assigned is generated.

例えば、バドミントン競技では、コートを縦に構えた画角で撮影される場合に、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置がフレーム画像Ｆにおける画面上側であれば奥の選手、画面下側であれば手前の選手、として識別することができる。また、柔道では白と青の道着で試合が行われるが、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置付近の画像情報として、フレーム画像Ｆにおける色情報を参照することで、選手の識別が可能になる。 For example, in a badminton game, when the court is photographed at an angle of view with the court held vertically, if the position of the skeleton of each skeleton coordinate set P ⁿ _b is at the top of the screen in frame image F, then the player is in the back, and the player is at the bottom of the screen. If so, it can be identified as the player in the foreground. In addition, in judo, matches are held in white and blue uniforms, and by referring to the color information in the frame image F as image information near the position of the skeleton in each skeleton coordinate set P ⁿ _b , the athlete can be identified. It becomes possible.

尚、前述した人物骨格抽出部１２２では、選手以外にも審判や観客など、触覚刺激の提示対象としない他の人物の骨格を検出してしまうことも多い。審判は選手と別の衣服を着用することが多いため、色情報で識別できる。また、観客は選手に比べて遠くにいることが多いため、骨格のサイズで識別が可能である。このように、各競技のルールや撮影状況を考慮し、人物識別に適切な周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を設定することにより、触覚刺激の提示対象とする選手の識別が可能となる。 Note that the human skeleton extraction unit 122 described above often detects the skeletons of other people other than athletes, such as referees and spectators, who are not the targets of tactile stimulation. Referees often wear different clothing from the players, so they can be identified by color information. Additionally, since spectators are often farther away than the athletes, they can be identified by the size of their skeletons. In this way, by taking into consideration the rules of each competition and the shooting conditions, and setting appropriate surrounding image information (color information and texture information near the face or back) for person identification, we can determine which athletes are the targets of tactile stimulation. identification becomes possible.

続いて、触覚メタデータ生成装置１２は、軌跡特徴量生成部１２４により、現フレーム画像を基準に、Ｔフレーム分のフレーム画像における骨格座標集合Ｐ^ｉ _ｂを時系列に連結し、人物毎の骨格の軌跡を示す軌跡特徴量の集合として骨格軌跡集合Ｔ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する（ステップＳ４）。 Next, the tactile metadata generation device 12 uses the trajectory feature generation unit 124 to connect the skeletal coordinate set P ⁱ _b in the frame images for T frames in chronological order based on the current frame image, and generates a skeletal structure for each person. A skeleton trajectory set T ⁱ _b (i: person ID, b: skeleton ID) is generated as a collection of trajectory features indicating the trajectory of (step S4).

ここで、骨格軌跡集合Ｔ^ｉ _ｂの生成にあたって、まず、任意のフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（ｔ）とし、現フレーム画像をｔ＝０として現フレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（０）で表し、過去Ｔフレームのフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（Ｔ）で表す。つまり、軌跡特徴量生成部１２４は、現フレーム画像のフレーム番号をｔ＝０として、過去Ｔフレームまでのフレーム番号をｔ＝Ｔで表すと、現フレーム画像を基準に、ｔ＝０，１，…，Ｔの各フレーム画像Ｆを用いて、人物毎の骨格の軌跡を示す軌跡特徴量の集合として骨格軌跡集合Ｔ^ｉ _ｂを生成することができる。 Here, in generating the skeletal trajectory set T ⁱ _b , first, let the skeletal coordinate set P ⁱ _b in an arbitrary frame image be P ⁱ _b (t), set the current frame image as t=0, and calculate the skeletal coordinate set in the current frame image. P ⁱ _b is represented by P ⁱ _b (0), and the skeleton coordinate set P ⁱ _b in the frame images of the past T frames is represented by P ⁱ _b (T). In other words, if the frame number of the current frame image is t=0 and the frame numbers up to the past T frames are expressed as t=T, then the trajectory feature generation unit 124 generates t=0, 1, ..., T can be used to generate a skeletal trajectory set T ⁱ _b as a set of trajectory features indicating the skeletal trajectory of each person.

尚、骨格軌跡集合Ｔ^ｉ _ｂの生成に用いる骨格座標は、必ずしも図３に示す３０点全てを用いる必要はなく、予め定めた特定の骨格軌跡のみを使用して、処理速度を向上させる構成とすることもできる。また、骨格軌跡集合Ｔ^ｉ _ｂとしては、骨格座標集合Ｐ^ｉ _ｂの座標表現そのものを連結したものとしてもよいが、人物毎の骨格の軌跡を示すものであればよいことから、各競技のルールや撮影状況を考慮し、軌跡特徴を表わすのに適切な情報（動き量や移動加速度等）に変換したものとしてもよい。 It should be noted that the skeleton coordinates used to generate the skeleton trajectory set T ⁱ _b do not necessarily need to use all 30 points shown in FIG. You can also. In addition, the skeletal trajectory set T ⁱ _b may be a concatenation of the coordinate representations of the skeletal coordinate set P ⁱ _b , but since it is sufficient as long as it shows the skeletal trajectory of each person, the rules for each competition The information may be converted into information (amount of movement, movement acceleration, etc.) appropriate for representing the trajectory characteristics, taking into consideration the shooting conditions and shooting conditions.

例えば、バドミントン競技では、ラケットを振る際に腕や体の移動加速度が上がるため、各骨格の移動量の二階微分を作成し、加速度に相当する軌跡特徴量に変換するのが好適である。そこで、骨格座標集合Ｐ^ｉ _ｂの軌跡として、加速度に相当する骨格軌跡集合Ｔ^ｉ _ｂで表すことで、後段の人物動作認識部１２６における動作認識の精度を向上させることができる。 For example, in badminton, the movement acceleration of the arms and body increases when swinging a racket, so it is preferable to create a second-order differential of the movement amount of each skeleton and convert it into a trajectory feature corresponding to the acceleration. Therefore, by representing the trajectory of the skeletal coordinate set P ⁱ _b as a skeletal trajectory set T ⁱ _b corresponding to acceleration, it is possible to improve the accuracy of motion recognition in the human motion recognition unit 126 in the subsequent stage.

まず、式（１）に示すように、隣接する画像フレーム間で、対応する骨格座標集合Ｐ^ｉ _ｂ（ｔ），Ｐ^ｉ _ｂ（ｔ＋１）の位置座標の差（ユークリッド距離）を取り、その移動量Ｄ^ｉ _ｂ（ｔ）を求める。 First, as shown in equation (1), the difference in position coordinates (Euclidean distance) of the corresponding skeletal coordinate sets P ⁱ _b (t) and P ⁱ _b (t+1) between adjacent image frames is calculated, and the movement Find the quantity D ⁱ _b (t).

ここで、Ｐ^ｉ _ｂ（ｔ），ｘはＰ^ｉ _ｂ（ｔ）におけるｘ座標、Ｐ^ｉ _ｂ（ｔ），ｙはＰ^ｉ _ｂ（ｔ）におけるｙ座標を表す。 Here, P ⁱ _b (t),x represents the x coordinate in P ⁱ _b (t), and P ⁱ _b (t), y represents the y coordinate in P ⁱ _b (t).

Ｄ^ｉ _ｂ（ｔ）は、各座標点の速度に相当する特徴量となるが、式（２）に示すように、更にその差の絶対値をとることで、加速度に相当する特徴量Ａ^ｉ _ｂ（ｔ）が得られる。ここで、abs()は、絶対値を返す関数である。 D ⁱ _b (t) is the feature amount corresponding to the velocity of each coordinate point, but as shown in equation (2), by further taking the absolute value of the difference, the feature amount A ⁱ corresponding to the acceleration _b (t) is obtained. Here, abs() is a function that returns an absolute value.

この加速度に相当する特徴量Ａ^ｉ _ｂ（ｔ）を用いて各人物の動作を追跡した軌跡を示す骨格軌跡集合Ｔ^ｉ _ｂを生成することができ、図５には、或るフレーム画像における人物オブジェクトＯｐ１，Ｏｐ３にそれぞれ対応する骨格座標集合の軌跡特徴量Ｔ^１ _ｂ，Ｔ^３ _ｂを分かり易く描画したフレーム画像Ｆｂを示している。 Using the feature amount A ⁱ _b (t) corresponding to this acceleration, it is possible to generate a skeleton trajectory set T ⁱ _b indicating the trajectory of each person's motion. A frame image Fb is shown in which trajectory feature amounts T ¹ _b and T ³ _b of skeletal coordinate sets corresponding to objects Op1 and Op3 are drawn in an easy-to-understand manner.

続いて、触覚メタデータ生成装置１２は、動オブジェクト検出部１２５により、現フレーム画像を含むＴフレーム分のフレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち軌跡特徴量生成部１２４から得られる全ての人物の骨格軌跡集合Ｔ^ｉ _ｂを用いて特定の動オブジェクトを選定し、各差分画像から得られる特定の動オブジェクトの座標位置、大きさ、移動方向を要素とし連結した動オブジェクト情報を生成する（ステップＳ５）。 Next, the haptic metadata generation device 12 uses the moving object detection unit 125 to detect moving objects based on the difference images between adjacent frames using each of the T frame images including the current frame image, and detects each moving object based on the difference image between adjacent frames. Among the moving objects detected from the difference images, a specific moving object is selected using the set of skeletal trajectories T ⁱ _b of all the people obtained from the trajectory feature generation unit 124, and the specific moving object obtained from each difference image is Moving object information is generated by linking the coordinate position, size, and movement direction as elements (step S5).

後段の人物動作認識部１２６では、骨格軌跡集合Ｔ^ｉ _ｂを用いて人物の動作認識を行うことが可能であるが、人物（選手）の動作は多種多様であり、誤検出や検出漏れが発生するケースも少なくない。そこで、動オブジェクト検出部１２５は、現フレーム画像を含むＴフレーム分のフレーム画像の各々を用いて、人物の動き以外にもラケットやシャトルなどの動オブジェクトの位置や動きに関する情報を抽出する。この情報を利用することで、後段の人物動作認識部１２６は、動作認識の精度をより向上させることができる。 The human motion recognition unit 126 in the latter stage can recognize human motion using the skeletal trajectory set T ⁱ _b , but the motions of people (athletes) vary widely, and false detections or omissions may occur. There are many cases where this is the case. Therefore, the moving object detection unit 125 uses each of T frames of frame images including the current frame image to extract information regarding the position and movement of a moving object such as a racket or a shuttlecock in addition to the movement of a person. By using this information, the subsequent human motion recognition unit 126 can further improve the accuracy of motion recognition.

バドミントン競技においては、例えばシャトルの位置、移動方向を考慮することで、次にシャトルを打つチームの選手の解析のみに集中でき、誤検出や検出漏れを抑制できる。そこで、動オブジェクト検出部１２５は、まず、シャトルの動き追跡においては、図６に示すような隣接フレーム間の差分画像Ｆｃを作成することでフレーム画像から動オブジェクトの領域のみを抽出する。 In the case of badminton, for example, by considering the position and direction of movement of the shuttlecock, it is possible to concentrate on analyzing only the player on the team who will hit the shuttlecock next, and it is possible to suppress false detections and omissions. Therefore, in tracking the movement of the shuttle, the moving object detection unit 125 first extracts only the region of the moving object from the frame image by creating a difference image Fc between adjacent frames as shown in FIG.

図６に示す差分画像Ｆｃに示されているように、人物オブジェクトＯｐ１’，Ｏｐ３’、人物以外の動オブジェクト（ラケット又はシャトル）Ｏｍ１’，Ｏｍ３’，Ｏｍ５’が検出できていることが分かる。このように差分画像Ｆｃでは、シャトル以外の動オブジェクトも検出されるが、軌跡特徴量生成部１２４から得られる骨格軌跡集合Ｔ^ｉ _ｂを用いて、差分画像Ｆｃ上で人物（選手）領域をマスクしてシャトルの検出を行うことができる。即ち、軌跡特徴量生成部１２４から得られる骨格軌跡集合Ｔ^ｉ _ｂを用いて、差分画像Ｆｃから人物（選手）領域のノイズを除外した差分画像を作成することが可能となり、人物の動きに係わる動オブジェクト（特にシャトル）を安定して検出することができる。 As shown in the difference image Fc shown in FIG. 6, it can be seen that human objects Op1' and Op3' and non-human moving objects (racquets or shuttles) Om1', Om3', and Om5' have been detected. In this way, moving objects other than the shuttle are detected in the differential image Fc, but the human (athlete) region is masked on the differential image Fc using the skeleton trajectory set T ⁱ _b obtained from the trajectory feature generation unit 124. Shuttle detection can be performed by That is, using the skeletal trajectory set T ⁱ _b obtained from the trajectory feature generation unit 124, it is possible to create a difference image in which noise in the person (player) region is excluded from the difference image Fc, and the noise related to the movement of the person can be created. Moving objects (especially shuttles) can be detected stably.

より具体的には、動オブジェクト検出部１２５は、作成した差分画像Ｆｃにおいて、ラベリング処理を施して人物以外の動オブジェクトにＩＤを与え、軌跡特徴量生成部１２４から得られる骨格軌跡集合Ｔ^ｉ _ｂを用いて各差分画像から、各動オブジェクトの色情報やサイズ、動き情報から最もシャトルらしいオブジェクトを選定する。続いて、動オブジェクト検出部１２５は、隣接フレーム間の差分画像Ｆｃから得られた特定の動オブジェクト（シャトル）の座標位置、大きさ、移動方向を要素とし、Ｎフレーム分を連結した動オブジェクト情報を生成する。 More specifically, the moving object detection unit 125 performs labeling processing on the created difference image Fc to give IDs to moving objects other than people, and generates a skeleton trajectory set T ⁱ _b obtained from the trajectory feature generation unit 124. From each difference image, the object most likely to be a shuttle is selected from the color information, size, and motion information of each moving object. Next, the moving object detection unit 125 generates moving object information in which N frames are connected, using the coordinate position, size, and movement direction of a specific moving object (shuttle) obtained from the difference image Fc between adjacent frames as elements. generate.

続いて、触覚メタデータ生成装置１２は、人物動作認識部１２６により、動オブジェクト情報を基に、全ての人物の骨格軌跡集合Ｔ^ｉ _ｂのうち、触覚提示デバイスを作動させるための骨格軌跡集合Ｔ^ｉ _ｂを選定し、選定した骨格軌跡集合Ｔ^ｉ _ｂの軌跡特徴量を基に、機械学習（サポートベクターマシン、又はニューラルネットワーク等）により触覚提示デバイス１４Ｒ，１４Ｌを作動させるタイミング及び速さを示す情報を検出する（ステップＳ６）。 Next, the tactile metadata generation device 12 uses the human motion recognition unit 126 to generate a skeletal trajectory set T for operating the tactile presentation device from among the skeletal trajectory set T ⁱ _b of all the people based on the moving object information. ^ib is selected, and the timing and speed at which the tactile presentation devices 14R and 14L are activated is indicated by machine learning (support vector machine, neural network, etc.) based on the trajectory feature amount of the selected ^skeleton _trajectory set _Tib . Information is detected (step S6).

機械学習（サポートベクターマシン、又はニューラルネットワーク等）時には、事前に学習用の軌跡特徴量を作成して学習させておく。例えば、サポートベクターマシンを利用するときは、シャトルを打つ瞬間の軌跡特徴量を正例、それ以外の軌跡特徴量を負例として学習することで、人物動作認識部１２６は、触覚提示デバイス１４Ｒ，１４Ｌを作動させるタイミング及び速さを示す情報を動作認識として検出することが可能となる。更に、人物動作認識部１２６は、選定した骨格軌跡集合Ｔ^ｉ _ｂからシャトルの位置座標を考慮することで、動作認識の精度を高めるとともに、どの選手がシャトルを打ったか等、現フレーム画像内の各人物の識別、位置座標、及び分類（チーム分類）の情報も検出することも可能である。 When performing machine learning (support vector machine, neural network, etc.), trajectory features for learning are created in advance and learned. For example, when using a support vector machine, the human motion recognition unit 126 learns the trajectory feature at the moment when the shuttlecock is hit as a positive example and the other trajectory features as a negative example. It becomes possible to detect information indicating the timing and speed at which 14L is actuated as motion recognition. Furthermore, the human motion recognition unit 126 improves the accuracy of motion recognition by considering the position coordinates of the shuttlecock from the selected set of skeletal trajectories T ⁱ _b , and also identifies information such as which player hit the shuttlecock in the current frame image. It is also possible to detect information on each person's identification, location coordinates, and classification (team classification).

最終的に、触覚メタデータ生成装置１２は、メタデータ生成部１２７により、現フレーム画像に対応して、現フレーム画像内の各人物の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む触覚メタデータを生成し、フレーム単位で制御ユニット１３に出力する（ステップＳ７）。 Finally, the haptic metadata generation device 12 uses the metadata generation unit 127 to identify, position coordinates, and classify (team classification) each person in the current frame image, and to identify the tactile sense Haptic metadata including information indicating the timing and speed at which the presentation device is actuated is generated and output to the control unit 13 in frame units (step S7).

そして、触覚メタデータ生成装置１２は、映像出力装置１０から映像のフレーム画像が入力される度に、ステップＳ１乃至Ｓ７の処理を繰り返す。 Then, the haptic metadata generation device 12 repeats the processing of steps S1 to S7 every time a video frame image is input from the video output device 10.

（制御ユニット）
図７は、本発明による一実施形態の映像触覚連動システム１における制御ユニット１３の概略構成を示すブロック図である。制御ユニット１３は、メタデータ受信部１３１、解析部１３２、記憶部１３３、及び駆動部１３４‐１，１３４‐２を備える。 (Controller unit)
FIG. 7 is a block diagram showing a schematic configuration of the control unit 13 in the video-tactile interlocking system 1 according to one embodiment of the present invention. The control unit 13 includes a metadata receiving section 131, an analyzing section 132, a storage section 133, and driving sections 134-1 and 134-2.

メタデータ受信部１３１は、触覚メタデータ生成装置１２から触覚メタデータを入力し、解析部１３２に出力する機能部である。触覚メタデータは、現フレーム画像内の各人物の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む。 The metadata receiving unit 131 is a functional unit that inputs haptic metadata from the haptic metadata generating device 12 and outputs it to the analysis unit 132. The haptic metadata includes information indicating the identity, location coordinates, and classification (team classification) of each person in the current frame image, as well as the timing and speed at which the tactile presentation device is activated.

解析部１３２は、触覚メタデータ生成装置１２から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、駆動部１３４‐１，１３４‐２を介して、対応する各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するよう制御する機能部である。例えば、解析部１３２は、チームＡのいずれか一方がシャトルを打つときは、触覚メタデータにおける人物の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さから、予め定めた駆動基準データを参照して、触覚提示デバイス１４Ｌの振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間を決定して駆動制御する。 The analysis unit 132 refers to predetermined driving reference data based on the haptic metadata obtained from the haptic metadata generation device 12, and generates each corresponding haptic presentation device via the driving units 134-1 and 134-2. This is a functional unit that controls the vibration actuators 142 of 14L and 14R to be driven. For example, when one of Team A hits a shuttlecock, the analysis unit 132 determines the person's identification, position coordinates, and classification (team classification) in the haptic metadata, as well as the timing and speed at which the haptic presentation device is activated. Then, with reference to predetermined drive reference data, the operation timing, strength, and operation time of the vibration actuator 142 of the tactile presentation device 14L are determined and the drive is controlled.

記憶部１３３は、触覚メタデータに基づいた駆動部１３４‐１，１３４‐２の駆動を制御するための予め定めた駆動基準データを記憶している。駆動基準データは、触覚メタデータに対応付けられた触覚刺激としての振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間について、予め定めたテーブル又は関数で表されている。また、記憶部１３３は、制御ユニット１３の機能を実現するためのプログラムを記憶している。即ち、制御ユニット１３を構成するコンピュータにより当該プログラムを読み出して実行することで、制御ユニット１３の機能を実現する。 The storage unit 133 stores predetermined drive reference data for controlling the drive of the drive units 134-1 and 134-2 based on tactile metadata. The drive reference data is expressed in a predetermined table or function regarding the operation timing, strength, and operation time of the vibration actuator 142 as a tactile stimulus associated with the tactile metadata. Furthermore, the storage unit 133 stores programs for realizing the functions of the control unit 13. That is, the functions of the control unit 13 are realized by reading and executing the program by the computer forming the control unit 13.

駆動部１３４‐１，１３４‐２は、各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するドライバである。 The drive units 134-1 and 134-2 are drivers that drive the vibration actuators 142 of the respective tactile presentation devices 14L and 14R.

このように、本実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させることができるようになる。特に、スポーツ映像視聴に際し、各選手の識別、位置座標、及び分類（チーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む触覚メタデータを生成することで、１台以上の触覚提示デバイスにより、プレーの種類、タイミング、強度などに関する触覚刺激をユーザＵに提示できるようになる。 As described above, according to the video-tactile interlocking system 1 including the tactile metadata generation device 12 of the present embodiment, a human object is automatically extracted from a video, and the tactile metadata corresponding to the dynamic human object is synchronized and automatically generated. Since it can be generated, it becomes possible to link the tactile presentation device and the video. In particular, when viewing sports videos, by generating haptic metadata that includes the identification, position coordinates, and classification (team classification) of each player, as well as information indicating the timing and speed at which the haptic presentation device is activated, one The above-described tactile presentation device allows tactile stimulation regarding the type, timing, intensity, etc. of play to be presented to the user U.

（実験検証）
本実施形態の触覚メタデータ生成装置１２により生成した触覚メタデータから抽出されるショットタイミングの検出に関する実験結果を表１に示す。尚、表１は、リオデジャネイロ五輪映像２０ショット分の映像で評価し、選手別のショットタイミングを計測（±５フレームを許容）した時の実験結果であり、適合率とは検出精度を示し、再現率とは検出感度を示す。Ｆ値とは、適合率と再現率の調和平均を取った値である。 (Experimental verification)
Table 1 shows experimental results regarding detection of shot timing extracted from haptic metadata generated by the haptic metadata generation device 12 of this embodiment. Table 1 shows the experimental results when evaluating 20 shots of the Rio de Janeiro Olympics video and measuring shot timing for each athlete (±5 frames allowed). Precision rate indicates detection accuracy and reproduction rate. Rate indicates detection sensitivity. The F value is a value obtained by taking the harmonic mean of precision and recall.

表１において、触覚メタデータのリアルタイム抽出に向けて、７割を超える一定の有効性を確認することができた。 In Table 1, we were able to confirm a certain degree of effectiveness of over 70% for real-time extraction of haptic metadata.

尚、上述した一実施形態の触覚メタデータ生成装置１２をコンピュータとして機能させることができ、当該コンピュータに、本発明に係る各構成要素を実現させるためのプログラムは、当該コンピュータの内部又は外部に備えられるメモリに記憶される。コンピュータに備えられる中央演算処理装置（ＣＰＵ）などの制御で、各構成要素の機能を実現するための処理内容が記述されたプログラムを、適宜、メモリから読み込んで、本実施形態のオブジェクト追跡装置１の各構成要素の機能をコンピュータに実現させることができる。ここで、各構成要素の機能をハードウェアの一部で実現してもよい。 The haptic metadata generation device 12 of the embodiment described above can function as a computer, and a program for causing the computer to implement each component according to the present invention may be provided inside or outside the computer. stored in memory. The object tracking device 1 of the present embodiment reads a program in which processing contents for realizing the functions of each component are appropriately read from memory under the control of a central processing unit (CPU) or the like included in a computer. The functions of each component can be realized by a computer. Here, the functions of each component may be realized by a part of hardware.

以上、特定の実施形態の例を挙げて本発明を説明したが、本発明は前述の実施形態の例に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、上述した実施形態の例では、主としてバドミントン競技の映像解析を例に説明したが、柔道や卓球、その他の様々なスポーツ種目、及びスポーツ以外の映像にも広く応用可能である。例えば、触覚情報を用いたパブリックビューイング、エンターテインメント、将来の触覚放送などのサービス性の向上に繋がる。また、スポーツ以外の例として、工場での触覚アラームへの応用や、監視カメラ映像解析に基づいたセキュリティシステムなど、様々な用途に応用することも可能である。従って、本発明は、前述の実施形態の例に限定されるものではなく、特許請求の範囲の記載によってのみ制限される。 Although the present invention has been described above with reference to specific embodiments, the present invention is not limited to the above-described embodiments, and can be modified in various ways without departing from the technical idea thereof. For example, in the above-described embodiment, video analysis of a badminton game was mainly explained as an example, but it is also widely applicable to judo, table tennis, various other sports events, and videos other than sports. For example, this will lead to improvements in services such as public viewing, entertainment, and future tactile broadcasting using tactile information. In addition to sports, it can also be applied to a variety of uses, such as tactile alarms in factories and security systems based on surveillance camera video analysis. Accordingly, the invention is not limited to the example embodiments described above, but only by the scope of the claims.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させる用途に有用である。 According to the present invention, it is possible to automatically extract a person object from a video and synchronize and automatically generate tactile metadata corresponding to a dynamic person object, which is useful for applications in which a tactile presentation device and a video are linked. be.

１映像触覚連動システム
１０映像出力装置
１１ディスプレイ
１２触覚メタデータ生成装置
１３制御ユニット
１４Ｌ，１４Ｒ触覚提示デバイス
１２１複数フレーム抽出部
１２２人物骨格抽出部
１２３人物識別部
１２４軌跡特徴量生成部
１２５動オブジェクト検出部
１２６人物動作認識部
１２７メタデータ生成部
１３１メタデータ受信部
１３２解析部
１３３記憶部
１３４‐１，１３４‐２駆動部
１４１ケース
１４２振動アクチュエーター 1 Video-tactile interlocking system 10 Video output device 11 Display 12 Tactile metadata generation device 13 Control unit 14L, 14R Tactile presentation device 121 Multiple frame extraction section 122 Human skeleton extraction section 123 Person identification section 124 Trajectory feature generation section 125 Moving object detection Section 126 Human motion recognition section 127 Metadata generation section 131 Metadata reception section 132 Analysis section 133 Storage section 134-1, 134-2 Drive section 141 Case 142 Vibration actuator

Claims

A tactile metadata generation device that extracts a person object from a video and generates tactile metadata corresponding to the dynamic person object,
a plurality of frame extraction means for extracting a plurality of past frame images including a current frame image from the input video;
human skeleton extraction means for generating a first skeleton coordinate set of each human object based on a skeleton detection algorithm for each of a plurality of frame images including the current frame image;
For each of a plurality of frame images including the current frame image, a person object is extracted by extracting the position and size of the skeleton of each person object and surrounding image information based on the first skeleton coordinate set. person identification means for identifying and generating a second skeletal coordinate set to which a person ID is assigned;
Using the current frame image as a reference, the second skeleton coordinate set in the plurality of frame images is connected in chronological order to generate a skeleton trajectory set as a collection of trajectory features indicating the trajectory of the skeleton of each human object. a trajectory feature generation means for
human motion recognition means for detecting information for operating a tactile presentation device by machine learning based on trajectory features of the skeleton trajectory set;
metadata generation means for generating tactile metadata for operating the tactile presentation device in accordance with the current frame image and outputting it to the outside in frame units;
A tactile metadata generation device comprising:

A moving object is detected based on a difference image between adjacent frames using each of a plurality of frame images including the current frame image, and the skeletal trajectories of all human objects among the moving objects detected from each difference image are detected. Further comprising a moving object detection means for selecting a specific moving object using the set and generating linked moving object information using the coordinate position, size, and movement direction of the specific moving object obtained from each difference image as elements,
The human motion recognition means selects a skeletal trajectory set for operating the tactile presentation device from among the skeletal trajectory sets of all human objects based on the moving object information, and determines the trajectory characteristics of the selected skeletal trajectory set. The tactile metadata generation device according to claim 1, wherein information for operating the tactile presentation device is detected by machine learning based on the amount.

The human motion recognition means is configured to identify, position coordinates, and classify each human object in the current frame image, and determine the timing and speed at which the tactile presentation device is actuated by the machine learning, in response to the current frame image. The tactile metadata generation device according to claim 1 or 2, wherein information indicating the tactile metadata is detected for generating the tactile metadata.

The haptic metadata generation device according to any one of claims 1 to 3;
a tactile presentation device that presents a tactile stimulus;
a control unit that controls the tactile presentation device to be driven by referring to predetermined drive reference data based on the tactile metadata obtained from the tactile metadata generation device;
A video-tactile interlocking system characterized by comprising:

A program for causing a computer to function as the haptic metadata generation device according to claim 1.