JP2022062313A

JP2022062313A - Tactile sensation meta data generation apparatus, video tactile sensation linkage system and program

Info

Publication number: JP2022062313A
Application number: JP2020170229A
Authority: JP
Inventors: 正樹高橋; Masaki Takahashi; 真希子東; Makiko Azuma; 拓也半田; Takuya Handa; 雅規佐野; Masami Sano; 結子山内; Yuiko Yamauchi
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-10-08
Filing date: 2020-10-08
Publication date: 2022-04-20

Abstract

To provide a tactile sensation meta data generation apparatus which automatically extracts a dynamic person object from a video and synchronously and automatically generates corresponding tactile sensation meta data, a video tactile sensation linkage system which controls drive of a tactile sensation presentation device based on the generated tactile sensation meta data and a program.SOLUTION: A tactile sensation meta data generation apparatus 12 according to the present invention comprises: a person skeleton extraction unit 122 and a person identification unit 123 which generate a skeleton coordinate aggregate of each person from a video and identify a person within a variable search range; a skeleton trajectory feature image generation unit 124 which generates a skeleton trajectory feature image (STI) indicating only the direction of the movement for each skeleton from which the person is identified; a person motion recognition unit 126 and a meta data generation unit 127 which recognize a person motion with deep learning (CNN) with the STI as the input, detect information for operating tactile sensation presentation devices 14L, 14R and generate the tactile sensation meta data. A video tactile sensation linkage system 1 comprises: the tactile sensation meta data generation apparatus 12; and a control unit 13 for tactile sensation presentation device control.SELECTED DRAWING: Figure 1

Description

本発明は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムに関する。 The present invention is a tactile metadata generator that extracts a person object from an image and generates tactile metadata corresponding to a dynamic person object, and an image tactile interlocking that drives and controls a tactile presentation device based on the generated tactile metadata. Regarding systems and programs.

放送映像等の一般的なカメラ映像の映像コンテンツは、視覚と聴覚の２つの感覚に訴える情報を提供するメディアである。しかし、視覚障害者や聴覚障害者に対しては視聴覚情報だけでは不十分であり、番組コンテンツの状況を正確に伝えることができない。そのため、テレビを持っていない、若しくは持っていても視聴しない障害者も多い。そこで、映像コンテンツに対し、視覚・聴覚以外の“触覚”で感じられる情報を提示することで、視覚又は聴覚の障害者もテレビ放送を理解できるシステムの構築が望まれる。 The video content of a general camera image such as a broadcast image is a medium that provides information that appeals to the two senses of sight and hearing. However, audiovisual information alone is not sufficient for the visually impaired and the hearing impaired, and it is not possible to accurately convey the status of the program content. Therefore, many people with disabilities do not have a TV, or even if they do, they do not watch. Therefore, it is desired to construct a system that allows visually or hearing impaired persons to understand television broadcasting by presenting information that can be felt by "tactile sense" other than visual and auditory senses to video contents.

また、視覚・聴覚の感覚を有する健常者にとっても、また、触覚刺激を提示することにより放送番組の視聴時の臨場感や没入感の向上が期待できる。特に、スポーツコンテンツにおける人物の動きは重要な情報であり、これを触覚刺激で提示することにより、コンテンツ視聴における臨場感が高まる。 Further, even for a healthy person having a sense of sight and hearing, by presenting a tactile stimulus, it can be expected to improve the sense of presence and immersiveness when watching a broadcast program. In particular, the movement of a person in sports content is important information, and by presenting this with a tactile stimulus, the sense of presence in viewing the content is enhanced.

例えば、野球映像を視聴する際、ボールがバットに当たるタイミングで触覚提示デバイスを介して視聴者に刺激を与えることで、バッターのヒッティングの感覚を疑似体験できる。また、視覚に障害のある方々に触覚刺激を提供することで、スポーツの試合状況を理解させることにも繋がると考えられる。このように、触覚は映像視聴における第３の感覚として期待されている。 For example, when watching a baseball image, the batter's hitting sensation can be simulated by stimulating the viewer via a tactile presentation device at the timing when the ball hits the bat. In addition, by providing tactile stimuli to people with visual impairments, it is thought that it will lead to understanding the game situation of sports. As described above, the tactile sensation is expected as a third sensation in video viewing.

特に、スポーツはリアルタイムでの映像視聴が重要視されるため、映像に対する触覚刺激の提示は、自動、且つリアルタイムで行われる必要がある。そこで、プレーの種類、タイミング、状況などに関する選手の動きに同期した触覚刺激の提示が、触覚を併用した映像コンテンツの映像視聴に効果的な場合が多い。そして、視覚又は聴覚に障害を持つ方々にもスポーツの状況を伝えることが可能となる。 In particular, since real-time video viewing is important in sports, the presentation of tactile stimuli for video needs to be performed automatically and in real time. Therefore, the presentation of tactile stimuli synchronized with the player's movements regarding the type, timing, situation, etc. of play is often effective for viewing video content that also uses tactile sensation. Then, it becomes possible to convey the situation of sports to people with visual or hearing disabilities.

このため、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 Therefore, in order to realize video viewing of video content that also uses tactile sensation, it is necessary to extract the movement of a person object from the video content and generate tactile information corresponding to the movement of the extracted person object as tactile metadata. You will need it.

しかし、従来の触覚メタデータの生成法では、触覚を併用した映像視聴を実現するとしても、触覚提示デバイスにより、どのようなタイミングで、またどのような刺激をユーザに提示するかを示す触覚メタデータを、映像と同期した態様で人手により編集する必要があった。 However, even if the conventional method of generating tactile metadata realizes video viewing using tactile sensation, tactile metadata indicating at what timing and what kind of stimulus is presented to the user by the tactile presentation device. It was necessary to manually edit the data in a manner synchronized with the video.

収録番組の場合、人手で時間をかけて触覚メタデータを編集することが可能である。しかし、生放送映像に対して触覚提示デバイスによる刺激提示を連動させるには、事前に触覚情報を編集することができないことから、リアルタイムで映像コンテンツの映像解析を行い、触覚メタデータを生成することが要求される。 In the case of recorded programs, it is possible to manually edit the tactile metadata over time. However, in order to link the stimulus presentation by the tactile presentation device to the live broadcast video, it is not possible to edit the tactile information in advance, so it is necessary to perform video analysis of the video content in real time and generate tactile metadata. Required.

近年、スポーツ映像解析技術は、目覚ましい成長を遂げている。ウィンブルドンでも使用されているテニスのホークアイシステムは、複数の固定カメラ映像をセンサとしてテニスボールを３次元的に追跡し、ジャッジに絡むＩＮ／ＯＵＴの判定を行っている。また２０１４年のＦＩＦＡワールドカップでは、ゴールラインテクノロジーと称して、数台の固定カメラの映像を解析し、ゴールの判定を自動化している。更に、サッカースタジアムへ多数のステレオカメラを設定し、フィールド内の全選手をリアルタイムに追跡するＴＲＡＣＡＢシステム等、スポーツにおけるリアルタイム映像解析技術の高度化が進んでいる。 In recent years, sports video analysis technology has achieved remarkable growth. The tennis Hawkeye system, which is also used in Wimbledon, tracks a tennis ball three-dimensionally using a plurality of fixed camera images as sensors, and determines IN / OUT related to the judge. In addition, at the 2014 FIFA World Cup, called goal line technology, images from several fixed cameras are analyzed to automate goal determination. Furthermore, the sophistication of real-time video analysis technology in sports, such as the TRACAB system that sets a large number of stereo cameras in a soccer stadium and tracks all players in the field in real time, is advancing.

一方で、動的な人物オブジェクトとして選手の姿勢を計測するには、従来、マーカー式のモーションキャプチャー方式を用いた計測が一般的である。しかし、この方式は、選手の体に多数のマーカーを装着する必要があり、実試合には適用できない。そこで、近年では、選手の体に投光されている赤外線パターンを読み取り、その赤外線パターンの歪みから深度情報を得る深度センサを用いることで、マーカーレスでの人物姿勢計測が可能になっている。また、マーカー式ではなく、光学式のモーションキャプチャー方式を応用した種々の技術が開示されている（例えば、特許文献１，２，３参照）。 On the other hand, in order to measure the posture of a player as a dynamic person object, the measurement using a marker-type motion capture method is generally used. However, this method requires a large number of markers to be attached to the player's body and cannot be applied to actual games. Therefore, in recent years, it has become possible to measure a person's posture without a marker by using a depth sensor that reads an infrared pattern projected on a player's body and obtains depth information from the distortion of the infrared pattern. Further, various techniques applying an optical motion capture method instead of a marker method are disclosed (see, for example, Patent Documents 1, 2 and 3).

例えば、特許文献１では、立体視を用いた仮想現実システムにおいて他者の模範動作映像を表示することにより使用者に対して動作を教示する際に、光学式のモーションキャプチャー方式により、計測対象者の骨格の３次元位置を計測する装置が開示されている。また、特許文献２では、体操競技などの映像とモーションキャプチャデータから得られる情報を利用し、動作認識を施す技術が開示されており、隠れマルコフモデルを利用し、動作の時間的長短の制約を取り除いていることに特長を有している。また、特許文献３には、光学式のモーションキャプチャー方式を利用してプレイヤーの動作を測定し、測定したデータとモデルのフォームに関するデータとに基づいて同プレイヤーのフォームを評価するトレーニング評価装置について開示されている。しかし、これらの技術は、モーションキャプチャー方式を利用するため、実際の試合に適用できず、汎用的なカメラ映像から人物のプレー動作を計測することは難しい。 For example, in Patent Document 1, when teaching an operation to a user by displaying a model operation image of another person in a virtual reality system using stereoscopic vision, a measurement target person uses an optical motion capture method. A device for measuring the three-dimensional position of the skeleton of the above is disclosed. Further, Patent Document 2 discloses a technique for performing motion recognition by using information obtained from images such as gymnastics and motion capture data, and uses a hidden Markov model to limit the time length of motion. It has the feature of being removed. Further, Patent Document 3 discloses a training evaluation device that measures a player's motion by using an optical motion capture method and evaluates the player's form based on the measured data and data related to the model's form. Has been done. However, since these techniques use a motion capture method, they cannot be applied to an actual game, and it is difficult to measure a person's playing motion from a general-purpose camera image.

また、モーションキャプチャー方式によらず、一人又は二人が一組となってバドミントンの試合やバドミントン練習を撮影したカメラ映像のみから、人物の動きをシミュレートする装置が開示されている（例えば、特許文献４参照）。特許文献４の技術では、撮影したカメラ映像から、ショットなどの動作を検出するものとなっているが、専用に設定したカメラによる撮影映像から処理することを前提としており、汎用的な放送カメラ映像から人物のプレー動作を計測することは難しい。 Further, regardless of the motion capture method, a device that simulates the movement of a person is disclosed only from a camera image of a badminton game or badminton practice taken by one or two people as a group (for example, a patent). See Document 4). The technique of Patent Document 4 detects actions such as shots from the captured camera image, but it is premised on processing from the captured image by a dedicatedly set camera, and is a general-purpose broadcast camera image. It is difficult to measure a person's playing behavior from.

ところで、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっている。この深層学習技術を用いることで、通常のカメラ映像から静止画像を抽出し、その静止画像に含まれる選手の姿勢を自動計測することが可能となっている。即ち、通常のカメラ映像から選手の姿勢を計測することで、競技に影響を与えず、触覚刺激に関する情報を取得することが可能である。 By the way, with the development of deep learning technology in recent years, it has become possible to estimate the skeleton position of a person from a normal still image that does not include depth information, which was difficult in the past, without using a depth sensor. By using this deep learning technique, it is possible to extract a still image from a normal camera image and automatically measure the posture of the athlete included in the still image. That is, by measuring the posture of the athlete from a normal camera image, it is possible to acquire information on the tactile stimulus without affecting the competition.

骨格情報の取得により、人物の姿勢を計測することは可能であるが、その姿勢の意味付けには認識処理が必要となる。例えば、柔道の映像を入力した際、当該フレームで行われている動作内容が「組み合い」なのか「投げ技」なのか「寝技」なのかは、画像特徴や骨格特徴から判別する必要がある。画像処理における認識処理で広く用いられているのがConvolutional Neural Network （ＣＮＮ）である。ＣＮＮは、何段もの深い層を持つニューラルネットワークで、特に画像認識の分野で優れた性能を発揮しているネットワークである。このネットワークは「畳み込み層」や「プーリング層」などの幾つかの特徴的な機能を持った層を積み上げることで構成され、現在幅広い分野で利用されている。 Although it is possible to measure the posture of a person by acquiring skeletal information, recognition processing is required to give meaning to that posture. For example, when a judo image is input, it is necessary to determine whether the movement content performed in the frame is "combination", "throwing technique", or "grounding technique" from the image features and skeletal features. The Convolutional Neural Network (CNN) is widely used in recognition processing in image processing. CNN is a neural network with many deep layers, and is a network that demonstrates excellent performance especially in the field of image recognition. This network is composed of layers with some characteristic functions such as "convolution layer" and "pooling layer", and is currently used in a wide range of fields.

一般的なニューラルネットワークでは層状にニューロンを配置し、前後の層に含まれるニューロン同士は網羅的に結線するのが普通であるが、この畳み込みニューラルネットワークではこのニューロン同士の結合をうまく制限し、尚且つウェイト共有という手法を使うことで、画像の畳み込みに相当するような処理をニューラルネットワークの枠組みの中で表現している。この層は「畳み込み層」と呼ばれ、ＣＮＮの最大の特徴となっている。また、この畳み込みニューラルネットワークにおいて、もうひとつ大きな特徴が、「プーリング層」である。ＣＮＮにおいて、「畳み込み層」が画像からのエッジ抽出等の特徴抽出の役割を果たしているとすると、「プーリング層」はそうした抽出された特徴が、平行移動などでも影響を受けないようにロバスト性を与えている。 In a general neural network, neurons are arranged in layers, and the neurons contained in the layers before and after are usually connected comprehensively, but in this convolutional neural network, the connection between these neurons is well restricted, and moreover. By using a technique called weight sharing, processing equivalent to image convolution is expressed within the framework of a neural network. This layer is called the "convolution layer" and is the greatest feature of CNN. Another major feature of this convolutional neural network is the "pooling layer". In CNN, if the "convolution layer" plays a role of feature extraction such as edge extraction from an image, the "pooling layer" has robustness so that such extracted features are not affected by translation. Giving.

他方では、骨格情報を利用する以外にも、画像から動作を認識する手法として、Motion History Image（ＭＨＩ）と呼ばれる画像が従来使われてきた（例えば、非特許文献１、特許文献５参照）。ＭＨＩは、フレームごとに輝度差分が生じた領域を高い輝度で塗りつぶし、以降のフレームでは徐々にその輝度を下げて描画した画像であり、動オブジェクトの動きの向きの情報を持つ１枚の画像となっている。 On the other hand, in addition to using skeletal information, an image called Motion History Image (MHI) has been conventionally used as a method of recognizing motion from an image (see, for example, Non-Patent Document 1 and Patent Document 5). MHI is an image drawn by filling the area where the brightness difference occurs for each frame with high brightness and gradually lowering the brightness in the subsequent frames, and is a single image having information on the direction of movement of the moving object. It has become.

特許文献５では、画像認識技術を用いて野球映像から投球動作を検出する技術が開示されており、野球映像に対してMotion History Image（ＭＨＩ）を作成し、投球動作を検出するものとなっている。ただし、特許文献５に開示される技法のＭＨＩは骨格検出を行っておらず、詳細な動作の認識は困難である。 Patent Document 5 discloses a technique for detecting a pitching motion from a baseball image using an image recognition technique, and creates a Motion History Image (MHI) for the baseball image to detect the pitching motion. There is. However, the MHI of the technique disclosed in Patent Document 5 does not detect the skeleton, and it is difficult to recognize the detailed operation.

そこで、骨格検出を行って得られる人物骨格と各骨格を結ぶ接続線を示す画像（ボーン画像）についてＭＨＩを生成し、深層学習技術によりカメラ映像から人物の姿勢を計測する、Skeleton motion history Image（ＳｋｌＭＨＩ）と称される技術も開示されている（例えば、非特許文献２参照）。 Therefore, Skeleton motion history Image (Skeleton motion history Image), which generates MHI for an image (bone image) showing the connection line connecting the human skeleton and each skeleton obtained by performing skeleton detection, and measures the posture of the person from the camera image by deep learning technology. A technique called Skl MHI) is also disclosed (see, for example, Non-Patent Document 2).

特開２００２－８０６３号公報Japanese Unexamined Patent Publication No. 2002-8063 特開２００２－２５３７１８号公報Japanese Patent Application Laid-Open No. 2002-253718 特開２０２０－３８４４０号公報Japanese Unexamined Patent Publication No. 2020-38440 特開２０１８－１８７３８３号公報Japanese Unexamined Patent Publication No. 2018-187383 特開２００８－２２１４２号公報Japanese Unexamined Patent Publication No. 2008-22142

“Motion History Image”、［online］、［令和２年９月１５日検索］、インターネット〈https://web.cse.ohio-state.edu/~davis.1719/CVL/Research/MHI/mhi.html〉"Motion History Image", [online], [Search on September 15, 2nd year of Reiwa], Internet <https://web.cse.ohio-state.edu/~davis.1719/CVL/Research/MHI/mhi .html> C. N. Ohyo, T. T. Zin, P. Tin., “Skeleton motion history based human action recognition using deep learning”、［online］、［令和２年９月１５日検索］、インターネット〈https://ieeexplore.ieee.org/document/8229448〉C. N. Ohyo, T. T. Zin, P. Tin., “Skeleton motion history based human action recognition using deep learning”, [online], [Search on September 15, 2nd year of Reiwa], Internet <https://ieeexplore.ieee. org / document / 8229448>

上述したように、従来、一般的には、映像コンテンツに触覚情報を付与する際は、刺激の種類やタイミングを人手で編集する必要があった。そのため、生放送番組での触覚情報提示は不可能であった。リアルタイム映像解析により、触覚情報抽出を自動化できれば、生放送番組でも触覚情報を提供できる。そして、触覚を併用した映像コンテンツの映像視聴を実現するには、その映像コンテンツから人物オブジェクトの動きを抽出し、抽出した人物オブジェクトの動きに対応した触覚情報を触覚メタデータとして生成することが必要になる。 As described above, conventionally, in general, when adding tactile information to video content, it has been necessary to manually edit the type and timing of the stimulus. Therefore, it was impossible to present tactile information in a live broadcast program. If tactile information extraction can be automated by real-time video analysis, tactile information can be provided even in live broadcast programs. Then, in order to realize video viewing of video content using tactile sensation, it is necessary to extract the movement of a person object from the video content and generate tactile information corresponding to the extracted movement of the person object as tactile metadata. become.

特に、スポーツ中継はリアルタイム性が重視されるコンテンツである。そのため、競技に関する触覚情報もリアルタイムで付与され、映像と同時に提示される必要がある。選手の動きに同期した触覚刺激が効果的な場合が多く、映像から触覚メタデータを抽出する場合には、カメラ映像からリアルタイムで選手の動きを解析する必要がある。競技に影響を与えないため、マーカー装着によるモーションキャプチャーや、撮影距離に制限のある深度センサなどは用いず、通常の放送カメラ映像から触覚メタデータを抽出することが望ましい。 In particular, sports broadcasts are content that emphasizes real-time performance. Therefore, tactile information about the competition must be given in real time and presented at the same time as the video. In many cases, tactile stimuli synchronized with the movement of the athlete are effective, and when extracting tactile metadata from the video, it is necessary to analyze the movement of the athlete in real time from the camera image. Since it does not affect the competition, it is desirable to extract tactile metadata from normal broadcast camera images without using motion capture by attaching markers or depth sensors with a limited shooting distance.

つまり、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成する技法が望まれる。 That is, a technique for automatically and in real time generating tactile metadata regarding the movement of a person object (player, etc.) is desired from only a normal camera image for shooting sports.

また、人物オブジェクトの動きを高精度に検出するために、人物以外の動オブジェクト（例えば、バドミントン競技であればシャトル、ラケット）を参考する技法も考えられるが、参考とする人物以外の動オブジェクトが存在しない競技（例えば、柔道やレスリング等）においても、人物オブジェクトの動きを高精度に検出する技法が望まれる。 Also, in order to detect the movement of a person object with high accuracy, a technique of referring to a motion object other than the person (for example, shuttle, racket in the case of badminton competition) can be considered, but the motion object other than the reference person is Even in non-existent competitions (for example, judo and wrestling), a technique for detecting the movement of a person object with high accuracy is desired.

尚、近年の深層学習技術の発達により、深度センサを用いずに、従来では困難であった深度情報を含まない通常の静止画像から人物の骨格位置を推定することが可能になっているが、これに代表される骨格検出アルゴリズムは基本的に静止画単位で骨格位置を検出するものである。このため、スポーツを撮影する通常のカメラ映像のみから、自動、且つリアルタイムで人物オブジェクト（選手等）の動きに関する触覚メタデータを生成するには、更なる工夫が必要になる。 With the recent development of deep learning technology, it has become possible to estimate the skeleton position of a person from a normal still image that does not include depth information, which was difficult in the past, without using a depth sensor. The skeleton detection algorithm represented by this basically detects the skeleton position in units of still images. For this reason, further ingenuity is required to automatically and in real time generate tactile metadata related to the movement of a person object (player, etc.) from only a normal camera image for shooting sports.

ところで、動作認識の機械学習として、旧来の教師あり学習手法であるＳＶＭなどを用いることで高速に動作認識できるものの、近年発展が望ましい深層学習を利用することで、更なる精度向上が期待できる。映像解析に基づく動作認識にはＣＮＮが用いられることが多い。しかし、ＣＮＮは静止画像ベースの識別アルゴリズムであり、時間軸が考慮されない。映像シーンの動作内容を理解するには、人物の動きに関する特徴量を扱う必要があるが、静止画には時間軸の情報が含まれないため、ＣＮＮの動作内容を高精度な識別は期待できない。 By the way, as machine learning for motion recognition, although motion recognition can be performed at high speed by using a conventional supervised learning method such as SVM, further improvement in accuracy can be expected by using deep learning, which is desirable to be developed in recent years. CNN is often used for motion recognition based on video analysis. However, CNN is a still image-based identification algorithm and does not consider the time axis. In order to understand the movement content of a video scene, it is necessary to handle features related to the movement of a person, but since the still image does not contain time axis information, it is not possible to expect highly accurate identification of the movement content of CNN. ..

このため、ＣＮＮにより画像から動作を認識する手法として、Motion History Image（ＭＨＩ）と呼ばれる画像を利用することが考えられる。このＭＨＩを解析することで、 “腕を広げる”、“しゃがむ”、“手を上げる”など人物の基本的な動きを認識判定することが可能になる。ただし、ＭＨＩは人物の関節の各部位を計測しているわけではないため、全身を使った大きな動作の認識に限られる。例えば、特許文献５に開示されるような、野球映像に対してMotion History Image（ＭＨＩ）を作成し、投球動作を検出するには、背景に含まれるノイズの影響を抑えるために投手の領域を高精度に検出する必要があり、更に、骨格検出を行うものではないため詳細な動作の認識は困難である。 Therefore, it is conceivable to use an image called Motion History Image (MHI) as a method of recognizing the motion from the image by CNN. By analyzing this MHI, it becomes possible to recognize and judge the basic movements of a person such as "spreading arms", "squatting", and "raising hands". However, since MHI does not measure each part of a person's joints, it is limited to recognizing large movements using the whole body. For example, in order to create a Motion History Image (MHI) for a baseball image and detect a pitching motion as disclosed in Patent Document 5, the pitcher's area is set in order to suppress the influence of noise contained in the background. It is necessary to detect with high accuracy, and it is difficult to recognize detailed motion because it does not detect the skeleton.

そこで、非特許文献２に開示されるように、骨格検出を行って得られる人物骨格と各骨格を結ぶ接続線を示す画像（ボーン画像）についてMotion History Image（ＭＨＩ）を生成し、深層学習技術によりカメラ映像から人物の姿勢を計測する、Skeleton motion history Image（ＳｋｌＭＨＩ）と称される技術により、動作認識の精度向上が実現されるが、より一層の動作認識の精度向上が要望される。 Therefore, as disclosed in Non-Patent Document 2, a Motion History Image (MHI) is generated for an image (bone image) showing a connection line connecting a human skeleton obtained by performing skeleton detection and each skeleton, and a deep learning technique is used. The technology called Skeleton motion history Image (Skl MHI), which measures the posture of a person from a camera image, improves the accuracy of motion recognition, but further improvement of motion recognition accuracy is required.

本発明の目的は、上述の問題に鑑みて、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成する触覚メタデータ生成装置、生成した触覚メタデータを基に触覚提示デバイスを駆動制御する映像触覚連動システム、及びプログラムを提供することにある。 In view of the above problems, an object of the present invention is a tactile metadata generator that automatically extracts a person object from a video and synchronizes and automatically generates tactile metadata corresponding to a dynamic person object, and a generated tactile meta. It is an object of the present invention to provide a video-tactile interlocking system and a program for driving and controlling a tactile presentation device based on data.

本発明の触覚メタデータ生成装置は、映像から人物オブジェクトを抽出し、動的な人物オブジェクトに対応する触覚メタデータを生成する触覚メタデータ生成装置であって、入力された映像について、現フレーム画像と所定数の過去のフレーム画像を含む複数フレーム画像を抽出する複数フレーム抽出手段と、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトの第１の骨格座標集合を生成する人物骨格抽出手段と、当該複数フレーム画像の各々について、前記第１の骨格座標集合を基に探索範囲を可変設定し、各人物オブジェクトの骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物オブジェクトを識別し、人物ＩＤを付与した第２の骨格座標集合を生成する人物識別手段と、前記現フレーム画像を基準に、当該複数フレーム画像の各々における前記第２の骨格座標集合を基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像を生成する骨格軌跡特徴画像生成手段と、前記骨格軌跡特徴画像を入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識し、所定の触覚提示デバイスを作動させる衝撃提示用の情報を検出する人物動作認識手段と、前記現フレーム画像に対応して、当該衝撃提示用の情報を含む触覚メタデータを生成し、フレーム単位で外部出力するメタデータ生成手段と、を備えることを特徴とする。 The tactile metadata generation device of the present invention is a tactile metadata generation device that extracts a person object from an image and generates tactile metadata corresponding to a dynamic person object, and is a current frame image of the input image. A multi-frame extraction means for extracting a multi-frame image including a predetermined number of past frame images, and a person who generates a first skeletal coordinate set of each person object based on a skeletal detection algorithm for each of the multiple frame images. By variably setting the search range based on the first skeleton coordinate set for each of the skeleton extraction means and the plurality of frame images, the position and size of the skeleton of each person object and the peripheral image information thereof are extracted. Based on the person identification means that identifies the person object and generates the second skeletal coordinate set to which the person ID is given, and the second skeletal coordinate set in each of the plurality of frame images based on the current frame image. , A person's identification motion is performed by a skeletal locus feature image generation means that generates one skeletal locus feature image showing only the direction of movement for each identified person skeleton, and a convolutional neural network that inputs the skeletal locus feature image. A person motion recognition means that recognizes and detects impact information for activating a predetermined tactile presentation device, and a tactile metadata including the impact presentation information corresponding to the current frame image is generated and a frame is generated. It is characterized by comprising a metadata generation means for externally outputting in units.

また、本発明の触覚メタデータ生成装置において、前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標ごとに連結した軌跡を描画し、且つこの描画の際に、過去に向かうほど輝度を下げるか、又は上げて描画して生成した１枚の画像とすることを特徴とする。 Further, in the tactile metadata generation device of the present invention, the skeleton locus feature image generation means draws a locus connected for each person's skeleton coordinates in the plurality of frame images as the skeleton locus feature image, and draws the locus. At that time, it is characterized in that the brightness is lowered or raised toward the past to obtain one image generated by drawing.

また、本発明の触覚メタデータ生成装置において、前記骨格軌跡特徴画像生成手段は、前記骨格軌跡特徴画像として、当該複数フレーム画像における各人物の骨格座標について、各人物に対し共通又は区別して、各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動きをフレーム単位で時系列に階調するよう描画して生成した１枚の画像とすることを特徴とする。 Further, in the tactile metadata generation device of the present invention, the skeleton locus feature image generation means uses the skeleton locus feature image as the skeleton locus feature image, and the skeleton coordinates of each person in the plurality of frame images are common to or distinguished from each person. It is characterized by color-coding each person's skeleton coordinates and drawing the movement of each person's skeleton coordinates in a time-series manner on a frame-by-frame basis to form a single image.

また、本発明の触覚メタデータ生成装置において、前記人物識別手段は、前記探索範囲として、最大で人物骨格の全体を囲む人物探索範囲に限定し、最小で人物骨格のうち所定領域を注目探索範囲として定めた絞り込みによる可変設定を行い、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも前記注目探索範囲を含むように前記探索範囲を決定して、当該人物オブジェクトを識別する処理を行う手段を有することを特徴とする。 Further, in the tactile metadata generation device of the present invention, the person identification means is limited to a person search range that surrounds the entire human skeleton at the maximum as the search range, and a predetermined area of the human skeleton is the attention search range at the minimum. The search range is determined so as to include at least the attention search range based on the state transition estimated value of the skeleton of the person obtained by the state estimation algorithm, and the person object is identified. It is characterized by having a means for performing the processing.

また、本発明の触覚メタデータ生成装置において、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像と対比して人物以外の動オブジェクトを選定し、前記人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成する動オブジェクト検出手段を更に備え、前記人物動作認識手段は、前記識別した人物骨格毎の動きの方向のみを示す骨格軌跡特徴画像上に、前記動オブジェクト軌跡画像を追加して合成したものを入力とする畳み込みニューラルネットワークにより、人物の特定動作を認識することを特徴とする。 Further, in the tactile metadata generation device of the present invention, a moving object is detected based on a difference image between adjacent frames using each of the plurality of frame images, and the identified person among the moving objects detected from each difference image. A motion object other than a person is selected in comparison with a skeletal trajectory feature image showing only the direction of movement for each skeleton, and the coordinate position, size, and movement direction obtained from each difference image are elements for the motion object other than the person. Further provided with a moving object detecting means for generating a moving object locus image connected to the above, the person motion recognizing means is a moving object locus image on a skeleton locus feature image showing only the direction of movement for each identified human skeleton. It is characterized by recognizing a specific motion of a person by a convolutional neural network whose input is a composite of images.

また、本発明の映像触覚連動システムは、本発明の触覚メタデータ生成装置と、触覚刺激を提示する触覚提示デバイスと、前記触覚メタデータ生成装置から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、前記触覚提示デバイスを駆動するよう制御する制御ユニットと、を備えることを特徴とする。 Further, the image tactile interlocking system of the present invention is predetermined based on the tactile metadata generation device of the present invention, the tactile presentation device for presenting the tactile stimulus, and the tactile metadata obtained from the tactile metadata generation device. It is characterized by comprising a control unit that refers to the drive reference data and controls the tactile presentation device to be driven.

更に、本発明のプログラムは、コンピュータを、本発明の触覚メタデータ生成装置として機能させるためのプログラムとして構成する。 Further, the program of the present invention is configured as a program for making the computer function as the tactile metadata generator of the present invention.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができる。特に、スポーツ映像のリアルタイム視聴時での触覚刺激提示が可能となる。視覚・聴覚への情報提供のみならず、触覚にも提示することで、視覚や聴覚に障害を持つ方々へもスポーツの状況を分かりやすく伝えることが可能となる。また、一般の晴眼者の方々にとっても、従来の映像視聴では伝えきれない臨場感や没入感を提供することができる。 According to the present invention, a person object can be automatically extracted from a video, and tactile metadata corresponding to a dynamic person object can be automatically generated in synchronization. In particular, it is possible to present tactile stimuli during real-time viewing of sports images. By presenting information not only to the eyes and hearing but also to the sense of touch, it is possible to convey the situation of sports to people with visual and hearing disabilities in an easy-to-understand manner. In addition, even for general sighted people, it is possible to provide a sense of presence and immersiveness that cannot be conveyed by conventional video viewing.

本発明による一実施形態の触覚メタデータ生成装置を備える映像触覚連動システムの概略構成を示すブロック図である。It is a block diagram which shows the schematic structure of the image tactile interlocking system provided with the tactile metadata generation apparatus of one Embodiment by this invention. 本発明による一実施形態の触覚メタデータ生成装置の処理例を示すフローチャートである。It is a flowchart which shows the processing example of the tactile metadata generation apparatus of one Embodiment by this invention. 本発明による一実施形態の触覚メタデータ生成装置における人物骨格抽出処理に関する説明図である。It is explanatory drawing about the person skeleton extraction processing in the tactile metadata generation apparatus of one Embodiment by this invention. （ａ）は１フレーム画像を例示する図であり、（ｂ）は本発明による一実施形態の触覚メタデータ生成装置における１フレーム画像における人物骨格抽出例を示す図である。(A) is a diagram illustrating a one-frame image, and (b) is a diagram showing an example of extracting a human skeleton in a one-frame image in the tactile metadata generator of one embodiment according to the present invention. （ａ），（ｂ）は、それぞれ本発明による一実施形態の触覚メタデータ生成装置における人物骨格抽出処理に関する人物オブジェクトの探索範囲の処理例を示す図である。(A) and (b) are diagrams showing a processing example of a search range of a person object related to a person skeleton extraction process in the tactile metadata generation device of one embodiment according to the present invention, respectively. （ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ：Skeleton Trajectory Image）の画像例を示す図であり、（ｂ）は、その軌跡特徴画像（ＳＴＩ）の説明図である。(A) is a diagram showing an image example of a skeleton trajectory image (STI: Skeleton Trajectory Image) according to the present invention, and (b) is an explanatory diagram of the locus feature image (STI). （ａ）は１フレーム画像例を模擬的に示した図であり、（ｂ）は従来技術のボーン画像例、（ｃ）は従来技術のＳｋｌＭＨＩ（Skeleton Motion History Image）の画像例、（ｄ）は本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。(A) is a diagram simulating a one-frame image example, (b) is a conventional bone image example, and (c) is a conventional Skeleton Motion History Image (Skl MHI) image example, (d). ) Is a diagram showing an image example of the skeleton locus feature image (STI) according to the present invention. 従来技術のボーン画像、従来技術のＳｋｌＭＨＩ、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）の人物動きの検出精度の比較評価を示す図である。It is a figure which shows the comparative evaluation of the detection accuracy of the person movement of the bone image of the prior art, the Skl MHI of the prior art, and the skeletal locus feature image (STI) according to the present invention. 本発明による一実施形態の映像触覚連動システムにおける制御ユニットの概略構成を示すブロック図である。It is a block diagram which shows the schematic structure of the control unit in the image tactile interlocking system of one Embodiment by this invention.

（システム構成）
以下、図面を参照して、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１について詳細に説明する。図１は、本発明による一実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１の概略構成を示すブロック図である。 (System configuration)
Hereinafter, the video tactile interlocking system 1 provided with the tactile metadata generation device 12 according to the embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a video tactile interlocking system 1 including the tactile metadata generation device 12 of the embodiment according to the present invention.

図１に示す映像触覚連動システム１は、カメラや記録装置等の映像出力装置１０から映像を入力し、入力された映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータ（第１の触覚メタデータと第２の触覚メタデータの２種類）を同期して自動生成する触覚メタデータ生成装置１２と、生成した触覚メタデータを基に、本例では２台の触覚提示デバイス１４Ｌ，１４Ｒと、各触覚提示デバイス１４Ｌ，１４Ｒを個別に駆動制御する制御ユニット１３と、を備える。 The video tactile interlocking system 1 shown in FIG. 1 inputs video from a video output device 10 such as a camera or a recording device, automatically extracts a person object from the input video, and tactile metadata corresponding to the dynamic person object. Based on the tactile metadata generator 12 that automatically generates (two types of first tactile metadata and the second tactile metadata) and the generated tactile metadata, two tactile presentations are presented in this example. The devices 14L and 14R and a control unit 13 for individually driving and controlling the tactile presentation devices 14L and 14R are provided.

まず、映像出力装置１０が出力する映像は、一例として柔道競技をリアルタイムで撮影されたものとしてディスプレイ１１に表示され、ユーザＵによって視覚されるものとする。 First, it is assumed that the video output by the video output device 10 is displayed on the display 11 as a real-time shot of the judo competition as an example, and is viewed by the user U.

柔道競技は、二人の選手が組み合って、「抑え込み」や「投げ」などの技を競うスポーツであり、各人物に衝撃が生じた瞬間や各人物の動きの状況変化を触覚提示デバイス１４Ｌ，１４Ｒにより触覚刺激としてユーザＵに提示することで、より臨場感を高め、また視聴覚障害者にも試合状況を伝えることが可能である。 Judo competition is a sport in which two athletes combine to compete for techniques such as "holding down" and "throwing". By presenting it to the user U as a tactile stimulus by 14R, it is possible to further enhance the sense of presence and to inform the audiovisually impaired person of the game situation.

特に、柔道競技では、映像上で選手同士の重なりやオクルージョンが多数生じるため、各選手に生じる衝撃の種類に応じたタイミングと速さ以外にも、各選手の押し引きなどの組み合い、投げ等に係る動作状況を連続的に触覚提示できるようにすることで、視覚や聴覚の障害者にも試合の緊迫感を伝えることができ、また臨場感を高めることができる。 In particular, in judo competitions, there are many overlaps and occlusions between players on the video, so in addition to the timing and speed according to the type of impact that occurs on each player, it is also possible to combine and throw each player's push and pull. By making it possible to continuously present the tactile sensation, it is possible to convey the sense of urgency of the game to the visually and hearing impaired, and to enhance the sense of presence.

そこで、ユーザＵは、左手ＨＬで触覚提示デバイス１４Ｌを把持し、右手ＨＲで触覚提示デバイス１４Ｒを把持して、本例では映像解析に同期した振動刺激が提示されるものとする。制御ユニット１３は、触覚メタデータ生成装置１２から得られる各人物オブジェクトＯｐ１，Ｏｐ２に生じる衝撃の種類に応じたタイミングと速さを示す衝撃提示用の情報を含む触覚メタデータを基に、各人物オブジェクトＯｐ１，Ｏｐ２に対応付けられた２台の触覚提示デバイス１４Ｌ，１４Ｒの触覚提示を個別に制御する。ただし、制御ユニット１３は、１台の触覚提示デバイスに対してのみ駆動制御する形態でもよいし、３台以上の触覚提示デバイスに対して個別に駆動制御する形態でもよい。また、限定するものではないが、本例の制御ユニット１３は、映像内の人物オブジェクトＯｐ１（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｌで、人物オブジェクトＯｐ２（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｒで提示するように分類して制御するものとする。 Therefore, it is assumed that the user U grips the tactile presentation device 14L with the left hand HL and the tactile presentation device 14R with the right hand HR, and in this example, the vibration stimulus synchronized with the image analysis is presented. The control unit 13 is based on tactile metadata including impact presentation information indicating the timing and speed according to the type of impact generated in each person object Op1 and Op2 obtained from the tactile metadata generation device 12. The tactile presentations of the two tactile presentation devices 14L and 14R associated with the objects Op1 and Op2 are individually controlled. However, the control unit 13 may be driven and controlled only for one tactile presenting device, or may be individually driven and controlled for three or more tactile presenting devices. Further, although not limited to this, in the control unit 13 of this example, the vibration stimulus corresponding to the movement of the person object Op1 (player) in the video is the tactile presentation device 14L, and corresponds to the movement of the person object Op2 (player). The vibration stimulus generated shall be classified and controlled so as to be presented by the tactile presentation device 14R.

触覚提示デバイス１４Ｌ，１４Ｒは、球状のケース１４１内に、制御ユニット１３の制御によって振動刺激を提示可能な振動アクチュエーター１４２が収容されている。尚、触覚提示デバイス１４Ｌ，１４Ｒは、振動刺激の他、電磁気パルス刺激を提示するものでもよい。本例では、制御ユニット１３と各触覚提示デバイス１４Ｌ，１４Ｒとの間は有線接続され、触覚メタデータ生成装置１２と制御ユニット１３との間も有線接続されている形態を例に説明するが、それぞれ近距離無線通信で無線接続されている形態としてもよい。 In the tactile presentation devices 14L and 14R, a vibration actuator 142 capable of presenting a vibration stimulus under the control of the control unit 13 is housed in a spherical case 141. The tactile presentation devices 14L and 14R may present electromagnetic pulse stimulation in addition to vibration stimulation. In this example, a mode in which the control unit 13 and the tactile presentation devices 14L and 14R are connected by wire and the tactile metadata generator 12 and the control unit 13 are also connected by wire will be described. Each may be wirelessly connected by short-range wireless communication.

触覚メタデータ生成装置１２は、複数フレーム抽出部１２１、人物骨格抽出部１２２、人物識別部１２３、骨格軌跡特徴画像生成部１２４、動オブジェクト検出部１２５、人物動作認識部１２６、及びメタデータ生成部１２７を備える。 The tactile metadata generation device 12 includes a plurality of frame extraction unit 121, a person skeleton extraction unit 122, a person identification unit 123, a skeleton trajectory feature image generation unit 124, a moving object detection unit 125, a person motion recognition unit 126, and a metadata generation unit. 127 is provided.

複数フレーム抽出部１２１は、映像出力装置１０から入力された映像について、現フレーム画像とＴ（Ｔは１以上の整数）フレーム分の過去のフレーム画像を含む複数フレーム画像を抽出し、人物骨格抽出部１２２及び動オブジェクト検出部１２５に出力する。 The multi-frame extraction unit 121 extracts a plurality of frame images including the current frame image and the past frame image for T (T is an integer of 1 or more) frames from the video input from the video output device 10, and extracts the human skeleton. It is output to the unit 122 and the moving object detection unit 125.

人物骨格抽出部１２２は、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクト（以下、単に「人物」とも称する。）Ｏｐ１，Ｏｐ２の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成し、現フレーム画像を含む当該複数フレーム画像とともに、人物識別部１２３に出力する。 The human skeleton extraction unit 122 has a skeleton coordinate set P ⁿ _b (n: number of people to be detected) of each person object (hereinafter, also simply referred to as “person”) Op1 and Op2 based on the skeleton detection algorithm for each of the plurality of frame images. , B: Skeleton ID) and output to the person identification unit 123 together with the plurality of frame images including the current frame image.

人物識別部１２３は、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に探索範囲（詳細は後述する。）を可変設定し、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成し、骨格軌跡特徴画像生成部１２４に出力する。 The person identification unit 123 variably sets a search range (details will be described later) based on the skeleton coordinate set P ⁿ _b for each of the plurality of frame images, and sets the position and size of the skeleton of each person and the peripheral image thereof. The person is identified by extracting the information, a skeleton coordinate set Pi _b ( ⁱ : person ID, b: skeleton ID) to which the person ID is given is generated, and is output to the skeleton trajectory feature image generation unit 124.

骨格軌跡特徴画像生成部１２４は、現フレーム画像を基準に、当該複数フレーム画像における骨格座標集合Ｐ^ｉ _ｂを基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像を生成し、人物動作認識部１２６に出力する。ここで、骨格軌跡特徴画像について、その詳細は後述するが、本願明細書中、ＳＴＩ（Skeleton Trajectory Image）と名付けている。 The skeleton locus feature image generation unit 124 is a single skeleton locus feature image showing only the direction of movement for each identified person skeleton based on the skeleton coordinate set ^Pi _b in the plurality of frame images with reference to the current frame image. Is generated and output to the person motion recognition unit 126. Here, the details of the skeletal trajectory feature image will be described later, but in the present specification, it is named STI (Skeleton Trajectory Image).

動オブジェクト検出部１２５は、本例のような柔道競技の動きの認識のためには必ずしも設ける必要はないが、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち骨格軌跡特徴画像生成部１２４から得られる骨格軌跡特徴画像（ＳＴＩ）と対比して人物以外の動オブジェクトを選定し、その人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成し、骨格軌跡特徴画像生成部１２４に出力する。この場合、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）したものを人物動作認識部１２６に出力する。 The motion object detection unit 125 is not necessarily provided for recognizing the motion of the judo competition as in this example, but uses each of the plurality of frame images to generate a motion object based on the difference image between adjacent frames. Among the moving objects detected from each difference image, a moving object other than the person is selected by comparing with the skeletal locus feature image (STI) obtained from the skeletal locus feature image generation unit 124, and the moving object other than the person is selected. , A moving object locus image connected with the coordinate position, size, and movement direction obtained from each difference image as elements is generated, and is output to the skeleton locus feature image generation unit 124. In this case, the skeleton locus feature image generation unit 124 adds a moving object locus image on the skeleton locus feature image (STI) and draws (synthesizes) the image, and outputs the image to the person motion recognition unit 126.

人物動作認識部１２６は、骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮ（畳み込みニューラルネットワーク）により、人物の特定動作を認識し、触覚提示デバイス１４Ｌ，１４Ｒを作動させる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を検出し、メタデータ生成部１２７に出力する。 The human motion recognition unit 126 recognizes a specific motion of a person by a CNN (convolutional neural network) that inputs a skeletal locus feature image (STI), and operates predetermined impact presentation devices 14L and 14R. That is, the identification of each person in the current frame image, the position coordinates (and, in this example, it is excluded because it is a judo competition, but if it is a team competition, the team classification), and the timing to activate the tactile presentation device. And the information for impact presentation indicating the speed is detected and output to the metadata generation unit 127.

メタデータ生成部１２７は、現フレーム画像に対応して、人物動作認識部１２６から得られる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を含む触覚メタデータ（衝撃提示用）を生成し、フレーム単位で制御ユニット１３に出力する。 The metadata generation unit 127 corresponds to the current frame image and is obtained from the person motion recognition unit 126 for information for presenting a predetermined impact, that is, identification of each person in the current frame image, position coordinates (and in this example,). It is out of scope because it is a judo competition, but if it is a team competition, its team classification), and tactile metadata (for impact presentation) that includes information for impact presentation indicating the timing and speed at which the tactile presentation device is activated. Is generated and output to the control unit 13 in frame units.

以下、より具体的に、図２を基に、図３乃至図６を参照しながら、触覚メタデータ生成装置１２における触覚メタデータ生成処理について説明する。 Hereinafter, the tactile metadata generation process in the tactile metadata generation device 12 will be described more specifically with reference to FIGS. 3 to 6 with reference to FIG. 2.

（触覚メタデータ生成処理）
図２は、本発明による一実施形態の触覚メタデータ生成装置１２の処理例を示すフローチャートである。そして、図３は、触覚メタデータ生成装置１２における人物骨格抽出処理に関する説明図である。また、図４（ａ）は１フレーム画像を例示する図であり、図４（ｂ）は触覚メタデータ生成装置１２における１フレーム画像における人物骨格抽出例を示す図である。図５（ａ），（ｂ）は、それぞれ本発明による一実施形態の触覚メタデータ生成装置１２における人物骨格抽出処理に関する人物オブジェクトの探索範囲の処理例を示す図である。図６（ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図であり、図６（ｂ）は、その軌跡特徴画像（ＳＴＩ）の説明図である。 (Tactile metadata generation process)
FIG. 2 is a flowchart showing a processing example of the tactile metadata generation device 12 according to the embodiment of the present invention. FIG. 3 is an explanatory diagram relating to the human skeleton extraction process in the tactile metadata generation device 12. Further, FIG. 4A is a diagram illustrating a one-frame image, and FIG. 4B is a diagram showing an example of extracting a human skeleton in a one-frame image in the tactile metadata generation device 12. 5 (a) and 5 (b) are diagrams showing a processing example of a search range of a human object related to a human skeleton extraction process in the tactile metadata generation device 12 of the embodiment according to the present invention, respectively. FIG. 6A is a diagram showing an image example of the skeleton locus feature image (STI) according to the present invention, and FIG. 6B is an explanatory diagram of the locus feature image (STI).

図２に示すように、触覚メタデータ生成装置１２は、まず、複数フレーム抽出部１２１により、映像出力装置１０から入力された映像について、現フレーム画像とＴ（Ｔは１以上の整数）フレーム分の過去のフレーム画像を含む複数フレーム画像を抽出する（ステップＳ１）。 As shown in FIG. 2, the tactile metadata generation device 12 first receives the current frame image and T (T is an integer of 1 or more) frames of the video input from the video output device 10 by the plurality of frame extraction units 121. A plurality of frame images including the past frame images of the above are extracted (step S1).

続いて、触覚メタデータ生成装置１２は、人物骨格抽出部１２２により、当該複数フレーム画像の各々について、骨格検出アルゴリズムに基づき、各人物オブジェクトＯｐ１，Ｏｐ２の骨格座標集合Ｐ^ｎ _ｂ（ｎ：検出人数、ｂ：骨格ＩＤ）を生成する（ステップＳ２）。 Subsequently, the tactile metadata generation device 12 uses the person skeleton extraction unit 122 to obtain the skeleton coordinate set P ⁿ _b (n: number of people to be detected) of each person object Op1 and Op2 based on the skeleton detection algorithm for each of the plurality of frame images. , B: Skeleton ID) (step S2).

近年の深層学習技術の発展により、通常の画像から人物の骨格位置を推定することが可能となった。OpenPoseやVisionPose（NextSystem社）に代表されるように、骨格検出アルゴリズムをオープンソースで公開しているものも存在する。そこで、本例の人物骨格抽出部１２２は、VisionPoseを用いて、図３に示すように、フレーム画像毎に人物の骨格３０点を検出し、その位置座標を示す骨格座標集合Ｐ^ｎ _ｂを生成する。 Recent developments in deep learning techniques have made it possible to estimate the skeleton position of a person from ordinary images. As represented by OpenPose and VisionPose (NextSystem), there are some open source skeleton detection algorithms. Therefore, the human skeleton extraction unit 122 of this example detects 30 points of the human skeleton for each frame image using VisionPose, and generates a skeleton coordinate set P ⁿ _b indicating the position coordinates thereof. do.

VisionPoseでは、図３において、Ｐ^ｎ _１：“頭”、Ｐ^ｎ _２：“鼻”、Ｐ^ｎ _３：“左目”、Ｐ^ｎ _４：“右目”、Ｐ^ｎ _５：“左耳”、Ｐ^ｎ _６：“右耳”、Ｐ^ｎ _７：“首”、Ｐ^ｎ _８：“背骨（肩）”、Ｐ^ｎ _９：“左肩”、Ｐ^ｎ _１０：“右肩”、Ｐ^ｎ _１１：“左肘”、Ｐ^ｎ _１２：“右肘”、Ｐ^ｎ _１３：“左手首”、Ｐ^ｎ _１４：“右手首”、Ｐ^ｎ _１５：“左手”、Ｐ^ｎ _１６：“右手”、Ｐ^ｎ _１７：“左親指”、Ｐ^ｎ _１８：“右親指”、Ｐ^ｎ _１９：“左指先”、Ｐ^ｎ _２０：“右指先”、Ｐ^ｎ _２１：“背骨（中央）”、Ｐ^ｎ _２２：“背骨（基端部）”、Ｐ^ｎ _２３：“左尻部”、Ｐ^ｎ _２４：“右尻部”、Ｐ^ｎ _２５：“左膝”、Ｐ^ｎ _２６：“右膝”、Ｐ^ｎ _２７：“左足首”、Ｐ^ｎ _２８：“右足首”、Ｐ^ｎ _２９：“左足”、及び、Ｐ^ｎ _３０：“右足”、についての座標位置と、各座標位置を図示するような線で連結した描画が可能である。 In VisionPose, in FIG. 3, P ⁿ ₁ : "head", P ⁿ ₂ : "nose", P ⁿ ₃ : "left eye", P ⁿ ₄ : "right eye", P ⁿ ₅ : "left ear", P ⁿ . ₆ : "Right ear", P ⁿ ₇ : "Neck", P ⁿ ₈ : "Spine (shoulder)", P ⁿ ₉ : "Left shoulder", P ⁿ ₁₀ : "Right shoulder", P ⁿ ₁₁ : "Left elbow" , P ⁿ ₁₂ : "Right elbow", P ⁿ ₁₃ : "Left wrist", P ⁿ ₁₄ : "Right wrist", P ⁿ ₁₅ : "Left hand", P ⁿ ₁₆ : "Right hand", P ⁿ ₁₇ : " Left thumb ”, P ⁿ ₁₈ :“ Right thumb ”, P ⁿ ₁₉ :“ Left fingertip ”, P ⁿ ₂₀ :“ Right fingertip ”, P ⁿ ₂₁ :“ Spine (center) ”, P ⁿ ₂₂ :“ Spine (base) End) ”, P ⁿ ₂₃ :“ Left butt ”, P ⁿ ₂₄ :“ Right butt ”, P ⁿ ₂₅ :“ Left knee ”, P ⁿ ₂₆ :“ Right knee ”, P ⁿ ₂₇ :“ Left ankle , P ⁿ ₂₈ : "Right ankle", P ⁿ ₂₉ : "Left foot", and P ⁿ ₃₀ : "Right foot", and each coordinate position can be drawn by connecting them with a line as shown in the figure. Is.

このVisionPoseの骨格検出アルゴリズムに基づき、図４（ａ）に示す柔道競技の１フレーム画像Ｆに対して、人物の骨格抽出を行ったフレーム画像Ｆａを図４（ｂ）に示している。図４（ａ）に示すフレーム画像Ｆには、各人物オブジェクトＯｐ１，Ｏｐ２（選手）のみが映り込んでいる様子を示しているが、その他の人物オブジェクトである審判の動オブジェクトが映り込むことや、別のスポーツ競技であれば人物以外の動オブジェクト（バドミントン競技であればラケットやシャトル等）、或いは観客等のオブジェクト（実質的には、静オブジェクト）が写り込むことがある。しかし、VisionPoseの骨格検出アルゴリズムを適用すると、選手及び審判の人物オブジェクトの人物についてのみ人物の骨格抽出を抽出することができる。本例では、図４（ｂ）に示すように、人物オブジェクトＯｐ１，Ｏｐ２にそれぞれ対応する骨格座標集合Ｐ^１ _ｂ，Ｐ^２ _ｂを推定して生成することができる。図４（ｂ）からも理解されるように、柔道競技においても、比較的精度よく各人物の骨格を推定できる。尚、骨格検出アルゴリズムは、静止画単位での推定に留まるので、触覚メタデータ生成装置１２は、後続する処理として、人物の識別を行い、各人物の骨格位置の推移を１枚の骨格軌跡特徴画像（ＳＴＩ）に描画し、ＣＮＮにより時間軸を考慮した高精度な動作認識を行う。 Based on this Vision Pose skeleton detection algorithm, the frame image Fa obtained by extracting the skeleton of a person from the one-frame image F of the judo competition shown in FIG. 4 (a) is shown in FIG. 4 (b). The frame image F shown in FIG. 4A shows that only each person object Op1 and Op2 (players) are reflected, but other person objects such as the referee's moving object are reflected. In another sports competition, a moving object other than a person (such as a racket or shuttle in a badminton competition) or an object such as a spectator (substantially a static object) may be reflected. However, by applying VisionPose's skeleton detection algorithm, it is possible to extract the skeleton extraction of a person only for the person of the player and the referee's person object. In this example, as shown in FIG. 4 (b), the skeleton coordinate sets P ¹ _b and P ² _b corresponding to the person objects Op 1 and Op 2 can be estimated and generated. As can be understood from FIG. 4B, the skeleton of each person can be estimated relatively accurately even in judo competition. Since the skeleton detection algorithm is limited to estimation in units of still images, the tactile metadata generation device 12 identifies a person as a subsequent process, and changes the skeleton position of each person as one skeleton trajectory feature. It is drawn on an image (STI), and high-precision motion recognition is performed by CNN in consideration of the time axis.

続いて、触覚メタデータ生成装置１２は、人物識別部１２３により、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂを基に探索範囲を可変設定し、各人物の骨格の位置及びサイズと、その周辺画像情報を抽出することにより人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する（ステップＳ３）。 Subsequently, the tactile metadata generation device 12 variably sets the search range for each of the plurality of frame images by the person identification unit 123 based on the skeleton coordinate set P ⁿ _b , and sets the position and size of the skeleton of each person. , A person is identified by extracting the peripheral image information thereof, and a skeleton coordinate set Pi _b ( ⁱ : person ID, b: skeleton ID) to which the person ID is given is generated (step S3).

前述した人物骨格抽出部１２２により、当該複数フレーム画像の各々について、骨格座標集合Ｐ^ｎ _ｂとして、１以上の人物の骨格の検出が可能となる。しかし、各フレーム画像の骨格座標集合Ｐ^ｎ _ｂでは、「誰」の情報は存在しないため、各人物の骨格を識別する必要がある。この識別には、各フレーム画像における各骨格座標集合Ｐ^ｎ _ｂの座標付近の画像情報を利用する。即ち、人物識別部１２３は、骨格座標集合Ｐ^ｎ _ｂを基に、各人物の骨格の位置及びサイズと、その周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を抽出することにより、人物を識別し、人物ＩＤを付与した骨格座標集合Ｐ^ｉ _ｂ（ｉ：人物ＩＤ、ｂ：骨格ＩＤ）を生成する。 The human skeleton extraction unit 122 described above makes it possible to detect the skeleton of one or more people as the skeleton coordinate set P ⁿ _b for each of the plurality of frame images. However, since the information of "who" does not exist in the skeleton coordinate set P ⁿ _b of each frame image, it is necessary to identify the skeleton of each person. For this identification, image information near the coordinates of each skeleton coordinate set P ⁿ _b in each frame image is used. That is, the person identification unit 123 extracts the position and size of the skeleton of each person and the peripheral image information (color information and texture information near the face or back) based on the skeleton coordinate set P ⁿ _b . , A person is identified, and a skeleton coordinate set Pi _b ( ⁱ : person ID, b: skeleton ID) to which a person ID is given is generated.

例えば、柔道では白と青の道着で試合が行われるが、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置付近の画像情報として、フレーム画像Ｆにおける色情報を参照することで、選手の識別が可能になる。また、バドミントン競技では、コートを縦に構えた画角で撮影される場合に、各骨格座標集合Ｐ^ｎ _ｂの骨格の位置がフレーム画像Ｆにおける画面上側であれば奥の選手、画面下側であれば手前の選手、として識別することができる。 For example, in judo, a match is played in white and blue dogi, but by referring to the color information in the frame image F as the image information near the position of the skeleton of each skeleton coordinate set P ⁿ _b , the player can be identified. It will be possible. In the badminton competition, when the court is held vertically at an angle of view, if the position of the skeleton of each skeleton coordinate set P ⁿ _b is on the upper side of the screen in the frame image F, the player in the back and the lower side of the screen. If there is, it can be identified as the player in the foreground.

従って、人物骨格抽出部１２２における骨格検出アルゴリズムは静止画単位での推定に留まるが、骨格座標集合Ｐ^ｎ _ｂを基に動オブジェクトとしての人物を認識することができる。 Therefore, although the skeleton detection algorithm in the human skeleton extraction unit 122 is limited to estimation in units of still images, it is possible to recognize a person as a moving object based on the skeleton coordinate set P ⁿ _b .

尚、前述した人物骨格抽出部１２２では、選手以外にも審判や観客など、触覚刺激の提示対象としない他の人物の骨格を検出してしまうことも多い。審判は選手と別の衣服を着用することが多いため、色情報で識別できる。また、観客は選手に比べて遠くにいることが多いため、骨格のサイズで識別が可能である。このように、各競技のルールや撮影状況を考慮し、人物識別に適切な周辺画像情報（色情報、及び顔又は背付近のテクスチャ情報）を設定することにより、触覚刺激の提示対象とする選手の識別が可能となる。 In addition to the athletes, the above-mentioned human skeleton extraction unit 122 often detects the skeletons of other persons such as referees and spectators who are not the targets of the presentation of tactile stimuli. Referees often wear different clothing than athletes, so they can be identified by color information. Also, since the spectators are often farther than the athletes, they can be identified by the size of the skeleton. In this way, by setting the peripheral image information (color information and texture information near the face or back) appropriate for person identification in consideration of the rules and shooting conditions of each competition, the athlete to be presented with the tactile stimulus. Can be identified.

ところで、本実施形態の人物識別部１２３は、各人物の重なりやオクルージョンにも対応するため、フレーム画像単位で探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）を可変設定する。例えば、図５（ａ）に示す人物オブジェクトＯｐ１，Ｏｐ２（選手）と、人物オブジェクトＯｐ３（審判）について、人物骨格抽出部１２２により各骨格座標集合Ｐ^ｎ _ｂ（図示略）の抽出が行われると、人物識別部１２３は、フレーム画像単位で人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉを可変設定することができる。この探索範囲Ｒ^ｉは、図５（ａ）において、人物ＩＤ（ｉ）ごとに設定し、フレーム画像の画像座標上での人物の位置座標、及び人物の大きさ（幅及び高さ）を有するものとして外接矩形で表している。また、各人物の腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）を囲む領域を注目探索範囲Ｒｂ^ｉとして表している。 By the way, the person identification unit 123 of the present embodiment variably sets the search range (person search range R ⁱ and attention search range R ^bi ) for each frame image in order to deal with the overlap and occlusion of each person. For example, when the person skeleton extraction unit 122 extracts each skeleton coordinate set P ⁿ _b (not shown) for the person objects Op1 and Op2 (players) and the person object Op3 (referee) shown in FIG. 5 (a). The person identification unit 123 can variably set the person search range R ⁱ and the attention search range R ^bi for each frame image. This search range R ⁱ is set for each person ID (i) in FIG. 5A, and has the position coordinates of the person on the image coordinates of the frame image and the size (width and height) of the person. It is represented by an circumscribing rectangle. Further, the region surrounding the waist region (P ⁿ ₂₂ , P ⁿ ₂₃ , P ⁿ ₂₄ ) of each person is represented as the attention search range ^Rbi .

より具体的には、本実施形態の人物識別部１２３は、各フレーム画像で人物の探索範囲を、最大で人物骨格の全体を囲む人物探索範囲Ｒ^ｉに限定し、最小で人物骨格のうち所定領域（本例では腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）を囲む領域）を注目探索範囲Ｒｂ^ｉとして定めた絞り込みによる可変設定を行い、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも注目探索範囲Ｒｂ^ｉを含むように探索範囲を決定して、当該人物オブジェクトを識別する処理を行う。これにより、例えば図５（ｂ）に示すように各人物の動作が変化した場合やフレーム画像に対する相対的な人物の大きさが変化した場合でも、他の人物の誤認識を防ぎ、また処理速度も向上できる。特に、柔道のように識別対象の人物の重なりが激しく、背景も複雑な映像から精度よく選手を識別するには探索範囲の利用が有効である。 More specifically, the person identification unit 123 of the present embodiment limits the search range of the person in each frame image to the person search range ^Ri that surrounds the entire person skeleton at the maximum, and determines the person skeleton at the minimum. A variable setting is made by narrowing down the area (the area surrounding the waist area (P ⁿ ₂₂ , P ⁿ ₂₃ , P ⁿ ₂₄ ) in this example) as the attention search range ^Rbi , and the skeleton of the person obtained by the state estimation algorithm is set. Based on the state transition estimated value, the search range is determined so as to include at least the attention search range ^Rbi , and the process of identifying the person object is performed. This prevents misrecognition of other people and the processing speed even when the movement of each person changes or the size of the person relative to the frame image changes, for example, as shown in FIG. 5 (b). Can also be improved. In particular, it is effective to use the search range in order to accurately identify a player from an image in which the persons to be identified are heavily overlapped and the background is complicated as in judo.

つまり、本実施形態の人物識別部１２３は、各選手及び審判の人物オブジェクトのＯｐ１，Ｏｐ２，Ｏｐ３における各骨格座標集合Ｐ^ｎ _ｂのうち、色識別を可能とする所定範囲（本例では腰領域（Ｐ^ｎ _２２，Ｐ^ｎ _２３，Ｐ^ｎ _２４）の色（青、白、茶色））を注目探索範囲Ｒｂ^ｉとして予め定めているので、検出した複数の人物の骨格座標集合Ｐ^ｎ _ｂが重なる場合には注目探索範囲Ｒｂ^ｉに絞って探索することで、各フレーム画像で精度よく人物を抽出・追跡できる。尚、背景に解析対象以外の骨格を検出する場合もあるため、解析対象の人物の骨格には、人物ＩＤ（ｉ）を付与して判別することで、追跡対象の人物の骨格座標Ｐ^ｉ _ｂを識別できる。 That is, the person identification unit 123 of the present embodiment has a predetermined range (in this example, the waist region) that enables color identification among the skeleton coordinate sets ^Pn _b in Op1, Op2, and Op3 of the person objects of each player and the referee. Since the colors (blue, white, brown) of (P ⁿ ₂₂ , P ⁿ ₂₃ , P ⁿ ₂₄ ) are predetermined as the attention search range ^Rbi , the skeleton coordinate sets P ⁿ _b of a plurality of detected persons overlap. In the case, by narrowing down the search to the attention search range ^Rbi , it is possible to accurately extract and track a person in each frame image. Since a skeleton other than the analysis target may be detected in the background, the skeleton of the person to be analyzed is determined by assigning a person ID ( ⁱ ) to the skeleton of the person to be _tracked . Can be identified.

そして、探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）の広さや形の決定は、カルマンフィルタやパーティクルフィルタなどの状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、少なくとも注目探索範囲Ｒｂ^ｉ（本例では、各人物の腰領域）を含むように決定する。 The width and shape of the search range (person search range R ⁱ and attention search range R ^bi ) are determined at least based on the state transition estimates of the human skeleton obtained by a state estimation algorithm such as a Kalman filter or a particle filter. It is determined to include the attention search range ^Rbi (in this example, the waist region of each person).

そして、探索範囲（人物探索範囲Ｒ^ｉ及び注目探索範囲Ｒｂ^ｉ）の安定検出時には範囲を狭め、検出が不安定な際には範囲を広げることができ、例えば、人物ＩＤ（ｉ）ごとに人物の骨格の状態遷移推定値に基づいて定めた探索範囲を設定し、その状態遷移推定値が直前フレームから所定値以内であれば安定とし、そうでなければ不安定とすることや、状態推定アルゴリズムで得られる人物の骨格の状態遷移推定値に基づいて、Ｔフレーム分の時間窓間に、検出に成功した割合を計算し、その割合が所定値以上であれば安定とし、当該所定値を下回った場合に不安定とすることで、探索範囲を可変設定することができる。 Then, the range can be narrowed when the search range (person search range R ⁱ and attention search range R ^bi ) is stably detected, and the range can be expanded when the detection is unstable. For example, each person ID (i) can be expanded. A search range determined based on the state transition estimated value of the skeleton of is set, and if the state transition estimated value is within a predetermined value from the immediately preceding frame, it is made stable, otherwise it is made unstable, or a state estimation algorithm. Based on the state transition estimation value of the skeleton of the person obtained in, the ratio of successful detection is calculated during the time window for T frames, and if the ratio is equal to or more than the predetermined value, it is stabilized and falls below the predetermined value. In this case, the search range can be variably set by making it unstable.

続いて、触覚メタデータ生成装置１２は、骨格軌跡特徴画像生成部１２４により、現フレーム画像を基準に、当該複数フレーム画像における骨格座標集合Ｐ^ｉ _ｂを基に、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像（ＳＴＩ）を生成する（ステップＳ４）。 Subsequently, the tactile metadata generation device 12 uses the skeleton trajectory feature image generation unit 124 to identify the movement of each person's skeleton based on the skeleton coordinate set ^Pi _b in the plurality of frame images with reference to the current frame image. A single skeletal locus feature image (STI) showing only the direction is generated (step S4).

ここで、骨格軌跡特徴画像（ＳＴＩ）の描画生成にあたって、まず、任意のフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（ｔ）とし、現フレーム画像をｔ＝０として現フレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（０）で表し、過去Ｔフレームのフレーム画像における骨格座標集合Ｐ^ｉ _ｂをＰ^ｉ _ｂ（Ｔ）で表す。つまり、骨格軌跡特徴画像生成部１２４は、現フレーム画像のフレーム番号をｔ＝０として、過去Ｔフレームまでのフレーム番号をｔ＝Ｔで表すと、現フレーム画像を基準に、ｔ＝０，１，…，Ｔの各フレーム画像Ｆを用いて、識別した人物骨格毎の動きの方向のみを示す１枚の骨格軌跡特徴画像（ＳＴＩ）を生成することができる。骨格軌跡特徴画像（ＳＴＩ）は、いわば現フレーム画像を基準に過去のオプティカルフローを連結し、１枚の画像として時間軸の情報を含んだものである。 Here, in generating the drawing of the skeleton locus feature image (STI), first, the skeleton coordinate set Pi _{ib in an arbitrary frame image is set to Pi b} ⁽ _t ), the current frame image is set to t ⁼ 0, and the skeleton in the current frame image is set. The coordinate set P ⁱ _b is represented by P ⁱ _b (0), and the skeleton coordinate set P ⁱ _b in the frame image of the past T frame is represented by P ⁱ _b (T). That is, when the skeleton locus feature image generation unit 124 expresses the frame number up to the past T frame by t = T, where the frame number of the current frame image is t = 0, t = 0,1 with respect to the current frame image. Using each frame image F of, ..., T, it is possible to generate one skeleton locus feature image (STI) showing only the direction of movement for each identified human skeleton. The skeleton locus feature image (STI) is, so to speak, concatenated past optical flows with reference to the current frame image, and includes information on the time axis as one image.

この骨格軌跡特徴画像（ＳＴＩ）における軌跡特徴量のデュレーションとなるＴは、任意に設定可能である。また、１枚の骨格軌跡特徴画像（ＳＴＩ）の生成に用いる骨格座標は、必ずしも図３に示す３０点全てを用いる必要はなく、予め定めた特定の骨格軌跡のみを使用して、処理速度を向上させる構成とすることもできる。 The T, which is the duration of the locus feature amount in the skeleton locus feature image (STI), can be arbitrarily set. Further, it is not always necessary to use all 30 points shown in FIG. 3 as the skeleton coordinates used for generating one skeleton locus feature image (STI), and the processing speed is determined by using only a predetermined specific skeleton locus. It can also be configured to improve.

骨格軌跡特徴画像（ＳＴＩ）は、現フレーム画像から過去Ｔフレーム分のフレーム画像における各人物の骨格座標を利用し、各人物の骨格座標ごとに連結した軌跡を描画するものとし、且つこの描画の際に、過去に向かうほど輝度を下げか、又は上げて描画して生成した１枚の画像とする。好適には、骨格軌跡特徴画像（ＳＴＩ）は、現フレームからＴフレーム分の過去のフレーム画像における各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動き（遷移）をフレーム単位で時系列に階調するよう描画したものとする。 The skeleton locus feature image (STI) uses the skeleton coordinates of each person in the frame image for the past T frames from the current frame image, and draws a locus connected for each skeleton coordinate of each person, and draws this drawing. At that time, it is made into one image generated by drawing with the brightness lowered or increased toward the past. Preferably, the skeleton locus feature image (STI) is color-coded for each person's skeleton coordinates in the past frame image for T frames from the current frame, and the movement (transition) for each person's skeleton coordinates is frame-by-frame. It is assumed that the image is drawn so as to be gradation in time series.

例えば、現フレームから過去Ｔフレームまで、各人物の骨格座標ごとに連結した軌跡を描画する際に、その輝度ｂを
ｂ＝２５５×（Ｔ－ｔ）／Ｔ
として定めたものとする。 For example, when drawing a locus connected for each skeleton coordinate of each person from the current frame to the past T frame, the brightness b is b = 255 × (Tt) / T.
It shall be determined as.

また、過去に遡るほど輝度を上げるように描画してもよく、この場合には、
ｂ＝２５５×ｔ／Ｔ
とすることができる。 In addition, you may draw so that the brightness increases as you go back in time. In this case,
b = 255 × t / T
Can be.

ここで、ｔ＝０を現フレーム画像とし過去Ｔフレーム分を処理対象とするとき（ｔ＝０～Ｔ）、ｂを０～２５５とし、その値を、各人物の骨格座標ごとに色分けして表現するのが好適である。例えば、図６（ａ）は、本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。図６（ａ）ではグレイスケール表示として認識処理に用いるとしているが、好適には、図６（ｂ）に示す軌跡特徴画像（ＳＴＩ）の説明図に示すように、例えば背景は輝度として最低値の“黒”（若しくは輝度として最高値の“白”でもよい。）、いずれの人物オブジェクトＯｐ１，Ｏｐ２についても、例えば“頭”（Ｐ^１ _１），（Ｐ^２ _１）の色を“青”に、“左指先” （Ｐ^１ _１９），（Ｐ^２ _１９）の色を“赤”とするなど、予め区別可能とする色で色分けして描画する。また、本実施例では、図６（ｂ）に示すように人物オブジェクトＯｐ１，Ｏｐ２を区別する色分けを施していないが、各人物オブジェクトＯｐ１，Ｏｐ２をも色分けするとしてもよく、例えば２名の人物に対し最大３０点の骨格座標を色分けするには、６０色を定義すればよい。そして、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、各人物の骨格座標ごとに色を固定したまま、輝度のみが元フレーム画像から過去へ遡るほど暗く（もしくは明るく）描画するものとする。 Here, when t = 0 is the current frame image and the past T frames are the processing targets (t = 0 to T), b is set to 0 to 255, and the values are color-coded for each person's skeleton coordinates. It is preferable to express it. For example, FIG. 6A is a diagram showing an image example of a skeleton locus feature image (STI) according to the present invention. In FIG. 6A, it is assumed that the grayscale display is used for the recognition process, but preferably, as shown in the explanatory diagram of the locus feature image (STI) shown in FIG. 6B, for example, the background has the lowest luminance value. For _{any of the person objects Op1 and Op2, for example, the color of "head" (P1 1) and (P2 1} ⁾ ^is _" blue". In addition, the color of the "left fingertip" (P ¹ ₁₉ ) and (P ² ₁₉ ) is set to "red", and the drawing is performed by color-coding with a color that can be distinguished in advance. Further, in this embodiment, as shown in FIG. 6B, the person objects Op1 and Op2 are not color-coded to distinguish them, but each person object Op1 and Op2 may also be color-coded, for example, two persons. In order to color-code the skeleton coordinates of up to 30 points, 60 colors may be defined. The skeleton locus feature image (STI) according to the present invention is drawn darker (or brighter) so that only the brightness goes back to the past from the original frame image while the color is fixed for each skeleton coordinate of each person.

従って、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）として、当該複数フレーム画像における各人物の骨格座標について、各人物に対し共通又は区別して、各人物の骨格座標ごとに色分けし、各人物の骨格座標ごとの動き（遷移）をフレーム単位で時系列に階調するよう描画したものとする。 Therefore, the skeleton locus feature image generation unit 124, as the skeleton locus feature image (STI), common or distinguishes the skeleton coordinates of each person in the plurality of frame images for each person, and color-codes each person's skeleton coordinates. , It is assumed that the movement (transition) of each person's skeleton coordinates is drawn so as to be gradation in time series in frame units.

また、骨格軌跡特徴画像生成部１２４は、動オブジェクト検出部１２５の機能により、球技の場合はボールなど、人物骨格以外の軌跡を併せて骨格軌跡特徴画像（ＳＴＩ）上に描画することができる。この場合、ボールの移動方向などが特徴量に付加されるため、動作認識の判定精度が向上する。 Further, the skeleton locus feature image generation unit 124 can draw a locus other than the human skeleton, such as a ball in the case of a ball game, on the skeleton locus feature image (STI) by the function of the moving object detection unit 125. In this case, since the moving direction of the ball is added to the feature amount, the determination accuracy of motion recognition is improved.

即ち、触覚メタデータ生成装置１２は、動オブジェクト検出部１２５により、当該複数フレーム画像の各々を用いて隣接フレーム間の差分画像を基に動オブジェクトを検出し、各差分画像から検出した動オブジェクトのうち骨格軌跡特徴画像生成部１２４から得られる骨格軌跡特徴画像（ＳＴＩ）と対比して人物以外の動オブジェクトを選定し、その人物以外の動オブジェクトについて、各差分画像から得られる座標位置、大きさ、移動方向を要素とし連結した動オブジェクト軌跡画像を生成し、骨格軌跡特徴画像生成部１２４に出力する。この場合、骨格軌跡特徴画像生成部１２４は、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）したものを人物動作認識部１２６に出力する（ステップＳ５）。 That is, the tactile metadata generation device 12 detects a moving object based on the difference image between adjacent frames by using each of the plurality of frame images by the moving object detection unit 125, and the moving object detected from each difference image. Among them, a moving object other than a person is selected in comparison with the skeletal locus feature image (STI) obtained from the skeletal locus feature image generation unit 124, and the coordinate position and size of the moving object other than the person are obtained from each difference image. , A moving object locus image connected with the movement direction as an element is generated, and is output to the skeleton locus feature image generation unit 124. In this case, the skeleton locus feature image generation unit 124 adds a moving object locus image on the skeleton locus feature image (STI) and draws (synthesizes) the image, and outputs the image to the person motion recognition unit 126 (step S5).

即ち、動オブジェクト検出部１２５は、競技に関わる人物以外の動オブジェクトが存在しない、柔道競技のような場合では必要とされないが（処理として設けていても弊害が無い。）、競技に関わる人物以外の動オブジェクトが存在する場合（例えばバドミントン競技のシャトルやラケット、卓球やテニス競技のボールやラケット等）、その人物以外の動オブジェクトの動きの軌跡を検出し、動オブジェクト軌跡画像として生成し、骨格軌跡特徴画像生成部１２４に対して、骨格軌跡特徴画像（ＳＴＩ）上に、動オブジェクト軌跡画像を追加して描画（合成）させる。これにより、例えば競技に関わる人物以外の動オブジェクトが存在する場合、競技に関わる人物の動きに関わる情報が増えるため、後段の人物動作認識部１２６における人物動作の認識精度が向上する。このため、動オブジェクト検出部１２５を設けておくことで、任意の競技に対して同処理で対応できるため、汎用性のある触覚メタデータ生成装置１２を構成できる。 That is, the moving object detection unit 125 is not required in a case such as a judo competition in which there is no moving object other than the person involved in the competition (there is no harmful effect even if it is provided as a process), but the person other than the person involved in the competition. When there is a moving object of (for example, shuttle or racket of badminton competition, ball or racket of table tennis or tennis competition), the movement trajectory of the moving object other than that person is detected, generated as a moving object trajectory image, and the skeleton A moving object locus image is added to the locus feature image generation unit 124 on the skeleton locus feature image (STI) and drawn (combined). As a result, for example, when a moving object other than a person involved in the competition exists, information related to the movement of the person involved in the competition increases, so that the recognition accuracy of the person movement in the subsequent person movement recognition unit 126 is improved. Therefore, by providing the moving object detection unit 125, the same processing can be applied to any competition, so that a versatile tactile metadata generation device 12 can be configured.

続いて、触覚メタデータ生成装置１２は、人物動作認識部１２６により、骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮ（畳み込みニューラルネットワーク）により、人物の特定動作を認識し、触覚提示デバイス１４Ｌ，１４Ｒを作動させる所定の衝撃提示用の情報を検出する（ステップＳ６）。衝撃提示用の情報には、現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報が含まれる。 Subsequently, the tactile metadata generation device 12 recognizes a specific motion of a person by a CNN (convolutional neural network) inputting a skeletal locus feature image (STI) by a person motion recognition unit 126, and a tactile presentation device 14L, The information for presenting a predetermined impact that activates the 14R is detected (step S6). The information for impact presentation includes identification of each person in the current frame image, position coordinates (and, in this example, it is excluded because it is a judo competition, but if it is a team competition, its team classification), and tactile sensation. It contains information indicating when and how fast the presentation device is activated.

尚、ＣＮＮによる機械学習時には、事前に学習用の骨格軌跡特徴画像（ＳＴＩ）を作成して学習させておく。このように、人物動作認識部１２６における認識処理には、深層学習の一つであるＣＮＮ（畳み込みニューラルネットワーク）を用いる。ＣＮＮは、何段もの深い層を持つニューラルネットワークであり、特に画像認識の分野で優れた性能を発揮しているネットワークである。このネットワークは「畳み込み層」や「プーリング層」などの幾つかの特徴的な機能を持った層を積み上げることで構成され、現在幅広い分野で利用されている。「畳み込み層」の処理により高い精度を、「プーリング層」の処理により撮影画角に依存しない汎用性を実現している。 At the time of machine learning by CNN, a skeletal trajectory feature image (STI) for learning is created and trained in advance. As described above, CNN (convolutional neural network), which is one of deep learning, is used for the recognition process in the person motion recognition unit 126. CNN is a neural network with many deep layers, and is a network that exhibits excellent performance especially in the field of image recognition. This network is composed of layers with some characteristic functions such as "convolution layer" and "pooling layer", and is currently used in a wide range of fields. The processing of the "convolution layer" realizes high accuracy, and the processing of the "pooling layer" realizes versatility that does not depend on the shooting angle of view.

このＣＮＮを用いて骨格軌跡特徴画像（ＳＴＩ）を解析することで、「組み合い」や「投げ」、「寝技」などの動作イベントを、選手の撮影サイズや位置に依存せずに高い精度で識別することが可能となり、これらの情報を基に触覚デバイス１４Ｌ，１４Ｒを制御するための触覚メタデータを生成することで、スポーツ映像のリアルタイム視聴時でも触覚刺激を提示することが可能となる。 By analyzing the skeletal trajectory feature image (STI) using this CNN, motion events such as "combination", "throwing", and "ground fighting" can be identified with high accuracy regardless of the shooting size and position of the athlete. By generating tactile metadata for controlling the tactile devices 14L and 14R based on this information, it is possible to present tactile stimuli even during real-time viewing of sports images.

最終的に、触覚メタデータ生成装置１２は、メタデータ生成部１２７により、現フレーム画像に対応して、人物動作認識部１２６から得られる所定の衝撃提示用の情報、即ち現フレーム画像内の各人物の識別、位置座標（及び、本例では柔道競技としているため対象外となるが、チーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す衝撃提示用の情報を含む触覚メタデータ（衝撃提示用）を生成し、フレーム単位で制御ユニット１３に出力する（ステップＳ７）。 Finally, the tactile metadata generation device 12 is provided with the information for presenting a predetermined impact obtained from the person motion recognition unit 126 corresponding to the current frame image by the metadata generation unit 127, that is, each of the information in the current frame image. For impact presentation that indicates the identification of a person, position coordinates (and, in this example, it is not applicable because it is a judo competition, but if it is a team competition, its team classification), and the timing and speed at which the tactile presentation device is activated. Tactile metadata (for impact presentation) including the above information is generated and output to the control unit 13 in frame units (step S7).

（実験検証）
本発明に係る触覚メタデータ生成装置１２の有効性を示すため、評価実験を行った。
図７（ａ）は１フレーム画像例を模擬的に示した図であり、図７（ｂ）は従来技術のボーン画像例、図７（ｃ）は従来技術のＳｋｌＭＨＩ（Skeleton Motion History Image）の画像例、図７（ｄ）は本発明に係る骨格軌跡特徴画像（ＳＴＩ）の画像例を示す図である。また、図８は、従来技術のボーン画像、従来技術のＳｋｌＭＨＩ、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）の人物動きの検出精度の比較評価を示す図である。 (Experimental verification)
An evaluation experiment was conducted to show the effectiveness of the tactile metadata generator 12 according to the present invention.
FIG. 7 (a) is a diagram simulating a one-frame image example, FIG. 7 (b) is a conventional bone image example, and FIG. 7 (c) is a conventional Skeleton Motion History Image (Skeleton Motion History Image). 7 (d) is an image example of the skeleton locus feature image (STI) according to the present invention. Further, FIG. 8 is a diagram showing a comparative evaluation of the detection accuracy of the human movement of the bone image of the prior art, the Skl MHI of the prior art, and the skeletal locus feature image (STI) according to the present invention.

まず、比較評価する前に、柔道の試合映像（図７（ａ）参照）から、従来技術のボーン画像（図７（ｂ）参照）、従来技術のＳｋｌＭＨＩ（図７（ｃ）参照）、及び本発明に係る骨格軌跡特徴画像（ＳＴＩ）（図７（ｄ）参照）について、正例、負例それぞれ約２，０００枚の画像を作成して、それぞれＣＮＮによる事前学習を行った。 First, before comparative evaluation, from the judo match video (see FIG. 7 (a)), the bone image of the prior art (see FIG. 7 (b)), the Skeleton MHI of the prior art (see FIG. 7 (c)), For the skeletal locus feature image (STI) (see FIG. 7 (d)) according to the present invention, about 2,000 images were created for each of the positive and negative examples, and pre-learning was performed by CNN.

そして、別の試合映像で識別した結果を図８に示している。図８では、「立ち合い」、「投げ」、「寝技」、「待て」の４つの試合状況（シーン分類）の識別結果と、「投げ」動作の検出結果の比較として、適合率、再現率、及びこれらの統合的指標であるＦ値（F-Measure）の値を示した。４つの試合状況（シーン分類）の状態の識別判定、及び「投げ」の検出精度のいずれの場合においても、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いて学習した場合が最もよい結果が得られた。従って、従来技術のボーン画像や、従来技術のＳｋｌＭＨＩを用いた動作認識よりも、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いる触覚メタデータ生成装置１２の有効性を確認できた。尚、ＳｋｌＭＨＩについても骨格座標ごとに色分けを行って評価したが、それでも本発明に係る骨格軌跡特徴画像（ＳＴＩ）を用いた方が動作認識の精度として向上する理由として、ＳｋｌＭＨＩ（ボーン画像も同様）では、各骨格を結ぶ接続線が動作認識に悪影響を及ぼしていると考えられる。 The results identified in another match video are shown in FIG. In FIG. 8, as a comparison between the identification results of the four game situations (scene classification) of "attendance", "throwing", "ground fighting", and "waiting" and the detection results of the "throwing" operation, the matching rate and the recall rate are shown. And the value of F value (F-Measure) which is an integrated index of these is shown. In any case of the identification judgment of the state of the four game situations (scene classification) and the detection accuracy of "throwing", the best result is obtained by learning using the skeletal trajectory feature image (STI) according to the present invention. Obtained. Therefore, it was possible to confirm the effectiveness of the tactile metadata generation device 12 using the skeletal trajectory feature image (STI) according to the present invention, rather than the motion recognition using the bone image of the prior art or the Skl MHI of the prior art. Although Skl MHI was also evaluated by color-coding each skeleton coordinate, the reason why using the skeleton locus feature image (STI) according to the present invention improves the accuracy of motion recognition is that Skl MHI (bone image). In the same way), it is considered that the connecting line connecting each skeleton has an adverse effect on motion recognition.

（制御ユニット）
図９は、本発明による一実施形態の映像触覚連動システム１における制御ユニット１３の概略構成を示すブロック図である。制御ユニット１３は、メタデータ受信部１３１、解析部１３２、記憶部１３３、及び駆動部１３４‐１，１３４‐２を備える。 (Controller unit)
FIG. 9 is a block diagram showing a schematic configuration of the control unit 13 in the video-tactile interlocking system 1 of the embodiment according to the present invention. The control unit 13 includes a metadata receiving unit 131, an analysis unit 132, a storage unit 133, and a driving unit 134-1,134-2.

メタデータ受信部１３１は、触覚メタデータ生成装置１２から触覚メタデータ（衝撃提示用）を入力し、解析部１３２に出力する機能部である。触覚メタデータは、現フレーム画像内の各人物の識別、位置座標、（及びチーム競技であればそのチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さを示す情報を含む。 The metadata receiving unit 131 is a functional unit that inputs tactile metadata (for impact presentation) from the tactile metadata generation device 12 and outputs it to the analysis unit 132. Tactile metadata includes identification of each person in the current frame image, position coordinates (and their team classification if in a team competition), as well as information indicating when and how fast the tactile presentation device is activated.

解析部１３２は、触覚メタデータ生成装置１２から得られる触覚メタデータを基に、予め定めた駆動基準データを参照し、駆動部１３４‐１，１３４‐２を介して、対応する各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するよう制御する機能部である。例えば、解析部１３２は、触覚メタデータにおける人物の識別、位置座標、（及びチーム分類）、並びに、触覚提示デバイスを作動させるタイミング及び速さから、予め定めた駆動基準データを参照して、触覚提示デバイス１４Ｌの振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間を決定して駆動制御する。 The analysis unit 132 refers to predetermined drive reference data based on the tactile metadata obtained from the tactile metadata generation device 12, and corresponds to each tactile presentation device via the drive units 134-1 and 134-2. It is a functional unit that controls to drive the vibration actuator 142 of 14L and 14R. For example, the analysis unit 132 refers to the tactile sense by referring to the predetermined drive reference data from the identification of the person in the tactile metadata, the position coordinates, (and the team classification), and the timing and speed of operating the tactile presentation device. The operation timing, strength, and operation time of the vibration actuator 142 of the presentation device 14L are determined and drive-controlled.

記憶部１３３は、触覚メタデータに基づいた駆動部１３４‐１，１３４‐２の駆動を制御するための予め定めた駆動基準データを記憶している。駆動基準データは、触覚メタデータに対応付けられた触覚刺激としての振動アクチュエーター１４２の作動タイミング、強さ、及び動作時間について、予め定めたテーブル又は関数で表されている。また、記憶部１３３は、制御ユニット１３の機能を実現するためのプログラムを記憶している。即ち、制御ユニット１３を構成するコンピュータにより当該プログラムを読み出して実行することで、制御ユニット１３の機能を実現する。 The storage unit 133 stores predetermined drive reference data for controlling the drive of the drive units 134-1, 134-2 based on the tactile metadata. The drive reference data is represented by a predetermined table or function regarding the operation timing, strength, and operation time of the vibration actuator 142 as a tactile stimulus associated with the tactile metadata. Further, the storage unit 133 stores a program for realizing the function of the control unit 13. That is, the function of the control unit 13 is realized by reading and executing the program by the computer constituting the control unit 13.

駆動部１３４‐１，１３４‐２は、各触覚提示デバイス１４Ｌ，１４Ｒの振動アクチュエーター１４２を駆動するドライバである。 The drive units 134-1 and 134-2 are drivers that drive the vibration actuators 142 of the tactile presentation devices 14L and 14R.

このように、本実施形態の触覚メタデータ生成装置１２を備える映像触覚連動システム１によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させることができるようになる。 As described above, according to the image tactile interlocking system 1 provided with the tactile metadata generation device 12 of the present embodiment, the person object is automatically extracted from the image, and the tactile metadata corresponding to the dynamic person object is automatically synchronized. Since it can be generated, the tactile presentation device and the image can be linked.

特に、本実施形態の触覚メタデータ生成装置１２は、カメラの映像を入力とし、その映像を解析して前述のような触覚メタデータを出力するため、まず映像を解析して映像中の人物骨格を検出し、検出した骨格情報を用いて解析対象とする人物を特定し、その後、対象人物の骨格位置の履歴から骨格軌跡特徴画像（ＳＴＩ）を描画生成する。この骨格軌跡特徴画像（ＳＴＩ）上で、ボールなど人物以外の動オブジェクトの軌跡画像を合成してもよい。 In particular, the tactile metadata generation device 12 of the present embodiment takes the image of the camera as an input, analyzes the image, and outputs the tactile metadata as described above. Therefore, the image is first analyzed to form a human skeleton in the image. Is detected, a person to be analyzed is specified using the detected skeleton information, and then a skeleton locus feature image (STI) is drawn and generated from the history of the skeleton position of the target person. On this skeleton locus feature image (STI), a locus image of a moving object other than a person such as a ball may be combined.

そして、本実施形態の触覚メタデータ生成装置１２は、この骨格軌跡特徴画像（ＳＴＩ）を入力とするＣＮＮにより、選手の動作イベント（「投げ」、「組み合い」等）、及びその動作の状況を認識し、対応する人物動作の触覚メタデータ（衝撃提示用）を生成する。そして、制御ユニット１３は、本実施形態の触覚メタデータ生成装置１２から得られる触覚メタデータ（衝撃提示用）を基に、映像内の人物オブジェクトＯｐ１（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｌで、人物オブジェクトＯｐ２（選手）の動きに対応した振動刺激は触覚提示デバイス１４Ｒで提示するように分類して制御する。 Then, the tactile metadata generation device 12 of the present embodiment uses the CNN that inputs the skeletal locus feature image (STI) to display the player's motion event (“throw”, “combination”, etc.) and the status of the motion. Recognize and generate tactile metadata (for impact presentation) of the corresponding human movement. Then, the control unit 13 is based on the tactile metadata (for impact presentation) obtained from the tactile metadata generation device 12 of the present embodiment, and the vibration stimulus corresponding to the movement of the person object Op1 (player) in the image is tactile. In the presentation device 14L, the vibration stimulus corresponding to the movement of the person object Op2 (player) is classified and controlled so as to be presented by the tactile presentation device 14R.

従って、本実施形態の映像触覚連動システム１は、「投げ」のような動作イベント以外にも、選手の押し引きなどの状況を連続的に伝えることが可能となり、障害者にも試合の緊迫感を伝えることができ、また臨場感を高めることができる。 Therefore, the video-tactile interlocking system 1 of the present embodiment can continuously convey the situation such as pushing and pulling of the player in addition to the motion event such as "throwing", and the handicapped person feels the sense of urgency of the game. Can be conveyed and the sense of presence can be enhanced.

ところで、従来のMotion History Image（ＭＨＩ）と呼ばれる画像を解析することで、“腕を広げる”、“しゃがむ”、“手を上げる”など人物の基本的な動きを判定することが可能になるが、人物の関節の各部位を計測しているわけではないため、全身を使った大きな動作の認識に限られる。一方、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、このＭＨＩの改善版ともいえる画像特徴量を示す画像であり、各人物の骨格の軌跡、もしくはこれに加えて追跡対象となる人物以外の動オブジェクトの軌跡情報を描画したものとすることで、背景に含まれるノイズの影響を抑えた高精度な認識が可能となる。また、各人物の骨格座標の推移を利用して画像を作成しているため、全身運動のみならず、手や足の部分的な動作の認識も、高い精度で行うことができる。 By the way, by analyzing an image called a conventional Motion History Image (MHI), it is possible to determine basic movements of a person such as "spreading arms", "squatting", and "raising hands". Since each part of a person's joint is not measured, it is limited to recognition of large movements using the whole body. On the other hand, the skeleton locus feature image (STI) according to the present invention is an image showing an image feature amount that can be said to be an improved version of this MHI, and is an image showing the locus of the skeleton of each person, or in addition to this, a person other than the person to be tracked. By drawing the locus information of the moving object, it is possible to perform highly accurate recognition while suppressing the influence of noise contained in the background. In addition, since the image is created using the transition of the skeletal coordinates of each person, it is possible to recognize not only the whole body movement but also the partial movement of the hands and feet with high accuracy.

特に、骨格検出アルゴリズムは静止画単位での姿勢推定に留まるが、本発明に係る骨格軌跡特徴画像（ＳＴＩ）は、各骨格位置の推移を軌跡特徴として扱い、この軌跡特徴量を１枚の画像で表現することにより、ＣＮＮによる動作の識別を可能としている。つまり、ＣＮＮでは困難であった時間軸方向の特徴を、本発明に係る骨格軌跡特徴画像（ＳＴＩ）を入力として用いることで高精度な人物動きの動作認識を可能としている。 In particular, the skeleton detection algorithm is limited to posture estimation in units of still images, but the skeleton locus feature image (STI) according to the present invention treats the transition of each skeleton position as a locus feature, and treats this locus feature amount as one image. By expressing with, it is possible to identify the operation by CNN. That is, by using the skeletal locus feature image (STI) according to the present invention as an input for the feature in the time axis direction, which was difficult in CNN, it is possible to recognize the motion of a person with high accuracy.

尚、上述した一実施形態の触覚メタデータ生成装置１２をコンピュータとして機能させることができ、当該コンピュータに、本発明に係る各構成要素を実現させるためのプログラムは、当該コンピュータの内部又は外部に備えられるメモリに記憶される。コンピュータに備えられる中央演算処理装置（ＣＰＵ）などの制御で、各構成要素の機能を実現するための処理内容が記述されたプログラムを、適宜、メモリから読み込んで、本実施形態の触覚メタデータ生成装置１２の各構成要素の機能をコンピュータに実現させることができる。ここで、各構成要素の機能をハードウェアの一部で実現してもよい。 The tactile metadata generation device 12 of the above-described embodiment can be made to function as a computer, and the computer is provided with a program for realizing each component according to the present invention inside or outside the computer. It is stored in the memory to be stored. A program in which processing contents for realizing the functions of each component are appropriately read from a memory under the control of a central processing unit (CPU) provided in a computer, and tactile metadata generation of the present embodiment is performed. The function of each component of the device 12 can be realized in the computer. Here, the function of each component may be realized by a part of hardware.

以上、特定の実施形態の例を挙げて本発明を説明したが、本発明は前述の実施形態の例に限定されるものではなく、その技術思想を逸脱しない範囲で種々変形可能である。例えば、上述した実施形態の例では、主として柔道競技の映像解析を例に説明したが、バドミントンや卓球、その他の様々なスポーツ種目、及びスポーツ以外の映像にも広く応用可能である。例えば、触覚情報を用いたパブリックビューイング、エンターテインメント、将来の触覚放送などのサービス性の向上に繋がる。また、スポーツ以外の例として、工場での触覚アラームへの応用や、監視カメラ映像解析に基づいたセキュリティシステムなど、様々な用途に応用することも可能である。従って、本発明は、前述の実施形態の例に限定されるものではなく、特許請求の範囲の記載によってのみ制限される。 Although the present invention has been described above with reference to examples of specific embodiments, the present invention is not limited to the examples of the above-described embodiments, and can be variously modified without departing from the technical idea. For example, in the example of the above-described embodiment, the video analysis of judo competition has been mainly described as an example, but it can be widely applied to badminton, table tennis, various other sports events, and non-sports video. For example, it will lead to improvement of services such as public viewing using tactile information, entertainment, and future tactile broadcasting. In addition, as an example other than sports, it can be applied to various applications such as a tactile alarm in a factory and a security system based on surveillance camera image analysis. Therefore, the present invention is not limited to the example of the above-described embodiment, but is limited only by the description of the scope of claims.

本発明によれば、映像から人物オブジェクトを自動抽出し、動的な人物オブジェクトに対応する触覚メタデータを同期して自動生成することができるので、触覚提示デバイスと映像を連動させる用途に有用である。 According to the present invention, a person object can be automatically extracted from an image, and tactile metadata corresponding to a dynamic person object can be automatically generated in synchronization, which is useful for linking a tactile presentation device and an image. be.

１映像触覚連動システム
１０映像出力装置
１１ディスプレイ
１２触覚メタデータ生成装置
１３制御ユニット
１４Ｌ，１４Ｒ触覚提示デバイス
１２１複数フレーム抽出部
１２２人物骨格抽出部
１２３人物識別部
１２４骨格軌跡特徴画像生成部
１２５動オブジェクト検出部
１２６人物動作認識部
１２７メタデータ生成部
１３１メタデータ受信部
１３２解析部
１３３記憶部
１３４‐１，１３４‐２駆動部
１４１ケース
１４２振動アクチュエーター 1 Video tactile interlocking system 10 Video output device 11 Display 12 Tactile metadata generation device 13 Control unit 14L, 14R Tactile presentation device 121 Multiple frame extraction unit 122 Person skeleton extraction unit 123 Person identification unit 124 Skeletal trajectory feature image generation unit 125 Dynamic object Detection unit 126 Person motion recognition unit 127 Metadata generation unit 131 Metadata reception unit 132 Analysis unit 133 Storage unit 134-1,134-2 Drive unit 141 Case 142 Vibration actuator

Claims

A tactile metadata generator that extracts a person object from a video and generates tactile metadata corresponding to a dynamic person object.
A multi-frame extraction means for extracting a multi-frame image including a current frame image and a predetermined number of past frame images from the input video, and
For each of the plurality of frame images, a person skeleton extraction means for generating a first skeleton coordinate set of each person object based on a skeleton detection algorithm, and
For each of the plurality of frame images, the search range is variably set based on the first skeleton coordinate set, and the person object is identified by extracting the position and size of the skeleton of each person object and the peripheral image information thereof. , A person identification means for generating a second skeletal coordinate set with a person ID,
Based on the current frame image and the second skeleton coordinate set in each of the plurality of frame images, a skeleton trajectory that generates a single skeleton trajectory feature image showing only the direction of movement for each identified person skeleton. Feature image generation means and
A human motion recognition means that recognizes a specific motion of a person by a convolutional neural network that inputs a skeletal trajectory feature image and detects information for impact presentation that activates a predetermined tactile presentation device.
A metadata generation means that generates tactile metadata including information for presenting the impact corresponding to the current frame image and outputs it externally in frame units.
A tactile metadata generator characterized by comprising.

The skeleton locus feature image generation means draws a locus connected for each skeleton coordinate of each person in the plurality of frame images as the skeleton locus feature image, and at the time of this drawing, the brightness is lowered toward the past. The tactile metadata generation device according to claim 1, wherein the image is generated by raising or drawing the image.

The skeleton locus feature image generation means, as the skeleton locus feature image, common or distinguishes the skeleton coordinates of each person in the plurality of frame images for each person, and color-codes each person's skeleton coordinates for each person. The tactile metadata generation device according to claim 1 or 2, wherein the motion of each skeleton coordinate is drawn so as to be gradation in time series on a frame-by-frame basis to form a single image.

The person identification means limits the search range to a person search range that surrounds the entire human skeleton at the maximum, and performs variable setting by narrowing down a predetermined area of the human skeleton as the attention search range at the minimum, and estimates the state. It is characterized by having a means for determining the search range so as to include at least the attention search range based on the state transition estimation value of the skeleton of the person obtained by the algorithm, and performing a process of identifying the person object. , The tactile metadata generator according to any one of claims 1 to 3.

A moving object is detected based on a difference image between adjacent frames using each of the plurality of frame images, and among the moving objects detected from each difference image, a skeletal trajectory feature showing only the direction of movement for each of the identified human skeletons. A moving object that selects a moving object other than a person in comparison with an image, and generates a moving object locus image that is connected with the coordinate position, size, and movement direction obtained from each difference image as elements for the moving object other than the person. With more detection means
The person motion recognition means is a convolutional neural network in which the motion object locus image is added and synthesized on a skeleton locus feature image showing only the movement direction of each identified person skeleton, and the person's motion recognition means is used. The tactile metadata generator according to any one of claims 1 to 4, characterized in that it recognizes a specific motion.

The tactile metadata generator according to any one of claims 1 to 5.
A tactile presentation device that presents tactile stimuli,
A control unit that controls to drive the tactile presentation device by referring to predetermined drive reference data based on the tactile metadata obtained from the tactile metadata generator.
A video-tactile interlocking system characterized by being equipped with.

A program for causing a computer to function as the tactile metadata generator according to any one of claims 1 to 5.