JP2021092891A

JP2021092891A - Information processor, information processing method, image processing system, and program

Info

Publication number: JP2021092891A
Application number: JP2019222034A
Authority: JP
Inventors: 和文小沼; Kazufumi Konuma
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-06-17

Abstract

To provide an information processor that can accurately estimate positions of human body parts for multiple subjects even in games played on large fields such as soccer and rugby.SOLUTION: A server 270 as an information processor comprises: an estimating image generation part 273 that acquires shape data indicating a three-dimensional shape of a person object generated from a plurality of captured images obtained by imaging the person object from a plurality of directions, and generates a distance image indicating a distance to the person object in accordance with a predetermined viewpoint condition based on the acquired shape data; and a human body part estimation part 274 that estimates a position of the human body part in the person object using the distance image and a trained model for estimating the position of the human body part obtained through prior training.SELECTED DRAWING: Figure 3

Description

本開示は、距離画像から人体の部位の位置を推定する技術に関する。 The present disclosure relates to a technique for estimating the position of a part of the human body from a distance image.

近年、スポーツのスキル向上などを目的として、選手の体の動きを取得・解析することが行われている。またサッカーやラグビーなどのチーム競技においては、選手個々の動きに加え、全体のフォーメーションを解析することが求められている。選手の動きを解析する技術としては、いわゆるデプスカメラによって被写体（選手）の距離画像を取得し、予め学習した関節位置等を推定するための学習済みモデルを用いて、選手の関節位置等を推定する手法が提案されている（特許文献１）。 In recent years, the movement of a player's body has been acquired and analyzed for the purpose of improving sports skills. In team competitions such as soccer and rugby, it is required to analyze the overall formation in addition to the movement of each player. As a technique for analyzing the movement of a player, a distance image of a subject (player) is acquired by a so-called depth camera, and a trained model for estimating a joint position learned in advance is used to estimate the joint position of the player. A method has been proposed (Patent Document 1).

特開２０１５−２２８１８８号公報Japanese Unexamined Patent Publication No. 2015-228188 ＷＯ２０１７／１８７６４１WO2017 / 187641 特開２０１７−２１１８２８号公報Japanese Unexamined Patent Publication No. 2017-21128

Shotton， Jamie． "Real-time human pose recognition in parts from single depth images"、IEEE Computer Vision and Pattern Recognition (CVPR)，2011． Page(s): 1297 - 1304Shotton, Jamie. "Real-time human pose recognition in parts from single depth images", IEEE Computer Vision and Pattern Recognition (CVPR), 2011. Page (s): 1297 --1304 Alireza Shafaei， James J． Little “Real-Time Human Motion Capture with Multiple Depth Cameras” 、IEEE 2016 13th Conference on Computer and Robot Vision (CRV)、Page(s): 24 - 31Alireza Shafaei, James J. Little “Real-Time Human Motion Capture with Multiple Depth Cameras”, IEEE 2016 13th Conference on Computer and Robot Vision (CRV), Page (s): 24-31

特許文献１に示す手法の場合、関節位置等の推定精度が、デプスカメラに搭載された撮像センサの性能や撮像可能距離、解像度によって大きく影響を受けることになる。また、一般的なデプスカメラの有効撮像範囲は数ｃｍから十数ｍであるところ、撮像範囲がより広大なサッカーやラグビーといったフィールド競技を対象として各選手の関節位置等の推定を行おうとすれば、フィールド内にもデプスカメラの設置が必要となる。しかしながら、広大なフィールド内に多数のデプスカメラを設置するというのは実際には実現困難である。 In the case of the method shown in Patent Document 1, the estimation accuracy of the joint position and the like is greatly affected by the performance of the image sensor mounted on the depth camera, the imageable distance, and the resolution. In addition, the effective imaging range of a general depth camera is several cm to a dozen meters, but if you try to estimate the joint position of each player for field competitions such as soccer and rugby, which have a wider imaging range. , It is necessary to install a depth camera in the field as well. However, it is actually difficult to install a large number of depth cameras in a vast field.

本発明は、簡便な構成で、被写体である人体の部位の推定を精度よく行うようにすることを目的とする。 An object of the present invention is to make it possible to accurately estimate a part of the human body that is a subject with a simple configuration.

本開示に係る情報処理装置は、人物オブジェクトを複数の方向から撮像して得られた複数の撮像画像を用いて生成された、前記人物オブジェクトの三次元形状を示す形状データを取得する取得手段と、前記取得手段が取得した前記形状データに基づき、所定の視点条件に従った場合の前記人物オブジェクトまでの距離を示す距離画像を生成する生成手段と、予め学習を行って得られた人体の部位の位置を推定するための学習済みモデルを用いて、前記生成手段が生成した前記距離画像に基づき、前記人物オブジェクトにおける人体部位の位置を推定する部位推定手段と、を備えたことを特徴とする。 The information processing apparatus according to the present disclosure is an acquisition means for acquiring shape data indicating a three-dimensional shape of the person object, which is generated by using a plurality of captured images obtained by imaging the person object from a plurality of directions. , A generation means for generating a distance image showing a distance to the person object when a predetermined viewpoint condition is followed based on the shape data acquired by the acquisition means, and a part of the human body obtained by performing learning in advance. It is characterized in that it includes a part estimation means for estimating the position of a human body part in the person object based on the distance image generated by the generation means using a trained model for estimating the position of. ..

本発明によれば、簡便な構成で、被写体である人体の部位の推定を精度よく行うことができる。 According to the present invention, it is possible to accurately estimate a part of the human body that is a subject with a simple configuration.

画像処理システムの構成の一例を示す図The figure which shows an example of the structure of an image processing system 撮像モジュールの配置例を示す図The figure which shows the arrangement example of the image pickup module 実施形態１に係る、サーバの機能構成を示すブロック図Block diagram showing the functional configuration of the server according to the first embodiment （ａ）は形状データの一例を示す図、（ｂ）は個別形状データの一例を示す図(A) is a diagram showing an example of shape data, and (b) is a diagram showing an example of individual shape data. 形状データに対する分離処理を説明する図The figure explaining the separation process for the shape data （ａ）は推定用距離画像の一例を示す図、（ｂ）は人体部位情報の一例を示す図(A) is a diagram showing an example of an estimation distance image, and (b) is a diagram showing an example of human body part information. （ａ）及び（ｂ）は、教師データとしての色分け画像の一例を示す図(A) and (b) are diagrams showing an example of a color-coded image as teacher data. （ａ）は学習フェーズにおける入出力の説明図、（ｂ）は推定フェーズにおける入出力の説明図(A) is an explanatory diagram of input / output in the learning phase, and (b) is an explanatory diagram of input / output in the estimation phase. （ａ）は従来のデプスカメラによって距離画像を取得する様子を説明する図、（ｂ）は本開示に係る学習用距離画像の作成時の概要を説明する図(A) is a diagram for explaining how a distance image is acquired by a conventional depth camera, and (b) is a diagram for explaining an outline at the time of creating a learning distance image according to the present disclosure. 実施形態２に係るサーバの機能構成を示すブロック図Block diagram showing the functional configuration of the server according to the second embodiment 人物オブジェクトの向いている方向の識別方法を説明する図Diagram illustrating how to identify the direction in which a person object is facing 実施形態２の変形例に係るサーバの機能構成を示すブロック図A block diagram showing a functional configuration of a server according to a modified example of the second embodiment.

以下、本発明の実施形態について、図面を参照して説明する。なお、以下の実施形態は本発明を限定するものではなく、また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。なお、同一の構成については、同じ符号を付して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. It should be noted that the following embodiments do not limit the present invention, and not all combinations of features described in the present embodiment are essential for the means for solving the present invention. The same configuration will be described with the same reference numerals.

［実施形態１］
（基本的なシステム構成と動作）
図１は、被写体の関節位置等の推定を行う画像処理システムの構成の一例を示す図である。画像処理システム１００は、撮像モジュール１１０ａ〜１１０ｐ、データベース（ＤＢ）２５０、サーバ２７０、制御装置３００、スイッチングハブ１８０、及びエンドユーザ端末１９０を有する。すなわち、画像処理システム１００は、映像収集ドメイン、データ保存ドメイン、及び映像生成ドメインという３つの機能ドメインを有する。映像収集ドメインは撮像モジュール１１０ａ〜１１０ｐを含み、データ保存ドメインはＤＢ２５０とサーバ２７０を含み、映像生成ドメインは制御装置３００及びエンドユーザ端末１９０を含む。 [Embodiment 1]
(Basic system configuration and operation)
FIG. 1 is a diagram showing an example of a configuration of an image processing system that estimates the joint position of a subject and the like. The image processing system 100 includes imaging modules 110a to 110p, a database (DB) 250, a server 270, a control device 300, a switching hub 180, and an end user terminal 190. That is, the image processing system 100 has three functional domains, that is, a video acquisition domain, a data storage domain, and a video generation domain. The video acquisition domain includes the imaging modules 110a to 110p, the data storage domain includes the DB 250 and the server 270, and the video generation domain includes the control device 300 and the end user terminal 190.

制御装置３００は、画像処理システム１００を構成するそれぞれのブロックに対してネットワークを通じて動作状態の管理及びパラメータ設定制御などを行う。 The control device 300 manages the operating state and controls the parameter setting for each block constituting the image processing system 100 through the network.

最初に、撮像モジュール１１０ａ〜１１０ｐの１６セット分の撮像画像を撮像モジュール１１０ｐからサーバ２７０へ送信する動作を説明する。撮像モジュール１１０ａ〜１１０ｐは、それぞれ１台ずつのカメラ１１２ａ〜１１２ｐを有する。以下では、撮像モジュール１１０ａ〜１１０ｐまでの１６セットのシステムを区別せず、単に「撮像モジュール１１０」と記載する場合がある。各撮像モジュール１１０内の装置についても同様に、「カメラ１１２」、「カメラアダプタ１２０」と記載する場合がある。カメラ１１２ａ〜１１２ｐは制御装置３００からの同期信号に基づいて互いに高精度に同期して撮像を行う。各撮像モジュール１１０ａ〜１１０ｐは、例えば、図２に示すように、グラウンド等を囲むように設置される。なお、撮像モジュール１１０の台数を１６セットとしているが、あくまでも一例でありこれに限定されない。 First, an operation of transmitting 16 sets of captured images of the imaging modules 110a to 110p from the imaging module 110p to the server 270 will be described. The image pickup modules 110a to 110p each have one camera 112a to 112p. In the following, 16 sets of systems from the imaging modules 110a to 110p may not be distinguished and may be simply referred to as “imaging module 110”. Similarly, the devices in each imaging module 110 may be described as "camera 112" and "camera adapter 120". The cameras 112a to 112p perform imaging in synchronization with each other with high accuracy based on the synchronization signal from the control device 300. The imaging modules 110a to 110p are installed so as to surround the ground or the like, for example, as shown in FIG. The number of imaging modules 110 is 16 sets, but this is just an example and is not limited to this.

撮像モジュール１１０ａ〜１１０ｐはデイジーチェーンにより接続される。なお、接続形態は任意であり、例えば撮像モジュール１１０ａ〜１１０ｐがスイッチングハブ１８０にそれぞれ接続されて、スイッチングハブ１８０を経由して撮像モジュール１１０間のデータ送受信を行うスター型のネットワーク構成としてもよい。 The imaging modules 110a to 110p are connected by a daisy chain. The connection form is arbitrary. For example, a star-type network configuration may be obtained in which the imaging modules 110a to 110p are connected to the switching hub 180 and data is transmitted and received between the imaging modules 110 via the switching hub 180.

本実施形態では、カメラ１１２とカメラアダプタ１２０とが分離された構成となっているが、同一筺体で一体化されていてもよい。撮像モジュール１１０ａ内のカメラ１１２ａにて得られた撮像画像は、カメラアダプタ１２０ａにおいて前景背景分離等の所定の画像処理が施された後、撮像モジュール１１０ｂのカメラアダプタ１２０ｂに伝送される。同様に撮像モジュール１１０ｂは、カメラ１１２ｂにて得られた撮像画像を、撮像モジュール１１０ａから取得した撮像画像と合わせて撮像モジュール１１０ｃに伝送する。このような動作を続けることにより、１６セット分の撮像画像（前景画像を含む）が、撮像モジュール１１０ｐからスイッチングハブ１８０に伝わり、その後、サーバ２７０へ伝送される。なお、以下の説明においては、被写体を「人物オブジェクト」と表記することとする。 In the present embodiment, the camera 112 and the camera adapter 120 are separated from each other, but they may be integrated in the same housing. The captured image obtained by the camera 112a in the image pickup module 110a is transmitted to the camera adapter 120b of the image pickup module 110b after being subjected to predetermined image processing such as foreground background separation in the camera adapter 120a. Similarly, the image pickup module 110b transmits the captured image obtained by the camera 112b to the image pickup module 110c together with the captured image acquired from the image pickup module 110a. By continuing such an operation, 16 sets of captured images (including foreground images) are transmitted from the imaging module 110p to the switching hub 180, and then transmitted to the server 270. In the following description, the subject will be referred to as a "person object".

（サーバの機能構成）
図３は、本実施形態に係る、人物オブジェクトの関節位置等を推定する情報処理装置としての、サーバ２７０の機能構成を示すブロック図である。サーバ２７０は、三次元形状導出部２７１、オブジェクト分離部２７２、推定用画像生成部２７３、人体部位推定部２７４及び学習部２７５を有する。また、サーバ２７０は一般的なコンピュータが備える各種ハードウェア、すなわち、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部Ｉ／Ｆ、大容量記憶領域などを有している。そして、ＣＰＵがＲＡＭやＲＯＭに格納されている所定のプログラムを実行することで図３に示す各部の機能を実現する。なお、本実施形態では、１台のサーバ２７０にて上記各部の機能を実現する構成とするが、複数台のサーバによって上記各部の機能が分散して実現されるような構成でもよい。例えば、三次元形状導出部２７１とオブジェクト分離部２７２の２つの機能を担うサーバと、推定用画像生成部２７３、人体部位推定部２７４及び学習部２７５の３つの機能を担うサーバとに分けてもよい。 (Functional configuration of server)
FIG. 3 is a block diagram showing a functional configuration of the server 270 as an information processing device for estimating joint positions and the like of a person object according to the present embodiment. The server 270 has a three-dimensional shape derivation unit 271, an object separation unit 272, an image generation unit for estimation 273, a human body part estimation unit 274, and a learning unit 275. Further, the server 270 has various hardware included in a general computer, that is, a CPU, RAM, ROM, an external I / F, a large-capacity storage area, and the like. Then, the CPU executes a predetermined program stored in the RAM or ROM to realize the functions of each part shown in FIG. In the present embodiment, the functions of the above parts are realized by one server 270, but the functions of the above parts may be distributed and realized by a plurality of servers. For example, it may be divided into a server that has two functions of a three-dimensional shape derivation unit 271 and an object separation unit 272, and a server that has three functions of an image generation unit 273 for estimation, a human body part estimation unit 274, and a learning unit 275. Good.

サーバ２７０に入力された複数の撮像画像（同期撮影による複数の視点に対応した撮像画像）のデータは、まず、三次元形状導出部２７１に入力される。三次元形状導出部２７１は、視点の異なる各撮像画像に対し人物オブジェクトのシルエット抽出を行い、得られたシルエット画像を用いて視体積交差法などによって人物オブジェクトの三次元形状を表す形状データを生成する。このような形状データは、一般に三次元モデルとも呼ばれる。本実施形態では、撮影シーンとして複数の選手が広大なフィールド上を動き回るサッカーやラグビーの試合を想定し、各撮像画像のそれぞれには複数の選手が含まれているものとする。すなわち、三次元形状導出部２７１で生成される形状データには、複数の人物オブジェクトそれぞれの三次元形状を表す部分が含まれている。図４（ａ）に、三次元形状導出部２７１で生成される形状データの一例を示す。図４（ａ）に示すとおり、形状データにおいては、人物オブジェクトそれぞれの三次元形状が、例えばボクセルと呼ばれる単位立方体の集合（塊）として表現される。つまり、三次元形状導出部２７１で生成される形状データには、複数の選手に対応する複数のボクセル群が含まれることになる。なお、三次元形状の表現形式には、ボクセル形式の代わりに点群形式やポリゴン形式など他の形式を用いてもよい。 The data of the plurality of captured images (captured images corresponding to a plurality of viewpoints by synchronous shooting) input to the server 270 are first input to the three-dimensional shape derivation unit 271. The three-dimensional shape derivation unit 271 extracts the silhouette of the person object for each captured image having a different viewpoint, and generates shape data representing the three-dimensional shape of the person object by the visual volume crossing method or the like using the obtained silhouette image. To do. Such shape data is also generally referred to as a three-dimensional model. In the present embodiment, it is assumed that a soccer or rugby game in which a plurality of players move around on a vast field as a shooting scene, and each of the captured images includes a plurality of players. That is, the shape data generated by the three-dimensional shape derivation unit 271 includes a portion representing the three-dimensional shape of each of the plurality of person objects. FIG. 4A shows an example of shape data generated by the three-dimensional shape derivation unit 271. As shown in FIG. 4A, in the shape data, the three-dimensional shape of each person object is represented as a set (lump) of unit cubes called, for example, voxels. That is, the shape data generated by the three-dimensional shape derivation unit 271 includes a plurality of voxel groups corresponding to a plurality of athletes. As the three-dimensional shape expression format, another format such as a point cloud format or a polygon format may be used instead of the voxel format.

オブジェクト分離部２７２は、三次元形状導出部２７１で生成された形状データから、選手一人一人に対応する領域を切り出す分離処理を行って、人物オブジェクト単位の三次元形状を表す個別の形状データ（以下、「個別形状データ」と呼ぶ。）を生成する。図５は、形状データに対する分離処理を説明する図である。図５において一点鎖線の矩形５０１は、三次元形状導出部２７１で生成された分離処理前の形状データを示している。いま、形状データ５０１には二人の選手に対応するボクセル群５０２及び５０３が存在する。このときボクセル間には関連付けがなされておらず、どのボクセルがどの人物オブジェクトのボクセルかの区別は存在しない。そこで、各ボクセルについて他のボクセルとの間の距離を求め、接している（すなわち、距離がゼロ）或いは所定距離以内にあるボクセル同士を、同一の人物オブジェクトに属するボクセルであると特定する判別処理を行う。例えば、図５の例において、注目ボクセルをボクセル５０４とした場合には、ボクセル群５０３を構成する各ボクセルは同一の人物オブジェクトに属すると判定されることになる。一方、ボクセル群５０３を構成するどのボクセルからも距離が一定以上離れているボクセル群５０２を構成する各ボクセルに関しては、異なる人物オブジェクトに属するとは判定されることになる。このような分離処理を行うことによって、各人物オブジェクトに対応した個別形状データが生成される。なお、分離手法は上述の例に限定されない。例えば、形状データを仮想的な視点に投影した二次元画像上にて各ボクセル群に対応する領域間の距離を求め、当該距離が一定距離離れている場合に分離するといった手法でもよい。図４（ｂ）に、上述の図４（ａ）に示す形状データにおけるボクセル群４０１の部分を分離することで生成された個別形状データの一例を示す。こうして得られた人物オブジェクト単位の個別形状データは、推定用画像生成部２７３に送られる。 The object separation unit 272 performs separation processing for cutting out an area corresponding to each player from the shape data generated by the three-dimensional shape derivation unit 271, and individual shape data representing the three-dimensional shape of each person object (hereinafter referred to as , Called "individual shape data"). FIG. 5 is a diagram for explaining the separation process for the shape data. In FIG. 5, the alternate long and short dash line rectangle 501 shows the shape data before the separation process generated by the three-dimensional shape derivation unit 271. Now, the shape data 501 has voxel groups 502 and 503 corresponding to two athletes. At this time, there is no association between voxels, and there is no distinction between which voxel is which voxel of which person object. Therefore, the distance between each voxel and another voxel is obtained, and the voxels that are in contact with each other (that is, the distance is zero) or are within a predetermined distance are identified as voxels belonging to the same person object. I do. For example, in the example of FIG. 5, when the voxel of interest is voxel 504, it is determined that each voxel constituting the voxel group 503 belongs to the same person object. On the other hand, it is determined that each voxel constituting the voxel group 502, which is a certain distance or more from any voxel constituting the voxel group 503, belongs to a different person object. By performing such separation processing, individual shape data corresponding to each person object is generated. The separation method is not limited to the above example. For example, a method may be used in which the distance between regions corresponding to each voxel group is obtained on a two-dimensional image obtained by projecting shape data onto a virtual viewpoint, and the distance is separated when the distance is a certain distance. FIG. 4B shows an example of individual shape data generated by separating the portion of the voxel group 401 in the shape data shown in FIG. 4A. The individual shape data for each person object obtained in this way is sent to the estimation image generation unit 273.

推定用画像生成部２７３は、オブジェクト分離部２７２から受け取った個別形状データに基づき、所定の視点条件に従った場合の人物オブジェクトまでの距離を示す、人体部位の位置を推定するための距離画像（推定用距離画像）を生成する。図６（ａ）に、図４（ｂ）の個別形状データから生成した推定用距離画像の一例を示す。図６（ａ）の距離画像の場合、白に近いほど距離が近く（手前側）、黒に近いほど距離が遠い（奥側）ことを示している。生成する推定用距離画像の数が多いほど（すなわち、視点の数が多いほど）、後の人体部位の推定フェーズにおける推定精度が高くなる。なお、所定の「視点条件」については後述する。人物オブジェクト単位で生成された１又は複数の推定用距離画像のデータは、人体部位推定部２７４に送られる。 The estimation image generation unit 273 is a distance image for estimating the position of a human body part, which indicates the distance to a human object when a predetermined viewpoint condition is followed, based on the individual shape data received from the object separation unit 272. Generate a distance image for estimation). FIG. 6A shows an example of an estimation distance image generated from the individual shape data of FIG. 4B. In the case of the distance image of FIG. 6A, the closer to white, the closer the distance (front side), and the closer to black, the farther the distance (back side). The larger the number of estimation distance images generated (that is, the larger the number of viewpoints), the higher the estimation accuracy in the later estimation phase of the human body part. The predetermined "viewpoint condition" will be described later. The data of one or more estimation distance images generated for each person object is sent to the human body part estimation unit 274.

人体部位推定部２７４は、推定用画像生成部２７３から受け取った１又は複数の推定用距離画像を入力として、学習部２７５から提供される人体の各部位の位置を推定するための学習済みモデルを用いて、各人物オブジェクトについての関節位置等を推定する。人体部位推定部２７４による推定結果は、例えば上述した肩、肘、膝、腰、手首、足首といった人体の各部位の位置を、三次元座標上の点でそれぞれ表した情報（以下、「人体部位情報」と呼ぶ。）として得られる。図６（ｂ）に、図６（ａ）の推定用距離画像に基づく推定結果として出力される人体部位情報の一例を示す。なお、人体部位推定部２７４を複数設け、人物オブジェクト単位で切り出された複数の個別形状データを並列処理する構成とすることで処理の高速化を図ってもよい。 The human body part estimation unit 274 receives one or a plurality of estimation distance images received from the estimation image generation unit 273 as input, and uses a learned model for estimating the position of each part of the human body provided by the learning unit 275. It is used to estimate the joint position and the like for each human object. The estimation result by the human body part estimation unit 274 is information representing the positions of each part of the human body such as the shoulder, elbow, knee, waist, wrist, and ankle described above by points on the three-dimensional coordinates (hereinafter, "human body part"). It is called "information"). FIG. 6B shows an example of human body part information output as an estimation result based on the estimation distance image of FIG. 6A. It should be noted that a plurality of human body part estimation units 274 may be provided, and a plurality of individual shape data cut out for each person object may be processed in parallel to speed up the processing.

学習部２７５は、ＣＧ等で予め作成した各人物オブジェクトまでの距離を表す学習用距離画像を入力データとして、図７の（ａ）及び（ｂ）に示すような、人体の部位毎に色分けした画像（色分け画像）を教師データとして用いた機械学習を行う。機械学習に使用するアルゴリズムには特に限定はなく、非特許文献１に記載のランダムフォレストや非特許文献２に記載のＣＮＮなどを適用可能である。この機械学習によって人物オブジェクトの関節位置等の推定に用いる学習済みモデルを生成する。生成された学習済みモデルは人体部位推定部２７４に提供される。 The learning unit 275 uses a learning distance image showing the distance to each person object created in advance by CG or the like as input data, and color-codes each part of the human body as shown in FIGS. 7A and 7B. Machine learning is performed using images (color-coded images) as teacher data. The algorithm used for machine learning is not particularly limited, and the random forest described in Non-Patent Document 1 and the CNN described in Non-Patent Document 2 can be applied. By this machine learning, a trained model used for estimating the joint position of a person object is generated. The generated trained model is provided to the human body part estimation unit 274.

なお、本実施形態では、サーバ２７０内に学習部２７５を設けているが、機械学習までをシステム外の外部装置にて行い、学習結果としての学習済みモデルを人体部位推定部２７４で保持しておくような構成でもよい。 In the present embodiment, the learning unit 275 is provided in the server 270, but machine learning is performed by an external device outside the system, and the learned model as the learning result is held by the human body part estimation unit 274. It may be configured to be stored.

（学習と推定）
図８（ａ）は、本実施形態の学習フェーズにおける入出力の説明図である。入力データＸ_ｔは、学習部２７５で生成された学習用距離画像のデータである。そして、学習モデルにおいては、学習用距離画像から抽出可能な情報、例えば画素値の分布や人物オブジェクトの形状などに着目して、人体の各部位の予測が行われる。具体的には、特許文献２に記載されているように、まず、人体部位の色分け画像を用いて、入力された学習用距離画像に映っている人物オブジェクトにおける人体のパーツを判別する学習を行う。次に、判別されたパーツ毎にピクセルの三次元的な重心を求めることで肘や肩といった関節位置等を予測する学習を行なう。そして、予測された関節位置等と、教師データＴとしての色分け画像で特定される人体の各部位とのずれ量Ｌが最小となるように繰り返し学習が行われる。 (Learning and estimation)
FIG. 8A is an explanatory diagram of input / output in the learning phase of the present embodiment. The input data X_t is the data of the learning distance image generated by the learning unit 275. Then, in the learning model, each part of the human body is predicted by paying attention to information that can be extracted from the learning distance image, for example, the distribution of pixel values and the shape of a human object. Specifically, as described in Patent Document 2, first, learning is performed to discriminate a part of the human body in the human object reflected in the input learning distance image by using the color-coded image of the human body part. .. Next, learning is performed to predict the joint positions such as elbows and shoulders by obtaining the three-dimensional center of gravity of the pixels for each of the identified parts. Then, repeated learning is performed so that the amount of deviation L between the predicted joint position and the like and each part of the human body specified by the color-coded image as the teacher data T is minimized.

図８（ｂ）は、本実施形態の推定フェーズにおける入出力の説明図である。入力データＸ_ｉは、推定用画像生成部２７３で生成された推定用距離画像のデータである。学習部２７５から提供された学習済モデルを用いた推定の結果として得られる出力データＹ_ｉが、上述の人体部位情報である。 FIG. 8B is an explanatory diagram of input / output in the estimation phase of the present embodiment. The input data X_i is the data of the estimation distance image generated by the estimation image generation unit 273. The output data Y_i obtained as a result of estimation using the trained model provided by the learning unit 275 is the above-mentioned human body part information.

（推定用距離画像生成時の「視点条件」について）
距離画像生成部２７３が推定用距離画像を生成する際に適用する所定の「視点条件」は、不図示の操作部を介したユーザ入力によって或いは予め用意しておいた視点情報を読み込むことによって設定される。そして、このときの視点条件は、学習フェーズで用いた学習用距離画像の作成時に採用した視点となるべく一致させることが望ましい。以下、その理由を説明する。 (About "viewpoint conditions" when generating a distance image for estimation)
The predetermined "viewpoint condition" applied when the distance image generation unit 273 generates the estimation distance image is set by user input via an operation unit (not shown) or by reading the viewpoint information prepared in advance. Will be done. Then, it is desirable that the viewpoint condition at this time matches the viewpoint adopted at the time of creating the learning distance image used in the learning phase as much as possible. The reason will be described below.

具体的な説明に入る前に、従来のデプスカメラを用いて距離画像を取得し、得られた距離画像から関節位置等の推定を行う場合について確認しておく。図９（ａ）は、従来のデプスカメラによる撮影によって距離画像を得る様子を説明する図である。図９（ａ）では、選手９０１を、それぞれ異なる位置に設置された２台のデプスカメラ９０２及び９０３によって撮影する様子が示されている。そして、選手９０１に近い位置に設置されたデプスカメラ９０２によって得られた距離画像９０４には選手９０１の上半身の部分だけが現れている。一方、選手９０１から離れた高い位置に設置されたデプスカメラ９０３によって得られた距離画像９０５には、選手９０１の全身が小さく現れている。このように、従来のデプスカメラを用いた距離画像においては、人物オブジェクトとの距離、カメラの設置高さ、人物オブジェクトの向いている方向、カメラの光学特性といった様々な要因により、得られる距離画像の内容が大きく異なる。つまり、デプスカメラの性能や設置環境等に、距離画像に移る人物オブジェクトの大きさや見え方などが大きく依存することになる。そして、設置環境等が異なる状況下で得られた様々な内容の距離画像に基づいて関節位置等を推定するための学習を行なって信頼性の高い学習済みモデルを得ようとすると、そのために必要な学習用データは膨大な量になる。 Before going into a specific explanation, it is confirmed that a distance image is acquired using a conventional depth camera and the joint position and the like are estimated from the obtained distance image. FIG. 9A is a diagram illustrating how a distance image is obtained by taking a picture with a conventional depth camera. FIG. 9A shows a state in which the player 901 is photographed by two depth cameras 902 and 903 installed at different positions. Then, in the distance image 904 obtained by the depth camera 902 installed at a position close to the player 901, only the upper body portion of the player 901 appears. On the other hand, in the distance image 905 obtained by the depth camera 903 installed at a high position away from the player 901, the whole body of the player 901 appears small. As described above, in the distance image using the conventional depth camera, the distance image obtained by various factors such as the distance to the person object, the installation height of the camera, the direction in which the person object is facing, and the optical characteristics of the camera. The contents of are very different. In other words, the size and appearance of the person object that moves to the distance image greatly depends on the performance of the depth camera, the installation environment, and the like. Then, if it is attempted to obtain a highly reliable trained model by learning to estimate the joint position etc. based on the distance images of various contents obtained under different installation environments, it is necessary for that purpose. The amount of learning data is enormous.

上記のような問題を踏まえ、本実施形態においては、学習フェーズにおける入力データとしての距離画像をＣＧ等で作成する際の視点条件について一定の制限を設ける。図９（ｂ）は、本実施形態に係る学習用距離画像の作成時の概要を説明する図である。図９（ｂ）の場合、人物オブジェクト９１０から５ｍ離れた位置で、地上１．５ｍの高さから水平な位置に疑似的なカメラ９１１があるとの想定で、そこから画角４５度で見た場合の距離画像をＣＧで作成する場合の図になっている。さらには、解像度などの光学特性も予め決定しておく。そして、人物オブジェクト９１０を中心として、３６０度を等分した複数の方向（例えば１２等分の場合であれば３０度刻みの１２通りの方向）から学習用距離画像を作成する。このように、ＣＧ等で作成する際の疑似的な視点を固定化した上で例えば外部装置（ＰＣ等）にて学習用距離画像を作成し、学習部２７５に対し入力データとして提供する。このように固定化した視点条件に従って作成した学習用距離画像を使って学習を行うことで、精度の良い学習済みモデルが得られることになる。 Based on the above problems, in the present embodiment, certain restrictions are set on the viewpoint conditions when creating a distance image as input data in the learning phase by CG or the like. FIG. 9B is a diagram illustrating an outline at the time of creating a learning distance image according to the present embodiment. In the case of FIG. 9B, it is assumed that there is a pseudo camera 911 at a position 5 m away from the person object 910 and at a horizontal position from a height of 1.5 m above the ground, and the image is viewed from there at an angle of view of 45 degrees. It is a figure in the case of creating a distance image in the case of CG. Furthermore, optical characteristics such as resolution are also determined in advance. Then, a learning distance image is created from a plurality of directions in which 360 degrees are equally divided (for example, in the case of 12 equal divisions, 12 directions in increments of 30 degrees) centered on the person object 910. In this way, after fixing the pseudo viewpoint when creating with CG or the like, for example, a learning distance image is created by an external device (PC or the like) and provided to the learning unit 275 as input data. By performing learning using the learning distance image created according to the viewpoint condition fixed in this way, a trained model with high accuracy can be obtained.

本実施形態では、学習用距離画像作成時の視点条件を固定化することで、学習フェーズにおいて、距離や高さを異ならせたあらゆる視点から見たときの学習済みモデルのバリエーションを作成する手間を省いている。つまり、上述の図９（ｂ）に示したような一定の視点条件下にて人物オブジェクトを見たときの学習用距離画像でのみ学習を行えばよいようにしている。これにより、学習用データの準備及び学習に要する時間などの学習コストを抑制することができる。また、視点条件を固定化しているため最終的に得られる学習済みモデルもノイズの少ない高精度のものが得られる。よって、当該学習済みモデルを用いた推定フェーズにおける精度の向上にも繋がる。 In the present embodiment, by fixing the viewpoint condition at the time of creating the distance image for learning, it takes time and effort to create a variation of the trained model when viewed from all viewpoints with different distances and heights in the learning phase. I'm omitting it. That is, learning needs to be performed only with the learning distance image when the person object is viewed under a certain viewpoint condition as shown in FIG. 9B described above. As a result, it is possible to suppress learning costs such as preparation of learning data and time required for learning. Further, since the viewpoint condition is fixed, the trained model finally obtained can be a highly accurate model with less noise. Therefore, it also leads to improvement of accuracy in the estimation phase using the trained model.

＜変形例＞
本実施形態では、複数の人物オブジェクトのボクセル群を含んだ形状データから各人物オブジェクトに対応する個別形状データ切り出す分離処理を行っているが、分離処理を独立に行うことは必須の構成ではない。三次元形状導出部２７１において、複数の人物オブジェクトの三次元形状を含んだ形状データから人物オブジェクト毎の個別形状データを分離・生成するまでの処理をまとめて行うような構成でもよい。 <Modification example>
In the present embodiment, the separation process of cutting out the individual shape data corresponding to each person object from the shape data including the voxel group of a plurality of person objects is performed, but it is not essential to perform the separation process independently. The three-dimensional shape derivation unit 271 may be configured to collectively perform processing from shape data including three-dimensional shapes of a plurality of person objects to individual shape data for each person object.

また、必ずしも人物オブジェクト単位の個別形状データに分離しなくてもよい。例えば、学習フェーズにおいて、複数の人物オブジェクトを含んだ状態の学習用距離画像を用いて機械学習を行なうようにしてもよい。この場合、複数の人物オブジェクトを含む推定用距離画像と、複数の人物オブジェクトに対応した学習済みモデルとを用いて、複数の人物オブジェクトそれぞれについて同時に人体部位の位置推定を行うことになる。 Further, it is not always necessary to separate the individual shape data for each person object. For example, in the learning phase, machine learning may be performed using a learning distance image including a plurality of person objects. In this case, the position of the human body part is estimated at the same time for each of the plurality of person objects by using the estimation distance image including the plurality of person objects and the trained model corresponding to the plurality of person objects.

本実施形態では、通常の撮像手段を用いて取得した複数の撮像画像から生成した人物オブジェクトの三次元形状を表す形状データに基づき高精度な距離画像を生成する。そして、予め機械学習を行って得た人体部位の位置推定のための学習済みモデルを用いて、撮像画像に映っている人物オブジェクトの関節位置を推定する。これにより、専用のデプスカメラを適切に設置することが困難な、複数の選手等が広い範囲で自由に移動するような撮影シーンにおいても、選手等の人体各部位の位置推定を高精度に行うことが可能となる。 In the present embodiment, a highly accurate distance image is generated based on shape data representing a three-dimensional shape of a person object generated from a plurality of captured images acquired by using a normal imaging means. Then, the joint position of the human object shown in the captured image is estimated by using the learned model for estimating the position of the human body part obtained by performing machine learning in advance. As a result, even in a shooting scene where it is difficult to properly install a dedicated depth camera and multiple athletes move freely over a wide range, the position of each part of the human body such as athletes can be estimated with high accuracy. It becomes possible.

［実施形態２］
次に、人物オブジェクトの向いている方向を先ず特定して、特定された向きに応じた推定用距離画像を生成することで、人体部位の位置推定精度を高める態様を、実施形態２として説明する。なお、画像処理システムの基本構成など実施形態１と共通する内容については省略し、以下では差異点であるサーバ２７０の機能構成について説明を行うものとする。 [Embodiment 2]
Next, a mode for improving the position estimation accuracy of the human body part by first specifying the direction in which the person object is facing and generating an estimation distance image according to the specified direction will be described as the second embodiment. .. The contents common to the first embodiment such as the basic configuration of the image processing system will be omitted, and the functional configuration of the server 270, which is a difference, will be described below.

（サーバの機能構成）
図１０は、本実施形態に係る、人物オブジェクトの関節位置等を推定する情報処理装置としてのサーバ２７０’の機能構成を示すブロック図である。実施形態１との大きな違いは、まず、人物オブジェクトの向きの推定を行うオブジェクト向き推定部１００１と、そのための機械学習を行って学習済みモデルを生成する第２学習部１００２を有している点である。さらには、第２推定用画像生成部１００２を有している点も実施形態１と異なる点である。なお、実施形態１のサーバ２７０と共通する同一名称のブロックは同様の機能・動作を行うので、その説明を省略する。また、図１０における第１推定用画像生成部２７３及び第１学習部２７５の機能・動作は、実施形態１の推定用画像生成部２７３及び学習部２７５と同じであるため同じ符号を付している。 (Functional configuration of server)
FIG. 10 is a block diagram showing a functional configuration of the server 270'as an information processing device for estimating joint positions and the like of a person object according to the present embodiment. The major difference from the first embodiment is that it has an object orientation estimation unit 1001 that estimates the orientation of a person object, and a second learning unit 1002 that performs machine learning for that purpose and generates a trained model. Is. Further, it is different from the first embodiment in that it has the second estimation image generation unit 1002. Since the blocks having the same name as the server 270 of the first embodiment have the same functions and operations, the description thereof will be omitted. Further, since the functions and operations of the first estimation image generation unit 273 and the first learning unit 275 in FIG. 10 are the same as those of the estimation image generation unit 273 and the learning unit 275 of the first embodiment, the same reference numerals are given. There is.

オブジェクト向き推定部１００１は、第１推定用画像生成部２７３で生成された推定用距離画像と第２学習部１００２から提供された学習済みモデルとを用いて、各人物オブジェクトの体が向いている方向を推定する。図１１は、撮像空間がサッカーのフィールドである場合における、選手の向いている方向の識別方法を説明する図である。図１１の例では、ｘ−ｙの二次元平面で示すフィールド中央（ｚ軸）を中心軸としたときの回転角によって選手の体の向きを識別する。いま、ｘ方向を基準角度（０度）とし、例えば右側にあるゴールの方向を選手が向いている場合は“０度を向いている”となり、左側にあるゴールの方向を選手が向いている場合は“１８０度を向いている”となる。 The object orientation estimation unit 1001 uses the estimation distance image generated by the first estimation image generation unit 273 and the learned model provided by the second learning unit 1002, and the body of each person object is oriented. Estimate the direction. FIG. 11 is a diagram illustrating a method of identifying a direction in which a player is facing when the imaging space is a soccer field. In the example of FIG. 11, the orientation of the athlete's body is identified by the rotation angle when the center of the field (z-axis) shown in the two-dimensional plane of xy is the central axis. Now, the x direction is the reference angle (0 degree). For example, if the player is facing the direction of the goal on the right side, it is "facing 0 degree", and the player is facing the direction of the goal on the left side. In the case, it is "facing 180 degrees".

なお、ここでは２次元平面上で向きを識別する例を説明したが、ｚ軸を含めた３次元空間における向きを識別してもよい。このような回転角の情報が、人物オブジェクトの体の向いている方向を示す情報（以下、「身体方向情報」と呼ぶ。）として、第２推定用画像生成部１００３に出力される。なお、向きを推定する方法は、個別形状データから生成した推定用距離画像に基づくものに限定されない。例えば、現在の個別形状データと過去のある時刻の個別形状データとの差分から人物オブジェクトの移動方向を求め、当該移動方向を人体オブジェクトの向いている方向と推定する構成でもよい。また、個別形状データから生成した推定用距離画像を用いるのに代えて、撮像画像データから人物オブジェクトの向いている方向を推定するような学習済みモデルを用いて向きの推定を行ってもよい。 Although the example of identifying the orientation on the two-dimensional plane has been described here, the orientation in the three-dimensional space including the z-axis may be identified. Such rotation angle information is output to the second estimation image generation unit 1003 as information indicating the direction in which the body of the person object is facing (hereinafter, referred to as "body direction information"). The method of estimating the orientation is not limited to the method based on the estimation distance image generated from the individual shape data. For example, the moving direction of the person object may be obtained from the difference between the current individual shape data and the individual shape data at a certain time in the past, and the moving direction may be estimated to be the direction in which the human body object is facing. Further, instead of using the estimation distance image generated from the individual shape data, the orientation may be estimated using a trained model that estimates the direction in which the person object is facing from the captured image data.

第２学習部１００２は、予め異なる姿勢・方向を向いた人物オブジェクトの距離画像を入力データとし、人物オブジェクトの体の向く方向の情報を教師データとした機械学習を行い、人体の向きを推定するための学習済みモデルを生成する。この機械学習を行う際の入力データとしての距離画像には、実施形態１の第１学習部２７５と同様に、人体の向いている方向が推定しやすい、所定の視点条件に従った距離画像のみを用いるのが望ましい。 The second learning unit 1002 performs machine learning using the distance images of the person objects facing different postures and directions as input data and the information of the direction of the body of the person object as the teacher data, and estimates the direction of the human body. Generate a trained model for. Similar to the first learning unit 275 of the first embodiment, the distance image as input data when performing this machine learning is only a distance image according to a predetermined viewpoint condition in which the direction in which the human body is facing can be easily estimated. It is desirable to use.

第２推定用画像生成部１００３は、オブジェクト向き推定部１００１から受け取った身体方向情報に基づき、オブジェクト分離部２７２から入力された個別形状データを用いて、特定方向からみた場合の推定用距離画像（以下、「特定距離画像」と呼ぶ。）を生成する。具体的には、処理対象の人物オブジェクトを正面から見た場合或いは右側面から見た場合といった具合に、予め決めておいた特定の方向から見た場合の特定距離画像を、身体方向情報によって示される人物オブジェクトの体の向きを基準として生成する。例えば、身体方向情報によって人物オブジェクトの向きが０度である場合において、当該人物オブジェクトを正面から捉えた特定距離画像を生成するとき、疑似的なカメラの視線方向は１８０度を向くことになる。そして、人体部位推定部２７４において、人物オブジェクトの体の向き基準で生成された特定距離画像と上述の学習済みモデルとに基づき、各人物オブジェクトの人体部位の位置推定が実行されることになる。 The second estimation image generation unit 1003 uses the individual shape data input from the object separation unit 272 based on the body direction information received from the object orientation estimation unit 1001, and the estimation distance image when viewed from a specific direction ( Hereinafter, it is referred to as a “specific distance image”). Specifically, a specific distance image when viewed from a predetermined specific direction, such as when the person object to be processed is viewed from the front or from the right side, is shown by body direction information. Generated based on the body orientation of the person object. For example, when the direction of the person object is 0 degrees based on the body direction information, when a specific distance image in which the person object is captured from the front is generated, the line-of-sight direction of the pseudo camera is 180 degrees. Then, the human body part estimation unit 274 executes the position estimation of the human body part of each person object based on the specific distance image generated based on the orientation reference of the body of the person object and the above-mentioned learned model.

本実施形態の場合、第１学習部２７５で学習を行う際の入力データが、前述の所望の方向から捉えた距離画像だけで済むことになる。すなわち、実施形態１の場合は、固定化した視点条件ではあるものの、高精度の学習済みモデルを得るには人物オブジェクトを３６０度様々な方向から捉えた学習用距離画像が必要となる。これに対し本実施形態の手法であれば、予め定めた特定方向のみから捉えた学習用距離画像を用意すればよく、人体部位の位置推定の学習フェーズに要するコストを削減できる。なお、人物オブジェクトの向く方向を推定するための学習済みモデルを作成するための学習コストが追加となるが、上記特定方向以外の方向での学習を行わないことから、推定フェーズにおける人体部位の位置推定精度の向上が期待できる。 In the case of the present embodiment, the input data for learning by the first learning unit 275 is only the distance image captured from the desired direction described above. That is, in the case of the first embodiment, although the viewpoint condition is fixed, a learning distance image in which a person object is captured from various directions by 360 degrees is required in order to obtain a highly accurate trained model. On the other hand, in the method of the present embodiment, it is sufficient to prepare a learning distance image captured only from a predetermined specific direction, and the cost required for the learning phase of the position estimation of the human body part can be reduced. In addition, the learning cost for creating a trained model for estimating the direction in which the person object faces is added, but since learning is not performed in a direction other than the above specific direction, the position of the human body part in the estimation phase. Improvement of estimation accuracy can be expected.

＜変形例＞
上述の例では特定方向を一方向としたが、複数方向であってもよい。図１２は、特定方向を２個設ける場合のサーバ２７０”の機能構成を示すブロック図である。それぞれ異なる方向に対応した２つの第２推定用画像生成部１００２ａ及び１００２ｂと、２つの人体部位推定部２７４ａ及び２７４ｂが存在している。このような構成の下、例えば第２推定用画像生成部１００２ａと人体部位推定部２７４ａを正面用、第２推定用画像生成部１００２ｂと人体部位推定部２７４ｂを右側面用といった具合に、ユーザが所望する特定方向と各機能部とにそれぞれ対応付ければよい。この場合、第１学習部２７５’は、正面用と右側面用の２種類の機械学習をそれぞれ行って、正面用の学習済みモデルを人体部位推定部２７４ａに提供し、右側面用の学習済みモデルを人体部位推定部２７４ｂに提供する。 <Modification example>
In the above example, the specific direction is one direction, but it may be a plurality of directions. FIG. 12 is a block diagram showing a functional configuration of the server 270 "when two specific directions are provided. Two second estimation image generation units 1002a and 1002b corresponding to different directions and two human body part estimations. Parts 274a and 274b are present. Under such a configuration, for example, the second estimation image generation unit 1002a and the human body part estimation unit 274a are used for the front surface, and the second estimation image generation unit 1002b and the human body part estimation unit 274b are present. For the right side surface, etc., the specific direction desired by the user and each functional unit may be associated with each other. In this case, the first learning unit 275'is subjected to two types of machine learning, one for the front surface and the other for the right side surface. Each of these is performed to provide a trained model for the front surface to the human body part estimation unit 274a and a trained model for the right side surface to the human body part estimation unit 274b.

上記のような構成において、第２推定用画像生成部１００２ａは、オブジェクト向き推定部１００１から受け取った身体方向情報に基づき、人物オブジェクトを正面から見た場合の特定距離画像を生成する。同様に、第２推定用画像生成部１００２ｂは、入力された身体方向情報に基づき、人物オブジェクトを右側面から見た場合の特定距離画像を生成する。 In the above configuration, the second estimation image generation unit 1002a generates a specific distance image when the person object is viewed from the front based on the body direction information received from the object orientation estimation unit 1001. Similarly, the second estimation image generation unit 1002b generates a specific distance image when the person object is viewed from the right side surface based on the input body direction information.

そして、人体部位推定部２７４ａ及び２７４ｂは、上述のようにして得られたそれぞれの特定方向に対応する推定用距離画像と、第１学習部２７５’からそれぞれ提供された学習済みモデルを用いて、人物オブジェクトの人体部位の位置推定を行う。そして、それぞれの特定方向における推定結果が補完処理部１００４に出力される。 Then, the human body part estimation units 274a and 274b use the estimation distance images corresponding to the respective specific directions obtained as described above and the trained models provided by the first learning unit 275', respectively. Estimates the position of the human body part of the human object. Then, the estimation result in each specific direction is output to the complement processing unit 1004.

補完処理部１００４は、受け取った複数の推定結果を用いて、個々の特定方向からでは見えない部分をそれぞれ補い合う補完処理を行って最終的な推定結果としての人体部位情報を出力する。これにより、より信頼性が向上した人体部位情報が得られることになる。なお、補完補正処理を追加的に行うことになるが、３６０度様々な方向からの学習を行うよりは学習コストは少なくて済む。また、各学習済みモデルは一方向からの学習を行ったモデルであるので、推定精度の向上も同様に期待できる。 The complement processing unit 1004 uses the received plurality of estimation results to perform complement processing to complement each part that cannot be seen from each specific direction, and outputs human body part information as the final estimation result. As a result, more reliable information on the human body part can be obtained. Although the complementary correction process is additionally performed, the learning cost is lower than that of learning from various directions of 360 degrees. Further, since each trained model is a model trained from one direction, improvement in estimation accuracy can be expected as well.

［仮想視点映像への応用］
各実施形態で説明した内容は、いわゆる仮想視点映像生成システム（特許文献３を参照）にも応用できる。例えば、予め人物オブジェクト毎の三次元形状データをポリゴン等で作成しておき、上述の実施形態によって得られた人体部位情報に基づいて人物オブジェクトの形状を変形させることで選手の動きを再現することができる。この場合、従来の生成手法と比較してデータ量を大幅に削減した仮想視点映像を生成することが可能となる。本応用手法の場合、選手の表情といった細部の表現については再現困難であるが、選手間のフォーメーションや選手個々の動きなどを確認するには十分な情報を得ることができ、タブレットＰＣなどの携帯型端末にて仮想視点映像を視聴・確認するのに向いているといえる。 [Application to virtual viewpoint video]
The contents described in each embodiment can also be applied to a so-called virtual viewpoint video generation system (see Patent Document 3). For example, three-dimensional shape data for each person object is created in advance with polygons or the like, and the movement of the player is reproduced by deforming the shape of the person object based on the human body part information obtained by the above-described embodiment. Can be done. In this case, it is possible to generate a virtual viewpoint video in which the amount of data is significantly reduced as compared with the conventional generation method. In the case of this applied method, it is difficult to reproduce detailed expressions such as facial expressions of athletes, but sufficient information can be obtained to check formations between athletes and individual movements of athletes, and mobile phones such as tablet PCs can be obtained. It can be said that it is suitable for viewing and checking virtual viewpoint images on a tablet terminal.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

２７０サーバ
２７１三次元形状導出部
２７２オブジェクト分離部
２７３推定用画像生成部
２７４人体部位推定部
２７５学習部 270 Server 271 Three-dimensional shape derivation unit 272 Object separation unit 273 Image generation unit for estimation 274 Human body part estimation unit 275 Learning unit

Claims

An acquisition means for acquiring shape data indicating the three-dimensional shape of the person object, which is generated by using a plurality of captured images obtained by imaging the person object from a plurality of directions.
Based on the shape data acquired by the acquisition means, a generation means for generating a distance image showing a distance to the person object when a predetermined viewpoint condition is followed, and a generation means.
Part estimation that estimates the position of the human body part in the person object based on the distance image generated by the generation means using a trained model for estimating the position of the human body part obtained by pre-learning. Means and
An information processing device characterized by being equipped with.

When the shape data acquired by the acquisition means includes three-dimensional shapes of a plurality of person objects, it further has a separation means for separating the individual shape data for each person object.
The generation means generates the distance image based on the individual shape data, and generates the distance image.
The part estimation means performs the estimation using a trained model obtained by training for each person object.
The information processing apparatus according to claim 1.

The shape data is data in which the three-dimensional shape of the person object is represented by voxels.
The separation means obtains the distance between each voxel and another voxel, and performs a discrimination process for identifying voxels that are in contact with each other or within a predetermined distance as voxels belonging to the same person object. Separate into the individual shape data,
The information processing apparatus according to claim 2.

The information processing apparatus according to any one of claims 1 to 3, wherein the predetermined viewpoint condition is a viewpoint condition when a learning distance image used as input data in the learning is created.

The information processing apparatus according to claim 4, wherein the predetermined viewpoint condition includes at least elements of a distance to the person object, a height from the ground, and an angle of view.

A direction estimating means for estimating the direction in which the person object is facing, and
The generation means generates a specific distance image that captures the person object from a specific direction based on the direction estimated by the direction estimation means.
The site estimation means uses the specific distance image to perform the estimation.
The information processing apparatus according to any one of claims 1 to 5, wherein the information processing device is characterized by the above.

The claim is characterized in that the direction estimation means estimates the direction in which the person object is facing by using a trained model for estimating the direction in which the human body is facing, which is obtained by performing learning in advance. The information processing apparatus according to 6.

The information processing apparatus according to claim 7, wherein the direction estimation means estimates the direction in which the person object is facing by using the distance image as input data.

The information processing apparatus according to claim 7, wherein the direction estimation means estimates the direction in which the person object is facing by using the plurality of captured images as input data.

The sixth aspect of the present invention is characterized in that the direction estimating means obtains a moving direction from the difference between the current individual shape data and the past individual shape data, and estimates the moving direction as the direction in which the person object is facing. The information processing device described.

A plurality of the site estimation means corresponding to different specific directions, and
A complement processing unit that generates an estimation result that complements a part that cannot be seen from one specific direction by using the estimation results by the plurality of the site estimation means.
The information processing apparatus according to any one of claims 6 to 10, further comprising.

A step of acquiring shape data showing a three-dimensional shape of the person object, which is generated by using a plurality of captured images obtained by imaging the person object from a plurality of directions, and
Based on the acquired shape data, a step of generating a distance image showing a distance to the person object when a predetermined viewpoint condition is followed, and a step of generating a distance image.
Using a trained model for estimating the position of a human body part obtained by pre-learning, a step of estimating the position of a human body part in the person object based on the generated distance image, and a step of estimating the position of the human body part in the person object.
An information processing method characterized by including.

A program for causing a computer to function as the information processing device according to any one of claims 1 to 11.