JP7432275B1

JP7432275B1 - Video display device, video display method, and program

Info

Publication number: JP7432275B1
Application number: JP2023123183A
Authority: JP
Inventors: 直広早石
Original assignee: KEISUUGIKEN CORPORATION
Current assignee: KEISUUGIKEN CORPORATION
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2024-02-16
Anticipated expiration: 2043-07-28

Abstract

【課題】参照映像と同様の撮影環境を用意しなくても、参照映像と比較可能な自己映像を表示することができる映像表示装置を提供する。【解決手段】映像表示装置１は、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像が記憶される記憶部１１と、ユーザの動作の映像である第１の自己映像を取得する映像取得部１２と、第１の自己映像のユーザの骨格認識を行う骨格認識部１３と、ユーザの動作に応じて動く３次元オブジェクトの映像である第２の自己映像を、骨格認識の結果を用いて、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視点と３次元オブジェクトとの相対的な位置関係とが同じになるように生成する生成部１４と、参照映像及び第２の自己映像を表示する表示部１５とを備える。【選択図】図１The present invention provides a video display device that can display a self-image that can be compared with a reference video without preparing a shooting environment similar to that of the reference video. A video display device 1 includes a storage unit 11 that stores a reference video that is a video of an imitation target whose motion is to be imitated by a user, and a first self-video that is a video of the user's motion. a skeleton recognition unit 13 that performs skeletal recognition of the user's first self-image, and a skeletal recognition unit 13 that performs skeletal recognition of the second self-image, which is an image of a three-dimensional object that moves according to the user's movements. Using the results of , the relative positional relationship between the reference video camera that shoots the reference video and the imitation target becomes the same as the relative positional relationship between the video viewpoint of the 3D object and the 3D object. The display unit 15 includes a generation unit 14 that generates a reference video and a second self-video. [Selection diagram] Figure 1

Description

本発明は、ユーザが動作を模倣する対象となる模倣対象の動作の映像と、ユーザの動作に応じた映像とを表示する映像表示装置等に関する。 The present invention relates to a video display device and the like that displays a video of a motion to be imitated whose motion is to be imitated by a user, and a video corresponding to the user's motion.

従来、手術などの動きの学習のために、学習者が模倣する対象となる模倣対象の動作の映像である参照映像と、学習者の動作の映像である自己映像とを合成して表示する学習支援装置が知られている（例えば、特許文献１参照）。このような表示を参照することにより、学習者は、模倣対象の動作と同じ動作を行うためのトレーニングを行うことができる。 Conventionally, in order to learn movements such as those in surgery, learning involves combining and displaying a reference video, which is a video of the movement of the target to be imitated by the learner, and a self-video, which is a video of the learner's movements. Support devices are known (for example, see Patent Document 1). By referring to such a display, the learner can train to perform the same motion as the motion to be imitated.

特開２０１４－０７１４４３号公報Japanese Patent Application Publication No. 2014-071443

しかしながら、従来の学習支援装置では、参照映像と自己映像とを比較可能な映像にするため、両映像について、撮影を行うカメラと撮影対象との相対的な位置関係が同じになるようにする必要があった。例えば、参照映像が、模倣対象の動作を行う教師役が装着しているヘッドマウントカメラで撮影された一人称映像である場合には、自己映像も学習者が装着しているヘッドマウントカメラで撮影された一人称映像とする必要があった。このように、学習者も、参照映像と同様の撮影環境を用意する必要があり、そのためのコストや時間がかかるという問題があった。 However, in conventional learning support devices, in order to make the reference video and self-video comparable, it is necessary to ensure that the relative positional relationship between the camera that is shooting and the subject is the same for both videos. was there. For example, if the reference video is a first-person video shot with a head-mounted camera worn by the teacher performing the action to be imitated, the self-video is also shot with a head-mounted camera worn by the learner. It needed to be a first-person video. In this way, the learner also needs to prepare a shooting environment similar to that of the reference video, which poses the problem of cost and time.

本発明は、上記課題を解決するためになされたものであり、参照映像と同様の撮影環境を用意しなくても、参照映像と比較可能な自己映像を表示することができる映像表示装置等を提供することを目的とする。 The present invention has been made to solve the above problems, and provides a video display device etc. that can display a self-image that can be compared with a reference video without having to prepare a shooting environment similar to that of the reference video. The purpose is to provide.

上記目的を達成するため、本発明の一態様による映像表示装置は、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像が記憶される記憶部と、ユーザの動作の映像である第１の自己映像を取得する映像取得部と、第１の自己映像に含まれるユーザの骨格認識を行う骨格認識部と、模倣対象に対応する３次元オブジェクトであり、ユーザの動作に応じて動く３次元オブジェクトの映像である第２の自己映像を、骨格認識部による骨格認識の結果を用いて、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視点と３次元オブジェクトとの相対的な位置関係とが同じになるように生成する生成部と、参照映像及び第２の自己映像を表示する表示部と、を備え、参照映像用カメラと模倣対象との相対的な位置関係と、第１の自己映像を撮影する自己映像用カメラと模倣対象に対応するユーザの部分との相対的な位置関係とは異なる、ものである。 In order to achieve the above object, a video display device according to one aspect of the present invention includes a storage unit that stores a reference video that is a video of an action to be imitated whose action is to be imitated by a user; an image acquisition unit that acquires a first self-image, a skeleton recognition unit that recognizes the user's skeleton included in the first self-image, and a three-dimensional object that corresponds to the imitation target and that responds to the user's movements. 3. The second self-image, which is an image of a three-dimensional object that moves, is determined using the results of skeleton recognition by the skeleton recognition unit, and the relative positional relationship between the reference image camera that shoots the reference image and the imitation target; A generation unit that generates a video so that the viewpoint of the video of the dimensional object and the relative positional relationship with the three-dimensional object is the same, and a display unit that displays the reference video and the second self-video. The relative positional relationship between the user camera and the imitation target is different from the relative positional relationship between the self-image camera that captures the first self-image and the part of the user corresponding to the imitation target.

このような構成により、参照映像と同様の撮影環境を用意しなくても、参照映像と比較可能な第２の自己映像を表示することができる。例えば、ノートパソコンやタブレット端末、スマートフォンなどのユーザに対面するカメラによって撮影された第１の自己映像から、ユーザの一人称視点の第２の自己映像を生成して表示することができるようになる。 With such a configuration, it is possible to display a second self-image that can be compared with the reference image without preparing a shooting environment similar to that of the reference image. For example, it becomes possible to generate and display a second self-image from the user's first-person perspective from a first self-image taken by a camera facing the user, such as a laptop computer, a tablet terminal, or a smartphone.

また、本発明の一態様による映像表示装置では、生成部は、骨格認識部による骨格認識の結果の視線方向を、設定されている角度だけ変化させた第２の自己映像を生成してもよい。 Further, in the video display device according to one aspect of the present invention, the generation unit may generate a second self-image in which the line of sight direction as a result of skeleton recognition by the skeleton recognition unit is changed by a set angle. .

このような構成により、例えば、ユーザに対面するカメラによって撮影された第１の自己映像から、ユーザの一人称視点の第２の自己映像を生成することができる。 With such a configuration, for example, it is possible to generate a second self-image from a first-person viewpoint of the user from a first self-image taken by a camera facing the user.

また、本発明の一態様による映像表示装置では、参照映像は、模倣対象を動作させる被模倣者の視点からの映像であり、生成部は、ユーザの視点からの映像である第２の自己映像を生成してもよい。 Further, in the video display device according to one aspect of the present invention, the reference video is a video from the viewpoint of the imitator who moves the imitation target, and the generation unit generates a second self-video that is the video from the user's viewpoint. may be generated.

このような構成により、第１の自己映像を撮影する自己映像用カメラとユーザとの相対的な位置関係に関わらず、一人称視点の第２の自己映像を生成することができる。 With such a configuration, it is possible to generate a second self-image from a first-person viewpoint, regardless of the relative positional relationship between the user and the self-image camera that captures the first self-image.

また、本発明の一態様による映像表示装置では、模倣対象は、形状が変化する操作対象物を含み、第１の自己映像は、ユーザの手を含んでおり、生成部は、第１の自己映像に含まれるユーザの手のジェスチャに応じて形状が変化する操作対象物の３次元オブジェクトを含む第２の自己映像を生成してもよい。 Further, in the video display device according to one aspect of the present invention, the imitation target includes an operation target whose shape changes, the first self-image includes the user's hand, and the generation unit A second self-image may be generated that includes a three-dimensional object as an operation target whose shape changes according to the user's hand gesture included in the image.

このような構成により、コントローラなどを用いることなく、ジェスチャによって操作対象物の３次元オブジェクトを操作することができる。 With such a configuration, the three-dimensional object to be manipulated can be manipulated by gestures without using a controller or the like.

また、本発明の一態様による映像表示装置では、模倣対象は、形状が変化する操作対象物を含み、ユーザによって操作されるコントローラからの指示を受け付ける受付部をさらに備え、生成部は、受付部によって受け付けられた指示に応じて形状が変化する操作対象物の３次元オブジェクトを含む第２の自己映像を生成してもよい。 Further, in the video display device according to one aspect of the present invention, the imitation target includes an operation target whose shape changes, and the generation unit further includes a reception unit that receives an instruction from a controller operated by a user. A second self-image including a three-dimensional object of an operation target whose shape changes according to an instruction received by the user may be generated.

このような構成により、コントローラを用いることによって、操作対象物の３次元オブジェクトを操作することができる。そのため、例えば、参照映像が手術ロボットの映像である場合に、その手術ロボットの操作で用いられるコントローラと同様のコントローラを用いてユーザが操作対象物の３次元オブジェクトを操作するようにすることもでき、ユーザは、実環境の手術ロボットと同様な環境で３次元オブジェクトを操作することができるようになる。 With such a configuration, the three-dimensional object to be manipulated can be manipulated by using the controller. Therefore, for example, if the reference image is an image of a surgical robot, the user can operate the three-dimensional object to be operated using a controller similar to the controller used to operate the surgical robot. , the user will be able to manipulate three-dimensional objects in an environment similar to a real-world surgical robot.

また、本発明の一態様による映像表示装置では、表示部は、参照映像と第２の自己映像とを合成して表示してもよい。 Further, in the video display device according to one aspect of the present invention, the display unit may combine the reference video and the second self-video and display the composite image.

このような構成により、参照映像と第２の自己映像とを容易に比較することができ、ユーザは、自らの動作が、参照映像の模倣対象の動作と同様になるように学習することができる。 With such a configuration, the reference video and the second self-video can be easily compared, and the user can learn to make his/her own actions similar to the actions to be imitated in the reference video. .

また、本発明の一態様による映像表示装置では、自己映像用カメラと、参照映像と第２の自己映像とが表示される表示デバイスとは、光軸方向における自己映像用カメラから撮影対象への向きと、表示デバイスを正視する視線の向きとが逆になるように配置されていてもよい。 Further, in the image display device according to one aspect of the present invention, the self-image camera and the display device on which the reference image and the second self-image are displayed are configured to move from the self-image camera to the object to be photographed in the optical axis direction. The display device may be arranged so that the direction and the direction of the line of sight looking directly at the display device are opposite to each other.

このような構成により、例えば、ノートパソコンやタブレット端末、スマートフォンなどのユーザに対面するカメラによって撮影された第１の自己映像から、第２の自己映像を生成し、その第２の自己映像と参照映像とをユーザに対して表示することができるようになる。 With such a configuration, for example, a second self-image is generated from a first self-image taken by a camera facing the user such as a laptop computer, a tablet terminal, or a smartphone, and the second self-image and the reference The video can now be displayed to the user.

また、本発明の一態様による映像表示方法は、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像が記憶される記憶部と、映像取得部と、骨格認識部と、生成部と、表示部とを用いて処理される映像表示方法であって、映像取得部が、ユーザの動作の映像である第１の自己映像を取得するステップと、骨格認識部が、第１の自己映像に含まれるユーザの骨格認識を行うステップと、生成部が、模倣対象に対応する３次元オブジェクトであり、ユーザの動作に応じて動く３次元オブジェクトの映像である第２の自己映像を、骨格認識の結果を用いて、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視点と３次元オブジェクトとの相対的な位置関係とが同じになるように生成するステップと、表示部が、参照映像及び第２の自己映像を表示するステップと、を備え、参照映像用カメラと模倣対象との相対的な位置関係と、第１の自己映像を撮影する自己映像用カメラと模倣対象に対応するユーザの部分との相対的な位置関係とは異なる、ものである。 Further, a video display method according to one aspect of the present invention includes: a storage unit that stores a reference video that is a video of an action to be imitated whose action is to be imitated by a user; a video acquisition unit; a skeleton recognition unit; A video display method that is processed using a generation unit and a display unit, the video acquisition unit acquiring a first self-image that is an image of a user's movement; a second self-image, which is a three-dimensional object corresponding to the imitation target and which is an image of a three-dimensional object that moves in response to the user's movements; Using the results of skeleton recognition, the relative positional relationship between the reference video camera that shoots the reference video and the imitation target, and the relative positional relationship between the video viewpoint of the 3D object and the 3D object are determined. and a step in which the display unit displays the reference video and the second self-video, and the display unit displays the relative positional relationship between the reference video camera and the imitation target, and the first self-video. The relative positional relationship between the self-image camera that photographs the self-image and the part of the user corresponding to the imitation target is different.

本発明の一態様による映像表示装置等によれば、参照映像と同様の撮影環境を用意しなくても、参照映像と比較可能な自己映像を表示することができるようになる。 According to a video display device or the like according to one aspect of the present invention, it becomes possible to display a self-image that can be compared with a reference video without preparing a shooting environment similar to that of the reference video.

本発明の実施の形態による映像表示装置の構成を示すブロック図A block diagram showing the configuration of a video display device according to an embodiment of the present invention 同実施の形態による映像表示装置を実現するコンピュータの使用例を示す図A diagram illustrating an example of use of a computer that implements a video display device according to the embodiment. 同実施の形態における参照映像の一例を示す図A diagram showing an example of a reference video in the same embodiment 同実施の形態における第１の自己映像及び骨格認識の結果の一例を示す図A diagram showing an example of the first self-image and skeleton recognition results in the same embodiment. 同実施の形態における手の骨格認識の結果の一例を示す図A diagram showing an example of the result of hand skeleton recognition in the same embodiment. 同実施の形態における第２の自己映像の一例を示す図A diagram showing an example of a second self-image in the same embodiment 同実施の形態における第１及び第２の自己映像の表示の一例を示す図A diagram showing an example of display of first and second self-images in the same embodiment. 同実施の形態におけるジェスチャによる操作対象物の操作の一例を示す図A diagram illustrating an example of operation of an operation target using a gesture in the same embodiment. 同実施の形態におけるジェスチャによる操作対象物の操作の一例を示す図A diagram illustrating an example of operation of an operation target using a gesture in the same embodiment. 同実施の形態による映像表示装置の動作を示すフローチャートFlowchart showing the operation of the video display device according to the embodiment 同実施の形態による映像表示装置の他の構成を示すブロック図A block diagram showing another configuration of the video display device according to the embodiment 同実施の形態におけるコントローラの一例を示す斜視図A perspective view showing an example of a controller in the same embodiment 同実施の形態におけるコンピュータの構成の一例を示す図A diagram showing an example of the configuration of a computer in the same embodiment.

以下、本発明による映像表示装置、及び映像表示方法について、実施の形態を用いて説明する。なお、以下の実施の形態において、同じ符号を付した構成要素及びステップは同一または相当するものであり、再度の説明を省略することがある。本実施の形態による映像表示装置は、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像とは異なる撮影環境で撮影された第１の自己映像を用いて、視点と３次元オブジェクトとの相対的な位置関係が参照映像の撮影環境と同様になるように３次元オブジェクトの第２の自己映像を生成し、参照映像と第２の自己映像とを表示するものである。 Hereinafter, a video display device and a video display method according to the present invention will be described using embodiments. Note that in the following embodiments, components and steps denoted by the same reference numerals are the same or equivalent, and a repeated explanation may be omitted. The video display device according to the present embodiment uses a first self-video shot in a different shooting environment from a reference video, which is a video of the motion of the imitation target whose motion is to be imitated by the user. A second self-image of the three-dimensional object is generated so that the relative positional relationship with the dimensional object is similar to the shooting environment of the reference image, and the reference image and the second self-image are displayed.

図１は、本実施の形態による映像表示装置１の構成を示すブロック図である。本実施の形態による映像表示装置１は、記憶部１１と、映像取得部１２と、骨格認識部１３と、生成部１４と、表示部１５とを備える。なお、映像表示装置１は、一例として、図２等で示されるようにコンピュータ９００によって実現されてもよい。本実施の形態では、この場合について主に説明する。 FIG. 1 is a block diagram showing the configuration of a video display device 1 according to this embodiment. The video display device 1 according to the present embodiment includes a storage section 11, a video acquisition section 12, a skeleton recognition section 13, a generation section 14, and a display section 15. Note that the video display device 1 may be realized by a computer 900, as shown in FIG. 2 and the like, as an example. In this embodiment, this case will mainly be explained.

記憶部１１では、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像が記憶される。ユーザは、参照映像を参照しながら動作を学習する学習者である。ユーザが学習する動作は、例えば、手術などの動作であってもよく、工場における作業の動作であってもよく、介護やホテルなどにおける業務の動作であってもよく、料理などの動作であってもよく、工芸品等の作品の作成のための動作であってもよく、スポーツなどの動作であってもよく、習字などの動作であってもよく、ロープ結びの動作であってもよく、その他の動作であってもよい。模倣対象は、例えば、被模倣者の身体の一部であってもよく、被模倣者によって動作される対象物であってもよい。被模倣者は、例えば、学習者であるユーザの先生役であり、学習者が学習する対象となる動作に熟練している者であってもよい。また、被模倣者の身体の一部は、例えば、被模倣者の手を含んでいてもよい。また、被模倣者によって動作される対象物は、例えば、手術ロボットの手先や鉗子などであってもよく、被模倣者が有している鉗子やメス、ピンセット、ハサミ、筆などの道具であってもよい。参照映像は、通常、カメラによって撮影された映像であるが、カメラによって撮影された映像に相当するＣＧ（Computer Graphics）映像であってもよい。参照映像は、一例として、模倣対象を動作させる被模倣者の視点からの映像、すなわち被模倣者の一人称視点の映像であってもよい。この場合には、参照映像は、一例として、被模倣者が装着しているヘッドマウントカメラで撮影された映像であってもよい。本実施の形態では、図３で示されるように、手術ロボットの鉗子である模倣対象２１を含む参照映像が記憶部１１で記憶されている場合について主に説明する。 The storage unit 11 stores a reference video that is a video of an action to be imitated whose action is to be imitated by the user. The user is a learner who learns movements while referring to reference videos. The motions that the user learns may be, for example, the motions of surgery, the motions of work in a factory, the motions of work in nursing care, hotels, etc., the motions of cooking, etc. It may be an action for creating a work such as a craft, it may be an action for sports, it may be an action for calligraphy, or it may be an action for tying a rope. , or other operations. The imitation target may be, for example, a part of the imitator's body, or may be an object that is operated by the imitator. The imitator may be, for example, a teacher of a user who is a learner, and may be someone who is skilled in the movement that the learner is learning. Further, the part of the imitator's body may include, for example, the imitator's hand. Further, the object to be moved by the imitator may be, for example, the hand of a surgical robot or forceps, or it may be a tool owned by the imitator, such as forceps, a scalpel, tweezers, scissors, or a brush. You can. The reference video is usually a video taken by a camera, but may also be a CG (Computer Graphics) video corresponding to the video taken by a camera. The reference video may be, for example, a video from the viewpoint of the imitator who moves the imitation target, that is, a video from the first-person viewpoint of the imitator. In this case, the reference video may be, for example, a video shot with a head-mounted camera worn by the imitator. In this embodiment, as shown in FIG. 3, a case will be mainly described in which a reference image including an imitation target 21, which is a forceps of a surgical robot, is stored in the storage unit 11.

記憶部１１では、例えば、参照映像の全体が記憶されてもよく、または、参照映像の一部が記憶されてもよい。一例として、映像表示装置１が、外部から参照映像を受信しながら表示する場合には、参照映像の一部である受信された最新の参照映像の部分が記憶部１１で記憶され、それが読み出されて表示されると共に、順次、上書きされてもよい。記憶部１１には、参照映像以外の情報が記憶されてもよい。例えば、３次元オブジェクトの情報が記憶部１１で記憶されてもよく、生成部１４によって生成された第２の自己映像が記憶部１１で記憶されてもよく、映像取得部１２によって取得された第１の自己映像が記憶部１１で記憶されてもよい。 In the storage unit 11, for example, the entire reference video may be stored, or a part of the reference video may be stored. As an example, when the video display device 1 displays a reference video while receiving it from the outside, a portion of the latest received reference video that is part of the reference video is stored in the storage unit 11, and it is readable. The information may be output and displayed, and may also be sequentially overwritten. Information other than the reference video may be stored in the storage unit 11. For example, information on a three-dimensional object may be stored in the storage unit 11, a second self-image generated by the generation unit 14 may be stored in the storage unit 11, and a second self-image acquired by the image acquisition unit 12 may be stored in the storage unit 11. One self-image may be stored in the storage unit 11.

記憶部１１に情報が記憶される過程は問わない。例えば、記録媒体を介して情報が記憶部１１で記憶されるようになってもよく、通信回線等を介して送信された情報が記憶部１１で記憶されるようになってもよく、または、カメラなどのデバイスを介して入力された情報が記憶部１１で記憶されるようになってもよい。記憶部１１は、不揮発性の記録媒体によって実現されることが好適であるが、揮発性の記録媒体によって実現されてもよい。記録媒体は、例えば、半導体メモリや磁気ディスクなどであってもよい。 The process by which information is stored in the storage unit 11 does not matter. For example, information may be stored in the storage unit 11 via a recording medium, information transmitted via a communication line or the like may be stored in the storage unit 11, or, Information input via a device such as a camera may be stored in the storage unit 11. The storage unit 11 is preferably implemented by a nonvolatile recording medium, but may also be implemented by a volatile recording medium. The recording medium may be, for example, a semiconductor memory or a magnetic disk.

映像取得部１２は、ユーザの動作の映像である第１の自己映像を取得する。映像取得部１２は、例えば、映像を撮影するカメラ等の光学機器であってもよく、カメラ等の光学機器によって撮影された映像を取得するものであってもよい。本実施の形態では、映像取得部１２が自己映像用カメラであるカメラ９０１によって撮影された映像を受け付ける場合について主に説明する。第１の自己映像は、一例として、ユーザの手の動作の映像であってもよい。第１の自己映像は、例えば、ユーザの手のひらの映像であってもよく、ユーザの手首から肘までの腕をも含む映像であってもよく、ユーザの肘から肩までの腕をも含む映像であってもよく、さらに、ユーザの肩や胴体をも含む映像であってもよい。視線方向が設定された角度だけ変化されることによって第２の自己映像が生成される場合には、第１の自己映像は、あらかじめ決められたように撮影されることが好適である。例えば、第１の自己映像は、自己映像用カメラとユーザとが対向した状態で撮影されてもよい。また、ユーザの視点からの第２の自己映像が生成される場合には、ユーザの視線方向が分かる第１の自己映像、例えば、ユーザの頭部をも含む第１の自己映像が取得されることが好適である。 The video acquisition unit 12 acquires a first self-video that is a video of the user's actions. The image acquisition unit 12 may be, for example, an optical device such as a camera that photographs an image, or may be one that acquires an image photographed by an optical device such as a camera. In this embodiment, a case will be mainly described in which the video acquisition unit 12 receives a video shot by the camera 901, which is a self-video camera. The first self-image may be, for example, an image of the user's hand movement. The first self-image may be, for example, an image of the user's palm, an image that also includes the user's arm from the wrist to the elbow, or an image that also includes the user's arm from the elbow to the shoulder. Furthermore, the image may also include the user's shoulders and torso. When the second self-image is generated by changing the line-of-sight direction by a set angle, the first self-image is preferably photographed in a predetermined manner. For example, the first self-image may be photographed with the self-image camera facing the user. Further, when a second self-image from the user's viewpoint is generated, a first self-image that shows the direction of the user's line of sight, for example, a first self-image that also includes the user's head is acquired. It is preferable that

なお、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、第１の自己映像を撮影する自己映像用カメラと模倣対象に対応するユーザの部分との相対的な位置関係とは異なっているものとする。模倣対象に対応するユーザの部分は、特に限定されないが、例えば、ユーザの手のひらや、ユーザの肘から先の手の部分などであってもよい。一例として、参照映像は、被模倣者の一人称視点の映像であり、第１の自己映像は、図２で示されるように、ユーザ３０に対向するカメラ９０１によって撮影された映像であってもよい。本実施の形態では、主にこの場合について説明する。また、本実施の形態では、一例として、図２で示されるように、カメラ９０１と、参照映像及び第２の自己映像が表示される表示デバイス９０２とが、光軸方向におけるカメラ９０１から撮影対象であるユーザ３０への向きと、表示デバイス９０２を正視するユーザ３０の視線の向きとが逆になるように配置されている場合、すなわち、カメラ付きのノートパソコンや、カメラ付きのタブレット端末、カメラ付きのスマートフォンなどで第１の自己映像の撮影や、参照映像及び第２の自己映像の表示などを行う場合について主に説明する。なお、カメラ９０１は、例えば、コンピュータ９００の内蔵カメラであってもよい。コンピュータ９００の内蔵カメラは、例えば、ノートパソコンの対面カメラや、タブレット端末、スマートフォンのインカメラであってもよい。 In addition, the relative positional relationship between the reference video camera that shoots the reference video and the imitation target, and the relative position between the self-image camera that shoots the first self-video and the part of the user corresponding to the imitation target. It shall be different from the relationship. The part of the user that corresponds to the imitation target is not particularly limited, but may be, for example, the user's palm or the part of the user's hand from the elbow. As an example, the reference video may be a first-person perspective video of the person to be imitated, and the first self-video may be a video captured by a camera 901 facing the user 30, as shown in FIG. . In this embodiment, this case will mainly be explained. Furthermore, in this embodiment, as an example, as shown in FIG. When the display device 902 is arranged so that the direction toward the user 30 and the direction of the user's line of sight looking straight at the display device 902 are opposite, that is, a laptop computer with a camera, a tablet terminal with a camera, a camera The following will mainly describe cases in which a first self-image is photographed, a reference image and a second self-image are displayed, etc. using a smartphone with an attached camera. Note that the camera 901 may be, for example, a built-in camera of the computer 900. The built-in camera of the computer 900 may be, for example, a facing camera of a notebook computer, a tablet terminal, or an in-camera of a smartphone.

骨格認識部１３は、第１の自己映像に含まれるユーザの骨格認識を行う。骨格認識部１３は、例えば、第１の自己映像のあるフレームにおいて人または人の一部の検出を行い、検出した人や人の一部について骨格認識を行ってもよい。また、骨格認識部１３は、その処理を、第１の自己映像に含まれる複数のフレームのそれぞれについて行ってもよい。すなわち、骨格認識の処理が繰り返して行われてもよい。なお、第１の自己映像のすべてのフレームについて骨格認識が行われてもよく、飛び飛びのフレームについて骨格認識が行われてもよい。人の一部は、例えば、人の上半身であってもよく、人の手のひらを含む腕の部分であってもよく、人の手のひらの部分であってもよい。生成部１４において、ユーザの視点からの第２の自己映像が生成される場合には、ユーザの頭部を含む骨格認識が行われることが好適である。この骨格認識の方法については、すでに公知であり、その詳細な説明を省略する。骨格認識部１３は、例えば、図４で示されるように、第１の自己映像に含まれるユーザ３０を特定し、そのユーザ３０の骨格３１を認識してもよい。骨格３１は、一例として、図４で示されるように、関節、並びに指先及び頭部などの身体の端部に対応する丸形状などのノード図形と、それらを繋ぐ腕などの体の部位に対応する直線状のリンク図形とを含んでいてもよい。図５は、ユーザ３０の手３２について認識された骨格３１を示す図である。本実施の形態では、骨格認識部１３によって認識された手３２の骨格３１を用いて３次元オブジェクトが操作される場合について主に説明する。 The skeleton recognition unit 13 recognizes the user's skeleton included in the first self-image. For example, the skeleton recognition unit 13 may detect a person or a part of a person in a certain frame of the first self-image, and perform skeleton recognition on the detected person or part of the person. Further, the skeleton recognition unit 13 may perform the processing on each of a plurality of frames included in the first self-image. That is, the skeleton recognition process may be performed repeatedly. Note that skeleton recognition may be performed for all frames of the first self-image, or skeleton recognition may be performed for discrete frames. The part of the person may be, for example, the upper body of the person, a portion of the arm including the palm of the hand, or a portion of the palm of the person. When the generation unit 14 generates the second self-image from the user's viewpoint, it is preferable that skeleton recognition including the user's head is performed. This skeleton recognition method is already known, and detailed explanation thereof will be omitted. For example, the skeleton recognition unit 13 may identify the user 30 included in the first self-image and recognize the skeleton 31 of the user 30, as shown in FIG. 4, for example. For example, as shown in FIG. 4, the skeleton 31 corresponds to node shapes such as round shapes corresponding to joints, body ends such as fingertips and head, and body parts such as arms that connect them. It may also include a linear link figure. FIG. 5 is a diagram showing the skeleton 31 recognized for the hand 32 of the user 30. In this embodiment, a case will be mainly described in which a three-dimensional object is operated using the skeleton 31 of the hand 32 recognized by the skeleton recognition unit 13.

生成部１４は、模倣対象に対応する３次元オブジェクトであり、ユーザの動作に応じて動く３次元オブジェクトの映像である第２の自己映像を、骨格認識部１３による骨格認識の結果を用いて生成する。模倣対象に対応する３次元オブジェクトは、模倣対象に似ている形状の３次元オブジェクトであることが好適であるが、例えば、模倣対象の形状等を簡略化した形状の３次元オブジェクトであってもよい。第２の自己映像に含まれる３次元オブジェクトの位置や姿勢は、例えば、模倣対象に対応するユーザの部分の骨格の認識結果に応じた位置及び姿勢であってもよく、また、模倣対象に対応するユーザの部分の骨格の認識結果の時系列に応じた変化に応じて動いてもよい。また、生成部１４は、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視点と３次元オブジェクトとの相対的な位置関係とが同じになるように、第２の自己映像を生成するものとする。３次元オブジェクトの映像の視点は、例えば、３次元仮想空間に配置された３次元オブジェクトを２次元の映像にレンダリングする際の視点であってもよい。また、参照映像用カメラの光軸の方向と模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視線の方向と３次元オブジェクトとの相対的な位置関係とも同じになるように第２の自己映像が生成されてもよい。このようにすることで、第１の自己映像を、参照映像と同様の撮影環境で撮影された第２の自己映像に変換することができる。例えば、参照映像が一人称視点の映像である場合には、図２のようにユーザ３０に対面するカメラ９０１で撮影された第１の自己映像から、一人称視点の映像である第２の自己映像を生成することができる。なお、視点や視線の方向が同じであるとは、例えば、両者が厳密に同じであることであってもよく、両者が所定の誤差の範囲内で同じであることであってもよい。 The generation unit 14 generates a second self-image, which is an image of a three-dimensional object that corresponds to the imitation target and moves in response to the user's movements, using the results of skeleton recognition by the skeleton recognition unit 13. do. The three-dimensional object corresponding to the imitation target is preferably a three-dimensional object with a shape similar to the imitation target, but for example, a three-dimensional object with a shape that is a simplified version of the imitation target may also be used. good. The position and orientation of the three-dimensional object included in the second self-image may be, for example, the position and orientation according to the recognition result of the skeleton of the part of the user that corresponds to the imitation target; It may move in response to changes in time series of the recognition results of the skeleton of the user's part. In addition, the generation unit 14 makes sure that the relative positional relationship between the reference video camera that shoots the reference video and the imitation target is the same as the relative positional relationship between the viewpoint of the video of the three-dimensional object and the three-dimensional object. Assume that the second self-image is generated so that The viewpoint of an image of a three-dimensional object may be, for example, a viewpoint when rendering a three-dimensional object placed in a three-dimensional virtual space into a two-dimensional image. In addition, the relative positional relationship between the direction of the optical axis of the reference video camera and the imitation target is the same as the relative positional relationship between the direction of the line of sight of the image of the three-dimensional object and the three-dimensional object. Two self-images may be generated. By doing so, the first self-image can be converted into the second self-image shot in the same shooting environment as the reference image. For example, if the reference video is a first-person viewpoint video, a second self-video that is a first-person viewpoint video is created from a first self-video shot by the camera 901 facing the user 30 as shown in FIG. can be generated. Note that the viewpoints and directions of the line of sight are the same, for example, they may be strictly the same, or they may be the same within a predetermined error range.

骨格認識の結果によって示されるのは、２次元の画像において認識された骨格であるが、人間の身体は、肩から肘までの長さ、肘から手首までの長さ、手首から各指の付け根までの長さ、各指の長さ、手の幅などのように各部分の長さが概ね決まっており、また、各関節における可動域も決まっている。したがって、それらを考慮することにより、生成部１４は、骨格認識部１３による骨格認識の結果に基づいて、第１の自己映像に含まれるユーザ３０の骨格の３次元における位置や姿勢を推定することができる。例えば、生成部１４は、骨格認識の結果に基づいて、３次元空間における、第１の自己映像を撮影した自己映像用カメラの位置及び光軸方向と、ユーザの骨格３１とを特定してもよい。 The results of skeletal recognition show the skeletal structure recognized in a two-dimensional image, but the length of the human body is measured from the shoulder to the elbow, from the elbow to the wrist, and from the wrist to the base of each finger. The length of each part is roughly determined, such as the length of the head, the length of each finger, and the width of the hand, and the range of motion of each joint is also determined. Therefore, by considering them, the generation unit 14 estimates the three-dimensional position and orientation of the user's 30 skeleton included in the first self-image based on the result of skeleton recognition by the skeleton recognition unit 13. Can be done. For example, the generation unit 14 may specify the position and optical axis direction of the self-image camera that captured the first self-image and the user's skeleton 31 in the three-dimensional space based on the skeleton recognition result. good.

例えば、図５で示される手３２の骨格３１では、手３２が矢印Ｂ１の方向に回転したとしても、両矢印Ａ１の長さは変化せず、手３２が矢印Ｂ２の方向に回転したとしても、両矢印Ａ２の長さは変化しない。なお、両矢印Ａ１は、人差し指の付け根の関節と、小指の付け根の関節とを両端点とする両矢印であり、両矢印Ａ２は、手首の関節と、中指の付け根の関節とを両端点とする両矢印である。したがって、生成部１４は、骨格認識の結果において、手３２の両矢印Ａ１，Ａ２の長さや、その長さの変化に応じて、矢印Ｂ１，Ｂ２方向における手の角度や、その角度の変化について知ることができる。 For example, in the skeleton 31 of the hand 32 shown in FIG. 5, even if the hand 32 rotates in the direction of the arrow B1, the length of the double arrow A1 does not change, and even if the hand 32 rotates in the direction of the arrow B2, the length of the double arrow A1 does not change. , the length of the double-headed arrow A2 does not change. Note that the double-headed arrow A1 is a double-headed arrow whose end points are the joint at the base of the index finger and the joint at the base of the little finger, and the double-headed arrow A2 is a double-headed arrow whose end points are the joint at the wrist and the joint at the base of the middle finger. It is a double-headed arrow. Therefore, in the skeleton recognition results, the generation unit 14 determines the angle of the hand in the directions of the arrows B1 and B2 and the change in the angle according to the lengths of the double arrows A1 and A2 of the hand 32 and changes in the lengths. You can know.

矢印Ｂ１の方向におけるどちら側に回転したのかについては、例えば、各指の関節間の距離の変化に応じて判断してもよい。例えば、矢印Ｂ１の方向に回転した際に、手首に近い側の関節間の距離が大きくなったのに対して、指先に近い側の関節間の距離が小さくなった場合には、手首側がカメラに近づく方向に回転したと判断され、手首に近い側の関節間の距離が小さくなったのに対して、指先に近い側の関節間の距離が大きくなった場合には、指先側がカメラに近づく方向に回転したと判断されてもよい。矢印Ｂ２の方向におけるどちら側に回転した除けについても、同様にして判断してもよい。例えば、矢印Ｂ２の方向に回転した際に、小指の関節間の距離が大きくなったのに対して、親指の関節間の距離が小さくなった場合には、小指側がカメラに近づく方向に回転したと判断され、小指の関節間の距離が小さくなったのに対して、親指の関節間の距離が大きくなった場合には、親指側がカメラに近づく方向に回転したと判断されてもよい。 The direction of rotation in the direction of arrow B1 may be determined based on, for example, a change in the distance between the joints of each finger. For example, when rotating in the direction of arrow B1, if the distance between the joints on the side closer to the wrist increases, but the distance between the joints on the side closer to the fingertips decreases, the wrist side becomes the camera. If the distance between the joints on the side closer to the wrist becomes smaller, whereas the distance between the joints on the side closer to the fingertips increases, the fingertips will move closer to the camera. It may be determined that the object has rotated in the direction. The determination may be made in the same manner for the blade rotated to either side in the direction of arrow B2. For example, when rotating in the direction of arrow B2, the distance between the joints of the little finger becomes larger, but if the distance between the joints of the thumb becomes smaller, the little finger side rotates in the direction closer to the camera. If it is determined that the distance between the joints of the little finger has become smaller while the distance between the joints of the thumb has become larger, it may be determined that the thumb side has rotated in a direction closer to the camera.

また、図４などで示される骨格認識の結果において、例えば、カメラの光軸方向に垂直な平面方向における肩や肘、手首の位置は、第１の自己映像における肩や肘、手首の位置に応じて特定することができる。また、肩に対する肘や手首のカメラの光軸方向における位置については、例えば、第１の自己映像において、肩から肘までの長さが肩から肘までの本来の長さと比較して短い場合には、それに応じて肘がカメラ側に近づいていると推定することができる。また、肘から手首についても同様である。このようにして、生成部１４は、ユーザ３０の腕や手のひらの骨格の３次元における位置や姿勢を推定してもよい。 In addition, in the skeleton recognition results shown in FIG. 4, for example, the positions of the shoulders, elbows, and wrists in the plane direction perpendicular to the optical axis direction of the camera are the same as the positions of the shoulders, elbows, and wrists in the first self-image. It can be specified accordingly. Regarding the position of the elbow and wrist relative to the shoulder in the optical axis direction of the camera, for example, in the first self-image, if the length from the shoulder to the elbow is shorter than the original length from the shoulder to the elbow, It can be estimated that the elbow is moving closer to the camera side accordingly. The same applies to the elbow to wrist. In this way, the generation unit 14 may estimate the three-dimensional position and posture of the skeleton of the user's 30 arm and palm.

生成部１４は、このようにして推定したユーザ３０の骨格３１の３次元における位置や姿勢を用いて、第２の自己映像を生成してもよい。生成部１４は、例えば、骨格認識部１３による骨格認識の結果の視線方向を、設定されている角度だけ変化させた第２の自己映像を生成してもよい。具体的には、生成部１４は、骨格認識の結果に基づいて、３次元仮想空間における、自己映像用カメラの位置及び光軸方向に相当する視点及び視線方向と、ユーザの骨格３１とを特定し、視点及び視線方向をあらかじめ決められているように変更してもよい。このように、生成部１４は、視線方向を変化させると共に、視点も変化させてもよい。より具体的には、生成部１４は、図４で示される第１の自己映像のように、ユーザ３０の手先から肩に向かう視線方向を、ユーザ３０の肩から手先に向かう視線方向に変化させた第２の自己映像、すなわち一人称視点の第２の自己映像を生成してもよい。なお、視点及び視線方向と、骨格３１との位置関係は相対的なものであるため、生成部１４は、３次元仮想空間において、例えば、視線方向を変化させる代わりに、ユーザ３０の骨格を変化させてもよい。ユーザ３０の骨格の変化は、一例として、ユーザ３０の骨格の回転であってもよく、さらに移動を含んでいてもよい。このように、例えば、３次元仮想空間において骨格の角度などを変化させた場合にも、視線方向を変化させたと考えてもよい。 The generation unit 14 may generate the second self-image using the three-dimensional position and orientation of the skeleton 31 of the user 30 estimated in this way. For example, the generation unit 14 may generate a second self-image in which the line-of-sight direction as a result of skeleton recognition by the skeleton recognition unit 13 is changed by a set angle. Specifically, the generation unit 14 specifies, based on the result of skeleton recognition, a viewpoint and line-of-sight direction corresponding to the position and optical axis direction of the self-image camera, and the user's skeleton 31 in the three-dimensional virtual space. However, the viewpoint and line of sight direction may be changed as determined in advance. In this way, the generation unit 14 may change the line of sight direction and also change the viewpoint. More specifically, the generation unit 14 changes the line of sight direction from the user's 30 hand toward the shoulder to the line of sight direction from the user's 30 shoulder to the hand, as in the first self-image shown in FIG. A second self-image, that is, a second self-image from a first-person perspective may be generated. Note that, since the positional relationship between the viewpoint and the line-of-sight direction and the skeleton 31 is relative, the generation unit 14 changes the skeleton of the user 30 in the three-dimensional virtual space, for example, instead of changing the line-of-sight direction. You may let them. The change in the user's 30 skeleton may be, for example, a rotation of the user's 30 skeleton, and may further include movement. In this way, for example, even if the angle of the skeleton is changed in the three-dimensional virtual space, it may be considered that the viewing direction is changed.

また、生成部１４は、例えば、３次元仮想空間において、ユーザ３０の骨格の位置及び姿勢に基づいて３次元オブジェクトを配置してもよい。生成部１４は、例えば、３次元仮想空間において、ユーザ３０の肘から手先までの骨格を用いて、長手方向がユーザ３０の肘から手首までの方向に沿っており、先端がユーザ３０の手のひらの位置となるように鉗子の３次元オブジェクトを配置してもよい。また、生成部１４は、例えば、ユーザ３０の手のひらの骨格を用いて、長手方向が図５の矢印Ａ２の方向に沿っており、長手方向を中心とした角度が図５の矢印Ａ１の方向に応じて変化し、先端がユーザ３０の中指の先端の位置となるように鉗子の３次元オブジェクトを配置してもよい。この３次元オブジェクトは、例えば、記憶部１１で記憶されており、それが読み出されて用いられてもよい。３次元オブジェクトは、例えば、模倣対象に対応したものであり、例えば、鉗子やメスなどの道具であってもよく、手を含んでいてもよい。３次元オブジェクトが手を含んでいる場合には、生成部１４は、骨格認識の結果によって示される手の骨格に応じた形状の手の３次元オブジェクトを３次元仮想空間に配置してもよい。 Further, the generation unit 14 may arrange a three-dimensional object based on the position and posture of the user's 30 skeleton in the three-dimensional virtual space, for example. For example, in a three-dimensional virtual space, the generation unit 14 uses the skeleton of the user 30 from the elbow to the hand, and the longitudinal direction is along the direction from the elbow to the wrist of the user 30, and the tip is the point of the palm of the user 30. The three-dimensional object of the forceps may be arranged so as to correspond to the position. Further, the generation unit 14 uses, for example, the skeleton of the palm of the user 30, and the longitudinal direction is along the direction of the arrow A2 in FIG. 5, and the angle around the longitudinal direction is in the direction of the arrow A1 in FIG. The three-dimensional object of the forceps may be arranged so that the position changes accordingly and the tip is located at the tip of the user's 30 middle finger. This three-dimensional object is stored in the storage unit 11, for example, and may be read out and used. The three-dimensional object corresponds to the imitation target, for example, and may be a tool such as forceps or a scalpel, or may include a hand. When the three-dimensional object includes a hand, the generation unit 14 may arrange, in the three-dimensional virtual space, a three-dimensional hand object having a shape corresponding to the hand skeleton indicated by the result of skeleton recognition.

なお、ここでは、ユーザ３０の骨格と視点や視線方向との相対的な位置関係を変化させた後に、ユーザ３０の骨格に応じて３次元オブジェクトを配置する場合について説明したが、その順序は逆であってもよい。例えば、ユーザ３０の骨格に応じて３次元オブジェクトを配置してから、その３次元オブジェクトと視点や視線方向との相対的な位置関係を変化させてもよい。 Note that here, a case has been described in which a three-dimensional object is arranged according to the user's 30 skeleton after changing the relative positional relationship between the user's 30's skeleton and the viewpoint or line of sight direction, but the order is reversed. It may be. For example, after arranging a three-dimensional object according to the skeleton of the user 30, the relative positional relationship between the three-dimensional object and the viewpoint or direction of line of sight may be changed.

また、生成部１４は、例えば、ユーザ３０の視点からの映像である第２の自己映像を生成してもよい。この場合には、生成部１４は、例えば、骨格認識の結果に基づいて、３次元仮想空間における、ユーザ３０の頭部の位置を含むユーザの骨格３１を特定し、ユーザ３０の頭部の位置を視点とし、その視点からユーザ３０の手先に向かう方向を視線方向として特定してもよい。そして、上記説明と同様に、生成部１４は、３次元仮想空間において、ユーザ３０の骨格の位置及び姿勢に基づいて３次元オブジェクトを配置してもよい。 Furthermore, the generation unit 14 may generate, for example, a second self-image that is an image from the user's 30 viewpoint. In this case, the generation unit 14 identifies the user's skeleton 31 including the position of the head of the user 30 in the three-dimensional virtual space based on the result of skeleton recognition, and determines the position of the head of the user 30. may be taken as a viewpoint, and the direction from the viewpoint toward the hand of the user 30 may be specified as the line-of-sight direction. Then, similarly to the above description, the generation unit 14 may arrange the three-dimensional object in the three-dimensional virtual space based on the position and orientation of the user's 30 skeleton.

３次元仮想空間には、例えば、ユーザ３０の右手に対応する３次元オブジェクトと、ユーザ３０の左手に対応する３次元オブジェクトとが配置されてもよい。この場合には、ユーザ３０の右手に関する骨格認識の結果に基づいて、右手に対応する３次元オブジェクトが配置され、ユーザの左手に関する骨格認識の結果に基づいて、左手に対応する３次元オブジェクトが配置されてもよい。図６は、そのようにして生成された３次元オブジェクト３３ａ，３３ｂを含む第２の自己映像の一例を示す図である。３次元オブジェクト３３ａ，３３ｂはそれぞれ、ユーザ３０の右手及び左手に対応したものであってもよい。なお、３次元オブジェクト３３ａ，３３ｂを特に区別しない場合には、３次元オブジェクト３３と呼ぶこともある。また、第２の自己映像において、３次元オブジェクト３３ａ，３３ｂ以外の領域は、例えば、透明であってもよい。 For example, a three-dimensional object corresponding to the right hand of the user 30 and a three-dimensional object corresponding to the left hand of the user 30 may be arranged in the three-dimensional virtual space. In this case, a three-dimensional object corresponding to the right hand is placed based on the result of skeleton recognition regarding the right hand of the user 30, and a three-dimensional object corresponding to the left hand is placed based on the result of skeleton recognition regarding the user's left hand. may be done. FIG. 6 is a diagram showing an example of the second self-image including the three-dimensional objects 33a and 33b generated in this way. The three-dimensional objects 33a and 33b may correspond to the right and left hands of the user 30, respectively. Note that the three-dimensional objects 33a and 33b may be referred to as a three-dimensional object 33 when not particularly distinguished. Further, in the second self-image, the area other than the three-dimensional objects 33a and 33b may be transparent, for example.

生成部１４は、３次元仮想空間において、視点及び視線方向に基づいて３次元オブジェクト３３をレンダリングすることによって２次元画像を生成してもよい。この２次元画像は、３次元仮想空間において、視点から視線方向に３次元オブジェクト３３を見た２次元画像となる。生成部１４は、例えば、骨格認識部１３によって繰り返して行われる骨格認識の結果に応じて３次元仮想空間における３次元オブジェクト３３の位置や姿勢を変化させ、その変化後の３次元オブジェクト３３のレンダリングの結果である２次元画像を繰り返して生成してもよい。第２の自己映像は、例えば、そのようにして生成された複数の２次元画像によって構成されてもよい。 The generation unit 14 may generate a two-dimensional image by rendering the three-dimensional object 33 based on the viewpoint and line-of-sight direction in the three-dimensional virtual space. This two-dimensional image is a two-dimensional image of the three-dimensional object 33 viewed from the viewpoint in the line-of-sight direction in the three-dimensional virtual space. For example, the generation unit 14 changes the position and orientation of the three-dimensional object 33 in the three-dimensional virtual space according to the results of skeleton recognition repeatedly performed by the skeleton recognition unit 13, and renders the three-dimensional object 33 after the change. The resulting two-dimensional image may be generated repeatedly. The second self-image may be composed of a plurality of two-dimensional images generated in this way, for example.

なお、３次元仮想空間における３次元オブジェクトの大きさ、視点から３次元オブジェクトまでの距離、及びレンダリング時の画角などに応じて、第２の自己映像に含まれる３次元オブジェクト３３の大きさが決まることになる。一例として、３次元仮想空間における３次元オブジェクトの大きさ、及びレンダリング時の画角などは、あらかじめ決められた値であり、視点から３次元オブジェクトまでの距離は、自己映像用カメラから、模倣対象に対応するユーザの部分までの距離に応じて決まってもよい。自己映像用カメラから、模倣対象に対応するユーザの部分までの距離は、例えば、第１の自己映像に含まれる、模倣対象に対応するユーザの部分の大きさに応じて決まってもよい。この場合には、ユーザは、例えば、自己映像用カメラとユーザとの距離を変えることによって、第２の自己映像に含まれる３次元オブジェクト３３の大きさを調整することができる。また、ユーザは、例えば、あらかじめ決められた値を調整できてもよい。また、一例として、第２の自己映像に含まれる３次元オブジェクト３３の大きさと、参照映像に含まれる模倣対象との大きさが同じになるように、生成部１４によって、３次元仮想空間における３次元オブジェクトの大きさ、及びレンダリング時の画角などが自動的に調整されてもよい。この自動的な調整は、例えば、第２の自己映像の生成の開始時点に１回だけ行われてもよく、第２の自動映像の生成時に繰り返して行われてもよい。３次元オブジェクト３３の大きさと模倣対象との大きさが同じであるとは、例えば、厳密に同じであることであってもよく、所定の誤差の範囲内で同じであることであってもよい。また、参照映像に含まれる模倣対象の領域は、例えば、パターンマッチングや、セグメンテーションなどによって特定されてもよい。 Note that the size of the three-dimensional object 33 included in the second self-image varies depending on the size of the three-dimensional object in the three-dimensional virtual space, the distance from the viewpoint to the three-dimensional object, the angle of view at the time of rendering, etc. It will be decided. As an example, the size of a 3D object in a 3D virtual space and the angle of view at the time of rendering are predetermined values, and the distance from the viewpoint to the 3D object is determined from the self-image camera. It may be determined according to the distance to the user's part corresponding to . The distance from the self-image camera to the part of the user corresponding to the imitation target may be determined, for example, depending on the size of the part of the user corresponding to the imitation target included in the first self-image. In this case, the user can adjust the size of the three-dimensional object 33 included in the second self-image by, for example, changing the distance between the self-image camera and the user. The user may also be able to adjust predetermined values, for example. Further, as an example, the generation unit 14 generates a three-dimensional object 33 in the three-dimensional virtual space so that the size of the three-dimensional object 33 included in the second self-video is the same as the size of the imitation target included in the reference video. The size of the dimensional object, the angle of view during rendering, etc. may be automatically adjusted. This automatic adjustment may be performed only once, for example, at the start of generation of the second self-image, or may be performed repeatedly during generation of the second automatic image. The size of the three-dimensional object 33 and the size of the imitation target may be, for example, strictly the same or may be the same within a predetermined error range. . Further, the imitation target area included in the reference video may be specified by pattern matching, segmentation, or the like, for example.

表示部１５は、記憶部１１で記憶されている参照映像と、生成部１４によって生成された第２の自己映像とを表示する。表示部１５は、両映像を比較することができるように両映像を表示することが好適である。表示部１５は、例えば、参照映像と第２の自己映像とを合成して表示してもよい。この場合には、例えば、図７で示されるように、模倣対象２１と、３次元オブジェクト３３ａ，３３ｂとが一緒に表示されてもよい。なお、図７では、模倣対象２１と、３次元オブジェクト３３ａ，３３ｂとを区別可能にするため、３次元オブジェクト３３ａ，３３ｂを破線で示している。また、例えば、参照映像の手前側、すなわち上側に、第２の自己映像を合成してもよい。上記したように、参照映像に合成される第２の自己映像は、３次元オブジェクト３３以外の領域は透明であってもよい。また、参照映像に合成される第２の自己映像における３次元オブジェクト３３の領域の不透明度は、例えば、１００％であってもよく、１００％未満であってもよい。第２の自己映像の不透明度が０％より大きく、１００％未満である場合、すなわち第２の自己映像が半透明である場合には、ユーザは、仮に３次元オブジェクト３３と模倣対象２１が重なっていたとしても、両方を見ることができるようになる。 The display unit 15 displays the reference video stored in the storage unit 11 and the second self-video generated by the generation unit 14. It is preferable that the display unit 15 displays both images so that they can be compared. The display unit 15 may, for example, combine and display the reference video and the second self-video. In this case, for example, as shown in FIG. 7, the imitation target 21 and the three-dimensional objects 33a and 33b may be displayed together. In addition, in FIG. 7, the three-dimensional objects 33a and 33b are shown with broken lines in order to make it possible to distinguish between the imitation target 21 and the three-dimensional objects 33a and 33b. Further, for example, the second self-image may be synthesized on the near side, that is, on the upper side of the reference image. As described above, the area other than the three-dimensional object 33 of the second self-image to be combined with the reference image may be transparent. Further, the opacity of the area of the three-dimensional object 33 in the second self-image to be combined with the reference image may be, for example, 100% or less than 100%. If the opacity of the second self-image is greater than 0% and less than 100%, that is, if the second self-image is semi-transparent, the user can assume that the three-dimensional object 33 and the imitation target 21 overlap. Even if you do, you will be able to see both.

なお、表示部１５は、例えば、両映像を時分割で切り替えながら表示してもよい。より具体的には、表示部１５は、第１の時間だけ参照映像を表示し、次の第２の時間だけ第２の自己映像を表示することを繰り返してもよい。この場合には、表示部１５は、第１の時間ごとに分割された参照映像を、第２の時間の第２の自己映像の表示を挟みながら順番に表示してもよい。第１及び第２の時間は特に限定されないが、例えば、それぞれ０．１秒から１秒の範囲内の時間であってもよい。この場合でも、ユーザは、両方の映像を見ることができるようになる。なお、参照映像と第２の自己映像との切り替えながらの表示は、参照映像の手前側に合成した第２の自己映像の不透明度を、第１の時間だけ０％にして表示した後に、第２の時間だけ１００％にして表示することを繰り返していると考えることもできる。この場合には、第２の自己映像において、３次元オブジェクト３３以外の領域は不透明（例えば、白色などの単色など）であってもよい。また、この場合に、０％と１００％との間で不透明度を切り替えるのではなく、不透明度を０％から１００％まで連続して変化させてもよい。例えば、不透明度を、正弦波やノコギリ波、三角波などのように０％から１００％までの範囲内で連続的に変化させてもよい。 Note that the display unit 15 may display both videos while switching them in a time-sharing manner, for example. More specifically, the display unit 15 may repeat displaying the reference video for a first time and displaying the second self-video for the next second time. In this case, the display unit 15 may display the reference video divided for each first time in order while interposing the display of the second self-video for the second time. The first and second times are not particularly limited, but may be, for example, times within a range of 0.1 seconds to 1 second. Even in this case, the user will be able to view both videos. Note that the display while switching between the reference image and the second self-image is such that the opacity of the second self-image synthesized in front of the reference image is set to 0% for the first time, and then the second self-image is displayed. It can also be thought of as repeating the display at 100% for the time 2. In this case, in the second self-image, the area other than the three-dimensional object 33 may be opaque (for example, a single color such as white). Further, in this case, instead of switching the opacity between 0% and 100%, the opacity may be changed continuously from 0% to 100%. For example, the opacity may be changed continuously within a range from 0% to 100%, such as a sine wave, a sawtooth wave, a triangular wave, or the like.

なお、表示部１５は、それらの表示を行う表示デバイス（例えば、液晶ディスプレイや有機ＥＬディスプレイなど）を含んでもよく、または含まなくてもよい。また、表示対象の表示は、別の装置においてなされてもよい。その場合には、表示部１５は、装置の外部に対して表示対象の映像を送信するものであってもよい。また、表示部１５は、ハードウェアによって実現されてもよく、または表示デバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 Note that the display unit 15 may or may not include a display device (for example, a liquid crystal display, an organic EL display, etc.) that performs these displays. Furthermore, the display target may be displayed in another device. In that case, the display unit 15 may transmit the video to be displayed to the outside of the device. Further, the display unit 15 may be realized by hardware or by software such as a driver that drives a display device.

また、模倣対象が、形状が変化する操作対象物を含む場合には、生成部１４は、例えば、その操作対象物の３次元オブジェクトを含む第２の自己映像を生成してもよい。形状が変化する操作対象物は、操作者の操作に応じて形状が変化するものである。この操作対象物は、例えば、ハサミ、鉗子、ピンセット、トングなどのように、開閉部を有しており、その開閉部が開状態と閉状態との間で形状が変化するものであってもよい。この場合には、第１の自己映像にユーザ３０の手３２が含まれていてもよい。そして、生成部１４は、第１の自己映像に含まれるユーザ３０の手３２のジェスチャに応じて形状が変化する操作対象物の３次元オブジェクトを含む第２の自己映像を生成してもよい。ジェスチャは、例えば、ユーザ３０の手３２の形状であってもよく、手３２の形状の動きであってもよい。前者のジェスチャは静的なものであり、後者のジェスチャは動的なものである。生成部１４は、例えば、第１の自己映像そのものから、ユーザ３０の手３２のジェスチャを特定してもよく、第１の自己映像における骨格認識の結果を用いて、ユーザ３０の手３２のジェスチャを特定してもよい。動的なジェスチャは、例えば、第１の自己映像における手３２の形状の変化や、骨格３１の形状の変化を用いて特定されてもよい。 Further, when the imitation target includes an operation target whose shape changes, the generation unit 14 may generate a second self-image including the three-dimensional object of the operation target, for example. An operation object whose shape changes is one whose shape changes in response to an operation by an operator. This object to be operated has an opening/closing part, such as scissors, forceps, tweezers, tongs, etc., and the shape of the opening/closing part changes between the open state and the closed state. good. In this case, the first self-image may include the hand 32 of the user 30. Then, the generation unit 14 may generate a second self-image including a three-dimensional object of the operation target whose shape changes according to the gesture of the hand 32 of the user 30 included in the first self-image. The gesture may be, for example, the shape of the hand 32 of the user 30, or the movement of the hand 32. The former gesture is static, and the latter gesture is dynamic. For example, the generation unit 14 may identify the gesture of the hand 32 of the user 30 from the first self-image itself, and identify the gesture of the hand 32 of the user 30 using the result of skeleton recognition in the first self-image. may be specified. The dynamic gesture may be identified using, for example, a change in the shape of the hand 32 or a change in the shape of the skeleton 31 in the first self-image.

操作対象物が開閉部を有している鉗子であり、静的なジェスチャが特定される場合には、生成部１４は、例えば、図８Ａで示されるように、手３２が開いているとき、特に手３２の人差し指の先端と親指の先端とが離れているときに、開閉部が開いている鉗子の３次元オブジェクト３３を含む第２の自己映像を生成し、図８Ｂで示されるように、手３２の人差し指の先端と親指の先端とが接しているときに、開閉部が閉じている鉗子の３次元オブジェクト３３を含む第２の自己映像を生成してもよい。なお、ジェスチャに応じて、鉗子の３次元オブジェクト３３の開閉部が、２個の状態、すなわち開いている状態と閉じている状態とに変化してもよく、さらに開閉部の開いている程度も含めて変化してもよい。後者の場合には、生成部１４は、例えば、第１の自己映像における骨格認識の結果を用いて、手３２の人差し指の先端と親指の先端とが、開いている状態（図８Ａ）に近いのか、閉じている状態（図８Ｂ）に近いのかの程度を取得し、その程度に応じて、３次元オブジェクト３３の開閉部の開いている程度を変化させてもよい。この場合には、例えば、手３２の人差し指の先端と親指の先端とが閉じている状態に近くなるほど、３次元オブジェクト３３の開閉部が閉じている状態に近くなるように、その開閉の程度が変化されてもよい。また、操作対象物が開閉部を有している鉗子であり、動的なジェスチャが特定される場合には、生成部１４は、例えば、手３２が開いている状態から人差し指の先端と親指の先端とが接している状態に変化したときに、開閉部が開いた状態から閉じた状態に変化した鉗子の３次元オブジェクト３３を含む第２の自己映像を生成し、手３２の人差し指の先端と親指の先端とが接している状態から、両先端が離れた状態に変化したときに、開閉部が閉じた状態から開いた状態に変化した鉗子の３次元オブジェクト３３を含む第２の自己映像を生成してもよい。このようにすることで、コントローラなどを用いることなく、ユーザ３０の手３２のジェスチャを用いて、第２の自己映像に含まれる３次元オブジェクト３３の形状を変化させることができる。なお、３次元オブジェクト３３の位置や姿勢は、骨格認識部１３による骨格認識の結果、例えば、手３２の位置や姿勢に応じて変化されてもよい。 When the object to be operated is a forceps having an opening/closing part and a static gesture is specified, the generation unit 14 generates a gesture when the hand 32 is open, as shown in FIG. 8A, for example. In particular, when the tip of the index finger and the tip of the thumb of the hand 32 are separated, a second self-image including a three-dimensional object 33 of forceps with an open opening is generated, as shown in FIG. 8B. When the tip of the index finger and the tip of the thumb of the hand 32 are in contact, a second self-image including the three-dimensional object 33 of forceps with the opening/closing part closed may be generated. Note that, depending on the gesture, the opening/closing part of the three-dimensional object 33 of the forceps may change into two states, that is, an open state and a closed state, and the extent to which the opening/closing part is open may also change. It may be changed to include. In the latter case, the generation unit 14 uses, for example, the result of skeleton recognition in the first self-image to create a state in which the tip of the index finger and the tip of the thumb of the hand 32 are close to the open state (FIG. 8A). The extent to which the opening/closing portion of the three-dimensional object 33 is open may be changed according to the obtained degree by obtaining the degree of whether the opening/closing portion of the three-dimensional object 33 is close to the closed state (FIG. 8B). In this case, for example, the degree of opening/closing of the three-dimensional object 33 is adjusted such that the closer the tip of the index finger and the tip of the thumb of the hand 32 are to a closed state, the closer the opening/closing part of the three-dimensional object 33 is to a closed state. May be changed. Further, if the object to be operated is a forceps having an opening/closing part and a dynamic gesture is specified, the generation unit 14 generates, for example, the tip of the index finger and the thumb from a state where the hand 32 is open. When the tip of the forceps changes to the state in which they are in contact with each other, a second self-image is generated that includes the three-dimensional object 33 of the forceps whose opening/closing part changes from the open state to the closed state, and the tip of the index finger of the hand 32 and the second self-image are generated. A second self-image containing the three-dimensional object 33 of the forceps whose opening/closing part changes from a closed state to an open state when the tips of the thumbs change from a state where they are in contact to a state where both tips are separated. may be generated. By doing so, the shape of the three-dimensional object 33 included in the second self-image can be changed using the gesture of the hand 32 of the user 30 without using a controller or the like. Note that the position and posture of the three-dimensional object 33 may be changed according to the result of skeleton recognition by the skeleton recognition unit 13, for example, the position and posture of the hand 32.

次に、映像表示装置１の動作について図９のフローチャートを用いて説明する。 Next, the operation of the video display device 1 will be explained using the flowchart of FIG.

（ステップＳ１０１）表示部１５は、記憶部１１で記憶されている参照映像の表示を開始する。この後、表示部１５による参照映像の表示は、継続して行われるものとする。 (Step S101) The display unit 15 starts displaying the reference video stored in the storage unit 11. After this, the display section 15 continues to display the reference video.

（ステップＳ１０２）映像取得部１２は、第１の自己映像を取得するかどうか判断する。そして、第１の自己映像を取得する場合にはステップＳ１０３に進み、そうでない場合には、第１の自己映像を取得すると判断するまでステップＳ１０２の処理を繰り返す。なお、映像取得部１２は、例えば、第１の自己映像を取得すると定期的に判断してもよい。 (Step S102) The video acquisition unit 12 determines whether to acquire the first self-video. If the first self-image is to be acquired, the process proceeds to step S103; otherwise, the process of step S102 is repeated until it is determined that the first self-image is to be acquired. Note that the video acquisition unit 12 may, for example, periodically determine to acquire the first self-video.

（ステップＳ１０３）映像取得部１２は、第１の自己映像を取得する。なお、この第１の自己映像の取得は、例えば、第１の自己映像を構成する１つのフレームの取得であってもよい。映像取得部１２は、一例として、カメラ９０１から第１の自己映像を受け取ってもよい。 (Step S103) The video acquisition unit 12 acquires a first self-video. Note that this acquisition of the first self-image may be, for example, acquisition of one frame that constitutes the first self-image. The image acquisition unit 12 may receive the first self-image from the camera 901, for example.

（ステップＳ１０４）骨格認識部１３は、第１の自己映像に含まれるユーザ３０の骨格認識を行う。この骨格認識は、例えば、第１の自己映像を構成する１つのフレームについて行われてもよい。 (Step S104) The skeleton recognition unit 13 performs skeleton recognition of the user 30 included in the first self-image. This skeleton recognition may be performed, for example, on one frame that constitutes the first self-image.

（ステップＳ１０５）生成部１４は、第１の自己映像、または骨格認識の結果を用いて、ユーザ３０の手３２のジェスチャを特定する。静的なジェスチャの特定は、例えば、１つのフレーム、または１つのフレームについて行われた骨格認識の結果を用いて行われてもよい。動的なジェスチャの特定は、例えば、複数の連続したフレーム、または、複数の連続したフレームについて行われた骨格認識の結果を用いて行われてもよい。 (Step S105) The generation unit 14 identifies the gesture of the hand 32 of the user 30 using the first self-image or the result of skeleton recognition. Static gestures may be identified using, for example, one frame or the result of skeleton recognition performed on one frame. Dynamic gestures may be identified using, for example, a plurality of consecutive frames or the results of skeleton recognition performed on a plurality of consecutive frames.

（ステップＳ１０６）生成部１４は、骨格認識の結果と、ユーザ３０の手３２のジェスチャの特定結果とを用いて、３次元オブジェクト３３を含む第２の自己映像を生成する。この第２の自己映像の生成は、例えば、第２の自己映像を構成する１つのフレームの生成であってもよい。また、第２の自己映像に含まれる３次元オブジェクト３３の形状は、特定されたユーザ３０の手３２のジェスチャに応じたものであってもよい。また、この第２の自己映像は、参照映像用カメラと模倣対象との相対的な位置関係と、第２の自己映像の視点と３次元オブジェクト３３との相対的な位置関係とが同じになるように生成されてもよい。 (Step S106) The generation unit 14 generates a second self-image including the three-dimensional object 33 using the skeleton recognition result and the identification result of the gesture of the hand 32 of the user 30. Generation of this second self-image may be, for example, generation of one frame that constitutes the second self-image. Further, the shape of the three-dimensional object 33 included in the second self-image may correspond to the specified gesture of the hand 32 of the user 30. Furthermore, in this second self-image, the relative positional relationship between the reference image camera and the imitation target is the same as the relative positional relationship between the viewpoint of the second self-image and the three-dimensional object 33. It may be generated as follows.

（ステップＳ１０７）表示部１５は、生成された第２の自己映像を、参照映像と一緒に表示する。例えば、第２の自己映像と参照映像とが合成されて表示されてもよい。このようにして、ユーザ３０は、参照映像と第２の自己映像との両方を見ることができるようになる。そして、ステップＳ１０２に戻る。 (Step S107) The display unit 15 displays the generated second self-image together with the reference image. For example, the second self-video and the reference video may be combined and displayed. In this way, user 30 is able to view both the reference video and the second self-video. Then, the process returns to step S102.

なお、図９のフローチャートにおける処理の順序は一例であり、同様の結果を得られるのであれば、各ステップの順序を変更してもよい。また、図９のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Note that the order of processing in the flowchart of FIG. 9 is an example, and the order of each step may be changed as long as the same result can be obtained. Further, in the flowchart of FIG. 9, the process is terminated by turning off the power or by an interrupt to terminate the process.

次に、本実施の形態による映像表示装置１の動作について、具体例を用いて説明する。本具体例において、参照映像は、図３で示されるように、鉗子である模倣対象２１が手術ロボットによって操作されている状況の映像であるとする。また、図２で示されるように、映像表示装置１がノートパソコンであるコンピュータ９００によって実現されており、そのコンピュータ９００に内蔵されているカメラ９０１を用いて、ユーザ３０の手３２を含む第１の自己映像が撮影されるものとする。 Next, the operation of the video display device 1 according to this embodiment will be explained using a specific example. In this specific example, the reference video is assumed to be a video of a situation where the imitation target 21, which is a forceps, is being operated by a surgical robot, as shown in FIG. Further, as shown in FIG. 2, the video display device 1 is realized by a computer 900 that is a notebook computer, and a camera 901 built in the computer 900 is used to display a first image including the hand 32 of the user 30. A self-video of the applicant shall be taken.

まず、ユーザ３０が、コンピュータ９００を操作して、参照映像、及び第２の自己映像を表示する旨の指示を入力したとする。すると、その指示に応じて、表示部１５は、記憶部１１で記憶されている参照映像の表示デバイス９０２への表示を開始する（ステップＳ１０１）。また、映像取得部１２は、カメラ９０１によって撮影されたユーザ３０の第１の自己映像を取得して、それを骨格認識部１３に渡す（ステップＳ１０２、Ｓ１０３）。骨格認識部１３は、図４で示されるように、ユーザ３０に対面するカメラ９０１によって撮影された第１の自己映像において、ユーザ３０の骨格３１を認識して、その認識した骨格３１を生成部１４に渡す（ステップＳ１０４）。認識結果の骨格３１を受け取ると、生成部１４は、その骨格３１の手の部分の形状を用いて、ユーザ３０の手のジェスチャを特定する（ステップＳ１０５）。また、生成部１４は、骨格認識の結果、及びジェスチャの特定結果を用いて、３次元仮想空間において、ユーザ３０の骨格３１に応じた位置及び姿勢となるように、特定されたジェスチャに応じた形状の３次元オブジェクト３３ａ，３３ｂを配置し、その３次元オブジェクト３３ａ，３３ｂを、ユーザ３０の一人称視点の映像となるようにレンダリングすることによって第２の自己映像を生成して表示部１５に渡す（ステップＳ１０６）。例えば、図６のような第２の自己映像が生成されることになる。第２の自己映像を受け取ると、表示部１５は、参照映像に第２の自己映像を合成して表示する（ステップＳ１０７）。その結果、ユーザ３０は、図２で示されるように、コンピュータ９００の表示デバイス９０２に表示された参照映像と第２の自己映像とを見ることができるようになる。このように、第１の自己映像の取得や、骨格の認識、ジェスチャの特定、それらに基づいた第２の自己映像の生成、参照映像と第２の自己映像との表示が繰り返されることによって、ユーザ３０は、参照映像に含まれる鉗子の動作に沿うように、自らの手の動作に応じて鉗子の３次元オブジェクト３３ａ，３３ｂを動作させることができる。そして、ユーザ３０は、参照映像に含まれる模倣対象と同様に３次元オブジェクト３３ａ，３３ｂを動作させるためのトレーニングを行うことができる。 First, assume that the user 30 operates the computer 900 and inputs an instruction to display a reference video and a second self-video. Then, in response to the instruction, the display unit 15 starts displaying the reference video stored in the storage unit 11 on the display device 902 (step S101). The video acquisition unit 12 also acquires the first self-video of the user 30 taken by the camera 901, and passes it to the skeleton recognition unit 13 (steps S102, S103). As shown in FIG. 4, the skeleton recognition unit 13 recognizes the skeleton 31 of the user 30 in the first self-image taken by the camera 901 facing the user 30, and converts the recognized skeleton 31 into a generation unit. 14 (step S104). Upon receiving the recognition result skeleton 31, the generation unit 14 identifies the hand gesture of the user 30 using the shape of the hand portion of the skeleton 31 (step S105). In addition, the generation unit 14 uses the skeleton recognition results and the gesture identification results to create a position and posture that corresponds to the skeleton 31 of the user 30 in the three-dimensional virtual space. A second self-image is generated by arranging shaped three-dimensional objects 33a, 33b and rendering the three-dimensional objects 33a, 33b to be an image from the first-person viewpoint of the user 30, and passing it to the display unit 15. (Step S106). For example, a second self-image as shown in FIG. 6 will be generated. Upon receiving the second self-image, the display unit 15 synthesizes and displays the second self-image with the reference image (step S107). As a result, the user 30 can view the reference video and the second self-video displayed on the display device 902 of the computer 900, as shown in FIG. 2. In this way, by repeatedly acquiring the first self-image, recognizing the skeleton, identifying gestures, generating the second self-image based on them, and displaying the reference image and the second self-image, The user 30 can move the three-dimensional objects 33a and 33b of the forceps according to the movement of his/her hand so as to follow the movement of the forceps included in the reference video. Then, the user 30 can perform training to operate the three-dimensional objects 33a and 33b in the same way as the imitation target included in the reference video.

以上のように、本実施の形態による映像表示装置１によれば、参照映像の撮影時と同様の撮影環境を用意しなくても、参照映像と比較することができる第２の自己映像を生成して表示することができるようになる。そのため、ユーザ３０は、参照映像と第２の自己映像とを比較しながら、模倣対象の動作と同じ動作をするためのトレーニングを行うことができる。例えば、参照映像が一人称視点の映像であっても、ユーザ３０は、自らの映像をノートパソコンやタブレット端末、スマートフォンなどの対面カメラを用いて撮影することができるようになり、ヘッドマウントカメラなどを用意しなくてもよくなるため、そのためのコストや時間を低減することができるというメリットがある。また、ユーザ３０の手のジェスチャに応じて操作対象物の３次元オブジェクト３３の形状を変化させる場合には、３次元オブジェクト３３の形状を変化のためのコントローラなどを用いなくてもよいことになり、簡易な構成でトレーニングを行うことができるようになる。 As described above, the video display device 1 according to the present embodiment generates a second self-image that can be compared with the reference video without preparing a shooting environment similar to that used when the reference video was shot. and display it. Therefore, the user 30 can train to perform the same action as the imitation target while comparing the reference video and the second self-video. For example, even if the reference video is from a first-person perspective, the user 30 can now shoot his or her own video using a face-to-face camera such as a laptop computer, tablet terminal, or smartphone, or a head-mounted camera. Since there is no need to prepare one, there is an advantage that the cost and time required for this can be reduced. Furthermore, when changing the shape of the three-dimensional object 33 as the operation target in accordance with the hand gesture of the user 30, there is no need to use a controller or the like to change the shape of the three-dimensional object 33. , training can be performed with a simple configuration.

なお、本実施の形態では、ユーザ３０の手３２のジェスチャを用いて、３次元オブジェクト３３の形状を変化させる場合について説明したが、そうでなくてもよい。ユーザ３０は、コントローラを操作することによって、３次元オブジェクト３３の形状を変化させてもよい。この場合には、映像表示装置１は、図１０で示されるように、ユーザ３０によって操作されるコントローラ４からの指示を受け付ける受付部１６をさらに備えていてもよい。受付部１６は、例えば、コントローラ４からの指示を有線または無線によって受信してもよい。なお、受付部１６は、受け付けを行うためのデバイス（例えば、通信デバイスなど）を含んでもよく、または含まなくてもよい。また、受付部１６は、ハードウェアによって実現されてもよく、または所定のデバイスを駆動するドライバ等のソフトウェアによって実現されてもよい。 Note that in this embodiment, a case has been described in which the shape of the three-dimensional object 33 is changed using the gesture of the hand 32 of the user 30, but this is not necessarily the case. The user 30 may change the shape of the three-dimensional object 33 by operating the controller. In this case, the video display device 1 may further include a reception unit 16 that receives instructions from the controller 4 operated by the user 30, as shown in FIG. For example, the receiving unit 16 may receive instructions from the controller 4 by wire or wirelessly. Note that the reception unit 16 may or may not include a device (for example, a communication device) for performing reception. Further, the reception unit 16 may be realized by hardware or by software such as a driver that drives a predetermined device.

ユーザ３０によって操作されるコントローラ４は、例えば、図１１で示されるものであってもよい。図１１で示されるコントローラ４は、ボタン４ａを有しており、ユーザ３０がボタン４ａを押下している際に、例えば、３次元オブジェクト３３の開閉部を閉じる旨の指示が映像表示装置１に送信されてもよい。この場合には、生成部１４は、例えば、受付部１６によって受け付けられた指示に応じて形状が変化する操作対象物の３次元オブジェクトを含む第２の自己映像を生成してもよい。より具体的には、ユーザ３０がボタン４ａを押下している際に、生成部１４は、受付部１６によって受け付けられた開閉部を閉じる旨の指示に応じて、開閉部が閉じた形状の３次元オブジェクト３３を含む第２の自己映像を生成してもよい。なお、ユーザ３０がボタン４ａを押下していない場合には、開閉部が開いた形状の３次元オブジェクト３３を含む第２の自己映像が生成されてもよい。このようにすることで、コントローラ４を用いて、操作対象物の３次元オブジェクト３３を操作することができる。そのため、例えば、参照映像が手術ロボットの映像である場合に、その手術ロボットの操作で用いられるコントローラと同様のコントローラ４を用いてユーザ３０が操作対象物の３次元オブジェクト３３を操作するようにすることもでき、ユーザ３０は、実環境の手術ロボットと同様な環境で３次元オブジェクト３３を操作することができるようになる。 The controller 4 operated by the user 30 may be one shown in FIG. 11, for example. The controller 4 shown in FIG. 11 has a button 4a, and when the user 30 presses the button 4a, for example, an instruction to close the opening/closing part of the three-dimensional object 33 is sent to the video display device 1. May be sent. In this case, the generation unit 14 may generate a second self-image including a three-dimensional object of the operation target whose shape changes according to the instruction received by the reception unit 16, for example. More specifically, when the user 30 presses the button 4a, the generation unit 14 generates a 3-shaped button with the opening/closing part closed in response to an instruction to close the opening/closing part received by the reception unit 16. A second self-image including the dimensional object 33 may be generated. Note that if the user 30 has not pressed the button 4a, a second self-image including the three-dimensional object 33 in the shape of an open opening/closing part may be generated. By doing so, the three-dimensional object 33 of the operation target can be operated using the controller 4. Therefore, for example, when the reference image is an image of a surgical robot, the user 30 operates the three-dimensional object 33 to be operated using the controller 4 that is similar to the controller used to operate the surgical robot. The user 30 can also operate the three-dimensional object 33 in an environment similar to a surgical robot in a real environment.

また、本実施の形態では、本実施の形態による映像表示装置１がカメラ付きのノートパソコンやカメラ付きのタブレット端末などによって実現される場合について主に説明したが、そうでなくてもよい。例えば、外付けのカメラの接続されたデスクトップパソコンなどによって本実施の形態による映像表示装置１が実現されてもよい。近年、ウェブミーティングなどのために、デスクトップパソコンのディスプレイの周囲に、ディスプレイを見るユーザと対向するようにカメラが配置されていることがあるが、そのカメラが自己映像用カメラとして用いられ、そのディスプレイが、参照映像及び第２の自己映像を表示するためのディスプレイとして用いられてもよい。 Further, in the present embodiment, the case where the video display device 1 according to the present embodiment is realized by a notebook computer with a camera, a tablet terminal with a camera, or the like has been mainly described, but this need not be the case. For example, the video display device 1 according to the present embodiment may be realized by a desktop personal computer to which an external camera is connected. In recent years, cameras are often placed around the display of desktop computers to face the user viewing the display for purposes such as web meetings. may be used as a display for displaying the reference image and the second self-image.

また、本実施の形態において、３次元オブジェクトを変更できるようにしてもよい。例えば、参照映像において、模倣対象が変更されることもある。具体的には、模倣対象が鉗子からメスに変更されることも考えられる。このような場合に、第２の自己映像においても、参照映像における模倣対象の変更に応じて、３次元オブジェクトが変更されてもよい。この変更は、例えば、手動で行われてもよい。手動で３次元オブジェクトが変更される場合に、例えば、ユーザは、手のジェスチャによって、３次元オブジェクトを変更できてもよい。この場合には、例えば、ユーザが手を払うジェスチャを行うことによって、３次元オブジェクトが変更されてもよい。また、３次元オブジェクトの変更は、自動的に行われてもよい。この場合には、例えば、生成部１４が、参照映像に含まれる模倣対象の種類を特定し、その特定した種類に応じた３次元オブジェクトを３次元仮想空間に配置するようにしてもよい。模倣対象の種類の特定は、例えば、パターンマッチングや、物体認識などによって行われてもよい。 Further, in this embodiment, the three-dimensional object may be changed. For example, the imitation target may be changed in the reference video. Specifically, it is conceivable that the imitation target is changed from forceps to a scalpel. In such a case, the three-dimensional object may also be changed in the second self-image in accordance with the change in the imitation target in the reference image. This change may be made manually, for example. When a three-dimensional object is changed manually, for example, the user may be able to change the three-dimensional object by a hand gesture. In this case, the three-dimensional object may be changed, for example, by the user performing a hand-sweeping gesture. Further, the change of the three-dimensional object may be performed automatically. In this case, for example, the generation unit 14 may specify the type of imitation target included in the reference video, and arrange a three-dimensional object according to the specified type in the three-dimensional virtual space. The type of imitation target may be determined by, for example, pattern matching or object recognition.

また、上記実施の形態では、映像表示装置１がスタンドアロンである場合について説明したが、映像表示装置１は、スタンドアロンの装置であってもよく、サーバ・クライアントシステムにおけるサーバ装置であってもよい。後者の場合には、映像取得部や表示部は、通信回線を介して映像を取得したり、映像を表示したりしてもよい。 Further, in the above embodiment, the case where the video display device 1 is a stand-alone device has been described, but the video display device 1 may be a stand-alone device or a server device in a server-client system. In the latter case, the video acquisition section and the display section may acquire the video or display the video via the communication line.

また、上記実施の形態において、各処理または各機能は、単一の装置または単一のシステムによって集中処理されることによって実現されてもよく、または、複数の装置または複数のシステムによって分散処理されることによって実現されてもよい。 Furthermore, in the above embodiments, each process or each function may be realized by being centrally processed by a single device or a single system, or may be realized by being distributedly processed by multiple devices or multiple systems. This may be realized by

また、上記実施の形態において、各構成要素間で行われる情報の受け渡しは、例えば、その情報の受け渡しを行う２個の構成要素が物理的に異なるものである場合には、一方の構成要素による情報の出力と、他方の構成要素による情報の受け付けとによって行われてもよく、または、その情報の受け渡しを行う２個の構成要素が物理的に同じものである場合には、一方の構成要素に対応する処理のフェーズから、他方の構成要素に対応する処理のフェーズに移ることによって行われてもよい。 In addition, in the above embodiment, the information exchange performed between each component is performed by one component, for example, when the two components that exchange the information are physically different. This may be done by outputting information and receiving the information by another component, or by one component if the two components passing that information are physically the same. This may be performed by moving from a phase of processing corresponding to the component to a phase of processing corresponding to the other component.

また、上記実施の形態において、各構成要素が実行する処理に関係する情報、例えば、各構成要素が受け付けたり、取得したり、選択したり、生成したり、送信したり、受信したりした情報や、各構成要素が処理で用いる閾値や数式、アドレス等の情報等は、上記説明で明記していなくても、図示しない記録媒体において、一時的に、または長期にわたって保持されていてもよい。また、その図示しない記録媒体への情報の蓄積を、各構成要素、または、図示しない蓄積部が行ってもよい。また、その図示しない記録媒体からの情報の読み出しを、各構成要素、または、図示しない読み出し部が行ってもよい。 In the above embodiments, information related to processing executed by each component, for example, information accepted, acquired, selected, generated, transmitted, or received by each component. Information such as threshold values, formulas, addresses, etc. used by each component in processing may be held temporarily or for a long period of time in a recording medium (not shown), even if not specified in the above description. Further, the information may be stored in the recording medium (not shown) by each component or by a storage unit (not shown). Further, each component or a reading unit (not shown) may read information from the recording medium (not shown).

また、上記実施の形態において、各構成要素等で用いられる情報、例えば、各構成要素が処理で用いる閾値やアドレス、各種の設定値等の情報がユーザによって変更されてもよい場合には、上記説明で明記していなくても、ユーザが適宜、それらの情報を変更できるようにしてもよく、または、そうでなくてもよい。それらの情報をユーザが変更可能な場合には、その変更は、例えば、ユーザからの変更指示を受け付ける図示しない受付部と、その変更指示に応じて情報を変更する図示しない変更部とによって実現されてもよい。その図示しない受付部による変更指示の受け付けは、例えば、入力デバイスからの受け付けでもよく、通信回線を介して送信された情報の受信でもよく、所定の記録媒体から読み出された情報の受け付けでもよい。 In addition, in the above-described embodiment, if the information used in each component, for example, information such as threshold values, addresses, various setting values, etc. used by each component in processing, may be changed by the user, the above-mentioned Even if it is not specified in the description, the user may or may not be able to change the information as appropriate. If the information can be changed by the user, the change is realized by, for example, a reception unit (not shown) that receives change instructions from the user, and a change unit (not shown) that changes the information in accordance with the change instruction. You can. The acceptance of the change instruction by the reception unit (not shown) may be, for example, acceptance from an input device, information transmitted via a communication line, or information read from a predetermined recording medium. .

また、上記実施の形態において、映像表示装置１に含まれる２以上の構成要素が通信デバイスや入力デバイス等を有する場合に、２以上の構成要素が物理的に単一のデバイスを有してもよく、または、別々のデバイスを有してもよい。 Further, in the above embodiment, when two or more components included in the video display device 1 have a communication device, an input device, etc., even if the two or more components physically have a single device, or may have separate devices.

また、上記実施の形態において、各構成要素は専用のハードウェアにより構成されてもよく、または、ソフトウェアにより実現可能な構成要素については、プログラムを実行することによって実現されてもよい。例えば、ハードディスクや半導体メモリ等の記録媒体に記録されたソフトウェア・プログラムをＣＰＵ等のプログラム実行部が読み出して実行することによって、各構成要素が実現され得る。その実行時に、プログラム実行部は、記憶部や記録媒体にアクセスしながらプログラムを実行してもよい。なお、上記実施の形態における映像表示装置１を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、ユーザが動作を模倣する対象となる模倣対象の動作の映像である参照映像が記憶される記憶部にアクセス可能なコンピュータを、ユーザの動作の映像である第１の自己映像を取得する映像取得部、第１の自己映像に含まれるユーザの骨格認識を行う骨格認識部、模倣対象に対応する３次元オブジェクトであり、ユーザの動作に応じて動く３次元オブジェクトの映像である第２の自己映像を、骨格認識部による骨格認識の結果を用いて、参照映像を撮影する参照映像用カメラと模倣対象との相対的な位置関係と、３次元オブジェクトの映像の視点と３次元オブジェクトとの相対的な位置関係とが同じになるように生成する生成部、参照映像及び第２の自己映像を表示する表示部として機能させ、参照映像用カメラと模倣対象との相対的な位置関係と、第１の自己映像を撮影する自己映像用カメラと模倣対象に対応するユーザの部分との相対的な位置関係とは異なる、プログラムである。 Furthermore, in the embodiments described above, each component may be configured by dedicated hardware, or components that can be realized by software may be realized by executing a program. For example, each component can be realized by a program execution unit such as a CPU reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. At the time of execution, the program execution section may execute the program while accessing the storage section or recording medium. Note that the software that implements the video display device 1 in the above embodiment is the following program. In other words, this program causes a computer that has access to a storage unit that stores a reference image, which is an image of the movement of the imitation target whose movement is to be imitated, to the first self-image, which is the image of the user's movement. a skeletal recognition unit that recognizes the user's skeleton included in the first self-image, a 3D object corresponding to the imitation target, and an image of the 3D object that moves in response to the user's movements. The second self-image is determined using the results of skeleton recognition by the skeleton recognition unit, the relative positional relationship between the reference image camera that shoots the reference image and the imitation target, the viewpoint of the image of the 3D object, and the 3D image. It functions as a generation unit that generates images so that the relative positional relationship with the object is the same, and a display unit that displays the reference video and the second self-image, and generates images that have the same relative positional relationship with the object, and functions as a display unit that displays the reference video camera and the imitation target. The relationship and the relative positional relationship between the self-image camera that captures the first self-image and the portion of the user corresponding to the imitation target are different programs.

なお、上記プログラムにおいて、上記プログラムが実現する機能には、ハードウェアでしか実現できない機能は含まれない。例えば、情報を取得する取得部や、情報を表示する表示部などにおけるインターフェースカードなどのハードウェアでしか実現できない機能は、上記プログラムが実現する機能には少なくとも含まれない。 Note that in the above program, the functions realized by the program do not include functions that can only be realized by hardware. For example, functions that can only be realized by hardware such as an interface card in an acquisition unit that acquires information, a display unit that displays information, etc. are not included in the functions that are realized by the program.

また、このプログラムは、サーバなどからダウンロードされることによって実行されてもよく、所定の記録媒体（例えば、ＣＤ－ＲＯＭなどの光ディスクや磁気ディスク、半導体メモリなど）に記録されたプログラムが読み出されることによって実行されてもよい。また、このプログラムは、プログラムプロダクトを構成するプログラムとして用いられてもよい。 Further, this program may be executed by being downloaded from a server or the like, and the program recorded on a predetermined recording medium (for example, an optical disk such as a CD-ROM, a magnetic disk, a semiconductor memory, etc.) is read out. It may be executed by Further, this program may be used as a program constituting a program product.

また、このプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、または分散処理を行ってもよい。 Further, the number of computers that execute this program may be one or more. That is, centralized processing or distributed processing may be performed.

図１２は、上記プログラムを実行して、上記実施の形態による映像表示装置１を実現するコンピュータ９００の構成の一例を示す図である。図１２において、コンピュータ９００は、カメラ９０１と、表示デバイス９０２と、キーボード９０３と、タッチパッドやマウスなどのポインティングデバイス９０４と、ＭＰＵ（Micro Processing Unit）９１１と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ９１２と、ＭＰＵ９１１に接続され、アプリケーションプログラムの命令を一時的に記憶すると共に、一時記憶空間を提供するＲＡＭ９１３と、アプリケーションプログラム、システムプログラム、及びデータを記憶する記憶部９１４と、ＬＡＮやＷＡＮ等への接続を提供する通信モジュール９１５とを備える。なお、ＭＰＵ９１１、ＲＯＭ９１２等はバスによって相互に接続されていてもよい。また、記憶部９１４は、例えば、ハードディスクやＳＳＤ（Solid State Drive）などであってもよい。また、カメラ９０１、表示デバイス９０２、キーボード９０３、ポインティングデバイス９０４などは、例えば、コンピュータ９００に内蔵されているデバイスであってもよく、外付けのデバイスであってもよい。 FIG. 12 is a diagram showing an example of the configuration of a computer 900 that executes the above program to realize the video display device 1 according to the above embodiment. In FIG. 12, a computer 900 stores a camera 901, a display device 902, a keyboard 903, a pointing device 904 such as a touch pad or a mouse, an MPU (Micro Processing Unit) 911, and programs such as a boot-up program. a ROM 912 connected to the MPU 911 and temporarily storing instructions of application programs and providing temporary storage space; a storage unit 914 storing application programs, system programs, and data; and a LAN or WAN and a communication module 915 that provides connection to the network. Note that the MPU 911, ROM 912, etc. may be connected to each other via a bus. Further, the storage unit 914 may be, for example, a hard disk or an SSD (Solid State Drive). Further, the camera 901, display device 902, keyboard 903, pointing device 904, etc. may be, for example, devices built into the computer 900, or may be external devices.

コンピュータ９００に、上記実施の形態による映像表示装置１の機能を実行させるプログラムは、実行の際にＲＡＭ９１３にロードされてもよい。なお、プログラムは、例えば、記憶部９１４、またはネットワークから直接、ロードされてもよい。 A program that causes the computer 900 to execute the functions of the video display device 1 according to the embodiment described above may be loaded into the RAM 913 at the time of execution. Note that the program may be loaded directly from the storage unit 914 or the network, for example.

プログラムは、コンピュータ９００に、上記実施の形態による映像表示装置１の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティプログラム等を必ずしも含んでいなくてもよい。プログラムは、制御された態様で適切な機能やモジュールを呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいてもよい。コンピュータ９００がどのように動作するのかについては周知であり、詳細な説明は省略する。 The program does not necessarily need to include an operating system (OS) that causes computer 900 to execute the functions of video display device 1 according to the above embodiment, a third party program, or the like. A program may include only those portions of instructions that call appropriate functions or modules in a controlled manner to achieve desired results. How computer 900 operates is well known and detailed explanation will be omitted.

また、以上の実施の形態は、本発明を具体的に実施するための例示であって、本発明の技術的範囲を制限するものではない。本発明の技術的範囲は、実施の形態の説明ではなく、特許請求の範囲によって示されるものであり、特許請求の範囲の文言上の範囲及び均等の意味の範囲内での変更が含まれることが意図される。 Further, the above embodiments are illustrative examples for concretely implementing the present invention, and do not limit the technical scope of the present invention. The technical scope of the present invention is indicated by the claims, not the description of the embodiments, and includes changes within the literal scope and equivalent meaning of the claims. is intended.

１映像表示装置
１１記憶部
１２映像取得部
１３骨格認識部
１４生成部
１５表示部
１６受付部 1 Video Display Device 11 Storage Unit 12 Video Acquisition Unit 13 Skeleton Recognition Unit 14 Generation Unit 15 Display Unit 16 Reception Unit

Claims

a storage unit that stores a reference video that is a video of an action to be imitated whose action is to be imitated by the user;
a video acquisition unit that acquires a first self-video that is a video of the user's actions;
a skeleton recognition unit that recognizes the user's skeleton included in the first self-image;
A second self-image, which is an image of a three-dimensional object that corresponds to the imitation target and moves in response to the user's movements, is converted into the reference image using the result of skeleton recognition by the skeleton recognition unit. generation so that the relative positional relationship between a reference video camera that shoots the image and the imitation target is the same as the relative positional relationship between the viewpoint of the video of the three-dimensional object and the three-dimensional object; Department and
a display unit that displays the reference video and the second self-video;
the first self-image is acquired while displaying the reference image;
The reference video is a video taken by a camera or a CG video corresponding to a video taken by a camera,
a relative positional relationship between the reference video camera and the imitation target; a relative positional relationship between the self-image camera that captures the first self-image and a portion of the user corresponding to the imitation target; is a different video display device.

The image display device according to claim 1, wherein the generation unit generates a second self-image in which the line of sight direction as a result of skeleton recognition by the skeleton recognition unit is changed by a set angle.

The reference video is a video from the perspective of an imitator who moves the imitation target,
The video display device according to claim 1, wherein the generation unit generates a second self-video that is a video from the user's viewpoint.

The imitation target includes an operation target whose shape changes,
The first self-image includes the user's hand,
The generation unit generates a second self-image including a three-dimensional object of an operation target whose shape changes according to a hand gesture of the user included in the first self-image. Video display device.

The imitation target includes an operation target whose shape changes,
further comprising a reception unit that receives instructions from a controller operated by the user,
The video display device according to claim 1, wherein the generation unit generates a second self-image including a three-dimensional object of an operation target whose shape changes according to an instruction received by the reception unit.

6. The video display device according to claim 1, wherein the display section combines and displays the reference video and the second self-video.

The self-image camera and the display device on which the reference image and the second self-image are displayed are arranged such that the self-image camera faces the object to be photographed in the optical axis direction, and the display device faces directly. The video display device according to any one of claims 1 to 5, wherein the video display device is arranged so that the direction of line of sight is opposite to that of the video display device.

Processed using a storage unit in which a reference video, which is a video of an imitation target motion whose motion is to be imitated by a user, is stored, a video acquisition unit, a skeleton recognition unit, a generation unit, and a display unit. A video display method,
a step in which the video acquisition unit acquires a first self-video that is a video of the user's actions;
a step in which the skeleton recognition unit recognizes the skeleton of the user included in the first self-image;
The generation unit generates a second self-image, which is an image of a three-dimensional object that corresponds to the imitation target and moves according to the user's motion, using the result of the skeleton recognition, The relative positional relationship between the reference video camera that shoots the video and the imitation target is the same as the relative positional relationship between the viewpoint of the video of the three-dimensional object and the three-dimensional object. step and
The display unit displays the reference video and the second self-video,
the first self-image is acquired while displaying the reference image;
The reference video is a video taken by a camera or a CG video corresponding to a video taken by a camera,
a relative positional relationship between the reference video camera and the imitation target; a relative positional relationship between the self-image camera that captures the first self-image and a portion of the user corresponding to the imitation target; The image display method is different.

A computer that can access a storage unit that stores a reference video that is a video of the motion of the imitation target whose motion is to be imitated by the user,
a video acquisition unit that acquires a first self-video that is a video of the user's actions;
a skeleton recognition unit that recognizes the user's skeleton included in the first self-image;
A second self-image, which is an image of a three-dimensional object that corresponds to the imitation target and moves in response to the user's movements, is converted into the reference image using the result of skeleton recognition by the skeleton recognition unit. generation so that the relative positional relationship between a reference video camera that shoots the image and the imitation target is the same as the relative positional relationship between the viewpoint of the video of the three-dimensional object and the three-dimensional object; Department,
functioning as a display unit that displays the reference video and the second self-video;
the first self-image is acquired while displaying the reference image;
The reference video is a video taken by a camera or a CG video corresponding to a video taken by a camera,
a relative positional relationship between the reference video camera and the imitation target; a relative positional relationship between the self-image camera that captures the first self-image and a portion of the user corresponding to the imitation target; is a different program.