JP2023125517A

JP2023125517A - Information processing apparatus and program

Info

Publication number: JP2023125517A
Application number: JP2022029644A
Authority: JP
Inventors: 雄一郎山本; Yuichiro Yamamoto; 清貴田中; Seiki Tanaka
Original assignee: Tribawl; Tribawl Co Ltd
Current assignee: Tribawl; Tribawl Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-09-07

Abstract

To increase the variety of motion of at least one of an avatar and a robot based on a motion of a person.SOLUTION: A skeleton information generation system H1 generates skeleton information, from a detection target image KTG represented by a target video ST1, and outputs the generated skeleton information as a data sequence ST5, and the detection target image KTG as an image ST6 of a detection target, to an attitude evaluation system H2. The attitude evaluation system H2 generates, in evaluating an attitude of the detection target image KTG using example data TD, text information to be transmitted to the detection target KT by differential AI analysis SH2 regarding a difference from an example, using a difference between skeleton information included in the example data TD and the input skeleton information. The text information is output as voice from a voice output system TO. The detection target image KTG and an example image OG included in the example data TD are displayed by an image display system GH.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、及びプログラムに関する。 The present invention relates to an information processing device and a program.

カメラにより撮像された画像に含まれる人物の骨格情報を推定する技術が知られている（例えば、特許文献１参照）。また、人物の動きに基づいて、ロボットを遠隔で制御し、特定の操作を行わせる技術も知られている（例えば、特許文献２参照）。 A technique for estimating skeletal information of a person included in an image captured by a camera is known (for example, see Patent Document 1). There is also a known technology for remotely controlling a robot to perform a specific operation based on the movement of a person (for example, see Patent Document 2).

特開２０２１－１８９９４６号公報JP2021-189946A 特開平７－１６０３１０号公報Japanese Unexamined Patent Publication No. 7-160310

上記のようなロボットは、人の身体的な動き、或いは操作等に沿って動作するように制御される。このような動作制御では、ロボットの動作は、操作を含む身体的な動きをする人が予め想定したものとなるだけでなく、その人にのみに依存することになる。ロボットの用途によっては、人にとっての利便性、或いは人に与える心証等を考慮し、ロボットの動作の多様性を向上させることが望ましいと考えられる。これは、アバター等の画像を人の動作により動作させる場合も同様である。 The above-mentioned robots are controlled to operate in accordance with a person's physical movements or operations. In such motion control, the robot's motion not only becomes what the person who performs the physical movement including manipulation has assumed in advance, but also depends solely on that person. Depending on the purpose of the robot, it may be desirable to improve the variety of robot movements, taking into account convenience for humans or the morale it provides to humans. This also applies when an image such as an avatar is moved by a person's movements.

本発明は、人の動作によるアバター、及びロボットの少なくとも一方の動作の多様性を向上させることが可能な情報処理装置、及びプログラムを提供することを目的とする。 An object of the present invention is to provide an information processing device and a program that can improve the diversity of at least one of the motions of an avatar and a robot based on human motions.

本開示の一態様の情報処理装置は、検知対象を撮像して得られる画像データを取得する画像データ取得手段と、前記画像データが表す検知対象画像に基づいて、前記検知対象画像における関節の位置情報と、関節間の関係性を示す関係性情報とを含む骨格情報を生成する骨格情報生成手段と、前記骨格情報に基づいて、前記検知対象画像における動作内容を解析する解析手段と、前記動作内容の解析結果に基づいて、テキスト情報を生成するテキスト情報生成手段と、前記テキスト情報に基づいて、表示させる所定のアバター、及び物理機械であるロボットのうちの少なくとも一方を動作対象として動作させる動作制御手段と、を有する。 An information processing device according to an aspect of the present disclosure includes an image data acquisition unit that acquires image data obtained by imaging a detection target, and a position of a joint in the detection target image based on the detection target image represented by the image data. skeletal information generating means for generating skeletal information including information and relationship information indicating relationships between joints; an analyzing means for analyzing motion content in the detection target image based on the skeletal information; a text information generation means for generating text information based on the content analysis result; and an operation for causing at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, to operate based on the text information. and a control means.

本開示の一態様のプログラムは、情報処理装置に、検知対象を想定した撮像により得られる画像データが表す検知対象画像に基づいて、前記検知対象画像における関節の位置情報と、関節間の関係性を示す関係性情報とを含む骨格情報を生成し、前記骨格情報に基づいて、前記検知対象画像における動作内容を解析し、前記動作内容の解析結果に基づいて、テキスト情報を生成し、前記テキスト情報に基づいて、表示させる所定のアバター、及び物理機械であるロボットのうちの少なくとも一方を動作対象として動作させる処理を実行させる。 A program according to an aspect of the present disclosure causes an information processing device to obtain position information of joints in the detection target image and relationships between joints based on a detection target image represented by image data obtained by imaging assuming a detection target. generates skeleton information including relationship information indicating the text, analyzes the motion content in the detection target image based on the skeleton information, generates text information based on the analysis result of the motion content, and generates text information including relationship information indicating the text Based on the information, a process is executed to cause at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, to operate as an operation target.

本発明では、人の動作によるアバター、及びロボットの少なくとも一方の動作の多様性を向上させることができる。 According to the present invention, it is possible to improve the diversity of at least one of the motions of an avatar and a robot based on human motions.

本発明の適用により構築されたサービス提供システム、及びそのシステムで提供されるサービスの内容の一例を説明する図である。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating an example of a service providing system constructed by applying the present invention and the contents of services provided by the system. 動作対象に別の動作を行わせるための姿勢を含む動作の一例を説明する図。FIG. 6 is a diagram illustrating an example of a motion including a posture for causing the motion target to perform another motion. 本発明の情報処理装置の一実施形態に係るＡＰサーバが接続されたネットワーク環境の一例を説明する図である。FIG. 1 is a diagram illustrating an example of a network environment connected to an AP server according to an embodiment of an information processing apparatus of the present invention. ゲーム等のための場の例を説明する図である。FIG. 2 is a diagram illustrating an example of a place for games and the like. 本発明の情報処理装置の一実施形態に係るＡＰサーバのハードウェア構成の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of the hardware configuration of an AP server according to an embodiment of the information processing device of the present invention. 本発明の情報処理装置の一実施形態に係るＡＰサーバ上に実現される機能的構成の一例を示す機能ブロック図である。FIG. 2 is a functional block diagram showing an example of a functional configuration implemented on an AP server according to an embodiment of the information processing device of the present invention. 本実施形態に係る情報処理装置であるＡＰサーバに搭載のＣＰＵによって実行される動画表示処理の一例を示すフローチャートである。2 is a flowchart illustrating an example of a video display process executed by a CPU installed in an AP server, which is an information processing apparatus according to the present embodiment.

以下、本発明を実施するための形態について、図を参照しながら説明する。なお、説明する実施形態は、あくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。本発明の技術的範囲には、様々な変形例も含まれる。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. Note that the described embodiment is merely an example, and the technical scope of the present invention is not limited thereto. The technical scope of the present invention also includes various modifications.

図１は、本発明の適用により構築されたサービス提供システム、及びそのシステムで提供されるサービスの内容の一例を説明する図である。ここでは、このサービス提供システムＳＹにより提供するサービスを「本サービス」と表記し説明する。 FIG. 1 is a diagram illustrating an example of a service providing system constructed by applying the present invention and the contents of services provided by the system. Here, the service provided by this service providing system SY will be referred to as "this service" and will be explained.

図１に示す例は、カメラＣによる撮像（動画撮影）を想定しているのは動物であり、動物は主に人である。この例では、人を、骨格検知の対象である検知対象ＫＴとし、検知対象ＫＴの動作内容に応じて、伝達すべき情報を音声出力により、タイムリに検知対象ＫＴに伝達することが想定されている。 In the example shown in FIG. 1, it is assumed that images (video shooting) of camera C are taken of animals, and the animals are mainly people. In this example, it is assumed that a person is the detection target KT whose skeleton is to be detected, and the information to be transmitted is timely transmitted to the detection target KT by audio output according to the operation content of the detection target KT. There is.

検知対象ＫＴの動作内容の確認のために、検知対象ＫＴの骨格情報の生成が行われる。この骨格情報は、人の関節の位置情報と、関節間の関係性を示す関係性情報とを含む情報である。関係性情報は、例えば骨格上、隣り合う形となっている関節間の距離、及び方向を表す情報である。このような骨格情報の生成は、カメラＣの撮像により得られた対象映像ＳＴ１を用いた解析により行われる。それにより、検知対象ＫＴの骨格情報は、対象映像ＳＴ１から生成される。図１に示すスケルトン画像ＳＧは、間接の位置、及び関係性情報が生成される関節間の一例を表している。骨格情報の生成自体は、周知の技術により行われる。 In order to confirm the operation details of the detection target KT, skeleton information of the detection target KT is generated. This skeletal information is information including position information of a person's joints and relationship information indicating relationships between joints. The relationship information is, for example, information representing the distance and direction between joints that are adjacent to each other on the skeleton. Generation of such skeleton information is performed by analysis using the target image ST1 obtained by imaging with the camera C. Thereby, the skeleton information of the detection target KT is generated from the target video ST1. The skeleton image SG shown in FIG. 1 represents an example of joint positions and joints for which relationship information is generated. The generation of skeleton information itself is performed using a well-known technique.

なお、撮像に用いるカメラＣは複数台であっても良い。複数台のカメラＣをそれぞれ異なる位置に設置させた場合、各カメラＣから得られる対象映像ＳＴ１により、１台以上のカメラＣでは死角となる検知対象ＫＴの部分も確認できるようになる。そのため、骨格情報の生成はより高精度に行えることとなる。例えば、これは、検知対象ＫＴが二人以上である場合に特に有用である。ここでは、説明上、便宜的に、カメラＣは１台の想定で説明することとする。 Note that a plurality of cameras C may be used for imaging. When a plurality of cameras C are installed at different positions, the target image ST1 obtained from each camera C makes it possible to confirm a portion of the detection target KT that is a blind spot for one or more cameras C. Therefore, skeleton information can be generated with higher accuracy. For example, this is particularly useful when there are two or more people KT to be detected. Here, for the sake of explanation and convenience, the explanation will be based on the assumption that there is only one camera C.

サービス提供システムＳＹには、図１に示すように、骨格情報生成システムＨ１、姿勢評価システムＨ２、及び姿勢管理システムＨ３が含まれる。
図１では、各システムＨ１～Ｈ３を別々に表しているが、これらのシステム（における機能）は同じ情報処理装置上に実現されていても良い。そのうちの２つを同じ情報処理装置上に実現させても良い。ここでは、図１に示す通り、各システムＨ１～Ｈ３はそれぞれ異なる情報処理装置上に実現されているものと想定、つまりサービス提供システムＳＹには３台以上の情報処理装置を用いて構築されているものと想定し説明する。 As shown in FIG. 1, the service providing system SY includes a skeletal information generation system H1, a posture evaluation system H2, and a posture management system H3.
Although each of the systems H1 to H3 is shown separately in FIG. 1, these systems (the functions thereof) may be realized on the same information processing device. Two of them may be implemented on the same information processing device. Here, as shown in Figure 1, it is assumed that the systems H1 to H3 are implemented on different information processing devices, that is, the service providing system SY is constructed using three or more information processing devices. Let's assume that there is one and explain.

骨格情報生成システムＨ１は、ＯＳ（Operating System）Ｈ１１が搭載された情報処理装置上に実現されている。そのＯＳＨ１１には、カメラ機器映像取得・管理スイッチ（ＳＷ）Ｈ１１１が搭載されている。このカメラ機器映像取得・管理スイッチＨ１１１は、骨格情報生成システムＨ１に受信された対象映像ＳＴ１を含む各種情報を渡す先を切り換える機能である。それにより、骨格情報生成システムＨ１に受信された対象映像ＳＴ１は、カメラ機器映像取得・管理ＳＷＨ１１１により、対象映像ＳＴ２として処理エンジンＨ１１２に渡される。この処理エンジンＨ１１２、及び骨格検知部Ｈ１１３は、例えば何れもＯＳＨ１１上で動作するアプリケーション・プログラム（以降「アプリケーション」と略記）により実現される機能である。 The skeleton information generation system H1 is realized on an information processing device equipped with an OS (Operating System) H11. The OSH 11 is equipped with a camera device image acquisition/management switch (SW) H111. This camera device video acquisition/management switch H111 has a function of switching the destination to which various information including the target video ST1 received by the skeleton information generation system H1 is delivered. Thereby, the target video ST1 received by the skeleton information generation system H1 is passed to the processing engine H112 as the target video ST2 by the camera device video acquisition/management SWH111. The processing engine H112 and the skeleton detection unit H113 are functions realized by, for example, an application program (hereinafter abbreviated as "application") running on the OSH 11.

処理エンジンＨ１１２は、本サービスの提供のための全体的な制御を行う機能である。処理エンジンＨ１１２は、骨格検知処理呼出ＳＴ３により、骨格検知部Ｈ１１３を骨格検知処理に実行させ、対象映像ＳＴ２を用いた骨格情報の生成を行わせる。そのために、処理エンジンＨ１１２は、骨格検知処理呼出ＳＴ３により、例えば対象映像ＳＴ２を骨格検知部Ｈ１１３に渡す。処理エンジンＨ１１２から骨格検知部Ｈ１１３に渡すのは、対象映像ＳＴ２ではなく、検知対象画像を含む部分のみの映像情報であっても良い。ここでは、対象映像ＳＴ２が処理エンジンＨ１１２から骨格検知部Ｈ１１３に渡されるものと想定する。 The processing engine H112 is a function that performs overall control for providing this service. The processing engine H112 causes the skeleton detection unit H113 to execute the skeleton detection process and generate skeleton information using the target video ST2 by the skeleton detection process call ST3. For this purpose, the processing engine H112 passes, for example, the target video ST2 to the skeleton detection unit H113 by a skeleton detection process call ST3. What is passed from the processing engine H112 to the skeleton detection unit H113 may be not the target video ST2 but the video information of only the portion including the detection target image. Here, it is assumed that the target video ST2 is passed from the processing engine H112 to the skeleton detection unit H113.

骨格検知部Ｈ１１３は、骨格情報の生成のための骨格検知処理を実行し、その骨格検知処理の処理結果ＳＴ４を処理エンジンＨ１１２に返す。この処理結果ＳＴ４が、骨格表現のデータ列の形で生成される骨格情報である。 The skeleton detection unit H113 executes skeleton detection processing for generating skeleton information, and returns the processing result ST4 of the skeleton detection processing to the processing engine H112. This processing result ST4 is skeleton information generated in the form of a data string representing the skeleton.

処理エンジンＨ１１２は、骨格検知部Ｈ１１３から返された処理結果ＳＴ４である骨格情報を骨格表現のデータ列ＳＴ５として、姿勢評価システムＨ２、及び姿勢管理システムＨ３にそれぞれ出力することができる。姿勢評価システムＨ２には、検知対象の画像ＳＴ６として、検知対象ＫＴの画像である検知対象画像ＫＴＧも出力される。 The processing engine H112 can output the skeletal information, which is the processing result ST4 returned from the skeletal detection unit H113, to the posture evaluation system H2 and the posture management system H3, respectively, as a data string ST5 representing the skeleton. A detection target image KTG, which is an image of the detection target KT, is also output to the posture evaluation system H2 as a detection target image ST6.

姿勢評価システムＨ２は、骨格情報生成システムＨ１から入力した検知対象の画像ＳＴ６か、或いは骨格表現のデータ列ＳＴ５を用いて生成したキャラクタの画像を、表示させるべき画像ＳＴ８の一部として画像表示システムＧＨに出力することができる。画像ＳＴ８は、例えば１画面分の表示のためのものであり、キャラクタの画像は、その画面内に配置されて表示される。キャラクタの画像は、検知対象ＫＴの分身と想定されたものである。 The posture evaluation system H2 is an image display system that uses the detection target image ST6 inputted from the skeleton information generation system H1 or the character image generated using the skeleton representation data string ST5 as part of the image ST8 to be displayed. It can be output to GH. The image ST8 is for display on one screen, for example, and the character image is arranged and displayed within the screen. The character image is assumed to be an alter ego of the detection target KT.

他のキャラクタ画像を表示させる場合であっても、キャラクタには別の検知対象ＫＴの分身か、或いは仮想的な人格の分身のものが含まれることがある。仮想的な人格は、具体的には本サービスの提供用に想定されたものである。以降、キャラクタのうち、そのような分身と位置付けられるキャラクタを「アバター」と表記する。特に断らない限り、キャラクタのうち、アバター以外のものを指す意味で「キャラクタ」を用いる。 Even when displaying another character image, the character may include an alter ego of another detection target KT or an alter ego of a virtual personality. The virtual persona is specifically envisioned for the provision of the Service. Hereinafter, among the characters, a character that is positioned as such an alter ego will be referred to as an "avatar." Unless otherwise specified, "character" is used to refer to characters other than avatars.

画像表示システムＧＨは、姿勢評価システムＨ２から入力した画像ＳＴ８により、ディスプレイＤ上に画面を表示させる。そのため、姿勢評価システムＨ２は、画像表示システムＧＨに出力する画像ＳＴ８を通して、ディスプレイＤに表示させる画面の内容を任意に変更することができる。図１では、画像ＳＴ８が表す検知対象画像ＫＴＧがディスプレイＤ上に表示されている状態を示している。検知対象画像ＫＴＧの代わりに、骨格情報から生成したアバター画像を表示させるようにしても良い。 The image display system GH displays a screen on the display D based on the image ST8 input from the posture evaluation system H2. Therefore, the posture evaluation system H2 can arbitrarily change the content of the screen displayed on the display D through the image ST8 output to the image display system GH. FIG. 1 shows a state in which the detection target image KTG represented by the image ST8 is displayed on the display D. Instead of the detection target image KTG, an avatar image generated from skeleton information may be displayed.

ディスプレイＤとは、例えば大型の表示装置か、或いはスクリーン等である。ディスプレイＤがスクリーンであった場合、画像表示システムＧＨには、スクリーンへの投影が可能なプロジェクターも含まれる。
その一方、姿勢評価システムＨ２は、骨格表現のデータ列ＳＴ５を用いた分析により、音声出力の対象となるテキスト情報の生成を行う。そのために姿勢評価システムＨ２は、お手本との差分ＡＩ（Artificial Intelligence）分析ＳＨ２を行う。
このテキスト情報は、詳細は後述するように、アバター画像、或いはロボットである動作対象の各種動作を制御する制御情報としても機能する。各種動作には、動作対象を視覚的に変化させる動作だけでなく、視覚的な変化を伴わない表示、或いは放音による発言動作等も含まれる。 The display D is, for example, a large display device or a screen. When the display D is a screen, the image display system GH also includes a projector capable of projecting onto the screen.
On the other hand, the posture evaluation system H2 generates text information to be outputted as audio by analyzing the skeleton representation data string ST5. For this purpose, the posture evaluation system H2 performs a difference AI (Artificial Intelligence) analysis SH2 with respect to the model.
As will be described in detail later, this text information also functions as control information for controlling various movements of an avatar image or a robot. The various actions include not only actions that visually change the object of action, but also displays that do not involve visual changes, actions that make speeches by emitting sounds, and the like.

お手本との差分ＡＩ分析ＳＨ２は、何れかのタイミングで検知対象ＫＴが取るべき姿勢（ポーズ）が予め判明していることを前提として行われる分析である。その分析のために、手本となる姿勢を表すお手本データＴＤが用意されている。取るべき姿勢が予め判明している動きとしては、ダンス、お手本の動きをまねた動作が求められるか、或いは定めた一つ以上の姿勢を取る動作が求められるようなゲーム、及び理想的な動きが考えられる運動（例えばスポーツ）等を挙げることができる。以降、このような要求、或いは運動等による一連の動作を「姿勢要求動作」と総称する。 The difference AI analysis SH2 with respect to the model is an analysis performed on the premise that the posture (pose) that the detection target KT should take at any timing is known in advance. For this analysis, model data TD representing a model posture is prepared. Movements where the posture to be taken is known in advance include dance, games that require movements that imitate a model movement, or movements that require movements that take one or more predetermined postures, and ideal movements. Exercises (for example, sports) that can be considered include sports. Hereinafter, such requests or a series of motions such as exercise will be collectively referred to as "posture-required motions."

このお手本データＴＤには、手本となる姿勢時の画像であるお手本画像ＯＧの他に、骨格情報が含まれる。それにより、お手本との差分ＡＩ分析ＳＨ２では、骨格表現のデータ列ＳＴ５とお手本データＴＤに含まれる骨格情報との間の差分の算出を含む、その差分を用いたＡＩによる分析が行われる。その分析により、検知対象画像ＫＴＧが表す姿勢が評価され、その評価結果に応じたテキスト情報が生成される。図１に示す例では、姿勢評価システムＨ２から音声出力システムＴＯに出力される差分補正指示ＳＴ９は、テキスト情報の音声としての放音を指示するためのものである。この差分補正指示ＳＴ９は、具体的には、放音のために生成された音声信号か、或いはその音声信号の生成に必要な情報を含むコマンド等である。 The model data TD includes skeletal information in addition to the model image OG, which is an image in a model posture. Thereby, in the difference AI analysis SH2 with respect to the model, an AI analysis using the difference is performed, including calculation of the difference between the data string ST5 of the skeleton representation and the skeleton information included in the model data TD. Through the analysis, the posture represented by the detection target image KTG is evaluated, and text information is generated according to the evaluation result. In the example shown in FIG. 1, the difference correction instruction ST9 output from the posture evaluation system H2 to the audio output system TO is for instructing to emit text information as audio. Specifically, this difference correction instruction ST9 is an audio signal generated for sound emission, or a command containing information necessary for generating the audio signal.

差分を用いた分析のために、ＡＩでは、差分と、生成すべきテキスト情報との関係を表す学習データを用いた深層学習が行われる。この深層学習により、お手本との差分ＡＩ分析ＳＨ２では、差分の生成（算出）により、適切なテキスト情報を生成することができる。検知対象画像ＫＴＧのサイズとお手本データＴＤが表すお手本画像ＯＧのサイズとの間の比が必ずしも適切とする範囲内であるとは限らないことから、差分は、例えば２つの骨格情報のうちの一方に対して拡大、或いは縮小の操作を行った後に生成される。 For analysis using differences, AI performs deep learning using learning data representing the relationship between the differences and the text information to be generated. Through this deep learning, in the difference AI analysis SH2 with respect to the model, appropriate text information can be generated by generating (calculating) the difference. Since the ratio between the size of the detection target image KTG and the size of the model image OG represented by the model data TD is not necessarily within an appropriate range, the difference is, for example, one of the two pieces of skeleton information. It is generated after enlarging or reducing the image.

図１では、ディスプレイＤに、検知対象画像ＫＴＧに加え、お手本画像ＯＧが表示されていることを表している。
検知対象画像ＫＴＧがお手本画像ＯＧと特に大きく異なるのは、右手である。左手も明確に異なっているが、右手と比較して、異なる程度は小さい。このことから、この場合、「右手を少し上へ」等の文字列を表すテキスト情報がお手本との差分ＡＩ分析ＳＨ２により生成される。この結果、この文字列が音声として音声出力システムＴＯから出力、つまり放音される。この音声出力は、検知対象画像ＫＴＧ、或いはお手本画像ＯＧをアバター画像と想定して行われる。 FIG. 1 shows that the model image OG is displayed on the display D in addition to the detection target image KTG.
The detection target image KTG is particularly different from the model image OG in the right hand. The left hand is also clearly different, but the degree of difference is small compared to the right hand. Therefore, in this case, text information representing a character string such as "raise your right hand slightly" is generated by differential AI analysis SH2 with respect to the model. As a result, this character string is output as audio from the audio output system TO, that is, is emitted. This audio output is performed assuming that the detection target image KTG or the model image OG is the avatar image.

音声出力システムＴＯによる音声出力は、検知対象ＫＴを想定して行われる。そのため、検知対象ＫＴにとっては、ディスプレイＤにアバター画像として表示される検知対象画像ＫＴＧ、或いはお手本画像ＯＧから、自身が取るべき適切な姿勢を取るうえでの有用な情報が音声出力によりタイムリに得られる形となる。それにより、表示される検知対象画像ＫＴＧ、或いはお手本画像ＯＧを動作対象とし、検知対象画像ＫＴＧからのテキスト情報の生成により、その動作対象に発言させる発言動作が仮想的に行われる形となる。 Audio output by the audio output system TO is performed assuming a detection target KT. Therefore, the detection target KT can obtain useful information in a timely manner from the detection target image KTG displayed as an avatar image on the display D or from the model image OG through audio output to help him take the appropriate posture. It will be in the form of As a result, the displayed detection target image KTG or the model image OG is set as an operation target, and by generating text information from the detection target image KTG, a speaking operation is virtually performed to make the operation target speak.

なお、動作対象の動作は、発言動作、つまり音声出力でなくとも良い。例えば、検知対象ＫＴがお手本画像ＯＧの姿勢に近づけるために必要な動きを動作対象に行わせるようにしても良い。また、文字列をアバター画像の発言内容として表示させるようにしても良い。音声出力（メッセージ等の出力）は、検知対象画像ＫＴＧの姿勢が不適切と評価した場合にのみ行い、その姿勢が適切と評価した場合には、その旨を検知対象ＫＴが認識できるように、効果音、或いは演出音を放音させるようにしても良い。このことから、テキスト情報は、効果音、或いは演出音の放音を表すものであっても良い。発言動作は、放音、及び表示の何れで表しても良い。 Note that the action to be performed does not have to be a speaking action, that is, a voice output. For example, the motion target may be caused to perform a movement necessary for the detection target KT to approach the posture of the model image OG. Furthermore, a character string may be displayed as the content of the avatar image's statement. Audio output (output of messages, etc.) is performed only when the posture of the detection target image KTG is evaluated as inappropriate, and when the posture is evaluated as appropriate, so that the detection target KT can recognize this. Sound effects or performance sounds may be emitted. For this reason, the text information may represent sound effects or production sounds. The speaking action may be expressed by either sound emission or display.

姿勢評価システムＨ２は、上記のように、骨格情報生成システムＨ１から入力した骨格表現のデータ列ＳＴ５を検知対象ＫＴの姿勢評価に用い、テキスト情報の生成、及びそのテキスト情報に応じたアバター画像の動作を実現させる。そのようにして、検知対象ＫＴの動作にはない別の動作を動作対象に行わせ、別の動作を通す形で、検知対象ＫＴにとって有用な情報を提供する。このため、動作対象における動作の多様性が向上するだけでなく、検知対象ＫＴにとっての利便性も向上することになる。検知対象ＫＴは、より適切な姿勢要求動作をより容易に行えるようになる。 As described above, the posture evaluation system H2 uses the data string ST5 of the skeleton representation inputted from the skeleton information generation system H1 to evaluate the posture of the detection target KT, and generates text information and creates an avatar image according to the text information. Make the action happen. In this way, the motion target is made to perform a different motion that is not present in the motion of the detection target KT, and information useful to the detection target KT is provided in the form of passing through the different motion. Therefore, not only the variety of motions in the motion target is improved, but also the convenience for the detection target KT is improved. The detection target KT can more easily perform a more appropriate posture requesting operation.

検知対象ＫＴにとっては、自身の身体を適切に動かすために、音声出力の認識が求められることになる。そのため、より高い集中力が必要となる。そのような状態で動作を行わなければならないことから、検知対象ＫＴは、より高い没入感がより容易に得られるようにもなる。この結果、検知対象ＫＴに対し、より良い心証、或いはより高い満足感を与えられるようにもなる。 The detection target KT is required to recognize the voice output in order to move his or her body appropriately. Therefore, higher concentration is required. Since the motion must be performed in such a state, the detection target KT can also more easily obtain a higher sense of immersion. As a result, it becomes possible to give better confidence or a higher sense of satisfaction to the detection target KT.

一方、姿勢管理システムＨ３は、骨格情報生成システムＨ１から入力した骨格表現のデータ列ＳＴ５を、物理機械であるロボット、及び画面に表示させるアバター画像のうちの少なくとも一方を動作対象にした動作制御に用いている。それにより、動作対象は、検知対象画像ＫＴＧの動きに沿って動作する。このため、検知対象ＫＴは、自身の動きを通して、動作対象の動きを制御することができる。従って、検知対象ＫＴには、動作対象を動かすことを娯楽として楽しめる、或いは自身の姿勢、若しくは動きを確認できる環境が提供される。ここでの物理機械とは、一つ以上のモータ等の動力源を備えたものである。物理機械であるロボットは、動力源から伝達される動力により、全体的な姿形を変化させることが可能なものである。 On the other hand, the posture management system H3 uses the data string ST5 of the skeletal representation input from the skeletal information generation system H1 to control the motion of at least one of the robot, which is a physical machine, and the avatar image displayed on the screen. I am using it. Thereby, the motion target moves along the movement of the detection target image KTG. Therefore, the detection target KT can control the motion of the motion target through its own motion. Therefore, the detection target KT is provided with an environment in which he can enjoy moving the motion target as entertainment, or in which he can check his own posture or movement. The physical machine here is one that is equipped with a power source such as one or more motors. Robots, which are physical machines, are capable of changing their overall appearance using power transmitted from a power source.

ロボットの動作制御は、例えば次に取るべき姿勢を取らせるための制御情報を動作・表示指示ＳＴ７として姿勢管理システムＨ３から送信させることで行われる。この制御情報は、主に動力源を動作させるための情報である。アバター画像の動作制御は、例えばアバター画像を含む画面を動画として表示させることが可能な画像音声生成システムに、表示させるべき画面を表す画像情報を動作・表示指示ＳＴ７として姿勢管理システムＨ３から送信させることで行われる。 The motion control of the robot is performed, for example, by transmitting control information for causing the robot to take the next posture as a motion/display instruction ST7 from the posture management system H3. This control information is mainly information for operating the power source. To control the movement of the avatar image, for example, the posture management system H3 transmits image information representing the screen to be displayed as a movement/display instruction ST7 to an image/audio generation system capable of displaying a screen including the avatar image as a moving image. It is done by

動作対象とするロボット、及びアバター画像の何れも、通常、各部の全体における比率は、検知対象画像ＫＴＧにおけるその比率とは異なる。例えば検知対象画像ＫＴＧが８頭身であれば、動作対象は８頭身未満であることが多く、それらの比率は異なるのが普通である。このことから、姿勢管理システムＨ３では、入力した骨格情報を動作対象における比率に合わせるための縮尺変換処理ＳＨ３を行い、その骨格情報を操作する。その操作は、比率の違いに応じて用意されるパラメータＰを参照して行われる。そのような操作を行った後の骨格情報を用いて動作・表示指示ＳＴ７を生成し送信することにより、動作対象は、検知対象画像ＫＴＧによって表される動作に沿って、不自然な印象を与えないように動作することになる。 In both the motion target robot and the avatar image, the overall ratio of each part is usually different from the ratio in the detection target image KTG. For example, if the detection target image KTG is 8 heads and bodies, the motion targets are often less than 8 heads and bodies, and their ratios are usually different. For this reason, the posture management system H3 performs a scale conversion process SH3 to match the input skeletal information to the ratio of the motion target, and operates the skeletal information. This operation is performed with reference to parameters P prepared according to the difference in ratio. By generating and transmitting the motion/display instruction ST7 using the skeletal information after performing such an operation, the motion target can create an unnatural impression in accordance with the motion represented by the detection target image KTG. It will work like it doesn't.

なお、姿勢管理システムＨ３でも、お手本データＴＤを用意し、テキスト情報の生成、及びその出力を行うようにしても良い。それにより、例えば要所要所の姿勢の善し悪し、或いは次に取るべき姿勢についての情報等をタイムリに検知対象ＫＴに伝達するようにしても良い。次に取るべき姿勢についての情報等の伝達を行うようにする場合、お手本データＴＤは、検知対象画像ＫＴＧがテキスト情報を生成すべき姿勢となったか否かの判定用にしても良い。
上述した動作対象における別の動作は、一つのお手本データＴＤが表す一つの姿勢との対比により行わせるものである。しかし、別の動作は、姿勢を含む動作との対比により行わせるようにしても良い。 Note that the posture management system H3 may also prepare the model data TD, and generate and output text information. Thereby, for example, information regarding the quality of the posture at important points or the posture to be taken next may be transmitted to the detection target KT in a timely manner. When transmitting information about the next posture to be taken, the model data TD may be used to determine whether the detection target image KTG has reached a posture for which text information should be generated.
Another motion in the above-mentioned motion target is performed by comparison with one posture represented by one model data TD. However, other motions may be performed in comparison with motions involving posture.

図２は、動作対象に別の動作を行わせるための姿勢を含む動作の一例を説明する図である。図２中に描く各丸は、それぞれ異なる関節を表している。
図２に示す例は、忍者が手裏剣を投げる場合を想定したものである。手裏剣を投げる動作は、左手で手裏剣を投げる場合のものである。図２の左側の図は、手裏剣を投げる際に取る姿勢の例を表している。その姿勢は、右腕の肘を曲げ、手裏剣を持つ右手を腰の高さでその腰の近くに位置させた状態で、左手で手裏剣をつかんだ姿勢である。以降、この姿勢は、便宜的に「初期想定姿勢」と表記する。 FIG. 2 is a diagram illustrating an example of a motion including a posture for causing the motion target to perform another motion. Each circle drawn in FIG. 2 represents a different joint.
The example shown in FIG. 2 assumes that a ninja throws a shuriken. The action of throwing a shuriken is when throwing a shuriken with the left hand. The diagram on the left side of FIG. 2 shows an example of the posture taken when throwing a shuriken. In this posture, the elbow of the right arm is bent, the right hand holding the shuriken is positioned near the waist at waist height, and the left hand is grasping the shuriken. Hereinafter, this attitude will be referred to as an "initial assumed attitude" for convenience.

一方、図２の右側の図は、手裏剣を投げ終えた姿勢の例を表している。この姿勢は、図２の左側の図で表す初期想定姿勢から、左手を向かって右方向に動かし手裏剣を投げた後の姿勢の例である。手裏剣を投げるために、頭上から見て、身体は反時計回りに回転された状態となっている。 On the other hand, the diagram on the right side of FIG. 2 shows an example of the posture after throwing a shuriken. This posture is an example of the posture after moving the left hand toward the right and throwing the shuriken from the initial assumed posture shown in the diagram on the left side of FIG. To throw a shuriken, the body is rotated counterclockwise when viewed from above.

左手に掴んだ手裏剣を投げるためには、図２に示すように、その左手を比較的に大きく動かす必要がある。このため、初期想定姿勢を起点とし、手裏剣を持つと想定する手の起点からの動きを確認することにより、検知対象ＫＴが手裏剣を投げる動作を行ったか否かを判定することができる。そのような判定が可能となることから、手裏剣を投げる動作が行われたと判定した場合には、テキスト情報の生成により、例えば投げられた手裏剣の風切り音、或いは演出音等を放音させるようなことが可能となる。このような音の放音は、特に、身体を動かして行うゲーム等において、より高いリアル感が得られる環境の提供を可能にする。このような音の放音自体は、検知対象画像ＫＴＧの一部であった手裏剣を投げる動作とは別の動作と位置付けることができる。なお、上述の手裏剣を投げる動作を行ったか否かの判定のみではなく、手裏剣を離した位置および投げた方向を更に判定することもできる。 In order to throw a shuriken held in the left hand, it is necessary to make a relatively large movement with the left hand, as shown in Figure 2. Therefore, by checking the movement of the hand assumed to be holding a shuriken from the initial assumed posture as the starting point, it is possible to determine whether or not the detection target KT has performed a motion of throwing a shuriken. Since such a determination is possible, if it is determined that the action of throwing a shuriken has been performed, text information can be generated to emit, for example, the wind noise of the thrown shuriken or a sound effect. becomes possible. Emitting such sounds makes it possible to provide an environment that provides a higher sense of realism, especially in games that require physical movement. The emission of such a sound itself can be positioned as an action different from the action of throwing a shuriken, which was a part of the detection target image KTG. In addition to determining whether or not the above-described action of throwing a shuriken has been performed, it is also possible to further determine the position at which the shuriken is released and the direction in which it is thrown.

図２に例を示すような動きを検知可能にする場合、お手本データＴＤには、例えば次のようなデータを含めても良い。つまり、初期想定姿勢と見なすべき第１条件、その初期想定姿勢から定められた動作が行われたと見なす第２条件、及び第１条件が満たされなくなってから第２の条件が満たされるまでの時間範囲、等をデータとして含むお手本データＴＤを用意しても良い。 When it is possible to detect a movement as shown in FIG. 2, the model data TD may include, for example, the following data. In other words, the first condition to be considered as the initial assumed posture, the second condition to consider that the specified movement has been performed from the initial assumed posture, and the time from when the first condition is no longer satisfied until the second condition is satisfied. Model data TD may be prepared that includes the range, etc. as data.

ゲーム等において、このような手裏剣を投げる動作を検知可能にする場合、動作対象は、対戦相手等とするアバター画像とすることが考えられる。そのような動作対象では、手裏剣を投げる動作の確認により、手裏剣が投げられたことに対する発言を行わせるとともに、手裏剣に対応するための動作を行わせるようにしても良い。つまり、一つのテキスト情報の生成により、更に一つ以上のテキスト情報を生成するようにしても良い。 In a game or the like, when it is possible to detect the action of throwing a shuriken, the action target may be an avatar image of an opponent or the like. In such an action target, by confirming the action of throwing a shuriken, the user may make a comment regarding the throwing of the shuriken, and may also be made to perform an action in response to the throwing of the shuriken. In other words, by generating one piece of text information, one or more pieces of text information may also be generated.

このようなこともあり、テキスト情報を生成する検知対象画像ＫＴＧの動き、その動きにより生成するテキスト情報の内容、及び生成したテキスト情報を出力させた後の動作制御等は、様々なものが考えられる。例えば手裏剣以外の武器の使用を想定した動作に着目し、仕様を考えても良い。しかし、どのような仕様であっても、動作対象の動作による表現の幅がより広がるだけでなく、検知対象ＫＴに提供される情報量もより大きくなる。その結果、検知対象ＫＴの満足度もより高くさせることができるようになる。 Because of this, various considerations have been made regarding the movement of the detection target image KTG that generates text information, the content of the text information generated by that movement, and the operation control after outputting the generated text information. It will be done. For example, specifications may be designed by focusing on movements that assume the use of weapons other than shuriken. However, regardless of the specifications, not only the range of expression by the motion of the motion target becomes wider, but also the amount of information provided to the detection target KT becomes larger. As a result, it becomes possible to further increase the satisfaction level of the detection target KT.

以降は、図３～図７を参照しつつ、図１に例示する本サービスを提供する具体的な実現方法について詳細に説明する。
図３は、本発明の情報処理装置の一実施形態に係るＡＰ（APplication）サーバが接続されたネットワーク環境の一例を説明する図である。 Hereinafter, a specific implementation method for providing this service illustrated in FIG. 1 will be explained in detail with reference to FIGS. 3 to 7.
FIG. 3 is a diagram illustrating an example of a network environment to which an AP (application) server according to an embodiment of the information processing apparatus of the present invention is connected.

ＡＰサーバ１は、本サービスを提供するサービス提供会社ＳＫが設置の情報処理装置である。サービス提供会社ＳＫは、例えばゲーム等の娯楽を提供する企業等の組織と契約し、契約した組織への本サービスの提供を通して、その組織を利用する訪問者の満足度がより高くなるように支援する。
図３では、ＡＰサーバ１は、サービス提供会社ＳＫ内に設置されたものとして表しているが、別の場所に設置されていても良い。例えばクラウドサービスにより提供されるものであっても良い。 The AP server 1 is an information processing device installed by a service provider SK that provides this service. Service provider SK contracts with organizations such as companies that provide entertainment such as games, and provides this service to contracted organizations to help increase the satisfaction level of visitors who use those organizations. do.
In FIG. 3, the AP server 1 is shown as being installed within the service providing company SK, but it may be installed in another location. For example, it may be provided by a cloud service.

ＡＰサーバ１は、実際には、例えばプロキシ等の他の情報処理装置を介してネットワークＮと接続されている。しかし、ここでは、説明上、便宜的に、ＡＰサーバ１とネットワークＮとの間に介在する情報処理装置は無視することとする。つまり、ＡＰサーバ１は直接的にネットワークＮと接続されているものとする。ネットワークＮは、例えばインターネットを含むものである。 The AP server 1 is actually connected to the network N via another information processing device such as a proxy. However, here, for the sake of explanation and convenience, the information processing device interposed between the AP server 1 and the network N will be ignored. In other words, it is assumed that the AP server 1 is directly connected to the network N. The network N includes, for example, the Internet.

顧客企業ＫＫは、サービス提供会社ＳＫと契約した組織である。この顧客企業ＫＫは、利用者が身体を使ったゲーム等を行える場を提供する。本サービスは、その場を利用する利用者に対し、より満足できる環境の提供を可能にする。以降、顧客企業ＫＫは、契約した組織の総称として用いる。顧客企業ＫＫは、複数、存在する。 Customer company KK is an organization that has a contract with service provider SK. This customer company KK provides a place where users can play games using their bodies. This service makes it possible to provide a more satisfying environment to users who use the site. Hereinafter, customer company KK will be used as a general term for the contracted organizations. There are multiple customer companies KK.

顧客企業ＫＫには、上記カメラＣの他に、サーバ２、プロジェクター３、サウンドシステム４、及びマイクロホン（図３では「マイク」と略記。以降、この略記を用いる）５が設置されている。サーバ２を除く全ては、ゲーム等が行える一つの場に設置されている。そのため、サーバ２を除く全ては、複数、存在するのが普通である。 In addition to the camera C, the customer company KK is equipped with a server 2, a projector 3, a sound system 4, and a microphone (abbreviated as "microphone" in FIG. 3. This abbreviation will be used hereinafter) 5. All except server 2 are installed in one place where games and the like can be played. Therefore, it is common for all servers except server 2 to exist in plural numbers.

図４は、ゲーム等のための場の例を説明する図である。
ゲーム等のための場（以降「ゲーム場」と表記）は、図４に示すように、スクリーンＳＣが設置されているか、或いは設置可能な空間である。その空間には、スクリーンＳＣへの投影が可能なプロジェクター３、各種音の放音が可能なサウンドシステム４、カメラＣ、及びマイク５が設置されている。カメラＣは、スクリーンＳＣ側から利用者を撮像可能なように設置されている。 FIG. 4 is a diagram illustrating an example of a place for games and the like.
A place for games and the like (hereinafter referred to as a "game place") is a space where a screen SC is installed or can be installed, as shown in FIG. In that space, a projector 3 capable of projecting onto a screen SC, a sound system 4 capable of emitting various sounds, a camera C, and a microphone 5 are installed. The camera C is installed so that it can image the user from the screen SC side.

サーバ２は、カメラＣ、及びマイク５と接続されている。プロジェクター３、及びサウンドシステム４は、ネットワークＮを介してＡＰサーバ１と接続されている。
サウンドシステム４は、放音装置であるスピーカを含むシステムであり、ネットワークＮを介した通信が可能な端末として機能する情報処理装置も含まれる。それにより、サウンドシステム４は、ＡＰサーバ１から送信される音声信号により、スピーカから音を放音させる。 The server 2 is connected to a camera C and a microphone 5. The projector 3 and sound system 4 are connected to the AP server 1 via a network N.
The sound system 4 is a system that includes a speaker that is a sound emitting device, and also includes an information processing device that functions as a terminal that can communicate via the network N. Thereby, the sound system 4 causes the speaker to emit sound based on the audio signal transmitted from the AP server 1.

プロジェクター３は、ネットワークＮを介した通信機能を備えた情報処理装置である、ＡＰサーバ１から送信される映像信号により、投影すべき画面を生成し、生成した画面をスクリーンＳＣ上に投影させることができる。
カメラＣは、ゲーム場の利用者の動画撮影に用いられる。この撮影結果は、画像データ（動画情報）として、サーバ２からＡＰサーバ１に送信される。この利用者の全て、或いは少なくとも一人は、検知対象ＫＴである。カメラＣがスクリーンＳＣ側に設置されているのは、検知対象ＫＴである利用者はスクリーンＳＣのほうを向いて、身体を動かすゲームを行うものと想定しているからである。 The projector 3 is an information processing device equipped with a communication function via the network N, and generates a screen to be projected based on a video signal transmitted from the AP server 1, and projects the generated screen on the screen SC. I can do it.
Camera C is used to take video of users of the game hall. This photographic result is transmitted from the server 2 to the AP server 1 as image data (video information). All or at least one of these users is the detection target KT. The camera C is installed on the screen SC side because it is assumed that the user who is the detection target KT faces the screen SC and plays a game in which he moves his body.

マイク５は、ゲーム場の利用者が発する音声を拾うために設置されている。マイク５から出力された音声信号は、サーバ２により音声情報に変換され、ＡＰサーバ１に送信される。 The microphone 5 is installed to pick up sounds emitted by users of the game hall. The audio signal output from the microphone 5 is converted into audio information by the server 2 and transmitted to the AP server 1.

図５は、本発明の情報処理装置の一実施形態に係るＡＰサーバ１のハードウェア構成の一例を示すブロック図である。次に図５を参照し、ＡＰサーバ１のハードウェア構成例について具体的に説明する。なお、この構成例は一例であり、ＡＰサーバ１のハードウェア構成はこれに限定されない。 FIG. 5 is a block diagram showing an example of the hardware configuration of the AP server 1 according to an embodiment of the information processing apparatus of the present invention. Next, with reference to FIG. 5, an example of the hardware configuration of the AP server 1 will be specifically described. Note that this configuration example is just an example, and the hardware configuration of the AP server 1 is not limited to this.

ＡＰサーバ１は、図５に示すように、ＣＰＵ（Central Processing Unit）１１と、ＲＯＭ（Read Only Memory）１２と、ＲＡＭ（Random Access Memory）１３と、バス１４と、入出力インターフェース１５と、出力部１６、入力部１７と、記憶部１８と、通信部１９と、及びドライブ２０と、を備えている。 As shown in FIG. 5, the AP server 1 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a bus 14, an input/output interface 15, and an output It includes a section 16, an input section 17, a storage section 18, a communication section 19, and a drive 20.

ＣＰＵ１１は、ＲＯＭ１２に記録されているプログラム、及び記憶部１８からＲＡＭ１３にロードされたプログラムに従って各種の処理を実行する。記憶部１８からＲＡＭ１３にロードされるプログラムには、例えばＯＳ、及びそのＯＳ上で動作する各種アプリケーション・プログラムが含まれる。各種アプリケーション・プログラムには、本サービスの提供用に開発されたものが１つ以上、含まれる。 The CPU 11 executes various processes according to programs recorded in the ROM 12 and programs loaded into the RAM 13 from the storage unit 18 . The programs loaded from the storage unit 18 into the RAM 13 include, for example, an OS and various application programs that run on the OS. The various application programs include one or more programs developed for providing this service.

ＲＡＭ１３には、ＣＰＵ１１が各種の処理を実行する上において必要なデータ等も適宜記憶される。そのデータには、ＣＰＵ１１が実行する各種プログラムも含まれる。
ＣＰＵ１１、ＲＯＭ１２及びＲＡＭ１３は、バス１４を介して相互に接続されている。このバス１４にはまた、入出力インターフェース１５も接続されている。入出力インターフェース１５には、出力部１６、入力部１７、記憶部１８、通信部１９、及びドライブ２０が接続されている。 The RAM 13 also appropriately stores data necessary for the CPU 11 to execute various processes. The data also includes various programs executed by the CPU 11.
The CPU 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input/output interface 15 is also connected to this bus 14 . An output section 16 , an input section 17 , a storage section 18 , a communication section 19 , and a drive 20 are connected to the input/output interface 15 .

出力部１６は、例えば液晶等のディスプレイを含む構成である。出力部１６は、ＣＰＵ１１の制御により、各種画像、或いは各種画面を表示する。出力部１６は、ＡＰサーバ１に搭載されたものであっても良いが、必要に応じて接続されるものであっても良い。つまり、出力部１６は、必須の構成要素ではない。 The output unit 16 has a configuration including, for example, a display such as a liquid crystal display. The output unit 16 displays various images or various screens under the control of the CPU 11. The output unit 16 may be installed in the AP server 1, or may be connected as necessary. In other words, the output unit 16 is not an essential component.

入力部１７は、例えばキーボード等の各種ハードウェア釦等を含む構成のものである。その構成には、マウス等のポインティングデバイスが１つ以上、含まれていても良い。操作者は、入力部１７を介して各種情報を入力することができる。この入力部１７も、ＡＰサーバ１に搭載されたものであっても良いが、必要に応じて接続されるものであっても良い。つまり、入力部１７も、必須の構成要素ではない。 The input unit 17 includes various hardware buttons such as a keyboard. The configuration may include one or more pointing devices, such as a mouse. The operator can input various information via the input section 17. This input unit 17 may also be installed in the AP server 1, or may be connected as necessary. In other words, the input section 17 is also not an essential component.

記憶部１８は、例えばハードディスク装置、或いはＳＳＤ（Solid State Drive）等の補助記憶装置である。データ量の大きいデータは、この記憶部１８に記憶される。
通信部１９は、ネットワークＮを介した他の情報処理装置との間の通信を可能にする。図３に示すサーバ２、プロジェクター３、及びサウンドシステムを構成する情報処理装置は全て、他の情報処理装置に相当する。 The storage unit 18 is, for example, a hard disk device or an auxiliary storage device such as an SSD (Solid State Drive). Data with a large amount of data is stored in this storage section 18.
The communication unit 19 enables communication with other information processing devices via the network N. The server 2, projector 3, and information processing devices that constitute the sound system shown in FIG. 3 all correspond to other information processing devices.

ドライブ２０は、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリカード等のリムーバブルメディア２５が着脱可能な装置である。ドライブ２０は、例えば装着されたリムーバブルメディア２５からの情報の読み取り、及びリムーバブルメディア２５への情報の書き込みが可能である。それにより、リムーバブルメディア２５に記録されたプログラムは、ドライブ２０を介して、記憶部１８に記憶させることができる。また、ドライブ２０に装着されたリムーバブルメディア２５は、記憶部１８に記憶されている各種データのコピー先、或いは移動先として用いることができる。 The drive 20 is a device into which a removable medium 25 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory card can be attached and detached. The drive 20 is capable of reading information from and writing information to the removable medium 25 mounted thereon, for example. Thereby, the program recorded on the removable medium 25 can be stored in the storage unit 18 via the drive 20. Furthermore, the removable medium 25 attached to the drive 20 can be used as a copy destination or a movement destination of various data stored in the storage unit 18.

本サービス用に開発されたアプリケーション・プログラムは、リムーバブルメディア２５に記録させて配布しても良い。ネットワークＮ等を介して配布可能にしても良い。このことから、アプリケーション・プログラムを記録した記録媒体としては、ネットワークＮに直接的、或いは間接的に接続された情報処理装置に搭載、若しくは装着されたもの、或いは外部のアクセス可能な装置に搭載、若しくは装着されたものであっても良い。 Application programs developed for this service may be recorded on the removable media 25 and distributed. It may also be possible to distribute it via network N or the like. For this reason, the recording medium that records the application program may be one that is installed or installed in an information processing device that is directly or indirectly connected to the network N, or one that is installed in an externally accessible device. Alternatively, it may be attached.

ＡＰサーバ１が備えるハードウェア資源は、アプリケーション・プログラムを含む各種プログラムによって制御される。その結果、ＡＰサーバ１は、顧客企業ＫＫに対し、本サービスを提供することができる。 The hardware resources included in the AP server 1 are controlled by various programs including application programs. As a result, the AP server 1 can provide this service to the customer company KK.

図６は、本発明の情報処理装置の一実施形態に係るＡＰサーバ１上に実現される機能的構成の一例を示す機能ブロック図である。次に図６を参照しつつ、ＡＰサーバ１上に実現される機能的構成の例について詳細に説明する。
ここでは、混乱を避けるために、ゲーム場は一つのみと想定する。また、ゲーム場で行われるのはゲームであると想定する。動作対象であるアバター画像、及びロボットの何れも、検知対象の画像（以降、検知対象画像ＫＴＧとも表記する。）から生成される骨格情報を用いて動作させるものと想定する。 FIG. 6 is a functional block diagram showing an example of a functional configuration implemented on the AP server 1 according to an embodiment of the information processing apparatus of the present invention. Next, an example of the functional configuration implemented on the AP server 1 will be described in detail with reference to FIG. 6.
Here, to avoid confusion, it is assumed that there is only one game field. Further, it is assumed that a game is played in a game arena. It is assumed that both the avatar image and the robot, which are operation targets, are operated using skeleton information generated from a detection target image (hereinafter also referred to as a detection target image KTG).

ＡＰサーバ１のＣＰＵ１１上には、機能的構成として、図６に示すように、設定部１１１、画像抽出部１１２、骨格検知部１１３、姿勢評価部１１４、テキスト変換部１１５、動画生成部１１６、姿勢制御部１１７、音声認識部１１８、会話処理部１１９、画像認識部１２０、及び顔認識部１２１が実現される。 On the CPU 11 of the AP server 1, as shown in FIG. 6, the functional configuration includes a setting section 111, an image extraction section 112, a skeleton detection section 113, a posture evaluation section 114, a text conversion section 115, a video generation section 116, A posture control section 117, a voice recognition section 118, a conversation processing section 119, an image recognition section 120, and a face recognition section 121 are realized.

これらは、本サービスの提供用に開発されたアプリケーション・プログラムを含む各種プログラムをＣＰＵ１１が実行することにより実現される。その結果として、或いは本サービスの提供のために、記憶部１８には、基準画像格納部１８１、キャラクタ画像格納部１８２、背景画像格納部１８３、辞書格納部１８４、及び制御情報格納部１８５が確保される。 These are realized by the CPU 11 executing various programs including application programs developed for providing this service. As a result of this, or in order to provide this service, the storage unit 18 includes a reference image storage unit 181, a character image storage unit 182, a background image storage unit 183, a dictionary storage unit 184, and a control information storage unit 185. be done.

設定部１１１は、ゲーム場で行われるゲームの設定、設定されたゲームの開始、或いは終了、設定されたゲームを行ううえでの各種設定、等のための機能である。これら設定のための各種要求等は、例えばサーバ２から送信される。 The setting unit 111 is a function for setting a game to be played at the game venue, starting or ending the set game, various settings for playing the set game, and the like. Various requests for these settings are transmitted from the server 2, for example.

図３、及び図４では、このような設定を可能にする装置は省略している。この装置は、サーバ２と接続されており、サーバ２は、この装置からの要求を処理して、ＡＰサーバ１に送信すべき要求を生成し送信する。
サーバ２は、例えばゲームが開始される前からそのゲームが終了するまでの間、カメラＣによる動画撮影を行わせ、その動画撮影により得られる画像データ（動画情報）をＡＰサーバ１に送信し続ける。その間、マイク５からの音声信号を変換して得られる音声情報もサーバ２からＡＰサーバ１に送信される。 In FIGS. 3 and 4, a device that enables such settings is omitted. This device is connected to a server 2, and the server 2 processes requests from this device, generates and sends requests to be sent to the AP server 1.
For example, from before the game starts until the game ends, the server 2 causes the camera C to shoot a video, and continues to send image data (video information) obtained by shooting the video to the AP server 1. . During this time, audio information obtained by converting the audio signal from the microphone 5 is also transmitted from the server 2 to the AP server 1.

画像抽出部１１２は、サーバ２から受信した画像データが表す画像中の検知対象画像ＫＴＧの範囲を特定し、特定した範囲を抽出する。この範囲の抽出は、例えば画像データから、定めた時間間隔毎に静止画像データを生成することにより、生成した静止画像データを対象に行われる。このようにして抽出された画像データは、以降「検知対象画像データ」と表記する。 The image extraction unit 112 specifies the range of the detection target image KTG in the image represented by the image data received from the server 2, and extracts the specified range. Extraction of this range is performed, for example, by generating still image data from image data at predetermined time intervals, and using the generated still image data. The image data extracted in this manner will hereinafter be referred to as "detection target image data."

骨格検知部１１３は、例えば検知対象画像データ毎に、それが表す検知対象画像ＫＴＧから骨格情報を生成する（例えば、特許文献１参照）。
姿勢評価部１１４は、骨格検知部１１３が生成する骨格情報を用いて、検知対象画像ＫＴＧの姿勢、或いは姿勢を含む動きを評価する。この評価は、例えば図１におけるお手本との差分ＡＩ分析ＳＨ２によるものと基本的に同じものか、或いは図２に示すような初期想定姿勢、及びその後の動きによるものである。この評価を行うために、お手本データＴＤに相当する基準画像情報が基準画像格納部１８１に格納されている。それにより、評価には、対応する基準画像情報も用いられる。 The skeleton detection unit 113 generates skeleton information from the detection target image KTG represented by the detection target image data, for example, for each detection target image data (see, for example, Patent Document 1).
The posture evaluation section 114 uses the skeleton information generated by the skeleton detection section 113 to evaluate the posture of the detection target image KTG or the movement including the posture. This evaluation is basically the same as the difference AI analysis SH2 with respect to the model in FIG. 1, for example, or it is based on the initially assumed posture and subsequent movements as shown in FIG. In order to perform this evaluation, reference image information corresponding to the model data TD is stored in the reference image storage section 181. Thereby, the corresponding reference image information is also used for the evaluation.

テキスト変換部１１５は、姿勢評価部１１４による評価結果を変換する形でテキスト情報を生成する。生成されるテキスト情報は、図１、及び図２を参照しての説明の通り、評価に実際に用いられた基準画像情報、及びその基準画像情報が表す動作内容と検知対象画像ＫＴＧが表す動作内容との間の相違に依存する。辞書格納部１８４には、想定するゲームに合わせ、テキスト情報等の生成のための辞書が格納されている。それにより、例えば辞書に登録された文章、或いは文節等の文字列か、或いは２つ以上の文字列の組み合わせが、テキスト情報として生成される。 The text conversion unit 115 generates text information by converting the evaluation result by the posture evaluation unit 114. As explained with reference to FIGS. 1 and 2, the generated text information includes the reference image information actually used for evaluation, the action content represented by the reference image information, and the action represented by the detection target image KTG. Depends on the difference between the content. The dictionary storage unit 184 stores dictionaries for generating text information and the like in accordance with the assumed game. Thereby, for example, a sentence registered in a dictionary, a character string such as a phrase, or a combination of two or more character strings is generated as text information.

動画生成部１１６は、プロジェクター３に投影させる画像（画面）を生成し、生成した画像を画像データとして、通信部１９に送信させる。
動画生成部１１６は、検知対象ＫＴに聞かせることを想定する音楽、効果音、演出音、及び音声の放音のための情報も必要に応じて生成し、生成した情報を通信部１９に送信させる。それにより、各種音が放音されるなかで、検知対象ＫＴがゲーム等を行える環境を提供する。以降、各種音を放音させる情報は「音声情報」と総称する。 The moving image generation unit 116 generates an image (screen) to be projected on the projector 3, and causes the communication unit 19 to transmit the generated image as image data.
The video generation unit 116 also generates information for emitting music, sound effects, performance sounds, and audio that are expected to be heard by the detection target KT as necessary, and transmits the generated information to the communication unit 19. let This provides an environment in which the detection target KT can play games, etc. while various sounds are being emitted. Hereinafter, information for emitting various sounds will be collectively referred to as "audio information."

キャラクタ画像格納部１８２には、アバター画像を含む各種キャラクタ画像の生成のための各種画像情報が格納されている。その画像情報は、例えばアバター画像を構成するパーツ毎の３次元表現の画像情報である。背景画像格納部１８３には、各種キャラクタ画像以外の画像の生成のための画像情報が背景画像情報として格納されている。それにより、動画生成部１１６は、背景画像格納部１８３に格納されている各種画像情報から背景画像を生成するとともに、骨格情報、及びキャラクタ画像格納部１８２に格納されている各種画像情報を用いて各アバター画像を生成する。動画生成部１１６は、これらを生成した後、背景画像上に各アバター画像を配置する形で、１画面分の画像を生成する。 The character image storage unit 182 stores various image information for generating various character images including avatar images. The image information is, for example, image information of a three-dimensional representation of each part constituting the avatar image. The background image storage unit 183 stores image information for generating images other than various character images as background image information. Thereby, the video generation unit 116 generates a background image from various image information stored in the background image storage unit 183, and also generates a background image using the skeleton information and various image information stored in the character image storage unit 182. Generate each avatar image. After generating these, the moving image generation unit 116 generates one screen worth of images by arranging each avatar image on the background image.

姿勢制御部１１７は、骨格情報を用いてロボットを動作させるための姿勢制御情報を生成する。この姿勢制御部１１７により、アバター画像の代わりに、或いはアバター画像とともに、ロボットを検知対象画像ＫＴＧの動きに合わせて動作させることが可能である。姿勢制御情報の生成は、制御情報格納部１８５に格納された別の制御情報を参照して行われる。この制御情報は、姿勢制御情報の生成のためのものであり、図１に示すパラメータＰは制御情報の一つとして制御情報格納部１８５に格納されている。 The posture control unit 117 uses the skeleton information to generate posture control information for operating the robot. The posture control unit 117 allows the robot to move in accordance with the movement of the detection target image KTG instead of or together with the avatar image. The attitude control information is generated by referring to other control information stored in the control information storage section 185. This control information is for generating attitude control information, and the parameter P shown in FIG. 1 is stored in the control information storage section 185 as one of the control information.

音声認識部１１８は、例えばサーバ２から受信した音声情報を用いた自然言語処理により音声認識を行う。この音声認識により、音声認識部１１８は、検知対象ＫＴの発言内容を表すテキスト情報である文字列を生成する。文字列の生成は、パターンマッチングにより、音声情報が表す各文字を特定することで行われる。この文字列が本実施形態における第１の文字列に相当する。
会話処理部１１９は、生成された文字列を検知対象ＫＴの発言内容と見なし、その発言内容に対して返すべき文字列を応答文として生成する。そのために、会話処理部１１９は、生成された文字列に対し、形態素解析、構文解析、意味解析、文脈解析、及び照応解析を順次、行い、その文字列が表す内容を特定する。会話処理部１１９は、この特定結果を用いて、例えばその特定結果に対応付けられた文字列を応答文として生成する。応答文として生成される文字列は本実施形態における第２の文字列に相当する。 The voice recognition unit 118 performs voice recognition by natural language processing using voice information received from the server 2, for example. Through this voice recognition, the voice recognition unit 118 generates a character string that is text information representing the content of the statement made by the detection target KT. The character string is generated by pattern matching to identify each character represented by the audio information. This character string corresponds to the first character string in this embodiment.
The conversation processing unit 119 regards the generated character string as the content of the utterance of the detection target KT, and generates a character string to be returned to the utterance content as a response sentence. To this end, the conversation processing unit 119 sequentially performs morphological analysis, syntactic analysis, semantic analysis, context analysis, and anaphoric analysis on the generated character string to identify the content represented by the character string. The conversation processing unit 119 uses this identification result to generate, for example, a character string associated with the identification result as a response sentence. The character string generated as the response sentence corresponds to the second character string in this embodiment.

会話処理部１１９によって生成される応答文、及びテキスト変換部１１５によって生成されるテキスト情報を放音させるための音声情報は、例えば、動画生成部１１６により生成され、通信部１９から送信される。それにより、検知対象ＫＴは、プロジェクター３によりスクリーンＳＣに投影された画像（動画）を見るだけでなく、サウンドシステム４により放音される会話等のための音声を聞くことができる。
放音される音には、音声の他に、音楽、効果音、及び演出音等も含まれる。ここでは、特に断らない限り、音声情報は、それらの音、及び音声を含む各種音を放音させる情報の総称として用いる。
動作対象がロボットであった場合、音声情報の生成は、姿勢制御部１１７によって行われる。それにより、ロボットを動作させる場合であっても、検知対象ＫＴは、会話等の音声を聞くことができる。 The response sentence generated by the conversation processing unit 119 and the audio information for making the text information generated by the text conversion unit 115 sound are generated by the video generation unit 116 and transmitted from the communication unit 19, for example. Thereby, the detection target KT can not only see the image (video) projected on the screen SC by the projector 3, but also hear the sound for conversation etc. emitted by the sound system 4.
The emitted sounds include, in addition to voices, music, sound effects, performance sounds, and the like. Here, unless otherwise specified, audio information is used as a general term for information for emitting these sounds and various sounds including voices.
If the object to be operated is a robot, the posture control unit 117 generates the audio information. Thereby, even when the robot is operated, the detection target KT can hear sounds such as conversations.

音声認識部１１８、及び会話処理部１１９は、検知対象ＫＴと、スクリーンＳＣ上に画像が投影されたキャラクタとの間の会話を可能とさせる。それにより、検知対象ＫＴは、キャラクタとの会話をしながら、ゲームを進めることができる。その会話により、検知対象ＫＴは、例えば考えていなかった動作、或いはゲームの進め方等の選択も可能になる。このことから、検知対象ＫＴにとっては、ゲームへの没入感をより高くさせられるようにするだけでなく、楽しみ方の幅も広がって、より高い満足度も得られるようになる。検知対象ＫＴが考えていなかった動作とは、自身の発言による他のアバター画像、或いはキャラクタ画像の出現（及び出現後の動作）、自身のアバター画像の発言内容に応じた動作、或いは変化、背景画像の変化、等を挙げることができる。
なお、音声認識部１１８、及び会話処理部１１９にはともに、周知の技術が採用されている。 The voice recognition unit 118 and the conversation processing unit 119 enable a conversation between the detection target KT and the character whose image is projected on the screen SC. Thereby, the detection target KT can proceed with the game while having a conversation with the character. Through this conversation, the detection target KT can select, for example, an unexpected action or a way to proceed with the game. As a result, for the detection target KT, not only is the sense of immersion in the game enhanced, but the range of ways to enjoy the game is expanded, and a higher degree of satisfaction can be obtained. The actions that the detection target KT did not think about are the appearance of other avatar images or character images (and actions after appearance) due to his own utterances, actions or changes in response to the utterances of his own avatar image, and the background. Changes in images, etc.
Note that both the speech recognition section 118 and the conversation processing section 119 employ well-known techniques.

画像認識部１２０は、検知対象画像ＫＴＧ内に含まれる、或いは近傍に画像として存在する物体を認識する。この物体は、検知対象ＫＴとは別と見なすべきものである。その物体は、主に、検知対象ＫＴが身につけている物か、或いは手にしている物である。この物体画像を認識した後、物体画像の位置、姿勢、或いは大きさ等の変化に着目することにより、画像認識部１２０は、物体画像の動作内容を特定する。動作内容の特定は、周知の技術を画像認識部１２０に採用しても行うことができる。
その動作内容の特定により、例えば検知対象ＫＴの物体の扱いに応じて、物体画像を変化させることができる。具体的には、物体が剣と見なす剣状物体であった場合、例えば検知対象ＫＴとの位置関係に応じて、物体画像の表現を変更する、剣状物体を振る早さに応じて、物体画像に演出を加える、等のことを行っても良い。 The image recognition unit 120 recognizes an object included in the detection target image KTG or existing as an image in the vicinity. This object should be considered separate from the detection target KT. The object is mainly something that the detection target KT is wearing or holding in his hand. After recognizing this object image, the image recognition unit 120 identifies the action content of the object image by focusing on changes in the position, orientation, size, etc. of the object image. The content of the operation can also be specified by employing a well-known technique in the image recognition unit 120.
By specifying the content of the operation, the object image can be changed, for example, depending on how the object to be detected KT is handled. Specifically, if the object is a sword-shaped object that is considered to be a sword, for example, the expression of the object image is changed depending on the positional relationship with the detection target KT, and the object image is changed depending on the speed of swinging the sword-shaped object. You may also add effects to the image.

顔認識部１２１は、検知対象画像ＫＴＧから顔の部分を抽出し、検知対象ＫＴの性別、及び年齢等の認識を行う。この顔認識部１２１にも周知の技術を採用することができる。
なお、物体としては、加速度センサ、及び通信装置等を搭載させた専用のものであっても良い。そのようなセンサの検知結果を画像認識部１２０による画像認識に用いる場合、物体画像の動作制御をより高精度に行えるようになる。通信装置から物体の種別等を表す情報を例えばサーバ２を介して送信させることにより、ＡＰサーバ１側は検知対象ＫＴが持っている、或いは身につけている物体の種類を認識することができる。 The face recognition unit 121 extracts a face part from the detection target image KTG, and recognizes the gender, age, etc. of the detection target KT. A well-known technique can also be adopted for this face recognition unit 121.
Note that the object may be a dedicated object equipped with an acceleration sensor, a communication device, and the like. When the detection results of such a sensor are used for image recognition by the image recognition unit 120, the operation of the object image can be controlled with higher precision. By transmitting information representing the type of object etc. from the communication device via the server 2, for example, the AP server 1 side can recognize the type of object held or worn by the detection target KT.

画像認識部１２０による物体の認識結果は、物体を持つアバター画像の生成に用いることができる。それにより、例えば検知対象ＫＴが刀形状の物体を手に持っていた場合、刀を持ったアバター画像を表示させるようにしても良い。或いは検知対象ＫＴが投げた物体があった場合には、投げられた物体が動く様子を動画で表現するようにしても良い。物体を表す画像は、例えばキャラクタ画像格納部１８２に格納し、物体の認識結果、或いは骨格情報により特定する検知対象画像ＫＴＧの動作内容等から選択させるようにすれば良い。この選択は、例えば画像認識部１２０に行わせることが考えられる。検知対象ＫＴが物体を持っているか否かの判定は、例えば検知対象画像ＫＴＧと物体画像とが重なっていると見なす箇所が存在し、且つそれらの大きさが変化する度合いに設定以上の変化が生じているか否かにより行うことができる。
例えば図２に示すように手裏剣を投げる場合、左手を動かす方向から、手裏剣を投げる方向、更には的の位置等も推定することができる。このような推定結果を手裏剣の動きに反映させても良い。 The object recognition result by the image recognition unit 120 can be used to generate an avatar image holding the object. Thereby, for example, if the detection target KT is holding a sword-shaped object in his hand, an avatar image holding a sword may be displayed. Alternatively, if there is an object thrown by the detection target KT, the motion of the thrown object may be expressed in a moving image. The image representing the object may be stored in the character image storage section 182, for example, and selected from the recognition result of the object or the motion content of the detection target image KTG specified by skeleton information. This selection may be made by the image recognition unit 120, for example. The determination of whether the detection target KT has an object is made if, for example, there are parts where the detection target image KTG and the object image are considered to overlap, and the degree of change in their size is greater than the setting. This can be done depending on whether it is occurring or not.
For example, when throwing a shuriken as shown in FIG. 2, the direction in which the shuriken is thrown and the position of the target can be estimated based on the direction in which the left hand is moved. Such estimation results may be reflected in the movement of the shuriken.

このような画面制御は、会話処理部１１９が生成した応答文に応じた処理を動画生成部１１６に行わせることで実現できる。ロボットの姿勢制御は、その応答文に応じた処理を姿勢制御部１１７に行わせることで実現できる。応答文に応じて動作させるものは、基本的に、応答文を発言させる想定のアバター画像、或いはロボットとなる。 Such screen control can be realized by causing the video generation unit 116 to perform processing according to the response sentence generated by the conversation processing unit 119. The posture control of the robot can be realized by causing the posture control unit 117 to perform processing according to the response sentence. What is operated in response to the response text is basically an avatar image or a robot that is supposed to make the response statement.

顔認識部１２１による顔の認識結果は、画像として表示させるアバターの性別を含む種類、顔の表情等の決定に用いるようにしても良い。その場合、検知対象ＫＴにとって、より望ましいアバターの選択が可能になるとともに、アバターの顔の表現を検知対象ＫＴの顔の表情に応じて変化させるようなことも可能となる。
上記のような制御は何れも、検知対象ＫＴにとっては、ゲームへの没入感を更に高くさせられるようにし、且つ更に高い満足度も得られるように作用する。 The face recognition result by the face recognition unit 121 may be used to determine the type, including the gender, of the avatar to be displayed as an image, the facial expression, etc. In that case, it becomes possible to select a more desirable avatar for the detection target KT, and it also becomes possible to change the facial expression of the avatar according to the facial expression of the detection target KT.
All of the above-mentioned controls work so that the detection target KT can further enhance the sense of immersion in the game, and can also obtain a higher degree of satisfaction.

上記のような機能構成では、骨格検知部１１３は、本実施形態における骨格情報生成手段に相当する。同様に、姿勢評価部１１４は解析手段、テキスト変換部はテキスト情報生成手段、動画生成部１１６、及び姿勢制御部１１７は動作制御手段、音声認識部１１８は音声認識手段、会話処理部１１９は会話処理手段、にそれぞれ相当する。ＣＰＵ１１自体は、画像データ取得手段、及び音声情報取得手段に相当する。 In the above functional configuration, the skeleton detection unit 113 corresponds to the skeleton information generation means in this embodiment. Similarly, the posture evaluation section 114 is an analysis means, the text conversion section is a text information generation means, the video generation section 116 and the posture control section 117 are operation control means, the voice recognition section 118 is a voice recognition means, and the conversation processing section 119 is a conversation Each corresponds to a processing means. The CPU 11 itself corresponds to image data acquisition means and audio information acquisition means.

図７は、本実施形態に係る情報処理装置であるＡＰサーバ１に搭載のＣＰＵによって実行される動画表示処理の一例を示すフローチャートである。この動画表示処理は、例えばサーバ２等からゲームの開始が要求された場合に、検知対象ＫＴにそのゲームを行わせるために実行される処理である。この処理は、本サービスの提供用に開発されたアプリケーション・プログラムをＣＰＵ１１が実行することで実現される。最後に図７を参照し、この動画表示処理について詳細に説明する。 FIG. 7 is a flowchart illustrating an example of a video display process executed by the CPU installed in the AP server 1, which is the information processing device according to the present embodiment. This video display process is a process that is executed in order to cause the detection target KT to play the game, for example, when the server 2 or the like requests the start of the game. This processing is realized by the CPU 11 executing an application program developed for providing this service. Finally, this moving image display process will be described in detail with reference to FIG.

図７にフローチャート例を示す動画表示処理は、混乱を避けるために、図４に示すような一つのゲーム場を想定したものとしている。説明は、この想定を前提に行う。それにより、カメラＣによる動画撮影による動画情報（画像データ）、及びマイク５による音声情報がサーバ２からＡＰサーバ１に送信されると想定し説明する。アバター画像は、検知対象ＫＴの動きに沿って動かすものと想定する。処理を実行する主体としてはＣＰＵ１１を想定する。 In order to avoid confusion, the video display process whose flowchart example is shown in FIG. 7 is based on the assumption that one game field is shown in FIG. 4. The explanation will be based on this assumption. The following description assumes that moving image information (image data) captured by the camera C and audio information captured by the microphone 5 are transmitted from the server 2 to the AP server 1. It is assumed that the avatar image moves along with the movement of the detection target KT. The CPU 11 is assumed to be the entity that executes the processing.

先ず、ステップＳ１では、ＣＰＵ１１は、サーバ２から動画情報を受信したか否か判定する。動画情報を受信した場合、ステップＳ１の判定はＹＥＳとなってステップＳ２に移行する。動画情報を受信していない場合、ステップＳ１の判定はＮＯとなってステップＳ６に移行する。
ステップＳ２では、ＣＰＵ１１は、例えば受信した動画情報から静止画像を生成し、その静止画像上の検知対象画像ＫＴＧの抽出を行う。続くステップＳ３では、ＣＰＵ１１は、抽出した検知対象画像ＫＴＧから骨格情報を生成するための骨格検知処理を実行する。次に移行するステップＳ４では、ＣＰＵ１１は、生成した骨格情報から、検知対象画像ＫＴＧの姿勢の評価を行う。この姿勢の評価には、必要に応じて、既に生成済みの骨格情報も用いられる。 First, in step S1, the CPU 11 determines whether or not video information has been received from the server 2. When video information is received, the determination in step S1 is YES and the process moves to step S2. If no video information has been received, the determination in step S1 is NO and the process moves to step S6.
In step S2, the CPU 11 generates a still image from the received video information, for example, and extracts the detection target image KTG from the still image. In subsequent step S3, the CPU 11 executes skeleton detection processing to generate skeleton information from the extracted detection target image KTG. In the next step S4, the CPU 11 evaluates the posture of the detection target image KTG from the generated skeleton information. Already generated skeletal information is also used for this posture evaluation, if necessary.

ステップＳ４に続くステップＳ５では、ＣＰＵ１１は、姿勢の評価結果に応じてテキスト情報を生成することにより、その評価結果をテキスト化する。このテキスト化の後、ステップＳ６に移行する。
ステップＳ６では、ＣＰＵ１１は、サーバ２から音声情報を受信したか否か判定する。音声情報を受信した場合、ステップＳ６の判定はＹＥＳとなってステップＳ７に移行する。音声情報を受信していない場合、ステップＳ６の判定はＮＯとなってステップＳ１０に移行する。 In step S5 following step S4, the CPU 11 converts the evaluation result into text by generating text information according to the posture evaluation result. After this text conversion, the process moves to step S6.
In step S6, the CPU 11 determines whether or not audio information has been received from the server 2. If audio information has been received, the determination in step S6 is YES and the process moves to step S7. If audio information has not been received, the determination in step S6 is NO and the process moves to step S10.

ステップＳ７では、ＣＰＵ１１は、受信した音声情報、及び過去に受信した音声情報を用いた音声認識処理を実行する。ここでの音声認識処理には、音声情報からの文字列の生成の他に、その文字列の内容の特定も含まれる。続くステップＳ８では、ＣＰＵ１１は、検知対象ＫＴによる１会話分の発言内容が特定できたか否か判定する。１会話分の発言内容、つまり発言開始から発言終了までの発言内容が特定できた場合、ステップＳ８の判定はＹＥＳとなってステップＳ９に移行する。１会話分の発言内容が特定できていない場合、ステップＳ８の判定はＮＯとなってステップＳ１０に移行する。
なお、発言終了の判定は、例えば検知対象ＫＴの音声が確認できない状態が設定時間、継続した場合に行われる。それにより、ステップＳ８の判定がＹＥＳとなった場合、ステップＳ７で最後に特定された内容が、検知対象ＫＴによる発言内容となる。
ステップＳ９では、ＣＰＵ１１（会話処理部１１９）は、特定した発言内容から、検知対象ＫＴに返すべき文字列を応答文として生成する。その生成後、ステップＳ１０に移行する。 In step S7, the CPU 11 executes voice recognition processing using the received voice information and voice information received in the past. The voice recognition process here includes not only generating a character string from voice information but also specifying the content of the character string. In the subsequent step S8, the CPU 11 determines whether or not the content of one conversation's worth of utterances by the detection target KT has been identified. If the content of the statement for one conversation, that is, the content of the statement from the start of the statement to the end of the statement, can be specified, the determination in step S8 is YES and the process moves to step S9. If the utterance content for one conversation cannot be specified, the determination in step S8 is NO and the process moves to step S10.
Note that the determination of the end of the speech is made, for example, when the state in which the voice of the detection target KT cannot be confirmed continues for a set time. As a result, if the determination in step S8 is YES, the content finally identified in step S7 becomes the content uttered by the detection target KT.
In step S9, the CPU 11 (conversation processing unit 119) generates a character string to be returned to the detection target KT as a response sentence from the specified utterance content. After the generation, the process moves to step S10.

ステップＳ１０では、ＣＰＵ１１は、ゲームが終了したか否か判定する。検知対象ＫＴ等によるゲームの終了指示、或いはゲーム内でのゲームを終了させるべきイベントの発生等である終了イベントが発生した場合、ステップＳ１０の判定はＹＥＳとなってステップＳ１１に移行する。終了イベントが発生していない場合、ステップＳ１０の判定はＮＯとなってステップＳ１２に移行する。 In step S10, the CPU 11 determines whether the game has ended. If an end event occurs, such as an instruction to end the game by the detection target KT or the like, or an event within the game that should end the game, the determination in step S10 becomes YES and the process moves to step S11. If the end event has not occurred, the determination in step S10 is NO and the process moves to step S12.

ステップＳ１１では、ＣＰＵ１１は、ゲームの終了を検知対象ＫＴに通知するための終了画面をサーバ２に送信する。この終了画面の送信後、動画表示処理が終了する。
ステップＳ１２では、ＣＰＵ１１は、ゲームの進行のために必要なその他の処理を実行する。
ゲームのうちには、ゲームの途中で利用者にアイテム等の選択肢を提示し、提示させた選択肢のうちの何れかを選択させるようになっているものがある。このような選択肢の提示、選択肢のうちからの選択は、ステップＳ１２でのその他の処理を実行することで実現される。 In step S11, the CPU 11 transmits to the server 2 an end screen for notifying the detection target KT of the end of the game. After transmitting this end screen, the video display process ends.
In step S12, the CPU 11 executes other processes necessary for progressing the game.
Some games present options such as items to the user during the game, and allow the user to select one of the presented options. Presentation of such options and selection from among the options are realized by executing other processing in step S12.

続くステップＳ１３では、ＣＰＵ１１は、動画情報、及び音声情報の生成をそれぞれ行い、生成した動画情報、及び音声情報をそれぞれプロジェクター３、及びサウンドシステム４に送信させる。その後、上記ステップＳ１に戻る。
動画情報が表す画面上にアバター画像が存在する場合、そのアバター画像は、ステップＳ３の骨格検知処理の実行により生成される骨格情報を用いて生成されたものである。１画面分の画像は、各アバター画像の他に、各キャラクタ画像、及び背景画像を生成し、背景画像上に、各アバター画像、及び各キャラクタ画像を配置させる形で生成される。何れかのアバター画像、或いは何れかのキャラクタ画像の生成には、ステップＳ５で生成されたテキスト情報、或いはステップＳ９で生成された応答文が必要に応じて反映される。それにより、アバター画像の動作は、生成されるテキスト情報、及び応答文のうちの少なくとも一方によって変化させることが可能となっている。このような画面生成が行われるステップＳ１３で生成される動画情報は、例えば生成した画面に圧縮処理を行って生成される情報である。 In subsequent step S13, the CPU 11 generates video information and audio information, and causes the generated video information and audio information to be transmitted to the projector 3 and the sound system 4, respectively. Thereafter, the process returns to step S1.
If an avatar image exists on the screen represented by the video information, the avatar image is generated using the skeleton information generated by executing the skeleton detection process in step S3. An image for one screen is generated by generating each character image and a background image in addition to each avatar image, and placing each avatar image and each character image on the background image. The text information generated in step S5 or the response sentence generated in step S9 is reflected in the generation of any avatar image or any character image as necessary. Thereby, the behavior of the avatar image can be changed by at least one of the generated text information and the response sentence. The video information generated in step S13 where such screen generation is performed is, for example, information generated by performing compression processing on the generated screen.

ステップＳ５でテキスト情報が生成されるか、或いはステップＳ９で応答文が生成された場合、ステップＳ１３における上述の音声情報が生成される。
送信する音声情報は、例えばデジタルの音声信号である。デジタルの音声信号を音声信号として送信する場合、その音声信号は、例えばテキスト→音声変換を行うアプリケーション（以降「音声変換ソフト」）によって生成される。その場合、ステップＳ１３では、その変換の対象となるテキスト情報、或いは応答文を指定してのその音声変換ソフトの呼出、及び音声信号の取得が行われることになる。取得された音声信号の音声情報としての送付は、送信すべき音声信号が無くなるまで継続して行われる。この音声変換ソフトを搭載したサーバ２を介してサウンドシステム４による放音を行わせる場合、テキスト情報、及び応答文は音声情報として送信させることができる。このこともあり、音声情報は、特に限定されない。 If text information is generated in step S5 or a response sentence is generated in step S9, the above-mentioned audio information in step S13 is generated.
The audio information to be transmitted is, for example, a digital audio signal. When transmitting a digital audio signal as an audio signal, the audio signal is generated by, for example, an application that performs text-to-speech conversion (hereinafter referred to as "speech conversion software"). In this case, in step S13, the text information to be converted or the response sentence is designated, the voice conversion software is called, and the voice signal is acquired. Sending of the acquired audio signal as audio information continues until there is no more audio signal to be transmitted. When sound is emitted by the sound system 4 via the server 2 equipped with this voice conversion software, text information and response sentences can be transmitted as voice information. For this reason, the audio information is not particularly limited.

動作対象がロボットであった場合、動画表示処理の代わりに、姿勢制御処理が実行される。その姿勢制御処理の流れは、基本的に、上述の動画表示処理と同じである。その動画表示処理との相違点は、動画情報の代わりに、姿勢制御情報を生成することである。このようなことから、姿勢制御処理についての詳細な説明は省略する。 If the object to be operated is a robot, posture control processing is executed instead of video display processing. The flow of the attitude control process is basically the same as the above-described video display process. The difference from the video display process is that posture control information is generated instead of video information. For this reason, a detailed explanation of the attitude control process will be omitted.

本実施形態では、身体を動かして行うゲームに着目する形で説明を行ったが、本発明は、上記のように、取るべき姿勢が予め一つ以上、判明している姿勢要求動作を身体に行わせるものであれば、幅広く適用させることができる。
例えば手指を動かして行われる手話は、定められた姿勢、その姿勢からの定められた動き等により、話として伝えたい内容を表現するためのものである。このことから、骨格情報を用いて、手話で表される内容をテキスト情報に変換し、そのテキスト情報を出力させるようにしても良い。具体的には、例えば手話の動きを動作対象で再現させるとともに、その動作対象で再現する手話の内容を表示、及び音声のうちの少なくとも一方で出力させるようにしても良い。耳が不自由な人、及びそうでない人の両方が利用することを考慮し、その両方でテキスト情報を出力させるようにすることが望ましい。 In the present embodiment, the explanation has been given focusing on a game played by moving the body, but as described above, the present invention allows the body to perform posture-requiring movements for which one or more postures to be taken are known in advance. It can be widely applied as long as it is carried out.
For example, sign language, which is performed by moving the hands and fingers, is used to express the content that is desired to be conveyed through a certain posture and certain movements from that posture. Therefore, the content expressed in sign language may be converted into text information using the skeleton information, and the text information may be output. Specifically, for example, the movement of a sign language may be reproduced with an action object, and the content of the sign language reproduced with the action object may be outputted as at least one of display and audio. It is desirable to have text information output for both people, taking into consideration that it will be used by both people who are hearing impaired and people who are not hearing impaired.

スポーツでは、各種走り（短距離走、長距離走、ハードル競争等）、及びゴルフ等が姿勢要求動作に相当する。このようなスポーツでは、例えば骨格情報が表す動作を動作対象に行わせつつ、お手本画像ＯＧとの対比により、改善すべき点、或いは良い点、等を検知対象ＫＴに伝えるためのテキスト情報を生成し、そのテキスト情報を表示、及び音声のうちの少なくとも一方で出力させるようにしても良い。走っている検知対象ＫＴには、テキスト情報をタイムリに伝えることは困難であることから、動作対象の動作を表す動画情報をテキスト情報とともに保存し、再生可能にさせるようにすることが望ましい。 In sports, various types of running (sprint running, long distance running, hurdle competition, etc.), golf, etc. correspond to movements that require posture. In such sports, for example, while having the motion target perform the motion represented by the skeletal information, text information is generated to inform the detection target KT of areas to be improved or good points by comparison with the model image OG. However, the text information may be output as at least one of display and audio. Since it is difficult to convey text information in a timely manner to a running detection target KT, it is desirable to save video information representing the motion of the motion target together with the text information so that it can be played back.

また、骨格情報を用いることで、検知対象ＫＴの状態をタイムリに特定することが可能となる。このことを利用し、検知対象ＫＴの今後の動きを推定し、その推定結果に応じたテキスト情報の生成、及びその出力を行うようにしても良い。それにより、例えば触れるべきでない展示品に触れようとする検知対象ＫＴへの警告、移動すべきでない場所への移動への警告、等を行うようにしても良い。店舗に来店するお客を検知対象ＫＴとし、検知対象ＫＴによる万引き等の犯罪行為の防止、或いは犯罪行為を行った検知対象ＫＴの摘発等に利用することも考えられる。何れであっても、より多くの情報が提供されるようになるとともに、利用する者における利便性はより高いものとなる。 Furthermore, by using the skeleton information, it becomes possible to timely specify the state of the detection target KT. Utilizing this fact, the future movement of the detection target KT may be estimated, and text information may be generated and output according to the estimation result. Thereby, for example, a warning may be given to the detection target KT who tries to touch an exhibit that should not be touched, a warning to the detection target KT about moving to a place where it should not be moved, etc. It is also conceivable that customers visiting a store are the KTs to be detected, and used to prevent criminal acts such as shoplifting by the KTs to be detected, or to catch KTs to be detected who have committed criminal acts. In either case, more information will be provided and convenience for users will be higher.

動きの推定では、骨格情報から検知対象画像ＫＴＧの移動速度、及び移動方向等を推定し、その推定結果を動作対象の動作に反映させるようにしても良い。そのような推定（予測）を行うことにより、レスポンスを改善させることが期待できる。検知対象ＫＴの危険行動の検知に適用する場合、その検知をより早く行えるようになる。 In estimating the motion, the moving speed, moving direction, etc. of the detection target image KTG may be estimated from the skeleton information, and the estimation results may be reflected in the motion of the motion target. By performing such estimation (prediction), it is expected that the response will be improved. When applied to detect dangerous behavior of the detection target KT, the detection can be performed more quickly.

この動きの推定は、対戦ゲーム、或いはＲＰＧ（Role-Playing Game）等のゲームにおける対戦相手、或いは対峙させるキャラクタ等のアバターの動作制御に用いても良い。対戦相手には、骨格情報が表す動作に応じて、防御する際の動きを行わせることが考えられる。しかし、動きの推定を行うことにより、防御する際の動きをより自然なものとすることができる。対峙させるキャラクターでは、検知対象ＫＴの動きに応じた動作を行わせることにより、より自身に対峙しているという印象を検知対象ＫＴに与えられるようになる。動きの推定結果をそのキャラクターの動きの制御に用いた場合、そのキャラクターの動きをより自然なものとでき、実際に相手が対峙しているというより強い印象を検知対象ＫＴに与えられるようになる。何れにおいても、骨格情報を用いた姿勢評価によるテキスト情報の生成、生成したテキスト情報に応じた動作、自然会話を行わせるようにするのが望ましい。 This motion estimation may be used to control the motion of an avatar such as an opponent or a character to be confronted in a game such as a battle game or an RPG (Role-Playing Game). It is conceivable to have the opponent make defensive movements in accordance with the movements represented by the skeletal information. However, by estimating the movement, the movement when defending can be made more natural. By having the character to be confronted perform actions that correspond to the movements of the detection target KT, it is possible to give the detection target KT the impression that the character is confronting itself. If the movement estimation results are used to control the character's movement, the character's movement can be made more natural, giving the detection target KT a stronger impression that they are actually facing the opponent. . In either case, it is desirable to generate text information based on posture evaluation using skeletal information, to perform movements according to the generated text information, and to have natural conversation.

また、本発明は、無人店舗、或いは無人受付におけるアバター、或いはロボットの動作制御にも適用させることができる。このアバター、及びロボットの何れも、検知対象ＫＴの動作に応じて、その動作とは異なる動作を行わせ、その検知対象ＫＴが望むサービスを提供するために用いられる。
その適用としては、例えば画像認識技術と組み合わせ、商品、或いは荷物等の有無を認識し、その認識結果に応じて、異なる対応をさせるようにしても良い。より具体的には、例えば検知対象ＫＴが何らかの商品を見せるような動作をした場合、その商品の種類を特定し、その特定結果を、以降の会話に反映させるようにしても良い。これは、商品を見せるような動作をした検知対象ＫＴには、他の色違いの商品、それとは同じカテゴリの別の商品、或いは在庫等について確認したいという意図がある可能性が高いためである。 Further, the present invention can be applied to controlling the operation of an avatar or a robot in an unmanned store or an unmanned reception desk. Both the avatar and the robot are used to make a motion different from the motion of the detection target KT and provide a service desired by the detection target KT.
As an application, for example, it may be combined with image recognition technology to recognize the presence or absence of products, luggage, etc., and take different actions depending on the recognition result. More specifically, for example, when the detection target KT makes an action such as showing some kind of product, the type of the product may be specified, and the identification result may be reflected in the subsequent conversation. This is because there is a high possibility that the detection target KT who made an action to show the product has the intention of checking other products of a different color, other products in the same category, or inventory, etc. .

上記のように、動作対象としては、検知対象ＫＴの動作に沿って動作をさせるもの（以降「第１動作対象」と表記）と、検知対象ＫＴの動作とは異なる動作をさせるもの（以降「第２動作対象」と表記）と、の２種類に大別される。そのため、生成されるテキスト情報も、第１動作対象を想定したものと、第２動作を想定したものと、の２種類に大別される。これは、お手本データＴＤ（基準画像情報）も同様である。 As mentioned above, the motion targets include those that operate in accordance with the motion of the detection target KT (hereinafter referred to as "first motion targets") and those that behave in a manner different from the motion of the detection target KT (hereinafter referred to as "first motion target"). It is roughly divided into two types: Therefore, the generated text information is also roughly divided into two types: that which assumes the first action object, and that which assumes the second action. This also applies to model data TD (reference image information).

第１動作対象、及び第２動作対象は、ともに動作制御を行っても良いものである。その二つを同時に動作制御する場合、その二つの動作制御に用いることを想定し、テキスト情報を生成するようにしても良い。例えばお手本データＴＤが表す変身ポーズを検知対象ＫＴが取った場合、テキスト情報を生成して、第１動作対象には「変身」の音声出力、第２動作対象は動作の停止、或いは設定の姿勢への動作を行わせるようにしても良い。音声認識技術、及び自然会話（言語）技術も利用する場合、検知対象ＫＴによる「変身」の発言を条件に、検知対象ＫＴが変身ポーズを取ったか否か評価し、テキスト情報の生成を行うようにしても良い。この場合、テキスト情報は、第１動作対象における演出音の放音、第２動作対象に対する同様の動作制御に用いても良い。このようなこともあり、テキスト情報を生成する条件、生成するテキスト情報の内容、テキスト情報による動作対象の動作制御には、様々な変形が可能である。 Both the first operation target and the second operation target may be subjected to operation control. When controlling the two operations at the same time, text information may be generated assuming that it will be used to control the two operations. For example, when the detection target KT assumes a transformation pose represented by the model data TD, text information is generated, the first action target is output with the word "transformation", the second action target is stopped, or the set posture is output. It is also possible to cause the user to perform the following actions. When voice recognition technology and natural conversation (language) technology are also used, on the condition that the detection target KT utters "transformation", it is necessary to evaluate whether the detection target KT has taken a transformation pose and generate text information. You can also do it. In this case, the text information may be used for emitting sound effects for the first action object and for similar action control for the second action object. For this reason, various modifications can be made to the conditions for generating text information, the contents of the generated text information, and the operation control of the operation target using the text information.

１ＡＰサーバ、２サーバ、３プロジェクター、４サウンドシステム、５マイクロホン、１１ＣＰＵ、１８記憶部、１９通信部、１１１設定部、１１２画像抽出部、１１３骨格検知部、１１４姿勢評価部、１１５テキスト変換部、１１６動画生成部１１６、１１７姿勢制御部、１１８音声認識部、１１９会話処理部１１９、１２０画像認識部、１２１顔認識部、１８１基準画像格納部、１８２キャラクタ画像格納部、１８３背景画像格納部、１８４辞書格納部、１８５制御情報格納部、Ｃカメラ、Ｈ１骨格情報生成システム、Ｈ２姿勢評価システム、Ｈ３姿勢管理システム、ＫＫ顧客企業、ＫＴ検知対象、ＫＴＧ検知対象画像、ＯＧお手本画像、ＳＫサービス提供会社、ＳＹサービス提供システム 1 AP server, 2 server, 3 projector, 4 sound system, 5 microphone, 11 CPU, 18 storage unit, 19 communication unit, 111 setting unit, 112 image extraction unit, 113 skeleton detection unit, 114 posture evaluation unit, 115 text conversion 116 Video generation section 116, 117 Posture control section, 118 Voice recognition section, 119 Conversation processing section 119, 120 Image recognition section, 121 Face recognition section, 181 Reference image storage section, 182 Character image storage section, 183 Background image storage Section, 184 Dictionary storage section, 185 Control information storage section, C camera, H1 Skeletal information generation system, H2 Posture evaluation system, H3 Posture management system, KK Customer company, KT Detection target, KTG Detection target image, OG Model image, SK Service provider, SY service provider system

Claims

an image data acquisition means for acquiring image data obtained by imaging a detection target;
skeletal information generation means for generating skeletal information including position information of joints in the detection target image and relationship information indicating relationships between the joints, based on the detection target image represented by the image data;
analysis means for analyzing the motion content in the detection target image based on the skeleton information;
text information generation means for generating text information based on the analysis result of the operation content;
a motion control means for causing at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, to operate based on the text information;
An information processing device having:

The action of the action target based on the text information by the action control means includes a speaking action by at least one of displaying and emitting sound.
The information processing device according to claim 1.

The motion control means is capable of operating the motion target based on the skeleton information.
The information processing device according to claim 1 or 2.

audio information acquisition means for acquiring audio information representing the audio emitted by the detection target;
voice recognition means that performs voice recognition using the voice information and generates a first character string from the voice information;
further comprising conversation processing means for generating a second character string to be a response to the first character string,
The operation control means outputs the second character string as the statement content in the statement operation of the operation target.
The information processing device according to any one of claims 1 to 3.

The motion control means causes the motion target to perform a motion different from the motion represented by the skeleton information, based on the skeleton information.
The information processing device according to claim 1, 2, or 4.

The text information generation means generates the text information representing the content of the sign language when the movement of the hand and finger in the detection target image is identified as a movement for sign language as a result of the analysis of the action content;
The motion control means uses the skeletal information to cause the motion target to perform a finger motion in accordance with the hand and finger motion, and output the text information.
The information processing device according to any one of claims 1 to 4.

In the information processing device,
Generating skeletal information including position information of joints in the detection target image and relationship information indicating relationships between joints, based on a detection target image represented by image data obtained by imaging assuming the detection target,
Analyzing the motion content in the detection target image based on the skeleton information,
Generating text information based on the analysis result of the operation content,
Based on the text information, at least one of a predetermined avatar to be displayed and a robot, which is a physical machine, is operated as an operation target;
A program that executes processing.