WO2023277043A1

WO2023277043A1 - Information processing device

Info

Publication number: WO2023277043A1
Application number: PCT/JP2022/025855
Authority: WO
Inventors: 達也高村
Original assignee: 株式会社 Preferred Networks
Priority date: 2021-06-29
Filing date: 2022-06-28
Publication date: 2023-01-05

Abstract

[Problem] To generate an appropriate three-dimensional CG model from a two-dimensional image. [Solution] An information processing device according to the present invention is provided with one or more memories and one or more processors. The one or more processors estimate three-dimensional posture data of an imaging subject from a two-dimensional image including the imaging subject, detect an event related to a motion of the imaging subject from the two-dimensional image, and control the posture of a skeleton of the three-dimensional model on the basis of the three-dimensional posture data and the event.

Description

Information processing equipment

The present disclosure relates to an information processing device.

When reproducing the motion of a subject in a 2D image with a 3D CG model, the joint angle estimated from the subject in the 2D image is 3 due to the difference between the skeleton of the subject and the skeleton of the 3D CG model. It is not possible to accurately reproduce the movement of the subject by simply applying it to the skeleton of the dimensional CG model.

JP 2019-197278 A

This disclosure provides a technique for detecting an event to be reproduced by a 3D CG model in a subject of a 2D image and generating a 3D CG model that reproduces the event.

According to one embodiment, an information processing device includes one or more memories and one or more processors. The one or more processors estimate 3D pose data of the subject from a 2D image including the subject, detect an event related to the movement of the subject from the 2D image, and process the 3D pose data and the Based on the event, the posture of the skeleton of the 3D model is controlled.

1 is a block diagram schematically showing an information processing apparatus according to one embodiment; FIG. The figure which shows an example of the area|region of the predetermined site|part which concerns on one Embodiment. The figure which shows an example of the joint which concerns on one Embodiment. 4 is a flowchart showing an example of processing of an information processing apparatus according to one embodiment; FIG. 4 is a diagram showing a rendering result according to one embodiment; FIG. 10 is a diagram showing a rendering result according to a comparative example; The figure which shows an example of an event. FIG. 4 is a diagram showing a rendering result according to one embodiment; FIG. 10 is a diagram showing a rendering result according to a comparative example; 2 is a block diagram illustrating an example hardware implementation according to one embodiment. FIG.

Embodiments of the present invention will be described below with reference to the drawings. The drawings and description of the embodiments are given by way of example and are not intended to limit the invention. It should be noted that, in the description, the portion described as an image can be appropriately read as a frame of a moving image unless otherwise specified.

FIG. 1 is a block diagram schematically showing an information processing device according to one embodiment. The information processing device 1 includes an input unit 100, a storage unit 102, a part detection unit 104, a two-dimensional coordinate estimation unit 106, a three-dimensional coordinate estimation unit 108, a smoothing unit 110, an event detection unit 112, A skeleton control unit 114, a rendering unit 116, and an output unit 118 are provided. When a two-dimensional image including a subject is input, the information processing apparatus 1 causes a three-dimensional model, which is a model of the subject, to realize the movement shown in the two-dimensional image. The three-dimensional model may be a three-dimensional CG (Computer Graphics) model, and the information processing device 1 can output images and videos of the three-dimensional model viewed from any angle.

The information processing device 1 estimates three-dimensional posture data of a subject from a two-dimensional image including the subject, separately from this estimation, detects an event related to the movement of the subject from this two-dimensional image, and provides an estimation result and a detection result. Controls the posture of the skeleton (bone) of the 3D model based on In addition, in the present disclosure, the three terms “posture”, “pose” and “three-dimensional coordinates” may be read interchangeably depending on the context.

The information processing device 1 further corrects and renders the three-dimensional model by using bone animation (skeletal animation, skin mesh animation) technology for this skeleton control result, and renders a still two-dimensional or three-dimensional model. You may output as a picture or a video. Note that the subject is not limited to a human being as described later, and may be a moving object (moving object) such as an animal other than a human being or a machine such as a robot, or a stationary object that interacts with a moving object. Also, the event related to the movement of the subject may include at least either an event generated by the movement of the subject itself or an event generated by the action of a moving body on the subject although the subject itself does not move.

The input unit 100 has an input interface that accepts input of data from the outside. The information processing apparatus 1 receives data required for operation and data to be processed via the input unit 100 .

The storage unit 102 stores data necessary for the operation of the information processing device 1 or data to be processed. Various data input from the input unit 100 may be stored in this storage unit 102 temporarily or non-temporarily.

The part detection unit 104 acquires a predetermined part or a region to which the predetermined part belongs from the two-dimensional image to be processed. Part detection section 104 may perform this process using first model NN1, which is a trained model.

The first model NN1 is a model that detects 2D data related to a specific part from a 2D image that includes a subject. This model is, for example, a neural network model. More specifically, the first model NN1 may be, for example, a CNN (Convolutional Neural Network) having at least one convolutional layer, or another neural network model such as MLP (Multi-Layer Perceptron). There may be. The first model NN1 is trained, for example, by an arbitrary machine learning method so that when a two-dimensional image is input, information regarding a predetermined part is output. The part detection unit 104 acquires information on the area to which the predetermined part belongs by inputting the two-dimensional image to the first model NN1 and forward propagating it.

The information about the predetermined part may be coordinate data of the predetermined part of the subject, the range to which the predetermined part of the subject belongs, for example, data about the bounding box. The predetermined part is, for example, a part whose posture is desired to be corrected in the three-dimensional model.

FIG. 2 is a diagram schematically showing an example of detection by the part detection unit 104. As shown in FIG.
For example, when the subject is a human and the predetermined part is a hand, when a two-dimensional image as shown in FIG. 2 is input, the part detection unit 104 detects the position of the hand using the first model NN1. The part detection unit 104 may extract a predetermined part by setting the area to which the right hand belongs as a bounding box B1 and the area to which the left hand belongs as a bounding box B2.

In this case, the first model NN1 is a model trained to output a bounding box of a given part when a 2D image including a subject is input. Desirably, as shown in FIG. 2, the first model NN1 outputs one bounding box for one predetermined part, that is, the predetermined part and the bounding box are extracted on a one-to-one basis. It is a model trained to

Returning to FIG. 1, the two-dimensional coordinate estimation unit 106 estimates the coordinates of the predetermined part of the predetermined part detected by the part detection unit 104. The predetermined location is, for example, the position of the joint of the subject. The two-dimensional coordinate estimation unit 106 may perform this process using the second model NN2, which is a trained model.

The second model NN2 is a model that acquires the two-dimensional coordinates of the joints in a given part from the two-dimensional data of the given part (for example, the two-dimensional image in the bounding box). This model is, for example, a neural network model. More specifically, the second model NN2 may be, for example, a CNN having at least one convolutional layer, or may be another neural network model such as MLP. The second model NN2 is trained, for example, by an arbitrary machine learning technique so that when two-dimensional data relating to a predetermined part is input, two-dimensional coordinates of joints in the predetermined part are output. The two-dimensional coordinate estimation unit 106 acquires the two-dimensional coordinates of the joint by inputting the region of the predetermined part to the second model NN2 and forward propagating it.

　The two-dimensional coordinates may be represented by a coordinate system centered on the origin of the bounding box, or by a coordinate system centered on the origin of the two-dimensional image. When the subject is a human being, the two-dimensional coordinates may be relative coordinates with a part of the human being, such as the nose, as the origin.

Fig. 3 is a diagram showing an example of the positions of the hand joints in the bounding box of Fig. 2. Points indicated by dots indicate joints of predetermined parts in the bounding box. The two-dimensional coordinate estimating unit 106 acquires the joint positions as shown in the figure from the image within the bounding box in FIG. 2 by using the second model NN2.

When the second model NN2 is input with an image such as the bounding box in Figure 2, it is trained by an appropriate machine learning method to show the joint positions in Figure 3. Teacher data may be created by the user. Already optimized and published neural network models may be used as the first model NN1 and the second model NN2.

It should be noted that the processes of the part detection unit 104 and the two-dimensional coordinate estimation unit 106 need not be separately executed. As indicated by the dotted line in FIG. 1, a fourth model NN4, which is a training model that acquires the two-dimensional coordinates of joints in a predetermined part when a two-dimensional image including a subject is input, may be used. In this case, part detection section 104 may not be provided. For example, the two-dimensional coordinate estimating unit 106 inputs the two-dimensional image including the subject input via the input unit 100 to the fourth model NN4, thereby acquiring the two-dimensional coordinates of the joints in the predetermined part from this two-dimensional image. You may

Returning to FIG. 1, the 3D coordinate estimation unit 108 acquires the 3D coordinates of the joints of the subject based on the 2D coordinates of the joints acquired by the 2D coordinate estimation unit 106 . The three-dimensional coordinate estimation unit 108 may perform this process using the third model NN3, which is a trained model.

The third model NN3 is a model that acquires 3D coordinates from the 2D coordinates of the joints. This model is, for example, a neural network model. More specifically, the third model NN3 may be, for example, CNN or other neural network model such as MLP. More simply, the third model NN3 may be a simple model having a fully connected layer whose input layer is two-dimensional coordinates of joints. The third model NN3 is trained as a model that obtains 3D coordinates from 2D coordinates of joints by any machine learning method. The three-dimensional coordinate estimation unit 108 acquires the three-dimensional coordinates of the joint by inputting the two-dimensional coordinates of the joint to the third model NN3 and forward propagating it.

The three-dimensional coordinate estimation unit 108, for example, converts the two-dimensional coordinates of each point in FIG. 3 into three-dimensional coordinates in a three-dimensional space based on the third model NN3. Similarly to the above, the coordinates in the three-dimensional space may be, for example, coordinates indicated by a predetermined coordinate system centered on a predetermined origin, or a predetermined position of the three-dimensional model, for example, the position of the nose. Coordinates indicated by relative coordinates centered on the center may also be used.

Part detection unit 104, two-dimensional coordinate estimation unit 106, and three-dimensional coordinate estimation unit 108 are described above as performing respective processes using trained neural network models, but are not limited to this. No. For example, the output may be obtained using a function or the like that outputs appropriately acquired data based on the statistics of each input.

The smoothing unit 110 smoothes the acquired three-dimensional coordinates using frame information. The smoothing unit 110 smoothes, for example, the three-dimensional coordinates of the joints obtained from the two-dimensional image of the frame of interest using the three-dimensional coordinates of the joints obtained from the two-dimensional images of the preceding and succeeding frames. This smoothing unit 110 suppresses blurring in the time series of the three-dimensional coordinates of the joints, and suppresses unnatural movements in which the positions of the joints suddenly move to other positions when the three-dimensional model is moved. be able to.

The event detection unit 112 detects the occurrence of events in the two-dimensional image. The event may be predetermined. If the subject is a human and the predetermined part is a hand, the event may include, for example, collision between hands, or shielding of the hand in the two-dimensional image.

The event detection unit 112 may detect an event based on a rule, or may detect an event using a fifth model NN5 (not shown). The fifth model NN5 is, for example, a neural network model, and may be a model such as CNN or MLP. The fifth model NN5 is a model properly trained to output three-dimensional coordinates of joints when two-dimensional coordinates of joints are input. Also, the fifth model NN5 may be a model that can also input a two-dimensional image.

For example, this training dataset may be generated by comparing images taken using a depth camera that can acquire spatial 3D coordinates with 2D images. Another example is reconstructing a 3D image from multiple cameras that capture the subject at different angles from different positions and comparing it with a 2D image captured by one or another camera. may be generated by

These shots may be realized by shooting a subject with markers attached to the acquired joints. In addition, the data for obtaining 3D coordinates is not the camera, but sound waves (including ultrasonic waves), radio waves (including visible light and light in other areas), electrodes, etc., emitted from the joints of the subject. , or by obtaining information such as these reflections from joints. Of course, a data set used for training may be generated by comparing three-dimensional data acquired and reconstructed by techniques other than these, such as motion capture, and data obtained by photographing a subject with a camera.

By using the data set generated in this way, training is executed by associating the positions of the subject's joints (two-dimensional coordinates) and the positions of the subject's joints (three-dimensional coordinates) in the captured two-dimensional image. can do. The positions (three-dimensional coordinates) of the joints of the subject may be expressed, for example, in the camera coordinate system used when the two-dimensional image was captured, or may be expressed in another coordinate system.

The skeleton control unit 114 uses the three-dimensional coordinate data of the joints smoothed by the smoothing unit 110 to generate the posture of the skeleton of the three-dimensional model. correct the posture of The skeleton control unit 114 may, for example, appropriately set key frames and control the pose of the skeleton for each frame.

The skeleton control unit 114 also calculates the joint angles of the subject based on the joint three-dimensional coordinate data output by the smoothing unit 110 . This joint angle is applied to the skeleton of the 3D model. The three-dimensional posture data of the subject is data specifying the posture of the subject in the three-dimensional space, such as this three-dimensional coordinate data and joint angles. The skeleton control unit 114 controls the posture of the skeleton by executing forward kinematics calculation on the skeleton of the three-dimensional model based on the calculated joint angles.

The skeleton control unit 114 further corrects the posture of the skeleton controlled by forward kinematics calculation by inverse kinematics calculation based on the event so that the three-dimensional model reproduces the event detected by the event detection unit 112 . For example, the skeleton control unit 114 sets, of the skeleton controlled by the forward kinematics calculation, the portions essential for event reproduction (for example, the middle fingers of the right hand and left hand) to be arranged at the event occurrence positions, and the three-dimensional model Perform inverse kinematics calculations on the skeleton of the 3D model based on the location of this event to correct the pose of the skeleton of the 3D model.

By executing the above processing, the skeleton control unit 114 generates the posture of the skeleton of the three-dimensional model by forward kinematics calculation based on the three-dimensional coordinates of the joints, and calculates the posture of the skeleton by inverse kinematics calculation based on the event. Correct your posture.

As another example, the skeleton control unit 114 adjusts the three-dimensional positions of the joints according to the event occurrence positions, and generates the posture of the skeleton by inverse kinematics calculation based on the event occurrence positions based on the adjustment results. good too.

The rendering unit 116 executes bone animation and rendering of the three-dimensional model whose skeleton is controlled by the skeleton control unit 114. The rendering unit 116 executes rendering using, for example, a technique such as ray tracing, and converts the 3D model into a 2D image and a 2D video.

The output unit 118 outputs the image and video information generated by the rendering unit 116 using a two-dimensional output device such as a display. In addition, the output unit 118 may output images and video information generated by the rendering unit 116 based on the user's viewpoint and field of view in a device capable of appropriately outputting three-dimensional information.

The output unit 118 may, for example, output by streaming, or store data in the storage unit 102 or the like.

FIG. 4 is a flowchart showing processing of the information processing device 1 according to one embodiment.

The information processing device 1 first sets a 3D model for tracing the movement of the subject (S100). The subsequent processing is processing for properly operating this three-dimensional model. This three-dimensional model may be a preset or user-specified avatar.

The information processing device 1 receives input of a two-dimensional image via the input unit 100 (S102). The two-dimensional image may be a frame-by-frame image in a two-dimensional video. The information processing device 1 may acquire in real time a two-dimensional image including an object captured by a camera, or may process an image captured in advance frame by frame.

The part detection unit 104 detects a predetermined part from the two-dimensional image including the subject (S104). For example, the subject may be a human being and the hand may be a predetermined part.

The two-dimensional coordinate estimation unit 106 estimates the position of the joint in the predetermined part detected by the part detection unit 104, and outputs the two-dimensional coordinates of the joint in the two-dimensional image (S106). When the part is a hand, the two-dimensional coordinate estimation unit 106 may estimate, for example, the joint position of each finger, the joint position between the forearm and the hand, or the joint position of the arm.

The 3D coordinate estimation unit 108 estimates 3D coordinates in a 3D space based on the 2D coordinates of the joint estimated by the 2D coordinate estimation unit 106 (S108).

The smoothing unit 110 performs smoothing processing in the time series direction on the three-dimensional coordinates of the joints estimated by the three-dimensional coordinate estimation unit (S110). This smoothing process may be any suitable process. For example, the smoothing unit 110 performs smoothing based on the three-dimensional coordinates of the joints in a predetermined number of frames in the past and the three-dimensional coordinates of the joints in the current frame (that is, the three-dimensional coordinates of the joints in a plurality of frames including the current frame). Get the 3D coordinates of the joints in the current frame.

The event detection unit 112 detects whether an event has occurred in the frame image being processed, and if an event has occurred, acquires the position where the event occurred (S112). The event may be, for example, contact between the fingers of the right hand and the left hand, blocking of the other hand by one hand, etc., but is not limited to this. For example, contact between the hands of the subject and the face, shielding of the face by the hands of the subject, or the like may be possible. The occurrence position of an event is an example of data related to the event. It is a coordinate or a three-dimensional coordinate corresponding to the coordinate. Alternatively, the event occurrence position may be, for example, two-dimensional coordinates or three-dimensional coordinates preset for each event. Alternatively, the event occurrence position may be calculated, for example, by a predetermined calculation for each event (for example, calculating the midpoint of the coordinates of the fingers of the right hand and the fingers of the left hand).

Note that the event does not have to be detected at this timing. The event detection unit 112 may be appropriately executed in parallel with at least one of the processes from S104 to S110, or between any two processes.

The skeleton control unit 114 controls the posture of the skeleton of the 3D model based on the 3D coordinates of the joints smoothed by the smoothing unit 110 and the events detected by the event detection unit 112 (S114). The skeleton control unit 114 corrects the posture of the skeleton of the three-dimensional model by performing forward kinematics calculation based on the three-dimensional coordinates of the joints and inverse kinematics calculation based on the event occurrence position. Control. As described above, the posture of the skeleton may be controlled by inverse kinematics calculations based on the three-dimensional coordinates of the joints and the positions of occurrence of the events.

The rendering unit 116 executes bone animation and rendering based on the posture of the skeleton controlled by the skeleton control unit 114, converts it into an appropriate format for output by an appropriate output means, and outputs it (S116).

The information processing device 1 repeats the processing from S102 to S116 an appropriate number of times, for example, for the number of frames, thereby obtaining corrected image data of each frame or video data in which each frame is appropriately combined in the time series direction. can be output.

In the above description, the smoothing unit 110 performs smoothing using a predetermined number of past frame images, but the present invention is not limited to this. For example, when outputting as a file containing video instead of real-time processing, future frames may be used for smoothing. In this case, the processes from S102 to S108 may be executed for a predetermined number of frames, and the smoothing process may be appropriately executed. Thus, the frame-by-frame processing described above may be performed in parallel or serially, as appropriate.

Explain what kind of image will be output under what circumstances, giving a specific example.

FIG. 5 is a diagram showing the result of rendering the hand shape of FIG. In this figure, an event has occurred in which the fingertips of both hands are brought into contact with each other in front of the mouth.

In Fig. 5, just like the shape shown in Fig. 2, the 3D model is appropriately controlled and the fingertips of both hands are in contact. This can be achieved because touch is detected as an event and inverse kinematics calculations are performed based on the location of the occurrence of the event.

Fig. 6 shows the results of controlling the 3D model according to the comparative example at the same timing. As shown in FIG. 6, in the comparative example, a three-dimensional model is generated by forward kinematics calculation without detecting an event. Therefore, due to differences in physiques and the like between the subject and the three-dimensional model, the fingertips overlap and the event cannot be expressed appropriately.

Figures 7 to 9 are diagrams showing another example of event reproduction.

Fig. 7 shows a subject with a lightly clenched hand on his forehead. FIG. 8 is a diagram rendered by tracing the state of FIG. 7 to a three-dimensional model by the information processing device 1. In FIG. As shown in this figure, the position of the forehead and the position of the clenched hand are properly represented.

In this example, the 3D model is controlled without detecting the eye-hand relationship as an event, that is, without considering the eye-occlusion by the forearm.

On the other hand, if you want to control the 3D model so that the right eye is not blocked, you may add event information that the eye is not blocked by the forearm at the timing of detecting the event where the forehead is blocked by the hand. Then, based on the hand located on the forehead and the forearm that does not occlude the right eye, inverse kinematics calculations can be performed to achieve appropriate 3D model rendering. If necessary, the forward kinematics calculation for the joint and the inverse kinematics calculation from the event occurrence position may be repeated a predetermined number of times. For example, when implementing the aspect of this paragraph, inverse kinematics calculations from the forehead and hands and inverse kinematics calculations from the forearms and eyes may be performed respectively.

Fig. 9 is an example showing control of a 3D model without event detection. If only forward kinematics calculation from the three-dimensional coordinates of joints is performed, it is not possible to appropriately represent the positional relationship between the hand and the forehead.

Examples of these figures can be used to output expressions in sign language using a 3D model. In this case, for example, the event may be a critical event expressed in sign language. For example, as described above, the information processing device 1 detects an event of touching both hands, covering a part of the face with a hand, or touching a part of the face with a hand, by using the three-dimensional model as an event. can be reproduced using

For example, when the event detection unit 112 detects an event in which the fingertips of a hand touch each other as shown in FIG. 2, the detection may be performed based on the distance between the fingertips. For example, the event detection unit 112 may detect that the fingertips are in contact when the distance between the fingertips is equal to or less than a predetermined distance.

For example, when detecting an event in which a part of the face is covered by a hand as shown in FIG. 5, the event detection unit 112 may perform detection based on the distance between the nose and the wrist. For example, the event detection unit 112 may detect that the event has occurred when the distance between the nose and the wrist is equal to or less than a predetermined distance.

In addition to these, the event detection unit 112 may detect events by various rule-based processing.

As another example, prepare a large number of 2D images of the required events and a training data set that labels the events in each 2D image, and input these 2D images. Model NN5 may be trained by machine learning. The fifth model NN5 may use, for example, a data set of contact between fingertips, contact of a part of a hand and face, and occlusion of another arbitrary part by an arbitrary part.

For example, if the occlusion state is properly detected as an event, the positions of the hidden hand joints can be appropriately inferred by forward kinematics calculation and inverse kinematics calculation based on the event. Bone animation can be generated. As a result, for example, the information processing device 1 can reconstruct a 3D model while maintaining the continuity of the shape of the hand in an appropriately hidden region in a 2D image in which one hand is shielded from the other. It becomes possible to

As described above, according to the present embodiment, the information processing device 1 executes the above-described processing to, for example, create a three-dimensional model (avatar) that has been set in advance according to the movement of the subject in the two-dimensional image. It is possible to perform combined operations. The information processing device 1 performs inverse kinematics calculation based on the event that has occurred, so that even if there is a difference in the skeleton of the subject in the two-dimensional image and the three-dimensional model, the image appropriately reflects the event. Images can be acquired.

For example, in the case of sign language, using a 3D model without showing the face of the sign language interpreter will protect the privacy of the sign language interpreter. In addition, by using a 3D model, it is possible to convey the content of sign language to the user without being influenced by the appearance or appearance of the sign language interpreter. Toon rendering can also be used to render the 3D model. In this case, any character or the like can be easily used as a three-dimensional model.

In the above, images in sign language are given as an example, but this is not the only option. For example, it can be applied to 2D images, 3D rendering of 2D videos, sports, VR, anonymization, surveillance cameras, marketing, and the like. For example, by representing a 2D image of a ball game with a 3D model, it is possible to reproduce the play from the angle that the user wants to watch.

　Events can be set appropriately depending on the type of video to be reproduced in the 3D model. For example, in the above ball game, by detecting events such as the distance between the ball and the players, their positional relationship, and the actions of the players toward the ball, the 3D models of the players and the 3D model of the ball can be appropriately generated. can be expressed.

In addition, when instructing operations remotely to other users, etc., it is also possible for the user receiving the instructions to view the operations from any angle. VR also makes it possible to view artists and others from arbitrary viewpoints.

By applying it to a surveillance camera, it will be possible to appropriately reproduce what actions were being taken in a shielded state, and it will also be possible to prevent and suppress crimes.

All of the above trained models may be concepts that include, for example, models that have been trained as described and further distilled by a general method.

Part or all of each device (information processing device 1) in the above-described embodiment may be configured with hardware, or software executed by CPU (Central Processing Unit), GPU (Graphics Processing Unit), etc. It may be configured by information processing of (program). In the case of software information processing, software that realizes at least a part of the functions of each device in the above-described embodiments can be transferred to a flexible disk, CD-ROM (Compact Disc-Read Only Memory), or USB (Universal Serial Bus) memory or other non-temporary storage medium (non-temporary computer-readable medium) and read into a computer to execute software information processing. Alternatively, the software may be downloaded via a communication network. Furthermore, information processing may be performed by hardware by implementing software in a circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).

The type of storage medium that stores software is not limited. The storage medium is not limited to a detachable one such as a magnetic disk or an optical disk, and may be a fixed storage medium such as a hard disk or memory. Also, the storage medium may be provided inside the computer, or may be provided outside the computer.

FIG. 10 is a block diagram showing an example of the hardware configuration of each device (information processing device 1) in the above embodiment. Each device includes, for example, a processor 71, a main storage device 72 (memory), an auxiliary storage device 73 (memory), a network interface 74, and a device interface 75, which are connected via a bus 76. may be implemented as a computer 7 integrated with the

The computer 7 in FIG. 10 has one of each component, but may have a plurality of the same components. Also, in FIG. 10, one computer 7 is shown. good too. In this case, it may be in the form of distributed computing in which each computer communicates via the network interface 74 or the like to execute processing. In other words, each device (information processing device 1) in the above-described embodiment is configured as a system in which functions are realized by one or more computers executing instructions stored in one or more storage devices. good too. Alternatively, the information transmitted from the terminal may be processed by one or more computers provided on the cloud, and the processing result may be transmitted to the terminal.

Various operations of each device (information processing device 1) in the above-described embodiment may be executed in parallel using one or more processors or using multiple computers via a network. Also, various operations may be distributed to a plurality of operation cores in the processor and executed in parallel. Also, part or all of the processing, means, etc. of the present disclosure may be executed by at least one of a processor and a storage device provided on a cloud capable of communicating with the computer 7 via a network. Thus, each device in the above-described embodiments may be in the form of parallel computing by one or more computers.

The processor 71 may be an electronic circuit (processing circuit, processing circuitry, CPU, GPU, FPGA, ASIC, etc.) including a computer control device and arithmetic device. Also, the processor 71 may be a semiconductor device or the like including a dedicated processing circuit. The processor 71 is not limited to an electronic circuit using electronic logic elements, and may be realized by an optical circuit using optical logic elements. Also, the processor 71 may include arithmetic functions based on quantum computing.

The processor 71 can perform arithmetic processing based on the data and software (programs) input from each device, etc. of the internal configuration of the computer 7, and output the arithmetic result and control signal to each device, etc. The processor 71 may control each component of the computer 7 by executing the OS (Operating System) of the computer 7, applications, and the like.

Each device (information processing device 1) in the above-described embodiment may be realized by one or more processors 71. Here, the processor 71 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or two or more devices. You can point When multiple electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.

The main storage device 72 is a storage device that stores instructions and various data to be executed by the processor 71 , and the information stored in the main storage device 72 is read by the processor 71 . Auxiliary storage device 73 is a storage device other than main storage device 72 . These storage devices mean any electronic components capable of storing electronic information, and may be semiconductor memories. The semiconductor memory may be either volatile memory or non-volatile memory. A storage device for storing various data in each device (information processing device 1) in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73, and may be realized by the built-in memory built into the processor 71. may be implemented. For example, the storage unit 102 in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73.

Multiple processors may be connected (coupled) to one storage device (memory), or a single processor may be connected. A plurality of storage devices (memories) may be connected (coupled) to one processor. When each device (information processing device 1) in the above-described embodiment is composed of at least one storage device (memory) and a plurality of processors connected (coupled) to this at least one storage device (memory), a plurality of at least one of the processors is connected (coupled) to at least one storage device (memory). Also, this configuration may be realized by storage devices (memory) and processors included in a plurality of computers. Furthermore, a configuration in which a storage device (memory) is integrated with a processor (for example, a cache memory including an L1 cache and an L2 cache) may be included.

The network interface 74 is an interface for connecting to the communication network 8 wirelessly or by wire. As for the network interface 74, an appropriate interface such as one conforming to existing communication standards may be used. The network interface 74 may exchange information with the external device 9A connected via the communication network 8. FIG. The communication network 8 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), etc., or a combination thereof. It is sufficient if information can be exchanged between them. Examples of WAN include the Internet, examples of LAN include IEEE802.11 and Ethernet (registered trademark), and examples of PAN include Bluetooth (registered trademark) and NFC (Near Field Communication).

The device interface 75 is an interface such as USB that directly connects with the external device 9B.

The external device 9A is a device connected to the computer 7 via a network. External device 9B is a device that is directly connected to computer 7 .

For example, the external device 9A or the external device 9B may be an input device. The input device is, for example, a device such as a camera, microphone, motion capture, various sensors, a keyboard, a mouse, or a touch panel, and provides the computer 7 with acquired information. Alternatively, a device such as a personal computer, a tablet terminal, or a smartphone including an input unit, a memory, and a processor may be used.

Also, the external device 9A or the external device 9B may be, for example, an output device. The output device may be, for example, a display device such as LCD (Liquid Crystal Display), CRT (Cathode Ray Tube), PDP (Plasma Display Panel), or organic EL (Electro Luminescence) panel. A speaker or the like for output may be used. Alternatively, a device such as a personal computer, a tablet terminal, or a smartphone including an output unit, a memory, and a processor may be used.

Also, the external device 9A or the external device 9B may be a storage device (memory). For example, the external device 9A may be a network storage or the like, and the external device 9B may be a storage such as an HDD.

Also, the external device 9A or the external device 9B may be a device having the functions of some of the constituent elements of each device (information processing device 1) in the above-described embodiment. That is, the computer 7 may transmit or receive part or all of the processing results of the external device 9A or the external device 9B.

In the present specification (including claims), the expression "at least one (one) of a, b and c" or "at least one (one) of a, b or c" (including similar expressions) Where used, includes any of a, b, c, a-b, ac, b-c, or a-b-c. Also, multiple instances of any element may be included, such as a-a, a-b-b, a-a-b-b-c-c, and so on. It also includes the addition of other elements than the listed elements (a, b and c), such as having d such as a-b-c-d.

In this specification (including claims), when expressions such as "data as input / based on data / according to / according to" (including similar expressions) are used, unless otherwise specified, It includes the case where various data itself is used as an input, and the case where various data subjected to some processing (for example, noise added, normalized, intermediate representation of various data, etc.) is used as an input. In addition, if it is stated that some result can be obtained "based on/according to/depending on the data", this includes cases where the result is obtained based only on the data, other data other than the data, It may also include cases where the result is obtained under the influence of factors, conditions, and/or states. In addition, if it is stated that "data will be output", unless otherwise specified, if the various data themselves are used as output, or if the various data have undergone some processing (for example, noise addition, normalization, etc.) This also includes the case where the output is a converted version, an intermediate representation of various data, etc.).

In this specification (including the claims), when the terms "connected" and "coupled" are used, they refer to direct connection/coupling, indirect connection/coupling , electrically connected/coupled, communicatively connected/coupled, operatively connected/coupled, physically connected/coupled, etc. intended as a term. The term should be interpreted appropriately according to the context in which the term is used, but any form of connection/bonding that is not intentionally or naturally excluded is not included in the term. should be interpreted restrictively.

In this specification (including claims), when the phrase "A configured to B" is used, the physical structure of element A is such that it is capable of performing operation B has a configuration, including that a permanent or temporary setting/configuration of element A is configured/set to actually perform action B good. For example, if element A is a general-purpose processor, the processor has a hardware configuration that can execute operation B, and operation B can be performed by setting a permanent or temporary program (instruction). It just needs to be configured to actually run. In addition, when the element A is a dedicated processor or a dedicated arithmetic circuit, etc., regardless of whether or not control instructions and data are actually attached, the circuit structure of the processor actually executes the operation B. It just needs to be implemented.

In this specification (including the claims), when terms denoting containing or possessing (e.g., "comprising/including" and "having, etc.") are used, by the object of the terms It is intended as an open-ended term, including the case of containing or possessing things other than the indicated object. When the object of these terms of inclusion or possession is an expression that does not specify a quantity or implies a singular number (an expression with the article a or an), the expression shall be construed as not being limited to a specific number. It should be.

In the specification (including the claims), expressions such as "one or more" or "at least one" are used in some places, and quantities are specified in other places. Where no or suggestive of the singular (a or an articles) are used, the latter is not intended to mean "one." In general, expressions that do not specify a quantity or imply a singular number (indicative of the articles a or an) should be construed as not necessarily being limited to a particular number.

In this specification, when it is stated that a particular configuration of an embodiment has a particular effect (advantage/result), unless there is a specific reason otherwise, other one or more having that configuration It should be understood that this effect can be obtained also for the embodiment of However, it should be understood that the presence or absence of the effect generally depends on various factors, conditions, and/or states, and that the configuration does not always provide the effect. The effect is only obtained by the configuration described in the embodiment when various factors, conditions, and/or states are satisfied, and in the claimed invention defining the configuration or a similar configuration , the effect is not necessarily obtained.

In this specification (including claims), when a plurality of pieces of hardware perform predetermined processing, each piece of hardware may work together to perform the predetermined processing, or a part of the hardware may perform the predetermined processing. You may do all of Also, some hardware may perform a part of the predetermined processing, and another hardware may perform the rest of the predetermined processing. In the present specification (including claims), when expressions such as "one or more pieces of hardware perform the first process and the one or more pieces of hardware perform the second process" are used , the hardware that performs the first process and the hardware that performs the second process may be the same or different. In other words, the hardware that performs the first process and the hardware that performs the second process may be included in the one or more pieces of hardware. Note that hardware may include an electronic circuit or a device including an electronic circuit.

Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, replacements, partial deletions, etc. are possible without departing from the conceptual idea and spirit of the present disclosure derived from the content defined in the claims and equivalents thereof. For example, in all the embodiments described above, when numerical values or formulas are used for explanation, they are shown as an example and are not limited to these. Also, the order of each operation in the embodiment is shown as an example, and is not limited to these.

Claims

one or more memories;
one or more processors;
with
The one or more processors are
Acquire the 3D posture data of the subject included in the 2D image,
detecting an event related to the movement of the subject from the two-dimensional image;
controlling the posture of the skeleton of the 3D model based on the 3D posture data of the subject and the event;
Information processing equipment.

The information processing apparatus according to claim 1, wherein the acquired three-dimensional posture data is three-dimensional posture data corresponding to the posture of the subject in the two-dimensional image.

the two-dimensional image is one frame of a moving image including the subject;
The one or more processors are
obtaining three-dimensional posture data of the subject for each of the two or more frames of the moving image;
Based on the three-dimensional posture data of the subject acquired for each of the two or more frames and the event detected from at least one of the two or more frames, the three-dimensional to control the pose of the model's skeleton,
The information processing device according to claim 1 .

the skeleton of the 3D model to be controlled is the skeleton of a 3D model of a subject different from the subject;
The information processing device according to claim 1 .

The one or more processors are
correcting the pose of the skeleton of the 3D model calculated based on the 3D pose data based on the detected event;
The information processing device according to claim 1.

The one or more processors estimate and obtain the three-dimensional pose data from the two-dimensional image.
The information processing device according to claim 1 .

The one or more processors are
estimating a two-dimensional position of a joint of a predetermined part of the subject from the two-dimensional image;
estimating the three-dimensional coordinates of the joint from the two-dimensional position of the joint;
estimating the three-dimensional pose data;
The information processing device according to claim 6 .

The one or more processors are
inputting the two-dimensional image to a first model for detecting two-dimensional data relating to the predetermined part from the two-dimensional image;
inputting the two-dimensional data about the predetermined part into a second model that acquires the two-dimensional coordinates of the joint of the predetermined part from the two-dimensional data about the predetermined part;
inputting the two-dimensional coordinates of the joints of the predetermined part into a third model that acquires the three-dimensional coordinates of the predetermined part from the two-dimensional coordinates of the joints of the predetermined part;
estimating the three-dimensional pose data;
The information processing apparatus according to claim 7.

The one or more processors are
inputting the two-dimensional image to a fourth model that acquires the two-dimensional coordinates of the joint of the predetermined part from the two-dimensional image;
inputting the two-dimensional data of the predetermined part into a third model that acquires the three-dimensional coordinates of the predetermined part from the two-dimensional data of the joints of the predetermined part;
estimating the three-dimensional pose data;
The information processing apparatus according to claim 7.

The 3D posture data is 3D posture data smoothed in a time series direction,
The information processing apparatus according to any one of claims 1 to 9.

The one or more processors are
calculating a joint angle to be applied to the skeleton of the three-dimensional model based on the three-dimensional posture data;
calculating three-dimensional coordinates of the skeleton of the three-dimensional model by performing forward kinematics calculations on the skeleton of the three-dimensional model based on the joint angles;
The information processing apparatus according to any one of claims 1 to 9.

The one or more processors further
Correcting the posture of the skeleton of the three-dimensional model using inverse kinematics calculation based on the event for the posture of the skeleton of the three-dimensional model calculated by performing the forward kinematics calculation;
The information processing device according to claim 11 .

The one or more processors are
Detecting said event by a rule-based or fifth model;
The information processing apparatus according to any one of claims 1 to 9.

the event is at least one of contact or shielding of the subject;
The information processing apparatus according to any one of claims 1 to 9.

The one or more processors are
performing bone animation of the three-dimensional model based on the three-dimensional pose data and the posture of the skeleton of the three-dimensional model controlled based on the event;
The information processing apparatus according to any one of claims 1 to 9.