WO2023277043A1 - Information processing device - Google Patents

Information processing device Download PDF

Info

Publication number
WO2023277043A1
WO2023277043A1 PCT/JP2022/025855 JP2022025855W WO2023277043A1 WO 2023277043 A1 WO2023277043 A1 WO 2023277043A1 JP 2022025855 W JP2022025855 W JP 2022025855W WO 2023277043 A1 WO2023277043 A1 WO 2023277043A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
model
information processing
data
subject
Prior art date
Application number
PCT/JP2022/025855
Other languages
French (fr)
Japanese (ja)
Inventor
達也 高村
Original Assignee
株式会社 Preferred Networks
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社 Preferred Networks filed Critical 株式会社 Preferred Networks
Publication of WO2023277043A1 publication Critical patent/WO2023277043A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras

Definitions

  • the present disclosure relates to an information processing device.
  • the joint angle estimated from the subject in the 2D image is 3 due to the difference between the skeleton of the subject and the skeleton of the 3D CG model. It is not possible to accurately reproduce the movement of the subject by simply applying it to the skeleton of the dimensional CG model.
  • This disclosure provides a technique for detecting an event to be reproduced by a 3D CG model in a subject of a 2D image and generating a 3D CG model that reproduces the event.
  • an information processing device includes one or more memories and one or more processors.
  • the one or more processors estimate 3D pose data of the subject from a 2D image including the subject, detect an event related to the movement of the subject from the 2D image, and process the 3D pose data and the Based on the event, the posture of the skeleton of the 3D model is controlled.
  • FIG. 1 is a block diagram schematically showing an information processing apparatus according to one embodiment;
  • 4 is a flowchart showing an example of processing of an information processing apparatus according to one embodiment;
  • FIG. 4 is a diagram showing a rendering result according to one embodiment;
  • FIG. 10 is a diagram showing a rendering result according to a comparative example;
  • FIG. 4 is a diagram showing a rendering result according to one embodiment;
  • FIG. 10 is a diagram showing a rendering result according to a comparative example;
  • 2 is a block diagram illustrating an example hardware implementation according to one embodiment.
  • FIG. 1 is a block diagram schematically showing an information processing device according to one embodiment.
  • the information processing device 1 includes an input unit 100, a storage unit 102, a part detection unit 104, a two-dimensional coordinate estimation unit 106, a three-dimensional coordinate estimation unit 108, a smoothing unit 110, an event detection unit 112, A skeleton control unit 114, a rendering unit 116, and an output unit 118 are provided.
  • the information processing apparatus 1 causes a three-dimensional model, which is a model of the subject, to realize the movement shown in the two-dimensional image.
  • the three-dimensional model may be a three-dimensional CG (Computer Graphics) model, and the information processing device 1 can output images and videos of the three-dimensional model viewed from any angle.
  • CG Computer Graphics
  • the information processing device 1 estimates three-dimensional posture data of a subject from a two-dimensional image including the subject, separately from this estimation, detects an event related to the movement of the subject from this two-dimensional image, and provides an estimation result and a detection result. Controls the posture of the skeleton (bone) of the 3D model based on In addition, in the present disclosure, the three terms “posture”, “pose” and “three-dimensional coordinates” may be read interchangeably depending on the context.
  • the information processing device 1 further corrects and renders the three-dimensional model by using bone animation (skeletal animation, skin mesh animation) technology for this skeleton control result, and renders a still two-dimensional or three-dimensional model.
  • You may output as a picture or a video.
  • the subject is not limited to a human being as described later, and may be a moving object (moving object) such as an animal other than a human being or a machine such as a robot, or a stationary object that interacts with a moving object.
  • the event related to the movement of the subject may include at least either an event generated by the movement of the subject itself or an event generated by the action of a moving body on the subject although the subject itself does not move.
  • the input unit 100 has an input interface that accepts input of data from the outside.
  • the information processing apparatus 1 receives data required for operation and data to be processed via the input unit 100 .
  • the storage unit 102 stores data necessary for the operation of the information processing device 1 or data to be processed. Various data input from the input unit 100 may be stored in this storage unit 102 temporarily or non-temporarily.
  • the part detection unit 104 acquires a predetermined part or a region to which the predetermined part belongs from the two-dimensional image to be processed. Part detection section 104 may perform this process using first model NN1, which is a trained model.
  • the first model NN1 is a model that detects 2D data related to a specific part from a 2D image that includes a subject.
  • This model is, for example, a neural network model.
  • the first model NN1 may be, for example, a CNN (Convolutional Neural Network) having at least one convolutional layer, or another neural network model such as MLP (Multi-Layer Perceptron). There may be.
  • the first model NN1 is trained, for example, by an arbitrary machine learning method so that when a two-dimensional image is input, information regarding a predetermined part is output.
  • the part detection unit 104 acquires information on the area to which the predetermined part belongs by inputting the two-dimensional image to the first model NN1 and forward propagating it.
  • the information about the predetermined part may be coordinate data of the predetermined part of the subject, the range to which the predetermined part of the subject belongs, for example, data about the bounding box.
  • the predetermined part is, for example, a part whose posture is desired to be corrected in the three-dimensional model.
  • FIG. 2 is a diagram schematically showing an example of detection by the part detection unit 104.
  • the part detection unit 104 detects the position of the hand using the first model NN1.
  • the part detection unit 104 may extract a predetermined part by setting the area to which the right hand belongs as a bounding box B1 and the area to which the left hand belongs as a bounding box B2.
  • the first model NN1 is a model trained to output a bounding box of a given part when a 2D image including a subject is input.
  • the first model NN1 outputs one bounding box for one predetermined part, that is, the predetermined part and the bounding box are extracted on a one-to-one basis. It is a model trained to
  • the two-dimensional coordinate estimation unit 106 estimates the coordinates of the predetermined part of the predetermined part detected by the part detection unit 104.
  • the predetermined location is, for example, the position of the joint of the subject.
  • the two-dimensional coordinate estimation unit 106 may perform this process using the second model NN2, which is a trained model.
  • the second model NN2 is a model that acquires the two-dimensional coordinates of the joints in a given part from the two-dimensional data of the given part (for example, the two-dimensional image in the bounding box).
  • This model is, for example, a neural network model. More specifically, the second model NN2 may be, for example, a CNN having at least one convolutional layer, or may be another neural network model such as MLP.
  • the second model NN2 is trained, for example, by an arbitrary machine learning technique so that when two-dimensional data relating to a predetermined part is input, two-dimensional coordinates of joints in the predetermined part are output.
  • the two-dimensional coordinate estimation unit 106 acquires the two-dimensional coordinates of the joint by inputting the region of the predetermined part to the second model NN2 and forward propagating it.
  • the two-dimensional coordinates may be represented by a coordinate system centered on the origin of the bounding box, or by a coordinate system centered on the origin of the two-dimensional image.
  • the two-dimensional coordinates may be relative coordinates with a part of the human being, such as the nose, as the origin.
  • Fig. 3 is a diagram showing an example of the positions of the hand joints in the bounding box of Fig. 2. Points indicated by dots indicate joints of predetermined parts in the bounding box.
  • the two-dimensional coordinate estimating unit 106 acquires the joint positions as shown in the figure from the image within the bounding box in FIG. 2 by using the second model NN2.
  • the second model NN2 When the second model NN2 is input with an image such as the bounding box in Figure 2, it is trained by an appropriate machine learning method to show the joint positions in Figure 3. Teacher data may be created by the user. Already optimized and published neural network models may be used as the first model NN1 and the second model NN2.
  • a fourth model NN4 which is a training model that acquires the two-dimensional coordinates of joints in a predetermined part when a two-dimensional image including a subject is input, may be used.
  • part detection section 104 may not be provided.
  • the two-dimensional coordinate estimating unit 106 inputs the two-dimensional image including the subject input via the input unit 100 to the fourth model NN4, thereby acquiring the two-dimensional coordinates of the joints in the predetermined part from this two-dimensional image. You may
  • the 3D coordinate estimation unit 108 acquires the 3D coordinates of the joints of the subject based on the 2D coordinates of the joints acquired by the 2D coordinate estimation unit 106 .
  • the three-dimensional coordinate estimation unit 108 may perform this process using the third model NN3, which is a trained model.
  • the third model NN3 is a model that acquires 3D coordinates from the 2D coordinates of the joints.
  • This model is, for example, a neural network model. More specifically, the third model NN3 may be, for example, CNN or other neural network model such as MLP. More simply, the third model NN3 may be a simple model having a fully connected layer whose input layer is two-dimensional coordinates of joints.
  • the third model NN3 is trained as a model that obtains 3D coordinates from 2D coordinates of joints by any machine learning method.
  • the three-dimensional coordinate estimation unit 108 acquires the three-dimensional coordinates of the joint by inputting the two-dimensional coordinates of the joint to the third model NN3 and forward propagating it.
  • the three-dimensional coordinate estimation unit 108 converts the two-dimensional coordinates of each point in FIG. 3 into three-dimensional coordinates in a three-dimensional space based on the third model NN3.
  • the coordinates in the three-dimensional space may be, for example, coordinates indicated by a predetermined coordinate system centered on a predetermined origin, or a predetermined position of the three-dimensional model, for example, the position of the nose. Coordinates indicated by relative coordinates centered on the center may also be used.
  • Part detection unit 104, two-dimensional coordinate estimation unit 106, and three-dimensional coordinate estimation unit 108 are described above as performing respective processes using trained neural network models, but are not limited to this. No.
  • the output may be obtained using a function or the like that outputs appropriately acquired data based on the statistics of each input.
  • the smoothing unit 110 smoothes the acquired three-dimensional coordinates using frame information.
  • the smoothing unit 110 smoothes, for example, the three-dimensional coordinates of the joints obtained from the two-dimensional image of the frame of interest using the three-dimensional coordinates of the joints obtained from the two-dimensional images of the preceding and succeeding frames.
  • This smoothing unit 110 suppresses blurring in the time series of the three-dimensional coordinates of the joints, and suppresses unnatural movements in which the positions of the joints suddenly move to other positions when the three-dimensional model is moved. be able to.
  • the event detection unit 112 detects the occurrence of events in the two-dimensional image.
  • the event may be predetermined. If the subject is a human and the predetermined part is a hand, the event may include, for example, collision between hands, or shielding of the hand in the two-dimensional image.
  • the event detection unit 112 may detect an event based on a rule, or may detect an event using a fifth model NN5 (not shown).
  • the fifth model NN5 is, for example, a neural network model, and may be a model such as CNN or MLP.
  • the fifth model NN5 is a model properly trained to output three-dimensional coordinates of joints when two-dimensional coordinates of joints are input.
  • the fifth model NN5 may be a model that can also input a two-dimensional image.
  • this training dataset may be generated by comparing images taken using a depth camera that can acquire spatial 3D coordinates with 2D images.
  • Another example is reconstructing a 3D image from multiple cameras that capture the subject at different angles from different positions and comparing it with a 2D image captured by one or another camera. may be generated by
  • shots may be realized by shooting a subject with markers attached to the acquired joints.
  • the data for obtaining 3D coordinates is not the camera, but sound waves (including ultrasonic waves), radio waves (including visible light and light in other areas), electrodes, etc., emitted from the joints of the subject. , or by obtaining information such as these reflections from joints.
  • a data set used for training may be generated by comparing three-dimensional data acquired and reconstructed by techniques other than these, such as motion capture, and data obtained by photographing a subject with a camera.
  • training is executed by associating the positions of the subject's joints (two-dimensional coordinates) and the positions of the subject's joints (three-dimensional coordinates) in the captured two-dimensional image. can do.
  • the positions (three-dimensional coordinates) of the joints of the subject may be expressed, for example, in the camera coordinate system used when the two-dimensional image was captured, or may be expressed in another coordinate system.
  • the skeleton control unit 114 uses the three-dimensional coordinate data of the joints smoothed by the smoothing unit 110 to generate the posture of the skeleton of the three-dimensional model. correct the posture of The skeleton control unit 114 may, for example, appropriately set key frames and control the pose of the skeleton for each frame.
  • the skeleton control unit 114 also calculates the joint angles of the subject based on the joint three-dimensional coordinate data output by the smoothing unit 110 . This joint angle is applied to the skeleton of the 3D model.
  • the three-dimensional posture data of the subject is data specifying the posture of the subject in the three-dimensional space, such as this three-dimensional coordinate data and joint angles.
  • the skeleton control unit 114 controls the posture of the skeleton by executing forward kinematics calculation on the skeleton of the three-dimensional model based on the calculated joint angles.
  • the skeleton control unit 114 further corrects the posture of the skeleton controlled by forward kinematics calculation by inverse kinematics calculation based on the event so that the three-dimensional model reproduces the event detected by the event detection unit 112 .
  • the skeleton control unit 114 sets, of the skeleton controlled by the forward kinematics calculation, the portions essential for event reproduction (for example, the middle fingers of the right hand and left hand) to be arranged at the event occurrence positions, and the three-dimensional model Perform inverse kinematics calculations on the skeleton of the 3D model based on the location of this event to correct the pose of the skeleton of the 3D model.
  • the skeleton control unit 114 By executing the above processing, the skeleton control unit 114 generates the posture of the skeleton of the three-dimensional model by forward kinematics calculation based on the three-dimensional coordinates of the joints, and calculates the posture of the skeleton by inverse kinematics calculation based on the event. Correct your posture.
  • the skeleton control unit 114 adjusts the three-dimensional positions of the joints according to the event occurrence positions, and generates the posture of the skeleton by inverse kinematics calculation based on the event occurrence positions based on the adjustment results. good too.
  • the rendering unit 116 executes bone animation and rendering of the three-dimensional model whose skeleton is controlled by the skeleton control unit 114.
  • the rendering unit 116 executes rendering using, for example, a technique such as ray tracing, and converts the 3D model into a 2D image and a 2D video.
  • the output unit 118 outputs the image and video information generated by the rendering unit 116 using a two-dimensional output device such as a display.
  • the output unit 118 may output images and video information generated by the rendering unit 116 based on the user's viewpoint and field of view in a device capable of appropriately outputting three-dimensional information.
  • the output unit 118 may, for example, output by streaming, or store data in the storage unit 102 or the like.
  • FIG. 4 is a flowchart showing processing of the information processing device 1 according to one embodiment.
  • the information processing device 1 first sets a 3D model for tracing the movement of the subject (S100).
  • the subsequent processing is processing for properly operating this three-dimensional model.
  • This three-dimensional model may be a preset or user-specified avatar.
  • the information processing device 1 receives input of a two-dimensional image via the input unit 100 (S102).
  • the two-dimensional image may be a frame-by-frame image in a two-dimensional video.
  • the information processing device 1 may acquire in real time a two-dimensional image including an object captured by a camera, or may process an image captured in advance frame by frame.
  • the part detection unit 104 detects a predetermined part from the two-dimensional image including the subject (S104).
  • the subject may be a human being and the hand may be a predetermined part.
  • the two-dimensional coordinate estimation unit 106 estimates the position of the joint in the predetermined part detected by the part detection unit 104, and outputs the two-dimensional coordinates of the joint in the two-dimensional image (S106).
  • the two-dimensional coordinate estimation unit 106 may estimate, for example, the joint position of each finger, the joint position between the forearm and the hand, or the joint position of the arm.
  • the 3D coordinate estimation unit 108 estimates 3D coordinates in a 3D space based on the 2D coordinates of the joint estimated by the 2D coordinate estimation unit 106 (S108).
  • the smoothing unit 110 performs smoothing processing in the time series direction on the three-dimensional coordinates of the joints estimated by the three-dimensional coordinate estimation unit (S110).
  • This smoothing process may be any suitable process.
  • the smoothing unit 110 performs smoothing based on the three-dimensional coordinates of the joints in a predetermined number of frames in the past and the three-dimensional coordinates of the joints in the current frame (that is, the three-dimensional coordinates of the joints in a plurality of frames including the current frame). Get the 3D coordinates of the joints in the current frame.
  • the event detection unit 112 detects whether an event has occurred in the frame image being processed, and if an event has occurred, acquires the position where the event occurred (S112).
  • the event may be, for example, contact between the fingers of the right hand and the left hand, blocking of the other hand by one hand, etc., but is not limited to this. For example, contact between the hands of the subject and the face, shielding of the face by the hands of the subject, or the like may be possible.
  • the occurrence position of an event is an example of data related to the event. It is a coordinate or a three-dimensional coordinate corresponding to the coordinate.
  • the event occurrence position may be, for example, two-dimensional coordinates or three-dimensional coordinates preset for each event.
  • the event occurrence position may be calculated, for example, by a predetermined calculation for each event (for example, calculating the midpoint of the coordinates of the fingers of the right hand and the fingers of the left hand).
  • the event detection unit 112 may be appropriately executed in parallel with at least one of the processes from S104 to S110, or between any two processes.
  • the skeleton control unit 114 controls the posture of the skeleton of the 3D model based on the 3D coordinates of the joints smoothed by the smoothing unit 110 and the events detected by the event detection unit 112 (S114).
  • the skeleton control unit 114 corrects the posture of the skeleton of the three-dimensional model by performing forward kinematics calculation based on the three-dimensional coordinates of the joints and inverse kinematics calculation based on the event occurrence position. Control.
  • the posture of the skeleton may be controlled by inverse kinematics calculations based on the three-dimensional coordinates of the joints and the positions of occurrence of the events.
  • the rendering unit 116 executes bone animation and rendering based on the posture of the skeleton controlled by the skeleton control unit 114, converts it into an appropriate format for output by an appropriate output means, and outputs it (S116).
  • the information processing device 1 repeats the processing from S102 to S116 an appropriate number of times, for example, for the number of frames, thereby obtaining corrected image data of each frame or video data in which each frame is appropriately combined in the time series direction. can be output.
  • the smoothing unit 110 performs smoothing using a predetermined number of past frame images, but the present invention is not limited to this.
  • future frames may be used for smoothing.
  • the processes from S102 to S108 may be executed for a predetermined number of frames, and the smoothing process may be appropriately executed.
  • the frame-by-frame processing described above may be performed in parallel or serially, as appropriate.
  • FIG. 5 is a diagram showing the result of rendering the hand shape of FIG. In this figure, an event has occurred in which the fingertips of both hands are brought into contact with each other in front of the mouth.
  • Fig. 5 just like the shape shown in Fig. 2, the 3D model is appropriately controlled and the fingertips of both hands are in contact. This can be achieved because touch is detected as an event and inverse kinematics calculations are performed based on the location of the occurrence of the event.
  • Fig. 6 shows the results of controlling the 3D model according to the comparative example at the same timing.
  • a three-dimensional model is generated by forward kinematics calculation without detecting an event. Therefore, due to differences in physiques and the like between the subject and the three-dimensional model, the fingertips overlap and the event cannot be expressed appropriately.
  • FIGS 7 to 9 are diagrams showing another example of event reproduction.
  • FIG. 7 shows a subject with a lightly clenched hand on his forehead.
  • FIG. 8 is a diagram rendered by tracing the state of FIG. 7 to a three-dimensional model by the information processing device 1. In FIG. As shown in this figure, the position of the forehead and the position of the clenched hand are properly represented.
  • the 3D model is controlled without detecting the eye-hand relationship as an event, that is, without considering the eye-occlusion by the forearm.
  • inverse kinematics calculations can be performed to achieve appropriate 3D model rendering. If necessary, the forward kinematics calculation for the joint and the inverse kinematics calculation from the event occurrence position may be repeated a predetermined number of times. For example, when implementing the aspect of this paragraph, inverse kinematics calculations from the forehead and hands and inverse kinematics calculations from the forearms and eyes may be performed respectively.
  • Fig. 9 is an example showing control of a 3D model without event detection. If only forward kinematics calculation from the three-dimensional coordinates of joints is performed, it is not possible to appropriately represent the positional relationship between the hand and the forehead.
  • Examples of these figures can be used to output expressions in sign language using a 3D model.
  • the event may be a critical event expressed in sign language.
  • the information processing device 1 detects an event of touching both hands, covering a part of the face with a hand, or touching a part of the face with a hand, by using the three-dimensional model as an event. can be reproduced using
  • the event detection unit 112 when the event detection unit 112 detects an event in which the fingertips of a hand touch each other as shown in FIG. 2, the detection may be performed based on the distance between the fingertips. For example, the event detection unit 112 may detect that the fingertips are in contact when the distance between the fingertips is equal to or less than a predetermined distance.
  • the event detection unit 112 may perform detection based on the distance between the nose and the wrist. For example, the event detection unit 112 may detect that the event has occurred when the distance between the nose and the wrist is equal to or less than a predetermined distance.
  • the event detection unit 112 may detect events by various rule-based processing.
  • Model NN5 may be trained by machine learning.
  • the fifth model NN5 may use, for example, a data set of contact between fingertips, contact of a part of a hand and face, and occlusion of another arbitrary part by an arbitrary part.
  • the positions of the hidden hand joints can be appropriately inferred by forward kinematics calculation and inverse kinematics calculation based on the event. Bone animation can be generated.
  • the information processing device 1 can reconstruct a 3D model while maintaining the continuity of the shape of the hand in an appropriately hidden region in a 2D image in which one hand is shielded from the other. It becomes possible to
  • the information processing device 1 executes the above-described processing to, for example, create a three-dimensional model (avatar) that has been set in advance according to the movement of the subject in the two-dimensional image. It is possible to perform combined operations.
  • the information processing device 1 performs inverse kinematics calculation based on the event that has occurred, so that even if there is a difference in the skeleton of the subject in the two-dimensional image and the three-dimensional model, the image appropriately reflects the event. Images can be acquired.
  • a 3D model without showing the face of the sign language interpreter will protect the privacy of the sign language interpreter.
  • Toon rendering can also be used to render the 3D model. In this case, any character or the like can be easily used as a three-dimensional model.
  • images in sign language are given as an example, but this is not the only option.
  • it can be applied to 2D images, 3D rendering of 2D videos, sports, VR, anonymization, surveillance cameras, marketing, and the like.
  • 2D image of a ball game with a 3D model, it is possible to reproduce the play from the angle that the user wants to watch.
  • Events can be set appropriately depending on the type of video to be reproduced in the 3D model. For example, in the above ball game, by detecting events such as the distance between the ball and the players, their positional relationship, and the actions of the players toward the ball, the 3D models of the players and the 3D model of the ball can be appropriately generated. can be expressed.
  • VR also makes it possible to view artists and others from arbitrary viewpoints.
  • All of the above trained models may be concepts that include, for example, models that have been trained as described and further distilled by a general method.
  • each device (information processing device 1) in the above-described embodiment may be configured with hardware, or software executed by CPU (Central Processing Unit), GPU (Graphics Processing Unit), etc. It may be configured by information processing of (program).
  • software information processing software that realizes at least a part of the functions of each device in the above-described embodiments can be transferred to a flexible disk, CD-ROM (Compact Disc-Read Only Memory), or USB (Universal Serial Bus) memory or other non-temporary storage medium (non-temporary computer-readable medium) and read into a computer to execute software information processing.
  • the software may be downloaded via a communication network.
  • information processing may be performed by hardware by implementing software in a circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
  • the type of storage medium that stores software is not limited.
  • the storage medium is not limited to a detachable one such as a magnetic disk or an optical disk, and may be a fixed storage medium such as a hard disk or memory. Also, the storage medium may be provided inside the computer, or may be provided outside the computer.
  • FIG. 10 is a block diagram showing an example of the hardware configuration of each device (information processing device 1) in the above embodiment.
  • Each device includes, for example, a processor 71, a main storage device 72 (memory), an auxiliary storage device 73 (memory), a network interface 74, and a device interface 75, which are connected via a bus 76.
  • a processor 71 for example, a main storage device 72 (memory), an auxiliary storage device 73 (memory), a network interface 74, and a device interface 75, which are connected via a bus 76.
  • a bus 76 may be implemented as a computer 7 integrated with the
  • the computer 7 in FIG. 10 has one of each component, but may have a plurality of the same components. Also, in FIG. 10, one computer 7 is shown. good too. In this case, it may be in the form of distributed computing in which each computer communicates via the network interface 74 or the like to execute processing.
  • each device (information processing device 1) in the above-described embodiment is configured as a system in which functions are realized by one or more computers executing instructions stored in one or more storage devices. good too.
  • the information transmitted from the terminal may be processed by one or more computers provided on the cloud, and the processing result may be transmitted to the terminal.
  • each device information processing device 1 in the above-described embodiment may be executed in parallel using one or more processors or using multiple computers via a network. Also, various operations may be distributed to a plurality of operation cores in the processor and executed in parallel. Also, part or all of the processing, means, etc. of the present disclosure may be executed by at least one of a processor and a storage device provided on a cloud capable of communicating with the computer 7 via a network. Thus, each device in the above-described embodiments may be in the form of parallel computing by one or more computers.
  • the processor 71 may be an electronic circuit (processing circuit, processing circuitry, CPU, GPU, FPGA, ASIC, etc.) including a computer control device and arithmetic device. Also, the processor 71 may be a semiconductor device or the like including a dedicated processing circuit. The processor 71 is not limited to an electronic circuit using electronic logic elements, and may be realized by an optical circuit using optical logic elements. Also, the processor 71 may include arithmetic functions based on quantum computing.
  • the processor 71 can perform arithmetic processing based on the data and software (programs) input from each device, etc. of the internal configuration of the computer 7, and output the arithmetic result and control signal to each device, etc.
  • the processor 71 may control each component of the computer 7 by executing the OS (Operating System) of the computer 7, applications, and the like.
  • Each device (information processing device 1) in the above-described embodiment may be realized by one or more processors 71.
  • the processor 71 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or two or more devices. You can point When multiple electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.
  • the main storage device 72 is a storage device that stores instructions and various data to be executed by the processor 71 , and the information stored in the main storage device 72 is read by the processor 71 .
  • Auxiliary storage device 73 is a storage device other than main storage device 72 . These storage devices mean any electronic components capable of storing electronic information, and may be semiconductor memories. The semiconductor memory may be either volatile memory or non-volatile memory.
  • a storage device for storing various data in each device (information processing device 1) in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73, and may be realized by the built-in memory built into the processor 71. may be implemented.
  • the storage unit 102 in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73.
  • processors may be connected (coupled) to one storage device (memory), or a single processor may be connected.
  • a plurality of storage devices (memories) may be connected (coupled) to one processor.
  • each device (information processing device 1) in the above-described embodiment is composed of at least one storage device (memory) and a plurality of processors connected (coupled) to this at least one storage device (memory)
  • a plurality of at least one of the processors is connected (coupled) to at least one storage device (memory).
  • this configuration may be realized by storage devices (memory) and processors included in a plurality of computers.
  • a configuration in which a storage device (memory) is integrated with a processor for example, a cache memory including an L1 cache and an L2 cache
  • a cache memory for example, a cache memory including an L1 cache and an L2 cache
  • the network interface 74 is an interface for connecting to the communication network 8 wirelessly or by wire. As for the network interface 74, an appropriate interface such as one conforming to existing communication standards may be used. The network interface 74 may exchange information with the external device 9A connected via the communication network 8.
  • FIG. The communication network 8 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), etc., or a combination thereof. It is sufficient if information can be exchanged between them. Examples of WAN include the Internet, examples of LAN include IEEE802.11 and Ethernet (registered trademark), and examples of PAN include Bluetooth (registered trademark) and NFC (Near Field Communication).
  • the device interface 75 is an interface such as USB that directly connects with the external device 9B.
  • the external device 9A is a device connected to the computer 7 via a network.
  • External device 9B is a device that is directly connected to computer 7 .
  • the external device 9A or the external device 9B may be an input device.
  • the input device is, for example, a device such as a camera, microphone, motion capture, various sensors, a keyboard, a mouse, or a touch panel, and provides the computer 7 with acquired information.
  • a device such as a personal computer, a tablet terminal, or a smartphone including an input unit, a memory, and a processor may be used.
  • the external device 9A or the external device 9B may be, for example, an output device.
  • the output device may be, for example, a display device such as LCD (Liquid Crystal Display), CRT (Cathode Ray Tube), PDP (Plasma Display Panel), or organic EL (Electro Luminescence) panel.
  • a speaker or the like for output may be used.
  • a device such as a personal computer, a tablet terminal, or a smartphone including an output unit, a memory, and a processor may be used.
  • the external device 9A or the external device 9B may be a storage device (memory).
  • the external device 9A may be a network storage or the like, and the external device 9B may be a storage such as an HDD.
  • the external device 9A or the external device 9B may be a device having the functions of some of the constituent elements of each device (information processing device 1) in the above-described embodiment. That is, the computer 7 may transmit or receive part or all of the processing results of the external device 9A or the external device 9B.
  • the expression "at least one (one) of a, b and c" or “at least one (one) of a, b or c" includes any of a, b, c, a-b, ac, b-c, or a-b-c. Also, multiple instances of any element may be included, such as a-a, a-b-b, a-a-b-b-c-c, and so on. It also includes the addition of other elements than the listed elements (a, b and c), such as having d such as a-b-c-d.
  • connection and “coupled” when used, they refer to direct connection/coupling, indirect connection/coupling , electrically connected/coupled, communicatively connected/coupled, operatively connected/coupled, physically connected/coupled, etc. intended as a term.
  • the term should be interpreted appropriately according to the context in which the term is used, but any form of connection/bonding that is not intentionally or naturally excluded is not included in the term. should be interpreted restrictively.
  • the physical structure of element A is such that it is capable of performing operation B has a configuration, including that a permanent or temporary setting/configuration of element A is configured/set to actually perform action B good.
  • element A is a general-purpose processor
  • the processor has a hardware configuration that can execute operation B, and operation B can be performed by setting a permanent or temporary program (instruction). It just needs to be configured to actually run.
  • the element A is a dedicated processor or a dedicated arithmetic circuit, etc., regardless of whether or not control instructions and data are actually attached, the circuit structure of the processor actually executes the operation B. It just needs to be implemented.
  • each piece of hardware may work together to perform the predetermined processing, or a part of the hardware may perform the predetermined processing. You may do all of Also, some hardware may perform a part of the predetermined processing, and another hardware may perform the rest of the predetermined processing.
  • the hardware that performs the first process and the hardware that performs the second process may be the same or different. In other words, the hardware that performs the first process and the hardware that performs the second process may be included in the one or more pieces of hardware.
  • hardware may include an electronic circuit or a device including an electronic circuit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

[Problem] To generate an appropriate three-dimensional CG model from a two-dimensional image. [Solution] An information processing device according to the present invention is provided with one or more memories and one or more processors. The one or more processors estimate three-dimensional posture data of an imaging subject from a two-dimensional image including the imaging subject, detect an event related to a motion of the imaging subject from the two-dimensional image, and control the posture of a skeleton of the three-dimensional model on the basis of the three-dimensional posture data and the event.

Description

情報処理装置Information processing equipment
 本開示は、情報処理装置に関する。 The present disclosure relates to an information processing device.
 2次元画像の被写体の動作を3次元CGモデルで再現をする場合に、被写体の骨格と3次元CGモデルの骨格との違いに起因して、2次元画像の被写体から推定される関節角を3次元CGモデルの骨格に適用するだけでは、正確に被写体の動作を再現することができない。 When reproducing the motion of a subject in a 2D image with a 3D CG model, the joint angle estimated from the subject in the 2D image is 3 due to the difference between the skeleton of the subject and the skeleton of the 3D CG model. It is not possible to accurately reproduce the movement of the subject by simply applying it to the skeleton of the dimensional CG model.
特開2019-197278号公報JP 2019-197278 A
 本開示においては、3次元CGモデルが再現すべきイベントを2次元画像の被写体において検出し、当該イベントを再現する3次元CGモデルを生成する技術を提供する。 This disclosure provides a technique for detecting an event to be reproduced by a 3D CG model in a subject of a 2D image and generating a 3D CG model that reproduces the event.
 一実施形態によれば、情報処理装置は、1又は複数のメモリと、1又は複数のプロセッサと、を備える。前記1又は複数のプロセッサは、被写体を含む2次元画像から、前記被写体の3次元姿勢データを推定し、前記2次元画像から、前記被写体の動きに関するイベントを検出し、前記3次元姿勢データ及び前記イベントに基づいて、前記3次元モデルの骨格の姿勢を制御する。 According to one embodiment, an information processing device includes one or more memories and one or more processors. The one or more processors estimate 3D pose data of the subject from a 2D image including the subject, detect an event related to the movement of the subject from the 2D image, and process the 3D pose data and the Based on the event, the posture of the skeleton of the 3D model is controlled.
一実施形態に係る情報処理装置を模式的に示すブロック図。1 is a block diagram schematically showing an information processing apparatus according to one embodiment; FIG. 一実施形態に係る所定部位の領域の一例を示す図。The figure which shows an example of the area|region of the predetermined site|part which concerns on one Embodiment. 一実施形態に係る関節の一例を示す図。The figure which shows an example of the joint which concerns on one Embodiment. 一実施形態に係る情報処理装置の処理の一例を示すフローチャート。4 is a flowchart showing an example of processing of an information processing apparatus according to one embodiment; 一実施形態に係るレンダリング結果を示す図。FIG. 4 is a diagram showing a rendering result according to one embodiment; 比較例に係るレンダリング結果を示す図。FIG. 10 is a diagram showing a rendering result according to a comparative example; イベントの一例を示す図。The figure which shows an example of an event. 一実施形態に係るレンダリング結果を示す図。FIG. 4 is a diagram showing a rendering result according to one embodiment; 比較例に係るレンダリング結果を示す図。FIG. 10 is a diagram showing a rendering result according to a comparative example; 一実施形態に係るハードウェア実装の一例を示すブロック図。2 is a block diagram illustrating an example hardware implementation according to one embodiment. FIG.
 以下、図面を参照して本発明の実施形態について説明する。図面及び実施形態の説明は一例として示すものであり、本発明を限定するものではない。なお、説明において、画像と記載されている箇所は、特に記載が無い限り、適宜動画のフレームと読み替えることができる。 Embodiments of the present invention will be described below with reference to the drawings. The drawings and description of the embodiments are given by way of example and are not intended to limit the invention. It should be noted that, in the description, the portion described as an image can be appropriately read as a frame of a moving image unless otherwise specified.
 図1は、一実施形態に係る情報処理装置を模式的に示すブロック図である。情報処理装置1は、入力部100と、記憶部102と、部位検出部104と、2次元座標推定部106と、3次元座標推定部108と、平滑化部110と、イベント検出部112と、骨格制御部114と、レンダリング部116と、出力部118と、を備える。情報処理装置1は、被写体が含まれている2次元画像を入力すると、当該被写体のモデルである3次元モデルに、2次元画像に写っている動きを実現させる。3次元モデルは、3次元のCG(Computer Graphics)モデルであってもよく、情報処理装置1は、この3次元モデルを任意の角度から見た画像、映像を出力することも可能である。 FIG. 1 is a block diagram schematically showing an information processing device according to one embodiment. The information processing device 1 includes an input unit 100, a storage unit 102, a part detection unit 104, a two-dimensional coordinate estimation unit 106, a three-dimensional coordinate estimation unit 108, a smoothing unit 110, an event detection unit 112, A skeleton control unit 114, a rendering unit 116, and an output unit 118 are provided. When a two-dimensional image including a subject is input, the information processing apparatus 1 causes a three-dimensional model, which is a model of the subject, to realize the movement shown in the two-dimensional image. The three-dimensional model may be a three-dimensional CG (Computer Graphics) model, and the information processing device 1 can output images and videos of the three-dimensional model viewed from any angle.
 情報処理装置1は、被写体を含む2次元画像から、被写体の3次元姿勢データを推定し、この推定とは別に、この2次元画像から、被写体の動きに関するイベントを検出し、推定結果及び検出結果に基づいて、3次元モデルの骨格(ボーン)の姿勢を制御する。なお、本開示において「姿勢」「ポーズ」「3次元座標」の三者は、文脈により適宜互いに読み替えて良い。 The information processing device 1 estimates three-dimensional posture data of a subject from a two-dimensional image including the subject, separately from this estimation, detects an event related to the movement of the subject from this two-dimensional image, and provides an estimation result and a detection result. Controls the posture of the skeleton (bone) of the 3D model based on In addition, in the present disclosure, the three terms “posture”, “pose” and “three-dimensional coordinates” may be read interchangeably depending on the context.
 情報処理装置1は、さらに、この骨格制御結果に対してボーンアニメーション(スケルタルアニメーション、スキンメッシュアニメーション)の技術を用いることにより、3次元モデルを補正してレンダリングし、2次元又は3次元の静止画又は動画として出力してもよい。なお、被写体は、後述のような人間に限られず、人間以外の動物やロボット等の機械など、動く物体(動体)でもよいし、動体と相互作用する静止した物体でもよい。また、被写体の動きに関するイベントは、少なくとも、被写体自体が動くことで発生するイベントや、被写体自体は動かないが被写体に対して動体が作用することによって発生するイベントのいずれかを含んでもよい。 The information processing device 1 further corrects and renders the three-dimensional model by using bone animation (skeletal animation, skin mesh animation) technology for this skeleton control result, and renders a still two-dimensional or three-dimensional model. You may output as a picture or a video. Note that the subject is not limited to a human being as described later, and may be a moving object (moving object) such as an animal other than a human being or a machine such as a robot, or a stationary object that interacts with a moving object. Also, the event related to the movement of the subject may include at least either an event generated by the movement of the subject itself or an event generated by the action of a moving body on the subject although the subject itself does not move.
 入力部100は、外部からのデータの入力を受け付ける入力インタフェースを備える。情報処理装置1は、この入力部100を介して動作に必要となるデータ、及び、処理の対象となるデータが入力される。 The input unit 100 has an input interface that accepts input of data from the outside. The information processing apparatus 1 receives data required for operation and data to be processed via the input unit 100 .
 記憶部102は、情報処理装置1の動作に必要となるデータ又は処理の対象となるデータを格納する。入力部100から入力された各種データは、この記憶部102に一時的又は非一時的に格納されてもよい。 The storage unit 102 stores data necessary for the operation of the information processing device 1 or data to be processed. Various data input from the input unit 100 may be stored in this storage unit 102 temporarily or non-temporarily.
 部位検出部104は、処理の対象である2次元画像から所定部位又は所定部位の属する領域を取得する。部位検出部104は、訓練済みモデルである第1モデルNN1を用いてこの処理を実行してもよい。 The part detection unit 104 acquires a predetermined part or a region to which the predetermined part belongs from the two-dimensional image to be processed. Part detection section 104 may perform this process using first model NN1, which is a trained model.
 第1モデルNN1は、被写体が含まれている2次元画像から所定部位に関する2次元データを検出するモデルである。このモデルは、例えば、ニューラルネットワークモデルである。より具体的には、第1モデルNN1は、例えば、少なくとも1層に畳み込み層を備えるCNN(Convolutional Neural Network)であってもよいし、MLP(Multi-Layer Perceptron)等のその他のニューラルネットワークモデルであってもよい。第1モデルNN1は、例えば、任意の機械学習手法により、2次元画像を入力すると、所定部位に関する情報を出力するように訓練される。部位検出部104は、2次元画像を第1モデルNN1に入力して順伝播させることで、所定部位の属する領域の情報を取得する。 The first model NN1 is a model that detects 2D data related to a specific part from a 2D image that includes a subject. This model is, for example, a neural network model. More specifically, the first model NN1 may be, for example, a CNN (Convolutional Neural Network) having at least one convolutional layer, or another neural network model such as MLP (Multi-Layer Perceptron). There may be. The first model NN1 is trained, for example, by an arbitrary machine learning method so that when a two-dimensional image is input, information regarding a predetermined part is output. The part detection unit 104 acquires information on the area to which the predetermined part belongs by inputting the two-dimensional image to the first model NN1 and forward propagating it.
 所定部位に関する情報は、被写体の所定部位の座標データ、被写体の所定部位が属する範囲、例えば、バウンティングボックスに関するデータであってもよい。所定部位は、例えば、3次元モデルにおいて姿勢を補正したい部位である。 The information about the predetermined part may be coordinate data of the predetermined part of the subject, the range to which the predetermined part of the subject belongs, for example, data about the bounding box. The predetermined part is, for example, a part whose posture is desired to be corrected in the three-dimensional model.
 図2は、部位検出部104の検出の一例を模式的に示す図である。
 例えば、被写体が人間であり、所定部位が手である場合、図2のような2次元画像を入力すると、部位検出部104は、第1モデルNN1を用いて手の位置を検出する。部位検出部104は、右手の属する領域をバウンティングボックスB1、左手の属する領域をバウンティングボックスB2として、所定部位を抽出してもよい。
FIG. 2 is a diagram schematically showing an example of detection by the part detection unit 104. As shown in FIG.
For example, when the subject is a human and the predetermined part is a hand, when a two-dimensional image as shown in FIG. 2 is input, the part detection unit 104 detects the position of the hand using the first model NN1. The part detection unit 104 may extract a predetermined part by setting the area to which the right hand belongs as a bounding box B1 and the area to which the left hand belongs as a bounding box B2.
 この場合、第1モデルNN1は、被写体が含まれる2次元画像が入力されると、所定部位のバウンティングボックスを出力するように訓練されたモデルである。望ましくは、第1モデルNN1は、図2に示すように、1つの所定部位に対して、1つのバウンティングボックスが出力される、すなわち、所定部位とバウンティングボックスとが1対1で抽出されるように訓練されたモデルである。 In this case, the first model NN1 is a model trained to output a bounding box of a given part when a 2D image including a subject is input. Desirably, as shown in FIG. 2, the first model NN1 outputs one bounding box for one predetermined part, that is, the predetermined part and the bounding box are extracted on a one-to-one basis. It is a model trained to
 図1に戻り、2次元座標推定部106は、部位検出部104が検出した所定部位の所定箇所の座標を推定する。所定箇所とは、例えば、被写体の関節(joint)の位置である。2次元座標推定部106は、訓練済みモデルである第2モデルNN2を用いてこの処理を実行してもよい。 Returning to FIG. 1, the two-dimensional coordinate estimation unit 106 estimates the coordinates of the predetermined part of the predetermined part detected by the part detection unit 104. The predetermined location is, for example, the position of the joint of the subject. The two-dimensional coordinate estimation unit 106 may perform this process using the second model NN2, which is a trained model.
 第2モデルNN2は、所定部位の2次元データ(例えば、バウンティングボックス内の2次元画像)から、所定部位における関節の2次元座標を取得するモデルである。このモデルは、例えば、ニューラルネットワークモデルである。より具体的には、第2モデルNN2は、例えば、少なくとも1層に畳み込み層を備えるCNNであってもよいし、MLP等のその他のニューラルネットワークモデルであってもよい。第2モデルNN2は、例えば、任意の機械学習手法により、所定部位に関する2次元データを入力すると、所定部位における関節の2次元座標を出力するように訓練される。2次元座標推定部106は、所定部位の領域を第2モデルNN2に入力して順伝播させることで、関節の2次元座標を取得する。 The second model NN2 is a model that acquires the two-dimensional coordinates of the joints in a given part from the two-dimensional data of the given part (for example, the two-dimensional image in the bounding box). This model is, for example, a neural network model. More specifically, the second model NN2 may be, for example, a CNN having at least one convolutional layer, or may be another neural network model such as MLP. The second model NN2 is trained, for example, by an arbitrary machine learning technique so that when two-dimensional data relating to a predetermined part is input, two-dimensional coordinates of joints in the predetermined part are output. The two-dimensional coordinate estimation unit 106 acquires the two-dimensional coordinates of the joint by inputting the region of the predetermined part to the second model NN2 and forward propagating it.
 2次元座標は、バウンティングボックスにおける原点を中心とした座標系で表されてもよいし、2次元画像における原点を中心とした座標系で表されてもよい。人間が被写体である場合には、2次元座標は、例えば、鼻といった人間の一部を原点とした相対的な座標であってもよい。  The two-dimensional coordinates may be represented by a coordinate system centered on the origin of the bounding box, or by a coordinate system centered on the origin of the two-dimensional image. When the subject is a human being, the two-dimensional coordinates may be relative coordinates with a part of the human being, such as the nose, as the origin.
 図3は、図2のバウンティングボックス内における手の関節の位置の一例を示す図である。点で示す箇所がバウンティングボックスにおける所定部位の関節を示す。2次元座標推定部106は、第2モデルNN2を用いることにより、図2のバウンティングボックス内の画像から、図に示すような関節の位置を取得する。 Fig. 3 is a diagram showing an example of the positions of the hand joints in the bounding box of Fig. 2. Points indicated by dots indicate joints of predetermined parts in the bounding box. The two-dimensional coordinate estimating unit 106 acquires the joint positions as shown in the figure from the image within the bounding box in FIG. 2 by using the second model NN2.
 第2モデルNN2は、例えば、図2のバウンティングボックスのような画像が入力されると、図3の関節の位置を示すように適切な機械学習手法により訓練される。教師データは、ユーザが作成してもよい。第1モデルNN1、第2モデルNN2としては、すでに最適化されて公開されているニューラルネットワークモデルを用いてもよい。 When the second model NN2 is input with an image such as the bounding box in Figure 2, it is trained by an appropriate machine learning method to show the joint positions in Figure 3. Teacher data may be created by the user. Already optimized and published neural network models may be used as the first model NN1 and the second model NN2.
 なお、部位検出部104及び2次元座標推定部106の処理は、別々に分けて実行する必要は無い。図1内に点線で示されるように、被写体を含む2次元画像を入力すると所定部位における関節の2次元座標を取得する訓練モデルである第4モデルNN4を用いてもよい。この場合、部位検出部104は、備えられなくてもよい。例えば、2次元座標推定部106が入力部100を介して入力された被写体を含む2次元画像を第4モデルNN4に入力することにより、この2次元画像から所定部位における関節の2次元座標を取得してもよい。 It should be noted that the processes of the part detection unit 104 and the two-dimensional coordinate estimation unit 106 need not be separately executed. As indicated by the dotted line in FIG. 1, a fourth model NN4, which is a training model that acquires the two-dimensional coordinates of joints in a predetermined part when a two-dimensional image including a subject is input, may be used. In this case, part detection section 104 may not be provided. For example, the two-dimensional coordinate estimating unit 106 inputs the two-dimensional image including the subject input via the input unit 100 to the fourth model NN4, thereby acquiring the two-dimensional coordinates of the joints in the predetermined part from this two-dimensional image. You may
 図1に戻り、3次元座標推定部108は、2次元座標推定部106が取得した関節の2次元座標に基づいて、被写体の関節の3次元座標を取得する。3次元座標推定部108は、訓練済みモデルである第3モデルNN3を用いてこの処理を実行してもよい。 Returning to FIG. 1, the 3D coordinate estimation unit 108 acquires the 3D coordinates of the joints of the subject based on the 2D coordinates of the joints acquired by the 2D coordinate estimation unit 106 . The three-dimensional coordinate estimation unit 108 may perform this process using the third model NN3, which is a trained model.
 第3モデルNN3は、関節の2次元座標から、3次元座標を取得するモデルである。このモデルは、例えば、ニューラルネットワークモデルである。より具体的には、第3モデルNN3は、例えば、CNNであってもよいし、MLP等のその他のニューラルネットワークモデルであってもよい。より単純に、第3モデルNN3は、関節の2次元座標を入力層とした全結合層を有するシンプルなモデルであってもよい。第3モデルNN3は、任意の機械学習手法により関節の2次元座標から3次元座標を取得するモデルとして訓練される。3次元座標推定部108は、関節の2次元座標を第3モデルNN3に入力して順伝播させることで、関節の3次元座標を取得する。 The third model NN3 is a model that acquires 3D coordinates from the 2D coordinates of the joints. This model is, for example, a neural network model. More specifically, the third model NN3 may be, for example, CNN or other neural network model such as MLP. More simply, the third model NN3 may be a simple model having a fully connected layer whose input layer is two-dimensional coordinates of joints. The third model NN3 is trained as a model that obtains 3D coordinates from 2D coordinates of joints by any machine learning method. The three-dimensional coordinate estimation unit 108 acquires the three-dimensional coordinates of the joint by inputting the two-dimensional coordinates of the joint to the third model NN3 and forward propagating it.
 3次元座標推定部108は、例えば、図3における各点の2次元座標を、第3モデルNN3に基づいて3次元空間における3次元座標に変換する。上記と同様に、3次元空間における座標は、例えば、所定の原点を中心とした所定の座標系で示される座標であってもよいし、3次元モデルの所定の位置、例えば、鼻の位置を中心とした相対的な座標で示される座標であってもよい。 The three-dimensional coordinate estimation unit 108, for example, converts the two-dimensional coordinates of each point in FIG. 3 into three-dimensional coordinates in a three-dimensional space based on the third model NN3. Similarly to the above, the coordinates in the three-dimensional space may be, for example, coordinates indicated by a predetermined coordinate system centered on a predetermined origin, or a predetermined position of the three-dimensional model, for example, the position of the nose. Coordinates indicated by relative coordinates centered on the center may also be used.
 部位検出部104、2次元座標推定部106、及び、3次元座標推定部108は、上記においては、訓練済みのニューラルネットワークモデルを用いてそれぞれの処理をするとしたが、これに限定されるものではない。例えば、それぞれの入力の統計量等に基づいて、適切に取得したデータが出力される関数等を用いて出力を得るものであってもよい。 Part detection unit 104, two-dimensional coordinate estimation unit 106, and three-dimensional coordinate estimation unit 108 are described above as performing respective processes using trained neural network models, but are not limited to this. No. For example, the output may be obtained using a function or the like that outputs appropriately acquired data based on the statistics of each input.
 平滑化部110は、取得された3次元座標について、フレームの情報を用いて平滑化を実行する。平滑化部110は、例えば、着目しているフレームの2次元画像から取得した関節の3次元座標を、前後のフレームにおける2次元画像から取得された関節の3次元座標を用いて平滑化する。この平滑化部110により、関節の3次元座標の時系列におけるブレを抑制し、3次元モデルに動作をさせる場合に、関節の位置が急激に他の位置に移動する不自然な動作を抑制することができる。 The smoothing unit 110 smoothes the acquired three-dimensional coordinates using frame information. The smoothing unit 110 smoothes, for example, the three-dimensional coordinates of the joints obtained from the two-dimensional image of the frame of interest using the three-dimensional coordinates of the joints obtained from the two-dimensional images of the preceding and succeeding frames. This smoothing unit 110 suppresses blurring in the time series of the three-dimensional coordinates of the joints, and suppresses unnatural movements in which the positions of the joints suddenly move to other positions when the three-dimensional model is moved. be able to.
 イベント検出部112は、2次元画像において、イベントの発生を検出する。イベントは、あらかじめ決められていてもよい。被写体が人間、所定部位が手である場合に、イベントは、例えば、手同士の衝突、2次元画像における手の遮蔽を含んでもよい。 The event detection unit 112 detects the occurrence of events in the two-dimensional image. The event may be predetermined. If the subject is a human and the predetermined part is a hand, the event may include, for example, collision between hands, or shielding of the hand in the two-dimensional image.
 イベント検出部112は、ルールベースによりイベントを検出してもよいし、図示しない第5モデルNN5によりイベントを検出してもよい。第5モデルNN5は、例えば、ニューラルネットワークモデルであり、CNN又はMLP等のモデルであってもよい。第5モデルNN5は、関節の2次元座標を入力すると、関節の3次元座標を出力するように適切に訓練されたモデルである。また、第5モデルNN5は、2次元画像を併せて入力できるモデルであってもよい。 The event detection unit 112 may detect an event based on a rule, or may detect an event using a fifth model NN5 (not shown). The fifth model NN5 is, for example, a neural network model, and may be a model such as CNN or MLP. The fifth model NN5 is a model properly trained to output three-dimensional coordinates of joints when two-dimensional coordinates of joints are input. Also, the fifth model NN5 may be a model that can also input a two-dimensional image.
 例えば、この訓練のデータセットは、空間の3次元座標が取得できるようなデプスカメラを用いて撮影された画像と2次元画像とを比較することにより生成されてもよい。別の例として、異なる位置から異なる角度で被写体を撮影する複数のカメラにより3次元画像を再構成し、そのうち1つ、又は、別の1つのカメラにより撮影された2次元画像とを比較することにより生成されてもよい。 For example, this training dataset may be generated by comparing images taken using a depth camera that can acquire spatial 3D coordinates with 2D images. Another example is reconstructing a 3D image from multiple cameras that capture the subject at different angles from different positions and comparing it with a 2D image captured by one or another camera. may be generated by
 これらの撮影は、取得した関節にマーカを装着した被写体を撮影することにより実現されてもよい。また、3次元座標を取得するためのデータは、カメラでは無く、音波(超音波を含む)、電波(可視光、それ以外の領域の光を含む)、電極、等を被写体の関節から発すること、又は、関節からのこれらの反射等の情報を取得すること、により取得されてもよい。もちろん、これら以外のモーションキャプチャ等の技術により取得し、再構成された3次元データと、被写体をカメラにより撮影したデータと、を比較することで訓練に用いるデータセットを生成してもよい。 These shots may be realized by shooting a subject with markers attached to the acquired joints. In addition, the data for obtaining 3D coordinates is not the camera, but sound waves (including ultrasonic waves), radio waves (including visible light and light in other areas), electrodes, etc., emitted from the joints of the subject. , or by obtaining information such as these reflections from joints. Of course, a data set used for training may be generated by comparing three-dimensional data acquired and reconstructed by techniques other than these, such as motion capture, and data obtained by photographing a subject with a camera.
 このように生成されたデータセットを用いることにより、撮影された2次元画像における被写体の関節の位置(2次元座標)と、被写体の関節の位置(3次元座標)とを紐付けて訓練を実行することができる。被写体の関節の位置(3次元座標)は、例えば2次元画像が撮影されたときのカメラ座標系で表現されていても、他の座標系で表現されていてもよい。 By using the data set generated in this way, training is executed by associating the positions of the subject's joints (two-dimensional coordinates) and the positions of the subject's joints (three-dimensional coordinates) in the captured two-dimensional image. can do. The positions (three-dimensional coordinates) of the joints of the subject may be expressed, for example, in the camera coordinate system used when the two-dimensional image was captured, or may be expressed in another coordinate system.
 骨格制御部114は、平滑化部110により平滑化された関節の3次元座標データを用いて、3次元モデルの骨格の姿勢を生成し、さらに、イベント検出部112が検出したイベントに基づいて骨格の姿勢を補正する。骨格制御部114は、例えば、適切にキーフレームを設定し、フレームごとに骨格の姿勢を制御してもよい。 The skeleton control unit 114 uses the three-dimensional coordinate data of the joints smoothed by the smoothing unit 110 to generate the posture of the skeleton of the three-dimensional model. correct the posture of The skeleton control unit 114 may, for example, appropriately set key frames and control the pose of the skeleton for each frame.
 骨格制御部114は、平滑化部110が出力した関節の3次元座標データにも基づいて被写体の関節角を算出する。この関節角は、3次元モデルの骨格に適用される。被写体の3次元姿勢データは、この3次元座標データや関節角など、被写体の3次元空間における姿勢を特定するデータである。骨格制御部114は、算出された関節角に基づいて、3次元モデルの骨格に対して順運動学計算を実行することで、骨格の姿勢を制御する。 The skeleton control unit 114 also calculates the joint angles of the subject based on the joint three-dimensional coordinate data output by the smoothing unit 110 . This joint angle is applied to the skeleton of the 3D model. The three-dimensional posture data of the subject is data specifying the posture of the subject in the three-dimensional space, such as this three-dimensional coordinate data and joint angles. The skeleton control unit 114 controls the posture of the skeleton by executing forward kinematics calculation on the skeleton of the three-dimensional model based on the calculated joint angles.
 骨格制御部114はさらに、イベント検出部112が検出したイベントを3次元モデルが再現するよう、順運動学計算によって制御した骨格の姿勢を、イベントに基づいた逆運動学計算によって補正する。例えば、骨格制御部114は、順運動学計算によって制御した骨格のうち、イベントの再現に必須な部分(例えば右手と左手の中指)をイベントの発生位置に配置するように設定し、3次元モデルの骨格に対してこのイベントの発生位置に基づいた逆運動学計算を実行して、3次元モデルの骨格の姿勢を補正する。 The skeleton control unit 114 further corrects the posture of the skeleton controlled by forward kinematics calculation by inverse kinematics calculation based on the event so that the three-dimensional model reproduces the event detected by the event detection unit 112 . For example, the skeleton control unit 114 sets, of the skeleton controlled by the forward kinematics calculation, the portions essential for event reproduction (for example, the middle fingers of the right hand and left hand) to be arranged at the event occurrence positions, and the three-dimensional model Perform inverse kinematics calculations on the skeleton of the 3D model based on the location of this event to correct the pose of the skeleton of the 3D model.
 骨格制御部114は、上記の処理を実行することにより、関節の3次元座標に基づいて順運動学計算により3次元モデルの骨格の姿勢を生成し、イベントに基づいて逆運動学計算により骨格の姿勢を補正する。 By executing the above processing, the skeleton control unit 114 generates the posture of the skeleton of the three-dimensional model by forward kinematics calculation based on the three-dimensional coordinates of the joints, and calculates the posture of the skeleton by inverse kinematics calculation based on the event. Correct your posture.
 別の例として、骨格制御部114は、関節の3次元位置をイベントの発生位置により調整し、この調整結果に対して、イベントの発生位置に基づく逆運動学計算により骨格の姿勢を生成してもよい。 As another example, the skeleton control unit 114 adjusts the three-dimensional positions of the joints according to the event occurrence positions, and generates the posture of the skeleton by inverse kinematics calculation based on the event occurrence positions based on the adjustment results. good too.
 レンダリング部116は、骨格制御部114により骨格が制御された3次元モデルのボーンアニメーション及びレンダリングを実行する。レンダリング部116は、例えば、レイトレーシング等の技術によりレンダリングを実行し、3次元モデルを2次元画像、2次元映像へと変換する。 The rendering unit 116 executes bone animation and rendering of the three-dimensional model whose skeleton is controlled by the skeleton control unit 114. The rendering unit 116 executes rendering using, for example, a technique such as ray tracing, and converts the 3D model into a 2D image and a 2D video.
 出力部118は、レンダリング部116が生成した画像、映像情報を、例えば、ディスプレイ等の2次元の出力デバイスにより出力する。また、出力部118は、適切に3次元情報を出力することが可能なデバイスにおいて、レンダリング部116が生成したユーザの視点、視界に基づいた画像、映像情報を出力してもよい。 The output unit 118 outputs the image and video information generated by the rendering unit 116 using a two-dimensional output device such as a display. In addition, the output unit 118 may output images and video information generated by the rendering unit 116 based on the user's viewpoint and field of view in a device capable of appropriately outputting three-dimensional information.
 出力部118は、例えば、ストリーミングにより出力をしてもよいし、記憶部102等にデータを格納してもよい。 The output unit 118 may, for example, output by streaming, or store data in the storage unit 102 or the like.
 図4は、一実施形態に係る情報処理装置1の処理を示すフローチャートである。 FIG. 4 is a flowchart showing processing of the information processing device 1 according to one embodiment.
 情報処理装置1は、まず、被写体の動きをトレースさせたい3次元モデルを設定する(S100)。これ以降の処理は、この3次元モデルを適切に動作させるための処理である。この3次元モデルは、あらかじめ設定された、又は、ユーザが指定したアバターであってもよい。 The information processing device 1 first sets a 3D model for tracing the movement of the subject (S100). The subsequent processing is processing for properly operating this three-dimensional model. This three-dimensional model may be a preset or user-specified avatar.
 情報処理装置1は、入力部100を介して2次元画像の入力を受け付ける(S102)。2次元画像は、2次元の映像におけるフレームごとの画像であってもよい。情報処理装置1は、カメラで撮影されている被写体を含む2次元映像をリアルタイムで取得してもよいし、あらかじめ撮影された映像を1フレームずつ処理してもよい。 The information processing device 1 receives input of a two-dimensional image via the input unit 100 (S102). The two-dimensional image may be a frame-by-frame image in a two-dimensional video. The information processing device 1 may acquire in real time a two-dimensional image including an object captured by a camera, or may process an image captured in advance frame by frame.
 部位検出部104は、被写体を含む2次元画像から、所定の部位を検出する(S104)。例えば、被写体が人間であり、所定の部位を手としてもよい。 The part detection unit 104 detects a predetermined part from the two-dimensional image including the subject (S104). For example, the subject may be a human being and the hand may be a predetermined part.
 2次元座標推定部106は、部位検出部104が検出した所定の部位において、関節の位置を推定し、この関節の2次元画像における2次元座標を出力する(S106)。部位が手である場合には、2次元座標推定部106は、例えば、指ごとの関節の位置、前腕と手の間の関節の位置、又は、腕の関節の位置を推定してもよい。 The two-dimensional coordinate estimation unit 106 estimates the position of the joint in the predetermined part detected by the part detection unit 104, and outputs the two-dimensional coordinates of the joint in the two-dimensional image (S106). When the part is a hand, the two-dimensional coordinate estimation unit 106 may estimate, for example, the joint position of each finger, the joint position between the forearm and the hand, or the joint position of the arm.
 3次元座標推定部108は、2次元座標推定部106が推定した関節の2次元座標に基づいて、3次元空間における3次元座標を推定する(S108)。 The 3D coordinate estimation unit 108 estimates 3D coordinates in a 3D space based on the 2D coordinates of the joint estimated by the 2D coordinate estimation unit 106 (S108).
 平滑化部110は、3次元座標推定部108が推定した関節の3次元座標について、時系列方向に平滑化処理を実行する(S110)。この平滑化処理は、適切な任意の処理であればよい。平滑化部110は、例えば、過去の所定フレーム数のフレームの関節の3次元座標及び現フレームの関節の3次元座標(すなわち現フレームを含む複数フレームの関節の3次元座標)に基づいて、平滑化された現フレームの関節の3次元座標を取得する。 The smoothing unit 110 performs smoothing processing in the time series direction on the three-dimensional coordinates of the joints estimated by the three-dimensional coordinate estimation unit (S110). This smoothing process may be any suitable process. For example, the smoothing unit 110 performs smoothing based on the three-dimensional coordinates of the joints in a predetermined number of frames in the past and the three-dimensional coordinates of the joints in the current frame (that is, the three-dimensional coordinates of the joints in a plurality of frames including the current frame). Get the 3D coordinates of the joints in the current frame.
 イベント検出部112は、処理中のフレーム画像においてイベントが発生したか否かを検出し、イベントが発生している場合には、そのイベントの発生位置を取得する(S112)。イベントは、例えば、右手と左手の指同士の接触、一方の手による他方の手の遮蔽等であってもよいが、これに限定されるものではない。例えば、被写体の手と顔との接触、被写体の手による顔の遮蔽等であってもよい。イベントの発生位置とは、当該イベントに関するデータの一例であり、例えば右手と左手の指同士の接触イベントの例であれば、当該イベントが発生したフレーム画像中の右手と左手の指同士が接触した座標や当該座標に対応する3次元座標である。またあるいは、イベントの発生位置とは、例えば、イベントごとに予め設定された2次元座標あるいは3次元座標であってもよい。またあるいは、イベントの発生位置とは、例えばイベントごとに予め定められた演算(例えば右手の指と左手の指の座標の中点を計算すること)によって算出されてもよい。 The event detection unit 112 detects whether an event has occurred in the frame image being processed, and if an event has occurred, acquires the position where the event occurred (S112). The event may be, for example, contact between the fingers of the right hand and the left hand, blocking of the other hand by one hand, etc., but is not limited to this. For example, contact between the hands of the subject and the face, shielding of the face by the hands of the subject, or the like may be possible. The occurrence position of an event is an example of data related to the event. It is a coordinate or a three-dimensional coordinate corresponding to the coordinate. Alternatively, the event occurrence position may be, for example, two-dimensional coordinates or three-dimensional coordinates preset for each event. Alternatively, the event occurrence position may be calculated, for example, by a predetermined calculation for each event (for example, calculating the midpoint of the coordinates of the fingers of the right hand and the fingers of the left hand).
 なお、イベントの検出は、このタイミングでなくてもよい。イベント検出部112は、S104からS110の少なくともいずれか1つの処理と並行して、又は、いずれか2つの処理の間に適切に実行されてもよい。 Note that the event does not have to be detected at this timing. The event detection unit 112 may be appropriately executed in parallel with at least one of the processes from S104 to S110, or between any two processes.
 骨格制御部114は、平滑化部110が平滑化した関節の3次元座標及びイベント検出部112が検出したイベントに基づいて、3次元モデルの骨格の姿勢を制御する(S114)。骨格制御部114は、関節の3次元座標に基づいた順運動学計算、及び、イベントの発生位置に基づいた逆運動学計算により3次元モデルの骨格の姿勢を補正することで、骨格の姿勢を制御する。上述したように、関節の3次元座標及びイベントの発生位置に基づいた逆運動学計算により骨格の姿勢を制御してもよい。 The skeleton control unit 114 controls the posture of the skeleton of the 3D model based on the 3D coordinates of the joints smoothed by the smoothing unit 110 and the events detected by the event detection unit 112 (S114). The skeleton control unit 114 corrects the posture of the skeleton of the three-dimensional model by performing forward kinematics calculation based on the three-dimensional coordinates of the joints and inverse kinematics calculation based on the event occurrence position. Control. As described above, the posture of the skeleton may be controlled by inverse kinematics calculations based on the three-dimensional coordinates of the joints and the positions of occurrence of the events.
 レンダリング部116は、骨格制御部114が制御した骨格の姿勢に基づいて、ボーンアニメーション及びレンダリングを実行し、適切な出力手段における出力に適切な形式へと変換して出力する(S116)。 The rendering unit 116 executes bone animation and rendering based on the posture of the skeleton controlled by the skeleton control unit 114, converts it into an appropriate format for output by an appropriate output means, and outputs it (S116).
 情報処理装置1は、S102からS116の処理を適切な回数、例えば、フレーム数分繰り返すことにより、各フレームの補正された画像データ、又は、各フレームを時系列方向に適切に結合した映像データを出力することができる。 The information processing device 1 repeats the processing from S102 to S116 an appropriate number of times, for example, for the number of frames, thereby obtaining corrected image data of each frame or video data in which each frame is appropriately combined in the time series direction. can be output.
 上記においては、平滑化部110は、過去の所定フレーム数分のフレーム画像を用いて平滑化を実行するとしたが、これに限定されるものではない。例えば、リアルタイム処理ではなく、映像を格納したファイルとして出力をする場合には、未来のフレームを用いて平滑化してもよい。この場合、所定フレーム数分のフレームについて、S102からS108の処理を実行し、適切に平滑化処理が実行されてもよい。このように、適宜適切に、上記のフレームごとの処理は、並列して、又は、逐次的に実行されてもよい。 In the above description, the smoothing unit 110 performs smoothing using a predetermined number of past frame images, but the present invention is not limited to this. For example, when outputting as a file containing video instead of real-time processing, future frames may be used for smoothing. In this case, the processes from S102 to S108 may be executed for a predetermined number of frames, and the smoothing process may be appropriately executed. Thus, the frame-by-frame processing described above may be performed in parallel or serially, as appropriate.
 具体的な例を挙げて、どのような状況下において、どのような画像が出力されるかを説明する。 Explain what kind of image will be output under what circumstances, giving a specific example.
 図5は、上記の情報処理装置1により、図2の手の形を3次元モデルに実装し、レンダリングした結果を示す図である。この図は、両手の指先を口の前で接触させるというイベントが発生している。 FIG. 5 is a diagram showing the result of rendering the hand shape of FIG. In this figure, an event has occurred in which the fingertips of both hands are brought into contact with each other in front of the mouth.
 図5においては、図2に示す形状と同じように、適切に3次元モデルが制御され、両手の指先が接触している様子を適切に表現している。これは、接触をイベントとして検知し、当該イベントの発生位置に基づいて、逆運動学計算が実行されるために実現できる。 In Fig. 5, just like the shape shown in Fig. 2, the 3D model is appropriately controlled and the fingertips of both hands are in contact. This can be achieved because touch is detected as an event and inverse kinematics calculations are performed based on the location of the occurrence of the event.
 図6は、同じタイミングにおける比較例に係る3次元モデルの制御を行った結果である。この図6に示すように、比較例においては、イベントを検知せずに、順運動学計算により3次元モデルを生成している。このため、被写体と3次元モデルの体格等の違いにより、指先が重なり、イベントを適切に表現することができない。 Fig. 6 shows the results of controlling the 3D model according to the comparative example at the same timing. As shown in FIG. 6, in the comparative example, a three-dimensional model is generated by forward kinematics calculation without detecting an event. Therefore, due to differences in physiques and the like between the subject and the three-dimensional model, the fingertips overlap and the event cannot be expressed appropriately.
 図7から図9は、イベントの再現について、もう1つの例を示す図である。 Figures 7 to 9 are diagrams showing another example of event reproduction.
 図7は、額に軽く握った手を当てている被写体を示す。図8は、情報処理装置1により、図7の状態を3次元モデルにトレースさせ、レンダリングした図である。この図に示すように、額の位置と、握った手の位置とが適切に表現されている。 Fig. 7 shows a subject with a lightly clenched hand on his forehead. FIG. 8 is a diagram rendered by tracing the state of FIG. 7 to a three-dimensional model by the information processing device 1. In FIG. As shown in this figure, the position of the forehead and the position of the clenched hand are properly represented.
 この例においては、目と手の関係性をイベントとして検出せずに、すなわち、前腕による目の遮蔽を考慮せずに3次元モデルを制御している。 In this example, the 3D model is controlled without detecting the eye-hand relationship as an event, that is, without considering the eye-occlusion by the forearm.
 一方で、右目を遮蔽されないように3次元モデルを制御したい場合には、額が手に遮蔽されるイベントを検出するタイミングにおいて、目が前腕により遮蔽されないというイベントの情報を付加してもよい。そして、額に位置する手と、右目を遮蔽しない前腕と、に基づいて、逆運動学計算を行うことにより、適切な3次元モデルのレンダリングを実現することができる。必要に応じて、関節についての順運動学計算と、イベントの発生位置からの逆運動学計算を所定回数繰り返し行ってもよい。例えば、本段落の様子を実現する場合には、額と手からの逆運動学計算と、前腕と目からの逆運動学計算をそれぞれ実行してもよい。 On the other hand, if you want to control the 3D model so that the right eye is not blocked, you may add event information that the eye is not blocked by the forearm at the timing of detecting the event where the forehead is blocked by the hand. Then, based on the hand located on the forehead and the forearm that does not occlude the right eye, inverse kinematics calculations can be performed to achieve appropriate 3D model rendering. If necessary, the forward kinematics calculation for the joint and the inverse kinematics calculation from the event occurrence position may be repeated a predetermined number of times. For example, when implementing the aspect of this paragraph, inverse kinematics calculations from the forehead and hands and inverse kinematics calculations from the forearms and eyes may be performed respectively.
 図9は、イベント検知をしない3次元モデルの制御を示す一例である。関節の3次元座標からの順運動学計算のみを行った場合には、このように、手と額との位置関係を適切に表現することができない。 Fig. 9 is an example showing control of a 3D model without event detection. If only forward kinematics calculation from the three-dimensional coordinates of joints is performed, it is not possible to appropriately represent the positional relationship between the hand and the forehead.
 これらの図の例は、3次元モデルを用いて手話における表現を出力することに用いることができる。この場合、例えば、イベントとは、手話における表現でクリティカルなイベントとしてもよい。情報処理装置1は、例えば、上述したように、両手を接触したり、手で顔の一部を遮蔽したり、又は、手で顔の一部に接触したりするイベントを、3次元モデルを用いて再現することができる。 Examples of these figures can be used to output expressions in sign language using a 3D model. In this case, for example, the event may be a critical event expressed in sign language. For example, as described above, the information processing device 1 detects an event of touching both hands, covering a part of the face with a hand, or touching a part of the face with a hand, by using the three-dimensional model as an event. can be reproduced using
 イベント検出部112は、例えば、図2のように手の指先が接触するイベントを検知する場合には、指先同士の距離に基づいて検出をしてもよい。例えば、イベント検出部112は、指先同士の距離が所定距離以下の場合に、当該指先同士が接触したと検出してもよい。 For example, when the event detection unit 112 detects an event in which the fingertips of a hand touch each other as shown in FIG. 2, the detection may be performed based on the distance between the fingertips. For example, the event detection unit 112 may detect that the fingertips are in contact when the distance between the fingertips is equal to or less than a predetermined distance.
 イベント検出部112は、例えば、図5のように手により顔の一部が遮蔽されるイベントを検知する場合には、鼻と手首の距離に基づいて検出をしてもよい。例えば、イベント検出部112は、鼻と手首の距離が所定距離以下の場合に、当該イベントが発生したと検出してもよい。 For example, when detecting an event in which a part of the face is covered by a hand as shown in FIG. 5, the event detection unit 112 may perform detection based on the distance between the nose and the wrist. For example, the event detection unit 112 may detect that the event has occurred when the distance between the nose and the wrist is equal to or less than a predetermined distance.
 イベント検出部112は、これらの他、種々のルールベースの処理により、イベントを検出してもよい。 In addition to these, the event detection unit 112 may detect events by various rule-based processing.
 別の例として、必要となるイベントの2次元画像とそれぞれの2次元画像におけるイベントをラベル付けした訓練データセットを多数用意し、この2次元画像を入力した場合に、イベントが出力される第5モデルNN5を機械学習により訓練してもよい。第5モデルNN5は、例えば、指先同士の接触、手と顔の一部の接触、及び、任意の部位による他の任意の部位の遮蔽等のデータセットを用いてもよい。 As another example, prepare a large number of 2D images of the required events and a training data set that labels the events in each 2D image, and input these 2D images. Model NN5 may be trained by machine learning. The fifth model NN5 may use, for example, a data set of contact between fingertips, contact of a part of a hand and face, and occlusion of another arbitrary part by an arbitrary part.
 例えば、適切に遮蔽状態がイベントとして検出されると、隠れている手の関節等の位置を、順運動学計算及びイベントに基づいた逆運動学計算により適切に推論することが可能となり、適切なボーンアニメーションを生成することが可能となる。これにより、例えば、情報処理装置1は、一方の手に他方の手が遮蔽されている2次元画像において、適切に隠れた領域の手の形の連続性を保持して3次元モデルに再構成することが可能となる。 For example, if the occlusion state is properly detected as an event, the positions of the hidden hand joints can be appropriately inferred by forward kinematics calculation and inverse kinematics calculation based on the event. Bone animation can be generated. As a result, for example, the information processing device 1 can reconstruct a 3D model while maintaining the continuity of the shape of the hand in an appropriately hidden region in a 2D image in which one hand is shielded from the other. It becomes possible to
 以上のように、本実施形態によれば、情報処理装置1は、上記の処理を実行することで、例えば、あらかじめ設定された3次元モデル(アバター)に、2次元画像中の被写体の動きに合わせた動作をさせることが可能となる。情報処理装置1は、発生したイベントに基づいて逆運動学計算をすることで、2次元画像中の被写体と、3次元モデルの骨格に違いがある場合でも、適切にイベントを反映させた画像、映像を取得することが可能となる。 As described above, according to the present embodiment, the information processing device 1 executes the above-described processing to, for example, create a three-dimensional model (avatar) that has been set in advance according to the movement of the subject in the two-dimensional image. It is possible to perform combined operations. The information processing device 1 performs inverse kinematics calculation based on the event that has occurred, so that even if there is a difference in the skeleton of the subject in the two-dimensional image and the three-dimensional model, the image appropriately reflects the event. Images can be acquired.
 例えば、手話の例の場合、手話通訳者の顔を出さずに3次元モデルを用いることで、手話通訳者のプライバシーを保護することにつながる。また、3次元モデルを用いることで、手話通訳者の見かけや容姿に左右されずに、手話の内容をユーザに伝えることが可能となる。また、3次元モデルをレンダリングには、トゥーンレンダリングを用いることもできる。この場合、任意のキャラクター等を3次元モデルとして用いることを容易とする。 For example, in the case of sign language, using a 3D model without showing the face of the sign language interpreter will protect the privacy of the sign language interpreter. In addition, by using a 3D model, it is possible to convey the content of sign language to the user without being influenced by the appearance or appearance of the sign language interpreter. Toon rendering can also be used to render the 3D model. In this case, any character or the like can be easily used as a three-dimensional model.
 上記においては、手話の画像を一例としてあげたが、これには限られない。例えば、2次元画像、2次元映像の3次元化、スポーツ、VR、匿名化、監視カメラ、マーケティング等にも適用することができる。例えば、球技における2次元映像を3次元モデルにより表現することで、ユーザが観戦したい角度からのプレーを再現する、といったことも可能となる。 In the above, images in sign language are given as an example, but this is not the only option. For example, it can be applied to 2D images, 3D rendering of 2D videos, sports, VR, anonymization, surveillance cameras, marketing, and the like. For example, by representing a 2D image of a ball game with a 3D model, it is possible to reproduce the play from the angle that the user wants to watch.
 イベントは、3次元モデルに再現させる映像等の種類により適切に設定することができる。例えば、上記の球技であれば、ボールと、プレイヤーの距離、位置関係、プレイヤーのボールに対する行動等をイベントとして検出することで、プレイヤー同士の3次元モデルと、ボールの3次元モデルとを適切に表現することが可能となる。  Events can be set appropriately depending on the type of video to be reproduced in the 3D model. For example, in the above ball game, by detecting events such as the distance between the ball and the players, their positional relationship, and the actions of the players toward the ball, the 3D models of the players and the 3D model of the ball can be appropriately generated. can be expressed.
 また、遠隔で操作を他のユーザ等に教示する場合等には、教示を受けるユーザが任意の角度で操作を閲覧することも可能となる。また、VRにより、アーティスト等を任意の視点から観覧することも可能となる。 In addition, when instructing operations remotely to other users, etc., it is also possible for the user receiving the instructions to view the operations from any angle. VR also makes it possible to view artists and others from arbitrary viewpoints.
 監視カメラに適用させることで、遮蔽された状態でどのような動作がされていたかを適切に再現することも可能となり、犯罪の防止や抑制を図ることもできる。 By applying it to a surveillance camera, it will be possible to appropriately reproduce what actions were being taken in a shielded state, and it will also be possible to prevent and suppress crimes.
 上記の全ての訓練済モデルは、例えば、説明したように訓練した上で、さらに、一般的な手法により蒸留されたモデルを含む概念であってもよい。 All of the above trained models may be concepts that include, for example, models that have been trained as described and further distilled by a general method.
 前述した実施形態における各装置(情報処理装置1)の一部又は全部は、ハードウェアで構成されていてもよいし、CPU(Central Processing Unit)、又はGPU(Graphics Processing Unit)等が実行するソフトウェア(プログラム)の情報処理で構成されてもよい。ソフトウェアの情報処理で構成される場合には、前述した実施形態における各装置の少なくとも一部の機能を実現するソフトウェアを、フレキシブルディスク、CD-ROM(Compact Disc-Read Only Memory)又はUSB(Universal Serial Bus)メモリ等の非一時的な記憶媒体(非一時的なコンピュータ可読媒体)に収納し、コンピュータに読み込ませることにより、ソフトウェアの情報処理を実行してもよい。また、通信ネットワークを介して当該ソフトウェアがダウンロードされてもよい。さらに、ソフトウェアがASIC(Application Specific Integrated Circuit)又はFPGA(Field Programmable Gate Array)等の回路に実装されることにより、情報処理がハードウェアにより実行されてもよい。 Part or all of each device (information processing device 1) in the above-described embodiment may be configured with hardware, or software executed by CPU (Central Processing Unit), GPU (Graphics Processing Unit), etc. It may be configured by information processing of (program). In the case of software information processing, software that realizes at least a part of the functions of each device in the above-described embodiments can be transferred to a flexible disk, CD-ROM (Compact Disc-Read Only Memory), or USB (Universal Serial Bus) memory or other non-temporary storage medium (non-temporary computer-readable medium) and read into a computer to execute software information processing. Alternatively, the software may be downloaded via a communication network. Furthermore, information processing may be performed by hardware by implementing software in a circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
ソフトウェアを収納する記憶媒体の種類は限定されるものではない。記憶媒体は、磁気ディスク、又は光ディスク等の着脱可能なものに限定されず、ハードディスク、又はメモリ等の固定型の記憶媒体であってもよい。また、記憶媒体は、コンピュータ内部に備えられてもよいし、コンピュータ外部に備えられてもよい。 The type of storage medium that stores software is not limited. The storage medium is not limited to a detachable one such as a magnetic disk or an optical disk, and may be a fixed storage medium such as a hard disk or memory. Also, the storage medium may be provided inside the computer, or may be provided outside the computer.
 図10は、前述した実施形態における各装置(情報処理装置1)のハードウェア構成の一例を示すブロック図である。各装置は、一例として、プロセッサ71と、主記憶装置72(メモリ)と、補助記憶装置73(メモリ)と、ネットワークインタフェース74と、デバイスインタフェース75と、を備え、これらがバス76を介して接続されたコンピュータ7として実現されてもよい。 FIG. 10 is a block diagram showing an example of the hardware configuration of each device (information processing device 1) in the above embodiment. Each device includes, for example, a processor 71, a main storage device 72 (memory), an auxiliary storage device 73 (memory), a network interface 74, and a device interface 75, which are connected via a bus 76. may be implemented as a computer 7 integrated with the
 図10のコンピュータ7は、各構成要素を一つ備えているが、同じ構成要素を複数備えていてもよい。また、図10では、1台のコンピュータ7が示されているが、ソフトウェアが複数台のコンピュータにインストールされて、当該複数台のコンピュータそれぞれがソフトウェアの同一の又は異なる一部の処理を実行してもよい。この場合、コンピュータそれぞれがネットワークインタフェース74等を介して通信して処理を実行する分散コンピューティングの形態であってもよい。つまり、前述した実施形態における各装置(情報処理装置1)は、1又は複数の記憶装置に記憶された命令を1台又は複数台のコンピュータが実行することで機能を実現するシステムとして構成されてもよい。また、端末から送信された情報をクラウド上に設けられた1台又は複数台のコンピュータで処理し、この処理結果を端末に送信するような構成であってもよい。 The computer 7 in FIG. 10 has one of each component, but may have a plurality of the same components. Also, in FIG. 10, one computer 7 is shown. good too. In this case, it may be in the form of distributed computing in which each computer communicates via the network interface 74 or the like to execute processing. In other words, each device (information processing device 1) in the above-described embodiment is configured as a system in which functions are realized by one or more computers executing instructions stored in one or more storage devices. good too. Alternatively, the information transmitted from the terminal may be processed by one or more computers provided on the cloud, and the processing result may be transmitted to the terminal.
 前述した実施形態における各装置(情報処理装置1)の各種演算は、1又は複数のプロセッサを用いて、又は、ネットワークを介した複数台のコンピュータを用いて、並列処理で実行されてもよい。また、各種演算が、プロセッサ内に複数ある演算コアに振り分けられて、並列処理で実行されてもよい。また、本開示の処理、手段等の一部又は全部は、ネットワークを介してコンピュータ7と通信可能なクラウド上に設けられたプロセッサ及び記憶装置の少なくとも一方により実行されてもよい。このように、前述した実施形態における各装置は、1台又は複数台のコンピュータによる並列コンピューティングの形態であってもよい。 Various operations of each device (information processing device 1) in the above-described embodiment may be executed in parallel using one or more processors or using multiple computers via a network. Also, various operations may be distributed to a plurality of operation cores in the processor and executed in parallel. Also, part or all of the processing, means, etc. of the present disclosure may be executed by at least one of a processor and a storage device provided on a cloud capable of communicating with the computer 7 via a network. Thus, each device in the above-described embodiments may be in the form of parallel computing by one or more computers.
 プロセッサ71は、コンピュータの制御装置及び演算装置を含む電子回路(処理回路、Processing circuit、Processing circuitry、CPU、GPU、FPGA又はASIC等)であってもよい。また、プロセッサ71は、専用の処理回路を含む半導体装置等であってもよい。プロセッサ71は、電子論理素子を用いた電子回路に限定されるものではなく、光論理素子を用いた光回路により実現されてもよい。また、プロセッサ71は、量子コンピューティングに基づく演算機能を含むものであってもよい。 The processor 71 may be an electronic circuit (processing circuit, processing circuitry, CPU, GPU, FPGA, ASIC, etc.) including a computer control device and arithmetic device. Also, the processor 71 may be a semiconductor device or the like including a dedicated processing circuit. The processor 71 is not limited to an electronic circuit using electronic logic elements, and may be realized by an optical circuit using optical logic elements. Also, the processor 71 may include arithmetic functions based on quantum computing.
 プロセッサ71は、コンピュータ7の内部構成の各装置等から入力されたデータやソフトウェア(プログラム)に基づいて演算処理を行い、演算結果や制御信号を各装置等に出力することができる。プロセッサ71は、コンピュータ7のOS(Operating System)や、アプリケーション等を実行することにより、コンピュータ7を構成する各構成要素を制御してもよい。 The processor 71 can perform arithmetic processing based on the data and software (programs) input from each device, etc. of the internal configuration of the computer 7, and output the arithmetic result and control signal to each device, etc. The processor 71 may control each component of the computer 7 by executing the OS (Operating System) of the computer 7, applications, and the like.
 前述した実施形態における各装置(情報処理装置1)は、1又は複数のプロセッサ71により実現されてもよい。ここで、プロセッサ71は、1チップ上に配置された1又は複数の電子回路を指してもよいし、2つ以上のチップあるいは2つ以上のデバイス上に配置された1又は複数の電子回路を指してもよい。複数の電子回路を用いる場合、各電子回路は有線又は無線により通信してもよい。 Each device (information processing device 1) in the above-described embodiment may be realized by one or more processors 71. Here, the processor 71 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or two or more devices. You can point When multiple electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.
 主記憶装置72は、プロセッサ71が実行する命令及び各種データ等を記憶する記憶装置であり、主記憶装置72に記憶された情報がプロセッサ71により読み出される。補助記憶装置73は、主記憶装置72以外の記憶装置である。なお、これらの記憶装置は、電子情報を格納可能な任意の電子部品を意味するものとし、半導体のメモリでもよい。半導体のメモリは、揮発性メモリ、不揮発性メモリのいずれでもよい。前述した実施形態における各装置(情報処理装置1)において各種データを保存するための記憶装置は、主記憶装置72又は補助記憶装置73により実現されてもよく、プロセッサ71に内蔵される内蔵メモリにより実現されてもよい。例えば、前述した実施形態における記憶部102は、主記憶装置72又は補助記憶装置73により実現されてもよい。 The main storage device 72 is a storage device that stores instructions and various data to be executed by the processor 71 , and the information stored in the main storage device 72 is read by the processor 71 . Auxiliary storage device 73 is a storage device other than main storage device 72 . These storage devices mean any electronic components capable of storing electronic information, and may be semiconductor memories. The semiconductor memory may be either volatile memory or non-volatile memory. A storage device for storing various data in each device (information processing device 1) in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73, and may be realized by the built-in memory built into the processor 71. may be implemented. For example, the storage unit 102 in the above-described embodiment may be realized by the main storage device 72 or the auxiliary storage device 73.
 記憶装置(メモリ)1つに対して、複数のプロセッサが接続(結合)されてもよいし、単数のプロセッサが接続されてもよい。プロセッサ1つに対して、複数の記憶装置(メモリ)が接続(結合)されてもよい。前述した実施形態における各装置(情報処理装置1)が、少なくとも1つの記憶装置(メモリ)とこの少なくとも1つの記憶装置(メモリ)に接続(結合)される複数のプロセッサで構成される場合、複数のプロセッサのうち少なくとも1つのプロセッサが、少なくとも1つの記憶装置(メモリ)に接続(結合)される構成を含んでもよい。また、複数台のコンピュータに含まれる記憶装置(メモリ)とプロセッサによって、この構成が実現されてもよい。さらに、記憶装置(メモリ)がプロセッサと一体になっている構成(例えば、L1キャッシュ、L2キャッシュを含むキャッシュメモリ)を含んでもよい。 Multiple processors may be connected (coupled) to one storage device (memory), or a single processor may be connected. A plurality of storage devices (memories) may be connected (coupled) to one processor. When each device (information processing device 1) in the above-described embodiment is composed of at least one storage device (memory) and a plurality of processors connected (coupled) to this at least one storage device (memory), a plurality of at least one of the processors is connected (coupled) to at least one storage device (memory). Also, this configuration may be realized by storage devices (memory) and processors included in a plurality of computers. Furthermore, a configuration in which a storage device (memory) is integrated with a processor (for example, a cache memory including an L1 cache and an L2 cache) may be included.
 ネットワークインタフェース74は、無線又は有線により、通信ネットワーク8に接続するためのインタフェースである。ネットワークインタフェース74は、既存の通信規格に適合したもの等、適切なインタフェースを用いればよい。ネットワークインタフェース74により、通信ネットワーク8を介して接続された外部装置9Aと情報のやり取りが行われてもよい。なお、通信ネットワーク8は、WAN(Wide Area Network)、LAN(Local Area Network)、PAN(Personal Area Network)等のいずれか、又は、それらの組み合わせであってよく、コンピュータ7と外部装置9Aとの間で情報のやりとりが行われるものであればよい。WANの一例としてインターネット等があり、LANの一例としてIEEE802.11やイーサネット(登録商標)等があり、PANの一例としてBluetooth(登録商標)やNFC(Near Field Communication)等がある。 The network interface 74 is an interface for connecting to the communication network 8 wirelessly or by wire. As for the network interface 74, an appropriate interface such as one conforming to existing communication standards may be used. The network interface 74 may exchange information with the external device 9A connected via the communication network 8. FIG. The communication network 8 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), etc., or a combination thereof. It is sufficient if information can be exchanged between them. Examples of WAN include the Internet, examples of LAN include IEEE802.11 and Ethernet (registered trademark), and examples of PAN include Bluetooth (registered trademark) and NFC (Near Field Communication).
 デバイスインタフェース75は、外部装置9Bと直接接続するUSB等のインタフェースである。 The device interface 75 is an interface such as USB that directly connects with the external device 9B.
 外部装置9Aは、コンピュータ7とネットワークを介して接続されている装置である。外部装置9Bは、コンピュータ7と直接接続されている装置である。 The external device 9A is a device connected to the computer 7 via a network. External device 9B is a device that is directly connected to computer 7 .
 外部装置9A又は外部装置9Bは、一例として、入力装置であってもよい。入力装置は、例えば、カメラ、マイクロフォン、モーションキャプチャ、各種センサ等、キーボード、マウス、又は、タッチパネル等のデバイスであり、取得した情報をコンピュータ7に与える。また、パーソナルコンピュータ、タブレット端末、又は、スマートフォン等の入力部とメモリとプロセッサを備えるデバイスであってもよい。 For example, the external device 9A or the external device 9B may be an input device. The input device is, for example, a device such as a camera, microphone, motion capture, various sensors, a keyboard, a mouse, or a touch panel, and provides the computer 7 with acquired information. Alternatively, a device such as a personal computer, a tablet terminal, or a smartphone including an input unit, a memory, and a processor may be used.
 また、外部装置9A又は外部装置9Bは、一例として、出力装置でもよい。出力装置は、例えば、LCD(Liquid Crystal Display)、CRT(Cathode Ray Tube)、PDP(Plasma Display Panel)、又は、有機EL(Electro Luminescence)パネル等の表示装置であってもよいし、音声等を出力するスピーカ等であってもよい。また、パーソナルコンピュータ、タブレット端末、又は、スマートフォン等の出力部とメモリとプロセッサを備えるデバイスであってもよい。 Also, the external device 9A or the external device 9B may be, for example, an output device. The output device may be, for example, a display device such as LCD (Liquid Crystal Display), CRT (Cathode Ray Tube), PDP (Plasma Display Panel), or organic EL (Electro Luminescence) panel. A speaker or the like for output may be used. Alternatively, a device such as a personal computer, a tablet terminal, or a smartphone including an output unit, a memory, and a processor may be used.
 また、外部装置9A又は外部装置9Bは、記憶装置(メモリ)であってもよい。例えば、外部装置9Aは、ネットワークストレージ等であってもよく、外部装置9Bは、HDD等のストレージであってもよい。 Also, the external device 9A or the external device 9B may be a storage device (memory). For example, the external device 9A may be a network storage or the like, and the external device 9B may be a storage such as an HDD.
 また、外部装置9A又は外部装置9Bは、前述した実施形態における各装置(情報処理装置1)の構成要素の一部の機能を有する装置でもよい。つまり、コンピュータ7は、外部装置9A又は外部装置9Bの処理結果の一部又は全部を送信又は受信してもよい。 Also, the external device 9A or the external device 9B may be a device having the functions of some of the constituent elements of each device (information processing device 1) in the above-described embodiment. That is, the computer 7 may transmit or receive part or all of the processing results of the external device 9A or the external device 9B.
 本明細書(請求項を含む)において、「a、b及びcの少なくとも1つ(一方)」又は「a、b又はcの少なくとも1つ(一方)」の表現(同様な表現を含む)が用いられる場合は、a、b、c、a-b、a-c、b-c、又は、a-b-cのいずれかを含む。また、a-a、a-b-b、a-a-b-b-c-c等のように、いずれかの要素について複数のインスタンスを含んでもよい。さらに、a-b-c-dのようにdを有する等、列挙された要素(a、b及びc)以外の他の要素を加えることも含む。 In the present specification (including claims), the expression "at least one (one) of a, b and c" or "at least one (one) of a, b or c" (including similar expressions) Where used, includes any of a, b, c, a-b, ac, b-c, or a-b-c. Also, multiple instances of any element may be included, such as a-a, a-b-b, a-a-b-b-c-c, and so on. It also includes the addition of other elements than the listed elements (a, b and c), such as having d such as a-b-c-d.
 本明細書(請求項を含む)において、「データを入力として/データに基づいて/に従って/に応じて」等の表現(同様な表現を含む)が用いられる場合は、特に断りがない場合、各種データそのものを入力として用いる場合や、各種データに何らかの処理を行ったもの(例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等)を入力として用いる場合を含む。また「データに基づいて/に従って/に応じて」何らかの結果が得られる旨が記載されている場合、当該データのみに基づいて当該結果が得られる場合を含むとともに、当該データ以外の他のデータ、要因、条件、及び/又は状態等にも影響を受けて当該結果が得られる場合をも含み得る。また、「データを出力する」旨が記載されている場合、特に断りがない場合、各種データそのものを出力として用いる場合や、各種データに何らかの処理を行ったもの(例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等)を出力とする場合も含む。 In this specification (including claims), when expressions such as "data as input / based on data / according to / according to" (including similar expressions) are used, unless otherwise specified, It includes the case where various data itself is used as an input, and the case where various data subjected to some processing (for example, noise added, normalized, intermediate representation of various data, etc.) is used as an input. In addition, if it is stated that some result can be obtained "based on/according to/depending on the data", this includes cases where the result is obtained based only on the data, other data other than the data, It may also include cases where the result is obtained under the influence of factors, conditions, and/or states. In addition, if it is stated that "data will be output", unless otherwise specified, if the various data themselves are used as output, or if the various data have undergone some processing (for example, noise addition, normalization, etc.) This also includes the case where the output is a converted version, an intermediate representation of various data, etc.).
 本明細書(請求項を含む)において、「接続される(connected)」及び「結合される(coupled)」との用語が用いられる場合は、直接的な接続/結合、間接的な接続/結合、電気的(electrically)な接続/結合、通信的(communicatively)な接続/結合、機能的(operatively)な接続/結合、物理的(physically)な接続/結合等のいずれをも含む非限定的な用語として意図される。当該用語は、当該用語が用いられた文脈に応じて適宜解釈されるべきであるが、意図的に或いは当然に排除されるのではない接続/結合形態は、当該用語に含まれるものして非限定的に解釈されるべきである。 In this specification (including the claims), when the terms "connected" and "coupled" are used, they refer to direct connection/coupling, indirect connection/coupling , electrically connected/coupled, communicatively connected/coupled, operatively connected/coupled, physically connected/coupled, etc. intended as a term. The term should be interpreted appropriately according to the context in which the term is used, but any form of connection/bonding that is not intentionally or naturally excluded is not included in the term. should be interpreted restrictively.
 本明細書(請求項を含む)において、「AがBするよう構成される(A configured to B)」との表現が用いられる場合は、要素Aの物理的構造が、動作Bを実行可能な構成を有するとともに、要素Aの恒常的(permanent)又は一時的(temporary)な設定(setting/configuration)が、動作Bを実際に実行するように設定(configured/set)されていることを含んでよい。例えば、要素Aが汎用プロセッサである場合、当該プロセッサが動作Bを実行可能なハードウェア構成を有するとともに、恒常的(permanent)又は一時的(temporary)なプログラム(命令)の設定により、動作Bを実際に実行するように設定(configured)されていればよい。また、要素Aが専用プロセッサ又は専用演算回路等である場合、制御用命令及びデータが実際に付属しているか否かとは無関係に、当該プロセッサの回路的構造が動作Bを実際に実行するように構築(implemented)されていればよい。 In this specification (including claims), when the phrase "A configured to B" is used, the physical structure of element A is such that it is capable of performing operation B has a configuration, including that a permanent or temporary setting/configuration of element A is configured/set to actually perform action B good. For example, if element A is a general-purpose processor, the processor has a hardware configuration that can execute operation B, and operation B can be performed by setting a permanent or temporary program (instruction). It just needs to be configured to actually run. In addition, when the element A is a dedicated processor or a dedicated arithmetic circuit, etc., regardless of whether or not control instructions and data are actually attached, the circuit structure of the processor actually executes the operation B. It just needs to be implemented.
 本明細書(請求項を含む)において、含有又は所有を意味する用語(例えば、「含む(comprising/including)」及び有する「(having)等)」が用いられる場合は、当該用語の目的語により示される対象物以外の物を含有又は所有する場合を含む、open-endedな用語として意図される。これらの含有又は所有を意味する用語の目的語が数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)である場合は、当該表現は特定の数に限定されないものとして解釈されるべきである。 In this specification (including the claims), when terms denoting containing or possessing (e.g., "comprising/including" and "having, etc.") are used, by the object of the terms It is intended as an open-ended term, including the case of containing or possessing things other than the indicated object. When the object of these terms of inclusion or possession is an expression that does not specify a quantity or implies a singular number (an expression with the article a or an), the expression shall be construed as not being limited to a specific number. It should be.
 本明細書(請求項を含む)において、ある箇所において「1つ又は複数(one or more)」又は「少なくとも1つ(at least one)」等の表現が用いられ、他の箇所において数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)が用いられているとしても、後者の表現が「1つ」を意味することを意図しない。一般に、数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)は、必ずしも特定の数に限定されないものとして解釈されるべきである。 In the specification (including the claims), expressions such as "one or more" or "at least one" are used in some places, and quantities are specified in other places. Where no or suggestive of the singular (a or an articles) are used, the latter is not intended to mean "one." In general, expressions that do not specify a quantity or imply a singular number (indicative of the articles a or an) should be construed as not necessarily being limited to a particular number.
 本明細書において、ある実施例の有する特定の構成について特定の効果(advantage/result)が得られる旨が記載されている場合、別段の理由がない限り、当該構成を有する他の1つ又は複数の実施例についても当該効果が得られると理解されるべきである。但し当該効果の有無は、一般に種々の要因、条件、及び/又は状態等に依存し、当該構成により必ず当該効果が得られるものではないと理解されるべきである。当該効果は、種々の要因、条件、及び/又は状態等が満たされたときに実施例に記載の当該構成により得られるものに過ぎず、当該構成又は類似の構成を規定したクレームに係る発明において、当該効果が必ずしも得られるものではない。 In this specification, when it is stated that a particular configuration of an embodiment has a particular effect (advantage/result), unless there is a specific reason otherwise, other one or more having that configuration It should be understood that this effect can be obtained also for the embodiment of However, it should be understood that the presence or absence of the effect generally depends on various factors, conditions, and/or states, and that the configuration does not always provide the effect. The effect is only obtained by the configuration described in the embodiment when various factors, conditions, and/or states are satisfied, and in the claimed invention defining the configuration or a similar configuration , the effect is not necessarily obtained.
 本明細書(請求項を含む)において、複数のハードウェアが所定の処理を行う場合、各ハードウェアが協働して所定の処理を行ってもよいし、一部のハードウェアが所定の処理の全てを行ってもよい。また、一部のハードウェアが所定の処理の一部を行い、別のハードウェアが所定の処理の残りを行ってもよい。本明細書(請求項を含む)において、「1又は複数のハードウェアが第1の処理を行い、前記1又は複数のハードウェアが第2の処理を行う」等の表現が用いられている場合、第1の処理を行うハードウェアと第2の処理を行うハードウェアは同じものであってもよいし、異なるものであってもよい。つまり、第1の処理を行うハードウェア及び第2の処理を行うハードウェアが、前記1又は複数のハードウェアに含まれていればよい。なお、ハードウェアは、電子回路、又は、電子回路を含む装置等を含んでもよい。 In this specification (including claims), when a plurality of pieces of hardware perform predetermined processing, each piece of hardware may work together to perform the predetermined processing, or a part of the hardware may perform the predetermined processing. You may do all of Also, some hardware may perform a part of the predetermined processing, and another hardware may perform the rest of the predetermined processing. In the present specification (including claims), when expressions such as "one or more pieces of hardware perform the first process and the one or more pieces of hardware perform the second process" are used , the hardware that performs the first process and the hardware that performs the second process may be the same or different. In other words, the hardware that performs the first process and the hardware that performs the second process may be included in the one or more pieces of hardware. Note that hardware may include an electronic circuit or a device including an electronic circuit.
 以上、本開示の実施形態について詳述したが、本開示は上記した個々の実施形態に限定されるものではない。特許請求の範囲に規定された内容及びその均等物から導き出される本開示の概念的な思想と趣旨を逸脱しない範囲において種々の追加、変更、置き換え及び部分的削除等が可能である。例えば、前述した全ての実施形態において、数値又は数式を説明に用いている場合は、一例として示したものであり、これらに限られるものではない。また、実施形態における各動作の順序は、一例として示したものであり、これらに限られるものではない。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, replacements, partial deletions, etc. are possible without departing from the conceptual idea and spirit of the present disclosure derived from the content defined in the claims and equivalents thereof. For example, in all the embodiments described above, when numerical values or formulas are used for explanation, they are shown as an example and are not limited to these. Also, the order of each operation in the embodiment is shown as an example, and is not limited to these.

Claims (15)

 1又は複数のメモリと、
 1又は複数のプロセッサと、
 を備え、
 前記1又は複数のプロセッサは、
  2次元画像に含まれる被写体の3次元姿勢データを取得し、
  前記2次元画像から、前記被写体の動きに関するイベントを検出し、
  前記被写体の3次元姿勢データ及び前記イベントに基づいて、3次元モデルの骨格の姿勢を制御する、
 情報処理装置。
one or more memories;
one or more processors;
with
The one or more processors are
Acquire the 3D posture data of the subject included in the 2D image,
detecting an event related to the movement of the subject from the two-dimensional image;
controlling the posture of the skeleton of the 3D model based on the 3D posture data of the subject and the event;
Information processing equipment.
 前記取得される3次元姿勢データは、前記2次元画像における前記被写体の姿勢に対応する3次元姿勢データである、請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the acquired three-dimensional posture data is three-dimensional posture data corresponding to the posture of the subject in the two-dimensional image.
 前記2次元画像は、前記被写体を含む動画の1フレームであり、
 前記1又は複数のプロセッサは、
  前記動画の2以上のフレームのそれぞれについて、前記被写体の3次元姿勢データを取得し、
  前記2以上のフレームのそれぞれについて取得される前記被写体の3次元姿勢データ、及び、前記2以上のフレームの少なくともいずれかから検出される前記イベントに基づいて、前記2以上のフレームについて、前記3次元モデルの骨格の姿勢を制御する、
 請求項1に記載の情報処理装置。
the two-dimensional image is one frame of a moving image including the subject;
The one or more processors are
obtaining three-dimensional posture data of the subject for each of the two or more frames of the moving image;
Based on the three-dimensional posture data of the subject acquired for each of the two or more frames and the event detected from at least one of the two or more frames, the three-dimensional to control the pose of the model's skeleton,
The information processing device according to claim 1 .
 前記制御される3次元モデルの骨格は、前記被写体と異なる対象の3次元モデルの骨格である、
 請求項1に記載の情報処理装置。
the skeleton of the 3D model to be controlled is the skeleton of a 3D model of a subject different from the subject;
The information processing device according to claim 1 .
 前記1又は複数のプロセッサは、
  前記3次元姿勢データに基づいて計算された前記3次元モデルの骨格の姿勢を、前記検出されたイベントに基づいて補正する、
 請求項1に記載の情報処理装置。
The one or more processors are
correcting the pose of the skeleton of the 3D model calculated based on the 3D pose data based on the detected event;
The information processing device according to claim 1.
 前記1又は複数のプロセッサは、前記2次元画像から、前記3次元姿勢データを推定して取得する、
 請求項1に記載の情報処理装置。
The one or more processors estimate and obtain the three-dimensional pose data from the two-dimensional image.
The information processing device according to claim 1 .
 前記1又は複数のプロセッサは、
  前記2次元画像から、前記被写体の所定部位の関節の2次元位置を推定し、
  前記関節の2次元位置から、前記関節の3次元座標を推定し、
 前記3次元姿勢データを推定する、
 請求項6に記載の情報処理装置。
The one or more processors are
estimating a two-dimensional position of a joint of a predetermined part of the subject from the two-dimensional image;
estimating the three-dimensional coordinates of the joint from the two-dimensional position of the joint;
estimating the three-dimensional pose data;
The information processing device according to claim 6 .
 前記1又は複数のプロセッサは、
  前記2次元画像から前記所定部位に関する2次元データを検出する第1モデルに、前記2次元画像を入力し、
  前記所定部位に関する2次元データから前記所定部位の関節の2次元座標を取得する第2モデルに、前記所定部位に関する2次元データを入力し、
  前記所定部位の関節の2次元座標から前記所定部位の3次元座標を取得する第3モデルに、前記所定部位の関節の2次元座標を入力し、
 前記3次元姿勢データを推定する、
 請求項7に記載の情報処理装置。
The one or more processors are
inputting the two-dimensional image to a first model for detecting two-dimensional data relating to the predetermined part from the two-dimensional image;
inputting the two-dimensional data about the predetermined part into a second model that acquires the two-dimensional coordinates of the joint of the predetermined part from the two-dimensional data about the predetermined part;
inputting the two-dimensional coordinates of the joints of the predetermined part into a third model that acquires the three-dimensional coordinates of the predetermined part from the two-dimensional coordinates of the joints of the predetermined part;
estimating the three-dimensional pose data;
The information processing apparatus according to claim 7.
 前記1又は複数のプロセッサは、
  前記2次元画像から前記所定部位の関節の2次元座標を取得する第4モデルに、前記2次元画像を入力し、
  前記所定部位の関節の2次元データから前記所定部位の3次元座標を取得する第3モデルに、前記所定部位の2次元データを入力し、
 前記3次元姿勢データを推定する、
 請求項7に記載の情報処理装置。
The one or more processors are
inputting the two-dimensional image to a fourth model that acquires the two-dimensional coordinates of the joint of the predetermined part from the two-dimensional image;
inputting the two-dimensional data of the predetermined part into a third model that acquires the three-dimensional coordinates of the predetermined part from the two-dimensional data of the joints of the predetermined part;
estimating the three-dimensional pose data;
The information processing apparatus according to claim 7.
 前記3次元姿勢データは、時系列方向に平滑化された3次元姿勢データである、
 請求項1から請求項9のいずれかに記載の情報処理装置。
The 3D posture data is 3D posture data smoothed in a time series direction,
The information processing apparatus according to any one of claims 1 to 9.
 前記1又は複数のプロセッサは、
  前記3次元姿勢データに基づいて前記3次元モデルの骨格に適用する関節角を算出し、
  前記関節角に基づいて、前記3次元モデルの骨格に対して順運動学計算を行うことで、前記3次元モデルの骨格の3次元座標を算出する、
 請求項1から請求項9のいずれかに記載の情報処理装置。
The one or more processors are
calculating a joint angle to be applied to the skeleton of the three-dimensional model based on the three-dimensional posture data;
calculating three-dimensional coordinates of the skeleton of the three-dimensional model by performing forward kinematics calculations on the skeleton of the three-dimensional model based on the joint angles;
The information processing apparatus according to any one of claims 1 to 9.
 前記1又は複数のプロセッサは、さらに、
  順運動学計算を実行して算出した前記3次元モデルの骨格の姿勢に対して、前記イベントに基づいて逆運動学計算を用い、前記3次元モデルの骨格の姿勢を補正する、
 請求項11に記載の情報処理装置。
The one or more processors further
Correcting the posture of the skeleton of the three-dimensional model using inverse kinematics calculation based on the event for the posture of the skeleton of the three-dimensional model calculated by performing the forward kinematics calculation;
The information processing device according to claim 11 .
 前記1又は複数のプロセッサは、
  ルールベース又は第5モデルにより、前記イベントを検出する、
 請求項1から請求項9のいずれかに記載の情報処理装置。
The one or more processors are
Detecting said event by a rule-based or fifth model;
The information processing apparatus according to any one of claims 1 to 9.
 前記イベントは、前記被写体の接触又は遮蔽の少なくとも一方である、
 請求項1から請求項9のいずれかに記載の情報処理装置。
the event is at least one of contact or shielding of the subject;
The information processing apparatus according to any one of claims 1 to 9.
 前記1又は複数のプロセッサは、
  前記3次元姿勢データおよび前記イベントに基づいて制御された前記3次元モデルの骨格の姿勢に基づいて、前記3次元モデルのボーンアニメーションを行う、
 請求項1から請求項9のいずれかに記載の情報処理装置。
The one or more processors are
performing bone animation of the three-dimensional model based on the three-dimensional pose data and the posture of the skeleton of the three-dimensional model controlled based on the event;
The information processing apparatus according to any one of claims 1 to 9.
PCT/JP2022/025855 2021-06-29 2022-06-28 Information processing device WO2023277043A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021107996 2021-06-29
JP2021-107996 2021-06-29

Publications (1)

Publication Number Publication Date
WO2023277043A1 true WO2023277043A1 (en) 2023-01-05

Family

ID=84690235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/025855 WO2023277043A1 (en) 2021-06-29 2022-06-28 Information processing device

Country Status (1)

Country Link
WO (1) WO2023277043A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011215922A (en) * 2010-03-31 2011-10-27 Namco Bandai Games Inc Program, information storage medium, and image generation system
JP2017138915A (en) * 2016-02-05 2017-08-10 株式会社バンダイナムコエンターテインメント Image generation system and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011215922A (en) * 2010-03-31 2011-10-27 Namco Bandai Games Inc Program, information storage medium, and image generation system
JP2017138915A (en) * 2016-02-05 2017-08-10 株式会社バンダイナムコエンターテインメント Image generation system and program

Similar Documents

Publication Publication Date Title
US9654734B1 (en) Virtual conference room
US11315287B2 (en) Generating pose information for a person in a physical environment
TWI659335B (en) Graphic processing method and device, virtual reality system, computer storage medium
US10324522B2 (en) Methods and systems of a motion-capture body suit with wearable body-position sensors
US10726625B2 (en) Method and system for improving the transmission and processing of data regarding a multi-user virtual environment
KR20210011425A (en) Image processing method and device, image device, and storage medium
WO2023071964A1 (en) Data processing method and apparatus, and electronic device and computer-readable storage medium
US11436790B2 (en) Passthrough visualization
JP7490072B2 (en) Vision-based rehabilitation training system based on 3D human pose estimation using multi-view images
EP4359892A1 (en) Body pose estimation using self-tracked controllers
US11748913B2 (en) Modeling objects from monocular camera outputs
JP2016105279A (en) Device and method for processing visual data, and related computer program product
US20200043211A1 (en) Preventing transition shocks during transitions between realities
JP2022537817A (en) Fast hand meshing for dynamic occlusion
CN116848556A (en) Enhancement of three-dimensional models using multi-view refinement
WO2023240999A1 (en) Virtual reality scene determination method and apparatus, and system
WO2023277043A1 (en) Information processing device
US10621788B1 (en) Reconstructing three-dimensional (3D) human body model based on depth points-to-3D human body model surface distance
Becher et al. VIRTOOAIR: virtual reality toolbox for avatar intelligent reconstruction
Roth et al. Avatar Embodiment, Behavior Replication, and Kinematics in Virtual Reality.
US20230290101A1 (en) Data processing method and apparatus, electronic device, and computer-readable storage medium
KR20220083166A (en) Method and apparatus for estimating human body
CN116156141A (en) Volume video playing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22833189

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22833189

Country of ref document: EP

Kind code of ref document: A1