CN109145788B - Video-based attitude data capturing method and system - Google Patents

Video-based attitude data capturing method and system Download PDF

Info

Publication number
CN109145788B
CN109145788B CN201810895934.4A CN201810895934A CN109145788B CN 109145788 B CN109145788 B CN 109145788B CN 201810895934 A CN201810895934 A CN 201810895934A CN 109145788 B CN109145788 B CN 109145788B
Authority
CN
China
Prior art keywords
picture
data
captured
neural network
network model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810895934.4A
Other languages
Chinese (zh)
Other versions
CN109145788A (en
Inventor
陈敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yungoal Tech Co ltd
Original Assignee
Yungoal Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yungoal Tech Co ltd filed Critical Yungoal Tech Co ltd
Priority to CN201810895934.4A priority Critical patent/CN109145788B/en
Publication of CN109145788A publication Critical patent/CN109145788A/en
Application granted granted Critical
Publication of CN109145788B publication Critical patent/CN109145788B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/08Indexing scheme for image data processing or generation, in general involving all processing steps from image acquisition to 3D model generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Abstract

The application discloses a video-based attitude data capturing method and a video-based attitude data capturing system, wherein the method comprises the following steps: acquiring video data, and decomposing the video data into at least one picture; extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on the first neural network model; based on the second neural network model, determining corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured; and determining the corresponding three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on at least one picture. The method and the device achieve the purpose of extracting the gesture data from the video containing the motion gesture actions of the object to be captured.

Description

Video-based attitude data capturing method and system
Technical Field
The application relates to the field of machine vision, in particular to a video-based attitude data capturing method and system.
Background
Motion capture refers to recording the motion of an object in a three-dimensional space and simulating the motion track of the object into a digital model. For example, the animation sequence is generated by detecting and recording the motion trail of the limbs of the performer in the three-dimensional space, capturing the gesture motion of the performer, converting the captured gesture motion into a digital abstract motion, and controlling a virtual model in a software application to make the same motion as the performer. In recent years, motion capture technology has been widely used in many fields such as virtual reality, three-dimensional games, and human body engineering.
The conventional motion capture techniques mainly include the following three types:
first, optical motion capture techniques. This technique requires a specialized environment, requires no significant interference around, and requires specialized actors to wear optical motion capture devices to capture motion. Although the optical motion capture technology has high accuracy of the captured result, it requires a lot of space, equipment, and personnel, and is expensive to use.
Second, inertial motion capture techniques. This technique requires professional performers to wear a variety of motion capture devices that, because they are tied to human joints, are able to sample the velocity and acceleration of human motion to infer the position and motion of the human joints. However, this technique is inferior to the optical motion capture technique in capture effect due to the accuracy of the device, and cannot solve the problem that the heel of the virtual character is attached to the ground. Similarly, inertial motion capture techniques require equipment and personnel.
Third, end-to-end 3D pose data generation. This approach requires acquiring 3D body data in real environment, is difficult, and requires additional equipment. It is difficult to directly output a 3D position using a picture as a network input. In order to solve the problem of insufficient data, the scheme adopts a mode of replacing backgrounds and human clothes, so that training is carried out, and the capturing effect is not ideal.
Analysis shows that in the prior art, the gesture data capturing mode based on the video needs to be supported by sites, equipment and personnel, so that the working efficiency is influenced, the cost is high, and the capturing effect is poor.
Disclosure of Invention
In order to solve the above problem, the present application provides a video-based gesture data capturing method, which includes the following steps: acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; decomposing video data into at least one picture; wherein each picture corresponds to a frame of image of the video data; extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on the first neural network model; determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to two-dimensional coordinate data of each mark point on the object to be captured, which is contained in each picture obtained by decomposing video data, based on a second neural network model, wherein the local coordinate system is a coordinate system determined by the centroid of the object to be captured; and determining the corresponding three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on at least one picture.
In one example, based on a first neural network model, extracting two-dimensional coordinate data of at least one mark point on a subject to be captured contained in each picture obtained by decomposing the video data, comprises: inputting each picture obtained by decomposing video data into a first neural network model, and outputting at least one confidence map corresponding to each picture, wherein the coordinate of a pixel point with the maximum brightness in each confidence map corresponds to the coordinate of a mark point on an object to be captured; and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.
In one example, before each picture obtained by decomposing the video data is input into the first neural network model and at least one confidence map corresponding to each picture is output, the method further includes: acquiring a picture sample library; the picture sample library comprises a plurality of sample pictures; marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library; and obtaining a first neural network model through machine learning training by using a plurality of sample pictures in the picture sample library and at least one marked point marked on the sample pictures.
Alternatively, the mark point may be an articulation point on the subject to be captured, the articulation point including at least one of: a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, and a right ankle joint.
In one example, determining two-dimensional coordinate data of at least one mark point on a subject to be captured in each picture according to at least one confidence map corresponding to each picture comprises: determining the two-dimensional coordinates of the marking point corresponding to each confidence map by the following Gaussian response map formula:
Figure BDA0001758149420000031
wherein G (x, y) represents the confidence per sheetA gaussian distribution of pixel points on the image; σ represents the standard deviation of the gaussian distribution, and (x, y) represents the coordinates of each pixel point on each confidence map.
In one example, the obtaining of the first neural network model through machine learning training using a plurality of sample pictures in a picture sample library and at least one marked point marked on the sample pictures includes: training a first neural network model based on an objective function:
Figure BDA0001758149420000032
wherein E represents a target function, H'j(x, y) is the coordinates of each marker point on the predicted sample picture, Hj(x, y) are coordinates of the marked points marked on the sample picture, N represents the number of training samples, and j is a natural number.
In one example, before determining the three-dimensional coordinate data of each marker point in the preset three-dimensional space according to the two-dimensional coordinate data of each marker point on the to-be-captured object contained in each picture obtained by decomposing the video data based on the second neural network model, the method further comprises: acquiring a picture sample library; the picture sample library comprises a plurality of sample pictures; marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library; acquiring three-dimensional coordinate data of each marking point on the object to be captured in a local coordinate system based on the two-dimensional coordinates of each marking point marked on each sample picture, wherein the local coordinate system is a coordinate system determined by the mass center of the object to be captured; and obtaining a second neural network model through machine learning training by using the two-dimensional coordinate data of at least one marking point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.
In one example, determining, based on the position of each marker point on the object to be captured on the at least one picture, the three-dimensional coordinate data corresponding to each marker point on the object to be captured in the preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marker point in the local coordinate system includes: determining the three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system based on the position data of each mark point on at least one picture on the object to be captured by the following formula:
Figure BDA0001758149420000041
wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,
Figure BDA0001758149420000042
the coordinate information of the mark point output by the second neural network model;
Figure BDA0001758149420000043
the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kiThe coordinate information of the mark points output by the first neural network model;
Figure BDA0001758149420000044
is the average value of all the mark point coordinate information output by the first neural network model.
In one example, the two-dimensional coordinates and corresponding three-dimensional coordinates of at least one marked point marked on each sample picture in the picture sample library are used for obtaining a second neural network model through machine learning training, and the method comprises the following steps: training a second neural network model based on the following objective function:
Figure BDA0001758149420000045
wherein E represents a target function, H'j(x, y, z) is the three-dimensional coordinates of each marker point on the predicted sample picture, Hj(x, y, z) are three-dimensional coordinates of the marked points marked by the sample pictures, N represents the number of training samples, and j is a natural number.
In one example, after generating three-dimensional posture data of the motion of the object to be captured according to three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space, which is contained in each picture obtained by decomposing the video data, the method further comprises the following steps: and generating a file in a first preset format according to the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data of each mark point on each picture is obtained by decomposing the video data, and the three-dimensional attitude data of the object to be captured, which is used for making the corresponding attitude motion in the video data, is stored in the file in the first preset format.
In one example, after generating a file in a first predetermined format according to three-dimensional coordinate data of each mark point on the subject to be captured in a preset three-dimensional space included in each picture obtained by decomposing the video data, the method further includes: and converting the file in the first preset format into a file in a second preset format, wherein the file in the second preset format is used for producing the three-dimensional animation.
In one example, the first neural network model is a convolutional neural network model built based on a residual network, and the second neural network model is a deep neural network model.
In another aspect, the present application further provides a video-based gesture data capturing system, which includes: the camera device is used for acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; the image processing equipment is communicated with the camera device and used for acquiring video data, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, determining corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data based on a second neural network model, and determining corresponding three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system based on the position data of each mark point on the at least one picture on the object to be captured; each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.
In another aspect, the present application further provides a video-based gesture data capturing system, which includes: the client device is used for acquiring and uploading video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; the server is communicated with the client device and used for receiving the video data uploaded by the client device, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on the first neural network model, and based on the second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on at least one picture on the object to be captured, determining the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system; each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.
According to the posture data capturing mode based on the video, after at least one corresponding picture is obtained according to the obtained video data, the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture is extracted through the first neural network model, the three-dimensional coordinate of each mark point in the preset three-dimensional space is determined through the second neural network model according to the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture, the purpose of extracting the three-dimensional posture data according to the video is achieved, in addition, the two-dimensional coordinate data and the three-dimensional coordinate data of a human body joint are obtained through the two neural networks, and the problem of poor 3D training effect caused by insufficient 3D human body posture data is effectively solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic diagram of a video-based pose data capture system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an alternative video-based pose data capture system provided by an embodiment of the present application;
fig. 3(a) is a schematic diagram of obtaining a picture from a video screenshot according to an embodiment of the present application;
FIG. 3(b) is a schematic diagram of extracting joint points of a human body from a picture according to an embodiment of the present application;
FIG. 4(a) is a schematic diagram of a two-dimensional human joint according to an embodiment of the present application;
FIG. 4(b) is a schematic diagram of a three-dimensional human joint according to an embodiment of the present application;
fig. 5 is a flowchart of a video-based gesture data capturing method according to an embodiment of the present application.
Detailed Description
In order to more clearly explain the overall concept of the present application, the following detailed description is given by way of example in conjunction with the accompanying drawings.
In order to solve the problems that the operation is complicated and the cost is high due to the fact that an existing gesture data capturing system needs to acquire gesture data of a moving target moving in a three-dimensional space by means of various sensors, the application provides a video-based gesture data capturing scheme which can dynamically capture the gesture data of the moving target moving in a video according to the video of the moving target.
It should be noted that the video-based gesture data capture scheme provided in the embodiment of the present application may be applied to capture gesture data of any moving target, and may be gesture data of a human body, gesture data of a moving object, or gesture data of a robot. Preferably, the various embodiments of the present application are described by taking the example of capturing posture data of a human body. It is easy to note that the attitude data at the time of the motion of the object to be captured can be determined by marking the change of at least one point on the object to be captured, and in the case where the object to be captured is a human body, the marked point may be an articulated point of the human body.
As a first alternative embodiment, an embodiment of the present application provides a video-based pose data capture system, as shown in fig. 1, comprising: an image pickup device 1 and an image processing apparatus 2. The image processing apparatus 2 collects video data of a motion of an object to be captured (for example, a human body) by the camera 1 connected thereto, and the collected video data includes data of a plurality of posture motions of the human body during the motion. After the image processing device 2 obtains the video data of the human motion, the video data is subjected to frame decomposition processing to obtain a plurality of pictures, and each picture corresponds to one frame of image of the video data. Because the posture and the motion of the human body are changed continuously during the movement of the human body, the joint points (including but not limited to joint positions of a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, a right ankle joint and the like) of the human body in each picture have different positions on the picture.
The image processing device 2 and the imaging apparatus 1 may be two components on the same device (for example, in a room, gesture data of a user is acquired by a smart mobile device such as a mobile phone, and image processing is performed), or may be two independent devices. In the case where the image processing apparatus 2 and the imaging device 1 are two independent apparatuses, the communication means of the two may be wired communication (for example, attitude data of a performer is acquired indoors and image processing is performed by an image processing apparatus such as a computer), or wireless communication. In the case of wireless communication, the communication system may be a communication system based on a local area network (for example, the posture data of the user is collected indoors by a mobile phone or the like and transmitted to a computer for image processing), or may be a communication system based on the internet (for example, the posture data of the user is collected indoors by a mobile phone and then transmitted to an application server for image processing based on a client application).
For example, fig. 2 is a schematic view of an optional video-based gesture data capturing system provided in an embodiment of the present application, and as shown in fig. 2, the video data acquired by the server 5 may be gesture motion data of a motion of an object to be captured, which is acquired in real time by using a camera device such as a camera of a mobile phone, or may be video data of a gesture motion, which includes a motion of an object to be captured, uploaded by a user through a device such as a mobile phone 3 or a computer 4.
In any way, after the image processing device 2 acquires the video data of the human body motion acquired by the camera device 1 and decomposes the video data into a plurality of pictures, the image processing device 2 may extract the two-dimensional coordinate data of each joint point (for example, the left wrist joint point of the human body shown in fig. 3) on the subject to be captured included in each picture obtained by decomposing the video data based on the first neural network model, determine the three-dimensional coordinate data of each joint point in the preset three-dimensional space according to the two-dimensional coordinate data of each joint point on the subject to be captured included in each picture obtained by decomposing the video data based on the second neural network model, and finally generate a file in a first predetermined format according to the three-dimensional coordinate data of each joint point on the subject to be captured included in each picture obtained by decomposing the video data, the file in the first preset format stores three-dimensional posture data of the object to be captured for making corresponding posture motions in the video data.
It is easy to note that the first neural network model and the second neural network model are models obtained by adopting an artificial intelligence algorithm for training in advance, wherein the first neural network model can be used for estimating the joint position of the human body in the picture to obtain 2D coordinate data; a second neural network model may be used to convert the 2D joint coordinate data into corresponding 3D coordinate data. And the human body motion in the video is converted into 3D motion data by utilizing the powerful learning and derivation capabilities of the neural network. The human body gestures under various environments can be effectively recognized by using a neural network trained by a large number of real videos and pictures. The embodiment of the invention has good treatment on human body shielding and self-shielding, and has no limitation on the use environment and the camera position for shooting the video, thereby being capable of treating various video materials.
Alternatively, the first neural network model may be a convolutional neural network model established based on a residual network, and the second neural network model may be a deep neural network model.
It is easy to note that the first neural network model needs to be trained before it is used to extract the two-dimensional coordinate data of at least one joint point on the subject to be captured contained in each picture. Specifically, a picture sample library containing a plurality of sample pictures is collected, at least one joint point (for example, a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint) on a subject to be captured contained in each sample picture in the picture sample library is labeled, and a first neural network model is obtained through machine learning training by using the plurality of sample pictures in the picture sample library and the at least one joint point labeled on the sample pictures.
After the first neural network model is obtained through training, the two-dimensional coordinate data of at least one joint point on the object to be captured, which is contained in each picture, can be extracted by using the first neural network model, and each picture obtained through decomposing the video data is input into the first neural network model by the image processing device 2, so as to output at least one confidence map corresponding to each picture. And determining two-dimensional coordinate data of at least one joint point on the body of the object to be captured in each picture according to at least one confidence map corresponding to each picture, wherein the coordinate of the pixel point with the maximum brightness in each confidence map corresponds to the coordinate of one joint point on the body of the object to be captured.
Optionally, the first neural network model may be trained based on the following objective function:
Figure BDA0001758149420000091
wherein E represents a target function, H'j(xY) coordinates of each joint point on the predicted sample picture, Hj(x, y) are coordinates of the joint points marked by the sample pictures, N represents the number of training samples, and j is a natural number.
When the two-dimensional coordinate data of at least one joint point on the object to be captured in each picture is determined according to at least one confidence map corresponding to each picture, the two-dimensional coordinates of the joint point corresponding to each confidence map can be determined through the following Gaussian response map formula:
Figure BDA0001758149420000092
wherein G (x, y) represents the Gaussian distribution of pixel points on each confidence map; σ represents the standard deviation of the gaussian distribution, and (x, y) represents the coordinates of each pixel point on each confidence map.
In addition, it should be noted that, before determining the three-dimensional coordinate data of each joint point in the preset three-dimensional space according to the two-dimensional coordinate data of each joint point on the subject to be captured, which is included in each picture obtained by decomposing the video data, using the second neural network model, the second neural network model also needs to be trained. The training process is as follows: collecting a picture sample library comprising a plurality of sample pictures, and labeling at least one joint point (such as a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint) on a body of a to-be-captured object contained in each sample picture in the picture sample library; acquiring three-dimensional coordinate data of each joint point on the body of the object to be captured in a local coordinate system based on the two-dimensional coordinates of each joint point marked on each sample picture, wherein the local coordinate system is a coordinate system determined by the mass center of the object to be captured; and obtaining a second neural network model through machine learning training by using the two-dimensional coordinate data of at least one joint point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.
Optionally, the second neural network model may be trained based on the following objective function:
Figure BDA0001758149420000101
wherein E represents a target function, H'j(x, y, z) is the three-dimensional coordinates of each joint point on the predicted sample picture, Hj(x, y, z) are three-dimensional coordinates of the joint points marked by the sample pictures, N represents the number of training samples, and j is a natural number.
The video-based gesture data capture scheme provided by the present application is specifically described below with reference to fig. 3(a) and 3 (b). Fig. 3(a) shows a single picture decomposed from a high jump video of a high jump athlete, and the picture shown in fig. 3(a) can be inferred by using a Convolutional Neural Network (CNN) to obtain joint point positions of a human body in the picture, such as a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, and a right ankle joint shown in fig. 3 (b).
As shown in fig. 4(a), the 2D human joint data corresponding to fig. 3(b) is obtained by reasoning the 2D human joint data using a Deep Neural Network (DNN), and obtaining 3D coordinate data of a human joint in a preset three-dimensional space, as shown in fig. 4 (b).
In order to train and obtain the convolutional neural network, pictures including the human body under various environments can be collected as samples, and human body joint points in the pictures are labeled. The annotated information includes: the human body joint confidence coefficient graph comprises 14 joint points, namely a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint, and is constructed by referring to a standard residual error network. The confidence map comprises 14 pieces, and each piece of map comprises a Gaussian response map of the corresponding position of the joint in the sample picture. It is easy to note that the confidence map is a 2D picture, as large as the input picture or on a smaller scale. Each pixel is compared with the input picture, and the probability of the human joint is shown in the position where the pixel is located corresponding to the position of the input picture.
As an alternative, 10 ten thousand pictures can be used as a sample and trained 200 times. And extracting the coordinates of the points with the maximum brightness in the confidence map, namely obtaining the 2D positions of the human body joint points to be used as the training input of the deep neural network.
In order to train and obtain the neural network, traditional optical equipment can be adopted in a studio, professional actors are hired to capture human postures, and position data of human joints in a 3D space are obtained. The data is continuous motion data. The number and location of joints is consistent with 2D. As an alternative embodiment, 10 cameras may be used around the body with the lens pointing towards the middle actor.
The input of the deep neural network is 2D joint data (X, Y), and the output is based on the three-dimensional joint position (X, Y, Z) of the corresponding human body. Because the deep neural network has strong feature extraction capability, the input 2D data is subjected to dimension expansion and is input into a 3D coordinate after passing through a hidden layer. The L2Loss function is still used for training, and the Relu function is used for the activation function. Because the normal human body is bilaterally symmetrical, the constraint of the length of the skeleton is added, so that the lengths of the left skeleton and the right skeleton are consistent as much as possible. The 3D posture of the human body can be obtained by using the deep network, but the displacement data of the human body in the 3D space is not available, the continuity of the motion is not enough, and the playing motion has jitter.
Therefore, as an alternative embodiment, specifically, when the three-dimensional coordinate data corresponding to each joint in the preset three-dimensional space on the object to be captured is determined according to the three-dimensional coordinate data corresponding to each joint in the local coordinate system on the basis of the position data of each joint on the object to be captured on the at least one picture, the three-dimensional coordinate data corresponding to each joint in the preset three-dimensional space on the object to be captured can be determined according to the three-dimensional coordinate data corresponding to each joint in the local coordinate system on the basis of the position data of each joint on the at least one picture on the object to be captured:
Figure BDA0001758149420000121
wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,
Figure BDA0001758149420000122
is joint information output by the second neural network model;
Figure BDA0001758149420000123
is the average of all joint information output by the second neural network model; kiIs joint information output by the first neural network model;
Figure BDA0001758149420000124
is the average of all joint information output by the first neural network model.
Optionally, in the post-processing stage, the 3D body is processed by an iterative method to make the motion smoother, and the vibration of the ankle part is eliminated, so that the body is fixed on the ground.
Further, the generated human joint 3D data can be converted into an FBX file (i.e., a file in a FilmBox software format) commonly used by game art personnel. Since video is a continuous human motion, the information stored in FBX includes motion information on a frame-by-frame basis.
Preferably, the FBX file can also be converted into a human body motion file BIP file (BIP is called Bipedal, BIP file is a file with a special format of 3dsmax cs, and is used for producing animation and 3D files).
The FBX file is software which is produced by Autodesk company and used for cross-platform free three-dimensional creation and exchange format, and users can access the three-dimensional files of most three-dimensional suppliers through the FBX file. The FBX file format supports all major three-dimensional data elements as well as two-dimensional, audio and video media elements.
The BIP file is a commonly used motion file of the step controller and is a commonly used file for animation and 3D production. BIP is a 3dsmaxCS specific format, and is opened with Natural Motion Endorphin (Motion capture simulation), or may be opened using MotionBuilder (one of 3D character animation software) or the like software. The BIP file is a character animation file commonly used by game art.
Based on the standard skeleton in 3d max, the rotation direction and length of the human joint were calculated. Importing these data into 3d max can result in correct BIP skeleton and animation effects.
As an optional implementation mode, the development process of the application is completed by using Tensorflow, and the running environment is a PC installed with a Linux operating system.
An embodiment of the present application further provides a method for capturing gesture data based on video, as shown in fig. 5, including the following steps:
step S501, video data is obtained, wherein the video data comprises gesture motion data of the motion of the object to be captured.
Specifically, the video data may be a video recorded in real time and containing the gesture motion data of the motion of the object to be captured, or may also be a video recorded or created in advance and containing the gesture motion data of the motion of the object to be captured. The object to be captured may be a person or an object moving in a video, and the embodiment of the present application is described by taking a person as an example.
Step S502, decomposing the video data into at least one picture; wherein each picture corresponds to a frame of image of the video data.
Specifically, after video data including gesture motion data of the motion of the object to be captured is acquired, the video data may be decomposed into a plurality of pictures by frames. Wherein, each picture comprises a gesture motion of the object to be captured.
Step S503, based on the first neural network model, extracting the two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data.
Specifically, each picture obtained by decomposing video data is input into a first neural network model, at least one confidence map corresponding to each picture is output, and the coordinates of the pixel point with the maximum brightness in each confidence map correspond to the coordinates of one mark point on the object to be captured; and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.
Before each image obtained by decomposing the video data is input into the first neural network model and at least one confidence map corresponding to each image is output, the first neural network model also needs to be trained, and the specific training process is discussed as above and is not repeated here.
Optionally, the first neural network model is a convolutional neural network model.
Step S504, based on the second neural network model, determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured, which is contained in each picture obtained by decomposing the video data, wherein the local coordinate system is a coordinate system determined by the centroid of the object to be captured.
Optionally, the second neural network model is a deep neural network model. The input is 2D joint data (X, Y) and the output is three-dimensional data in a local coordinate system. The process of training the second neural network model is discussed above, and is not described here again.
Step S505, based on the position data of each mark point on the object to be captured on at least one picture, according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system, determining the corresponding three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space.
Specifically, in order to obtain displacement data of the object to be captured in the 3D space, the position of the object to be captured in the 3D space can be derived from the position of the joint on the 2D picture by the following formula:
Figure BDA0001758149420000141
wherein z is the proximity of the object to be captured in the preset three-dimensional spaceLike the depth of the hole,
Figure BDA0001758149420000142
the coordinate information of the mark point output by the second neural network model;
Figure BDA0001758149420000143
the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kiThe coordinate information of the mark points output by the first neural network model;
Figure BDA0001758149420000144
is the average value of all the mark point coordinate information output by the first neural network model.
The joint position (X, Y, Z) of each joint in a preset three-dimensional space (e.g., a three-dimensional space where a shooting camera is located) can be calculated according to the three-dimensional coordinate data of each joint in the local coordinate system by the above formula.
Step S506, generating a file with a first preset format according to the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data of each mark point on each picture is obtained by decomposing the video data, and the three-dimensional posture data of the object to be captured, which makes the corresponding posture action in the video data, is stored in the file with the first preset format.
Specifically, the file in the first predetermined format may be an FBX file, and the generated human body joint 3D data is converted into the FBX file, which may be used by game artists.
Step S507, converting the file in the first predetermined format into a file in a second predetermined format, where the file in the second predetermined format is used for producing a three-dimensional animation.
Specifically, the file in the second predetermined format may be a BIP file, and the FBX file is converted into a BIP file, which may be used for animation and 3D production.
Through the scheme disclosed in the above steps S501 to S507, the strong learning and derivation capabilities of the neural network are utilized to convert the human body motion in a segment of video into 3D motion data through the first neural network model and the second neural network model, and finally, the 3D motion data is converted into an FBX file, optionally, the FBX file can be further converted into a BIP file required by a 3D motion maker, so that the 3D motion maker can use the captured human body gesture data in scenes such as 3D animation, game making and the like.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. A method for video-based pose data capture, the method comprising the steps of:
acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;
decomposing the video data into at least one picture; each picture corresponds to one frame of image of the video data;
extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on a first neural network model;
determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to two-dimensional coordinate data of each mark point on the object to be captured, wherein the two-dimensional coordinate data are contained in each picture obtained by decomposing the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured;
determining the three-dimensional coordinate data corresponding to each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on the at least one picture by the following formula:
Figure FDA0002484561510000011
wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,
Figure FDA0002484561510000012
is the marker point coordinate information output by the second neural network model;
Figure FDA0002484561510000013
the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kiThe coordinate information of the mark points output by the first neural network model;
Figure FDA0002484561510000014
is the average value of all the mark point coordinate information output by the first neural network model.
2. The method according to claim 1, wherein extracting two-dimensional coordinate data of at least one marker point on the subject to be captured included in each picture decomposed from the video data based on a first neural network model comprises:
inputting each picture obtained by decomposing the video data into a first neural network model, and outputting at least one confidence map corresponding to each picture, wherein the coordinate of a pixel point with the maximum brightness in each confidence map corresponds to the coordinate of a mark point on the object to be captured;
and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.
3. The method of claim 2, wherein before inputting each picture obtained by decomposing the video data into the first neural network model and outputting at least one confidence map corresponding to each picture, the method further comprises:
acquiring a picture sample library; wherein the picture sample library comprises a plurality of sample pictures;
marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library;
and obtaining the first neural network model through machine learning training by using a plurality of sample pictures in the picture sample library and at least one mark point marked on the sample pictures.
4. The method for capturing pose data based on video according to claim 2, wherein determining the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture according to at least one confidence map corresponding to each picture comprises:
determining the two-dimensional coordinates of the marking point corresponding to each confidence map by the following Gaussian response map formula:
Figure FDA0002484561510000021
wherein G (x, y) represents the Gaussian distribution of pixel points on each confidence map; σ represents the standard deviation of the gaussian distribution, and (x, y) represents the coordinates of each pixel point on each confidence map.
5. The method of claim 3, wherein the obtaining the first neural network model through machine learning training using a plurality of sample pictures in the picture sample library and at least one marked point marked on the sample pictures comprises:
training the first neural network model based on an objective function:
Figure FDA0002484561510000022
wherein E represents a target function, H'j(x, y) is the coordinates of each marker point on the predicted sample picture, Hj(x, y) are coordinates of the marked points marked on the sample picture, N represents the number of training samples, and j is a natural number.
6. The method of claim 1, wherein before determining the three-dimensional coordinate data of each marker point in the local coordinate system based on the two-dimensional coordinate data of each marker point on the subject to be captured contained in each picture decomposed from the video data based on the second neural network model, the method further comprises:
acquiring a picture sample library; wherein the picture sample library comprises a plurality of sample pictures;
marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library;
acquiring three-dimensional coordinate data of each marking point on the body of the object to be captured in a local coordinate system based on the two-dimensional coordinates of each marking point marked on each sample picture;
and obtaining the second neural network model through machine learning training by using the two-dimensional coordinate data of at least one marking point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.
7. The method of claim 6, wherein the obtaining the second neural network model through machine learning training using two-dimensional coordinates and corresponding three-dimensional coordinates of at least one marker point marked on each sample picture in the picture sample library comprises:
training the second neural network model based on an objective function:
Figure FDA0002484561510000031
wherein E represents a target function, H'j(x, y, z) is the coordinates of each marker point on the predicted sample picture, Hj(x, y, z) are coordinates of the marked points marked on the sample picture, N represents the number of training samples, and j is a natural number.
8. The method of claim 1, wherein after generating the three-dimensional pose data of the motion of the object to be captured according to the three-dimensional coordinate data of each marker point on the object to be captured in the preset three-dimensional space, which is included in each picture obtained by decomposing the video data, the method further comprises:
and decomposing the video data into three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data is contained in each picture and is converted into a file with a first preset format, and the file with the first preset format stores three-dimensional posture data of the object to be captured for making corresponding posture motion in the video data.
9. The method of claim 8, wherein after generating a file of a first predetermined format based on three-dimensional coordinate data of each marker point on the subject to be captured in a predetermined three-dimensional space, the three-dimensional coordinate data being included in each picture obtained by decomposing the video data, the method further comprises:
converting the file in the first preset format into a file in a second preset format, wherein the file in the second preset format is used for making a three-dimensional animation; the file with the first preset format is an FBX file, and/or the file with the second preset format is a BIP file.
10. The method of any of claims 1-9, wherein the first neural network model is a convolutional neural network model based on a residual network, and the second neural network model is a deep neural network model.
11. A video-based pose data capture system, the system comprising:
the camera device is used for acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;
an image processing device, which is communicated with the camera device and is used for acquiring the video data, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one marker point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, and based on a second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on the at least one picture on the object to be captured through the following formula, determining the three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system:
Figure FDA0002484561510000051
wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,
Figure FDA0002484561510000052
is the marker point coordinate information output by the second neural network model;
Figure FDA0002484561510000053
the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kiThe coordinate information of the mark points output by the first neural network model;
Figure FDA0002484561510000054
the average value of the coordinate information of all the mark points output by the first neural network model is obtained;
wherein each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.
12. A video-based pose data capture system, the system comprising:
the client device is used for acquiring and uploading video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;
a server in communication with the client device for receiving video data uploaded by the client device, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, and based on a second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on the at least one picture on the object to be captured through the following formula, determining the three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system:
Figure FDA0002484561510000055
wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,
Figure FDA0002484561510000061
is the marker point coordinate information output by the second neural network model;
Figure FDA0002484561510000062
the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kiThe coordinate information of the mark points output by the first neural network model;
Figure FDA0002484561510000063
the average value of the coordinate information of all the mark points output by the first neural network model is obtained;
wherein each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.
CN201810895934.4A 2018-08-08 2018-08-08 Video-based attitude data capturing method and system Active CN109145788B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810895934.4A CN109145788B (en) 2018-08-08 2018-08-08 Video-based attitude data capturing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810895934.4A CN109145788B (en) 2018-08-08 2018-08-08 Video-based attitude data capturing method and system

Publications (2)

Publication Number Publication Date
CN109145788A CN109145788A (en) 2019-01-04
CN109145788B true CN109145788B (en) 2020-07-07

Family

ID=64792037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810895934.4A Active CN109145788B (en) 2018-08-08 2018-08-08 Video-based attitude data capturing method and system

Country Status (1)

Country Link
CN (1) CN109145788B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109529350A (en) * 2018-12-27 2019-03-29 北京云舶在线科技有限公司 A kind of action data processing method and its device applied in game
CN109758756B (en) * 2019-02-28 2021-03-23 国家体育总局体育科学研究所 Gymnastics video analysis method and system based on 3D camera
CN110163112B (en) * 2019-04-25 2021-03-19 沈阳图为科技有限公司 Examinee posture segmentation and smoothing method
CN110334574A (en) * 2019-04-26 2019-10-15 武汉理工大学 A method of automatically extracting traffic accident key frame in traffic video
CN110796077A (en) * 2019-10-29 2020-02-14 湖北民族大学 Attitude motion real-time detection and correction method
CN111208783B (en) * 2019-12-30 2021-09-17 深圳市优必选科技股份有限公司 Action simulation method, device, terminal and computer storage medium
CN111476291B (en) * 2020-04-03 2023-07-25 南京星火技术有限公司 Data processing method, device and storage medium
CN111638791B (en) * 2020-06-03 2021-11-09 北京火山引擎科技有限公司 Virtual character generation method and device, electronic equipment and storage medium
CN111680758B (en) * 2020-06-15 2024-03-05 杭州海康威视数字技术股份有限公司 Image training sample generation method and device
CN111798547B (en) * 2020-06-22 2021-05-28 完美世界(北京)软件科技发展有限公司 Animation mixed space subdivision method, device, equipment and readable medium
CN112818898B (en) * 2021-02-20 2024-02-20 北京字跳网络技术有限公司 Model training method and device and electronic equipment
CN113146634A (en) * 2021-04-25 2021-07-23 达闼机器人有限公司 Robot attitude control method, robot and storage medium
CN113420719B (en) * 2021-07-20 2022-07-22 北京百度网讯科技有限公司 Method and device for generating motion capture data, electronic equipment and storage medium
CN113989928B (en) * 2021-10-27 2023-09-05 南京硅基智能科技有限公司 Motion capturing and redirecting method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377812A (en) * 2008-07-11 2009-03-04 北京航空航天大学 Method for recognizing position and attitude of space plane object
CN105631861A (en) * 2015-12-21 2016-06-01 浙江大学 Method of restoring three-dimensional human body posture from unmarked monocular image in combination with height map
CN106204625A (en) * 2016-07-27 2016-12-07 大连理工大学 A kind of variable focal length flexibility pose vision measuring method
CN107392097A (en) * 2017-06-15 2017-11-24 中山大学 A kind of 3 D human body intra-articular irrigation method of monocular color video
CN107492121A (en) * 2017-07-03 2017-12-19 广州新节奏智能科技股份有限公司 A kind of two-dimension human body bone independent positioning method of monocular depth video
CN108377368A (en) * 2018-05-08 2018-08-07 扬州大学 A kind of one master and multiple slaves formula intelligent video monitoring apparatus and its control method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102591552B1 (en) * 2015-08-21 2023-10-18 매직 립, 인코포레이티드 Eyelid shape estimation using eye pose measurement

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101377812A (en) * 2008-07-11 2009-03-04 北京航空航天大学 Method for recognizing position and attitude of space plane object
CN105631861A (en) * 2015-12-21 2016-06-01 浙江大学 Method of restoring three-dimensional human body posture from unmarked monocular image in combination with height map
CN106204625A (en) * 2016-07-27 2016-12-07 大连理工大学 A kind of variable focal length flexibility pose vision measuring method
CN107392097A (en) * 2017-06-15 2017-11-24 中山大学 A kind of 3 D human body intra-articular irrigation method of monocular color video
CN107492121A (en) * 2017-07-03 2017-12-19 广州新节奏智能科技股份有限公司 A kind of two-dimension human body bone independent positioning method of monocular depth video
CN108377368A (en) * 2018-05-08 2018-08-07 扬州大学 A kind of one master and multiple slaves formula intelligent video monitoring apparatus and its control method

Also Published As

Publication number Publication date
CN109145788A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109145788B (en) Video-based attitude data capturing method and system
KR101295471B1 (en) A system and method for 3D space-dimension based image processing
Menache Understanding motion capture for computer animation and video games
US8786680B2 (en) Motion capture from body mounted cameras
US20230008567A1 (en) Real-time system for generating 4d spatio-temporal model of a real world environment
CN104915978A (en) Realistic animation generation method based on Kinect
CN109035415B (en) Virtual model processing method, device, equipment and computer readable storage medium
CN112037310A (en) Game character action recognition generation method based on neural network
Sanna et al. A kinect-based interface to animate virtual characters
CN113822970A (en) Live broadcast control method and device, storage medium and electronic equipment
CN109529350A (en) A kind of action data processing method and its device applied in game
CN113989928B (en) Motion capturing and redirecting method
Hao et al. Cromosim: A deep learning-based cross-modality inertial measurement simulator
Eom et al. Data‐Driven Reconstruction of Human Locomotion Using a Single Smartphone
Lin et al. Temporal IK: Data-Driven Pose Estimation for Virtual Reality
Zeng et al. Motion capture and reconstruction based on depth information using Kinect
Borodulina Application of 3D human pose estimation for motion capture and character animation
Kelly et al. Motion synthesis for sports using unobtrusive lightweight body‐worn and environment sensing
Verma et al. Motion capture using computer vision
US20220028144A1 (en) Methods and systems for generating an animation control rig
US11450054B2 (en) Method for operating a character rig in an image-generation system using constraints on reference nodes
Sutopo et al. Synchronization of dance motion data acquisition using motion capture
Akinjala et al. Animating human movement & gestures on an agent using Microsoft kinect
US11170553B1 (en) Methods and systems for generating an animation control rig
Törmänen Comparison of entry level motion capture suits aimed at indie game production

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant