CN109145788B

CN109145788B - Video-based attitude data capturing method and system

Info

Publication number: CN109145788B
Application number: CN201810895934.4A
Authority: CN
Inventors: 陈敏
Original assignee: Yungoal Tech Co ltd
Current assignee: Yungoal Tech Co ltd
Priority date: 2018-08-08
Filing date: 2018-08-08
Publication date: 2020-07-07
Anticipated expiration: 2038-08-08
Also published as: CN109145788A

Abstract

The application discloses a video-based attitude data capturing method and a video-based attitude data capturing system, wherein the method comprises the following steps: acquiring video data, and decomposing the video data into at least one picture; extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on the first neural network model; based on the second neural network model, determining corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured; and determining the corresponding three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on at least one picture. The method and the device achieve the purpose of extracting the gesture data from the video containing the motion gesture actions of the object to be captured.

Description

Video-based attitude data capturing method and system

Technical Field

The application relates to the field of machine vision, in particular to a video-based attitude data capturing method and system.

Background

Motion capture refers to recording the motion of an object in a three-dimensional space and simulating the motion track of the object into a digital model. For example, the animation sequence is generated by detecting and recording the motion trail of the limbs of the performer in the three-dimensional space, capturing the gesture motion of the performer, converting the captured gesture motion into a digital abstract motion, and controlling a virtual model in a software application to make the same motion as the performer. In recent years, motion capture technology has been widely used in many fields such as virtual reality, three-dimensional games, and human body engineering.

The conventional motion capture techniques mainly include the following three types:

first, optical motion capture techniques. This technique requires a specialized environment, requires no significant interference around, and requires specialized actors to wear optical motion capture devices to capture motion. Although the optical motion capture technology has high accuracy of the captured result, it requires a lot of space, equipment, and personnel, and is expensive to use.

Second, inertial motion capture techniques. This technique requires professional performers to wear a variety of motion capture devices that, because they are tied to human joints, are able to sample the velocity and acceleration of human motion to infer the position and motion of the human joints. However, this technique is inferior to the optical motion capture technique in capture effect due to the accuracy of the device, and cannot solve the problem that the heel of the virtual character is attached to the ground. Similarly, inertial motion capture techniques require equipment and personnel.

Third, end-to-end 3D pose data generation. This approach requires acquiring 3D body data in real environment, is difficult, and requires additional equipment. It is difficult to directly output a 3D position using a picture as a network input. In order to solve the problem of insufficient data, the scheme adopts a mode of replacing backgrounds and human clothes, so that training is carried out, and the capturing effect is not ideal.

Analysis shows that in the prior art, the gesture data capturing mode based on the video needs to be supported by sites, equipment and personnel, so that the working efficiency is influenced, the cost is high, and the capturing effect is poor.

Disclosure of Invention

In order to solve the above problem, the present application provides a video-based gesture data capturing method, which includes the following steps: acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; decomposing video data into at least one picture; wherein each picture corresponds to a frame of image of the video data; extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on the first neural network model; determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to two-dimensional coordinate data of each mark point on the object to be captured, which is contained in each picture obtained by decomposing video data, based on a second neural network model, wherein the local coordinate system is a coordinate system determined by the centroid of the object to be captured; and determining the corresponding three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on at least one picture.

In one example, based on a first neural network model, extracting two-dimensional coordinate data of at least one mark point on a subject to be captured contained in each picture obtained by decomposing the video data, comprises: inputting each picture obtained by decomposing video data into a first neural network model, and outputting at least one confidence map corresponding to each picture, wherein the coordinate of a pixel point with the maximum brightness in each confidence map corresponds to the coordinate of a mark point on an object to be captured; and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.

In one example, before each picture obtained by decomposing the video data is input into the first neural network model and at least one confidence map corresponding to each picture is output, the method further includes: acquiring a picture sample library; the picture sample library comprises a plurality of sample pictures; marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library; and obtaining a first neural network model through machine learning training by using a plurality of sample pictures in the picture sample library and at least one marked point marked on the sample pictures.

Alternatively, the mark point may be an articulation point on the subject to be captured, the articulation point including at least one of: a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, and a right ankle joint.

In one example, determining two-dimensional coordinate data of at least one mark point on a subject to be captured in each picture according to at least one confidence map corresponding to each picture comprises: determining the two-dimensional coordinates of the marking point corresponding to each confidence map by the following Gaussian response map formula:

wherein G (x, y) represents the confidence per sheetA gaussian distribution of pixel points on the image; σ represents the standard deviation of the gaussian distribution, and (x, y) represents the coordinates of each pixel point on each confidence map.

In one example, the obtaining of the first neural network model through machine learning training using a plurality of sample pictures in a picture sample library and at least one marked point marked on the sample pictures includes: training a first neural network model based on an objective function:

wherein E represents a target function, H'_j(x, y) is the coordinates of each marker point on the predicted sample picture, H_j(x, y) are coordinates of the marked points marked on the sample picture, N represents the number of training samples, and j is a natural number.

In one example, before determining the three-dimensional coordinate data of each marker point in the preset three-dimensional space according to the two-dimensional coordinate data of each marker point on the to-be-captured object contained in each picture obtained by decomposing the video data based on the second neural network model, the method further comprises: acquiring a picture sample library; the picture sample library comprises a plurality of sample pictures; marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library; acquiring three-dimensional coordinate data of each marking point on the object to be captured in a local coordinate system based on the two-dimensional coordinates of each marking point marked on each sample picture, wherein the local coordinate system is a coordinate system determined by the mass center of the object to be captured; and obtaining a second neural network model through machine learning training by using the two-dimensional coordinate data of at least one marking point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.

In one example, determining, based on the position of each marker point on the object to be captured on the at least one picture, the three-dimensional coordinate data corresponding to each marker point on the object to be captured in the preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marker point in the local coordinate system includes: determining the three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system based on the position data of each mark point on at least one picture on the object to be captured by the following formula:

wherein z is the approximate depth of the object to be captured in the preset three-dimensional space,

the coordinate information of the mark point output by the second neural network model;

the average value of the coordinate information of all the mark points output by the second neural network model is obtained; kⁱThe coordinate information of the mark points output by the first neural network model;

is the average value of all the mark point coordinate information output by the first neural network model.

In one example, the two-dimensional coordinates and corresponding three-dimensional coordinates of at least one marked point marked on each sample picture in the picture sample library are used for obtaining a second neural network model through machine learning training, and the method comprises the following steps: training a second neural network model based on the following objective function:

wherein E represents a target function, H'_j(x, y, z) is the three-dimensional coordinates of each marker point on the predicted sample picture, H_j(x, y, z) are three-dimensional coordinates of the marked points marked by the sample pictures, N represents the number of training samples, and j is a natural number.

In one example, after generating three-dimensional posture data of the motion of the object to be captured according to three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space, which is contained in each picture obtained by decomposing the video data, the method further comprises the following steps: and generating a file in a first preset format according to the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data of each mark point on each picture is obtained by decomposing the video data, and the three-dimensional attitude data of the object to be captured, which is used for making the corresponding attitude motion in the video data, is stored in the file in the first preset format.

In one example, after generating a file in a first predetermined format according to three-dimensional coordinate data of each mark point on the subject to be captured in a preset three-dimensional space included in each picture obtained by decomposing the video data, the method further includes: and converting the file in the first preset format into a file in a second preset format, wherein the file in the second preset format is used for producing the three-dimensional animation.

In one example, the first neural network model is a convolutional neural network model built based on a residual network, and the second neural network model is a deep neural network model.

In another aspect, the present application further provides a video-based gesture data capturing system, which includes: the camera device is used for acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; the image processing equipment is communicated with the camera device and used for acquiring video data, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, determining corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data based on a second neural network model, and determining corresponding three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system based on the position data of each mark point on the at least one picture on the object to be captured; each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.

In another aspect, the present application further provides a video-based gesture data capturing system, which includes: the client device is used for acquiring and uploading video data, wherein the video data comprises gesture motion data of the motion of an object to be captured; the server is communicated with the client device and used for receiving the video data uploaded by the client device, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on the first neural network model, and based on the second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on at least one picture on the object to be captured, determining the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system; each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.

According to the posture data capturing mode based on the video, after at least one corresponding picture is obtained according to the obtained video data, the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture is extracted through the first neural network model, the three-dimensional coordinate of each mark point in the preset three-dimensional space is determined through the second neural network model according to the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture, the purpose of extracting the three-dimensional posture data according to the video is achieved, in addition, the two-dimensional coordinate data and the three-dimensional coordinate data of a human body joint are obtained through the two neural networks, and the problem of poor 3D training effect caused by insufficient 3D human body posture data is effectively solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a video-based pose data capture system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative video-based pose data capture system provided by an embodiment of the present application;

fig. 3(a) is a schematic diagram of obtaining a picture from a video screenshot according to an embodiment of the present application;

FIG. 3(b) is a schematic diagram of extracting joint points of a human body from a picture according to an embodiment of the present application;

FIG. 4(a) is a schematic diagram of a two-dimensional human joint according to an embodiment of the present application;

FIG. 4(b) is a schematic diagram of a three-dimensional human joint according to an embodiment of the present application;

fig. 5 is a flowchart of a video-based gesture data capturing method according to an embodiment of the present application.

Detailed Description

In order to more clearly explain the overall concept of the present application, the following detailed description is given by way of example in conjunction with the accompanying drawings.

In order to solve the problems that the operation is complicated and the cost is high due to the fact that an existing gesture data capturing system needs to acquire gesture data of a moving target moving in a three-dimensional space by means of various sensors, the application provides a video-based gesture data capturing scheme which can dynamically capture the gesture data of the moving target moving in a video according to the video of the moving target.

It should be noted that the video-based gesture data capture scheme provided in the embodiment of the present application may be applied to capture gesture data of any moving target, and may be gesture data of a human body, gesture data of a moving object, or gesture data of a robot. Preferably, the various embodiments of the present application are described by taking the example of capturing posture data of a human body. It is easy to note that the attitude data at the time of the motion of the object to be captured can be determined by marking the change of at least one point on the object to be captured, and in the case where the object to be captured is a human body, the marked point may be an articulated point of the human body.

As a first alternative embodiment, an embodiment of the present application provides a video-based pose data capture system, as shown in fig. 1, comprising: an image pickup device 1 and an image processing apparatus 2. The image processing apparatus 2 collects video data of a motion of an object to be captured (for example, a human body) by the camera 1 connected thereto, and the collected video data includes data of a plurality of posture motions of the human body during the motion. After the image processing device 2 obtains the video data of the human motion, the video data is subjected to frame decomposition processing to obtain a plurality of pictures, and each picture corresponds to one frame of image of the video data. Because the posture and the motion of the human body are changed continuously during the movement of the human body, the joint points (including but not limited to joint positions of a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, a right ankle joint and the like) of the human body in each picture have different positions on the picture.

The image processing device 2 and the imaging apparatus 1 may be two components on the same device (for example, in a room, gesture data of a user is acquired by a smart mobile device such as a mobile phone, and image processing is performed), or may be two independent devices. In the case where the image processing apparatus 2 and the imaging device 1 are two independent apparatuses, the communication means of the two may be wired communication (for example, attitude data of a performer is acquired indoors and image processing is performed by an image processing apparatus such as a computer), or wireless communication. In the case of wireless communication, the communication system may be a communication system based on a local area network (for example, the posture data of the user is collected indoors by a mobile phone or the like and transmitted to a computer for image processing), or may be a communication system based on the internet (for example, the posture data of the user is collected indoors by a mobile phone and then transmitted to an application server for image processing based on a client application).

For example, fig. 2 is a schematic view of an optional video-based gesture data capturing system provided in an embodiment of the present application, and as shown in fig. 2, the video data acquired by the server 5 may be gesture motion data of a motion of an object to be captured, which is acquired in real time by using a camera device such as a camera of a mobile phone, or may be video data of a gesture motion, which includes a motion of an object to be captured, uploaded by a user through a device such as a mobile phone 3 or a computer 4.

In any way, after the image processing device 2 acquires the video data of the human body motion acquired by the camera device 1 and decomposes the video data into a plurality of pictures, the image processing device 2 may extract the two-dimensional coordinate data of each joint point (for example, the left wrist joint point of the human body shown in fig. 3) on the subject to be captured included in each picture obtained by decomposing the video data based on the first neural network model, determine the three-dimensional coordinate data of each joint point in the preset three-dimensional space according to the two-dimensional coordinate data of each joint point on the subject to be captured included in each picture obtained by decomposing the video data based on the second neural network model, and finally generate a file in a first predetermined format according to the three-dimensional coordinate data of each joint point on the subject to be captured included in each picture obtained by decomposing the video data, the file in the first preset format stores three-dimensional posture data of the object to be captured for making corresponding posture motions in the video data.

It is easy to note that the first neural network model and the second neural network model are models obtained by adopting an artificial intelligence algorithm for training in advance, wherein the first neural network model can be used for estimating the joint position of the human body in the picture to obtain 2D coordinate data; a second neural network model may be used to convert the 2D joint coordinate data into corresponding 3D coordinate data. And the human body motion in the video is converted into 3D motion data by utilizing the powerful learning and derivation capabilities of the neural network. The human body gestures under various environments can be effectively recognized by using a neural network trained by a large number of real videos and pictures. The embodiment of the invention has good treatment on human body shielding and self-shielding, and has no limitation on the use environment and the camera position for shooting the video, thereby being capable of treating various video materials.

Alternatively, the first neural network model may be a convolutional neural network model established based on a residual network, and the second neural network model may be a deep neural network model.

It is easy to note that the first neural network model needs to be trained before it is used to extract the two-dimensional coordinate data of at least one joint point on the subject to be captured contained in each picture. Specifically, a picture sample library containing a plurality of sample pictures is collected, at least one joint point (for example, a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint) on a subject to be captured contained in each sample picture in the picture sample library is labeled, and a first neural network model is obtained through machine learning training by using the plurality of sample pictures in the picture sample library and the at least one joint point labeled on the sample pictures.

After the first neural network model is obtained through training, the two-dimensional coordinate data of at least one joint point on the object to be captured, which is contained in each picture, can be extracted by using the first neural network model, and each picture obtained through decomposing the video data is input into the first neural network model by the image processing device 2, so as to output at least one confidence map corresponding to each picture. And determining two-dimensional coordinate data of at least one joint point on the body of the object to be captured in each picture according to at least one confidence map corresponding to each picture, wherein the coordinate of the pixel point with the maximum brightness in each confidence map corresponds to the coordinate of one joint point on the body of the object to be captured.

Optionally, the first neural network model may be trained based on the following objective function:

wherein E represents a target function, H'_j(xY) coordinates of each joint point on the predicted sample picture, H_j(x, y) are coordinates of the joint points marked by the sample pictures, N represents the number of training samples, and j is a natural number.

When the two-dimensional coordinate data of at least one joint point on the object to be captured in each picture is determined according to at least one confidence map corresponding to each picture, the two-dimensional coordinates of the joint point corresponding to each confidence map can be determined through the following Gaussian response map formula:

wherein G (x, y) represents the Gaussian distribution of pixel points on each confidence map; σ represents the standard deviation of the gaussian distribution, and (x, y) represents the coordinates of each pixel point on each confidence map.

In addition, it should be noted that, before determining the three-dimensional coordinate data of each joint point in the preset three-dimensional space according to the two-dimensional coordinate data of each joint point on the subject to be captured, which is included in each picture obtained by decomposing the video data, using the second neural network model, the second neural network model also needs to be trained. The training process is as follows: collecting a picture sample library comprising a plurality of sample pictures, and labeling at least one joint point (such as a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint) on a body of a to-be-captured object contained in each sample picture in the picture sample library; acquiring three-dimensional coordinate data of each joint point on the body of the object to be captured in a local coordinate system based on the two-dimensional coordinates of each joint point marked on each sample picture, wherein the local coordinate system is a coordinate system determined by the mass center of the object to be captured; and obtaining a second neural network model through machine learning training by using the two-dimensional coordinate data of at least one joint point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.

Optionally, the second neural network model may be trained based on the following objective function:

wherein E represents a target function, H'_j(x, y, z) is the three-dimensional coordinates of each joint point on the predicted sample picture, H_j(x, y, z) are three-dimensional coordinates of the joint points marked by the sample pictures, N represents the number of training samples, and j is a natural number.

The video-based gesture data capture scheme provided by the present application is specifically described below with reference to fig. 3(a) and 3 (b). Fig. 3(a) shows a single picture decomposed from a high jump video of a high jump athlete, and the picture shown in fig. 3(a) can be inferred by using a Convolutional Neural Network (CNN) to obtain joint point positions of a human body in the picture, such as a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint, and a right ankle joint shown in fig. 3 (b).

As shown in fig. 4(a), the 2D human joint data corresponding to fig. 3(b) is obtained by reasoning the 2D human joint data using a Deep Neural Network (DNN), and obtaining 3D coordinate data of a human joint in a preset three-dimensional space, as shown in fig. 4 (b).

In order to train and obtain the convolutional neural network, pictures including the human body under various environments can be collected as samples, and human body joint points in the pictures are labeled. The annotated information includes: the human body joint confidence coefficient graph comprises 14 joint points, namely a head joint, a neck joint, a left shoulder joint, a right shoulder joint, a left elbow joint, a right elbow joint, a left pelvis joint, a right pelvis joint, a left knee joint, a right knee joint, a left ankle joint and a right ankle joint, and is constructed by referring to a standard residual error network. The confidence map comprises 14 pieces, and each piece of map comprises a Gaussian response map of the corresponding position of the joint in the sample picture. It is easy to note that the confidence map is a 2D picture, as large as the input picture or on a smaller scale. Each pixel is compared with the input picture, and the probability of the human joint is shown in the position where the pixel is located corresponding to the position of the input picture.

As an alternative, 10 ten thousand pictures can be used as a sample and trained 200 times. And extracting the coordinates of the points with the maximum brightness in the confidence map, namely obtaining the 2D positions of the human body joint points to be used as the training input of the deep neural network.

In order to train and obtain the neural network, traditional optical equipment can be adopted in a studio, professional actors are hired to capture human postures, and position data of human joints in a 3D space are obtained. The data is continuous motion data. The number and location of joints is consistent with 2D. As an alternative embodiment, 10 cameras may be used around the body with the lens pointing towards the middle actor.

The input of the deep neural network is 2D joint data (X, Y), and the output is based on the three-dimensional joint position (X, Y, Z) of the corresponding human body. Because the deep neural network has strong feature extraction capability, the input 2D data is subjected to dimension expansion and is input into a 3D coordinate after passing through a hidden layer. The L2Loss function is still used for training, and the Relu function is used for the activation function. Because the normal human body is bilaterally symmetrical, the constraint of the length of the skeleton is added, so that the lengths of the left skeleton and the right skeleton are consistent as much as possible. The 3D posture of the human body can be obtained by using the deep network, but the displacement data of the human body in the 3D space is not available, the continuity of the motion is not enough, and the playing motion has jitter.

Therefore, as an alternative embodiment, specifically, when the three-dimensional coordinate data corresponding to each joint in the preset three-dimensional space on the object to be captured is determined according to the three-dimensional coordinate data corresponding to each joint in the local coordinate system on the basis of the position data of each joint on the object to be captured on the at least one picture, the three-dimensional coordinate data corresponding to each joint in the preset three-dimensional space on the object to be captured can be determined according to the three-dimensional coordinate data corresponding to each joint in the local coordinate system on the basis of the position data of each joint on the at least one picture on the object to be captured:

is joint information output by the second neural network model;

is the average of all joint information output by the second neural network model; kⁱIs joint information output by the first neural network model;

is the average of all joint information output by the first neural network model.

Optionally, in the post-processing stage, the 3D body is processed by an iterative method to make the motion smoother, and the vibration of the ankle part is eliminated, so that the body is fixed on the ground.

Further, the generated human joint 3D data can be converted into an FBX file (i.e., a file in a FilmBox software format) commonly used by game art personnel. Since video is a continuous human motion, the information stored in FBX includes motion information on a frame-by-frame basis.

Preferably, the FBX file can also be converted into a human body motion file BIP file (BIP is called Bipedal, BIP file is a file with a special format of 3dsmax cs, and is used for producing animation and 3D files).

The FBX file is software which is produced by Autodesk company and used for cross-platform free three-dimensional creation and exchange format, and users can access the three-dimensional files of most three-dimensional suppliers through the FBX file. The FBX file format supports all major three-dimensional data elements as well as two-dimensional, audio and video media elements.

The BIP file is a commonly used motion file of the step controller and is a commonly used file for animation and 3D production. BIP is a 3dsmaxCS specific format, and is opened with Natural Motion Endorphin (Motion capture simulation), or may be opened using MotionBuilder (one of 3D character animation software) or the like software. The BIP file is a character animation file commonly used by game art.

Based on the standard skeleton in 3d max, the rotation direction and length of the human joint were calculated. Importing these data into 3d max can result in correct BIP skeleton and animation effects.

As an optional implementation mode, the development process of the application is completed by using Tensorflow, and the running environment is a PC installed with a Linux operating system.

An embodiment of the present application further provides a method for capturing gesture data based on video, as shown in fig. 5, including the following steps:

step S501, video data is obtained, wherein the video data comprises gesture motion data of the motion of the object to be captured.

Specifically, the video data may be a video recorded in real time and containing the gesture motion data of the motion of the object to be captured, or may also be a video recorded or created in advance and containing the gesture motion data of the motion of the object to be captured. The object to be captured may be a person or an object moving in a video, and the embodiment of the present application is described by taking a person as an example.

Step S502, decomposing the video data into at least one picture; wherein each picture corresponds to a frame of image of the video data.

Specifically, after video data including gesture motion data of the motion of the object to be captured is acquired, the video data may be decomposed into a plurality of pictures by frames. Wherein, each picture comprises a gesture motion of the object to be captured.

Step S503, based on the first neural network model, extracting the two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data.

Specifically, each picture obtained by decomposing video data is input into a first neural network model, at least one confidence map corresponding to each picture is output, and the coordinates of the pixel point with the maximum brightness in each confidence map correspond to the coordinates of one mark point on the object to be captured; and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.

Before each image obtained by decomposing the video data is input into the first neural network model and at least one confidence map corresponding to each image is output, the first neural network model also needs to be trained, and the specific training process is discussed as above and is not repeated here.

Optionally, the first neural network model is a convolutional neural network model.

Step S504, based on the second neural network model, determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured, which is contained in each picture obtained by decomposing the video data, wherein the local coordinate system is a coordinate system determined by the centroid of the object to be captured.

Optionally, the second neural network model is a deep neural network model. The input is 2D joint data (X, Y) and the output is three-dimensional data in a local coordinate system. The process of training the second neural network model is discussed above, and is not described here again.

Step S505, based on the position data of each mark point on the object to be captured on at least one picture, according to the corresponding three-dimensional coordinate data of each mark point in the local coordinate system, determining the corresponding three-dimensional coordinate data of each mark point on the object to be captured in a preset three-dimensional space.

Specifically, in order to obtain displacement data of the object to be captured in the 3D space, the position of the object to be captured in the 3D space can be derived from the position of the joint on the 2D picture by the following formula:

wherein z is the proximity of the object to be captured in the preset three-dimensional spaceLike the depth of the hole,

The joint position (X, Y, Z) of each joint in a preset three-dimensional space (e.g., a three-dimensional space where a shooting camera is located) can be calculated according to the three-dimensional coordinate data of each joint in the local coordinate system by the above formula.

Step S506, generating a file with a first preset format according to the three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data of each mark point on each picture is obtained by decomposing the video data, and the three-dimensional posture data of the object to be captured, which makes the corresponding posture action in the video data, is stored in the file with the first preset format.

Specifically, the file in the first predetermined format may be an FBX file, and the generated human body joint 3D data is converted into the FBX file, which may be used by game artists.

Step S507, converting the file in the first predetermined format into a file in a second predetermined format, where the file in the second predetermined format is used for producing a three-dimensional animation.

Specifically, the file in the second predetermined format may be a BIP file, and the FBX file is converted into a BIP file, which may be used for animation and 3D production.

Through the scheme disclosed in the above steps S501 to S507, the strong learning and derivation capabilities of the neural network are utilized to convert the human body motion in a segment of video into 3D motion data through the first neural network model and the second neural network model, and finally, the 3D motion data is converted into an FBX file, optionally, the FBX file can be further converted into a BIP file required by a 3D motion maker, so that the 3D motion maker can use the captured human body gesture data in scenes such as 3D animation, game making and the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for video-based pose data capture, the method comprising the steps of:

acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;

decomposing the video data into at least one picture; each picture corresponds to one frame of image of the video data;

extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured, which is contained in each picture obtained by decomposing the video data, based on a first neural network model;

determining three-dimensional coordinate data corresponding to each mark point in a local coordinate system according to two-dimensional coordinate data of each mark point on the object to be captured, wherein the two-dimensional coordinate data are contained in each picture obtained by decomposing the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured;

determining the three-dimensional coordinate data corresponding to each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system based on the position data of each marking point on the object to be captured on the at least one picture by the following formula:

is the marker point coordinate information output by the second neural network model;

2. The method according to claim 1, wherein extracting two-dimensional coordinate data of at least one marker point on the subject to be captured included in each picture decomposed from the video data based on a first neural network model comprises:

inputting each picture obtained by decomposing the video data into a first neural network model, and outputting at least one confidence map corresponding to each picture, wherein the coordinate of a pixel point with the maximum brightness in each confidence map corresponds to the coordinate of a mark point on the object to be captured;

and determining two-dimensional coordinate data of at least one marking point on the object to be captured in each picture according to at least one confidence map corresponding to each picture.

3. The method of claim 2, wherein before inputting each picture obtained by decomposing the video data into the first neural network model and outputting at least one confidence map corresponding to each picture, the method further comprises:

acquiring a picture sample library; wherein the picture sample library comprises a plurality of sample pictures;

marking at least one mark point on a to-be-captured object contained in each sample picture in the picture sample library;

and obtaining the first neural network model through machine learning training by using a plurality of sample pictures in the picture sample library and at least one mark point marked on the sample pictures.

4. The method for capturing pose data based on video according to claim 2, wherein determining the two-dimensional coordinate data of at least one mark point on the object to be captured contained in each picture according to at least one confidence map corresponding to each picture comprises:

determining the two-dimensional coordinates of the marking point corresponding to each confidence map by the following Gaussian response map formula:

5. The method of claim 3, wherein the obtaining the first neural network model through machine learning training using a plurality of sample pictures in the picture sample library and at least one marked point marked on the sample pictures comprises:

training the first neural network model based on an objective function:

6. The method of claim 1, wherein before determining the three-dimensional coordinate data of each marker point in the local coordinate system based on the two-dimensional coordinate data of each marker point on the subject to be captured contained in each picture decomposed from the video data based on the second neural network model, the method further comprises:

acquiring three-dimensional coordinate data of each marking point on the body of the object to be captured in a local coordinate system based on the two-dimensional coordinates of each marking point marked on each sample picture;

and obtaining the second neural network model through machine learning training by using the two-dimensional coordinate data of at least one marking point marked on each sample picture in the picture sample library and the corresponding three-dimensional coordinate data in the local coordinate system.

7. The method of claim 6, wherein the obtaining the second neural network model through machine learning training using two-dimensional coordinates and corresponding three-dimensional coordinates of at least one marker point marked on each sample picture in the picture sample library comprises:

training the second neural network model based on an objective function:

wherein E represents a target function, H'_j(x, y, z) is the coordinates of each marker point on the predicted sample picture, H_j(x, y, z) are coordinates of the marked points marked on the sample picture, N represents the number of training samples, and j is a natural number.

8. The method of claim 1, wherein after generating the three-dimensional pose data of the motion of the object to be captured according to the three-dimensional coordinate data of each marker point on the object to be captured in the preset three-dimensional space, which is included in each picture obtained by decomposing the video data, the method further comprises:

and decomposing the video data into three-dimensional coordinate data of each mark point on the body of the object to be captured in a preset three-dimensional space, wherein the three-dimensional coordinate data is contained in each picture and is converted into a file with a first preset format, and the file with the first preset format stores three-dimensional posture data of the object to be captured for making corresponding posture motion in the video data.

9. The method of claim 8, wherein after generating a file of a first predetermined format based on three-dimensional coordinate data of each marker point on the subject to be captured in a predetermined three-dimensional space, the three-dimensional coordinate data being included in each picture obtained by decomposing the video data, the method further comprises:

converting the file in the first preset format into a file in a second preset format, wherein the file in the second preset format is used for making a three-dimensional animation; the file with the first preset format is an FBX file, and/or the file with the second preset format is a BIP file.

10. The method of any of claims 1-9, wherein the first neural network model is a convolutional neural network model based on a residual network, and the second neural network model is a deep neural network model.

11. A video-based pose data capture system, the system comprising:

the camera device is used for acquiring video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;

an image processing device, which is communicated with the camera device and is used for acquiring the video data, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one marker point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, and based on a second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on the at least one picture on the object to be captured through the following formula, determining the three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system:

the average value of the coordinate information of all the mark points output by the first neural network model is obtained;

wherein each picture corresponds to one frame of image of the video data, and the local coordinate system is a coordinate system determined by the centroid of the object to be captured.

12. A video-based pose data capture system, the system comprising:

the client device is used for acquiring and uploading video data, wherein the video data comprises gesture motion data of the motion of an object to be captured;

a server in communication with the client device for receiving video data uploaded by the client device, decomposing the video data into at least one picture, extracting two-dimensional coordinate data of at least one mark point on the body of the object to be captured contained in each picture obtained by decomposing the video data based on a first neural network model, and based on a second neural network model, determining the corresponding three-dimensional coordinate data of each mark point in a local coordinate system according to the two-dimensional coordinate data of each mark point on the object to be captured contained in each picture obtained by decomposing the video data, and based on the position data of each mark point on the at least one picture on the object to be captured through the following formula, determining the three-dimensional coordinate data of each marking point on the object to be captured in a preset three-dimensional space according to the corresponding three-dimensional coordinate data of each marking point in the local coordinate system: