CN110135329B

CN110135329B - Method, device, equipment and storage medium for extracting gestures from video

Info

Publication number: CN110135329B
Application number: CN201910394221.4A
Authority: CN
Inventors: 卢云帆; 易阳; 赵世杰; 李峰; 左小祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2023-08-04
Anticipated expiration: 2039-05-13
Also published as: CN110135329A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for extracting gestures from a video, and belongs to the technical field of computers. The method comprises the following steps: acquiring a target video frame sequence; calling a three-dimensional coordinate extraction model; inputting the target video frame sequence into a three-dimensional coordinate extraction model, acquiring three-dimensional coordinate sets of at least two video frames, determining preset rules met by the three-dimensional coordinate sets of the target video frames according to a preset database, and determining the gesture corresponding to the preset rules as the gesture of the target video frames. The three-dimensional coordinate extraction model can learn the relation between the video frame sequence and the three-dimensional coordinate set and the continuity of the gesture between adjacent video frames, accords with the objective rule of gesture change, and improves the accuracy. When the three-dimensional coordinate extraction model is based on extraction, the calculated amount is reduced, the calculation time is saved, a plurality of cameras are not required to be used for shooting videos, the limitation of shooting equipment is avoided, and the application range is enlarged.

Description

Method, device, equipment and storage medium for extracting gestures from video

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for extracting gestures from videos.

Background

Gesture recognition technology is a technology for analyzing and processing images to understand gestures contained in the images, and is increasingly applied to various fields such as somatosensory games along with development of computer technology and widespread popularization of intelligent devices. Three-dimensional coordinate extraction is a key step of gesture recognition, and in order to accurately recognize a gesture contained in an image, three-dimensional coordinates of a plurality of feature points in the image need to be acquired first.

In the related art, a target object is usually photographed by a multi-camera to obtain a target image containing the target object, and the three-dimensional coordinates of the pixel points can be directly obtained when the multi-camera photographs, so that the target image contains the three-dimensional coordinates of each pixel point, and only a plurality of feature points need to be extracted from the target image.

In the scheme, a plurality of cameras are required to be used for shooting, the shooting equipment is limited, the wide popularization is impossible, and the application range is narrow.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for extracting gestures from video, which can solve the problems of the related technology. The technical scheme is as follows:

In one aspect, a method of extracting a gesture from a video is provided, the method comprising:

acquiring a target video frame sequence, wherein the target video frame sequence comprises at least two video frames;

invoking a three-dimensional coordinate extraction model, wherein the three-dimensional coordinate extraction model is used for obtaining a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and adjacent video frames of any video frame, and the three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of feature points in any video frame;

inputting the target video frame sequence into the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of the at least two video frames;

for any video frame in the target video frame sequence, determining a preset rule met by the three-dimensional coordinate set of the target video frame according to a preset database, determining a gesture corresponding to the preset rule as the gesture of the target video frame, wherein the preset database comprises at least one corresponding relation between the preset rule and the gesture.

Optionally, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and a neighboring network node of the target network node; the inputting the target video frame sequence into the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of the at least two video frames includes:

Respectively inputting the at least two video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, wherein the arrangement sequence of the at least two network nodes is the same as the arrangement sequence of the input video frames in the target video frame sequence;

and processing the at least two video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a three-dimensional coordinate set of the at least two video frames.

Optionally, the plurality of network layers include a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer that are sequentially connected;

the inputting the target video frame sequence into the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of the at least two video frames includes:

inputting the target video frame to the feature extraction layer for any target video frame in the target video frame sequence;

processing the target video frame based on the feature extraction layer to obtain image features of the target video frame;

processing the image features and the image features of the adjacent video frames of the target video frame based on the two-dimensional coordinate extraction layer to obtain a two-dimensional coordinate set of the target video frame, wherein the two-dimensional coordinate set comprises two-dimensional coordinates of a plurality of feature points in the target video frame;

And processing the two-dimensional coordinate set and the two-dimensional coordinate set of the adjacent video frame of the target video frame based on the three-dimensional coordinate extraction layer to obtain the three-dimensional coordinate set of the target video frame.

Optionally, before the invoking the three-dimensional coordinate extraction model, the method further includes:

acquiring a sample three-dimensional coordinate set of at least two sample video frames in a sample video frame sequence;

respectively inputting the at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, wherein the arrangement order of the at least two network nodes is the same as the arrangement order of the input sample video frames in the sample video frame sequence;

processing the at least two sample video frames based on each network node and the connection relation between any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames;

and training each network node in the three-dimensional coordinate extraction model according to the error between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

Optionally, the acquiring the sample video frame sequence and the sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence includes:

acquiring a three-dimensional coordinate set of a plurality of video frames in an original sample video;

extracting at least two video frames from the original sample video, determining the at least two video frames as sample video frames, forming the sample video frame sequence by the extracted at least two sample video frames, and determining a three-dimensional coordinate set of the at least two sample video frames as a sample three-dimensional coordinate set.

In another aspect, a three-dimensional coordinate extraction model training method is provided, the method comprising:

acquiring a sample video frame sequence and a sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence, wherein the sample three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of feature points in the sample video frames;

inputting the sample video frame sequence into a three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and the adjacent video frames of each sample video frame;

And training the three-dimensional coordinate extraction model according to the error between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

Optionally, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and a neighboring network node of the target network node; inputting the sample video frame sequence into a three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and the adjacent video frame of each sample video frame, wherein the test three-dimensional coordinate set comprises the following steps:

and processing the at least two sample video frames based on each network node and the connection relation between any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames.

inputting the sample video frame sequence into a three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and the adjacent video frame of each sample video frame, wherein the test three-dimensional coordinate set comprises the following steps:

inputting the sample video frames to the feature extraction layer for any sample video frame in the sequence of sample video frames;

processing the sample video frame based on the feature extraction layer to obtain the test image features of the sample video frame;

processing the test image features and the test image features of adjacent video frames of the sample video frame based on the two-dimensional coordinate extraction layer to obtain a test two-dimensional coordinate set of the sample video frame, wherein the test two-dimensional coordinate set comprises two-dimensional coordinates of a plurality of feature points in the sample video frame;

and processing the test two-dimensional coordinate set and the test two-dimensional coordinate set of the adjacent video frame of the sample video frame based on the three-dimensional coordinate extraction layer to obtain the test three-dimensional coordinate set of the sample video frame.

In another aspect, there is provided an apparatus for extracting a gesture from a video, the apparatus comprising:

the system comprises a target sequence acquisition module, a target video frame sequence acquisition module and a video frame processing module, wherein the target sequence acquisition module is used for acquiring a target video frame sequence, and the target video frame sequence comprises at least two video frames;

the model calling module is used for calling a three-dimensional coordinate extraction model, wherein the three-dimensional coordinate extraction model is used for obtaining a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and adjacent video frames of the any video frame, and the three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of feature points in the any video frame;

The three-dimensional coordinate acquisition module is used for inputting the target video frame sequence into the three-dimensional coordinate extraction model to acquire a three-dimensional coordinate set of the at least two video frames;

the gesture determining module is configured to determine, for any video frame in the target video frame sequence, a preset rule that is satisfied by the three-dimensional coordinate set of the target video frame according to a preset database, and determine a gesture corresponding to the preset rule as a gesture of the target video frame, where the preset database includes a correspondence between at least one preset rule and the gesture.

Optionally, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and a neighboring network node of the target network node; the three-dimensional coordinate acquisition module comprises:

the input unit is used for respectively inputting the at least two video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, and the arrangement sequence of the at least two network nodes is the same as the arrangement sequence of the input video frames in the target video frame sequence;

And the processing unit is used for processing the at least two video frames based on each network node and the connection relation between any two network nodes in the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of the at least two video frames.

the input unit is further configured to input, for any target video frame in the target video frame sequence, the target video frame to the feature extraction layer;

the processing unit is used for processing the target video frame based on the feature extraction layer to obtain the image feature of the target video frame;

the processing unit is further used for processing the image characteristics and the image characteristics of the adjacent video frames of the target video frame based on a two-dimensional coordinate extraction layer to obtain a two-dimensional coordinate set of the target video frame, wherein the two-dimensional coordinate set comprises two-dimensional coordinates of a plurality of feature points in the target video frame;

the processing unit is further configured to process the two-dimensional coordinate set and the two-dimensional coordinate set of the adjacent video frame of the target video frame based on the three-dimensional coordinate extraction layer, so as to obtain a three-dimensional coordinate set of the target video frame.

Optionally, the apparatus further comprises:

the system comprises a sample sequence acquisition module, a sample video frame sequence acquisition module and a sample three-dimensional coordinate set, wherein the sample sequence acquisition module is used for acquiring a sample video frame sequence and sample three-dimensional coordinate sets of at least two sample video frames in the sample video frame sequence;

the input module is used for respectively inputting the at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, and the arrangement sequence of the at least two network nodes is the same as the arrangement sequence of the input sample video frames in the sample video frame sequence;

the processing module is used for processing the at least two sample video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a test three-dimensional coordinate set of the at least two sample video frames;

and the training module is used for training each network node in the three-dimensional coordinate extraction model according to the error between the test three-dimensional coordinate set of each sample video frame and the sample three-dimensional coordinate set.

Optionally, the sample sequence acquisition module includes:

the acquisition unit is used for acquiring an original sample video and a three-dimensional coordinate set of a plurality of video frames in the original sample video;

The extraction unit is used for extracting at least two video frames from the original sample video, determining the at least two video frames as sample video frames, forming the sample video frame sequence by the extracted at least two sample video frames, and determining three-dimensional coordinate sets of the at least two sample video frames as sample three-dimensional coordinate sets.

In another aspect, a three-dimensional coordinate extraction model training apparatus is provided, the apparatus comprising:

a sample sequence obtaining module, configured to obtain a sample video frame sequence and a sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence, where the sample three-dimensional coordinate set includes three-dimensional coordinates of a plurality of feature points in the sample video frames;

the processing module is used for inputting the sample video frame sequence into a three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and the adjacent video frames of each sample video frame;

and the training module is used for training the three-dimensional coordinate extraction model according to the errors between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

Optionally, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and a neighboring network node of the target network node; the processing module comprises:

the input unit is used for respectively inputting the at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, and the arrangement sequence of the at least two network nodes is the same as the arrangement sequence of the input sample video frames in the sample video frame sequence;

the processing unit is used for processing the at least two sample video frames based on the connection relation between each network node and any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames.

the input unit is further configured to input, for any sample video frame in the sample video frame sequence, the sample video frame to the feature extraction layer;

The processing unit is further used for processing the sample video frame based on the feature extraction layer to obtain the test image feature of the sample video frame;

the processing unit is further configured to process the test image feature and the test image feature of the adjacent video frame of the sample video frame based on the two-dimensional coordinate extraction layer, to obtain a test two-dimensional coordinate set of the sample video frame, where the test two-dimensional coordinate set includes two-dimensional coordinates of a plurality of feature points in the sample video frame;

the processing unit is further configured to process the test two-dimensional coordinate set and the test two-dimensional coordinate set of the adjacent video frame of the sample video frame based on the three-dimensional coordinate extraction layer, so as to obtain a test three-dimensional coordinate set of the sample video frame.

Optionally, the sample sequence acquisition module includes:

In another aspect, a processing device is provided, the processing device comprising a processor and a memory, the memory storing at least one instruction, at least one program, code set, or instruction set, the instruction, the program, the code set, or the instruction set being loaded and executed by the processor to perform operations as performed in the method of extracting gestures from video; or to implement the operations as performed in the three-dimensional coordinate extraction model training method.

In yet another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded by a processor and having operations performed in the method of extracting gestures from video; or to implement the operations as performed in the three-dimensional coordinate extraction model training method.

The method, the device, the equipment and the storage medium provided by the embodiment of the invention acquire a target video frame sequence, call a trained three-dimensional coordinate extraction model, input the at least two video frames into the three-dimensional coordinate extraction model, and acquire a three-dimensional coordinate set of the at least two video frames based on the three-dimensional coordinate extraction model. The three-dimensional coordinate extraction model provided by the embodiment of the invention is used for acquiring the three-dimensional coordinate set according to any video frame and adjacent video frames in any video frame sequence, so that the relation between the video frame sequence and the three-dimensional coordinate set can be learned, the continuity of the gesture between the adjacent video frames can be learned, the objective rule of gesture change is met, and the accuracy of the three-dimensional coordinate extraction model is improved. Based on the three-dimensional coordinate extraction model, the three-dimensional coordinate set of each video frame in the target video frame sequence can be directly extracted, so that the calculated amount is reduced, the calculation time is saved, and the accuracy of the three-dimensional coordinate set can be ensured. And the video is shot without using a multi-camera, so that the limitation of shooting equipment is avoided, the wide popularization is facilitated, the application range is enlarged, and the equipment cost is effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network structure of a three-dimensional coordinate extraction model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a three-dimensional coordinate extraction model training method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a numerical sequence provided by an embodiment of the present invention;

FIG. 5 is a flow chart of a method for extracting gestures from video according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an operational flow provided by an embodiment of the present invention;

FIG. 7 is a flow chart of gesture recognition provided by an embodiment of the present invention;

FIG. 8 is a schematic illustration of a human posture provided by an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of an apparatus for extracting gestures from video according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of another apparatus for extracting gestures from video according to an embodiment of the present invention;

FIG. 11 is a schematic structural diagram of a training device for three-dimensional coordinate extraction model according to an embodiment of the present invention;

FIG. 12 is a schematic structural diagram of another three-dimensional coordinate extraction model training apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention, the implementation environment including: terminal 101 and server 102, terminal 101 and server 102 are connected via a network.

The terminal 101 may be a mobile phone, a computer, a tablet computer, or other devices, and the server 102 may be a server, or a server cluster formed by a plurality of servers, or a cloud computing service center.

The embodiment of the invention provides a three-dimensional coordinate extraction model training method, which can train based on a sample video frame sequence and a sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence to obtain a trained three-dimensional coordinate extraction model.

The method is applied to the server 102, and a trained three-dimensional coordinate extraction model is obtained by the server through training.

The embodiment of the invention also provides a method for extracting the gesture from the video, which can extract the coordinates of any video frame in the video frame sequence corresponding to the video based on the trained three-dimensional coordinate extraction model, obtain the three-dimensional coordinate set of the video frame, and identify the gesture according to the obtained three-dimensional coordinate set, so as to extract the gesture in the video.

In one possible implementation, the method is applied in the terminal 101.

After the server 102 trains to obtain the three-dimensional coordinate extraction model, the three-dimensional coordinate extraction model is issued to the terminal 101, the terminal obtains three-dimensional coordinate sets of at least two video frames in the target video frame sequence based on the three-dimensional coordinate extraction model, and according to the three-dimensional coordinate sets, the gesture in the video is extracted.

In another possible implementation, the method is applied in the server 102.

The server 102 trains to obtain the three-dimensional coordinate extraction model, then stores the three-dimensional coordinate extraction model, the terminal 101 uploads the target video frame sequence to the server 102, the server 102 obtains a three-dimensional coordinate set of at least two video frames in the target video frame sequence based on the three-dimensional coordinate extraction model, and the gesture in the video is extracted according to the three-dimensional coordinate set.

Fig. 2 is a schematic network structure diagram of a three-dimensional coordinate extraction model according to an embodiment of the present invention, where the three-dimensional coordinate extraction model is used for obtaining a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and a neighboring video frame of the any video frame, and the three-dimensional coordinate set includes three-dimensional coordinates of a plurality of feature points in the any video frame.

Referring to fig. 2, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected to a target network node in a next network layer to which the network layer belongs and a neighboring network node of the target network node.

The number of network nodes in each network layer is equal, and the network nodes are sequentially arranged according to the sequence, and the target network node of any network node refers to the network node with the same sequence as any network node in the next network layer of the network layer to which the any network node belongs.

Wherein, the first network node in the previous network layer and the second network node in the next network layer are exemplified, and the connection between the first network node and the second network node represents: when processing is performed based on the three-dimensional coordinate extraction model, a processing result obtained after the processing of the first network node is output to the second network node, and is used as input of the second network node, and the second network node continues to process the processing result.

Fig. 2 includes three network layers in a three-dimensional coordinate extraction model, each network layer includes three network nodes, and a target video frame sequence includes three video frames: the process of extracting the three-dimensional coordinate set of the video frame 1 by the three-dimensional coordinate extraction model will be described by taking the video frame 1, the video frame 2 and the video frame 3 as examples.

Referring to fig. 2, the three-dimensional coordinate extraction model includes a first network layer, a second network layer, and a third network layer connected in sequence, each network layer including three network nodes. Video frame 1 is input to network node 101, video frame 2 is input to network node 102, and video frame 3 is input to network node 103, respectively, in the order of the three video frames. The process flow of each network layer is as follows:

(1) First network layer:

video frame 1 is input to network node 101, network node 101 processes video frame 1 to obtain a processing result, and outputs the processing result to connected network node 201 and network node 202.

In addition, the video frame 2 is input to the network node 102, the network node 102 processes the video frame 2 to obtain a processing result, and outputs the processing result to the connected network node 201, network node 202 and network node 203.

(2) Second network layer:

the network node 201 receives the processing results input by the network node 101 and the network node 102, continues to process the received two processing results, obtains the processing results, and outputs the processing results to the connected network node 301 and the network node 302.

In addition, the network node 202 receives the processing results input by the network node 101, the network node 102 and the network node 103, continues to process the received three processing results, and outputs the processing results to the connected network node 301, network node 302 and network node 303.

(3) Third network layer:

the network node 301 receives the processing results input by the network node 201 and the network node 202, and continues to process the received two processing results to obtain a processing result, namely, the three-dimensional coordinate set 1 of the video frame 1.

The process of extracting the three-dimensional coordinate set 2 of the video frame 2 and the three-dimensional coordinate set 3 of the video frame 3 by the three-dimensional coordinate extraction model is similar to that, and will not be described in detail here.

In the three-dimensional coordinate extraction model provided by the embodiment of the invention, each network node is connected with a target network node and also connected with adjacent network nodes of the target network node to form a cross network architecture, and when the network architecture is adopted to extract the three-dimensional coordinate set of each video frame in the video frame sequence, the continuity between the video frame and the adjacent video frame of the video frame can be considered, and the extracted three-dimensional coordinate set is more accurate.

It should be noted that, the embodiment of the present disclosure is merely described by taking an example that the three-dimensional coordinate extraction model includes three network layers, and in another embodiment, the three-dimensional coordinate extraction model may further include other numbers of network layers. For example, the three-dimensional coordinate extraction model includes 4 network layers, or the three-dimensional coordinate extraction model includes 18 network layers.

Fig. 3 is a flowchart of a three-dimensional coordinate extraction model training method according to an embodiment of the present invention. The execution body of the embodiment of the present invention is a training device, and the training device may be a terminal or a server shown in fig. 1, referring to fig. 3, and the method includes:

301. the training device acquires a sample three-dimensional coordinate set of at least two sample video frames in a sample video frame sequence.

The embodiment of the invention provides a three-dimensional coordinate extraction model which is used for acquiring a three-dimensional coordinate set of any video frame according to any video frame in any video sequence and adjacent video frames of any video frame, wherein the three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of feature points in any video frame.

The embodiment of the invention only takes a sample video frame sequence as an example, and describes the process of training a three-dimensional coordinate extraction model. The method comprises the steps of obtaining a sample video frame sequence comprising at least two sample video frames and a sample three-dimensional coordinate set of each sample video frame, wherein the sample three-dimensional coordinate set is an actual three-dimensional coordinate set of the sample video frames and comprises three-dimensional coordinates of a plurality of feature points in the sample video frames. Inputting the sample video frame sequence into a three-dimensional coordinate extraction model, acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and adjacent video frames of each sample video frame, and training the three-dimensional coordinate extraction model according to errors between the test three-dimensional coordinate set of each sample video frame and the sample three-dimensional coordinate set.

In one possible implementation, the training device acquires an original sample video including a plurality of video frames and a three-dimensional coordinate set of each video frame, and acquires a sample video frame sequence according to the original sample video, where all video frames in the original sample video may be included in the sample video frame sequence, or only some video frames in the original sample video may be included in the sample video.

The original sample video may be any video. The original sample video may be, in terms of content, dance-like video, entertainment news-like video, sports-like video, etc. From the source, the original sample video may be a video shot by the training device through a camera, or may be a video downloaded from the internet, or may be a video acquired by an operator and then input into the training device, or may be a video sent by other devices, or may be a video acquired by the training device through other modes.

In one possible implementation manner, after the training device acquires the original sample video, at least two video frames are extracted from the original sample video according to a preset extraction mode, the at least two video frames are determined to be sample video frames, the three-dimensional coordinate set of the at least two sample video frames is determined to be a sample three-dimensional coordinate set, and the extracted at least two sample video frames form a sample video frame sequence.

The preset extraction mode may be sequentially extracting according to the arrangement sequence of video frames in the original sample video, or may also perform one extraction for a preset number of video frames at each interval, or may also perform one extraction for a preset time at each interval, or may also be a mode of downsampling the original sample video. The preset extraction mode can be determined by default by the training device or according to the number of video frames contained in the original sample video.

For example, the original sample video includes 30 video frames, starting with the first video frame of the original sample video, extracting two video frames at intervals according to the arrangement sequence of the video frames, so as to obtain 10 video frames, determining the 10 video frames as sample video frames, and forming a sample video frame sequence by the 10 sample video frames.

In one possible implementation manner, when the original sample video is a video shot in real time, the training device extracts video frames along with the shooting, extracts a preset number of video frames each time when the preset number of video frames are shot, obtains a sample video frame sequence, and continues to extract a new preset number of video frames when a new preset number of video frames are shot again later, so as to obtain another sample video frame sequence. In another possible implementation manner, when the original sample video is a locally stored video, the training device extracts a plurality of video frames from the original sample video as sample video frames according to a preset extraction manner, and constructs the plurality of sample video frames into a sample video frame sequence.

In one possible implementation manner, in order to improve generalization performance of the three-dimensional feature extraction model, after a sample video frame sequence is obtained, data enhancement can be performed on any sample video frame, real conditions such as deformation or overturn of the video frame are simulated, the obtained sample video frame is more similar to the real video frame, authenticity of a sample can be improved, stability of the sample can be improved, and robustness of the model is further improved.

The data enhancement mode may include at least one of the following eight modes:

1. random horizontal flip: randomly flipping the sample video frames left and right, e.g., with a probability of 0.5;

2. randomly rotating: randomly rotating the sample video frames, for example by 90 ° or 180 ° or 270 ° with a probability of 0.1;

3. randomly cutting: cropping the sample video frames randomly, e.g., cropping the center of the sample video frames with a probability of 0.5;

4. random HSV (hue-saturation-value) change: the HSV values of the pixel points in the sample video frames are randomly changed, for example, the sample video frames are converted into an HSV format from an RGB (Red-Green-Blue) format, and after the HSV values of the pixel points in the sample video frames are changed with the probability of 0.3, the changed sample video frames are converted into an RGB format from the HSV format;

5. Random squaring: randomly performing contrast-limiting adaptive orthostatic equalization on the sample video frames, for example, performing the contrast-limiting adaptive orthostatic equalization with a probability of 0.3;

6. random gaussian blur: subjecting the sample video frames to gaussian blur randomly, for example, subjecting the sample video frames to gaussian blur with a probability of 0.3, and the size of the gaussian kernel can be randomly selected to be 3, 5, 7, 9, 11;

7. random motion blur: randomly performing motion blur processing on the sample video frame, for example, performing motion blur processing on the sample video frame with a probability of 0.3, wherein the size of the convolution kernel can be randomly selected as 5, 7, 9, 11 and 13, and the direction can be randomly selected from (1 ° -360 °;

8. random gaussian noise: gaussian noise is randomly added to the sample video frames, e.g., with a probability of 0.3, with a mean of 0 and a variance of between 50-200.

302. The training device inputs the at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, respectively.

Assuming that a person in motion is photographed, the resulting video may include a plurality of video frames, such as S1, S2, S3, … Sn, with a small interval between any two video frames. Taking the video frame S1, the video frame S2 and the video frame S3 as an example, knowing the situation of the video frame S1 and the video frame S3, according to the objective law of human motion, it is easy to understand that the video frame S2 is a transition between the video frame S1 and the video frame S3, and the transition time is very short, so that a sufficiently strong supervision can be generated on the video frame S2 by using the video frame S1 and the video frame S3. That is, there is continuity between any video frame and its adjacent video frame, there is an association relationship between each video frame and the adjacent video frame, and there is an association relationship between the two three-dimensional coordinate sets, so that a more accurate three-dimensional coordinate extraction model can be trained based on the continuity, and when the three-dimensional coordinate set of any video frame is extracted based on the three-dimensional coordinate extraction model, the three-dimensional coordinate extraction can be performed according to the video frame and the adjacent video frame to improve accuracy.

To this end, an embodiment of the present invention provides a cross three-dimensional coordinate extraction model, where the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, and each network node is connected to a target network node in a next network layer of the network layer to which the network node belongs and to a neighboring network node of the target network node. After the training device acquires the sample video frame sequence, at least two sample video frames contained in the sample video frame sequence are respectively input into at least two network nodes in a first network layer of the three-dimensional coordinate extraction model. Wherein the arrangement order of the at least two network nodes is the same as the arrangement order of the input sample video frames in the sample video frame sequence.

For example, the sample video frame sequence includes 3 sample video frames, three network nodes are included in a first network layer of the three-dimensional coordinate extraction model, the first sample video frame is input to the first network node, the second sample video frame is input to the second network node, and the third video frame is input to the third network node.

303. The training device processes the at least two sample video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a test three-dimensional coordinate set of the at least two sample video frames.

After at least two sample video frames are respectively input into at least two network nodes of a first network layer, the training device processes the at least two sample video frames based on the connection relation between each network node and any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames.

After the processing device inputs at least two sample video frames to at least two network nodes in the current first network layer, the at least two network nodes in the first network layer respectively process the input sample video frames to obtain corresponding processing results, and then the processing results of the at least two network nodes in the first network layer are respectively input to the network nodes in the next network layer connected, namely, the target network node of the network node in the second network layer and the adjacent network nodes of the target network node.

For any network node in the second network layer, the processing result input to the network node includes the processing result of the network node corresponding to the network node in the first layer, and also includes the processing result of the adjacent network node of the network node corresponding to the network node. And continuing to process the processing result input to the network node by the last network layer to obtain a processing result. The processing result obtained at this time can be considered as a processing result obtained by combining the sample video frame corresponding to the network node and the adjacent video frame of the sample video frame, so that continuity and relevance between the sample video frame corresponding to the network node and the adjacent video frame can be represented.

And by analogy, after processing of a plurality of network layers, a test three-dimensional coordinate set of each sample video frame can be obtained, wherein the test three-dimensional coordinate set is a three-dimensional coordinate set predicted by a three-dimensional coordinate extraction model.

In one possible implementation, the plurality of network layers of the three-dimensional coordinate extraction model include a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer that are sequentially connected, and for any sample video frame in a sequence of sample video frames, the sample video frame is input to the feature extraction layer, and the sample video frame is processed based on the feature extraction layer to obtain a test image feature of the sample video frame. And processing the test image features and the test image features of the adjacent video frames of the sample video frame based on the two-dimensional coordinate extraction layer to obtain a test two-dimensional coordinate set of the sample video frame. And processing the test two-dimensional coordinate set and the test two-dimensional coordinate set of the adjacent video frame of the sample video frame based on the three-dimensional coordinate extraction layer to obtain the test three-dimensional coordinate set of the sample video frame.

Wherein the test two-dimensional coordinate set of the sample video frame includes two-dimensional coordinates of a plurality of feature points in the sample video frame.

In one possible implementation, the feature extraction layer includes a plurality of feature extraction nodes, the two-dimensional coordinate extraction layer includes a plurality of two-dimensional coordinate extraction nodes, and the three-dimensional coordinate extraction layer includes a plurality of three-dimensional coordinate extraction nodes. Each feature extraction node is connected with a target two-dimensional feature extraction node in the two-dimensional coordinate extraction layer and is also connected with adjacent two-dimensional feature extraction nodes of the target two-dimensional feature extraction node. Each two-dimensional feature extraction node is connected with a target three-dimensional feature extraction node in the three-dimensional feature extraction layer and is also connected with the target three-dimensional feature extraction node.

For any sample video frame in the sequence of sample video frames, the process of processing the at least two sample video frames based on each network node in the three-dimensional coordinate extraction model and the connection relationship between any two network nodes to obtain the test three-dimensional coordinate set of the at least two sample video frames may include:

(1) The sample video frames are input to the corresponding feature extraction nodes, and the order of the feature extraction nodes in the feature extraction layer is the same as the order of the sample video frames in the sample video frame sequence.

(2) And processing the sample video frame based on the feature extraction node input into the sample video frame to obtain the test image feature of the sample video frame. And respectively inputting the test image characteristics of the sample video frame into a target two-dimensional coordinate extraction node and adjacent two-dimensional coordinate extraction nodes of the target two-dimensional coordinate extraction node.

The test image features of the sample video frame are used for describing the sample video frame, and may include feature values of all pixels of the sample video frame or only feature values of part of pixels of the sample video frame.

For example, when the feature extraction node processes the sample video frame, a plurality of feature points are extracted from the sample video frame to obtain feature values of the plurality of feature points, and the feature values of the plurality of feature points form a test image feature. The plurality of feature points belong to an object included in the sample video frame, the object may be a human body, an animal body, or other kind of body, etc., and the plurality of feature points may include an eye keyword, a nose keyword, a shoulder keyword, a knee keyword, an elbow keyword, etc.

(3) And processing the test image features of the sample video frame and the test image features of the adjacent sample video frames of the sample video frame based on the target two-dimensional coordinate extraction node of the feature extraction node to obtain a test two-dimensional coordinate set of the sample video frame, wherein the test two-dimensional coordinate set comprises the two-dimensional coordinates of a plurality of feature points in the sample video frame. And respectively inputting the test two-dimensional coordinate set of the sample video frame into a target three-dimensional coordinate extraction node and the adjacent three-dimensional coordinate extraction nodes of the target three-dimensional coordinate extraction node.

Each two-dimensional coordinate extraction node is used for acquiring two-dimensional coordinates of a plurality of feature points in the test image features according to the input test image features, so that a test two-dimensional coordinate set is obtained, and the test two-dimensional coordinate set can describe objects included in a sample video frame from a two-dimensional angle.

(4) And processing the test two-dimensional coordinate set of the sample video frame and the test two-dimensional coordinate set of the adjacent sample video frame of the sample video frame based on the target three-dimensional coordinate extraction node of the target two-dimensional coordinate extraction node to obtain the test three-dimensional coordinate set of the sample video frame.

Each three-dimensional coordinate extraction node is used for processing an input test two-dimensional coordinate set to obtain a test three-dimensional coordinate set corresponding to the test two-dimensional coordinate set, wherein the test three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of characteristic points, and objects included in a sample video frame can be described from a three-dimensional angle.

Referring to fig. 4, for example, the video frame sequence includes 243 video frames, the numerical sequence obtained after the processing of the feature extraction node is (243, 17), which means that the video frame sequence includes 243 video frames, 17 feature points are extracted from each video frame, and the numerical sequence obtained after the processing of the two-dimensional coordinate extraction node is (243, 34), which means that the video frame sequence includes 243 video frames, 17 feature points are extracted from each video frame, and the two-dimensional coordinates of each feature point include two coordinate values, 34 coordinate values are obtained in total. After the three-dimensional coordinate extraction node processing, the obtained numerical sequence is (243, 51), which indicates that the video frame sequence comprises 243 video frames, 17 feature points are extracted from each video frame, and the three-dimensional coordinate of each feature point comprises three coordinate values, 51 coordinate values are obtained in total.

In another possible implementation, the three-dimensional coordinate extraction model may include a residual network including a residual block including a plurality of network layers, each of which may perform a convolution operation according to a convolution kernel. Wherein the number of residual blocks and the size of the convolution kernel can be arbitrarily set.

For example, the residual network may include four network layers, each adopting a one-dimensional convolution residual structure, and the convolution kernel of each convolution layer has a width of 3, a channel number of 1024, and a random discard rate of the processing result of 0.25.

In another possible implementation, the three-dimensional coordinate extraction model includes a network of visual geometry groups based on which effects are achieved similar to those achieved based on a residual network.

For any of the above network structures, the convolution performed by the convolution layer may include a plurality of kinds, for example, may be hole convolution, and by introducing hole convolution, the receptive field of the convolution kernel may be enlarged; alternatively, the calculation may be performed by using a hole convolution, and the calculation efficiency may be improved by introducing a hole convolution.

The two-dimensional coordinate extraction layer may adopt an FPN (Feature Pyramid Networks, feature pyramid network) structure. Through adopting the FPN structure, can carry out more meticulous processing operation to the data of input, promote the treatment effect, improve the rate of accuracy.

304. The training device trains each network node in the three-dimensional coordinate extraction model according to errors between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

When the three-dimensional coordinate extraction model extracts a test three-dimensional coordinate set of a sample video frame, an error may exist between the extracted test three-dimensional coordinate set and the sample three-dimensional coordinate set of the sample video frame, the smaller the error is, the better the extraction effect of the three-dimensional coordinate extraction model is, the larger the error is, and the worse the extraction effect of the three-dimensional coordinate extraction model is.

Therefore, the error between the sample three-dimensional coordinate set and the test three-dimensional coordinate set of the sample video frame can reflect the current extraction effect of the three-dimensional coordinate extraction model. Training the three-dimensional coordinate extraction model according to errors between the sample three-dimensional coordinate set and the test three-dimensional coordinate set of the sample video frame, so that the errors of the three-dimensional coordinate set extracted from any video frame in any video frame sequence based on the trained three-dimensional coordinate extraction model are reduced, and the effect of optimizing the three-dimensional coordinate extraction model is achieved.

The training process for the three-dimensional coordinate extraction model is essentially a process of adjusting model parameters of the three-dimensional coordinate extraction model.

The three-dimensional coordinate extraction model includes a plurality of network layers, each network layer including a plurality of network nodes, each network node having model parameters, and processing based on the model parameters. When the three-dimensional coordinate extraction model is trained according to the error between the sample three-dimensional coordinate set and the test three-dimensional coordinate set of the sample video frame, the model parameters of each network node can be adjusted according to the error, so that the error of the three-dimensional coordinate set extracted by the adjusted three-dimensional coordinate extraction model is reduced, and the effect of optimizing the three-dimensional coordinate extraction model is achieved.

In one possible implementation, the model parameters are adjusted according to a learning rate, which may be step learning rate (step learning rate), exponential learning rate (index learning rate), or other type of learning rate.

In one possible implementation, a loss function may be set for the three-dimensional coordinate extraction model, and model parameters may be adjusted according to the output value of the loss function. The loss function may be a binary cross entropy loss function, a dic loss function, or other loss function. The loss function may be as follows:

wherein E represents the output value of the loss function, N represents the number of feature points, x _i Pixel value representing the i-th feature point, f (x _i ) Test three-dimensional coordinates representing the ith feature point, y _i Sample three-dimensional coordinates representing the i-th feature point, ||f (x _i )-y _i I is the distance between the test three-dimensional coordinates of the ith feature point and the three-dimensional coordinates of the sample, alpha _i The weighting coefficient of the i-th feature point is represented.

And when the three-dimensional feature extraction model is trained based on the loss function, the output value of the loss function is obtained according to the error, and the model parameters of each network node are adjusted according to the output value of the loss function, so that the error of the three-dimensional coordinate set extracted by the three-dimensional coordinate extraction model after adjustment is reduced, and the effect of optimizing the three-dimensional coordinate extraction model is achieved.

The training device can train the three-dimensional coordinate extraction model by adopting a preset training algorithm, wherein the preset algorithm can be a convolutional neural network algorithm, a cyclic neural network algorithm, a deep learning algorithm, an SVM (Support Vector Machine ) algorithm or a deep separable convolutional algorithm, and different models such as a convolutional neural network model, a cyclic neural network model, a deep learning model, an SVM model or a deep separable convolutional network model can be obtained by adopting different preset algorithms.

When the training device optimizes the three-dimensional coordinate extraction model, SGD (Random gradient descent, random gradient descent method) can be adopted for optimization, namely, an optimization algorithm is adopted for continuously and iteratively adjusting model parameters of each network node, and training of the model is gradually completed. Alternatively, adam (adaptive moment estimation ) algorithm or AMSGrad algorithm (a variant of Adam algorithm) may also be used for optimization.

The embodiment of the invention only takes a sample video frame sequence as an example to describe the process of training a three-dimensional coordinate extraction model. In order to improve the accuracy of the three-dimensional coordinate extraction model, a plurality of sample video frame sequences may be acquired, each sample video frame sequence including at least two sample video frames. Training the three-dimensional coordinate extraction model for multiple times according to the multiple sample video frame sequences, so that errors of a three-dimensional coordinate set extracted from any video frame in any video frame sequence based on the trained three-dimensional coordinate extraction model are reduced until the three-dimensional coordinate extraction model converges, and a trained three-dimensional coordinate extraction model is obtained. At the moment, the accuracy of the three-dimensional coordinate extraction model meets the training requirement, and the extracted three-dimensional coordinate set can be ensured to describe three-dimensional coordinate information contained in any video frame as accurately as possible.

According to the method provided by the embodiment of the invention, a sample video frame sequence comprising at least two sample video frames and a sample three-dimensional coordinate set of each sample video frame are obtained, at least two sample video frames in the sample video frame sequence are processed based on the structure of the three-dimensional coordinate extraction model and the connection relation among the structures, a test three-dimensional coordinate set is obtained, and the three-dimensional coordinate extraction model is trained according to the error between the test three-dimensional coordinate set and the sample three-dimensional coordinate set. According to the three-dimensional coordinate extraction model, the three-dimensional coordinate extraction model is obtained according to continuous training between adjacent video frames in the video frame sequence, mutual supervision between the adjacent video frames is enhanced, the relation between the video frame sequence and the three-dimensional coordinate set can be learned, the continuity of the posture between the adjacent video frames can be learned, objective rules of posture change are met, and accuracy of the three-dimensional coordinate extraction model is improved.

It should be noted that, in the embodiment of the present invention, only the network structure shown in fig. 2 is taken as an example for illustration, and in another embodiment, the three-dimensional coordinate extraction model may also adopt other network structures, and only the three-dimensional coordinate extraction model needs to be ensured to be capable of considering the continuity between adjacent video frames, and the three-dimensional coordinate set is obtained according to any video frame and its adjacent video frames in any video frame sequence.

After the training of the three-dimensional coordinate extraction model is completed, the processing device can acquire a three-dimensional coordinate set of the video frame based on the three-dimensional coordinate extraction model. The processing device may be the same as the training device, or the processing device may be different from the training device, that is, the training device trains the three-dimensional coordinate extraction model and then provides the three-dimensional coordinate extraction model to the processing device, and the processing device applies the three-dimensional coordinate extraction model to perform three-dimensional coordinate extraction.

FIG. 5 is a flow chart of a method for extracting gestures from video according to an embodiment of the present invention. The execution body of the embodiment of the present invention is a processing device, and the processing device may be a terminal or a server shown in fig. 1, referring to fig. 5, where the method includes:

501. the processing device obtains a sequence of target video frames.

The embodiment of the invention only takes the target video frame sequence as an example to describe the process of extracting the three-dimensional coordinate set of the video frames in the target video frame sequence, wherein the three-dimensional coordinate set comprises the three-dimensional coordinates of a plurality of feature points in the corresponding video frames.

First, a processing device obtains a target video frame sequence comprising at least two video frames.

In one possible implementation, the processing device acquires an original video, where the original video includes a plurality of video frames, and acquires a target video frame sequence according to the original video, where the target video frame sequence may include all video frames in the original video or may include only some video frames in the original video.

The original video may be any video acquired by the processing device. The original video may be, in terms of content, dance-like video, entertainment news-like video, sports-like video, etc. From the source, the original video may be a video shot by the processing device through a camera, or may be a video downloaded from the internet, or may be a video acquired by an operator and then input into the training device, or may be a video sent by other devices, or may be a video acquired by the processing device through other modes.

In one possible implementation, after the original video is acquired, at least two video frames are extracted from the original video according to a preset extraction mode, and the extracted at least two video frames form a target video frame sequence.

The preset extraction mode may be sequentially extracting according to the arrangement sequence of video frames in the original video, or may also perform one extraction for a preset number of video frames at each interval, or may also perform one extraction for a preset time at each interval, or may also be a mode of downsampling the original video. The preset extraction mode may be determined by default by the processing device or may be determined according to the number of video frames contained in the original video.

502. The processing means invokes the three-dimensional coordinate extraction model.

The processing device stores a trained three-dimensional coordinate extraction model, and the three-dimensional coordinate extraction model can be obtained by training the processing device or provided to the processing device after being obtained by training by the training device. When the processing device acquires the target video frame sequence and needs to extract the three-dimensional coordinate set, the three-dimensional coordinate extraction model can be called.

503. The processing means inputs the at least two video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model.

The processing device calls the trained three-dimensional coordinate extraction model, inputs the target video frame sequence into the three-dimensional coordinate extraction model, and acquires the three-dimensional coordinate set of the at least two target video frames.

In the embodiment of the invention, the three-dimensional coordinate extraction model comprises a plurality of network layers, each network layer comprises a plurality of network nodes, and each network node is connected with a target network node in the next network layer of the network layer and a neighboring network node of the target network node. At least two video frames are respectively input to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model. And processing at least two video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a three-dimensional coordinate set of at least two video frames. Wherein the arrangement order of the at least two network nodes is the same as the arrangement order of the input video frames in the target video frame sequence.

504. And processing at least two video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a three-dimensional coordinate set of at least two video frames.

Inputting a target video frame sequence into a three-dimensional coordinate extraction model, and acquiring a three-dimensional coordinate set of the at least two video frames, wherein the process comprises the following steps:

(1) For any target video frame in the target video frame sequence, inputting the target video frame into at least two network nodes in a first network layer, and respectively processing the input video frame by the at least two network nodes in the first network layer to obtain a corresponding first processing result.

(2) And respectively inputting a first processing result obtained by processing each network node in the first network layer to the network node in the next network layer connected, namely a target network node of the network node in the second network layer and a neighboring network node of the target network node. And at least two network nodes in the next network layer respectively process the input first processing result to obtain a corresponding second processing result. The second processing result may be considered as a processing result after combining the video frame corresponding to the network node and the adjacent video frame of the video frame, so as to reflect continuity and relevance between the video frame corresponding to the network node and the adjacent video frame.

The first processing result input to the network node in any one of the second network layers comprises a first processing result of the network node corresponding to the network node in the first network layer and a first processing result of a neighboring network node of the network node corresponding to the network node.

(3) And so on, after processing of a plurality of network layers, a three-dimensional coordinate set of each video frame can be obtained.

In one possible implementation, the plurality of network layers of the three-dimensional coordinate extraction network include a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer that are connected in sequence. For any target video frame in the target video frame sequence, processing the target video frame based on the three-dimensional coordinate extraction model, and obtaining a three-dimensional coordinate set of the target video frame, wherein the process comprises the following steps: the target video frame is input to the feature extraction layer. And processing the target video frame based on the feature extraction layer to obtain the image features of the target video frame. And processing the image features of the target video frame and the image features of the adjacent video frames of the target video frame based on the two-dimensional coordinate extraction layer to obtain a two-dimensional coordinate set of the target video frame, wherein the two-dimensional coordinate set comprises two-dimensional coordinates of a plurality of feature points in the target video frame. And processing the two-dimensional coordinate set of the target video frame and the two-dimensional coordinate set of the adjacent video frame of the target video frame based on the three-dimensional coordinate extraction layer to obtain the three-dimensional coordinate set of the target video frame.

In one possible implementation, the feature extraction layer includes a plurality of feature extraction nodes, the two-dimensional coordinate extraction layer includes a plurality of two-dimensional coordinate extraction nodes, and the three-dimensional coordinate extraction layer includes a plurality of three-dimensional coordinate extraction nodes. Each feature extraction node is connected with a target two-dimensional feature extraction node in the two-dimensional coordinate extraction layer and is also connected with adjacent two-dimensional feature extraction nodes of the target two-dimensional feature extraction node. Each two-dimensional feature extraction node is connected with a target three-dimensional feature extraction node in the three-dimensional feature extraction layer and is also connected with the target three-dimensional feature extraction node. For any target video frame in the target video frame sequence, processing the target video frame based on the three-dimensional coordinate extraction model, and obtaining a three-dimensional coordinate set of the target video frame, wherein the process comprises the following steps:

(1) The target video frames are input to the corresponding feature extraction nodes, and the order of the feature extraction nodes in the feature extraction layer is the same as the order of the target video frames in the target video frame sequence.

(2) And processing the target video frame based on the feature extraction node input into the target video frame to obtain the image features of the target video frame. And respectively inputting the image features of the target video frame into a target two-dimensional coordinate extraction node and adjacent two-dimensional coordinate extraction nodes of the target two-dimensional coordinate extraction node.

(3) And processing the image features of the target video frame and the image features of the adjacent video frames of the target video frame based on the target two-dimensional coordinate extraction node of the feature extraction node to obtain a two-dimensional coordinate set of the target video frame, wherein the two-dimensional coordinate set comprises the two-dimensional coordinates of a plurality of feature points in the target video frame.

(4) And processing the two-dimensional coordinate set of the target video frame and the two-dimensional coordinate set of the adjacent video frame of the target video frame based on the target three-dimensional coordinate extraction node of the target two-dimensional coordinate extraction node to obtain the three-dimensional coordinate set of the target video frame.

505. The processing device invokes a preset database.

The processing device acquires a preset database, wherein the preset database comprises at least one preset rule and a corresponding relation between gestures, the preset rule is used for defining a rule that a three-dimensional coordinate set of a video frame with the corresponding gesture meets, the rule can comprise a rule that displacement between each characteristic point is required to meet, or each two characteristic points are connected to form a straight line, and the preset rule can also comprise a rule that an angle between any two straight lines is required to meet.

After the processing device acquires the three-dimensional coordinate set of any video frame in the target video frame sequence, the acquired preset database is called so as to recognize the gesture of the video frame.

506. For any target video frame in the target video frame sequence, the processing device determines a preset rule met by the three-dimensional coordinate set of the target video frame according to a preset database, and determines a gesture corresponding to the preset rule as the gesture of the target video frame.

In one possible implementation manner, matching a rule met by a three-dimensional coordinate set of any one target video frame in a target video frame sequence with at least one preset rule in a preset database, determining a matching degree of the rule met by the three-dimensional coordinate set and each preset rule, wherein the matching degree is used for representing a matching degree between the rule met by the three-dimensional coordinate set and the preset rule, so that the degree of the three-dimensional coordinate set meeting the preset rule can be represented, if the matching degree is greater than a preset rule of a preset threshold value, determining the preset rule met by the three-dimensional coordinate set of the target video frame, and determining a gesture corresponding to the preset rule as the gesture of the target video frame. Or when the preset rules with the matching degree larger than the preset threshold value comprise a plurality of preset rules, selecting the preset rule with the largest similarity from the preset rules as the preset rule which is met by the three-dimensional coordinate set of the target video frame, and determining the gesture corresponding to the preset rule as the gesture of the target video frame.

The matching degree can be expressed in different modes such as Euclidean distance, cosine similarity, manhattan distance, mahalanobis distance and the like, and the matching degree of different modes has different relations with the matching degree expressed by the matching degree. For example, the euclidean distance and the matching degree are in a negative correlation, and the larger the euclidean distance is, the less the rule satisfied by the three-dimensional coordinate set is matched with the preset rule. The cosine similarity and the matching degree are in positive correlation, and the larger the cosine similarity is, the more the rule which is satisfied by the three-dimensional coordinate set is matched with the preset rule.

It should be noted that, in the embodiment of the present invention, step 505 is automatically performed after step 504, and in another embodiment, gesture recognition may be performed in other manners.

For example, after the processing device acquires the three-dimensional coordinate set of at least two video frames of the original video in step 504, the processing device stores the acquired three-dimensional coordinate set. When a gesture recognition instruction for the original video is received, steps 505 to 506 are executed again, and gesture recognition is performed according to the three-dimensional coordinate set.

Alternatively, when the processing device receives the gesture recognition instruction, a video database storing one or more videos is invoked according to the gesture recognition instruction. The method comprises the steps of obtaining an original video selected by a user from a video database or carrying the original video to be identified in a gesture identification instruction, obtaining a target video frame sequence according to the original video, obtaining a three-dimensional coordinate set of the target video frame sequence by adopting the method provided by the embodiment of the invention, and performing gesture identification on the original video according to the obtained three-dimensional coordinate set.

Or when the processing device receives the gesture recognition instruction, the camera is turned on, the video is shot through the camera, the currently shot video is taken as an original video, the electronic equipment can perform gesture recognition on the currently shot video clip along with the shooting of the video, and when a new video clip is obtained through subsequent re-shooting, the gesture recognition is performed on the newly shot video clip.

It should be noted that, another point is that the method provided by the embodiment of the present invention may be set in the form of an interface, and provided to the processing device, where the processing device triggers and executes the method provided by the embodiment of the present invention by calling the interface to determine the three-dimensional coordinate set of the video frame in the target video frame sequence.

According to the method provided by the embodiment of the invention, the target video frame sequence is acquired, the trained three-dimensional coordinate extraction model is called, the at least two video frames are input into the three-dimensional coordinate extraction model, and the three-dimensional coordinate set of the at least two video frames is acquired based on the three-dimensional coordinate extraction model. The three-dimensional coordinate extraction model provided by the embodiment of the invention is used for acquiring the three-dimensional coordinate set according to any video frame and adjacent video frames in any video frame sequence, so that the relation between the video frame sequence and the three-dimensional coordinate set can be learned, the continuity of the gesture between the adjacent video frames can be learned, the objective rule of gesture change is met, and the accuracy of the three-dimensional coordinate extraction model is improved. Based on the three-dimensional coordinate extraction model, the three-dimensional coordinate set of each video frame in the target video frame sequence can be directly extracted, so that the calculated amount is reduced, the calculation time is saved, and the accuracy of the three-dimensional coordinate set can be ensured. And the video is shot without using a multi-camera, so that the limitation of shooting equipment is avoided, the wide popularization is facilitated, the application range is enlarged, and the equipment cost is effectively reduced.

In addition, the embodiment of the invention can effectively utilize the acquired three-dimensional coordinate set, apply the acquired three-dimensional coordinate set to the scene of gesture recognition, recognize the gesture of the video frame, have better alternatives to the scheme of gesture recognition by utilizing the multi-camera, provide possibility for gesture recognition of equipment with the monocular camera such as a mobile phone, a tablet computer and the like, and provide a foundation for a plurality of application scenes such as subsequent man-machine interaction, somatosensory games and the like.

With the development of intelligent devices, more and more scenes such as somatosensory games and identity verification are involved in recognizing human body gestures. By the method for extracting the gesture from the video, which is provided by the embodiment of the invention, the three-dimensional coordinate set in the video frame containing the human body can be extracted, and the gesture can be identified according to the obtained three-dimensional coordinate set.

In summary, the training process and the application process of the three-dimensional coordinate extraction model may be shown in fig. 6, and the operation flow of the embodiment of the present invention may be shown in fig. 6, where the operation flow includes two stages:

first, training phase:

1. pose data, i.e., a sequence of sample video frames and a set of sample three-dimensional coordinates for each sample video frame, is collected.

2. And data enhancement is carried out on the sample video frames, so that the generalization of the samples is improved.

3. And carrying out coordinate estimation based on the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of each sample video frame.

4. Training by using a sample three-dimensional coordinate set and a test three-dimensional coordinate set of each sample video frame to obtain a three-dimensional coordinate extraction model.

Second, gesture recognition phase:

1. capturing an action video of a user, and extracting a plurality of video frames from the action video.

2. And carrying out coordinate estimation based on the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of each video frame.

3. Gesture recognition is performed using a three-dimensional set of coordinates for each video frame.

Referring to fig. 7 and 8, in one possible implementation manner, a video including a human body is photographed by using a camera, a three-dimensional coordinate set of each video frame in the video is extracted based on a three-dimensional coordinate extraction model, gesture recognition is performed according to the obtained three-dimensional coordinate set, a gesture of the human body in the video is determined, and man-machine interaction and feedback are performed according to the gesture.

For example, in a motion sensing game scene, a motion sensing game machine is provided with a monocular RGB camera, and a preset gesture library is stored in advance, wherein the preset gesture library comprises a plurality of preset rules and corresponding relations between gestures. When a user makes a gesture according to a preset gesture displayed in a display interface of the somatosensory game machine, the somatosensory game machine can shoot a video of the user based on the monocular RGB camera, the method provided by the embodiment of the invention is adopted to extract three-dimensional coordinates of the video of the user, and a preset gesture library is queried according to a rule met by the extracted three-dimensional coordinate set, so that whether the gesture made by the user is identical to the preset gesture can be determined. When the motion sensing game machine determines that the gesture made by the user is the same as the preset gesture, the game is over-closed, and the next game can be played.

For example, in an authentication scene, the authentication apparatus stores a preset gesture library in advance, the preset gesture library including correspondence between a plurality of preset rules and gestures, and the authentication apparatus is further configured with a monocular camera. When verification is carried out, a user makes a gesture facing a camera of the verification device, the verification device can shoot a video of the user, the method provided by the embodiment of the invention is adopted to extract three-dimensional coordinates of the video of the user, and a preset gesture library is queried according to rules met by the extracted three-dimensional coordinate set, so that whether the gesture made by the user is identical with the preset gesture can be determined. When the authentication device determines that the gesture of the user is the same as the preset gesture, the authentication is passed.

Fig. 9 is a schematic structural diagram of an apparatus for extracting a gesture from a video according to an embodiment of the present invention.

Referring to fig. 9, the apparatus includes:

a target sequence acquisition module 901, configured to acquire a target video frame sequence, where the target video frame sequence includes at least two video frames;

the model invoking module 902 is configured to invoke a three-dimensional coordinate extraction model, where the three-dimensional coordinate extraction model is configured to obtain a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and an adjacent video frame of any video frame, and the three-dimensional coordinate set includes three-dimensional coordinates of a plurality of feature points in any video frame;

The three-dimensional coordinate acquisition module 903 is configured to input a target video frame sequence into the three-dimensional coordinate extraction model, and acquire a three-dimensional coordinate set of at least two video frames;

the gesture determining module 904 is configured to determine, for any video frame in the target video frame sequence, a preset rule that is satisfied by the three-dimensional coordinate set of the target video frame according to a preset database, and determine a gesture corresponding to the preset rule as a gesture of the target video frame, where the preset database includes a correspondence between at least one preset rule and the gesture.

The device provided by the embodiment of the invention acquires a target video frame sequence, invokes the trained three-dimensional coordinate extraction model, inputs the at least two video frames into the three-dimensional coordinate extraction model, and acquires a three-dimensional coordinate set of the at least two video frames based on the three-dimensional coordinate extraction model. The three-dimensional coordinate extraction model provided by the embodiment of the invention is used for acquiring the three-dimensional coordinate set according to any video frame and adjacent video frames in any video frame sequence, so that the relation between the video frame sequence and the three-dimensional coordinate set can be learned, the continuity of the gesture between the adjacent video frames can be learned, the objective rule of gesture change is met, and the accuracy of the three-dimensional coordinate extraction model is improved. Based on the three-dimensional coordinate extraction model, the three-dimensional coordinate set of each video frame in the target video frame sequence can be directly extracted, so that the calculated amount is reduced, the calculation time is saved, and the accuracy of the three-dimensional coordinate set can be ensured. And the video is shot without using a multi-camera, so that the limitation of shooting equipment is avoided, the wide popularization is facilitated, the application range is enlarged, and the equipment cost is effectively reduced.

Optionally, referring to fig. 10, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, each network node is connected with a target network node in a next network layer of the network layer to which it belongs and a neighboring network node of the target network node; the three-dimensional coordinate acquisition module 903 includes:

an input unit 9031, configured to input at least two video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, respectively, where an arrangement order of the at least two network nodes is the same as an arrangement order of the input video frames in the target video frame sequence;

the processing unit 9032 is configured to process at least two video frames based on each network node and a connection relationship between any two network nodes in the three-dimensional coordinate extraction model, so as to obtain a three-dimensional coordinate set of the at least two video frames.

an input unit 9031, configured to input, for any target video frame in the target video frame sequence, the target video frame to the feature extraction layer;

the processing unit 9032 is further configured to process the target video frame based on the feature extraction layer, so as to obtain an image feature of the target video frame; processing the image characteristics and the image characteristics of the adjacent video frames of the target video frame based on the two-dimensional coordinate extraction layer to obtain a two-dimensional coordinate set of the target video frame; and processing the two-dimensional coordinate set and the two-dimensional coordinate set of the adjacent video frame of the target video frame based on the three-dimensional coordinate extraction layer to obtain the three-dimensional coordinate set of the target video frame.

Optionally, referring to fig. 10, the apparatus further includes:

a sample sequence obtaining module 905, configured to obtain a sample video frame sequence and a sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence;

an input module 906, configured to input at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, where an arrangement order of the at least two network nodes is the same as an arrangement order of the input sample video frames in a sample video frame sequence;

the processing module 907 is configured to process at least two sample video frames based on each network node in the three-dimensional coordinate extraction model and a connection relationship between any two network nodes, so as to obtain a test three-dimensional coordinate set of the at least two sample video frames;

the training module 908 is configured to train each network node in the three-dimensional coordinate extraction model according to an error between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

Optionally, referring to fig. 10, the sample sequence acquisition module 905 includes:

an obtaining unit 9051, configured to obtain an original sample video and a sample three-dimensional coordinate set of a plurality of video frames in the original sample video;

The extracting unit 9052 is configured to extract at least two video frames from an original sample video, determine the at least two video frames as sample video frames, construct a sample video frame sequence from the extracted at least two sample video frames, and determine a three-dimensional coordinate set of the at least two sample video frames as a sample three-dimensional coordinate set.

It should be noted that: the device for extracting gestures from video provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the processing device is divided into different functional modules to perform all or part of the functions described above. In addition, the device for extracting the gesture from the video provided in the above embodiment belongs to the same concept as the method embodiment for extracting the gesture from the video, and the specific implementation process of the device is detailed in the method embodiment, which is not described herein again.

Fig. 11 is a schematic structural diagram of a three-dimensional coordinate extraction model training device according to an embodiment of the present invention. Referring to fig. 11, the apparatus includes:

a sample sequence obtaining module 1101, configured to obtain a sample video frame sequence and a sample three-dimensional coordinate set of at least two sample video frames in the sample video frame sequence;

The processing module 1102 is configured to input a sample video frame sequence into a three-dimensional coordinate extraction model, and obtain a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and an adjacent video frame of each sample video frame;

the training module 1103 is configured to train the three-dimensional coordinate extraction model according to the error between the test three-dimensional coordinate set and the sample three-dimensional coordinate set of each sample video frame.

Optionally, referring to fig. 12, the three-dimensional coordinate extraction model includes a plurality of network layers, each network layer includes a plurality of network nodes, each network node is connected with a target network node in a next network layer of the network layer to which it belongs and a neighboring network node of the target network node; a processing module 1102 comprising:

an input unit 11021, configured to input at least two sample video frames to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, where an arrangement order of the at least two network nodes is the same as an arrangement order of the input sample video frames in a sample video frame sequence;

the processing unit 11022 is configured to process at least two sample video frames based on each network node in the three-dimensional coordinate extraction model and a connection relationship between any two network nodes, so as to obtain a test three-dimensional coordinate set of the at least two sample video frames.

Alternatively, referring to fig. 12, the plurality of network layers include a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer, which are sequentially connected;

an input unit 11021 for inputting sample video frames to the feature extraction layer for any sample video frame in the sequence of sample video frames;

the processing unit 11022 is further configured to process the sample video frame based on the feature extraction layer to obtain a test image feature of the sample video frame; based on the two-dimensional coordinate extraction layer, processing the test image features and the test image features of adjacent video frames of the sample video frame to obtain a test two-dimensional coordinate set of the sample video frame, wherein the test two-dimensional coordinate set comprises two-dimensional coordinates of a plurality of feature points in the sample video frame; and processing the test two-dimensional coordinate set and the test two-dimensional coordinate set of the adjacent video frame of the sample video frame based on the three-dimensional coordinate extraction layer to obtain the test three-dimensional coordinate set of the sample video frame.

Optionally, referring to fig. 12, the sample sequence acquisition module 1101 includes:

an acquisition unit 11011 for acquiring an original sample video and a three-dimensional coordinate set of a plurality of video frames in the original sample video;

an extracting unit 11012, configured to extract at least two video frames from an original sample video, determine the at least two video frames as sample video frames, construct a sample video frame sequence from the extracted at least two sample video frames, and determine a three-dimensional coordinate set of the at least two sample video frames as a sample three-dimensional coordinate set.

It should be noted that: in the three-dimensional coordinate extraction model training device provided in the above embodiment, only the division of the above functional modules is used for illustration when the three-dimensional coordinate extraction model is trained, and in practical application, the above functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the training device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the three-dimensional coordinate extraction model training device and the three-dimensional coordinate extraction model training method provided in the above embodiments belong to the same concept, and detailed implementation processes thereof are shown in the method embodiments, and are not repeated here.

Fig. 13 shows a block diagram of a terminal 1300 according to an exemplary embodiment of the present invention, where the terminal 1300 is configured to perform steps performed by a processing device, which may be a training device or a processing device in the above-described embodiments. The terminal 1300 may be a portable mobile terminal such as: smart phones, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) players, notebook computers, desktop computers, head mounted devices, or any other intelligent terminal. Terminal 1300 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 1300 includes: a processor 1301, and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 1301 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 1301 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, processor 1301 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 1301 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one instruction for being possessed by processor 1301 to implement the methods provided by the method embodiments herein.

In some embodiments, the terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. The processor 1301, the memory 1302, and the peripheral interface 1303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, a touch display screen 1305, a camera 1306, audio circuitry 1307, and a power supply 1309.

A peripheral interface 1303 may be used to connect I/O (Input/Output) related at least one peripheral to the processor 1301 and the memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1301, the memory 1302, and the peripheral interface 1303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1304 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal to an electromagnetic signal for transmission, or converts a received electromagnetic signal to an electrical signal. Optionally, the radio frequency circuit 1304 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 8G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1305 is a touch display, the display 1305 also has the ability to capture touch signals at or above the surface of the display 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1305 may be one, providing the front panel of the terminal 1300; in other embodiments, the display 1305 may be at least two, disposed on different surfaces of the terminal 1300 or in a folded configuration; in still other embodiments, the display 1305 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1300. Even more, the display screen 1305 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display screen 1305 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1300, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 1301 or the radio frequency circuit 1304 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1307 may also comprise a headphone jack.

A power supply 1309 is used to power the various components in the terminal 1300. The power supply 1309 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1309 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1300 also includes one or more sensors 1310. The one or more sensors 1310 include, but are not limited to: acceleration sensor 1311, gyroscope sensor 1312, pressure sensor 1313, optical sensor 1315, and proximity sensor 1316.

The acceleration sensor 1311 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1300. For example, the acceleration sensor 1311 may be used to detect components of gravitational acceleration in three coordinate axes. Processor 1301 may control touch display screen 1305 to display a user interface in either a landscape view or a portrait view based on gravitational acceleration signals acquired by acceleration sensor 1311. The acceleration sensor 1311 may also be used for the acquisition of motion data of a game or user.

The gyro sensor 1312 may detect a body direction and a rotation angle of the terminal 1300, and the gyro sensor 1312 may collect a 3D motion of the user on the terminal 1300 in cooperation with the acceleration sensor 1311. Processor 1301 can implement the following functions based on the data collected by gyro sensor 1312: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1313 may be disposed on a side frame of terminal 1300 and/or below touch display screen 1305. When the pressure sensor 1313 is disposed at a side frame of the terminal 1300, a grip signal of the terminal 1300 by a user may be detected, and the processor 1301 performs left-right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 1313. When the pressure sensor 1313 is disposed at the lower layer of the touch display screen 1305, the processor 1301 realizes control of the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1305. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1315 is used to collect ambient light intensity. In one embodiment, processor 1301 may control the display brightness of touch display screen 1305 based on the intensity of ambient light collected by optical sensor 1315. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1305 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 1305 is turned down. In another embodiment, processor 1301 may also dynamically adjust the shooting parameters of camera assembly 1306 based on the intensity of ambient light collected by optical sensor 1315.

A proximity sensor 1316, also referred to as a distance sensor, is typically provided on the front panel of the terminal 1300. The proximity sensor 1316 is used to collect the distance between the user and the front of the terminal 1300. In one embodiment, when proximity sensor 1316 detects a gradual decrease in the distance between the user and the front of terminal 1300, processor 1301 controls touch display 1305 to switch from a bright screen state to a inactive screen state; when the proximity sensor 1316 detects that the distance between the user and the front surface of the terminal 1300 gradually increases, the touch display screen 1305 is controlled by the processor 1301 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting of terminal 1300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 14 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 1400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 1401 and one or more memories 1402, where at least one instruction is stored in the memories 1402, and the at least one instruction is loaded and executed by the processors 1401 to implement the methods provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the processing device, which are not described herein.

Server 1400 may be used to perform the steps performed by the processing device in the method of extracting gestures from video described above; or executing the step executed by the training device in the three-dimensional coordinate extraction model training method.

The embodiment of the present invention also provides a processing device, where the processing device includes a processor and a memory, and at least one instruction, at least one section of program, code set, or instruction set is stored in the memory, where the instruction, program, code set, or instruction set is loaded by the processor and has an operation performed in the method for extracting a gesture from a video according to the above embodiment; or to implement the operations performed in the training three-dimensional coordinate extraction model method of the above embodiment.

The present invention also provides a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions loaded by a processor and having operations performed in a method of extracting a gesture from a video to implement the above embodiments; or to implement the operations performed in the training three-dimensional coordinate extraction model method of the above embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present invention is not intended to limit the embodiments of the present invention, but is intended to cover any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the embodiments of the present invention.

Claims

1. A method of extracting gestures from video, the method comprising:

invoking a three-dimensional coordinate extraction model, wherein the three-dimensional coordinate extraction model is used for obtaining a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and adjacent video frames of the any video frame, the three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of characteristic points in the any video frame, the three-dimensional coordinate extraction model comprises a plurality of network layers, each network layer comprises a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and an adjacent network node of the target network node;

Respectively inputting at least two video frames in the target video frame sequence to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, and acquiring a three-dimensional coordinate set of the at least two video frames;

2. The method according to claim 1, wherein said inputting at least two video frames of the target video frame sequence to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model, respectively, obtaining a three-dimensional coordinate set of the at least two video frames, comprises:

the arrangement order of the at least two network nodes is the same as the arrangement order of the input video frames in the target video frame sequence; and processing the at least two video frames based on each network node in the three-dimensional coordinate extraction model and the connection relation between any two network nodes to obtain a three-dimensional coordinate set of the at least two video frames.

3. The method of claim 2, wherein the plurality of network layers comprises a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer connected in sequence;

the step of respectively inputting at least two video frames in the target video frame sequence to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model to obtain a three-dimensional coordinate set of the at least two video frames includes:

4. The method of claim 2, wherein prior to the invoking the three-dimensional coordinate extraction model, the method further comprises:

and training each network node in the three-dimensional coordinate extraction model according to the error between the test three-dimensional coordinate set of each sample video frame and the sample three-dimensional coordinate set.

5. The method of claim 4, wherein the acquiring the sequence of sample video frames and the set of sample three-dimensional coordinates of at least two sample video frames in the sequence of sample video frames comprises:

6. A three-dimensional coordinate extraction model training method, the method comprising:

respectively inputting at least two sample video frames in the sample video frame sequence to at least two network nodes in a first network layer in a three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and an adjacent video frame of each sample video frame, wherein the three-dimensional coordinate extraction model comprises a plurality of network layers, each network layer comprises a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and an adjacent network node of the target network node;

7. The method of claim 6, wherein the inputting at least two sample video frames of the sequence of sample video frames to at least two network nodes in a first network layer of a three-dimensional coordinate extraction model, respectively, obtains a test three-dimensional coordinate set for each sample video frame from each sample video frame of the sequence of sample video frames and adjacent video frames of each sample video frame, comprising:

the arrangement order of the at least two network nodes is the same as the arrangement order of the input sample video frames in the sample video frame sequence; and processing the at least two sample video frames based on each network node and the connection relation between any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames.

8. The method of claim 7, wherein the plurality of network layers comprises a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer connected in sequence;

The step of respectively inputting at least two sample video frames in the sample video frame sequence to at least two network nodes in a first network layer in a three-dimensional coordinate extraction model, and obtaining a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and adjacent video frames of each sample video frame, including:

9. The method of claim 6, wherein the acquiring the sequence of sample video frames and the set of sample three-dimensional coordinates of at least two sample video frames in the sequence of sample video frames comprises:

10. An apparatus for extracting gestures from video, the apparatus comprising:

the model calling module is used for calling a three-dimensional coordinate extraction model, wherein the three-dimensional coordinate extraction model is used for obtaining a three-dimensional coordinate set of any video frame according to any video frame in any video frame sequence and adjacent video frames of any video frame, the three-dimensional coordinate set comprises three-dimensional coordinates of a plurality of feature points in any video frame, the three-dimensional coordinate extraction model comprises a plurality of network layers, each network layer comprises a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and an adjacent network node of the target network node;

The three-dimensional coordinate acquisition module is used for respectively inputting at least two video frames in the target video frame sequence to at least two network nodes in a first network layer of the three-dimensional coordinate extraction model to acquire three-dimensional coordinate sets of the at least two video frames;

11. The apparatus of claim 10, wherein the three-dimensional coordinate acquisition module comprises:

12. The apparatus of claim 11, wherein the plurality of network layers comprises a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer connected in sequence;

the processing unit is further configured to process the image feature and the image feature of an adjacent video frame of the target video frame based on the two-dimensional coordinate extraction layer, so as to obtain a two-dimensional coordinate set of the target video frame, where the two-dimensional coordinate set includes two-dimensional coordinates of a plurality of feature points in the target video frame;

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. The apparatus of claim 13, wherein the sample sequence acquisition module comprises:

15. A three-dimensional coordinate extraction model training apparatus, the apparatus comprising:

the processing module is used for respectively inputting at least two sample video frames in the sample video frame sequence to at least two network nodes in a first network layer in the three-dimensional coordinate extraction model, and acquiring a test three-dimensional coordinate set of each sample video frame according to each sample video frame in the sample video frame sequence and adjacent video frames of each sample video frame, wherein the three-dimensional coordinate extraction model comprises a plurality of network layers, each network layer comprises a plurality of network nodes, and each network node is connected with a target network node in a next network layer of the network layer and adjacent network nodes of the target network node;

16. The apparatus of claim 15, wherein the processing module comprises:

an input unit, configured to arrange the at least two network nodes in the same order as the input sample video frames in the sample video frame sequence; and processing the at least two sample video frames based on each network node and the connection relation between any two network nodes in the three-dimensional coordinate extraction model to obtain a test three-dimensional coordinate set of the at least two sample video frames.

17. The apparatus of claim 16, wherein the plurality of network layers comprises a feature extraction layer, a two-dimensional coordinate extraction layer, and a three-dimensional coordinate extraction layer connected in sequence;

18. The apparatus of claim 15, wherein the sample sequence acquisition module comprises:

19. A processing device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, code set, or instruction set, the instruction, program, code set, or instruction set being loaded and executed by the processor to perform the operations performed in the method of extracting gestures from video according to any of claims 1 to 5; alternatively, it is performed to realize the operations performed in the three-dimensional coordinate extraction model training method according to any one of claims 6 to 9.

20. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the instruction, program, code set, or instruction set being loaded and executed by a processor to implement operations performed in the method of extracting gestures from video according to any of claims 1 to 5; alternatively, it is performed to realize the operations performed in the three-dimensional coordinate extraction model training method according to any one of claims 6 to 9.