CN111753795A

CN111753795A - Action recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111753795A
Application number: CN202010623952.4A
Authority: CN
Inventors: 刘思阳
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-09

Abstract

The embodiment of the invention provides a method and a device for recognizing actions, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of continuous infrared images; the infrared image is an image which is shot by an infrared camera and contains a designated part of a target object, and the target object is provided with a plurality of light capturing balls, wherein each light capturing ball corresponds to one designated part of the target object; determining a target image group containing multiple frames of target images from multiple frames of continuous infrared images; and inputting the multi-frame target images in the target image group into a pre-trained motion recognition model to obtain the motion types of the target objects corresponding to the target image group. By adopting the method provided by the embodiment of the invention, the processing of the action recognition of the target object is simplified, the requirement of the use scene of the action recognition is reduced, and meanwhile, the higher action recognition precision can be achieved under the use scene with lower requirement.

Description

Action recognition method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision technologies, and in particular, to a method and an apparatus for motion recognition, an electronic device, and a storage medium.

Background

At present, there are many techniques for recognizing human body actions. For example: the motion of the target object in the video picture is recognized only by an image processing technique or by an optical capturing technique.

At present, the method of recognizing the motion of a target object by using a light capture technology is mainly applied to movie and television production. The identification process needs to be completed in a professional studio, as shown in fig. 1:

a plurality of infrared cameras 103 are installed at various positions in the studio, and the actor wears a special light-catching suit 101 on which a plurality of light-catching balls 102 having a strong reflection power are disposed. During shooting, the infrared camera 103 emits infrared light and receives the infrared light reflected by the light-capturing ball 102, and infrared video images in different directions are shot. After the infrared video images in different directions are obtained, the spatial position of the light capture ball 102 is calculated through image processing technologies such as image fusion, and the movement of the actor is further identified. However, this method is not only costly but also requires processing of multiple video images, and the algorithm is complex and requires high requirements for use scenes.

The motion of the target object in the video image is recognized by an image processing technique, and the motion of the target object included in the visible light image, for example, an RGB (red, green, blue) image, is mainly recognized based on the image processing technique. However, since the visible light image is greatly affected by the environment, the image quality is unstable, and the accuracy of motion recognition is easily affected. For example, when motion recognition is performed on a target object included in a strongly exposed visible light image, the accuracy of the recognition result is lowered due to low image quality.

Disclosure of Invention

An object of embodiments of the present invention is to provide a motion recognition method, a motion recognition apparatus, an electronic device, and a storage medium, so as to improve the accuracy of motion recognition while simplifying the process of motion recognition.

In order to achieve the above object, an embodiment of the present invention provides a motion recognition method, including:

acquiring a plurality of continuous infrared images; the infrared image is an image which is shot by an infrared camera and contains a designated part of a target object, and the target object is provided with a plurality of light capturing balls, wherein each light capturing ball corresponds to one designated part of the target object;

determining a target image group containing multiple frames of target images from multiple frames of continuous infrared images;

inputting multi-frame target images in the target image group into a pre-trained motion recognition model to obtain motion types of the target object corresponding to the target image group; wherein the motion recognition model is obtained by training based on a training sample set, and the training sample set comprises: the motion type of the sample object corresponding to each sample image group comprises a plurality of sample image groups, each sample image group comprises a plurality of frames of sample images, and the sample images in the sample image groups are images containing the designated parts of the sample objects.

Further, the determining a target image group including a plurality of frames of target images from a plurality of frames of continuous infrared images includes:

selecting a frame of infrared image from a plurality of continuous infrared images at preset frame number intervals as a target image to obtain a target image group consisting of a plurality of target images.

Further, the pre-trained motion recognition model includes: the system comprises a feature extraction network layer, a difference feature calculation layer, a feature splicing layer, an action classification network layer and an output layer;

the step of inputting the multi-frame target images in the target image group into a pre-trained motion recognition model to obtain the motion types of the target objects corresponding to the target image group includes:

inputting multi-frame target images in the target image group into a feature extraction network layer of a pre-trained action recognition model;

the feature extraction network layer is used for respectively extracting the light capture features of the multi-frame target image to obtain a plurality of light capture feature information;

the difference characteristic calculation layer calculates the difference value of the light capture characteristics of two adjacent target images in the target image group according to the light capture characteristic information to obtain a plurality of difference characteristic information;

the characteristic splicing layer splices the light capture characteristic information and the difference characteristic information to obtain splicing characteristic information;

the action classification network layer determines the probability that the action corresponding to the splicing characteristic information belongs to each preset action type;

and the output layer outputs the action type with the maximum probability as the action type of the target object corresponding to the target image group.

Further, the feature extraction network layer is:

a visual geometry group network VGG, or a residual neural network ResNet, or a lightweight deep neural network MobileNet.

Further, the action classification network layer in the action recognition model comprises: a preset number of fully-connected layers; wherein the input feature dimension of a first fully-connected layer of the action classification network layers is sx (2N-1); the output characteristic dimension of the last full-connection layer of the action classification network layer is 1 multiplied by n; n represents the number of target images, and N represents the number of action types;

the output layer in the motion recognition model comprises: softmax layer.

Further, the motion recognition model is obtained by training based on a training sample set by adopting the following steps:

collecting the training samples, inputting multi-frame sample images of the sample image group into a neural network model to be trained, and obtaining the action types of sample objects corresponding to the sample image group as output results;

adjusting parameters of the current neural network model to be trained based on the output result to obtain a new neural network model to be trained, completing one iteration, returning to the step of collecting the training samples, and inputting multi-frame sample images of the sample image group into the neural network model to be trained;

and when the iteration times reach the preset iteration times or the loss function value of the current neural network model to be trained is smaller than the preset loss function threshold value, ending the training, and determining the current neural network model to be trained as the motion recognition model.

Further, the feature extraction network layer of the neural network model to be trained is a predetermined image feature extraction network layer;

the adjusting the parameters of the current neural network model to be trained based on the output result comprises:

and adjusting the parameters of the action classification network layer of the current neural network model to be trained based on the output result.

Further, the action type of the target object includes: kicking, lifting hands, running, walking, pushing, pulling, jumping, and nonsense movements.

In order to achieve the above object, an embodiment of the present invention further provides a motion recognition apparatus, including:

the infrared image acquisition module is used for acquiring a plurality of continuous infrared images; the infrared image is an image which is shot by an infrared camera and contains a designated part of a target object, and the target object is provided with a plurality of light capturing balls, wherein each light capturing ball corresponds to one designated part of the target object;

the image group determining module is used for determining a target image group containing a plurality of frames of target images from a plurality of frames of continuous infrared images;

the action recognition module is used for inputting the multi-frame target images in the target image group into a pre-trained action recognition model to obtain the action types of the target objects corresponding to the target image group; wherein the motion recognition model is obtained by training based on a training sample set, and the training sample set comprises: the motion type of the sample object corresponding to each sample image group comprises a plurality of sample image groups, each sample image group comprises a plurality of frames of sample images, and the sample images in the sample image groups are images containing the designated parts of the sample objects.

Further, the image group determining module is specifically configured to select one infrared image frame at every preset frame number from multiple continuous infrared images as a target image, and obtain a target image group composed of multiple target images.

the action recognition module is specifically used for inputting the multi-frame target images in the target image group into a feature extraction network layer of a pre-trained action recognition model; the feature extraction network layer is used for respectively extracting the light capture features of the multi-frame target image to obtain a plurality of light capture feature information; the difference characteristic calculation layer calculates the difference value of the light capture characteristics of two adjacent target images in the target image group according to the light capture characteristic information to obtain a plurality of difference characteristic information; the characteristic splicing layer splices the light capture characteristic information and the difference characteristic information to obtain splicing characteristic information; the action classification network layer determines the probability that the action corresponding to the splicing characteristic information belongs to each preset action type; and the output layer outputs the action type with the maximum probability as the action type of the target object corresponding to the target image group.

Further, the feature extraction network layer is:

Further, the action classification network layer in the action recognition model comprises: a preset number of fully-connected layers; wherein the input feature dimension of a first fully-connected layer of the action classification network layers is sx (2N-1); the output characteristic dimension of the last full-connection layer of the action classification network layer is 1 multiplied by n; n represents the number of target images, N represents the number of motion types, and s represents the dimension of light capture characteristic information;

the output layer in the motion recognition model comprises: softmax layer.

Further, the apparatus further includes: a model training module;

the model training module is used for training based on a training sample set to obtain the action recognition model by adopting the following steps:

and the model training module adjusts the parameters of the action classification network layer of the current neural network model to be trained based on the output result.

In order to achieve the above object, an embodiment of the present invention provides an electronic device, which includes a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface are configured to complete communication between the memory and the processor through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing any one of the steps of the action recognition method when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above steps of the motion recognition method.

In order to achieve the above object, an embodiment of the present invention further provides a computer program product containing instructions, which when run on a computer, causes the computer to perform any of the steps of the motion recognition method described above.

The embodiment of the invention has the following beneficial effects:

by adopting the method provided by the embodiment of the invention, a plurality of optical capture balls are deployed on the target object, only one infrared camera is needed to shoot a plurality of continuous infrared images aiming at the target object, a target image group containing a plurality of target images is obtained, and then the action of the target object in the plurality of target images is identified through a pre-trained action identification model, so that the action type of the target object is determined. Therefore, compared with the existing action recognition method, the method provided by the embodiment of the invention simplifies the action recognition processing of the target object, and reduces the requirements of the use scene of the action recognition, namely the target object does not need to wear specific light-catching clothes and use scenes with higher requirements, and only needs to stick a plurality of light-catching balls on the target object, acquire the image aiming at the target object by one infrared camera, and then process the acquired image to realize the action recognition of the target object. Meanwhile, the method provided by the embodiment of the invention uses the light capture technology, so that the method can achieve higher action recognition precision under the use scene with lower requirements.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a diagram illustrating a professional optical capture data acquisition in the prior art;

FIG. 2 is a flowchart of a method for recognizing actions according to an embodiment of the present invention;

FIG. 3 is another flow chart of a method for recognizing actions according to an embodiment of the present invention;

fig. 4a is a schematic diagram of a target object with an optical capture ball deployed in the motion recognition method according to the embodiment of the present invention;

fig. 4b is a schematic diagram of a target object with light-trapping balls deployed and an infrared image collected for the target object with light-trapping balls deployed according to an embodiment of the present invention;

FIG. 5a is a schematic structural diagram of a motion recognition model according to an embodiment of the present invention;

FIG. 5b is a diagram of a target image processed by the motion recognition model according to the embodiment of the present invention;

fig. 5c is a schematic structural diagram of a motion classification network layer in the motion recognition model according to the embodiment of the present invention;

FIG. 6 is a flowchart of training a motion recognition model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an action recognition device according to an embodiment of the present invention;

fig. 8 is another schematic structural diagram of the motion recognition device according to the embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Because the existing action recognition method has complex algorithm and is difficult to be applied to other fields, in order to simplify the action recognition processing and expand the application scenario of action recognition, the embodiment of the invention provides an action recognition method, as shown in fig. 2, comprising:

step 201, acquiring multiple continuous infrared images; the infrared image is an image including a designated portion of a target object photographed by one infrared camera, and the target object is disposed with a plurality of light-capturing balls, wherein each light-capturing ball corresponds to one designated portion of the target object. Wherein, the target object may be: humans and animals, and the like. The light-trapping spheres deployed for the target object may be light-trapping spheres.

Step 202, determining a target image group containing multi-frame target images from multi-frame continuous infrared images.

Step 203, inputting multi-frame target images in the target image group into a pre-trained motion recognition model to obtain motion types of target objects corresponding to the target image group; wherein, the action recognition model is obtained for training based on the training sample set, and the training sample set contains: the motion type of the sample object corresponding to a plurality of sample image groups and each sample image group, wherein each sample image group comprises a plurality of frames of sample images, and the sample images in the sample image groups are images containing the designated parts of the sample objects.

The following describes in detail the motion recognition method and apparatus provided in the embodiments of the present invention with specific embodiments.

In an embodiment of the present application, as shown in fig. 3, another flow of the motion recognition method includes the following steps:

step 301, acquiring multiple continuous infrared images.

In this step, a plurality of light-capturing balls may be disposed at a designated portion of the target object, each light-capturing ball corresponding to a designated portion of the target object. The light-catching ball may be a light-reflecting light-catching ball, and the target object may be a person, an animal, or the like. If the target object is a person, the designated portion of the target object may include: wrist, elbow, ankle, knee, foot, shoulder, etc. A plurality of light-catching balls may be attached to a plurality of designated portions of the target object. Referring specifically to fig. 4a, designated parts of the target object 401, such as the wrist, elbow, ankle, knee, foot, and shoulder, are all deployed with a light-trapping ball 402. Specifically, referring to fig. 4a, the light-catching ball 402 may be sequentially attached to each designated portion of the target object 401 in the order of the arrow with respect to the target object 401. For example, one light-trapping ball 402 may be pasted on the pelvis portion of the target object 401, and one light-trapping ball 402 may be pasted on each of the "spine 1", "spine 2", "spine 3", "neck", "head", "left clavicle", "left shoulder", "left elbow", "left wrist", "left hand", "right clavicle", "right shoulder", "right elbow", "right wrist", "right hand", and the like in the upward direction of the arrow in fig. 4a, with the pelvis portion as the starting portion; with the pelvis region as the starting region, a light-catching ball 402 is stuck to each of the regions "left hip", "left knee", "left ankle", "left foot", "right hip", "right knee", "right ankle", and "right foot" in the downward direction in the arrow direction in fig. 4 a. Finally, 24 light-trapping balls 402 may be stuck at the above-mentioned 24 body parts of the target object 401.

An infrared camera, such as a Kineck DK camera, may be used to take a picture of a target object with a plurality of light trapping balls deployed therein, to obtain a plurality of consecutive frames of images including a designated portion of the target object. As shown in fig. 4b, the target object 403 has the optical ball capture 402 disposed on the left wrist, the right wrist, the left elbow, and the right elbow, and an infrared image 410 can be acquired by using an infrared camera for the target object 403 with the optical ball capture disposed thereon.

Step 302, selecting a frame of infrared image from the multiple continuous frames of infrared images at intervals of a preset number of frames as a target image, and obtaining a target image group consisting of multiple frames of target images.

The preset frame number can be specifically set according to practical application, for example, the preset frame number can be set to 2, 3 or 4, and the like.

For example, if 10 consecutive infrared images are acquired and the preset number of frames is 2, starting from the step of determining the 1 st infrared image as the target image, every 2 infrared images are determined to determine one infrared image as the target image, and finally, the 1 st, 3 rd, 5 th, 7 th and 9 th frames of the 10 consecutive infrared images are determined as the target images. And, the determined 5 frames of target images can be taken as a target image group.

In the embodiment of the invention, all the multiple continuous infrared images acquired by the infrared camera can be used as target images.

And step 303, inputting a plurality of frames of target images in the target image group into a feature extraction network layer of a pre-trained motion recognition model.

In the embodiment of the present invention, referring to fig. 5a, the pre-trained motion recognition model includes: the system comprises a feature extraction network layer, a difference feature calculation layer, a feature splicing layer, an action classification network layer and an output layer.

Wherein, the feature extraction network layer may be: VGG (visual geometry group network), or ResNet (residual neural network), or MobileNet (lightweight deep neural network).

And step 304, the feature extraction network layer respectively extracts the light capture features of the multiple frames of target images to obtain multiple pieces of light capture feature information.

In the embodiment of the invention, after a plurality of frames of target images in the target image group are input into the feature extraction network layer of the pre-trained motion recognition model, the feature extraction network layer can be used for extracting the features of each frame of target image to be used as the light capture feature information. And, the extracted light capture characteristic information is a multi-dimensional vector.

For example, if the target image group includes 5 target images, and the 5 target images are respectively the t-2 frame infrared image, the t-1 frame infrared image, the t +1 frame infrared image and the t +2 frame infrared image in the multiple continuous infrared images collected by the infrared camera, the w × h × 1 dimensional pixel matrix of each target image can be input into the feature extraction network layer, and the light capture features of the 5 target images are respectively I_t-2、I_t-1、I_t、I_t+1And I_t+2. And the extracted light capture features are all preset vectors with s dimension, and s is not equal to 0. Wherein w is the number of horizontal pixel points of the pixel matrix of the target image, and w is the number of vertical pixel points of the pixel matrix of the target image.

And 305, the difference characteristic calculation layer calculates the difference value of the light capture characteristics of two adjacent target images in the target image group according to the light capture characteristic information to obtain a plurality of difference characteristic information.

In this step, if the number of the extracted light capture features is N, N-1 difference feature information can be calculated.

For example, referring to fig. 5b, if the target image group includes 5 target images, the 5 target images are respectively in the plurality of consecutive infrared images collected by the infrared camera: the method comprises the following steps of (1) respectively obtaining a t-2 th frame infrared image, a t-1 th frame infrared image, a t +1 th frame infrared image and a t +2 th frame infrared image, wherein light capture characteristic information of 5 frames of target images extracted by a characteristic extraction network layer is as follows: i is_t-2、I_t-1、I_t、I_t+1And I_t+2The difference feature calculation layer may respectively calculate difference feature information according to the extracted light capture feature information:

M_t＝I_t-I_t-1；

M_t-1＝I_t-1-I_t-2；

M_t+1＝I_t+1-I_t；

M_t+2＝I_t+2-I_t+1；

according to the 5 pieces of light capture characteristic information, 4 pieces of difference characteristic information can be calculated: m_t、M_t-1、M_t+1And M_t+2. Wherein, the difference feature information is also a vector of s dimension.

And step 306, the characteristic splicing layer splices the plurality of light capture characteristic information and the plurality of difference characteristic information to obtain splicing characteristic information.

In this step, the plurality of pieces of light capture characteristic information and the plurality of pieces of difference characteristic information may be spliced into one piece of splicing characteristic information. If there are N pieces of light capture characteristic information and N-1 pieces of difference characteristic information, and the light capture characteristic information and the difference characteristic information are both vectors of s dimension, then s × (2N-1) dimension splicing characteristic information can be obtained.

For example, referring to fig. 5b, if there are 5 pieces of light capture characteristic information: i is_t-2、I_t-1、I_t、I_t+1And I_t+24 pieces of difference feature information: m_t、M_t-1、M_t+1And M_t+2The 5 pieces of light capture characteristic information and the 4 pieces of difference characteristic information can be superposed to obtain splicing characteristic information of s × 9 dimensions.

And 307, the action classification network layer determines the probability that the action corresponding to the splicing characteristic information belongs to each preset action type.

In the embodiment of the present invention, the action classification network layer in the action recognition model may include a preset number of full connection layers. On the premise of ensuring that the output dimension of the last full connection layer of the action classification network layer is 1 × n, the preset number may be specifically determined according to the actual application situation, for example, the action classification network layer may set 6 full connection layers or 10 full connection layers, and the like. n represents the number of action types of the target object.

For example, referring to fig. 5c, if the action classification network layer in the action recognition model includes 6 fully connected layers:

first fully-connected layer: the input characteristic dimension is s (2N-1), the number of neurons is (2N-1) s, and the output characteristic dimension is 1 (2N-1) s; n represents the number of target images; n represents the number of target images, and N also represents the number of light capture characteristic information;

second full connection layer: the input characteristic dimension is 1 x (2N-1) s, the number of neurons is 2(2N-1) s, and the output characteristic dimension is 1 x 2(2N-1) s;

a third fully-connected layer: the input characteristic dimension is 1 multiplied by 2(2N-1) s, the number of neurons is 2(2N-1) s, the output characteristic dimension is 1 multiplied by 2N, and N represents the number of action types;

fourth full connection layer: the input characteristic dimension is 1 x 2n, the number of neurons is 2n, and the output characteristic dimension is 1 x 2 n;

a fifth fully-connected layer: the input characteristic dimension is 1 multiplied by 2n, the number of neurons is 2n, and the output characteristic dimension is 1 multiplied by n;

sixth full connection layer: the input characteristic dimension is n, the number of neurons is n, and the output characteristic dimension is 1 × n.

For example, referring to FIG. 5b, the characteristic dimension of the sixth fully-connected layer output is 1 × n, which indicates the probability value corresponding to each of the n preset action types of the target object₁，p2，…，p_n-1，p_n]Wherein p is₁、p₂、…、p_n-1And p_nN preset action types of the target object respectively, wherein the probability value corresponding to each preset action type is p₁+p₂+…+p_n-1+p_n＝1。

And step 308, outputting the action type with the maximum probability as the action type of the target object corresponding to the target image group by the output layer.

The output layer in the action recognition model comprises a softmax layer and an output characteristic dimension of 1 × n, wherein the input characteristic dimension of the softmax layer is 1 × n, the input of the softmax layer is the output of the last full connection layer of the action classification network layer in the action recognition model, namely the probability value corresponding to each preset action type in n preset action types representing the target object, for example, the probability value is referred toFig. 5b, the sixth fully-connected layer output is characterized by: [ p ]₁，p₂，…，p_n-1，p_n]。

Based on the probability value corresponding to each preset action type in the n preset action types of the input target object, the softmax layer may generate a classification vector that retains the maximum probability value and sets other probability values to 0.

For example, the input softmax layer is characterized by: [ p ]₁，p₂，…，p_n-1，p_n]Wherein p is₁、p₂、…、p_n-1And p_nAnd respectively obtaining a probability value corresponding to each preset action type in the n preset action types of the target object. If p is₁Is p₁、p₂、…、p_n-1And p_nThe softmax layer may generate a classification vector [1, 0, …, 0 ] with the reserved maximum probability value set to 1 and the other probability values set to 0]。

Based on the classification vector output by the softmax layer, the preset action type corresponding to the classification vector can be used as the action type of the target object corresponding to the target image group.

The preset action type corresponding to the classification vector is the action type of the preset target object corresponding to the maximum probability value. For example, the classification vector [1, 0, …, 0]The action type of the corresponding preset target object is the maximum probability value p₁And the action type of the corresponding preset target object.

Referring to fig. 5b, in the n preset action types of the target object, a probability value corresponding to each preset action type is obtained: [ p ]₁，p₂，…，p_n-1，p_n]Then, the output layer may output, as the action type of the identified target object, the preset action type with the highest probability value among the probability values of the preset action types, that is, max { p [ ]₁,p₂,...,p_nOutputting the corresponding preset action type as the action type of the identified target object.

In the embodiment of the present invention, the preset action type of the target object includes: kicking, lifting hands, running, walking, pushing, pulling, jumping, and nonsense movements, among others.

For example, if the action types of the target object include 8 types: kicking, lifting hands, running, walking, pushing, pulling, jumping and meaningless movements, wherein n is 8; if the output of the softmax layer is characterized by: [0.1, 0.1, 0.01, 0.03, 0.04, 0.05, 0.6, 0.06 ]; wherein, the probability corresponding to the kicking action is 0.1, the probability corresponding to the lifting action is 0.1, the probability corresponding to the running action is 0.01, the probability corresponding to the walking action is 0.03, the probability corresponding to the pushing action is 0.04, the probability corresponding to the pulling action is 0.05, the probability corresponding to the jumping action is 0.6, and the probability corresponding to the meaningless action is 0.04. The output layer can output the action type with the highest probability: and jumping motion as the motion type of the target object corresponding to the target image group.

By adopting the method provided by the embodiment of the invention, a plurality of optical capture balls are deployed on the target object, only one infrared camera is needed to shoot a plurality of continuous infrared images aiming at the target object, a target image group containing a plurality of target images is obtained, then the optical capture characteristic information is respectively extracted from the plurality of target images in the target image group through a pre-trained action recognition model, the difference characteristic information is obtained through calculation according to the extracted optical capture characteristic information, and then the optical capture characteristic information and the difference characteristic information are spliced to obtain the spliced characteristic information. And then determining the probability that the action corresponding to the splicing characteristic information belongs to each preset action type, and determining the action type with the maximum probability as the action type of the target object. Compared with the existing action recognition method, the method provided by the embodiment of the invention has the advantages that the action is recognized by extracting the light capture characteristic information of continuous multi-frame target images and combining the difference characteristic information, so that on one hand, the action recognition processing of the target object is simplified, the requirement on the use scene of the action recognition is reduced, and on the other hand, the action recognition accuracy of the target object is improved. In addition, a professional studio is not needed, a target object does not need to wear specific light-catching clothes, only a plurality of light-catching balls are needed to be pasted on the target object, the action recognition of the target object can be realized through one infrared camera, and the use scene of the action recognition is expanded.

In the embodiment of the present invention, referring to fig. 6, a process for training a motion recognition model includes:

step 601, collecting training samples, inputting multi-frame sample images of a sample image group into a neural network model to be trained, and obtaining action types of sample objects corresponding to the sample image group as output results.

The training sample set comprises a plurality of sample image groups. The multi-frame sample image of each sample image group is an infrared image collected by one infrared camera, and a preset number of frames are arranged between every two adjacent frame sample images in each sample image group.

The neural network model to be trained comprises: the system comprises a feature extraction network layer, a difference feature calculation layer, a feature splicing layer, an action classification network layer and an output layer. Wherein the feature extraction network layer may use a predetermined image feature extraction network layer, such as a VGG network, or a ResNet network, or a MobileNet network.

In the step, after the sample images are input into the neural network model to be trained, the feature extraction network layer can extract the light capture feature information of each sample image; the difference characteristic calculation layer calculates the difference value of the light capture characteristic information of two adjacent frames of sample images in the same sample image group according to the light capture characteristic information of the extracted sample images to obtain difference characteristic information; the characteristic splicing layer splices the extracted light capture characteristic information and the difference characteristic information of the sample image to obtain splicing characteristic information of the sample image group; the action classification network layer is used for determining the probability that the action corresponding to the splicing characteristic information of the sample image group belongs to each preset action type; and the output layer outputs the action type with the highest probability as the action type of the sample object corresponding to the sample image group as an output result.

And step 602, adjusting parameters of the current neural network model to be trained based on the output result to obtain a new neural network model to be trained, completing one iteration, and returning to the step of inputting the multi-frame sample images of the sample image group in the training sample set into the neural network model to be trained.

In the embodiment of the invention, the parameters of the action classification network layer of the current neural network model to be trained are adjusted based on the output result.

Step 603, when the iteration number reaches a preset iteration number, or if the loss function value of the current neural network model to be trained is smaller than a preset loss function threshold value, ending the training, and determining the current neural network model to be trained as the motion recognition model.

The preset iteration times and the preset loss function threshold value can be set according to the actual training condition, wherein the preset iteration times meet the following setting requirements: after the iteration of the preset times, the current neural network model to be trained is converged; the preset loss function threshold is set to satisfy the following conditions: and if the loss function value of the current neural network model to be trained is smaller than the preset loss function threshold value, the current neural network model to be trained is converged.

Based on the same inventive concept, according to the motion recognition method provided in the above embodiment of the present invention, correspondingly, another embodiment of the present invention further provides a motion recognition apparatus, a schematic structural diagram of which is shown in fig. 7, specifically including:

an infrared image acquisition module 701, configured to acquire multiple frames of continuous infrared images; the infrared image is an image including a designated portion of a target object photographed by one infrared camera, and the target object is disposed with a plurality of light-capturing balls, wherein each light-capturing ball corresponds to one designated portion of the target object.

An image group determining module 702, configured to determine a target image group including multiple frames of target images from multiple frames of consecutive infrared images.

The action recognition module 703 is configured to input a plurality of frames of target images in the target image group into a pre-trained action recognition model, so as to obtain an action type of a target character object corresponding to the target image group; wherein, the action recognition model is obtained for training based on the training sample set, and the training sample set contains: the motion type of the sample object corresponding to a plurality of sample image groups and each sample image group, wherein each sample image group comprises a plurality of frames of sample images, and the sample images in the sample image groups are images containing the designated parts of the sample objects.

Therefore, by adopting the motion recognition device provided by the embodiment of the invention, a plurality of optical capture balls are deployed on the target object, only one infrared camera is required to shoot a plurality of continuous infrared images aiming at the target object, a target image group containing a plurality of target images is obtained, and then the motion of the target object in the plurality of target images is recognized through a pre-trained motion recognition model, so that the motion type of the target object is determined. Therefore, compared with the existing action recognition method, the method provided by the embodiment of the invention simplifies the action recognition processing of the target object, and reduces the requirements of the use scene of the action recognition, namely the target object does not need to wear specific light-catching clothes and use scenes with higher requirements, and only needs to stick a plurality of light-catching balls on the target object, acquire the image aiming at the target object by one infrared camera, and then process the acquired image to realize the action recognition of the target object. Meanwhile, the method provided by the embodiment of the invention uses the light capture technology, so that the method can achieve higher action recognition precision under the use scene with lower requirements.

Further, the image group determining module 702 is specifically configured to select one frame of infrared image from multiple frames of continuous infrared images at preset frame intervals, and use the selected frame of infrared image as a target image to obtain a target image group consisting of multiple frames of target images.

Further, the pre-trained motion recognition model comprises: the system comprises a feature extraction network layer, a difference feature calculation layer, a feature splicing layer, an action classification network layer and an output layer;

the action recognition module 703 is specifically configured to input a plurality of frames of target images in the target image group into a feature extraction network layer of a pre-trained action recognition model; respectively extracting the light capture characteristics of multiple frames of target images to obtain a plurality of light capture characteristic information, wherein the light capture characteristic information is an s-dimensional vector; the difference characteristic calculation layer is used for calculating the difference value of the light capture characteristics of two adjacent target images in the target image group according to the light capture characteristic information to obtain a plurality of difference characteristic information; the characteristic splicing layer splices the multiple pieces of light capture characteristic information and the multiple pieces of difference characteristic information to obtain spliced characteristic information; the action classification network layer determines the probability that the action corresponding to the splicing characteristic information belongs to each preset action type; and the output layer outputs the action type with the maximum probability as the action type of the target object corresponding to the target image group.

Further, the feature extraction network layer is as follows: VGG, or ResNet, or MobileNet.

Further, the action classification network layer in the action recognition model comprises: a preset number of fully-connected layers; wherein the input feature dimension of a first fully-connected layer of the action classification network layers is sx (2N-1); the output characteristic dimension of the last full-connection layer of the action classification network layer is 1 multiplied by n; n represents the number of target images;

the output layer in the action recognition model comprises: softmax layer.

Further, the action classification network layer in the action recognition model comprises: first to sixth fully-connected layers;

first fully-connected layer: the input characteristic dimension is s (2N-1), the number of neurons is (2N-1) s, and the output characteristic dimension is 1 (2N-1) s;

sixth full connection layer: the input characteristic dimension is n, the number of neurons is n, and the output characteristic dimension is 1 multiplied by n;

the output layer in a motion recognition model, comprising: a softmax layer;

and in the softmax layer, the input characteristic dimension is 1 multiplied by n, and the output characteristic dimension is 1 multiplied by n.

Further, referring to fig. 8, the motion recognition apparatus further includes: a model training module 801;

a model training module 801, configured to obtain an action recognition model based on training sample set training by using the following steps:

collecting training samples, inputting multi-frame sample images of a sample image group into a neural network model to be trained, and obtaining action types of sample objects corresponding to the sample image group as output results;

adjusting parameters of the current neural network model to be trained based on the output result to obtain a new neural network model to be trained, completing one iteration, and returning to the step of collecting the training samples and inputting multi-frame sample images of the sample image group into the neural network model to be trained;

Further, a feature extraction network layer of the neural network model to be trained is a predetermined image feature extraction network layer;

the model training module 801 adjusts parameters of the action classification network layer of the current neural network model to be trained based on the output result.

Therefore, by adopting the device provided by the embodiment of the invention, a plurality of optical capture balls are deployed on the target object, only one infrared camera is required to shoot a plurality of continuous infrared images aiming at the target object, a target image group containing a plurality of target images is obtained, then the optical capture characteristic information is respectively extracted from the plurality of target images in the target image group through a pre-trained action recognition model, the difference characteristic information is obtained through calculation according to the extracted optical capture characteristic information, and then the optical capture characteristic information and the difference characteristic information are spliced to obtain the spliced characteristic information. And then determining the probability that the action corresponding to the splicing characteristic information belongs to each preset action type, and determining the action type with the maximum probability as the action type of the target object. Compared with the existing action recognition method, the method provided by the embodiment of the invention has the advantages that the action is recognized by extracting the light capture characteristic information of continuous multi-frame target images and combining the difference characteristic information, so that on one hand, the action recognition processing of the target object is simplified, the requirement on the use scene of the action recognition is reduced, and on the other hand, the action recognition accuracy of the target object is improved. In addition, a professional studio is not needed, a target object does not need to wear specific light-catching clothes, only a plurality of light-catching balls are needed to be pasted on the target object, the action recognition of the target object can be realized through one infrared camera, and the use scene of the action recognition is expanded.

An embodiment of the present invention further provides an electronic device, as shown in fig. 9, which includes a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 complete mutual communication through the communication bus 904,

a memory 903 for storing computer programs;

the processor 901 is configured to implement the following steps when executing the program stored in the memory 903:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program realizes the steps of any one of the above-mentioned action recognition methods when being executed by a processor.

In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform any of the above-described method for motion recognition.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, the electronic device and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A motion recognition method, comprising:

2. The method according to claim 1, wherein the determining a target image group containing a plurality of frames of target images from a plurality of frames of continuous infrared images comprises:

3. The method of claim 1, wherein the pre-trained motion recognition model comprises: the system comprises a feature extraction network layer, a difference feature calculation layer, a feature splicing layer, an action classification network layer and an output layer;

4. The method of claim 3, wherein the action classification network layer in the action recognition model comprises: a preset number of fully-connected layers; wherein the input feature dimension of a first fully-connected layer of the action classification network layers is sx (2N-1); the output characteristic dimension of the last full-connection layer of the action classification network layer is 1 multiplied by n; n represents the number of target images, N represents the number of motion types, and s represents the dimension of light capture characteristic information;

the output layer in the motion recognition model comprises: softmax layer.

5. The method of claim 3, wherein the motion recognition model is trained based on a training sample set using the steps of:

6. The method of claim 5, wherein the feature extraction network layer of the neural network model to be trained is a predetermined image feature extraction network layer;

7. An action recognition device, comprising:

8. The apparatus of claim 7, wherein the image group determining module is specifically configured to select one infrared image from multiple consecutive infrared images every preset number of frames as the target image, so as to obtain a target image group consisting of multiple target images.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 6.