CN114582013A

CN114582013A - Behavior gesture recognition method for shielded part of human body

Info

Publication number: CN114582013A
Application number: CN202210033000.6A
Authority: CN
Inventors: 于征; 黄剑男; 谷雨; 林志勇; 刘鹏; 卢志勇; 钟冲; 许涵; 李源鑫; 潘帆
Original assignee: Xiamen Road & Bridge Information Co ltd
Current assignee: Xiamen Road & Bridge Information Co ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-06-03

Abstract

The invention discloses a behavior gesture recognition method, medium and equipment for a shielded part of a human body, wherein the method comprises the following steps: acquiring a video to be detected, and extracting a video frame of the video to be detected; acquiring a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected; if not, inputting the current video frame into the feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network; acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame; inputting the feature fusion result into a feature extraction refinement network to extract corresponding refinement features, and taking the refinement features as second frame features of the current video frame; the human key point loss condition caused by large motion amplitude and shielding of the human body can be prevented, and the accuracy of human key point identification is improved.

Description

Behavior gesture recognition method for shielded part of human body

Technical Field

The invention relates to the technical field of machine recognition, in particular to a behavior gesture recognition method for a shielded part of a human body, a computer-readable storage medium and computer equipment.

Background

In the related art, when extracting key points of a human body; most of the methods simply input the image into the detection network to extract the human key points corresponding to the human body in the image. However, it can be understood that when a person is in motion, the motion range is large and the person is easily blocked. Therefore, in a motion state, the situation that key points of a human body are lost often occurs in the identification mode, and the final identification result is inaccurate.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the art described above. Therefore, an object of the present invention is to provide a behavior and gesture recognition method for a shielded part of a human body, which can prevent a situation that a key point of the human body is lost due to shielding and large motion amplitude of the human body, and improve accuracy of human key point recognition.

A second object of the invention is to propose a computer-readable storage medium.

A third object of the invention is to propose a computer device.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for recognizing a behavior gesture of a shielded part of a human body, including the following steps: acquiring a video to be detected, and extracting a video frame of the video to be detected; acquiring a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected; if not, inputting the current video frame into a feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network; acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame; and inputting the feature fusion result into a feature extraction refinement network to extract corresponding refined features, and taking the refined features as second frame features of the current video frame.

According to the behavior gesture recognition method of the shielded part of the human body, firstly, a video to be detected is obtained, and a video frame of the video to be detected is extracted; then, obtaining a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected; then, if not, inputting the current video frame into a feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network; then, acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame; then, inputting the feature fusion result into a feature extraction refinement network to extract corresponding refinement features, and taking the refinement features as second frame features of the current video frame; therefore, the situation that the key points of the human body are lost due to large motion amplitude and shielding of the human body is prevented, and the accuracy of identifying the key points of the human body is improved.

In addition, the behavior gesture recognition method for the shielded part of the human body according to the embodiment of the invention may further have the following additional technical features:

optionally, if the current video frame is the first frame of the video to be detected, inputting the current video frame into a feature extraction main network, outputting a first frame feature corresponding to the current video frame through the feature extraction main network, inputting the first frame feature into a feature extraction refinement network, and taking a refinement feature output by the feature extraction refinement network as a second frame feature of the current video frame.

Optionally, performing feature fusion on a first frame feature corresponding to the current video frame and a second frame feature of the previous video frame, including: acquiring an influence coefficient of a previous frame of video frame, and adjusting a second frame characteristic of the previous frame of video frame according to the influence coefficient to obtain a characteristic to be fused; adjusting the size of the feature to be fused according to the first convolution so that the size of the feature to be fused is equal to the size of the first frame feature of the current video frame; amplifying the number of the feature map channels of the feature to be fused according to a second convolution so as to enable the number of the feature map channels of the feature to be fused to be equal to the number of the feature map channels of the first frame feature of the current video frame; and adding the feature images of the expanded features to be fused and the first frame features of the current video frame on the same dimension to complete feature fusion of the first frame features corresponding to the current video frame and the second frame features of the previous video frame.

Optionally, the training process of the feature extraction master network and the feature extraction refinement network includes: acquiring a first data set, and training according to the first data set to obtain a pre-training model; and acquiring a second data set, and continuously training the pre-training model according to the second data set to obtain a final detection model.

Optionally, the first data set is a COCO data set, wherein training according to the first data set to obtain a pre-training model includes: performing dither amplification on the COCO data set to obtain a pre-training data set; in the training process of the pre-training model, inputting an original image in the COCO data set as a current video frame, and inputting an image in the pre-training data set as a previous video frame; wherein dithering the COCO data set comprises: randomly translating any randomly given key point of any image in the COCO data set; randomly rotating all key points of any image in the COCO data set; and scaling the size of a key point of any image in the COCO data set.

Optionally, the second data set is a PoseTrack data set.

In order to achieve the above object, a second embodiment of the present invention provides a computer-readable storage medium, which stores a behavior gesture recognition program of an occluded part of a human body, so that the behavior gesture recognition program of the occluded part of the human body is executed by a processor to implement the above behavior gesture recognition method of the occluded part of the human body.

According to the computer-readable storage medium of the embodiment of the invention, the behavior gesture recognition program of the part of the human body which is sheltered is stored, so that the behavior gesture recognition program of the part of the human body which is sheltered can realize the behavior gesture recognition method of the part of the human body which is sheltered when being executed by the processor; therefore, the situation that the key points of the human body are lost due to large motion amplitude and shielding of the human body is prevented, and the accuracy of identifying the key points of the human body is improved.

In order to achieve the above object, a third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above method for recognizing the behavior gesture of the portion of the human body that is occluded when executing the program.

According to the computer equipment provided by the embodiment of the invention, the behavior gesture recognition program of the part, which is shielded by the human body, is stored through the memory, so that the behavior gesture recognition program of the part, which is shielded by the human body, is executed by the processor to realize the behavior gesture recognition method of the part, which is shielded by the human body; therefore, the situation that the key points of the human body are lost due to large motion amplitude and shielding of the human body is prevented, and the accuracy of identifying the key points of the human body is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for recognizing behavioral gestures of an occluded part of a human body according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a COCO data set amplification process according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the related art, when extracting key points of a human body; in a motion state, the situation that key points of a human body are lost often occurs in the identification mode, so that the final identification result is inaccurate. According to the behavior gesture recognition method of the shielded part of the human body, firstly, a video to be detected is obtained, and a video frame of the video to be detected is extracted; then, obtaining a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected; then, if not, inputting the current video frame into a feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network; then, acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on the first frame characteristics and the second frame characteristics; then, inputting the feature fusion result into a feature extraction refinement network to extract corresponding refinement features, and taking the refinement features as second frame features of the current video frame; therefore, the situation that the key points of the human body are lost due to large motion amplitude and shielding of the human body is prevented, and the accuracy of identifying the key points of the human body is improved.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Fig. 1 is a schematic flow chart of a method for recognizing a behavior gesture of an occluded part of a human body according to an embodiment of the present invention, and as shown in fig. 1, the method for recognizing a behavior gesture of an occluded part of a human body includes the following steps:

s101, acquiring a video to be detected, and extracting a video frame of the video to be detected.

S102, obtaining the current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected.

S103, if not, inputting the current video frame into the feature extraction main network so as to extract the first frame feature corresponding to the current video frame through the feature extraction main network.

In some embodiments, if the current video frame is the first frame of the video to be detected, the current video frame is input to the feature extraction main network, so that the first frame feature corresponding to the current video frame is output through the feature extraction main network, the first frame feature is input to the feature extraction refinement network, and the refinement feature output by the feature extraction refinement network is used as the second frame feature of the current video frame.

That is, video frames of a video to be detected are extracted from the video to be detected, and the video frames are sorted according to a time axis. When detecting key points of a human body, firstly, acquiring a current video frame; if the current video frame is the first frame, inputting the current video frame into a feature extraction main network to extract corresponding first frame features (namely integral features); next, the first frame features are input to a feature extraction refinement network to extract corresponding refined features as second frame features (i.e., local features) of the current video frame.

S104, acquiring second frame characteristics of the previous frame of video frame, and performing characteristic fusion on the first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame.

In some embodiments, feature fusing a first frame feature corresponding to a current video frame and a second frame feature of a previous video frame includes: acquiring an influence coefficient of a previous frame of video frame, and adjusting a second frame characteristic of the previous frame of video frame according to the influence coefficient to obtain a characteristic to be fused; adjusting the size of the feature to be fused according to the first convolution so as to enable the size of the feature to be fused to be equal to the size of the first frame feature of the current video frame; amplifying the number of the feature image channels of the feature to be fused according to the second convolution so as to enable the number of the feature image channels of the feature to be fused to be equal to the number of the feature image channels of the first frame feature of the current video frame; and adding the feature images of the expanded features to be fused and the first frame features of the current video frame on the same dimension to complete feature fusion of the first frame features corresponding to the current video frame and the second frame features of the previous video frame.

As an example, as shown in fig. 2, when a key point of a human body of a current video frame is to be acquired, the key point may be lost because the human body is in a motion state. Therefore, the key point detection capability of the current video frame is enhanced by supplementing key point information of the previous frame in a manner of feature fusion of the first frame feature of the current video frame and the second frame feature of the previous frame video frame. Preferably, an influence coefficient a of a previous frame video frame is first obtained (for example, a has a value of 0.1.), and the influence coefficient is used for determining the influence degree of a second frame feature of the previous frame video frame on the fused feature. Secondly, synthesizing a feature graph by a second frame feature output by a previous frame video frame in a superposition mode, and multiplying the feature graph by an influence coefficient to obtain a feature to be fused; then, the size of the feature to be fused by using the convolution of 3 x 3 is adjusted to be consistent with the size of the first frame feature of the current video frame; then, the number of channels of the feature to be fused is expanded to be consistent with the number of channels of the feature map of the first frame feature of the current video frame by using convolution with 1 × 1. And then, adding the feature maps with the same dimension to complete feature fusion.

And S105, inputting the feature fusion result into a feature extraction refinement network to extract corresponding refinement features, and taking the refinement features as second frame features of the current video frame.

The training modes of the feature extraction main network and the feature extraction refining network can be various.

In some embodiments, the training process of the feature extraction master network and the feature extraction refinement network each includes: acquiring a first data set, and training according to the first data set to obtain a pre-training model; and acquiring a second data set, and continuously training the pre-training model according to the second data set to obtain a final detection model.

In some embodiments, the first data set is a COCO data set, wherein training according to the first data set to obtain the pre-trained model comprises: performing dither amplification on the COCO data set to obtain a pre-training data set; in the training process of the pre-training model, the original image in the COCO data set is used as a current video frame to be input, and the image in the pre-training data set is used as a previous video frame to be input; wherein dithering the COCO data set comprises: randomly translating any randomly given key point of any image in the COCO data set; randomly rotating all key points of any image in the COCO data set; the keypoint size of any image in the COCO dataset is scaled.

As an example, as shown in fig. 2, first, a COCO dataset is acquired; then, for a single picture in the COCO dataset; randomly giving the number k of key points needing to be changed, and randomly and minutely translating the k key points (for example, randomly translating within 5 pixel points for a feature map with the size of 46 × 46); and randomly and minutely rotating all key points of the whole picture; and enlarging the width of all the key points. Therefore, in the training process of the model, the adjusted feature map can be used as the video frame input of the previous frame of the current video frame; and then obtaining a pre-training model through training.

In some embodiments, the second dataset is a poseTrack dataset.

As an example, to increase the diversity of samples, first, a pre-training model is trained on the amplified COCO dataset; then, the training is continued on the Posetrack data set by using the pre-training model obtained by training to obtain a final detection model.

In summary, according to the behavior gesture recognition method for the shielded part of the human body in the embodiment of the present invention, first, a video to be detected is obtained, and a video frame of the video to be detected is extracted; then, obtaining a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected; then, if not, inputting the current video frame into a feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network; then, acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame; then, inputting the feature fusion result into a feature extraction refinement network to extract corresponding refinement features, and taking the refinement features as second frame features of the current video frame; therefore, the situation that the key points of the human body are lost due to large motion amplitude and shielding of the human body is prevented, and the accuracy of identifying the key points of the human body is improved.

In order to achieve the above embodiments, a second aspect of the present invention provides a computer-readable storage medium, which stores a program for recognizing behavior and posture of an occluded part of a human body, so that the program for recognizing behavior and posture of an occluded part of a human body is executed by a processor to implement the method for recognizing behavior and posture of an occluded part of a human body as described above.

In order to achieve the above embodiments, a third aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above behavior gesture recognition method for an occluded part of a human body when executing the program.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

In the description of the present invention, it is to be understood that the terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above should not be understood to necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A behavior gesture recognition method for a sheltered part of a human body is characterized by comprising the following steps:

acquiring a video to be detected, and extracting a video frame of the video to be detected;

acquiring a current video frame according to the sequence of the video frames, and judging whether the current video frame is the first frame of the video to be detected;

if not, inputting the current video frame into a feature extraction main network so as to extract a first frame feature corresponding to the current video frame through the feature extraction main network;

acquiring second frame characteristics of a previous frame of video frame, and performing characteristic fusion on first frame characteristics corresponding to the current video frame and the second frame characteristics of the previous frame of video frame;

and inputting the feature fusion result into a feature extraction refinement network to extract corresponding refined features, and taking the refined features as second frame features of the current video frame.

2. The method for recognizing the behavioral gesture of the occluded part of the human body according to claim 1, wherein if the current video frame is the first frame of the video to be detected, the current video frame is input to a feature extraction main network, so that the first frame feature corresponding to the current video frame is output through the feature extraction main network, the first frame feature corresponding to the current video frame is input to the feature extraction refinement network, and the refined feature output by the feature extraction refinement network is used as the second frame feature of the current video frame.

3. The method for recognizing the behavioral gesture of an occluded part of a human body according to claim 1, wherein feature fusion is performed on a first frame feature corresponding to the current video frame and a second frame feature of the previous video frame, and the method comprises the following steps:

acquiring an influence coefficient of a previous frame of video frame, and adjusting a second frame characteristic of the previous frame of video frame according to the influence coefficient to obtain a characteristic to be fused;

adjusting the size of the feature to be fused according to the first convolution so that the size of the feature to be fused is equal to the size of the first frame feature of the current video frame;

amplifying the number of the feature map channels of the feature to be fused according to a second convolution so as to enable the number of the feature map channels of the feature to be fused to be equal to the number of the feature map channels of the first frame feature of the current video frame;

and adding the feature images of the expanded features to be fused and the first frame features of the current video frame on the same dimension to complete feature fusion of the first frame features corresponding to the current video frame and the second frame features of the previous video frame.

4. The method for recognizing the behavioral gesture of the occluded part of the human body according to claim 1, wherein the training processes of the feature extraction main network and the feature extraction refinement network each comprise:

acquiring a first data set, and training according to the first data set to obtain a pre-training model;

and acquiring a second data set, and continuously training the pre-training model according to the second data set to obtain a final detection model.

5. The method for recognizing the behavioral gesture of an occluded part of a human body according to claim 4, wherein the first data set is a COCO data set, and wherein training according to the first data set to obtain a pre-training model comprises:

performing dither amplification on the COCO data set to obtain a pre-training data set;

in the training process of the pre-training model, inputting an original image in the COCO data set as a current video frame, and inputting an image in the pre-training data set as a previous video frame;

wherein dither amplifying the COCO dataset comprises:

randomly translating any randomly given key point of any image in the COCO data set;

randomly rotating all key points of any image in the COCO data set;

and scaling the size of a key point of any image in the COCO data set.

6. The method for recognizing the behavioral gesture of an occluded part of a human body according to claim 4, wherein the second data set is a Posetrack data set.

7. A computer-readable storage medium on which a behavior gesture recognition program of an occluded part of a human body is stored, the behavior gesture recognition program of the occluded part of the human body implementing the behavior gesture recognition method of an occluded part of a human body according to any one of claims 1 to 6 when executed by a processor.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the method for recognizing the behavioral gesture of an occluded part of a human body according to any one of claims 1 to 6.