CN116129523A

CN116129523A - Action recognition method, device, terminal and computer readable storage medium

Info

Publication number: CN116129523A
Application number: CN202211737618.7A
Authority: CN
Inventors: 刘艳禹; 魏乃科; 潘华东; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-05-16

Abstract

The invention provides an action recognition method, a device, a terminal and a computer readable storage medium, wherein the action recognition method comprises the following steps: performing key point detection on each video frame in the acquired video stream to be identified to obtain a limb characteristic sequence corresponding to the video stream to be identified; the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames; determining action features corresponding to the target object based on the limb feature sequence; comparing the action characteristics of the target object with each preset action characteristic, and determining the action category of the target object based on the comparison result. According to the method, the action characteristics of the target object in the video stream to be identified are detected, and then the preset action characteristics are compared with the preset action characteristics, so that the preset action characteristics can be rapidly expanded according to the preset action types detected as required, the method is suitable for various types of action identification, and the generalization performance of the action identification method is improved.

Description

Action recognition method, device, terminal and computer readable storage medium

Technical Field

The present invention relates to the field of graph motion recognition technologies, and in particular, to a motion recognition method, device, terminal, and computer readable storage medium.

Background

The motion recognition technology is an important item of intellectualization of each industry, but because of different industry requirements, the requirements for the motion are also different, and the requirements are very fragmented. The traditional action recognition method also has the problems of difficult data acquisition, long labeling time, long model training and deployment period and low timeliness.

Disclosure of Invention

The invention mainly solves the technical problem of providing a motion recognition method, a motion recognition device, a motion recognition terminal and a computer readable storage medium, and solves the problem of low generalization performance of the motion recognition method in the prior art.

In order to solve the technical problems, the first technical scheme adopted by the invention is as follows: provided is an action recognition method, which includes:

performing key point detection on each video frame in the acquired video stream to be identified to obtain a limb characteristic sequence corresponding to the video stream to be identified; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames;

determining action features corresponding to the target object based on the limb feature sequence;

comparing the action characteristics of the target object with preset action characteristics, and determining the action category of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories.

Wherein the limb feature is determined based on at least one key point of a target object contained in the corresponding video frame; or based on the limb posture of the target object contained in the corresponding video frame.

The method for determining the action characteristics corresponding to the target object based on the limb characteristic sequence comprises the following steps:

extracting features of the limb feature sequence through a feature extraction model to obtain a loading feature sequence corresponding to the target object;

and carrying out feature fusion based on the limb feature sequence and the loading feature sequence to obtain action features corresponding to the target object.

The loading feature sequence comprises a length information sequence and an angle information sequence;

feature extraction is carried out on the limb feature sequence through a feature extraction model to obtain a loading feature sequence corresponding to the target object, and the method comprises the following steps:

and determining a length information sequence and an angle information sequence according to the position relation between key points in the limb feature sequence by the feature extraction model.

Wherein the loading feature sequence further comprises a speed information sequence;

extracting the time characteristics of the limb characteristic sequence through the characteristic extraction model to obtain a time sequence corresponding to the limb characteristic sequence;

a velocity information sequence is determined based on the time series, limb feature sequence.

The training method of the feature extraction model comprises the following steps:

acquiring a training video stream, wherein the training video stream comprises a plurality of frames of images containing targets; the training video stream is associated with limb sequence data and annotation action categories corresponding to targets;

feature extraction is carried out based on limb sequence data through a feature extraction network in the network model, so as to obtain detection feature information;

obtaining a predicted action category of the target through a classification network in the network model based on the detection characteristic information of the training video stream;

and iteratively training a network model based on an error value between a predicted action category and a labeling action category corresponding to the target, and determining the feature extraction network after training as a feature extraction model.

Wherein, still include:

acquiring at least one pre-stored video stream corresponding to a preset action category, wherein each pre-stored video stream comprises multiple frames of images containing targets;

performing key point detection on each frame of image in the pre-stored video stream, and determining a pre-stored limb feature sequence corresponding to the pre-stored video stream;

determining a preset action feature corresponding to the pre-stored video stream based on the pre-stored limb feature sequence by adopting a feature extraction model;

and constructing a database based on the preset action characteristics and the corresponding preset action categories.

Comparing the action characteristics of the target object with each preset action characteristic, and determining the action category of the target object based on the comparison result, wherein the method comprises the following steps:

calculating the similarity between the action characteristics of the target object and each preset action characteristic;

and in response to the similarity exceeding the similarity threshold, determining the preset action category corresponding to the similarity as the action category of the target object.

In order to solve the technical problems, a second technical scheme adopted by the invention is as follows: provided is an action recognition device, which comprises:

the detection module is used for detecting key points of all video frames in the acquired video stream to be identified to obtain a limb characteristic sequence corresponding to the video stream to be identified; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames;

the analysis module is used for determining action characteristics corresponding to the target object based on the limb characteristic sequence;

the determining module is used for comparing the action characteristics of the target object with preset action characteristics and determining the action category of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories.

In order to solve the technical problems, a third technical scheme adopted by the invention is as follows: there is provided a terminal comprising a memory, a processor and a computer program stored in the memory and running on the processor, the processor being adapted to execute program data to carry out the steps of the above-mentioned action recognition method.

In order to solve the technical problems, a fourth technical scheme adopted by the invention is as follows: there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described action recognition method.

The beneficial effects of the invention are as follows: different from the prior art, the motion recognition method, the motion recognition device, the terminal and the computer readable storage medium provided by the invention comprise the steps of performing key point detection on each video frame in the acquired video stream to be recognized to obtain a limb feature sequence corresponding to the video stream to be recognized; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames; determining action features corresponding to the target object based on the limb feature sequence; comparing the action characteristics of the target object with preset action characteristics, and determining the action category of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories. According to the method, the action characteristics of the target object in the video stream to be identified are detected, and then the preset action characteristics are compared with the preset action characteristics, so that the preset action characteristics can be rapidly expanded according to the preset action types detected as required, the method is suitable for various types of action identification, and the generalization performance of the action identification method is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for identifying actions provided by the invention;

FIG. 2 is a flowchart illustrating an embodiment of a method for motion recognition according to the present invention;

FIG. 3 is a flowchart illustrating a step S3 of the method for identifying actions in FIG. 1 according to an embodiment;

FIG. 4 is a schematic diagram of a motion recognition device according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a frame of an embodiment of a terminal provided by the present invention;

fig. 6 is a schematic diagram of a computer readable storage medium according to an embodiment of the present invention.

Detailed Description

The following describes the embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

In order to enable those skilled in the art to better understand the technical scheme of the present invention, a method for identifying actions provided by the present invention is described in further detail below with reference to the accompanying drawings and the detailed description.

Referring to fig. 1 and fig. 2, fig. 1 is a flow chart of a motion recognition method provided by the present invention; fig. 2 is a flowchart of an embodiment of an action recognition method according to the present invention.

In this embodiment, an action recognition method is provided, and the action recognition method includes the following steps.

S1: performing key point detection on each video frame in the acquired video stream to be identified to obtain a limb characteristic sequence corresponding to the video stream to be identified; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames.

S2: and determining the action characteristics corresponding to the target object based on the limb characteristic sequence.

S3: comparing the action characteristics of the target object with preset action characteristics, and determining the action category of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories.

In an embodiment, the specific implementation manner of obtaining the limb feature sequence corresponding to the video stream to be identified in step S1 includes the following steps.

Specifically, a video stream to be identified containing the same target object is acquired. And performing key point detection on each video frame in the video stream to be identified based on the key point detection model to obtain the key point information of the target object in each video frame. Wherein the keypoint information comprises at least one keypoint. And determining the limb characteristics corresponding to the target object in the video frame based on at least one key point corresponding to the target object in the video frame. Wherein, the key point can be a joint point. In another embodiment, a network skeleton of a target object in a video frame is extracted through a neural network, so that limb characteristics of the target object are obtained.

And based on the time sequence of each video frame, the limb characteristics of the target object in each video frame are formed into a limb characteristic sequence corresponding to the video stream to be identified. Wherein the target object can be a human, an animal, a plant, etc.

In an embodiment, the specific implementation manner of obtaining the action characteristic of the target object in the video stream to be identified in step S2 includes the following steps.

Referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of step S3 in the action recognition method provided in fig. 1.

S21: and carrying out feature extraction on the limb feature sequence through a feature extraction model to obtain a loading feature sequence corresponding to the target object.

Specifically, the loading feature sequence includes a length information sequence and an angle information sequence.

The target object is exemplified as a human body, and the topological structure sequence of the human body joint point is constructed according to the limb characteristic sequence. The key point information in the limb feature sequence comprises the number of key points, the type of the key points, the key point features and the key point positions. Wherein, the key point represents human body joint point.

And determining the distance between the key points according to the position relation between the key points in each video frame in the limb characteristic sequence.

The skeleton length between the keypoints in the video frame may be determined based on the distance between the keypoints. For example, the length of the skeleton between the knee and ankle nodes. And determining a length information sequence corresponding to the limb characteristic sequence according to the skeleton length between the corresponding key points in each video frame.

And determining the angle information of the connecting line between the two key points according to the association relation between the key points. The angle information is an included angle between a connecting line between two key points and a preset direction. For example, the angle between the line between the knee and ankle joints and the horizontal. And determining an angle information sequence corresponding to the limb characteristic sequence according to the angle information of the connecting line between the corresponding key points in each video frame.

In an embodiment, the loading signature sequence further comprises a speed information sequence.

Extracting the time characteristics of the limb characteristic sequence through the characteristic extraction model to obtain a time sequence corresponding to the limb characteristic sequence; a velocity information sequence is determined based on the time series, limb feature sequence.

Specifically, spatial features of the joint points are learned by utilizing graph convolution, and time convolution operation is carried out on data after graph convolution operation according to the dimension of a time sequence, so that a time sequence corresponding to the limb feature sequence is obtained. And determining the speed of the corresponding key point between the two video frames according to the position information of the same key point in the limb feature sequence in different video frames and the time interval between the video frames. And further determining a speed information sequence corresponding to the limb characteristic sequence according to the speed information corresponding to each video frame.

S22: and carrying out feature fusion based on the limb feature sequence and the loading feature sequence to obtain action features corresponding to the target object.

Specifically, a limb feature sequence, a length information sequence, an angle information sequence and a speed information sequence corresponding to a video stream to be identified are fused in a channel dimension to obtain action features corresponding to a target object. The obtained action features are sequence features with more abundant information, and are suitable for various types of action recognition.

In one embodiment, the specific implementation of obtaining the feature extraction model in step S21 includes the following steps.

Acquiring a training video stream, wherein the training video stream comprises a plurality of frames of images containing targets; the training video stream is associated with limb sequence data and annotation action categories corresponding to the targets. The labeling action category may be running, jogging, jumping, etc. The image containing the object includes various images of the whole body, the upper body, the lower body, the hands, the limbs, the feet, and the like.

Feature extraction is carried out based on limb sequence data through a feature extraction network in the network model, so as to obtain detection feature information; obtaining a predicted action category of the target through a classification network in the network model based on the detection characteristic information of the training video stream; and iteratively training a network model based on an error value between a predicted action category and a labeling action category corresponding to the target, and determining the feature extraction network after training as a feature extraction model.

In an embodiment, the construction of the database based on the preset action features and the corresponding preset action categories includes the following steps.

At least one pre-stored video stream corresponding to a preset action category is obtained, and each pre-stored video stream comprises multiple frames of images containing targets. The preset action category may be running, jogging, jumping, etc. And setting a preset action category according to the actual screening requirement. The video stream having the same scene as the video stream to be identified may also be regarded as the pre-stored video stream.

And detecting key points of each frame of image in the pre-stored video stream, and determining a pre-stored limb characteristic sequence corresponding to the pre-stored video stream. Determining preset action features corresponding to each pre-stored video stream based on each pre-stored limb feature sequence by adopting a feature extraction model; and constructing a database based on the preset action characteristics and the corresponding preset action categories.

For example, when the preset action category is running, a pre-stored video stream corresponding to a plurality of running gestures may be obtained, and a plurality of preset action features corresponding to running may be obtained, so that feature information in the video stream to be identified may be compared with a plurality of preset action features corresponding to running. As long as the feature information in the video stream to be identified is similar to one of a plurality of preset action features, the action identification can be realized.

In an embodiment, the specific implementation of determining the action recognition result of the video stream to be recognized in step S3 includes the following steps.

And calculating the similarity corresponding to each preset action characteristic in the database. And in response to the similarity exceeding the similarity threshold, determining the preset action category corresponding to the similarity as the action category of the target object. And in response to the similarity not exceeding the similarity threshold, determining that the preset action category corresponding to the similarity is not the action category of the target object.

When the actions of other scenes are required to be identified or other types of preset action categories are screened, feature extraction can be performed according to a small number of samples corresponding to the categories of the preset actions, preset action features are obtained and registered in a database, so that identification of target actions is realized, and further rapid and effective identification of target actions is realized.

According to the action recognition method provided by the embodiment, key point detection is carried out on each video frame in the acquired video stream to be recognized, so that a limb characteristic sequence corresponding to the video stream to be recognized is obtained; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames; determining action features corresponding to the target object based on the limb feature sequence; comparing the action characteristics of the target object with preset action characteristics, and determining the action category of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories. According to the method, the action characteristics of the target object in the video stream to be identified are detected, and then the preset action characteristics are compared with the preset action characteristics, so that the preset action characteristics can be rapidly expanded according to the preset action types detected as required, the method is suitable for various types of action identification, and the generalization performance of the action identification method is improved.

Referring to fig. 4, fig. 4 is a schematic diagram of a frame of an embodiment of an action recognition device according to the present invention. The present embodiment provides an action recognition device 60, and the action recognition device 60 includes a detection module 61, an analysis module 62, and a determination module 63.

The detection module 61 is configured to perform key point detection on each video frame in the acquired video stream to be identified, so as to obtain a limb feature sequence corresponding to the video stream to be identified; the video stream to be identified comprises a plurality of video frames containing target objects, and the limb characteristic sequence comprises limb characteristics corresponding to each video frame in the plurality of video frames.

The analysis module 62 is configured to determine an action feature corresponding to the target object based on the limb feature sequence.

The determining module 63 is configured to compare the motion characteristics of the target object with preset motion characteristics, and determine a motion class of the target object based on the comparison result; the preset action features are obtained based on the action features of the corresponding preset action categories.

According to the motion recognition device provided by the embodiment, the motion characteristics of the target object in the video stream to be recognized are detected, and then the motion characteristics are compared with the preset motion characteristics, so that the preset motion characteristics can be rapidly expanded according to the preset motion types detected as required, the motion recognition device is suitable for various types of motion recognition, and the generalization performance of a motion recognition method is improved.

Referring to fig. 5, fig. 5 is a schematic diagram of a frame of an embodiment of a terminal according to the present invention. The terminal 80 comprises a memory 81 and a processor 82 coupled to each other, the processor 82 being adapted to execute program instructions stored in the memory 81 for implementing the steps of any of the above-described embodiments of the method of action recognition. In one particular implementation scenario, terminal 80 may include, but is not limited to: the microcomputer, server, and the terminal 80 may also include mobile devices such as a notebook computer and a tablet computer, which are not limited herein.

In particular, the processor 82 is operative to control itself and the memory 81 to implement the steps of any of the motion recognition method embodiments described above. The processor 82 may also be referred to as a CPU (Central Processing Unit ). The processor 82 may be an integrated circuit chip having signal processing capabilities. The processor 82 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 82 may be commonly implemented by an integrated circuit chip.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an embodiment of a computer readable storage medium according to the present invention. The computer readable storage medium 90 stores program instructions 901 executable by a processor, the program instructions 901 for implementing the steps of any of the above-described action recognition method embodiments.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is only the embodiments of the present invention, and therefore, the patent protection scope of the present invention is not limited thereto, and all equivalent structures or equivalent flow changes made by the content of the present specification and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the patent protection scope of the present invention.

Claims

1. An action recognition method, characterized in that the action recognition method comprises:

2. The motion recognition method of claim 1, wherein the limb characteristics are determined based on at least one keypoint of the target object contained in the corresponding video frame; or based on limb posture of the target object contained in the corresponding video frame.

3. The method of claim 1, wherein,

the determining the action feature corresponding to the target object based on the limb feature sequence comprises the following steps:

4. The method of claim 3, wherein the loading feature sequence comprises a length information sequence and an angle information sequence;

the feature extraction is carried out on the limb feature sequence through a feature extraction model to obtain a loading feature sequence corresponding to the target object, and the feature extraction method comprises the following steps:

and determining the length information sequence and the angle information sequence according to the position relation between the key points in the limb feature sequence by the feature extraction model.

5. The method of claim 4, wherein the loading signature sequence further comprises a velocity information sequence;

performing time feature extraction on the limb feature sequence through the feature extraction model to obtain a time sequence corresponding to the limb feature sequence;

the speed information sequence is determined based on the time sequence, the limb feature sequence.

6. The method for recognizing an action according to any one of claims 3 to 5, wherein,

acquiring a training video stream, wherein the training video stream comprises a plurality of frames of images containing targets; the training video stream is associated with limb sequence data and a labeling action category corresponding to the target;

feature extraction is carried out based on the limb sequence data through a feature extraction network in a network model, so as to obtain detection feature information;

obtaining a predicted action category of the target based on the detection characteristic information of the training video stream through a classification network in the network model;

and iteratively training the network model based on the error value between the predicted action category and the labeling action category corresponding to the target, and determining the feature extraction network after training is completed as the feature extraction model.

7. The method of action recognition according to claim 6, further comprising:

acquiring at least one pre-stored video stream corresponding to the preset action category, wherein each pre-stored video stream comprises multiple frames of images containing targets;

determining preset action features corresponding to the pre-stored video stream based on the pre-stored limb feature sequence by adopting a feature extraction model;

and constructing the database based on the preset action characteristics and the corresponding preset action categories thereof.

8. The method of claim 1, wherein,

comparing the action characteristics of the target object with preset action characteristics, and determining the action category of the target object based on the comparison result, wherein the method comprises the following steps:

and responding to the similarity exceeding a similarity threshold, and determining the preset action category corresponding to the similarity as the action category of the target object.

9. An action recognition device, characterized in that the action recognition device comprises:

10. A terminal comprising a memory, a processor and a computer program stored in the memory and running on the processor, the processor being adapted to execute program data to carry out the steps of the action recognition method according to any one of claims 1 to 8.

11. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the action recognition method according to any of claims 1-8.