CN114093024A

CN114093024A - Human body action recognition method, device, equipment and storage medium

Info

Publication number: CN114093024A
Application number: CN202111104329.9A
Authority: CN
Inventors: 张哲为
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2022-02-25

Abstract

The invention provides a method, a device, equipment and a storage medium for identifying human body actions, wherein the method comprises the following steps: acquiring three-dimensional bone key points corresponding to image data to be identified; identifying a corresponding action gesture based on the three-dimensional bone key points; and determining the target behavior type corresponding to the image data to be recognized by adopting a finite state machine model based on the recognized action gesture. According to the method, the finite-state machine model is adopted to determine the target behavior type corresponding to the image data to be recognized, so that the recognition result of the error state can be accurately eliminated, the recognition precision is improved, and the problems of low recognition precision and the like in the prior art are solved.

Description

Human body action recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of motion recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing human body motions.

Background

Human body action recognition is always the popular research direction of computer vision, artificial intelligence, mode recognition and the like, and has very wide application in the fields of human-computer interaction, virtual reality, video retrieval, security monitoring and the like. The existing human body action recognition method mainly comprises a human body action recognition method based on a wearable inertial sensor and a human body action recognition method based on computer vision, and the human body action recognition method based on the wearable inertial sensor increases human body burden, is complex in operation and difficult in practical popularization, so that the human body action method based on the computer vision becomes the mainstream research direction at present, and the computer is used for processing and analyzing image data acquired by a camera to learn and understand human actions and behaviors in the human body action recognition method.

Currently, with the development of depth cameras and human skeleton extraction algorithms, human skeleton key point information can be conveniently acquired, and then human body actions are identified based on the human skeleton key point information. In the prior art, the identification of human body actions usually adopts a traditional time series signal processing mode to perform signal preprocessing and time series analysis on collected key point data, and performs sliding window matching according to a time series waveform and a waveform in an action database to obtain an analysis result. However, the conventional motion recognition method has low recognition accuracy.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for recognizing human body actions, which aim to solve the problems of low action recognition precision and the like in the prior art.

In a first aspect, an embodiment of the present invention provides a method for recognizing a human body action, including:

acquiring three-dimensional bone key points corresponding to image data to be identified;

identifying a corresponding action gesture based on the three-dimensional bone key points;

and determining the target behavior type corresponding to the image data to be recognized by adopting a finite state machine model based on the recognized action gesture.

In a second aspect, an embodiment of the present invention provides an apparatus for recognizing a human body motion, including:

the acquisition module is used for acquiring three-dimensional bone key points corresponding to the image data to be identified;

the first processing module is used for identifying corresponding action gestures based on the three-dimensional skeleton key points;

and the second processing module is used for determining the target behavior type corresponding to the image data to be recognized by adopting a finite state machine model based on the recognized action gesture.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a transceiver, and at least one processor;

the processor, the memory and the transceiver are interconnected through a circuit;

the memory stores computer-executable instructions; the transceiver is used for receiving image data sent by the image acquisition equipment;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the method as set forth in the first aspect above and in various possible designs of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method according to the first aspect and various possible designs of the first aspect is implemented.

According to the human body action recognition method, the human body action recognition device, the human body action recognition equipment and the storage medium, the corresponding action gesture is recognized based on the three-dimensional bone key points corresponding to the image data to be recognized, the target behavior type corresponding to the image data to be recognized is determined by adopting a finite-state machine model, the recognition result of the error state is accurately eliminated, the recognition precision is improved, and the problems that the recognition precision is low in the prior art and the like are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a human body motion recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart illustrating a method for recognizing human body actions according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training framework of a lightweight skeletal neural network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an exemplary network structure of the LSTM neural network model according to an embodiment of the present invention;

fig. 5 is an exemplary flowchart of a human body motion recognition method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a human body motion recognition device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

With the above figures, certain embodiments of the invention have been illustrated and described in more detail below. The drawings and the description are not intended to limit the scope of the inventive concept in any way, but rather to illustrate it by those skilled in the art with reference to specific embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.

The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

An embodiment of the invention provides a human body action recognition method, which is used for action recognition in the fields of human-computer interaction, virtual reality, security monitoring and the like. The execution subject of this embodiment is a human body action recognition device, and the device may be disposed in an electronic device, where the electronic device may be a mobile terminal, a server, or other computer equipment that can be implemented, the mobile terminal may be a mobile phone, a tablet, or other mobile terminal that can be implemented, and the server may be a single server or a server cluster.

As shown in fig. 1, a schematic flow chart of a method for recognizing human body actions provided in this embodiment is shown, where the method includes:

step 101, obtaining three-dimensional bone key points corresponding to image data to be identified.

Specifically, the image data to be identified may be original image data acquired by any image pickup device or image data obtained by preprocessing the original image data, and may be specifically set according to an actual requirement, where the image data to be identified may include at least one frame of image data, for example, image data including multiple frames obtained by frame splitting from a segment of video data, or image data including multiple frames obtained by frame extracting the video data according to a certain rule; each frame of image data may obtain three-dimensional (also referred to as 3D) skeleton key points corresponding to one or more human body objects, that is, each human body object corresponds to a group of three-dimensional skeleton key points, for example, a group of three-dimensional skeleton key points including main joints of a head, a neck, four limbs, and the like, the image pickup device may be, for example, a monocular camera, a binocular camera, a TOF camera, and the like, and a manner of specifically obtaining the three-dimensional skeleton key points corresponding to the image data to be recognized may be any implementable manner, which is not limited in this embodiment.

For the identification of multiple persons, the invention can combine PAF vector field and thermodynamic diagram HeatMap to realize the real-time identification of multiple persons.

For example, the image data to be recognized is image data acquired by a monocular camera, two-dimensional (also referred to as 2D) human body key point detection may be performed first based on the image data to be recognized to obtain two-dimensional bone key points, and then the two-dimensional bone key points are converted into a three-dimensional space in a certain manner to obtain corresponding three-dimensional bone key points.

For example, the image data to be recognized is a binocular camera or other cameras capable of simultaneously acquiring depth information, the image data to be recognized is three-dimensional image data, and three-dimensional human body key point detection can be performed based on the image data to be recognized to obtain corresponding three-dimensional bone key points.

And 102, identifying corresponding action postures based on the three-dimensional skeleton key points.

Specifically, after the three-dimensional bone key points corresponding to the image data to be recognized are obtained, motion recognition can be performed based on the three-dimensional bone key points, and the corresponding motion postures are determined. Specifically, a large number of human body actions can be collected in advance for classification, a corresponding action library is manufactured, when identification is needed, three-dimensional skeleton key points can be classified based on an action classification network model obtained through training, each class corresponds to an action posture, and therefore the action postures corresponding to the three-dimensional skeleton key points are determined.

The action classification network model may adopt any network model that can be implemented, and may be specifically set according to actual requirements, and this embodiment is not limited.

Illustratively, a convolutional neural network + Sigmoid layer can be adopted to realize motion classification, and motion postures corresponding to the three-dimensional bone key points are obtained.

And 103, determining the target behavior type corresponding to the image data to be recognized by adopting a finite state machine model based on the recognized action posture.

Specifically, the behavior types are classified according to specific action combinations, such as different body-building exercises and different dances, and each behavior type corresponds to a group of specific actions; after the action gesture is obtained, a finite state machine model may be used to determine a behavior type corresponding to the image data to be recognized based on the recognized action gesture, in order to distinguish between what is referred to as a target behavior type. Or, for a specific behavior type, based on the recognized action posture, determining whether the image data to be recognized is the target behavior type by using a finite state machine model, for guiding or correcting the action of the sporter, specifically, the method may be set according to actual requirements; that is, step 103 may be based on which behavior type the recognized motion gesture analysis belongs to, or may be to determine whether the recognized motion gesture analysis belongs to a specific behavior type, and is not limited specifically.

Finite state machine, finish-state machine, abbreviation: FSM, also known as finite state automata, a state machine for short, is a mathematical model that represents a finite number of states and the behavior of transitions and actions between these states.

The method for determining the target behavior type by using the finite state machine model specifically includes that corresponding state transition diagrams are established in advance for different behavior types, the corresponding target behavior type is determined based on the identified action attitude and the state transition diagrams, the behavior type is identified by using the finite state machine model, only the attitude of a key state is identified for the specific behavior type, when a certain action state is identified, only the next transition state is identified, and each frame of action is not required to be identified, so that on one hand, the calculated amount is effectively reduced, the processing speed is improved, on the other hand, the identification result of an error state is accurately eliminated by using the finite state machine, and further, the identification precision is improved.

According to the human body action recognition method provided by the embodiment, the action gesture corresponding to the three-dimensional bone key point corresponding to the image data to be recognized is recognized, the target action type corresponding to the image data to be recognized is determined by adopting a finite state machine model, the recognition result of the error state is accurately eliminated, the recognition precision is improved, the problems of low recognition precision and the like in the prior art are solved, only the gesture of the key state is recognized for the specific action type, when a certain action state is recognized, only the state of the next transfer of the action state is recognized, each frame of action is not required to be recognized, the calculated amount is effectively reduced, and the processing speed is improved.

In order to make the technical solution of the present invention clearer, the method provided by the above embodiment is further described in an additional embodiment of the present invention.

As shown in fig. 2, an exemplary flowchart of the method for recognizing human body actions provided in this embodiment is schematically shown.

As a practical way, to reduce the image acquisition cost, on the basis of the foregoing embodiment, optionally, acquiring a three-dimensional bone key point corresponding to image data to be recognized includes:

in step 2011, image data to be identified is obtained.

Step 2012, based on the image data to be identified, determining corresponding two-dimensional bone key points.

And 2013, determining corresponding three-dimensional bone key points based on the two-dimensional bone key points.

Specifically, in order to reduce the image acquisition cost, the acquired image data to be identified may not include depth information, and the required three-dimensional bone key points may be obtained by extracting two-dimensional bone key points first and then reconstructing the corresponding three-dimensional bone key points in a certain manner based on the two-dimensional bone key points, and by adopting this manner, when acquiring image data, only a monocular camera is required for acquisition, and the acquired image data is subjected to certain preprocessing (such as scaling, denoising, brightness enhancement, and the like, and may be specifically set according to actual requirements) and then taken as the image data to be identified; the image data to be recognized may be, for example, RGB data, the corresponding two-dimensional bone key points are determined first based on the image data to be recognized, and any implementable manner may be adopted for specifically detecting the two-dimensional bone key points, such as a correlation algorithm or model in a top-down thought (e.g., Mask R-CNN model, DensePose model, CPN (Cascaded Pyramid Network) model, etc.) or a correlation algorithm or model in a bottom-up thought (e.g., CPM (Convolutional Pose Machines) algorithm, open Pose model, depthsenor 325 model, adaptive Embedding model, etc.); after the two-dimensional bone key points are obtained, the corresponding three-dimensional bone key points can be determined based on the two-dimensional bone key points, and specifically, a two-dimensional to three-dimensional processing mode can be set according to actual requirements, for example, a 3D reconstruction mode is adopted to convert 2D bone key point information into a 3D space, so as to obtain the 3D bone key points.

Illustratively, the 2D skeletal key point information is converted into a 3D space by using a DensePose-RCNN model, which uses an RCNN structure with pyramid network (FPN) features, a region feature aggregation mode ROI alignment posing is used to obtain dense part labels and coordinates in each selected region, surface image data of a person in a 2D image is projected onto a 3D body surface, the 3D surface model of the body is divided into 24 parts, then a UV coordinate system is constructed for each part, and each point of the body part on the 2D image is mapped to a corresponding 3D surface part.

Exemplarily, the transformation from the 2D skeleton key point coordinates to the 3D space may also be implemented by coordinate parameter transformation, for example, by parameter transformation of camera coordinates and world coordinates, for example, an internal reference matrix is obtained by a camera calibration or parameter estimation method, an external reference matrix is then calculated, and the mapping from 2D to 3D is implemented by the internal reference matrix and the external reference matrix, where a specific transformation process is the prior art and is not described herein again.

In the embodiment, the monocular camera is adopted to collect the images, so that the detection of the three-dimensional skeleton key points can be realized, expensive image collection equipment such as a binocular camera and a TOF camera is not required, the cost is reduced on one hand, and on the other hand, the method contributes to the operation of the human body action recognition method on the mobile terminal.

Further, in order to improve the real-time performance of motion recognition, the method for determining corresponding two-dimensional bone key points based on image data to be recognized comprises the following steps: determining corresponding two-dimensional bone key points based on image data to be recognized and a lightweight bone neural network model obtained through training; the lightweight skeletal neural network model is obtained by training in a knowledge distillation mode based on a Hourglass architecture.

Specifically, the lightweight skeletal neural network model needs to be obtained by pre-training, training is performed based on first training data and corresponding first label data, the first training data is training image data, the first label data is corresponding two-dimensional skeletal key point labeling data, the embodiment adopts teacher-student training and knowledge distillation to train the lightweight network to obtain the required lightweight skeletal neural network model, and the training network of the lightweight skeletal neural network model adopts a Hourglass lightweight network framework, in the training phase, by introducing soft targets associated with a complex teacher network (teacher network) as part of the total loss, the knowledge migration is realized by inducing the training of a student network (student network), the student network has a simplified structure and low complexity compared with a teacher network, and a lightweight skeletal neural network model for two-dimensional skeletal key point detection in the embodiment is the student network obtained by training; the specific training process and action principle will not be described in detail herein.

Exemplarily, as shown in fig. 3, for the training framework schematic diagram of the lightweight bone neural network model provided in this embodiment, the overall training network adopts a Hourglass architecture, where the lightweight bone neural network is a part of a student network, and a compression model is distilled according to knowledge to realize fast inference, [ h, w, c ]]The method is characterized in that the method is in an image data format and respectively represents the height, width and channel number of an image, the Distilll loss is distillation loss, and the channel number c of the image in a teacher network is respectively compressed into c after knowledge distillation compression₁、c₂、…、c_k(ii) a k can be set according to actual requirements, and the principle of a specific training process is the prior art and is not described in detail herein.

In the aspect of a neural network model, knowledge distillation and a simplified lightweight bone neural network model are adopted to detect two-dimensional bone key points, and the recognition speed is effectively increased on the basis of effectively ensuring the recognition precision, so that the real-time performance of motion recognition is improved, and further contribution is made to the operation of a human motion recognition method on a mobile terminal.

Further, in order to further improve the real-time performance of the identification, the determining a corresponding three-dimensional bone key point based on the two-dimensional bone key point includes: and determining corresponding three-dimensional bone key points by adopting a lightweight mapping neural network model obtained by training based on the two-dimensional bone key points.

As another implementable manner, in order to further improve the real-time performance of motion recognition, on the basis of the above embodiment, optionally, recognizing a corresponding motion pose based on the three-dimensional bone key points includes: and based on the three-dimensional bone key points, identifying the corresponding action posture by adopting an LSTM neural network model obtained by training.

Specifically, the LSTM neural network model adopts a lightweight time series analysis network to perform time series analysis on the obtained three-dimensional bone key points and identify corresponding action postures.

Exemplarily, as shown in fig. 4, an exemplary network structure diagram of the LSTM neural network model provided for the present embodiment is shown, where I ═ I₁,I₂,…,I_n]For 3D skeletal key point data, I₁-I_nRespectively serving as coordinates of each three-dimensional bone key point, wherein the coordinates are sequences with a certain time sequence, time sequence analysis is carried out through LSTM network units, each LSTM network unit comprises at least one hidden layer and a full connection layer (FC), an LSTM neural network is set to be a bidirectional structure, a concat layer is connected with forward and reverse LSTM output, a prediction result is obtained through a Sigmoid layer, and an action posture recognition result of the current moment is determined based on the prediction result, [ H_i,C_i](i-1, 2, …, n) wherein C is_iIndicating the cell status, H_iIndicating that forgotten gate needs to be read, forgotten gate reads H_iAnd I_i+1Outputting a value between 0 and 1 to each of the cells in the cell state C_iThe number 1 indicates "completely reserved", 0 indicates "completely discarded", the operation principle of the specific LSTM unit is the prior art, and is not described herein again, the specific value of n may be set according to actual requirements, and this embodiment is not limited.

The LSTM neural network model is obtained by pre-training and is trained based on second training data and corresponding second label data, the second training data are three-dimensional skeleton key point data for training, the second label data are corresponding action posture marking data, an action library can be preset and comprises a large number of action postures, corresponding training data are prepared for training aiming at different action postures, and the obtained LSTM neural network model can be analyzed in real time according to a historical time sequence and accurately outputs an action posture recognition result in real time.

The embodiment adopts the lightweight LSTM neural network model to identify the action posture, effectively reduces the identification calculated amount, and improves the identification speed, thereby further improving the identification real-time property, being applicable to the deployment of mobile terminals and improving the portability.

As another practical way, determining the target behavior type corresponding to the image data to be recognized by using a finite-state machine model based on the recognized action gesture includes: and determining the target behavior type corresponding to the image data to be recognized based on the recognized action posture and the state transition diagram corresponding to each behavior type.

Specifically, the finite state machine is in a certain state in the finite state set at any time, when it obtains an input character, it will be switched from the current state to two states, or still be maintained in the current state, any FSM can be described by using a state transition diagram, a node in the diagram represents one state in the FSM, a directed weighted edge represents a change of state when the character is input, if there is no directed edge corresponding to the current state and the input character in the diagram, the FSM will enter a death state, and thereafter, the FSM will always maintain the death state, there are two special states in the state transition diagram, namely a start state and an end state, the start state represents the initial state of the FSM, and the end state represents the successful recognition of the input character sequence, i.e. the successful recognition of the input action sequence in the present invention. The state transition diagram corresponding to each behavior type can be set based on a Markov state chain, when a certain action gesture is recognized, only the state of the next transition is needed to be recognized, the action of each frame of image is not needed to be recognized, only the key action gesture is needed to be recognized, and the recognition result of the wrong state is eliminated based on the state transition diagram, so that the recognition precision is improved, complex calculation is not needed, the state transition diagram is set by adopting priori knowledge aiming at the specific behavior type, and then the behavior type of the action can be conveniently determined according to the key state in the action.

Exemplarily, as shown in fig. 5, an exemplary flow chart of the human body motion recognition method provided for the present embodiment is shown.

In the embodiment, the above embodiments are combined, a lightweight skeletal neural network model is used to perform two-dimensional skeletal key point detection, a mapping neural network model is used to map two-dimensional skeletal key points to a three-dimensional space to obtain corresponding three-dimensional skeletal key points, historical time series analysis is performed on the three-dimensional skeletal key points based on the lightweight LSTM neural network model to recognize action gestures, and then corresponding action types are determined based on a state transition diagram of the action types, so that the action recognition method is light in weight, does not need complex calculation, can be deployed in a mobile terminal, realizes real-time landing of human action recognition based on the mobile terminal, can extract three-dimensional coordinate information of human skeletons in real time by using a monocular camera without using a complex or expensive sensor, and thus the mobile terminal can perform action gesture recognition and action type recognition on human actions by using a front camera, the real-time performance, the practicability and the portability are improved, in the aspect of precision, the bone identification precision of mAp-45 is obtained in a COCO data set, the processing speed can be more than 20fps on a low-end chip processor in 660, and can reach 75fps on a high-end chip of Xiaolong 845.

It should be noted that the respective implementable modes in the embodiment may be implemented individually, or may be implemented in combination in any combination without conflict, and the present invention is not limited thereto.

According to the method for identifying the human body actions, the two-dimensional skeleton key points are determined firstly, and then the three-dimensional skeleton key points are determined based on the two-dimensional skeleton key points, so that a complex or expensive sensor such as a binocular camera, a TOF camera and the like is not needed when an image is acquired, and only a monocular camera is needed, so that the image acquisition cost is effectively reduced, the portability is improved, and the method is convenient to apply to a mobile terminal; the two-dimensional skeleton key point detection is carried out through the lightweight skeleton neural network model, on the basis of ensuring the precision, the complexity is effectively reduced, the recognition speed is improved, and the real-time and portability of the recognition are further improved; the conversion from two dimensions to three dimensions is realized by adopting a lightweight mapping network, and then a lightweight LSTM neural network model is adopted to analyze the time sequence of the three-dimensional bone key points, and the action posture is accurately identified in real time by combining the historical time sequence, so that the real-time property and the portability are further improved; the action posture is accurately analyzed and recognized through the state transition diagram, the action type is finally determined, the recognition precision is effectively improved, no complex calculated amount is introduced, and the practicability and the effectiveness are improved; by integrating the aspects, the method can be conveniently deployed on the mobile terminal, the key points of the human skeleton can be accurately extracted in real time through the camera of the mobile terminal, and the behavior type can be accurately identified.

The method can be applied to identifying various behavior types to monitor or guide the scenes of specific behaviors of the actor, such as fitness and exercise scenes and dance training scenes, and can identify whether the actions of related personnel are correct so as to remind or guide the action; for example, the monitoring can be performed in a scene where certain specific operations are required, such as a disinfection scene, to monitor whether the relevant person performs the required operations, and the like. The specific application scenario can be set according to actual requirements, and the embodiment of the invention is not limited.

Still another embodiment of the present invention provides an apparatus for recognizing human body actions, which is used for executing the method of the above embodiment.

As shown in fig. 6, it is a schematic structural diagram of the human body motion recognition device provided in this embodiment. The device 30 comprises: an acquisition module 31, a first processing module 32 and a second processing module 33.

The acquisition module is used for acquiring three-dimensional bone key points corresponding to image data to be identified; the first processing module is used for identifying corresponding action gestures based on the three-dimensional skeleton key points; and the second processing module is used for determining the target behavior type corresponding to the image data to be recognized by adopting a finite state machine model based on the recognized action gesture.

Specifically, the acquisition module may acquire image data to be recognized from an image acquisition device such as a camera, acquire corresponding three-dimensional bone key points based on the image data to be recognized, and send the three-dimensional bone key points to the first processing module, the first processing module recognizes corresponding action gestures based on the three-dimensional bone key points, and sends the action gestures to the second processing module, and the second processing module determines a target behavior type corresponding to the image data to be recognized by using a finite-state machine model based on the recognized action gestures.

The specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and the same technical effect can be achieved, and will not be described in detail herein.

In order to make the device of the present invention clearer, the device provided by the above embodiment is further described in an additional embodiment of the present invention.

As a practical way, on the basis of the above embodiment, in order to reduce the image acquisition cost, optionally, the obtaining module includes: the device comprises an acquisition submodule, a processing submodule and a determination submodule.

The acquisition submodule is used for acquiring image data to be identified; the processing submodule is used for determining corresponding two-dimensional bone key points based on the image data to be identified; and the determining submodule is used for determining the corresponding three-dimensional bone key points based on the two-dimensional bone key points.

Specifically, the obtaining submodule may obtain image data to be identified from the image acquisition device, or may pre-process the image data acquired by the image acquisition device to obtain image data to be identified, the obtaining submodule sends the image data to be identified to the processing submodule, the processing submodule determines a corresponding two-dimensional bone key point based on the image data to be identified and sends the two-dimensional bone key point to the determining submodule, and the determining submodule determines a corresponding three-dimensional bone key point based on the two-dimensional bone key point and sends the three-dimensional bone key point to the first processing module.

Further, in order to improve the real-time performance of the motion recognition, the processing submodule is specifically configured to: determining corresponding two-dimensional bone key points based on image data to be recognized and a lightweight bone neural network model obtained through training; the lightweight skeletal neural network model is obtained by training in a knowledge distillation mode based on a Hourglass architecture.

Further, in order to further improve the real-time performance of the recognition, the determining submodule is specifically configured to: and determining corresponding three-dimensional bone key points by adopting a mapping neural network model obtained by training based on the two-dimensional bone key points.

As another implementable manner, in order to further improve the real-time performance of the motion recognition, on the basis of the foregoing embodiment, optionally, the first processing module is specifically configured to: and based on the three-dimensional bone key points, identifying the corresponding action posture by adopting an LSTM neural network model obtained by training.

As another implementable manner, the second processing module is specifically configured to: and determining the target behavior type corresponding to the image data to be recognized based on the recognized action posture and the state transition diagram corresponding to each behavior type.

Still another embodiment of the present invention provides an electronic device, configured to perform the method provided by the foregoing embodiment. The electronic device may be a mobile terminal, server, or other computer-implemented device.

As shown in fig. 7, is a schematic structural diagram of the electronic device provided in this embodiment. The electronic device 50 includes: memory 51, transceiver 52, and at least one processor 53.

The processor, the memory and the transceiver are interconnected through a circuit; the memory stores computer-executable instructions; the transceiver is used for receiving the image data sent by the image acquisition equipment; the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform a method as provided by any of the embodiments above.

Specifically, the electronic device is connected to an image capturing device, for example, the electronic device is a mobile terminal, the image capturing device is a camera on the mobile terminal, a transceiver of the electronic device receives image data sent by the camera, sends the image data to a processor, the processor stores the image data as image data to be recognized, or preprocesses the image data to be recognized as image data to be recognized, and the processor reads and executes a computer execution instruction stored in the memory, so as to implement the method provided in any of the above embodiments.

The electronic equipment provided by the embodiment can be applied to identifying various behavior types so as to monitor or guide the scenes of specific behaviors of the actor, such as fitness and exercise scenes, dance exercise scenes and the like, and can identify whether the actions of related personnel are correct or not so as to remind or guide; for example, the monitoring can be performed in a scene where certain specific operations are required, such as a disinfection scene, to monitor whether the relevant person performs the required operations, and the like. The specific application scenario can be set according to actual requirements, and the embodiment of the invention is not limited.

It should be noted that the electronic device of this embodiment can implement the method provided in any of the above embodiments, and can achieve the same technical effect, which is not described herein again.

Yet another embodiment of the present invention provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the method provided in any one of the above embodiments is implemented.

It should be noted that the computer-readable storage medium of this embodiment can implement the method provided in any of the above embodiments, and can achieve the same technical effects, which are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A human body action recognition method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining three-dimensional bone key points corresponding to the image data to be identified comprises:

acquiring image data to be identified;

determining corresponding two-dimensional bone key points based on the image data to be identified;

based on the two-dimensional bone keypoints, corresponding three-dimensional bone keypoints are determined.

3. The method of claim 2, wherein determining corresponding two-dimensional bone keypoints based on the image data to be identified comprises:

determining the corresponding two-dimensional bone key points based on the image data to be recognized and a lightweight bone neural network model obtained through training; the lightweight skeletal neural network model is a student network and is obtained by training in a knowledge distillation mode based on a Hourglass architecture.

4. The method of claim 2, wherein said determining, based on said two-dimensional bone keypoints, corresponding three-dimensional bone keypoints comprises:

and determining the corresponding three-dimensional bone key points by adopting a mapping neural network model obtained by training based on the two-dimensional bone key points.

5. The method of claim 1, wherein said identifying a corresponding action pose based on said three-dimensional skeletal keypoints comprises:

and identifying the corresponding action posture by adopting an LSTM neural network model obtained by training based on the three-dimensional skeleton key points.

6. The method according to any one of claims 1 to 5, wherein the determining the target behavior type corresponding to the image data to be recognized by using a finite state machine model based on the recognized action gesture comprises:

and determining the target behavior type corresponding to the image data to be recognized based on the recognized action posture and the state transition diagram corresponding to each behavior type.

7. A human body motion recognition device, comprising:

8. The apparatus of claim 7, wherein the first processing module is specifically configured to:

9. An electronic device, comprising: a memory, a transceiver, and at least one processor;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method of any one of claims 1-6.

10. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-6.