CN117315791B

CN117315791B - Bone action recognition method, device and storage medium

Info

Publication number: CN117315791B
Application number: CN202311599865.XA
Authority: CN
Inventors: 李德财; 张海涛; 马子昂
Original assignee: Hangzhou Huacheng Software Technology Co Ltd
Current assignee: Hangzhou Huacheng Software Technology Co Ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-02-20
Anticipated expiration: 2043-11-28
Also published as: CN117315791A

Abstract

The application discloses a bone action recognition method, equipment and storage medium, wherein the bone action recognition method comprises the following steps: acquiring an image to be identified, wherein the image to be identified contains a target object; performing key point detection on a target object in an image to be identified, and obtaining skeleton posture information of the target object based on a key point detection result; mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result; and extracting action features of the target object based on the grid mapping result, and carrying out action recognition by using the extracted action features to obtain an action recognition result. By converting irregular skeleton posture information into regular and compact grid mapping results, feature learning is facilitated, and skeleton feature interaction efficiency and action recognition accuracy are improved.

Description

Bone action recognition method, device and storage medium

Technical Field

The present invention relates to the technical field of behavior analysis, and in particular, to a method and apparatus for identifying skeletal actions, and a storage medium.

Background

The action recognition refers to analyzing given action sequence data, recognizing and judging action categories contained in the given action sequence data, and is widely applied to the fields of video monitoring, man-machine interaction, augmented reality, automatic driving and the like. With the rapid development of 3D motion capture systems and advanced real-time 2D/3D pose estimation algorithms, bone-based motion recognition is increasingly receiving industry and academia attention.

The existing bone action recognition method still has the problem of low recognition precision, so how to improve the precision and the robustness of bone action recognition becomes a technical problem to be solved urgently by the person skilled in the art.

Disclosure of Invention

The application provides at least a bone action recognition method, device and storage medium.

The first aspect of the application provides a bone action recognition method, which comprises the following steps: acquiring an image to be identified, wherein the image to be identified contains a target object; performing key point detection on a target object in an image to be identified, and obtaining skeleton posture information of the target object based on a key point detection result; mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result; and extracting action features of the target object based on the grid mapping result, and carrying out action recognition by using the extracted action features to obtain an action recognition result.

In one embodiment, the image to be identified is composed of a plurality of frames of video, the key point detection result contains a plurality of key points, and the bone posture information comprises one or more of bone point information, bone point motion information or bone motion information; obtaining skeletal posture information of the target object based on the key point detection result comprises the following steps: extracting skeleton points from a plurality of key points, and combining each skeleton point to obtain skeleton point information of a target object; and/or connecting adjacent bone points to obtain a plurality of bone connecting lines, and combining each bone connecting line to obtain bone information of a target object; and/or calculating the difference between the bone point information of the adjacent video frames to obtain bone point motion information; and/or calculating the difference between the bone information of the adjacent video frames to obtain the bone motion information.

In an embodiment, mapping the bone pose information to a two-dimensional grid expression with a preset size to obtain a grid mapping result includes: performing information expansion on the skeleton posture information to obtain expanded skeleton posture information; and mapping the expanded skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result.

In one embodiment, the information expansion of the skeletal posture information to obtain expanded skeletal posture information includes: calculating interpolation data corresponding to the bone posture information by using an adjacent matrix, wherein the adjacent matrix contains the topological structure information of the bones of the target object; and combining the bone posture information and interpolation data corresponding to the bone posture information to obtain expanded bone posture information.

In one embodiment, the expanded skeleton gesture information is composed of a plurality of elements to be mapped, and the number of the elements to be mapped is equal to the number of grids in the two-dimensional grid expression; mapping the expanded skeleton gesture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result, wherein the method comprises the following steps of: and mapping the elements to be mapped in the expanded skeleton gesture information into grids of the two-dimensional grid expression one by one to obtain a grid mapping result.

In one embodiment, the image to be identified consists of multiple frames of video frames; extracting action features of the target object based on the grid mapping result, and performing action recognition by using the extracted action features to obtain an action recognition result, wherein the method comprises the following steps: carrying out spatial feature extraction on the grid mapping result corresponding to each video frame to obtain spatial features corresponding to each video frame respectively; extracting time features of each space feature to obtain space-time features corresponding to the target object; and performing action recognition on the time-space characteristics to obtain an action recognition result.

In one embodiment, the target object corresponds to multiple types of skeletal pose information; extracting action features of the target object based on the grid mapping result, and performing action recognition by using the extracted action features to obtain an action recognition result, wherein the method comprises the following steps: performing action recognition on the grid mapping result of each type of skeleton gesture information to obtain initial recognition results respectively corresponding to each type of grid mapping result; and fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized.

In an embodiment, fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized includes: acquiring weight parameters corresponding to each type of bone posture information; and carrying out weighted summation on each initial recognition result by using the weight parameters to obtain a motion recognition result corresponding to the image to be recognized.

A second aspect of the present application provides a bone motion recognition device, the device comprising: the image acquisition module is used for acquiring an image to be identified, wherein the image to be identified contains a target object; the key point detection module is used for carrying out key point detection on the target object in the image to be identified and obtaining skeleton posture information of the target object based on a key point detection result; the grid mapping module is used for mapping the skeleton gesture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result; and the action recognition module is used for extracting action characteristics of the target object based on the grid mapping result, and performing action recognition by using the extracted action characteristics to obtain an action recognition result.

A third aspect of the present application provides an electronic device, including a memory and a processor, where the processor is configured to execute program instructions stored in the memory to implement the bone action recognition method described above.

A fourth aspect of the present application provides a computer readable storage medium having program instructions stored thereon, which when executed by a processor, implement the bone action recognition method described above.

According to the scheme, the image to be identified is obtained, and the image to be identified contains the target object; performing key point detection on a target object in an image to be identified, and obtaining skeleton posture information of the target object based on a key point detection result; mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result so as to convert the irregular skeleton posture information into a regular and compact grid expression; and extracting action features of the target object based on the grid mapping result, and performing action recognition by using the extracted action features to obtain an action recognition result, so that the efficiency of skeleton feature interaction and the action recognition precision are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the technical aspects of the application.

FIG. 1 is a schematic diagram of one implementation environment involved in a bone action recognition method as illustrated in an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating a bone action recognition method according to an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram illustrating keypoint detection in accordance with an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the acquisition of bone point information, bone point motion information, and bone motion information according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram illustrating interpolation of skeletal point information in accordance with an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of an action recognition network model shown in an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a spatial reconstruction unit shown in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a channel reconstruction unit shown in an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of an MB-TCN shown in an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of action recognition shown in an exemplary embodiment of the present application;

FIG. 11 is a block diagram of a bone action recognition device, as shown in an exemplary embodiment of the present application;

FIG. 12 is a schematic diagram of an electronic device shown in an exemplary embodiment of the present application;

fig. 13 is a schematic structural view of a computer-readable storage medium shown in an exemplary embodiment of the present application.

Detailed Description

The following describes the embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is herein merely an association information describing an associated object, meaning that three relationships may exist, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Current bone motion recognition methods based on deep learning are mainly classified into four types, namely, bone motion recognition based on convolutional neural networks (Convolutional Neural Network, CNN), bone motion recognition based on recurrent neural networks (Recurrent Neural Network, RNN), bone motion recognition based on graph convolutional neural networks (Graph Convolution Network, GCN), and bone motion recognition based on converters.

Among them, the bone structure of a human body or an animal can be expressed as a graph naturally, so that the GCN-based method is a mainstream method for bone motion recognition. However, GCN relies on an irregular skeletal topology, forced to learn different shapes and independent convolution kernels node by node, and lacks the characteristic aggregation properties of CNNs in expression. Furthermore, the receptive field of the GCN typically covers a set of neighboring nodes that have a predefined distance from the target node, and cannot interact with close but far-topological actions, such as handshakes, hugs, etc., which reduces the efficiency, accuracy, and applicability of skeletal action recognition feature interactions.

In order to solve the above problems, the present application provides a bone motion recognition method, an electronic device, and a computer-readable storage medium.

The bone motion recognition method provided in the embodiment of the present application is described below.

Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The scenario implementation environment may include a terminal 110 and a server 120, with the terminal 110 and the server 120 being communicatively coupled to each other.

The number of terminals 110 may be one or more. The terminal 110 may be, but is not limited to, a video camera, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc.

The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.

In one example, the server 120 may perform an action recognition process for the image to be recognized containing the target object obtained from the terminal 110, to obtain an action recognition result of the target object, and then the server 120 may store the action recognition result locally, transmit back to the terminal 110, or transmit to other terminals 110.

In one example, a client running a target application, such as an application providing an action recognition function, is installed in the terminal 110. The server 120 may be a background server of the target application program for providing background services to clients of the target application program.

In the bone action recognition method provided in the embodiment of the present application, the execution subject of each step may be the terminal 110, for example, a client terminal of the target application program installed and operated in the terminal 110, or may be the server 120, or the terminal 110 and the server 120 are interactively cooperated to execute, that is, a part of steps of the method are executed by the terminal 110, and another part of the steps are executed by the server 120, which is not limited in the present application.

It can be appreciated that in the specific embodiments of the present application, related data such as images to be identified, images of human bodies, etc. are related, when the embodiments of the present application are applied to specific products or technologies, user permission or consent is required to be obtained, and the collection, use and processing of related data requires the relevant laws and regulations and standards of the relevant countries and regions.

Referring to fig. 2, fig. 2 is a flowchart illustrating a bone motion recognition method according to an exemplary embodiment of the present application. The bone action recognition method can be applied to the implementation environment shown in fig. 1 and is specifically performed by a server in the implementation environment. It should be understood that the method may be adapted to other exemplary implementation environments and be specifically executed by devices in other implementation environments, and the implementation environments to which the method is adapted are not limited by the present embodiment.

As shown in fig. 2, the bone motion recognition method at least includes steps S210 to S240, and is described in detail as follows:

step S210: and acquiring an image to be identified, wherein the image to be identified contains the target object.

The image to be identified contains a target object, which may be a pedestrian, an animal, a robot, etc., which is not limited in this application.

The method comprises the steps that an image to be identified is stored in a database in advance, and the image to be identified is obtained by inquiring the database; or the terminal can collect the video stream in real time and send the video stream to the server, and the server takes the image frames in the received video stream as the images to be identified, so that the method for acquiring the images to be identified is not limited.

The monitoring camera performs image acquisition on the preset area in real time, and sends the acquired video stream to the server. The server performs target detection on the image frames in the received video stream to detect whether a target object exists in a preset area, and if the target object exists, the image frames containing the target object are used as images to be identified, so that action identification processing is performed on the images to be identified.

The image to be identified is illustratively obtained by sampling the target video data. Specifically, the action category to be identified is obtained, the sampling interval corresponding to the action category is obtained, for example, for the action category with higher movement speed and shorter time, the shorter sampling interval is set, for example, the interval is 2 seconds, for the action category with gentle movement and longer time, the longer sampling interval is set, for example, the acquisition interval is 5 seconds; and then, sampling the target video data based on the sampling interval corresponding to the action category to obtain an image to be identified.

Step S220: and detecting key points of the target object in the image to be identified, and obtaining skeleton posture information of the target object based on the key point detection result.

The key points refer to points capable of representing the gesture information of the target object, such as joints, head positions and the like of a person.

And performing key point detection on the target object in the image to be identified to obtain a key point detection result of the target object in the image to be identified, wherein the key point detection result consists of a plurality of key points of the target object.

For example, referring to fig. 3, fig. 3 is a schematic diagram illustrating key point detection according to an exemplary embodiment of the present application, as shown in fig. 3, in which an image to be identified includes a pedestrian, the image to be identified is input to a key point detection module to perform key point detection on the pedestrian to obtain N key points, and the N key points are used as key point detection results of the pedestrianI is a positive integer from 1 to N, preferably N is 17, wherein +.>Coordinates representing the ith key point, +.>Indicating the confidence of the ith keypoint.

The key point detection module may perform key point detection by using a Harris corner detection algorithm, a scale invariant feature transform (Scale Invariant Feature Transform, SIFT) algorithm, an acceleration robust feature (Speeded Up Robust Features, SURF) algorithm, and the like, which is not limited in this application.

Further, according to the key point detection result, skeleton posture information of the target object is obtained, wherein the skeleton posture information is used for reflecting the posture of the target object in the image to be recognized. Wherein the bone pose information includes, but is not limited to, one or more of bone point information, bone point motion information, or bone motion information of the target object within the single image frame.

Step S230: and mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result.

As can be seen from the key point detection result shown in fig. 3, the detected key points are irregular, and the corresponding skeleton gesture information is also irregular, if the graph convolution method in the prior art is adopted, the actions close to each other but far from the topological structure cannot be interacted, and the skeleton feature interaction efficiency is reduced.

Therefore, the bone posture information is mapped into a two-dimensional grid expression with a preset size, the two-dimensional grid expression is used for integrating the irregular bone posture information so as to convert the irregular bone posture information into a regular and compact grid expression, and a grid mapping result is obtained and consists of a plurality of grid elements which are orderly arranged, and the mapped bone posture information is stored in each grid element.

Wherein, the size of the two-dimensional grid expression can be determined according to the actual use scene, for example, the size of the grid can be increased when high performance is required, so as to improve the action recognition precision; and when the high efficiency is required, the grid size closest to the information quantity in the skeleton gesture information can be selected, so that the action recognition reasoning speed is improved.

Step S240: and extracting action features of the target object based on the grid mapping result, and carrying out action recognition by using the extracted action features to obtain an action recognition result.

And extracting action features of the target object according to the grid mapping result to obtain action features of the target object in the image to be identified, and further carrying out action identification according to the action features to obtain an action identification result.

The action feature extraction and the action recognition can be performed according to a neural network model which is trained in advance.

For example, the neural network model which is trained in advance comprises a convolutional network layer and a full-connection layer, the grid mapping result is input into the convolutional network layer to carry out grid convolution processing to obtain action characteristics output by the convolutional network layer, then the action characteristics are input into the full-connection layer to carry out action recognition, the probability of each action output by the full-connection layer is obtained, and the action with the highest probability is taken as the action recognition result of the target object.

Some embodiments of the present application are described in further detail below.

In some embodiments, the image to be identified is composed of a plurality of frames of video, the keypoint detection result contains a plurality of keypoints, and the skeletal pose information comprises one or more of skeletal point information, skeletal point motion information, or skeletal motion information; in step S220, obtaining skeletal posture information of the target object based on the key point detection result includes:

extracting skeleton points from the key points, and combining each skeleton point to obtain skeleton point information of a target object; and/or the number of the groups of groups,

connecting adjacent key points to obtain a plurality of bone connecting lines, and combining each bone connecting line to obtain bone information of a target object; and/or the number of the groups of groups,

calculating the difference between the bone point information of adjacent video frames to obtain bone point motion information; and/or the number of the groups of groups,

and calculating the difference between the bone information of the adjacent video frames to obtain bone motion information.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating acquiring bone point information, bone point motion information and bone motion information according to an exemplary embodiment of the present application, and as shown in fig. 4, preprocessing a key point in an acquired key point detection result to obtain bone point information, bone point motion information and bone motion information.

The target object is exemplified.

Acquiring bone point information: based on the human body structure, extracting a preset number of bone points from a plurality of key points, and combining each bone point to obtain bone point information of a target object.

Acquiring bone information: based on the human body structure, connecting adjacent bone points to obtain a preset number of bone connecting lines, and combining each bone connecting line to obtain the bone information of the target object.

Acquiring skeletal point motion information: and subtracting the skeleton points of adjacent video frames according to the time sequence to obtain skeleton point motion information.

Acquiring bone movement information: and subtracting the skeleton connecting lines of adjacent video frames according to the time sequence to obtain skeleton motion information.

And taking one or more of the bone point information, the bone point motion information and the bone motion information as bone posture information, so as to map the bone posture information into a two-dimensional grid expression with a preset size. If the bone pose information contains multiple types, such as bone point information, bone point motion information and bone motion information, mapping the bone point information, the bone point motion information and the bone motion information to obtain a grid mapping result of the bone point information, a grid mapping result of the bone point motion information and a grid mapping result of the bone motion information.

In some embodiments, mapping the bone pose information to a two-dimensional grid expression of a preset size in step S230, to obtain a grid mapping result includes:

step S231: and carrying out information expansion on the skeleton posture information to obtain expanded skeleton posture information.

As can be seen from fig. 4, the bone posture information is composed of a plurality of bone posture elements, for example, the bone point information is composed of a plurality of bone points, and the bone information is composed of a plurality of bone connecting lines.

However, the number of meshes contained in the two-dimensional mesh expression and the number of bone posture elements may not be equal, and in order to avoid loss of information during conversion from bone posture information to the two-dimensional mesh expression, the number of meshes in the two-dimensional mesh expression should be set to be not less than the number of bone posture elements.

And in order to improve the information expression capability of the grid mapping result, if the number of grids in the two-dimensional grid expression is greater than the number of skeleton gesture elements, carrying out information expansion on the skeleton gesture information so that the number of skeleton gesture elements in the final skeleton gesture information is equal to the number of grids in the two-dimensional grid expression.

Illustratively, the information expansion is performed on the bone posture information to obtain expanded bone posture information, including: calculating interpolation data corresponding to the bone posture information by using an adjacent matrix, wherein the adjacent matrix contains the topological structure information of the bones of the target object; and combining the bone posture information and interpolation data corresponding to the bone posture information to obtain expanded bone posture information.

The adjacent matrix contains topological structure information of bones of the target object, unordered interpolation data can be effectively avoided, the information matrix can be constrained to insert new bone posture elements according to existing bone posture information by using adjacent bone posture elements, association of a bone posture information to a grid mapping result and a topology priori is facilitated, and therefore action recognition accuracy is improved.

Specifically, taking bone posture information as bone point information as an example for illustration:

referring to fig. 5, fig. 5 is a schematic diagram illustrating interpolation of bone point information according to an exemplary embodiment of the present application, where the bone point information includes 17 bone points, the size of the two-dimensional grid expression should be greater than 17, and since the conventional convolution feature map is square, the size of the two-dimensional grid expression is 5*5, but the number of grids of the 5*5 two-dimensional grid expression is greater than the number of bone points, so that additional grid units are interpolated.

For example, the adjacent matrix a and the information matrix S are multiplied to obtain an extended matrix, and interpolation processing is performed on the extended matrix to obtain interpolation data, and the related formula may be the following formula (1):

formula (1)

Wherein, Is skeletal point information after information expansion; />Is the original bone point information;is an information matrix, and represents bone intermediate information before and after expansion; />Is an adjacency matrix and represents topological structure information of human bones; HW is the width and height of the two-dimensional lattice expression, size +.>N is the number of bone points, 17 total, C is the number of channels, representing the position coordinates and confidence of the bone points.

As shown in fig. 5, the expansion result is combined with the original bone point information and the interpolation data corresponding to the bone point information to obtain the expanded bone point information.

Step S232: and mapping the expanded skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result.

After the information expansion, the number of skeleton gesture elements in the skeleton gesture information is unified with the grid number of the two-dimensional grid expression, and the expanded skeleton gesture information is mapped into the two-dimensional grid expression, namely HW skeleton gesture elements are converted into the two-dimensional grid expressionIs a two-dimensional grid of (c) a plurality of cells.

Illustratively, each skeleton gesture element in the extended skeleton gesture information is taken as an element to be mapped, and the number of the element to be mapped is equal to the number of grids in the two-dimensional grid expression; mapping the expanded skeleton gesture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result, wherein the method comprises the following steps of: and mapping the elements to be mapped in the expanded skeleton gesture information into grids of the two-dimensional grid expression one by one to obtain a grid mapping result.

Specifically, the augmented bone pose information is mapped into a grid of a two-dimensional grid expression using the following equation (2):

formula (2)

Wherein,is a two-dimensional grid expression; />The matrix is binary mapping matrix, the size is HW multiplied by HW, each matrix element value is 0 or 1, each row is single-hot coding, only one matrix element value is 1, and the index with the value of 1 is the index of the selected bone posture element in the bone posture information. />Is information extended skeletal pose information, HW is the width and height of the two-dimensional lattice expression, i.e. +.>Size and dimensions ofC is the number of channels and represents the position coordinates and confidence of the bone pose element.

Taking skeleton posture information as skeleton point information as an example, the mapping process is shown in fig. 5, and each skeleton point in the expanded skeleton point information is mapped into a grid of a two-dimensional grid expression one by one to obtain a grid mapping result.

The skeleton posture information is mapped into a regular grid representation through the information mapping, so that the skeleton information can be subjected to feature learning through conventional convolution operation, efficient feature modeling is realized, and the skeleton feature extraction capability is improved.

By mapping the bone posture information into a regular two-dimensional grid expression, on the premise of maintaining the topological structure of the bone information, the irregular bone posture information is mapped into the regular grid expression, so that the subsequent feature learning is facilitated, the connection of the bone posture information is enlarged, and the interaction of the bone posture information is enhanced.

After the grid mapping result is obtained, extracting action features of the target object according to the grid mapping result, and carrying out action recognition by using the extracted action features to obtain an action recognition result.

In some embodiments, the image to be identified consists of multiple frames of video frames; in step S240, motion feature extraction is performed on the target object based on the mesh mapping result, motion recognition is performed by using the extracted motion feature, and a motion recognition result is obtained, which includes:

step S241: and carrying out spatial feature extraction on the grid mapping result corresponding to each video frame to obtain spatial features corresponding to each video frame respectively.

Spatial features refer to skeletal pose features contained within the same video frame.

Illustratively, convolution processing is performed on the grid mapping result to fuse the skeleton gesture information in the same video frame, so as to obtain the spatial characteristics of the video frame.

Step S242: and extracting the time characteristics of each space characteristic to obtain the space-time characteristics corresponding to the target object.

Temporal features refer to variations between skeletal pose features contained in multiple video frames.

Illustratively, dimensional change, alignment and the like are performed on the spatial features of each video frame, and then time sequence modeling is performed on the spatial features of each video frame, so as to extract the space-time features corresponding to the target object from the plurality of video frames.

Step S243: and performing action recognition on the time-space characteristics to obtain an action recognition result.

And performing action recognition according to the space-time characteristics to obtain an action recognition result.

Optionally, the above-mentioned action recognition is implemented based on a pre-trained action recognition network model, and the action recognition network model for performing the action recognition is exemplified.

Referring to fig. 6, fig. 6 is a schematic diagram of an action recognition network model according to an exemplary embodiment of the present application, and as shown in fig. 6, the action recognition network model includes a batch normalization (Batch Normalization, BN) layer, a spatial channel reconstruction convolution (Spatial and Channel reconstruction Convolution, SCConv) layer, a time domain convolution (Multiple branches Temporal Convolutional Network, MB-TCN) layer, and a Full Connection (FC) layer. The input size of the motion recognition network model isThe grid mapping result mechanical energy of the grid mapping result is normalized through the BN layer, then the normalized result is input into the SCConv layer to fuse skeleton posture information in the same video frame, spatial characteristics of the video frame are obtained, then dimensional transformation and alignment are carried out on the spatial characteristics, then the spatial characteristics are input into the MB-TCN layer to carry out time sequence modeling, space-time characteristics of skeleton posture information of a plurality of video frames are extracted, the characteristic extraction capability can be further improved through adding residual connection, and finally the action recognition result is obtained through the full connection layer.

Optionally, the motion recognition network model comprises ten layers of grid convolutions, each grid convolution is composed of an SCConv layer and an MB-TCN layer, as the network deepens, the number of channels of the feature deepens continuously, the time sequence dimension is down-sampled continuously, wherein the number of channels of L1-L4 is 64, the time sequence length is unchanged, the number of channels of L5-L7 is 128, the time sequence length is changed to 1/2, the number of channels of L8-L10 is 256, and the time sequence length is changed to 1/4.

The SCConv layer is described in detail:

the SCConv layer contains a spatial reconstruction unit (Spatial Reconstruction Unit, SRU) and a channel reconstruction unit (Channel Reconstruction Unit, CRU). In the computer vision task, the deep neural network has quite large redundancy, the space and channel dimensions of the feature map have high similarity and high redundancy, and in order to reduce the space and channel redundancy of the feature map, a space reconstruction unit and a channel reconstruction unit are used for relieving the feature redundancy, reducing the number of model parameters and the calculated amount and enhancing the feature representation capability.

Referring to fig. 7, fig. 7 is a schematic diagram of a spatial reconstruction unit according to an exemplary embodiment of the present application, where the SRU includes two parts, namely a separation part and a reconstruction part, and the separation part can separate a feature map with a large information amount and a feature map with a small information amount, which correspond to spatial content.

Specifically, for the input grid mapping result, i.e., the input feature X, the scaling factors in the batch normalized GN are first usedEvaluating the information content in different feature maps, and normalizing the scaling factors to obtain normalized correlation weights +.>Then +.>The weight value is mapped in the (0, 1) range, and the weight higher than the threshold value is set to 1 through threshold value gating to obtain the information weight +.>Setting the weight below the threshold to 0 to obtain a non-information weight +.>. The specific process for obtaining W is shown in the following formula (3):

formula (3)

Finally, the input features X are multiplied byAnd->Two weighted features are obtained: />And->. Wherein->Spatial content with informative and expressive properties, whereas +.>Little information is redundant information.

In order to reduce the space redundancy, the features with more information and the features with less information are added to generate the features with more information and save space.

In particular, a cross-reconstruction operation is employed to fully combine weighted two different information features and enhance the information flow between them, after which the resulting cross-reconstructed features are usedAnd->Features that are connected to obtain spatial refinement- >。

Referring to fig. 8, fig. 8 is a schematic diagram of a channel reconstruction unit according to an exemplary embodiment of the present application, and as shown in fig. 8, the CUR includes three parts of a splitting operation, a converting operation, and a fusing operation.

The segmentation operation refines the spatial refinement features of the inputDivided into two parts, one part having a channel number of +.>The number of channels of the other part is +.>. The number of channels for the two sets of features is then used +.>Convolution is compressed to obtain +.>And->。

The conversion operation being to be inputExtracting features, performing group convolution and point-by-point convolution respectively, and adding to obtain output +.>. Will input +.>As the supplement of the feature extraction, the point-by-point convolution is carried out, the obtained result and the original input are obtained by combining the set>。

The fusion operation willAnd->And carrying out adaptive merging. Specifically, global space information and channel information are first combined using global averaging pooling technique to obtain +.>And->. Then, the characteristic weight vector is obtained by carrying out a Softmax activation operation on the characteristic weight vector>And->Then the characteristic weight vector is +>And->And converting the output vector->And->Multiplying, and finally adding the multiplication results to obtain the channel refined feature Y.

The feature interaction capability of the skeleton attitude information is improved through the feature aggregation of the convolution, the performance of the convolution is improved through reducing the space and channel redundancy widely existing in the standard convolution, and the calculation cost is reduced.

Details of the MB-TCN layer are:

traditional TCN is a large-core single-branch convolution with a convolution kernel of 9, which greatly increases the computational load, parameter and hardware requirements of the network. The multi-branch structure is adopted, so that the time sequence modeling capability of the network can be enhanced, and the calculated amount and the parameter amount are saved.

Referring to FIG. 9, FIG. 9 is a schematic diagram of MB-TCN according to an exemplary embodiment of the present application, as shown in FIG. 9, by first passing the feature Y obtained in the SCConv layerThe convolution equally divides the channel into six parts to construct six branched time domain convolutions, the first four branched convolution kernels are 3 in size but different in expansion rate, the expansion rate is gradually increased from 1 to 4, the receptive field is gradually increased, the time sequence feature extraction capacity of the network is improved, the remaining two branches are identical mapping, the maximum pooling is achieved, and finally a plurality of branched time domain convolutions are spliced together to pass through the receptive field>Convolution, the final spatio-temporal features are obtained.

The multi-branch time domain convolution is used for replacing the large-core single-branch time domain convolution, so that the time sequence modeling capability of a network is enhanced, the calculated amount and the parameter amount are saved, and the accuracy and the efficiency of motion recognition are improved.

In some implementations, the target object corresponds to multiple types of skeletal pose information; extracting action features of the target object based on the grid mapping result, and performing action recognition by using the extracted action features to obtain an action recognition result, wherein the method comprises the following steps: performing action recognition on the grid mapping result of each type of skeleton gesture information to obtain initial recognition results respectively corresponding to each type of grid mapping result; and fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized.

For example, referring to fig. 10, fig. 10 is a schematic diagram of motion recognition according to an exemplary embodiment of the present application, and as shown in fig. 10, the bone posture information includes bone point information, bone point motion information and bone motion information, and motion recognition is performed on each type of bone posture information to obtain an initial recognition result corresponding to the bone point information, an initial recognition result corresponding to the bone point motion information and an initial recognition result corresponding to the bone motion information, respectively.

And then fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized.

Illustratively, fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized, including: acquiring weight parameters corresponding to each type of bone posture information; and carrying out weighted summation on each initial recognition result by using the weight parameters to obtain a motion recognition result corresponding to the image to be recognized.

The importance of different types of bone posture information on the result of motion recognition is different, weight parameters corresponding to each type of bone posture information are obtained, and each initial recognition result is weighted and summed by utilizing the weight parameters to obtain the motion recognition result corresponding to the image to be recognized.

For example, the weight parameters corresponding to the different types of bone pose information may be preset, for example, the ratio of the weight parameters corresponding to the bone point information, the bone point motion information and the bone motion information is set to be 2:2:1:1.

for example, weight parameters corresponding to different types of bone posture information can be flexibly selected according to actual application conditions.

For example, the image content quality of the target object in the image to be identified is obtained, for example, the image content quality of the target object is obtained according to the image definition, whether the shielding object exists or not, and the weight parameters corresponding to the bone posture information of each type are determined according to the image content quality. If the image content quality is higher, higher weight can be allocated to the bone point information and the identification result corresponding to the bone information; when the quality of the image content is low, higher weight can be allocated to the bone point motion information and the identification result corresponding to the bone motion information.

For example, the weight parameter corresponding to each type of bone posture information may be determined according to the information such as the object type to which the target object belongs, the height of the target object, and the type of the motion to be recognized.

By fusing the initial recognition results of the bone posture information of various types in the embodiment, the bone information can be effectively utilized, and the accuracy of the recognition result can be improved.

According to the skeleton action recognition method, the image to be recognized is obtained, and the image to be recognized contains the target object; performing key point detection on a target object in an image to be identified, and obtaining skeleton posture information of the target object based on a key point detection result; mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result so as to convert the irregular skeleton posture information into a regular and compact grid expression; and extracting the action characteristics of the target object based on the grid mapping result, and carrying out action recognition by utilizing the extracted action characteristics to obtain an action recognition result, so that the efficiency of skeleton characteristic interaction and the action recognition precision are improved, and the problem that the traditional graph convolution cannot effectively express actions close in action but far in topological structure is solved.

Fig. 11 is a block diagram of a bone action recognition device, as shown in an exemplary embodiment of the present application. As shown in fig. 11, the exemplary bone action recognition apparatus 1100 includes: an image acquisition module 1110, a keypoint detection module 1120, a grid mapping module 1130, and an action recognition module 1140. Specifically:

An image acquisition module 1110, configured to acquire an image to be identified, where the image to be identified contains a target object;

the key point detection module 1120 is configured to perform key point detection on a target object in an image to be identified, and obtain skeletal posture information of the target object based on a key point detection result;

the grid mapping module 1130 is configured to map the skeleton gesture information to a two-dimensional grid expression with a preset size, so as to obtain a grid mapping result;

the action recognition module 1140 is configured to extract action features of the target object based on the mesh mapping result, and perform action recognition by using the extracted action features to obtain an action recognition result.

It should be noted that, the bone action recognition device provided in the foregoing embodiment and the bone action recognition method provided in the foregoing embodiment belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the method embodiment, which is not repeated here. In practical application, the skeletal motion recognition device provided in the above embodiment may distribute the functions to different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above, which is not limited herein.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of an electronic device according to the present application. The electronic device 1200 comprises a memory 1201 and a processor 1202, the processor 1202 being adapted to execute program instructions stored in the memory 1201 to implement the steps of any of the bone action recognition method embodiments described above. In one particular implementation scenario, electronic device 1200 may include, but is not limited to: the microcomputer and the server, and the electronic device 1200 may also include a mobile device such as a notebook computer and a tablet computer, which is not limited herein.

In particular, the processor 1202 is configured to control itself and the memory 1201 to implement the steps of any of the bone action recognition method embodiments described above. The processor 1202 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 1202 may be an integrated circuit chip with signal processing capabilities. The processor 1202 may also be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 1202 may be commonly implemented by an integrated circuit chip.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of a computer readable storage medium of the present application. The computer readable storage medium 1300 stores program instructions 1310 that can be executed by a processor, where the program instructions 1310 are configured to implement steps in any of the above-described bone action recognition method embodiments.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method for identifying skeletal actions, comprising:

acquiring an image to be identified, wherein the image to be identified contains a target object, and the image to be identified consists of a plurality of frames of video frames;

performing key point detection on a target object in the image to be identified, and obtaining skeleton posture information of the target object based on a key point detection result;

mapping the skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result; the two-dimensional grid expression is used for converting irregular skeleton gesture information into a regular and compact grid mapping result, the grid mapping result consists of a plurality of grid elements which are orderly arranged, and each grid element stores mapped skeleton gesture information;

extracting spatial features from grid mapping results corresponding to each video frame in the image to be identified to obtain spatial features corresponding to each video frame respectively;

extracting space-time characteristics of each space characteristic to obtain space-time characteristics corresponding to the target object;

and performing action recognition on the space-time characteristics to obtain an action recognition result.

2. The method of claim 1, wherein the image to be identified is composed of a plurality of frames of video, the keypoint detection result contains a plurality of keypoints, and the bone pose information comprises one or more of bone point information, bone point motion information, or bone motion information; the obtaining skeleton posture information of the target object based on the key point detection result comprises the following steps:

Extracting skeleton points from the plurality of key points, and combining each skeleton point to obtain skeleton point information of the target object; and/or the number of the groups of groups,

connecting adjacent bone points to obtain a plurality of bone connecting lines, and combining each bone connecting line to obtain bone information of the target object; and/or the number of the groups of groups,

3. The method of claim 1, wherein mapping the bone pose information into a two-dimensional grid expression of a preset size to obtain a grid mapping result comprises:

performing information expansion on the skeleton posture information to obtain expanded skeleton posture information;

and mapping the expanded skeleton posture information into a two-dimensional grid expression with a preset size to obtain a grid mapping result.

4. A method according to claim 3, wherein said information augmenting said skeletal pose information to obtain augmented skeletal pose information comprises:

calculating interpolation data corresponding to the bone posture information by using an adjacent matrix, wherein the adjacent matrix contains topological structure information of bones of the target object;

And combining the bone posture information and interpolation data corresponding to the bone posture information to obtain expanded bone posture information.

5. The method of claim 3, wherein the augmented skeletal pose information is comprised of a plurality of elements to be mapped, the number of elements to be mapped being equal to the number of meshes in the two-dimensional mesh expression; mapping the expanded bone posture information to a two-dimensional grid expression with a preset size to obtain a grid mapping result, wherein the method comprises the following steps of:

and mapping the elements to be mapped in the expanded skeleton gesture information into grids of the two-dimensional grid expression one by one to obtain a grid mapping result.

6. The method of claim 1, wherein an action recognition network model is pre-trained, the action recognition network model comprising a batch normalization layer, a spatial channel reconstruction convolution layer, a time domain convolution layer; the step of extracting spatial features from the grid mapping result corresponding to each video frame in the image to be identified to obtain spatial features corresponding to each video frame respectively includes:

inputting the grid mapping result to the batch normalization layer for normalization processing to obtain a normalization processing result;

Inputting the normalization result into the spatial channel reconstruction convolution layer to fuse skeleton posture information in the same video frame, so as to obtain spatial characteristics of each video frame;

and extracting the space-time characteristics of each space characteristic to obtain the space-time characteristics corresponding to the target object, wherein the space-time characteristics comprise:

and carrying out dimension transformation and alignment on the spatial characteristics of each video frame, inputting the spatial characteristics into the time domain convolution layer for time sequence modeling, and extracting to obtain space-time characteristics.

7. The method according to any one of claims 1 to 6, wherein the target object corresponds to a plurality of types of bone pose information; the method further comprises the steps of:

performing action recognition on the grid mapping result of each type of skeleton gesture information to obtain initial recognition results respectively corresponding to each type of grid mapping result;

and fusing each initial recognition result to obtain an action recognition result corresponding to the image to be recognized.

8. The method of claim 7, wherein the fusing each initial recognition result to obtain the action recognition result corresponding to the image to be recognized includes:

Acquiring weight parameters corresponding to each type of bone posture information;

and carrying out weighted summation on each initial recognition result by utilizing the weight parameters to obtain a motion recognition result corresponding to the image to be recognized.

9. An electronic device comprising a memory and a processor for executing program instructions stored in the memory to implement the steps of the method according to any of claims 1-8.

10. A computer readable storage medium storing program instructions executable by a processor to perform the steps of the method according to any one of claims 1-8.