CN115205975A

CN115205975A - Behavior recognition method and apparatus, electronic device, and computer-readable storage medium

Info

Publication number: CN115205975A
Application number: CN202210836365.2A
Authority: CN
Inventors: 邹昆; 张伟熙; 董帅
Original assignee: University of Electronic Science and Technology of China Zhongshan Institute
Current assignee: University of Electronic Science and Technology of China Zhongshan Institute
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-10-18

Abstract

The method combines encoders in a graph volume layer and a Transformer algorithm to identify the behavior of the bone data, so that the behavior of the bone data can be identified based on small models of the graph volume layer and the Transformer encoder; and also enables full analysis of bone data using the atlas layer. Although the encoder of the Transformer adopted by the method has only two layers, the relevance of the bone data on the time dimension can be well identified, and the small structures of the two layers can have higher identification reasoning speed, so that the designed identification model is small and has higher identification accuracy. Additionally, in the method, the graph volume layer network is optimized by using a global point mode, so that the features extracted by the graph volume layer network represent the global features of the skeleton, and the accuracy of the scheme on behavior recognition is further improved.

Description

Behavior recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of behavior intelligent identification, and in particular, to a behavior identification method, apparatus, electronic device, and computer-readable storage medium.

Background

In the skeleton-based human motion recognition method, human behavior can be analyzed through temporal and spatial changes of human skeleton. As the skeleton is not limited by clothing change, illumination conditions or complex backgrounds, the identification method has strong reliability and wide application scenes. Generally, in the process of identification, a human body skeleton is divided into different body parts, and an identification system identifies the positions of the body parts of the human body to detect gestures, so that the safety precaution of law enforcement facilities, schools, airports, banks, business spaces or office buildings can be improved, and suspicious or abnormal behaviors are detected on the other hand, so that the method has very important significance and application value in the fields of safety, behavior monitoring and the like.

At present, intelligent behavior analysis of bone data is a mature intelligent behavior analysis technology at present, and a method for recognizing bone behaviors generally adopts a deep neural network for automatic recognition, but a front-edge network model for completing intelligent behavior recognition based on bone data of an analysis system or a method applied at present is extremely large, is difficult to deploy on a small-computing-force device, and has a complex recognition process and a low speed.

Disclosure of Invention

An object of the embodiments of the present application is to provide a behavior recognition method, a behavior recognition device, an electronic device, and a computer-readable storage medium, so as to solve the problems of a large model, difficulty in deployment, and a complex and slow recognition process of the current intelligent behavior recognition method based on skeletal data.

In a first aspect, the present invention provides a behavior recognition method, including: constructing according to the bone information to be identified and the target global point information to obtain bone input information; the target global point represents a point with consistent information proportion obtained by each skeleton point in the skeleton information to be identified; carrying out bone feature extraction on the bone input information through the target image convolution layer to obtain first feature information; the first characteristic information represents information obtained after information mixing of the bone information to be identified and the target global point information is completed; and identifying the behavior corresponding to the bone information to be identified according to the target Transformer encoder and the first characteristic information.

According to the designed behavior identification method, the encoder in the graph volume layer and the Transformer algorithm is combined to conduct behavior identification on the bone data, so that the behavior of the bone data can be identified based on the small models of the graph volume layer and the Transformer encoder, the graph volume layer can fully analyze the bone data, although the encoder adopting the Transformer has only two layers, the relevance of the bone data on the time dimension can be well identified, and the small structures of the two layers can have higher identification reasoning speed, so that the designed identification model is small and has higher identification accuracy, in addition, the graph volume layer network is optimized in a global point mode, the characteristics extracted by the graph volume layer network represent the global characteristics of the bone, and the accuracy of the scheme is further improved.

In an optional implementation manner of the first aspect, identifying, according to the target Transformer encoder and the first feature information, a behavior corresponding to bone information to be identified includes: extracting first target global point feature information in the first feature information, and generating coding input data according to the target global point feature information and the target time vector; performing time feature extraction on the coded input data through a target Transformer coder to obtain second feature information, wherein the second feature information represents information obtained after the target global point feature information and the target time vector are mixed; and identifying the behavior corresponding to the bone information to be identified according to the second characteristic information.

In the above embodiment, the target time vector is introduced in the scheme, so that the time dimension feature in the second feature information can be represented by the target time vector after being mashup with the time dimension feature, thereby simplifying the algorithm and improving the accuracy.

In an optional implementation manner of the first aspect, identifying a behavior corresponding to the bone information to be identified according to the second feature information includes: extracting target time vector characteristic information in the second characteristic information; inputting the target time vector characteristic information into a multilayer perceptron to obtain the probability corresponding to each behavior type output by the multilayer perceptron; and determining the behavior corresponding to the bone information to be identified according to the probability corresponding to each behavior type.

In the embodiment, the behavior type recognition is carried out on the basis of the second characteristic information through the multilayer perceptron, so that the accuracy of the behavior recognition is guaranteed.

In an optional implementation manner of the first aspect, generating encoded input data according to the target global point feature information and the target time vector includes: splicing the target global point characteristic information and the target time vector to obtain first spliced data; and adding position codes to the first splicing data to obtain coded input data.

In the above embodiment, the scheme adds position coding on the basis of introducing the target time vector, so that a single Transformer encoder can also identify the time sequence in the bone information, and further, the identified features have time relevance, and the identification accuracy of the Transformer encoder is improved.

In an optional implementation manner of the first aspect, the obtaining first feature information by performing feature extraction on the bone input information by using a target map convolutional layer includes: performing characteristic dimension increasing on the bone input information through a target graph convolution layer; and performing information mashup on the bone input information with the raised feature dimension to obtain first feature information.

In an optional implementation manner of the first aspect, obtaining the first feature information by performing an information mashup on the feature-augmented bone input information includes: and generating the first characteristic information according to the bone input information after characteristic dimension raising and a target graph adjacency matrix, wherein the target graph adjacency matrix is obtained by training a preset graph adjacency matrix in advance.

In an optional implementation manner of the first aspect, before constructing and obtaining the bone input information according to the bone information to be identified and the target global point information, the method further includes: acquiring a bone sample data set, wherein the bone sample data set comprises a plurality of bone sample data, and each bone sample data comprises a bone point sample and a corresponding behavior label; carrying out normalization processing on the plurality of bone sample data to obtain a plurality of bone sample data which are subjected to normalization processing; constructing a plurality of bone sample input data according to a preset global point and each bone sample data finished by normalization processing; and training a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder according to the input data of the plurality of bone samples to obtain a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder.

In an optional implementation manner of the first aspect, training a preset global point, a preset graph convolution layer, a preset time vector, and a preset transform encoder according to a plurality of bone sample input data includes: selecting a set of bone sample input data from a plurality of bone sample input data as a current bone sample input set; inputting a current bone sample input set into a preset graph volume layer to obtain first characteristic information corresponding to each current bone sample input data; extracting preset global point feature information in first feature information corresponding to each current skeleton sample input data, and generating coding sample input data according to the preset global point feature information and a preset time vector; inputting the input data of the coding samples into a preset Transformer encoder, obtaining second characteristic information corresponding to the input data of each coding sample, and obtaining a corresponding classification result based on the second characteristic information and a preset multilayer perceptron; calculating training loss corresponding to the current bone sample input set through a comparison loss function according to first characteristic information corresponding to each bone sample input data, second characteristic information corresponding to each coding sample input data and a corresponding behavior label; updating and iterating parameters corresponding to a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively according to the training loss, the back propagation algorithm and the optimization algorithm; judging whether the current iteration accumulated times exceed a preset iteration time or not, or whether each parameter corresponding to a preset global point, a preset graph volume layer, a preset time vector and a preset Transformer encoder is converged or not; if the current iteration accumulated times exceed a preset iteration time or the preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively correspond to each parameter convergence, obtaining a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder; and if the current iteration accumulated times do not exceed the preset iteration times and the parameters respectively corresponding to the preset global point, the preset graph convolution layer, the preset time vector and the preset Transformer encoder are not converged, returning to execute the step of selecting a group of bone sample input data from the plurality of bone sample input data as the current bone sample input set.

In an optional implementation manner of the first aspect, the obtaining the bone input information according to the bone information to be identified and the target global point information includes: acquiring skeleton information to be identified and target global point information; carrying out normalization processing on the bone information to be identified; and constructing skeleton input information according to the target global point information and the skeleton information to be identified after normalization processing.

In a second aspect, the present invention provides a behavior recognition apparatus, which includes a construction module, a graph volume module, and a recognition module; the construction module is used for constructing according to the bone information to be identified and the target global point information to obtain bone input information; the target global point represents a point with consistent information proportion obtained by each skeleton point in the skeleton information to be identified; the graph convolution module is used for extracting the bone features of the bone input information through the target graph convolution layer to obtain first feature information; the first characteristic information represents information obtained after information mixing of the bone information to be identified and the target reference point information is completed; the identification module is used for identifying the behavior corresponding to the bone information to be identified according to the target transform encoder and the first characteristic information.

The action recognition device of above-mentioned design, the encoder in this scheme combination graph volume layer and the transform algorithm carries out the action discernment to skeletal data, thereby the little model based on graph volume layer and transform encoder can discern the action of skeletal data, and graph volume layer can be to the abundant analysis skeletal data, although the encoder that adopts the transform is only two-layer, but can discern skeletal data relevance in the time dimension well, and two-layer little structure can have faster discernment inference speed, thereby make the identification model of design little and have higher discernment accuracy, thereby further improve the accuracy of this scheme.

In an optional implementation manner of the second aspect, the identification module is specifically configured to extract target global point feature information in the first feature information, and generate encoded input data according to the target global point feature information and the target time vector; performing time feature extraction on the coded input data through a target Transformer coder to obtain second feature information, wherein the second feature information represents information obtained after information mixing of target global point feature information and target time vectors is completed; and identifying the behavior corresponding to the bone information to be identified according to the second characteristic information.

In an optional implementation manner of the second aspect, the identification module is further specifically configured to extract target time vector feature information in the second feature information, input the target time vector feature information into the multilayer perceptron, and obtain a probability corresponding to each behavior type output by the multilayer perceptron; and determining the behavior corresponding to the bone information to be identified according to the probability corresponding to each behavior type.

In an optional implementation manner of the second aspect, the identifying module is further specifically configured to splice the target global point feature information and the target time vector to obtain first spliced data; and adding position codes to the first splicing data to obtain coded input data.

In an optional implementation manner of the second aspect, the map convolution module is specifically configured to perform feature dimension enhancement on the bone input information through a target map convolution layer; and performing information mixing on the bone input information after feature dimension increasing to obtain first feature information.

In an optional implementation manner of the second aspect, the graph volume module is further specifically configured to generate the first feature information according to the feature-upscaled bone input information and a target adjacency matrix, where the target adjacency matrix is obtained through pre-training.

In an optional embodiment of the second aspect, the apparatus further comprises an obtaining module, configured to obtain a bone sample data set, where the bone sample data set includes a plurality of bone sample data, each including a bone point sample and a corresponding behavior tag; the normalization module is used for performing normalization processing on the plurality of bone sample data to obtain a plurality of bone sample data which are subjected to normalization processing; the construction module is also used for constructing a plurality of bone sample input data according to the preset global point and each bone sample data after normalization processing; and the training module is used for training the preset global point, the preset graph convolution layer, the preset time vector and the preset Transformer encoder according to the plurality of bone sample input data to obtain the target global point, the target graph convolution layer, the target time vector and the target Transformer encoder.

In an alternative embodiment of the second aspect, the training module is specifically configured to select a set of bone sample input data from a plurality of bone sample input data as a current bone sample input set; inputting a current bone sample input set into a preset graph volume layer to obtain first characteristic information corresponding to each current bone sample input data; extracting preset global point feature information in first feature information corresponding to each current skeleton sample input data, and generating coding sample input data according to the preset global point feature information and a preset time vector; inputting the input data of the coding samples into a preset Transformer encoder to obtain second characteristic information corresponding to the input data of each coding sample; calculating training loss corresponding to the current bone sample input set through a comparison loss function according to first characteristic information corresponding to each bone sample input data, second characteristic information corresponding to each coding sample input data and corresponding behavior labels; updating and iterating parameters respectively corresponding to a preset global point, a preset graph volume layer, a preset time vector and a preset Transformer encoder according to the training loss, the back propagation algorithm and the optimization algorithm; judging whether the current iteration accumulated times exceed a preset iteration time or not, or whether each parameter corresponding to a preset global point, a preset graph volume layer, a preset time vector and a preset Transformer encoder is converged or not; if the current iteration accumulated times exceed a preset iteration time or the preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively correspond to each parameter convergence, obtaining a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder; and if the current iteration accumulated times do not exceed the preset iteration times and the parameters respectively corresponding to the preset global point, the preset graph volume layer, the preset time vector and the preset Transformer encoder are not converged, returning to execute the step of selecting a group of bone sample input data from the plurality of bone sample input data as the current bone sample input set.

In an optional implementation manner of the second aspect, the building module is specifically configured to obtain information of a bone to be identified and information of a target global point; carrying out normalization processing on the bone information to be identified; and constructing skeleton input information according to the target global point information and the skeleton information to be identified after normalization processing.

In a third aspect, the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to perform the method in any one of the first aspect and the optional implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method in the first aspect or any optional implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a computer causes the computer to execute the method in any one of the first aspect and the optional implementation manner of the first aspect.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a first flowchart of a behavior recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of global points and skeletal points provided in the embodiments of the present application;

fig. 3 is a second flowchart of a behavior recognition method provided in the embodiment of the present application;

fig. 4 is a third flowchart of a behavior recognition method according to an embodiment of the present application;

fig. 5 is a fourth flowchart of a behavior recognition method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a model structure provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application; .

An icon: 700-building a module; 710-graph convolution module; 720-an identification module; 730-an acquisition module; 740-a normalization module; 750-a training module; 8-an electronic device; 801-a processor; 802-a memory; 803 — communication bus.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are merely used to more clearly illustrate the technical solutions of the present application, and therefore are only examples, and the protection scope of the present application is not limited thereby.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof in the description and claims of this application and the description of the figures above, are intended to cover non-exclusive inclusions.

In the description of the embodiments of the present application, the technical terms "first", "second", and the like are used only for distinguishing different objects, and are not to be construed as indicating or implying relative importance or to implicitly indicate the number, specific order, or primary-secondary relationship of the technical features indicated. In the description of the embodiments of the present application, "a plurality" means two or more unless specifically defined otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing the association object, and means that three relationships may exist, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

In the description of the embodiments of the present application, the term "plurality" refers to two or more (including two), and similarly, "plural sets" refers to two or more (including two), and "plural pieces" refers to two or more (including two).

In the description of the embodiments of the present application, the terms "center", "longitudinal", "transverse", "length", "width", "thickness", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", "axial", "radial", "circumferential", and the like, indicate orientations and positional relationships that are based on the orientations and positional relationships shown in the drawings, and are used for convenience in describing the embodiments of the present application and for simplification of the description, but do not indicate or imply that the device or element referred to must have a specific orientation, be configured and operated in a specific orientation, and thus, should not be construed as limiting the embodiments of the present application.

In the description of the embodiments of the present application, unless otherwise explicitly stated or limited, the terms "mounted," "connected," "fixed," and the like are used in a broad sense, and for example, may be fixedly connected, detachably connected, or integrated; mechanical connection or electrical connection is also possible; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the embodiments of the present application can be understood by those of ordinary skill in the art according to specific situations.

At present, an intelligent identification mode for analyzing target behaviors based on bone data gradually becomes mainstream, and behavior identification can be used for playing a key role in analyzing and judging the behavior purposes and behavior modes of targets.

The inventor finds that the currently common behavior intelligent identification method based on the bone data generally adopts a deep learning model, but the current deep learning model is large in model and difficult to deploy on conventional computing equipment, and the current deep learning model is complex in computing process and low in speed.

On the basis, the applicant provides a behavior recognition method, a behavior recognition device, electronic equipment and a computer-readable storage medium, the scheme combines an encoder in a graph volume layer and a Transformer algorithm to perform behavior recognition on skeleton data, so that a small model based on the graph volume layer and the Transformer encoder can recognize the behavior of the skeleton data, the graph volume layer can fully analyze the skeleton data, the Transformer encoder is adopted to have only two layers, but can well recognize the relevance of the skeleton data in a time dimension, and the small structures of the two layers can have higher recognition inference speed, so that a designed recognition model is small and has higher recognition accuracy, in addition, a global point mode is utilized to optimize a graph volume layer network, the characteristics extracted by the graph volume layer network represent the global characteristics of the skeleton, and a target time vector class-token is introduced to improve the time dimension characteristic recognition, so that the accuracy of the scheme is further improved.

The embodiment of the present application provides a behavior recognition method, where the capacity recognition method may be applied to a computing device, where the computing device includes but is not limited to a computer, a server, and the like, and as shown in fig. 1, the behavior recognition method may be implemented as follows:

step S100: and constructing according to the bone information to be identified and the target global point information to obtain bone input information.

Step S110: and carrying out bone feature extraction on the bone input information through the target image convolution layer to obtain first feature information.

Step S120: and identifying the behavior corresponding to the bone information to be identified according to the target Transformer encoder and the first characteristic information.

In the above embodiment, the bone information to be identified may include one or a batch (a plurality of) of bone data to be identified, specifically, the bone data to be identified may include five dimensions of a frame number T, a bone point V, a bone group number M, a channel number C, and a sample capacity N, forming a five-dimensional matrix (N, C, T, V, M), where the frame number T may represent the length of the behavior action; bone points V represent nose, chin, neck, thumb, etc.; the number M of the skeleton groups represents the number of recognized people; the number of channels C may vary according to the dimensional change of the bone data, for example, if the bone information to be identified is spatial bone data, the number of channels C is 3, which respectively represents an x axis, a y axis, and a z axis; if the bone information to be identified is planar bone data, the number of channels C is 2, which respectively represents an x-axis and a y-axis.

The target global point represents a point in which the information proportion obtained by each bone point in the bone information to be identified is consistent, such as the global point shown in fig. 2, and the distances from the target global point to each bone point are equal, that is, the information proportion obtained by each bone point is consistent. On this basis, the scheme can obtain the information of each skeleton point through the target global point, and each skeleton point can also obtain the information of the target global point. The target global point can be obtained by training a preset global point.

As for step S100, the present solution constructs and obtains the bone input information based on the bone information to be recognized and the target global point information, and as a possible implementation manner, since the bone information to be recognized includes a batch of bone data, in order to increase the convergence rate of the model, the present solution may first perform normalization processing on the bone information to be recognized, and then construct the bone input information according to the target global point information and the bone information to be recognized that is completed by the normalization processing. The normalization processing may specifically be batch normalization processing.

In addition, according to the scheme, the behavior of each person is identified, so that in addition to normalization processing of the bone information to be identified, the number N of samples in the bone information to be identified can be spliced with the number T of bone groups, and then the bone input information is constructed. For example, assuming that the bone information to be identified is (N, C, T, V, M), the bone information to be identified after batch normalization and concatenation of the number of samples N and the number of bone groups T may be (N × M, C, T, V), and then the bone input information constructed by combining the information with the target global points may be (N × M, C, T, V + 1).

After the skeleton input information is obtained in the above mode, the scheme can extract the skeleton characteristics of the skeleton input information through the target graph convolution layer to obtain the first characteristic information. The target graph convolutional layer is obtained through pre-training, and the first characteristic information represents information obtained after information mixing of the bone information to be recognized and the target global point information is completed.

As a possible implementation, the target graph convolutional layer may first perform feature dimension raising on the bone input information, and then perform information mashup on the bone input information after the feature dimension raising, thereby obtaining the first feature information.

Specifically, the target map convolutional layer may perform feature dimension expansion on the number of channels of the bone input information to expand the dimension of the number of channels in the bone input information, so as to increase features of the bone data, for example, the original number of channels C (e.g., 3), the target map convolutional layer may perform dimension expansion on the number of channels C to C1 (e.g., 32, 64, 128), and the like, to obtain the bone input information (N × M, T, C1, V + 1) after the feature dimension expansion, so as to expand the dimension of the number of channels in the bone input information to increase the features of the bone data.

In addition, the target map convolutional layer may specifically generate the first feature information by using the feature-scaled bone input information and the target adjacency matrix, and may multiply a matrix (N × M, T, C1, V + 1) of the feature-scaled bone input information by the target map adjacency matrix to obtain the first feature information. Wherein, the target adjacency matrix is obtained by training a preset map adjacency matrix in advance, and the map adjacency matrix should be changed from (V, V) to an M matrix of (V +1 ) because the V dimension is increased by 1. The graph adjacency matrix operates by first assigning M [0] to 1/(V + 1), assigning M [ 1.

After the first characteristic information is obtained in the above mode, the behavior corresponding to the bone information to be recognized is recognized according to the target Transformer encoder and the first characteristic information.

As a possible implementation manner, as shown in fig. 3, the present solution may specifically implement step S120 in the following manner, including:

step S300: and extracting target global point feature information in the first feature information.

Step S310: and generating coding input data according to the target global point characteristic information and the target time vector.

Step S320: and performing time characteristic extraction on the coded input data through a target Transformer coder to obtain second characteristic information.

Step S330: and identifying the behavior corresponding to the bone information to be identified according to the second characteristic information.

In the above embodiment, since the target map convolutional layer is the first feature information obtained by information-mashup of the bone information to be recognized and the target global point information, the target global point feature information in the first feature information corresponding to the target global point information can represent the feature information of each bone point. Therefore, according to the scheme, the target global point feature information can be extracted from the first feature information, and the extracted target global point feature information is only 1 in number, so that the matrix of the target global point feature information can be simplified to (N × M, C1, T).

On the basis of the above, the present solution may generate the encoded input data based on the target global point feature information and a target time vector, where the target time vector is also referred to as class-token, and may be obtained by training a preset time vector, and the preset time vector may be a random matrix.

Specifically, the target time vector and the target global point feature information may be first spliced to obtain the spliced data, for example, assuming that the target time vector is (N × M, C1, 1), on the basis, the target time vector may be spliced to (N × M, C1, 1) and (N × M, C1, T) to obtain the spliced data (N × M, C1, T + 1).

Further, since the second dimension of the target time vector is a time dimension in a general case, and the third dimension is a channel dimension, that is, the target time vector is (N × M, 1, C1), on this basis, the present solution may first transpose the target global point feature information, that is, (N × M, C1, T) to (N × M, T, C1), and then concatenate the target global point feature information with the target time vector (N × M, 1, C1), so as to obtain the concatenated data (N × M, T +1, C1).

According to the scheme, multiple frames of bone data with time sequences need to be identified, and the separate transform encoder cannot distinguish the time sequences, so that on the basis of the obtained spliced data, position codes can be added to the spliced data to obtain coded input data, the coded input data input to the transform encoder can be identified by the transform encoder to obtain the time sequences, the time sequences of the multiple frames of bone data and the relevance of front and back actions can be identified, and the accuracy of behavior identification is improved.

Specifically, the splicing data may be spliced with a trainable matrix to obtain encoded input data, for example, the spliced data (N × M, T +1, C1) and the trainable matrix P (1, T +1, C1) are spliced and then output as the encoded input data. In addition, it should be noted here that, in addition to generating the encoded input data in the above manner, the present solution may also directly add position encoding to the first feature information without introducing a target time vector, so as to obtain the encoded input data.

According to the scheme, the obtained coded input data are input into a Transformer coder, the Transformer coder can extract time characteristics in the coded input data to obtain second characteristic information, namely, the Transformer coder only extracts the characteristics of a time dimension, namely a frame number T, in the coded input data to obtain the second characteristic information, and the second characteristic information represents information obtained after information mixing of target global point characteristic information and target time vectors is completed.

Finally, the behavior corresponding to the bone information to be recognized can be recognized based on the second feature information, as a possible implementation manner, as shown in fig. 4, the scheme can be specifically implemented in the following manner:

step S400: and extracting target time vector characteristic information in the second characteristic information.

Step S410: and inputting the target time vector characteristic information into the multilayer perceptron to obtain the probability corresponding to each behavior type output by the multilayer perceptron.

Step S420: and determining the behavior type corresponding to the bone information to be identified according to the probability corresponding to each behavior type.

In the above-described embodiment, when the target time vector is introduced into the coded input data, the target time vector is mashup with the time dimension in the target global point feature information, and therefore, only the feature information corresponding to the target time vector is extracted, and the feature information in the time dimension representing all the bone points is obtained. For example, the transform encoder outputs (N × M, T +1, C1), and based on this, the present solution only takes the first index data of the second dimension, so as to obtain the second feature information, i.e., (N × M, C1).

In the above embodiment, the multi-layer perceptron may identify the probability that each bone data to be identified belongs to each behavior type based on the target time vector feature information, so as to determine the behavior type with the highest probability as the type corresponding to the bone information to be identified. For example, specifically, the multi-layer sensing machine may convert the aforementioned second feature information (N × M, C1) into an (N, C1, M) matrix, then obtain (N, C1) by averaging and dimensionality reduction on the third matrix, then obtain (N, C1) through two full-connected layers, the first full-connected layer expands (N, C1) to (N, 512), the second full-connected layer reduces (N, 512) to (N, S), and S is a category number, thereby determining the identification type corresponding to the bone information to be identified.

It should be noted here that, when the target time vector is not introduced into the encoded input data, the second feature information output by the transform encoder may be all input into the multi-layer perceptron, so as to perform behavior recognition by the multi-layer perceptron. For example, the transform encoder output (N × M, T +1, C1) matrices are all input into the multi-layered perceptron.

According to the designed behavior identification method, the encoder in the graph volume layer and the Transformer algorithm is combined to conduct behavior identification on the bone data, so that the behavior of the bone data can be identified based on the small models of the graph volume layer and the Transformer encoder, the graph volume layer can fully analyze the bone data, although the Transformer encoder is only provided with two layers, the relevance of the bone data in the time dimension can be well identified, the small structures of the two layers can have higher identification reasoning speed, the designed identification model is small and has higher identification accuracy, in addition, the graph volume layer network is optimized in a global point mode, the features extracted by the graph volume layer network represent the global features of the bone, the target time vector class-token is introduced to improve the time dimension feature identification, and the accuracy of the scheme is further improved.

In an optional implementation manner of this embodiment, the target global point information, the target graph convolution layer, the target time vector, the target adjacent matrix, and the transform encoder may all be obtained through pre-training. Specifically, as shown in fig. 5, the training process may be implemented as follows:

step S500: the method comprises the steps of obtaining a bone sample data set, wherein the bone sample data set comprises a plurality of bone sample data, and each bone sample data comprises a bone point sample and a corresponding behavior label.

Step S510: and carrying out normalization processing on the plurality of bone sample data to obtain the plurality of bone sample data after normalization processing.

Step S520: and constructing a plurality of bone sample input data according to the preset global points and each bone sample data after normalization processing.

Step S530: and training a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder according to the input data of the plurality of bone samples to obtain a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder.

In the above embodiment, the present invention first obtains a bone sample data set from a batch of bone sample data, then performs batch normalization on a plurality of bone sample data in order to accelerate the training convergence rate of the model, thereby obtaining a plurality of bone sample data after normalization, then constructs a plurality of bone sample input data from a preset global point (a preset random matrix) and each bone sample data after normalization, and finally performs training to obtain a target global point, a target graph convolution layer, a target time vector, and a target transform encoder.

Specifically, for step S530, a group of bone sample input data may be selected from the plurality of bone sample input data as a current bone sample input set; inputting a current bone sample input set into a preset graph volume layer to obtain first characteristic information corresponding to each current bone sample input data; then extracting preset global point feature information in first feature information corresponding to each current skeleton sample input data, and generating coding sample input data according to the preset global point feature information and a preset time vector; and inputting the input data of the coding samples into a preset Transformer encoder, obtaining second characteristic information corresponding to the input data of each coding sample, and obtaining a classification result of the input data of the corresponding coding sample based on the second characteristic information. The above-mentioned manner for constructing global points and introducing time vectors is the same as the foregoing manner, and is not described herein again.

On the basis, the scheme calculates the training loss corresponding to the current skeleton sample input set through a comparison loss function according to the first characteristic information corresponding to each skeleton sample input data, the second characteristic information corresponding to each coding sample input data and the corresponding behavior label; updating and iterating parameters corresponding to a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively according to the training loss, the back propagation algorithm and the optimization algorithm; judging whether the current iteration accumulated times exceed preset iteration times or not, or whether all parameters respectively corresponding to a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder are converged or not; if the current iteration accumulated times exceed a preset iteration time or the preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively correspond to each parameter convergence, obtaining a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder; and if the current iteration accumulated times do not exceed the preset iteration times and the parameters respectively corresponding to the preset global point, the preset graph volume layer, the preset time vector and the preset Transformer encoder are not converged, returning to execute the step of selecting a group of bone sample input data from the plurality of bone sample input data as the current bone sample input set. In addition, it should be noted that, if the foregoing multi-layer perceptron is used in performing behavior recognition, the foregoing multi-layer perceptron may also be trained in advance through the second feature information corresponding to each piece of coded sample input data and the corresponding classification label.

In an optional implementation manner of this embodiment, as a specific example, a behavior recognition manner designed in this embodiment may be as shown in fig. 6, where BN represents normalization processing; GCN represents a graph convolution layer; dimension alignment represents the introduction of a target time vector and position encoding; MLP stands for multi-layer perceptron.

On the basis, inputted bone data to be recognized firstly enters a BN layer for normalization processing, then the bone data to be recognized after the normalization processing is introduced into a global point between the BN layer and a GCN layer, then the bone data to be recognized introduced into the global point is inputted into the GCN layer, so that the GCN layer performs feature dimension raising and information blending on the bone data to extract only first subscript data of the dimension of the bone point to obtain target global point feature information, then class-token and position coding are introduced into a dimension alignment layer, the coded input data is generated and then inputted into a Transformer coder for time feature extraction, so that the first subscript data of the time dimension is extracted to obtain second feature information, and then behavior recognition is performed by a multi-layer perceptron MLP based on the second feature information to obtain a behavior type corresponding to the bone data to be recognized.

Fig. 7 shows a schematic block diagram of a behavior recognition device provided by the present application, and it should be understood that the device corresponds to the method embodiment executed in fig. 1 to 6, and can execute the steps related to the foregoing method, and the specific functions of the device can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software functional module that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device. Specifically, the apparatus includes: a construction module 700, a graph volume module 710, and an identification module 720; the constructing module 700 is configured to obtain bone input information according to bone information to be identified and target global point information; the target global point represents a point with consistent information proportion obtained by each skeleton point in the skeleton information to be identified; the map convolution module 710 is configured to perform bone feature extraction on the bone input information through the target map convolution layer to obtain first feature information; the first characteristic information represents information obtained after information mixing of the bone information to be identified and the target reference point information is completed; the identifying module 720 is configured to identify, according to the target Transformer encoder and the first feature information, a behavior corresponding to the bone information to be identified.

In an optional implementation manner of this embodiment, the identifying module 720 is specifically configured to extract target global point feature information in the first feature information, and generate encoded input data according to the target global point feature information and the target time vector; performing time feature extraction on the coded input data through a target Transformer coder to obtain second feature information, wherein the second feature information represents information obtained after the target global point feature information and the target time vector are mixed; and identifying the behavior corresponding to the bone information to be identified according to the second characteristic information.

In an optional implementation manner of this embodiment, the identifying module 720 is further specifically configured to extract target time vector feature information in the second feature information, input the target time vector feature information into the multi-layer perceptron, and obtain a probability corresponding to each behavior type output by the multi-layer perceptron; and determining the behavior corresponding to the bone information to be identified according to the probability corresponding to each behavior type.

In an optional implementation manner of this embodiment, the identifying module 720 is further specifically configured to splice the target global point feature information and the target time vector to obtain first spliced data; and adding position codes to the first splicing data to obtain coded input data.

In an optional implementation manner of this embodiment, the map volume module 710 is specifically configured to perform feature dimension enhancement on the bone input information by using a target map volume layer; and performing information mixing on the bone input information after feature dimension increasing to obtain first feature information.

In an optional implementation manner of this embodiment, the graph volume module 710 is further specifically configured to generate the first feature information according to the bone input information after feature dimension lifting and a target adjacency matrix, where the target adjacency matrix is obtained through pre-training.

In an optional implementation manner of this embodiment, the apparatus further includes an obtaining module 730, configured to obtain a bone sample data set, where the bone sample data set includes a plurality of bone sample data, and each bone sample data includes a bone point sample and a corresponding behavior tag; the normalization module 740 is configured to perform normalization processing on the multiple bone sample data to obtain multiple bone sample data after normalization processing is completed; the constructing module 700 is further configured to construct a plurality of bone sample input data according to a preset global point and each bone sample data after normalization processing; the training module 750 is configured to train a preset global point, a preset graph convolution layer, a preset time vector, and a preset transform encoder according to a plurality of bone sample input data, so as to obtain a target global point, a target graph convolution layer, a target time vector, and a target transform encoder.

In an optional implementation manner of this embodiment, the training module 750 is specifically configured to select a group of bone sample input data from a plurality of bone sample input data as a current bone sample input set; inputting a current bone sample input set into a preset graph volume layer to obtain first characteristic information corresponding to each current bone sample input data; extracting preset global point characteristic information in first characteristic information corresponding to each current skeleton sample input data, and generating coding sample input data according to the preset global point characteristic information and a preset time vector; inputting the input data of the coding samples into a preset Transformer encoder to obtain second characteristic information corresponding to the input data of each coding sample; calculating training loss corresponding to the current bone sample input set through a comparison loss function according to first characteristic information corresponding to each bone sample input data, second characteristic information corresponding to each coding sample input data and a corresponding behavior label; updating and iterating parameters corresponding to a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively according to the training loss, the back propagation algorithm and the optimization algorithm; judging whether the current iteration accumulated times exceed a preset iteration time or not, or whether each parameter corresponding to a preset global point, a preset graph volume layer, a preset time vector and a preset Transformer encoder is converged or not; if the current iteration accumulated times exceed a preset iteration time or the preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder respectively correspond to each parameter convergence, obtaining a target global point, a target graph convolution layer, a target time vector and a target Transformer encoder; and if the current iteration accumulated times do not exceed the preset iteration times and the parameters respectively corresponding to the preset global point, the preset graph volume layer, the preset time vector and the preset Transformer encoder are not converged, returning to execute the step of selecting a group of bone sample input data from the plurality of bone sample input data as the current bone sample input set.

In an optional implementation manner of this embodiment, the building module 700 is specifically configured to obtain information of a bone to be identified and information of a target global point; carrying out normalization processing on the bone information to be identified; and constructing skeleton input information according to the target global point information and the skeleton information to be identified which is subjected to normalization processing.

According to some embodiments of the present application, as shown in fig. 8, the present application provides an electronic device 8 comprising: the processor 801 and the memory 802, the processor 801 and the memory 802 being interconnected and communicating with each other via a communication bus 803 and/or other form of connection mechanism (not shown), the memory 802 storing a computer program executable by the processor 801, the processor 801 executing the computer program when the computing device is running to perform the method in any alternative implementation, such as the steps S100 to S120: constructing according to the bone information to be identified and the target global point information to obtain bone input information; carrying out bone feature extraction on the bone input information through the target image convolution layer to obtain first feature information; and identifying the behavior corresponding to the bone information to be identified according to the target Transformer encoder and the first characteristic information.

The present application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the method of any one of the previous alternative implementations.

The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

The present application provides a computer program product which, when run on a computer, causes the computer to perform the method of any of the alternative implementations.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present disclosure, and the present disclosure should be construed as being covered by the claims and the specification. In particular, the technical features mentioned in the embodiments can be combined in any way as long as there is no structural conflict. The present application is not intended to be limited to the particular embodiments disclosed herein but is to cover all embodiments that may fall within the scope of the appended claims.

Claims

1. A method of behavior recognition, comprising:

constructing according to the bone information to be identified and the target global point information to obtain bone input information; the target global point represents a point with consistent information proportion obtained by each skeleton point in the skeleton information to be identified;

extracting the bone characteristics of the bone input information through a target graph convolution layer to obtain first characteristic information; the first characteristic information represents information obtained after the information mixing of the bone information to be identified and the target reference point information is completed;

and identifying the behavior corresponding to the bone information to be identified according to the target transform encoder and the first characteristic information.

2. The method according to claim 1, wherein the identifying the behavior corresponding to the bone information to be identified according to the target transform encoder and the first feature information comprises:

extracting target global point characteristic information in the first characteristic information, and generating coding input data according to the target global point characteristic information and the target time vector;

performing time feature extraction on the coded input data through a target Transformer coder to obtain second feature information; the second feature information represents information obtained after the target global point feature information and target time vector completion information are mixed;

and identifying the behavior corresponding to the bone information to be identified according to the second characteristic information.

3. The method according to claim 2, wherein the identifying the behavior corresponding to the bone information to be identified according to the second feature information comprises:

extracting target time vector characteristic information in the second characteristic information;

inputting the target time vector characteristic information into a multilayer perceptron to obtain the probability corresponding to each behavior type output by the multilayer perceptron;

and determining the behavior corresponding to the bone information to be identified according to the probability corresponding to each behavior type.

4. The method according to claim 2, wherein generating the encoded input data according to the target global point feature information and the target time vector comprises:

splicing the target global point characteristic information and the target time vector to obtain first spliced data;

and adding position codes to the first splicing data to obtain the coded input data.

5. The method of claim 1, wherein the performing feature extraction on the bone input information through a target map convolution layer to obtain first feature information comprises:

performing characteristic dimension increasing on the bone input information through a target graph convolution layer;

and performing information mashup on the bone input information after feature dimension raising to obtain the first feature information.

6. The method of claim 5, wherein the mashup of information about the feature-augmented bone input information to obtain the first feature information comprises:

generating the first characteristic information according to the bone input information subjected to characteristic dimension enhancement and a target adjacency matrix; wherein the target adjacency matrix is obtained by pre-training.

7. The method according to any one of claims 1-6, wherein before said obtaining bone input information constructed from the bone information to be identified and the target global point information, the method further comprises:

acquiring a bone sample data set; wherein the bone sample data set comprises a plurality of bone sample data, each bone sample data comprising a bone point sample and a corresponding behavior tag;

carrying out normalization processing on the plurality of bone sample data to obtain a plurality of bone sample data which are subjected to normalization processing;

constructing a plurality of bone sample input data according to a preset global point and each bone sample data completed by normalization processing;

training a preset global point, a preset graph convolution layer, a preset time vector and a preset Transformer encoder according to a plurality of bone sample input data to obtain the target global point, the target graph convolution layer, the target time vector and the target Transformer encoder.

8. The method of claim 7, wherein training a predetermined global point, a predetermined graph convolution layer, a predetermined time vector, and a predetermined transform encoder from a plurality of bone sample input data comprises:

selecting a set of bone sample input data from the plurality of bone sample input data as a current bone sample input set;

inputting a current bone sample input set into a preset graph volume layer to obtain first characteristic information corresponding to each current bone sample input data;

extracting preset global point characteristic information in first characteristic information corresponding to each current skeleton sample input data, and generating coding sample input data according to the preset global point characteristic information and a preset time vector;

inputting the input data of the coding samples into a preset Transformer encoder to obtain second characteristic information corresponding to the input data of each coding sample;

calculating training loss corresponding to the current bone sample input set through a comparison loss function according to first characteristic information corresponding to each bone sample input data, second characteristic information corresponding to each coding sample input data and a corresponding behavior label;

updating and iterating parameters respectively corresponding to the preset global point, the preset graph convolution layer, the preset time vector and the preset Transformer encoder according to the training loss, the back propagation algorithm and the optimization algorithm;

judging whether the current iteration accumulated times exceed a preset iteration time or not, or whether each parameter corresponding to the preset global point, the preset graph volume layer, the preset time vector and a preset Transformer encoder is converged or not;

if the current iteration accumulated times exceed preset iteration times or parameters corresponding to the preset global point, the preset graph convolution layer, the preset time vector and the preset Transformer encoder are converged, the target global point, the target graph convolution layer, the target time vector and the target Transformer encoder are obtained;

and if the current iteration accumulated times do not exceed the preset iteration times and the parameters respectively corresponding to the preset global point, the preset graph convolution layer, the preset time vector and the preset Transformer encoder are not converged, returning to execute the step of selecting a group of bone sample input data from the plurality of bone sample input data to serve as the current bone sample input set.

9. A behavior recognition apparatus, comprising a construction module, a graph convolution module, and a recognition module;

the construction module is used for constructing according to the bone information to be identified and the target global point information to obtain bone input information; the target global point represents a point with consistent information proportion obtained by each skeleton point in the skeleton information to be identified;

the graph convolution module is used for extracting the bone features of the bone input information through the target graph convolution layer to obtain first feature information; the first characteristic information represents information obtained after information mixing of the bone information to be identified and the target reference point information is completed;

and the identification module is used for identifying the behavior corresponding to the bone information to be identified according to the target transform encoder and the first characteristic information.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.