CN113468980B

CN113468980B - Human behavior recognition method and related device

Info

Publication number: CN113468980B
Application number: CN202110654055.4A
Authority: CN
Inventors: 张兴明; 白云超; 魏乃科; 潘华东; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Filing date: 2021-06-11
Publication date: 2024-05-31
Anticipated expiration: 2041-06-11

Abstract

The application discloses a human body behavior recognition method and a related device, wherein the human body behavior recognition method comprises the following steps: acquiring node sequence data and a graph convolution kernel; performing time convolution processing on the joint point sequence data to obtain a joint point time sequence; performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimension; performing first dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into joint point sequence feature vectors; performing second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; human behavior recognition is performed based on the space-time sequence feature vectors. According to the scheme, the human body behavior recognition can be avoided by directly adopting the graph convolution mode, so that the intelligent chip can be effectively deployed, the processing speed is high, and the precision loss is small.

Description

Human behavior recognition method and related device

Technical Field

The application relates to the technical field of computers, in particular to a human behavior recognition method and a related device.

Background

In the field of video monitoring, intelligent equipment ends gradually become the mainstream development direction, wherein the identification of human behavior actions can be directly completed by disposing an algorithm on an embedded chip of the intelligent equipment ends. In the human motion recognition technology, the recognition of human motion by using a joint point sequence is attracting more attention, and in the joint point sequence recognition method, space-time diagram convolution is the most used and best-effective method in recent years. Therefore, the space-time diagram convolution is directly deployed on a chip of the intelligent terminal equipment, and is a mainstream development direction of the future video monitoring field in the identification of human body actions.

However, although the space-time diagram convolution has the advantages of high speed, high recognition precision and the like in recognizing the joint point sequence, the space-time diagram convolution can process unstructured data generally, and a diagram convolution operation is used, so that the operation of the diagram convolution cannot be directly realized on a chip of a lightweight intelligent device, and the operation mode of the space-time diagram convolution related correspondingly is complex. Therefore, space-time diagram convolution is greatly hindered by the deployment of lightweight smart device ends.

Disclosure of Invention

The application mainly solves the technical problem of providing a human behavior recognition method and a related device so as to effectively solve the problem that graph convolution is deployed on an intelligent chip.

In order to solve the above problems, a first aspect of the present application provides a human behavior recognition method, wherein the human behavior recognition method includes: acquiring node sequence data and a graph convolution kernel; performing time convolution processing on the joint point sequence data to obtain a joint point time sequence; performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimension; performing first dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into joint point sequence feature vectors; performing second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; human behavior recognition is performed based on the space-time sequence feature vectors.

The method comprises the steps of obtaining the node sequence data and the graph convolution kernel, wherein the step of obtaining the node sequence data and the graph convolution kernel comprises the following steps of: acquiring a video image; the video image at least comprises a human body to be detected; human body behavior detection is carried out on the video image so as to obtain joint point sequence data; acquiring a preset weight matrix and an initial graph convolution kernel; and adding the preset weight matrix with the initial graph convolution kernel to obtain the graph convolution kernel.

The method for obtaining the joint point time sequence comprises the following steps of: performing iterative update processing on the joint point sequence data in the T dimension to obtain an iteratively updated joint point time sequence: [ N, C, T _x, V, M ]; the formula of the iterative update process is as follows: Where x is the time series, F _K is the time convolution kernel, K is the number of convolution kernels, and F (x _t) is equivalent to Tx.

The step of performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with a set dimension comprises the following steps: fusing the joint point characteristics C in the joint point time sequence with the number M of the identified people, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with a set dimension: [ N, C _M, tx, V ].

The step of performing a first dimension transformation process on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector comprises the following steps: extracting the number k of convolution kernels in the graph convolution kernel, and performing replacement and remodelling processing of a first set sequence through k and the low-dimensional articulation point sequence to convert the low-dimensional articulation point sequence into an articulation point sequence feature vector: [1, vk, N, C _M Tx ].

The method comprises the steps of performing a second dimension transformation process on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with set dimension, wherein the graph convolution kernel comprises a convolution kernel number k, joint point numbers v and w, and the step of converting the graph convolution kernel into the graph convolution feature vector with set dimension comprises the following steps of: after the second set order replacement and remodeling treatment is carried out on the graph convolution kernel, dimension adjustment treatment is carried out so as to convert the graph convolution feature vector into a set dimension graph convolution feature vector: [ w, vk, 1].

The human behavior recognition based on the space-time sequence feature vector comprises the following steps: the space-time sequence feature vector is converted into a chip supportable network model format for human behavior recognition.

In order to solve the above-described problems, a second aspect of the present application provides a human behavior recognition apparatus, wherein the human behavior recognition apparatus includes: the data acquisition module is used for acquiring the node sequence data; the space-time diagram convolution module is used for carrying out time convolution processing on the joint point sequence data so as to obtain a joint point time sequence; the trainable parameter importing module is used for acquiring a graph convolution kernel; the data feature dimension reduction module is used for carrying out data dimension reduction processing on the joint point time sequence so as to obtain a low-dimension joint point sequence with a set dimension; the Einstein convention conversion module is used for carrying out first dimension conversion processing on the low-dimension joint point sequence so as to convert the low-dimension joint point sequence into joint point sequence feature vectors, and carrying out second dimension conversion processing on the graph convolution kernel so as to convert the low-dimension joint point sequence into graph convolution feature vectors with set dimensions; the space-time diagram convolution module is also used for carrying out functional convolution processing on the joint point sequence feature vector and the diagram convolution feature vector so as to obtain a space-time sequence feature vector; and the model quantization module is used for identifying human behaviors according to the space-time sequence feature vectors.

In order to solve the above problems, a third aspect of the present application provides an intelligent terminal, wherein the intelligent terminal includes a memory and a processor coupled to each other; the memory stores program data; the processor is configured to execute program data to implement the human behavior recognition method according to any one of the above.

In order to solve the above-described problems, a fourth aspect of the present application provides a computer-readable storage medium storing program data executable to implement the human behavior recognition method as set forth in any one of the above.

The beneficial effects of the application are as follows: compared with the prior art, the human behavior recognition method of the application obtains the joint point time sequence by carrying out time convolution processing on the obtained joint point sequence data, carries out data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimension, and then carries out first dimension conversion processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into low-latitude data which can be directly processed by an intelligent chip, and a joint point sequence feature vector; and then, performing second dimension transformation processing on the obtained graph convolution kernel to convert the graph convolution kernel into graph convolution feature vectors with set dimensions, so that functional convolution processing can be performed on the joint point sequence feature vectors and the graph convolution feature vectors to obtain space-time sequence feature vectors, the situation that the space-time sequence feature vectors are directly obtained in a graph convolution mode can be avoided, and human behavior recognition is performed on the basis of the space-time sequence feature vectors. Therefore, the processing procedure which is originally required to be subjected to graph convolution can be replaced by functional convolution processing through carrying out data dimension reduction processing and corresponding dimension conversion processing on the high-latitude joint point time sequence, so that the human behavior recognition method can be effectively deployed on an intelligent chip, particularly on a chip of a lightweight intelligent terminal device, the corresponding processing speed is also accelerated, and the precision loss is reduced.

Drawings

FIG. 1 is a flow chart of an embodiment of a human behavior recognition method according to the present application;

FIG. 2 is a flow chart of an embodiment of S11 in FIG. 1;

FIG. 3 is a schematic diagram of an embodiment of acquiring node sequence data according to the present application;

FIG. 4 is a schematic diagram of an embodiment of human behavior recognition using space-time diagram convolution in the prior art;

FIG. 5 is a schematic diagram of an embodiment of an Einstein summing convention conversion process;

FIG. 6 is a schematic diagram of an application scenario equivalent to the human behavior recognition method in FIG. 4;

FIG. 7 is a schematic diagram illustrating one embodiment of a first dimension transformation process for a low-dimension node sequence according to the present application;

FIG. 8 is a schematic diagram of one embodiment of a second dimension transform process for the convolution kernel of the diagram;

FIG. 9 is a schematic diagram of a human behavior recognition device according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of a smart terminal according to the present application;

FIG. 11 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a human behavior recognition method according to the present application. Specifically, the human behavior recognition method in the present embodiment may include the steps of:

s11: and acquiring the node sequence data and the graph convolution kernel.

The application relates to a human behavior recognition method, which is characterized in that a video image comprising a human body is collected through monitoring equipment or any reasonable shooting device so as to carry out human body detection, human body tracking and joint point extraction based on the video image. It will be appreciated that the specific behavior characteristics of the human body are usually represented by the relative positions of the respective nodes, so that the acquired corresponding node sequence data can be processed to further perform behavior recognition on the corresponding human body.

Specifically, the intelligent terminal collects a video image comprising a human body through monitoring equipment or any reasonable shooting device or a camera of the intelligent terminal in communication connection with the intelligent terminal so as to extract joint point information of the human body in the video image and further obtain corresponding joint point sequence data; or, receiving the joint point sequence data sent by other intelligent terminals.

Further, receiving graph convolution kernels which are sent by other intelligent terminals and are obtained after training the deep learning network model of the intelligent terminals; or, the preset weight matrix and the initial graph convolution kernel which are sent by other intelligent terminals and obtained through training are added to obtain the graph convolution kernel.

It will be appreciated that in human behavior recognition, the impact weights of different joints on the human body recognition result will also typically be different. In order to make the accuracy of behavior recognition on the human body higher, the different trunk importance can be adjusted by introducing additional fixed parameters which do not belong to a space-time diagram convolution model, namely, the change of the training parameters, so that the weight of connection between each joint point is obtained, and different trunk importance degrees are determined.

The graph convolution kernel is obtained by acquiring and processing trainable parameters, so that the readiness rate of human behavior recognition can be ensured in the later processing. In order to reduce the operation amount of the intelligent terminal in the embodiment, other intelligent terminals or graph convolution kernels obtained after training other network models on the intelligent terminal can be directly received; or receiving a preset weight matrix and an initial graph convolution kernel so as to obtain the graph convolution kernel through addition calculation, thereby improving the processing efficiency of corresponding human behavior recognition.

Optionally, the smart terminal may specifically be one of a mobile phone, an unmanned plane, a smart watch, a tablet computer, a smart camera or any other reasonable smart terminal including an embedded chip, which is not limited in the present application.

S12: and performing time convolution processing on the joint point sequence data to obtain a joint point time sequence.

Further, the obtained joint point sequence data is subjected to a time convolution process, for example, the joint point sequence data is subjected to an iterative update operation on a certain one-dimensional data thereof, so as to obtain an updated joint point time sequence, and therefore local characteristics of joint point changes caused by time changes of the joint point sequence data can be obtained.

In an embodiment, the joint point sequence data specifically includes a joint point sequence length T, so that a corresponding joint point time sequence can be obtained by performing iterative update operation on the joint point sequence data in the dimension T, while other data in the joint point sequence data remains unchanged.

S13: and performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with a set dimension.

It can be appreciated that for some chips of the intelligent terminal, especially for chips of the lightweight intelligent terminal device, the operation processing on the high-latitude data is not generally supported, so that the data of the high-latitude data needs to be subjected to the data dimension reduction processing, for example, the 5-dimensional data is converted into the 4-dimensional data, so as to adapt to some specific application scenarios.

Therefore, the joint point time sequence is usually high latitude data, so that the application scene of the joint point time sequence is enlarged, the corresponding operation processing efficiency is improved, the data dimension reduction processing can be performed on the joint point time sequence, for example, two-dimensional data in the joint point time sequence are fused, and the sequence information of other data is reserved, so that the low-dimensional joint point sequence with set dimension is obtained.

Optionally, the set dimension is 4 dimensions or any other reasonable dimension, so as to be suitable for a wider computing processing scenario of the intelligent terminal.

S14: and performing first dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into joint point sequence feature vectors.

In the process of performing operation processing on the joint point sequence data to perform human behavior recognition, it is generally unavoidable that the graph convolution operation is involved, and the einstein summation contract operation is a core operation mode of graph convolution, so that the human behavior recognition method involving the einstein summation contract operation cannot be implemented by an embedded hardware chip which does not support 5-dimensional data operation and graph convolution processing. Therefore, there is a need for additional alternative methods that address the problem of the inability of embedded chips to directly perform einstein summing convention operations.

It can be understood that, in order to replace the einstein summation convention conversion process of the high-latitude data processing involved in the following, after the low-dimensional joint point sequence is obtained, adaptive adjustment is needed, for example, the low-dimensional joint point sequence is subjected to a first dimension conversion process, that is, a plurality of times of replacement and remodeling processes in a specific order are performed on the low-dimensional joint point sequence, so as to change the order and fusion characteristics of each dimension data in the low-dimensional joint point sequence, and further convert the sequence into the joint point sequence feature vector suitable for the following convolution processing process.

S15: the convolution kernel is subjected to a second dimension transformation process to convert it into a convolution feature vector of the set dimension.

Similarly, instead of the subsequent einstein summation convention conversion process involving high latitude data processing, it is also necessary to perform a second dimension conversion process on the graph convolution kernel at the same time, that is, perform multiple replacement and remodeling processes in another specific order on the graph convolution kernel, so as to change the order and fusion characteristics of the dimensional data in the graph convolution kernel to a graph convolution feature vector with a set dimension and the same latitude as the joint point sequence feature vector, so that the graph convolution feature vector and the joint point sequence feature vector can be directly subjected to a functional convolution process in the subsequent process to avoid performing the einstein summation convention conversion process.

S16: and carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain the space-time sequence feature vector.

Further, since the joint point sequence feature vector and the graph convolution feature vector are already converted into the same latitude data and the matched sequence order, the joint point sequence feature vector and the graph convolution feature vector can be directly subjected to functional convolution processing, for example, functional convolution processing equivalent to matrix multiplication, so as to obtain the corresponding space-time sequence feature vector.

S17: human behavior recognition is performed based on the space-time sequence feature vectors.

Further, the behavior characteristics of the human body to be detected are obtained by identifying and classifying the behavior actions of the human body to be detected through the space-time sequence feature vectors.

According to the scheme, the obtained joint point sequence data is subjected to time convolution processing to obtain a joint point time sequence, the joint point time sequence is subjected to data dimension reduction processing to obtain a low-dimension joint point sequence with set dimension, and then the low-dimension joint point sequence is subjected to first dimension conversion processing to be converted into low-latitude data which can be directly processed by an intelligent chip, and the joint point sequence feature vector; and then, performing second dimension transformation processing on the obtained graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with set dimension, so that functional convolution processing can be performed on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector, the situation that the space-time sequence feature vector is directly processed in a graph convolution mode is effectively avoided, and human behavior recognition can be performed on the basis of the space-time sequence feature vector.

Therefore, through carrying out data dimension reduction processing on the high-latitude joint point time sequence and corresponding dimension transformation processing, the processing process which is originally required to carry out graph convolution can be replaced by functional convolution processing, so that the human behavior recognition method can be effectively deployed on an intelligent chip, particularly on a chip of a lightweight intelligent terminal device, the application range of the corresponding human behavior recognition method is enlarged, and the corresponding processing speed can be accelerated due to the avoided complex graph convolution processing, the efficiency is improved, and the precision loss is reduced.

With continued reference to fig. 2, fig. 2 is a flowchart of an embodiment of S11 in fig. 1.

In an embodiment, the step S11 may specifically further include:

S111: a video image is acquired.

Specifically, the intelligent terminal collects video images including human bodies through monitoring equipment in communication connection with the intelligent terminal or any reasonable shooting device or a camera of the intelligent terminal; or receiving video images sent by other intelligent terminals.

The video image at least comprises a human body to be detected, so that behavior recognition can be performed on the human body to be detected based on the video image.

S112: and detecting human body behaviors of the video images to obtain joint point sequence data.

Further, please refer to fig. 3, fig. 3 is a schematic diagram illustrating an embodiment of the present application for acquiring node sequence data. After obtaining a video image at least comprising a human body to be detected, human body detection, human body tracking and joint point extraction are sequentially carried out on the human body to be detected, so that joint point sequence data of the human body to be detected can be obtained through processing of a space-time diagram convolution module integrated on a corresponding intelligent terminal.

In one embodiment, the joint point sequence data specifically includes 5 parameters, namely, a batch size (number of video images) N, a joint point feature C, a joint point sequence length T, a joint point number V, and an identification number M. It is understood that the sequence of joint points data specifically corresponds to the motion characteristics of the human body in the skeleton state of the human body.

S113: and acquiring a preset weight matrix and an initial graph convolution kernel.

Specifically, the initial graph convolution kernel given by the trained preset weight matrix and the corresponding network model thereof is received, wherein the initial graph convolution kernel is sent by other intelligent terminals or other network models in the intelligent terminals for human behavior recognition in the embodiment.

It can be understood that, for the different impact weights of different joint points on the human body recognition result, in order to ensure the accuracy of the final human body behavior recognition, some trainable parameters may be introduced by adding an attention mechanism, for example, a preset weight matrix may be introduced, which may be obtained by training. Because the parameters are additionally increased and cannot be directly deployed on the chip of the intelligent terminal, the corresponding training process is also needed by means of other intelligent terminals, and the preset weight matrix obtained after training and the initial graph convolution kernel given by the corresponding network model are imported and sent to the intelligent terminal for human behavior recognition in the embodiment.

S114: and adding the preset weight matrix with the initial graph convolution kernel to obtain the graph convolution kernel.

Further, the preset weight matrix is added with the initial graph convolution kernel, so that the sum of the preset weight matrix and the initial graph convolution kernel is determined to be the graph convolution kernel to be subjected to human behavior recognition processing.

In one embodiment, the graph convolution kernel specifically includes three parameters, namely, the number k of convolution kernels corresponding to the motion sub-states, and the number v and w of joint points. The motion sub-state specifically refers to three states of centripetal motion, centrifugal motion and stillness which are included in human body motion, and the graph convolution kernel is a matrix of fixed parameters obtained according to a human body topological graph.

The number V and w of the joint points in the convolution kernel of the graph and the number V of the joint points in the joint point sequence data all correspond to the same numerical value, but different code numbers are introduced only for facilitating the subsequent specific operation process.

Further, in an embodiment, the joint point sequence data specifically includes a number N of video images, a joint point feature C, a joint point sequence length T, a number V of joint points, and a number M of recognition persons, and the step S12 of the human behavior recognition method of the present application specifically may further include: performing iterative update processing on the joint point sequence data in the T dimension to obtain an iteratively updated joint point time sequence: [ N, C, tx, V, M ];

The formula of the iterative update process is as follows:

Where x is the time series, F _K is the time convolution kernel, K is the number of convolution kernels, and F (x _t) is equivalent to Tx.

Further, in an embodiment, the step S13 of the human behavior recognition method of the present application may specifically further include: fusing the joint point characteristics C in the joint point time sequence with the number M of the identified people, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with a set dimension: [ N, C _M, tx, V ].

It can be appreciated that since an embedded chip (e.g., a chip family integrated on a lightweight intelligent terminal) does not support the operation of 5-dimensional data, it is necessary to convert all of the 5-dimensional data corresponding to the joint point time series into 4-dimensional data. The M dimension and the C dimension are fused, and the data can be directly converted and fused due to the operation during data reading, the sequence information of the T dimension and the V dimension is still reserved after fusion, so that the characteristics of the input data are redefined, and the low-dimension joint point sequence can be obtained after all the T dimension information and the V dimension information are aligned: [ N, C _M, tx, V ], so that unstructured data of the graph rolling process can be converted into pseudo-structured data supportable by the intelligent chip.

The M dimension and the C dimension are fused, so that the alignment of the joint point sequence in the time dimension and the space dimension can be effectively maintained, and the time convolution operation and the graph convolution operation cannot be influenced.

Further, in an embodiment, the step S14 of the human behavior recognition method of the present application may specifically further include: extracting the number k of convolution kernels in the graph convolution kernel, and performing replacement and remodelling processing of a first set sequence through k and the low-dimensional articulation point sequence to convert the low-dimensional articulation point sequence into an articulation point sequence feature vector: [1, vk, N, C _M Tx ].

Specifically, as shown in fig. 4, fig. 4 is a schematic diagram of an embodiment of human behavior recognition using space-time diagram convolution in the prior art. Therefore, in the existing behavior recognition on the human body, after the joint point sequence data is acquired, the joint point sequence data is usually subjected to time convolution, a graph convolution kernel formed by the sum of a trainable parameter corresponding to an attention mechanism and an initial graph convolution kernel is introduced, and corresponding graph convolution processing is performed to complete a corresponding feature extraction process, so that the subsequent human body behavior recognition can be completed according to the output result of the feature extraction.

The joint point sequence data comprises 5 parameters, namely, a BatchSize (n), a joint point characteristic (c), a joint point sequence length (t), a joint point number (v) and an identification number (m). It can be understood that the joint point sequence data specifically corresponds to the motion characteristics of the human body in the human body skeleton state; the convolution kernel contains three parameters, namely a motion sub-state (number of convolution kernels) k, and joint number v and w. The motion sub-state specifically refers to three states of centripetal motion, centrifugal motion and stillness which are included in human body motion, and the graph convolution kernel is a matrix of fixed parameters obtained according to a human body topological graph.

The attention mechanism is to add a trainable parameter of the same size as the convolution kernel to the initial convolution kernel so that the weights of the connections between each joint point can be learned to determine different torso importance. And different trunk importance can be adjusted through the change of the trainable parameters, the trainable parameters do not belong to fixed parameters of the space-time diagram convolution processing, are additionally added, and are introduced for enabling the accuracy of the space-time diagram convolution processing to be higher.

And the time convolution is used for learning the local characteristics of the joint point change in the time change so as to iteratively update each node in the t dimension. The formula of the iterative update process is: Where X is a time sequence, F _K is a time convolution kernel, k is the number of convolution kernels, and F (X _t) is equivalent to X.

The graph convolution is specifically used for learning characteristics of joint point changes in a space, for example, joint points in a C dimension and a V dimension in each sequence data are updated, and a corresponding update formula is as follows: Σ _k∑_v(XW)_nkctvA_kvw＝X′_nctw; wherein A is the graph convolution kernel, X is the joint point sequence after iterative updating processing, and k is the number of the graph convolution kernels.

Referring to fig. 5 in combination, fig. 5 is a schematic diagram of an embodiment of an einstein summing convention conversion process. The update formula described above may also be represented by the einstein summation convention: nkctv, kvw → nctw; where n represents the product of BatchSize and M, c represents the characteristics of the articulation point, t represents the articulation point sequence length, and v and w represent the number of articulation points.

And the Einstein summation contract operation is the core operation of graph convolution and is also the most difficult problem to solve in the embedded hardware chip. As can be seen from the above formula conversion process and fig. 5, the formula specifically relates to an operation between the 5-dimensional data corresponding to the joint point sequence and the 3-dimensional data corresponding to the graph convolution kernel, and since the operation of the 5-dimensional data is not supported on the embedded hardware chip, the joint point sequence data and the graph convolution kernel need to be subjected to dimension conversion.

It is known that human behavior recognition by using the graph convolution method inevitably involves arithmetic processing of 5-dimensional data, and thus cannot be supported by a partially lightweight embedded chip. Therefore, in order to ensure that the corresponding human behavior recognition can be performed on a partially lightweight embedded chip, the above-mentioned graph convolution processing process needs to be equivalently replaced, and in particular, the einstein summation contract representation in the graph convolution processing process needs to be equivalently replaced.

It will be appreciated that, as shown in fig. 6, fig. 6 is a schematic diagram of an embodiment of the present application for performing a first dimension transformation on a low-dimension node sequence. Because the einstein summation operation operates with the graph convolution kernel in the k and V dimensions, it is necessary to first extract the data k in the graph convolution kernel, then perform multiple permutation and remodeling processes of a first set order through k and the low-dimensional node sequence [ N, C _M, tx, V ], e.g., sequentially performing reshape (permutation), permute (remodeling) and reshape and permute on [ N, kC _M, tx, V ] in the order shown in fig. 7, to enable the k and V dimensions to be adjacent and combined, and finally put the vk dimension at a position of a specific channel dimension, thereby converting the low-dimensional node sequence into a node sequence feature vector: [1, vk, n, c _M Tx ] to be able to participate as channel characteristics in the corresponding functional convolution operation of the 1x1 convolution kernel after the subsequent conversion.

Therefore, the operation can effectively avoid the occurrence of 5-dimensional data, and can combine the data in the v dimension with the data in the k dimension to participate in the 1x1 convolution operation.

Still further, in an embodiment, the graph convolution kernel specifically includes a convolution kernel number k, a joint point number v and w, and the above S15 of the human behavior recognition method of the present application specifically may further include: after the second set order replacement and remodeling treatment is carried out on the graph convolution kernel, dimension adjustment treatment is carried out so as to convert the graph convolution feature vector into a set dimension graph convolution feature vector: [ w, vk, 1].

FIG. 7 is a schematic diagram of an embodiment of performing a second dimension transform on the graph convolution kernel, as shown in FIG. 7. In order to transform the graph convolution kernel into a convolution kernel in 1x1 convolution through operations of multi-dimension transposition and dimension stitching, the k dimension and the v dimension in the original graph convolution kernel can be exchanged first to enable vk to be combined. And because vk and kv are different in the array conversion dimension, operations permute and reshape are needed to be performed once, then vk is put on a specific channel dimension of the convolution kernel, operations permute are performed once again, and finally two dimensions are added through reshape to be changed into a convolution kernel of 1x1, so that a graph convolution feature vector with set dimension is obtained: [ w, vk, 1].

It follows that the data dimension is transformed into the BatchSize dimension and the channel dimension in the 1x1 convolution by the graph convolution kernel in the Einstein summation convention to be able to participate in subsequent functional convolution operations.

The functional convolution operation specifically means that a newly constructed 1x1 convolution kernel and newly constructed data are subjected to convolution operation, and the functional convolution operation is used, so that the functional convolution operation does not have any training parameters and does not participate in graph convolution operation, and various embedded intelligent chips can be supported. And the characteristic dimension of the data is not changed by the 1x1 convolution, the operation is only performed under the channel dimension, and the time consumption is greatly reduced by using the operation processing of the 1x1 convolution.

Specifically, the feature vectors of the joint point sequence are sequentially obtained through corresponding dimension transformation: [1, vk, n, c _M Tx ] and graph convolution eigenvectors: after [ w, vk, 1], the two can be directly subjected to functional convolution operation, for example, the functional convolution operation equivalent to matrix multiplication, so as to obtain corresponding space-time sequence feature vectors.

Further, in an embodiment, the step S17 of the human behavior recognition method of the present application may specifically further include: the space-time sequence feature vector is converted into a chip supportable network model format for human behavior recognition.

Specifically, the above operation process for obtaining the space-time sequence feature vector can be understood as being performed by a space-time diagram convolutional network model with a corresponding network architecture, and in order to quantify the space-time diagram convolutional network model into a type that can be supported by a chip, the space-time diagram convolutional network model is deployed on the chip of the intelligent terminal, and the space-time diagram convolutional network model needs to be converted into any reasonable network model, such as a caffe (Convolutional Architecture for Fast Feature Embedding, convolution structure with rapid feature embedding) model or a onnx (Open Neural Network Exchange, neural network development framework intercommunication ecological model) model, which is not limited in this application.

In a specific embodiment, as shown in fig. 8, fig. 8 is a schematic diagram of an application scenario equivalent to the human behavior recognition method in fig. 4. In order to avoid the process of graph convolution in human behavior recognition, the above-mentioned graph convolution process can be replaced by functional convolution by adding the introduced data feature dimension reduction, einstein convention conversion and trainable parameter introduction process flow.

Specifically, the human behavior recognition processing is implemented by a preset network model, and the preset network model specifically comprises a space-time diagram convolution module, a data feature dimension reduction module, an einstein convention conversion module, a trainable parameter import module and a model quantization module.

It can be seen that, for the preset network model, the joint point sequence data is obtained: after [ N, C, T, V, M ], the data can be subjected to data dimension reduction processing through a data feature dimension reduction module to obtain new joint point sequence data: [ N, C _M, T, V ].

Further, the new joint point sequence data is convolved by a space-time diagram convolution module: [ N, C _M, T, V ] to obtain a temporal sequence of joint points [ N, C _M,T_x, V ]. And the picture convolution kernel is imported through a trainable parameter importing module: [ k, v, w ]; or, importing a preset weight matrix and an initial graph convolution kernel to add the preset weight matrix and the initial graph convolution kernel to obtain the graph convolution kernel: [ k, v, w ].

Still further, the joint point time sequences are respectively processed by the einstein convention conversion module: [ N, C _M,T_x, V ] and a graph convolution kernel: and performing corresponding dimension transformation processing on the [ k, v and w ] to obtain joint point sequence feature vectors respectively: [1, vk, n, c _M Tx ] and graph convolution eigenvectors: and [ w, vk, 1], so that the new 'graph convolution' corresponding to the functional convolution operation can be adopted to replace the existing graph convolution to process the two so as to obtain corresponding space-time sequence feature vectors, and human behavior recognition can be performed on the basis of the space-time sequence feature vectors through a model quantization module.

It can be understood that the above process of performing the operation processing by each module is the same as the corresponding specific formula operation manner mentioned in each embodiment, and specific reference is made to the corresponding text content and the related drawings of the above embodiment, which are not repeated herein.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a human behavior recognition device according to an embodiment of the application. In the present embodiment, the human behavior recognition apparatus 21 includes: a data acquisition module 211, a space-time diagram convolution module 212, a trainable parameter import module 213, a data feature dimension reduction module 214, an einstein convention conversion module 215, and a model quantization module 216.

The data acquisition module 211 is configured to acquire node sequence data; the trainable parameter import module 213 is configured to obtain a graph convolution kernel; the space-time diagram convolution module 212 is configured to perform a time convolution process on the joint point sequence data to obtain a joint point time sequence; the data feature dimension reduction module 214 is configured to perform data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with a set dimension; the einstein convention conversion module 215 is configured to perform a first dimension conversion process on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector, and perform a second dimension conversion process on the graph convolution kernel to convert the low-dimension joint point sequence into a graph convolution feature vector with a set dimension; the space-time diagram convolution module 212 is further configured to perform a functional convolution process on the joint point sequence feature vector and the diagram convolution feature vector to obtain a space-time sequence feature vector; the model quantization module 216 is used for human behavior recognition according to the space-time sequence feature vector.

In some embodiments, the step of the data acquisition module 211 performing acquisition of the sequence of joint points data includes: acquiring a video image; the video image at least comprises a human body to be detected; and detecting human body behaviors of the video images to obtain joint point sequence data.

In some embodiments, the step of trainable parameter import module 213 performing the step of obtaining the graph convolution kernel includes: acquiring a preset weight matrix and an initial graph convolution kernel; and adding the preset weight matrix with the initial graph convolution kernel to obtain the graph convolution kernel.

In some embodiments, the joint point sequence data includes a number of video images N, a joint point feature C, a joint point sequence length T, a joint point number V, and an identified number of people M, and the step of performing the time convolution processing on the joint point sequence data by the space-time diagram convolution module 212 to obtain the joint point time sequence includes: performing iterative update processing on the joint point sequence data in the T dimension to obtain an iteratively updated joint point time sequence: [ N, C, tx, V, M ]; the formula of the iterative update process is as follows: Where x is the time series, F _K is the time convolution kernel, K is the number of convolution kernels, and F (x _t) is equivalent to Tx.

In some embodiments, the step of performing the data feature dimension reduction module 214 to perform the data dimension reduction on the temporal sequence of joint points to obtain the low-dimensional sequence of joint points in the set dimension includes: fusing the joint point characteristics C in the joint point time sequence with the number M of the identified people, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with a set dimension: [ N, C _M, tx, V ].

In some embodiments, the step of the einstein convention conversion module 215 performing a first dimension transformation process on the low-dimensional sequence of nodes to convert it to a sequence of node feature vectors, and performing a second dimension transformation process on the convolution kernel to convert it to a convolution feature vector of the set dimension comprises: extracting the number k of convolution kernels in the graph convolution kernel, and performing replacement and remodelling processing of a first set sequence through k and the low-dimensional articulation point sequence to convert the low-dimensional articulation point sequence into an articulation point sequence feature vector: [1, vk, n, c _M Tx ]; after the second set order replacement and remodeling treatment is carried out on the graph convolution kernel, dimension adjustment treatment is carried out so as to convert the graph convolution feature vector into a set dimension graph convolution feature vector: [ w, vk, 1].

In some embodiments, the step of model quantization module 216 performing human behavior recognition from the spatiotemporal sequence feature vectors includes: the space-time sequence feature vector is converted into a chip supportable network model format for human behavior recognition.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an intelligent terminal according to an embodiment of the application. In this embodiment, the intelligent terminal 31 includes a memory 311 and a processor 312 coupled to each other, where the memory 311 stores program data, and the processor 312 is configured to execute the program data stored in the memory 311 to implement the steps of any of the above-described human behavior recognition method embodiments.

In one particular implementation scenario, the intelligent terminal 31 may include, but is not limited to: microcomputer, server, cell-phone, panel computer, intelligent wrist-watch.

Specifically, the processor 312 is configured to control itself and the memory 311 to implement the steps of any of the above-described human behavior recognition method embodiments. The processor 312 may also be referred to as a CPU (Central Processing Unit ). The processor 312 may be an integrated circuit chip having signal processing capabilities. The Processor 312 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 312 may also be commonly implemented by an integrated circuit chip.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of a computer readable storage medium according to the present application. In the present embodiment, the computer-readable storage medium 41 stores program data 411, and the program data 411 can be executed to implement the steps of any of the above-described human behavior recognition method embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A human behavior recognition method, characterized in that the human behavior recognition method comprises:

Acquiring node sequence data and a graph convolution kernel; the joint point sequence data comprises the number N of video images, joint point characteristics C, joint point series length T, the number V of joint points and the number M of identification persons, and the graph convolution kernel comprises the number k of convolution kernels, the number V of joint points and the number w of identification persons;

performing time convolution processing on the joint point sequence data to obtain a joint point time sequence;

Performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimension;

Extracting the number k of convolution kernels in the graph convolution kernel, and performing replacement and remodelling processing of a first set sequence through k and the low-dimensional articulation point sequence to convert the low-dimensional articulation point sequence into an articulation point sequence feature vector: [1, vk, n, c _M Tx ]; wherein x is a time series;

And after performing permutation and remodelling processing of a second set sequence on the graph convolution kernel, performing dimension adjustment processing to convert the graph convolution kernel into a graph convolution eigenvector of the set dimension: [ w, vk, 1];

performing functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector;

and carrying out human behavior recognition based on the space-time sequence feature vector.

2. The human behavior recognition method according to claim 1, wherein the step of acquiring the node sequence data and the graph convolution kernel comprises:

acquiring a video image; wherein the video image at least comprises a human body to be detected;

Detecting human behaviors of the video image to obtain the joint point sequence data;

Acquiring a preset weight matrix and an initial graph convolution kernel;

and adding the preset weight matrix and the initial graph convolution kernel to obtain the graph convolution kernel.

3. The human behavior recognition method according to claim 1, wherein the step of performing a time convolution process on the joint point sequence data to obtain a joint point time sequence comprises:

performing iterative update processing on the joint point sequence data in the T dimension to obtain the joint point time sequence after iterative update: [ N, C, tx, V, M ];

the formula of the iterative update processing is as follows:

；

wherein, Is a time convolution kernel, K is the number of convolution kernels,/>Equivalent to Tx.

4. The method of claim 3, wherein the step of performing data dimension reduction processing on the time sequence of joint points to obtain a low-dimensional joint point sequence with a set dimension comprises:

Fusing the joint point characteristics C in the joint point time sequence with the number M of the identified people, and reserving sequence information of other data to obtain the low-dimensional joint point sequence with the set dimension: [ N, C _M, tx, V ].

5. The human behavior recognition method according to claim 1, wherein the step of performing human behavior recognition based on the spatiotemporal sequence feature vector comprises:

And converting the space-time sequence feature vector into a chip supportable network model format so as to perform human behavior recognition.

6. A human behavior recognition apparatus, characterized in that the human behavior recognition apparatus comprises:

the data acquisition module is used for acquiring the node sequence data; the joint point sequence data comprises the number N of video images, joint point characteristics C, the joint point series length T, the joint point number V and the number M of identification persons;

the space-time diagram convolution module is used for carrying out time convolution processing on the joint point sequence data so as to obtain a joint point time sequence;

The trainable parameter importing module is used for acquiring a graph convolution kernel; wherein the graph convolution kernel comprises the number k of convolution kernels, and the number v and w of joint points;

the data feature dimension reduction module is used for carrying out data dimension reduction processing on the joint point time sequence so as to obtain a low-dimension joint point sequence with set dimension;

The Einstein convention conversion module is used for extracting the number k of convolution kernels in the graph convolution kernel so as to carry out replacement and remodelling processing of a first set sequence through k and the low-dimensional articulation point sequence, so that the low-dimensional articulation point sequence is converted into an articulation point sequence feature vector: [1, vk, n, c _M Tx ], and performing a dimension adjustment process after performing a second set order permutation and remodeling process on the graph convolution kernel to convert it into a graph convolution eigenvector of the set dimension: [ w, vk, 1]; wherein x is a time sequence, and the space-time diagram convolution module is further configured to perform a functional convolution process on the joint point sequence feature vector and the diagram convolution feature vector to obtain a space-time sequence feature vector;

and the model quantization module is used for identifying human behaviors according to the space-time sequence feature vectors.

7. An intelligent terminal, characterized in that the intelligent terminal comprises a memory and a processor which are mutually coupled;

the memory stores program data;

the processor is configured to execute the program data to implement the human behavior recognition method of any one of claims 1-5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores program data executable to implement the human behavior recognition method according to any one of claims 1-5.