CN113468980A

CN113468980A - Human behavior recognition method and related device

Info

Publication number: CN113468980A
Application number: CN202110654055.4A
Authority: CN
Inventors: 张兴明; 白云超; 魏乃科; 潘华东; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-10-01
Anticipated expiration: 2041-06-11

Abstract

The application discloses a human behavior identification method and a related device, wherein the human behavior identification method comprises the following steps: acquiring joint point sequence data and a graph volume kernel; carrying out time convolution processing on the joint point sequence data to obtain a joint point time sequence; performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimensions; carrying out first-dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector; performing second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; and carrying out human behavior identification based on the space-time sequence feature vector. By the scheme, human behavior recognition can be avoided by directly adopting a graph convolution mode, so that the intelligent chip can be effectively deployed, the processing speed is high, and the precision loss is small.

Description

Human behavior recognition method and related device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a human behavior recognition method and a related device.

Background

In the field of video monitoring, an intelligent device end gradually becomes the mainstream development direction, wherein the identification of human behavior and actions can be directly completed by deploying an algorithm on an embedded chip of the intelligent device end. In the human motion recognition technology, there is a growing interest in recognizing human motion by using a joint point sequence, and in the joint point sequence recognition method, space-time graph convolution is the most used and most effective method in recent years. Therefore, the time-space diagram convolution is directly deployed on a chip of the intelligent terminal device, and is the main development direction of recognizing human body actions in the field of video monitoring in the future.

However, although the space-time graph convolution has the advantages of high speed, high identification precision and the like in identifying joint point sequences, the space-time graph convolution can process generally unstructured data, and graph convolution operation is used, so that the graph convolution operation cannot be directly realized on a chip of a lightweight intelligent device end at present, and the operation mode of the space-time graph convolution related to the space-time graph convolution is relatively complex. Therefore, the deployment of the space-time graph convolution at the lightweight intelligent device side is greatly hindered.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a human behavior identification method and a related device so as to effectively solve the problem that graph convolution is deployed on an intelligent chip.

In order to solve the above problem, a first aspect of the present application provides a human behavior recognition method, where the human behavior recognition method includes: acquiring joint point sequence data and a graph volume kernel; carrying out time convolution processing on the joint point sequence data to obtain a joint point time sequence; performing data dimension reduction processing on the joint point time sequence to obtain a low-dimension joint point sequence with set dimensions; carrying out first-dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector; performing second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; and carrying out human behavior identification based on the space-time sequence feature vector.

The step of acquiring the joint point sequence data and the graph volume kernel comprises the following steps: acquiring a video image; wherein the video image at least comprises a human body to be detected; detecting human body behaviors of the video images to obtain joint point sequence data; acquiring a preset weight matrix and an initial graph convolution kernel; and adding the preset weight matrix and the initial graph convolution kernel to obtain a graph convolution kernel.

The joint point sequence data comprises a video image number N, a joint point feature C, a joint point series length T, a joint point number V and a number of identification people M, and the joint point sequence data is subjected to time convolution processing to obtain a joint point time sequence, wherein the joint point time sequence comprises the following steps: carrying out iterative update processing on the joint point sequence data on the T dimension to obtain an iterative updated joint point time sequence: [ N, C, T ]_x，V，M](ii) a The formula of the iterative update processing is as follows:

wherein x is a time series, f_KIs a time convolution kernel, K is the number of convolution kernels, F (x)_t) Equivalent to Tx.

The method comprises the following steps of performing data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions: fusing joint point characteristics C in the joint point time sequence with the number M of the identified persons, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with set dimensions: [ N, C ]_M，Tx，V]。

The method comprises the following steps of carrying out first-dimension transformation processing on a low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector, wherein the step of carrying out first-dimension transformation processing on the low-dimension joint point sequence comprises the following steps of: extracting the number k of convolution kernels in the graph convolution kernel to pass k andand performing replacement and remodeling treatment of a first set sequence on the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into a joint point sequence feature vector: [1, Vk, N, C_MTx]。

The method comprises the following steps of obtaining a graph convolution kernel, wherein the graph convolution kernel comprises a convolution kernel number k and joint point numbers v and w, and the step of carrying out second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with set dimensions comprises the following steps: and after the second set sequence of replacement and remodeling is carried out on the graph convolution kernel, dimension adjustment processing is carried out to convert the graph convolution kernel into a graph convolution feature vector with set dimensions: [ w, vk, 1, 1 ].

The human behavior identification method based on the space-time sequence feature vector comprises the following steps of: and converting the space-time sequence feature vector into a chip supportable network model format to perform human behavior identification.

In order to solve the above problem, a second aspect of the present application provides a human behavior recognition apparatus, wherein the human behavior recognition apparatus includes: the data acquisition module is used for acquiring joint point sequence data; the time-space graph convolution module is used for performing time convolution processing on the joint point sequence data to obtain a joint point time sequence; a trainable parameter importing module for obtaining a graph convolution kernel; the data characteristic dimension reduction module is used for carrying out data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions; the Einstein convention conversion module is used for performing first dimension conversion processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector, and performing second dimension conversion processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; the space-time graph convolution module is also used for carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; and the model quantization module is used for identifying human body behaviors according to the space-time sequence feature vectors.

In order to solve the above problem, a third aspect of the present application provides an intelligent terminal, wherein the intelligent terminal comprises a memory and a processor coupled to each other; the memory stores program data; the processor is used for executing program data to realize the human behavior recognition method.

In order to solve the above problem, a fourth aspect of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores program data, and the program data can be executed to implement the human behavior recognition method as described in any one of the above.

The invention has the beneficial effects that: different from the situation of the prior art, the human behavior identification method in the application carries out time convolution processing on the obtained joint point sequence data to obtain a joint point time sequence, carries out data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions, and then carries out first-dimension transformation processing on the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into low-dimensional data and joint point sequence feature vectors which can be directly processed by an intelligent chip; and then carrying out second dimension transformation processing on the obtained graph convolution kernel to convert the graph convolution kernel into a graph convolution characteristic vector with a set dimension, so that functional convolution processing can be carried out on the joint point sequence characteristic vector and the graph convolution characteristic vector to obtain a space-time sequence characteristic vector, the space-time sequence characteristic vector can be prevented from being obtained by directly adopting a graph convolution mode, and then human behavior identification is carried out based on the space-time sequence characteristic vector. Therefore, by carrying out data dimension reduction processing and corresponding dimension conversion processing on the high-latitude joint point time sequence, the processing process which originally needs to be carried out graph convolution can be replaced by functional convolution processing, so that the human behavior identification method can be effectively deployed on an intelligent chip, particularly a chip of a light-weight intelligent terminal device, the corresponding processing speed is accelerated, and the precision loss is reduced.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a human behavior recognition method according to the present application;

FIG. 2 is a schematic flow chart of one embodiment of S11 of FIG. 1;

FIG. 3 is a schematic diagram of one embodiment of the present application for obtaining joint sequence data;

FIG. 4 is a diagram illustrating an embodiment of prior art human behavior recognition using space-time graph convolution;

FIG. 5 is a schematic diagram of one embodiment of an Einstein summing contract conversion process;

FIG. 6 is a diagram of an application scenario equivalent to the human behavior recognition method of FIG. 4;

FIG. 7 is a diagram illustrating an embodiment of a first dimension transformation process performed on a low-dimensional sequence of joint points according to the present application;

FIG. 8 is a diagram of one embodiment of a second dimension transformation process on a graph convolution kernel;

FIG. 9 is a schematic structural diagram of an embodiment of a human behavior recognition apparatus according to the present application;

FIG. 10 is a schematic structural diagram of an embodiment of an intelligent terminal according to the present application;

FIG. 11 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a human behavior recognition method according to the present application. Specifically, the human behavior recognition method in this embodiment may include the following steps:

s11: and acquiring joint point sequence data and a graph volume kernel.

The human behavior identification method specifically comprises the steps of collecting video images including a human body through monitoring equipment or any reasonable shooting device, and carrying out human body detection, human body tracking and joint point extraction based on the video images. Understandably, the specific behavior characteristics of the human body are usually expressed by the relative positions of the joint points, so that the corresponding human body can be further identified by processing the acquired sequence data of the corresponding joint points.

Specifically, the intelligent terminal collects video images including human bodies through monitoring equipment in communication connection with the intelligent terminal, any reasonable shooting device or a camera carried by the intelligent terminal, so as to extract joint information of the human bodies in the video images and further obtain corresponding joint sequence data; or receiving joint point sequence data sent by other intelligent terminals.

Further, receiving a graph convolution kernel which is sent by other intelligent terminals and obtained after training the deep learning network model of the intelligent terminals; or, the preset weight matrix obtained through training and sent by other intelligent terminals and the initial graph convolution kernel are further added to obtain the graph convolution kernel.

Understandably, in human behavior recognition, the influence weights of different joint points on the human body recognition result are different. In order to improve the accuracy of behavior recognition on a human body, the additionally added fixed parameters which do not belong to a space-time diagram convolution model are introduced, namely, the variable of trainable parameters are used for adjusting different trunk importance, so that the weight of connection between each joint point is obtained, and different trunk importance degrees are determined.

The graph convolution kernel is obtained by acquiring and processing trainable parameters so as to ensure the preparation rate of human behavior identification in post-processing. In order to reduce the computation of the intelligent terminal in this embodiment, other intelligent terminals or graph volume kernels obtained by training other network models on the intelligent terminal may be directly received; or, receiving a preset weight matrix and an initial graph convolution kernel, and further obtaining the graph convolution kernel through addition calculation, so that the processing efficiency of the corresponding human behavior recognition can be improved.

Optionally, the intelligent terminal may be one of a mobile phone, an unmanned aerial vehicle, a smart watch, a tablet computer, a smart camera, or any other reasonable intelligent terminal including an embedded chip, which is not limited in this application.

S12: and carrying out time convolution processing on the joint point sequence data to obtain a joint point time sequence.

Further, the obtained joint point sequence data is subjected to time convolution processing, for example, iterative update operation is performed on some one-dimensional data of the joint point sequence data to obtain an updated joint point time sequence, so that local features of joint point changes caused by time changes of the joint point sequence data can be obtained.

In an embodiment, the joint sequence data specifically includes a length T of the relevant node sequence, so that the corresponding joint time sequence can be obtained by performing an iterative update operation on the joint sequence data in a dimension T, while other data in the joint sequence data is kept unchanged.

S13: and performing data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions.

It can be understood that, for some chips of intelligent terminals, especially chips of lightweight intelligent terminal devices, usually do not support to perform operation processing on data at high latitude, and therefore, data at high latitude needs to be subjected to data dimension reduction processing, for example, 5-dimensional data is converted into 4-dimensional data to adapt to some specific application scenarios.

Therefore, since the joint time sequence is usually high-latitude data, in order to expand the applicable application scenarios and improve the corresponding operation processing efficiency, data dimension reduction processing can be performed on the joint time sequence, for example, two-dimensional data in the joint time sequence is fused, and sequence information of other data is retained, so that a low-dimension joint sequence with set dimensions is obtained.

Optionally, the set dimension is 4 dimensions or any other reasonable dimension, so that the method can be adapted to a wider operation processing scenario of the intelligent terminal.

S14: and carrying out first-dimension transformation processing on the low-dimension joint point sequence so as to convert the low-dimension joint point sequence into a joint point sequence feature vector.

In the process of carrying out operation processing on joint point sequence data to carry out human behavior identification, graph convolution operation is usually and inevitably involved, and Einstein summation convention operation is a core operation mode of graph convolution, so that the human behavior identification method involving Einstein summation convention operation cannot be realized by an embedded hardware chip which does not support 5-dimensional data operation and graph convolution processing. Therefore, an alternative method is needed to solve the problem that the embedded chip cannot directly perform the einstein summation contract operation.

It can be understood that, in order to replace the einstein summation convention conversion process of the subsequent related high-latitude data processing, after the low-dimensional joint point sequence is obtained, it is further required to perform adaptive adjustment, for example, a first-dimension transformation process is performed on the low-dimensional joint point sequence, that is, a plurality of replacement and remodeling processes of a specific sequence are performed on the low-dimensional joint point sequence, so as to change the sequence and fusion characteristics of each dimension of data in the low-dimensional joint point sequence, and further convert the low-dimension joint point sequence into a joint point sequence feature vector suitable for the subsequent convolution processing process.

S15: and performing second-dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension.

Similarly, in order to replace the subsequent einstein summation convention conversion process involving high-latitude data processing, the second dimension transformation process, that is, the multiple replacement and reshaping processes of another specific sequence are performed on the graph convolution kernel at the same time, so that the sequence and fusion characteristics of the data in each dimension in the graph convolution kernel are changed, the graph convolution kernel is converted into the graph convolution feature vector of the set dimension with the same latitude as the joint point sequence feature vector, and thus the functional convolution process can be directly performed on the graph convolution feature vector and the joint point sequence feature vector in the subsequent process, so as to avoid performing the einstein summation convention conversion process.

S16: and performing functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector.

Further, since the joint point sequence feature vector and the map convolution feature vector have been converted into the same latitude data and the matched sequence order, the joint point sequence feature vector and the map convolution feature vector can be directly subjected to functional convolution processing, such as functional convolution processing equivalent to matrix multiplication, to obtain the corresponding space-time sequence feature vector.

S17: and carrying out human behavior identification based on the space-time sequence feature vector.

And further, identifying and classifying the behavior of the human body to be detected by the time-space sequence feature vector to obtain the behavior feature of the human body to be detected.

According to the scheme, the time convolution processing is carried out on the obtained joint point sequence data to obtain a joint point time sequence, the data dimension reduction processing is carried out on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions, and then the first-dimension transformation processing is carried out on the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into low-dimensional data which can be directly processed by an intelligent chip and joint point sequence characteristic vectors; and then the obtained graph convolution kernel is subjected to second dimension transformation processing to be converted into a graph convolution characteristic vector with a set dimension, so that the joint point sequence characteristic vector and the graph convolution characteristic vector can be subjected to functional convolution processing to obtain a space-time sequence characteristic vector, the condition that the space-time sequence characteristic vector is obtained by directly adopting the graph convolution processing is effectively avoided, and further human behavior identification can be carried out based on the space-time sequence characteristic vector.

Therefore, by carrying out data dimension reduction processing and corresponding dimension conversion processing on the high-latitude joint point time sequence, the processing process of graph convolution which is originally required to be carried out can be replaced by functional convolution processing, and the Einstein summation processing related to high-latitude data operation can be effectively deployed on an intelligent chip, particularly a chip of a lightweight intelligent terminal device, so that the application range of the corresponding human behavior identification method is expanded, and the corresponding processing speed can be accelerated, the efficiency is improved, and the precision loss is reduced due to the avoided more complicated graph convolution processing.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating an embodiment of S11 in fig. 1.

In an embodiment, the step S11 may specifically include:

s111: a video image is acquired.

Specifically, the intelligent terminal collects video images including a human body through monitoring equipment in communication connection with the intelligent terminal, any reasonable shooting device or a camera carried by the intelligent terminal; or receiving video images sent by other intelligent terminals.

The video image at least comprises a human body to be detected, so that behavior recognition can be carried out on the human body to be detected based on the video image.

S112: and carrying out human behavior detection on the video image to obtain joint point sequence data.

Further, please refer to fig. 3 in conjunction with the description, fig. 3 is a schematic diagram of an embodiment of the present application for obtaining joint sequence data. After the video image at least comprising the human body to be detected is obtained, human body detection, human body tracking and joint point extraction are sequentially carried out on the human body to be detected, and then joint point sequence data of the human body to be detected can be obtained through processing of a space-time diagram convolution module integrated on a corresponding intelligent terminal.

In one embodiment, the joint sequence data specifically includes 5 parameters, which are, respectively, BatchSize (number of video images) N, joint feature C, joint series length T, joint number V, and the number of people to recognize M. Understandably, the joint point sequence data specifically corresponds to the motion characteristics of the human body in the state of the human body skeleton.

S113: and acquiring a preset weight matrix and an initial graph convolution kernel.

Specifically, the trained preset weight matrix sent by other intelligent terminals or other network models in the intelligent terminal performing human behavior recognition in this embodiment and the initial graph convolution kernel given by the corresponding network model thereof are received.

It can be understood that, aiming at different influence weights of different joint points on a human body recognition result, in order to ensure accuracy of final human body behavior recognition, some trainable parameters can be introduced by adding an attention mechanism, for example, a preset weight matrix obtained by training is introduced. Because these parameters are additionally increased, they cannot be directly deployed on the chip of the intelligent terminal, and therefore, it is necessary to perform a corresponding training process with the help of other intelligent terminals, and to import and send the preset weight matrix obtained after training and the initial graph convolution kernel given by the corresponding network model into the intelligent terminal for human behavior recognition in this embodiment.

S114: and adding the preset weight matrix and the initial graph convolution kernel to obtain a graph convolution kernel.

Further, the preset weight matrix is added to the initial image convolution kernel, so that the sum of the two is determined as the image convolution kernel to be subjected to human behavior recognition processing.

In one embodiment, the graph convolution kernel specifically includes three parameters, which respectively correspond to the number k of convolution kernels of the kinematic sub-state and the number v and w of joint points. The motion sub-states specifically refer to three states including centripetal motion, centrifugal motion and static motion when a human body moves, and the graph convolution kernel is a matrix of fixed parameters obtained according to a human body topological graph.

The joint point numbers V and w in the graph convolution kernel and the joint point number V in the joint point sequence data correspond to the same numerical value, and different codes are introduced only for facilitating the subsequent specific operation process.

Further, in an embodiment, the joint sequence data specifically includes a number N of video images, a joint feature C, a length T of a joint series, a number V of joint points, and a number M of recognition people, and the above-mentioned S12 of the human behavior recognition method of the present application may further specifically include: carrying out iterative update processing on the joint point sequence data on the T dimension to obtain an iterative updated joint point time sequence: [ N, C, Tx, V, M ];

wherein, the formula of the iterative update processing is as follows:

Further, in an embodiment, the S13 of the human behavior recognition method of the present application may further include: fusing joint point characteristics C in the joint point time sequence with the number M of the identified persons, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with set dimensions: [ N, C ]_M，Tx，V]。

Understandably, since the embedded chip (e.g., a chip series integrated on a lightweight intelligent terminal) does not support the operation of 5-dimensional data, it is necessary to convert all 5-dimensional data corresponding to the joint time series into 4-dimensional data. Specifically, the M dimension and the C dimension are fused, and since the operation is performed during data reading, the transformation and fusion of the arrays can be directly performed, and the sequence information of the T dimension and the V dimension is still retained after the fusion, so as to redefine the characteristics of the input data, and after all the information of the T dimension and the V dimension are aligned, the low-dimension joint point sequence can be obtained: [ N, C ]_M，Tx，V]Therefore, unstructured data processed by graph convolution can be converted into pseudo structured data which can be supported by an intelligent chip.

The M dimension and the C dimension are fused, so that the alignment of the joint point sequence in the time dimension and the space dimension can be effectively kept, and the time convolution operation and the graph convolution operation cannot be influenced.

Further, in an embodiment, the S14 of the human behavior recognition method of the present application may further include: extracting the number k of convolution kernels in the graph convolution kernels, and performing replacement and remodeling processing of a first set sequence through the k and the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into a joint point sequence feature vector: [1, Vk, N, C_MTx]。

Specifically, as shown in fig. 4, fig. 4 is a schematic diagram of an embodiment of performing human behavior recognition by using a space-time graph convolution in the prior art. Therefore, in the existing behavior recognition of human body, after the joint point sequence data is obtained, time convolution is usually performed on the joint point sequence data, a graph convolution kernel formed by the sum of the corresponding trainable parameters in the attention mechanism and the initial graph convolution kernel is introduced, and corresponding graph convolution processing is performed to complete the corresponding feature extraction process, so that the subsequent human body behavior recognition can be completed according to the output result of the feature extraction.

The joint point sequence data comprises 5 parameters, namely BatchSize (n), the characteristics (c) of the joint points, the length (t) of the joint point sequence, the number (v) of the joint points and the number (m) of the identification persons. Understandably, the joint point sequence data specifically corresponds to the motion characteristics of the human body in the state of the human body skeleton; the graph convolution kernel contains three parameters, namely a motion substate (number of convolution kernels) k, and joint numbers v and w. The motion sub-states specifically refer to three states including centripetal motion, centrifugal motion and static motion when a human body moves, and the graph convolution kernel is a matrix of fixed parameters obtained according to a human body topological graph.

The attention mechanism is that a trainable parameter with the same size as the graph convolution kernel is added to the initial graph convolution kernel, so that the weight of the connection between each joint point can be learned and different trunk importance can be determined. Different torso importance can be adjusted through the change of trainable parameters, and the trainable parameters do not belong to fixed parameters of the time-space graph convolution processing, are additionally added and are introduced for enabling the accuracy of the time-space graph convolution processing to be higher.

And the time convolution is used for learning the local characteristics of the change of the joint points in the time change so as to carry out iterative update on each node in the dimension t. The formula of the iterative update process is:

wherein x is a time series, f_KIs a time convolution kernel, k is the number of convolution kernels, F (x)_t) Equivalent to X.

The graph convolution is specifically used for learning a feature of a change of a joint point in a space, for example, the joint point under a C dimension and a V dimension in each sequence data is updated, and the corresponding update formula is as follows: sigma_k∑_v(XW)_nkctvA_kvw＝X′_nctw(ii) a Wherein, A is a graph volume kernel, X is a joint point sequence after iterative update processing, and k is the number of the graph volume kernels.

Referring to fig. 5, fig. 5 is a schematic diagram of an einstein and contract conversion process according to an embodiment. The update formula above can also be expressed in einstein's summation convention: nkctv, kvw → nctw; where n represents the product of BatchSize and M, c represents the characteristics of the joint, t represents the length of the joint sequence, and v and w represent the number of joints.

And the Einstein summing convention operation is the core operation of graph convolution and is also the most difficult problem to solve when the embedded hardware chip is realized. As can be seen from the above formula conversion process and fig. 5, the formula specifically relates to the operation between the 5-dimensional data corresponding to the joint sequence and the 3-dimensional data corresponding to the graph convolution kernel, and because the embedded hardware chip does not support the operation of the 5-dimensional data, the joint sequence data and the graph convolution kernel need to be subjected to dimension transformation.

From this, it is found that human behavior recognition using the graph convolution method inevitably involves arithmetic processing on 5-dimensional data, and thus cannot be supported by a partially lightweight embedded chip. Therefore, in order to ensure that corresponding human behavior recognition can be performed on a part of the lightweight embedded chip, the above-mentioned graph volume processing procedure needs to be equivalently replaced, and in particular, einstein summation convention expression in the graph volume processing procedure needs to be equivalently replaced.

It can be understood that, as shown in fig. 6, fig. 6 is a schematic diagram of an embodiment of the present application for performing the first-dimension transformation processing on the low-dimension joint point sequence. Since the einstein summation operation is performed with the graph convolution kernel in the k dimension and the v dimension, the data k in the graph convolution kernel needs to be extracted first, and then the k and the low-dimension joint point sequence are passed[N，C_M，Tx，V]A first set sequence of multiple replacement and remodeling processes is performed, for example, sequentially for [ N, kC ] in the sequence shown in FIG. 7_M，Tx，V]Reshape, permute, and reshape and permute are performed to enable the k-dimension to be adjacent to and combined with the v-dimension, and finally to place the vk-dimension at the location of the specific channel dimension, thereby converting the low-dimensional joint point sequence into a joint point sequence feature vector: [1, Vk, N, C_MTx]And the function convolution operation corresponding to the 1x1 convolution kernel after the subsequent conversion can be participated in as the channel characteristic.

Therefore, the operation can effectively avoid the occurrence of 5-dimensional data, and simultaneously can combine the v-dimensional data and the k-dimensional data to participate in the 1x1 convolution operation.

Still further, in an embodiment, the graph convolution kernel specifically includes a number k of convolution kernels and a number v and w of joint points, and the S15 of the human behavior identification method of the present application may further include: and after the second set sequence of replacement and remodeling is carried out on the graph convolution kernel, dimension adjustment processing is carried out to convert the graph convolution kernel into a graph convolution feature vector with set dimensions: [ w, vk, 1, 1 ].

As shown in fig. 7, fig. 7 is a schematic diagram of an embodiment of performing a second dimension transformation process on a graph convolution kernel. In order to transform the graph convolution kernel into the convolution kernel in the 1x1 convolution through multiple operations of dimension transposition and dimension splicing, the k dimension and the v dimension in the original graph convolution kernel may be exchanged to combine vk. And because vk and kv results are different when the dimensions are converted in groups, it is necessary to perform permute and reshape operations once, then place vk on the dimension of a specific channel of a convolution kernel, perform permute operations once again, and finally increase two dimensions through reshape, so that the two dimensions become a convolution kernel of 1 × 1, thereby obtaining a graph convolution feature vector with a set dimension: [ w, vk, 1, 1 ].

From this, it is found that the data dimension is converted into the BatchSize dimension and the channel dimension in the 1 × 1 convolution by the graph convolution kernel in the einstein sum constraint, so that the subsequent functional convolution operation can be involved.

The functional convolution operation specifically refers to performing convolution operation on a newly constructed 1x1 convolution kernel and newly constructed data, and the used functional convolution operation is not involved in the graph convolution operation because of no training parameter, so that the method can be supported on various embedded intelligent chips. And because the characteristic dimension of the data cannot be changed by the 1x1 convolution, the operation is only carried out under the channel dimension, and the time consumption is greatly reduced by using the operation processing of the 1x1 convolution.

Specifically, joint point sequence feature vectors are obtained sequentially through corresponding dimension transformation: [1, Vk, N, C_MTx]And a graph convolution feature vector: [ w, vk, 1]And then, directly performing a functional convolution operation on the two signals, for example, a functional convolution operation equivalent to matrix multiplication, to obtain a corresponding space-time sequence feature vector.

Further, in an embodiment, the S17 of the human behavior recognition method of the present application may further include: and converting the space-time sequence feature vector into a chip supportable network model format to perform human behavior identification.

Specifically, the above operation processing procedure for obtaining the space-time sequence Feature vector may be performed by a space-time graph Convolutional Network model having a corresponding Network Architecture, and in order to quantize the space-time graph Convolutional Network model into a type that can be supported by a chip, the space-time graph Convolutional Network model is deployed on a chip of an intelligent terminal, and the space-time graph Convolutional Network model needs to be converted into a chip-supportable Network model, for example, any reasonable Network model such as a capacity (Convolutional structure for Fast Feature Embedding) model or an onx (Open Network Exchange, Neural Network development framework interworking ecology) model, which is not limited in this application.

In a specific embodiment, as shown in fig. 8, fig. 8 is a schematic diagram of an application scenario equivalent to the human behavior recognition method in fig. 4. In order to avoid the graph convolution processing process in human behavior recognition, the graph convolution processing can be replaced by functional convolution by additionally introducing a data feature dimension reduction process, an Einstein convention conversion process and a trainable parameter importing process flow.

Specifically, the human behavior recognition processing is completed by a preset network model, which specifically includes a time-space graph convolution module, a data feature dimension reduction module, an einstein contract conversion module, a trainable parameter import module, and a model quantization module.

Then, it can be known that, for the preset network model, when the joint point sequence data is obtained: [ N, C, T, V, M ]]Then, the data feature dimension reduction module can firstly perform data dimension reduction processing on the data feature dimension reduction module to obtain new joint point sequence data: [ N, C ]_M，T，V]。

Further, the new joint sequence data is processed by the space-time graph convolution module: [ N, C ]_M，T，V]Performing time convolution to obtain joint point time sequence [ N, C_M，T_x，V]. And a graph convolution kernel is further imported through a trainable parameter importing module: [ k, v, w ]](ii) a Or, introducing a preset weight matrix and an initial graph convolution kernel to add the preset weight matrix and the initial graph convolution kernel to obtain the graph convolution kernel: [ k, v, w ]]。

Still further, the joint time series are respectively converted by the einstein contract conversion module: [ N, C ]_M，T_x，V]And a graph convolution kernel: [ k, v, w ]]And (3) carrying out corresponding dimension transformation processing so as to respectively obtain joint point sequence feature vectors: [1, Vk, N, C_MTx]And a graph convolution feature vector: [ w, vk, 1]Therefore, the new graph convolution corresponding to the functional convolution operation can be adopted to replace the existing graph convolution to process the new graph convolution and the existing graph convolution so as to obtain the corresponding space-time sequence characteristic vector, and the human behavior identification can be carried out on the basis of the space-time sequence characteristic vector through the model quantization module.

It can be understood that the above process of performing operation processing through each module is the same as the way of the specific formula operation mentioned in each of the above embodiments corresponding thereto, and specific reference is made to the corresponding text and related drawings of the above embodiments, which are not described herein again.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of a human behavior recognition device according to the present application. In the present embodiment, the human behavior recognition device 21 includes: a data acquisition module 211, a space-time graph convolution module 212, a trainable parameter import module 213, a data feature dimension reduction module 214, an einstein contract transformation module 215, and a model quantization module 216.

The data acquiring module 211 is configured to acquire joint point sequence data; the trainable parameter import module 213 is configured to obtain a graph convolution kernel; the space-time diagram convolution module 212 is configured to perform time convolution processing on the joint point sequence data to obtain a joint point time sequence; the data feature dimension reduction module 214 is configured to perform data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with a set dimension; the einstein convention conversion module 215 is configured to perform first-dimension conversion processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into a joint point sequence feature vector, and perform second-dimension conversion processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with a set dimension; the space-time graph convolution module 212 is further configured to perform functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector; the model quantization module 216 is configured to perform human behavior recognition according to the spatio-temporal sequence feature vectors.

In some embodiments, the data acquisition module 211 performs the step of acquiring the joint sequence data including: acquiring a video image; wherein the video image at least comprises a human body to be detected; and carrying out human behavior detection on the video image to obtain joint point sequence data.

In some embodiments, the trainable parameter import module 213 performs the step of obtaining a graph convolution kernel including: acquiring a preset weight matrix and an initial graph convolution kernel; and adding the preset weight matrix and the initial graph convolution kernel to obtain a graph convolution kernel.

In some embodiments, the joint sequence data includes a number N of video images, a joint feature C, a length T of the joint series, a number V of joints, and a number M of people to be identified, and the time-space diagram convolution module 212 performs a time convolution process on the joint sequence data to obtain a joint time sequence, including: iterating joint sequence data in the T dimensionAnd newly processing to obtain an iteratively updated joint point time sequence: [ N, C, Tx, V, M](ii) a The formula of the iterative update processing is as follows:

In some embodiments, the data feature dimension reduction module 214 performs a data dimension reduction process on the joint point time series to obtain a dimension-set low-dimensional joint point series, including: fusing joint point characteristics C in the joint point time sequence with the number M of the identified persons, and reserving sequence information of other data to obtain a low-dimensional joint point sequence with set dimensions: [ N, C ]_M，Tx，V]。

In some embodiments, the einstein convention conversion module 215 performing a first dimension transformation process on the low-dimension joint point sequence to convert it into a joint point sequence feature vector, and performing a second dimension transformation process on the graph convolution kernel to convert it into a graph convolution feature vector of a set dimension includes: extracting the number k of convolution kernels in the graph convolution kernels, and performing replacement and remodeling processing of a first set sequence through the k and the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into a joint point sequence feature vector: [1, Vk, N, C_MTx](ii) a And after the second set sequence of replacement and remodeling is carried out on the graph convolution kernel, dimension adjustment processing is carried out to convert the graph convolution kernel into a graph convolution feature vector with set dimensions: [ w, vk, 1]。

In some embodiments, the step of performing human behavior recognition based on the spatio-temporal sequence feature vectors by the model quantization module 216 comprises: and converting the space-time sequence feature vector into a chip supportable network model format to perform human behavior identification.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an embodiment of an intelligent terminal according to the present application. In this embodiment, the intelligent terminal 31 includes a memory 311 and a processor 312 coupled to each other, where the memory 311 stores program data, and the processor 312 is configured to execute the program data stored in the memory 311 to implement the steps of any one of the embodiments of the human behavior identification method described above.

In a specific implementation scenario, the intelligent terminal 31 may include but is not limited to: microcomputer, server, cell-phone, panel computer, intelligent wrist-watch.

In particular, the processor 312 is configured to control itself and the memory 311 to implement the steps of any one of the above embodiments of the human behavior recognition method. Processor 312 may also be referred to as a CPU (Central Processing Unit). Processor 312 may be an integrated circuit chip having signal processing capabilities. The Processor 312 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Additionally, processor 312 may also be implemented collectively by an integrated circuit chip.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application. In the present embodiment, the computer readable storage medium 41 stores program data 411, and the program data 411 can be executed to implement the steps of any one of the embodiments of the human behavior recognition method described above.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A human behavior recognition method is characterized by comprising the following steps:

acquiring joint point sequence data and a graph volume kernel;

performing time convolution processing on the joint point sequence data to obtain a joint point time sequence;

performing data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions;

performing first-dimension transformation processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into joint point sequence feature vectors;

performing second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into a graph convolution feature vector with the set dimension;

carrying out functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector;

and carrying out human behavior identification based on the space-time sequence feature vector.

2. The human behavior recognition method according to claim 1, wherein the step of acquiring the joint point sequence data and the graph convolution kernel includes:

acquiring a video image; the video image at least comprises a human body to be detected;

carrying out human behavior detection on the video image to obtain the joint point sequence data;

acquiring a preset weight matrix and an initial graph convolution kernel;

and adding the preset weight matrix and the initial graph convolution kernel to obtain the graph convolution kernel.

3. The human behavior recognition method according to claim 1, wherein the joint sequence data includes a number of video images N, a number of joint features C, a length of a joint series T, a number of joint points V, and a number of recognition people M, and the step of performing time convolution on the joint sequence data to obtain a joint time series includes:

performing iterative update processing on the joint sequence data on a T dimension to obtain the joint time sequence after iterative update: [ N, C, Tx, V, M ];

wherein, the formula of the iterative update processing is as follows:

4. The human behavior recognition method according to claim 3, wherein the step of performing data dimension reduction processing on the joint point time series to obtain a low-dimensional joint point series with a set dimension comprises:

fusing the joint point characteristics C in the joint point time sequence with the number M of the identification persons, and reserving sequence information of other data to obtain the low-dimensional joint point sequence with the set dimension: [ N, C ]_M，Tx，V]。

5. The human behavior recognition method according to claim 4, wherein the step of performing a first-dimension transformation process on the low-dimension joint point sequence to convert it into a joint point sequence feature vector comprises:

extracting the number k of convolution kernels in the graph convolution kernel, and performing replacement and remodeling treatment of a first set sequence through k and the low-dimensional joint point sequence to convert the low-dimensional joint point sequence into a joint point sequence feature vector: [1, Vk, N, C_MTx]。

6. The human behavior recognition method according to claim 5, wherein the graph convolution kernel includes a number k of convolution kernels and a number v and w of joint points, and the step of performing the second dimension transformation processing on the graph convolution kernel to convert the graph convolution kernel into the graph convolution feature vector of the set dimension includes:

performing displacement and remodeling processing of a second set sequence on the graph convolution kernel, and then performing dimension adjustment processing to convert the graph convolution kernel into a graph convolution feature vector with a set dimension: [ w, vk, 1, 1 ].

7. The human behavior recognition method according to claim 1, wherein the step of performing human behavior recognition based on the spatiotemporal sequence feature vectors comprises:

and converting the space-time sequence feature vector into a chip supportable network model format to perform human behavior identification.

8. A human behavior recognition apparatus characterized by comprising:

the data acquisition module is used for acquiring joint point sequence data;

the time-space diagram convolution module is used for carrying out time convolution processing on the joint point sequence data to obtain a joint point time sequence;

a trainable parameter importing module for obtaining a graph convolution kernel;

the data characteristic dimension reduction module is used for carrying out data dimension reduction processing on the joint point time sequence to obtain a low-dimensional joint point sequence with set dimensions;

the Einstein convention conversion module is used for performing first dimension conversion processing on the low-dimension joint point sequence to convert the low-dimension joint point sequence into joint point sequence feature vectors and performing second dimension conversion processing on the graph convolution kernel to convert the low-dimension joint point sequence into the graph convolution feature vectors with the set dimensions; the space-time graph convolution module is further used for performing functional convolution processing on the joint point sequence feature vector and the graph convolution feature vector to obtain a space-time sequence feature vector;

and the model quantization module is used for identifying human body behaviors according to the space-time sequence feature vector.

9. An intelligent terminal, characterized in that the intelligent terminal comprises a memory and a processor coupled to each other;

the memory stores program data;

the processor is used for executing the program data to realize the human behavior recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium characterized in that the computer-readable storage medium stores program data executable to implement the human behavior recognition method according to any one of claims 1 to 7.