CN115457660A

CN115457660A - Behavior identification method based on time-space relationship and electronic equipment

Info

Publication number: CN115457660A
Application number: CN202211155011.8A
Authority: CN
Inventors: 苏航; 周凡; 刘海亮; 陈小燕; 汤武惊; 张怡
Original assignee: Sun Yat Sen University; Shenzhen Research Institute of Sun Yat Sen University
Current assignee: Sun Yat Sen University; Shenzhen Research Institute of Sun Yat Sen University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-09

Abstract

The application is applicable to the technical field of equipment management, and provides a behavior identification method based on a space-time relationship and an electronic device, wherein the method comprises the following steps: receiving target video data to be identified; importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; importing the inter-frame action feature data into a feature extraction network, and outputting sparse feature data corresponding to the target video data; the feature extraction network is generated by performing sparsity constraint processing on each convolution kernel in the pooling fusion network through selecting weights; importing the target video data into a context attention network, and determining gait behavior data of a target object in the target video data; and obtaining the behavior category of the target object according to the gait behavior data and the sparse characteristic data. By adopting the method, the calculation cost of the video data in the behavior recognition process can be greatly reduced, and the operation efficiency is further improved.

Description

Behavior identification method based on time-space relationship and electronic equipment

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a behavior identification method based on a spatiotemporal relationship and an electronic device.

Background

With the continuous development of artificial intelligence technology, a computer can assist a user in executing various types of recognition operations so as to improve the processing efficiency of the user. For example, when a user analyzes video data, the behavior type of a target person in the video data can be determined through an artificial intelligence algorithm, so that the user can conveniently analyze the target person, for example, when the user performs behavior tracking on the target person or monitors dangerous actions in a key area, the workload of the user can be greatly reduced through artificial intelligence behavior recognition, and the analysis efficiency is improved.

The existing behavior recognition technology often uses optical flow information to determine time information and spatial information of a target object in a video so as to determine a behavior type of the target object, but extracting optical flow frame by frame so as to determine the optical flow information of the whole video data needs to establish an extraction network with a larger structure, and equipment requires higher computing capability and needs to store a larger neural network, so that the computing cost of computing equipment is greatly improved, and the computing efficiency is reduced.

Disclosure of Invention

The embodiment of the application provides a behavior recognition method, a behavior recognition device, electronic equipment and a storage medium based on a spatiotemporal relationship, which can solve the problems that in the existing behavior recognition technology, optical flow information is often used for determining time information and space information of a target object in a video so as to determine the behavior type of the target object, but an extraction network with a larger structure needs to be established when optical flow is extracted frame by frame so as to determine the optical flow information of the whole video data, and equipment requires higher computing capability and needs to store a larger neural network, so that the computing cost of computing equipment is greatly improved, and the computing efficiency is reduced.

In a first aspect, an embodiment of the present application provides a behavior identification method based on a spatiotemporal relationship, including:

receiving target video data to be identified;

importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

importing the inter-frame action feature data into a feature extraction network, and outputting sparse feature data corresponding to the target video data; the feature extraction network is generated by performing sparsity constraint processing on each convolution kernel in the pooling fusion network through selecting weights;

importing the target video data into a context attention network, and determining gait behavior data of a target object in the target video data; the contextual attention network is used for extracting a mutual position relation between the target object and an environmental object in the target video data;

and obtaining the behavior category of the target object according to the gait behavior data and the sparse characteristic data.

In a possible implementation manner of the first aspect, before the importing the inter-frame motion feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data, the method further includes:

configuring the selection weight with the numerical value of 0 for at least one convolution kernel to be identified in the pooling fusion network to obtain a network to be corrected;

inputting a plurality of preset training characteristic data to the network to be corrected to generate a first training result, and inputting a plurality of training characteristic data to the pooling fusion network to generate a second training result;

determining a loss value of the network to be corrected according to the first training result and the second training result;

if the loss value is less than or equal to the loss threshold value, identifying the convolution to be identified configured with the selection weight as a redundant convolution kernel;

if the loss value is larger than a preset loss threshold value, identifying the convolution kernel to be identified configured with the selection weight as a necessary convolution kernel;

returning to execute the selection weight with the configuration value of 0 for at least one convolution kernel to be identified in the pooling fusion network to obtain the operation of the network to be corrected until all the convolution kernels to be identified in the pooling fusion network are classified;

generating the feature extraction network based on all of the necessary convolution kernels.

In one possible implementation manner of the first aspect, the feature training data is associated with a reference action tag; the feature extraction network generated based on the feature training data is associated with the reference action tag;

before the importing the inter-frame action feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data, the method further includes:

determining a plurality of candidate action tags based on the inter-frame action data;

respectively calculating the matching degree of each candidate extraction network according to the candidate action tags and the reference action tags corresponding to each candidate extraction network;

and selecting the candidate extraction network with the highest matching degree as the feature extraction network.

In a possible implementation manner of the first aspect, the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data includes:

determining an image tensor of any two consecutive video image frames in the target video data;

determining a plurality of feature point coordinates according to the key positions of the target object in the video image frame; the characteristic point coordinates are determined according to the gait behaviors of the target object;

determining tensor expressions of coordinates of all characteristic points in the image tensor, and generating characteristic vectors of the target object in the video image frame based on the coordinate expressions of all the characteristic points;

constructing a displacement correlation matrix according to the characteristic vectors of any two continuous video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and each coordinate point in the other video image frame;

determining the maximum displacement distance of each characteristic point coordinate between the two continuous video image frames according to the displacement correlation matrix, and determining the displacement matrix of the target object based on all the maximum displacement distances;

importing the displacement matrix into a preset feature transformation model to generate action feature subdata of any two continuous video image frames;

and obtaining the inter-frame action characteristic data based on the action characteristic subdata of all the video image frames.

In a possible implementation manner of the first aspect, before the receiving target video data to be identified, the method further includes:

acquiring sample video data for training a behavior recognition module; the behavior recognition module comprises the interframe action extraction network, the feature extraction network and the contextual attention network;

generating positive sample data and negative sample data according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by carrying out interference processing on a frame sequence of a sample video frame in the sample video data;

generating first spatial information and first optical flow information from the positive sample data, and generating second spatial information and second optical flow information from the negative sample data;

obtaining space enhancement information according to the first space information and the second space information;

obtaining optical flow enhancement information according to the second optical flow information and the first optical flow information;

importing the spatial enhancement information and the optical flow enhancement information into the behavior recognition module to obtain a training recognition result of the sample video data;

and pre-training the position learning parameters in the initial identification module based on the training results of all the sample video data to obtain the behavior identification module.

In a possible implementation manner of the first aspect, the generating positive sample data and negative sample data according to the sample video data includes:

dividing the sample video data into a plurality of video segments according to a preset action time duration; the paragraph duration of each video segment is not greater than the action time duration;

respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset disorder processing algorithm;

and packaging each sample video frame based on the updated frame serial number to obtain the negative sample data.

In a possible implementation manner of the first aspect, the importing the target video data into a contextual attention network, and determining gait behavior data of a target object in the target video data further includes:

determining a target object and at least one environmental object within respective video image frames of the target video data;

determining a first context feature based on first position coordinates of each key feature point of the target object in all the video image frames; the key feature points are human key points related to the gait of the target object;

determining a second context feature based on a relative positional relationship between the target object and the environmental object in each of the video frames;

and importing the first context feature and the second context feature into the context attention network to generate the gait behavior data.

In a second aspect, an embodiment of the present application provides a behavior recognition apparatus based on a spatiotemporal relationship, including:

the target video data receiving unit is used for receiving target video data to be identified;

the inter-frame action characteristic data extraction unit is used for importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

the sparse feature data unit is used for importing the inter-frame action feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data; the feature extraction network is generated after sparsity constraint processing is carried out on each convolution kernel in the pooling fusion network through weight selection;

the gait behavior data identification unit is used for importing the target video data into a context attention network and determining the gait behavior data of a target object in the target video data; the contextual attention network is used for extracting a mutual position relation between the target object and an environmental object in the target video data;

and the behavior identification unit is used for obtaining the behavior type of the target object according to the gait behavior data and the sparse feature data.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to any one of the above first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method according to any one of the above first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a server, causes the server to perform the method of any one of the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: after target video data needing behavior recognition is received, the target video data are imported into an interframe action extraction network, action characteristic information between each video image frame is extracted, action characteristic data are generated based on the action characteristic information between all the video image frames, then the action characteristic data are imported into a characteristic extraction network for characteristic extraction, corresponding sparse characteristic data are obtained, the characteristic extraction network is obtained by selecting weights to carry out sparse constraint processing on convolution kernels in a pooling fusion network, so that the whole characteristic extraction network can reduce unnecessary convolution kernels, the network size is reduced, the operation amount is reduced, the resource occupancy rate of the network can also be reduced, meanwhile, in order to consider the relation between action behaviors in the global dimensions, a context attention network is introduced, gait behavior data of the target object in the whole target video data are determined, finally, the behavior category of the target object in the target video data is determined through two types of data obtained through extraction, and the purpose of automatically recognizing the behavior category is achieved. Compared with the existing behavior recognition technology, the method and the device do not need to calculate the optical flow information of the whole video data, but the network is lifted to determine the action characteristic information among all the video frames through the plug-and-play interframe action, so that the operation cost of the operation device is greatly reduced, the operation amount is reduced, the sparsity constraint processing is carried out on the pooling fusion network, the network amount is reduced, the resource occupation is reduced, and the recognition efficiency is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation of a spatiotemporal relationship-based behavior recognition method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an inter-frame action extraction network according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a pooled fusion network according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a contextual attention network provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation manner of the method for behavior recognition based on spatiotemporal relationship S102 according to the second embodiment of the present application;

fig. 6 is a schematic diagram of an implementation manner of S102 of a behavior identification method based on a spatiotemporal relationship according to three embodiments of the present application;

fig. 7 is a schematic diagram illustrating an implementation manner of S102 of a behavior identification method based on a spatiotemporal relationship according to an embodiment of the present application;

FIG. 8 is a diagram illustrating an implementation manner of a spatiotemporal relationship-based behavior recognition method according to yet another embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an implementation manner of a spatiotemporal relationship-based behavior recognition method S104 according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a spatiotemporal relationship-based behavior recognition apparatus provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

The behavior identification method based on the spatiotemporal relationship can be applied to electronic equipment capable of identifying behaviors of video data, such as smart phones, servers, tablet computers, notebook computers, ultra-mobile personal computers (UMPCs), netbooks and the like. The embodiment of the present application does not set any limit to the specific type of the electronic device.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an implementation of a spatiotemporal relationship-based behavior recognition method according to an embodiment of the present application, where the method includes the following steps:

in S101, target video data to be recognized is received.

In this embodiment, the electronic device may be configured with a video database containing a plurality of video data. When behavior identification needs to be carried out on certain video data in the video database, the electronic equipment can identify the video data as target video data and carry out subsequent processing. The video data of the recognized behavior category contains the recognized behavior category, and the behavior flag of the video data of which the behavior category recognition is not performed is empty. In this case, the electronic device may read whether the behavior flag is empty, and recognize the video data whose behavior flag is empty as the target video data.

In one possible implementation, the target video data may be a video server. When a user needs to identify the behavior of a certain video, a corresponding client program can be installed through a local user terminal, target video data to be identified are imported into the client program, an identification request is initiated, after the user terminal receives the identification request, communication connection between the user terminal and a video server can be established through the client program, the target video data are sent to the video server, and the behavior identification is carried out through the identification server.

In a possible implementation manner, in order to improve the efficiency of behavior recognition, the electronic device may be provided with a corresponding video duration threshold, if the video duration of the original video data is greater than the video duration threshold, the original video data may be divided into more than two video segments, the video duration of each video segment is not greater than the video duration threshold, the divided video segments are recognized as target video data, and a subsequent behavior recognition operation is performed.

In S102, importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action characteristic data; the inter-frame motion characteristic data is used for determining motion characteristic information between adjacent video image frames in the target video data.

In this embodiment, in order to reduce the operation pressure of behavior recognition, an inter-frame motion extraction network is configured in a motion behavior recognition module of an electronic device, and the inter-frame motion extraction network is specifically configured to determine motion characteristic information between any two adjacent video image frames, that is, an identification key point of the inter-frame motion extraction network is not a behavior of a user in the global but a motion change between every two frames, and then the motion changes between all frames are combed, so that a complete behavior motion of the whole video can be obtained, and the subsequent behavior recognition is facilitated. Compared with the global optical flow information, the inter-frame action extraction network provided by the embodiment of the application has the characteristic of plug and play, the data volume input to the inter-frame action extraction network each time is specifically the data volume of two video image frames, the whole target video data is not required to be imported into the identification network to extract the optical flow information, the occupancy rate of a cache space is reduced, and the requirement on the computing capacity of a computer is reduced.

In a possible implementation manner, the manner of determining the motion characteristic information between the video image frames may specifically be: and identifying an object region of a target object through the inter-frame action extraction network, then identifying an area deviation between the two object regions, determining action characteristic information of the target object according to the direction, the position and the size of the deviation area, then determining the number of each action characteristic information according to the frame number of each video image frame, and packaging all the action characteristic information according to the number to generate the action characteristic data.

Exemplarily, fig. 2 shows a schematic structural diagram of an inter-frame action extraction network provided in an embodiment of the present application. Referring to fig. 2, the input data of the inter-frame action extraction network is two video image frames, namely an image t and an image t +1, the two video image frames are two video image frames with adjacent frame numbers, the electronic device can perform vector conversion on the two video image frames through a vector conversion module, then perform dimension reduction processing through a pooling layer, determine displacement information between vector identifiers corresponding to the two video image frames through an activation layer and a displacement calculation module, and then determine action information between the two video image frames through an action identification unit. Specifically, the motion recognition unit may be configured by a plurality of convolution layers, and as shown in the drawing, the motion recognition unit may include a first convolution layer configured by a convolution kernel of 1 × 7, a second convolution layer configured by a convolution kernel of 1 × 3, a third convolution layer configured by a convolution kernel of 1 × 3, and a fourth convolution layer configured by a convolution kernel of 1 × 3.

In S103, importing the inter-frame motion feature data into a feature extraction network, and outputting sparse feature data corresponding to the target video data; the feature extraction network is generated after sparsity constraint processing is carried out on each convolution kernel in the pooling fusion network through weight selection.

In this embodiment, before performing sparsity constraint processing, a corresponding pooled fusion network may be configured in the electronic device. Because each piece of motion characteristic information in the inter-frame motion extraction module is discrete, on the basis, feature extraction is required to be carried out to determine continuous motion so as to carry out motion identification subsequently, based on the feature information, the electronic equipment can introduce inter-frame motion characteristic data into the pooling fusion network, carry out pooling dimension reduction processing and feature fusion, and output corresponding fusion characteristic data. Wherein, the fusion feature data can be expressed as:

wherein, maxpool is the fusion characteristic data; actioni is interframe action information corresponding to the ith video image frame; n is the total number of frames in the target video data; and T is the feature transpose.

Further, as another embodiment of the present application, the pooled fusion network is specifically a homologous bilinear pooled network, and the homologous bilinear pooled network is a symmetric matrix generated by calculating an outer product of features at different spatial positions, and then performing an average pooling on the matrix to obtain bilinear features, which can provide a stronger feature representation than a linear model and can be optimized in an end-to-end manner. The traditional Global Average Pooling (GAP) only captures first-order statistical information, and omits more detailed characteristics useful for behavior identification, and for the problem, a bilinear pooling method used in fine granularity classification is used for reference and is fused with the GAP method, so that more detailed characteristics can be extracted for behaviors with higher similarity, and a better identification result is obtained.

Illustratively, fig. 3 shows a schematic structural diagram of a pooled fusion network provided in an embodiment of the present application. Referring to fig. 3, the pooling-fusion network includes bilinear pooling and first-order pooling. And finally, inserting a bilinear pooling module into the features extracted from the convolutional layer before global average pooling to capture second-order statistics of the spatial feature map so as to obtain second-order classified output, and adding first-order feature vectors obtained by global average pooling so as to obtain a classified output vector. By combining the first-order and second-order vectors, large context clues and fine-grained information of behaviors can be captured, and the classification layer of the existing behavior recognition network is enriched. Meanwhile, the original GAP branch is crucial to the back propagation in the end-to-end training process, and the training difficulty of the bilinear pool module can be reduced.

Therefore, in order to realize the processing of the inter-action feature data, each convolutional layer and each pooling layer can be configured with one or more convolutional cores to realize the feature extraction of different action dimensions, and part of the convolutional cores may be redundant to the feature extraction of the action dimensions, that is, do not contribute any contribution to the result, thereby greatly increasing the data amount of the whole network.

In one possible implementation manner, the sparsity constraint processing may be: the electronic equipment comprises a plurality of redundant network templates, the redundant network templates comprise at least one selection weight with the weight value of 0, the redundant network templates are superposed in the pooling fusion network, the loss value between the verification data of the superposed pooling fusion network and the reference data of the pooling fusion network before superposition is identified, and if the loss value is smaller than a preset loss threshold value, the pooling fusion network superposed with the redundant network templates is identified as the characteristic extraction network.

In S104, importing the target video data into a context attention network, and determining gait behavior data of a target object in the target video data; the contextual attention network is used to extract a mutual positional relationship between the target object and an environmental object in the target video data.

In this embodiment, since the inter-frame motion extraction network mainly focuses on local motion changes, in order to ensure the accuracy of recognition, the electronic device introduces a context attention network, and can recognize global motion changes. The context attention network specifically determines the change of the mutual position relationship between the target object and the environment object and determines the global action change, so in the context attention network, the target object and the environment object are labeled on each video image frame in the target video data, the position change vector between the target object and the environment object in each video image frame is identified, and feature extraction and context attention identification are performed according to the position change vector between each video image frame, so that the gait behavior data is obtained.

Illustratively, fig. 4 shows a schematic structural diagram of a contextual attention network provided by an embodiment of the present application. As shown in fig. 4, the context attention network can perform feature extraction on target video data, and perform object detection, key node detection and human body detection, wherein the object detection is specifically used for determining an environmental object, the human body detection is specifically used for identifying a target object, the key node detection is specifically used for determining gait changes of a human body, and finally, context attention is performed through a graph neural network convolution layer, so that corresponding gait behavior data is output.

In S105, a behavior category of the target object is obtained according to the gait behavior data and the sparse feature data.

In this embodiment, after obtaining the gait behavior data and the sparse feature data, the electronic device may import the full connection layer, determine the confidence level between the electronic device and each candidate behavior category, and select the candidate behavior category with the highest confidence level as the behavior category of the target object, so as to achieve the purpose of identifying the behavior of the target object.

In a possible implementation manner, the video length of the target video data is longer, so that the target object may include multiple types of behavior actions in the process of the entire video length, in this case, the electronic device may output a behavior sequence according to the occurrence sequence of each behavior, where the behavior sequence includes multiple elements, and each element corresponds to one behavior category.

As can be seen from the above, in the behavior identification method based on the spatiotemporal relationship provided in the embodiment of the present application, after target video data that needs to be subjected to behavior identification is received, the target video data is imported into an inter-frame motion extraction network, motion feature information between each video image frame is extracted, motion feature data is generated based on the motion feature information between all the video image frames, and then the motion feature data is imported into a feature extraction network for feature extraction, so as to obtain corresponding sparse feature data. Compared with the existing behavior recognition technology, the method and the device do not need to calculate the optical flow information of the whole video data, but the network is lifted to determine the action characteristic information among all the video frames through the plug-and-play interframe action, so that the operation cost of the operation device is greatly reduced, the operation amount is reduced, the sparsity constraint processing is carried out on the pooling fusion network, the network amount is reduced, the resource occupation is reduced, and the recognition efficiency is further improved.

Fig. 5 is a flowchart illustrating a specific implementation of a spatiotemporal relationship-based behavior recognition method according to a second embodiment of the present invention. Referring to fig. 5, with respect to the embodiment shown in fig. 1, before the importing the inter-frame motion feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data, the behavior identification method based on spatio-temporal relationship further includes: S501-S507 are detailed as follows:

in S501, the selection weight with a value of 0 is configured for at least one convolution kernel to be identified in the pooled fusion network, so as to obtain a network to be corrected.

In this embodiment, the electronic device may store a pooled fusion network, where the pooled fusion network may be obtained by the cloud server, or may be obtained by training a native network based on preset training data, and as stated in S103, the pooled fusion network is specifically configured to perform feature extraction to determine a continuous action, so as to perform action recognition subsequently.

In this embodiment, in order to identify the redundant convolution kernels included in the pooled fusion network, the electronic device may adjust the contribution of each convolution kernel to the output by the selection weight, and therefore, a selection weight with a weight value of 0 may be set and configured to any convolution kernel to be identified, so as to shield the influence of the convolution kernel to be identified on the subsequent data output.

In this embodiment, the number of convolution kernels to be identified, each of which is configured and selected with a weight of 0, may be 1, or may be multiple, and may be specifically set according to an actual situation. The mode of selecting the convolution kernel to be identified may be random selection, or may be selected in sequence based on a preset rule, which is not limited herein.

In S502, a plurality of preset training feature data are input to the to-be-corrected network to generate a first training result, and a plurality of training feature data are input to the pooling fusion network to generate a second training result.

In this embodiment, the electronic device may store a plurality of training feature data, and the process of generating the training feature data is the same as the process of generating the inter-frame motion feature data, and is obtained by importing the training video into the inter-frame motion extraction network and outputting the training video. The electronic device may import training feature data into the adjusted to-be-corrected network to obtain a first training result, and at the same time, in order to determine an influence of the selection weight of 0 on the output, the electronic device may import the same training feature data into the pooled fusion network to obtain a second training result.

In S503, determining a loss value of the network to be corrected according to the first training result and the second training result.

In this embodiment, since the first training result and the second training result are generated according to the same training feature data, if the influence of the selection weight of the convolution kernel to be identified with 0 on the result is small, the similarity between the two training results is high (i.e., the loss value is small); on the contrary, if the convolution kernel to be identified with the weight of 0 is selected to have a larger influence on the result, the similarity between the two training results is lower (i.e. the loss value is larger). Therefore, the electronic device may determine whether the convolution kernel to be identified is a redundant convolution kernel according to a loss value between training results corresponding to a plurality of training data.

In S504, if the loss value is less than or equal to the loss threshold, the convolution to be identified configured with the selection weight is identified as a redundant convolution kernel.

In S505, if the loss value is greater than a preset loss threshold, the convolution kernel to be identified configured with the selection weight is identified as a necessary convolution kernel.

In this embodiment, if the loss values corresponding to all the training feature data are less than or equal to the loss threshold, the convolution kernel to be identified configured with the selection weight having the value of 0 may be identified as the redundant convolution kernel; on the contrary, if the loss value corresponding to any training feature data is greater than the loss threshold, the convolution kernel to be identified configured with the selection weight with the value of 0 may be used as the necessary convolution kernel, and the type of the convolution kernel to be identified is determined according to the magnitude of the loss value.

In S506, the selecting weight with the configuration value of 0 for at least one convolution kernel to be identified in the pooled fusion network is executed back to obtain the operation of the network to be corrected until all the convolution kernels to be identified in the pooled fusion network are classified.

In this embodiment, after determining the type of the convolution kernel to be identified, the electronic device may return to perform the operation of S501 until all the convolution kernels to be identified are classified, that is, the convolution kernels to be identified are identified as redundant convolution kernels or necessary convolution kernels.

In S507, the feature extraction network is generated based on all the necessary convolution kernels.

In this embodiment, all redundant convolution kernels in the pooled fusion network are removed, and all remaining necessary convolution kernels are used to generate a corresponding feature extraction network, so that the network volume of the feature extraction network can be reduced.

In the embodiment of the application, the convolution kernel with the selection weight of 0 is configured to shield the influence of partial convolution kernels on the output result, so that the convolution kernel with larger influence on the output can be determined, the volume of the whole convolution kernel is simplified, and the operation amount of behavior identification is reduced.

Fig. 6 is a flowchart illustrating a specific implementation of a spatiotemporal relationship-based behavior recognition method according to a third embodiment of the present application. Referring to fig. 6, with respect to the embodiment shown in fig. 1, before the importing the inter-frame motion feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data, the behavior identification method based on spatio-temporal relationship according to this embodiment further includes: s601 to S603 are specifically detailed as follows:

further, the characteristic training data is associated with a reference action label; the feature extraction network generated based on the feature training data is associated with the reference action tag;

before the step of importing the inter-frame action feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data, the method further comprises the following steps:

in S601, a plurality of candidate action tags are determined based on the inter-frame action data.

In S602, the matching degree between each candidate extraction network is calculated based on the plurality of candidate action tags and the reference action tag corresponding to each candidate extraction network.

In S603, the candidate extraction network with the highest matching degree is selected as the feature extraction network.

In this embodiment, after sparsity constraint is performed by selecting weights, the network volume of the feature extraction network may be reduced, but a certain computation loss may be inevitably introduced. In order to improve the accuracy of subsequent behavior identification, each piece of training feature data can be associated with one reference action label during training, and a feature extraction network generated based on the training feature data corresponding to the same reference action label can be used for identifying action categories of specific action labels, namely different reference action labels, and can correspond to different feature extraction networks, so that specialized training is realized, and the identification accuracy is improved. Therefore, the database of the electronic device may store the action extraction networks (which may be identified as candidate extraction networks before being selected) associated with different reference action tags.

In this embodiment, after the electronic device generates the inter-frame motion feature data, it may determine a plurality of candidate motion tags, determine a matching degree between the inter-frame motion feature data and a candidate extraction network according to tag association degrees between the plurality of candidate motion tags and a reference motion tag, and select one candidate extraction network most relevant to the inter-frame motion feature data according to the matching degree as a feature extraction network for subsequently outputting sparse feature data.

In the embodiment of the application, different feature extraction networks are configured for different action tags, so that the situation that the identification accuracy rate is reduced due to the reduction of convolution kernels can be reduced, the identification efficiency can be ensured, and the identification accuracy rate can be improved.

Fig. 7 is a flowchart illustrating a specific implementation of the spatio-temporal relationship-based behavior recognition method S102 according to the third embodiment of the present invention. Referring to fig. 7, with respect to the embodiment described in fig. 1, in the method for identifying a behavior based on a spatiotemporal relationship provided in this embodiment, S102 includes: s1021 to S1027 are specifically described as follows:

further, the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data includes:

in S1021, the image tensors of any two consecutive video image frames within the target video data are determined.

In this embodiment, before extracting the motion characteristic information between two video image frames, the electronic device needs to pre-process the video image frames, and needs to convert the video image frames expressed in a graph into tensors expressed in a vector. The image tensor corresponding to each video image frame is determined according to the image size of the video image frame, for example, the image long phase may be a tensor of H × W × C size, where H is determined according to the image length of the video image frame, W is determined according to the image width of the video image frame, that is, H × W is used to represent the spatial resolution of the video image frame, and C is used to identify the spatial position where the target object is located, for example, two consecutive video image frames may be identified as F (t) and F (t + 1), that is, the t-th video image frame and the image tensor corresponding to the t + 1-th video image frame.

In S1022, determining a plurality of feature point coordinates according to the key positions of the target object in the video image frame; the feature point coordinates are determined according to the gait behavior of the target object.

In this embodiment, the electronic device may mark the position where the target object is located, i.e., the above-mentioned key position, in each video image frame. In this case, the electronic device may perform sliding framing in the video image frame through the human body template, and calculate a matching degree between the human body template and the framing region, so as to identify and obtain a region where a human body is located, that is, a region where the target object is located.

In this embodiment, after determining the key position, the electronic device may identify a plurality of key points in the target object based on the key position, where each key point corresponds to one feature point coordinate. Illustratively, key points related to gait behavior include: after each key point is marked, the coordinates of the key point in the video image frame, namely the coordinates of the characteristic points can be determined.

In S1023, tensor expressions of the coordinates of the respective feature points are determined in the image tensor, and feature vectors of the target object in the video image frame are generated based on coordinate expressions of all the feature points.

In this embodiment, after determining the coordinates of the plurality of feature points, the electronic device may locate an element in which each feature point is located in the image tensor, so as to obtain an expression of each feature point through the tensor, that is, the tensor expression, and finally encapsulate the tensor expressions of all the feature point, so as to obtain the feature vector of the target object related to the gait.

In S1024, constructing a displacement correlation matrix according to the feature vectors of the two consecutive video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and the coordinates of each coordinate point in another video image frame.

In this embodiment, after determining the tensor expressions corresponding to the feature point coordinates of the key points and obtaining the feature vectors formed based on the tensor expressions of all the key points, the electronic device may calculate the vector deviation between the two video image frames, so as to determine, according to the vector deviation, the displacement corresponding to each key point of the target object between the two video image frames, and thus determine to obtain the displacement correlation matrix.

In this embodiment, since no large displacement occurs in the probability according to a certain position in two adjacent frames of the video, the displacement can be limited to a certain specific area, which is assumed to have X as the center point and contain P2 feature points, and then a correlation score matrix of the position X and all features in the candidate area can be obtained by performing point multiplication on the feature of the X position and the feature in the corresponding candidate area in the adjacent video image frame, and the dimension of the matrix is hxx X P2, that is, the displacement correlation matrix reflects the relationship between the positions between the adjacent frames.

In S1025, the maximum displacement distance of each of the feature point coordinates between the two consecutive video image frames is determined according to the displacement correlation matrix, and the displacement matrix of the target object is determined based on all the maximum displacement distances.

In this embodiment, after determining the correlation scores between the coordinates of each feature point in the key area relative to the coordinates of another video image frame, the electronic device may select a value with the largest correlation score to determine the maximum displacement distance corresponding to the coordinates of the feature point, that is, locate the coordinate point associated with the coordinates of the feature point in another video image frame.

Further, as another embodiment of the present application, the step S1025 specifically includes the following steps:

step 1: determining a displacement correlation array corresponding to each characteristic point coordinate in the displacement correlation matrix;

step 2: determining a parameter value with the maximum correlation coefficient from the displacement correlation array as the maximum displacement distance of the characteristic coordinate point;

and step 3: constructing a displacement field of the target object on a two-dimensional space according to the maximum displacement distances of all the feature coordinate points;

and 4, step 4: performing pooling dimensionality reduction on the displacement field through an activation function softmax to obtain a one-dimensional confidence tensor;

and 5: and fusing the displacement field and the one-dimensional confidence tensor to construct a displacement matrix for expressing a three-dimensional space.

In this embodiment, according to the correlation score matrix, as long as the maximum score of each feature point in the correlation score matrix in a video image frame is found to correspond to a corresponding point in another video image frame, the displacement field of the motion information can be estimated, because the correlation score is used for determining the correlation between two coordinate points, the correlation score between the coordinates of each feature point on another video image frame, that is, the above-mentioned displacement correlation matrix, can be separated according to the above-mentioned displacement correlation matrix, and a parameter value with the maximum correlation coefficient is determined to determine the corresponding coordinate point of the coordinates of the feature point in another video image frame, and the distance between other points is taken as the above-mentioned maximum displacement distance, so as to construct the displacement field of the target object in the two-dimensional space. Specifically, feature extraction, i.e., maximum pooling, may be performed on the two-dimensional field by adding a softmax layer, so as to obtain a confidence map of the target object, and finally, the two-dimensional displacement field and the one-dimensional confidence map are combined to form a displacement matrix with three-dimensional features.

In the embodiment of the application, the motion condition of the target object is determined by constructing the two-dimensional displacement field, the confidence of each point in the displacement field is determined by pooling dimension reduction, and the displacement condition is conveniently and effectively evaluated, so that subsequent action identification can be conveniently performed, and the accuracy of action identification is improved

In S1026, the displacement matrix is imported into a preset feature transformation model, and the motion feature sub-data of any two consecutive video image frames is generated.

In this embodiment, in order to match the features of the downstream layer, the displacement tensor needs to be converted into a motion feature matrix matching the dimension of the downstream layer. The feed to four depth-scalable layers, one 1 x 7 layer and three 1 x 3 layers, can be converted to the same channel number C of motion features as the original input F (t). For input to the next layer of the network.

In S1027, the inter-frame motion feature data is obtained based on the motion feature sub-data of all the video image frames.

In this embodiment, after determining the motion characteristic subdata corresponding to each video image frame relative to the next video image frame, the electronic device may perform encapsulation according to the frame number of each video image frame, so as to obtain inter-frame motion characteristic data of the entire target video data.

In the embodiment of the application, a plurality of key point coordinates related to gait are marked in the target object, a corresponding displacement matrix is constructed by displacement of the key point coordinates, and the action characteristic subdata of the target object is determined by the displacement of the key point, so that the number of points required to be operated can be reduced, the operation amount is further reduced, and the operation efficiency is improved.

Fig. 8 is a flowchart illustrating a specific implementation of a spatiotemporal relationship-based behavior recognition method according to a third embodiment of the present invention. Referring to fig. 8, with respect to the embodiment shown in any one of fig. 1 to 7, before the receiving the target video data to be identified, the method for identifying behaviors based on spatio-temporal relationships according to this embodiment further includes: s801 to S807 are specifically detailed as follows:

further, before the receiving the target video data to be identified, the method further includes:

in S801, sample video data for training a behavior recognition module is acquired; the behavior recognition module includes the inter-frame action extraction network, the feature extraction network, and the contextual attention network.

In this embodiment, before performing behavior recognition on target video data, the electronic device may train and learn a local behavior recognition module, so that accuracy of subsequent behavior recognition can be improved. The behavior identification module specifically comprises three networks which are respectively an interframe motion extraction network, specifically used for extracting interframe motion data, a pooling fusion network, specifically used for performing feature extraction and feature fusion on the interframe motion data, and a context attention network, specifically used for determining the relative position between a target object and an environment object, so that the behavior category of the target object can be determined in a global dimension, and based on the behavior category, the electronic equipment can acquire sample video data from a video library. It should be noted that the sample video data is specifically video data that is not labeled with a behavior type or weakly labeled video data. The training method can be used for training and learning in a counterstudy mode, so that the time consumption of marking of a user can be reduced, the training efficiency can be improved, and the training accuracy can be improved.

The embodiment introduces a depth bidirectional converter so as to better utilize key information in a position embedding and multi-head attention mechanism automatic selection video, designs a video understanding-oriented sequence self-supervision learning method, and makes full use of massive internet big data and an existing public data set to continuously optimize and train a behavior pre-training model so as to obtain a robust behavior pre-training model with field universality and task sharing capability.

In S802, positive sample data and negative sample data are generated according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by performing interference processing on a frame sequence of a sample video frame in the sample video data.

In this embodiment, after obtaining any sample video data, the electronic device may convert the sample video data into two different types of sample data, one of which is positive sample data obtained by interfering with background information, that is, interfering with a spatial dimension, and one of which is negative sample data obtained by interfering with a frame sequence, that is, interfering with a temporal dimension, so as to decouple an action and a spatial scene, and further enhance the sensitivity of a network to the action. This way of constructing positive and negative samples makes the network have to focus on global statistics to be able to distinguish between positive and negative samples.

The process of generating the positive sample may specifically include the following steps:

step 1.1 marks sample objects in each sample video frame of the sample video data and identifies other regions than the sample objects as background regions.

And 1.2, carrying out interpolation processing on the background area through a preset thin plate spline to obtain a spatial interference image frame.

And step 1.3, packaging according to the frame serial number of each spatial interference image frame in the sample video data to obtain the positive sample data.

In this embodiment, the electronic device may locate a sample object in the sample video data through an object recognition algorithm (such as a face recognition algorithm or a human key point recognition algorithm), where the sample object may also be an entity person, and after the sample object in the sample video data is marked, may identify other areas except for an area where the sample object is located as a background area, and because the space needs to be interfered, the electronic device may perform interpolation processing in the background area through a thin-plate spline manner, so as to shield part of the background area, so as to eliminate correlation in the space between sample video frames, and repackage a spatial interference image frame after the thin-plate spline is added according to a frame number, so as to obtain positive sample data.

In the embodiment of the application, the background area is subjected to interpolation processing through the thin-plate spline, the local scene information is damaged, so that a positive sample is constructed, the sensitivity of subsequent identification to the user action can be improved, and the training accuracy is improved.

The process of generating the negative examples may specifically include the following steps:

step 2.1, dividing the sample video data into a plurality of video segments according to preset action time duration; the paragraph duration of each of the video segments is not greater than the action time duration.

And 2.2, respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset disorder processing algorithm.

And 2.3, packaging each sample video frame based on the updated frame serial number to obtain the negative sample data.

In this embodiment, to implement interference in the time dimension, the electronic device may divide the sample video data into a plurality of video segments, and perform out-of-order processing on the video image frames in each video segment. Because one action has a certain duration, different actions can be separated by dividing the video segments, and the sensitivity of each action to be identified subsequently can be improved. Wherein the action time duration is determined by determining an average duration of an action based on big data analysis. The electronic equipment reconfigures the frame number of each sample video frame in the video band through a random algorithm, and therefore the sample video frames with the updated frame numbers are packaged, and negative sample data are obtained.

Usually, the negative samples adopted in the comparative learning are all directly used by other videos, but if the negative samples are used by other videos, besides different action information, a plurality of characteristics which can make the network be more easily distinguished can be introduced, so that the mode of selecting the negative samples cannot ensure that the network focuses on the movement, and the optical flow information is damaged by using local time interference on the basis of the project so as to construct the negative samples. This way of constructing positive and negative samples makes the network have to focus on global statistics to be able to distinguish between positive and negative samples.

In S803, first spatial information and first optical flow information are generated from the positive sample data, and second spatial information and the second optical flow information are generated from the negative sample data.

In this embodiment, the electronic device may perform data conversion on positive sample data through a coding algorithm to obtain coded data of each image frame in the positive sample data, to obtain a plurality of feature maps, add the learned position codes to the extracted feature maps, model the time information by using the depth bidirectional converter after fusing the position codes, and model the space information from the time information, that is, the first optical flow information, of the positive sample data, to obtain the space information, that is, the first space information of the positive sample data. Correspondingly, corresponding processing is carried out on the negative sample data, and second spatial information and the second optical flow information are obtained.

In S804, spatial enhancement information is obtained according to the first spatial information and the second spatial information.

In this embodiment, the first spatial information interferes with the background region, so that the first spatial information is not correlated spatially, the second spatial information does not interfere with the background region, the two sample data are derived from the same sample video data, the two spatial information are fused, the sensitivity of spatial information capture can be improved, and the spatial enhancement information is obtained.

In S805, optical-flow enhancement information is obtained from the second optical-flow information and the first optical-flow information.

In this embodiment, the first optical flow information does not interfere with the time series, so that the first optical flow information has correlation in the time dimension, the second optical flow information interferes with the time series, and the two sample data both come from the same sample video data, so that the two optical flow information are fused, the sensitivity of capturing the time information can be improved, and the optical flow enhancement information can be obtained.

In S806, the spatial enhancement information and the optical flow enhancement information are imported into the behavior recognition module, so as to obtain a training recognition result of the sample video data.

In S807, the position learning parameters in the initial recognition module are pre-trained based on the training results of all the sample video data, so as to obtain the behavior recognition module.

In this embodiment, the behavior recognition includes two key pieces of information: spatial information and temporal information. The spatial information belongs to static information in a scene, such as an object, context information and the like, which is easy to capture in a single frame of a video, the temporal information mainly captures dynamic characteristics of motion, which is obtained by integrating spatial information between frames, for behavior recognition, how to better capture motion information is crucial to model performance, and the global average pooling layer used at the end of the existing 3D convolutional neural network hinders the richness of the temporal information. To address this problem, a depth bi-directional Transformer (Transformer) is to be used instead of global average pooling. A K frame sampled from an input video is coded through a 3D convolutional coder, an obtained feature map (feature map) is not subjected to global average pooling at the end of a network, feature vectors are divided into token sequences with fixed lengths, then in order to store position information, learned position codes are added into extracted features, time information is modeled by a transform block in a depth bidirectional converter after the position codes are fused, the feature vectors obtained through multi-head attention mechanism of the depth bidirectional converter are fused with the time information, then the vectors are connected together to carry out feature dimension transformation through a multilayer perceptron, and end-to-end training is completed through calculating contrast loss. Thereby obtaining a pre-training model with good generalization performance.

In the embodiment of the application, the sensitivity of the action and the space-time information identification can be improved by determining the positive sample data and the negative sample data, so that the training of the behavior category can be completed without marking, and the pre-training effect is improved.

Fig. 9 is a flowchart illustrating a specific implementation of the spatio-temporal relationship-based behavior recognition method S104 according to the fourth embodiment of the present invention. Referring to fig. 9, with respect to the embodiment described in any one of fig. 1 to 7, the method S104 for behavior recognition based on spatiotemporal relationship provided in this embodiment includes: s1041 to S1044, detailed description is as follows:

in S1041, a target object and at least one environmental object within a respective video image frame of the target video data are determined.

In S1042, determining a first context feature based on first position coordinates of each key feature point of the target object in all the video image frames; the key feature points are human key points related to the gait of the target object.

In S1043, a second context feature is determined based on a relative positional relationship between the target object and the environment object in each of the video frames.

In S1044, importing the first context feature and the second context feature into the context attention network, and generating the gait behavior data.

In the embodiment, the deep convolutional neural network can extract texture and appearance features from the RGB image, and can directly or indirectly use a pre-trained deep learning model trained by large-scale data in other visual tasks, so that image feature expression knowledge is effectively migrated, and the image feature expression knowledge is easily interfered by scenes and objects. The behavior recognition data based on high-level semantic human body key points or other relation modeling is relatively light and free from the interference of scenes and objects, but lacks texture and appearance information, cannot effectively utilize scene and object information on which behaviors depend, and can only be used for performing behavior recognition on related actions taking a human body as a center. Therefore, it is necessary to integrate the feature expression based on RGB images and the information based on high-level context modeling, so as to better mine the time sequence relationship between space-time features and the interaction pattern between human-human and human-object, and to fully utilize the abstract extraction capability of the convolutional neural network to the bottom-level visual feature information and the high-level semantic relationship reasoning capability of the space-time diagram neural network. Specifically, an attention 3D convolutional neural network is used to extract video features of a human body region, which are used for behavior recognition based on RGB images on one hand and used as sub-network input of human body key node prediction on the other hand. The human body key node estimation method comprises the steps of outputting a plurality of frames of human body key nodes from a network, sending key node sequence image video images into a graph convolution context neural network model, and carrying out behavior identification based on the human body key nodes. In addition, the target detection model is used for detecting people and objects in the picture in real time, and then other human body characteristic expressions and target characteristic expressions around the concerned target human body are sent to the graph convolution context neural network model for combined optimization training. The detected target feature expression, the peripheral related human body features and the key nodes of the human body are taken as context information of the behaviors of the attention object and are fused into the model through the graph neural network, so that the problem of inconsistent gap in mapping from the bottom layer visual features to the high-layer semantic information gap is solved, the modeling and expression capacity of the model on the association relation between human-human and human-object is enhanced, and the learning capacity and the modeling capacity of behavior recognition on different complex and common key information are improved.

In the embodiment of the application, the identification accuracy of the action type can be improved by identifying the environment object and determining the mutual relation between the environment object and the target object.

Fig. 10 is a block diagram illustrating a structure of a spatiotemporal relationship-based behavior recognition apparatus according to an embodiment of the present invention, where the spatiotemporal relationship-based behavior recognition apparatus includes units for performing the steps implemented by the encryption apparatus in the corresponding embodiment of fig. 1. Please refer to fig. 1 and fig. 1 for the corresponding description of the embodiment. For convenience of explanation, only the portions related to the present embodiment are shown.

Referring to fig. 10, the spatiotemporal relationship-based behavior recognition apparatus includes:

a target video data receiving unit 11 for receiving target video data to be identified;

the inter-frame action feature data extraction unit 12 is configured to import the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data; the inter-frame action characteristic data is used for determining action characteristic information between adjacent video image frames in the target video data;

a sparse feature data unit 13, configured to import the inter-frame motion feature data into a feature extraction network, and output sparse feature data corresponding to the target video data; the feature extraction network is generated by performing sparsity constraint processing on each convolution kernel in the pooling fusion network through selecting weights;

a gait behavior data identification unit 14, configured to import the target video data into a contextual attention network, and determine gait behavior data of a target object in the target video data; the contextual attention network is used for extracting the mutual position relation between the target object and the environmental object in the target video data;

and the behavior identification unit 15 is configured to obtain a behavior category of the target object according to the gait behavior data and the sparse feature data.

Optionally, the behavior recognition device further includes:

a network to be corrected generating unit, configured to configure the selection weight with a numerical value of 0 for at least one convolution kernel to be identified in the pooled fusion network, so as to obtain a network to be corrected;

the training result determining unit is used for inputting a plurality of preset training characteristic data into the network to be corrected to generate a first training result and inputting a plurality of training characteristic data into the pooling fusion network to generate a second training result;

a loss value determining unit, configured to determine a loss value of the network to be corrected according to the first training result and the second training result;

a first class identification unit, configured to identify the convolution to be identified configured with the selection weight as a redundant convolution kernel if the loss value is less than or equal to the loss threshold;

the second type identification unit is used for identifying the convolution kernel to be identified configured with the selection weight as a necessary convolution kernel if the loss value is greater than a preset loss threshold value;

the circular execution unit is used for returning and executing the selection weight with the configuration value of 0 for at least one convolution kernel to be identified in the pooling fusion network to obtain the operation of the network to be corrected until all the convolution kernels to be identified in the pooling fusion network are classified;

a network generation unit for generating the feature extraction network based on all the necessary convolution kernels.

Optionally, the feature training data is associated with a reference action tag; the feature extraction network generated based on the feature training data is associated with the reference action tag; the behavior recognizing apparatus further includes:

a candidate tag generation unit configured to determine a plurality of candidate action tags based on the inter-frame action data;

the label matching unit is used for respectively calculating the matching degree among the candidate extraction networks according to the candidate action labels and the reference action labels corresponding to the candidate extraction networks;

and the network selecting unit is used for selecting the candidate extracting network with the highest matching degree as the feature extracting network.

Optionally, the inter-frame motion feature data extraction unit 12 includes:

the image tensor conversion unit is used for determining the image tensors of any two continuous video image frames in the target video data;

the characteristic point coordinate determination unit is used for determining a plurality of characteristic point coordinates according to the key positions of the target object in the video image frame; the characteristic point coordinates are determined according to the gait behaviors of the target object;

an eigenvector generating unit, configured to determine tensor expressions of coordinates of feature points in the image tensor, and generate eigenvectors of the target object in the video image frame based on coordinate expressions of all the feature points;

the displacement correlation matrix constructing unit is used for constructing a displacement correlation matrix according to the characteristic vectors of any two continuous video image frames; the displacement correlation matrix is used for determining displacement correlation scores between the coordinates of each characteristic point in one video image frame and each coordinate point in the other video image frame;

a displacement matrix construction unit, configured to determine, according to the displacement correlation matrix, a maximum displacement distance between the two consecutive video image frames of each of the feature point coordinates, and determine a displacement matrix of the target object based on all the maximum displacement distances;

the action characteristic subdata determining unit is used for leading the displacement matrix into a preset characteristic transformation model and generating action characteristic subdata of any two continuous video image frames;

and the action characteristic subdata packaging unit is used for obtaining the interframe action characteristic data based on the action characteristic subdata of all the video image frames.

Optionally, the displacement matrix constructing unit includes:

a displacement correlation array determining unit, configured to determine a displacement correlation array corresponding to each feature point coordinate in the displacement correlation matrix;

a maximum displacement distance determination unit configured to determine a parameter value having a maximum correlation coefficient from the displacement correlation array as the maximum displacement distance of the feature coordinate point;

a displacement field determining unit, configured to construct a displacement field of the target object in a two-dimensional space according to the maximum displacement distances of all the feature coordinate points;

the displacement field pooling unit is used for performing pooling dimension reduction on the displacement field through an activation function softmax to obtain a one-dimensional confidence tensor;

and the displacement field fusion unit is used for fusing the displacement field and the one-dimensional confidence tensor to construct a displacement matrix for expressing a three-dimensional space.

Optionally, the behavior recognition device further includes:

the system comprises a sample video data acquisition unit, a behavior recognition module and a control unit, wherein the sample video data acquisition unit is used for acquiring sample video data used for training the behavior recognition module; the behavior recognition module comprises the interframe action extraction network, the feature extraction network and the contextual attention network;

the sample data conversion unit is used for generating positive sample data and negative sample data according to the sample video data; the positive sample data is obtained after interference processing is carried out on background information in the sample video data; the negative sample data is obtained by carrying out interference processing on a frame sequence of a sample video frame in the sample video data;

an information extraction unit configured to generate first spatial information and first optical flow information from the positive sample data, and generate second spatial information and second optical flow information from the negative sample data;

a spatial enhancement information generating unit, configured to obtain spatial enhancement information according to the first spatial information and the second spatial information;

an optical flow enhancement information extraction unit configured to obtain optical flow enhancement information from the second optical flow information and the first optical flow information;

the training recognition result output unit is used for importing the spatial enhancement information and the optical flow enhancement information into the behavior recognition module to obtain a training recognition result of the sample video data;

and the module training unit is used for pre-training the position learning parameters in the initial identification module based on the training results of all the sample video data to obtain the behavior identification module.

Optionally, the sample data conversion unit includes:

a background region identification unit, configured to mark a sample object in each sample video frame of the sample video data, and identify a region other than the sample object as a background region;

the background area processing unit is used for carrying out interpolation processing on the background area through a preset thin plate spline to obtain a spatial interference image frame;

and the positive sample generation unit is used for packaging the frame serial numbers of the space interference image frames in the sample video data to obtain the positive sample data.

Optionally, the sample data conversion unit includes:

the video dividing unit is used for dividing the sample video data into a plurality of video segments according to preset action time duration; the paragraph duration of each video segment is not greater than the action time duration;

the disorder processing unit is used for respectively updating the frame sequence numbers of the sample video frames in the video segments according to a preset disorder processing algorithm;

and the negative sample generating unit is used for packaging each sample video frame based on the updated frame serial number to obtain the negative sample data.

Optionally, the gait behavior data identification unit 14 includes:

an environmental object identification unit for determining a target object and at least one environmental object within each video image frame of the target video data;

a first context feature generation unit, configured to determine a first context feature based on first position coordinates of each key feature point of the target object in all the video image frames; the key feature points are human key points related to the gait of the target object;

a second context feature generation unit, configured to determine a second context feature based on a relative positional relationship between the target object and the environment object in each of the video frames;

a gait behavior data determining unit, configured to import the first context feature and the second context feature into the context attention network, and generate the gait behavior data.

Therefore, the behavior recognition device based on the spatiotemporal relationship provided by the embodiment of the present invention may also be configured to, after receiving target video data that needs to be subjected to behavior recognition, import the target video data into an inter-frame motion extraction network, extract motion feature information between each video image frame, generate motion feature data based on the motion feature information between all the video image frames, and then import the motion feature data into a feature extraction network to perform feature extraction, so as to obtain corresponding sparse feature data. Compared with the existing behavior recognition technology, the method and the device do not need to calculate the optical flow information of the whole video data, but can determine the motion characteristic information among all the video frames by means of plug-and-play interframe motion lifting network, so that the operation cost of operation equipment is greatly reduced, the operation amount is reduced, sparsity constraint processing is carried out on a pooling fusion network, the network amount is reduced, the resource occupation is reduced, and the recognition efficiency is further improved.

It should be understood that, in the structural block diagram of the behavior identification apparatus based on spatio-temporal relationship shown in fig. 10, each module is used to execute each step in the embodiment corresponding to fig. 1 to fig. 9, and each step in the embodiment corresponding to fig. 1 to fig. 9 has been explained in detail in the above embodiment, and specific reference is made to the relevant description in the embodiment corresponding to fig. 1 to fig. 9 and fig. 1 to fig. 9, which is not repeated herein.

Fig. 11 is a block diagram of an electronic device according to another embodiment of the present application. As shown in fig. 11, the electronic apparatus 1100 of this embodiment includes: a processor 1110, a memory 1120, and a computer program 1130, such as a program for a spatiotemporal relationship-based behavior recognition method, stored in the memory 1120 and executable on the processor 1110. The processor 1110, when executing the computer program 1130, implements the steps in each embodiment of the spatiotemporal relationship-based behavior recognition method described above, such as S101 to S105 shown in fig. 1. Alternatively, when the processor 1110 executes the computer program 1130, the functions of the modules in the embodiment corresponding to fig. 11, for example, the functions of the units 11 to 15 shown in fig. 10, are implemented, and refer to the related description in the embodiment corresponding to fig. 10.

Illustratively, the computer program 1130 may be divided into one or more modules, which are stored in the memory 1120 and executed by the processor 1110 to accomplish the present application. One or more of the modules may be a series of computer program instruction segments that can perform particular functions, and that describe the execution of the computer program 1130 in the electronic device 1100. For example, the computer program 1130 may be divided into unit modules, and the specific functions of the modules are as described above.

The electronic device 1100 may include, but is not limited to, a processor 1110, a memory 1120. Those skilled in the art will appreciate that fig. 11 is merely an example of the electronic device 1100 and does not constitute a limitation of the electronic device 1100 and may include more or fewer components than illustrated, or combine certain components, or different components, e.g., the electronic device may also include input-output devices, network access devices, buses, etc.

The processor 1110 may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 1120 may be an internal storage unit of the electronic device 1100, such as a hard disk or a memory of the electronic device 1100. The memory 1120 can also be an external storage device of the electronic device 1100, such as a plug-in hard disk, a smart card, a flash memory card, etc. provided on the electronic device 1100. Further, the memory 1120 may also include both internal and external storage units of the electronic device 1100.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A behavior identification method based on a spatiotemporal relationship is characterized by comprising the following steps:

receiving target video data to be identified;

importing the target video data into a context attention network, and determining gait behavior data of a target object in the target video data; the contextual attention network is used for extracting the mutual position relation between the target object and the environmental object in the target video data;

2. The behavior recognition method according to claim 1, further comprising, before the importing the inter-frame motion feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data:

if the loss value is less than or equal to the loss threshold, identifying the convolution to be identified configured with the selection weight as a redundant convolution kernel;

3. The behavior recognition method according to claim 2, wherein the feature training data is associated with a reference action tag; the feature extraction network generated based on the feature training data is associated with the reference action tag;

4. The behavior recognition method according to claim 1, wherein the importing the target video data into a preset inter-frame action extraction network to obtain inter-frame action feature data comprises:

5. The behavior recognition method according to any one of claims 1 to 4, further comprising, before the receiving target video data to be recognized:

generating first spatial information and first optical flow information by the positive sample data, and generating second spatial information and second optical flow information by the negative sample data;

6. The behavior recognition method according to claim 5, wherein the generating positive sample data and negative sample data from the sample video data comprises:

and packaging each sample video frame based on the updated frame sequence number to obtain the negative sample data.

7. The behavior recognition method according to any one of claims 1-4, wherein the importing the target video data into a contextual attention network, determining gait behavior data of a target object in the target video data, further comprises:

8. A spatiotemporal relationship-based behavior recognition apparatus, comprising:

the sparse feature data unit is used for importing the inter-frame action feature data into a feature extraction network and outputting sparse feature data corresponding to the target video data; the feature extraction network is generated by performing sparsity constraint processing on each convolution kernel in the pooling fusion network through selecting weights;

the gait behavior data identification unit is used for importing the target video data into a context attention network and determining the gait behavior data of a target object in the target video data; the contextual attention network is used for extracting the mutual position relation between the target object and the environmental object in the target video data;

and the behavior identification unit is used for obtaining the behavior category of the target object according to the gait behavior data and the sparse characteristic data.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.