CN112836609A

CN112836609A - Human behavior recognition method and system based on relationship-guided video spatiotemporal features

Info

Publication number: CN112836609A
Application number: CN202110098237.8A
Authority: CN
Inventors: 吕晨; 吴琼; 庄云亮; 王潇; 吕蕾; 刘弘
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-25

Abstract

The invention discloses a human behavior recognition method and system based on relationship-guided spatiotemporal features of videos, which acquires videos to be recognized; divides the videos to be recognized into several video segments according to a set number of frames; Image feature map extraction is performed separately to obtain the feature map of each frame of image; the feature map of each frame of image is divided into multiple different regions, and the spatial relationship between the regions is extracted to obtain the relationship-based guidance of each frame of image. The spatial feature vector based on the relationship guidance of all frame images is extracted from the time relationship, and the relationship guidance based time feature vector of each frame image is obtained; the relationship guidance based time feature vector of each frame image is extracted The average operation is performed to obtain the spatiotemporal feature vector of the video clip; based on the spatiotemporal feature vector of all frame images of each video clip, the trained behavior recognition model is used to obtain the human behavior recognition result.

Description

Human behavior identification method and system based on relation guide video space-time characteristics

Technical Field

The application relates to the technical field of human behavior recognition, in particular to a human behavior recognition method and system based on relation guide video space-time characteristics.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Video-based behavior recognition has received great attention in recent years due to its important application in video surveillance. For video, information in the temporal dimension is a crucial part, and video-based behavior recognition has richer background information, which increases the importance of temporal information for information region and cross-frame fusion.

At present, in video behavior recognition, which has been widely applied to the field of deep learning, a video frame is input into a defined behavior recognition model, and then behavior types contained in the video can be output. At present, the better video behavior recognition method mostly uses 2D convolution on the shallow layer and 3D convolution on the deep layer on the basis of a multi-fiber network, so that the complex space-time fusion in training and the huge memory consumption caused by the 3D convolution are reduced, and the accuracy of video behavior recognition cannot be reduced.

However, the inventor finds that most of the existing methods do not consider the influence of the video background on behavior recognition, generate a lot of unnecessary noise information, and secondly, the feature fusion in the time dimension is still to be improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the application provides a human behavior identification method and a human behavior identification system based on the relation guide video space-time characteristics; the video behavior can be efficiently and accurately identified.

In a first aspect, the application provides a human behavior identification method based on relation guide video spatiotemporal features;

the human body behavior identification method based on the relation guide video space-time characteristics comprises the following steps:

acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;

respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;

dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;

extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;

averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;

and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.

In a second aspect, the application provides a human behavior recognition system based on relation-guided video spatiotemporal features;

human behavior recognition system based on relation guide video space-time characteristics includes:

a partitioning module configured to: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;

a feature map extraction module configured to: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;

a spatial feature vector extraction module configured to: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;

a temporal feature vector extraction module configured to: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;

an averaging operation module configured to: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;

a human behavior recognition module configured to: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of this application are:

(1) the relation-guided video spatiotemporal behavior recognition introduces the concept of the relation into the model, connects local features, calculates the features of different parts and sets different weights, and realizes better feature aggregation.

(2) The method is characterized in that a relation concept is also introduced in the time dimension based on the relation-guided video time-space behavior identification, the aggregation and extraction of the characteristics of key frames are realized through processing similar to the space dimension, the key time information can be utilized for mutual complementation, and the time dimension characteristics are refined and aggregated to enhance the distinguishing capability.

(3) The RM relation extraction module is adopted in relation extraction for relation-guided video spatio-temporal behavior identification, and compared with simple dot products and inner products, the relation extracted by the RM relation extraction module can achieve a better effect.

(4) The invention provides video behavior recognition based on relationship-guided temporal and spatial fusion. In the spatial dimension, the feature and the relation vector of the position are fused into the feature of a video frame as the attention of the spatial position, the feature can better capture local and global information, has good filtering capability on background information in the video, and can extract more distinctive feature vectors for different types of behaviors. In the time dimension, generally, the similarity between the frames close to the time dimension is higher, the available information is less, and the feature contrast of the time frame far away has higher value. The invention adopts a method similar to the space aspect to capture the relation between video frames, and uses a relation extraction module to fuse the characteristics of a plurality of video frames into a characteristic vector.

(5) Compared with other traditional video behavior identification technologies, the video spatiotemporal behavior identification based on the relation guidance can make full use of information of time dimension in the video, better reflect the dependency between local and global in the space dimension, bring better identification effect and improve the robustness of the model.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a general framework diagram of the model according to the first embodiment of the present application;

fig. 2 is a diagram of an RM module according to a first embodiment of the present application;

FIG. 3 is a diagram illustrating a relationship extraction module according to a first embodiment of the present application;

fig. 4 is a spatial feature extraction module guided by relationships according to a first embodiment of the present application.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment provides a human behavior identification method based on relation guide video space-time characteristics;

as shown in fig. 1, the method for recognizing human body behaviors based on the spatio-temporal features of the relationship-guided video includes:

s101: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;

s102: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;

s103: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;

s104: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;

s105: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;

s106: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.

As one or more embodiments, the S101: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number; the method specifically comprises the following steps:

the number of frames is set, for example, 10 frames, 20 frames, 30 frames, etc., and is not limited herein.

As one or more embodiments, the S102: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image; the method specifically comprises the following steps:

and respectively extracting the feature map of each frame image of each video clip based on the 3D convolutional neural network to obtain the feature map of each frame image.

As one or more embodiments, the S103: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance; the method specifically comprises the following steps:

s1031: carrying out reshape operation on the feature map of each frame of image, and dividing the feature map into N local areas, wherein N is H multiplied by W; h represents the number of rows and W represents the number of columns;

s1032: performing difference calculation on the spatial feature representation of the ith local area and the spatial feature representation of the jth local area of the feature map to obtain a difference result;

sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the ith local area and the jth local area;

s1033: repeating S1032 when j is j +1 to obtain the relation vectors of the ith local area of the current feature map and all local areas of the current feature map;

s1034: combining the ith local area of the current feature map with the relationship vectors of all local areas of the current feature map to obtain the relationship vector between the ith local area of the current feature map and the whole situation of the current feature map;

s1035: combining a relation vector between the ith local area of the current feature map and the global situation of the current feature map with the spatial feature representation of the ith local area of the current feature map, sequentially performing full-connection operation and normalization operation on a combined result, and finally processing the result after the normalization operation by using a sigmoid activation function to generate a spatial attention score of the ith local area of the current feature map;

s1036: and carrying out weighted summation on the spatial attention score of each local area and the original spatial feature representation of each local area to obtain a spatial feature vector based on relationship guidance.

Illustratively, the spatial feature representation refers to a spatial feature representation in a feature map.

As one or more embodiments, the S104: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide; the method specifically comprises the following steps:

s1041: performing time feature extraction on the spatial relationship vector of the p frame image based on relationship guidance to obtain the time feature of the p frame image;

performing time feature extraction on the q frame image based on the spatial relationship vector guided by the relationship to obtain the time feature of the q frame image;

s1042: performing difference calculation on the time characteristics of the p frame image and the time characteristics of the q frame image to obtain a difference result;

sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the time characteristic of the p frame image and the time characteristic of the q frame image;

s1043: repeating S1041 to S1042 when q is q +1 to obtain a relation vector of the time characteristic of the p frame image and the time characteristics of all other frame images in the current video clip;

s1044: combining the time characteristics of the p-th frame image with the relationship vectors of the time characteristics of all other frame images in the current video clip to obtain the relationship vectors between the time characteristics of the p-th frame image and the global characteristics of the time dimension of the current video clip;

s1045: splicing the relation vector between the time characteristic of the p frame image and the global characteristic of the time dimension of the current video clip with the time characteristic of the p frame image, and sequentially carrying out full-connection operation and normalization operation on the spliced result to obtain a spatial characteristic vector of the p frame image based on relation guidance;

and similarly, obtaining the time characteristic vector of each frame of image based on the relation guidance.

As one or more embodiments, the S106: obtaining a human body behavior recognition result by adopting a trained behavior recognition model based on the space-time characteristic vectors of all frame images of each video clip; wherein the training step of the trained behavior recognition model comprises the following steps:

constructing a behavior recognition model, wherein the behavior recognition model is a fully-connected neural network;

constructing a training set, wherein the training set is the space-time characteristic vectors of all frame images of a video clip with a known human behavior recognition result;

and inputting the training set into the behavior recognition model, training the behavior recognition model, and stopping training when the iteration times are reached or the loss function reaches the minimum value to obtain the trained behavior recognition model.

Firstly, a section of video is intercepted into a plurality of parts according to adjacent frames, and the number of video frames contained in each part is the same; then, the feature map of the video is extracted by using 3D convolution operation, reshape transformation is carried out on the extracted feature map, and the feature map is processed

Is transformed into

Form, use RM module to calculate the relation between two feature vectors in space (as shown in FIG. 2); calculating each relation between one feature vector and other feature vectors to form a feature vector, generating an attention score of the feature vector by the data, and fusing to obtain space attention; and extracting frame-level features of different frames by adopting a similar relation guiding method, mixing the frame-level features by using various loss functions, and obtaining the category of the behavior in the video through multiple iterations.

The invention relates to video spatio-temporal behavior identification based on relationship guidance, which comprises three main parts: the video feature map extraction part, the relation-guided spatial feature part and the relation-guided temporal feature part.

Firstly, a video feature map extraction part uses a 3D convolution network model, the network model can better extract time dimension information on the basis of ensuring feature extraction on a space dimension, the method is simpler compared with a method for calculating features such as an optical flow, and the parameter quantity and resources required to be consumed during calculation are obviously reduced.

The relation-guided space feature part carries out reshape operation on the extracted feature graph and carries out reshape operation on the feature graph

Is transformed into

Form (a) wherein

Features representing the ith position. Generating a relationship vector using a GRV module, characterizing the relationship vector

Feature integration, as shown in FIG. 4, then through full connection layers andand carrying out weighted summation on the plurality of feature vectors and the spatial attention scores to obtain the intra-frame features.

As shown in fig. 2, the RM module is part of the GRV module. The RM module is used for inputting two feature vectors, calculating the difference between the two feature vectors, and performing a series of nonlinear and normalization operations to obtain a numerical value, wherein the numerical value represents the correlation degree between the two vectors.

As shown in fig. 3, the plurality of feature vectors input by the GRV module are mainly divided into a target feature vector and other feature vectors, and the main function is to combine each of the target vector and other vectors two by two and extract numerical values of the degree of relationship through the RM module, and then connect the numerical values in series to generate a vector. The GRV module is called global relation vector in English, and represents that a relation vector between one feature and the global feature is solved.

Finally, the relation-guided temporal feature part uses an RM module (RM full-name module, a module for solving the relation between two vectors) to generate feature vectors and fuses with the original vector to generate features between frames in a method similar to the space vector. Unlike the spatial part, the method enhances the feature representation capability in the time dimension and further enhances the time feature representation by directly summing the features of each part in the time dimension without adopting weighted summation.

When the relation between two features is calculated, an inner product or a dot product is usually adopted in the past, and the two relation modes have some problems. For example, the method using inner products usually indicates how similar the features are, and it cannot be inferred which parts are similar and which parts are different; the method using dot products involves a huge amount of computation and extracts some redundant information when representing the connections between features. To avoid the problems of the two methods, the RM module is adopted to generate a relation vector between the two features, and the vector contains abundant and compact information compared with the vector obtained by the two methods.

Firstly, inputting a video frame into a 3D convolution network model, and extracting a characteristic diagram with time information. Dividing the feature map into multi-region feature maps, extracting the relationship of the feature maps on the spatial dimension by using a GRV module to generate a relationship vector, generating the attention weight of each region by using the relationship vector, and performing weighted summation on the weight of each region and the feature of the region to obtain the feature vector of a single video frame. And calculating the relation of the feature vectors in a plurality of video frames by using a GRV module, combining the original feature vectors as predicted feature vectors, and continuously updating the weight values in an iterative manner to obtain the optimal weight values so as to strengthen the behavior recognition performance.

The data set for the technique is the Kinetics data set. The data set video is from YouTube, and has 600 categories, each category has at least 600 videos, and each video lasts about 10 seconds. The categories are mainly divided into three main categories: human-to-object interactions, such as playing musical instruments; human-human interaction, such as handshaking, hugging; sports, etc. Namely person, person-object.

The technique is directed to behavior recognition in video and is not applicable to other information carriers.

And the video characteristic diagram extraction part extracts a characteristic diagram with space-time characteristics by using a 3D convolution model and is used for extracting a later-stage relation vector. The method comprises the following specific steps:

1) the Kinetics data set is searched and downloaded using a search engine.

2) After the data set is decompressed, the whole video is cut into T frames

And (3) inputting the T-frame video into a 3D ConvNet model, and generating a space-time feature map with time and space features through multi-layer convolution and pooling. The 3D convolution operation operates as follows:

feature map generation after 3D convolution operations

3) The reshape operation is performed on the feature map, and the feature map is spatially viewed as N (N ═ H × W) different local regions, and the dimension of each local region feature is kept unchanged. The feature map after reshape becomes

Wherein the characteristic of the ith local area is represented as:

the method has the advantages that the features of the input video are extracted, and the extracted features are processed in advance, so that the follow-up work is facilitated.

The following is a spatial feature part guided by the relationship, which is a main module of the invention and is used for processing and integrating the relationship between local spatial features to improve the effect of video behavior recognition. The method comprises the following specific steps:

1) the video feature map extracting section extracts a piece of video inputted as

Form of the feature map. Now, the relationship between the feature map of each part and the feature maps of other parts needs to be calculated to obtain a relationship vector. Assuming that the relationship vector of the ith local area is calculated at present, we will express the feature of the ith local area as

2) And when the relation between the two different region characteristics is solved, an RM module is adopted.

The relationship vector generated by the RM module has a remarkable effect relative to other characteristic relationship generation modes, and the information richness and the fusion of the relationship vector are obviously superior to those of other modes.

To solve for

And

the relation between the ith local area and other local areas is an example, and the relation vector r of the ith local area and other local areas is generated by an RM module_i,j：

The RM module is specifically implemented as follows, two groups of new features are obtained from the two features through a shallow neural network, and the difference value calculation is performed on the new features, so that the following operation is implemented:

and on the basis of the difference value, obtaining a relation vector between the two areas through full connection and normalization operation. The specific operation is as follows:

wherein f is_diIs the difference in characteristics between the two local regions.

3) Repeating the above operations, calculating the relation vectors of the ith area and other areas, combining the obtained N relation vectors to generate the relation vector between the ith area and the whole

Where the vectors are combined as follows:

relation vector to be generated

And feature vectors

Combining, using Sigmoid activation function after full concatenation and normalization processing, to generate spatial attention score a_i. The specific formula is as follows:

wherein W_AIs the weight information of the shallow neural network. And finally, weighting and summing the attention scores and the features of the spaces of all the parts to generate a feature vector f of the frame, wherein the specific calculation is as follows:

and obtaining the spatial feature vector based on the relationship guidance through the operation, and generating the spatial feature of the fusion relationship vector for each frame.

Next, in a similar manner, a relationship vector between frames is generated, low quality frame features are reduced or filtered out, and relationships between refined and aggregated frames enhance behavior recognition capabilities. The method comprises the following specific steps:

1) after passing through the 3D convolutional neural network and the relationship-guided spatial feature module, the video segments are converted into frame-level feature vectors. The RM module is used for extracting the relation between the time frame features, and each time frame generates a feature relation vector relative to the global. The specific formula is realized as follows:

2) relation vector to be generated

And the feature vector f_tCombining, namely generating a time dimension characteristic vector guided by a relation by fully connecting and normalizing the combined characteristic vector

The characteristic generation process is as follows:

and averaging the generated feature vector combination to generate a video-level feature vector for final behavior identification.

Example two

The embodiment provides a human behavior recognition system based on the relation guide video space-time characteristics;

It should be noted here that the dividing module, the feature map extracting module, the spatial feature vector extracting module, the temporal feature vector extracting module, the averaging module and the human behavior recognizing module correspond to steps S101 to S106 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The human body behavior identification method based on the relation guide video space-time characteristics is characterized by comprising the following steps:

2. The method for recognizing human body behaviors based on the relation-oriented video spatio-temporal features as claimed in claim 1, wherein feature map extraction is performed on each frame image of each video clip to obtain a feature map of each frame image; the method specifically comprises the following steps:

3. The method for recognizing human body behaviors based on relationship-guided video spatio-temporal features as claimed in claim 1, wherein a plurality of different regions are divided into the feature map of each frame image, and spatial relationship extraction is performed between the regions to obtain the spatial feature vector of each frame image based on relationship guidance; the method specifically comprises the following steps:

carrying out reshape operation on the feature map of each frame of image, and dividing the feature map into N local areas, wherein N is H multiplied by W; h represents the number of rows and W represents the number of columns;

performing difference calculation on the spatial feature representation of the ith local area and the spatial feature representation of the jth local area of the feature map to obtain a difference result;

repeating the previous step to obtain the relation vectors of the ith local area of the current feature map and all local areas of the current feature map;

combining the ith local area of the current feature map with the relationship vectors of all local areas of the current feature map to obtain the relationship vector between the ith local area of the current feature map and the whole situation of the current feature map;

combining a relation vector between the ith local area of the current feature map and the global situation of the current feature map with the spatial feature representation of the ith local area of the current feature map, sequentially performing full-connection operation and normalization operation on a combined result, and finally processing the result after the normalization operation by using a sigmoid activation function to generate a spatial attention score of the ith local area of the current feature map;

and carrying out weighted summation on the spatial attention score of each local area and the original spatial feature representation of each local area to obtain a spatial feature vector based on relationship guidance.

4. The method for recognizing human body behaviors based on relationship-guided video space-time features as claimed in claim 1, wherein the relationship-guided spatial feature vectors of all the frame images are subjected to time relationship extraction to obtain the relationship-guided temporal feature vectors of each frame image; the method specifically comprises the following steps:

performing time feature extraction on the spatial relationship vector of the p frame image based on relationship guidance to obtain the time feature of the p frame image; performing time feature extraction on the q frame image based on the spatial relationship vector guided by the relationship to obtain the time feature of the q frame image;

performing difference calculation on the time characteristics of the p frame image and the time characteristics of the q frame image to obtain a difference result; sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the time characteristic of the p frame image and the time characteristic of the q frame image;

repeating the two steps to obtain a relation vector of the time characteristic of the p-th frame image and the time characteristics of all other frame images in the current video clip;

combining the time characteristics of the p-th frame image with the relationship vectors of the time characteristics of all other frame images in the current video clip to obtain the relationship vectors between the time characteristics of the p-th frame image and the global characteristics of the time dimension of the current video clip;

splicing the relation vector between the time characteristic of the p frame image and the global characteristic of the time dimension of the current video clip with the time characteristic of the p frame image, and sequentially carrying out full-connection operation and normalization operation on the spliced result to obtain a spatial characteristic vector of the p frame image based on relation guidance;

5. The human behavior recognition method based on the relation-guided video spatio-temporal features as claimed in claim 1, wherein the human behavior recognition result is obtained by adopting a trained behavior recognition model based on spatio-temporal feature vectors of all frame images of each video segment; wherein the training step of the trained behavior recognition model comprises the following steps:

6. Human behavior recognition system based on relation guide video space-time characteristics, characterized by includes:

7. The system according to claim 6, wherein the feature map of each frame of image is divided into a plurality of different regions, and the spatial relationship between the regions is extracted to obtain the spatial feature vector of each frame of image based on the relationship guide; the method specifically comprises the following steps:

8. The system for recognizing human body behaviors based on relationship-guided video space-time characteristics according to claim 6, wherein the relationship-guided spatial characteristic vectors of all the frame images are subjected to time relationship extraction to obtain the relationship-guided temporal characteristic vector of each frame image; the method specifically comprises the following steps:

9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-5.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 5.