CN112836609A - Human behavior identification method and system based on relation guide video space-time characteristics - Google Patents

Human behavior identification method and system based on relation guide video space-time characteristics Download PDF

Info

Publication number
CN112836609A
CN112836609A CN202110098237.8A CN202110098237A CN112836609A CN 112836609 A CN112836609 A CN 112836609A CN 202110098237 A CN202110098237 A CN 202110098237A CN 112836609 A CN112836609 A CN 112836609A
Authority
CN
China
Prior art keywords
time
frame
relation
vector
relationship
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110098237.8A
Other languages
Chinese (zh)
Inventor
吕晨
吴琼
庄云亮
王潇
吕蕾
刘弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Normal University
Original Assignee
Shandong Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Normal University filed Critical Shandong Normal University
Priority to CN202110098237.8A priority Critical patent/CN112836609A/en
Publication of CN112836609A publication Critical patent/CN112836609A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body behavior identification method and a human body behavior identification system based on relation guide video space-time characteristics, and the method comprises the steps of obtaining a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number; respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image; dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance; extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide; averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip; and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.

Description

Human behavior identification method and system based on relation guide video space-time characteristics
Technical Field
The application relates to the technical field of human behavior recognition, in particular to a human behavior recognition method and system based on relation guide video space-time characteristics.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
Video-based behavior recognition has received great attention in recent years due to its important application in video surveillance. For video, information in the temporal dimension is a crucial part, and video-based behavior recognition has richer background information, which increases the importance of temporal information for information region and cross-frame fusion.
At present, in video behavior recognition, which has been widely applied to the field of deep learning, a video frame is input into a defined behavior recognition model, and then behavior types contained in the video can be output. At present, the better video behavior recognition method mostly uses 2D convolution on the shallow layer and 3D convolution on the deep layer on the basis of a multi-fiber network, so that the complex space-time fusion in training and the huge memory consumption caused by the 3D convolution are reduced, and the accuracy of video behavior recognition cannot be reduced.
However, the inventor finds that most of the existing methods do not consider the influence of the video background on behavior recognition, generate a lot of unnecessary noise information, and secondly, the feature fusion in the time dimension is still to be improved.
Disclosure of Invention
In order to overcome the defects of the prior art, the application provides a human behavior identification method and a human behavior identification system based on the relation guide video space-time characteristics; the video behavior can be efficiently and accurately identified.
In a first aspect, the application provides a human behavior identification method based on relation guide video spatiotemporal features;
the human body behavior identification method based on the relation guide video space-time characteristics comprises the following steps:
acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
In a second aspect, the application provides a human behavior recognition system based on relation-guided video spatiotemporal features;
human behavior recognition system based on relation guide video space-time characteristics includes:
a partitioning module configured to: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
a feature map extraction module configured to: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
a spatial feature vector extraction module configured to: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
a temporal feature vector extraction module configured to: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
an averaging operation module configured to: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
a human behavior recognition module configured to: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.
In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.
Compared with the prior art, the beneficial effects of this application are:
(1) the relation-guided video spatiotemporal behavior recognition introduces the concept of the relation into the model, connects local features, calculates the features of different parts and sets different weights, and realizes better feature aggregation.
(2) The method is characterized in that a relation concept is also introduced in the time dimension based on the relation-guided video time-space behavior identification, the aggregation and extraction of the characteristics of key frames are realized through processing similar to the space dimension, the key time information can be utilized for mutual complementation, and the time dimension characteristics are refined and aggregated to enhance the distinguishing capability.
(3) The RM relation extraction module is adopted in relation extraction for relation-guided video spatio-temporal behavior identification, and compared with simple dot products and inner products, the relation extracted by the RM relation extraction module can achieve a better effect.
(4) The invention provides video behavior recognition based on relationship-guided temporal and spatial fusion. In the spatial dimension, the feature and the relation vector of the position are fused into the feature of a video frame as the attention of the spatial position, the feature can better capture local and global information, has good filtering capability on background information in the video, and can extract more distinctive feature vectors for different types of behaviors. In the time dimension, generally, the similarity between the frames close to the time dimension is higher, the available information is less, and the feature contrast of the time frame far away has higher value. The invention adopts a method similar to the space aspect to capture the relation between video frames, and uses a relation extraction module to fuse the characteristics of a plurality of video frames into a characteristic vector.
(5) Compared with other traditional video behavior identification technologies, the video spatiotemporal behavior identification based on the relation guidance can make full use of information of time dimension in the video, better reflect the dependency between local and global in the space dimension, bring better identification effect and improve the robustness of the model.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a general framework diagram of the model according to the first embodiment of the present application;
fig. 2 is a diagram of an RM module according to a first embodiment of the present application;
FIG. 3 is a diagram illustrating a relationship extraction module according to a first embodiment of the present application;
fig. 4 is a spatial feature extraction module guided by relationships according to a first embodiment of the present application.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment provides a human behavior identification method based on relation guide video space-time characteristics;
as shown in fig. 1, the method for recognizing human body behaviors based on the spatio-temporal features of the relationship-guided video includes:
s101: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
s102: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
s103: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
s104: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
s105: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
s106: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
As one or more embodiments, the S101: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number; the method specifically comprises the following steps:
the number of frames is set, for example, 10 frames, 20 frames, 30 frames, etc., and is not limited herein.
As one or more embodiments, the S102: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image; the method specifically comprises the following steps:
and respectively extracting the feature map of each frame image of each video clip based on the 3D convolutional neural network to obtain the feature map of each frame image.
As one or more embodiments, the S103: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance; the method specifically comprises the following steps:
s1031: carrying out reshape operation on the feature map of each frame of image, and dividing the feature map into N local areas, wherein N is H multiplied by W; h represents the number of rows and W represents the number of columns;
s1032: performing difference calculation on the spatial feature representation of the ith local area and the spatial feature representation of the jth local area of the feature map to obtain a difference result;
sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the ith local area and the jth local area;
s1033: repeating S1032 when j is j +1 to obtain the relation vectors of the ith local area of the current feature map and all local areas of the current feature map;
s1034: combining the ith local area of the current feature map with the relationship vectors of all local areas of the current feature map to obtain the relationship vector between the ith local area of the current feature map and the whole situation of the current feature map;
s1035: combining a relation vector between the ith local area of the current feature map and the global situation of the current feature map with the spatial feature representation of the ith local area of the current feature map, sequentially performing full-connection operation and normalization operation on a combined result, and finally processing the result after the normalization operation by using a sigmoid activation function to generate a spatial attention score of the ith local area of the current feature map;
s1036: and carrying out weighted summation on the spatial attention score of each local area and the original spatial feature representation of each local area to obtain a spatial feature vector based on relationship guidance.
Illustratively, the spatial feature representation refers to a spatial feature representation in a feature map.
As one or more embodiments, the S104: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide; the method specifically comprises the following steps:
s1041: performing time feature extraction on the spatial relationship vector of the p frame image based on relationship guidance to obtain the time feature of the p frame image;
performing time feature extraction on the q frame image based on the spatial relationship vector guided by the relationship to obtain the time feature of the q frame image;
s1042: performing difference calculation on the time characteristics of the p frame image and the time characteristics of the q frame image to obtain a difference result;
sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the time characteristic of the p frame image and the time characteristic of the q frame image;
s1043: repeating S1041 to S1042 when q is q +1 to obtain a relation vector of the time characteristic of the p frame image and the time characteristics of all other frame images in the current video clip;
s1044: combining the time characteristics of the p-th frame image with the relationship vectors of the time characteristics of all other frame images in the current video clip to obtain the relationship vectors between the time characteristics of the p-th frame image and the global characteristics of the time dimension of the current video clip;
s1045: splicing the relation vector between the time characteristic of the p frame image and the global characteristic of the time dimension of the current video clip with the time characteristic of the p frame image, and sequentially carrying out full-connection operation and normalization operation on the spliced result to obtain a spatial characteristic vector of the p frame image based on relation guidance;
and similarly, obtaining the time characteristic vector of each frame of image based on the relation guidance.
As one or more embodiments, the S106: obtaining a human body behavior recognition result by adopting a trained behavior recognition model based on the space-time characteristic vectors of all frame images of each video clip; wherein the training step of the trained behavior recognition model comprises the following steps:
constructing a behavior recognition model, wherein the behavior recognition model is a fully-connected neural network;
constructing a training set, wherein the training set is the space-time characteristic vectors of all frame images of a video clip with a known human behavior recognition result;
and inputting the training set into the behavior recognition model, training the behavior recognition model, and stopping training when the iteration times are reached or the loss function reaches the minimum value to obtain the trained behavior recognition model.
Firstly, a section of video is intercepted into a plurality of parts according to adjacent frames, and the number of video frames contained in each part is the same; then, the feature map of the video is extracted by using 3D convolution operation, reshape transformation is carried out on the extracted feature map, and the feature map is processed
Figure BDA0002914761350000091
Is transformed into
Figure BDA0002914761350000092
Form, use RM module to calculate the relation between two feature vectors in space (as shown in FIG. 2); calculating each relation between one feature vector and other feature vectors to form a feature vector, generating an attention score of the feature vector by the data, and fusing to obtain space attention; and extracting frame-level features of different frames by adopting a similar relation guiding method, mixing the frame-level features by using various loss functions, and obtaining the category of the behavior in the video through multiple iterations.
The invention relates to video spatio-temporal behavior identification based on relationship guidance, which comprises three main parts: the video feature map extraction part, the relation-guided spatial feature part and the relation-guided temporal feature part.
Firstly, a video feature map extraction part uses a 3D convolution network model, the network model can better extract time dimension information on the basis of ensuring feature extraction on a space dimension, the method is simpler compared with a method for calculating features such as an optical flow, and the parameter quantity and resources required to be consumed during calculation are obviously reduced.
The relation-guided space feature part carries out reshape operation on the extracted feature graph and carries out reshape operation on the feature graph
Figure BDA0002914761350000093
Is transformed into
Figure BDA0002914761350000094
Form (a) wherein
Figure BDA0002914761350000095
Features representing the ith position. Generating a relationship vector using a GRV module, characterizing the relationship vector
Figure BDA0002914761350000096
Feature integration, as shown in FIG. 4, then through full connection layers andand carrying out weighted summation on the plurality of feature vectors and the spatial attention scores to obtain the intra-frame features.
As shown in fig. 2, the RM module is part of the GRV module. The RM module is used for inputting two feature vectors, calculating the difference between the two feature vectors, and performing a series of nonlinear and normalization operations to obtain a numerical value, wherein the numerical value represents the correlation degree between the two vectors.
As shown in fig. 3, the plurality of feature vectors input by the GRV module are mainly divided into a target feature vector and other feature vectors, and the main function is to combine each of the target vector and other vectors two by two and extract numerical values of the degree of relationship through the RM module, and then connect the numerical values in series to generate a vector. The GRV module is called global relation vector in English, and represents that a relation vector between one feature and the global feature is solved.
Finally, the relation-guided temporal feature part uses an RM module (RM full-name module, a module for solving the relation between two vectors) to generate feature vectors and fuses with the original vector to generate features between frames in a method similar to the space vector. Unlike the spatial part, the method enhances the feature representation capability in the time dimension and further enhances the time feature representation by directly summing the features of each part in the time dimension without adopting weighted summation.
When the relation between two features is calculated, an inner product or a dot product is usually adopted in the past, and the two relation modes have some problems. For example, the method using inner products usually indicates how similar the features are, and it cannot be inferred which parts are similar and which parts are different; the method using dot products involves a huge amount of computation and extracts some redundant information when representing the connections between features. To avoid the problems of the two methods, the RM module is adopted to generate a relation vector between the two features, and the vector contains abundant and compact information compared with the vector obtained by the two methods.
Firstly, inputting a video frame into a 3D convolution network model, and extracting a characteristic diagram with time information. Dividing the feature map into multi-region feature maps, extracting the relationship of the feature maps on the spatial dimension by using a GRV module to generate a relationship vector, generating the attention weight of each region by using the relationship vector, and performing weighted summation on the weight of each region and the feature of the region to obtain the feature vector of a single video frame. And calculating the relation of the feature vectors in a plurality of video frames by using a GRV module, combining the original feature vectors as predicted feature vectors, and continuously updating the weight values in an iterative manner to obtain the optimal weight values so as to strengthen the behavior recognition performance.
The data set for the technique is the Kinetics data set. The data set video is from YouTube, and has 600 categories, each category has at least 600 videos, and each video lasts about 10 seconds. The categories are mainly divided into three main categories: human-to-object interactions, such as playing musical instruments; human-human interaction, such as handshaking, hugging; sports, etc. Namely person, person-object.
The technique is directed to behavior recognition in video and is not applicable to other information carriers.
And the video characteristic diagram extraction part extracts a characteristic diagram with space-time characteristics by using a 3D convolution model and is used for extracting a later-stage relation vector. The method comprises the following specific steps:
1) the Kinetics data set is searched and downloaded using a search engine.
2) After the data set is decompressed, the whole video is cut into T frames
Figure BDA0002914761350000111
And (3) inputting the T-frame video into a 3D ConvNet model, and generating a space-time feature map with time and space features through multi-layer convolution and pooling. The 3D convolution operation operates as follows:
Figure BDA0002914761350000112
feature map generation after 3D convolution operations
Figure BDA0002914761350000113
3) The reshape operation is performed on the feature map, and the feature map is spatially viewed as N (N ═ H × W) different local regions, and the dimension of each local region feature is kept unchanged. The feature map after reshape becomes
Figure BDA0002914761350000114
Wherein the characteristic of the ith local area is represented as:
Figure BDA0002914761350000115
the method has the advantages that the features of the input video are extracted, and the extracted features are processed in advance, so that the follow-up work is facilitated.
The following is a spatial feature part guided by the relationship, which is a main module of the invention and is used for processing and integrating the relationship between local spatial features to improve the effect of video behavior recognition. The method comprises the following specific steps:
1) the video feature map extracting section extracts a piece of video inputted as
Figure BDA0002914761350000121
Form of the feature map. Now, the relationship between the feature map of each part and the feature maps of other parts needs to be calculated to obtain a relationship vector. Assuming that the relationship vector of the ith local area is calculated at present, we will express the feature of the ith local area as
Figure BDA0002914761350000122
2) And when the relation between the two different region characteristics is solved, an RM module is adopted.
The relationship vector generated by the RM module has a remarkable effect relative to other characteristic relationship generation modes, and the information richness and the fusion of the relationship vector are obviously superior to those of other modes.
To solve for
Figure BDA0002914761350000123
And
Figure BDA0002914761350000124
the relation between the ith local area and other local areas is an example, and the relation vector r of the ith local area and other local areas is generated by an RM modulei,j
Figure BDA0002914761350000125
The RM module is specifically implemented as follows, two groups of new features are obtained from the two features through a shallow neural network, and the difference value calculation is performed on the new features, so that the following operation is implemented:
Figure BDA0002914761350000126
and on the basis of the difference value, obtaining a relation vector between the two areas through full connection and normalization operation. The specific operation is as follows:
Figure BDA0002914761350000127
wherein f isdiIs the difference in characteristics between the two local regions.
3) Repeating the above operations, calculating the relation vectors of the ith area and other areas, combining the obtained N relation vectors to generate the relation vector between the ith area and the whole
Figure BDA0002914761350000128
Where the vectors are combined as follows:
Figure BDA0002914761350000129
relation vector to be generated
Figure BDA00029147613500001210
And feature vectors
Figure BDA00029147613500001211
Combining, using Sigmoid activation function after full concatenation and normalization processing, to generate spatial attention score ai. The specific formula is as follows:
Figure BDA0002914761350000131
wherein WAIs the weight information of the shallow neural network. And finally, weighting and summing the attention scores and the features of the spaces of all the parts to generate a feature vector f of the frame, wherein the specific calculation is as follows:
Figure BDA0002914761350000132
and obtaining the spatial feature vector based on the relationship guidance through the operation, and generating the spatial feature of the fusion relationship vector for each frame.
Next, in a similar manner, a relationship vector between frames is generated, low quality frame features are reduced or filtered out, and relationships between refined and aggregated frames enhance behavior recognition capabilities. The method comprises the following specific steps:
1) after passing through the 3D convolutional neural network and the relationship-guided spatial feature module, the video segments are converted into frame-level feature vectors. The RM module is used for extracting the relation between the time frame features, and each time frame generates a feature relation vector relative to the global. The specific formula is realized as follows:
Figure BDA0002914761350000133
2) relation vector to be generated
Figure BDA0002914761350000134
And the feature vector ftCombining, namely generating a time dimension characteristic vector guided by a relation by fully connecting and normalizing the combined characteristic vector
Figure BDA0002914761350000135
The characteristic generation process is as follows:
Figure BDA0002914761350000136
and averaging the generated feature vector combination to generate a video-level feature vector for final behavior identification.
Example two
The embodiment provides a human behavior recognition system based on the relation guide video space-time characteristics;
human behavior recognition system based on relation guide video space-time characteristics includes:
a partitioning module configured to: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
a feature map extraction module configured to: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
a spatial feature vector extraction module configured to: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
a temporal feature vector extraction module configured to: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
an averaging operation module configured to: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
a human behavior recognition module configured to: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
It should be noted here that the dividing module, the feature map extracting module, the spatial feature vector extracting module, the temporal feature vector extracting module, the averaging module and the human behavior recognizing module correspond to steps S101 to S106 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.
EXAMPLE III
The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Example four
The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. The human body behavior identification method based on the relation guide video space-time characteristics is characterized by comprising the following steps:
acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
2. The method for recognizing human body behaviors based on the relation-oriented video spatio-temporal features as claimed in claim 1, wherein feature map extraction is performed on each frame image of each video clip to obtain a feature map of each frame image; the method specifically comprises the following steps:
and respectively extracting the feature map of each frame image of each video clip based on the 3D convolutional neural network to obtain the feature map of each frame image.
3. The method for recognizing human body behaviors based on relationship-guided video spatio-temporal features as claimed in claim 1, wherein a plurality of different regions are divided into the feature map of each frame image, and spatial relationship extraction is performed between the regions to obtain the spatial feature vector of each frame image based on relationship guidance; the method specifically comprises the following steps:
carrying out reshape operation on the feature map of each frame of image, and dividing the feature map into N local areas, wherein N is H multiplied by W; h represents the number of rows and W represents the number of columns;
performing difference calculation on the spatial feature representation of the ith local area and the spatial feature representation of the jth local area of the feature map to obtain a difference result;
sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the ith local area and the jth local area;
repeating the previous step to obtain the relation vectors of the ith local area of the current feature map and all local areas of the current feature map;
combining the ith local area of the current feature map with the relationship vectors of all local areas of the current feature map to obtain the relationship vector between the ith local area of the current feature map and the whole situation of the current feature map;
combining a relation vector between the ith local area of the current feature map and the global situation of the current feature map with the spatial feature representation of the ith local area of the current feature map, sequentially performing full-connection operation and normalization operation on a combined result, and finally processing the result after the normalization operation by using a sigmoid activation function to generate a spatial attention score of the ith local area of the current feature map;
and carrying out weighted summation on the spatial attention score of each local area and the original spatial feature representation of each local area to obtain a spatial feature vector based on relationship guidance.
4. The method for recognizing human body behaviors based on relationship-guided video space-time features as claimed in claim 1, wherein the relationship-guided spatial feature vectors of all the frame images are subjected to time relationship extraction to obtain the relationship-guided temporal feature vectors of each frame image; the method specifically comprises the following steps:
performing time feature extraction on the spatial relationship vector of the p frame image based on relationship guidance to obtain the time feature of the p frame image; performing time feature extraction on the q frame image based on the spatial relationship vector guided by the relationship to obtain the time feature of the q frame image;
performing difference calculation on the time characteristics of the p frame image and the time characteristics of the q frame image to obtain a difference result; sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the time characteristic of the p frame image and the time characteristic of the q frame image;
repeating the two steps to obtain a relation vector of the time characteristic of the p-th frame image and the time characteristics of all other frame images in the current video clip;
combining the time characteristics of the p-th frame image with the relationship vectors of the time characteristics of all other frame images in the current video clip to obtain the relationship vectors between the time characteristics of the p-th frame image and the global characteristics of the time dimension of the current video clip;
splicing the relation vector between the time characteristic of the p frame image and the global characteristic of the time dimension of the current video clip with the time characteristic of the p frame image, and sequentially carrying out full-connection operation and normalization operation on the spliced result to obtain a spatial characteristic vector of the p frame image based on relation guidance;
and similarly, obtaining the time characteristic vector of each frame of image based on the relation guidance.
5. The human behavior recognition method based on the relation-guided video spatio-temporal features as claimed in claim 1, wherein the human behavior recognition result is obtained by adopting a trained behavior recognition model based on spatio-temporal feature vectors of all frame images of each video segment; wherein the training step of the trained behavior recognition model comprises the following steps:
constructing a behavior recognition model, wherein the behavior recognition model is a fully-connected neural network;
constructing a training set, wherein the training set is the space-time characteristic vectors of all frame images of a video clip with a known human behavior recognition result;
and inputting the training set into the behavior recognition model, training the behavior recognition model, and stopping training when the iteration times are reached or the loss function reaches the minimum value to obtain the trained behavior recognition model.
6. Human behavior recognition system based on relation guide video space-time characteristics, characterized by includes:
a partitioning module configured to: acquiring a video to be identified; dividing a video to be identified into a plurality of video segments according to a set frame number;
a feature map extraction module configured to: respectively extracting a characteristic diagram of each frame image of each video clip to obtain the characteristic diagram of each frame image;
a spatial feature vector extraction module configured to: dividing the characteristic diagram of each frame of image into a plurality of different regions, and extracting the spatial relationship among the regions to obtain a spatial characteristic vector of each frame of image based on relationship guidance;
a temporal feature vector extraction module configured to: extracting the time relation of the space feature vectors of all the frame images based on the relation guide to obtain the time feature vector of each frame image based on the relation guide;
an averaging operation module configured to: averaging the time characteristic vectors of each frame of image based on the relation guidance to obtain the space-time characteristic vector of the video clip;
a human behavior recognition module configured to: and obtaining a human body behavior recognition result by adopting the trained behavior recognition model based on the space-time characteristic vectors of all the frame images of each video clip.
7. The system according to claim 6, wherein the feature map of each frame of image is divided into a plurality of different regions, and the spatial relationship between the regions is extracted to obtain the spatial feature vector of each frame of image based on the relationship guide; the method specifically comprises the following steps:
carrying out reshape operation on the feature map of each frame of image, and dividing the feature map into N local areas, wherein N is H multiplied by W; h represents the number of rows and W represents the number of columns;
performing difference calculation on the spatial feature representation of the ith local area and the spatial feature representation of the jth local area of the feature map to obtain a difference result;
sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the ith local area and the jth local area;
repeating the previous step to obtain the relation vectors of the ith local area of the current feature map and all local areas of the current feature map;
combining the ith local area of the current feature map with the relationship vectors of all local areas of the current feature map to obtain the relationship vector between the ith local area of the current feature map and the whole situation of the current feature map;
combining a relation vector between the ith local area of the current feature map and the global situation of the current feature map with the spatial feature representation of the ith local area of the current feature map, sequentially performing full-connection operation and normalization operation on a combined result, and finally processing the result after the normalization operation by using a sigmoid activation function to generate a spatial attention score of the ith local area of the current feature map;
and carrying out weighted summation on the spatial attention score of each local area and the original spatial feature representation of each local area to obtain a spatial feature vector based on relationship guidance.
8. The system for recognizing human body behaviors based on relationship-guided video space-time characteristics according to claim 6, wherein the relationship-guided spatial characteristic vectors of all the frame images are subjected to time relationship extraction to obtain the relationship-guided temporal characteristic vector of each frame image; the method specifically comprises the following steps:
performing time feature extraction on the spatial relationship vector of the p frame image based on relationship guidance to obtain the time feature of the p frame image; performing time feature extraction on the q frame image based on the spatial relationship vector guided by the relationship to obtain the time feature of the q frame image;
performing difference calculation on the time characteristics of the p frame image and the time characteristics of the q frame image to obtain a difference result; sequentially performing full-connection operation and normalization operation on the difference result to obtain a relation vector of the time characteristic of the p frame image and the time characteristic of the q frame image;
repeating the two steps to obtain a relation vector of the time characteristic of the p-th frame image and the time characteristics of all other frame images in the current video clip;
combining the time characteristics of the p-th frame image with the relationship vectors of the time characteristics of all other frame images in the current video clip to obtain the relationship vectors between the time characteristics of the p-th frame image and the global characteristics of the time dimension of the current video clip;
splicing the relation vector between the time characteristic of the p frame image and the global characteristic of the time dimension of the current video clip with the time characteristic of the p frame image, and sequentially carrying out full-connection operation and normalization operation on the spliced result to obtain a spatial characteristic vector of the p frame image based on relation guidance;
and similarly, obtaining the time characteristic vector of each frame of image based on the relation guidance.
9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-5.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 5.
CN202110098237.8A 2021-01-25 2021-01-25 Human behavior identification method and system based on relation guide video space-time characteristics Pending CN112836609A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110098237.8A CN112836609A (en) 2021-01-25 2021-01-25 Human behavior identification method and system based on relation guide video space-time characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110098237.8A CN112836609A (en) 2021-01-25 2021-01-25 Human behavior identification method and system based on relation guide video space-time characteristics

Publications (1)

Publication Number Publication Date
CN112836609A true CN112836609A (en) 2021-05-25

Family

ID=75931412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110098237.8A Pending CN112836609A (en) 2021-01-25 2021-01-25 Human behavior identification method and system based on relation guide video space-time characteristics

Country Status (1)

Country Link
CN (1) CN112836609A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408448A (en) * 2021-06-25 2021-09-17 之江实验室 Method and device for extracting local features of three-dimensional space-time object and identifying object
CN113435578A (en) * 2021-06-25 2021-09-24 重庆邮电大学 Feature map coding method and device based on mutual attention and electronic equipment
CN114926770A (en) * 2022-05-31 2022-08-19 上海人工智能创新中心 Video motion recognition method, device, equipment and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160295A (en) * 2019-12-31 2020-05-15 广州视声智能科技有限公司 Video pedestrian re-identification method based on region guidance and space-time attention

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160295A (en) * 2019-12-31 2020-05-15 广州视声智能科技有限公司 Video pedestrian re-identification method based on region guidance and space-time attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XINGZE LI,ET AL.: "Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification", 《PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408448A (en) * 2021-06-25 2021-09-17 之江实验室 Method and device for extracting local features of three-dimensional space-time object and identifying object
CN113435578A (en) * 2021-06-25 2021-09-24 重庆邮电大学 Feature map coding method and device based on mutual attention and electronic equipment
CN113435578B (en) * 2021-06-25 2022-04-05 重庆邮电大学 Feature map coding method and device based on mutual attention and electronic equipment
CN114926770A (en) * 2022-05-31 2022-08-19 上海人工智能创新中心 Video motion recognition method, device, equipment and computer readable storage medium
CN114926770B (en) * 2022-05-31 2024-06-07 上海人工智能创新中心 Video motion recognition method, apparatus, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN112836609A (en) Human behavior identification method and system based on relation guide video space-time characteristics
CN110765246B (en) Question and answer method and device based on intelligent robot, storage medium and intelligent device
WO2021057056A1 (en) Neural architecture search method, image processing method and device, and storage medium
CN112541904B (en) Unsupervised remote sensing image change detection method, storage medium and computing device
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN112446379B (en) Self-adaptive intelligent processing method for dynamic large scene
CN111259940A (en) Target detection method based on space attention map
CN111046821A (en) Video behavior identification method and system and electronic equipment
CN110516734B (en) Image matching method, device, equipment and storage medium
CN114612832A (en) Real-time gesture detection method and device
CN111915650A (en) Target tracking method and system based on improved twin network
CN114998601B (en) On-line update target tracking method and system based on Transformer
CN113221680B (en) Text pedestrian retrieval method based on text dynamic guiding visual feature extraction
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
He et al. Transvcl: Attention-enhanced video copy localization network with flexible supervision
US20210056353A1 (en) Joint representation learning from images and text
CN114333062B (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN113723352B (en) Text detection method, system, storage medium and electronic equipment
CN108197660A (en) Multi-model Feature fusion/system, computer readable storage medium and equipment
CN114677611B (en) Data identification method, storage medium and device
CN111914809B (en) Target object positioning method, image processing method, device and computer equipment
CN116152938A (en) Method, device and equipment for training identity recognition model and transferring electronic resources
Zhang et al. CAM R-CNN: End-to-end object detection with class activation maps
CN113761282A (en) Video duplicate checking method and device, electronic equipment and storage medium
Liu et al. Joint learning of image aesthetic quality assessment and semantic recognition based on feature enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210525