CN112836602B - Behavior recognition method, device, equipment and medium based on space-time feature fusion - Google Patents

Behavior recognition method, device, equipment and medium based on space-time feature fusion Download PDF

Info

Publication number
CN112836602B
CN112836602B CN202110079906.7A CN202110079906A CN112836602B CN 112836602 B CN112836602 B CN 112836602B CN 202110079906 A CN202110079906 A CN 202110079906A CN 112836602 B CN112836602 B CN 112836602B
Authority
CN
China
Prior art keywords
features
feature
processed
space
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110079906.7A
Other languages
Chinese (zh)
Other versions
CN112836602A (en
Inventor
梁添才
蔡德利
赵清利
徐天适
王乃洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xinyi Technology Co Ltd
Original Assignee
Shenzhen Xinyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xinyi Technology Co Ltd filed Critical Shenzhen Xinyi Technology Co Ltd
Priority to CN202110079906.7A priority Critical patent/CN112836602B/en
Publication of CN112836602A publication Critical patent/CN112836602A/en
Application granted granted Critical
Publication of CN112836602B publication Critical patent/CN112836602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a behavior recognition method, a device, equipment and a medium based on space-time feature fusion, wherein the method comprises the following steps: obtaining a video frame to be processed, and unifying the sizes of the video frame to be processed; extracting shallow layer characteristics in the video frame to be processed; extracting deep features in the video frame to be processed according to the shallow features; extracting space-time 2D characteristic layers in the video frames to be processed; and identifying the behavior category of the target object in the video frame to be processed according to the deep features and the space-time 2D feature layer. The invention uses a 2D convolution mode to replace 3D convolution, can effectively reduce the calculated amount of a network, can maintain the identification performance on behavior identification, ensures the identification accuracy and real-time performance, and can be widely applied to the technical field of computer vision.

Description

Behavior recognition method, device, equipment and medium based on space-time feature fusion
Technical Field
The invention relates to the technical field of computer vision, in particular to a behavior recognition method, device, equipment and medium based on space-time feature fusion.
Background
Behavior recognition is an important field in the field of computer vision, and the main task of the behavior recognition is to automatically analyze ongoing behavior actions of a target through videos, and the behavior recognition plays an important role in video monitoring and monitoring, robot interaction and the like.
With the continuous development of deep learning, the performance of video understanding and behavior analysis is greatly improved, and the behavior recognition technology is remarkably developed. The current mainstream behavior recognition methods are divided into a double-flow method, a human skeleton-based recognition method, a 3D convolution network-based method and the like. The double-flow method comprises two aspects of information of a video frame RGB image and an optical flow, the method designs two depth convolution networks to extract space-time information of the video frame and optical flow information of the video respectively, and then results of the two networks are fused to obtain a final behavior recognition result. The prior art mentions the use of dual stream methods for behavior recognition. The double-flow method can obtain higher precision, but the double-flow method needs to extract the optical flow of the video, has low calculation efficiency and can not achieve real-time behavior recognition. The human skeleton-based recognition method is to perform behavior recognition through the positions of human key points (head, hand, foot and the like) of the video frame, so that parameters of a behavior recognition model are effectively reduced. In the related prior art, behavior recognition is performed on human body key points of a video frame, a recognition method based on human body skeleton depends on the accuracy of the human body key points of the video frame, the time is long in the process of extracting the key points, the recognition accuracy of the key points is low, and the final behavior recognition accuracy is low. The method based on the 3D convolution network is to directly input the video into the 3D convolution network, extract the characteristics of the time dimension and the space dimension of the video, and finally obtain higher behavior recognition precision. The related art also mentions combining the ECO algorithm with 2D convolution and 3D convolution to obtain higher accuracy in behavior recognition. However, because the 3D convolution calculation amount is large, the reasoning speed is influenced, and in practical application, online and real-time behavior identification is difficult to achieve.
Disclosure of Invention
In view of the above, the embodiment of the invention provides a behavior identification method, a device, equipment and a medium based on space-time feature fusion, which have the advantages of small calculated amount, high accuracy and good real-time performance.
One aspect of the invention provides a behavior recognition method based on space-time feature fusion, comprising the following steps:
obtaining a video frame to be processed, and unifying the sizes of the video frame to be processed;
extracting shallow layer characteristics in the video frame to be processed;
extracting deep features in the video frame to be processed according to the shallow features;
extracting space-time 2D characteristic layers in the video frames to be processed;
and identifying the behavior category of the target object in the video frame to be processed according to the deep features and the space-time 2D feature layer.
Preferably, the obtaining the video frame to be processed and unifying the size of the video frame to be processed include:
acquiring continuously input video content;
extracting N frames of images from the video content;
the size of the N frame image is determined to be 224x 224.
Preferably, the shallow features include texture features and detail features of the image;
the shallow layer characteristics in the video frame to be processed are extracted specifically as follows:
and extracting the image through an InceptionV2 convolution network to obtain InceptionV2-3c characteristics.
Preferably, the deep features include contour features, shape features, and most prominent features of the image;
and extracting deep features in the video frame to be processed according to the shallow features, wherein the extracting comprises the following steps:
extracting the InceptionV2-3c features through an InceptionV2 convolution network to obtain the Pooling features.
Preferably, the extracting the spatio-temporal 2D feature layer in the video frame to be processed includes:
extracting the InceptionV2-3c features by a space-time 2D convolution module to obtain time features and space features;
and carrying out average pooling treatment on the time features and the space features to obtain space-time 2D features.
Preferably, the extracting the characteristics of the innov 2-3c by the space-time 2D convolution module to obtain temporal characteristics and spatial characteristics includes:
performing dimension reduction operation on the InceptionV2-3c feature to obtain a time feature;
sequentially executing normalization operation, reLU activation operation and 3*3 convolution operation on the InceptionV2-3c feature to obtain a first feature;
sequentially executing normalization operation, reLU activation operation, first 3*3 convolution operation and second 3*3 convolution operation on the InceptionV2-3c feature to obtain a second feature;
and adding the first feature and the second feature to obtain a third feature serving as a spatial feature.
Preferably, the method further comprises:
training a behavior recognition model;
the method comprises the steps of,
and testing the behavior recognition model.
Another aspect of the embodiment of the present invention further provides a behavior recognition device based on space-time feature fusion, including:
the acquisition module is used for acquiring the video frames to be processed and unifying the sizes of the video frames to be processed;
the first extraction module is used for extracting shallow layer characteristics in the video frame to be processed;
the second extraction module is used for extracting deep features in the video frame to be processed according to the shallow features;
the third extraction module is used for extracting the space-time 2D characteristic layers in the video frames to be processed;
and the identification module is used for identifying the behavior category of the target object in the video frame to be processed according to the deep layer characteristics and the space-time 2D characteristic layer.
Another aspect of the embodiment of the invention also provides an electronic device, which includes a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Another aspect of the embodiments of the present invention provides a computer-readable storage medium storing a program that is executed by a processor to implement a method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
The method comprises the steps of firstly obtaining a video frame to be processed, and unifying the sizes of the video frame to be processed; extracting shallow layer characteristics in the video frame to be processed; extracting deep features in the video frame to be processed according to the shallow features; extracting space-time 2D characteristic layers in the video frames to be processed; and identifying the behavior category of the target object in the video frame to be processed according to the deep features and the space-time 2D feature layer. According to the embodiment of the invention, a 2D convolution mode is used for replacing 3D convolution, so that the calculated amount of a network can be effectively reduced, meanwhile, the identification performance can be maintained in behavior identification, and the identification accuracy and instantaneity are ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of steps provided in an embodiment of the present invention;
FIG. 2 is a general block diagram of a network structure according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of five-dimensional features provided in an embodiment of the present invention;
FIG. 4 is a schematic diagram of one dimension of a five-dimensional feature provided by an embodiment of the present invention;
FIG. 5 is a schematic drawing of one dimension of a five-dimensional feature provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a feature effect after a dimension reduction operation according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a network structure of a sample 2D module according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a Res block structure according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a block1 according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of block2 according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a model training process provided in an embodiment of the present invention;
fig. 12 is a schematic diagram of a model test flow provided in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Aiming at the problems existing in the prior art, the embodiment of the invention provides a behavior recognition method based on space-time feature fusion, as shown in figure 1, which comprises the following steps:
obtaining a video frame to be processed, and unifying the sizes of the video frame to be processed;
extracting shallow layer characteristics in the video frame to be processed;
extracting deep features in the video frame to be processed according to the shallow features;
extracting space-time 2D characteristic layers in the video frames to be processed;
and identifying the behavior category of the target object in the video frame to be processed according to the deep features and the space-time 2D feature layer.
Preferably, the obtaining the video frame to be processed and unifying the size of the video frame to be processed include:
acquiring continuously input video content;
extracting N frames of images from the video content;
the size of the N frame image is determined to be 224x 224.
Preferably, the shallow features include texture features and detail features of the image;
the shallow layer characteristics in the video frame to be processed are extracted specifically as follows:
and extracting the image through an InceptionV2 convolution network to obtain InceptionV2-3c characteristics.
Preferably, the deep features include contour features, shape features, and most prominent features of the image;
and extracting deep features in the video frame to be processed according to the shallow features, wherein the extracting comprises the following steps:
extracting the InceptionV2-3c features through an InceptionV2 convolution network to obtain the Pooling features.
Preferably, the extracting the spatio-temporal 2D feature layer in the video frame to be processed includes:
extracting the InceptionV2-3c features by a space-time 2D convolution module to obtain time features and space features;
and carrying out average pooling treatment on the time features and the space features to obtain space-time 2D features.
Preferably, the extracting the characteristics of the innov 2-3c by the space-time 2D convolution module to obtain temporal characteristics and spatial characteristics includes:
performing dimension reduction operation on the InceptionV2-3c feature to obtain a time feature;
sequentially executing normalization operation, reLU activation operation and 3*3 convolution operation on the InceptionV2-3c feature to obtain a first feature;
sequentially executing normalization operation, reLU activation operation, first 3*3 convolution operation and second 3*3 convolution operation on the InceptionV2-3c feature to obtain a second feature;
and adding the first feature and the second feature to obtain a third feature serving as a spatial feature.
Preferably, the method further comprises:
training a behavior recognition model;
the method comprises the steps of,
and testing the behavior recognition model.
The following describes the behavior recognition method of the present invention in detail with reference to the accompanying drawings, as shown in fig. 2, fig. 2 is a general block diagram of a network structure provided by an embodiment of the present invention, in this embodiment, a network structure for behavior recognition based on space-time feature fusion is provided, including:
(1) Video preprocessing module: the method is used for extracting video frames and unifying the sizes of the video frames;
(2) Shallow layer feature extraction module: the method is used for extracting shallow layer characteristics of the video frame, including characteristics such as image textures;
(3) Deep feature extraction module: the method is used for extracting deep features of the video frame, including features such as target contours, shapes and the like;
(4) A space-time 2D convolution module: the method comprises the steps of extracting and fusing space-time characteristics of video frames;
(5) The classification and identification module: and the method is used for calculating the loss of classification recognition and obtaining the behavior category of the target.
The overall block diagram of the network architecture is shown in fig. 2. The network structure aims at improving an ECO algorithm network model, a space-time feature fusion strategy is provided, shallow features of the network are further extracted and fused with features of time and space, 3D convolution of the ECO model is replaced, model parameters of the network are reduced, and the target behavior recognition speed is increased.
It should be noted that, in the embodiment of the present invention, the shallow layer network extracts texture and detail features; the deep network extracts the contour, shape, and strongest features. The shallow network contains more features and also has the capability of extracting key features. The deeper the number of layers, the more representative the extracted features, the less the resolution of the image. The extraction process of the shallow network and the deep network is basically the same, and the deep features are completed on the basis of the shallow features. As in fig. 2, both networks are implemented using an imperceptin v2 convolutional network.
Specifically, the shallow feature extraction module is composed of convolutional network InceptionV 23 a to 3c layers, and mainly extracts shallow features of images to finally obtain InceptionV2-3c features. The deep feature extraction module is composed of convolutional network InceptionV24a to 5b layers and a Pooling layer, and is mainly used for extracting deep features of images and finally obtaining Pooling features.
In the overall block diagram of the network architecture shown in fig. 2, the video preprocessing module receives a continuous input of video, fetches N frames of images from the video, and sets the size of each frame of image to 224x 224. The N frames of images are input into a shallow feature extraction module, the shallow feature extraction module is composed of convolutional network InceptionV 23 a to 3c layers, the deep feature module mainly extracts shallow features of the images, and finally InceptionV2-3c features are obtained.
The deep feature module is composed of convolutional network InceptionV24a to 5b layers and a Pooling layer, and is mainly used for extracting deep features of images and finally obtaining Pooling features.
The InceptionV2-3c features are input into a space-time 2D convolution module, and the space-time 2D convolution module is mainly used for extracting the time features and the space features of N frames of images and fusing the two features to obtain space-time 2D features.
The classification recognition module is composed of a full connection layer (Fully Connected Layer) and a softMax function, and is mainly used for obtaining the classification recognition result of the final behavior.
The space-time 2D convolution module comprises a dimension reduction module and a sample 2D module.
The dimension reduction module is mainly used for converting five-dimensional characteristics into four-dimensional characteristics, and the dimension reduction module comprises the following specific operations: the five-dimensional features are five-dimensional data, and the sizes of the five-dimensional data are (B, C, T, H and W), and each dimension is respectively represented as a number dimension B of the features, a channel dimension C of the features, a time dimension T of the features, a height dimension H of the features and a width dimension W of the features. Taking b=c=1 as an example, the four-dimensional feature is shown in fig. 3. Fig. 4 is a two-dimensional tensor of fig. 3 in one dimension (H, W). After stretching fig. 4, the one-dimensional tensor is converted into a one-dimensional tensor (h×w), as shown in fig. 5. Finally, the dimension reduction operation is realized, and the effect schematic diagram shown in fig. 6 is obtained. And performing dimension reduction operation on all other time dimensions, and finally obtaining four-dimensional features.
The technical 2D module includes three Res blocks and one Pooling layer as shown in fig. 7. As shown in fig. 8, in Res block, it is divided into two parts, block1 and block2, respectively. As shown in fig. 9, in block1, the input features sequentially repeat the batch normalization operation of the batch norm, the activation feature of the ReLU mode, and the operation of 3x3 convolution, so as to obtain new features, and finally, the Eltwise-Sum operation is performed with the input features.
As shown in fig. 10. In block2, sequentially repeating the batch normalization operation of the BatchNorm, the activation of the features in a ReLU mode and the operation of 3x3 convolution of the input features, wherein the first 3x3 convolution sets the step length to be 2, and finally obtaining a new feature block2-1; on the other hand, the input features are convolved by 3x3, the step length is set to be 2, the features block2-2 are obtained, and then the two features block2-1 and block2-2 are subjected to Eltwise operation (feature addition) to obtain the output features of block2.
The InceptionV2-3c features are subjected to average Pooling through three Res blocks and Pooling layers in sequence, and finally the space-time 2D features are obtained.
In addition, the embodiment of the invention provides a network structure and a recognition method for behavior recognition based on space-time feature fusion, wherein the network structure comprises a video preprocessing module, a shallow feature extraction module, a deep feature extraction module, a space-time 2D convolution module and a classification recognition module. The recognition method is based on the network structure and comprises a training part and a testing part of a behavior recognition model, wherein the network structures of the two parts are the same.
The practice of the present invention is described below in terms of training and testing of a Kinetics behavioral recognition dataset. Videos of the kinetic dataset originate from Youtube, each video comprising only one behavior category, for a total of 400 categories, each video frame having a resolution of 320x240. The Kinetics training set included 236180 videos and the test set included 19905 videos. The training and testing process is realized by using a Caffe framework, and the model of the display card used for the experiment is Tesla V100.
The model training flow chart of the invention is shown in fig. 11, and comprises the following specific steps:
(1) Extracting all video frames of each video of the Kinetics training data set, wherein the format of the video frames is JPG;
(2) The total iteration number of training of the model is set to 120000, the initial learning rate is set to 0.001, the learning rate of 0-69999 times is set to 0.001, the learning rate of 70000-95999 times is set to 0.0001, the learning rate of 96000-120000 times is set to 0.00001, and the optimization mode of the learning rate is an SGD random gradient descent method. Batch size of model training is set to 10, initial iteration number of the model is set to 0, storage interval of the model is set to 2000, and a model with 30000 iteration numbers of ECO algorithm in a kinetic data set is adopted as a pre-training model of the model.
(3) And adding 1 to the iteration number of the model, and continuing the training process of the network.
(4) The method comprises the steps of randomly taking batch size videos from a training set, randomly taking N video frames in each video by a preprocessing module, setting the size of each video frame to 224x224, and processing the video frames by adopting operations of overturning, scaling, subtracting the average value (107,117,123) of three channels of a BGR image and the like.
(5) The shallow feature extraction module adopts an InceptionV 23 a-3c network structure to extract shallow features from the N preprocessed video frames to obtain shallow feature layers of InceptionV2-3 c.
(6) The deep feature extraction module adopts an InceptionV24 a-5b network structure to extract deep features of video frames from the feature layers of InceptionV2-3c, and a deep feature layer of InceptionV2-5b is obtained.
(7) And extracting and fusing video frame time and space features of the shallow feature layers of the InceptionV2-3c by a space-time 2D convolution module, and finally obtaining a space-time 2D feature layer.
(8) And performing behavior classification on the deep characteristic layer of the InceptionV2-5b and the characteristic layer spliced by the space-time 2D characteristic layer by using a classification recognition module, and calculating the loss of the behavior classification by adopting a full-connection layer and a softMax function.
(9) Judging whether the iteration times can divide the model preservation interval, and if so, preserving the model parameters.
(10) And carrying out back propagation on the behavior classification loss of the model by adopting a random gradient descent method, and respectively updating parameters of the shallow feature extraction module, the deep feature extraction module, the space-time 2D convolution module and the classification recognition module.
(11) Judging whether the iteration times are larger than or equal to the total iteration times of model training, if so, ending the training of the model, otherwise, returning to the step (3).
The flow chart of the model test of the invention is shown in fig. 12, and the specific steps are as follows:
(1) All video frames of each video of the Kinetics test data set are extracted, and the format of the video frames is JPG. (2) A test model is set, the model which is trained for 120000 times in a Kinetics data set is adopted as the test model, the batch size during test is set to be 1, and the iteration number test_iter of the test is set to be 4000.
(3) The video preprocessing module randomly fetches N video frames from the batch size videos, the size of each video frame is set to 224x224, and the average value (107,117,123) of three channels of the BGR image is subtracted from each video frame.
(4) And (3) consistent with the step (5) of the training part, the N preprocessed video frames pass through a shallow feature extraction module to obtain shallow features of the video frames.
(5) And (3) consistent with the step (6) of the training part, the shallow feature layer passes through a deep feature extraction module to obtain deep features of the video frame.
(6) And (3) consistent with the step (7) of the training part, the shallow characteristic layer passes through a space-time 2D convolution module to obtain a space-time 2D characteristic layer of the video frame.
(7) And the classification recognition module adopts a full connection layer and a softMax function for the deep feature layer and the space-time 2D feature layer to obtain the behavior classification category of the video.
(8) Judging whether all the test videos are input into a network for testing, if so, ending the test and outputting a test result; otherwise, returning to the step (3) to continue testing the video.
The behavior recognition network provided by the invention is compared and evaluated in detail with the classical behavior recognition network ECO algorithm, so that when the video frame N is set to be 8, verification tests are carried out on the mainstream behavior recognition data sets Kinetics, UCF101 and SometangV 2. Table 1 below shows a comparison of the test results of the behavior recognition method of the present invention and the ECO algorithm.
TABLE 1
As can be seen from table 1, the accuracy of the present invention is almost consistent with ECO across Kinetics and UCF101 datasets; on the SometangV 2 data set, the invention has the same accuracy as ECO, which shows that the invention can effectively perform behavior recognition. On the other hand, the space-time 2D convolution network module is adopted to replace the original 3D convolution, so that the parameter quantity is greatly reduced, the size of the model is far lower than that of an ECO model, and the occupied space of the model is greatly reduced. In the reasoning time, the time used by the model is only 20ms, and compared with an ECO model, the reasoning time is reduced by 23.1%, so that the invention can realize online and real-time behavior recognition.
The embodiment of the invention also provides a behavior recognition device based on space-time feature fusion, which comprises:
the acquisition module is used for acquiring the video frames to be processed and unifying the sizes of the video frames to be processed;
the first extraction module is used for extracting shallow layer characteristics in the video frame to be processed;
the second extraction module is used for extracting deep features in the video frame to be processed according to the shallow features;
the third extraction module is used for extracting the space-time 2D characteristic layers in the video frames to be processed;
and the identification module is used for identifying the behavior category of the target object in the video frame to be processed according to the deep layer characteristics and the space-time 2D characteristic layer.
The embodiment of the invention also provides electronic equipment, which comprises a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
The embodiment of the invention also provides a computer readable storage medium storing a program, which is executed by a processor to implement the method as described above.
Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A behavior recognition method based on space-time feature fusion is characterized by comprising the following steps:
obtaining a video frame to be processed through a video preprocessing module, and unifying the sizes of the video frame to be processed;
extracting shallow features in the video frame to be processed through a shallow feature extraction module;
according to the shallow layer characteristics, deep layer characteristics in the video frame to be processed are extracted through a deep layer characteristic extraction module;
extracting a space-time 2D characteristic layer in the video frame to be processed through a space-time 2D convolution module;
according to the deep features and the space-time 2D feature layers, identifying behavior categories of target objects in the video frames to be processed through a classification and identification module;
the network of the shallow feature extraction module and the network of the deep feature extraction module are realized by adopting an InceptionV2 convolution network;
the shallow feature extraction module is composed of convolutional network InceptionV 23 a to 3c layers and is used for extracting shallow features of images and finally obtaining InceptionV2-3c features; the deep feature extraction module consists of convolutional network InceptionV24a to 5b layers and a Pool layer, and is used for extracting deep features of an image and finally obtaining the Pool features;
the space-time 2D convolution module comprises a dimension reduction module and a sample 2D module;
the dimension reduction module is used for converting the five-dimensional features into four-dimensional features, wherein the five-dimensional features in the dimension reduction module are five-dimensional data, the sizes of the five-dimensional data are (B, C, T, H and W), and each dimension is respectively represented as a number dimension B of the features, a channel dimension C of the features, a time dimension T of the features, a height dimension H of the features and a width dimension W of the features;
the technical 2D module comprises three Res blocks and a Pooling layer, wherein the Res blocks are divided into two parts, namely block1 and block2;
in block1, input features sequentially repeat the operations of batch normalization operation, reLU mode activation feature and 3x3 convolution of the batch North, so as to obtain new features, and finally, feature addition operation is carried out on the new features and the input features;
in block2, sequentially repeating the batch normalization operation of the BatchNorm, the activation of the features in a ReLU mode and the operation of 3x3 convolution of the input features, wherein the first 3x3 convolution sets the step length to be 2, and finally obtaining a new feature block2-1; meanwhile, the input features are subjected to 3x3 convolution, the step length is set to be 2, the features block2-2 are obtained, and then the two features block2-1 and block2-2 are operated by using an Eltwise to obtain output features of block2;
and carrying out average pooling on the deep features sequentially through three Res block and Pool layers, and finally obtaining space-time 2D features.
2. The behavior recognition method based on temporal-spatial feature fusion according to claim 1, wherein the obtaining the video frames to be processed and unifying the sizes of the video frames to be processed comprises:
acquiring continuously input video content;
extracting N frames of images from the video content;
the size of the N frame image is determined to be 224x 224.
3. The behavior recognition method based on temporal-spatial feature fusion according to claim 2, wherein the shallow features include texture features and detail features of the image;
the shallow layer characteristics in the video frame to be processed are extracted specifically as follows:
and extracting the image through an InceptionV2 convolution network to obtain InceptionV2-3c characteristics.
4. A method of behavioral recognition based on spatiotemporal feature fusion according to claim 3, wherein the deep features include contour features, shape features, and most salient features of the image;
and extracting deep features in the video frame to be processed according to the shallow features, wherein the extracting comprises the following steps:
extracting the InceptionV2-3c features through an InceptionV2 convolution network to obtain Pool features.
5. The behavior recognition method based on temporal-spatial feature fusion according to claim 3, wherein the extracting a temporal-spatial 2D feature layer in the video frame to be processed comprises:
extracting the InceptionV2-3c features by a space-time 2D convolution module to obtain time features and space features;
and carrying out average pooling treatment on the time features and the space features to obtain space-time 2D features.
6. The behavior recognition method based on space-time feature fusion according to claim 5, wherein the extracting the conceptionv 2-3c features by the space-time 2D convolution module to obtain the temporal features and the spatial features comprises:
performing dimension reduction operation on the InceptionV2-3c feature to obtain a time feature;
sequentially executing normalization operation, reLU activation operation and 3*3 convolution operation on the InceptionV2-3c feature to obtain a first feature;
sequentially executing normalization operation, reLU activation operation, first 3*3 convolution operation and second 3*3 convolution operation on the InceptionV2-3c feature to obtain a second feature;
and adding the first feature and the second feature to obtain a third feature serving as a spatial feature.
7. The method for identifying behaviors based on spatiotemporal feature fusion of any of claims 1-6, further comprising:
training a behavior recognition model;
the method comprises the steps of,
and testing the behavior recognition model.
8. An apparatus for applying the spatiotemporal feature fusion-based behavior recognition method of any of claims 1-7, comprising:
the acquisition module is used for acquiring the video frames to be processed and unifying the sizes of the video frames to be processed;
the first extraction module is used for extracting shallow layer characteristics in the video frame to be processed;
the second extraction module is used for extracting deep features in the video frame to be processed according to the shallow features;
the third extraction module is used for extracting the space-time 2D characteristic layers in the video frames to be processed;
and the identification module is used for identifying the behavior category of the target object in the video frame to be processed according to the deep layer characteristics and the space-time 2D characteristic layer.
9. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program to implement the method of any one of claims 1-7.
10. A computer readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method of any one of claims 1-7.
CN202110079906.7A 2021-01-21 2021-01-21 Behavior recognition method, device, equipment and medium based on space-time feature fusion Active CN112836602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110079906.7A CN112836602B (en) 2021-01-21 2021-01-21 Behavior recognition method, device, equipment and medium based on space-time feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110079906.7A CN112836602B (en) 2021-01-21 2021-01-21 Behavior recognition method, device, equipment and medium based on space-time feature fusion

Publications (2)

Publication Number Publication Date
CN112836602A CN112836602A (en) 2021-05-25
CN112836602B true CN112836602B (en) 2024-04-05

Family

ID=75929651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110079906.7A Active CN112836602B (en) 2021-01-21 2021-01-21 Behavior recognition method, device, equipment and medium based on space-time feature fusion

Country Status (1)

Country Link
CN (1) CN112836602B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887419B (en) * 2021-09-30 2023-05-12 四川大学 Human behavior recognition method and system based on extracted video space-time information
CN114529889A (en) * 2022-01-28 2022-05-24 燕山大学 Method and device for identifying distracted driving behaviors and storage medium
CN116824641B (en) * 2023-08-29 2024-01-09 卡奥斯工业智能研究院(青岛)有限公司 Gesture classification method, device, equipment and computer storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
WO2020258498A1 (en) * 2019-06-26 2020-12-30 平安科技(深圳)有限公司 Football match behavior recognition method and apparatus based on deep learning, and terminal device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
WO2020258498A1 (en) * 2019-06-26 2020-12-30 平安科技(深圳)有限公司 Football match behavior recognition method and apparatus based on deep learning, and terminal device

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Multi-Branch Spatial-Temporal Network for Action Recognition;Yingying Wang等;《IEEE Signal Processing Letters》;20190911;第26卷(第10期);1556 - 1560 *
SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition;Fei Wang等;《IEEE Access》;20191212;第7卷;164876 - 164886 *
Spatio-temporal Collaborative Convolution for Video Action Recognition;Xu Li等;《2020 IEEE International Conference on Artificial Intelligence and Computer Applications》;20200901;554-558 *
一种基于2D时空信息提取的行为识别算法;刘董经典等;《智能系统学报》;20200828;第15卷(第5期);900-909 *
基于时空特征的视频行为识别;常颖;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20200615(第6期);I138-841 *
基于深度学习的人体行为识别算法;韩雪平;吴甜甜;;数学的实践与认识;20191223(第24期);135-141 *

Also Published As

Publication number Publication date
CN112836602A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
Shen et al. End-to-end deep image reconstruction from human brain activity
Zhao et al. Supervised segmentation of un-annotated retinal fundus images by synthesis
CN112836602B (en) Behavior recognition method, device, equipment and medium based on space-time feature fusion
Sabir et al. Recurrent convolutional strategies for face manipulation detection in videos
Wang et al. Detect globally, refine locally: A novel approach to saliency detection
Kim et al. Fully deep blind image quality predictor
Pathak et al. Context encoders: Feature learning by inpainting
CN109886881B (en) Face makeup removal method
Halit et al. Multiscale motion saliency for keyframe extraction from motion capture sequences
Talavera-Martinez et al. Hair segmentation and removal in dermoscopic images using deep learning
Din et al. Effective removal of user-selected foreground object from facial images using a novel GAN-based network
CN114219719A (en) CNN medical CT image denoising method based on dual attention and multi-scale features
CN112597824A (en) Behavior recognition method and device, electronic equipment and storage medium
Ahmed et al. Improve of contrast-distorted image quality assessment based on convolutional neural networks.
CN112580458A (en) Facial expression recognition method, device, equipment and storage medium
He et al. What catches the eye? Visualizing and understanding deep saliency models
Xu et al. AutoSegNet: An automated neural network for image segmentation
Li et al. Speckle noise removal based on structural convolutional neural networks with feature fusion for medical image
CN114399480A (en) Method and device for detecting severity of vegetable leaf disease
CN117275063A (en) Face depth counterfeiting detection method and system based on three-dimensional information time sequence consistency
Tan et al. Local context attention for salient object segmentation
Gupta et al. A robust and efficient image de-fencing approach using conditional generative adversarial networks
Liu et al. A3GAN: An attribute-aware attentive generative adversarial network for face aging
Astono et al. [Regular Paper] Adjacent Network for Semantic Segmentation of Liver CT Scans
Zhang et al. Learning to explore intrinsic saliency for stereoscopic video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant