CN113313030A - Human behavior identification method based on motion trend characteristics - Google Patents

Human behavior identification method based on motion trend characteristics Download PDF

Info

Publication number
CN113313030A
CN113313030A CN202110597647.7A CN202110597647A CN113313030A CN 113313030 A CN113313030 A CN 113313030A CN 202110597647 A CN202110597647 A CN 202110597647A CN 113313030 A CN113313030 A CN 113313030A
Authority
CN
China
Prior art keywords
video
data set
features
motion
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110597647.7A
Other languages
Chinese (zh)
Other versions
CN113313030B (en
Inventor
董敏
曹瑞东
毕盛
方政霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110597647.7A priority Critical patent/CN113313030B/en
Publication of CN113313030A publication Critical patent/CN113313030A/en
Application granted granted Critical
Publication of CN113313030B publication Critical patent/CN113313030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human behavior identification method based on motion trend characteristics, which comprises the following steps: 1) acquiring a video data set for human behavior recognition; 2) extracting video frames of videos in the video data set to manufacture a data set; 3) constructing a motion trend feature extraction model, and performing feature extraction and identification on the data set in the step 2) by using the motion trend feature extraction model to realize model training; 4) and (3) performing transfer learning on the model trained in the step 3) according to the actual scene, and applying the transferred model to a human behavior recognition task in the actual scene. The method is beneficial to identifying complex and long-term reasoning behaviors in a video scene, and has practical application value.

Description

Human behavior identification method based on motion trend characteristics
Technical Field
The invention relates to the technical field of human behavior recognition analysis based on a video scene, in particular to a human behavior recognition method based on motion trend characteristics.
Background
Visual information, one of the most readily available information in real life, is video as a carrier, on which a great deal of research hotspots and applications emerge. In these concerns, human behavior recognition in videos is focused on, such as smart nursing, sports event judgment, sign language recognition, and the like, and these tasks are more focused on the actions of the human body than on the interaction between the human body and the object. The video is composed of a plurality of frames of images with time sequence relation, so how to simultaneously capture the space semantic features of the images in the video and the time sequence motion features among the plurality of frames of images becomes the key for human behavior recognition.
The behavior recognition task of human behavior recognition in the video is different from the image classification task in that the classification in the video needs time sequence modeling, so that various ways are tried by combining time sequence and space modeling by various behavior recognition networks. Two Stream CNN proposed in 2014 tries to use a double-flow network, one network carries out spatial feature expression, the other network carries out time sequence feature expression, and then spatial features and time sequence features extracted by the Two networks are fused and classified. Then, many time sequence modeling models appear, a TSN model proposed in 2016 divides a video into a plurality of segments, then performs sparse sampling, performs prediction on each segment, and then fuses the results of the segments to obtain video-level prediction, and the mechanism enables the models to have the capability of capturing long-range time sequence information, but the method still does not link the time dimension and the space dimension of behavior characteristics and lacks the space-time fusion capability. And starting from C3D proposed in 2015, a 3D convolutional neural network (3D CNN) was used to perform spatio-temporal feature extraction, including R3D, I3D, NL-I3D, SlowFast, NL-SlowFast, X3D models, etc. The 3D convolution extends the spatial semantic features to spatio-temporal features by adding a temporal dimension to the convolution kernel. Although the 3D CNN models can well fuse spatio-temporal features, these 3D CNN models only learn features on a small sliding window instead of the entire video, so they are difficult to obtain video-level predictions, and the 3D CNN models are very expensive in computation overhead, high in requirements on a computation platform, difficult in training, and very time-consuming in the actual inference prediction process.
The human body behaviors can be divided into two types, one type of behavior can be judged by only one frame of static image in the video, the behavior is called as inference-free behavior, the other type of behavior can be judged by identifying the characteristics of multiple frames of images in the video and the motion relationship between the images, and the behavior is called as inference-required behavior. The identification of behaviors needing reasoning has higher requirements on the time sequence relation modeling of a human behavior identification model, so that the human behavior identification method applied to a video scene not only needs to consider the time-space characteristic fusion capability of the human behavior identification model, but also needs to design a strategy for capturing long-range time sequence motion characteristics so as to obtain video-level prediction of the behaviors, and simultaneously achieves the balance between human behavior identification precision and expenditure as far as possible.
Disclosure of Invention
The invention aims to overcome the defects of the human behavior recognition method in the current video scene on the capabilities of fusing space-time characteristics and capturing long-range time sequence motion information, provides a human behavior recognition method based on motion trend characteristics, enhances the recognition of behaviors needing inference and behaviors needing no inference, improves the accuracy of the model on the recognition of complex and long-term behaviors in the video scene, and enables the model to be better applied to an actual system.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the human behavior identification method based on the motion trend characteristics comprises the following steps:
1) acquiring a video data set for human behavior recognition;
2) extracting video frames of the videos in the video data set in the step 1) to manufacture a data set;
3) constructing a motion trend feature extraction model, and performing feature extraction and identification on the data set manufactured in the step 2) by using the motion trend feature extraction model to realize model training; the motion trend feature extraction model is improved on the structure of an ECO (efficient connected Network for Online Video exploration) model, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time;
4) and (3) performing transfer learning on the model trained in the step 3) according to the actual scene, and applying the transferred model to a human behavior recognition task in the actual scene to complete behavior recognition.
Further, in step 1), a public open-source video data set is obtained by downloading an HMDB51, UCF101 and a Jester video data set, and the obtained public open-source video data set is organized according to a customized file structure standard, wherein the directory names of primary folders of the HMDB51 video data set and the UCF101 video data set are categories to which human behaviors belong, each folder is a video belonging to the category with the format of avi, and the Jester video data set is a video set with the format of webm named according to a video sequence number.
Further, the step 2) comprises the following steps:
2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;
2.2) dividing each video frame data set in the step 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.
Further, in step 3), the constructed motion trend feature extraction model comprises a video frame preprocessing module, a spatial semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a spatio-temporal feature extraction module, a feature fusion module, a global pooling layer and a full link layer, and the specific conditions are as follows:
the video frame pre-processing module performs the following operations: sparse sampling is carried out on all video frames from the same video, the video frames are evenly divided into 16 segments according to the number of the frames, 1 frame is randomly selected from each segment, 16 frames of images are sampled in total, center cutting is carried out on the 16 frames of images, and 16 frames of images with the size of 224 multiplied by 224 pixels are obtained;
the space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images are passed through the backbone network to obtain 16 features with the size of [96,28,28], and the 16 features are stacked to form a spatial semantic feature;
the motion feature calculation module performs the following operations: for the obtained space semantic features, according to a formula D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, Fnk(x, y) feature map F representing the k-th channel of the spatial semantic features in the time dimension nnkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack of (a);
the motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module;
the space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module consists of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module;
the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain video-level characteristics with the final size of [1024,4,7,7 ];
passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, so as to obtain features with the size of [1024,1,1,1 ];
performing a Flatten operation on the features with the sizes of [1024,1,1 and 1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.
Further, the step 4) comprises the following steps:
4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;
4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention carries out sparse sampling on the video to be identified, can reduce data redundancy, reduce the input size of the model and reduce the calculated amount of the model in the reasoning process.
2. The model of the invention performs feature difference on the multi-frame space semantic features, enhances the motion trend feature expression of behaviors, and can further reduce the number of feature maps so as to reduce the calculated amount, and meanwhile, the motion trend features can amplify the difference of the behaviors among different categories, and have better recognition effect under the condition of smaller data set. The motion trend characteristics and the space-time characteristics are fused in the model, so that the model not only retains complete static image semantic information, but also can capture dynamic motion characteristics, and can simultaneously strengthen the identification of behaviors needing inference and behaviors needing no inference.
3. The model of the invention is mainly characterized by structure, modularization of each part, flexible replacement of the backbone network of the space semantic feature extraction module, the space-time feature extraction module and the motion trend feature extraction module into other networks, and selection of a light or high-precision network according to computing resources.
Drawings
FIG. 1 is a schematic diagram of the method of the present invention.
Detailed Description
The present invention will be further described with reference to the following specific examples.
The method for recognizing the human body behavior based on the motion trend characteristics, provided by the embodiment, comprises the following steps of:
1) acquiring a video data set for human behavior recognition, specifically as follows:
1.1) obtaining a public open source video data set by downloading an HMDB51, a UCF101 and a Jester video data set;
1.2) organizing the acquired public open source video data set according to a custom file structure standard, wherein the directory name of a primary folder of the HMDB51 video data set and the UCF101 video data set is a category to which human behaviors belong, and a video with a format of.avi of the category to which the folders belong is arranged under each folder; the Jester video data set is a video set named according to a video sequence number and in a format of webm.
2) Extracting video frames of the videos in the video data set in the step 1) to manufacture a data set, wherein the method specifically comprises the following steps:
2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;
2.2) dividing each video frame data set in 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.
3) Constructing a motion trend feature extraction model, as shown in fig. 1, performing feature extraction and identification on the data set in the step 2) by using the motion trend feature extraction model, and realizing model training; the motion trend feature extraction model is improved on the basis of an ECO (efficient connected Network for Online Video exploration) model structure, and is improved by adding calculation of motion features, modifying extraction of spatio-temporal features of the ECO model into extraction of motion trend features and adding a feature fusion module, wherein the improvement aims at enhancing identification of behaviors needing inference and behaviors needing no inference at the same time.
The constructed motion trend feature extraction model comprises a video frame preprocessing module, a space semantic feature extraction module, a motion feature calculation module, a motion trend feature extraction module, a space-time feature extraction module, a feature fusion module, a global pooling layer and a full-link layer, and the specific conditions are as follows:
the video frame pre-processing module performs the following operations: performing sparse sampling on all video frames from the same video, averagely dividing the video frames into 16 segments according to the number of the frames, namely N in the figure 1 is equal to 16, randomly selecting 1 frame from each segment, and sampling 16 frames of images in total; the 16 frame image is center-cropped to obtain 16 frames of 224 × 224 pixels.
The space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images get 16 features of size [96,28,28] through this backbone network, and these 16 features are stacked into a spatial semantic feature.
The motion feature calculation module performs the following operations: for the above space semantic features, according to formula D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, Fnk(x, y) stands for spatial semantic featuresFeature map F characterizing the k-th channel in time dimension nnkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack.
The motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module.
The space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module is composed of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module.
And the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain the video-level characteristics with the final size of [1024,4,7,7 ].
And (3) passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, and obtaining the features with the size of [1024,1,1,1 ].
Performing a Flatten operation on the features with the size of [1024,1,1,1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.
4) Performing migration learning on the model trained in the step 3) according to an actual scene, and applying the migrated model to a human behavior recognition task in the actual scene, wherein the method specifically comprises the following steps:
4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;
4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.
In conclusion, the invention provides the human behavior identification method based on the motion trend characteristics, which can effectively reduce data redundancy, capture long-range time sequence information, enhance the motion trend characteristic expression of behaviors, retain complete static image semantic information of a sampled video frame, capture dynamic motion characteristics of the behaviors, facilitate the identification of the human behaviors, and flexibly replace part of modules according to scenes, has wide research and practical application values, and is worthy of popularization.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (5)

1. The human behavior identification method based on the motion trend characteristics is characterized by comprising the following steps of:
1) acquiring a video data set for human behavior recognition;
2) extracting video frames of the videos in the video data set in the step 1) to manufacture a data set;
3) constructing a motion trend feature extraction model, and performing feature extraction and identification on the data set manufactured in the step 2) by using the motion trend feature extraction model to realize model training; the motion trend feature extraction model is improved on the basis of an ECO model structure, the improvement is that the calculation of motion features is added, the extraction of the spatio-temporal features by the ECO model is modified into the extraction of the motion trend features, a feature fusion module is added, and the improvement aims at simultaneously enhancing the identification of behaviors needing to be inferred and behaviors needing no inference;
4) and (3) performing transfer learning on the model trained in the step 3) according to the actual scene, and applying the transferred model to a human behavior recognition task in the actual scene to complete behavior recognition.
2. The human body behavior recognition method based on the movement trend characteristics as claimed in claim 1, wherein in step 1), a public open source video data set is obtained by downloading an HMDB51, UCF101 and a Jester video data set, and the obtained public open source video data set is organized according to a customized file structure standard, wherein the first-level folder directory names of the HMDB51 video data set and the UCF101 video data set are the categories to which the human body behaviors belong, each folder is a video in the category format of avi, and the Jester video data set is a video set in the format of webm, named according to video sequence numbers.
3. The human behavior recognition method based on the movement tendency characteristics as claimed in claim 1, wherein the step 2) comprises the following steps:
2.1) traversing all video files under the folder of each video data set, extracting the video frames of each video by using OpenCV to obtain a video frame data set, and counting the frame number of each video;
2.2) dividing each video frame data set in the step 2.1) into a training set and a verification set according to a division file provided by the official website of each video data set, and storing the information of the training set and the verification set in a file, wherein each behavior of the file is a tuple, and the tuple comprises a video frame folder address, a video frame number and a video corresponding category.
4. The method for recognizing human body behaviors based on motion trend characteristics according to claim 1, wherein in the step 3), the constructed motion trend characteristic extraction model comprises a video frame preprocessing module, a spatial semantic characteristic extraction module, a motion characteristic calculation module, a motion trend characteristic extraction module, a spatiotemporal characteristic extraction module, a characteristic fusion module, a global pooling layer and a full connection layer, and the specific conditions are as follows:
the video frame pre-processing module performs the following operations: sparse sampling is carried out on all video frames from the same video, the video frames are evenly divided into 16 segments according to the number of the frames, 1 frame is randomly selected from each segment, 16 frames of images are sampled in total, center cutting is carried out on the 16 frames of images, and 16 frames of images with the size of 224 multiplied by 224 pixels are obtained;
the space semantic feature extraction module is a 2D convolutional network, a weight sharing mechanism is used for extracting the space semantic features of the 16 frames of images, a backbone network of the module adopts a structure the same as that of an ECO (echo-back error) model, the backbone network firstly comprises two convolutional layers, one pooling layer is arranged behind each convolutional layer, the two convolutional layers are 7 multiplied by 7 and 3 multiplied by 3 respectively, the two pooling layers are max pool layers with the size of 3 multiplied by 3, and then the backbone network comprises three initiation layers, and the number of output channels of the three initiation layers is respectively 256, 320 and 96; 16 images are passed through the backbone network to obtain 16 features with the size of [96,28,28], and the 16 features are stacked to form a spatial semantic feature;
the motion feature calculation module performs the following operations: for the obtained space semantic features, according to a formula D(n-1)k(x,y)=Fnk(x,y)-F(n-1)k(x, y) calculating the motion characteristics, wherein n is more than 1 and less than or equal to 16, k is more than or equal to 1 and less than or equal to 96, x is more than or equal to 1 and less than or equal to 28, y is more than or equal to 1 and less than or equal to 28, Fnk(x, y) feature map F representing the k-th channel of the spatial semantic features in the time dimension nnkObtaining a plurality of characteristic differences D according to the characteristic values at the x-th row and the y-th column(n-1)kThese feature difference components have a size of [96,15,28 ]]The motion characteristics of the stack of (a);
the motion trend feature extraction module is a 3D convolution network and is used for extracting motion trend features of the motion features, the module backbone network is a part of a 3D-ResNet-18 network and comprises 6 3D convolutions, the sizes of the convolutions are all 3 multiplied by 3, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the motion trend features with the sizes of [512,4,7 and 7] are extracted through the module;
the space-time feature extraction module is a 3D convolutional network and is used for extracting space-time features of the space semantic features, the module consists of two layers of 3D convolutions, the number of output channels of the two convolutions is 512, the size of the output channels is 3 multiplied by 3, padding is equal to 1, the step length is 2, 1 BN3D layer and 1 ReLU layer are connected behind each convolution, and the space-time features with the sizes of [512,4,7 and 7] are extracted through the module;
the characteristic fusion module is used for connecting the motion trend characteristics and the space-time characteristics to obtain video-level characteristics with the final size of [1024,4,7,7 ];
passing the video-level features with the size of [1024,4,7,7] through a global pooling layer with a convolution kernel size of 1 × 7 × 7, wherein the global pooling layer is avg pooling and has a step size of 1, so as to obtain features with the size of [1024,1,1,1 ];
performing a Flatten operation on the features with the sizes of [1024,1,1 and 1], then passing through a Dropout (0.3) layer, and finally performing classification and identification through a full connection layer.
5. The human behavior recognition method based on the movement tendency characteristics as claimed in claim 1, wherein the step 4) comprises the following steps:
4.1) collecting actual scene video data, extracting video frames of the obtained video data, and making a data set;
4.2) fine-tuning the trained motion trend feature extraction model, freezing parameters of all feature extraction related layers, training on the data set in the step 4.1), and applying the trained motion feature extraction model to human behavior recognition in an actual scene to obtain an accurate recognition result.
CN202110597647.7A 2021-05-31 2021-05-31 Human behavior identification method based on motion trend characteristics Active CN113313030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110597647.7A CN113313030B (en) 2021-05-31 2021-05-31 Human behavior identification method based on motion trend characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110597647.7A CN113313030B (en) 2021-05-31 2021-05-31 Human behavior identification method based on motion trend characteristics

Publications (2)

Publication Number Publication Date
CN113313030A true CN113313030A (en) 2021-08-27
CN113313030B CN113313030B (en) 2023-02-14

Family

ID=77376213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110597647.7A Active CN113313030B (en) 2021-05-31 2021-05-31 Human behavior identification method based on motion trend characteristics

Country Status (1)

Country Link
CN (1) CN113313030B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973107A (en) * 2022-06-24 2022-08-30 山东省人工智能研究院 Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143074A (en) * 2013-05-07 2014-11-12 李东舸 Method and equipment for generating motion feature codes on the basis of motion feature information
US20170220854A1 (en) * 2016-01-29 2017-08-03 Conduent Business Services, Llc Temporal fusion of multimodal data from multiple data acquisition systems to automatically recognize and classify an action
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
EP3284013A1 (en) * 2015-04-16 2018-02-21 University of Essex Enterprises Limited Event detection and summarisation
CN108108699A (en) * 2017-12-25 2018-06-01 重庆邮电大学 Merge deep neural network model and the human motion recognition method of binary system Hash
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN111680618A (en) * 2020-06-04 2020-09-18 西安邮电大学 Dynamic gesture recognition method based on video data characteristics, storage medium and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143074A (en) * 2013-05-07 2014-11-12 李东舸 Method and equipment for generating motion feature codes on the basis of motion feature information
EP3284013A1 (en) * 2015-04-16 2018-02-21 University of Essex Enterprises Limited Event detection and summarisation
US20170220854A1 (en) * 2016-01-29 2017-08-03 Conduent Business Services, Llc Temporal fusion of multimodal data from multiple data acquisition systems to automatically recognize and classify an action
CN107609460A (en) * 2017-05-24 2018-01-19 南京邮电大学 A kind of Human bodys' response method for merging space-time dual-network stream and attention mechanism
CN108108699A (en) * 2017-12-25 2018-06-01 重庆邮电大学 Merge deep neural network model and the human motion recognition method of binary system Hash
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN111680618A (en) * 2020-06-04 2020-09-18 西安邮电大学 Dynamic gesture recognition method based on video data characteristics, storage medium and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BANGLI LIU ET AL: "Human-Human Interaction recognition based on spatial and motion trend feature", 《IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING》 *
张爱辉等: "PCRM的改进及其在人体行为识别中的应用", 《计算机工程与设计》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114973107A (en) * 2022-06-24 2022-08-30 山东省人工智能研究院 Unsupervised cross-domain video action identification method based on multi-discriminator cooperation and strong and weak sharing mechanism

Also Published As

Publication number Publication date
CN113313030B (en) 2023-02-14

Similar Documents

Publication Publication Date Title
CN111310676A (en) Video motion recognition method based on CNN-LSTM and attention
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN112541503A (en) Real-time semantic segmentation method based on context attention mechanism and information fusion
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
CN111382677B (en) Human behavior recognition method and system based on 3D attention residual error model
CN113255443B (en) Graph annotation meaning network time sequence action positioning method based on pyramid structure
CN112749608A (en) Video auditing method and device, computer equipment and storage medium
CN111523546A (en) Image semantic segmentation method, system and computer storage medium
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN112487207A (en) Image multi-label classification method and device, computer equipment and storage medium
CN112818849B (en) Crowd density detection algorithm based on context attention convolutional neural network for countermeasure learning
CN114973049B (en) Lightweight video classification method with unified convolution and self-attention
US20220237917A1 (en) Video comparison method and apparatus, computer device, and storage medium
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114494981A (en) Action video classification method and system based on multi-level motion modeling
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
CN113313030B (en) Human behavior identification method based on motion trend characteristics
CN108875555B (en) Video interest area and salient object extracting and positioning system based on neural network
CN110991219A (en) Behavior identification method based on two-way 3D convolutional network
CN116189292A (en) Video action recognition method based on double-flow network
CN115082306A (en) Image super-resolution method based on blueprint separable residual error network
CN114494284A (en) Scene analysis model and method based on explicit supervision area relation
CN111709316A (en) Behavior identification method combining space-time discrimination filter bank
CN117689731B (en) Lightweight new energy heavy-duty battery pack identification method based on improved YOLOv model
CN117274869B (en) Cell deformation dynamic classification method and system based on deformation field extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant