CN111488805A - Video behavior identification method based on saliency feature extraction - Google Patents

Video behavior identification method based on saliency feature extraction Download PDF

Info

Publication number
CN111488805A
CN111488805A CN202010210957.4A CN202010210957A CN111488805A CN 111488805 A CN111488805 A CN 111488805A CN 202010210957 A CN202010210957 A CN 202010210957A CN 111488805 A CN111488805 A CN 111488805A
Authority
CN
China
Prior art keywords
feature
image
edge
video
salient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010210957.4A
Other languages
Chinese (zh)
Other versions
CN111488805B (en
Inventor
胡晓
向俊将
杨佳信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202010210957.4A priority Critical patent/CN111488805B/en
Publication of CN111488805A publication Critical patent/CN111488805A/en
Application granted granted Critical
Publication of CN111488805B publication Critical patent/CN111488805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video behavior identification method based on salient feature extraction, which comprises the steps of S1, acquiring a video to be identified, and converting the video to be identified into an image; s2, extracting the salient features of the image, and removing the background information in the image; s3, performing behavior recognition on the image; s4, outputting the recognized abnormal behavior. The method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.

Description

Video behavior identification method based on saliency feature extraction
Technical Field
The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on saliency feature extraction.
Background
With the development of economy and the improvement of legal systems, people pay more attention to the criminal behavior for preventing the safety of lives and properties. Video monitoring systems are beginning to be applied to people's lives, such as theft prevention in daily life, terrorist prevention in dense people, and the like.
The current behavior recognition methods mainly comprise two types: a method based on traditional feature extraction and a method based on deep learning. The traditional feature extraction-based method classifies behaviors by extracting various features such as optical flow Histogram (HOF), gradient Histogram (HOG), Motion Boundary Histogram (MBH) and the like in video. However, the recognition capability is easily affected by the illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.
With the development of the times, researchers have proposed deep learning, and in view of the fact that tasks in research fields such as vision and hearing can be effectively completed based on the deep learning, the market has begun to apply a method based on the deep learning to real life. Video surveillance is one such. On the basis of a deep learning theory, researchers build a good network model, and train a data set through a video with a mark to obtain a model with recognition capability. The model has good generalization capability and can classify untrained video data. In the current method, a video is directly input into a neural network, video data is not processed, or only exposure and distortion processing are performed on the video, but characteristic information carried by the video is not highlighted, which is not beneficial to detection of abnormal behaviors. In a real complex context, it may result in an inability to identify abnormal behavior. And the existing training based on the deep learning network model needs a large amount of data sets and a high-performance server, which brings great limitation to the actual video identification work.
In summary, there is a need in the industry to develop a method or system that can highlight the characteristic information carried by the video, reduce the amount of computation, and facilitate the detection and identification of abnormal behavior.
Disclosure of Invention
Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior identification method based on the significant characteristic extraction.
The specific scheme of the application is as follows:
a video behavior identification method based on salient feature extraction comprises the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
Preferably, step S2 includes:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally;
s22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E;
s23, in the asymmetric crossing module, loss is made between the salient feature S and a salient label L abel-S, loss is made between the edge feature E and an edge label L abel-E, meanwhile, an edge feature E is extracted from the salient feature S, and loss is made between the edge feature E and the edge of the salient label.
Preferably, step 21 includes the shallow image feature first passing through a convolution kernel of 1 ×, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 ×, along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed.
Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:
Figure BDA0002422803900000031
Figure BDA0002422803900000032
wherein ,
Figure BDA0002422803900000033
Figure BDA0002422803900000034
wherein
Figure BDA0002422803900000035
Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,
Figure BDA0002422803900000036
representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,
Figure BDA0002422803900000037
indicating a dot product.
Preferably, step S23 is implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure BDA0002422803900000038
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
Figure BDA0002422803900000041
wherein p(xi) Is true value, q (x)i) Are estimated values.
Preferably, the step S3 includes performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
Compared with the prior art, the invention has the following beneficial effects:
the method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.
Drawings
FIG. 1 is a schematic flow chart diagram of a video behavior recognition method based on salient feature extraction according to an embodiment;
FIG. 2 is a network architecture diagram of significance detection, according to an embodiment.
Fig. 3(a) is a unit structure diagram of a Res2Net backbone network according to an embodiment.
Fig. 3(b) is a unit structure diagram of a CRU unit according to an embodiment.
FIG. 4 is a diagram of a 3D convolution of an embodiment.
FIG. 5 is an exploded view of the R (2+1) D algorithm according to an embodiment.
Fig. 6 is a network framework diagram of a C3D network architecture of an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Visual Attention Mechanism (VA) refers to when facing a scene, a human automatically processes regions of interest and selectively ignores regions of no interest, which are called salient regions. And extracting the salient region of the specific target is called salient target detection. In the face of complex background information in the middle of a video or an image, the extraction of the salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal part in an image or a video can be detected by using a saliency detection method, and then the abnormal behavior in the image or the video can be identified. The video behavior identification method based on the saliency feature extraction in the scheme is based on a network framework composed of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:
referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
In the present embodiment, referring to fig. 2, step S2 includes:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer 4. From a practical point of view, different objects may appear in the picture in different sizes, for example, the size of the table and the computer in different positions are different. Secondly, the object to be detected may carry more information than it itself occupies. The introduction of Res2Net backbone network means that each feature layer has multiple receptive fields with different scales, so that the perception of a brain on multiple scales and different directions of salient objects in real life is simulated, and the problem that an algorithm cannot detect that multiple abnormal behaviors exist in a video image is avoided.
Further, as shown in FIG. 3(a), step 21 includes the shallow image feature first passing through a convolution kernel of 1 × 1, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 × 3 along with another set of input feature maps, this process being repeated twice until all input feature maps have been processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through a convolution kernel of 1 × 1 to obtain the multi-scale feature.
S22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU unit in this scheme is combined with the Res2Net network to create more discriminating characteristics. When an input image is subjected to multi-stage feature extraction by CNN, the deeper the CNN, the more the dispersion of image features is gradually suppressed. CNN refers to a general convolutional neural network, and also includes CRU structure. Given that there is more dispersion of attention in spatial detail where low-level features contain much background, while higher-level features are more concentrated on salient target regions, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, are stacked in an end-to-end manner, as shown in fig. 3(b), the formula for the superposition operation of the CRU units in step S22 is defined as:
Figure BDA0002422803900000061
Figure BDA0002422803900000062
wherein ,
Figure BDA0002422803900000063
Figure BDA0002422803900000071
wherein
Figure BDA0002422803900000072
Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,
Figure BDA0002422803900000073
representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,
Figure BDA0002422803900000074
indicating a dot product.
S23, in the asymmetric crossing module, significant feature S and significant label L abel-S are lost, edge feature E and edge label L abel-E are lost, meanwhile, edge feature E is extracted from significant feature S, and edge feature E and the edge of significant label are lost.
For the edge extraction of the binary significant image, a traditional edge detection operator is adopted, such as Sobel and L OG, and the edge detection operator is formed by two convolution kernels gx(x, y) and gy(x, y) is obtained by performing convolution operation on the original image f (x, y).
The operators can be divided into templates in the vertical and horizontal directions, the former Gx(x, y) horizontal edges in the image, the latter Gy(x, y) then vertically oriented edges in the image can be detected. In practical application, each pixel point in the image is subjected to convolution operation by using the two convolution kernels, and the step S23 is specifically implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure BDA0002422803900000081
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
Figure BDA0002422803900000082
wherein p(xi) Is true value, q (x)i) Are estimated values.
In the training process of the significance characteristic extraction, an experiment platform can adopt an operating system of Ubuntu 14.04.3L TS and a video card 1080Ti, and Python3.6 and Pythroch (0.4.0) are configured, an SGD algorithm is adopted as an optimization function, the iteration frequency is 30, the learning rate is 0.002, a learning rate plan is used after 20 iterations, the learning rate is multiplied by 0.1, the convergence process of a model is optimized, the batch size is set to be 8, a DUTS-TR data set is used as a training set, a test video is output of real-time video monitoring equipment, after the significance characteristic extraction is carried out on the monitoring video, background information is removed, and only behavior information of a significance region is reserved.
In the embodiment, as shown in fig. 6, step S3 includes performing behavior recognition on the image by using a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure includes 8 convolutional layers composed of R (2+1) D convolutional modules, 5 max pooling layers, and 2 full connection layers, and Re L u, BatchNormal, and Droupout technologies may be added to the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
At present, a plurality of human body behavior databases are provided, and video data sets such as Kinetics-400, Kinetics-600, Sport-1M and the like have larger data volume. The abnormal behavior data set may employ a vif (vialentflow) video database of violent behaviors about the population. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be sufficient and contain various behavior information.
With the development of deep learning, researchers have proposed a number of deep learning-based behavior recognition algorithms for extracting spatiotemporal features in video: based on a dual stream network architecture, based on a C3D network architecture.
And (3) behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for video character behavior analysis in deep learning, 5 layers of convolution layers and two layers of full connection layers are adopted, and static images are expanded to video data, such as UCF101 and HMDB. In this network, the spatial streams carry behavioral information as input to the RGB graph, and the temporal streams carry timing information as input to the optical flow graph. With the proposal of the VGG network structure, the VGG16 network is taken as a feature extraction network, and the fusion state of the dual-flow network in time and space is considered. However, when the dual-flow structure is used, the optical flow image must be generated first, the consumed time is long, and the real-time performance is poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.
The 3D convolution is formed by extending the time dimension by the 2D convolution, and the frame diagram is shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous frames of images pass through the convolution kernel with the depth of 3, and finally the characteristic value is mapped to the characteristic diagram. After the feature map is connected with the frame sequence, the motion features of the person in the video can be obtained, such as the formulas (11), (12):
Figure BDA0002422803900000091
Figure BDA0002422803900000092
in the formula :
Figure BDA0002422803900000093
and (4) representing the characteristic value of the ith layer after the jth characteristic diagram pixel point (x, y, z) is subjected to 3D convolution. bijFor the variance, m represents the number of feature maps in the previous layer, R, P, Q corresponds to the depth (time), length, and width of the 3D convolution kernel, respectively, and w is the weight of the convolution kernel connected by the feature maps.
The time-series relation between a single frame image and a single frame image can be utilized due to the introduction of the 3D convolution, namely, the time-series characteristic can be extracted while the spatial characteristic is extracted, so that the problem of carrying time flow by using optical flow does not exist, but the problems of calculation cost and model storage are brought about.A network structure based on the 3D convolution needs to be designed by considering that the network can effectively extract space-time characteristics and the calculation cost is as low as possible, and the model storage is as small as possible.R (2+1) D algorithm replaces a 3 × 3 × 3 convolution kernel by a spatial convolution kernel of 1 × 3 × 3 and a time convolution kernel of 3 × 1 × 1, the schematic diagram 5 is decomposed, and as can be understood from the calculation mode of the convolution kernel, the 1 × 3 × 3 convolution kernel operates on a two-dimensional image at a single time, and the convolution kernel of 3 × 1 × 1 operates only in a time dimension.
In summary, the real scene is collected by the camera and transmitted to the monitoring system, after being transmitted by the monitoring system, the salient features of the video are extracted, the background information in the video is removed, the behavior is identified through the behavior identification network, and finally the abnormal behavior is output.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (6)

1. A video behavior identification method based on salient feature extraction is characterized by comprising the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
2. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S2 comprises:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally;
s22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E;
s23, in the asymmetric crossing module, loss is made between the salient feature S and a salient label L abel-S, loss is made between the edge feature E and an edge label L abel-E, meanwhile, an edge feature E is extracted from the salient feature S, and loss is made between the edge feature E and the edge of the salient label.
3. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step 21 comprises:
the shallow image features first pass through a convolution kernel of 1 × 1, then pass through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group are sent to the next group of convolution kernels of 3 × 3 along with another group of input feature maps, this process is repeated twice until all input feature maps are processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through the convolution kernel of 1 × 1, resulting in a multi-scale feature.
4. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:
Figure FDA0002422803890000021
Figure FDA0002422803890000022
wherein ,
Figure FDA0002422803890000023
Figure FDA0002422803890000024
wherein
Figure FDA0002422803890000025
Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,
Figure FDA0002422803890000028
representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,
Figure FDA0002422803890000026
indicating a dot product.
5. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step S23 is implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure FDA0002422803890000027
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
Figure FDA0002422803890000031
wherein p(xi) Is true value, q (x)i) Are estimated values.
6. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S3 comprises performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
CN202010210957.4A 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction Active CN111488805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010210957.4A CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010210957.4A CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Publications (2)

Publication Number Publication Date
CN111488805A true CN111488805A (en) 2020-08-04
CN111488805B CN111488805B (en) 2023-04-25

Family

ID=71794420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010210957.4A Active CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Country Status (1)

Country Link
CN (1) CN111488805B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931793A (en) * 2020-08-17 2020-11-13 湖南城市学院 Saliency target extraction method and system
CN113205051A (en) * 2021-05-10 2021-08-03 中国科学院空天信息创新研究院 Oil storage tank extraction method based on high spatial resolution remote sensing image
CN113343760A (en) * 2021-04-29 2021-09-03 暖屋信息科技(苏州)有限公司 Human behavior recognition method based on multi-scale characteristic neural network
CN113379643A (en) * 2021-06-29 2021-09-10 西安理工大学 Image denoising method based on NSST domain and Res2Net network
CN113537375A (en) * 2021-07-26 2021-10-22 深圳大学 Diabetic retinopathy grading method based on multi-scale cascade

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110852295A (en) * 2019-10-15 2020-02-28 深圳龙岗智能视听研究院 Video behavior identification method based on multitask supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110852295A (en) * 2019-10-15 2020-02-28 深圳龙岗智能视听研究院 Video behavior identification method based on multitask supervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓芳;齐春;: "一种运用显著性检测的行为识别方法" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931793A (en) * 2020-08-17 2020-11-13 湖南城市学院 Saliency target extraction method and system
CN111931793B (en) * 2020-08-17 2024-04-12 湖南城市学院 Method and system for extracting saliency target
CN113343760A (en) * 2021-04-29 2021-09-03 暖屋信息科技(苏州)有限公司 Human behavior recognition method based on multi-scale characteristic neural network
CN113205051A (en) * 2021-05-10 2021-08-03 中国科学院空天信息创新研究院 Oil storage tank extraction method based on high spatial resolution remote sensing image
CN113379643A (en) * 2021-06-29 2021-09-10 西安理工大学 Image denoising method based on NSST domain and Res2Net network
CN113379643B (en) * 2021-06-29 2024-05-28 西安理工大学 Image denoising method based on NSST domain and Res2Net network
CN113537375A (en) * 2021-07-26 2021-10-22 深圳大学 Diabetic retinopathy grading method based on multi-scale cascade

Also Published As

Publication number Publication date
CN111488805B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN111488805B (en) Video behavior recognition method based on salient feature extraction
CN113158723B (en) End-to-end video motion detection positioning system
CN111444881A (en) Fake face video detection method and device
US20200012923A1 (en) Computer device for training a deep neural network
CN108805002B (en) Monitoring video abnormal event detection method based on deep learning and dynamic clustering
CN110929622A (en) Video classification method, model training method, device, equipment and storage medium
CN107590432A (en) A kind of gesture identification method based on circulating three-dimensional convolutional neural networks
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
Gunawan et al. Sign language recognition using modified convolutional neural network model
CN106682628B (en) Face attribute classification method based on multilayer depth feature information
Chenarlogh et al. A multi-view human action recognition system in limited data case using multi-stream CNN
KR102309111B1 (en) Ststem and method for detecting abnomalous behavior based deep learning
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN110633624A (en) Machine vision human body abnormal behavior identification method based on multi-feature fusion
CN111160356A (en) Image segmentation and classification method and device
CN112183240A (en) Double-current convolution behavior identification method based on 3D time stream and parallel space stream
WO2022183805A1 (en) Video classification method, apparatus, and device
CN113936175A (en) Method and system for identifying events in video
Sabater et al. Event Transformer+. A multi-purpose solution for efficient event data processing
CN113255464A (en) Airplane action recognition method and system
Anees et al. Deep learning framework for density estimation of crowd videos
Abdullah et al. Context aware crowd tracking and anomaly detection via deep learning and social force model
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation
WO2023164370A1 (en) Method and system for crowd counting
Ragesh et al. Fast R-CNN based Masked Face Recognition for Access Control System

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant