CN111488805A - Video behavior identification method based on saliency feature extraction - Google Patents
Video behavior identification method based on saliency feature extraction Download PDFInfo
- Publication number
- CN111488805A CN111488805A CN202010210957.4A CN202010210957A CN111488805A CN 111488805 A CN111488805 A CN 111488805A CN 202010210957 A CN202010210957 A CN 202010210957A CN 111488805 A CN111488805 A CN 111488805A
- Authority
- CN
- China
- Prior art keywords
- feature
- image
- edge
- video
- salient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000000605 extraction Methods 0.000 title claims abstract description 23
- 230000006399 behavior Effects 0.000 claims abstract description 40
- 206010000117 Abnormal behaviour Diseases 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 11
- 238000004364 calculation method Methods 0.000 abstract description 5
- 230000009286 beneficial effect Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 10
- 238000013135 deep learning Methods 0.000 description 9
- 238000012544 monitoring process Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000003708 edge detection Methods 0.000 description 3
- 101100385324 Arabidopsis thaliana CRA1 gene Proteins 0.000 description 2
- 101100007769 Arabidopsis thaliana CRB gene Proteins 0.000 description 2
- 101100275730 Arabidopsis thaliana CRC gene Proteins 0.000 description 2
- 239000006185 dispersion Substances 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 206010001488 Aggression Diseases 0.000 description 1
- 101100007773 Arabidopsis thaliana CRD gene Proteins 0.000 description 1
- 101100007772 Brassica napus CRU1 gene Proteins 0.000 description 1
- 101100007774 Brassica napus CRU4 gene Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video behavior identification method based on salient feature extraction, which comprises the steps of S1, acquiring a video to be identified, and converting the video to be identified into an image; s2, extracting the salient features of the image, and removing the background information in the image; s3, performing behavior recognition on the image; s4, outputting the recognized abnormal behavior. The method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.
Description
Technical Field
The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on saliency feature extraction.
Background
With the development of economy and the improvement of legal systems, people pay more attention to the criminal behavior for preventing the safety of lives and properties. Video monitoring systems are beginning to be applied to people's lives, such as theft prevention in daily life, terrorist prevention in dense people, and the like.
The current behavior recognition methods mainly comprise two types: a method based on traditional feature extraction and a method based on deep learning. The traditional feature extraction-based method classifies behaviors by extracting various features such as optical flow Histogram (HOF), gradient Histogram (HOG), Motion Boundary Histogram (MBH) and the like in video. However, the recognition capability is easily affected by the illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.
With the development of the times, researchers have proposed deep learning, and in view of the fact that tasks in research fields such as vision and hearing can be effectively completed based on the deep learning, the market has begun to apply a method based on the deep learning to real life. Video surveillance is one such. On the basis of a deep learning theory, researchers build a good network model, and train a data set through a video with a mark to obtain a model with recognition capability. The model has good generalization capability and can classify untrained video data. In the current method, a video is directly input into a neural network, video data is not processed, or only exposure and distortion processing are performed on the video, but characteristic information carried by the video is not highlighted, which is not beneficial to detection of abnormal behaviors. In a real complex context, it may result in an inability to identify abnormal behavior. And the existing training based on the deep learning network model needs a large amount of data sets and a high-performance server, which brings great limitation to the actual video identification work.
In summary, there is a need in the industry to develop a method or system that can highlight the characteristic information carried by the video, reduce the amount of computation, and facilitate the detection and identification of abnormal behavior.
Disclosure of Invention
Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior identification method based on the significant characteristic extraction.
The specific scheme of the application is as follows:
a video behavior identification method based on salient feature extraction comprises the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
Preferably, step S2 includes:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally;
s22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E;
s23, in the asymmetric crossing module, loss is made between the salient feature S and a salient label L abel-S, loss is made between the edge feature E and an edge label L abel-E, meanwhile, an edge feature E is extracted from the salient feature S, and loss is made between the edge feature E and the edge of the salient label.
Preferably, step 21 includes the shallow image feature first passing through a convolution kernel of 1 ×, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 ×, along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed.
Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:
wherein ,
wherein Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,indicating a dot product.
Preferably, step S23 is implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
wherein p(xi) Is true value, q (x)i) Are estimated values.
Preferably, the step S3 includes performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
Compared with the prior art, the invention has the following beneficial effects:
the method combines the salient object detection method and the behavior recognition method for the first time, on one hand, the interesting region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, on the other hand, the calculation amount is reduced, and the method can be beneficial to the detection and recognition of abnormal behaviors.
Drawings
FIG. 1 is a schematic flow chart diagram of a video behavior recognition method based on salient feature extraction according to an embodiment;
FIG. 2 is a network architecture diagram of significance detection, according to an embodiment.
Fig. 3(a) is a unit structure diagram of a Res2Net backbone network according to an embodiment.
Fig. 3(b) is a unit structure diagram of a CRU unit according to an embodiment.
FIG. 4 is a diagram of a 3D convolution of an embodiment.
FIG. 5 is an exploded view of the R (2+1) D algorithm according to an embodiment.
Fig. 6 is a network framework diagram of a C3D network architecture of an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Visual Attention Mechanism (VA) refers to when facing a scene, a human automatically processes regions of interest and selectively ignores regions of no interest, which are called salient regions. And extracting the salient region of the specific target is called salient target detection. In the face of complex background information in the middle of a video or an image, the extraction of the salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal part in an image or a video can be detected by using a saliency detection method, and then the abnormal behavior in the image or the video can be identified. The video behavior identification method based on the saliency feature extraction in the scheme is based on a network framework composed of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:
referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
In the present embodiment, referring to fig. 2, step S2 includes:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer 4. From a practical point of view, different objects may appear in the picture in different sizes, for example, the size of the table and the computer in different positions are different. Secondly, the object to be detected may carry more information than it itself occupies. The introduction of Res2Net backbone network means that each feature layer has multiple receptive fields with different scales, so that the perception of a brain on multiple scales and different directions of salient objects in real life is simulated, and the problem that an algorithm cannot detect that multiple abnormal behaviors exist in a video image is avoided.
Further, as shown in FIG. 3(a), step 21 includes the shallow image feature first passing through a convolution kernel of 1 × 1, then passing through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group being sent to the next group of convolution kernels of 3 × 3 along with another set of input feature maps, this process being repeated twice until all input feature maps have been processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through a convolution kernel of 1 × 1 to obtain the multi-scale feature.
S22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU unit in this scheme is combined with the Res2Net network to create more discriminating characteristics. When an input image is subjected to multi-stage feature extraction by CNN, the deeper the CNN, the more the dispersion of image features is gradually suppressed. CNN refers to a general convolutional neural network, and also includes CRU structure. Given that there is more dispersion of attention in spatial detail where low-level features contain much background, while higher-level features are more concentrated on salient target regions, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, are stacked in an end-to-end manner, as shown in fig. 3(b), the formula for the superposition operation of the CRU units in step S22 is defined as:
wherein ,
wherein Representing the resulting edge feature after the feature of Res2Net layer i has passed through n CRU structural units,representing the significance of the generation of the feature after the feature of the ith layer in Res2Net passes through n CRU structural units,indicating a dot product.
S23, in the asymmetric crossing module, significant feature S and significant label L abel-S are lost, edge feature E and edge label L abel-E are lost, meanwhile, edge feature E is extracted from significant feature S, and edge feature E and the edge of significant label are lost.
For the edge extraction of the binary significant image, a traditional edge detection operator is adopted, such as Sobel and L OG, and the edge detection operator is formed by two convolution kernels gx(x, y) and gy(x, y) is obtained by performing convolution operation on the original image f (x, y).
The operators can be divided into templates in the vertical and horizontal directions, the former Gx(x, y) horizontal edges in the image, the latter Gy(x, y) then vertically oriented edges in the image can be detected. In practical application, each pixel point in the image is subjected to convolution operation by using the two convolution kernels, and the step S23 is specifically implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
wherein p(xi) Is true value, q (x)i) Are estimated values.
In the training process of the significance characteristic extraction, an experiment platform can adopt an operating system of Ubuntu 14.04.3L TS and a video card 1080Ti, and Python3.6 and Pythroch (0.4.0) are configured, an SGD algorithm is adopted as an optimization function, the iteration frequency is 30, the learning rate is 0.002, a learning rate plan is used after 20 iterations, the learning rate is multiplied by 0.1, the convergence process of a model is optimized, the batch size is set to be 8, a DUTS-TR data set is used as a training set, a test video is output of real-time video monitoring equipment, after the significance characteristic extraction is carried out on the monitoring video, background information is removed, and only behavior information of a significance region is reserved.
In the embodiment, as shown in fig. 6, step S3 includes performing behavior recognition on the image by using a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure includes 8 convolutional layers composed of R (2+1) D convolutional modules, 5 max pooling layers, and 2 full connection layers, and Re L u, BatchNormal, and Droupout technologies may be added to the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
At present, a plurality of human body behavior databases are provided, and video data sets such as Kinetics-400, Kinetics-600, Sport-1M and the like have larger data volume. The abnormal behavior data set may employ a vif (vialentflow) video database of violent behaviors about the population. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be sufficient and contain various behavior information.
With the development of deep learning, researchers have proposed a number of deep learning-based behavior recognition algorithms for extracting spatiotemporal features in video: based on a dual stream network architecture, based on a C3D network architecture.
And (3) behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for video character behavior analysis in deep learning, 5 layers of convolution layers and two layers of full connection layers are adopted, and static images are expanded to video data, such as UCF101 and HMDB. In this network, the spatial streams carry behavioral information as input to the RGB graph, and the temporal streams carry timing information as input to the optical flow graph. With the proposal of the VGG network structure, the VGG16 network is taken as a feature extraction network, and the fusion state of the dual-flow network in time and space is considered. However, when the dual-flow structure is used, the optical flow image must be generated first, the consumed time is long, and the real-time performance is poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.
The 3D convolution is formed by extending the time dimension by the 2D convolution, and the frame diagram is shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous frames of images pass through the convolution kernel with the depth of 3, and finally the characteristic value is mapped to the characteristic diagram. After the feature map is connected with the frame sequence, the motion features of the person in the video can be obtained, such as the formulas (11), (12):
in the formula :and (4) representing the characteristic value of the ith layer after the jth characteristic diagram pixel point (x, y, z) is subjected to 3D convolution. bijFor the variance, m represents the number of feature maps in the previous layer, R, P, Q corresponds to the depth (time), length, and width of the 3D convolution kernel, respectively, and w is the weight of the convolution kernel connected by the feature maps.
The time-series relation between a single frame image and a single frame image can be utilized due to the introduction of the 3D convolution, namely, the time-series characteristic can be extracted while the spatial characteristic is extracted, so that the problem of carrying time flow by using optical flow does not exist, but the problems of calculation cost and model storage are brought about.A network structure based on the 3D convolution needs to be designed by considering that the network can effectively extract space-time characteristics and the calculation cost is as low as possible, and the model storage is as small as possible.R (2+1) D algorithm replaces a 3 × 3 × 3 convolution kernel by a spatial convolution kernel of 1 × 3 × 3 and a time convolution kernel of 3 × 1 × 1, the schematic diagram 5 is decomposed, and as can be understood from the calculation mode of the convolution kernel, the 1 × 3 × 3 convolution kernel operates on a two-dimensional image at a single time, and the convolution kernel of 3 × 1 × 1 operates only in a time dimension.
In summary, the real scene is collected by the camera and transmitted to the monitoring system, after being transmitted by the monitoring system, the salient features of the video are extracted, the background information in the video is removed, the behavior is identified through the behavior identification network, and finally the abnormal behavior is output.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (6)
1. A video behavior identification method based on salient feature extraction is characterized by comprising the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting the salient features of the image, and removing the background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the recognized abnormal behavior.
2. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S2 comprises:
s21, inputting the image into Res2Net backbone network, when the image passes through each layer of Res2Net backbone network, the image characteristic will be divided into saliency characteristic, edge characteristic, the Res2Net backbone network outputs saliency characteristic S0 and edge characteristic E0 finally;
s22, alternately training the saliency feature S0 and the edge feature E0 in the CRU unit as supervisory signals to generate the saliency feature S and the edge feature E;
s23, in the asymmetric crossing module, loss is made between the salient feature S and a salient label L abel-S, loss is made between the edge feature E and an edge label L abel-E, meanwhile, an edge feature E is extracted from the salient feature S, and loss is made between the edge feature E and the edge of the salient label.
3. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step 21 comprises:
the shallow image features first pass through a convolution kernel of 1 × 1, then pass through a set of convolution kernels that divide the feature map with n channels into 4 groups, the output features of the previous group are sent to the next group of convolution kernels of 3 × 3 along with another group of input feature maps, this process is repeated twice until all input feature maps are processed, finally, the feature maps from the 4 groups are concatenated, and the information is fused together through the convolution kernel of 1 × 1, resulting in a multi-scale feature.
4. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the CRU units are 4 CRU structural units stacked end to end, and the formula of the superposition operation of the CRU units in step S22 is defined as:
wherein ,
5. The video behavior recognition method based on salient feature extraction according to claim 2, wherein the step S23 is implemented as the following formula:
Gx(x,y)=f(x,y)*gx(x,y) (5)
Gy(x,y)=f(x,y)*gy(x,y) (6)
G=F(Gx(x,y),Gy(x,y)) (7)
the function F in the formula (7) considers the edge characteristics in the vertical direction and the edge characteristics in the horizontal direction, and fuses the characteristics of the two parts in the horizontal direction and the vertical direction according to the formula (8) or (9);
F(Gx(x,y),Gy(x,y))=Gx(x,y)+Gy(x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross-entropy loss, which is:
wherein p(xi) Is true value, q (x)i) Are estimated values.
6. The video behavior recognition method based on salient feature extraction according to claim 1, wherein the step S3 comprises performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure, wherein a network framework of the C3D network structure comprises 8 convolutional layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full-connection layers, and Re L u, BatchNormal and Droupout technologies can be added into the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; the total number of iterations is 100, and the learning rate is multiplied by 0.1 every 20 iterations to make the model converge; 4096 outputs are provided by the two fully-connected layers, and finally classification is realized by the softmax classification layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010210957.4A CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010210957.4A CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111488805A true CN111488805A (en) | 2020-08-04 |
CN111488805B CN111488805B (en) | 2023-04-25 |
Family
ID=71794420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010210957.4A Active CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488805B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931793A (en) * | 2020-08-17 | 2020-11-13 | 湖南城市学院 | Saliency target extraction method and system |
CN113205051A (en) * | 2021-05-10 | 2021-08-03 | 中国科学院空天信息创新研究院 | Oil storage tank extraction method based on high spatial resolution remote sensing image |
CN113343760A (en) * | 2021-04-29 | 2021-09-03 | 暖屋信息科技(苏州)有限公司 | Human behavior recognition method based on multi-scale characteristic neural network |
CN113379643A (en) * | 2021-06-29 | 2021-09-10 | 西安理工大学 | Image denoising method based on NSST domain and Res2Net network |
CN113537375A (en) * | 2021-07-26 | 2021-10-22 | 深圳大学 | Diabetic retinopathy grading method based on multi-scale cascade |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN110852295A (en) * | 2019-10-15 | 2020-02-28 | 深圳龙岗智能视听研究院 | Video behavior identification method based on multitask supervised learning |
-
2020
- 2020-03-24 CN CN202010210957.4A patent/CN111488805B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN110852295A (en) * | 2019-10-15 | 2020-02-28 | 深圳龙岗智能视听研究院 | Video behavior identification method based on multitask supervised learning |
Non-Patent Citations (1)
Title |
---|
王晓芳;齐春;: "一种运用显著性检测的行为识别方法" * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931793A (en) * | 2020-08-17 | 2020-11-13 | 湖南城市学院 | Saliency target extraction method and system |
CN111931793B (en) * | 2020-08-17 | 2024-04-12 | 湖南城市学院 | Method and system for extracting saliency target |
CN113343760A (en) * | 2021-04-29 | 2021-09-03 | 暖屋信息科技(苏州)有限公司 | Human behavior recognition method based on multi-scale characteristic neural network |
CN113205051A (en) * | 2021-05-10 | 2021-08-03 | 中国科学院空天信息创新研究院 | Oil storage tank extraction method based on high spatial resolution remote sensing image |
CN113379643A (en) * | 2021-06-29 | 2021-09-10 | 西安理工大学 | Image denoising method based on NSST domain and Res2Net network |
CN113379643B (en) * | 2021-06-29 | 2024-05-28 | 西安理工大学 | Image denoising method based on NSST domain and Res2Net network |
CN113537375A (en) * | 2021-07-26 | 2021-10-22 | 深圳大学 | Diabetic retinopathy grading method based on multi-scale cascade |
Also Published As
Publication number | Publication date |
---|---|
CN111488805B (en) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111488805B (en) | Video behavior recognition method based on salient feature extraction | |
CN113158723B (en) | End-to-end video motion detection positioning system | |
CN111444881A (en) | Fake face video detection method and device | |
US20200012923A1 (en) | Computer device for training a deep neural network | |
CN108805002B (en) | Monitoring video abnormal event detection method based on deep learning and dynamic clustering | |
CN110929622A (en) | Video classification method, model training method, device, equipment and storage medium | |
CN107590432A (en) | A kind of gesture identification method based on circulating three-dimensional convolutional neural networks | |
CN112507990A (en) | Video time-space feature learning and extracting method, device, equipment and storage medium | |
Gunawan et al. | Sign language recognition using modified convolutional neural network model | |
CN106682628B (en) | Face attribute classification method based on multilayer depth feature information | |
Chenarlogh et al. | A multi-view human action recognition system in limited data case using multi-stream CNN | |
KR102309111B1 (en) | Ststem and method for detecting abnomalous behavior based deep learning | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
CN110633624A (en) | Machine vision human body abnormal behavior identification method based on multi-feature fusion | |
CN111160356A (en) | Image segmentation and classification method and device | |
CN112183240A (en) | Double-current convolution behavior identification method based on 3D time stream and parallel space stream | |
WO2022183805A1 (en) | Video classification method, apparatus, and device | |
CN113936175A (en) | Method and system for identifying events in video | |
Sabater et al. | Event Transformer+. A multi-purpose solution for efficient event data processing | |
CN113255464A (en) | Airplane action recognition method and system | |
Anees et al. | Deep learning framework for density estimation of crowd videos | |
Abdullah et al. | Context aware crowd tracking and anomaly detection via deep learning and social force model | |
CN114120076B (en) | Cross-view video gait recognition method based on gait motion estimation | |
WO2023164370A1 (en) | Method and system for crowd counting | |
Ragesh et al. | Fast R-CNN based Masked Face Recognition for Access Control System |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |