CN111488805B - Video behavior recognition method based on salient feature extraction - Google Patents
Video behavior recognition method based on salient feature extraction Download PDFInfo
- Publication number
- CN111488805B CN111488805B CN202010210957.4A CN202010210957A CN111488805B CN 111488805 B CN111488805 B CN 111488805B CN 202010210957 A CN202010210957 A CN 202010210957A CN 111488805 B CN111488805 B CN 111488805B
- Authority
- CN
- China
- Prior art keywords
- image
- edge
- feature
- features
- salient
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000000605 extraction Methods 0.000 title claims abstract description 22
- 206010000117 Abnormal behaviour Diseases 0.000 claims abstract description 12
- 230000006399 behavior Effects 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 abstract description 11
- 238000013135 deep learning Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000012544 monitoring process Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000003708 edge detection Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 101100385324 Arabidopsis thaliana CRA1 gene Proteins 0.000 description 2
- 101100007769 Arabidopsis thaliana CRB gene Proteins 0.000 description 2
- 101100275730 Arabidopsis thaliana CRC gene Proteins 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000006185 dispersion Substances 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 206010001488 Aggression Diseases 0.000 description 1
- 101100007773 Arabidopsis thaliana CRD gene Proteins 0.000 description 1
- 101100007772 Brassica napus CRU1 gene Proteins 0.000 description 1
- 101100007774 Brassica napus CRU4 gene Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012806 monitoring device Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video behavior recognition method based on salient feature extraction, which comprises the following steps of S1, obtaining a video to be recognized, and converting the video to be recognized into an image; s2, extracting salient features of the image and removing background information in the image; s3, performing behavior recognition on the image; s4, outputting the identified abnormal behavior. According to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.
Description
Technical Field
The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on salient feature extraction.
Background
With the development of economy and the perfection of legal systems, people are more concerned about criminal behaviors for preventing life and property safety. Video monitoring systems are beginning to be applied to people's life, such as theft protection in daily life, prevention of terrorist events in dense people, etc.
At present, the behavior recognition method mainly comprises two types: methods based on traditional feature extraction and methods based on deep learning. The conventional feature extraction-based method classifies behaviors by extracting various features such as an optical flow Histogram (HOF), a gradient Histogram (HOG), a Motion Boundary Histogram (MBH) and the like in a video. However, the recognition capability is easily affected by illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.
With the development of the age, researchers have proposed deep learning, and in view of the fact that deep learning can effectively complete tasks in research fields such as vision and hearing, a method based on deep learning is applied to real life in the market. Video surveillance is one of these. Researchers build a good network model based on the deep learning theory, and train a data set through a video with marks to obtain a model with identification capability. The model has better generalization capability and can classify untrained video data. The current practice directly inputs the video into the neural network, does not process the video data, or only exposes and distorts the video, but does not highlight the characteristic information carried by the video, which is not beneficial to the detection of abnormal behaviors. In a truly complex background, it may result in an inability to identify abnormal behavior. The existing training based on the deep learning network model requires a large number of data sets and high-performance servers, which brings great limitation to actual video identification work.
In summary, there is an urgent need in the industry to develop a method or system that can highlight feature information carried by video, reduce the amount of computation, and facilitate detection and identification of abnormal behaviors.
Disclosure of Invention
Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior recognition method based on salient feature extraction.
The specific scheme of the application is as follows:
a video behavior recognition method based on salient feature extraction comprises the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
Preferably, step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network;
s22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E;
s23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost.
Preferably, step 21 comprises: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the stacking operation of the CRU units in step S22 is defined as follows:
wherein ,
wherein Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>Representing dot product.
Preferably, the implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross entropy loss, which is:
wherein p(xi ) True value, q (x i ) Is an estimated value.
Preferably, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.
Drawings
FIG. 1 is a schematic flow chart of a video behavior recognition method based on salient feature extraction according to one embodiment;
fig. 2 is a network configuration diagram of saliency detection according to an embodiment.
Fig. 3 (a) is a part of a unit structure diagram of a Res2Net backbone network according to an embodiment, and (b) is a part of a unit structure diagram of a CRU unit according to an embodiment.
Fig. 4 is a 3D convolution diagram of an embodiment.
FIG. 5 is an exploded view of an R (2+1) D algorithm according to an embodiment.
Fig. 6 is a network frame diagram of a C3D network structure according to an embodiment.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Visual salience (Visual Attention Mechanism, VA, visual attention mechanism) refers to the automatic processing of regions of interest by humans in the face of a scene, selectively ignoring regions of no interest, these regions of interest being referred to as salience regions. And extracting a salient region of a specific target is called salient target detection. In the face of complex background information in the middle of video or image, the extraction of salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal portion in an image or video may be detected by a method of saliency detection, and then the abnormal behavior in the image or video may be identified. The video behavior recognition method based on the salient feature extraction in the scheme is based on a network framework consisting of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:
referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
In this embodiment, referring to fig. 2, step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer4, respectively. From a practical point of view, different objects may appear in the picture in different sizes, e.g. different dining tables in different positions and the size of the computer are different. Second, the object to be detected may carry more information than it occupies itself. The introduction of the Res2Net backbone network means that each feature layer has a plurality of receptive fields with different scales, and the perception of the brain on the salient targets with different scales and different directions in real life is simulated, so that the problem that an algorithm cannot detect a plurality of abnormal behaviors in a video with images is avoided.
Still further, as shown in part (a) of fig. 3, step 21 includes: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
S22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU units in this scheme are combined with the Res2Net network to create further distinguishing features. When the input image is subjected to multistage feature extraction through CNN, the deeper the CNN is, the dispersion of image features is gradually suppressed. Wherein CNN refers to a common convolutional neural network and also comprises a CRU structure. Whereas the lower level features contain many background spatial details with more attention dispersion and the higher level features are more focused on the salient target region, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, respectively, are stacked in an end-to-end manner, and as shown in part (b) of fig. 3, the formula of the stacking operation of the CRU units in step S22 is defined as:
wherein ,
wherein Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>Representing dot product.
S23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost. Although in the CRU unit structure, the salient feature S combines the edge feature E in order to compensate for the edge information lost by the salient object, the effect of this combination on the salient feature S is limited, and the promoting effect of the edge feature E on the salient feature S cannot be directly estimated. Considering from the output layer of the network structure, the saliency target detection only evaluates the saliency feature S, and the edge feature E only serves as an auxiliary function, which is unfavorable for the extraction of the edge feature of the saliency feature S, so that besides the CRU structure, the scheme also extracts the edge information of the saliency target in the training process and loses the edge information with the edge label information, thereby realizing the double cross fusion of the saliency target detection network and the edge detection network.
For edge extraction of binary saliency images, traditional edge detection operators such as Sobel and LOG are adopted, and the edge detection operators are composed of two convolution kernels g x (x, y) and g y (x, y) convolution operation is performed on the original image f (x, y).
The operators can be divided into templates in the vertical direction and in the horizontal direction, the former G x (x, y) can detect the edges in the horizontal direction in the image, the latter G y (x, y) then the edges in the vertical direction in the image can be detected. In practical application, each pixel point in the image is convolved by using the two convolution kernels, and the specific implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross entropy loss, which is:
wherein p(xi ) True value, q (x i ) Is an estimated value.
In the training process of the salient feature extraction, an experimental platform can adopt an operating system of Ubuntu14.04.3LTS, a display card 1080Ti and configure Python3.6 and Pytorch (0.4.0). The SGD algorithm is adopted as an optimization function, the iteration number is 30, the learning rate is 0.002, the learning rate plan is used after the iteration is 20 times, the learning rate is multiplied by 0.1, the convergence process of the optimization model is set to be 8. The DUTS-TR dataset is used as the training set. The test video is the output of the video monitoring device in reality. After the salient features of the monitoring video are extracted, the background information is removed, and only the behavior information of the salient region is reserved.
In this embodiment, as shown in fig. 6, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
At present, a large number of human behavior databases exist, and the large data volume is provided with video data sets such as Kinetics-400, kinetics-600, sport-1M and the like. The abnormal behavior data set may employ a VIF (violent flow) video database of violent behaviors about the crowd. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be enough to contain various behavior information.
With the development of deep learning, researchers have proposed numerous behavior recognition algorithms based on deep learning for extracting spatiotemporal characteristics in video: based on a dual-flow network structure and based on a C3D network structure.
Behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for analyzing the behavior of a video character in deep learning, adopts a 5-layer convolution layer and a two-layer full-connection layer, and expands a static image to video data, such as UCF101 and HMDB. In this network, the spatial stream takes the RGB diagram as input and the temporal stream takes the optical flow diagram as input and carries the timing information. With the proposal of the VGG network structure, the VGG16 network is taken as a characteristic extraction network, and the fusion state of the double-flow network in time and space is considered. However, when the dual-flow structure is used, an optical flow image must be generated first, the consumed time is relatively long, and the real-time performance is relatively poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.
The 3D convolution is formed by a 2D convolution spread time dimension, with a frame diagram as shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous images pass through the convolution kernel with the depth of 3, and finally, the characteristic values are mapped to the characteristic images. After the feature map is connected to the frame sequence, it is possible to obtain the motion features of the person in the video, as shown in formulas (11), (12):
in the formula :and the characteristic value of the ith layer and the jth characteristic image pixel point (x, y, z) after 3D convolution is represented. b ij For deviation, m represents the previous layerThe number of the feature graphs R, P, Q corresponds to the depth (time), length and width of the 3D convolution kernel, and w is the convolution kernel weight of the feature graph connection.
Because of the introduction of 3D convolution, the time sequence relation between single frame images can be utilized, namely, the time sequence characteristics can be extracted while the spatial characteristics are extracted, so that the use of optical flow to bear time flow does not exist. But brings with it the problems of computational costs and model storage. Therefore, when designing a network structure based on 3D convolution, the network is considered to be capable of effectively extracting space-time characteristics, and the calculation cost is considered to be as low as possible, and the model storage is as small as possible. The R (2 + 1) D algorithm replaces the 3 x 3 convolution kernel with a 1 x 3 spatial convolution kernel and a 3 x 1 temporal convolution kernel, the decomposition of which is schematically illustrated in fig. 5, as can be understood from the way the convolution kernel is calculated, the 1×3×3 convolution kernel operates on a two-dimensional image at a single time, while the 3×1×1 convolution kernel operates only in the time dimension.
In conclusion, the real scene is collected and transmitted to the monitoring system through the camera, after being transmitted by the monitoring system, the video is subjected to salient feature extraction, background information in the video is removed, behaviors are identified through the behavior identification network, and finally abnormal behaviors are output.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.
Claims (4)
1. The video behavior recognition method based on the salient feature extraction is characterized by comprising the following steps of:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
the step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network;
s22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E;
s23, in the asymmetric cross module, the significant feature S and the significant Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the significant feature S, and the edge feature E and the edge of the significant Label are lost;
the CRU units are 4 CRU structural units stacked end to end, and in step S22, the formula of the stacking operation of the CRU units is defined as follows:
wherein ,
wherein Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>Representing dot product;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
2. The method for identifying video behavior based on salient feature extraction of claim 1, wherein step 21 comprises:
the shallow image features firstly pass through a convolution kernel of 1 multiplied by 1, then pass through a convolution kernel group which divides a feature map with n channels into 4 groups, the output features of the former group and the input feature map of the other group are transmitted to the convolution kernel of the next group of 3 multiplied by 3, and the process is repeated twice until all the input feature maps are processed; finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
3. The video behavior recognition method based on salient feature extraction of claim 1, wherein the implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
obtaining an edge feature map after the binary image passes through a formula (7); the loss function uses a binary cross entropy loss, which is:
wherein p(xi ) True value, q (x i ) Is an estimated value.
4. The video behavior recognition method based on salient feature extraction of claim 1, wherein step S3 comprises: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure is composed of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology is added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010210957.4A CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010210957.4A CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111488805A CN111488805A (en) | 2020-08-04 |
CN111488805B true CN111488805B (en) | 2023-04-25 |
Family
ID=71794420
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010210957.4A Active CN111488805B (en) | 2020-03-24 | 2020-03-24 | Video behavior recognition method based on salient feature extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488805B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931793B (en) * | 2020-08-17 | 2024-04-12 | 湖南城市学院 | Method and system for extracting saliency target |
CN113343760A (en) * | 2021-04-29 | 2021-09-03 | 暖屋信息科技(苏州)有限公司 | Human behavior recognition method based on multi-scale characteristic neural network |
CN113205051B (en) * | 2021-05-10 | 2022-01-25 | 中国科学院空天信息创新研究院 | Oil storage tank extraction method based on high spatial resolution remote sensing image |
CN113379643B (en) * | 2021-06-29 | 2024-05-28 | 西安理工大学 | Image denoising method based on NSST domain and Res2Net network |
CN113537375B (en) * | 2021-07-26 | 2022-04-05 | 深圳大学 | Diabetic retinopathy grading method based on multi-scale cascade |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN110852295A (en) * | 2019-10-15 | 2020-02-28 | 深圳龙岗智能视听研究院 | Video behavior identification method based on multitask supervised learning |
-
2020
- 2020-03-24 CN CN202010210957.4A patent/CN111488805B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
CN110852295A (en) * | 2019-10-15 | 2020-02-28 | 深圳龙岗智能视听研究院 | Video behavior identification method based on multitask supervised learning |
Non-Patent Citations (1)
Title |
---|
王晓芳 ; 齐春 ; .一种运用显著性检测的行为识别方法.西安交通大学学报.2017,(02),第29-34页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111488805A (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111488805B (en) | Video behavior recognition method based on salient feature extraction | |
CN113936339B (en) | Fighting identification method and device based on double-channel cross attention mechanism | |
CN111126379B (en) | Target detection method and device | |
WO2022134655A1 (en) | End-to-end video action detection and positioning system | |
US8494259B2 (en) | Biologically-inspired metadata extraction (BIME) of visual data using a multi-level universal scene descriptor (USD) | |
CN111291809B (en) | Processing device, method and storage medium | |
CN110929622A (en) | Video classification method, model training method, device, equipment and storage medium | |
CN107977661B (en) | Region-of-interest detection method based on FCN and low-rank sparse decomposition | |
CN104504395A (en) | Method and system for achieving classification of pedestrians and vehicles based on neural network | |
CN110232361B (en) | Human behavior intention identification method and system based on three-dimensional residual dense network | |
CN110222718B (en) | Image processing method and device | |
CN110852222A (en) | Campus corridor scene intelligent monitoring method based on target detection | |
KR102309111B1 (en) | Ststem and method for detecting abnomalous behavior based deep learning | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
CN113920581A (en) | Method for recognizing motion in video by using space-time convolution attention network | |
WO2023159898A1 (en) | Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium | |
Li et al. | Transmission line detection in aerial images: An instance segmentation approach based on multitask neural networks | |
CN114580541A (en) | Fire disaster video smoke identification method based on time-space domain double channels | |
Hu et al. | Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes | |
Vu et al. | A multi-task convolutional neural network with spatial transform for parking space detection | |
CN111104924B (en) | Processing algorithm for identifying low-resolution commodity image | |
CN112396036A (en) | Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction | |
Le Louedec et al. | Segmentation and detection from organised 3D point clouds: A case study in broccoli head detection | |
CN108764287B (en) | Target detection method and system based on deep learning and packet convolution | |
CN113822134A (en) | Instance tracking method, device, equipment and storage medium based on video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |