CN111488805B - Video behavior recognition method based on salient feature extraction - Google Patents

Video behavior recognition method based on salient feature extraction Download PDF

Info

Publication number
CN111488805B
CN111488805B CN202010210957.4A CN202010210957A CN111488805B CN 111488805 B CN111488805 B CN 111488805B CN 202010210957 A CN202010210957 A CN 202010210957A CN 111488805 B CN111488805 B CN 111488805B
Authority
CN
China
Prior art keywords
image
edge
feature
features
salient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010210957.4A
Other languages
Chinese (zh)
Other versions
CN111488805A (en
Inventor
胡晓
向俊将
杨佳信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202010210957.4A priority Critical patent/CN111488805B/en
Publication of CN111488805A publication Critical patent/CN111488805A/en
Application granted granted Critical
Publication of CN111488805B publication Critical patent/CN111488805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video behavior recognition method based on salient feature extraction, which comprises the following steps of S1, obtaining a video to be recognized, and converting the video to be recognized into an image; s2, extracting salient features of the image and removing background information in the image; s3, performing behavior recognition on the image; s4, outputting the identified abnormal behavior. According to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.

Description

Video behavior recognition method based on salient feature extraction
Technical Field
The invention relates to the technical field of intelligent video monitoring, in particular to a video behavior identification method based on salient feature extraction.
Background
With the development of economy and the perfection of legal systems, people are more concerned about criminal behaviors for preventing life and property safety. Video monitoring systems are beginning to be applied to people's life, such as theft protection in daily life, prevention of terrorist events in dense people, etc.
At present, the behavior recognition method mainly comprises two types: methods based on traditional feature extraction and methods based on deep learning. The conventional feature extraction-based method classifies behaviors by extracting various features such as an optical flow Histogram (HOF), a gradient Histogram (HOG), a Motion Boundary Histogram (MBH) and the like in a video. However, the recognition capability is easily affected by illumination intensity and background information, and the extracted features have certain limitations and do not have good recognition capability.
With the development of the age, researchers have proposed deep learning, and in view of the fact that deep learning can effectively complete tasks in research fields such as vision and hearing, a method based on deep learning is applied to real life in the market. Video surveillance is one of these. Researchers build a good network model based on the deep learning theory, and train a data set through a video with marks to obtain a model with identification capability. The model has better generalization capability and can classify untrained video data. The current practice directly inputs the video into the neural network, does not process the video data, or only exposes and distorts the video, but does not highlight the characteristic information carried by the video, which is not beneficial to the detection of abnormal behaviors. In a truly complex background, it may result in an inability to identify abnormal behavior. The existing training based on the deep learning network model requires a large number of data sets and high-performance servers, which brings great limitation to actual video identification work.
In summary, there is an urgent need in the industry to develop a method or system that can highlight feature information carried by video, reduce the amount of computation, and facilitate detection and identification of abnormal behaviors.
Disclosure of Invention
Aiming at the defect that the characteristic information carried by the video is not highlighted in the prior art, the invention designs a video behavior recognition method based on salient feature extraction.
The specific scheme of the application is as follows:
a video behavior recognition method based on salient feature extraction comprises the following steps:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
Preferably, step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network;
s22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E;
s23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost.
Preferably, step 21 comprises: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
Preferably, the CRU units are 4 CRU structural units stacked end to end, and the formula of the stacking operation of the CRU units in step S22 is defined as follows:
Figure GDA0004123548460000031
Figure GDA0004123548460000032
wherein ,
Figure GDA0004123548460000033
Figure GDA0004123548460000034
wherein
Figure GDA0004123548460000035
Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>
Figure GDA0004123548460000036
Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>
Figure GDA0004123548460000037
Representing dot product.
Preferably, the implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure GDA0004123548460000038
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross entropy loss, which is:
Figure GDA0004123548460000041
wherein p(xi ) True value, q (x i ) Is an estimated value.
Preferably, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, the salient target detection method and the behavior recognition method are combined for the first time, on one hand, the interested region in the video is extracted, the main characteristics are reserved, the background characteristics are shielded, and on the other hand, the operation amount is reduced.
Drawings
FIG. 1 is a schematic flow chart of a video behavior recognition method based on salient feature extraction according to one embodiment;
fig. 2 is a network configuration diagram of saliency detection according to an embodiment.
Fig. 3 (a) is a part of a unit structure diagram of a Res2Net backbone network according to an embodiment, and (b) is a part of a unit structure diagram of a CRU unit according to an embodiment.
Fig. 4 is a 3D convolution diagram of an embodiment.
FIG. 5 is an exploded view of an R (2+1) D algorithm according to an embodiment.
Fig. 6 is a network frame diagram of a C3D network structure according to an embodiment.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Visual salience (Visual Attention Mechanism, VA, visual attention mechanism) refers to the automatic processing of regions of interest by humans in the face of a scene, selectively ignoring regions of no interest, these regions of interest being referred to as salience regions. And extracting a salient region of a specific target is called salient target detection. In the face of complex background information in the middle of video or image, the extraction of salient features is necessary, and the main features can be reserved and the background features can be shielded. In the abnormal behavior detection, an abnormal portion in an image or video may be detected by a method of saliency detection, and then the abnormal behavior in the image or video may be identified. The video behavior recognition method based on the salient feature extraction in the scheme is based on a network framework consisting of a Res2Net backbone network, CRU units and asymmetric cross modules, and specifically comprises the following steps:
referring to fig. 1, a video behavior recognition method based on salient feature extraction includes:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
In this embodiment, referring to fig. 2, step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network; the Res2Net backbone network comprises 4 layers, layer1, layer2, layer3, layer4, respectively. From a practical point of view, different objects may appear in the picture in different sizes, e.g. different dining tables in different positions and the size of the computer are different. Second, the object to be detected may carry more information than it occupies itself. The introduction of the Res2Net backbone network means that each feature layer has a plurality of receptive fields with different scales, and the perception of the brain on the salient targets with different scales and different directions in real life is simulated, so that the problem that an algorithm cannot detect a plurality of abnormal behaviors in a video with images is avoided.
Still further, as shown in part (a) of fig. 3, step 21 includes: the shallow image features first pass through a 1 x 1 convolution kernel, then pass through a set of convolution kernels that divide the feature map with n channels into 4 sets, the output features of the former set are sent to the next set of 3 x 3 convolution kernels along with another set of input feature maps, and this process is repeated twice until all input feature maps are processed. Finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
S22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E; considering the logical relationship between saliency target detection and edge detection, a CRU structure that fuses saliency features and edge features may be employed. The CRU units in this scheme are combined with the Res2Net network to create further distinguishing features. When the input image is subjected to multistage feature extraction through CNN, the deeper the CNN is, the dispersion of image features is gradually suppressed. Wherein CNN refers to a common convolutional neural network and also comprises a CRU structure. Whereas the lower level features contain many background spatial details with more attention dispersion and the higher level features are more focused on the salient target region, 4 CRU structural units, CRU1, CRU2, CRU3 and CRU4, respectively, are stacked in an end-to-end manner, and as shown in part (b) of fig. 3, the formula of the stacking operation of the CRU units in step S22 is defined as:
Figure GDA0004123548460000061
Figure GDA0004123548460000062
wherein ,
Figure GDA0004123548460000063
Figure GDA0004123548460000071
wherein
Figure GDA0004123548460000072
Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>
Figure GDA0004123548460000073
Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>
Figure GDA0004123548460000074
Representing dot product.
S23, in the asymmetric cross module, the saliency feature S and the saliency Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the saliency feature S, and the edge feature E and the edge of the saliency Label are lost. Although in the CRU unit structure, the salient feature S combines the edge feature E in order to compensate for the edge information lost by the salient object, the effect of this combination on the salient feature S is limited, and the promoting effect of the edge feature E on the salient feature S cannot be directly estimated. Considering from the output layer of the network structure, the saliency target detection only evaluates the saliency feature S, and the edge feature E only serves as an auxiliary function, which is unfavorable for the extraction of the edge feature of the saliency feature S, so that besides the CRU structure, the scheme also extracts the edge information of the saliency target in the training process and loses the edge information with the edge label information, thereby realizing the double cross fusion of the saliency target detection network and the edge detection network.
For edge extraction of binary saliency images, traditional edge detection operators such as Sobel and LOG are adopted, and the edge detection operators are composed of two convolution kernels g x (x, y) and g y (x, y) convolution operation is performed on the original image f (x, y).
The operators can be divided into templates in the vertical direction and in the horizontal direction, the former G x (x, y) can detect the edges in the horizontal direction in the image, the latter G y (x, y) then the edges in the vertical direction in the image can be detected. In practical application, each pixel point in the image is convolved by using the two convolution kernels, and the specific implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure GDA0004123548460000081
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
after the binary image passes through the formula (7), an edge feature map can be obtained; the loss function uses a binary cross entropy loss, which is:
Figure GDA0004123548460000082
wherein p(xi ) True value, q (x i ) Is an estimated value.
In the training process of the salient feature extraction, an experimental platform can adopt an operating system of Ubuntu14.04.3LTS, a display card 1080Ti and configure Python3.6 and Pytorch (0.4.0). The SGD algorithm is adopted as an optimization function, the iteration number is 30, the learning rate is 0.002, the learning rate plan is used after the iteration is 20 times, the learning rate is multiplied by 0.1, the convergence process of the optimization model is set to be 8. The DUTS-TR dataset is used as the training set. The test video is the output of the video monitoring device in reality. After the salient features of the monitoring video are extracted, the background information is removed, and only the behavior information of the salient region is reserved.
In this embodiment, as shown in fig. 6, step S3 includes: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure consists of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology can be added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
At present, a large number of human behavior databases exist, and the large data volume is provided with video data sets such as Kinetics-400, kinetics-600, sport-1M and the like. The abnormal behavior data set may employ a VIF (violent flow) video database of violent behaviors about the crowd. The data sets can be used as training sets for training network parameters, so that the behavior recognition network can accurately recognize behaviors. The videos in the training set need to be enough to contain various behavior information.
With the development of deep learning, researchers have proposed numerous behavior recognition algorithms based on deep learning for extracting spatiotemporal characteristics in video: based on a dual-flow network structure and based on a C3D network structure.
Behavior recognition based on a double-flow network structure, wherein the double-flow network is a milestone for analyzing the behavior of a video character in deep learning, adopts a 5-layer convolution layer and a two-layer full-connection layer, and expands a static image to video data, such as UCF101 and HMDB. In this network, the spatial stream takes the RGB diagram as input and the temporal stream takes the optical flow diagram as input and carries the timing information. With the proposal of the VGG network structure, the VGG16 network is taken as a characteristic extraction network, and the fusion state of the double-flow network in time and space is considered. However, when the dual-flow structure is used, an optical flow image must be generated first, the consumed time is relatively long, and the real-time performance is relatively poor, so that a behavior recognition algorithm based on the C3D network structure can be adopted.
The 3D convolution is formed by a 2D convolution spread time dimension, with a frame diagram as shown in fig. 4: the convolution kernel is correspondingly expanded to 3 dimensions, the frame sequence sequentially passes through the convolution kernel, three continuous images pass through the convolution kernel with the depth of 3, and finally, the characteristic values are mapped to the characteristic images. After the feature map is connected to the frame sequence, it is possible to obtain the motion features of the person in the video, as shown in formulas (11), (12):
Figure GDA0004123548460000091
Figure GDA0004123548460000092
in the formula :
Figure GDA0004123548460000093
and the characteristic value of the ith layer and the jth characteristic image pixel point (x, y, z) after 3D convolution is represented. b ij For deviation, m represents the previous layerThe number of the feature graphs R, P, Q corresponds to the depth (time), length and width of the 3D convolution kernel, and w is the convolution kernel weight of the feature graph connection.
Because of the introduction of 3D convolution, the time sequence relation between single frame images can be utilized, namely, the time sequence characteristics can be extracted while the spatial characteristics are extracted, so that the use of optical flow to bear time flow does not exist. But brings with it the problems of computational costs and model storage. Therefore, when designing a network structure based on 3D convolution, the network is considered to be capable of effectively extracting space-time characteristics, and the calculation cost is considered to be as low as possible, and the model storage is as small as possible. The R (2 + 1) D algorithm replaces the 3 x 3 convolution kernel with a 1 x 3 spatial convolution kernel and a 3 x 1 temporal convolution kernel, the decomposition of which is schematically illustrated in fig. 5, as can be understood from the way the convolution kernel is calculated, the 1×3×3 convolution kernel operates on a two-dimensional image at a single time, while the 3×1×1 convolution kernel operates only in the time dimension.
In conclusion, the real scene is collected and transmitted to the monitoring system through the camera, after being transmitted by the monitoring system, the video is subjected to salient feature extraction, background information in the video is removed, behaviors are identified through the behavior identification network, and finally abnormal behaviors are output.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (4)

1. The video behavior recognition method based on the salient feature extraction is characterized by comprising the following steps of:
s1, acquiring a video to be identified, and converting the video to be identified into an image;
s2, extracting salient features of the image and removing background information in the image;
the step S2 includes:
s21, inputting an image into a Res2Net backbone network, dividing the image characteristics into salient characteristics and edge characteristics when the image passes through each layer in the Res2Net backbone network, and finally outputting the salient characteristics S0 and the edge characteristics E0 by the Res2Net backbone network;
s22, alternately training the salient features S0 and the edge features E0 in a CRU unit as supervision signals to generate salient features S and edge features E;
s23, in the asymmetric cross module, the significant feature S and the significant Label Label-S are lost, the edge feature E and the edge Label-E are lost, meanwhile, the edge feature E is extracted from the significant feature S, and the edge feature E and the edge of the significant Label are lost;
the CRU units are 4 CRU structural units stacked end to end, and in step S22, the formula of the stacking operation of the CRU units is defined as follows:
Figure FDA0004123548450000011
Figure FDA0004123548450000012
wherein ,
Figure FDA0004123548450000013
Figure FDA0004123548450000014
wherein
Figure FDA0004123548450000015
Generating edge features after the features representing Res2Net ith layer pass through n CRU building blocks,/for>
Figure FDA0004123548450000016
Representing the saliency of the generation of features of the ith layer in Res2Net after n CRU building blocks,/for>
Figure FDA0004123548450000021
Representing dot product;
s3, performing behavior recognition on the image;
s4, outputting the identified abnormal behavior.
2. The method for identifying video behavior based on salient feature extraction of claim 1, wherein step 21 comprises:
the shallow image features firstly pass through a convolution kernel of 1 multiplied by 1, then pass through a convolution kernel group which divides a feature map with n channels into 4 groups, the output features of the former group and the input feature map of the other group are transmitted to the convolution kernel of the next group of 3 multiplied by 3, and the process is repeated twice until all the input feature maps are processed; finally, the feature maps from the 4 groups are connected, and the information is fused together through a convolution kernel of 1×1 to obtain the multi-scale feature.
3. The video behavior recognition method based on salient feature extraction of claim 1, wherein the implementation of step S23 is as follows:
G x (x,y)=f(x,y)*g x (x,y) (5)
G y (x,y)=f(x,y)*g y (x,y) (6)
G=F(G x (x,y),G y (x,y)) (7)
the function F in the formula (7) takes into consideration the edge feature in the vertical direction and the edge feature in the horizontal direction, and fuses the features in both the horizontal direction and the vertical direction according to the formula (8) or (9);
Figure FDA0004123548450000022
F(G x (x,y),G y (x,y))=G x (x,y)+G y (x,y) (9)
obtaining an edge feature map after the binary image passes through a formula (7); the loss function uses a binary cross entropy loss, which is:
Figure FDA0004123548450000023
wherein p(xi ) True value, q (x i ) Is an estimated value.
4. The video behavior recognition method based on salient feature extraction of claim 1, wherein step S3 comprises: performing behavior recognition on the image by adopting a behavior recognition algorithm based on a C3D network structure; the network frame of the C3D network structure is composed of 8 convolution layers consisting of R (2+1) D convolution modules, 5 maximum pooling layers and 2 full connection layers, and ReLu, batchNormal, droupout technology is added in the network structure to optimize the network structure;
the model uses a random gradient descent method, and the initial learning rate is 0.003; for a total of 100 iterations, the learning rate is multiplied by 0.1 for each iteration by 20 to converge the model; there are 4096 outputs from the two fully connected layers, and finally classification is achieved by the softmax classification layer.
CN202010210957.4A 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction Active CN111488805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010210957.4A CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010210957.4A CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Publications (2)

Publication Number Publication Date
CN111488805A CN111488805A (en) 2020-08-04
CN111488805B true CN111488805B (en) 2023-04-25

Family

ID=71794420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010210957.4A Active CN111488805B (en) 2020-03-24 2020-03-24 Video behavior recognition method based on salient feature extraction

Country Status (1)

Country Link
CN (1) CN111488805B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931793B (en) * 2020-08-17 2024-04-12 湖南城市学院 Method and system for extracting saliency target
CN113343760A (en) * 2021-04-29 2021-09-03 暖屋信息科技(苏州)有限公司 Human behavior recognition method based on multi-scale characteristic neural network
CN113205051B (en) * 2021-05-10 2022-01-25 中国科学院空天信息创新研究院 Oil storage tank extraction method based on high spatial resolution remote sensing image
CN113379643B (en) * 2021-06-29 2024-05-28 西安理工大学 Image denoising method based on NSST domain and Res2Net network
CN113537375B (en) * 2021-07-26 2022-04-05 深圳大学 Diabetic retinopathy grading method based on multi-scale cascade

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110852295A (en) * 2019-10-15 2020-02-28 深圳龙岗智能视听研究院 Video behavior identification method based on multitask supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
CN110852295A (en) * 2019-10-15 2020-02-28 深圳龙岗智能视听研究院 Video behavior identification method based on multitask supervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王晓芳 ; 齐春 ; .一种运用显著性检测的行为识别方法.西安交通大学学报.2017,(02),第29-34页. *

Also Published As

Publication number Publication date
CN111488805A (en) 2020-08-04

Similar Documents

Publication Publication Date Title
CN111488805B (en) Video behavior recognition method based on salient feature extraction
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN111126379B (en) Target detection method and device
WO2022134655A1 (en) End-to-end video action detection and positioning system
US8494259B2 (en) Biologically-inspired metadata extraction (BIME) of visual data using a multi-level universal scene descriptor (USD)
CN111291809B (en) Processing device, method and storage medium
CN110929622A (en) Video classification method, model training method, device, equipment and storage medium
CN107977661B (en) Region-of-interest detection method based on FCN and low-rank sparse decomposition
CN104504395A (en) Method and system for achieving classification of pedestrians and vehicles based on neural network
CN110232361B (en) Human behavior intention identification method and system based on three-dimensional residual dense network
CN110222718B (en) Image processing method and device
CN110852222A (en) Campus corridor scene intelligent monitoring method based on target detection
KR102309111B1 (en) Ststem and method for detecting abnomalous behavior based deep learning
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
WO2023159898A1 (en) Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium
Li et al. Transmission line detection in aerial images: An instance segmentation approach based on multitask neural networks
CN114580541A (en) Fire disaster video smoke identification method based on time-space domain double channels
Hu et al. Parallel spatial-temporal convolutional neural networks for anomaly detection and location in crowded scenes
Vu et al. A multi-task convolutional neural network with spatial transform for parking space detection
CN111104924B (en) Processing algorithm for identifying low-resolution commodity image
CN112396036A (en) Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction
Le Louedec et al. Segmentation and detection from organised 3D point clouds: A case study in broccoli head detection
CN108764287B (en) Target detection method and system based on deep learning and packet convolution
CN113822134A (en) Instance tracking method, device, equipment and storage medium based on video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant