CN115188066A - Moving target detection system and method based on cooperative attention and multi-scale fusion - Google Patents

Moving target detection system and method based on cooperative attention and multi-scale fusion Download PDF

Info

Publication number
CN115188066A
CN115188066A CN202210620275.XA CN202210620275A CN115188066A CN 115188066 A CN115188066 A CN 115188066A CN 202210620275 A CN202210620275 A CN 202210620275A CN 115188066 A CN115188066 A CN 115188066A
Authority
CN
China
Prior art keywords
network
attention
module
scale
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210620275.XA
Other languages
Chinese (zh)
Inventor
胡晓
黎锦栋
黄子燊
黄奕秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202210620275.XA priority Critical patent/CN115188066A/en
Publication of CN115188066A publication Critical patent/CN115188066A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention relates to the technical field of computer vision, and discloses a moving target detection system and a method based on cooperative attention and multi-scale fusion, wherein the system comprises a feature extraction module, a target area identification module and a network based on a cooperative attention mechanism and multi-scale feature fusion; the network based on the cooperative attention mechanism and the multi-scale feature fusion is divided into two branches: the first branch extracts the image through features to obtain a corresponding feature map, embeds a cooperative attention module in the feature map, considers the position relationship of the image on the basis of the attention of the channel and combines the main attention of the channel with the space attention of the channel; and a multi-scale feature fusion module ASPP is added in the second branch to fuse image features of different scales. The invention improves the huge error caused by the multi-scale problem, reduces the influence of the multi-scale change of the image on the detection of the moving target, improves the detection performance of the system through a cooperative attention mechanism and reduces the false detection rate.

Description

Moving target detection system and method based on cooperative attention and multi-scale fusion
Technical Field
The invention relates to the technical field of computer vision, in particular to a moving target detection system and method based on cooperative attention and multi-scale fusion.
Background
The detection and tracking of the moving target is a branch of image processing and computer vision disciplines, has great significance in theory and practice, and has been concerned by scholars at home and abroad for a long time. In the unmanned technology which is continuously developed in recent years, moving object detection becomes an important part of the unmanned vehicle for understanding the surrounding environment, visual attribute information of a relative moving object in the surrounding environment is accurately detected in the high-speed moving process of a driving body, and a decision for ensuring driving safety is made in real time in the high-speed driving process, and the moving object detection is an important application of the unmanned technology. Moreover, in the field of military unmanned aerial vehicle reconnaissance, in order to ensure that the dynamic information acquisition of a ground target can be completed in a high-speed motion condition, a high-performance moving target detection technology must be carried on the unmanned aerial vehicle, so that military tasks such as reconnaissance, ground attack and monitoring are completed, and the requirements on acquiring the relevant war zones and enemy conditions required by military battle are met. Therefore, moving object detection has actually become the core technical means of military reconnaissance work. Therefore, the development of the moving target detection technology is very light and has positive social significance.
Disclosure of Invention
The present invention is directed to a moving object detection system and method based on cooperative attention and multi-scale fusion, so as to solve the above-mentioned problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a moving target detection system based on cooperative attention and multi-scale fusion comprises a feature extraction module, a target area identification module, a network based on a cooperative attention mechanism and multi-scale feature fusion; the network based on the cooperative attention mechanism and the multi-scale feature fusion is divided into two branches: the first branch acquires a corresponding feature map after feature extraction of the image, embeds a Coordinate Attention module, gives consideration to the position relation of the image on the basis of channel Attention, and combines the main Attention and the space Attention of the channel to enable the feature map extracted by the network to focus on a key area; and a multi-scale feature fusion module ASPP is added in the second branch to fuse image features of different scales, so that the neural network is helped to generate a target feature map with higher quality, and a target detection function is realized.
The feature extraction module inputs a target image to be detected, a front 34 layer of YOLOv3 is used as a front-end feature mapping extractor, and after data processing is carried out on features extracted by a front-end network through a coordinated Attention mechanism module Coordinate Attention, the network can distribute more Attention to an area where the target is located; the characteristics are subjected to multi-scale information through a multi-scale fusion module ASPP (asynchronous serial protocol) to improve the detection capability of the network on targets with different sizes, a cooperative Attention mechanism is added in the characteristic extraction process to enable the characteristics extracted by the network to pay more Attention to key areas of the images, and the accuracy of the network is improved; the Coordinate Attention mechanism Coordinate Attention can be regarded as a computing unit, and Y = { Y } which can take any intermediate feature tensor as input and output the same size as X tensor by conversion while having enhanced representation 1 ,y 1 ,y 1 ....y c }; the coding attribute encodes a channel relation and a long-term dependency through accurate position information, the perception capability of a neural network to a key area is effectively improved, a multi-scale feature fusion module ASPP adopts four cascaded void convolutions, and the void rates are 6, 12, 18 and 24 respectively; the neural network extracts spatial information of different scales through a multi-scale feature fusion module ASPP structure to obtain an output containing abstract feature information, and in the encoding stage, a low-level feature map containing sufficient local information and edge information is fused to supplement detailed information, and finally, the detailed information is addedAnd (5) line prediction.
Preferably, the specific operation of the coding attachment in the training process is divided into 2 steps of coding information embedding and coding attachment generation;
coding information embedding the global pooling method is typically used for global coding of channel attention coding spatial information, but it makes it difficult to save location information since it compresses the global spatial information into the channel descriptor. To encourage the Attention module to capture remote spatial interactions with precise location information, the coding Attention mechanism decomposes global pooling into a pair of one-dimensional feature encoding operations according to the following formula:
Figure BDA0003676291690000031
specifically, given an input X, each channel is first encoded along the horizontal and vertical coordinates using a posing kernel of size (H, 1) or (1, W), respectively, so the output formula for the height-wise second channel is:
Figure BDA0003676291690000032
similarly, the output formula of the c-th channel with width w is:
Figure BDA0003676291690000033
the 2 transformations aggregate features along two spatial directions respectively to obtain a pair of direction-aware feature maps, and the 2 transformations also allow the attention module to capture long-term dependencies along one spatial direction and save accurate location information along the other spatial direction, which helps the network to more accurately locate the target of interest.
The coding Attention is generated that the above transformation can well acquire the global receptive field and encode accurate position information. To take advantage of the resulting characterization, here the 2 nd conversion is needed, which is code Attention generation, the 2 nd conversion is as simple as possible and can make full use of the captured position information so that the region of interest can be accurately captured while effectively capturing the inter-channel information.
After passing through the transform in the information embedding, the section subjects the above transform to a concentate operation, and then uses a 1 × 1 convolution transform function F 1 Carrying out transformation operation on the obtained product:
f=δ(F 1 ([z h ,z w ]))#(4)
in the formula [. Star. ]]For the concatenate operation along the spatial dimension, δ is the nonlinear activation function, f is the intermediate eigenmap that encodes the spatial information in the horizontal and vertical directions, and f is then decomposed into 2 separate tensors f along the spatial dimension h ∈R C/r×H And f w ∈R C/r×W . Where r is used to control the reduction of the size of the SE block, using another 2 1X 1 convolution transforms F h And F w Respectively will f h And f w Transforming to an input X a tensor with the same number of channels yields:
g h =σ(F h (f h )),
g w =σ(F W (f W ))#(5)
where σ is the sigmoid activation function. To reduce the complexity and computational overhead of the model, a suitable reduction ratio r is usually used here to reduce the number of channels of f. Then outputs g h And g w Extensions were made as attention weights, respectively. Finally, the output Y formula of the Coordinate Attention Block is as follows:
Figure BDA0003676291690000041
the method adopts a multi-scale fusion module ASPP and a cooperative Attention mechanism module Coordinate Attention to improve the traditional target algorithm to obtain a target detection result with high accuracy and false detection omission rate;
the collaborative attention-based mechanism and multi-scale feature fusion network comprises: the system comprises a camera data acquisition module, a data import module, a data preprocessing module and a target detection network module;
the camera data acquisition module shoots a specified area to acquire image data;
the data preprocessing module receives and processes the image data imported by the data import module; the image data is cut and normalized to be converted into data which can be processed by a target detection network;
the moving target detection network module receives the processed data, a front 34 layer of YOLOv3 is used as a front end feature mapping extractor, and after the features extracted by the front end network are subjected to data processing through a coordinated Attention mechanism module coordination attribute, the network can distribute more Attention to the area where the target is located; and the characteristics are processed by a multi-scale fusion module ASPP to obtain multi-scale information so as to improve the detection capability of the network on targets with different sizes.
A moving target detection method based on cooperative attention and multi-scale fusion utilizes image information of different scales to increase the characteristic extraction capability of a system, and combines a cooperative attention mechanism to improve the positioning and retrieval capability of the system on key information, and comprises the following steps;
s1: acquiring a video data set for network training through a data acquisition module;
s2: constructing a multi-scale feature fusion network based on a cooperative attention mechanism;
s3: training a network based on a cooperative attention mechanism and multi-scale feature fusion to obtain a trained network;
s4: testing the video data set through the trained network to obtain a test result;
s5: evaluating the trained network according to the test result to obtain an evaluation result, and further optimizing a network weight coefficient;
s6: and inputting the video data set to be detected into the optimized network for target detection to obtain a detection result.
Preferably, S1 specifically includes: arranging a plurality of cameras in a public place with frequent activity, and acquiring target images at different angles through the plurality of cameras; marking and labeling all moving targets appearing in the camera image to generate a marking file, and randomly dividing a video data set into a training set and a testing set by adopting the proportion of 7.
Preferably, the position of each target in the annotation file is represented by a quaternion, the quaternion represents a positioning frame, and the pixel length and the width (w, h) of a region including the upper left coordinate pixel position (x 1, y 1) of the image appearance region are recorded.
Preferably, the step S3 of obtaining the trained network specifically includes the following steps:
s31), estimating regression boxes of all targets in the training image and corresponding classification credibility by using a geometric self-adaptive Gaussian kernel;
s32), preprocessing the collected data set to enable the size of the image to be fixed to 512x512, inputting the image into a neural network, and training by taking the Euclidean distance of a target regression frame and the classified cross entropy error as loss functions; during training, the data volume is increased by turning the images left and right;
s33), storing the trained model;
s34), inputting a low-resolution video data set into a network, and repeating the steps 3) and 4);
s35), testing the test video data set by using the trained model, and evaluating the network by using mean Average Precision (mAP).
Preferably, in S4, the video data set is tested through the trained network, so as to obtain a test result, which specifically includes:
s41), sampling the test set video according to 30 frames, and extracting to obtain a test image;
s42), preprocessing the test image to fix the size of the image to 512 multiplied by 512;
s43), loading the trained target detection training network, and inputting the preprocessed test set image into a network model for processing to obtain a target detection regression frame and classification reliability;
s44), evaluating the network by using the average value (mAP value) of each type of AP; the specific formula is shown in formulas (7) and (8):
Figure BDA0003676291690000061
Figure BDA0003676291690000062
where r1, r2.. Rn is the corresponding Recall value at the first interpolation of the Precison interpolation segments arranged in ascending order, i.e., AP is the area under the accuracy and Recall curves, and mAP is the average of various types of APs. Wherein the accuracy calculation formula is as follows:
Figure BDA0003676291690000063
the recall ratio is calculated by the formula:
Figure BDA0003676291690000064
wherein TP is the number of positive classes determined as positive classes, FP is the number of negative classes determined as positive classes, FN is the number of positive classes determined as negative classes, and TN is the number of negative classes determined as negative classes.
The moving target detection method and system based on cooperative attention and multi-scale fusion provided by the invention have the following beneficial effects:
(1) In the invention: the moving target detection method and system based on cooperative attention and multi-scale fusion create a neural network for moving target detection, namely a network based on a cooperative attention mechanism and multi-scale feature fusion, by utilizing deep learning, and the method realizes the autonomous detection and classification of moving targets;
(2) In the invention: the moving target detection method and system based on cooperative attention and multi-scale fusion improve the huge error caused by the multi-scale problem in the prediction process of the conventional neural network, and reduce the influence of the multi-scale change of the image on target detection. Meanwhile, a Coordinate Attention cooperative Attention mechanism module is embedded in the network, so that the neural network can distribute more Attention to a key area where a target is located for feature extraction, the accuracy of the neural network is greatly improved, and the detection classification result is more accurate;
(3) In the invention: the moving target detection method and system based on cooperative attention and multi-scale fusion mainly use an image processing technology and a deep learning technology, and train an image database with multiple target positioning and corresponding class label labels through establishing the image database based on the cooperative attention mechanism and the multi-scale feature fusion network, so that a prediction result of a target positioning frame and classification in a video can be obtained.
Drawings
FIG. 1 is a schematic flow chart of a moving object detection method based on a cooperative attention mechanism and a multi-scale feature fusion network according to an embodiment of the present invention;
FIG. 2 is a structural diagram of a network based on a cooperative attention mechanism and multi-scale feature fusion in embodiment 1 of the present invention;
FIG. 3 is a diagram of a coordinated Attention anchoring model structure according to embodiment 1 of the present invention;
fig. 4 is a diagram of an ASPP model structure in embodiment 1 of the present invention;
FIG. 5 is a flowchart of model training in embodiment 1 of the present invention;
FIG. 6 is a flowchart of a model test in embodiment 1 of the present invention;
fig. 7 is a schematic diagram of a moving object detection method and a working principle based on a cooperative attention mechanism and a multi-scale feature fusion network in embodiment 2 of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example one
Referring to fig. 1-6, a moving object detection system based on cooperative attention and multi-scale fusion provided by an embodiment of the present invention includes a feature extraction module, an object region identification module, a cooperative attention mechanism and a multi-scale feature fusion network; the network based on the cooperative attention mechanism and the multi-scale feature fusion is divided into two branches: the first branch extracts the image through the features to obtain a corresponding feature map, and embeds a Coordinate Attention module, namely a cooperative Attention module, so that the position relationship of the image is considered on the basis of the channel Attention, and the channel main Attention and the space Attention are combined together, so that the feature map extracted through the network is more focused on a key area; and a multi-scale feature fusion module ASPP is added into the second branch to fuse image features of different scales, so that the neural network is helped to generate a target feature map with higher quality, and a target detection function is realized.
The feature extraction module inputs a target image to be detected, a front 34 layer of YOLOv3 is used as a front-end feature mapping extractor, and after data processing is carried out on features extracted by a front-end network through a coordinated Attention mechanism module Coordinate Attention, the network can distribute more Attention to an area where the target is located; the characteristics are subjected to multi-scale information through a multi-scale fusion module ASPP (asynchronous serial protocol) to improve the detection capability of the network on targets with different sizes, a cooperative Attention mechanism is added in the characteristic extraction process to enable the characteristics extracted by the network to pay more Attention to key areas of the images, and the accuracy of the network is improved; the Coordinate Attention mechanism Coordinate Attention can be regarded as a computing unit, and Y = { Y } which can take any intermediate feature tensor as input and output the same size as X tensor by conversion while having enhanced representation 1 ,y 1 ,y 1 ....y c }; the coding attribute encodes a channel relation and a long-term dependency through accurate position information, the perception capability of a neural network to a key area is effectively improved, a multi-scale feature fusion module ASPP adopts four cascaded void convolutions, and the void rates are 6, 12, 18 and 24 respectively; the neural network extracts spatial information of different scales through a multi-scale feature fusion module ASPP structure to obtain spatial information of different scalesAn output containing abstracted feature information, and an encoding stage supplements detail information by fusing a low-level feature map containing sufficient local information and edge information to make a prediction.
Preferably, the specific operation of the Coordinate Attention in the training process is divided into 2 steps of Coordinate information embedding and Coordinate Attention generation;
coding information embedding the global pooling method is typically used for global coding of channel attention coding spatial information, but it makes it difficult to save location information since it compresses the global spatial information into the channel descriptor. To encourage the Attention module to capture remote spatial interactions with precise position information, the coding Attention mechanism breaks down the global pooling into a pair of one-dimensional feature coding operations according to the following formula:
Figure BDA0003676291690000091
specifically, given an input X, each channel is first encoded along the horizontal and vertical coordinates using a posing kernel of size (H, 1) or (1, W), respectively, so that the output formula for the height-wise second channel is:
Figure BDA0003676291690000092
similarly, the output formula of the c-th channel with width w is:
Figure BDA0003676291690000093
the 2 transformations aggregate features along two spatial directions respectively to obtain a pair of direction-aware feature maps, and the 2 transformations also allow the attention module to capture long-term dependencies along one spatial direction and save accurate location information along the other spatial direction, which helps the network to more accurately locate the target of interest.
The coding orientation is generated such that the above transformation can well obtain the global receptive field and encode accurate position information. To take advantage of the resulting characterization, here the 2 nd conversion is needed, which is code Attention generation, the 2 nd conversion is as simple as possible and can make full use of the captured position information so that the region of interest can be accurately captured while effectively capturing the inter-channel information.
After passing the transform in the information embedding, the section subjects the above transform to a convert operation, and then uses a 1 × 1 convolution transform function F 1 Carrying out transformation operation on the obtained product:
f=δ(F 1 ([z h ,z w ]))#(4)
in the formula [. Star. ]]For the concatenate operation along the spatial dimension, δ is the nonlinear activation function, f is the intermediate eigenmap that encodes the spatial information in the horizontal and vertical directions, and f is then decomposed into 2 separate tensors f along the spatial dimension h ∈R C/r×H And f w ∈R C/r×W . Where r is used to control the reduction of the size of the SE block, F is transformed by another 2 1 x1 convolution h And F w Respectively will f h And f w The transformation into a tensor with the same number of channels to the input X yields:
g h =σ(F h (f h )),
g w =σ(F W (f W ))#(5)
where σ is the sigmoid activation function. To reduce the complexity and computational overhead of the model, a suitable reduction ratio r is usually used here to reduce the number of channels of f. Then to output g h And g w Extensions were performed as attention weights, respectively. Finally, the output Y formula of the Coordinate Attention Block is as follows:
Figure BDA0003676291690000101
the method adopts a multi-scale fusion module ASPP and a cooperative Attention mechanism module Coordinate Attention to improve the traditional target algorithm to obtain a target detection result with high accuracy and false detection omission rate;
the collaborative attention mechanism and multi-scale feature fusion based network comprises: the system comprises a camera data acquisition module, a data import module, a data preprocessing module and a target detection network module;
the camera data acquisition module shoots a specified area to acquire image data;
the data preprocessing module receives and processes the image data imported by the data import module; the image data is cut and normalized to be converted into data which can be processed by a target detection network;
the moving target detection network module receives the processed data, a front 34 layer of YOLOv3 is used as a front end feature mapping extractor, and after the features extracted by the front end network are subjected to data processing through a coordinated attention mechanism module CoordinateAttention, the network can allocate more attention to the area where the target is located; and the characteristics are processed by a multi-scale fusion module ASPP to obtain multi-scale information so as to improve the detection capability of the network on targets with different sizes.
As shown in fig. 1, the moving object detection method based on cooperative attention and multi-scale fusion provided by the embodiment of the present invention utilizes image information of different scales to increase the feature extraction capability of the system, and combines a cooperative attention mechanism to improve the positioning and retrieval capability of the system on key information, including the following steps;
s1: acquiring a video data set for network training through a data acquisition module;
s2: constructing a multi-scale feature fusion network based on a cooperative attention mechanism;
s3: training a network based on a cooperative attention mechanism and multi-scale feature fusion to obtain a trained network;
s4: testing the video data set through the trained network to obtain a test result;
s5: evaluating the trained network according to the test result to obtain an evaluation result, and further optimizing a network weight coefficient;
s6: inputting a video data set to be detected into the optimized network for target detection to obtain a detection result, and acquiring the video data set for network training through a data acquisition module, wherein the method specifically comprises the following steps: the method comprises the following steps that a plurality of cameras are arranged in a public place with frequent activities, lenses can be respectively arranged into common monitoring focal sections of 4mm, 8mm, 12mm and the like, and social moving images at different angles are obtained through the cameras; the method comprises the steps of positioning and classifying moving targets appearing in a monitored video image, generating an annotation file, randomly dividing a video data set into a training set and a testing set by adopting a proportion of 7; based on a cooperative attention mechanism and a multi-scale feature fusion network, the method is divided into two branches: the first branch acquires a corresponding feature map after feature extraction, a CoordinateAttention module is embedded in the feature map, the position relation of the feature map is considered on the basis of channel attention, and the main attention of a channel is combined with the spatial attention, so that the feature map extracted by a network is more concerned in a key area; the second branch is added with a multi-scale feature fusion module ASPP to fuse image features of different scales, and is used for helping a neural network to generate a target feature map with higher quality to realize a target detection function, and the feature extraction module is used for: inputting a target image to be detected, adopting a front 34 layer of YOLOv3 as a front-end feature mapping extractor, and after the features extracted by a front-end network are subjected to data processing through a cooperative attention mechanism module CoordinateAttention, the network can distribute more attention to the area where the target is located; and the characteristics are obtained through a multi-scale fusion module ASPP to improve the detection capability of the network on targets with different sizes, and the detection capability of the network is similar to that of a Attention mechanism module Coordinate attachment: the cooperative attention mechanism enables the network extracted features to focus more on the imageThe accuracy of the network is improved in the key area of the network; the Coordinate Attention mechanism Coordinate Attention can be regarded as a computing unit, and Y = { Y } which can take any intermediate feature tensor as input and output the same size as X tensor by conversion while having enhanced representation 1 ,y 1 ,y 1 ....y c }; the CoordinateAttention encodes the channel relationship and the long-term dependency through accurate position information, and effectively improves the perception capability of the neural network for key areas. The detailed structure of the cooperative attention mechanism module is shown in fig. 3, and in the target detection problem, the multi-scale problem often affects the final performance of the detection network by the multi-scale feature fusion module ASPP. In order to solve the problem, a common method is to increase the receptive field of a convolution kernel by adopting hole convolution, but the up-sampling can not restore the detail information loss caused by Pooling operation, so an ASPP module (asynchronous Spatial Pyramid power) is adopted in the application to solve the problem; adopting four cascaded void convolutions with void rates of 6, 12, 18 and 24 respectively; the neural network extracts spatial information of different scales through a multi-scale feature fusion module (ASPP) structure to obtain an output containing abstract feature information, and in an encoding stage, a low-level feature map containing sufficient local information and edge information is fused to supplement detailed information, and finally prediction is performed, wherein the specific structure of the neural network is shown in fig. 4, a network based on a cooperative attention mechanism and a multi-scale feature fusion network is trained to obtain a trained network, and the step S3 specifically comprises the following steps:
s31), estimating regression boxes of all targets in the training image and corresponding classification credibility by using a geometric self-adaptive Gaussian kernel;
s32), preprocessing the collected data set to enable the size of the image to be fixed to 512x512, inputting the image into a neural network, and training by taking the Euclidean distance of a target regression frame and the classified cross entropy error as loss functions; during training, the data volume is increased by turning the images left and right;
s33), storing the trained model;
s34), inputting a low-resolution video data set into a network, and repeating the steps 3) and 4);
s35) testing the test video data set by using the trained model, and evaluating the network by using mean Average Precision (mAP). The process is shown in fig. 6.
Further, in S4, the video data set is tested through the trained network, and a test result is obtained, which specifically includes:
s41), sampling the test set video according to 30 frames, and extracting to obtain a test image;
s42), preprocessing the test image to fix the size of the image to be 512 multiplied by 512;
s43), loading the trained target detection training network, and inputting the preprocessed test set image into a network model for processing to obtain a target detection regression frame and classification reliability;
s44), evaluating the network by using the average value of the APs in each category, namely the mAP value; the specific formula is shown in formulas (5) and (6):
Figure BDA0003676291690000131
where r1, r2.. Rn is the corresponding Recall value at the first interpolation of the Precison interpolation segments arranged in ascending order, i.e., AP is the area under the accuracy and Recall curves, and mAP is the average of various types of APs. Wherein the accuracy calculation formula is as follows:
Figure BDA0003676291690000141
the recall ratio is calculated by the formula:
Figure BDA0003676291690000142
wherein TP is the number of positive classes judged as positive classes, FP is the number of negative classes judged as positive classes, FN is the number of positive classes judged as negative classes, TN is the number of negative classes judged as negative classes
In the embodiment of the invention, the network model obtained by training on the large-scale moving target data set can obtain a very remarkable effect in the detection and classification work of the moving target, and has very good robustness and universality; secondly, the model is based on a deep learning technology to process multi-scale problems and a cooperative attention mechanism function, which is difficult to realize by other methods. Finally, the network is trained end to end, runs faster than a double-flow network model, and has a little advantage in real-time. Therefore, the invention has obvious application value in a plurality of fields of public safety, medicine, agriculture and the like.
Example two
The moving target detection system and method based on the cooperative attention mechanism and the multi-scale feature fusion network, as shown in fig. 7, comprise a camera data acquisition module, a data import module, a data preprocessing module and a moving target detection network module.
Firstly, a camera data acquisition module shoots a specified area to acquire image data. Then the image data is transmitted to a data preprocessing module for processing through data import. And then, the image data is cut and normalized to be converted into data which can be processed by a target detection network. Then inputting the processed data into a moving target detection network, inputting the processed data into the target detection network, adopting a front 34 layer of YOLOv3 as a front end feature mapping extractor, and after the features extracted from the front end network are subjected to data processing by a coordinated Attention mechanism module coordination Attention, the network can distribute more Attention to the area where the target is located; and the characteristics are processed by a multi-scale fusion module ASPP to obtain multi-scale information so as to improve the detection capability of the network on targets with different sizes.
According to the moving target detection method based on cooperative attention and multi-scale fusion, provided by the embodiment of the invention, the characteristic extraction capability of the system is increased by utilizing image information of different scales, and the positioning and retrieval capability of the system on key information is improved by combining a cooperative attention mechanism; and testing the video data set through the trained network to obtain a test result. The invention improves the huge error caused by the multi-scale problem, reduces the influence of the multi-scale change of the image on the detection of the moving target, improves the detection performance of the system through a cooperative attention mechanism and reduces the false detection rate.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. A moving object detection system based on cooperative attention and multi-scale fusion is characterized in that: the system comprises a feature extraction module, a target area identification module and a multi-scale feature fusion network based on a cooperative attention mechanism; based on a cooperative attention mechanism and a multi-scale feature fusion network, the method is divided into two branches: the first branch acquires a corresponding feature map after extracting features of the image, embeds a Coordinate Attention module, gives consideration to the position relation of the image on the basis of channel Attention, and combines the main Attention of the channel with the space Attention; and a multi-scale feature fusion module ASPP is added into the second branch to fuse the image features of different scales.
2. The system for detecting moving objects based on cooperative attention and multi-scale fusion according to claim 1, characterized in that: the feature extraction module inputs a target image to be detected, a front 34 layer of YOLOv3 is used as a front-end feature mapping extractor, and after data processing is carried out on features extracted by a front-end network through a coordinated Attention mechanism module Coordinate Attention, the network can distribute more Attention to an area where the target is located; the characteristics are subjected to multi-scale information through a multi-scale fusion module ASPP (asynchronous serial protocol) to improve the detection capability of the network on targets with different sizes, a cooperative Attention mechanism is added in the characteristic extraction process to enable the characteristics extracted by the network to pay more Attention to key areas of the images, and the accuracy of the network is improved; coordinate as a synergistic attention mechanismAttention can be viewed as a computational unit that can take as input any intermediate feature tensor and output Y = { Y } with the same size as the X tensor and with enhanced tokens by switching over 1 ,y 1 ,y 1 ....y c }; coding attachment codes a channel relation and long-term dependence through accurate position information, a multi-scale feature fusion module ASPP adopts four cascaded void convolutions, and the void rates are 6, 12, 18 and 24 respectively; the neural network extracts spatial information of different scales through a multi-scale feature fusion module ASPP structure to obtain an output containing abstract feature information, and in the encoding stage, a low-layer feature graph containing sufficient local information and edge information is fused to supplement detailed information, and finally prediction is carried out.
3. The system for detecting moving objects based on cooperative attention and multi-scale fusion as claimed in claim 2, wherein: the specific operation of the Coordinate Attention in the training process is divided into 2 steps of Coordinate information embedding and Coordinate Attention generation;
global pooling method is usually used for global coding of channel Attention coding spatial information, but since it compresses global spatial information into channel descriptors, it is difficult to save location information, in order to enable the Attention module to capture remote spatial interaction with accurate location information, the coding Attention mechanism decomposes global pooling into a pair of one-dimensional feature coding operations according to the following formula:
Figure FDA0003676291680000021
specifically, given an input X, each channel is first encoded along the horizontal and vertical coordinates using a posing kernel of size (H, 1) or (1, W), respectively, so that the output formula for the height-wise second channel is:
Figure FDA0003676291680000022
similarly, the output formula of the c-th channel with width w is:
Figure FDA0003676291680000023
the 2 transformations respectively aggregate the characteristics along two spatial directions to obtain a pair of direction-sensing characteristic graphs,
coding Attention generation the above-described transform is well suited to obtain the global receptive field and encode accurate position information, and in order to take advantage of the resulting characterization, the 2 nd conversion, here needed, is coding Attention generation,
after passing through the transform in the information embedding, the section subjects the above transform to a concentate operation, and then uses a 1 × 1 convolution transform function F 1 Carrying out transformation operation on the data:
f=δ(F 1 ([z h ,z w ])) #(4)
in the formula [, ]]For the concatenate operation along the spatial dimension, δ is the nonlinear activation function, f is the intermediate eigenmap that encodes the spatial information in the horizontal and vertical directions, and f is then decomposed into 2 separate tensors f along the spatial dimension h ∈R C/r×H And f w ∈R C/r×W Where r is used to control the reduction of the size of the SE block, using another 2 1X 1 convolution transforms F h And F w Respectively will f h And f w Transforming to a tensor with the same number of channels to the input X, the equation is obtained:
g h =σ(F h (f h )),
g w =σ(F W (f W )) #(5)
where σ is the sigmoid activation function, to reduce the complexity and computational overhead of the model, we typically use an appropriate reduction ratio r to reduce the number of channels of f, and then on the output g h And g w Extension is performed as inputs to the attribute weights, and finally, the Coordinate attachment BlockThe formula of Y is:
Figure FDA0003676291680000031
4. the system for detecting moving objects based on cooperative attention and multi-scale fusion as claimed in claim 2, wherein: the method comprises the steps that a multi-scale fusion module ASPP and a cooperative Attention mechanism module Coordinate Attention are adopted to improve a traditional target algorithm, and a target detection result with high accuracy and false detection omission rate is obtained;
the collaborative attention mechanism and multi-scale feature fusion based network comprises: the system comprises a camera data acquisition module, a data import module, a data preprocessing module and a target detection network module;
the camera data acquisition module shoots a specified area to acquire image data;
the data preprocessing module receives and processes the image data imported by the data import module; the image data is cut and normalized to be converted into data which can be processed by a target detection network;
the moving target detection network module receives the processed data, a front 34 layer of YOLOv3 is used as a front end feature mapping extractor, and after the features extracted by the front end network are subjected to data processing through a coordinated Attention mechanism module coordination attribute, the network can distribute more Attention to the area where the target is located; and the characteristics are processed by a multi-scale fusion module ASPP to obtain multi-scale information so as to improve the detection capability of the network on targets with different sizes.
5. A moving target detection method based on cooperative attention and multi-scale fusion is characterized in that image information of different scales is used for increasing the characteristic extraction capability of a system, and the positioning and retrieval capability of the system on key information is improved by combining a cooperative attention mechanism, and the method comprises the following steps;
s1: acquiring a video data set for network training through a data acquisition module;
s2: constructing a multi-scale feature fusion network based on a cooperative attention mechanism;
s3: training a network based on a cooperative attention mechanism and multi-scale feature fusion to obtain a trained network;
s4: testing the video data set through the trained network to obtain a test result;
s5: evaluating the trained network according to the test result to obtain an evaluation result, and further optimizing a network weight coefficient;
s6: and inputting the video data set to be detected into the optimized network for target detection to obtain a detection result.
6. The method for detecting a moving object based on cooperative attention and multi-scale fusion as claimed in claim 5, wherein: the specific steps of the step S1 are as follows: arranging a plurality of cameras in a frequently-moving public place, and acquiring target images at different angles through the plurality of cameras; marking and labeling all moving targets appearing in the camera image to generate a marking file, and randomly dividing a video data set into a training set and a testing set by adopting the proportion of 7.
7. The method for detecting a moving object based on cooperative attention and multi-scale fusion as claimed in claim 6, wherein: the position of each target in the label file is represented by a quaternion, the quaternion represents a positioning frame, and the pixel length and the width (w, h) of an area including the position (x 1, y 1) of the pixel point of the upper left coordinate of the image appearing area are recorded.
8. The method for detecting a moving object based on cooperative attention and multi-scale fusion as claimed in claim 5, wherein: the step S3 of obtaining the trained network includes the following steps:
s31, estimating regression frames of all targets in the training image and corresponding classification credibility by using a geometric self-adaptive Gaussian kernel;
s32, preprocessing the collected data set to enable the size of the image to be fixed to 512x512, inputting the image into a neural network, and training by taking the Euclidean distance of a target regression frame and the classified cross entropy error as loss functions; during training, the data volume is increased by turning the images left and right;
s33, storing the trained model;
s34, inputting a low-resolution video data set into a network, and repeating the steps 3) and 4);
and S35, testing the test video data set by using the trained model, and evaluating the network by using mean Average Precision (mAP).
9. The method for detecting a moving object based on cooperative attention and multi-scale fusion as claimed in claim 5, wherein: and S4, testing the video data set through the trained network to obtain a test result, and specifically comprising the following steps:
s41, sampling the test set video according to 30 frames, and extracting to obtain a test image;
s42, preprocessing the test image to fix the size of the image to be 512 multiplied by 512;
s43, loading a trained target detection training network, and inputting the preprocessed test set image into a network model for processing to obtain a target detection regression frame and classification credibility;
s44, evaluating the network by using the average value (mAP value) of each type of AP; the specific formula is shown as formulas (7) and (8):
Figure FDA0003676291680000051
Figure FDA0003676291680000052
wherein r1, r2 \8230rnis the corresponding Recall value at the first interpolation position of the Precison interpolation section arranged in ascending order, namely AP is the area under the curve of accuracy and Recall, mAP is the average of various APs, wherein the calculation formula of accuracy is as follows:
Figure FDA0003676291680000053
the recall ratio is calculated by the formula:
Figure FDA0003676291680000061
wherein TP is the number of positive classes determined to be positive classes, FP is the number of negative classes determined to be positive classes, FN is the number of positive classes determined to be negative classes, and TN is the number of negative classes determined to be negative classes.
CN202210620275.XA 2022-06-02 2022-06-02 Moving target detection system and method based on cooperative attention and multi-scale fusion Pending CN115188066A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210620275.XA CN115188066A (en) 2022-06-02 2022-06-02 Moving target detection system and method based on cooperative attention and multi-scale fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210620275.XA CN115188066A (en) 2022-06-02 2022-06-02 Moving target detection system and method based on cooperative attention and multi-scale fusion

Publications (1)

Publication Number Publication Date
CN115188066A true CN115188066A (en) 2022-10-14

Family

ID=83514196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210620275.XA Pending CN115188066A (en) 2022-06-02 2022-06-02 Moving target detection system and method based on cooperative attention and multi-scale fusion

Country Status (1)

Country Link
CN (1) CN115188066A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452960A (en) * 2023-04-20 2023-07-18 南京航空航天大学 Multi-mode fusion military cross-domain combat target detection method
CN116645696A (en) * 2023-05-31 2023-08-25 长春理工大学重庆研究院 Contour information guiding feature detection method for multi-mode pedestrian detection

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452960A (en) * 2023-04-20 2023-07-18 南京航空航天大学 Multi-mode fusion military cross-domain combat target detection method
CN116645696A (en) * 2023-05-31 2023-08-25 长春理工大学重庆研究院 Contour information guiding feature detection method for multi-mode pedestrian detection
CN116645696B (en) * 2023-05-31 2024-02-02 长春理工大学重庆研究院 Contour information guiding feature detection method for multi-mode pedestrian detection

Similar Documents

Publication Publication Date Title
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN108537191B (en) Three-dimensional face recognition method based on structured light camera
Qiu et al. RGB-DI images and full convolution neural network-based outdoor scene understanding for mobile robots
CN115188066A (en) Moving target detection system and method based on cooperative attention and multi-scale fusion
CN107977656A (en) A kind of pedestrian recognition methods and system again
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN111814661A (en) Human behavior identification method based on residual error-recurrent neural network
CN111209811B (en) Method and system for detecting eyeball attention position in real time
CN111639580B (en) Gait recognition method combining feature separation model and visual angle conversion model
CN113139489A (en) Crowd counting method and system based on background extraction and multi-scale fusion network
Chang et al. Changes to captions: An attentive network for remote sensing change captioning
CN115100684A (en) Clothes-changing pedestrian re-identification method based on attitude and style normalization
CN111353429A (en) Interest degree method and system based on eyeball turning
Lee et al. Ev-reconnet: Visual place recognition using event camera with spiking neural networks
CN113343810B (en) Pedestrian re-recognition model training and recognition method and device based on time sequence diversity and correlation
CN115063717A (en) Video target detection and tracking method based on key area live-action modeling
CN115100681A (en) Clothes identification method, system, medium and equipment
Verma et al. Intensifying security with smart video surveillance
CN114743257A (en) Method for detecting and identifying image target behaviors
Shi et al. Cobev: Elevating roadside 3d object detection with depth and height complementarity
CN113869151A (en) Cross-view gait recognition method and system based on feature fusion
Jain et al. Generating Bird’s Eye View from Egocentric RGB Videos
Zhang et al. Multi-Moving Camera Pedestrian Tracking with a New Dataset and Global Link Model
Brander et al. Improving Data-Scarce Image Classification Through Multimodal Synthetic Data Pretraining
Pang et al. PTRSegNet: A Patch-to-Region Bottom-Up Pyramid Framework for the Semantic Segmentation of Large-Format Remote Sensing Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination