CN112699914A - Target detection method and system based on heterogeneous composite backbone - Google Patents

Target detection method and system based on heterogeneous composite backbone Download PDF

Info

Publication number
CN112699914A
CN112699914A CN202011388828.0A CN202011388828A CN112699914A CN 112699914 A CN112699914 A CN 112699914A CN 202011388828 A CN202011388828 A CN 202011388828A CN 112699914 A CN112699914 A CN 112699914A
Authority
CN
China
Prior art keywords
target detection
training
network
data
backbone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011388828.0A
Other languages
Chinese (zh)
Other versions
CN112699914B (en
Inventor
郑慧诚
严志伟
陈蔓薇
李烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202011388828.0A priority Critical patent/CN112699914B/en
Publication of CN112699914A publication Critical patent/CN112699914A/en
Application granted granted Critical
Publication of CN112699914B publication Critical patent/CN112699914B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method and a system based on a heterogeneous composite backbone, wherein the method comprises the following steps: acquiring training data and preprocessing the training data to obtain preprocessed data; constructing a target detection network based on a heterogeneous composite backbone architecture; training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network; and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result. The system comprises: the device comprises a preprocessing module, a network construction module, a training module and a detection module. By using the method and the device, complementary characteristics learned by two heterogeneous backbone networks are integrated, and characteristic redundancy is avoided, so that the overall characteristic expression and target detection performance of the detector are enhanced. The target detection method and the target detection system based on the heterogeneous composite backbone can be widely applied to the field of target detection networks.

Description

Target detection method and system based on heterogeneous composite backbone
Technical Field
The invention belongs to the field of target detection networks, and particularly relates to a target detection method and a target detection system based on a heterogeneous composite backbone.
Background
Object detection is a task that is fundamental and widely used in the field of computer vision. As an essential component of many visual systems, it plays a significant role in the overall performance of the system. With the research and application of deep learning in the field of computer vision, the performance of the current target detector is continuously and greatly improved.
In a common target detection network based on deep learning, a backbone network is mainly responsible for extracting relevant features of a target, and an output feature map is the basis of target positioning and identification of a detection head and is important for the overall performance of the network. The amount of parameters of the backbone network is typically larger in the detector than in the detector head to ensure adequate learning and expression of the data distribution. The existing target detector generally adopts a network pre-trained on a classification task to fully utilize a large number of training samples with class labels, but a network structure designed for the classification task is not completely suitable for a detection task, and a domain deviation phenomenon is often generated when the network structure is applied to a specific target detection task.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a target detection method and system based on a heterogeneous composite backbone, which solve the domain offset problem existing in a backbone network.
The first technical scheme adopted by the invention is as follows: a target detection method and system based on heterogeneous composite trunks comprises the following steps:
acquiring training data and preprocessing the training data to obtain preprocessed data;
constructing a target detection network based on a heterogeneous composite backbone architecture;
training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;
and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.
Further, the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically includes:
acquiring training data according to a preset problem;
carrying out category marking on the training data to obtain marked training data;
and the information in the marked training data comprises an original material picture, and a marking record of a target position and a category in the picture.
Further, the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.
Further, the composite module includes a 1 × 1 convolutional layer and an addition unit.
Further, the detail extraction backbone is constructed based on the ResNet structure, the detail extraction backbone comprising a stem section that removes the first pooling layer and an exploration subnet that replaces the base module with a narrow module.
Further, the narrow module includes two parametrically small 3 x 3 convolutional layers.
Further, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network specifically includes:
dividing data into a training set, a verification set and a test set according to a certain proportion;
calculating network output by taking the training set as input in the training process of the target detection network through operations such as convolution and the like to obtain a prediction frame set;
according to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;
for the classification subtask, using the cross entropy between the prediction frame class vector and the labeling frame class vector as a loss function;
for the positioning subtask, calculating the position loss of the prediction frame and the marking frame through a Smooth L1 loss function;
calculating the gradient of the parameters in the convolutional layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;
in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals;
and after the training is finished, evaluating the performance of the network by taking the test set as the input of the network, and simultaneously storing parameters such as a convolution kernel, an offset and the like in the network to obtain the trained target detection network.
Further, the step of acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result specifically includes:
obtaining an image of a target to be detected by taking data to be detected;
inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a prediction frame and an N-dimensional vector sequence representing class prediction through a convolutional layer;
the detector discards a part of low-quality results according to the N-dimensional vector sequence predicted by the category through an artificially preset category confidence threshold to obtain the residual detection results;
and calculating the overlapping rate of the prediction frames according to the residual detection results through the confidence degrees of the prediction frames and the position-based 4-dimensional vector, and removing the duplication of the prediction frames based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.
The method and the system have the beneficial effects that: the detection network integrates the complementary characteristics learned by the two heterogeneous backbone networks through the composite module, avoids characteristic redundancy, thereby enhancing the overall characteristic expression and target detection performance of the detector, and simplifies the network structure to reduce the network parameter number and the calculation complexity.
Drawings
FIG. 1 is a network architecture of a target detection network based on heterogeneous composite backbone according to the present invention;
FIG. 2 is a flowchart illustrating steps of a method for detecting a target based on a heterogeneous composite backbone according to the present invention;
fig. 3 is a structural block diagram of a target detection system based on a heterogeneous composite backbone according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
As shown in fig. 1 and fig. 2, the present invention provides a target detection method based on heterogeneous composite trunks, which includes the following steps:
s1, acquiring training data and preprocessing the training data to obtain preprocessed data;
s2, constructing a target detection network based on the heterogeneous composite backbone architecture;
s3, training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;
and S4, acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.
Further, as a preferred embodiment of the method, the step of obtaining the training data and preprocessing the training data to obtain preprocessed data specifically includes:
acquiring training data according to a preset problem;
specifically, training data is collected according to the problem to be solved (such as general target detection, face detection, and floater detection).
Carrying out category and position marking on the training data to obtain marked training data;
and the information in the marked training data comprises an original material picture, and a marking record of a target position and a category in the picture.
Further, as a preferred embodiment of the method, the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.
As a further preferred embodiment of the method, the composite module comprises a 1 × 1 convolutional layer and an adding unit.
Further as a preferred embodiment of the method, the detail extraction backbone is constructed based on a ResNet structure, and the detail extraction backbone comprises a stem part which removes the first pooling layer and a discovery subnet which replaces the base module with a narrow module.
Specifically, the detail extraction backbone is based on a ResNet structure and mainly comprises a stem part and an exploration subnet, and compared with the common ResNet, the difference is that the stem part of the detail extraction backbone network provided by the invention removes a first pooling layer in an original network. In the subnet exploration, the invention replaces the basic module in the original ResNet with the narrow module with less parameter quantity so as to reduce the network capacity and the video memory occupation.
In addition, the detail extraction backbone is used for fine-grained feature extraction, and as shown in fig. 1, includes five convolution levels, i.e., conv _1, conv2_ x, conv3_ x, conv4_ x, and conv5_ x. In contrast, the present invention eliminates the first pooling layer, so the output characteristics of each convolution level have smaller step sizes, thereby effectively preserving local detail information.
The characteristic step length in the detail extraction trunk proposed by the invention is half of ResNet-34, and the receptive field range of each convolution level is smaller than the corresponding range of ResNet-34. Compared with ResNet-34 which is designed aiming at the classification task, the network can better keep the spatial local information in the image, thereby being more suitable for target detection.
Further as a preferred embodiment of the method, the narrow module comprises two parametrically small 3 x 3 convolutional layers.
Specifically, reducing the feature extraction step size facilitates the extraction of local detail information, while also increasing the computational burden. For this purpose, the invention designs a narrow module to replace the basic residual module, BasicBlock, in the original ResNet-34. As shown in fig. 1, the narrow module includes two convolutional layers, for example, the number of input channels is 256, and the number of output channels of the first convolutional layer is, wherein the compression ratio is a super parameter. The second convolutional layer expands the feature channel back to 256. Narrow module parameter of BaThe computational complexity of a single convolutional layer of sicBlock is, where k, Ci、CoH and W are respectively the size of the square convolution kernel, the number of input characteristic channels, the number of output characteristic channels and the height and width of the output characteristic diagram. The two convolution layers can be adjusted by the compression ratio, thereby controlling the parameter quantity and the calculation complexity.
The invention places narrow modules in the deep layer of the convolutional network containing most parameters, namely the exploring subnetwork part of figure 1, so as to greatly reduce the network parameters without changing the stem structure, thereby fully utilizing the low-layer pre-training parameters with better generalization
As a preferred embodiment of the method, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain the trained target detection network specifically includes:
dividing data into a training set, a verification set and a test set according to a certain proportion;
calculating network output by taking the training set as input in the training process of the target detection network through operations such as convolution and the like to obtain a prediction frame set;
specifically, before training, a series of preprocessing rules for the input image are set, wherein the preprocessing operations that must be included include image normalization for stable training and changing image size to control computational complexity. During training, on the basis of necessary operation, a series of random preprocessing operations such as random clipping are introduced to achieve the purpose of data augmentation and enhance the performance of the network.
According to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;
for the classification subtask, using the cross entropy between the prediction frame class vector and the labeling frame class vector as a loss function;
for the positioning subtask, calculating the position loss of the prediction frame and the marking frame through a Smooth L1 loss function;
calculating the gradient of the parameters in the convolutional layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;
in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals;
and after the training is finished, evaluating the performance of the network by taking the test set as the input of the network, and simultaneously storing parameters such as a convolution kernel, an offset and the like in the network to obtain the trained target detection network.
Specifically, in actual detection, the trained model can be recovered only by assigning the parameter value to the parameter of the corresponding layer in the network through the parameter name, and the model is used as a basis for outputting a detection result in a subsequent detection process.
Specifically, the training strategy mainly comprises the following two parts:
(1) three-way supervision: if the heterogeneous composite backbone network is trained directly, a suppression phenomenon may occur, that is, the pre-trained backbone will converge faster and reach local optimum, and most features in the detail extraction network will be optimized to 0 and become invalid. Increasing the learning rate of the detail extraction network merely accelerates the suppression phenomenon and cannot solve the problem. Therefore, the invention respectively sets anchors on three branches of a pre-training deep network, a detail extraction network and a composite part and adds supervision signals, and the total loss can be expressed as:
Figure BDA0002811730670000051
where j is the branch index, pjiAnd rjiRespectively representing the category confidence prediction and the positioning prediction of the branch j on the ith anchor, ciAnd giRespectively representing the real category label and the positioning label corresponding to the ith anchor. Everson bracket function [ c ]i≥1]If and only if c i1 is only when the value is more than or equal to 1, and 0 is not. The three-way supervision ensures that all the characteristics including the detail extraction network have the discriminability required by detection through additional supervision signals, so that the network can easily escape from the local optimal solution.
(2) Network decomposition: the invention provides a network decomposition training strategy, which divides a network into a shallow stem part and a deep exploring subnetwork part. The stem part is initialized by adopting pre-training parameters and only finely adjusted in the training process, so that the generalization of the shallow feature is fully utilized to reduce the optimization difficulty and accelerate convergence. For exploring sub-networks, the basic modules in the original ResNet are replaced by narrow modules to reduce the number of parameters and to submit computational efficiency, wherein the parameters adopt random initialization and larger learning rate to promote a larger search of parameter space. The training cost of the decomposed training strategy is substantially consistent with a fully pre-trained network.
In addition, a penalty function L for training the classification subtasksconfFor cross entropy, a loss function L for training regression subtaskslocAs a function of Smooth L1 losses.
Further, as a preferred embodiment of the method, the step of acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting the detection result specifically includes:
obtaining an image of a target to be detected by taking data to be detected;
inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a prediction frame and an N-dimensional vector sequence representing class prediction through a convolutional layer;
the detector discards a part of low-quality results according to the N-dimensional vector sequence predicted by the category through an artificially preset category confidence threshold to obtain the residual detection results;
and calculating the overlapping rate of the prediction frames according to the residual detection results through the confidence degrees of the prediction frames and the position-based 4-dimensional vector, and removing the duplication of the prediction frames based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.
Specifically, the detector first discards a portion of the low quality results from the N-dimensional sequence of class predictions by an artificially predetermined class confidence threshold. The remaining detection results are de-duplicated from the prediction boxes according to a non-maximum suppression (NMS) algorithm by the prediction box confidence and the overlap ratio between the prediction boxes calculated based on the position 4-dimensional vector. And finally, the residual prediction frame is the detection result of the detector.
As shown in fig. 3, a target detection system based on heterogeneous composite backbones includes the following modules:
the preprocessing module is used for acquiring training data and preprocessing the training data to obtain preprocessed data;
the network construction module is used for constructing a target detection network based on the heterogeneous composite backbone architecture;
the training module is used for training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;
and the detection module is used for acquiring data to be detected, inputting the data to be detected into the trained target detection network and outputting a detection result.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (9)

1. A target detection method based on a heterogeneous composite trunk is characterized by comprising the following steps:
acquiring training data and preprocessing the training data to obtain preprocessed data;
constructing a target detection network based on a heterogeneous composite backbone architecture;
training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;
and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.
2. The method according to claim 1, wherein the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically comprises:
acquiring training data according to a preset problem;
carrying out category and position marking on the training data to obtain marked training data;
and the information in the marked training data comprises an original material picture, and a marking record of a target position and a category in the picture.
3. The target detection method based on the heterogeneous composite backbone of claim 2, wherein the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.
4. The method of claim 3, wherein the composite module comprises a 1 x 1 convolutional layer and an addition unit.
5. The method of claim 4, wherein the detail extraction backbone is constructed based on a ResNet structure, and the detail extraction backbone comprises a stem part with a first pooling layer removed and a exploration subnet with a narrow module instead of a basic module.
6. The method of claim 5, wherein the narrow module comprises two 3 x 3 convolutional layers with a small number of parameters.
7. The target detection method based on the heterogeneous composite backbone according to claim 6, wherein the step of training the target detection network based on the preprocessed data and the preset training strategy to obtain the trained target detection network specifically comprises:
dividing data into a training set, a verification set and a test set according to a certain proportion;
calculating network output by taking the training set as input in the training process of the target detection network through operations such as convolution and the like to obtain a prediction frame set;
according to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;
for the classification subtask, using the cross entropy between the prediction frame class vector and the labeling frame class vector as a loss function;
for the positioning subtask, calculating the position loss of the prediction frame and the marking frame through a Smooth L1 loss function;
calculating the gradient of the parameters in the convolutional layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;
in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals;
and after the training is finished, evaluating the performance of the network by taking the test set as the input of the network, and simultaneously storing parameters such as a convolution kernel, an offset and the like in the network to obtain the trained target detection network.
8. The target detection method based on the heterogeneous composite backbone of claim 7, wherein the step of obtaining the data to be detected, inputting the data to be detected to the trained target detection network, and outputting the detection result specifically comprises:
acquiring data to be detected to obtain an image of a target to be detected;
inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a prediction frame and an N-dimensional vector sequence representing class prediction through a convolutional layer;
the detector discards a part of low-quality results according to the N-dimensional vector sequence predicted by the category through an artificially preset category confidence threshold to obtain the residual detection results;
and calculating the overlapping rate of the prediction frames according to the residual detection results through the confidence degrees of the prediction frames and the position-based 4-dimensional vector, and removing the duplication of the prediction frames based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.
9. A target detection system based on heterogeneous composite backbones is characterized by comprising the following modules:
the preprocessing module is used for acquiring training data and preprocessing the training data to obtain preprocessed data;
the network construction module is used for constructing a target detection network based on the heterogeneous composite backbone architecture;
the training module is used for training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;
and the detection module is used for acquiring data to be detected, inputting the data to be detected into the trained target detection network and outputting a detection result.
CN202011388828.0A 2020-12-02 2020-12-02 Target detection method and system based on heterogeneous composite trunk Active CN112699914B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011388828.0A CN112699914B (en) 2020-12-02 2020-12-02 Target detection method and system based on heterogeneous composite trunk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011388828.0A CN112699914B (en) 2020-12-02 2020-12-02 Target detection method and system based on heterogeneous composite trunk

Publications (2)

Publication Number Publication Date
CN112699914A true CN112699914A (en) 2021-04-23
CN112699914B CN112699914B (en) 2023-09-22

Family

ID=75506104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011388828.0A Active CN112699914B (en) 2020-12-02 2020-12-02 Target detection method and system based on heterogeneous composite trunk

Country Status (1)

Country Link
CN (1) CN112699914B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN110188720A (en) * 2019-06-05 2019-08-30 上海云绅智能科技有限公司 A kind of object detection method and system based on convolutional neural networks
CN110503112A (en) * 2019-08-27 2019-11-26 电子科技大学 A kind of small target deteection of Enhanced feature study and recognition methods
CN111797676A (en) * 2020-04-30 2020-10-20 南京理工大学 High-resolution remote sensing image target on-orbit lightweight rapid detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875595A (en) * 2018-05-29 2018-11-23 重庆大学 A kind of Driving Scene object detection method merged based on deep learning and multilayer feature
CN110188720A (en) * 2019-06-05 2019-08-30 上海云绅智能科技有限公司 A kind of object detection method and system based on convolutional neural networks
CN110503112A (en) * 2019-08-27 2019-11-26 电子科技大学 A kind of small target deteection of Enhanced feature study and recognition methods
CN111797676A (en) * 2020-04-30 2020-10-20 南京理工大学 High-resolution remote sensing image target on-orbit lightweight rapid detection method

Also Published As

Publication number Publication date
CN112699914B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN111444878B (en) Video classification method, device and computer readable storage medium
CN109754015B (en) Neural networks for drawing multi-label recognition and related methods, media and devices
CN111091045A (en) Sign language identification method based on space-time attention mechanism
CN111696110B (en) Scene segmentation method and system
CN114022432B (en) Insulator defect detection method based on improved yolov5
CN113065645B (en) Twin attention network, image processing method and device
CN111428771B (en) Video scene classification method and device and computer-readable storage medium
CN111784623A (en) Image processing method, image processing device, computer equipment and storage medium
CN113095254B (en) Method and system for positioning key points of human body part
KR20190054770A (en) Apparatus and method for processing convolution operation using kernel
CN110222760A (en) A kind of fast image processing method based on winograd algorithm
Mohmmad et al. A survey machine learning based object detections in an image
KR20190080818A (en) Method and apparatus of deep learning based object detection with additional part probability maps
CN114170570A (en) Pedestrian detection method and system suitable for crowded scene
CN113420827A (en) Semantic segmentation network training and image semantic segmentation method, device and equipment
CN117033657A (en) Information retrieval method and device
JP6846216B2 (en) Scene change point model learning device, scene change point detection device and their programs
CN112115744B (en) Point cloud data processing method and device, computer storage medium and electronic equipment
CN113011320B (en) Video processing method, device, electronic equipment and storage medium
CN114373110A (en) Method and device for detecting target of input image and related products
CN115292439A (en) Data processing method and related equipment
CN112699914A (en) Target detection method and system based on heterogeneous composite backbone
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
CN112396126A (en) Target detection method and system based on detection of main stem and local feature optimization
Semeniuta et al. Image classification with recurrent attention models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant