CN114241250A

CN114241250A - Cascade regression target detection method and device and computer readable storage medium

Info

Publication number: CN114241250A
Application number: CN202111092255.1A
Authority: CN
Inventors: 张可; 袁堃; 葛绍妹; 邓其龙; 杨俊�; 高昱峰
Original assignee: Anhui Nanrui Jiyuan Power Grid Technology Co ltd; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; State Grid Electric Power Research Institute
Current assignee: Anhui Nanrui Jiyuan Power Grid Technology Co ltd; Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd; State Grid Electric Power Research Institute
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2022-03-25

Abstract

The invention discloses a cascade regression target detection method, a device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a picture to be detected, standardizing pixels and zooming to the same size; and inputting the adjusted picture to be detected into a trained deep convolutional neural network model, wherein the deep convolutional neural network model is obtained by training a training picture with label information and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, a cascaded two-way regressor module for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules. The invention improves the performance of the dual-stage target detector through the idea of cascade regression, and reduces the detection time of the algorithm while keeping higher detection precision.

Description

Cascade regression target detection method and device and computer readable storage medium

Technical Field

The invention relates to the field of image recognition, in particular to a cascade regression target detection method and device and a computer readable storage medium.

Background

In recent years, with the improvement of the computing performance of computers, the development of artificial intelligence not only makes great progress, but also gradually shows a phenomenon of large-scale application in various industries. Computer vision is one of the important research fields of artificial intelligence, and has a great promoting effect on the innovation of production and life style. For example, unmanned vehicles representing future travel modes, intelligent manufacturing for improving production efficiency of factories, intelligent security for early warning, and the like, and computer vision is a technical basis for realizing the unmanned vehicles and the intelligent security.

In image classification, target detection, semantic segmentation and example segmentation are three basic visual identification tasks in the field of computer vision. The target detection not only needs to identify the target category, but also needs to predict the position of the target by using a rectangular frame, and the target detection is a basic module which must be adopted by many scenes, such as face recognition, pedestrian detection, video analysis, mark detection and the like. Semantic segmentation assigns a specific class label to each pixel, thereby providing image description with richer information, which is different from target detection, and does not distinguish a plurality of targets of the same class. Example segmentation is a combination of object detection and semantic segmentation, requiring the identification of different objects and assigning a pixel-by-pixel class mask to each object. In practice, instance segmentation can be considered a special case of object detection, which does not use a rectangular frame to position an object, but rather pixel-by-pixel positioning.

The target detection algorithm is generally classified into a two-stage algorithm and a single-stage algorithm. The dual-stage target detector generally comprises two parts, namely a candidate region generator, wherein in the process of generating the candidate region, the generator tries to search for a region in which a target possibly exists in an image, and the main purpose is to find a region with a high recall rate so that all objects in the image are matched with one candidate region as much as possible; then a detector that classifies and regresses the candidate regions. The dual-stage target detector has the advantage of high detection accuracy, but often cannot meet the requirement of real-time detection.

The single-stage target detector is generally simpler in architecture, and unlike the double-stage target detection algorithm, the detection process is divided into two stages of candidate box generation and region classification, each pixel on the feature map is considered to be possible to have a target object, and a rectangular box and a plurality of confidence degrees are respectively allocated to the pixel to determine the accurate position and the probability value of the object. The single-stage target detector has higher detection speed but is often accompanied by poorer detection precision.

In general, the precision requirement is greater than the speed requirement, so that the target detection algorithm of the two stages has wider applicability. However, the existing two-stage target detection algorithm has the disadvantages that the algorithm with higher performance usually needs to consume higher computational resources and detection time, and the actual requirements are difficult to meet. Therefore, a light-weight high-performance two-stage target detection method is urgently needed.

Disclosure of Invention

Aiming at the problem of poor detection performance in the prior art, the invention provides a cascade regression target detection method, a device and a computer readable storage medium, which improve the performance of a dual-stage target detector through the idea of cascade regression and reduce the detection time of an algorithm while keeping higher detection precision.

The technical scheme of the invention is as follows.

A cascade regression target detection method comprises the following steps:

acquiring a picture to be detected, standardizing pixels and zooming to the same size;

inputting the adjusted picture to be detected into a trained deep convolutional neural network model, wherein the deep convolutional neural network model is obtained by training a training picture with label information and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, a cascaded two-path regressor module for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;

and detecting the picture to be detected by using the trained deep convolutional neural network model to obtain a detection result.

Preferably, the backbone network extracts the features of the picture through a plurality of feature prediction layers with different resolutions; the backbone network comprises one of a ResNet series network or a VGG series network and an additional convolution network added on the basis of the ResNet series network or the VGG series network.

Preferably, the cascaded region suggestion module performs classification and regression according to the extracted features, and comprises a cascaded region suggestion network, a feature fusion module and a classification regression module; the cascade regional suggestion network finely adjusts the size and the position of a preset frame in two steps to generate a regional suggestion; the feature fusion module fuses features of different scales together; and the classification regression module extracts the characteristics of the candidate region NxN for classification and regression according to the region suggestion provided by the cascade region suggestion network.

Preferably, the two cascaded regressors respectively use the neural networks of the full-connection architecture and the full convolution architecture as regression branches to regress the same candidate region, and adjust the size and the position of the candidate region, including:

the first stage is as follows: taking the position of a preset frame as a candidate region, extracting target features and cutting the target features into specified sizes, respectively inputting two paths of regression branches for prediction, adjusting the size and the position of the candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;

and a second stage: inputting the features fused in the first stage into two regression branches respectively for prediction, adjusting the size and the position of a candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;

and a third stage: and inputting the features fused in the second stage into two regression branches respectively for prediction, adjusting the size and the position of the candidate region according to the prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network to obtain the size and the position of the candidate region which is finally predicted.

Preferably, the distance loss function optimizes the classification using a cross entropy function and a Smooth L at each stage of the regression₁The function optimizes the regression for minimizing the output result difference of the two regression branches.

Preferably, the distance loss function definition includes:

wherein Smooth L₁The function is as follows:

wherein x^tIs the coordinate fused characteristic information predicted according to two paths of regression branches in the previous stage, f_bRepresenting regression branches of the fully-connected layer architecture, f_dRepresenting the regression branch of the full convolution architecture, i represents the center coordinate and width and height of the ith candidate region.

Preferably, the loss function includes: a cascaded region suggestion module loss function and a cascaded two-way regressor loss function:

wherein, Loss1 is a cascaded area suggestion module Loss function, Loss2 is a cascaded two-way regression Loss function, wherein, i represents the subscript of the preset anchor frame,

and

respectively representing the true class and the offset vector of the pre-set anchor box with index i, c_iAnd t_iFor the second stage, multi-class prediction probability and coordinate detection, N₁，N₂And N₃Respectively representing the number of positive samples, L, in the first, second and third detection stages of cascade regression_mTo determine the multi-class cross-entropy loss, L, of object classes_rIs Smooth-L₁A loss function; the total Loss is the weighted sum of the cascaded region suggestion module Loss function and the cascaded two-way regressor Loss function.

Preferably, the detecting the picture to be detected by using the trained network model includes:

detecting and inputting the trained network model by using Q test pictures;

determining that the detection result R is { R ═ R₁,R₂,…,R_q,…,R_QKeep in categories;

and calculating the area intersection ratio between the rotated rectangular frames, performing non-maximum value suppression, and only keeping the detection frames with larger scores and small mutual overlapping areas as final detection results.

Preferably, the non-maximum suppression includes:

for the initial detection result R_qAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'_q＝{R'_c1,R'_c2,…,R'_cf,…,R'_cFWherein R'_cfThe sorted j-th class detection result;

to R'_cfThe area intersection ratio between any one detection frame b and all detection frames with the prediction scores smaller than the current scores is calculated;

if the area intersection ratio of the two detection frames exceeds the threshold value t_iouThen the detection box bs with the lower score is discarded.

The invention also includes a cascade regression target detection device, comprising:

the acquisition module is used for acquiring the picture to be detected and/or the training picture;

the preprocessing module is used for carrying out pixel standardization and scaling operation on the detection picture and/or the training picture acquired by the acquisition module; the neural network module is used for configuring a deep convolutional neural network and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, two cascaded regression modules for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;

the neural network module is also used for training by means of the training pictures and detecting the pictures to be detected so as to obtain a detection result.

The invention also includes a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method described above. The invention adopts a popular convolutional neural network algorithm to solve the target detection problem, adopts a strategy of adjusting the position and the size of a preset frame step by step in a cascaded area suggestion module, and provides more accurate initialization information for a detection probe, thereby improving the performance of the detector. The cascaded regional suggestion network can provide more accurate preset frames for the detector by finely adjusting the size and the position of the preset frames step by step, so that overfitting is avoided. The size and the position of the preset frame are finely adjusted on the global feature map, so that the global feature map is shared by all the preset frame calculations, and a large amount of repetition and resource consumption caused by using a cascading strategy at a detection head are avoided.

In the two cascaded regression modules, each branch adopts a different framework to predict the position of a target, then the characteristics of a target region are extracted according to the predicted positions of the branches of the two frameworks, and classification and regression are carried out in the next stage after convolution fusion. During training, the same target was regressed using fully-connected and fully-convolved heads, respectively. The full-connection regressor has stronger spatial sensitivity, the full convolution regressor can capture more detailed target-level information, the advantages of the two architectures can be fully exerted, and the function of mutual complementation is achieved.

According to the method, a gradient back propagation algorithm is adopted, and the updating gradients of all learnable parameters in the network are calculated by means of a loss function calculated by the neural network in the training process of the neural network through a chain type derivation method, so that the updating of the parameters of the neural network is completed, and the end-to-end training process is realized.

According to the digital image processing principle, the method and the device can be used for enhancing various data of the training picture, including picture turning, color space conversion, picture zooming and the like, so that the utilization rate of the training picture is improved, the diversity of samples is increased, the requirement of data marking is reduced to a certain extent, and the robustness and the generalization capability of the model are improved.

According to the method and the device, the non-maximum value inhibition is used as a post-processing means of the detection result in the picture to be detected, and the redundant detection result in the picture is effectively reduced.

Drawings

FIG. 1 is a schematic diagram of a network architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating candidate region feature fusion according to an embodiment of the present invention;

FIG. 3 is a schematic view of a detection structure non-maximum suppression processing flow according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a training and detection process according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions will be clearly and completely described below with reference to the embodiments, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms first, second, third and the like in the description and in the claims, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical solution of the present invention will be described in detail below with specific examples. Embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.

Example (b):

a method of cascading regression target detection, operating on a device having a processor and a computer readable storage medium, comprising:

s1, acquiring a training set of the detection picture;

acquiring a training set of the detection pictures, wherein the training set comprises M training pictures of X ═ X₁,X₂,…,X_m,…,X_MIn which X is_mRepresenting the mth training picture;

m training pictures are selected to have M labels which correspond one to one and are Y ═ Y₁,Y₂,…,Y_m,…,Y_MIn which Y is_mRepresenting the mth training picture;

the M labels comprise the category and coordinate information of N target objects in corresponding pictures, and the category and coordinate information is Y_m＝{P_m,1,B_m,1,P_m,2,B_m,2,…,P_m,n,B_m,n,…,P_m,N,B_m,NIn which P is_m,nRepresents the category to which the nth target object in the mth picture belongs, and P_m,n∈{C₀,C₁,C₂,…,C_j,…,C_JDenotes the total class, C_jDenotes the jth class, C₀Representing background class, J being the total number of classes, B_m,nRepresents the coordinates of the nth object in the mth picture, and B_m,n＝{w_m,n,h_m,n,cx_m,n,cy_m,n,θ_m,nDenotes the width w of the rectangular frame of the object to be marked, respectively_m,nHeight h_m,nCentral point abscissa cx_m,nLongitudinal coordinate cy of center point_m,nAnd a rotation angle theta_m,n。

S2, carrying out picture standardization on the training pictures and the test pictures of the training set;

in some specific embodiments, the step of performing picture normalization on the training pictures and the test pictures of the training set includes:

according to the preset Pixel mean value Pixel_meanSum Pixel standard deviation Pixel_stdCarrying out pixel level standardization on the pictures in the training set X;

the pictures in the training set X are uniformly scaled to a size of 320 × 320, and it is noted that after the pictures are scaled, the labeled positions of the objects in the pictures also need to be adjusted accordingly, otherwise, mismatching occurs. The pictures can be scaled to 512 × 512 or 640 × 640, and the higher resolution pictures can improve the detection accuracy, but reduce the detection speed, keep the picture sizes consistent, make the picture sizes the same, and meet the input conditions of the network.

The standardized formula of any picture pixel point is as follows:

Pixel_x＝(Pixel_x-Pixel_mean)/Pixelx_std；

wherein, Pixel_meanAs Pixel mean, Pixel_stdIs the pixel standard deviation.

S3, designing a deep convolutional neural network structure, and training a network model by utilizing a step-by-step fine tuning mode;

in some specific embodiments, the deep convolutional neural network structure comprises: the system comprises a backbone network for extracting picture features, a cascaded regional suggestion module, feature fusion, a cascaded regional suggestion module loss function, two cascaded regression modules, a distance loss function and two cascaded regression module loss functions.

Obtaining a backbone for extracting basic features;

specifically, a ResNet series network or a VGG series network is used as a backbone network to extract basic features, wherein the ResNet series network comprises ResNet50, ResNet101 and ResNet 152; the VGG network includes VGG-16 and VGG-19. After the backbone network is selected, an additional convolutional network needs to be added on the basis of the backbone network to obtain a feature map with lower resolution, the spatial resolution of the feature map is smaller, but the feature abstraction degree is higher, the sense field is larger, and a large object in a picture can be detected, so that features of 5 scales are extracted from the backbone network and the additional convolutional network to be used for network prediction.

Initializing the backbone network and the additionally added network parameters:

M_weight＝MP_weight；

MA_weight＝Gaussian(0,1)；

wherein M is_weightAnd MA_weightParameters of a backbone network and an additional convolution network are respectively set; MP represents the pre-training result of the backbone network M on the data set, MP_weightParameters representing a pre-trained network; gaussian (0,1) denotes that the weight parameters of the additional convolutional network MA satisfy a Gaussian distribution with a mean of 0 and a variance of 1.

In some specific embodiments, the backbone network adopts ResNet50, a pre-training model of the ResNet50 is derived from a classification model on an ImageNet data set, and the learning rate of residual structures of the first two layers of ResNet50 is set to 0, so that the residual structures do not participate in training, and the overfitting risk in the network training process can be reduced.

Firstly, building a cascaded area suggestion network on the basis of the backbone network and the additionally added convolutional network, and carrying out secondary classification and position size adjustment on a target area;

taking the VGG-16 as the backbone network as an example, three layers of networks in the VGG network and an additional layer of network are taken as global feature prediction layers. The four global feature prediction layers are gradually reduced in resolution and gradually increased in receptive field, so that the four global feature prediction layers can be respectively responsible for detecting objects with different scales. Centering around each pixel location on the global feature map, each location is called an anchor point, resulting in K preset boxes with fixed dimensions and aspect ratios, each responsible for matching with a potential target. As shown in fig. 1, the cascaded regional suggestion network consists of two sets of two classes and regression modules. The first secondary classification and regression module uses the original global feature map of the backbone network for prediction, simply classifies each preset frame and outputs the offset relative to the original position of the preset frame. After calculating the overlap ratio of the preset box and the true value box, some simple negative samples are discarded by using a predefined threshold value, and then a cross-entropy function and a Smooth L are used₁Calculating deviation of function, classifying the first and second and regressingAnd (6) optimizing. And the preset frame after the first second classification and regression module finely adjusts the position and the size is used as the input of the second classification and regression module. Different from the method that each stage in Cascade R-CNN uses the features output by the backbone network as input, in order to obtain more discriminative information, the second classification and regression module uses the feature fusion block to fuse the original features, so that the requirements of the second classification and regression module are met. The features used by the two second classes and the regression module have the same dimensions. And the second secondary classification and regression module predicts 4 offsets and 2 confidence scores for each fine-tuned preset frame by using the fused features, discards simple samples according to the confidence scores, and filters the preset frames with higher overlapping rate by using a non-maximum suppression algorithm. Through the two-step cascade regression operation, the regional suggestion network provides initialization information with higher accuracy for the subsequent detection head.

The cascaded regional suggestion network serves a dual-stage target detection algorithm, and as the cascaded regional suggestion network provides more accurate characteristic information for the regressor, the performance of the target detector can be improved. Unlike Cascade R-CNN, in order to obtain more accurate initialization information, the regressors in all stages need to separately calculate the characteristics of the candidate regions, and the calculation amount is very large. All candidate regions in the cascaded region proposal network share the predicted feature map, requiring only little computational resources.

Adding feature fusion into the cascade region suggestion module to enhance the perception capability of the object;

the junction of the first two classifications and the regression and the second two classifications and the regression uses feature fusion. The feature fusion is composed of a convolutional layer, an active layer and a deconvolution layer, wherein the deconvolution is mainly responsible for enlarging the resolution of the deep feature map.

The feature fusion module is mainly used for transmitting rich semantic information of the deep network to the shallow network, and the detection precision of the algorithm is improved. In addition, if the two classes and regression modules share the prediction feature map, the results of the two cascaded regressions will not bring a good effect. In order to match the dimensions of the inputs of the different network layers, the resolution of the feature map is improved using deconvolution, followed by feature fusion using pixel-by-pixel addition, and finally adding a convolutional layer to produce features with discriminative power. Feature fusion is also used at the second classification and junction of the regression module and the detection head to further enhance the information of the feature map.

In the cascaded regional suggestion network, performing foreground and background classification on candidate regions, simultaneously fine-tuning the size and the position of a preset frame, and optimizing the network through a loss function;

the loss function of the cascaded regional recommendation network consists of the loss functions of the two-class and regression modules. In the cascade region suggestion network, a label of two categories (namely whether the label is an object or not) is allocated to each preset frame, and the preset frames are finely adjusted.

Wherein i represents the index number of the preset frame in a batch, and l_iIs the class number g of the real object corresponding to the preset frame i_iIs the position and size of the real object corresponding to the preset frame i. p is a radical of_iIs the confidence that the preset box contains an object, x_iIs the position of fine tuning to the preset box i. c. C_iIs a multi-class label prediction for a preset box i, t_iIs the predicted offset. L is_bIs based on cross entropy loss for the two classes.

In the two cascaded regression modules, a preset frame provided by a cascaded region suggestion module is used as initial input information, and a target is precisely and accurately positioned in a characteristic fusion mode;

the cascaded two-path regressor is mainly responsible for accurately positioning the target. In the algorithm, the size and the position of a candidate region are preliminarily adjusted by a cascaded region suggestion network, and then the adjusted candidate region is used as the input of two paths of regressors in the next stage to accurately position a target. Specifically, the cascaded two-way regressor is composed of three stages with the same function, and each stage has two regression branches with different architectures (based on the regression branches of the fully-connected architecture and the regression branches of the fully-connected architecture)Based on the regression branch of the full convolution mechanism), the two regression branches respectively position the same object. As shown in fig. 1, in the first stage, according to the candidate region position information provided by the cascaded regional suggestion network, the target feature is extracted and cut to a specified size, and the extracted target feature is used as the input of two regression branches. In the second stage, according to the position information predicted by the two regression branches in the previous stage on the same object, extracting different features of the same object, scaling the two groups of features into the same size, adding the two groups of features pixel by pixel, and then fusing the features of the two groups of features by using a convolution layer. Because the network of the full convolution architecture can capture the spatial information of the target, and the network of the full connection architecture can capture the information of the target level, the prediction of the two regression branches on the same object has deviation, the target characteristics are extracted according to the position information predicted by the two architectures, and then the two sets of characteristics are fused, so that the limitation of positioning of a certain network can be made up, and the network performance is improved. The fused features are used as the input of the third stage network, and the rest operation is basically the same as that of the second stage. For the predicted values of the two-way regressor at each stage, Smooth L is used₁And optimizing the functions to enable the predicted values of the two functions to be as close as possible to the same target. At each stage, optimization was performed using a cross-entropy function for classification and Smooth L for regression₁The function is optimized.

In the distance loss functions, the distance loss functions are mainly used for constraining the fine adjustment of the two regression branches to the size and the position of the same target;

in the two cascaded regression modules, the regression branches of the two architectures respectively adopt Smooth L₁The offset used to optimize the prediction. Because the two regression branches predict the same candidate region, the capability of enhancing the cooperative work of the two regression branches can be considered. Thus, Smooth L is used for the offset of the two-way regression branch prediction₁The function is optimized because the two regression branches are regressions on the same target, Smooth L₁The function can make the offset of the two-way regression branch prediction as close as possible, thereby improving the quality of the detectorAnd (4) accuracy. The introduced optimization function is named distance loss. The distance loss is expressed as follows:

in the above formula, Smooth L₁The detailed expression of the function is as follows:

wherein x^tThe feature information of coordinate fusion predicted according to two regression branches in the previous stage is obtained. f. of_bRepresenting regression branches of the fully-connected layer architecture, f_dRepresenting the regression branches of the full convolution architecture. i denotes the center coordinates and width and height of the ith candidate region.

Smooth L₁Are often used to optimize the gap between the offset and the label predicted by the regressor. The distance loss function is not used for calculating the difference between the predicted value and the real mark of the regressor, but is used for calculating the difference between the predicted offset of the two paths of regressors, so that the IoU (interaction over Union) of the regression branches of the two paths of regressors to the regression result of the same candidate region is as close to 1 as possible. It should be noted that the distance-loss function does not directly introduce annotation data, since the two-way regression branch has been constrained using annotation data alone, so that the distance loss does not cause the detector to deviate from the true result, and is therefore theoretically possible.

Constructing a classification network CLS of a first stage on the basis of a basic feature extraction network and an additional convolution network₁And location network LOC₁And CLS₁And LOC₁Each consisting of F convolutional layers. In which the sorting network CLS₁And location network LOC₁Are respectively denoted as CLS₁＝{CLS₁₁,CLS₁₂,…,CLS_1f,CLS_1F}，LOC₁＝{LOC₁₁,LOC₁₂,…,LOC_1f,LOC_1FF, extracting network M andthe number of feature maps, CLS, co-generated by the additional convolutional network MA_1fAnd LOC_1fRespectively representing classification and positioning networks on the f-th feature map, which are represented as follows:

CLS_1f＝Conv(channel_1f,2,stride_h1,stride_w1)；

LOC_1f＝Conv(channel_1f,5,stride_h1,stride_w1)；

wherein Conv represents a single convolutional layer, the number of input channels channel_1fRepresenting the number of channels of the f-th feature map obtained by the basic feature extraction network and the additional convolution network; 2 denotes CLS_1fThe number of convolution output channels represents that only two classification discrimination works of foreground and background are carried out at the moment, and 5 represents LOC_1fThe number of convolution output channels representing the regression of coordinates at that time is 5, and the above-mentioned object coordinates B_m,nCorresponding; stride_h1And stride_w1The height and width of the convolution kernel.

In some embodiments, the number of signatures generated by ResNet50 and the additional convolutional network is 4, i.e., F is 4, and the number of channels corresponding to the signatures is {512,1024,2048,512 }. stride_h1And stride_w1Are all 3.

Constructing a characteristic pyramid network on the basis of the basic characteristic extraction network M and the additional convolution network MA, and generating F characteristic graphs FEA of a first stage₁And further generating a feature map FEA of the second stage of high resolution₂；

Specifically, F feature maps FEA of the first stage are generated₁Denoted as FEA₁＝{FEA₁₁,FEA₁₂,…,FEA_1f,…,FEA_1FThe width and height of the profile of the first stage are denoted W, respectively₁＝{W₁₁,W₁₂,…,W_1f,…,W_1FH and₁＝{H₁₁,H₁₂,…,H_1f,…,H_1Fin which W is_1fAnd H_1fRespectively showing the width and height of the f-th feature map of the first stage.

When the content is less than or equal to 1When i is less than or equal to F-1, W is satisfied_1i＝2×W_1i+1，H_1i＝2×H_1i+1. In FEA₁The feature pyramid can transmit the semantic information of the high-level feature map to the bottom layer, so that the feature map with high resolution and rich semantic information is obtained by combining the advantages of the two. Marking the feature map generated by the feature pyramid as FEA₂Feature map, FEA, referred to as second stage₂＝{FEA₂₁,FEA₂₂,…,FEA_2f,…,FEA_2FTherein FEA_2fThe f-th feature map of the first stage is shown. FEA₂Feature map number and FEA₁Same and FEA_2fAnd FEA_1fThe width and height of (a) remain the same.

The feature pyramid network includes a feature graph conversion network TS and a feature graph scaling network INP, where the feature graph conversion network may be expressed as TS ═ { TS ═ TS {₁,TS₂,…,TS_f,…,TS_FTS is likewise composed of F parts, where TS is_fRepresenting the f-th feature map transformation network; INP ═ INP₁,INP₂,…,INP_f,…,INP_FINP is composed of F-1 parts, where INP_fAnd (3) representing a feature map scaling network between the f-th feature map and the f + 1-th feature map, wherein the width and the height of the feature map passing through the feature map scaling network are 2 times of the original width and height.

In the construction process of the feature map pyramid, firstly, the feature map of the highest layer is processed independently;

and then sequentially processing according to the sequence of the spatial resolution of the feature map from low to high:

FEA_2F＝TS_F(FEA_1F)；

t＝TSi(FEA_1i)；

FEA_2i＝t+INPi(FEA_2i+1)；

wherein t is an intermediate characteristic diagram in the characteristic pyramid construction process; the value sequence of i is { F-1, F-2, …,1}, and the feature pyramid network comprises a feature map conversion network TS and a feature map scaling network INP.

Specifically, the intermediate feature map in the feature pyramid construction process does not undergo the final detection step. FEA_2FOnly need to be executed once, and the formula t and the formula FEA_2iF-1 times in total.

In some specific embodiments, the feature map conversion network is used by a Res2net structure, and Res2net performs conversion and connection in a residual form between different channels of a feature map, thereby enhancing feature extraction capability; the feature map scaling network is completed by a feature map interpolation function in a PyTorch function library.

As shown in fig. 2, the candidate region is subjected to dual regression branch processing to obtain two regression results; and respectively extracting the features of the candidate region NxN from the two regression results, classifying and regressing, adding the features, and averaging to obtain the fused region features. And merging the zoomed characteristic diagram with the previous characteristic diagram through channel splicing operation, and generating a new characteristic diagram only by sending the characteristic diagram into a characteristic diagram conversion network. The feature map conversion network comprises 5 identical structures, the feature map scaling network comprises 4 identical structures, and the identical structures respectively have independent trainable parameters.

Locating network LOC with said first phase₁Coordinate detection result LR of (1)₁For the Classification network CLS of the first stage₁And convolution region CR of convolution network₁Performing re-registration;

wherein the coordinate detection result can be expressed as LR₁＝{w₁,h₁,cx₁,cy₁,θ₁And indicating the width, height, center point coordinates and rotation angle detected on the basis of a preset anchor frame.

Taking the result of a 3 × 3 convolution operation in two-dimensional space on the origin as an example:

CR₂＝Rotate(Scale(Shift(CR₁，LR₁)))

at this time CR₁Is a 3 × 3 rectangular area, SP₁Representing said convolution region CR₁The coordinate set of the sampling point in (1) is 9 positions in total; rotate, Scale and Shift indicate LR according to the detection results₁The convolution regions CR are respectively aligned in order₁Performing translation, scaling and rotation operations, CR₂The resulting new convolution region; in the formula, SP₂As a new convolution region CR₂Set of sample points of { p }₁,p₂,p₃,p₄,p₅,p₆,p₇,p₈,p₉Are the corresponding 9 sample point coordinates.

Feature map FEA at the second stage₂And reassigning the convolution region CR₂On the basis of the first stage, the classification and positioning of the second stage are carried out to obtain a classification network CLS of the second stage₂And location network LOC₂；

Respectively expressed as:

CLS₂＝{CLS₂₁,CLS₂₂,…,CLS_2f,CLS_2F}，

LOC₂＝{LOC₂₁,LOC₂₂,…,LOC_2f,LOC_2F}，

CLS_2fand LOC_2fRespectively representing classification and positioning networks on the f-th feature map, which are represented as follows: CLS_2f＝Conv(channel_2f,J,stride_h2,stride_w2)；

LOC_2f＝Conv(channel_2f,5,stride_h2,stride_w2)；

Wherein Conv represents a single convolutional layer, the number of input channels channel_1fRepresenting the number of channels of the f-th feature map obtained by the basic feature extraction network and the additional convolution network; 2 denotes CLS_1fThe number of convolution output channels represents that only two classification discrimination works of foreground and background are carried out at the momentAnd 5 represents LOC_1fThe number of convolution output channels representing the regression of coordinates at that time is 5, and the above-mentioned object coordinates B_m,nCorresponding; stride_h1And stride_w1The height and width of the convolution kernel.

Wherein the channel is_2fFeature diagram FEA representing the second stage of the f_2fConv denotes a convolutional layer, J is CLS_2fThe number of convolution output channels is also the total number of object classes in the training and test pictures, compared with CLS_1fAt this time, the second classification work is not carried out, but the specific classification of the object is judged; LOC_2fThe number of convolution output channels of (1) is 5, for detecting the coordinates of the object, and LOC_1fExcept that the position detection is not based on the preset anchor frame, but the first-stage position detection result LR is used₁The object position is further finely detected as a basis.

In some embodiments, the number of second stage profiles is 4, and the corresponding number of channels is 256,256,256,256. stride_h2And stride_w2Are all 3.

Defining a loss function in the detection processes of the first stage and the second stage;

specifically, the loss function includes a second classification and regression loss detected in the first stage and a multi-classification and regression loss in the second stage; and training the network by using the training set and obtaining a final network model. The loss function is a numerical value obtained by calculating the final classification and position detection result of the network and the real classification and position in the image marking information, the larger the numerical value is, the worse the network performance is, otherwise, the better the network performance is, and the purpose of training is to reduce the loss value.

The loss function is:

wherein i represents a subscript of a preset anchor frame, p_iAnd x_iRespectively representing the two-classification prediction probability and the coordinate detection result of the first stage;

and

respectively representing the true class and the offset vector of the pre-set anchor box with index i, c_iAnd t_iFor the second stage, multi-class prediction probability and coordinate detection, N₁And N₂Respectively representing the number of positive samples in the first stage and second stage detection processes. L is_bTwo-class cross entropy loss, L, to determine whether an object is foreground or background_mFor judging the multi-class cross-entropy loss, L, of said classes of objects_rIs Smooth-L₁A loss function.

In some specific embodiments, for all the preset anchor frames, whether the preset anchor frames belong to a positive sample or a negative sample is obtained through calculation with the marked positions in the input picture; all anchor boxes participate in the computation of the classification penalty, but only anchor boxes belonging to positive examples participate in the computation of the location penalty, since the location information is not important for anchor boxes belonging to negative examples, i.e. the background category.

And the total loss ultimately used to optimize the objective function is defined as the weighted sum of the losses of the two-way stages:

Loss＝λ₁Loss₁+λ₂Loss₂；

wherein λ is₁And λ₂Are weighting coefficients. In particular, said λ₁And λ₂Are all 1.

And training the training set to obtain a final network model.

And S4, testing the test picture according to the network model, and calculating the area intersection ratio and the non-maximum value inhibition to obtain a final detection result.

According to the network model obtained by training, using sample T ═ T of Q test pictures₁,T₂,…,T_q,…,T_QTesting, only sending the picture into a network for forward propagation during testing, obtaining the category score and regression coordinate of each anchor point position in the picture, discarding the area judged as the background and the score smaller than the set score threshold value t_scoreThe area of (2), input network model;

and detecting the result R ═ { R ═ R₁,R₂,…,R_q,…,R_QAre kept in categories, where R_qRepresents the detection result of the q-th test picture, and R_q＝{R_c1,R_c2,…,R_cj,…,R_cJIn which R is_cjRepresenting all detection results of the current test picture on the jth class; in some specific embodiments, the score threshold t_score0.5, and discarding all results with low confidence that the prediction score is below 0.5.

The picture testing steps are as follows:

standardizing a test picture at a pixel level;

zooming the test picture to be the same as the picture for training;

changing the network model into a test mode, not performing loss calculation and gradient backward propagation on the detection result, and only performing a forward propagation process;

obtaining an initial detection result R of the current q test picture_q。

In some specific embodiments, the initial detection result is the multi-classification and location detection result of the second stage, and the detection result of the first stage is only used in the forward propagation process of the network and is not used as the final detection result.

And finally, calculating the area intersection ratio between the rotated rectangular frames according to the initial detection result R, performing non-maximum value inhibition, and only keeping the detection frames with larger scores and small mutual overlapping areas as final detection results.

As shown in fig. 3, the step of suppressing the non-maximum value includes:

for beginningInitial detection result R_qAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'_q＝{R'_c1,R'_c2,…,R'_cf,…,R'_cFWherein R'_cfKeeping the result with the largest current score as the current processing object each time for the sorted detection result on the jth class;

to R'_cfAnd (c) calculating an area intersection ratio between any one of the detection frames b and all the detection frames with the prediction scores smaller than the current scores, wherein the area intersection ratio calculation formula is as follows:

T＝area_b+area_bs；

I＝inter_w×inter_h；

U＝T-I；

IOU＝I/U；

wherein, area_bIndicates the area of the detection frame b, area_bsIndicates the area of any of the detection boxes bs with a score smaller than b, inter_wAnd inter_hRespectively representing the width and the height of the intersection area of the two detection frames;

if the area intersection ratio of the two detection frames is smaller than the threshold, discarding the corresponding detection result, and if the area intersection ratio exceeds the threshold t_iouIf yes, the detection box bs with lower score is discarded;

selecting the detection result with the largest score except the current detection object as the current processing object from the current rest detection results until the current processing object is the last detection result; and ending the flow and outputting all the reserved detection results.

Based on the understanding of the foregoing solution, the protection scope of the present application should not be limited to the information expressed by the literal meaning, and also should include the supporting relationship and the execution logic in the method steps, so this embodiment may also be presented in other expression manners, for example, in fig. 4, this embodiment may also be expressed as the following steps: inputting a picture, performing enhancement and scaling operations on the picture, and synchronously adjusting the position information of an object label in the picture; acquiring a first-stage feature map through a feature extraction network and an additional convolution network, and performing first-stage second classification and position detection; acquiring a second-stage feature map through a feature pyramid network, and performing second-stage classification and position detection according to the detection result of the first stage; on the basis of the detection result of the second stage, performing multi-classification and position detection of the third stage as a final result; if the training process is carried out, loss calculation of detection results in the first stage, the second stage and the third stage is carried out, and network parameters are updated through a gradient direction propagation algorithm; if the training process is not carried out, the classification and position detection result is used as an initial detection result, and the result post-processing is carried out through a non-maximum value inhibition method.

Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of a specific device is divided into different functional modules to complete all or part of the above described functions.

In the embodiments provided in this application, it should be understood that the disclosed structures and methods may be implemented in other ways. For example, the above-described embodiments with respect to structures are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may have another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another structure, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, structures or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A cascade regression target detection method is characterized by comprising the following steps:

2. The cascade regression target detection method according to claim 1, wherein the backbone network performs feature extraction on the picture through a plurality of feature prediction layers with different resolutions; the backbone network comprises one of a ResNet series network or a VGG series network and an additional convolution network added on the basis of the ResNet series network or the VGG series network.

3. The cascade regression target detection method according to claim 2, wherein the cascade region suggestion module performs classification and regression according to the extracted features, and comprises a cascade region suggestion network, a feature fusion module and a classification regression module; the cascade regional suggestion network finely adjusts the size and the position of a preset frame in two steps to generate a regional suggestion; the feature fusion module fuses features of different scales together; and the classification regression module extracts the characteristics of the candidate region NxN for classification and regression according to the region suggestion provided by the cascade region suggestion network.

4. The method according to claim 3, wherein the two cascaded regressor modules respectively use neural networks of a full-connection architecture and a full convolution architecture as regression branches to regress the same candidate region and adjust the size and the position of the candidate region, and the method comprises:

5. The cascade regression target detection method as claimed in claim 4, wherein said distance loss function optimizes classification using cross entropy function and Smooth L at each stage of regression₁The function optimizes the regression for minimizing the output result difference of the two regression branches.

6. The cascade regression target detection method according to claim 5, wherein said distance loss function definition comprises:

wherein Smooth L₁The function is as follows:

7. The cascade regression target detection method of claim 1, wherein the loss function comprises: a cascaded region suggestion module loss function and a cascaded two-way regressor loss function:

and

8. The cascade regression target detection method according to claim 1, wherein the detecting the picture to be detected by using the trained network model comprises:

detecting and inputting the trained network model by using Q test pictures;

9. The cascade regression target detection method of claim 1, wherein said non-maxima suppression comprises: for the initial detection result R_qAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'_q＝{R′_c1,R′_c2,…,R′_cf,…,R′_cFWherein R'_cfThe sorted j-th class detection result;

10. A cascade regression target detection apparatus, comprising:

the preprocessing module is used for carrying out pixel standardization and scaling operation on the detection picture and/or the training picture acquired by the acquisition module;

the neural network module is used for configuring a deep convolutional neural network and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, two cascaded regression modules for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;

11. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 9.