CN114241250A - Cascade regression target detection method and device and computer readable storage medium - Google Patents

Cascade regression target detection method and device and computer readable storage medium Download PDF

Info

Publication number
CN114241250A
CN114241250A CN202111092255.1A CN202111092255A CN114241250A CN 114241250 A CN114241250 A CN 114241250A CN 202111092255 A CN202111092255 A CN 202111092255A CN 114241250 A CN114241250 A CN 114241250A
Authority
CN
China
Prior art keywords
regression
network
detection
cascaded
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111092255.1A
Other languages
Chinese (zh)
Inventor
张可
袁堃
葛绍妹
邓其龙
杨俊�
高昱峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
State Grid Electric Power Research Institute
Original Assignee
Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
State Grid Electric Power Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Nanrui Jiyuan Power Grid Technology Co ltd, Hangzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd, State Grid Electric Power Research Institute filed Critical Anhui Nanrui Jiyuan Power Grid Technology Co ltd
Priority to CN202111092255.1A priority Critical patent/CN114241250A/en
Publication of CN114241250A publication Critical patent/CN114241250A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cascade regression target detection method, a device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a picture to be detected, standardizing pixels and zooming to the same size; and inputting the adjusted picture to be detected into a trained deep convolutional neural network model, wherein the deep convolutional neural network model is obtained by training a training picture with label information and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, a cascaded two-way regressor module for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules. The invention improves the performance of the dual-stage target detector through the idea of cascade regression, and reduces the detection time of the algorithm while keeping higher detection precision.

Description

Cascade regression target detection method and device and computer readable storage medium
Technical Field
The invention relates to the field of image recognition, in particular to a cascade regression target detection method and device and a computer readable storage medium.
Background
In recent years, with the improvement of the computing performance of computers, the development of artificial intelligence not only makes great progress, but also gradually shows a phenomenon of large-scale application in various industries. Computer vision is one of the important research fields of artificial intelligence, and has a great promoting effect on the innovation of production and life style. For example, unmanned vehicles representing future travel modes, intelligent manufacturing for improving production efficiency of factories, intelligent security for early warning, and the like, and computer vision is a technical basis for realizing the unmanned vehicles and the intelligent security.
In image classification, target detection, semantic segmentation and example segmentation are three basic visual identification tasks in the field of computer vision. The target detection not only needs to identify the target category, but also needs to predict the position of the target by using a rectangular frame, and the target detection is a basic module which must be adopted by many scenes, such as face recognition, pedestrian detection, video analysis, mark detection and the like. Semantic segmentation assigns a specific class label to each pixel, thereby providing image description with richer information, which is different from target detection, and does not distinguish a plurality of targets of the same class. Example segmentation is a combination of object detection and semantic segmentation, requiring the identification of different objects and assigning a pixel-by-pixel class mask to each object. In practice, instance segmentation can be considered a special case of object detection, which does not use a rectangular frame to position an object, but rather pixel-by-pixel positioning.
The target detection algorithm is generally classified into a two-stage algorithm and a single-stage algorithm. The dual-stage target detector generally comprises two parts, namely a candidate region generator, wherein in the process of generating the candidate region, the generator tries to search for a region in which a target possibly exists in an image, and the main purpose is to find a region with a high recall rate so that all objects in the image are matched with one candidate region as much as possible; then a detector that classifies and regresses the candidate regions. The dual-stage target detector has the advantage of high detection accuracy, but often cannot meet the requirement of real-time detection.
The single-stage target detector is generally simpler in architecture, and unlike the double-stage target detection algorithm, the detection process is divided into two stages of candidate box generation and region classification, each pixel on the feature map is considered to be possible to have a target object, and a rectangular box and a plurality of confidence degrees are respectively allocated to the pixel to determine the accurate position and the probability value of the object. The single-stage target detector has higher detection speed but is often accompanied by poorer detection precision.
In general, the precision requirement is greater than the speed requirement, so that the target detection algorithm of the two stages has wider applicability. However, the existing two-stage target detection algorithm has the disadvantages that the algorithm with higher performance usually needs to consume higher computational resources and detection time, and the actual requirements are difficult to meet. Therefore, a light-weight high-performance two-stage target detection method is urgently needed.
Disclosure of Invention
Aiming at the problem of poor detection performance in the prior art, the invention provides a cascade regression target detection method, a device and a computer readable storage medium, which improve the performance of a dual-stage target detector through the idea of cascade regression and reduce the detection time of an algorithm while keeping higher detection precision.
The technical scheme of the invention is as follows.
A cascade regression target detection method comprises the following steps:
acquiring a picture to be detected, standardizing pixels and zooming to the same size;
inputting the adjusted picture to be detected into a trained deep convolutional neural network model, wherein the deep convolutional neural network model is obtained by training a training picture with label information and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, a cascaded two-path regressor module for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;
and detecting the picture to be detected by using the trained deep convolutional neural network model to obtain a detection result.
Preferably, the backbone network extracts the features of the picture through a plurality of feature prediction layers with different resolutions; the backbone network comprises one of a ResNet series network or a VGG series network and an additional convolution network added on the basis of the ResNet series network or the VGG series network.
Preferably, the cascaded region suggestion module performs classification and regression according to the extracted features, and comprises a cascaded region suggestion network, a feature fusion module and a classification regression module; the cascade regional suggestion network finely adjusts the size and the position of a preset frame in two steps to generate a regional suggestion; the feature fusion module fuses features of different scales together; and the classification regression module extracts the characteristics of the candidate region NxN for classification and regression according to the region suggestion provided by the cascade region suggestion network.
Preferably, the two cascaded regressors respectively use the neural networks of the full-connection architecture and the full convolution architecture as regression branches to regress the same candidate region, and adjust the size and the position of the candidate region, including:
the first stage is as follows: taking the position of a preset frame as a candidate region, extracting target features and cutting the target features into specified sizes, respectively inputting two paths of regression branches for prediction, adjusting the size and the position of the candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;
and a second stage: inputting the features fused in the first stage into two regression branches respectively for prediction, adjusting the size and the position of a candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;
and a third stage: and inputting the features fused in the second stage into two regression branches respectively for prediction, adjusting the size and the position of the candidate region according to the prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network to obtain the size and the position of the candidate region which is finally predicted.
Preferably, the distance loss function optimizes the classification using a cross entropy function and a Smooth L at each stage of the regression1The function optimizes the regression for minimizing the output result difference of the two regression branches.
Preferably, the distance loss function definition includes:
Figure BDA0003267980480000031
wherein Smooth L1The function is as follows:
Figure BDA0003267980480000032
wherein xtIs the coordinate fused characteristic information predicted according to two paths of regression branches in the previous stage, fbRepresenting regression branches of the fully-connected layer architecture, fdRepresenting the regression branch of the full convolution architecture, i represents the center coordinate and width and height of the ith candidate region.
Preferably, the loss function includes: a cascaded region suggestion module loss function and a cascaded two-way regressor loss function:
Figure BDA0003267980480000033
Figure BDA0003267980480000034
wherein, Loss1 is a cascaded area suggestion module Loss function, Loss2 is a cascaded two-way regression Loss function, wherein, i represents the subscript of the preset anchor frame,
Figure BDA0003267980480000035
and
Figure BDA0003267980480000036
respectively representing the true class and the offset vector of the pre-set anchor box with index i, ciAnd tiFor the second stage, multi-class prediction probability and coordinate detection, N1,N2And N3Respectively representing the number of positive samples, L, in the first, second and third detection stages of cascade regressionmTo determine the multi-class cross-entropy loss, L, of object classesrIs Smooth-L1A loss function; the total Loss is the weighted sum of the cascaded region suggestion module Loss function and the cascaded two-way regressor Loss function.
Preferably, the detecting the picture to be detected by using the trained network model includes:
detecting and inputting the trained network model by using Q test pictures;
determining that the detection result R is { R ═ R1,R2,…,Rq,…,RQKeep in categories;
and calculating the area intersection ratio between the rotated rectangular frames, performing non-maximum value suppression, and only keeping the detection frames with larger scores and small mutual overlapping areas as final detection results.
Preferably, the non-maximum suppression includes:
for the initial detection result RqAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'q={R'c1,R'c2,…,R'cf,…,R'cFWherein R'cfThe sorted j-th class detection result;
to R'cfThe area intersection ratio between any one detection frame b and all detection frames with the prediction scores smaller than the current scores is calculated;
if the area intersection ratio of the two detection frames exceeds the threshold value tiouThen the detection box bs with the lower score is discarded.
The invention also includes a cascade regression target detection device, comprising:
the acquisition module is used for acquiring the picture to be detected and/or the training picture;
the preprocessing module is used for carrying out pixel standardization and scaling operation on the detection picture and/or the training picture acquired by the acquisition module; the neural network module is used for configuring a deep convolutional neural network and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, two cascaded regression modules for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;
the neural network module is also used for training by means of the training pictures and detecting the pictures to be detected so as to obtain a detection result.
The invention also includes a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method described above. The invention adopts a popular convolutional neural network algorithm to solve the target detection problem, adopts a strategy of adjusting the position and the size of a preset frame step by step in a cascaded area suggestion module, and provides more accurate initialization information for a detection probe, thereby improving the performance of the detector. The cascaded regional suggestion network can provide more accurate preset frames for the detector by finely adjusting the size and the position of the preset frames step by step, so that overfitting is avoided. The size and the position of the preset frame are finely adjusted on the global feature map, so that the global feature map is shared by all the preset frame calculations, and a large amount of repetition and resource consumption caused by using a cascading strategy at a detection head are avoided.
In the two cascaded regression modules, each branch adopts a different framework to predict the position of a target, then the characteristics of a target region are extracted according to the predicted positions of the branches of the two frameworks, and classification and regression are carried out in the next stage after convolution fusion. During training, the same target was regressed using fully-connected and fully-convolved heads, respectively. The full-connection regressor has stronger spatial sensitivity, the full convolution regressor can capture more detailed target-level information, the advantages of the two architectures can be fully exerted, and the function of mutual complementation is achieved.
According to the method, a gradient back propagation algorithm is adopted, and the updating gradients of all learnable parameters in the network are calculated by means of a loss function calculated by the neural network in the training process of the neural network through a chain type derivation method, so that the updating of the parameters of the neural network is completed, and the end-to-end training process is realized.
According to the digital image processing principle, the method and the device can be used for enhancing various data of the training picture, including picture turning, color space conversion, picture zooming and the like, so that the utilization rate of the training picture is improved, the diversity of samples is increased, the requirement of data marking is reduced to a certain extent, and the robustness and the generalization capability of the model are improved.
According to the method and the device, the non-maximum value inhibition is used as a post-processing means of the detection result in the picture to be detected, and the redundant detection result in the picture is effectively reduced.
Drawings
FIG. 1 is a schematic diagram of a network architecture according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating candidate region feature fusion according to an embodiment of the present invention;
FIG. 3 is a schematic view of a detection structure non-maximum suppression processing flow according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a training and detection process according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions will be clearly and completely described below with reference to the embodiments, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms first, second, third and the like in the description and in the claims, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The technical solution of the present invention will be described in detail below with specific examples. Embodiments may be combined with each other and some details of the same or similar concepts or processes may not be repeated in some embodiments.
Example (b):
a method of cascading regression target detection, operating on a device having a processor and a computer readable storage medium, comprising:
s1, acquiring a training set of the detection picture;
acquiring a training set of the detection pictures, wherein the training set comprises M training pictures of X ═ X1,X2,…,Xm,…,XMIn which X ismRepresenting the mth training picture;
m training pictures are selected to have M labels which correspond one to one and are Y ═ Y1,Y2,…,Ym,…,YMIn which Y ismRepresenting the mth training picture;
the M labels comprise the category and coordinate information of N target objects in corresponding pictures, and the category and coordinate information is Ym={Pm,1,Bm,1,Pm,2,Bm,2,…,Pm,n,Bm,n,…,Pm,N,Bm,NIn which P ism,nRepresents the category to which the nth target object in the mth picture belongs, and Pm,n∈{C0,C1,C2,…,Cj,…,CJDenotes the total class, CjDenotes the jth class, C0Representing background class, J being the total number of classes, Bm,nRepresents the coordinates of the nth object in the mth picture, and Bm,n={wm,n,hm,n,cxm,n,cym,nm,nDenotes the width w of the rectangular frame of the object to be marked, respectivelym,nHeight hm,nCentral point abscissa cxm,nLongitudinal coordinate cy of center pointm,nAnd a rotation angle thetam,n
S2, carrying out picture standardization on the training pictures and the test pictures of the training set;
in some specific embodiments, the step of performing picture normalization on the training pictures and the test pictures of the training set includes:
according to the preset Pixel mean value PixelmeanSum Pixel standard deviation PixelstdCarrying out pixel level standardization on the pictures in the training set X;
the pictures in the training set X are uniformly scaled to a size of 320 × 320, and it is noted that after the pictures are scaled, the labeled positions of the objects in the pictures also need to be adjusted accordingly, otherwise, mismatching occurs. The pictures can be scaled to 512 × 512 or 640 × 640, and the higher resolution pictures can improve the detection accuracy, but reduce the detection speed, keep the picture sizes consistent, make the picture sizes the same, and meet the input conditions of the network.
The standardized formula of any picture pixel point is as follows:
Pixelx=(Pixelx-Pixelmean)/Pixelxstd
wherein, PixelmeanAs Pixel mean, PixelstdIs the pixel standard deviation.
S3, designing a deep convolutional neural network structure, and training a network model by utilizing a step-by-step fine tuning mode;
in some specific embodiments, the deep convolutional neural network structure comprises: the system comprises a backbone network for extracting picture features, a cascaded regional suggestion module, feature fusion, a cascaded regional suggestion module loss function, two cascaded regression modules, a distance loss function and two cascaded regression module loss functions.
Obtaining a backbone for extracting basic features;
specifically, a ResNet series network or a VGG series network is used as a backbone network to extract basic features, wherein the ResNet series network comprises ResNet50, ResNet101 and ResNet 152; the VGG network includes VGG-16 and VGG-19. After the backbone network is selected, an additional convolutional network needs to be added on the basis of the backbone network to obtain a feature map with lower resolution, the spatial resolution of the feature map is smaller, but the feature abstraction degree is higher, the sense field is larger, and a large object in a picture can be detected, so that features of 5 scales are extracted from the backbone network and the additional convolutional network to be used for network prediction.
Initializing the backbone network and the additionally added network parameters:
Mweight=MPweight
MAweight=Gaussian(0,1);
wherein M isweightAnd MAweightParameters of a backbone network and an additional convolution network are respectively set; MP represents the pre-training result of the backbone network M on the data set, MPweightParameters representing a pre-trained network; gaussian (0,1) denotes that the weight parameters of the additional convolutional network MA satisfy a Gaussian distribution with a mean of 0 and a variance of 1.
In some specific embodiments, the backbone network adopts ResNet50, a pre-training model of the ResNet50 is derived from a classification model on an ImageNet data set, and the learning rate of residual structures of the first two layers of ResNet50 is set to 0, so that the residual structures do not participate in training, and the overfitting risk in the network training process can be reduced.
Firstly, building a cascaded area suggestion network on the basis of the backbone network and the additionally added convolutional network, and carrying out secondary classification and position size adjustment on a target area;
taking the VGG-16 as the backbone network as an example, three layers of networks in the VGG network and an additional layer of network are taken as global feature prediction layers. The four global feature prediction layers are gradually reduced in resolution and gradually increased in receptive field, so that the four global feature prediction layers can be respectively responsible for detecting objects with different scales. Centering around each pixel location on the global feature map, each location is called an anchor point, resulting in K preset boxes with fixed dimensions and aspect ratios, each responsible for matching with a potential target. As shown in fig. 1, the cascaded regional suggestion network consists of two sets of two classes and regression modules. The first secondary classification and regression module uses the original global feature map of the backbone network for prediction, simply classifies each preset frame and outputs the offset relative to the original position of the preset frame. After calculating the overlap ratio of the preset box and the true value box, some simple negative samples are discarded by using a predefined threshold value, and then a cross-entropy function and a Smooth L are used1Calculating deviation of function, classifying the first and second and regressingAnd (6) optimizing. And the preset frame after the first second classification and regression module finely adjusts the position and the size is used as the input of the second classification and regression module. Different from the method that each stage in Cascade R-CNN uses the features output by the backbone network as input, in order to obtain more discriminative information, the second classification and regression module uses the feature fusion block to fuse the original features, so that the requirements of the second classification and regression module are met. The features used by the two second classes and the regression module have the same dimensions. And the second secondary classification and regression module predicts 4 offsets and 2 confidence scores for each fine-tuned preset frame by using the fused features, discards simple samples according to the confidence scores, and filters the preset frames with higher overlapping rate by using a non-maximum suppression algorithm. Through the two-step cascade regression operation, the regional suggestion network provides initialization information with higher accuracy for the subsequent detection head.
The cascaded regional suggestion network serves a dual-stage target detection algorithm, and as the cascaded regional suggestion network provides more accurate characteristic information for the regressor, the performance of the target detector can be improved. Unlike Cascade R-CNN, in order to obtain more accurate initialization information, the regressors in all stages need to separately calculate the characteristics of the candidate regions, and the calculation amount is very large. All candidate regions in the cascaded region proposal network share the predicted feature map, requiring only little computational resources.
Adding feature fusion into the cascade region suggestion module to enhance the perception capability of the object;
the junction of the first two classifications and the regression and the second two classifications and the regression uses feature fusion. The feature fusion is composed of a convolutional layer, an active layer and a deconvolution layer, wherein the deconvolution is mainly responsible for enlarging the resolution of the deep feature map.
The feature fusion module is mainly used for transmitting rich semantic information of the deep network to the shallow network, and the detection precision of the algorithm is improved. In addition, if the two classes and regression modules share the prediction feature map, the results of the two cascaded regressions will not bring a good effect. In order to match the dimensions of the inputs of the different network layers, the resolution of the feature map is improved using deconvolution, followed by feature fusion using pixel-by-pixel addition, and finally adding a convolutional layer to produce features with discriminative power. Feature fusion is also used at the second classification and junction of the regression module and the detection head to further enhance the information of the feature map.
In the cascaded regional suggestion network, performing foreground and background classification on candidate regions, simultaneously fine-tuning the size and the position of a preset frame, and optimizing the network through a loss function;
the loss function of the cascaded regional recommendation network consists of the loss functions of the two-class and regression modules. In the cascade region suggestion network, a label of two categories (namely whether the label is an object or not) is allocated to each preset frame, and the preset frames are finely adjusted.
Figure BDA0003267980480000081
Wherein i represents the index number of the preset frame in a batch, and liIs the class number g of the real object corresponding to the preset frame iiIs the position and size of the real object corresponding to the preset frame i. p is a radical ofiIs the confidence that the preset box contains an object, xiIs the position of fine tuning to the preset box i. c. CiIs a multi-class label prediction for a preset box i, tiIs the predicted offset. L isbIs based on cross entropy loss for the two classes.
In the two cascaded regression modules, a preset frame provided by a cascaded region suggestion module is used as initial input information, and a target is precisely and accurately positioned in a characteristic fusion mode;
the cascaded two-path regressor is mainly responsible for accurately positioning the target. In the algorithm, the size and the position of a candidate region are preliminarily adjusted by a cascaded region suggestion network, and then the adjusted candidate region is used as the input of two paths of regressors in the next stage to accurately position a target. Specifically, the cascaded two-way regressor is composed of three stages with the same function, and each stage has two regression branches with different architectures (based on the regression branches of the fully-connected architecture and the regression branches of the fully-connected architecture)Based on the regression branch of the full convolution mechanism), the two regression branches respectively position the same object. As shown in fig. 1, in the first stage, according to the candidate region position information provided by the cascaded regional suggestion network, the target feature is extracted and cut to a specified size, and the extracted target feature is used as the input of two regression branches. In the second stage, according to the position information predicted by the two regression branches in the previous stage on the same object, extracting different features of the same object, scaling the two groups of features into the same size, adding the two groups of features pixel by pixel, and then fusing the features of the two groups of features by using a convolution layer. Because the network of the full convolution architecture can capture the spatial information of the target, and the network of the full connection architecture can capture the information of the target level, the prediction of the two regression branches on the same object has deviation, the target characteristics are extracted according to the position information predicted by the two architectures, and then the two sets of characteristics are fused, so that the limitation of positioning of a certain network can be made up, and the network performance is improved. The fused features are used as the input of the third stage network, and the rest operation is basically the same as that of the second stage. For the predicted values of the two-way regressor at each stage, Smooth L is used1And optimizing the functions to enable the predicted values of the two functions to be as close as possible to the same target. At each stage, optimization was performed using a cross-entropy function for classification and Smooth L for regression1The function is optimized.
In the distance loss functions, the distance loss functions are mainly used for constraining the fine adjustment of the two regression branches to the size and the position of the same target;
in the two cascaded regression modules, the regression branches of the two architectures respectively adopt Smooth L1The offset used to optimize the prediction. Because the two regression branches predict the same candidate region, the capability of enhancing the cooperative work of the two regression branches can be considered. Thus, Smooth L is used for the offset of the two-way regression branch prediction1The function is optimized because the two regression branches are regressions on the same target, Smooth L1The function can make the offset of the two-way regression branch prediction as close as possible, thereby improving the quality of the detectorAnd (4) accuracy. The introduced optimization function is named distance loss. The distance loss is expressed as follows:
Figure BDA0003267980480000091
in the above formula, Smooth L1The detailed expression of the function is as follows:
Figure BDA0003267980480000092
wherein xtThe feature information of coordinate fusion predicted according to two regression branches in the previous stage is obtained. f. ofbRepresenting regression branches of the fully-connected layer architecture, fdRepresenting the regression branches of the full convolution architecture. i denotes the center coordinates and width and height of the ith candidate region.
Smooth L1Are often used to optimize the gap between the offset and the label predicted by the regressor. The distance loss function is not used for calculating the difference between the predicted value and the real mark of the regressor, but is used for calculating the difference between the predicted offset of the two paths of regressors, so that the IoU (interaction over Union) of the regression branches of the two paths of regressors to the regression result of the same candidate region is as close to 1 as possible. It should be noted that the distance-loss function does not directly introduce annotation data, since the two-way regression branch has been constrained using annotation data alone, so that the distance loss does not cause the detector to deviate from the true result, and is therefore theoretically possible.
Constructing a classification network CLS of a first stage on the basis of a basic feature extraction network and an additional convolution network1And location network LOC1And CLS1And LOC1Each consisting of F convolutional layers. In which the sorting network CLS1And location network LOC1Are respectively denoted as CLS1={CLS11,CLS12,…,CLS1f,CLS1F},LOC1={LOC11,LOC12,…,LOC1f,LOC1FF, extracting network M andthe number of feature maps, CLS, co-generated by the additional convolutional network MA1fAnd LOC1fRespectively representing classification and positioning networks on the f-th feature map, which are represented as follows:
CLS1f=Conv(channel1f,2,strideh1,stridew1);
LOC1f=Conv(channel1f,5,strideh1,stridew1);
wherein Conv represents a single convolutional layer, the number of input channels channel1fRepresenting the number of channels of the f-th feature map obtained by the basic feature extraction network and the additional convolution network; 2 denotes CLS1fThe number of convolution output channels represents that only two classification discrimination works of foreground and background are carried out at the moment, and 5 represents LOC1fThe number of convolution output channels representing the regression of coordinates at that time is 5, and the above-mentioned object coordinates Bm,nCorresponding; strideh1And stridew1The height and width of the convolution kernel.
In some embodiments, the number of signatures generated by ResNet50 and the additional convolutional network is 4, i.e., F is 4, and the number of channels corresponding to the signatures is {512,1024,2048,512 }. strideh1And stridew1Are all 3.
Constructing a characteristic pyramid network on the basis of the basic characteristic extraction network M and the additional convolution network MA, and generating F characteristic graphs FEA of a first stage1And further generating a feature map FEA of the second stage of high resolution2
Specifically, F feature maps FEA of the first stage are generated1Denoted as FEA1={FEA11,FEA12,…,FEA1f,…,FEA1FThe width and height of the profile of the first stage are denoted W, respectively1={W11,W12,…,W1f,…,W1FH and1={H11,H12,…,H1f,…,H1Fin which W is1fAnd H1fRespectively showing the width and height of the f-th feature map of the first stage.
When the content is less than or equal to 1When i is less than or equal to F-1, W is satisfied1i=2×W1i+1,H1i=2×H1i+1. In FEA1The feature pyramid can transmit the semantic information of the high-level feature map to the bottom layer, so that the feature map with high resolution and rich semantic information is obtained by combining the advantages of the two. Marking the feature map generated by the feature pyramid as FEA2Feature map, FEA, referred to as second stage2={FEA21,FEA22,…,FEA2f,…,FEA2FTherein FEA2fThe f-th feature map of the first stage is shown. FEA2Feature map number and FEA1Same and FEA2fAnd FEA1fThe width and height of (a) remain the same.
The feature pyramid network includes a feature graph conversion network TS and a feature graph scaling network INP, where the feature graph conversion network may be expressed as TS ═ { TS ═ TS {1,TS2,…,TSf,…,TSFTS is likewise composed of F parts, where TS isfRepresenting the f-th feature map transformation network; INP ═ INP1,INP2,…,INPf,…,INPFINP is composed of F-1 parts, where INPfAnd (3) representing a feature map scaling network between the f-th feature map and the f + 1-th feature map, wherein the width and the height of the feature map passing through the feature map scaling network are 2 times of the original width and height.
In the construction process of the feature map pyramid, firstly, the feature map of the highest layer is processed independently;
and then sequentially processing according to the sequence of the spatial resolution of the feature map from low to high:
FEA2F=TSF(FEA1F);
t=TSi(FEA1i);
FEA2i=t+INPi(FEA2i+1);
wherein t is an intermediate characteristic diagram in the characteristic pyramid construction process; the value sequence of i is { F-1, F-2, …,1}, and the feature pyramid network comprises a feature map conversion network TS and a feature map scaling network INP.
Specifically, the intermediate feature map in the feature pyramid construction process does not undergo the final detection step. FEA2FOnly need to be executed once, and the formula t and the formula FEA2iF-1 times in total.
In some specific embodiments, the feature map conversion network is used by a Res2net structure, and Res2net performs conversion and connection in a residual form between different channels of a feature map, thereby enhancing feature extraction capability; the feature map scaling network is completed by a feature map interpolation function in a PyTorch function library.
As shown in fig. 2, the candidate region is subjected to dual regression branch processing to obtain two regression results; and respectively extracting the features of the candidate region NxN from the two regression results, classifying and regressing, adding the features, and averaging to obtain the fused region features. And merging the zoomed characteristic diagram with the previous characteristic diagram through channel splicing operation, and generating a new characteristic diagram only by sending the characteristic diagram into a characteristic diagram conversion network. The feature map conversion network comprises 5 identical structures, the feature map scaling network comprises 4 identical structures, and the identical structures respectively have independent trainable parameters.
Locating network LOC with said first phase1Coordinate detection result LR of (1)1For the Classification network CLS of the first stage1And convolution region CR of convolution network1Performing re-registration;
wherein the coordinate detection result can be expressed as LR1={w1,h1,cx1,cy11And indicating the width, height, center point coordinates and rotation angle detected on the basis of a preset anchor frame.
Taking the result of a 3 × 3 convolution operation in two-dimensional space on the origin as an example:
Figure BDA0003267980480000121
CR2=Rotate(Scale(Shift(CR1,LR1)))
Figure BDA0003267980480000122
at this time CR1Is a 3 × 3 rectangular area, SP1Representing said convolution region CR1The coordinate set of the sampling point in (1) is 9 positions in total; rotate, Scale and Shift indicate LR according to the detection results1The convolution regions CR are respectively aligned in order1Performing translation, scaling and rotation operations, CR2The resulting new convolution region; in the formula, SP2As a new convolution region CR2Set of sample points of { p }1,p2,p3,p4,p5,p6,p7,p8,p9Are the corresponding 9 sample point coordinates.
Feature map FEA at the second stage2And reassigning the convolution region CR2On the basis of the first stage, the classification and positioning of the second stage are carried out to obtain a classification network CLS of the second stage2And location network LOC2
Respectively expressed as:
CLS2={CLS21,CLS22,…,CLS2f,CLS2F},
LOC2={LOC21,LOC22,…,LOC2f,LOC2F},
CLS2fand LOC2fRespectively representing classification and positioning networks on the f-th feature map, which are represented as follows: CLS2f=Conv(channel2f,J,strideh2,stridew2);
LOC2f=Conv(channel2f,5,strideh2,stridew2);
Wherein Conv represents a single convolutional layer, the number of input channels channel1fRepresenting the number of channels of the f-th feature map obtained by the basic feature extraction network and the additional convolution network; 2 denotes CLS1fThe number of convolution output channels represents that only two classification discrimination works of foreground and background are carried out at the momentAnd 5 represents LOC1fThe number of convolution output channels representing the regression of coordinates at that time is 5, and the above-mentioned object coordinates Bm,nCorresponding; strideh1And stridew1The height and width of the convolution kernel.
Wherein the channel is2fFeature diagram FEA representing the second stage of the f2fConv denotes a convolutional layer, J is CLS2fThe number of convolution output channels is also the total number of object classes in the training and test pictures, compared with CLS1fAt this time, the second classification work is not carried out, but the specific classification of the object is judged; LOC2fThe number of convolution output channels of (1) is 5, for detecting the coordinates of the object, and LOC1fExcept that the position detection is not based on the preset anchor frame, but the first-stage position detection result LR is used1The object position is further finely detected as a basis.
In some embodiments, the number of second stage profiles is 4, and the corresponding number of channels is 256,256,256,256. strideh2And stridew2Are all 3.
Defining a loss function in the detection processes of the first stage and the second stage;
specifically, the loss function includes a second classification and regression loss detected in the first stage and a multi-classification and regression loss in the second stage; and training the network by using the training set and obtaining a final network model. The loss function is a numerical value obtained by calculating the final classification and position detection result of the network and the real classification and position in the image marking information, the larger the numerical value is, the worse the network performance is, otherwise, the better the network performance is, and the purpose of training is to reduce the loss value.
The loss function is:
Figure BDA0003267980480000131
Figure BDA0003267980480000132
wherein i represents a subscript of a preset anchor frame, piAnd xiRespectively representing the two-classification prediction probability and the coordinate detection result of the first stage;
Figure BDA0003267980480000133
and
Figure BDA0003267980480000134
respectively representing the true class and the offset vector of the pre-set anchor box with index i, ciAnd tiFor the second stage, multi-class prediction probability and coordinate detection, N1And N2Respectively representing the number of positive samples in the first stage and second stage detection processes. L isbTwo-class cross entropy loss, L, to determine whether an object is foreground or backgroundmFor judging the multi-class cross-entropy loss, L, of said classes of objectsrIs Smooth-L1A loss function.
In some specific embodiments, for all the preset anchor frames, whether the preset anchor frames belong to a positive sample or a negative sample is obtained through calculation with the marked positions in the input picture; all anchor boxes participate in the computation of the classification penalty, but only anchor boxes belonging to positive examples participate in the computation of the location penalty, since the location information is not important for anchor boxes belonging to negative examples, i.e. the background category.
And the total loss ultimately used to optimize the objective function is defined as the weighted sum of the losses of the two-way stages:
Loss=λ1Loss12Loss2
wherein λ is1And λ2Are weighting coefficients. In particular, said λ1And λ2Are all 1.
And training the training set to obtain a final network model.
And S4, testing the test picture according to the network model, and calculating the area intersection ratio and the non-maximum value inhibition to obtain a final detection result.
According to the network model obtained by training, using sample T ═ T of Q test pictures1,T2,…,Tq,…,TQTesting, only sending the picture into a network for forward propagation during testing, obtaining the category score and regression coordinate of each anchor point position in the picture, discarding the area judged as the background and the score smaller than the set score threshold value tscoreThe area of (2), input network model;
and detecting the result R ═ { R ═ R1,R2,…,Rq,…,RQAre kept in categories, where RqRepresents the detection result of the q-th test picture, and Rq={Rc1,Rc2,…,Rcj,…,RcJIn which R iscjRepresenting all detection results of the current test picture on the jth class; in some specific embodiments, the score threshold tscore0.5, and discarding all results with low confidence that the prediction score is below 0.5.
The picture testing steps are as follows:
standardizing a test picture at a pixel level;
zooming the test picture to be the same as the picture for training;
changing the network model into a test mode, not performing loss calculation and gradient backward propagation on the detection result, and only performing a forward propagation process;
obtaining an initial detection result R of the current q test pictureq
In some specific embodiments, the initial detection result is the multi-classification and location detection result of the second stage, and the detection result of the first stage is only used in the forward propagation process of the network and is not used as the final detection result.
And finally, calculating the area intersection ratio between the rotated rectangular frames according to the initial detection result R, performing non-maximum value inhibition, and only keeping the detection frames with larger scores and small mutual overlapping areas as final detection results.
As shown in fig. 3, the step of suppressing the non-maximum value includes:
for beginningInitial detection result RqAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'q={R'c1,R'c2,…,R'cf,…,R'cFWherein R'cfKeeping the result with the largest current score as the current processing object each time for the sorted detection result on the jth class;
to R'cfAnd (c) calculating an area intersection ratio between any one of the detection frames b and all the detection frames with the prediction scores smaller than the current scores, wherein the area intersection ratio calculation formula is as follows:
T=areab+areabs
I=interw×interh
U=T-I;
IOU=I/U;
wherein, areabIndicates the area of the detection frame b, areabsIndicates the area of any of the detection boxes bs with a score smaller than b, interwAnd interhRespectively representing the width and the height of the intersection area of the two detection frames;
if the area intersection ratio of the two detection frames is smaller than the threshold, discarding the corresponding detection result, and if the area intersection ratio exceeds the threshold tiouIf yes, the detection box bs with lower score is discarded;
selecting the detection result with the largest score except the current detection object as the current processing object from the current rest detection results until the current processing object is the last detection result; and ending the flow and outputting all the reserved detection results.
Based on the understanding of the foregoing solution, the protection scope of the present application should not be limited to the information expressed by the literal meaning, and also should include the supporting relationship and the execution logic in the method steps, so this embodiment may also be presented in other expression manners, for example, in fig. 4, this embodiment may also be expressed as the following steps: inputting a picture, performing enhancement and scaling operations on the picture, and synchronously adjusting the position information of an object label in the picture; acquiring a first-stage feature map through a feature extraction network and an additional convolution network, and performing first-stage second classification and position detection; acquiring a second-stage feature map through a feature pyramid network, and performing second-stage classification and position detection according to the detection result of the first stage; on the basis of the detection result of the second stage, performing multi-classification and position detection of the third stage as a final result; if the training process is carried out, loss calculation of detection results in the first stage, the second stage and the third stage is carried out, and network parameters are updated through a gradient direction propagation algorithm; if the training process is not carried out, the classification and position detection result is used as an initial detection result, and the result post-processing is carried out through a non-maximum value inhibition method.
Through the description of the above embodiments, those skilled in the art will understand that, for convenience and simplicity of description, only the division of the above functional modules is used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of a specific device is divided into different functional modules to complete all or part of the above described functions.
In the embodiments provided in this application, it should be understood that the disclosed structures and methods may be implemented in other ways. For example, the above-described embodiments with respect to structures are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may have another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another structure, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, structures or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed to a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. A cascade regression target detection method is characterized by comprising the following steps:
acquiring a picture to be detected, standardizing pixels and zooming to the same size;
inputting the adjusted picture to be detected into a trained deep convolutional neural network model, wherein the deep convolutional neural network model is obtained by training a training picture with label information and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, a cascaded two-path regressor module for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;
and detecting the picture to be detected by using the trained deep convolutional neural network model to obtain a detection result.
2. The cascade regression target detection method according to claim 1, wherein the backbone network performs feature extraction on the picture through a plurality of feature prediction layers with different resolutions; the backbone network comprises one of a ResNet series network or a VGG series network and an additional convolution network added on the basis of the ResNet series network or the VGG series network.
3. The cascade regression target detection method according to claim 2, wherein the cascade region suggestion module performs classification and regression according to the extracted features, and comprises a cascade region suggestion network, a feature fusion module and a classification regression module; the cascade regional suggestion network finely adjusts the size and the position of a preset frame in two steps to generate a regional suggestion; the feature fusion module fuses features of different scales together; and the classification regression module extracts the characteristics of the candidate region NxN for classification and regression according to the region suggestion provided by the cascade region suggestion network.
4. The method according to claim 3, wherein the two cascaded regressor modules respectively use neural networks of a full-connection architecture and a full convolution architecture as regression branches to regress the same candidate region and adjust the size and the position of the candidate region, and the method comprises:
the first stage is as follows: taking the position of a preset frame as a candidate region, extracting target features and cutting the target features into specified sizes, respectively inputting two paths of regression branches for prediction, adjusting the size and the position of the candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;
and a second stage: inputting the features fused in the first stage into two regression branches respectively for prediction, adjusting the size and the position of a candidate region according to a prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network;
and a third stage: and inputting the features fused in the second stage into two regression branches respectively for prediction, adjusting the size and the position of the candidate region according to the prediction result, extracting two groups of features, and fusing the two groups of features by using a convolutional neural network to obtain the size and the position of the candidate region which is finally predicted.
5. The cascade regression target detection method as claimed in claim 4, wherein said distance loss function optimizes classification using cross entropy function and Smooth L at each stage of regression1The function optimizes the regression for minimizing the output result difference of the two regression branches.
6. The cascade regression target detection method according to claim 5, wherein said distance loss function definition comprises:
Figure FDA0003267980470000021
wherein Smooth L1The function is as follows:
Figure FDA0003267980470000022
wherein xtIs the coordinate fused characteristic information predicted according to two paths of regression branches in the previous stage, fbRepresenting regression branches of the fully-connected layer architecture, fdRepresenting the regression branch of the full convolution architecture, i represents the center coordinate and width and height of the ith candidate region.
7. The cascade regression target detection method of claim 1, wherein the loss function comprises: a cascaded region suggestion module loss function and a cascaded two-way regressor loss function:
Figure FDA0003267980470000023
Figure FDA0003267980470000024
wherein, Loss1 is a cascaded area suggestion module Loss function, Loss2 is a cascaded two-way regression Loss function, wherein, i represents the subscript of the preset anchor frame,
Figure FDA0003267980470000025
and
Figure FDA0003267980470000026
respectively representing the true class and the offset vector of the pre-set anchor box with index i, ciAnd tiFor the second stage, multi-class prediction probability and coordinate detection, N1,N2And N3Respectively representing the number of positive samples, L, in the first, second and third detection stages of cascade regressionmTo determine the multi-class cross-entropy loss, L, of object classesrIs Smooth-L1A loss function; the total Loss is the weighted sum of the cascaded region suggestion module Loss function and the cascaded two-way regressor Loss function.
8. The cascade regression target detection method according to claim 1, wherein the detecting the picture to be detected by using the trained network model comprises:
detecting and inputting the trained network model by using Q test pictures;
determining that the detection result R is { R ═ R1,R2,…,Rq,…,RQKeep in categories;
and calculating the area intersection ratio between the rotated rectangular frames, performing non-maximum value suppression, and only keeping the detection frames with larger scores and small mutual overlapping areas as final detection results.
9. The cascade regression target detection method of claim 1, wherein said non-maxima suppression comprises: for the initial detection result RqAnd respectively performing descending order sorting on the prediction scores of all the detection frames under the same category, wherein the sorted result is R'q={R′c1,R′c2,…,R′cf,…,R′cFWherein R'cfThe sorted j-th class detection result;
to R'cfThe area intersection ratio between any one detection frame b and all detection frames with the prediction scores smaller than the current scores is calculated;
if the area intersection ratio of the two detection frames exceeds the threshold value tiouThen the detection box bs with the lower score is discarded.
10. A cascade regression target detection apparatus, comprising:
the acquisition module is used for acquiring the picture to be detected and/or the training picture;
the preprocessing module is used for carrying out pixel standardization and scaling operation on the detection picture and/or the training picture acquired by the acquisition module;
the neural network module is used for configuring a deep convolutional neural network and comprises a backbone network for feature extraction, a cascaded region suggestion module for stepwise adjusting a preset frame, two cascaded regression modules for fine-tuning the preset frame, and a distance loss function and a loss function for optimizing the modules;
the neural network module is also used for training by means of the training pictures and detecting the pictures to be detected so as to obtain a detection result.
11. A computer-readable storage medium storing computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 9.
CN202111092255.1A 2021-09-17 2021-09-17 Cascade regression target detection method and device and computer readable storage medium Pending CN114241250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111092255.1A CN114241250A (en) 2021-09-17 2021-09-17 Cascade regression target detection method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111092255.1A CN114241250A (en) 2021-09-17 2021-09-17 Cascade regression target detection method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114241250A true CN114241250A (en) 2022-03-25

Family

ID=80742983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111092255.1A Pending CN114241250A (en) 2021-09-17 2021-09-17 Cascade regression target detection method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114241250A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925387A (en) * 2022-04-02 2022-08-19 北方工业大学 Sorting system and method based on end edge cloud architecture and readable storage medium
CN114998840A (en) * 2022-07-18 2022-09-02 成都东方天呈智能科技有限公司 Mouse target detection method based on deep cascade supervised learning
CN117408304A (en) * 2023-12-14 2024-01-16 江苏未来网络集团有限公司 6D gesture prediction neural network model and method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114925387A (en) * 2022-04-02 2022-08-19 北方工业大学 Sorting system and method based on end edge cloud architecture and readable storage medium
CN114925387B (en) * 2022-04-02 2024-06-07 北方工业大学 Sorting system, method and readable storage medium based on end-edge cloud architecture
CN114998840A (en) * 2022-07-18 2022-09-02 成都东方天呈智能科技有限公司 Mouse target detection method based on deep cascade supervised learning
CN117408304A (en) * 2023-12-14 2024-01-16 江苏未来网络集团有限公司 6D gesture prediction neural network model and method
CN117408304B (en) * 2023-12-14 2024-02-27 江苏未来网络集团有限公司 6D gesture prediction neural network model system and method

Similar Documents

Publication Publication Date Title
CN112818903B (en) Small sample remote sensing image target detection method based on meta-learning and cooperative attention
US10902615B2 (en) Hybrid and self-aware long-term object tracking
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
Long et al. Object detection in aerial images using feature fusion deep networks
CN114241250A (en) Cascade regression target detection method and device and computer readable storage medium
Zhao et al. Improved vision-based vehicle detection and classification by optimized YOLOv4
Chen et al. Research on recognition of fly species based on improved RetinaNet and CBAM
CN113221787B (en) Pedestrian multi-target tracking method based on multi-element difference fusion
CN111967480A (en) Multi-scale self-attention target detection method based on weight sharing
CN110633632A (en) Weak supervision combined target detection and semantic segmentation method based on loop guidance
CN104517103A (en) Traffic sign classification method based on deep neural network
CN111310604A (en) Object detection method and device and storage medium
CN114359851A (en) Unmanned target detection method, device, equipment and medium
Chen et al. Corse-to-fine road extraction based on local Dirichlet mixture models and multiscale-high-order deep learning
Cepni et al. Vehicle detection using different deep learning algorithms from image sequence
Hou et al. Object detection in high-resolution panchromatic images using deep models and spatial template matching
CN111783523A (en) Remote sensing image rotating target detection method
CN114821014A (en) Multi-mode and counterstudy-based multi-task target detection and identification method and device
Tian et al. Object localization via evaluation multi-task learning
CN111696136A (en) Target tracking method based on coding and decoding structure
Cao et al. Learning spatial-temporal representation for smoke vehicle detection
CN113128564B (en) Typical target detection method and system based on deep learning under complex background
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Chen et al. A review of object detection: Datasets, performance evaluation, architecture, applications and current trends
Nag et al. ARCN: a real-time attention-based network for crowd counting from drone images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination