CN116311062A - Highway small target detection method - Google Patents
Highway small target detection method Download PDFInfo
- Publication number
- CN116311062A CN116311062A CN202310269546.6A CN202310269546A CN116311062A CN 116311062 A CN116311062 A CN 116311062A CN 202310269546 A CN202310269546 A CN 202310269546A CN 116311062 A CN116311062 A CN 116311062A
- Authority
- CN
- China
- Prior art keywords
- features
- image
- representing
- target detection
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 17
- 230000000873 masking effect Effects 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000000605 extraction Methods 0.000 claims description 26
- 230000003993 interaction Effects 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 230000006399 behavior Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 241000334993 Parma Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a method for detecting a small target on a highway, which comprises the following steps: acquiring an unlabeled data set and performing data enhancement processing on each input image in the unlabeled data set to form a corresponding reconstructed imageEstablishing a target detection network modelAnd reconstruct the imageAnd detecting to obtain a corresponding target detection result. The method is different from the traditional target detection model, has higher accuracy in identifying the small pixel targets, has strong adaptability to abnormal weather scenes of the expressway, can more accurately detect abnormal objects on the expressway, obtains a more accurate frame for detecting the small target objects by using a data enhancement method of masking reconstruction, improves a loss function according to the characteristics of small target objects, such as few characteristics and unbalanced samples, and adopts an equilibrium focus loss function to alleviate the problem of unbalanced categories, thereby improving the accuracy of small target detection and being better applied to the expressway.
Description
Technical Field
The invention belongs to the technical field of image recognition and computer vision, and particularly relates to a method for detecting a small target on a highway.
Background
The expressway is a modern sign, is a comprehensive national force expression of a country, and has significance and effect on the country mainly in that the construction and operation of the expressway relate to various aspects of national economy and social life. However, objects other than automobiles, such as cargoes, animals, garbage and the like, can appear on the expressway, and have great potential safety hazards. Through computer vision technology, use the camera to gather real-time image and detect the foreign matter that the highway appears to in time take measures to handle the foreign matter, so maintain the unblocked of highway.
The existing target detection method is a detection method based on deep learning. Typically, several classes of target data sets are first collected, then trained using a generic target detection model, and finally the trained model is detected. Although the current detection method based on deep learning has high detection precision, for pictures collected on highways, the pixels of foreign matters are smaller, available features are fewer, and the positioning precision is high, so that targets are difficult to detect and even can be ignored. It is obvious that the actual effect of the existing object detection model applied to the expressway is not ideal.
Disclosure of Invention
The invention aims to solve the problems and provides a method for detecting a small target on an expressway, which is used for solving the problem that a traditional target detection model is difficult to obtain a good detection effect.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides a method for detecting a small target on a highway, which comprises the following steps:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Performing data enhancement processing on each input image in the unlabeled dataset to form a corresponding reconstructed imagex l Representing the first input image, l=1, 2, …, N;
s2, establishing a target detection network model and reconstructing an imageDetecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module, a dynamic instance interaction head and a classification and regression branch unit, the feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an imageInputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion frames and corresponding suggestion features, wherein the suggestion frames are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of the dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
Preferably, each input image in the unlabeled dataset is subjected to data enhancement processing to form a corresponding reconstructed imageThe method is realized by adopting a data enhancement module, and the data enhancement module comprises a first encoder, a second encoder and a decoder and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter thetaDecoder->Satisfy->M∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i :
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rank(Attn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M * :
in the formula ,represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
Preferably, the target detects a loss of the network modelLoss functionThe calculation is as follows:
wherein ,
in the formula ,balanced focus loss for predictive and true classification, +.>L1 loss for prediction and real frames, < ->To predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to-> Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->First component gamma b Basic behavior for controlling a classifier, +.>For variable parameters, a gradient guidance mechanism is used to select +.>g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame g C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->Representing penalty terms, IOU represents the cross-over ratio.
Preferably, the number of iterations e=6, the number of suggested boxes and suggested features n=100.
Compared with the prior art, the invention has the beneficial effects that:
the method is different from the traditional target detection model, has higher precision in identifying the small pixel targets, has the characteristics of strong adaptability to abnormal weather scenes of the expressway and the like, can more accurately detect abnormal objects on the expressway, obtains a more accurate frame for detecting the small target objects by using a data enhancement method of masking reconstruction, aims at the characteristics of small target objects, such as few characteristics and unbalanced samples, improves a loss function, adopts an equilibrium focus loss function to relieve the problem of unbalanced categories, balances the loss contribution of the difficult samples of positive and negative samples, thereby improving the precision of small target detection, and is better applied to the expressway.
Drawings
FIG. 1 is a flow chart of a method for detecting a small target on a highway according to the present invention;
FIG. 2 is a schematic diagram of a target detection network model according to the present invention;
FIG. 3 is a schematic diagram of a data enhancement module according to the present invention;
FIG. 4 is a schematic diagram of a feature extraction module according to the present invention;
FIG. 5 is a schematic diagram of an interaction process of the dynamic instance interaction head of the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1 to 5, a method for detecting a small target on an expressway includes the steps of:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N And data augmentation of each input image in the unlabeled datasetPerforming strong processing to form corresponding reconstructed imagex l The first input image is represented, i=1, 2, …, N.
In one embodiment, each input image in the unlabeled dataset is subjected to data enhancement processing to form a corresponding reconstructed imageThe method is realized by adopting a data enhancement module, and the data enhancement module comprises a first encoder, a second encoder and a decoder and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter thetaDecoder->Satisfy->M∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i :
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rank(Attn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M * :
in the formula ,represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
As shown in fig. 3, the data enhancement module includes a first Encoder (Encoder), a second Encoder (Encoder), and a decoder (decoder), and represents relationships between data by the avatar after performing attention operations through a thermodynamic diagram (Heat map), with Top-k representing ordering of attention attempts.
S2, establishing a target detection network model and reconstructing an imageDetecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module and a dynamic instance intersectionThe feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full-connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an imageInputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion frames and corresponding suggestion features, wherein the suggestion frames are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of the dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
As shown in FIG. 5, propos Feat represents the suggested feature, roi Feat represents the ROI feature, self-Attention represents the Self-Attention module, parmas represents the second feature. In fig. 2, the feature vector represents the recommended feature and ROI feature of each recommended frame.
In one embodiment, the objective detects a loss function of the network modelThe calculation is as follows:
wherein ,
in the formula ,balanced focus loss for predictive and true classification, +.>L1 loss for prediction and real frames, < ->To predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to-> Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->First component gamma b Basic behavior for controlling a classifier, +.>For variable parameters, a gradient guidance mechanism is used to select +.>g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame g C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->Representing penalty terms, IOU represents the cross-over ratio.
In one embodiment, the number of iterations e=6, the number of suggested boxes and suggested features n=100.
Specifically, the feature extraction module in this embodiment uses a FPN network based on res net. Wherein, the FPN network is a feature pyramid, and the structure is as shown in fig. 4, and is obtained by adopting the following steps: (1) The bottom-up path, through the backup, uses each stageThe feature activation output of the last residual structure of (C), the outputs of these residual modules conv2, conv3, conv4, conv5 are denoted as { C } 2 ,C 3 ,C 4 ,C 5 -a }; (2) And the deep feature map is up-sampled to obtain a map with higher resolution, and then the up-sampled feature map and the feature map from bottom to top are spliced together in a transverse connection mode, as shown in fig. 4. Construction of P 2 To P 5 Is described. The pyramid layer number is represented by l, and the resolution of each layer of feature map is 2 lower than that of the input image l All pyramid layers have 256 channels. Setting up a reconstructed imageIs h x w, h is the height of the reconstructed image, and w is the width of the reconstructed image. The outputs of the stages of the FPN network are shown in table 1 below:
TABLE 1
The suggestion boxes and the suggestion features are both learnable and are in one-to-one correspondence. A set of learnable target boxes are used as region suggestions, which are represented by four-dimensional parameters ranging from 0 to 1, normalized center coordinates, height and width, respectively. Parameters of the suggestion box are updated during training by a back propagation algorithm. Back propagation is currently the most common method used to train artificial neural networks, which propagates the error at the output layer back layer by layer, updating the network parameters by computing the partial derivatives so that the error loss function is minimized. These learnable suggestion boxes are statistics of potential target locations in the training set and can be seen as initial guesses of the most likely target areas contained in the image, regardless of the input. However, these boxes only provide rough positioning information, losing the pose and shape of the object, which is detrimental to subsequent classification and regression, so each instance is characterized by a learnable suggested feature. Is a high-dimensional potential vector for encoding rich instance properties.
The classification and regression branch unit is in the prior art, for example, the regression prediction is performed by adopting a three-layer perceptual calculation, and the classification prediction is realized by a linear mapping layer, which is not described herein. The object detection network model uses a set of prediction losses for classification and frame coordinate prediction of a fixed size. The best bipartite match between the predicted and real objects is produced based on the loss of the set. For example, the object detection network model detects 100 object frames, and expands the real frames into 100 detection frames. Thus both predictive and true are two sets of 100 elements. The elements of the prediction set and the real set are subjected to one-to-one correspondence by adopting the Hungary algorithm to perform binary matching, so that the matching loss is minimum, and the loss is calculated by positive and negative sample pairs.
Wherein the object detects a loss function of the network modelIn (C) j Plays a role of balancing difficult and easy samples, and gamma b A basic behavior for controlling the classifier does not work on class imbalance problems. />Is a variable parameter, determines the degree of attention of the study of the jth category on the positive and negative imbalance problem, and adopts a gradient guidance mechanism to select +.>In order to better meet the requirements, g will be in practice j Controlled at [0,1 ]]Within the range. />The weight coefficient is used for balancing the loss contribution of different categories, so that rare samples can make more loss contribution than common samples. For rare class data, the weight coefficient is set to a larger value to increase its loss contribution, while for frequent class data, the weight coefficient remains around 1. Final loss ofIs the sum of all pairs normalized by the number of objects in the training batch.
The variety of the spills, animals and garbage occurring on the expressway is not very large, and the number of different categories also presents an extremely unbalanced state. In order to solve the extreme unbalance of the category, a focusing coefficient and a weight coefficient are added on the basis of the traditional focus loss function. The present embodiment employs a distance cross-over ratio (DIOU). Because the proportion of scattered objects on the expressway is small under the shooting of a real-time camera, the predicted frame is often larger than the real frame, and the inclusion relation is formed. The loss of the target frame is the same at the center and corners of the predicted frame by adding penalty termsFor measuring the distance of the center point between the target frame and the predicted frame.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (4)
1. A method for detecting a small target on an expressway is characterized by comprising the following steps of: the method for detecting the expressway small target comprises the following steps:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Performing data enhancement processing on each input image in the unlabeled dataset to form a corresponding reconstruction mapImage forming apparatusx l Representing the first input image, l=1, 2, …, N;
s2, establishing a target detection network model and reconstructing an imageDetecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module, a dynamic instance interaction head and a classification and regression branch unit, the feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full-connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an imageInputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion boxes and corresponding suggestion features, wherein the suggestion boxes are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of a dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
2. The highway small target detection method according to claim 1, wherein: the data enhancement processing is carried out on each input image in the unlabeled data set to form a corresponding reconstructed imageThe method is realized by adopting a data enhancement module, wherein the data enhancement module comprises a first encoder, a second encoder and a decoder, and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter thetaSaid decoder->Satisfy the following requirementsM∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i :
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rankuttn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M * :
in the formula ,represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
3. The highway small target detection method according to claim 1, wherein: loss function of the object detection network modelThe calculation is as follows:
wherein ,
in the formula ,balanced focus loss for predictive and true classification, +.>To predict L1 loss for a frame and a real frame,to predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to-> Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->First component gamma b Basic behavior for controlling a classifier, +.>For variable parameters, a gradient guidance mechanism is used to select +.>g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame h C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->Representing penalty terms, IOU represents the cross-over ratio.
4. The highway small target detection method according to claim 1, wherein: the iteration number e=6, the number of suggested boxes and suggested features n=100.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310269546.6A CN116311062A (en) | 2023-03-20 | 2023-03-20 | Highway small target detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310269546.6A CN116311062A (en) | 2023-03-20 | 2023-03-20 | Highway small target detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116311062A true CN116311062A (en) | 2023-06-23 |
Family
ID=86802786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310269546.6A Pending CN116311062A (en) | 2023-03-20 | 2023-03-20 | Highway small target detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116311062A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576217A (en) * | 2024-01-12 | 2024-02-20 | 电子科技大学 | Object pose estimation method based on single-instance image reconstruction |
-
2023
- 2023-03-20 CN CN202310269546.6A patent/CN116311062A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576217A (en) * | 2024-01-12 | 2024-02-20 | 电子科技大学 | Object pose estimation method based on single-instance image reconstruction |
CN117576217B (en) * | 2024-01-12 | 2024-03-26 | 电子科技大学 | Object pose estimation method based on single-instance image reconstruction |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tan et al. | Mnasnet: Platform-aware neural architecture search for mobile | |
CN111126472B (en) | SSD (solid State disk) -based improved target detection method | |
CN111882040B (en) | Convolutional neural network compression method based on channel number search | |
CN111259940B (en) | Target detection method based on space attention map | |
CN113160062B (en) | Infrared image target detection method, device, equipment and storage medium | |
CN113642574B (en) | Small sample target detection method based on feature weighting and network fine tuning | |
CN109583483A (en) | A kind of object detection method and system based on convolutional neural networks | |
CN109711401A (en) | A kind of Method for text detection in natural scene image based on Faster Rcnn | |
CN103366375B (en) | Image set method for registering based on dynamic directed graph | |
CN113449671B (en) | Pedestrian re-recognition method and device based on multi-scale multi-feature fusion | |
CN110084284A (en) | Target detection and secondary classification algorithm and device based on region convolutional neural networks | |
CN115223017B (en) | Multi-scale feature fusion bridge detection method based on depth separable convolution | |
Zha et al. | Multifeature transformation and fusion-based ship detection with small targets and complex backgrounds | |
CN116311062A (en) | Highway small target detection method | |
CN113378812A (en) | Digital dial plate identification method based on Mask R-CNN and CRNN | |
CN111611925A (en) | Building detection and identification method and device | |
CN114926498B (en) | Rapid target tracking method based on space-time constraint and leachable feature matching | |
CN117456330A (en) | MSFAF-Net-based low-illumination target detection method | |
CN114913546A (en) | Method and system for detecting character interaction relationship | |
CN116523888B (en) | Pavement crack detection method, device, equipment and medium | |
CN117649526A (en) | High-precision semantic segmentation method for automatic driving road scene | |
CN117576149A (en) | Single-target tracking method based on attention mechanism | |
CN117173595A (en) | Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7 | |
CN114782983A (en) | Road scene pedestrian detection method based on improved feature pyramid and boundary loss | |
CN114972263A (en) | Real-time ultrasound image follicle measurement method and system based on intelligent picture segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |