CN116311062A - Highway small target detection method - Google Patents

Highway small target detection method Download PDF

Info

Publication number
CN116311062A
CN116311062A CN202310269546.6A CN202310269546A CN116311062A CN 116311062 A CN116311062 A CN 116311062A CN 202310269546 A CN202310269546 A CN 202310269546A CN 116311062 A CN116311062 A CN 116311062A
Authority
CN
China
Prior art keywords
features
image
representing
target detection
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310269546.6A
Other languages
Chinese (zh)
Inventor
邵奇可
郑泖琛
叶文武
颜世航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310269546.6A priority Critical patent/CN116311062A/en
Publication of CN116311062A publication Critical patent/CN116311062A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method for detecting a small target on a highway, which comprises the following steps: acquiring an unlabeled data set and performing data enhancement processing on each input image in the unlabeled data set to form a corresponding reconstructed image
Figure DDA0004134168870000011
Establishing a target detection network modelAnd reconstruct the image
Figure DDA0004134168870000012
And detecting to obtain a corresponding target detection result. The method is different from the traditional target detection model, has higher accuracy in identifying the small pixel targets, has strong adaptability to abnormal weather scenes of the expressway, can more accurately detect abnormal objects on the expressway, obtains a more accurate frame for detecting the small target objects by using a data enhancement method of masking reconstruction, improves a loss function according to the characteristics of small target objects, such as few characteristics and unbalanced samples, and adopts an equilibrium focus loss function to alleviate the problem of unbalanced categories, thereby improving the accuracy of small target detection and being better applied to the expressway.

Description

Highway small target detection method
Technical Field
The invention belongs to the technical field of image recognition and computer vision, and particularly relates to a method for detecting a small target on a highway.
Background
The expressway is a modern sign, is a comprehensive national force expression of a country, and has significance and effect on the country mainly in that the construction and operation of the expressway relate to various aspects of national economy and social life. However, objects other than automobiles, such as cargoes, animals, garbage and the like, can appear on the expressway, and have great potential safety hazards. Through computer vision technology, use the camera to gather real-time image and detect the foreign matter that the highway appears to in time take measures to handle the foreign matter, so maintain the unblocked of highway.
The existing target detection method is a detection method based on deep learning. Typically, several classes of target data sets are first collected, then trained using a generic target detection model, and finally the trained model is detected. Although the current detection method based on deep learning has high detection precision, for pictures collected on highways, the pixels of foreign matters are smaller, available features are fewer, and the positioning precision is high, so that targets are difficult to detect and even can be ignored. It is obvious that the actual effect of the existing object detection model applied to the expressway is not ideal.
Disclosure of Invention
The invention aims to solve the problems and provides a method for detecting a small target on an expressway, which is used for solving the problem that a traditional target detection model is difficult to obtain a good detection effect.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides a method for detecting a small target on a highway, which comprises the following steps:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Performing data enhancement processing on each input image in the unlabeled dataset to form a corresponding reconstructed image
Figure BDA0004134168850000011
x l Representing the first input image, l=1, 2, …, N;
s2, establishing a target detection network model and reconstructing an image
Figure BDA0004134168850000012
Detecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module, a dynamic instance interaction head and a classification and regression branch unit, the feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an image
Figure BDA0004134168850000013
Inputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion frames and corresponding suggestion features, wherein the suggestion frames are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of the dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
Preferably, each input image in the unlabeled dataset is subjected to data enhancement processing to form a corresponding reconstructed image
Figure BDA0004134168850000029
The method is realized by adopting a data enhancement module, and the data enhancement module comprises a first encoder, a second encoder and a decoder and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter theta
Figure BDA0004134168850000021
Decoder->
Figure BDA0004134168850000022
Satisfy->
Figure BDA0004134168850000023
M∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rank(Attn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M *
Figure BDA0004134168850000024
in the formula ,
Figure BDA0004134168850000025
represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
Figure BDA0004134168850000026
Preferably, the target detects a loss of the network modelLoss function
Figure BDA0004134168850000027
The calculation is as follows:
Figure BDA0004134168850000028
wherein ,
Figure BDA0004134168850000031
Figure BDA0004134168850000032
Figure BDA0004134168850000033
Figure BDA0004134168850000034
in the formula ,
Figure BDA0004134168850000035
balanced focus loss for predictive and true classification, +.>
Figure BDA0004134168850000036
L1 loss for prediction and real frames, < ->
Figure BDA0004134168850000037
To predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to->
Figure BDA0004134168850000038
Figure BDA0004134168850000039
Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->
Figure BDA00041341688500000310
First component gamma b Basic behavior for controlling a classifier, +.>
Figure BDA00041341688500000311
For variable parameters, a gradient guidance mechanism is used to select +.>
Figure BDA00041341688500000312
g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame g C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->
Figure BDA00041341688500000313
Representing penalty terms, IOU represents the cross-over ratio.
Preferably, the number of iterations e=6, the number of suggested boxes and suggested features n=100.
Compared with the prior art, the invention has the beneficial effects that:
the method is different from the traditional target detection model, has higher precision in identifying the small pixel targets, has the characteristics of strong adaptability to abnormal weather scenes of the expressway and the like, can more accurately detect abnormal objects on the expressway, obtains a more accurate frame for detecting the small target objects by using a data enhancement method of masking reconstruction, aims at the characteristics of small target objects, such as few characteristics and unbalanced samples, improves a loss function, adopts an equilibrium focus loss function to relieve the problem of unbalanced categories, balances the loss contribution of the difficult samples of positive and negative samples, thereby improving the precision of small target detection, and is better applied to the expressway.
Drawings
FIG. 1 is a flow chart of a method for detecting a small target on a highway according to the present invention;
FIG. 2 is a schematic diagram of a target detection network model according to the present invention;
FIG. 3 is a schematic diagram of a data enhancement module according to the present invention;
FIG. 4 is a schematic diagram of a feature extraction module according to the present invention;
FIG. 5 is a schematic diagram of an interaction process of the dynamic instance interaction head of the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1 to 5, a method for detecting a small target on an expressway includes the steps of:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N And data augmentation of each input image in the unlabeled datasetPerforming strong processing to form corresponding reconstructed image
Figure BDA0004134168850000041
x l The first input image is represented, i=1, 2, …, N.
In one embodiment, each input image in the unlabeled dataset is subjected to data enhancement processing to form a corresponding reconstructed image
Figure BDA0004134168850000049
The method is realized by adopting a data enhancement module, and the data enhancement module comprises a first encoder, a second encoder and a decoder and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter theta
Figure BDA0004134168850000042
Decoder->
Figure BDA0004134168850000043
Satisfy->
Figure BDA0004134168850000044
M∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rank(Attn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M *
Figure BDA0004134168850000045
in the formula ,
Figure BDA0004134168850000046
represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
Figure BDA0004134168850000047
As shown in fig. 3, the data enhancement module includes a first Encoder (Encoder), a second Encoder (Encoder), and a decoder (decoder), and represents relationships between data by the avatar after performing attention operations through a thermodynamic diagram (Heat map), with Top-k representing ordering of attention attempts.
S2, establishing a target detection network model and reconstructing an image
Figure BDA0004134168850000048
Detecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module and a dynamic instance intersectionThe feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full-connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an image
Figure BDA0004134168850000051
Inputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion frames and corresponding suggestion features, wherein the suggestion frames are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of the dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
As shown in FIG. 5, propos Feat represents the suggested feature, roi Feat represents the ROI feature, self-Attention represents the Self-Attention module, parmas represents the second feature. In fig. 2, the feature vector represents the recommended feature and ROI feature of each recommended frame.
In one embodiment, the objective detects a loss function of the network model
Figure BDA0004134168850000052
The calculation is as follows:
Figure BDA0004134168850000053
wherein ,
Figure BDA0004134168850000054
Figure BDA0004134168850000055
Figure BDA0004134168850000056
Figure BDA0004134168850000057
in the formula ,
Figure BDA0004134168850000058
balanced focus loss for predictive and true classification, +.>
Figure BDA0004134168850000059
L1 loss for prediction and real frames, < ->
Figure BDA00041341688500000510
To predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to->
Figure BDA00041341688500000511
Figure BDA00041341688500000512
Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->
Figure BDA00041341688500000513
First component gamma b Basic behavior for controlling a classifier, +.>
Figure BDA00041341688500000514
For variable parameters, a gradient guidance mechanism is used to select +.>
Figure BDA00041341688500000515
g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame g C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->
Figure BDA0004134168850000061
Representing penalty terms, IOU represents the cross-over ratio.
In one embodiment, the number of iterations e=6, the number of suggested boxes and suggested features n=100.
Specifically, the feature extraction module in this embodiment uses a FPN network based on res net. Wherein, the FPN network is a feature pyramid, and the structure is as shown in fig. 4, and is obtained by adopting the following steps: (1) The bottom-up path, through the backup, uses each stageThe feature activation output of the last residual structure of (C), the outputs of these residual modules conv2, conv3, conv4, conv5 are denoted as { C } 2 ,C 3 ,C 4 ,C 5 -a }; (2) And the deep feature map is up-sampled to obtain a map with higher resolution, and then the up-sampled feature map and the feature map from bottom to top are spliced together in a transverse connection mode, as shown in fig. 4. Construction of P 2 To P 5 Is described. The pyramid layer number is represented by l, and the resolution of each layer of feature map is 2 lower than that of the input image l All pyramid layers have 256 channels. Setting up a reconstructed image
Figure BDA0004134168850000062
Is h x w, h is the height of the reconstructed image, and w is the width of the reconstructed image. The outputs of the stages of the FPN network are shown in table 1 below:
TABLE 1
Figure BDA0004134168850000063
The suggestion boxes and the suggestion features are both learnable and are in one-to-one correspondence. A set of learnable target boxes are used as region suggestions, which are represented by four-dimensional parameters ranging from 0 to 1, normalized center coordinates, height and width, respectively. Parameters of the suggestion box are updated during training by a back propagation algorithm. Back propagation is currently the most common method used to train artificial neural networks, which propagates the error at the output layer back layer by layer, updating the network parameters by computing the partial derivatives so that the error loss function is minimized. These learnable suggestion boxes are statistics of potential target locations in the training set and can be seen as initial guesses of the most likely target areas contained in the image, regardless of the input. However, these boxes only provide rough positioning information, losing the pose and shape of the object, which is detrimental to subsequent classification and regression, so each instance is characterized by a learnable suggested feature. Is a high-dimensional potential vector for encoding rich instance properties.
The classification and regression branch unit is in the prior art, for example, the regression prediction is performed by adopting a three-layer perceptual calculation, and the classification prediction is realized by a linear mapping layer, which is not described herein. The object detection network model uses a set of prediction losses for classification and frame coordinate prediction of a fixed size. The best bipartite match between the predicted and real objects is produced based on the loss of the set. For example, the object detection network model detects 100 object frames, and expands the real frames into 100 detection frames. Thus both predictive and true are two sets of 100 elements. The elements of the prediction set and the real set are subjected to one-to-one correspondence by adopting the Hungary algorithm to perform binary matching, so that the matching loss is minimum, and the loss is calculated by positive and negative sample pairs.
Wherein the object detects a loss function of the network model
Figure BDA0004134168850000071
In (C) j Plays a role of balancing difficult and easy samples, and gamma b A basic behavior for controlling the classifier does not work on class imbalance problems. />
Figure BDA0004134168850000072
Is a variable parameter, determines the degree of attention of the study of the jth category on the positive and negative imbalance problem, and adopts a gradient guidance mechanism to select +.>
Figure BDA0004134168850000073
In order to better meet the requirements, g will be in practice j Controlled at [0,1 ]]Within the range. />
Figure BDA0004134168850000074
The weight coefficient is used for balancing the loss contribution of different categories, so that rare samples can make more loss contribution than common samples. For rare class data, the weight coefficient is set to a larger value to increase its loss contribution, while for frequent class data, the weight coefficient remains around 1. Final loss ofIs the sum of all pairs normalized by the number of objects in the training batch.
The variety of the spills, animals and garbage occurring on the expressway is not very large, and the number of different categories also presents an extremely unbalanced state. In order to solve the extreme unbalance of the category, a focusing coefficient and a weight coefficient are added on the basis of the traditional focus loss function. The present embodiment employs a distance cross-over ratio (DIOU). Because the proportion of scattered objects on the expressway is small under the shooting of a real-time camera, the predicted frame is often larger than the real frame, and the inclusion relation is formed. The loss of the target frame is the same at the center and corners of the predicted frame by adding penalty terms
Figure BDA0004134168850000075
For measuring the distance of the center point between the target frame and the predicted frame.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (4)

1. A method for detecting a small target on an expressway is characterized by comprising the following steps of: the method for detecting the expressway small target comprises the following steps:
s1, acquiring an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Performing data enhancement processing on each input image in the unlabeled dataset to form a corresponding reconstruction mapImage forming apparatus
Figure FDA0004134168840000011
x l Representing the first input image, l=1, 2, …, N;
s2, establishing a target detection network model and reconstructing an image
Figure FDA0004134168840000012
Detecting to obtain a corresponding target detection result, wherein the target detection network model comprises a feature extraction module, a dynamic instance interaction head and a classification and regression branch unit, the feature extraction module adopts an FPN network, the dynamic instance interaction head comprises N feature extraction units, each feature extraction unit comprises a self-attention module, a full-connection layer, a first convolution layer, a second convolution layer, a ReLu function and view operation, and the target detection network model executes the following operations:
s21, reconstructing an image
Figure FDA0004134168840000013
Inputting a feature extraction module to obtain a corresponding multi-scale feature map;
s22, setting N suggestion boxes and corresponding suggestion features, wherein the suggestion boxes are expressed as four-dimensional vectors formed by normalized center coordinates, heights and widths, and the suggestion features have the same dimension as the output features of the feature extraction module;
s23, obtaining corresponding ROI features through RoIAlign operation by correspondingly enabling the suggestion frames and the multi-scale feature map one by one;
s24, inputting the suggested features and the ROI features of each suggested frame into a feature extraction unit of a dynamic instance interaction head in a one-to-one correspondence mode, obtaining corresponding target frames and target features, and executing the following operations by the feature extraction unit:
performing self-attention operation on the suggested features by using a self-attention module to obtain first features;
converting the first feature into a one-dimensional vector through the full connection layer to form a second feature;
inputting the ROI features and the second features into a first convolution layer, sequentially passing through the second convolution layer and a ReLu function, and then adopting view operation to adjust the dimensions to obtain corresponding target features;
s25, updating the suggested frame and the suggested features to be the target frame and the target features, and returning to the step S23 until the iteration times are completed, so as to obtain interaction features;
s26, inputting the interaction characteristics into a classification and regression branch unit to obtain a target detection result.
2. The highway small target detection method according to claim 1, wherein: the data enhancement processing is carried out on each input image in the unlabeled data set to form a corresponding reconstructed image
Figure FDA0004134168840000016
The method is realized by adopting a data enhancement module, wherein the data enhancement module comprises a first encoder, a second encoder and a decoder, and performs the following operations:
s11, adopting an unlabeled data set X= { X 1 ,x 2 ,…,x l ,…,x N Training a second encoder and decoder, wherein the second encoder E θ Is satisfied by the learnable parameter theta
Figure FDA0004134168840000017
Said decoder->
Figure FDA0004134168840000014
Satisfy the following requirements
Figure FDA0004134168840000015
M∈{0,1} W×H A block-wise binary mask representing pixels of an image block size W x H, W representing the pixel width of the input image x and H representing the pixel height of the input image x;
s12, dividing each input image into S image blocks;
s13, executing the following operations on each divided input image:
s131, converting the divided input image into vectors by using a first encoder;
s132, acquiring attention map Attn of the ith image block based on the attention policy i
Attn i =q cls ·k i ,i∈{0,1,…,p 2 -1}
in the formula ,qcls Queries representing sequences of image blocks, k i Key embedding representing the ith image block, p representing the size of the image block;
s133, acquiring the first K index sets omega by sequencing each attention attempt:
Ω=top-rankuttn,K)
in the formula, top-rank (·, K) represents the index of the last K largest elements returned, and Attn represents Attn i Is a collection of (3);
s134, obtaining binary mask M *
Figure FDA0004134168840000021
in the formula ,
Figure FDA0004134168840000022
represents a round-down operation, mod (·) represents a modulo operation, Ω i Representing the i-th element in the index set Ω;
s135, according to binary mask M * Acquiring a masking image M * As indicated by x, dividing the masking image into non-overlapping image blocks and discarding the image blocks blocked by the binary mask, and feeding the remaining visible image blocks into a pre-trained second encoder and decoder to generate a corresponding reconstructed image
Figure FDA0004134168840000023
3. The highway small target detection method according to claim 1, wherein: loss function of the object detection network model
Figure FDA0004134168840000024
The calculation is as follows:
Figure FDA0004134168840000025
wherein ,
Figure FDA0004134168840000026
Figure FDA0004134168840000027
Figure FDA0004134168840000028
Figure FDA0004134168840000029
in the formula ,
Figure FDA00041341688400000210
balanced focus loss for predictive and true classification, +.>
Figure FDA00041341688400000211
To predict L1 loss for a frame and a real frame,
Figure FDA00041341688400000212
to predict the distance cross-ratio loss of the frame and the real frame lambda cls 、λ L1 、λ diou Sequentially correspond to->
Figure FDA00041341688400000213
Figure FDA00041341688400000214
Coefficient of alpha t To balance the weight factor of the number of positive and negative samples, p t To predict the probability of being a positive sample, γ j For the j-th class of focusing coefficients, j=1, 2, …, T is the total number of classes, γ j Is decoupled into a first component gamma b And a second component->
Figure FDA00041341688400000215
First component gamma b Basic behavior for controlling a classifier, +.>
Figure FDA00041341688400000216
For variable parameters, a gradient guidance mechanism is used to select +.>
Figure FDA00041341688400000217
g j The cumulative gradient ratio of the j-th positive sample and the negative sample is represented, and the value range is [0,1]S is the determination of gamma j Scale factor of upper limit, y pz Representing predicted value, y gz Representing the true value, z=1, 2, …, n, n represents the number of target objects, ρ 2 (b p ,b g ) Representing the center point b of the prediction frame p And the center point b of the real frame h C represents the diagonal distance of the smallest rectangle covering both the predicted and real frames,/->
Figure FDA00041341688400000218
Representing penalty terms, IOU represents the cross-over ratio.
4. The highway small target detection method according to claim 1, wherein: the iteration number e=6, the number of suggested boxes and suggested features n=100.
CN202310269546.6A 2023-03-20 2023-03-20 Highway small target detection method Pending CN116311062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310269546.6A CN116311062A (en) 2023-03-20 2023-03-20 Highway small target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310269546.6A CN116311062A (en) 2023-03-20 2023-03-20 Highway small target detection method

Publications (1)

Publication Number Publication Date
CN116311062A true CN116311062A (en) 2023-06-23

Family

ID=86802786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310269546.6A Pending CN116311062A (en) 2023-03-20 2023-03-20 Highway small target detection method

Country Status (1)

Country Link
CN (1) CN116311062A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576217A (en) * 2024-01-12 2024-02-20 电子科技大学 Object pose estimation method based on single-instance image reconstruction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576217A (en) * 2024-01-12 2024-02-20 电子科技大学 Object pose estimation method based on single-instance image reconstruction
CN117576217B (en) * 2024-01-12 2024-03-26 电子科技大学 Object pose estimation method based on single-instance image reconstruction

Similar Documents

Publication Publication Date Title
Tan et al. Mnasnet: Platform-aware neural architecture search for mobile
CN111126472B (en) SSD (solid State disk) -based improved target detection method
CN111882040B (en) Convolutional neural network compression method based on channel number search
CN111259940B (en) Target detection method based on space attention map
CN113160062B (en) Infrared image target detection method, device, equipment and storage medium
CN113642574B (en) Small sample target detection method based on feature weighting and network fine tuning
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN109711401A (en) A kind of Method for text detection in natural scene image based on Faster Rcnn
CN103366375B (en) Image set method for registering based on dynamic directed graph
CN113449671B (en) Pedestrian re-recognition method and device based on multi-scale multi-feature fusion
CN110084284A (en) Target detection and secondary classification algorithm and device based on region convolutional neural networks
CN115223017B (en) Multi-scale feature fusion bridge detection method based on depth separable convolution
Zha et al. Multifeature transformation and fusion-based ship detection with small targets and complex backgrounds
CN116311062A (en) Highway small target detection method
CN113378812A (en) Digital dial plate identification method based on Mask R-CNN and CRNN
CN111611925A (en) Building detection and identification method and device
CN114926498B (en) Rapid target tracking method based on space-time constraint and leachable feature matching
CN117456330A (en) MSFAF-Net-based low-illumination target detection method
CN114913546A (en) Method and system for detecting character interaction relationship
CN116523888B (en) Pavement crack detection method, device, equipment and medium
CN117649526A (en) High-precision semantic segmentation method for automatic driving road scene
CN117576149A (en) Single-target tracking method based on attention mechanism
CN117173595A (en) Unmanned aerial vehicle aerial image target detection method based on improved YOLOv7
CN114782983A (en) Road scene pedestrian detection method based on improved feature pyramid and boundary loss
CN114972263A (en) Real-time ultrasound image follicle measurement method and system based on intelligent picture segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination