CN117975036A

CN117975036A - Small target detection method and system based on detection converter

Info

Publication number: CN117975036A
Application number: CN202410041088.5A
Authority: CN
Inventors: 黄志青; 陈天戈; 余俊
Original assignee: Guangzhou Hengshayun Technology Co ltd
Current assignee: Guangzhou Hengshayun Technology Co ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-05-03

Abstract

The application discloses a small target detection method and a system based on a detection converter, wherein the method comprises the following steps: obtaining data to be detected of a small target image, and performing data enhancement processing to obtain the enhanced data to be detected of the small target image; introducing a multi-scale feature fusion interaction module and a multi-scale transformation parallel decoder module to construct a small target detection model; and carrying out target detection processing on the enhanced data to be detected of the small target image based on the small target detection model to obtain a small target image detection result. The embodiment of the application can fully combine the high-level semantic features and the low-level semantic features of the small target image, extract the detail information of the image and improve the detection result precision of the small target image. The application can be widely applied to the technical field of target detection.

Description

Small target detection method and system based on detection converter

Technical Field

The application relates to the technical field of target detection, in particular to a small target detection method and system based on a detection converter.

Background

DERT (Detect ion Transformer) the target detection model is a brand new architecture for target detection by using a transducer, and is a work of milestones in target detection. In the prior art, the target detection method based on the anchor frame or without the anchor frame needs to carry out non-maximum value inhibition, so that the model parameter adjustment is complex and the deployment is difficult. The DETR solves the problems, does not need an anchor frame or non-maximum suppression operation, utilizes the global modeling capability of a transducer, and uses target detection as a set prediction problem, thereby greatly simplifying the target detection process, and therefore, the DETR is also commonly used in the technical field of target detection at the present stage.

However, in the technical field of target detection, the current DETR target detection model has stronger global modeling capability because it uses a Transformer, but does not operate only for local attention, so that the current DETR target detection model cannot be used for detecting small targets better, the resolution of a feature map becomes particularly important for the detection processing of small targets, the resolution of a bottom feature map may be relatively lower because of multiple pooling operations, which makes the existing DETR target detection model more difficult to capture detailed information of small targets, because DETR only uses high-semantic and low-resolution features output by a feature extraction network and contains less detailed information, the detection result of small targets is often not ideal, the existing DETR target detection model only uses high-semantic features and does not well use low-semantic features, and the loss function is an operation of combining an average error loss function with a generalized cross-ratio loss function, which causes a large loss function value of a prediction frame to be larger, and the generalized cross-ratio loss function causes a small value to be lost in the final detection model when the generalized cross-ratio frame is smaller than the predicted target loss value and the existing target detection model is not lost.

In summary, the technical problems in the related art are to be improved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a small target detection method and system based on a detection converter, which can fully combine the high-level semantic features and the low-level semantic features of a small target image, extract the detail information of the image and improve the detection result precision of the small target image.

To achieve the above object, an aspect of an embodiment of the present application provides a small target detection method based on a detection transformer, the method including:

Obtaining data to be detected of a small target image, and performing data enhancement processing to obtain the enhanced data to be detected of the small target image;

introducing a multi-scale feature fusion interaction module and a multi-scale transformation parallel decoder module to construct a small target detection model;

And performing target detection processing on the enhanced data to be detected of the small target image based on the small target detection model to obtain a small target image detection result.

In some embodiments, the small target detection model includes a first feature extraction module, a second feature extraction module, a multi-scale feature fusion interaction module, a target query module, and a multi-scale transformation parallel decoder module, where an input end of the first feature extraction module is used to obtain the enhanced small target image to be detected data, an output end of the first feature extraction module is connected with an input end of the second feature extraction module, an output end of the second feature extraction module is connected with an input end of the multi-scale feature fusion interaction module, an output end of the multi-scale feature fusion interaction module is connected with a first input end of the multi-scale transformation parallel decoder module, and an output end of the target query module is connected with a second input end of the multi-scale transformation parallel decoder module, and an output end of the multi-scale transformation parallel decoder module is used to output the small target image detection result.

In some embodiments, the performing, based on the small target detection model, target detection processing on the enhanced small target image to-be-detected data to obtain a small target image detection result includes:

Inputting the data to be detected of the enhanced small target image into the small target detection model;

performing feature extraction processing on the enhanced small target image data to be detected based on a first feature extraction module of the small target detection model to obtain first small target image feature data;

the second feature extraction module is based on the small target detection model, and performs feature extraction processing on the first small target image feature data to obtain second small target image feature data;

Performing feature interaction processing on the second small target image feature data based on the multi-scale feature fusion interaction module of the small target detection model to obtain small target image feature interaction data;

and performing target detection query processing on the small target image characteristic interaction data based on a target query module of the small target detection model and a multi-scale transformation parallel decoder module of the small target detection model to obtain a small target image detection result.

In some embodiments, the feature extraction module, based on the small target detection model, performs feature extraction processing on the enhanced small target image data to be detected to obtain first small target image feature data, and includes:

The enhanced small target image data to be detected is input to a first feature extraction module of the small target detection model, wherein the first feature extraction module comprises a flattening module, a full-connection module, a position coding module and a first Steyr converter module;

flattening the data to be detected of the enhanced small target image based on the flattening module of the first feature extraction module to obtain third small target image feature data;

Based on a full connection module of the first feature extraction module, performing full connection processing on the third small target image feature data to obtain fourth small target image feature data;

Acquiring position coding information based on a position coding module of the first feature extraction module;

weighting the fourth small target image characteristic data and the position coding information to obtain weighted fourth small target image characteristic data;

And carrying out feature extraction and transformation processing on the weighted fourth small target image feature data based on a first Steyr converter module of the first feature extraction module to obtain first small target image feature data.

In some embodiments, the second feature extraction module, based on the small target detection model, performs feature extraction processing on the first small target image feature data to obtain second small target image feature data, including:

Inputting the first small target image feature data to a second feature extraction module of the small target detection model, wherein the second feature extraction module comprises a first feature extraction sub-module, a second feature extraction sub-module and a third feature extraction sub-module;

based on a first feature extraction sub-module of the second feature extraction module, performing feature extraction processing on the first small target image feature data to obtain fifth small target image feature data;

based on a second feature extraction sub-module of the second feature extraction module, performing feature extraction processing on the feature data of the fifth small target image to obtain feature data of a sixth small target image;

Based on a third feature extraction sub-module of the second feature extraction module, performing feature extraction processing on the feature data of the sixth small target image to obtain feature data of a seventh small target image;

And integrating the fifth small target image characteristic data, the sixth small target image characteristic data and the seventh small target image characteristic data to obtain the second small target image characteristic data.

In some embodiments, the first feature extraction sub-module, the second feature extraction sub-module, and the third feature extraction sub-module each include a patch combining module and a stoneley converter module, an output of the patch combining module is connected to an input of the stoneley converter module, wherein:

The patch merging module is used for performing resolution reduction processing on input data;

the stoneley converter module is used for carrying out feature extraction and feature transformation processing on input data.

In some embodiments, the multi-scale feature fusion interaction module based on the small target detection model performs feature interaction processing on the second small target image feature data to obtain small target image feature interaction data, including:

Inputting the second small target image feature data to a multi-scale feature fusion interaction module of the small target detection model, wherein the multi-scale feature fusion interaction module comprises a second Steyr converter module, a third Steyr converter module, a fourth Steyr converter module, a fifth Steyr converter module, a sixth Steyr converter module and a seventh Steyr converter module;

performing feature interaction processing on the fifth small target image feature data based on a second Steyr converter module of the multi-scale feature fusion interaction module to obtain eighth small target image feature data;

performing feature interaction processing on the sixth small target image feature data based on a third Steyr converter module of the multi-scale feature fusion interaction module to obtain ninth small target image feature data;

performing feature interaction processing on the seventh small target image feature data based on a fourth Steyr converter module of the multi-scale feature fusion interaction module to obtain tenth small target image feature data;

Performing bilinear interpolation up-sampling processing on the tenth small target image characteristic data, and then performing element-by-element addition on the tenth small target image characteristic data and the ninth small target image characteristic data to obtain eleventh small target image characteristic data;

Performing bilinear interpolation up-sampling processing on the eleventh small target image characteristic data, and then performing element-by-element addition on the eleventh small target image characteristic data and the eighth small target image characteristic data to obtain twelfth small target image characteristic data;

performing feature interaction processing on the tenth small target image feature data based on a fifth Steyr converter module of the multi-scale feature fusion interaction module to obtain thirteenth small target image feature data;

performing feature interaction processing on the eleventh small target image feature data based on a sixth stevensite converter module of the multi-scale feature fusion interaction module to obtain fourteenth small target image feature data;

performing feature interaction processing on the twelfth small target image feature data based on a seventh Stent temperature converter module of the multi-scale feature fusion interaction module to obtain fifteenth small target image feature data;

Performing bilinear interpolation downsampling processing on the fifteenth small target image characteristic data, and then performing element-by-element addition on the fifteenth small target image characteristic data and the fourteenth small target image characteristic data to obtain sixteenth small target image characteristic data;

Performing bilinear interpolation downsampling processing on the sixteenth small target image characteristic data, and then performing element-by-element addition on the sixteenth small target image characteristic data and the thirteenth small target image characteristic data to obtain seventeenth small target image characteristic data;

And integrating the fifteenth small target image characteristic data, the sixteenth small target image characteristic data and the seventeenth small target image characteristic data to obtain the small target image characteristic interaction data.

In some embodiments, the target query module based on the small target detection model and the multi-scale transformation parallel decoder module of the small target detection model perform target detection query processing on the small target image feature interaction data to obtain the small target image detection result, including:

acquiring target query information based on a target query module of the small target detection model;

Inputting the small target image feature interaction data to a multi-scale transformation parallel decoder module of the small target detection model, wherein the multi-scale transformation parallel decoder module comprises a first cross attention module, a second cross attention module, a third cross attention module, a first transformer module, a second transformer module, a third transformer module, a fourth transformer module, a fifth transformer module, a sixth transformer module, a seventh transformer module, a first feedforward neural network layer, a second feedforward neural network layer and a third feedforward neural network layer;

combining the fifteenth small target image feature data with the target query information and inputting the combined target image feature data and the target query information into a first cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a first small target image feature vector;

combining the sixteenth small target image characteristic data with the target query information and inputting the combined target image characteristic data and the target query information into a second cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a second small target image characteristic vector;

Combining the seventeenth small target image characteristic data with the target query information and inputting the combined target image characteristic data and the target query information into a third cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a third small target image characteristic vector;

performing feature transformation processing on the first small target image feature vector based on a first transformer module of the multi-scale transformation parallel decoder module to obtain a fourth small target image feature vector;

performing feature transformation processing on the second small target image feature vector based on a second transformer module of the multi-scale transformation parallel decoder module to obtain a fifth small target image feature vector;

performing feature transformation processing on the third small target image feature vector based on a third transformer module of the multi-scale transformation parallel decoder module to obtain a sixth small target image feature vector;

splicing the fourth small target image feature vector, the fifth small target image feature vector and the sixth small target image feature vector, and inputting the spliced fourth small target image feature vector, the fifth small target image feature vector and the sixth small target image feature vector into the fourth converter module for feature conversion processing to obtain a seventh small target image feature vector;

splitting the seventh small target image feature vector to obtain an eighth small target image feature vector, a ninth small target image feature vector and a tenth small target image feature vector;

Performing feature transformation processing on the eighth small target image feature vector based on a fifth transformer module of the multi-scale transformation parallel decoder module to obtain an eleventh small target image feature vector;

performing feature transformation processing on the ninth small target image feature vector based on a sixth transformer module of the multi-scale transformation parallel decoder module to obtain a twelfth small target image feature vector;

performing feature transformation processing on the tenth small target image feature vector based on a seventh transformer module of the multi-scale transformation parallel decoder module to obtain a thirteenth small target image feature vector;

Detecting the feature vector of the eleventh small target image based on a first feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a first small target image detection result;

Detecting the feature vector of the twelfth small target image based on a second feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a second small target image detection result;

Detecting the feature vector of the thirteenth small target image based on a third feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a third small target image detection result;

And integrating the first small target image detection result, the second small target image detection result and the third small target image detection result to obtain the small target image detection result.

In some embodiments, the loss function of the small target detection model includes a normalized average error loss function and a distance cross ratio loss function, and the expression of the loss function of the small target detection model is specifically shown as follows:

In the above equation, L _total denotes the loss function of the small object detection model, Representing the ith target in the first labeling targets,/>Representing the ith target in the second labeling targets,/>Representing the ith target in the third labeling targets, S representing the total first labeling target quantity, M representing the total second labeling target quantity, L representing the total third labeling target quantity,/>Representing the category of the ith object in the first tagged objects,/>Position information of ith target in first labeling target,/>Representing the category of the ith object in the second tagged objects,/>Position information of the ith target in the second labeling targets,/>Representing the category of the ith object in the third labeled objects,/>The location information of the ith object in the third tagged objects,Representing that the loss of the back edge regression frame is calculated only when the first labeling target is not an empty set,/>Representing that the second labeling target is not an empty set, calculating the loss of the back edge regression frame,/>Indicating that the loss of the back edge regression frame is calculated only when the third labeling target is not empty, and l _bbox (DEG) indicates the position loss of the regression frame and is/>Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmIs a function of the probability of (1),Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmProbability of/>Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmProbability of/>Representation/>Prediction box/>, matched in first target bipartite graph matching algorithmCoordinate information of/>Representation/>Prediction box/>, matched in a second target bipartite graph matching algorithmCoordinate information of/>Representation/>Prediction frame/>, matched in a third target bipartite graph matching algorithmCoordinate information of (a) is provided.

To achieve the above object, another aspect of the embodiments of the present application proposes a small object detection system based on a detection transformer, the system comprising:

The first module is used for acquiring the data to be detected of the small target image and carrying out data enhancement processing to obtain the enhanced data to be detected of the small target image;

The second module is used for introducing a multi-scale feature fusion interaction module and a multi-scale transformation parallel decoder module to construct a small target detection model;

And the third module is used for carrying out target detection processing on the enhanced data to be detected of the small target image based on the small target detection model to obtain a small target image detection result.

The embodiment of the application at least comprises the following beneficial effects: the application provides a small target detection method and system based on a detection converter, the proposal obtains the enhanced small target image to-be-detected data by acquiring the small target image to-be-detected data and carrying out data enhancement processing, and further introduces a multi-scale feature fusion interaction module and a multi-scale conversion parallel decoder module, wherein the multi-scale feature fusion interaction module can fully utilize the features of various scales generated by a feature extraction network to improve the perceptibility of different scale features by the network, the multi-scale conversion parallel decoder module adopts a multi-branch parallel connection mode, the input of each branch comes from the output of different resolutions of the feature fusion module, the features of various resolutions are fully utilized, each branch is used for independently predicting targets of different sizes to construct a small target detection model, finally, the enhanced small target image to-be-detected data is subjected to target detection processing based on the small target detection model to obtain a small target image detection result, and the high-level semantic features and low-level semantic features of the small target image can be fully combined, the information of the image is extracted, and the detection result precision of the small target image is improved.

Drawings

FIG. 1 is a flow chart of a small target detection method based on a detection transformer provided by an embodiment of the application;

FIG. 2 is a schematic diagram of a structural data processing flow of a small target detection model constructed according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a structural data processing flow of a multi-scale feature fusion interaction module constructed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a structural data processing flow of a multi-scale transform parallel decoder module constructed in accordance with an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a small target detection system based on a detection transformer according to an embodiment of the present application.

Reference numerals: 1.a first feature extraction module; 2. a second feature extraction module; 3. a multi-scale feature fusion interaction module; 4. a multi-scale transform parallel decoder module; 5. a target query module; 6. a first feature extraction sub-module; 7. a second feature extraction sub-module; 8. a third feature extraction sub-module; 9. a first feedforward neural network layer; 10. a second feedforward neural network layer; 11. and a third feedforward neural network layer.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the application, but are merely examples of systems and methods consistent with aspects of embodiments of the application as detailed in the accompanying claims.

It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present application. The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.

The terms "at least one", "a plurality", "each", "any" and the like as used herein, at least one includes one, two or more, a plurality includes two or more, each means each of the corresponding plurality, and any one means any of the plurality.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Referring to fig. 1, fig. 1 is a flowchart of a small target detection method based on a detection transformer according to an embodiment of the present invention, and referring to fig. 1, the method includes the following steps:

s100, acquiring data to be detected of a small target image, and performing data enhancement processing to obtain the enhanced data to be detected of the small target image;

In some embodiments, the data enhancement process is that the image is sent to the network for reasoning after being subjected to data enhancement such as random inversion, random clipping, image scaling, data filling and the like.

S200, introducing a multi-scale feature fusion interaction module and a multi-scale transformation parallel decoder module to construct a small target detection model;

It should be noted that, in some embodiments, the small target detection model includes a first feature extraction module 1, a second feature extraction module 2, a multi-scale feature fusion interaction module 3, a target query module 5, and a multi-scale transformation parallel decoder module 4, where an input end of the first feature extraction module is used to obtain data to be detected of the enhanced small target image, an output end of the first feature extraction module is connected with an input end of the second feature extraction module, an output end of the second feature extraction module is connected with an input end of the multi-scale feature fusion interaction module, an output end of the multi-scale feature fusion interaction module is connected with a first input end of the multi-scale transformation parallel decoder module, and an output end of the target query module is connected with a second input end of the multi-scale transformation parallel decoder module, and an output end of the multi-scale transformation parallel decoder module is used to output a small target image detection result.

In some specific embodiments, as shown in fig. 2, aiming at the current situation that DETR detects a small target poorly, the embodiment of the application provides a module for performing multi-scale fusion on a transducer feature, namely a multi-scale feature fusion interaction module, which can make full use of features of various scales generated by a feature extraction network and improve the perceptibility of the network to features of different scales. The multi-scale feature fusion interaction module can alleviate the problem that DETR only utilizes high-level semantic features but does not fully utilize low-level semantic features and the problem that the detection effect on small targets is poor due to the fact that multi-scale feature fusion is not carried out. Meanwhile, a decoder for decoding the multi-scale characteristics of the transducer, namely a multi-scale transformation parallel decoder module, is designed, the multi-scale transformation parallel decoder module adopts a multi-branch parallel mode, the input of each branch comes from the output of different resolutions of the characteristic fusion module, the characteristics of various resolutions are fully utilized, and each branch is used for independently predicting targets with different sizes. And a cross attention module is arranged among a plurality of branches of the decoder, and the size targets predicted by the branches can be known after the branches interact. The loss function of the regression box in the DETR uses L1-loss and GIOU loss, and the L1-loss has the defect of large-box large loss, so that the normalized L1-loss is the normalized average error loss function. Because the network structure is designed to enable the network to output more small targets, the situation that the target output frame is very small and slides in the real frame often occurs, and GIOU loss fails at the moment, so that the embodiment of the application changes DIOUloss which is sensitive to distance information, namely a distance cross-over loss function.

It should be noted that, in the specific embodiment of the present invention, the loss function of the small object detection model includes a normalized average error loss function and a distance cross ratio loss function, and y ^s is further used to represent a first labeling object, which is a small object set with an area smaller than 0.01 times of the image area, in the network input image, y ^l is used to represent a second labeling object, which is a large object set with an area larger than 0.1 times of the image area, in the network input image, and y ^m is used to represent a third labeling object, which is a middle object set.S=m=l=100. y ^s is 100 in length, the small target number in y ^s is less than 100, insufficient parts are empty/>And y ^m、y^l is the same as the expression. Output of three branches of network decoder/>And respectively carrying out bipartite graph matching with y ^s,y^m,y^l by using a Hungary algorithm, wherein the expression is as follows:

In the above-mentioned method, the step of, Representing the output/>, of small target annotation information y ^s and small target branches of a network decoder in an imageLoss value at index σ _s (i)/(i)And/>And the same is true.For the matching result, the best match between the predicted frame and the real frame is represented. The category information and regression frame position information are considered simultaneously in the loss calculation Representing the class of the ith target in the labeling of small targets,/>Position information representing the i-th object of small objects,/>Representing the abscissa and the ordinate of the center point of the real frame and the height and width.

The classification section uses a cross entropy loss function, and therefore the expression of the loss function of the small object detection model is specifically as follows:

In the above-mentioned method, the step of, Representation/>The loss of the back edge regression frame is calculated when the set is not empty,/>And the same is true. /(I)Class/>, in the prediction frame, representing the matching of the ith target in the small target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmProbability of/>And/>And the same is true. l _bbox denotes the regression box position loss,Representation/>Prediction box/>, matched in first target bipartite graph matching algorithmIs provided with a coordinate information of (a),Representation/>Prediction box/>, matched in a second target bipartite graph matching algorithmCoordinate information of/> Representation/>Prediction frame matched in third target bipartite graph matching algorithmCoordinate information of/>

In the DETR, L1 loss and GIOU loss are used for regression frame position loss, the larger the predicted frame is, the larger the loss function value is, the smaller the target loss function value is, and the detection of the small target is not friendly, so that the embodiment of the application adopts normalized L1 loss, namely the predicted frame and the real frame are subtracted and divided by the size of the real frame. The disadvantage of GIOU loss should be that when a small prediction box slides in the real box, the GIOU loss value is unchanged, so the embodiment of the application uses DIOUs (Distance IOU) to take the Distance between the real box and the prediction box into account.

Wherein,

In the above equation, ω _L1 represents a normalized average error loss function, and ω _iou represents a distance cross ratio loss function.

S300, performing target detection processing on the enhanced small target image data to be detected based on a small target detection model to obtain a small target image detection result;

It should be noted that, in some embodiments, step S300 may include: s310, inputting the data to be detected of the enhanced small target image into a small target detection model; s320, performing feature extraction processing on the enhanced data to be detected of the small target image based on a first feature extraction module of the small target detection model to obtain first small target image feature data; s330, a second feature extraction module based on the small target detection model performs feature extraction processing on the first small target image feature data to obtain second small target image feature data; s340, carrying out feature interaction processing on the feature data of the second small target image based on the multi-scale feature fusion interaction module of the small target detection model to obtain feature interaction data of the small target image; s350, performing target detection query processing on the small target image characteristic interaction data based on a target query module of the small target detection model and a multi-scale transformation parallel decoder module of the small target detection model to obtain a small target image detection result.

In some embodiments, the small target image detection result includes category information of the small target and frame position information of the small target.

Further, it should be noted that, in some embodiments, step S320 may include: s321, inputting the enhanced data to be detected of the small target image into a first feature extraction module of a small target detection model, wherein the first feature extraction module comprises a flattening module, a full-connection module, a position coding module and a first Steady temperature converter module; s322, flattening the data to be detected of the enhanced small target image based on a flattening module of the first feature extraction module to obtain feature data of a third small target image; s323, performing full connection processing on the third small target image characteristic data based on the full connection module of the first characteristic extraction module to obtain fourth small target image characteristic data; s324, acquiring position coding information based on a position coding module of the first feature extraction module; s325, weighting the fourth small target image characteristic data and the position coding information to obtain weighted fourth small target image characteristic data; s326, performing feature extraction and transformation processing on the weighted fourth small target image feature data based on the first Steyr converter module of the first feature extraction module to obtain first small target image feature data.

In some embodiments, the input image with the size (H, W, 3) is first flattened by the flattening module, and pixels of RGB three channels in the adjacent 4×4 spatial range are flattened, so as to obtain third small target image feature data Y3 with the size (H/4, W/4,48). Then the feature enters a full connection module to convert the feature dimension into c to obtain fourth small target image feature data Y4 with the size (H/4, W/4, c). Because the transfomer does not have explicit order information such as a Recurrent Neural Network (RNN) or Convolutional Neural Network (CNN), the use of position coding can help the model understand the relative positions of words or tokens in the input sequence. Therefore, the leachable position codes and the fourth small target image characteristic data Y4 are added element by element and then are sent to 2 first stoneware converter modules (Swin Transformer Block), the output characteristic dimension is unchanged, and the first small target image characteristic data Y1 is obtained.

Further, it should be noted that, in some embodiments, step S330 may include: s331, inputting the first small target image feature data to a second feature extraction module of a small target detection model, wherein the second feature extraction module comprises a first feature extraction sub-module 6, a second feature extraction sub-module 7 and a third feature extraction sub-module 8; s332, performing feature extraction processing on the feature data of the first small target image based on a first feature extraction sub-module of the second feature extraction module to obtain feature data of a fifth small target image; s333, performing feature extraction processing on the feature data of the fifth small target image based on a second feature extraction sub-module of the second feature extraction module to obtain feature data of the sixth small target image; s334, performing feature extraction processing on the feature data of the sixth small target image based on a third feature extraction sub-module of the second feature extraction module to obtain feature data of the seventh small target image; s335, integrating the fifth small target image characteristic data, the sixth small target image characteristic data and the seventh small target image characteristic data to obtain second small target image characteristic data.

In some embodiments, the first small target image feature data Y1 is then sequentially sent to a first feature extraction sub-module with the number of 2, a second feature extraction sub-module with the number of 6, and a third feature extraction sub-module with the number of 2, where the first feature extraction sub-module, the second feature extraction sub-module, and the third feature extraction sub-module are all combination modules of a patch combining module (PATCH MERGING) and a stoneley converter module (Swin Transformer Block), and only the first PATCH MERGING in each combination module reduces the feature resolution, so as to sequentially obtain fifth small target image feature data Y5 with a feature dimension (H/8,W/8,c), sixth small target image feature data Y6 with a feature dimension (H/8,W/8,c), and seventh small target image feature data Y7 with a feature dimension (H/8,W/8,c).

Further, it should be noted that, in some embodiments, step S340 may include: s3401, inputting second small target image characteristic data into a multi-scale characteristic fusion interaction module of a small target detection model, wherein the multi-scale characteristic fusion interaction module comprises a second Siemens temperature converter module, a third Siemens temperature converter module, a fourth Siemens temperature converter module, a fifth Siemens temperature converter module, a sixth Siemens temperature converter module and a seventh Siemens temperature converter module; s3402, performing feature interaction processing on the feature data of the fifth small target image based on a second Steyr converter module of the multi-scale feature fusion interaction module to obtain feature data of the eighth small target image; s3403, performing feature interaction processing on the feature data of the sixth small target image based on a third Steyr converter module of the multi-scale feature fusion interaction module to obtain feature data of the ninth small target image; s3404, performing feature interaction processing on the feature data of the seventh small target image based on a fourth Steady temperature converter module of the multi-scale feature fusion interaction module to obtain feature data of the tenth small target image; s3405, performing bilinear interpolation up-sampling processing on the tenth small target image characteristic data, and then performing element-by-element addition on the tenth small target image characteristic data and the ninth small target image characteristic data to obtain eleventh small target image characteristic data; s3406, performing bilinear interpolation up-sampling processing on the eleventh small target image characteristic data, and then performing element-by-element addition on the eleventh small target image characteristic data and the eighth small target image characteristic data to obtain twelfth small target image characteristic data; s3407, performing feature interaction processing on the feature data of the tenth small target image based on a fifth Steady temperature converter module of the multi-scale feature fusion interaction module to obtain feature data of the thirteenth small target image; s3408, performing feature interaction processing on the eleventh small target image feature data based on a sixth Steady temperature converter module of the multi-scale feature fusion interaction module to obtain fourteenth small target image feature data; s3409, performing feature interaction processing on the twelfth small target image feature data based on a seventh Steady temperature converter module of the multi-scale feature fusion interaction module to obtain fifteenth small target image feature data; s3410, performing bilinear interpolation downsampling on the fifteenth small target image characteristic data, and then performing element-by-element addition on the fifteenth small target image characteristic data to obtain sixteenth small target image characteristic data; s3411, performing bilinear interpolation downsampling on the sixteenth small target image characteristic data, and then adding the sixteenth small target image characteristic data with the thirteenth small target image characteristic data element by element to obtain seventeenth small target image characteristic data; s3412, integrating the fifteenth small target image characteristic data, the sixteenth small target image characteristic data and the seventeenth small target image characteristic data to obtain small target image characteristic interaction data.

In some embodiments, the fifth small target image feature data, the sixth small target image feature data and the seventh small target image feature data are sent to the multi-scale feature fusion interaction module of the embodiment of the application to perform cross attention operation, so as to obtain fifteenth small target image feature data Y15, sixteenth small target image feature data Y16 and seventeenth small target image feature data Y17 after multi-scale feature fusion.

More specifically, as shown in fig. 3, first, the fifth small target image feature data, the sixth small target image feature data, and the seventh small target image feature data are respectively passed through the third, fourth, and third stoneley converter modules having three output dimensions c to obtain three features having output sizes (H/8,W/8,c), (H/16, w/16, c), and (H/32, w/32, c), respectively, the eighth small target image feature data Y8, the ninth small target image feature data Y9, and the tenth small target image feature data Y10. The number of the characteristic channels with different scales is changed to be the same, so that characteristic fusion is facilitated. Then, the tenth small target image feature data Y10 with the lowest resolution is subjected to bilinear interpolation up-sampling and then added with the ninth small target image feature data Y9 with the medium resolution element by element to obtain eleventh small target image feature data Y11, and the eleventh small target image feature data Y11 is subjected to double bilinear interpolation up-sampling and then added with the eighth small target image feature data Y8 with the highest resolution element by element to obtain twelfth small target image feature data Y12. Then, the tenth small target image feature data Y10, the eleventh small target image feature data Y11, and the twelfth small target image feature data Y12 are respectively passed through three Swin Transformer Block fifth, sixth, and seventh stoneley converter modules to obtain three features with output sizes (H/8,W/8,c), (H/16, w/16, c), (H/32, w/32, c), respectively, that are thirteenth small target image feature data Y13, fourteenth small target image feature data Y14, and fifteenth small target image feature data Y15. The fifteenth small target image feature data with the largest resolution is subjected to double bilinear interpolation downsampling and then is subjected to element-by-element addition with the fourteenth small target image feature data with the medium resolution to obtain sixteenth small target image feature data Y16, and the sixteenth small target image feature data is subjected to double bilinear interpolation downsampling and then is subjected to element-by-element addition with the thirteenth small target image feature data with the minimum resolution to obtain feature seventeenth small target image feature data Y17. The output of the transform multiscale feature fusion module is fifteenth small-target image feature data, sixteenth small-target image feature data, and seventeenth small-target image feature data. The multi-scale feature fusion interaction module can fully utilize features of various scales generated by the feature extraction network, improve the perceptibility of the network to features of different scales, and can alleviate the problem that DETR only utilizes high-level semantic features to fully utilize low-level semantic features and the problem that the detection effect on small targets is poor due to the fact that multi-scale feature fusion is not carried out.

Further, it should be noted that, in some embodiments, step S350 may include: s3501, a target query module based on a small target detection model acquires target query information; s3502, inputting small target image feature interaction data to a multi-scale transformation parallel decoder module of a small target detection model, wherein the multi-scale transformation parallel decoder module comprises a first cross attention module, a second cross attention module, a third cross attention module, a first transformer module, a second transformer module, a third transformer module, a fourth transformer module, a fifth transformer module, a sixth transformer module, a seventh transformer module, a first feedforward neural network layer 9, a second feedforward neural network layer 10 and a third feedforward neural network layer 11; s3503, combining the fifteenth small target image feature data with the target query information and inputting the combined data into a first cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a first small target image feature vector; s3504, combining the sixteenth small target image characteristic data with the target query information and inputting the combined data into a second cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a second small target image characteristic vector; s3505, combining the seventeenth small target image characteristic data with the target query information and inputting the combined data into a third cross attention module of the multi-scale transformation parallel decoder module to perform self attention operation to obtain a third small target image characteristic vector; s3506, performing feature transformation processing on the first small target image feature vector based on the first converter module of the multi-scale transformation parallel decoder module to obtain a fourth small target image feature vector; s3507, performing feature transformation processing on the second small target image feature vector based on the second converter module of the multi-scale transformation parallel decoder module to obtain a fifth small target image feature vector; s3508, performing feature transformation processing on the feature vector of the third small target image based on the third converter module of the multi-scale transformation parallel decoder module to obtain a feature vector of the sixth small target image; s3509, splicing the fourth small target image feature vector, the fifth small target image feature vector and the sixth small target image feature vector, and inputting the spliced small target image feature vector, the fifth small target image feature vector and the sixth small target image feature vector to a fourth converter module for feature conversion processing to obtain a seventh small target image feature vector; s3510, carrying out splitting treatment on the seventh small target image feature vector to obtain an eighth small target image feature vector, a ninth small target image feature vector and a tenth small target image feature vector; s3511, performing feature transformation processing on the feature vector of the eighth small target image based on a fifth converter module of the multi-scale transformation parallel decoder module to obtain an eleventh small target image feature vector; s3512, performing feature transformation processing on the feature vector of the ninth small target image based on a sixth transformer module of the multi-scale transformation parallel decoder module to obtain a feature vector of the twelfth small target image; s3513, performing feature transformation processing on the feature vector of the tenth small target image based on a seventh transformer module of the multi-scale transformation parallel decoder module to obtain a feature vector of the thirteenth small target image; s3514, detecting an eleventh small target image feature vector based on a first feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a first small target image detection result; s3515, detecting a twelfth small target image feature vector based on a second feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a second small target image detection result; s3516, detecting a thirteenth small target image feature vector based on a third feedforward neural network layer of the multi-scale transformation parallel decoder module to obtain a third small target image detection result; s3517, integrating the first small target image detection result, the second small target image detection result and the third small target image detection result to obtain a small target image detection result.

In some embodiments, the fifteenth small target image feature data, the sixteenth small target image feature data, the seventeenth small target image feature data, and the target query information (object query) output by the multi-scale feature fusion interaction module are sent to the multi-scale transformation parallel decoder module of the embodiment of the present application to obtain final outputs with three dimensions (100, c), namely, a first small target image detection result, a second small target image detection result, and a third small target image detection result. The target query is represented by a c-dimensional vector of 100 learnable parameters, and the target query is a special marker for predicting the category and the position of the target and plays a role of a similar anchor frame or a candidate frame in the previous target detection network. Each target query corresponds to a location in the output feature map of the model. With these target queries, DETR can predict the class and location of all targets present in the image simultaneously without using a priori anchor boxes.

More specifically, as shown in fig. 4, the fifteenth small target image feature data, the sixteenth small target image feature data and the seventeenth small target image feature data respectively perform cross attention operations with 100 learnable c-dimensional vector target queries, and the cross attention operations are implemented in the first cross attention module, the second cross attention module and the third cross attention module, so as to obtain feature vectors with 3 dimensions (100, c), namely, a first small target image feature vector T1, a second small target image feature vector T2 and a third small target image feature vector T3. The first small target image feature vector T1, the second small target image feature vector T2 and the third small target image feature vector T3 respectively pass through the first transformer module, the second transformer module and the third transformer module to obtain three feature vectors, namely a fourth small target image feature vector T4, a fifth small target image feature vector T5 and a sixth small target image feature vector T6, and the dimensions are unchanged. And then, the fourth small target image feature vector T4, the fifth small target image feature vector T5 and the sixth small target image feature vector T6 are spliced in the space dimension to obtain the feature of the dimension (300, c), and then the feature is sent to a fourth converter module for self-attention operation to obtain a seventh small target image feature vector T7 with the dimension (300, c). The self-attention module of the multi-branch feature can enable the multiple branches to know how large size targets are predicted after interaction. The seventh small target image feature vector T7 is split into three parts in the spatial dimension to obtain an eighth small target image feature vector T8, a ninth small target image feature vector T9 and a tenth small target image feature vector T10, and the eighth small target image feature vector T11, the twelfth small target image feature vector T12 and the thirteenth small target image feature vector T13 are respectively processed by a fifth converter module, a sixth converter module and a seventh converter module to obtain the output eleventh small target image feature vector T11 of the decoder. The eleventh small target image feature vector T11, the twelfth small target image feature vector T12, and the thirteenth small target image feature vector T13 are processed by FFNs in two DRTE, namely, the first feedforward neural network layer, the second feedforward neural network layer, and the third feedforward neural network layer, respectively, to obtain a classification log with a dimension of (100, k+1), and a classification log with a dimension of (100, 4), where k is the number of prediction categories.

In summary, the embodiment of the application solves the problems that the decoder of the DETR only uses the low-resolution high-semantic features output by the encoder and ignores the high-resolution features, and the network structure of the DETR does not perform multi-scale fusion on the transform multi-scale features, so as to improve the segmentation precision, improve the network generalization and robustness, solve the problem that the DETR loss function is not friendly to small target detection, and solve the problem that the DETR cannot well detect the small target.

Referring to fig. 5, the embodiment of the present application further provides a small target detection system based on a detection transformer, which can implement the small target detection method based on the detection transformer, where the system includes:

It can be understood that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method for detecting a small target based on a detection transducer, the method comprising:

2. The method according to claim 1, wherein the small target detection model comprises a first feature extraction module, a second feature extraction module, a multi-scale feature fusion interaction module, a target query module and a multi-scale transformation parallel decoder module, wherein an input end of the first feature extraction module is used for acquiring data to be detected of the enhanced small target image, an output end of the first feature extraction module is connected with an input end of the second feature extraction module, an output end of the second feature extraction module is connected with an input end of the multi-scale feature fusion interaction module, an output end of the multi-scale feature fusion interaction module is connected with a first input end of the multi-scale transformation parallel decoder module, and an output end of the target query module is connected with a second input end of the multi-scale transformation parallel decoder module, and an output end of the multi-scale transformation parallel decoder module is used for outputting the small target image detection result.

3. The method according to claim 1, wherein the performing, based on the small target detection model, target detection processing on the enhanced small target image data to be detected to obtain a small target image detection result includes:

4. The method of claim 3, wherein the feature extraction module, based on the small target detection model, performs feature extraction processing on the enhanced small target image to-be-detected data to obtain first small target image feature data, and includes:

5. The method according to claim 3, wherein the feature extraction module, based on the small target detection model, performs feature extraction processing on the first small target image feature data to obtain second small target image feature data, and includes:

6. The method of claim 5, wherein the first feature extraction sub-module, the second feature extraction sub-module, and the third feature extraction sub-module each comprise a patch combining module and a stoneley converter module, an output of the patch combining module being connected to an input of the stoneley converter module, wherein:

7. The method of claim 5, wherein the multi-scale feature fusion interaction module based on the small target detection model performs feature interaction processing on the second small target image feature data to obtain small target image feature interaction data, and the method comprises:

8. The method according to claim 7, wherein the performing, by the target query module based on the small target detection model and the multi-scale transformation parallel decoder module of the small target detection model, target detection query processing on the small target image feature interaction data to obtain the small target image detection result includes:

9. The method according to claim 1, wherein the loss function of the small target detection model comprises a normalized average error loss function and a distance cross ratio loss function, and the expression of the loss function of the small target detection model is specifically as follows:

In the above equation, L _total denotes the loss function of the small object detection model, Representing the ith object in the first tagged objects,Representing the ith target in the second labeling targets,/>Representing the ith target in the third labeling targets, S representing the total first labeling target quantity, M representing the total second labeling target quantity, L representing the total third labeling target quantity,/>Representing the category of the ith object in the first tagged objects,/>Position information of ith target in first labeling target,/>Representing the category of the ith object in the second tagged objects,/>Position information of the ith target in the second labeling targets,/>Representing the category of the ith object in the third labeled objects,/>The location information of the ith object in the third tagged objects,Representing that the loss of the back edge regression frame is calculated only when the first labeling target is not an empty set,/>Representing that the second labeling target is not an empty set, calculating the loss of the back edge regression frame,/>Indicating that the loss of the back edge regression frame is calculated only when the third labeling target is not empty, and l _bbox (DEG) indicates the position loss of the regression frame and is/>Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmIs a function of the probability of (1),Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmProbability of/>Representing the category/>, in a prediction frame of the matching of the ith target in the first target annotation frame, which is obtained through a bipartite graph Hungary matching algorithmProbability of/>Representation/>Prediction box/>, matched in first target bipartite graph matching algorithmCoordinate information of/>Representation/>Prediction box/>, matched in a second target bipartite graph matching algorithmCoordinate information of/>Representation/>Prediction frame/>, matched in a third target bipartite graph matching algorithmCoordinate information of (a) is provided.

10. A small target detection system based on a detection transducer, the system comprising: