CN117576379A

CN117576379A - Target detection method based on meta-learning combined attention mechanism network model

Info

Publication number: CN117576379A
Application number: CN202410052133.7A
Authority: CN
Inventors: 汪俊; 蔡升堰; 濮宬涵; 林子煜
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-02-20
Anticipated expiration: 2044-01-15
Also published as: CN117576379B

Abstract

The invention relates to a target detection method based on a meta-learning combined attention mechanism network model, which comprises the following steps: collecting a general image and a target image to be detected, constructing a meta learning combined attention detection network MetaSwinNet model, on a training task, performing an initialization paradigm based on the learning of meta learning, and training the network model by using a support set and a query set of the general image to obtain a network detection model with optimal initialization parameters; and freezing a backbone network of the network detection model on a test task, inputting a support set of the target image into the model for training, and updating network parameters of the optimized multi-scale feature fusion module and the attention decoding detection module by using the CWT self-adaptive updating module to obtain a final target detection model. The method can overcome the difficulties of few effective samples, tiny characteristics, serious shielding and the like of the targets, effectively extract the image context information and realize the accurate detection of the small targets under the complex background.

Description

Target detection method based on meta-learning combined attention mechanism network model

Technical Field

The invention relates to the technical field of image processing, in particular to a target detection method based on a meta-learning combined attention mechanism network model.

Background

With the continuous improvement of the intelligent degree of industry, a great deal of demands are made on target detection technology in the industrial production process. A large number of target detection technologies are used for product positioning and tracking, quality control, safety monitoring and the like, and the target detection technologies play an increasingly important role in the industrial field. The object detection study object is mainly a two-dimensional image, gives a prediction boundary box of the object to be detected in the image, and correctly classifies the object to represent the detected object. In an industrial production scene, the image data has the characteristic of being different from a general image, and is particularly characterized by complex image background, insufficient effective sample number and the like, so that the existing target detection technology is challenged. The traditional convolutional neural network model based method adopted in the existing engineering practice adopts convolutional operation to extract target features, and when the conditions of complex system scene, insufficient effective samples, small volume of even to-be-detected target and serious shielding are faced, the traditional method often shows that the target feature relation is difficult to model and even feature loss occurs, and the problems cannot be effectively solved. Based on the background, the method aims to solve the problem of insufficient effective samples of images and the problem of context linkage in complex scenes, and improves the detection precision of the target detection technology in the industrial production process. By overcoming these problems, the target detection techniques may further improve the efficiency, quality control level, safety and automation level of industrial production.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a target detection method based on a meta-learning combined attention mechanism network model, which solves the problems of small data size of image detection task samples and context linkage in complex scenes. The decoder based on the hierarchical self-attention model Swin-transducer of the moving window can perform global modeling on the information of the image, so that the semantic information of the image can be better extracted, the constructed multi-scale feature fusion module can further strengthen the semantic relation of the scene context of the image, and the feature extraction capability of the model is enhanced. The invention adopts the learning of meta learning to initialize the paradigm Learn to initialize, can improve the adaptability of the model to few samples of different scenes, and further solves the problem of few samples. Preparing a support set and a query set of a training task by using a COCO data set, preparing a support set and a query set of a testing task by using a shot target picture to be detected, initializing a Learn to initialize element learning paradigm by adopting learning, performing model training and weight updating, and obtaining a final target detection model to be detected after finishing training; the method can overcome the difficulties of few effective samples, tiny characteristics, serious shielding and the like of the targets, effectively extract the image context information and realize the accurate detection of the small targets under the complex background.

In order to solve the technical problems, the invention provides the following technical scheme: a target detection method based on a meta-learning combined attention mechanism network model comprises the following steps:

s1, collecting a general image, which is used for manufacturing a support set and a query set on a training task, collecting a target image to be detected, and manufacturing the support set and the query set on a test task;

s2, constructing a meta-learning combined attention detection network MetaSwinNet model, wherein the MetaSwinNet model consists of an encoder taking a hierarchical self-attention model Swin-Transformer based on a moving window as a backbone network, a multi-scale feature fusion module, an attention decoding detection module and a CWT self-adaptive updating module;

s3, on a training task, performing meta-learning-based learning to initialize a paradigm Learn to initialize, training a MetaSwinNet model by using a support set and a query set of a general image, and completing initialization parameter updating of the model to obtain the optimal initialization parameterIs a network detection model of (a);

and S4, freezing a backbone network of the network detection model on a test task, inputting a support set of the target image into the model for training, and updating network parameters of the optimized multi-scale feature fusion module and the attention decoding detection module by using the CWT self-adaptive updating module to obtain a final target detection model.

Further, in step S1, the specific process includes the following steps:

s11, acquiring a general image by downloading a general object target detection data set COCO and dividing the general image to manufacture a support set and a query set of training tasks in a meta learning training paradigm;

s12, acquiring a target image in a scene to be detected by using an industrial camera, screening and graying, classifying the image according to the characteristics of the target image, marking the image with data after the definition of the target type is completed, and selecting the processed target image in the scene to be detected and a corresponding marking file, thereby manufacturing a support set and a query set in a test task.

Further, in step S3, the specific process includes the following steps:

s31, serially connecting the Swin-transform in S2 with four continuous Stage modules consisting of a block Partition module Patch Partition and a hierarchical self-attention module Swin Transformer Block based on a moving window;

s32, inputting a support set of the general image into a block Partition module Patch Partition of a Swin-Transformer model serving as a backbone network, dividing the support set into four equal parts on the height and width of the image, dividing the image into small blocks through a Linear Embedding module Linear Embedding of a first Stage module, embedding the small blocks into a low-dimensional space, and extracting first, second and third image semantic space feature images through a second, third and fourth Stage module respectively;

s33, the multi-scale feature fusion module comprises a first input feature convolution layer, a first upsampling layer, a second input feature convolution layer and a second upsampling layer; inputting the extracted third image semantic space feature image into a first input feature convolution layer, inputting the third semantic space feature subjected to convolution processing into a first up-sampling layer for up-sampling, then splicing with the second image semantic space feature, inputting the third image semantic space feature image into a second input feature convolution layer together, inputting the third image semantic space feature image into the second up-sampling layer for up-sampling after convolution is completed, splicing the output feature image with the first image semantic space feature image to obtain enhanced first image semantic space feature, sending the enhanced first image semantic space feature image into convolution, splicing with the enhanced second image semantic space feature, then convolving as input, splicing with the enhanced third image semantic space feature, then convolving respectively, and outputting to obtain multi-scale fusion features;

s34, respectively calculating the scale attention, the space attention and the task attention of the multi-scale fusion features, inputting the multi-scale fusion features output by the main network and the multi-scale feature fusion module into an attention decoding detection module, and finally outputting a bounding box and a category of a model prediction target;

s35, calculating a loss function by using a query set of the general image, and updating initial parameters of the model to obtain optimal initial parametersNetwork detection model of (a).

Further, in the step S31, the process of the hierarchical self-attention module Swin Transformer Block of the moving window may be divided into two stages, in which, the first stage performs layer normalization on the input, calculates multi-head attention based on the window, combines the input with the current output using residual connection, then connects the layer normalization and the multi-layer perceptron MLP, and also uses residual connection, combines the input with the current output through the block combining module Patch merge, and in which, the second stage uses the output of the first stage as the input, and repeats the first stage process, but uses multi-head attention mechanism based on the moving window when calculating the attention; the specific structure is as follows in sequence: the system comprises a first normalization layer, a window attention layer, a second normalization layer, a first multi-layer perceptron MLP layer, a third normalization layer, a moving window attention layer, a fourth normalization layer and a second multi-layer perceptron MLP layer; the first and second multi-layer perceptron MLP layer activation functions use the Gaussian error linear unit GELU with a convolution kernel size set to 1x1.

Further, in S33, the number of input channels of the first input feature convolution layer and the second input feature convolution layer is 192 and 384, respectively, the convolution kernel sizes are all set to 1x1, the step sizes are all set to 1, and the convolution kernel numbers are all set to 512.

Further, the step S35 specifically includes: on the training task, the outer layer loop inputs the support set of the multiple training tasks of the general image production into the MetaSwinNet network of the meta-learning combined attention detection network respectively, the inner layer loop uses the support set of each training task to calculate the loss function, the gradient descent method is utilized to iteratively update the parameters of the current network aiming at the training task, the outer layer loop obtains the loss calculation meta-loss on the support set of each training task on the basis of updating the network by each training task, and carries out gradient update to update the initial parameters of the model to obtain the optimal initialization parametersNetwork detection model of (a).

Further, the updating the initial parameters of the model in S35 specifically includes: definition of meta learning knowledge asThe network initialization parameter is +.>Meta-learning based learning to initialize the paradigm Learn to initialize,>on training tasks, calculate task loss on support set +.>Updating network parameters with losses to get +.>Calculating meta learning loss +.>Updating meta-learning knowledge with loss +.>I.e.)>The specific update formula is as follows:；

；

wherein the loss function，/>、/>Learning rate trained on query set and support set for model, +.>For the query set of training tasks, +.>For training tasksIs a support set of (a).

Further, in step S4, the input of the CWT weight adaptive update module is in the form of Query, key, value triplets, and then Linear Layer, multi-head attention calculation, layer normalization Layer Norm processing are performed, and residual connection is used to complete the weight update of the feature fusion module and the attention decoding detection module.

Further, the step S4 specifically includes: on a test task, a support set of test tasks is input to the test task with optimal initialization parametersIn the network detection model of (2), the weight self-adaptive module CWT is utilized to optimize, and the optimal network on the target task to be detected can be obtained>；

In the updating process of the weight self-adaptive module CWT, in order to learn the query condition information with discriminant, the input is designed as follows:

；

wherein,is a learnable parameter, each represented by a fully connected layer, < >>For the weights of the feature fusion module and the attention decoding detection module to be updated, F is the image feature extracted by the backbone network under the scene to be detected, and the weights of the feature fusion module and the attention decoding detection module are projected to +.>The latent space enables the weights of the feature fusion module and the attention decoding detection module to adapt to the target query image under the scene to be detected, so as to form a feature fusion module and an attention solutionThe attention of the code detection module to the target query image to be detected is shown as follows:

；

wherein,is a progressive softmax function for attention normalization,/v>Is a linear layer with an input dimension of +.>In order to make model convergence more stable, residual error learning is adopted, and self-adaptive updating of a network for a task to be detected is realized.

By means of the technical scheme, the invention provides a target detection method based on a meta-learning combined attention mechanism network model, which has at least the following beneficial effects:

firstly, the hierarchical self-attention model Swin-transducer based on the moving window is adopted as a backbone network, semantic information of an image can be better captured through a self-attention mechanism and a multi-head attention mechanism, modeling is dependent on a long distance, the semantic information of the image can be better utilized, the problems of insufficient context connection, loss of detail characteristics and the like of the conventional convolutional neural network technology are solved, the feature extraction function of the invention is obviously enhanced, and the detection precision of the model is effectively improved.

Secondly, the invention adopts meta-learning training paradigm learning to initialize Learn to initialize, the training process is specifically divided into training tasks and testing tasks, and a model with good initialization capability is obtained by using a support set and a query set of the training tasks, so that the model has the capability of rapidly adapting to different few sample scenes.

Thirdly, the invention combines the meta learning training paradigm with the self-adaptive updating module, uses the test task support set for training, and further optimizes model parameters by using the weight self-adaptive module CWT to obtain the target detection model under the scene to be detected, thereby overcoming the problem of few samples and solving the technical problem that the accurate target detection of the image cannot be supported by the small sample amount data in the real production environment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a simplified flow chart of a method for detecting targets based on a meta-learning combined attention mechanism network model;

FIG. 2 is a schematic view of a clip class defined in the present invention;

fig. 3 is a schematic diagram of the meta-learning combined attention detection network MetaSwinNet according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. Therefore, the implementation process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Referring to fig. 1-3, a specific implementation manner of the present embodiment is shown, in which a hierarchical self-attention model Swin-transform based on a moving window is used as a backbone network for multi-scale extraction of general image features on a training task, and the extracted features are fused by a multi-scale feature fusion module so as to be applicable to different tasks with few samples, and a feature map output by the backbone network is output by the multi-scale fusion module and is used as input to enter a detection module; the invention uses a CWT self-adaptive module to update a feature fusion module and a detection module on a test task to obtain a target detection model and realize self-adaptive detection for the target, uses meta-learning-based learning to initialize a paradigm Learn to initialize, and trains the model to learn initialization parameters so as to enable the model to converge more quickly on the test detection task, improve the generalization capability and adaptability of the model on the test detection task, and solve the problem of insufficient model capability caused by insufficient industrial samples.

Referring to fig. 1, in this embodiment, taking an aeroengine pipeline system clamp detection as an example, a target detection method based on a meta-learning combined attention mechanism network model is provided, and the method includes the following steps:

as a preferred embodiment of step S1, the specific procedure comprises the steps of:

s11, acquiring and dividing a general image by downloading a general object target detection data set COCO, and manufacturing a support set and a query set of training tasks in a meta learning training paradigm, wherein the division ratio of the support set and the query set is 8:2, determining the training task number as 80 according to the COCO data set target class number;

s12, acquiring a target image, namely a clamp image, in a scene to be detected by using an industrial camera, screening and graying, classifying the image according to the characteristics of the target image, and classifying the clamp into three types, namely a single clamp Kong Kagu clamp, a double-hole clamp and a triple-hole clamp according to the appearance characteristics of the clamp, wherein the specific clamp type definition is shown in FIG. 2; after the definition of the target clamp type is completed, carrying out data annotation on the clamp image, and selecting the processed target clamp image and corresponding annotation files in 163 to-be-inspected scenes, thereby manufacturing a support set and a query set in a test task, wherein the division ratio of the support set and the query set is 8:2, wherein the resolution of the image is 2700x2900.

s3, on a training task, performing meta-learning-based learning to initialize a paradigm Learn to initialize, training a MetaSwinNet model by using a support set and a query set of a general image, and completing initialization parameter updating of the model to obtain the optimal initialization parameterA clamp network detection model;

as a preferred embodiment of step S3, the specific procedure comprises the steps of:

s31, serially connecting the Swin-transform in S2 with four continuous Stage modules consisting of a block Partition module Patch Partition and a hierarchical self-attention module Swin Transfomer Block based on a moving window; taking an input image (namely a common object target detection data set COCO image of a training task) as an input of a block partitioning module Patch Partition; the Stage 1 module consists of a Linear embedded self-attention module Swin Transfomer Block with a movable window, and the Stage2-Stage4 modules have the same structure and comprise the following components: a block merge module Patch merge and a hierarchical self-attention module Swin Transfomer Block for moving windows;

more specifically, in the step S31, the process of the hierarchical self-attention module Swin Transfomer Block of the moving window may be divided into two stages, in which, the first stage performs layer normalization on the input, calculates multi-head attention based on the window, uses residual connection to combine the input with the current output, then connects the layer normalization with the multi-layer perceptron MLP, uses residual connection as well, combines the input with the current output through the block combining module Patch merge, and in which, in the second stage, the first stage of processing is repeated with the output of the first stage as the input, but uses multi-head attention mechanism based on the moving window when calculating the attention; the specific structure is as follows in sequence: the system comprises a first normalization layer, a window attention layer, a second normalization layer, a first multi-layer perceptron MLP layer, a third normalization layer, a moving window attention layer, a fourth normalization layer and a second multi-layer perceptron MLP layer; the first and second multi-layer perceptron MLP layer activation functions use the Gaussian error linear unit GELU with a convolution kernel size set to 1x1. The gaussian error linear unit GELU function formula is as follows:

；

s32, inputting a support set of the general image into a block Partition module Patch Partition of a Swin-Transformer model serving as a backbone network, dividing the support set into four equal parts on the height and width of the image, dividing the image into small blocks through a Linear Embedding module Linear Embedding of a first Stage module, embedding the small blocks into a low-dimensional space, and extracting first, second and third image semantic space feature images through a second, third and fourth Stage module respectively, so that image semantics and space features are further extracted;

s33, the multi-scale feature fusion module adopts a double-input convolution layer and a double-up-sampling layer, and specifically comprises a first input feature convolution layer, a first up-sampling layer, a second input feature convolution layer and a second up-sampling layer; inputting the extracted third image semantic space feature image into a first input feature convolution layer, inputting the third semantic space feature subjected to convolution processing into a first upsampling layer for upsampling, then splicing with the second image semantic space feature, inputting the third image semantic space feature image into a second input feature convolution layer together, inputting the third image semantic space feature image into a second upsampling layer for upsampling after convolution is completed, and splicing the output feature image with the first image semantic space feature image; through the steps, the lower semantic space feature map obtains features from higher semantic space, the enhanced first image semantic space features are sent to convolution, are spliced with the enhanced second image semantic space features, then are used as input to convolution, are spliced with the enhanced third image semantic space features, through the steps, the first, second and third semantic space features realize multi-scale feature fusion of high and low semantics, then are respectively convolved, and the obtained multi-scale fusion features are output to enter a notice decoding detection module; according to the embodiment, the multi-scale feature fusion module is constructed, so that feature semantic information transfer between high dimension and low dimension is realized, the space position information of the features is enhanced, and the detection of the model for targets with different scales is improved;

more specifically, in S33, the number of input channels of the first input feature convolution layer and the second input feature convolution layer is 192 and 384, respectively, the convolution kernel sizes are all set to 1x1, the step sizes are all set to 1, and the convolution kernel numbers are all set to 512.

S34, constructing an attention decoding detection module, namely, based on dynamic detection head construction in the prior art, respectively calculating the scale attention, the space attention and the task attention of the multi-scale fusion features, inputting the multi-scale fusion features output by the main network and the multi-scale feature fusion module into the attention decoding detection module, and finally outputting a bounding box and a category of a model prediction target; the scale attention is obtained by globally pooling to extract the maximum value of the feature map, integrating the feature map channel by 1x1 convolution and then using a modified linear unit ReLU and hard sigmoid activation function; the spatial attention is obtained by deformation convolution, calculating offset and characteristic amplitude modulation; the task attention is obtained by modeling by using a two-layer fully-connected neural network;

s35, calculating a loss function by using a query set of the general image, and updating initial parameters of the model to obtain the optimal initial parametersA clamp network detection model; as shown in fig. 3, a schematic diagram of a meta-learning combined attention detection network MetaSwinNet is shown.

More specifically, the meta-learning based learning de-initialization paradigm Learn to initialize training model includes the following:

the optimizer uses a random gradient descent SGD, a learning strategy uses a preheating learning strategy warmup, a learning rate attenuation strategy uses a linear attenuation linear, an embedding dimension is set to 96 in an encoder part, a network regularization method uses a random branch failure DropPath, and a random branch failure rate is set to 0.2;

the step S35 specifically includes: on the training task, the outer layer loop inputs the support set of the general image into the MetaSwinNet network of the meta-learning combined attention detection network, the inner layer loop uses the query set of the general image to calculate the loss function, the current network parameter for the task is iteratively updated by using the gradient descent method, the outer layer loop calculates the loss on the query set of each task by using the parameters obtained by the inner loop and carries out gradient update, the initial parameters of the model are updated, and the optimal initialization parameters are obtainedThrough the training model, the model learns general parameter initialization knowledge to obtain the optimal initialization parameter +.>Is a clamp network detection model.

More specifically, in the training process, a training task support set is used for training a network, a training task query set is used for calculating loss so as to reflect the quality of initialization parameters, the calculated loss is used for updating the initialization parameters, and the calculated loss is used for updating an initialization function; the clamp target detection loss function used in the invention comprises classification lossLoss of positioning->And confidence loss->. The expression formalization of the total loss is defined as follows:

；

wherein,、/>、/>the bounding box weight, the classification weight and the confidence weight are respectively set to be 0.05, 0.5 and 1.0 as default values.

The positioning loss, namely the boundary frame loss, measures the difference between the predicted result and the real result in the target detection by calculating the distance between the network predicted boundary frame and the actual boundary frame, and further guides the optimization of the predicted frame. There are 5 parameters per bounding box: confidence, x, y, w, h, (x, y) is the coordinates of the bounding box center point, w, h is the width and height of the bounding box. IoU (Intersection over Union, cross-over) function is to evaluate the degree of overlap between the predicted and real frames. Assuming that the prediction box is M, the formalized expression of the true box is N, ioU is as follows:

；

wherein, the value range of IoU isWhen the real frame and the predicted frame are completely overlapped, the real frame and the predicted frame are 1, and when the real frame and the predicted frame are not overlapped, the real frame and the predicted frame are IoU and are 0, but gradient feedback in the network training process can be affected, so that training cannot be performed. The invention uses CIoU, the value range of which is +.>The model may continue to train when the prediction box and the real box do not overlap and will not be 0. Based on the loss of the boundary box of the CIoU, the distance and the overlapping degree of the target box and the labeling box are considered at the same time, and the aspect ratio factor is considered during calculation. The invention can more accurately evaluate the similarity between the target detection result and the real frame through the CIoU, so that the boundary frameThe process of parameter regression becomes more stable. The CIoU-based positioning loss function formalized expression is as follows: />；

Wherein,is to the predicted frame center point c and the true frame center point +.>The Euclidean distance between the two points is calculated. d is the length of the smallest rectangular diagonal containing the prediction box and the real box. />Is an adjustment factor, adjusting the aspect ratio penalty based on the degree of overlap. Wherein->Is a weight parameter, formalized expression is as follows:

；

wherein,the similarity of the aspect ratios is evaluated by computing the arctangent of the aspect ratios of the network generation box and the annotation box. />Formalized expression is as follows:

；

wherein,，/>respectively for target marking framesWide and high, w and h are the width and the height of the network generation frame; when the degree of overlap is high, the weight parameter +.>The aspect ratio loss is small in the boundary box loss, and the model is used for optimizing the distance between the generating box and the labeling box; when the degree of overlap is low, the weight parameter +.>The aspect ratio penalty is large and the bounding box penalty is relatively high, and the model requires adjustment and optimization of the aspect ratio between the generated boxes.

The classification loss function and the confidence loss function used in the invention are cross entropy function CE (cross entropy loss), and the invention uses the cross entropy loss function CE to judge the distance between the network classification output value and the real expected value. The calculation process of CE is as follows: and (3) obtaining each class of the scores x defined in the scene to be detected by the attention decoding detection module, and calculating the corresponding probability p (x) by using a Softmax function respectively, wherein the Softmax function is defined as follows:

；

wherein x is the score of each category or confidence, the calculated probability and true value are calculated by using a cross entropy function CE, and the formula of CE is as follows:

；

wherein y is a class vector, and p is a probability vector of each class obtained by calculation, wherein y and p are usually in a single-heat coding mode, when the difference between y and p is smaller, the output of the model is more similar to a true value, and the minimum value is obtained when the y and the p are equal; and finally, obtaining the multi-task loss of target detection, deriving a loss function based on a meta-learning paradigm, carrying out gradient feedback, and finishing the weight updating of the network model to obtain a final clamp target detection model.

More specifically, the update in S35The initial parameters of the model specifically comprise: definition of meta learning knowledge asThe network initialization parameter is +.>Meta-learning based learning to initialize the paradigm Learn to initialize,>on training tasks, calculate task loss ∈ ->Updating network parameters with losses to get +.>Computing meta-learning loss->Updating meta-learning knowledge with loss +.>I.e.)>The specific update formula is as follows:

；

wherein the loss function，/>、/>Model is found in the query set and the query setSupport learning rate of training on the set, +.>For the query set of training tasks, +.>A support set for training tasks;

wherein, according to the learning de-initialization paradigm Learn to initialize and the task type being the target detection task, the mean square error MSE can be used by L, and the specific function form is as follows:

；

wherein n is a sample, and the sample is,is true value +.>Is a predicted value;

and S4, freezing the backbone network of the network detection model in the step S3 on a test task, inputting a support set of the target image into the model for training, updating and optimizing network parameters of the multi-scale feature fusion module and the attention decoding detection module by using the CWT self-adaptive updating module, improving the performance of the network in a small sample scene of target detection, and completing weight updating after training by the support set to obtain a final target detection model.

As a preferred embodiment of step S4, in step S4, the CWT adaptive update module is in the form of (Query, key, value) triplets, and then processes the (Query, key, value) triplets through Linear Layer, multi-head attention-to-attention calculation, layer normalization Layer Norm, and uses residual connection to complete the weight update of the feature fusion module and the attention decoding detection module.

More specifically, the S4 specifically includes: on a test task, a support set of test tasks is input to the test task with optimal initialization parametersIn the network detection model of (2), the weight self-adaptive module CWT is utilized to optimize, and the optimal network on the target task to be detected can be obtained>；

；

wherein,is a learnable parameter, each represented by a fully connected layer, < >>For the weights of the feature fusion module and the attention decoding detection module to be updated, F is the image feature extracted by the backbone network under the scene to be detected, and the weights of the feature fusion module and the attention decoding detection module are projected to +.>The latent space enables the weights of the feature fusion module and the attention decoding detection module to adapt to the clamp target query image under the scene to be detected, so that the attention from the feature fusion module and the attention decoding detection module to the clamp target query image to be detected is formed, and the formula is as follows:

；

wherein,is a progressive softmax function for attention normalization,/v>Is oneLinear layer with input dimension +.>In order to enable model convergence to be stable, residual error learning is adopted, and self-adaptive updating of a network for a clamp target task to be detected is achieved.

Finally, predicting a clamp image, which specifically comprises the following steps:

s501, carrying out gray processing on a clamp target image to be detected in a scene to be detected;

s502, inputting the target image subjected to the graying treatment into a trained meta-learning combined attention detection network MetaSwinNet to obtain detection results of the clamp targets in the image, namely the boundary box positions of the clamp targets and the types of the targets.

In summary, the invention uses COCO data set to make training task support set and query set, uses shot target picture to be detected to make test task support set and query set, adopts learning to initialize Learn to initialize element learning paradigm, carries out model training and weight updating, and obtains final target detection model to be detected after training; the method can overcome the difficulties of few effective samples, tiny characteristics, serious shielding and the like of the targets, effectively extract the image context information and realize the accurate detection of the small targets under the complex background.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The object detection method based on the meta-learning combined attention mechanism network model is characterized by comprising the following steps of:

2. The method for detecting the target based on the meta-learning combined attention mechanism network model according to claim 1, wherein the method comprises the following steps of: in step S1, the specific process includes the following steps:

3. The method for detecting the target based on the meta-learning combined attention mechanism network model according to claim 1, wherein the method comprises the following steps of: the step S3 specifically comprises the following steps:

4. A method for detecting an object based on a meta-learning combined attention mechanism network model as recited in claim 3, wherein: in the S31, the processing of the hierarchical self-attention module Swin Transformer Block of the moving window may be divided into two stages, in which the first stage performs layer normalization on the input, calculates multi-head attention based on the window, combines the input with the current output using residual connection, then connects the layer normalization and the multi-layer perceptron MLP, and combines the input with the current output through the block combining module Patch merge, and in which the second stage uses the output of the first stage as the input, and repeats the first stage processing procedure, but uses the multi-head attention mechanism based on the moving window when calculating the attention; the specific structure is as follows in sequence: the system comprises a first normalization layer, a window attention layer, a second normalization layer, a first multi-layer perceptron MLP layer, a third normalization layer, a moving window attention layer, a fourth normalization layer and a second multi-layer perceptron MLP layer; the first and second multi-layer perceptron MLP layer activation functions use the Gaussian error linear unit GELU with a convolution kernel size set to 1x1.

5. A method for detecting an object based on a meta-learning combined attention mechanism network model as recited in claim 3, wherein: in S33, the numbers of input channels of the first input feature convolution layer and the second input feature convolution layer are 192 and 384, respectively, the convolution kernel sizes are all set to 1x1, the step sizes are all set to 1, and the convolution kernel numbers are all set to 512.

6. A method for detecting an object based on a meta-learning combined attention mechanism network model as recited in claim 3, wherein: the step S35 specifically includes: on the training task, the outer layer loop inputs the support set of the multiple training tasks of the general image production into the MetaSwinNet network of the meta-learning combined attention detection network respectively, the inner layer loop uses the support set of each training task to calculate the loss function, the gradient descent method is utilized to iteratively update the parameters of the current network aiming at the training task, the outer layer loop obtains the loss calculation meta-loss on the support set of each training task on the basis of updating the network by each training task, and carries out gradient update to update the initial parameters of the model to obtain the optimal initialization parametersNetwork detection model of (a).

7. The method for detecting the target based on the meta-learning combined attention mechanism network model as set forth in claim 3, wherein the method comprises the steps ofIn the following steps: the updating the initial parameters of the model in S35 specifically includes: definition of meta learning knowledge asThe network initialization parameter is +.>Meta-learning based learning to initialize the paradigm Learn to initialize,>on training tasks, calculate task loss on support set +.>Updating network parameters with losses to get +.>Calculating meta learning loss +.>Updating meta-learning knowledge with loss +.>I.e.)>The specific update formula is as follows:；

；

wherein the loss function，/>、/>Learning rate trained on query set and support set for model, +.>For the query set of training tasks, +.>Is a support set for training tasks.

8. The method for detecting the target based on the meta-learning combined attention mechanism network model according to claim 1, wherein the method comprises the following steps of: in step S4, the input of the CWT weight adaptive update module is in the form of Query, key, value triplets, and then Linear Layer, multi-head attention calculation, layer normalization Layer Norm processing, and residual connection is used to complete the weight update of the feature fusion module and the attention decoding detection module.

9. The method for detecting the target based on the meta-learning combined attention mechanism network model according to claim 8, wherein the method comprises the following steps of: the step S4 specifically comprises the following steps: on a test task, a support set of test tasks is input to the test task with optimal initialization parametersIn the network detection model of (2), the weight self-adaptive module CWT is utilized to optimize, and the optimal network on the target task to be detected can be obtained>；

；

wherein,is a learnable parameter, each represented by a fully connected layer, < >>For the weights of the feature fusion module and the attention decoding detection module to be updated, F is the image feature extracted by the backbone network under the scene to be detected, and the weights of the feature fusion module and the attention decoding detection module are projected to +.>The latent space enables the weight of the feature fusion module and the attention decoding detection module to adapt to the target query image in the scene to be detected, so that the attention of the feature fusion module and the attention decoding detection module to the target query image to be detected is formed, and the formula is as follows:

；