CN114332479A

CN114332479A - Training method of target detection model and related device

Info

Publication number: CN114332479A
Application number: CN202111591732.9A
Authority: CN
Inventors: 赵健; 史宏志; 金良
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2022-04-12

Abstract

The application discloses a training method of a target detection model, which comprises the following steps: performing feature extraction operation on the image by adopting a backbone network of a model to be trained to obtain a multi-scale feature map; coding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain coded image features; carrying out sequence construction processing based on the correct label of the image to obtain a target sequence; decoding the coded image features and the target sequence based on a multi-scale deformable attention decoding module to obtain a prediction sequence; and updating parameters of the model to be trained based on a preset loss function, the prediction sequence and the correctly labeled target sequence of the image. Therefore, the characteristic diagram with high complexity can be processed, and the efficiency and the performance of model processing are improved. The application also discloses a training device, a server and a computer readable storage medium of the target detection model, which have the beneficial effects.

Description

Training method of target detection model and related device

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a training method, a training apparatus, a server, and a computer-readable storage medium for a target detection model.

Background

In the related art, although a PIX2SEQ (an object detection architecture) converts an object detection task into a generation task of a language model, a transform is still used inside the object detection task, and due to the characteristics of the transform, when processing image features, the relation between the current position and all other positions is calculated, and the complexity is directly proportional to the size of a feature map, so that large-size or multi-size features cannot be used, and the performance of the model is further influenced.

Therefore, how to maintain efficiency in performing the treatment process is a major concern to those skilled in the art.

Disclosure of Invention

The application aims to provide a training method, a training device, a server and a computer readable storage medium of a target detection model, so that a high-complexity feature map can be processed, and the efficiency and the performance of model processing are improved.

In order to solve the above technical problem, the present application provides a method for training a target detection model, including:

performing feature extraction operation on the image by adopting a backbone network of a model to be trained to obtain a multi-scale feature map;

coding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain coded image features;

carrying out sequence construction processing based on the correct label of the image to obtain a target sequence;

decoding the coded image features and the target sequence based on a multi-scale deformable attention decoding module to obtain a prediction sequence;

and updating parameters of the model to be trained based on a preset loss function, the prediction sequence and the correctly labeled target sequence of the image.

Optionally, performing a feature extraction operation on the image by using a backbone network of the model to be trained to obtain a multi-scale feature map, including:

carrying out image enhancement processing on the image to obtain a transformed image;

and carrying out image feature extraction operation on the transformed image based on the backbone network to obtain the multi-scale feature map.

Optionally, the encoding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain encoded image features includes:

acquiring image features of each scale from the multi-scale feature map;

coding each characteristic point position in the image characteristics to obtain a position code;

fusing the fusion characteristics of the image characteristics of each scale with the position codes to obtain code input;

and coding the coded input based on the multi-scale deformable attention coding module to obtain the coded image characteristics.

Optionally, performing sequence construction processing based on the correct label of the image to obtain a target sequence, including:

carrying out size normalization processing on the correct mark in the image to obtain a normalized image;

and carrying out sequence construction processing on any target in the normalized image to obtain the target sequence.

Optionally, the parameter updating of the model to be trained based on a preset loss function, the prediction sequence, and the correctly labeled target sequence of the image includes:

constructing a loss function based on the model structure of the model to be trained;

and updating parameters of the model to be trained based on the loss function, the prediction sequence and the correctly labeled target sequence of the image.

Optionally, the method further includes:

and when the model to be trained is trained, taking the model to be trained as a target detection model.

Optionally, the method further includes:

testing the target detection model based on the labeled data;

and when the test is passed, sending a test success message.

The present application further provides a training apparatus for a target detection model, including:

the characteristic extraction module is used for carrying out characteristic extraction operation on the image by adopting a backbone network of the model to be trained to obtain a multi-scale characteristic diagram;

the feature coding module is used for coding the multi-scale feature map based on the multi-scale deformable attention coding module to obtain coded image features;

the sequence construction module is used for carrying out sequence construction processing based on the correct label of the image to obtain a target sequence;

a feature decoding module, configured to decode the encoded image feature and the target sequence based on a multi-scale deformable attention decoding module to obtain a prediction sequence;

and the parameter updating module is used for updating the parameters of the model to be trained on the basis of a preset loss function, the prediction sequence and the correctly labeled target sequence of the image.

The present application further provides a server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the training method of the object detection model as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of training an object detection model as set forth above.

The application provides a training method of a target detection model, which comprises the following steps: performing feature extraction operation on the image by adopting a backbone network of a model to be trained to obtain a multi-scale feature map; coding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain coded image features; carrying out sequence construction processing based on the correct label of the image to obtain a target sequence; decoding the coded image features and the target sequence based on a multi-scale deformable attention decoding module to obtain a prediction sequence; and updating parameters of the model to be trained based on a preset loss function, the prediction sequence and the correctly labeled target sequence of the image.

By introducing the deformation attention mechanism into the encoder and the decoder of the transformer in sequence, the complexity of the transformer in processing the image features is greatly reduced, so that the model can use high-resolution image features and multi-scale feature information to achieve the aim of further improving the performance of the model.

The application also provides a training device, a server and a computer readable storage medium for the target detection model, which have the beneficial effects, and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a training method of a target detection model according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a training method of a target detection model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a feature diagram processing manner of a training method for a target detection model according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a training method, a training device, a server and a computer readable storage medium for a target detection model, so that a high-complexity feature diagram can be processed, and the efficiency and the performance of model processing are improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the related art, object detection is considered as a mapping of input images to sets. Such as deter (DEtection Transformer, transform-based end-to-end target DEtection), and Deformable-detar (attention-based detar), both algorithms introduce the transform technology in NLP (Natural Language Processing) into target DEtection, and consider target DEtection as prediction of ensemble, which is true end-to-end training. Firstly, the ResNet backbone network extracts the characteristics of the whole image, then adds the characteristic diagram and the position information, sends the characteristic diagram into a transform, and finally, the FFN (feed forward network) outputs the detection result. In the deta, because the transform calculates the relationship between the current query (feature point) and all keys during the processing, the number of keys is usually very large, the processing complexity is proportional to the feature map size, an averaging strategy is adopted for all keys during initialization, and an attention mechanism only focuses on a small number of keys finally, so that the convergence speed is slow, and at the same time, only small-scale features can be used, and the small-scale features affect the small-target detection performance.

Object detection is considered to be a description of the image, i.e., the detection box and corresponding category are considered to be a description of the image, the model receives the image input, and generates a desired sequence of descriptions. The processing flow firstly extracts the characteristics of the enhanced image by using a backbone network and constructs a sequence by using bbox (rectangular frame position information) and corresponding categories; then, a transform encoder (encoding module) encodes the image features; finally, a desired sequence, i.e. a detection result, is output based on the previous sequence and the image representation after encoding by a generator transformer decoder (decoding module).

Although the PIX2SEQ (an object detection framework) converts an object detection task into a language model generation task, a transformer is still used inside the object detection task, and due to the characteristics of the transformer, when image features are processed, the relation between the current position and all other positions is calculated, the complexity is in direct proportion to the feature diagram size, so that large-size or multi-size features cannot be used, and the performance of the model is further influenced.

Therefore, the method for training the target detection model greatly reduces the complexity of the transformer in processing the image features by introducing the deformation attention mechanism into the encoder and the decoder of the transformer in sequence, so that the model can use high-resolution image features and multi-scale feature information to further improve the performance of the model.

The following describes a training method of a target detection model provided in the present application by an embodiment.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for training a target detection model according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, performing feature extraction operation on the image by adopting a backbone network of a model to be trained to obtain a multi-scale feature map;

it can be seen that this step aims to extract the corresponding features from the image first. The backbone network may adopt any one of the main network structures of the target detection network provided in the prior art, which is not specifically limited herein.

Further, the step may include:

step 1, performing image enhancement processing on an image to obtain a transformed image;

and 2, performing image feature extraction operation on the transformed image based on the backbone network to obtain a multi-scale feature map.

It can be seen that this step is mainly to illustrate how to extract the feature map. In the alternative scheme, image enhancement processing is carried out on the image to obtain a converted image, and image feature extraction operation is carried out on the converted image based on a backbone network to obtain a multi-scale feature map. The image enhancement processing comprises random geometric transformation and random color transformation, wherein the random geometric transformation comprises random shearing, random horizontal overturning, random resize and the like, the random color transformation comprises color dithering, and the image contrast, brightness, saturation and the like are randomly adjusted.

S102, coding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain coded image features;

on the basis of S101, the step aims to encode the multi-scale feature map based on the multi-scale deformable attention coding module to obtain the encoded image features.

Namely, on the basis of original attention coding, multi-scale deformable attention coding is realized.

Referring to fig. 4, fig. 4 is a schematic structural diagram illustrating a feature diagram processing manner of a training method for a target detection model according to an embodiment of the present disclosure.

The feature map processing method shown in fig. 4 may be adopted for the coded image features, and is not specifically limited herein. Wherein the processed feature map is then further processed by a multi-scale deformable attention module. Under the normal condition, the multi-scale information can effectively improve the target detection performance, but the problem of excessive smoothness of the high-level feature map is caused by convolution operation, and in order to avoid the problem, the multi-scale information is fully utilized, corresponding feature points of each level of feature map are obtained according to the level of the level, and are fused through the multi-scale attention module.

Further, the step may include:

step 1, acquiring image characteristics of each scale from a multi-scale characteristic diagram;

step 2, coding each characteristic point position in the image characteristics to obtain a position code;

step 3, fusing the fusion characteristics of the image characteristics of each scale with the position codes to obtain code input;

and 4, coding the coded input based on the multi-scale deformable attention coding module to obtain the coded image characteristics.

It can be seen that the present alternative is mainly illustrative of how feature encoding is performed. In the alternative scheme, the image features of each scale are obtained from a multi-scale feature map, the position of each feature point in the image features is coded to obtain a position code, the fusion features of the image features of each scale are fused with the position code to obtain a coded input, and the coded input is coded based on a multi-scale deformable attention coding module to obtain coded image features.

S103, carrying out sequence construction processing based on correct labeling of the image to obtain a target sequence;

on the basis of S101, this step aims to perform sequence construction processing based on correct labeling of the image, so as to obtain a target sequence.

I.e. to generate the sequence content that the decoding process needs to employ.

Further, the step may include:

step 1, carrying out size normalization processing on correct marks in an image to obtain a normalized image;

and 2, performing sequence construction processing on all targets in the normalized image to obtain a target sequence.

It can be seen that the present alternative scheme mainly illustrates how to obtain the target sequence. In the alternative, the correct mark in the image is subjected to size normalization processing to obtain a normalized image, and any target in the normalized image is subjected to sequence construction processing to obtain a target sequence.

S104, decoding the coded image characteristics and the target sequence based on the multi-scale deformable attention decoding module to obtain a prediction sequence;

on the basis of S102 and S103, this step is intended to decode the encoded image feature and the target sequence based on the multi-scale deformable attention decoding module, resulting in a predicted sequence. That is, the corresponding decoding operation is realized, and different from the prior art, a multi-scale deformable attention decoding module is adopted in the step, so that important features can be focused on, and the performance of processing complex features is improved.

And S105, updating parameters of the model to be trained based on the preset loss function, the prediction sequence and the correctly labeled target sequence of the image.

On the basis of S104, this step aims to perform parameter updating on the model to be trained based on the preset loss function, the prediction sequence, and the correctly labeled target sequence of the image.

Further, the step may include:

step 1, constructing a loss function based on a model structure of a model to be trained;

and 2, updating parameters of the model to be trained based on the loss function, the prediction sequence and the correctly labeled target sequence of the image.

It can be seen that the present alternative scheme mainly illustrates how parameter updating is performed. In the alternative, a loss function is constructed based on the model structure of the model to be trained, and the parameter of the model to be trained is updated based on the loss function, the prediction sequence and the correctly labeled target sequence of the image.

Further, this embodiment may further include:

and when the training of the model to be trained is finished, taking the model to be trained as a target detection model.

Therefore, in the alternative scheme, when the training is completed, the corresponding target detection model is obtained, so that the accuracy of the target detection model is improved.

Further, this embodiment may further include:

step 1, testing a target detection model based on labeled data;

and step 2, when the test is passed, sending a test success message.

It can be seen that the present alternative is mainly described for testing the target detection model. In this alternative, the target detection model is tested based on the labeled data, and when the test is passed, a test success message is sent.

In summary, in the embodiment, the distortion attention mechanism is introduced into the encoder and the decoder of the transform in sequence, so that the complexity of the transform in processing the image features is greatly reduced, and the model can use the high-resolution image features and the multi-scale feature information, so as to achieve the purpose of further improving the performance of the model.

The following further describes a training method of the target detection model provided in the present application by a specific embodiment.

In this embodiment, the following four parts are mainly included:

extracting image characteristics and constructing multi-scale characteristics by a backbone network; constructing a Transformer encoder based on a multi-scale deformable attention module; constructing a sequence based on a real ground channel (correct labeled data) of the image, and performing corresponding data enhancement; and constructing a generative Transformer decoder based on the multi-scale deformable attention module.

The image features are extracted by using a backbone network and multi-scale features are constructed. Target detection belongs to transfer learning, and operations are performed on a backbone (core) and a head (head) based on a model, the backbone is generally obtained by pre-training other data sets, and the head is another network for bridging the backbone and target detection data, so that the backbone directly influences the final target detection result. In addition, each scale has an area of major interest, and if only the last layer of the backbone is used, the small target detection performance is poor, so that information of different scales is generally considered comprehensively during target detection.

Wherein the Transformer encoder is constructed based on a multi-scale deformable attention module. The Multi-head attention mechanism can enable the neural network to focus on more related elements, because initially all keys (attention values) adopt an average strategy, and the number of the keys is very large, the neural network needs longer time to focus on the related keys, and in addition, the Multi-head attention mechanism needs higher computational complexity and higher spatial complexity, so that a high-resolution feature map cannot be used.

And constructing a sequence by using a real ground channel based on the image, and performing corresponding data enhancement. The category in the label of the target detection can be directly represented by token (symbol), but the detection frame cannot be represented by quantization, the detection frame needs to be quantized by a quantization strategy, the detection frame is unified in an interval after quantization, so that the detection frame can be represented by a very small vocabulary, the detection frame and the category form a group of sequences randomly after each target in the image is quantized, in order to improve the model performance, noise is added by an enhancement method after the real sequence, and a new sequence is formed.

And constructing a generative Transformer decoder based on the multi-scale deformable attention module. Similar to the transform encoder, in order to reduce the computational complexity and spatial complexity, a deformation attention mechanism is introduced into cross-attention, a desired sequence is generated in a generating mode based on image characteristics and sequences after encoder encoding, a maximum likelihood loss is constructed with a real ground channel, and model parameters are updated according to the loss.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a training method of a target detection model according to an embodiment of the present disclosure.

Please refer to fig. 2, which is a flowchart of a Transformer-based end-to-end target detection model according to the present invention. The method comprises the following six steps:

step 1, extracting input image characteristics by using a backbone network;

step 2, constructing a Transformer encoder based on a multi-scale deformable attention module;

step 3, constructing a sequence based on the real ground route of the image, and performing corresponding data enhancement;

step 4, constructing a Transformer decoder based on a multi-scale deformable attention module;

step 5, constructing a loss function;

and 6, training and testing the model.

The above steps are described in detail below.

And extracting input image features by using a backbone network. In order to improve the detection performance of the model, an image enhancement correlation algorithm is adopted to enhance the image, then a backbone network with richer representation is adopted to extract features, and multiple scales are constructed simultaneously. The method can comprise the following steps:

step 1, performing image enhancement on an input image, wherein the image enhancement is divided into random geometric transformation and random color transformation, the random geometric transformation comprises random shearing, random horizontal turning, random resize and the like, the random color transformation comprises color dithering, and the image contrast, brightness, saturation and the like are randomly adjusted;

and 2, extracting image characteristics such as ResNeXt101 and the like from the transformed image by using a backbone network.

And 3, constructing multi-scale image features based on the backbone network.

Wherein the Transformer encoder is constructed based on a multi-scale deformable attention module. Although the Multi-head attribute mechanism in the transform has good performance in NLP, when the NLP is introduced into CV (computer vision), a processed object is an image, and when the relation between the current query (feature point) and all keys is calculated, the calculation complexity and the space complexity are very large due to the very large number of the keys, so that the model is forced to use image features with low resolution.

Firstly, encoding the image feature position, and fusing the image feature position and the image feature, wherein the operation flow is as follows:

step 1, extracting image features by using a backbone network, and constructing multi-scale image features;

step 2, acquiring image characteristics on each scale, encoding the position of each characteristic point, and then fusing the image characteristics and the encoding of the corresponding position;

and 3, unfolding the characteristics with the positions into a line to be used as an encoder (encoding module) input.

Then, an encoder of a multi-scale deformation attention mechanism is constructed. In the image feature map, a transform attention mechanism is to look at all positions in space as much as possible, and a deformable attention module is introduced to focus on only a small number of keys around a reference point, so that the problems of feature map spatial resolution and convergence are solved, and a transform encoder module is constructed on the basis of the deformable attention module, and the specific operation steps are as follows:

step 1, constructing a single-scale deformable attention module. Let the input feature map be x ∈ R^C×H×WThe characteristic point and the reference point are respectively z_q、r_qWhere q is the corresponding index, the deformable attention feature is calculated by:

wherein,

A_hqk∈[0,1]h is the index of the attribution head, K is the index of the selected keys, K represents the number of all the selected keys, Δ r_hqkDenotes the offset of the kth sample in the h head, A_hqkRepresents the weight, Δ r, corresponding to the kth sample in the h head_hqkIs a floating point number, then r_q+Δr_hqkUsually a non-integer, the value corresponding to this position can be obtained from the surrounding points by a bilinear interpolation method.

And 2, constructing a self-adaptive multi-scale deformable attention module. In general, the performance of target detection can be effectively improved by using multiple scale information, but the high-level feature map is obtained by performing convolution and other processing on the shallow-level feature map, so that all point information of the shallow-level feature may be fused into one point or several points in the high-level feature map, and an over-smoothing problem occurs, therefore, in order to fully utilize the multiple scale information, the multi-scale deformable attention module is proposed on the basis of the single-scale deformable attention module, and in general, the performance of target detection can be improved by using the multi-scale feature map, so that multiple scales are added on the basis of the single-scale deformable attention module, and the calculation method is as follows:

wherein,

A_hlqk∈[0,1]，

representing an input multi-scale feature map,

h is the attribute head index, L represents the current feature level, L is the total feature level, k represents the selected keys index, 2^(L-l)K represents the number of all selected keys needed by the current layer, 2^(L-l)For the coefficient of K, it can be ensured that the lower layer features select more keys, and the higher layer features select less keys, Δ r_hlqkRepresents the offset of the kth sample of the l-th layer feature in the h-th head, A_hlqkRepresents the weight corresponding to the kth sample of the ith layer feature in the ith head,

representing the coordinates after renormalization of the ith layer features.

And 3, constructing a transformer Encoder based on the multi-scale deformable attention module. Directly replacing an attention module in the transformer with a multi-scale deformable attention module, setting the head number H to be 8, the number K of sampling points key to be 4, and unifying the number of channels of the multi-scale features to be 256.

And constructing a sequence based on the real ground truth of the image, and performing corresponding data enhancement. The ground channel for target detection has detection frames and corresponding categories, and the detection frames are different due to different image sizes, so that the images need to be unified to a uniform size, all the detection frames can be quantized to a uniform scale, a small vocabulary can be used for representation, tokens corresponding to the categories are added, all targets are constructed into a sequence based on the representation, and a noise sequence is added by using a data enhancement algorithm during training. The specific implementation mode is as follows:

step 1, normalizing the ground truth in the image to be in a uniform size.

Modifying the original image according to the length-width ratio of the original image and the set longest edge size, for example, setting the longest edge as 1400; correspondingly adjusting the size of the bounding box in the ground route based on the resize proportion of the image, so as to normalize (quantize) all the images corresponding to the bounding box to be between [1, 1400 ]; randomly selecting any target in the image and constructing a sequence. Normalizing the bounding box in the target according to the method, and not modifying the category; the remaining targets are transformed in the same manner as the samples and appended to the previous sequence.

And 2, enhancing data. That is, according to the generated sequence, some sequences are randomly generated and added to the generated sequence, such as randomly adjusting the position of the upper left corner point or the lower right corner point of the target frame, translating the bounding box to other positions or modifying the corresponding categories, and the like.

And 3, uniformly converting the generated sequences into corresponding tokens by adopting a natural language processing mode.

And constructing a Transformer decoder based on the multi-scale deformable attention module. The Decoder mainly comprises a cross-attention module and a self-attention module, wherein query extraction features are derived from a feature map in the cross-attention module, and key is the feature map output by the encoder; in the self-attention module, the queries interact with each other, and the key is related to the input query, and since the deformable attention module is directed to the feature map, only the cross-attention module is replaced by the multi-scale deformable attention module, and the self-attention module is the same as before.

A loss function is constructed. Because the target detection task is converted into the generation task of the language model, the maximum likelihood loss function is adopted as follows:

wherein img is an input image,

and s_ιRespectively expressing the input target sequence and the input sequence, L is the length of the target sequence, w_ιIs the weight of the ith token in the sequence.

And (5) training and testing the model. And setting relevant hyper-parameters such as an optimizer, a learning rate, batch-size and the like based on the built model and the built loss function, and starting to train the model. During testing, a token is selected in a nucleous sampling mode, when the token output by the model is EOS, the end of the sequence is represented, and then the generated sequence is converted into a detection box and a corresponding category.

It can be seen that although the NLP of the transform in this embodiment has good performance, the NLP is introduced into the CV, and when the processing object is an image, the computational complexity and the spatial complexity are very large due to its own characteristics, which forces the model to use only image features with low resolution.

In the following, a training apparatus of a target detection model provided in an embodiment of the present application is introduced, and the training apparatus of the target detection model described below and the training method of the target detection model described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present disclosure.

In this embodiment, the apparatus may include:

the feature extraction module 100 is configured to perform feature extraction operation on the image by using a backbone network of the model to be trained to obtain a multi-scale feature map;

the feature coding module 200 is configured to code the multi-scale feature map based on the multi-scale deformable attention coding module to obtain coded image features;

the sequence construction module 300 is configured to perform sequence construction processing based on correct labeling of an image to obtain a target sequence;

a feature decoding module 400, configured to decode the encoded image features and the target sequence based on the multi-scale deformable attention decoding module to obtain a prediction sequence;

and the parameter updating module 500 is configured to perform parameter updating on the model to be trained based on the preset loss function, the prediction sequence, and the correctly labeled target sequence of the image.

An embodiment of the present application further provides a server, including:

a memory for storing a computer program;

a processor for implementing the steps of the training method of the object detection model as described in the above embodiments when the computer program is executed.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the training method for the target detection model according to the above embodiments.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above detailed description is provided for a training method, a training apparatus, a server and a computer-readable storage medium of an object detection model provided in the present application. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for training a target detection model, comprising:

decoding the coded image features and the enhanced target sequence based on a multi-scale deformable attention decoding module to obtain a prediction sequence; the target sequence subjected to enhancement processing is a target sequence subjected to data enhancement and mask processing on the target sequence;

2. The training method of claim 1, wherein performing feature extraction on the image by using a backbone network of the model to be trained to obtain a multi-scale feature map comprises:

3. The training method of claim 1, wherein encoding the multi-scale feature map based on a multi-scale deformable attention coding module to obtain encoded image features comprises:

acquiring image features of each scale from the multi-scale feature map;

4. The training method of claim 1, wherein performing a sequence construction process based on the correct labeling of the image to obtain a target sequence comprises:

5. The training method according to claim 1, wherein updating parameters of the model to be trained based on a preset loss function, the prediction sequence and a correctly labeled target sequence of the image comprises:

6. The training method of claim 1, further comprising:

7. The training method of claim 1, further comprising:

testing the target detection model based on the labeled data;

and when the test is passed, sending a test success message.

8. An apparatus for training an object detection model, comprising:

9. A server, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of training an object detection model according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method of training an object detection model according to any one of claims 1 to 7.