CN118097341B

CN118097341B - Target detection method, model training method and related device

Info

Publication number: CN118097341B
Application number: CN202410522310.3A
Authority: CN
Inventors: 李嘉麟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-04-28
Filing date: 2024-04-28
Publication date: 2024-08-06
Anticipated expiration: 2044-04-28
Also published as: CN118097341A

Abstract

The embodiment of the application discloses a target detection method, a model training method and a related device, wherein the target detection method comprises the following steps: determining image characteristics corresponding to an image to be detected; determining decoding query features corresponding to each query feature in the query feature sets according to the query feature sets and the image features by a target decoder in the target detection model; the plurality of query feature sets are obtained by grouping each query feature to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the initial query characteristics or are obtained by screening from each decoding query characteristic output by a decoder of the upper layer of the target decoder based on the high-quality characteristic conditions; and determining a target detection result corresponding to the image to be detected according to each decoding query characteristic output by a last layer of decoder in the target detection model. The method can improve the calculation efficiency and the processing accuracy of the target decoder, and further improve the overall efficiency and accuracy of the target detection task.

Description

Target detection method, model training method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a target detection method, a model training method and a related device.

Background

Object detection is an important research direction in the field of computer vision, which is used to detect objects of interest in images or videos and to determine the class and location information of the objects. The query-based target detection model is a novel model for executing target detection tasks, and target detection is realized by matching a preset fixed number of query features with targets in an image.

When the target detection model based on the query executes the target detection task, each layer of decoder needs to calculate the association relation between all query features and the association relation between the image features output by the encoder and all query features by means of a attention (attention) mechanism. Since a large number of query features (e.g., 300 query features are set in the far-super image) are usually set, the complexity of the above-mentioned computation performed by the decoder is high, the computation efficiency is low, and the overall execution efficiency of the target detection task is further affected.

Disclosure of Invention

The embodiment of the application provides a target detection method, a model training method and a related device, which can reduce the complexity of calculation required to be executed by a decoder, improve the calculation efficiency and further improve the overall execution efficiency of a target detection task.

The first aspect of the present application provides a target detection method, the method comprising:

Determining image characteristics corresponding to an image to be detected;

Determining decoding query features corresponding to each query feature in a plurality of query feature sets according to the plurality of query feature sets and the image features by a target decoder in a target detection model; the query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the initial query characteristics or are obtained by screening from decoding query characteristics output by a decoder of the upper layer of the target decoder based on high-quality characteristic conditions, wherein the target decoder is one layer of decoder in the target detection model;

And determining a target detection result corresponding to the image to be detected according to each decoding query feature output by a last layer of decoder in the target detection model.

Optionally, the determining the image feature corresponding to the image to be detected includes:

Performing feature extraction processing on the image to be detected through a backbone network in the target detection model to obtain initial image features corresponding to the image to be detected;

And determining the image characteristics corresponding to the image to be detected according to the initial image characteristics and the position coding characteristics corresponding to the image to be detected by an encoder in the target detection model.

The second aspect of the present application provides a model training method, the method comprising:

Determining training image characteristics corresponding to the training images;

Determining training decoding query characteristics corresponding to each training query characteristic in a plurality of training query characteristic sets according to the plurality of training query characteristic sets and the training image characteristics through a target decoder in a target detection model; the training query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the training initial query characteristics, or are obtained by screening from the training decoding query characteristics output by the decoder of the upper layer of the target decoder based on the high-quality characteristic conditions; the target decoder is a layer of decoder in the target detection model;

determining a loss value corresponding to each training query feature set according to the training decoding query feature corresponding to each training query feature and the label corresponding to the training image;

determining a loss value corresponding to the target decoder according to the loss values corresponding to the training query feature sets;

And training the target detection model based on the loss values corresponding to the decoders of each layer in the target detection model.

A third aspect of the present application provides an object detection apparatus, the apparatus comprising:

the coding module is used for determining the image characteristics corresponding to the image to be detected;

The decoding module is used for determining decoding query features corresponding to each query feature in the query feature sets according to the query feature sets and the image features by a target decoder in the target detection model; the query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the initial query characteristics or are obtained by screening from decoding query characteristics output by a decoder of the upper layer of the target decoder based on high-quality characteristic conditions, wherein the target decoder is one layer of decoder in the target detection model;

and the detection module is used for determining a target detection result corresponding to the image to be detected according to each decoding query characteristic output by the last layer of decoder in the target detection model.

A fourth aspect of the present application provides a model training apparatus, the apparatus comprising:

the coding module is used for determining training image characteristics corresponding to the training images;

The decoding module is used for determining training decoding query characteristics corresponding to each training query characteristic in the training query characteristic sets according to the training query characteristic sets and the training image characteristics through a target decoder in the target detection model; the training query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the training initial query characteristics, or are obtained by screening from the training decoding query characteristics output by the decoder of the upper layer of the target decoder based on the high-quality characteristic conditions; the target decoder is a layer of decoder in the target detection model;

The loss construction module is used for determining loss values corresponding to the training query feature sets according to the training decoding query features corresponding to the training query features and the labels corresponding to the training images;

the loss construction module is further configured to determine a loss value corresponding to the target decoder according to the loss values corresponding to the training query feature sets;

And the training module is used for training the target detection model based on the loss values corresponding to the decoders of each layer in the target detection model.

A fifth aspect of the application provides a computer apparatus comprising a processor and a memory:

The memory is used for storing a computer program;

The processor is configured to perform the steps of the method according to the first or second aspect described above in accordance with the computer program.

A sixth aspect of the present application provides a computer readable storage medium storing a computer program for performing the steps of the method of the first or second aspect described above.

A seventh aspect of the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, the computer instructions being executed by the processor to cause the computer device to perform the steps of the method of the first or second aspect described above.

From the above technical solutions, the embodiment of the present application has the following advantages:

according to the target detection method provided by the embodiment of the application, all query features to be processed in the target decoder are grouped to obtain a plurality of query feature sets, so that the target decoder calculates based on each query feature set and image features corresponding to the image to be detected by taking the group as a unit to obtain decoding query features corresponding to each query feature in each query feature set; by making the target decoder perform computation in units of packets, the number of query features processed by the target decoder at a time can be reduced, and as the number of query features processed is reduced, the computation complexity of the target decoder is correspondingly reduced, the computation efficiency is correspondingly improved, and the overall execution efficiency of the target detection task is also improved. In addition, the query feature to be processed of the target decoder can be obtained by screening from all decoding query features output by the decoder at the upper layer based on the high-quality feature condition, and at the moment, the query feature to be processed of the target decoder is the high-quality query feature, so that the target decoder carries out correlation calculation based on the query feature to be processed, and can obtain the decoding query feature with high quality, namely the obtained decoding query feature can reflect the target detection result more accurately, the processing accuracy of the target decoder is improved, and the accuracy of the target detection result determined according to the decoding query feature output by the decoder at the last layer in the target detection model is further improved; and the method and the device screen the query features to be processed of the target decoder based on the high-quality feature conditions, and can further reduce the number of the query features to be processed of the target decoder, so that the calculation complexity of the target decoder is further reduced, and the calculation efficiency of the target decoder and the overall target detection efficiency are improved.

Drawings

Fig. 1 is a schematic diagram of an application scenario of a target detection method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for detecting targets according to an embodiment of the present application;

FIG. 3 is a flowchart for determining image characteristics of an image to be detected according to an embodiment of the present application;

FIG. 4 is a flow chart of determining a set of decoded query features according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a target detection result according to an embodiment of the present application;

FIG. 6 is a flow chart of a model training method according to an embodiment of the present application;

FIG. 7 is an algorithm logic diagram of an object detection model provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of an object detection device according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important transformation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The scheme provided by the embodiment of the application relates to the technologies of computer vision, machine learning and the like of artificial intelligence, and is specifically described by the following embodiments:

The target detection method provided by the embodiment of the application can be implemented by computer equipment, and the computer equipment can be terminal equipment or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing service. The terminal equipment comprises, but is not limited to, mobile phones, computers, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

It should be noted that, the images, information, data and signals related to the embodiments of the present application are authorized by the related objects or fully authorized by the parties, and the collection, use and processing of the related data all comply with the related laws and regulations and standards of the related countries and regions.

In order to facilitate understanding of the target detection method provided by the embodiment of the present application, an application scenario of the target detection method is described below by taking an execution subject of the target detection method as an example of a server.

Referring to fig. 1, fig. 1 is a schematic application scenario of a target detection method according to an embodiment of the present application, where a terminal device 100 transmits an image to be detected to a server 200, and the server 200 is configured to execute the target detection method and return a target detection result of the image to be detected to the terminal device 100.

When the target detection is performed on the image to be detected, firstly, determining the image characteristics corresponding to the image to be detected, for example, processing the image to be detected through an encoder in a target detection model, and obtaining the image characteristics corresponding to the image to be detected. Further, the image feature is input to a decoder of the object detection model.

The object detection model may include a multi-layer decoder therein; the following describes the processing procedure of the target decoder by taking one layer of decoder as the target decoder. Before the relevant calculation processing is executed by the target decoder, the query features to be processed corresponding to the target decoder can be grouped to obtain a plurality of query feature sets, so that the target decoder can respectively perform relevant calculation on each query feature set, namely, calculate according to each query feature set and image features, and accordingly obtain decoding query features corresponding to each query feature in each query feature set. Compared with the number of all the query features to be processed, the number of the query features to be processed in each query feature set is greatly reduced, so that the target decoder calculates for each query feature set respectively, the calculation amount of the target decoder is greatly reduced, and the calculation efficiency of the target decoder is improved.

In the embodiment of the application, each query feature to be processed corresponding to the target decoder can be determined according to the initial query feature; that is, when the target decoder is the first layer decoder in the target detection model, each pending query feature of the target decoder may be determined according to the initial query feature. Each query feature to be processed corresponding to the target decoder can also be obtained by screening each decoding query feature output by the upper layer decoder of the target decoder based on the high-quality feature condition; that is, when the target decoder is a non-first layer decoder in the target detection model, the query feature to be processed of the target decoder can be screened from the query features to be decoded output by the decoder of the previous layer based on the high-quality feature condition, and the number of the query features to be processed of the target decoder can be further reduced through the high-quality feature condition. Similarly, when the next layer decoder of the target decoder has the same processing mechanism as the target decoder, the number of the corresponding query features to be processed of the next layer decoder of the target decoder is further reduced by screening the query features to be processed from each decoding output by the target decoder based on the quality feature condition. Therefore, the number of the processed query features among the decoders of each layer of the target detection model is reduced layer by layer, the calculated amount and the calculation complexity of the decoders of each layer are reduced, and the target detection efficiency of the target detection model is further improved.

After obtaining each decoding query feature output by the last layer of decoder in the target detection model, determining a target detection result corresponding to the image to be detected according to each decoding query feature. Since each query feature to be processed of the target decoder is input, the query feature to be processed may be screened from each query feature to be decoded output by a decoder of a higher layer of the target decoder based on the quality feature condition, and for a decoder of a lower layer of the target decoder, each corresponding query feature to be processed may be screened from each query feature to be decoded output by the target decoder based on the quality feature condition. The query characteristics to be processed of each layer of decoder meet the high-quality characteristic conditions, so that the quality of the query characteristics used in the target detection process is improved; the target decoder performs correlation calculation based on the query features to be processed, and the obtained decoding query features can reflect the target detection result more accurately, so that the processing accuracy of the target decoder is improved, and further the accuracy of the final target detection result is improved.

Therefore, according to the target detection method provided by the embodiment of the application, the target detection process is optimized by grouping the query features to be processed and using the high-quality feature condition to screen two angles of the query features to be processed corresponding to the decoder. On one hand, the number of query features processed by the target decoder once is reduced, and on the other hand, the quality of the query features used by the target decoder in the calculation process is improved, so that the calculation amount and the calculation complexity of the target decoder in the calculation process aiming at the query feature set are reduced, the calculation efficiency and the detection quality of the target decoder are improved, the detection efficiency and the detection quality of the target detection model when the target detection task is executed are further improved, and the accuracy of the target detection result is improved.

It should be understood that the application scenario shown in fig. 1 is merely an example, and in practical application, the target detection method provided by the embodiment of the present application may also be applied to other scenarios, for example, the target detection method provided by the embodiment of the present application may be independently executed by the terminal device, and the application scenario of the target detection method provided by the embodiment of the present application is not limited in any way.

The application scenario of the target detection method provided by the embodiment of the application can include, but is not limited to, automatic driving, medical image analysis, security, retail, agriculture and other scenarios. Specifically, in an automatic driving scene, the target detection method provided by the embodiment of the application can detect targets such as pedestrians, vehicles, traffic facilities and the like; in a medical image analysis scene, the target detection method provided by the embodiment of the application can detect and segment a lesion area in a medical image; in a security scene, the target detection method provided by the embodiment of the application can detect and follow targets such as pedestrians, vehicles, specific articles and the like; in a retail scene, the object detection method provided by the embodiment of the application can realize automatic identification and counting of commodities; in an agricultural application scene, the target detection method provided by the embodiment of the application can realize the detection and segmentation of plant diseases and insect pests.

The method for detecting the target provided by the application is described in detail by the method embodiment.

Referring to fig. 2, fig. 2 is a flowchart of a method for detecting an object according to an embodiment of the present application, where the method specifically includes the following steps:

step 201: and determining the image characteristics corresponding to the image to be detected.

In the embodiment of the application, the image to be detected is an image object to be subjected to target detection, and the embodiment of the application can specifically determine the target detection result corresponding to the image to be detected through a target detection model, for example, determine the category to which the target included in the image to be detected belongs and the position of the target in the image to be detected.

In embodiments of the present application, the target detection model may be a Query-based target detection model, including but not limited to a target detector (Detection Transformer, DETR), a label-free distillation (Distillation with No Labels, DINO) model, an Anchor-DETR detector, a AdaMixer model, and a spark R-CNN model. The target detection model based on the query realizes the target detection of the image to be detected by matching the query feature vector query with the image features of the image to be detected.

When the target detection is carried out on the image to be detected, firstly, the image characteristics corresponding to the image to be detected need to be determined so as to carry out the target detection on the image to be detected based on the image characteristics.

In one or more embodiments, the image features of the image to be detected may be obtained by inputting the image to be detected into a convolutional neural network (Convolutional Neural Network, CNN), or may be obtained by other models such as a support vector machine (Support Vector Machine, SVM), and the embodiment of the present application does not limit any way to obtain the image features of the image to be detected.

In one possible implementation, step 201 may be specifically implemented by:

s11: performing feature extraction processing on the image to be detected through a backbone network in the target detection model to obtain initial image features corresponding to the image to be detected;

s12: and determining the image characteristics corresponding to the image to be detected according to the initial image characteristics and the position coding characteristics corresponding to the image to be detected by an encoder in the target detection model.

Referring to fig. 3, fig. 3 is a flowchart of determining image features of an image to be detected, which is provided in the embodiment of the present application, after the image to be detected is input into a target detection model, initial image features corresponding to the image to be detected are first extracted through a backbone network (backhaul) of the target detection model.

After the initial image features are obtained, the encoder, the decoder and the prediction head of the target detection model are insensitive to the position features in the image in the target detection process, so that corresponding position information needs to be embedded for each initial image feature. It will be appreciated that the position-coding feature is used to characterize the position of individual sub-image features in the initial image feature in the image to be detected, and that the position-coding feature may in particular be obtained by processing the image to be detected by a position encoder.

In the target detection model, a multi-layer encoder can be included, and at this time, the multi-layer encoder can be used for carrying out multiple encoding processing on the fusion characteristic iteration of the initial image characteristic and the position encoding characteristic of the image to be detected, so that the image characteristic corresponding to the image to be detected in the embodiment of the application can be obtained.

In this way, the initial image features corresponding to the image to be detected are determined through the backbone network of the target detection model, then the position coding features of the image to be detected are embedded into the initial image features through the encoder of the target detection model, and the image features corresponding to the image to be detected are obtained, so that the decoder and the prediction head of the target detection model can recognize the position information corresponding to the features, and the accuracy and the detection efficiency of target detection in the subsequent process are improved.

Step 202: and determining decoding query features corresponding to each query feature in the query feature sets according to the query feature sets and the image features by a target decoder in the target detection model.

The query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder, and the target decoder is a layer of decoder in the target detection model. For the target decoder of the target detection model, all the query features to be processed corresponding to the target decoder are divided into a plurality of groups, one group is a query feature set, and the query features included in the query feature set are the query features to be processed of the target decoder.

In practical applications, the object detection model may include a multi-layer decoder, and any one or more layers of the multi-layer decoder may be used as the object decoder. The target decoder is a decoder that performs correlation calculation in units of a query feature set obtained by grouping all query features to be processed. For the target detection model, a part of the decoders may be set as target decoders, even if the part of the decoders performs correlation calculation in units of a query feature set including a smaller number of query features; all of the decoders may be set as target decoders even if each layer decoder performs correlation calculation in units of a query feature set including a smaller number of query features.

When the target decoder is a first layer decoder in the target detection model, each corresponding query feature to be processed can be determined according to the initial query feature; the initial query feature is an initial query set in the target detection model, the initial query is similar to model parameters, the initial query is continuously updated in the training target detection model, and the initial query feature is correspondingly fixed after the training of the target detection model is completed.

When the target decoder is not the first layer decoder, the corresponding query feature set is screened from each decoding query feature output by the upper layer decoder of the target decoder based on the quality feature condition. It should be understood that in the embodiment of the present application, the quality feature condition may be set according to the evaluation index of the query feature, where the evaluation index includes, but is not limited to, classification accuracy, position accuracy, and the like corresponding to the query feature; the high-quality characteristic condition can be a preference condition, namely a preset number of decoding query characteristics with optimal quality are screened out from high to low according to the evaluation index; the qualification condition can also be that the decoding query characteristics with the evaluation index higher than a certain threshold value are screened out from the decoding query characteristics; the embodiments of the present application are not limited to the specific form and content of the quality feature condition.

The target decoder may process the respective query feature sets in parallel or in series. In general, in order to improve processing efficiency, multiple query feature sets may be processed in parallel by multiple target decoders having the same model parameters.

For each query feature set, determining a decoding query feature corresponding to each query feature to be processed in the query feature set, where in the embodiment of the present application, the decoding query feature is a feature vector which is obtained by calculating based on the query feature to be processed and an image feature of the image to be detected and can indicate target classification information and target position information.

In one possible implementation, each layer decoder in the target detection model may be a target decoder; or the first n layers of decoders in the object detection model may be the object decoder, n being an integer greater than or equal to 1 and less than the total number of layers of the decoders in the object detection model.

As an example, when three layers of decoders are included in the object detection model, all the three layers of decoders may be determined as object decoders, and then, the input of each layer of object decoders needs to be grouped, so that the object decoders perform correlation calculation in units of groups, and the output decoding query features are filtered and then transmitted to the next layer of decoders.

When all decoders in the target detection model are used as target decoders, all decoders in the target detection model have the same processing mechanism, so that the number of query features processed by each layer of decoders can be reduced, the calculation amount and the calculation complexity of each layer of decoders are reduced, and the calculation efficiency of each layer of decoders is further improved.

As another example, when the target detection model includes three layers of decoders, the first two layers of decoders may be determined as target decoders, and then, for the first layer of target decoders, each query feature set in the first layer of target decoders is obtained by grouping initial query features, and after the output decoded query features are screened by the high-quality feature condition, each query feature to be processed corresponding to the second layer of target decoders is retained; each query feature set in the second layer target decoder is obtained by grouping the query features to be processed, and the decoded query features output by the second layer target decoder do not need to be screened for high-quality feature conditions or grouped, and can be directly input into the third layer decoder for calculation processing.

When the first n layers of decoders in the target detection model are determined as target decoders, the calculated amount in the target detection process can be effectively reduced through the first n layers of target decoders, the calculation efficiency is improved, other non-target decoders can calculate and process based on all decoding query features output by the last layer of target decoders, the number of query features used in the processing process of the non-target decoders is guaranteed, and the accuracy of target detection is improved; in addition, a plurality of different decoders including target decoders are arranged in the target detection model, and the different decoders have different processing mechanisms, so that the flexibility of target detection is improved.

In one possible implementation manner, when the target decoder is a first layer decoder in the target detection model, each pending query feature corresponding to the target decoder may be determined by the following S21 or S22:

s21: and taking each initial query characteristic as each query characteristic to be processed corresponding to the target decoder.

When the target decoder is a first layer decoder in the target detection model, the initial query features set in the target detection model can be directly determined to be the query features to be processed corresponding to the target decoder, and after the initial query features are grouped, each obtained query feature set is used as input of the target decoder. The method improves the determination efficiency of the query characteristics to be processed corresponding to the target decoder, and further improves the detection efficiency of the target detection model.

S32: determining the feature quality corresponding to each initial query feature; and selecting initial query features meeting preset quality conditions from the initial query features according to the feature quality corresponding to the initial query features, and taking the initial query features as query features to be processed corresponding to the target decoder.

For the initial query features set in the target detection model, the feature quality corresponding to each initial query feature can be determined through a pre-trained quality evaluation model, and the initial query features with better quality are screened out according to the feature quality by using a preset quality condition and serve as the query features to be processed corresponding to the target decoder.

In the embodiment of the application, the preset quality condition is a screening condition for screening out initial query features with better quality from all the initial query features according to the feature quality. The preset quality condition can be a preferential condition or a qualification condition, and the embodiment of the application is not limited to the specific form and content of the preset quality condition.

Among the initial query features of all the target detection models, the initial query feature with the feature quality meeting the preset quality condition is determined to be the query feature to be processed of the target decoder, so that the quality of the query feature processed in the target decoder can be improved, and the accuracy of target detection is further improved.

The embodiment of the application can determine the query characteristics to be processed of the first-layer target decoder in two different modes, thereby improving the flexibility of determining the query characteristics to be processed of the first-layer target decoder.

In one possible implementation manner, when the target decoder is a non-first layer decoder in the target detection model, each query feature to be processed corresponding to the target decoder is determined through the following flows corresponding to S31 and S32:

s31: and determining target detection results corresponding to each decoding query feature according to each decoding query feature output by the upper layer decoder.

There is also at least one layer of decoders before the target decoder, where the input of the target decoder is determined from the output of the previous layer of decoders. And determining target detection results corresponding to each decoding query feature output by the decoder of the upper layer, wherein the target detection results comprise classification scores corresponding to detection targets indicated by the decoding query features.

For example, when two targets, namely a person and a vehicle, are included in the image to be detected, the probability that the detected target indicated by the decoding query feature is the person target, the probability that the detected target is the vehicle target and the probability that the detected target is the background are calculated respectively, and the sum of the three probabilities is 1; in the target detection results corresponding to the decoding query features, the classification scores corresponding to the detection targets are determined according to the probabilities of various targets and backgrounds corresponding to the detection targets.

Specifically, from probabilities that the detected target corresponds to various targets and backgrounds, a target or background with the highest probability value is determined and used as a classification score corresponding to the detected target. For example, when the probability of the detection target being a human target is 0.6, the probability of the detection target being a vehicle is 0.3, and the probability of the detection target being a background is 0.1, then in the target detection result, the classification score of the detection target indicated by the decoded query feature is determined to be 0.6.

S32: and selecting the decoding query features meeting the high-quality feature conditions from the decoding query features according to the classification scores in the target detection results corresponding to the decoding query features, and taking the decoding query features as query features to be processed corresponding to the target decoder.

And after determining the classification score of the detection target indicated by the decoding query feature according to the target detection result corresponding to the decoding query feature, using a high-quality feature condition to screen out decoding query features with better quality from all decoding query features based on the classification score as query features to be processed corresponding to a target decoder. For example, the decoding query features may be ranked from large to small according to the classification score, the first 80% of the decoding query features with the highest classification score are screened according to the high-quality feature condition, and the first 80% of the decoding query features are determined as the query features to be processed corresponding to the target decoder. For another example, the decoded query features with a classification score greater than or equal to 0.5 may be selected from each decoded feature set according to the quality feature condition, and determined as the query features to be processed corresponding to the target decoder.

And the quality characteristic condition is used, and the decoding query characteristic with better quality is screened out from all decoding query characteristics output by the upper layer of decoders based on the classification score and is used as the query characteristic to be processed of the target decoder, so that the number of the query characteristics to be processed corresponding to the target decoder is reduced, the quality of all query characteristics processed in the target decoder is improved, and further the calculation efficiency of the target decoder and the accuracy of target detection are improved.

In one possible implementation manner, the input of the decoder of the upper layer includes a plurality of query feature sets obtained by grouping the query features to be processed corresponding to the decoder of the upper layer; the output of the upper layer decoder includes decoded query feature sets corresponding to each of the plurality of query feature sets. At this time, S32 may be specifically realized by:

And selecting the decoding query feature meeting the high-quality feature condition from the decoding query feature set according to the classification score in the target detection result corresponding to each decoding query feature in each decoding query feature set, and taking the decoding query feature as the query feature to be processed corresponding to the target decoder.

In the upper layer decoder, each query feature set is calculated separately, and then the obtained output is also the decoded query feature set corresponding to each query feature set. When determining the query feature to be processed corresponding to the target decoder based on each decoding query feature, the decoding query features meeting the high-quality feature conditions can be screened out for each decoding query feature set.

Specifically, for a decoded query feature set, determining classification scores in target detection results corresponding to each decoded query feature in the decoded query feature set, screening out decoded query features with higher classification scores and better quality according to the classification scores by using high-quality feature conditions, and determining the decoded query features as query features to be processed of a target decoder. And obtaining all the query features to be processed corresponding to the target decoder based on the respectively screened decoding query features in all the decoding query feature sets.

And screening the decoding query features meeting the high-quality feature conditions by taking the groups as units, and respectively screening each decoding query feature set, so that the data volume in the screening process can be reduced, and the screening efficiency can be improved.

In addition, when the previous layer decoder does not process each query feature based on the packet, the first 80% of the decoded query features can be screened out as the query features to be processed of the target decoder directly according to the classification scores corresponding to the decoded query features output by the previous layer decoder.

In one possible implementation, the query feature set may be obtained by grouping:

And according to the preset grouping number, randomly dividing each query feature to be processed corresponding to the target decoder into a plurality of query feature sets.

When grouping the query features to be processed corresponding to the target decoder, a preset grouping number can be determined according to the number of the query features to be processed, then each query feature to be processed is randomly grouped according to the preset grouping number, so as to obtain a plurality of query feature sets, and the number of the query features to be processed in each query feature set is the same.

For example, when the number of query features to be processed is 300 and the preset number of groups is 5, the 300 query features to be processed are randomly divided into 5 query feature sets, and the number of query features to be processed in each query feature set is 60.

The quantity and quality balance of the query features to be processed in each query feature set can be improved through random grouping.

In addition to random grouping, the target decoder may also be grouped according to the quality of each query feature to be processed corresponding to the target decoder, so as to obtain multiple query feature sets. In one possible implementation, the query feature set may be determined by:

S41: determining the feature quality of each to-be-processed query feature corresponding to the target decoder;

S42: and distributing each query feature to be processed to the multiple query feature sets according to the feature quality corresponding to each query feature to be processed according to the preset grouping number.

When grouping the query features to be processed, firstly determining the feature quality of the query features to be processed, and then grouping the query features to be processed according to the feature quality, wherein the feature quality can be determined according to the classification score in the target detection result corresponding to the query features to be processed and can be determined by a pre-trained quality evaluation model.

As an example, the embodiment of the application can perform quality average grouping on each query feature to be processed according to the feature quality, and average divide the query features to be processed with similar feature quality into each query feature set according to the preset grouping number, so that the overall feature quality of the query features to be processed among the query feature sets obtained finally is approximately average, and the feature quality balance among all query feature sets is improved.

As another example, embodiments of the present application may specifically group the individual query features to be processed on an intra-group average according to feature quality. Specifically, the query features to be processed with similar feature quality are divided into the same group according to the preset grouping number, so that the feature quality of each query feature to be processed belonging to the same query feature set is similar, and the balance of the feature quality in each query feature set is improved.

It should be understood that other grouping manners may be adopted according to the feature quality according to the embodiment of the present application, which is not specifically limited in this embodiment.

Each query feature to be processed is grouped according to the feature quality, so that the grouping flexibility is improved, the quality of the query feature set is improved, and the detection quality of the target detection model is improved.

In addition to random grouping and grouping according to feature quality, each query feature to be processed may also be grouped according to a feature distribution of the query features to be processed. In one possible implementation, the query feature set may be obtained by:

s51: determining the distribution of each query feature to be processed corresponding to the target decoder in a feature space;

S52: and distributing each query feature to be processed to the multiple query feature sets according to the distribution of each query feature to be processed in a feature space according to the preset grouping number.

The similarity degree between the query features to be processed can be determined through the distribution of the query features to be processed in the feature space, and the distribution of the query features to be processed with higher similarity degree in the feature space is closer. When the query features to be processed are grouped, the query features to be processed can be grouped according to the distribution of the query features to be processed in the feature space.

As an example, the embodiment of the application can divide the query features to be processed with relatively close distribution into the same group according to the distribution of the query features to be processed in the feature space, so that the query features to be processed in the obtained query feature set are relatively similar. It should be understood that, in the embodiment of the present application, other grouping manners may be adopted according to the distribution of the query features to be processed in the feature space, for example, the query features to be processed that are further distributed in the feature space are divided into the same group, which is not specifically limited in the embodiment of the present application.

According to the distribution condition of the query features to be processed in the feature space, the query features to be processed are grouped, so that the grouping flexibility is improved, the quality of the query feature set is improved, and the detection quality of the target detection model is improved.

Thus, when target detection is performed based on each query feature set with similar query features in the group, the accuracy of target detection can be improved.

It should be noted that, the embodiment of the application can also comprehensively consider the feature quality of the query feature to be processed and the distribution of the feature quality in the feature space, and finally determine the grouping mode of the query feature to be processed.

As an embodiment, the step 203 may be implemented as follows:

s61: and determining an intermediate query feature set corresponding to each query feature set according to each query feature set by adopting a self-attention mechanism through the target decoder.

Referring to fig. 4, fig. 4 is a schematic flow chart of determining a set of decoded query features according to an embodiment of the present application. And taking the query feature set as a unit, inputting each query feature to be processed in the query feature set to a target decoder, taking the query feature to be processed as the input of a Multi-head self-attention (Multi-HEADED SELF-attention, MSA) mechanism in the target decoder, processing each query feature to be processed through the self-attention mechanism, determining each corresponding intermediate query feature of each query feature to be processed, and then determining the intermediate query feature set corresponding to the query feature set based on each intermediate query feature. The intermediate query feature set comprises intermediate query features corresponding to each query feature in the corresponding query feature set.

S62: and determining a decoding query feature set corresponding to each intermediate query feature set according to each intermediate query feature set and the image features by adopting an attention mechanism through the target decoder.

After determining the intermediate query feature set according to each intermediate query feature output by the MSA, taking the intermediate query feature set as a unit, combining the image features of the image to be detected output by the encoder, as input of a Multi-Head Attention (Multi-Head Attention) mechanism of the target decoder, combining the intermediate query feature with the image features of the image to be detected through the Attention mechanism, determining decoding query features corresponding to each intermediate query feature, and further combining the position coding features of the image to be detected with the intermediate query features and the image features, determining decoding query features corresponding to each intermediate query feature, and then determining a decoding query feature set corresponding to the intermediate query feature set based on each decoding query feature. The decoding query feature set comprises decoding query features corresponding to the intermediate query features.

The self-attention mechanism has the capability of calculating the correlation among all the positions in parallel, and the target decoder uses the self-attention mechanism to determine the intermediate query feature set corresponding to the query feature set, so that the calculation efficiency of the target decoder can be improved; the attention mechanism can focus on key query features possibly with target information, so that the key query features are analyzed and processed, and the detection accuracy is improved. The embodiment of the application combines the self-attention mechanism and the attention mechanism in the target decoder, thereby not only improving the calculation efficiency of the target decoder, but also improving the accuracy of the target decoder for decoding each query feature, and further improving the detection efficiency and the detection accuracy of the target detection model.

Step 203: and determining a target detection result corresponding to the image to be detected according to each decoding query feature output by a last layer of decoder in the target detection model.

In the target detection model, the decoding query feature output by the last layer of decoder is obtained based on the query feature to be processed input into the last layer of decoder, the query feature to be processed corresponding to the last layer of decoder can be the high-quality query feature left after calculation by the previous layers of decoders and high-quality feature condition screening, wherein the target classification information and the position information have higher accuracy, and then the decoding query feature obtained based on the query feature to be processed corresponding to the last layer of decoder also has higher accuracy.

And respectively inputting the decoding query characteristics output by the last layer of decoder into two parallel Feed-Forward networks (FFNs), wherein one FNN is used for determining target classification information corresponding to a target indicated by the decoding query characteristics, and the other FNN is used for determining position information corresponding to the target indicated by the decoding query characteristics. As shown in fig. 5, fig. 5 is a schematic diagram of a target detection result provided by the embodiment of the present application, where the target detection result is composed of target classification information and position information, and the position information is displayed in an image in a form of a detection frame, and the target classification information "vehicle" is labeled above the detection frame in a form of text.

Therefore, according to the target detection results determined by the decoding query features output by the last layer of decoder, the classification and the position corresponding to each target in the image to be detected can be accurately described, the target detection of the image to be detected is realized, and the accuracy of the target detection results is improved.

The method comprises the steps of grouping all query features to be processed of a target decoder to obtain a plurality of query feature sets, and further enabling the target decoder to respectively calculate aiming at each query feature set by taking the grouping as a unit to obtain decoding query features corresponding to each query feature to be processed in each query feature set; by enabling the target decoder to calculate in units of groups, the number of query features processed by the target decoder at a time is reduced, and along with the reduction of the number of the processed query features, the calculation complexity of the target decoder is correspondingly reduced, the calculation efficiency is correspondingly improved, and the overall execution efficiency of the target detection task is also improved. In addition, the query feature to be processed of the target decoder can be obtained by screening from all decoding query features output by the decoder at the upper layer based on the high-quality feature condition, and at the moment, the query feature to be processed of the target decoder is the high-quality query feature, so that the target decoder carries out correlation calculation based on the query feature to be processed, and the decoding query feature with higher quality can be obtained, namely, the obtained decoding query feature can reflect the target detection result more accurately, the processing accuracy of the target decoder is improved, and the accuracy of the final target detection result is further improved; and the number of query features to be processed of the target decoder can be further reduced by screening the query features to be processed of the target decoder based on the high-quality feature condition, so that the calculation complexity of the target decoder is further simplified, and the calculation efficiency of the target decoder and the overall target detection efficiency are improved.

Generally, in the model training process, when a query-based target detection model is used for target detection, a certain number of query features are preset to match targets in a training image, so that target detection of the training image is realized. In the process of matching query features with targets, a hungarian matching algorithm is generally used globally.

The Hungary matching algorithm is a classical graph theory algorithm and can solve the problem of maximum matching of bipartite graphs. In the matching process, the query feature is taken as a node on one side, the real label (Ground Truth, GT) corresponding to the target in the training image is taken as a node on the other side, the nodes corresponding to each query feature (hereinafter referred to as query feature nodes) are matched with the nodes corresponding to each GT (hereinafter referred to as target nodes) in sequence, and the matching cost between each query feature node and each target node is calculated. And for each target node, determining the query feature node with the minimum matching cost as the query feature node matched with the target node, and uniformly setting the classification results of other unmatched query feature nodes as background types after determining the query feature nodes for all the target nodes.

Under the global matching strategy, when matching between all query features and GT is completed once through a Hungary matching algorithm, the calculation complexity rises in a quadratic exponential manner along with the increase of the number of the query features, the calculation complexity is too high, the calculation amount is large, and then the calculation efficiency in the matching process is low, and the model training efficiency is low. In addition, in order to ensure that the model has higher detection rate, the number of query features used by the query-based target detection model is far more than the number of targets actually existing in the training image, so that most of the query features can be matched into background types, and only a small part of the query features can be matched into foreground types, thereby leading to fewer positive samples in the training samples, unbalanced samples, increased model optimization difficulty and reduced model convergence rate.

In order to solve the above problems, the embodiment of the application provides a model training method. Referring to fig. 6, fig. 6 is a flowchart of a method for training a model according to an embodiment of the present application, where the method specifically includes the following steps:

Step 601: and determining the training image characteristics corresponding to the training image.

Similar to the target detection method described in fig. 2, in the model training method provided by the embodiment of the application, after the training image is input into the target detection model, the training image features corresponding to the training image need to be determined first, so that the target detection model can perform target detection on the training image based on the training image features in the training process. The embodiment of the application can determine the training image characteristics through the encoder in the target detection model.

Step 602: and determining training decoding query characteristics corresponding to each training query characteristic in the training query characteristic sets according to the training query characteristic sets and the training image characteristics through a target decoder in a target detection model.

The training query feature sets are obtained by grouping training query features corresponding to target decoders, and the target decoders are one-layer decoders in a target detection model. For the target decoder of the target detection model, all the training query features corresponding to the target decoder are divided into a plurality of groups, one group is a training query feature set, and the training query features included in the training query feature set are the training query features of the target decoder.

And in the target decoder, carrying out correlation calculation by taking the training query feature set as a unit, and determining training decoding query features corresponding to each training query feature in the training query feature set. In the target detection model, a part of the decoders may be set as target decoders, so that the part of the target decoders perform correlation calculation in units of training query feature sets including a smaller number of training query features; it is also possible to set all of the decoders as target decoders so that each layer of target decoders performs correlation calculation in units of training query feature sets including a smaller number of training query features.

In the training process, the object detection model may include multiple layers of decoders, and any layer or multiple layers of decoders may be used as the object decoder. When the target decoder is a first layer decoder in the target detection model, each training query feature in the corresponding training query feature set may be determined according to an initial training query feature, where the initial training query feature is an initial query set in the target detection model.

When the target decoder is not the first layer decoder in the target detection model, each training query feature in the corresponding training query feature set is screened from each training decoding query feature output by the last layer decoder of the target decoder based on the quality feature condition.

In particular, the specific implementation of step 602 may be implemented with reference to the various ways described in step 202 above.

Step 603: and aiming at each training query feature set, determining a loss value corresponding to the training query feature set according to the training decoding query feature corresponding to each training query feature and the label corresponding to the training image.

After determining the training decoding query characteristics corresponding to each training query characteristic in the training query characteristic set, performing label matching by taking the training query characteristic set as a unit. Taking a training query feature set as an example, aiming at the training query feature set, matching each training decoding query feature corresponding to the training query feature set with a tag GT of a training image to obtain a matching relation between the training decoding query feature and the GT, and calculating a loss value corresponding to the training query feature set according to the matching relation.

In the matching process, the embodiment of the application can determine the matching relation between the training decoding query feature and the GT by calculating the matching ratio of the training decoding query feature and the GT, and when the matching ratio between the training decoding query feature and the GT is large, the matching degree of the training decoding query feature and the GT is higher, and the matching possibility of the training decoding query feature and the GT is higher.

In the embodiment of the application, GT is reference data for training and evaluating the target detection model, and comprises position information and category information of each detection target in a training image. In the model training process, the GT can help the object detection model learn how to recognize objects in the training image and predict the positions of the objects, and the object detection model can try to generate detection results which are as close as possible to the GT.

Specifically, step 603 may be specifically implemented as follows:

S71: and determining the prediction target detection results corresponding to the training decoding query features according to the training decoding query features corresponding to the training query features in the training query feature set.

And taking the training query feature set as a unit, and determining a prediction target detection result corresponding to each training decoding query feature according to each training decoding query feature corresponding to the training query feature set. The predicted target detection result comprises a classification result and position information of a detection target indicated by the training decoding characteristic.

For the training query feature set, the training query features in the set are in one-to-one correspondence with the training decoding query features corresponding to the set, and in one-to-one correspondence with the predicted target detection results corresponding to the set, that is, for the same training query feature set, a one-to-one correspondence exists among the training query features, the training decoding query features and the predicted target detection results.

S72: and determining an optimal matching combination corresponding to the training query feature set according to each predicted target detection result and the label by adopting a Hungary matching algorithm.

The Hungary algorithm is a combined optimization algorithm for solving task allocation problems in polynomial time, and in the embodiment of the application, the matching of the predicted target detection result and the GT can be realized through the Hungary algorithm.

When the Hungary algorithm is used, the matching cost between each prediction target detection result corresponding to the training query feature set and the GT is calculated by taking the training query feature set as a unit, and as the training query feature set has fewer training query features and the training decoding query features corresponding to the training query feature set are in one-to-one correspondence with the training query features in the training query feature set, the number of the training decoding query features is fewer, so that the number of the prediction target detection results corresponding to the training decoding query features one-to-one is also fewer, the calculated amount is reduced when the matching cost is calculated, the calculated amount in the matching process is reduced, and the calculation efficiency and the model training efficiency are improved.

According to the matching cost between the prediction target detection result and the GT, the optimal matching combination corresponding to the training query feature set can be determined. The optimal matching combination is used for indicating an optimal matching mode between each prediction target detection result and the tag GT. In the embodiment of the application, the optimal matching combination is the matching combination with the minimum matching cost, and each GT has and only has one corresponding prediction target detection result.

When the number of GTs is smaller than the number of decoded training query features, the predicted target detection results which are not matched with the GTs are considered as detection results corresponding to the background of the training image by the target detection model.

S73: and determining a loss value corresponding to the training query feature set based on the optimal matching combination.

After the optimal matching combination corresponding to the training query feature set is determined, the matching loss value between each prediction target detection result and the matched label in the optimal matching combination can be determined, and the loss value corresponding to the training query feature set is calculated according to the matching loss values between all the prediction target detection results and the matched labels in the optimal matching combination. The embodiment of the application can specifically determine the sum of the matching loss values corresponding to the training query feature set as the loss value corresponding to the training query feature set.

For a training image, the number of the predicted target detection results corresponding to the training image is determined, so that the number of the predicted target detection results which can be matched with the GT is also determined.

Further, the model training method provided by the embodiment of the application can further comprise the following steps:

And determining the corresponding proportion between the label and the predicted target detection result according to the current training state of the target detection model.

At this time, S72 may be implemented as follows:

And determining the optimal matching combination corresponding to the training query feature set according to the corresponding proportion and the detection result of each predicted target and the label by adopting the Hungary matching algorithm.

In the training process, the current training state of the target detection model can be determined through indexes such as model performance, specifically, in the initial stage of model training, the model performance of the target detection model is poor, the targets in a training image are difficult to accurately detect, at this time, the matching efficiency when one-to-one label matching is performed on a training query feature set is low, the model convergence is slow, and the model training efficiency is affected.

In order to improve the model convergence rate in the early stage of model training, the corresponding proportion between the labels and the predicted target detection results can be improved, and the corresponding proportion between the labels and the predicted target detection results is improved from one to more, so that a plurality of predicted target detection results corresponding to the training query feature set can simultaneously correspond to one label.

As model training proceeds, the model performance of the trained target detection model will gradually increase, at this time, the corresponding ratio between the label and the predicted target detection result may be gradually reduced, for example, the corresponding ratio between the label and the predicted target detection result is adjusted from a pair of N to a pair of M, where N is greater than M. I.e. following optimization of the model performance, gradually reducing the corresponding proportion between the label and the predicted target detection result until the corresponding proportion between the label and the predicted target detection result is one-to-one. Therefore, the corresponding proportion between the label and the predicted target detection result is reduced along with the optimization of the model performance, and the model performance can be optimized more carefully, so that the model training effect is better improved.

Step 604: and determining the loss value corresponding to the target decoder according to the loss value corresponding to each of the training query feature sets.

After the loss values corresponding to the training query feature sets are determined, the loss values corresponding to the target decoder in the training process are determined according to the loss values of all the training query feature sets. Specifically, the embodiment of the application can determine the sum of the loss values corresponding to the training query feature sets as the loss value of the target decoder.

It should be understood that, in the embodiment of the present application, other calculation manners may be used to determine the loss value corresponding to the target decoder, for example, a weighted average manner, which is not limited in particular by the embodiment of the present application.

Step 605: and training the target detection model based on the loss values corresponding to the decoders of each layer in the target detection model.

After the single training is finished, adjusting each parameter of the target detection model according to the loss value corresponding to each layer of decoder in the target detection model, and continuing training the target detection model based on the adjusted parameter until the target detection model reaches the training termination condition.

According to the model training method provided by the embodiment of the application, all training query features of the target decoder are grouped to obtain a plurality of training query feature sets, so that the target decoder calculates each training query feature set by taking the group as a unit to obtain training decoding query features corresponding to each training query feature in each training query feature set; by enabling the target decoder to calculate in a unit of grouping, the number of training query features processed by the target decoder in a single time in the training process is reduced, and along with the reduction of the number of the processed training query features, the calculation complexity of the target decoder is correspondingly reduced, the calculation efficiency is improved, and the model training efficiency are also improved. And each training decoding query feature corresponding to the training query feature set is subjected to label matching by taking the group as a unit, so that the calculated amount and the calculated complexity in the label matching process are reduced, and the calculation efficiency and the model training efficiency are improved. In addition, in the embodiment of the application, the training query feature set is used as a unit to perform correlation calculation processing in the label matching process, so that the number of training decoding query features used in the single matching process is reduced, and the matching rate of the training decoding query features and the labels is improved for the determined number of labels, so that the proportion of positive samples in training samples is improved, the convergence rate of a model is improved, and the training efficiency of the model is also improved.

Referring to fig. 7, fig. 7 is an algorithm logic diagram of an object detection model according to an embodiment of the present application. It should be understood that fig. 7 is presented by taking each layer of decoder in the object detection model as an object decoder as an example.

Before introducing the algorithm logic shown in fig. 7, the correspondence between each parameter in the model application process and the model training process is introduced, where the query feature to be processed corresponds to a training query feature, the query feature set corresponds to a training query feature set, the decoded query feature corresponds to a training decoded query feature, and the decoded query feature set corresponds to a feature set composed of each training decoded query feature corresponding to the training query feature set.

Firstly, an initial query feature is set in a target detection model, and when the target decoder is a first layer decoder in the target detection model, the initial query feature is determined to be a query feature to be processed corresponding to the target decoder. And then, grouping each query feature to be processed corresponding to the target decoder according to the preset grouping number to obtain a plurality of query feature sets, inputting the query feature sets into the target decoder, and carrying out related calculation by the target decoder by taking the query feature sets as units, so as to obtain a decoding query feature set corresponding to the query feature set after the calculation is completed.

In the process of carrying out target detection by using a target detection model, after a decoding query feature set output by a target decoder is obtained, judging whether a next layer decoder corresponding to the target decoder is a last layer decoder in the target detection model, if so, directly merging all decoding query feature sets and inputting the merged decoding query feature sets into the last layer decoder, processing all input query features through the last layer decoder to obtain decoding query features output by the last layer decoder, and determining a target detection result corresponding to an image to be detected according to the decoding query features; if not, screening out the decoding query features with the classification score of 80% in each decoding query feature set, taking the decoding query features as the query features to be processed corresponding to the next-layer target decoder, and grouping the query features to be processed corresponding to the next-layer target decoder to obtain a plurality of query feature sets, so that the next-layer target decoder performs decoding processing by taking the query feature sets as units.

In the training process of the target detection model, after the decoding query feature set output by the target decoder is obtained, performing tag matching by taking the decoding query feature set as a unit, thereby determining an optimal matching combination corresponding to the decoding query feature set, sequentially calculating a loss value corresponding to the decoding query feature set and a loss value corresponding to the target decoder based on the optimal matching combination, and finally training the target detection model according to the loss values corresponding to the decoders of all layers.

The application also provides a corresponding target detection device aiming at the target detection method, so that the target detection method can be practically applied and realized.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an object detection apparatus 800 corresponding to the object detection method shown in fig. 2 above, the apparatus comprising:

The encoding module 801 is configured to determine an image feature corresponding to an image to be detected;

A decoding module 802, configured to determine, by using a target decoder in a target detection model, decoding query features corresponding to each query feature in a plurality of query feature sets according to the plurality of query feature sets and the image features; the query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the initial query characteristics or are obtained by screening from decoding query characteristics output by a decoder of the upper layer of the target decoder based on high-quality characteristic conditions, wherein the target decoder is one layer of decoder in the target detection model;

and the detection module 803 is configured to determine a target detection result corresponding to the image to be detected according to each decoding query feature output by a last layer of decoder in the target detection model.

In one possible implementation, when the target decoder is a non-first layer decoder in the target detection model, the target detection apparatus 800 further includes a determining module configured to:

Determining target detection results corresponding to each decoding query feature according to each decoding query feature output by the upper layer decoder; the target detection result comprises a classification score corresponding to a detection target indicated by the decoding query characteristic;

And selecting the decoding query features meeting the high-quality feature conditions from the decoding query features according to the classification scores in the target detection results corresponding to the decoding query features, and taking the decoding query features as query features to be processed corresponding to the target decoder.

In a possible implementation manner, the input of the previous layer decoder includes a plurality of query feature sets obtained by grouping each query feature to be processed corresponding to the previous layer decoder; the output of the upper layer decoder comprises decoding query feature sets corresponding to the query feature sets respectively;

The determining module is specifically configured to:

In one possible implementation, when the target decoder is a first layer decoder in the target detection model, the target detection apparatus 800 further includes a determining module configured to:

taking each initial query feature as each query feature to be processed corresponding to the target decoder;

Or determining the feature quality corresponding to each initial query feature; and selecting initial query features meeting preset quality conditions from the initial query features according to the feature quality corresponding to the initial query features, and taking the initial query features as query features to be processed corresponding to the target decoder.

In one possible implementation, the determining module is specifically configured to:

Determining the feature quality of each to-be-processed query feature corresponding to the target decoder;

and distributing each query feature to be processed to the multiple query feature sets according to the feature quality corresponding to each query feature to be processed according to the preset grouping number.

determining the distribution of each query feature to be processed corresponding to the target decoder in a feature space;

and distributing each query feature to be processed to the multiple query feature sets according to the distribution of each query feature to be processed in a feature space according to the preset grouping number.

In one possible implementation, the encoding module 801 is specifically configured to:

In one possible implementation, the decoding module 802 is specifically configured to:

Determining an intermediate query feature set corresponding to each query feature set according to each query feature set by adopting a self-attention mechanism through the target decoder; the intermediate query feature set comprises intermediate query features corresponding to each query feature in the corresponding query feature set;

Determining, by the target decoder, a decoded query feature set corresponding to each intermediate query feature set according to each intermediate query feature set and the image feature by adopting an attention mechanism; the decoded query feature set includes decoded query features corresponding to each intermediate query feature in the corresponding intermediate query feature set.

In one possible implementation, each layer decoder in the target detection model is the target decoder;

Or the first n layers of decoders in the target detection model are the target decoders; and n is an integer greater than or equal to 1 and less than the total layer number of the decoder in the target detection model.

According to the target detection device provided by the embodiment of the application, all query features to be processed in the target decoder are grouped to obtain a plurality of query feature sets, so that the target decoder calculates based on each query feature set and image features corresponding to the image to be detected by taking the group as a unit to obtain decoding query features corresponding to each query feature in each query feature set; by making the target decoder perform computation in units of packets, the number of query features processed by the target decoder at a time can be reduced, and as the number of query features processed is reduced, the computation complexity of the target decoder is correspondingly reduced, the computation efficiency is correspondingly improved, and the overall execution efficiency of the target detection task is also improved. In addition, the query feature to be processed of the target decoder can be obtained by screening from all decoding query features output by the decoder at the upper layer based on the high-quality feature condition, and at the moment, the query feature to be processed of the target decoder is the high-quality query feature, so that the target decoder carries out correlation calculation based on the query feature to be processed, and can obtain the decoding query feature with high quality, namely the obtained decoding query feature can reflect the target detection result more accurately, the processing accuracy of the target decoder is improved, and the accuracy of the target detection result determined according to the decoding query feature output by the decoder at the last layer in the target detection model is further improved; and the method and the device screen the query features to be processed of the target decoder based on the high-quality feature conditions, and can further reduce the number of the query features to be processed of the target decoder, so that the calculation complexity of the target decoder is further reduced, and the calculation efficiency of the target decoder and the overall target detection efficiency are improved.

The application also provides a corresponding model training device aiming at the model training method, so that the model training method can be practically applied and realized.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a model training apparatus 900 corresponding to the model training method shown in fig. 6 above, the apparatus comprising:

The encoding module 901 is configured to determine training image features corresponding to the training image;

The decoding module 902 is configured to determine, by using a target decoder in a target detection model, training decoding query features corresponding to each training query feature in a plurality of training query feature sets according to the plurality of training query feature sets and the training image features; the training query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; the query characteristics to be processed are determined according to the training initial query characteristics, or are obtained by screening from the training decoding query characteristics output by the decoder of the upper layer of the target decoder based on the high-quality characteristic conditions; the target decoder is a layer of decoder in the target detection model;

the loss construction module 903 is configured to determine, for each training query feature set, a loss value corresponding to the training query feature set according to a training decoding query feature corresponding to each training query feature and a label corresponding to the training image;

The loss construction module 903 is further configured to determine a loss value corresponding to the target decoder according to the loss values corresponding to the training query feature sets;

and the training module 904 is configured to train the target detection model based on the loss values corresponding to the decoders of each layer in the target detection model.

In one possible implementation, the loss building module 903 is specifically configured to:

According to the training decoding query characteristics corresponding to the training query characteristics in the training query characteristic set, determining the prediction target detection results corresponding to the training decoding query characteristics;

Adopting a Hungary matching algorithm, and determining an optimal matching combination corresponding to the training query feature set according to each predicted target detection result and the label; the optimal matching combination is used for indicating an optimal matching mode between each predicted target detection result and the label;

And determining a loss value corresponding to the training query feature set based on the optimal matching combination.

In a possible implementation manner, the apparatus further includes a proportion determining unit, configured to: determining the corresponding proportion between the label and the predicted target detection result according to the current training state of the target detection model;

The loss building block 903 is also configured to: and determining the optimal matching combination corresponding to the training query feature set according to the corresponding proportion and the detection result of each predicted target and the label by adopting the Hungary matching algorithm.

According to the model training device provided by the embodiment of the application, all training query features of the target decoder are grouped to obtain a plurality of training query feature sets, so that the target decoder calculates each training query feature set by taking the group as a unit to obtain training decoding query features corresponding to each training query feature in each training query feature set; by enabling the target decoder to calculate in a unit of grouping, the number of training query features processed by the target decoder in a single time in the training process is reduced, and along with the reduction of the number of the processed training query features, the calculation complexity of the target decoder is correspondingly reduced, the calculation efficiency is improved, and the model training efficiency are also improved. And each training decoding query feature corresponding to the training query feature set is subjected to label matching by taking the group as a unit, so that the calculated amount and the calculated complexity in the label matching process are reduced, and the calculation efficiency and the model training efficiency are improved. In addition, in the embodiment of the application, the training query feature set is used as a unit to perform correlation calculation processing in the label matching process, so that the number of training decoding query features used in the single matching process is reduced, and the matching rate of the training decoding query features and the labels is improved for the determined number of labels, so that the proportion of positive samples in training samples is improved, the convergence rate of a model is improved, and the training efficiency of the model is also improved.

The embodiment of the application also provides a computer device, which can be a terminal device or a server, and the terminal device and the server provided by the embodiment of the application are introduced from the aspect of hardware materialization.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 10, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, and the terminal is taken as a computer as an example:

Fig. 10 is a block diagram showing a part of the structure of a computer related to a terminal provided by an embodiment of the present application. Referring to fig. 10, a computer includes: radio Frequency (RF) circuitry 1210, memory 1220, input unit 1230 (including touch panel 1231 and other input devices 1232), display unit 1240 (including display panel 1241), sensors 1250, audio circuitry 1260 (to which speaker 1261 and microphone 1262 are connected), wireless fidelity (WIRELESS FIDELITY, wiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the computer architecture shown in fig. 10 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.

Memory 1220 may be used to store software programs and modules, and processor 1280 may execute the various functional applications and data processing of the computer by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Processor 1280 is a control center of the computer and connects various parts of the entire computer using various interfaces and lines, performing various functions of the computer and processing data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.

In an embodiment of the application, the processor 1280 included in the terminal is configured to perform the steps in the method described in the foregoing embodiments.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a server 1300 according to an embodiment of the present application. The server 1300 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1322 (e.g., one or more processors) and memory 1332, one or more storage mediums 1330 (e.g., one or more mass storage devices) that store applications 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.

The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems, such as a Windows Server ^TM,Mac OS X^TM,Unix^TM, Linux^TM,FreeBSD^TM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 11.

Wherein CPU 1322 is configured to perform steps in the methods described in the various embodiments described above.

The embodiments of the present application also provide a computer-readable storage medium storing a computer program for executing the steps of the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the methods described in the foregoing embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media in which a computer program can be stored.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of target detection, the method comprising:

Determining image characteristics corresponding to an image to be detected;

Determining decoding query features corresponding to each query feature in a plurality of query feature sets according to the plurality of query feature sets and the image features by a target decoder in a target detection model; the query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; if the target decoder is a first layer decoder in the target detection model, determining each query feature to be processed corresponding to the target decoder according to the initial query feature; if the target decoder is a non-first layer decoder in the target detection model, screening each query feature to be processed corresponding to the target decoder from each decoding query feature output by a decoder of a previous layer of the target decoder based on a high-quality feature condition, wherein the target decoder is one layer of decoders in the multi-layer decoder in the target detection model;

Determining a target detection result corresponding to the image to be detected according to each decoding query feature output by a last layer of decoder in the target detection model;

if the target decoder is a non-first layer decoder in the target detection model, the obtaining method of the query feature set includes:

Determining the feature quality of each to-be-processed query feature corresponding to the target decoder; the feature quality corresponding to the query feature to be processed is determined according to the classification score in the target detection result corresponding to the query feature to be processed; the target detection result corresponding to the query feature to be processed is a target detection result corresponding to the decoding query feature output by the upper layer decoder, and the target detection result comprises a classification score corresponding to the detection target indicated by the decoding query feature;

2. The method of claim 1, wherein when the target decoder is a non-first layer decoder in the target detection model, each of the pending query features corresponding to the target decoder is determined by:

determining target detection results corresponding to each decoding query feature according to each decoding query feature output by the upper layer decoder;

3. The method according to claim 2, wherein the input of the previous layer decoder includes a plurality of query feature sets obtained by grouping each query feature to be processed corresponding to the previous layer decoder; the output of the upper layer decoder comprises decoding query feature sets corresponding to the query feature sets respectively;

selecting the decoding query feature satisfying the quality feature condition from the decoding query features according to the classification scores in the target detection results corresponding to the decoding query features, as the query feature to be processed corresponding to the target decoder, including:

4. The method of claim 1, wherein when the target decoder is a first layer decoder in the target detection model, each of the pending query features corresponding to the target decoder is determined by:

5. The method according to any one of claims 1 to 4, wherein the obtaining the query feature set further comprises:

The allocating each query feature to be processed to the multiple query feature sets according to the feature quality corresponding to each query feature to be processed according to the preset grouping number, including:

and distributing each query feature to be processed to the multiple query feature sets according to the distribution of each query feature to be processed in a feature space and the feature quality corresponding to each query feature to be processed according to the preset grouping number.

6. The method according to any one of claims 1 to 4, wherein determining, by a target decoder in a target detection model, decoded query features corresponding to respective query features in a plurality of query feature sets from the plurality of query feature sets and the image features, comprises:

7. The method according to any one of claims 1 to 4, wherein each layer decoder in the object detection model is the object decoder;

8. A method of model training, the method comprising:

Determining training decoding query characteristics corresponding to each training query characteristic in a plurality of training query characteristic sets according to the plurality of training query characteristic sets and the training image characteristics through a target decoder in a target detection model; the training query feature sets are obtained by grouping the training query features corresponding to the target decoder; if the target decoder is a first layer decoder in the target detection model, determining each training query feature corresponding to the target decoder according to the training initial query feature; if the target decoder is a non-first layer decoder in the target detection model, screening all training query features corresponding to the target decoder from all training decoding query features output by a last layer decoder of the target decoder based on high-quality feature conditions; the target decoder is one layer of decoders in the multi-layer decoder in the target detection model;

Training the target detection model based on loss values corresponding to decoders of each layer in the target detection model;

If the target decoder is a non-first layer decoder in the target detection model, the obtaining manner of the training query feature set includes:

Determining the feature quality corresponding to each training query feature corresponding to the target decoder; the feature quality corresponding to the training query features is determined according to the classification score in the predicted target detection result corresponding to the training query features; the predicted target detection result corresponding to the training query feature is a target detection result corresponding to the training decoding query feature output by the upper layer decoder, and the target detection result comprises a classification score corresponding to the detection target indicated by the training decoding query feature;

and distributing the training query features to the training query feature sets according to the feature quality corresponding to the training query features according to the preset grouping number.

9. The method of claim 8, wherein determining the loss value for the set of training query features based on the training decoded query features for which each training query feature corresponds and the label for which the training image corresponds comprises:

10. The method according to claim 9, wherein the method further comprises:

determining the corresponding proportion between the label and the predicted target detection result according to the current training state of the target detection model;

the determining, by using a hungarian matching algorithm, an optimal matching combination corresponding to the training query feature set according to each predicted target detection result and the tag includes:

11. An object detection device, the device comprising:

The decoding module is used for determining decoding query features corresponding to each query feature in the query feature sets according to the query feature sets and the image features by a target decoder in the target detection model; the query feature sets are obtained by grouping the query features to be processed corresponding to the target decoder; if the target decoder is a first layer decoder in the target detection model, determining each query feature to be processed corresponding to the target decoder according to the initial query feature; if the target decoder is a non-first layer decoder in the target detection model, screening each query feature to be processed corresponding to the target decoder from each decoding query feature output by a decoder of a previous layer of the target decoder based on a high-quality feature condition, wherein the target decoder is one layer of decoders in the multi-layer decoder in the target detection model;

the detection module is used for determining a target detection result corresponding to the image to be detected according to each decoding query characteristic output by a last layer of decoder in the target detection model;

12. A model training apparatus, the apparatus comprising:

The decoding module is used for determining training decoding query characteristics corresponding to each training query characteristic in the training query characteristic sets according to the training query characteristic sets and the training image characteristics through a target decoder in the target detection model; the training query feature sets are obtained by grouping the training query features corresponding to the target decoder; if the target decoder is a first layer decoder in the target detection model, determining each training query feature corresponding to the target decoder according to the training initial query feature; if the target decoder is a non-first layer decoder in the target detection model, screening all training query features corresponding to the target decoder from all training decoding query features output by a last layer decoder of the target decoder based on high-quality feature conditions; the target decoder is one layer of decoders in the multi-layer decoder in the target detection model;

The training module is used for training the target detection model based on the loss values corresponding to the decoders of each layer in the target detection model;

13. A computer device, the device comprising a processor and a memory;

The memory is used for storing a computer program;

The processor is configured to execute the object detection method according to any one of claims 1 to 7 or the model training method according to any one of claims 8 to 10 according to the computer program.

14. A computer readable storage medium for storing a computer program which, when executed by an electronic device, implements the object detection method of any one of claims 1 to 7 or the model training method of any one of claims 8 to 10.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the object detection method of any one of claims 1 to 7 or the model training method of any one of claims 8 to 10.