CN117115835A

CN117115835A - Training method, device, equipment and storage medium of target detection model

Info

Publication number: CN117115835A
Application number: CN202311233461.9A
Authority: CN
Inventors: 杨志雄; 杨延展
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-11-24

Abstract

The embodiment of the disclosure relates to a training method, device, equipment and storage medium of a target detection model. The training method of the target detection model comprises the following steps: acquiring paired image samples and text samples for target detection; inputting the image sample and the text sample into a frozen multi-modal sub-model to generate initial image features and initial text features; based on the initial image features and the initial text features, generating a target object detection frame corresponding to the object to be detected and a target classification result of the target object detection frame through a detection head of the target detection sub-model; and performing model iterative training based on the target object detection frame, the target classification result and the target detection truth value corresponding to the image sample until a model convergence condition is reached. According to the embodiment of the disclosure, knowledge in the multi-mode submodel can be prevented from being destroyed and the model is prevented from being fitted when the detection head is trained, so that the speed and the efficiency of model training of open vocabulary target detection are improved.

Description

Training method, device, equipment and storage medium of target detection model

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a training method, device, equipment and storage medium for a target detection model.

Background

With the development of artificial intelligence technology, a need for Open vocabulary object detection (Open-vocabulary Object Detection) of images has arisen. Open vocabulary object detection may be understood as enabling a model to detect new classes of objects from images that do not exist in the training set.

In the related art, a large amount of high-quality training data is required to train the preset model end to end, so that the preset model can be suitable for application scenes of open vocabulary. However, this training method has low training efficiency and has a poor detection effect on the unknown class of objects.

Disclosure of Invention

In order to solve the technical problems, embodiments of the present disclosure provide a training method, device, equipment and storage medium for a target detection model.

In a first aspect, an embodiment of the present disclosure provides a training method of a target detection model, including:

acquiring paired image samples and text samples for target detection; wherein the text sample comprises descriptive text describing the image sample and instructional text designating an object to be detected in the image sample;

Inputting the image sample and the text sample into a frozen multi-modal sub-model to generate initial image features and initial text features;

generating a target object detection frame corresponding to the object to be detected and a target classification result of the target object detection frame through a detection head of a target detection sub-model based on the initial image feature and the initial text feature;

and performing model iterative training based on the target object detection frame, the target classification result and the target detection truth value corresponding to the image sample until a model convergence condition is reached.

In a second aspect, an embodiment of the present disclosure further provides a training apparatus for a target detection model, where the apparatus includes:

a sample acquisition module for acquiring paired image samples and text samples for target detection; wherein the text sample comprises descriptive text describing the image sample and instructional text designating an object to be detected in the image sample;

the feature generation module is used for inputting the image sample and the text sample into a frozen multi-mode sub-model to generate initial image features and initial text features;

the result generation module is used for generating a target object detection frame corresponding to the object to be detected and a target classification result of the target object detection frame through a detection head of a target detection sub-model based on the initial image characteristics and the initial text characteristics;

And the model training module is used for carrying out model iterative training based on the target object detection frame, the target classification result and the target detection true value corresponding to the image sample until a model convergence condition is reached.

In a third aspect, embodiments of the present disclosure further provide an electronic device, including:

a processor;

a memory for storing executable instructions;

the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the training method of the object detection model described in any embodiment of the disclosure.

In a fourth aspect, the embodiments of the present disclosure further provide a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement the training method of the object detection model described in any of the embodiments of the present disclosure.

In a fifth aspect, the disclosed embodiments also provide a computer program product for performing the training method of the object detection model described in any of the embodiments of the disclosure.

According to the training method, device, equipment and storage medium of the target detection model, the frozen multi-mode sub-model is used as a backbone network for open vocabulary target detection, and the detection head of the target detection sub-model is used as the detection head for open vocabulary target detection, so that the open vocabulary target detection model is formed, and model parameters of the multi-mode sub-model are kept unchanged in the iterative training process of the model, and only model parameters of the detection head are trained; on the one hand, the learned knowledge of the pre-trained multi-mode sub-model can be reserved, the learned knowledge in the pre-trained model is prevented from being destroyed, and the open vocabulary target detection model has the detection capability on the unknown class of objects; on the other hand, the model output result can be restrained by training the detection head, so that the constructed open vocabulary target detection model can detect and output the prediction result of the object of the unknown class, retraining of all model parameters and model overfitting are avoided, and the model training efficiency of the open vocabulary target detection model is improved.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

Fig. 1 is a flow chart of a training method of a target detection model according to an embodiment of the disclosure;

FIG. 2 is a flow chart of another training method of a target detection model according to an embodiment of the disclosure;

FIG. 3 is a flowchart illustrating a method for generating an enhanced classification result according to an embodiment of the disclosure;

FIG. 4 is a schematic diagram of a model structure for determining enhanced image features of a region according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a cross-attention mechanism sub-model according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a model structure of a target detection model according to an embodiment of the disclosure;

fig. 7 is a schematic structural diagram of a training device for a target detection model according to an embodiment of the disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

With the development of computer technology, open vocabulary target detection can be performed on images, and in the open vocabulary target detection, detection of targets of new types which do not exist in a training set is realized.

In the related art, there are various methods for realizing the open vocabulary target detection. For example, the following are several:

knowledge-based distillation methods. Specifically, knowledge distillation is performed on the pre-trained visual language model, and then the model subjected to knowledge distillation is endowed with the capability of detecting the open target. However, the method for realizing open vocabulary target detection through knowledge distillation damages knowledge in a pre-trained visual language model, and the whole target detection model needs to be trained again, so that the model training speed is low and the efficiency is low.

The method of pre-training is customized based on the detection. Specifically, the target detection model is pre-trained by using additional sample data and tasks, so that the target detection model can be suitable for an open vocabulary scene. However, the method needs to be applied to a large amount of pre-training data which is designed and customized manually, so that the training speed of the model is low and the efficiency is low.

A method based on weakly supervised learning. Specifically, the target detection model is trained using weak supervisory data such as image-level labels, but is typically less effective than full supervisory training.

A semantic embedding-based method. Specifically, the image content semantics are aligned into a word vector space to achieve open vocabulary target detection. However, since the dictionary (lexicon) is of a limited size, the accuracy of semantic alignment of image content is low.

Based on the method of generating the model. Specifically, the new class of training data is synthesized by using the generation model, and the target detection model is trained by using the new class of training data. However, the quality of the generated data may be low and the amount may be small. And thus the effect of the target detection model trained by the training data may be poor.

Based on the above situation, the embodiment of the disclosure provides a training scheme of a target detection model, which realizes the inheritance of knowledge of a multi-mode sub-model by the target detection model in the training process.

The training method of the target detection model provided by the embodiment of the disclosure can be applied to the scene of model training of open vocabulary object detection. The method may be performed by a training device of the object detection model, which may be implemented in software and/or hardware, which may be integrated in an electronic device having certain data processing functions. The electronic device may include, but is not limited to, a mobile terminal having a large data processing capability, and a stationary terminal having a large data processing capability such as a desktop computer, a super computer, and the like.

Fig. 1 shows a flowchart of a training method of a target detection model according to an embodiment of the disclosure. As shown in fig. 1, the training method of the object detection model may include the following steps:

s110, acquiring paired image samples and text samples for target detection; the text sample comprises descriptive text describing the image sample and instructional text designating an object to be detected in the image sample.

The object detection may be to find objects of a specified class in the image and their position in the image. The target detection in embodiments of the present disclosure may be open vocabulary target detection. The object to be detected refers to an object to be detected from the image sample. In the embodiment of the disclosure, the open vocabulary target detection model is a neural network model capable of realizing open vocabulary target detection, and the open vocabulary target detection model at least can include a frozen multi-modal sub-model, a detection head of the target detection sub-model, and the like.

The image sample may include an image in a training set that trains the open vocabulary target detection model, the text sample may include text in the training set that trains the open vocabulary target detection model, and the text sample may be a predefined text. The image sample and the text sample have an image text corresponding relationship, and the embodiment does not limit the image text corresponding relationship. For example, the image text correspondence may be one of one-to-one correspondence, one-to-many correspondence, and one-to-many correspondence of the image sample and the text sample. A pair of image sample and text sample may have a corresponding target detection truth value (group truth), where the target detection truth value may be a preset standard knot of the image sample and text sample during target detection, and the target detection truth value may also be understood as a real result of target detection.

The present embodiment does not limit the sources of the image samples and the text samples. For example, the image sample and the text sample may be obtained based on an open source dataset. Alternatively, the image sample may be a key frame of the video and the text sample may be a corpus written for the key frame. Optionally, the training device of the target detection model may perform repeated detection on the image sample and/or the text sample, and delete repeated data in the image sample and the text sample, so as to avoid that the training data is too single.

The descriptive text may be text describing the objects and/or the relationships between the objects comprised by the image sample. For example, if the image sample is an image with blue sky and white clouds drawn, the descriptive text may be text describing the blue sky and white clouds and/or the relative positions of the blue sky and white clouds. The method for generating the descriptive text is not limited in this embodiment, and the descriptive text may include text generated by performing text conversion processing on voice data of an image sample, or the descriptive text may be a corpus of the image sample, for example. The instructional text may be text indicating objects present in the image sample. For example, if the image sample is an image with blue sky and white clouds drawn, the instructional text may include blue sky and/or white clouds.

In the embodiment of the disclosure, in order to train the open vocabulary target detection model, a training device of the target detection model acquires paired image samples and text samples, determines a training set of the open vocabulary target detection model based on the image samples and the text samples, and trains the open vocabulary target detection model.

S120, inputting the image sample and the text sample into the frozen multi-modal sub-model to generate initial image features and initial text features.

The multi-modal sub-model refers to a model capable of processing information (such as text, images, audio, video, etc.) of multiple modalities as part of an open vocabulary object detection model in embodiments of the present disclosure. For example, the multimodal sub-Model may be a visual Language Model (Vision-Language Model) capable of processing image information and text information. The present embodiment is not limited to the visual language model, and for example, the visual language model may be a comparative language Image Pre-training (CLIP) model. For multimodal information of images and text, the multimodal sub-model may include an image encoder and a text encoder. The image encoder may be used for feature extraction of images and the text encoder may be used for feature extraction of text.

Freezing can characterize that the sub-model parameters do not change with the training of the open vocabulary target detection model. The Frozen (Frozen) multi-modal sub-model may be a multi-modal sub-model in which model parameters do not change with training of the open vocabulary target detection model. The initial image features may be features of image dimensions obtained by feature extraction processing of the frozen multimodal sub-model on the image sample. The initial text feature may be a feature of a text dimension obtained by performing feature extraction processing on the text sample by the frozen multimodal sub-model.

In the embodiment of the disclosure, the frozen multi-modal sub-model can be used as a Backbone network (Backbone) of the open vocabulary object detection model in advance, so that the open vocabulary object detection model can inherit rich semantic knowledge expression capability of the multi-modal sub-model, and a foundation is created for the open vocabulary object detection model to be applicable to open vocabulary scenes. In the training process of the open vocabulary target detection model, after the image sample and the text sample are input into the open vocabulary target detection model, the frozen image encoder of the multi-mode sub-model can be used as a backbone network of the image dimension in the open vocabulary target detection model to process the image sample so as to obtain initial image characteristics. The frozen multi-modal sub-model text encoder can be used as a backbone network for text dimensions in the open vocabulary target detection model to process text samples to obtain initial text features.

S130, based on the initial image features and the initial text features, generating a target object detection frame corresponding to the object to be detected and a target classification result of the target object detection frame through a detection head of the target detection sub-model.

The object detection sub-model may be a model capable of realizing closed vocabulary object detection, that is, the object detection sub-model may be a model capable of realizing non-open vocabulary object detection. The network structure of the object detection sub-model is not limited in this embodiment. For example, the network structure of the object detection sub-model may be one of a regional hint network (Region Proposal Network, RPN), a feature map pyramid network (Feature Pyramid Networks, FPN), etc. The detection Head (Head) may be a part of the model for detecting the object classification and/or position to which the object to be detected belongs. The detection Box (marking Box) may be a Box characterizing the predicted position of the object to be detected in the image sample. The target object detection frame may be a detection frame corresponding to the finally determined object to be detected, and the number of the target object detection frames is not limited in this embodiment. The classification result may be a probability that the object within the predicted detection box is the corresponding object type. For example, the classification result may be a probability that the object within the detection box is a dog. The data types of the classification result are various, and the embodiment is not limited, for example, the data types of the classification result may include: classification score, classification probability, etc. The target classification result may be a classification result corresponding to the finally determined target object detection frame.

In the embodiment of the present disclosure, after determining the initial image feature and the initial text feature, there are various methods for determining the target object detection box and the target classification result, and the embodiment is not limited thereto. Examples are as follows:

in an alternative embodiment, the initial image feature and the initial text feature may be directly input into a detection head of the target detection sub-model, and the detection head of the target detection sub-model determines a corresponding target object detection box and a target classification result according to the initial image feature and the initial text feature.

In another alternative embodiment, the initial image feature and the initial text feature may be further extracted to obtain an intermediate feature, and the intermediate feature is input into a detection head of the target detection sub-model, where the detection head of the target detection sub-model determines a corresponding target object detection frame and a target classification result according to the intermediate feature.

And S140, performing model iterative training based on the target object detection frame, the target classification result and the target detection truth value corresponding to the image sample until a model convergence condition is reached.

The model convergence condition may be a condition that a preset characterization model is sufficiently converged and training of the model can be stopped. The model convergence condition may be set according to a user requirement, etc., and the embodiment is not limited, for example, the training termination condition may be that an error (Loss) is smaller than a preset error threshold, etc.

In the embodiment of the disclosure, the target object detection frame and the target classification result are data determined by an open vocabulary target detection model based on an image sample and a text sample, and the target detection truth value is a true value corresponding to the paired image sample and text sample. The difference between the target object detection frame, the target classification result and the target detection truth value can be understood as an error generated by the target detection executed by the current open vocabulary target detection model, and model parameters of non-frozen sub-models in the open vocabulary target detection model are finely adjusted in an iterative mode based on the error until a preset model convergence condition is reached, so that the open vocabulary target detection model which is completed with training is obtained.

According to the training method for the target detection model, which is provided by the embodiment of the disclosure, the frozen multi-modal sub-model is used as a backbone network for open vocabulary target detection, and the detection head of the target detection sub-model is used as the detection head for open vocabulary target detection, so that the open vocabulary target detection model is formed, and model parameters of the multi-modal sub-model are kept unchanged in the iterative training process of the model, and only model parameters of the detection head are trained; on the one hand, the learned knowledge of the pre-trained multi-mode sub-model can be reserved, the learned knowledge in the pre-trained model is prevented from being destroyed, and the open vocabulary target detection model has the detection capability on the unknown class of objects; on the other hand, the model output result can be restrained by training the detection head, so that the constructed open vocabulary target detection model can detect and output the prediction result of the object of the unknown class, retraining of all model parameters and model overfitting are avoided, and the model training efficiency of the open vocabulary target detection model is improved.

In some embodiments of the present disclosure, generating, by a detection head of a target detection sub-model, a target object detection frame corresponding to an object to be detected and a target classification result of the target object detection frame based on an initial image feature and an initial text feature, includes: inputting the initial image characteristics into a detection head to generate at least one initial object detection frame and an initial classification result of the initial object detection frame; generating a target object detection frame and an enhanced classification result of the target object detection frame through the detection head based on the initial image characteristics and the initial text characteristics; and generating a target classification result based on the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame.

Fig. 2 is a flow chart of another training method of an object detection model according to an embodiment of the disclosure, as shown in fig. 2, the method includes:

s210, acquiring paired image samples and text samples for target detection; the text sample comprises descriptive text describing the image sample and instructional text designating an object to be detected in the image sample.

S220, inputting the image sample and the text sample into the frozen multi-modal sub-model, and generating initial image features and initial text features.

S230, inputting the initial image features into a detection head of the target detection sub-model, and generating at least one initial object detection frame and an initial classification result of the initial object detection frame.

The initial object detection frame is a position detection frame of a detection object in an image, wherein the position detection frame is obtained by directly carrying out target detection by utilizing initial image features. The initial classification result is a classification result of the detection object obtained after the target detection is directly performed by utilizing the initial image features. The initial classification result may be a score value, a probability value, or the like that the detection object belongs to the classification category.

In the embodiment of the disclosure, the initial image features are input into a detection head of a target detection sub-model, the detection head performs target detection based on the initial image features, and a determination frame is provided with an initial object detection frame of an initially determined object to be detected and an initial classification result corresponding to the initial object detection frame. It should be noted that, in each initial classification result, the initial classification result corresponding to the initial object detection frame with the real object to be detected may be lower, so if each initial object detection frame is directly used as the final object detection frame, the initial object detection frame with the real object to be detected is more likely to be discarded.

S240, based on the initial image features and the initial text features, generating a target object detection frame and an enhanced classification result of the target object detection frame through a detection head of the target detection sub-model.

The enhanced classification result may be a classification result selected by a detection frame capable of improving that the frame has a real object to be detected.

In the embodiment of the disclosure, the training device of the target detection model may input the initial image feature and the overall initial text feature into the detection head of the target detection sub-model in a direct input or indirect input manner. Alternatively, the initial image feature and a part of the initial text feature are input into the detection head of the target detection sub-model in a direct input or indirect input manner. The detection head generates a target object detection frame and an enhanced classification result according to the whole or part of initial text characteristics and initial image characteristics.

In some embodiments of the present disclosure, the initial text feature includes a first text feature corresponding to the descriptive text and a second text feature corresponding to the instructional text; the first text feature may be understood as a text feature obtained by feature extraction of the descriptive text, and the second text feature may be understood as a text feature obtained by feature extraction of the instructional text, and the second text feature may be an embedded representation corresponding to the instructional text.

Accordingly, fig. 3 is a schematic flow chart of generating an enhanced classification result according to an embodiment of the present disclosure, and as shown in fig. 3, based on initial image features and initial text features, the enhanced classification result of the target object detection frame and the target object detection frame is generated by the detection head, including:

s310, fusing the initial image features and the first text features to generate region enhanced image features of the overlapped region of interest.

Wherein the region of interest (Region of Interest, ROI), which may be a framed region in the sample image, is also called the proposed region. The region enhanced image features may be image features determined based on a visual language model (Vision Language Model, VLM).

In this embodiment, the training device of the target detection model fuses the initial image feature and the first text feature corresponding to the descriptive text, so as to obtain the region enhanced image feature in which the region of interest is superimposed.

In this embodiment, there are various ways to fuse the initial image feature and the first text feature, and this embodiment is not limited. For example, the fusion approach may include: one of direct connection, multiplication, addition, alignment, and cross-fusion. The method based on alignment interaction fusion can enable the target object detection frame to be more accurate.

In some embodiments of the present disclosure, fusing the initial image feature and the first text feature to generate a region-enhanced image feature overlaying the region of interest includes:

step a1, generating region image features of the region of interest based on the initial image features.

The region image feature may be an image feature of a part of the sample image framed by the region of interest, which is also referred to as a proposed frame feature. The region image features may be in one-to-one correspondence with the regions of interest.

In this embodiment, the training device of the target detection model performs region of interest alignment (alignment) on the initial image feature to obtain the region image feature of the region of interest.

Fig. 4 is a schematic diagram of a model structure for determining enhanced image features of a region according to an embodiment of the present disclosure, where, as shown in fig. 4, initial image features are aligned by regions of interest, and region image features of each region of interest are extracted.

And a2, inputting the regional image features and the first text features into a cross attention mechanism submodel to generate weighted graphic features.

The Cross Attention (Cross Attention) mechanism sub-model may be a neural network model constructed based on a Cross Attention mechanism. The weighted graph-text features can be features which are obtained by aligning and fusing the image features and the text features by the cross-attention mechanism sub-model and can be used for representing multiple dimensions such as image dimensions and text dimensions.

In this embodiment, the training device of the target detection model uses the regional image feature and the first text feature as input data of the cross-attention mechanism sub-model, and the cross-attention mechanism sub-model interactively fuses the regional image feature and the first text feature and outputs the corresponding weighted image-text feature.

As shown in fig. 4, the training device of the target detection model may copy the first text features to obtain a batch of first text features corresponding to the region of interest one by one. And inputting the first text feature and the regional image feature into a cross attention mechanism submodel, realizing interaction between the first text feature and the regional image feature, and calculating through a cross attention mechanism to obtain the weighted image-text feature.

Fig. 5 is a schematic diagram of a model structure of a cross-attention mechanism sub-model according to an embodiment of the present disclosure, where, as shown in fig. 5, after a region image feature and a first text feature are input into the cross-attention mechanism sub-model, the region image feature is mapped to a Query word (Query) and a Value (Value) to be represented, and the first text feature is mapped to a Key (Key) to be represented. Wherein the value is also called value representation (Value Representations). And then carrying out dot product Attention (Attention) operation on the query words and keys to obtain Attention scores, carrying out normalization processing on the Attention scores through a normalization function (for example, softmax) to obtain Attention weights, and carrying out fusion calculation on the Attention weights and the values to obtain fusion results. The fusion calculation may be weighted summation, etc., which is not limited in this embodiment. Further, the fusion result and the regional image features are processed through residual connection, and a residual connection result is obtained. And carrying out layer normalization (LayerNorm) treatment on the residual connection result to obtain the regional enhanced image characteristics.

And a step a3, generating regional enhanced image features based on the regional image features and the weighted image-text features.

In this embodiment, after determining the regional image feature and the weighted graphics context feature, feature fusion is performed on the regional image feature and the weighted graphics context feature, so as to obtain the regional feature image feature with enhanced text dimension feature. There are various methods for fusing the regional image feature and the weighted image-text feature, and the embodiment is not limited, for example, the regional image feature and the weighted image-text feature are added.

In the scheme, on the basis of the multi-section multi-mode submodel, the association between the image features and the text features is established through the cross attention mechanism, and the image features and the text features are aligned and fused, so that the expression capability of the image features on specific semantic Concepts (Concepts) is enhanced, and the capability of an open vocabulary target detection model for open vocabulary target detection is enhanced.

In some embodiments of the present disclosure, generating a region enhanced image feature based on the region image feature and the weighted teletext feature comprises:

connecting the regional image features and the weighted image-text features to generate weighted enhanced image features; the weighted enhanced image features are input to a pooling layer to generate regional enhanced image features.

The Pooling (Pooling) layer may be a neural network layer for implementing the dimension reduction process. The specific structure of the pooling layer is not limited by this implementation.

In this embodiment, the training device of the target detection model fuses the regional image feature and the weighted image-text feature to obtain the weighted enhanced image feature. And inputting the weighted enhanced image features into a preset pooling layer, and outputting corresponding region enhanced image features by the pooling layer. It should be noted that, during the training process of the open vocabulary target detection model, the parameters of the pooling layer may change.

In the scheme, the dimension reduction processing of the features is realized through the pooling layer, so that the calculated amount is reduced, and the generation of overfitting is avoided.

In some embodiments of the present disclosure, in training the open vocabulary object detection model, the relevant parameter settings include at least one of the following setting methods:

the method comprises the following steps: the model parameters of the detection head and the pooling layer are initialized to 0.

In this embodiment, the training device of the target detection model may initialize model parameters of a model whose parameters change with training of the open vocabulary target detection model in advance. For example, model parameters of a model related to interaction of image features and text features, such as a detection head, a pooling layer, etc., may be initialized to 0 by a torch.zero () function.

Therefore, the module is initially close to the backbone network of the frozen multi-mode sub-model as much as possible, and the original knowledge of the multi-mode sub-model is prevented from being destroyed. Moreover, training the detection head is also beneficial to keeping the inheritance of the pre-training knowledge by the open vocabulary target detection model.

The second method is as follows: the feature resolution of the regional image features adapts to the input feature resolution of the pooling layer.

The feature resolution may be a resolution of the regional image feature itself. The input feature resolution may be the resolution of features that the pooling layer is capable of handling.

In this embodiment, the initial image feature is subjected to the region of interest alignment process to generate a region image feature, where the feature resolution of the region image feature may be the same as the input feature resolution of the pooling layer. Thus, the pooling layer is enabled to perform pooling processing on the regional image features.

And a third method: the text length of the text sample is less than the preset text length. The preset text length may be a maximum value of a preset text length. The preset text length may be set according to a user's requirement, etc., which is not limited in this embodiment.

In this embodiment, the text length of the text sample is controlled by presetting the text length, so that the text length of the text sample is not too long, and excessive memory consumption caused by excessive embedding (modules) is avoided.

Optionally, in the training process of the open vocabulary target detection model, the related parameter settings further include at least one of the following: the learning rate is in a preset learning rate interval, and the training round number is in a preset round number interval, so that the open vocabulary target detection model is prevented from being over fitted. The size of the image sample is within a preset image size interval so that the image sample can be large enough to preserve details in the image.

S320, inputting the region enhanced image features and the second text features into the detection head, and generating an enhanced classification result of the target object detection frame and the target object detection frame.

In this embodiment, the text sample is represented by text embedding, resulting in a second text feature. And after the enhanced image features are determined, inputting the second text features corresponding to the enhanced image features and the instruction text into a detection head, and generating a target object detection frame and an enhanced classification result of the target object detection frame by the detection head.

In the scheme, the region enhanced image characteristics corresponding to each region of interest are determined, the enhanced classification result is determined based on the region enhanced image characteristics, and then the object detection frame with the original classification result ranked later is subjected to the premise of combining with the initial classification result based on the enhanced classification result, so that the determined power of the real object detection frame is improved, and the target detection of open vocabulary is realized.

S250, generating a target classification result based on the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame.

The initial object detection frame matched with the target object detection frame may be an initial object detection frame whose contact ratio with the target object detection frame is greater than a preset contact ratio threshold.

In this embodiment, there is a corresponding enhanced classification result for each target object detection box. And fusing an initial classification result of the initial object detection frame corresponding to each target object detection frame with an enhanced classification result to obtain a target classification result corresponding to the target object detection frame.

In some embodiments of the present disclosure, generating a target classification result based on the enhanced classification result, an initial classification result of an initial object detection box that matches the target object detection box, includes:

and determining the geometric average value between the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame as a target classification result.

In this embodiment, there are various methods for fusing the enhanced classification result with the corresponding initial classification result, and this embodiment is not limited thereto. For example, any one of a set average value, an arithmetic average value of the enhanced classification result and the corresponding initial classification result may be calculated. Where the average value is calculated in the form of a weight, the weight may be preset for the user or determined by a preset weight calculation method.

In the above scheme, compared with the calculated arithmetic mean value, the calculated geometric mean value can obtain higher detection enthusiasm and improve the accuracy of target detection. The weight of the geometric average can be adjusted according to the requirements of users and the like, so that the specific gravity of the initial classification result and the enhanced classification result is controlled, namely the fusion proportion of the initial classification result and the enhanced classification result is controlled.

And S260, performing model iterative training based on the target object detection frame, the target classification result and the target detection truth value corresponding to the image sample until a model convergence condition is reached.

In this embodiment, after the model convergence condition is reached, a trained open vocabulary target detection model is obtained, and the trained open vocabulary target detection model is capable of performing open vocabulary object detection.

For example, fig. 6 is a schematic diagram of a model structure of a target detection model according to an embodiment of the present disclosure, as shown in fig. 6, an image sample and a text sample are input into an open vocabulary target detection model, the image sample obtains initial image features through an image encoder of a frozen multi-mode sub-model, and the text sample obtains first text features through a text encoder of the frozen multi-mode sub-model. And carrying out feature interaction of the image and text dimensions on the initial image features and the first text features through a feature interaction module to obtain weighted enhanced image features. And carrying out pooling treatment on the weighted image features through a pooling layer to obtain region enhanced image features corresponding to the region of interest. And carrying out target detection on the region enhanced image features and the second text features through a detection head to obtain an enhanced classification result. And detecting the image sample through the detection head to obtain an initial object detection frame and a corresponding initial classification result, and performing geometric average on the initial classification result and the enhanced classification result to obtain a target classification result. The feature interaction module may be a functional module for interactively fusing image features and text features, and the feature interaction model library may include a cross-attention sub-model and the like.

In the scheme, the image features and the text features are aligned and fused through the attention mechanism, semantic information is added on the basis of the image features, the expression of the features of specific semantics is enhanced, and the detection of open vocabulary is facilitated. And training the detection head freezes the multi-modal sub-model, reserves knowledge representation in the multi-modal sub-model, and avoids the damage of knowledge in the sub-modal sub-model and the over fitting of the model. Retraining of the whole model is avoided, the complexity of model training is reduced, and the speed of model training is improved.

The following is an embodiment of a training device for an object detection model according to the present invention, which belongs to the same inventive concept as the training method for an object detection model according to the above embodiments, and details of the training device for an object detection model, which are not described in detail in the embodiment of the training device for an object detection model, may refer to the embodiment of the training method for an object detection model.

Fig. 7 is a schematic structural diagram of a training device for a target detection model according to an embodiment of the disclosure. As shown in fig. 7, the training apparatus 700 of the object detection model may include:

A sample acquisition module 710 for acquiring paired image samples and text samples for target detection; the text sample comprises descriptive text describing the image sample and instructional text designating an object to be detected in the image sample;

a feature generation module 720 for inputting the image sample and the text sample into the frozen multimodal sub-model to generate an initial image feature and an initial text feature;

the result generating module 730 is configured to generate, based on the initial image feature and the initial text feature, a target object detection frame corresponding to the object to be detected and a target classification result of the target object detection frame through the detection head of the target detection sub-model;

the model training module 740 is configured to perform model iterative training based on the target object detection frame, the target classification result, and the target detection truth value corresponding to the image sample until a model convergence condition is reached

In some embodiments, the result generation module 730 includes:

the initial generation sub-module is used for inputting the initial image characteristics into the detection head and generating at least one initial object detection frame and an initial classification result of the initial object detection frame;

the enhancement generation sub-module is used for generating an enhancement classification result of the target object detection frame and the target object detection frame through the detection head based on the initial image characteristics and the initial text characteristics;

And the target generation sub-module is used for generating a target classification result based on the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame.

In some embodiments, the initial text feature includes a first text feature corresponding to the descriptive text and a second text feature corresponding to the instructional text;

an enhancement generation sub-module, comprising:

the superposition generating unit is used for fusing the initial image characteristic and the first text characteristic to generate a region enhanced image characteristic of the superposition interested region;

and the enhancement generation unit is used for inputting the region enhancement image feature and the second text feature into the detection head and generating an enhancement classification result of the target object detection frame and the target object detection frame.

In some embodiments, the superposition generating unit comprises:

a first generation subunit for generating a region image feature of the region of interest based on the initial image feature;

the second generation subunit is used for inputting the regional image features and the first text features into a cross attention mechanism submodel to generate weighted image-text features;

and the third generation subunit is used for generating the regional enhanced image characteristic based on the regional image characteristic and the weighted image-text characteristic.

In some embodiments, the third generation subunit is configured to:

In some embodiments, the target generation sub-module is to:

In some embodiments, the model parameters for both the detection head and the pooling layer are initialized to 0; or, the feature resolution of the regional image features is adapted to the input feature resolution of the pooling layer; alternatively, the text length of the text sample is less than the preset text length.

The training device for the target detection model provided by the embodiment of the invention can execute the training method for the target detection model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the training device of the target detection model, each included module, sub-module, unit, and sub-unit are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names are also only for distinguishing from each other, and are not intended to limit the scope of the present disclosure.

Embodiments of the present disclosure also provide a training device for an object detection model, which may include a processor and a memory, which may be used to store executable instructions. The processor may be configured to read the executable instructions from the memory and execute the executable instructions to implement the training method of the object detection model in the above embodiment.

Fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

As shown in fig. 8, the electronic device 800 may include a processing means 801 (e.g., a central processing unit, a graphics processor, etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output interface (I/O interface) 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data.

It should be noted that the electronic device 800 shown in fig. 8 is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present disclosure. That is, while FIG. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. When executed by the processing means 801, the computer program performs the above-described functions defined in the training method of the object detection model of any embodiment of the present disclosure.

The disclosed embodiments also provide a computer-readable storage medium storing a computer program that, when executed by a processor, causes the processor to implement the training method of the object detection model in any of the embodiments of the disclosure.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP, and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the steps of the training method of the object detection model described in any of the embodiments of the present disclosure.

In an embodiment of the present disclosure, computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of training a target detection model, comprising:

2. The method according to claim 1, wherein the generating, based on the initial image feature and the initial text feature, the target object detection frame corresponding to the object to be detected and the target classification result of the target object detection frame by the detection head of the target detection sub-model includes:

inputting the initial image features into the detection head to generate at least one initial object detection frame and an initial classification result of the initial object detection frame;

generating, by the detection head, an enhanced classification result of the target object detection frame and the target object detection frame based on the initial image feature and the initial text feature;

And generating the target classification result based on the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame.

3. The method according to claim 2, wherein the initial text feature comprises a first text feature corresponding to the descriptive text and a second text feature corresponding to the instructional text;

the generating, by the detection head, an enhanced classification result of the target object detection frame and the target object detection frame based on the initial image feature and the initial text feature, including:

fusing the initial image features and the first text features to generate region enhanced image features of the overlapped region of interest;

and inputting the region enhanced image feature and the second text feature into the detection head to generate the enhanced classification results of the target object detection frame and the target object detection frame.

4. A method according to claim 3, wherein said fusing the initial image feature and the first text feature to generate a region-enhanced image feature overlaying a region of interest comprises:

generating region image features of a region of interest based on the initial image features;

Inputting the regional image features and the first text features into a cross attention mechanism sub-model to generate weighted image-text features;

and generating the region enhanced image feature based on the region image feature and the weighted image-text feature.

5. The method of claim 4, wherein generating the region enhanced image feature based on the region image feature and the weighted teletext feature comprises:

connecting the regional image features and the weighted image-text features to generate weighted enhanced image features;

and inputting the weighted enhanced image features into a pooling layer to generate the regional enhanced image features.

6. The method of claim 2, wherein the generating the target classification result based on the enhanced classification result, the initial classification result of the initial object detection box that matches the target object detection box, comprises:

and determining a geometric average value between the enhanced classification result and the initial classification result of the initial object detection frame matched with the target object detection frame as the target classification result.

7. The method of claim 5, wherein the model parameters of the detection head and the pooling layer are each initialized to 0;

Or, the feature resolution of the regional image features is adapted to the input feature resolution of the pooling layer;

or the text length of the text sample is smaller than the preset text length.

8. A training device for a target detection model, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing executable instructions;

wherein the processor is configured to read the executable instructions from the memory and execute the executable instructions to implement the training method of the object detection model of any of the preceding claims 1-7.

10. A computer readable storage medium, characterized in that the storage medium stores a computer program, which when executed by a processor causes the processor to implement the training method of the object detection model according to any of the preceding claims 1-7.