WO2023123847A1

WO2023123847A1 - Model training method and apparatus, image processing method and apparatus, and device, storage medium and computer program product

Info

Publication number: WO2023123847A1
Application number: PCT/CN2022/095298
Authority: WO
Inventors: 金国强; 杨帆; 孙明珊; 刘亚坤; 李韡; 暴天鹏; 吴立威
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-12-31
Filing date: 2022-05-26
Publication date: 2023-07-06
Also published as: CN114359592A

Abstract

Provided in the embodiments of the present disclosure are a model training method and apparatus, an image processing method and apparatus, and a device, a storage medium and a computer program product. The model training method comprises: acquiring a first augmented image and a second augmented image; performing target detection on the first augmented image by using a first model so as to obtain at least one first detection result that comprises first predictive object sequences, and performing target detection on the second augmented image by using a second model so as to obtain at least one second detection result that comprises second predictive object sequences; matching each first predictive object sequence with each second predictive object sequence to obtain at least one pair of first predictive object sequence and second predictive object sequence that have a target matching relationship; and on the basis of each pair of first predictive object sequence and second predictive object sequence that have a target matching relationship, updating model parameters of the first model at least once to obtain a trained first model.

Description

Model training and image processing method, device, equipment, storage medium and computer program product

Cross References to Related Applications

The embodiments of the present disclosure are based on the application number 202111667489.4, the applicant is Shanghai Shangtang Intelligent Technology Co., Ltd., the application date is December 31, 2021, and the application name is "model training and image processing method, device, equipment, storage medium" in China. A patent application is filed and priority is claimed to this Chinese patent application, the entire contents of which are hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to but not limited to the field of artificial intelligence, and in particular relates to a model training and image processing method, device, equipment, storage medium and computer program product.

Background technique

Target detection is an important problem in the fields of computer vision and industrial detection. Target detection is to use algorithms to obtain the position and corresponding classification of the target of interest in the image. Compared with image classification, object detection is a prediction-intensive computer vision task. During the training process of the object detection model, the labeling requirements are higher, so the labeling cost is also higher.

Contents of the invention

In view of this, embodiments of the present disclosure provide a model training and image processing method, device, device, storage medium, and computer program product.

The technical scheme of the embodiment of the present disclosure is realized in this way:

An embodiment of the present disclosure provides a model training method, the method is executed by a computer device, and the method includes:

Acquiring a first augmented image and a second augmented image respectively obtained by augmenting the first image sample;

Using the first model to be trained, perform target detection on the first augmented image to obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the second augmented image Target detection, obtaining at least one second detection result including a second predicted object sequence;

Matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

Based on each pair of the first predicted object sequence and the second predicted object sequence having a target matching relationship, the model parameters of the first model are updated at least once to obtain the trained first model.

An embodiment of the present disclosure provides an image processing method, the method is executed by a computer device, including:

Get the image to be processed;

Use the trained fourth model to perform target detection on the image to be processed to obtain a third detection result; wherein, the fourth model includes at least one of the following: the first model obtained by the above-mentioned model training method, using the above-mentioned The third model obtained by the model training method.

An embodiment of the present disclosure provides a model training device, the device comprising:

The first acquisition part is configured to acquire a first augmented image and a second augmented image obtained by respectively augmenting the first image sample;

The first detection part is configured to use the first model to be trained to perform object detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the first augmented image. Target detection is performed on the second augmented image, and at least one second detection result including a second predicted object sequence is obtained;

The first matching part is configured to match each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of first predictor sequences and second predictors having a target matching relationship sequence;

The first update part is configured to update the model parameters of the first model at least once based on each pair of the first predicted object sequence and the second predicted object sequence having a target matching relationship, and obtain the trained first predicted object sequence. a model.

An embodiment of the present disclosure provides an image processing device, including:

The third acquiring part is configured to acquire the image to be processed;

The second detection part is configured to use the trained fourth model to perform target detection on the image to be processed to obtain a third detection result; wherein the fourth model includes at least one of the following: using the above-mentioned model training method The obtained first model and the third model obtained by using the above model training method.

An embodiment of the present disclosure provides a computer device, including a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the program, part or all of the steps in the above method are implemented.

An embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, part or all of the steps in the above method are implemented.

An embodiment of the present disclosure provides a computer program, including computer readable codes, when the computer readable codes are run in a computer device, a processor in the computer device executes some or all of the steps for implementing the above method .

An embodiment of the present disclosure provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, a part or part of the above-mentioned method is implemented. All steps.

In the embodiment of the present disclosure, the first augmented image and the second augmented image obtained after respectively augmenting the first image sample are acquired; using the first model to be trained, the target detection is performed on the first augmented image, Obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the second augmented image, and obtain at least one second detection result including the second predicted object sequence; for each first A predictor sequence is matched with each second predictor sequence to obtain at least one pair of the first predictor sequence and the second predictor sequence with the target matching relationship; based on each pair of the first predictor sequence with the target match relationship and the second predicted object sequence, and update the model parameters of the first model at least once to obtain the trained first model. In this way, by maintaining the consistency between the first predicted object sequence and the second predicted object sequence obtained after the first model and the second model respectively process the first augmented image and the second augmented image of the same image sample, The sequence-level self-supervised training process of the target detection model can be realized, and the overall network structure of the target detection model can be trained, so that the performance of the entire target detection model can be effectively improved, and the labeling cost in the training process of the target detection model can be reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that need to be used in the embodiments of the present disclosure will be described below. The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an implementation flow of a model training method provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an implementation flow of an image processing method provided by an embodiment of the present disclosure;

FIG. 7A is a schematic diagram of an implementation process of model training based on a pre-training method provided by an embodiment of the present disclosure;

FIG. 7B is a schematic diagram of an implementation architecture of a model training method provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of the composition and structure of an image processing device provided by an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a hardware entity of a computer device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the technical solutions of the present disclosure will be described in detail below in conjunction with the accompanying drawings and embodiments, and the described embodiments should not be considered as limiting the present disclosure. All other embodiments obtained under the premise of creative work belong to the protection scope of the present disclosure.

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict. The term "first/second/third" involved is only to distinguish similar objects, and does not represent a specific ordering for the objects. It is understandable that "first/second/third" can be used interchangeably when permitted. The specific order or sequential order is not intended to enable the embodiments of the disclosure described herein to be practiced in other orders than those illustrated or described herein.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terms used herein are for the purpose of describing the present disclosure only, and are not intended to limit the present disclosure.

To solve the problem of high labeling costs in the training process of the target detection model in related technologies, a self-supervised training algorithm can be used to help improve the performance of the target detection model by using unlabeled data. However, the self-supervised training algorithm in the related art is mainly applied to the image classification task, and the entire image is regarded as a whole, which is not suitable for the prediction-intensive task of target detection, and the self-supervised training algorithm in the related art usually can only The parameters of some networks in the pre-trained target detection model, for example, can only train the parameters of the backbone network, so the performance improvement of the final target detection model is limited.

An embodiment of the present disclosure provides a model training method, which can be executed by a processor of a computer device. Among them, computer equipment refers to servers, notebook computers, tablet computers, desktop computers, smart TVs, set-top boxes, mobile devices (such as mobile phones, portable video players, personal digital assistants, dedicated messaging devices, portable game devices), etc. Devices with data processing capabilities. Fig. 1 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 1, the method includes the following steps S101 to S104:

Step S101 , acquiring a first augmented image and a second augmented image obtained by respectively performing augmentation processing on a first image sample.

Here, the first image sample may be any suitable image containing at least one object. The objects contained in the first image sample can be determined according to the actual application scene, for example, may include but not limited to at least one of objects such as people, human body parts, animals, animal limbs, plants, flowers, leaves, stones, clouds, and fences.

The augmentation processing performed on the first image sample may include but not limited to at least one of random scaling, random cropping, random flipping, random resizing, color dithering, grayscale processing, Gaussian blurring, random erasing, and the like. The first augmented image and the second augmented image may be obtained by performing different augmentation processes on the same first image sample, or may be obtained by performing the same augmentation process on the same first image sample. During implementation, those skilled in the art may use appropriate augmentation processing on the first image sample to obtain the first augmented image and the second augmented image according to actual conditions, which are not limited by the embodiments of the present disclosure.

Step S102, use the first model to be trained to perform object detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform object detection on the second augmented image Target detection is performed on the wide image, and at least one second detection result including the second predicted object sequence is obtained.

Here, the first model can be any suitable model for object detection based on sequence characteristics, such as Vision Transformer (ViT), Transformer-based object detection model (Detection Transformer, DETR), deformable DETR, etc. The first model can transform the target detection problem into a prediction problem of the feature sequence set, so as to output at least one first detection result including the first predicted object sequence. The first prediction target sequence may be obtained after the first model performs sequence encoding and sequence decoding on the first augmented image. Each sequence of first predictors may represent a predictor in the first image sample. During implementation, those skilled in the art may use any suitable sequence encoding method and sequence decoding method to process the first augmented image according to the actual situation to obtain at least one first prediction object sequence, which is not limited in this embodiment of the present disclosure.

In some embodiments, the first model may be a deformable DETR. The first prediction object sequence in the first detection result can be the prediction object sequence output by the decoder in the transformer (Transformer), or it can be the prediction object sequence output by the decoder in the Transformer after mapping processing such as dimension transformation The resulting mapped prediction object sequence.

In some implementation manners, the first detection result may include a first predicted object sequence, a first object region and a first object category corresponding to the first predicted object sequence. The first predicted object sequence may represent a predicted object, and the first object area and the first object category corresponding to the first predicted object sequence may respectively represent the predicted location area and predicted category of the predicted object.

The second model may have the same network structure as the first model, or may have a different network structure from the first model, which is not limited here. The process of using the second model to perform object detection on the second augmented image corresponds to the process of using the first model to perform object detection on the first augmented image. During implementation, you can refer to the process of using the first model to detect the first The process of augmenting images for object detection. The second prediction target sequence may be obtained after the second model performs sequence encoding and sequence decoding on the second augmented image. Each second sequence of predictors may represent a predictor in the first image sample.

In some implementations, in the case where the third model is a Transformer-based target detection model, the second predicted object sequence in the second detection result may be the predicted object sequence output by the decoder in the Transformer, or a pair of The predicted target sequence output by the decoder in the Transformer is the mapped predicted target sequence obtained after mapping processing such as dimension transformation.

In some implementation manners, the second detection result may include a second predicted object sequence, a second object region and a second object category corresponding to the second predicted object sequence. The second predicted object sequence may represent a predicted object, and the second object area and the second object category corresponding to the second predicted object sequence may respectively represent the predicted location area and predicted category of the predicted object.

Step S103, matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Here, the first sequence of prediction objects and the second sequence of prediction objects having a target matching relationship may represent the same prediction object in the first image sample. During implementation, those skilled in the art may use any suitable matching method to match each first sequence of prediction objects with each second sequence of prediction objects according to actual conditions, which is not limited here.

In some implementations, the output timing of each first predictor sequence and the output timing of each second predictor sequence can be determined, and the first predictor sequence and the second predictor sequence with the same output timing are determined as having The first predictor sequence and the second predictor sequence of the target matching relationship, so that at least one pair of the first predictor sequence and the second predictor sequence having the target matching relationship can be obtained.

In some implementations, bipartite graph matching can be used to match each first predictor sequence with each second predictor sequence to obtain at least one pair of the first predictor sequence and the second predictor sequence with the target matching relationship. Predict sequence of objects. During implementation, any suitable manner may be used to calculate the matching loss used in the bipartite graph matching process, which is not limited here. For example, the matching loss used in the bipartite graph matching process may be determined based on at least one of the following: the similarity between each pair of matched first predictor sequences and each second predictor sequence, each pair of matched The intersection ratio between the first object region and the second object region corresponding to the first prediction object sequence and each second prediction object sequence, each pair of the first prediction object sequence and each second prediction object sequence that match each other The focus loss between the first object category and the second object category respectively corresponding to the object sequences, etc.

Step S104 , based on each pair of the first predicted object sequence and the second predicted object sequence having the target matching relationship, update the model parameters of the first model at least once to obtain the trained first model.

Here, in some implementations, it may be determined whether to update the model parameters of the first model based on each pair of the first predicted object sequence and the second predicted object sequence having a target matching relationship. In the case of updating the model parameters, an appropriate parameter update algorithm is used to update the model parameters of the first model, and after the update, each pair of the first prediction object sequence and the second prediction object sequence with the target matching relationship is re-determined, Based on each re-determined pair of the first predictor sequence and the second predictor sequence having the target matching relationship, it is determined whether the model parameters of the first model need to be continuously updated. If it is determined that the model parameters of the first model do not need to be continuously updated, the finally updated first model is determined as the trained first model.

For example, the target loss value can be determined based on each pair of the first predictor sequence and the second predictor sequence with the target matching relationship, and when the target loss value does not meet the preset condition, the model parameters of the first model Update, when the target loss value satisfies the preset condition or the number of updates to the model parameters of the first model reaches the set threshold, stop updating the model parameters of the first model, and finally update the first model The model is determined to be the first model after training.

In some embodiments, the first model includes a feature extraction network and a converter network; using the first model to be trained in the above step S102 to perform target detection on the first augmented image to obtain at least one The first detection result including the first predicted object sequence includes the following steps S111 to S112:

Step S111, using the feature extraction network of the first model to perform feature extraction on the first augmented image to obtain image feature information.

Here, the feature extraction network can be any suitable network capable of extracting image features, such as a convolutional neural network, a recurrent neural network, a converter-based feature extraction network, and the like. During implementation, those skilled in the art may use an appropriate feature extraction network in the first model according to actual conditions to obtain image feature information, which is not limited here.

Step S112, using the converter network of the first model to perform prediction processing on the image feature information to obtain at least one sequence of first prediction objects.

Here, the converter network may include an encoder network and a decoder network. During implementation, those skilled in the art may use an appropriate converter network in the first model according to actual conditions to perform prediction processing on the image feature information, which is not limited here.

In some embodiments, the image feature information can be position-encoded and then input into the encoder network to obtain at least one encoded feature sequence after the encoder network performs feature encoding processing on the position-encoded image feature information; using the decoder network, the Identifying each coded feature sequence to obtain context identification information corresponding to at least one prediction object, and performing feature decoding processing on each coded feature sequence according to each context identification information to obtain at least one first prediction object sequence.

In the above-mentioned embodiment, the first model includes a feature extraction network and a converter network, so that based on the sequence characteristics of the converter network, the sequence-level self-supervised training process of the target detection model based on the converter network can be realized, and can be based on The overall network structure of the target detection model of the converter network is trained, so that the performance of the entire target detection model can be effectively improved, and the labeling cost in the training process of the target detection model can be reduced.

In some embodiments, the first model further includes a first feed-forward neural network; the above step S112 may include the following steps S121 to S122:

Step S121, using the converter network of the first model to perform prediction processing on the image feature information to obtain at least one feature sequence;

Step S122, using the first feed-forward neural network to map each feature sequence to a target dimension to obtain at least one first sequence of predicted objects.

Here, the first feedforward neural network may be any suitable feedforward neural network capable of mapping the feature sequence to the target dimension, which is not limited here.

Target dimensions can be pre-set. During implementation, those skilled in the art can set appropriate target dimensions according to actual business scenarios.

For example, the feature sequence output by the converter network is a 256-dimensional feature, and the 256-dimensional feature sequence can be mapped to a 512-dimensional first prediction object sequence through the first feedforward neural network.

In the above embodiment, the feature sequence output by the converter network is mapped to the target dimension through the first feed-forward neural network to obtain the second predicted object sequence. In this way, the detection of the first model can be improved by presetting the appropriate target dimension. performance. For example, the detection accuracy of the first model can be improved by setting a higher target dimension; for another example, the detection efficiency of the first model can be improved by setting a lower target dimension.

In some embodiments, the first detection result further includes a first object area and a first object category, and the first model further includes a second feedforward neural network and a third feedforward neural network; the above step S102 The aforementioned method uses the first model to be trained to perform target detection on the first augmented image to obtain at least one first detection result including the first predicted object sequence, and further includes:

Step S131, for each feature sequence, use the second feedforward neural network to perform region prediction on the feature sequence to obtain a first object region, and use the third feedforward neural network to predict the The feature sequence is used for category prediction to obtain the first object category.

Here, the second feedforward neural network may be any suitable feedforward neural network capable of area prediction, which is not limited here. In some implementations, the second feedforward neural network can be used to predict the position area of the predicted object represented by the feature sequence in the first augmented image, and the obtained first object area can be a detection frame of the predicted object.

The third feedforward neural network may be any suitable feedforward neural network capable of category prediction, which is not limited here. In some implementation manners, the object category of the predicted object represented by the feature sequence can be predicted by using the third feedforward neural network to obtain the first object category. During implementation, the output quantity of the third feedforward neural network may be determined according to the quantity of object categories to be detected in an actual business scenario, which is not limited here.

In some embodiments, the second model has the same network structure as the first model. In implementation, the process of performing target detection on the second augmented image by using the second model may refer to the process of performing target detection on the first augmented image by using the first model.

In some embodiments, the above step S101 may include the following steps S141 to S142:

Step S141, performing first image augmentation processing on the first image sample to obtain a first augmented image;

Step S142, performing a second image augmentation process on the first image sample to obtain a second augmented image.

During implementation, the first image augmentation processing and the second image augmentation processing may adopt the same augmentation processing manner, or may adopt different augmentation processing manners, which are not limited here.

In some embodiments, the first image augmentation process includes at least one of the following: color dithering, grayscale processing, Gaussian blur, and random erasure; the second image augmentation process includes at least one of the following: random scaling , random cropping, random flipping, random resizing.

In the above-mentioned embodiment, the first augmented image and the second augmented image are obtained by performing the first image augmentation processing and the second image augmentation processing on the first image sample and the first image sample respectively. The image disturbance caused by random scaling, random cropping, random flipping and random resizing included in the second image augmentation process, the image caused by color dithering, grayscale processing, Gaussian blur, and random erasure included in the first augmentation process The disturbance is stronger, which can make the target detection difficulty of the first model higher than that of the second model, thereby improving the learning ability of the trained first model, and improving the model due to the same learning ability of the first model and the second model. Collapsing situation.

An embodiment of the present disclosure provides a model training method, which can be executed by a processor of a computer device. Fig. 2 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method includes the following steps S201 to S206:

Step S201, acquiring a first augmented image and a second augmented image obtained by respectively performing augmentation processing on a first image sample.

Step S202, use the first model to be trained to perform object detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform object detection on the second augmented image Target detection is performed on the wide image, and at least one second detection result including the second predicted object sequence is obtained.

Step S203, matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Here, the above steps S201 to S203 respectively correspond to the above steps S101 to S103, and the implementation of the above steps S101 to S103 can be referred to for implementation.

Step S204, based on the similarity between each pair of the first prediction object sequence and the second prediction object sequence having the target matching relationship, determine a target loss value.

Here, any suitable similarity loss function can be used to determine the similarity loss between each pair of the first prediction object sequence and the second prediction object sequence with the target matching relationship, and based on each similarity loss, the target loss value can be determined . The similarity loss function may include but not limited to at least one of absolute value loss function, least square error loss function, cosine loss function, BYOL (Bootstrap Your Own Latent) algorithm, Momentum Contrastive (MOCO) algorithm, etc.

Step S205, if the target loss value does not meet the preset condition, update the model parameters of the first model to obtain an updated first model.

Here, the preset conditions may include, but are not limited to, the target loss value being smaller than a set loss value threshold, the change of the target loss value converging, and the like. During implementation, the preset conditions may be set according to actual conditions, which are not limited here.

The way to update the model parameters of the first model can be determined according to the actual situation, and can include but not limited to at least one of gradient descent method, momentum update method, Newton momentum method, etc., which is not limited here.

Step S206, based on the updated first model, determine the trained first model.

Here, in some implementation manners, the updated first model may be determined as the trained first model.

In some implementation manners, the updated first model may be continuously updated, and the finally updated first model may be determined as the trained first model.

In the embodiment of the present disclosure, the target loss value is determined based on the similarity between each pair of the first prediction object sequence and the second prediction object sequence with the target matching relationship. When the target loss value does not meet the preset condition, The model parameters of the first model are updated to obtain an updated first model, and based on the updated first model, a trained first model is determined. In this way, the model parameters of the first model can be updated at least once when the target loss value does not meet the preset condition, because the target loss value is based on each pair of the first prediction object sequence and the second The similarity between the predicted object sequences is determined, so that the consistency of the predicted object sequences obtained after the first model after training and the second model are processed for different augmented images of the same image sample can be improved, and the post-training can be improved. The performance of the target detection model.

In some embodiments, the above step S205 may include the following step S211:

Step S211, when the target loss value does not meet the preset condition, update the model parameters of the first model and the model parameters of the second model respectively to obtain the updated first model and the updated of the second model.

Here, both the model parameters of the first model and the model parameters of the second model may be updated when the target loss value does not meet the preset condition, so as to realize comparative learning between the first model and the second model.

The way to update the model parameters of the second model can be determined according to the actual situation, and can include but not limited to at least one of gradient descent method, momentum update method, Newton momentum method, etc., which is not limited here. During implementation, the model parameter updating methods of the first model and the second model may be the same or different, which is not limited here.

The above step S206 may include the following step S212:

Step S212: Determine the trained first model based on the updated first model and the updated second model.

In some implementations, based on the updated first model and the updated second model, a new target loss value can be determined, and by judging whether the new target loss value satisfies a preset condition, it is determined whether the updated The first model continues to be updated. When the new target loss value satisfies the preset condition, it can be determined not to continue to update the updated first model, and the updated first model can be determined as the first model after training; in the new target loss If the value does not meet the preset condition, the updated first model may be continuously updated, and the finally updated first model may be determined as the trained first model.

In the above embodiment, in the process of updating the model parameters of the first model, the model parameters of the second model are also updated, so that the learning capabilities of the first model and the second model can be mutually enhanced, thereby improving the training performance. performance of the target detection model.

In some embodiments, the above step S211 may include the following steps S221 to S222:

Step S221, based on the current model parameters of the first model, perform momentum update on the model parameters of the second model to obtain an updated second model.

Here, those skilled in the art may use any appropriate momentum update method according to the actual situation to update the momentum of the model parameters of the second model based on the current model parameters of the first model, which is not limited by the embodiments of the present disclosure.

In some implementations, based on the set weights, the current model parameters of the first model and the current model parameters of the second model may be weighted and summed to obtain an updated second model. For example, the following formula 1 can be used to update the momentum of the model parameters of the second model:

Θ _m+1 = k*Θ _m +(1-k)*Θ _o (1);

Among them, Θ _m and Θ _o are the current model parameters of the second model and the current model parameters of the first model, respectively, Θ _m+1 is the updated second model, and k is the set momentum coefficient. In some embodiments k may be a value greater than or equal to 0.9 and less than 1, for example, k is 0.995.

In step S222, the current model parameters of the first model are updated in a gradient update manner to obtain an updated first model.

Here, any suitable gradient update algorithm may be used to update the current model parameters of the first model, which is not limited in this embodiment of the present disclosure. For example, the gradient update algorithm may include, but is not limited to, at least one of batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

In the above embodiment, based on the current model parameters of the first model, the momentum update is performed on the model parameters of the second model to obtain the updated second model, and the current model parameters of the first model are updated by means of gradient update , to get the updated first model. In this way, the first model and the second model can be updated at different rates, which can improve model collapse and improve the performance of the trained target detection model.

In some embodiments, the above step S212 may include the following steps S231 to S235:

In step S231 , the first augmented image and the second augmented image obtained after augmenting the next first image sample are respectively determined as the current first augmented image and the current second augmented image.

Here, the next first image sample may be the same image as the current first image sample, or an image different from the current first image sample.

Step S232, using the currently updated first model to perform object detection on the current first augmented image, obtaining at least one first detection result including the first predicted object sequence, and using the currently updated second model, Object detection is performed on the current second augmented image to obtain at least one second detection result including a second predicted object sequence.

Step S233, matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Step S234, based on the similarity between each pair of the first prediction target sequence and the second prediction target sequence having a target matching relationship, determine the current target loss value.

Here, the above steps S231 to S234 respectively correspond to the above steps S201 to S204, and for implementation, reference may be made to the implementation manners of the above steps S201 to S204.

Step S235, when the current target loss value satisfies the preset condition or the number of times the model parameters of the first model are updated reaches a number threshold, determine the currently updated first model as training after the first model.

Here, the number of times threshold may be preset by the user according to the actual situation, or may be a default value.

In some embodiments, the above step S212 may further include the following steps S241 to S242:

Step S241, in the case that the current target loss value does not meet the preset condition, respectively update the model parameters of the first model and the model parameters of the second model to obtain the second update after the next update. One model and the second model after the next update.

Step S242, based on the first model after the next update and the second model after the next update, determine the first model after training.

In the above embodiment, when the target loss value does not meet the preset condition, the model parameters of the first model and the model parameters of the second model can be updated next time, and based on the first model after the next update and the second model after the next update to determine the first model after training, so that the performance of the first model after training can be improved through continuous iterative updating.

An embodiment of the present disclosure provides a model training method, which can be executed by a processor of a computer device. Fig. 3 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 3, the method includes the following steps S301 to S310:

Step S301, acquiring a first augmented image and a second augmented image obtained by respectively performing augmentation processing on a first image sample.

Step S302, use the first model to be trained to perform target detection on the first augmented image to obtain at least one first detection result, and use the second model to perform target detection on the second augmented image to obtain At least one of the second detection results includes a second sequence of predicted objects; the first detection result includes a first sequence of predicted objects and a first object region and a first object category corresponding to the first sequence of predicted objects.

Step S303, matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Here, the above step S301 to step S303 respectively correspond to the above step S101 to step S103, and the implementation manner of the above step S101 to step S103 can be referred to for implementation.

Step S304, acquiring at least one candidate object in the first image sample, each of the candidate objects having a candidate object area and a candidate object category.

Here, at least one candidate object in the first image sample may be randomly determined, or may be obtained by performing object detection on the first image sample through any suitable unsupervised algorithm, which is not limited here. For example, the unsupervised detection algorithm may include but not limited to at least one of a sliding window method, a candidate region algorithm, a selective search algorithm, and the like.

The candidate object area of the candidate object is the predicted position area of the candidate object in the first image sample, and the candidate object category of the candidate object is the prediction type of the candidate object. The candidate object category of the candidate object can be used as a pseudo-label of the candidate object region of the candidate object.

In some embodiments, the above step S304 may include: performing object detection on the first image sample in an unsupervised manner to obtain at least one predicted object region and a pseudo-label of each predicted object region; The pseudo-label of the prediction object area is used to characterize the prediction object category of the prediction object area; for each of the prediction object areas, the prediction object area is used as a candidate object area, and the pseudo-label of the prediction object area is used as Candidate object category, get a candidate object. Here, any suitable unsupervised algorithm may be used to implement the unsupervised target detection on the first image sample. In this way, the labeling cost in the training process of the target detection model can be reduced.

Step S305, based on the first object region and the first object category corresponding to each of the first prediction object sequences, and the candidate object region and the candidate object category of each of the candidate objects, for each of the first prediction objects The sequence is matched with each of the candidate objects to obtain at least one pair of the first predicted object sequence and the candidate object having a target matching relationship.

Here, the first predicted object sequence and the candidate object having a target matching relationship may represent the same predicted object in the first image sample. During implementation, those skilled in the art may use any suitable matching manner to match each first predicted object sequence with each candidate object according to actual conditions, which is not limited here.

In some implementations, bipartite graph matching may be used to match each first predicted object sequence and each candidate object to obtain at least one pair of the first predicted object sequence and the candidate object having a target matching relationship. During implementation, any suitable manner may be used to calculate the matching loss used in the bipartite graph matching process, which is not limited here. For example, the matching loss used in the bipartite graph matching process may be determined based on at least one of the following: the intersection and union between the first object region and the candidate object region respectively corresponding to each pair of first predicted object sequences and candidate objects that match each other Each pair of the first predicted object sequence and the candidate object matched with each other corresponds to the focal loss between the first object category and the candidate object category, etc.

Step S306, based on the similarity between each pair of the first prediction object sequence and the second prediction object sequence having the target matching relationship, determine a first loss value.

Here, any suitable similarity loss function may be used to determine the first loss value between each pair of the first prediction object sequence and the second prediction object sequence having the target matching relationship, which is not limited in this embodiment of the present disclosure.

In some implementations, the similarity loss between each pair of the first predicted object sequence and the second predicted object sequence having the target matching relationship may be determined, and each similarity loss may be accumulated to obtain the first loss value. For example, the first loss value may be determined in a manner as shown in the following formula 2:

Among them, N is the number of pairs of the first prediction object sequence and the second prediction object sequence with the target matching relationship, and N is a positive integer; _si is the first prediction object sequence,

is the second prediction object sequence that has a target matching relationship with _si .

is the similarity loss algorithm,

is the determined first loss value.

Step S307, based on each pair of the first predicted object sequence and the candidate object having the target matching relationship, determine a second loss value.

Here, any suitable loss function may be used to determine the second loss value between each pair of the first predicted object sequence and the candidate object having the target matching relationship, which is not limited in this embodiment of the present disclosure. The loss function may include but not limited to at least one of a similarity loss function, a focus loss function, an intersection loss function, a generalized intersection loss function, and the like.

Step S308: Determine a target loss value based on the first loss value and the second loss value.

Here, the target loss value may be determined based on the first loss value and the second loss value in an appropriate manner according to actual conditions, which is not limited in this embodiment of the present disclosure. For example, the sum of the first loss value and the second loss value can be determined as the target loss value, or the average value of the first loss value and the second loss value can be determined as the target loss value, and different weights can be used to determine the target loss value The first loss value and the second loss value are weighted and summed to obtain the target loss value.

Step S309, if the target loss value does not meet the preset condition, update the model parameters of the first model to obtain an updated first model.

Step S310, based on the updated first model, determine the trained first model.

Here, the above-mentioned steps S309 to S310 correspond to the above-mentioned steps S205 to S206 respectively, and for implementation, reference may be made to the implementation manners of the above-mentioned steps S205 to S206.

It should be noted that the execution order of the various steps is not limited to the order shown in FIG. 3 . For example, step S304 may be performed before step S301, step S304 may be performed after step S306, and step S307 may be performed after step S302 and before step S303; this is not limited in this embodiment of the present disclosure.

In the embodiment of the present disclosure, based on the similarity between each pair of the first prediction object sequence and the second prediction object sequence with the target matching relationship, the first loss value is determined, and based on each pair of the first prediction object sequence with the target matching relationship The object sequence and candidate objects are used to determine a second loss value, and based on the first loss value and the second loss value, a target loss value is determined. Since the candidate object category of each candidate object can be used as the pseudo-label of the candidate object region of the candidate object, based on each pair of the first predicted object sequence and the candidate object with the target matching relationship, the determined second loss value can be The object location prediction ability of the first model provides objective supervision, thereby improving the object location ability of the trained first model, and further improving the detection accuracy of the trained first model.

In some embodiments, the above step S307 may include the following steps S321 to S322:

Step S321, for each pair of the first predicted object sequence and the candidate object having the target matching relationship, based on the first object region corresponding to the first predicted object sequence and the candidate object region of the candidate object, determine a first sub- A loss value, and determine a second sub-loss value based on the first object category corresponding to the first predicted object sequence and the candidate object category of the candidate object.

Here, any suitable loss function can be used to determine the first sub-loss value between the first object region and the candidate object region, and the second sub-loss value between the first object category and the candidate object category. This is not limited. For example, the intersection ratio loss function, generalized intersection ratio loss function, etc. can be used to determine the first sub-loss value between the first object area and the candidate object area, and the focus loss function can be used to determine the first sub-loss value between the first object category and the candidate object category. The second sub-loss value between.

Step S322: Determine a second loss value based on each of the first sub-loss values and each of the second sub-loss values.

Here, the second loss value may be determined based on the first sub-loss value and the second sub-loss value in an appropriate manner according to actual conditions, which is not limited in this embodiment of the present disclosure. For example, the sum of the first sub-loss value and the second sub-loss value may be determined as the second loss value, or the average value of the first sub-loss value and the second sub-loss value may be determined as the second loss value, or The first sub-loss value and the second sub-loss value are weighted and summed with different weights to obtain the second loss value.

In some embodiments, each of the first sub-loss values, each of the second sub-loss values, and each pair of the first prediction object sequence and the second prediction object sequence having a target matching relationship can be The target loss value is obtained after the weighted summation of the similarity losses between them. For example, the target loss value can be determined in the manner shown in the following formula 3:

is the second predicted object sequence that has a target matching relationship with _si ,

is the first predictor sequence s _i and the second predictor sequence

The similarity loss between; _ci is the first object category corresponding to the first predicted object sequence _si ,

is the candidate object category of the candidate object that has a target matching relationship with _si ,

is the first object category c _i and the candidate object category calculated using the intersection loss function

The first sub-loss value between;

Indicates that _ci is 0 when it is empty, and 1 when ci is not empty; _{b i} _is the first object area corresponding to the first prediction object sequence _si ,

is the candidate object region of the candidate object that has a target matching relationship with _si ,

is the first object region _bi and the candidate object region calculated using the generalized intersection ratio loss function

The second sub-loss value between; λ _f , λ _b and λ _e are the weights of the first sub-loss value, the second sub-loss value and the similarity loss respectively;

is the first predictor sequence y and the second predictor sequence

The target loss value between.

In the above embodiment, for each pair of the first predicted object sequence and the candidate object having a target matching relationship, a first sub-loss value is determined based on the first object region corresponding to the first predicted object sequence and the candidate object region of the candidate object , and based on the first object category corresponding to the first predicted object sequence and the candidate object category of the candidate object, determine a second sub-loss value; based on each first sub-loss value and each second sub-loss value, determine the second loss value. In this way, the object region regression in the first model detection and the self-supervised representation learning process of the object category can be realized simultaneously, so that the detection accuracy of the trained first model can be improved.

An embodiment of the present disclosure provides a model training method, which can be executed by a processor of a computer device. Fig. 4 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 4, the method includes the following steps S401 to S404:

Step S401, acquiring a first augmented image and a second augmented image obtained by respectively performing augmentation processing on a first image sample.

Step S402, use the first model to be trained to perform object detection on the first augmented image to obtain at least one first detection result, and use the second model to perform object detection on the second augmented image to obtain At least one second detection result; the first detection result includes a first predicted object sequence and a first object area and a first object category corresponding to the first predicted object sequence, and the second detection result includes a second predicted object sequence An object sequence and a second object area and a second object category corresponding to the second predicted object sequence.

Here, the above step S401 to step S402 respectively correspond to the above step S101 to step S102, and the implementation manner of the above step S101 to step S102 can be referred to for implementation.

Wherein, the second object area may be obtained by predicting the position area of the predicted object represented by the second predicted object sequence in the second augmented image, and may be a detection frame of the predicted object. The second object category may be obtained by predicting the object category of the predicted object represented by the second sequence of predicted objects.

Step S403, based on the first object region and the first object category corresponding to each of the first predicted object sequences, and the second object region and the second object category corresponding to each of the second predicted object sequences, for each The first predictor sequence and each of the second predictor sequences perform bipartite graph matching to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Here, any suitable bipartite graph matching algorithm can be used to match each first predictor sequence with each second predictor sequence to obtain at least one pair of first predictor sequence and second predictor sequence with target matching relationship sequence. For example, the bipartite graph matching algorithm used may include but not limited to at least one of the Hungarian matching algorithm, the maximum flow matching algorithm, and the like. During implementation, any suitable manner may be used to calculate the matching loss used in the bipartite graph matching process, which is not limited here. For example, the matching loss used in the bipartite graph matching process may be determined based on at least one of the following: the similarity between each pair of first predicted object sequences and second predicted object sequences that match each other, each pair of first predicted object sequences that match each other The intersection ratio between the first object region and the second object region corresponding to the first prediction object sequence and the second prediction object sequence, and the first prediction object sequence and the second prediction object sequence respectively corresponding to each matching pair Focal loss between one object class and a second object class, etc.

Step S404, based on each pair of the first predicted object sequence and the second predicted object sequence having the target matching relationship, update the model parameters of the first model at least once to obtain the trained first model.

Here, the above-mentioned step S404 corresponds to the above-mentioned step S104, and the implementation of the above-mentioned step S104 can be referred to for implementation.

In some embodiments, the above step S403 may include the following steps S411 to S413:

Step S411, based on each of the first predictor sequences and each of the second predictor sequences, determine at least one set of candidate sequence pairs; each set of candidate sequence pairs includes at least one pair of candidates with a matching relationship The first sequence of predictors and the second sequence of predictors.

Here, any suitable manner may be used to perform one-to-one matching on each first predictor sequence and each second predictor sequence to obtain at least one candidate sequence pair set, which is not limited in this embodiment of the present disclosure. For example, at least one random match may be performed on each first predictor sequence and each second predictor sequence to obtain at least one candidate sequence pair set.

Step S412, for each set of candidate sequence pairs, based on the first predictor sequence corresponding to the first predictor sequence in the second predictor sequence and each pair of candidate sequence pair sets in the set of candidate sequence pairs. The object region and the first object category, and the second object region and the second object category corresponding to the second predicted object sequence determine the matching loss of the set of candidate sequence pairs.

Here, any suitable manner may be used to calculate the matching loss of the set of candidate sequence pairs.

In some implementations, based on the intersection ratio between the first object area and the second object area corresponding to each pair of the first prediction object sequence and the second prediction object sequence that match each other in the candidate sequence pair set, the mutual The focal loss between the first object category and the second object category respectively corresponding to each pair of the first predicted object sequence and each second predicted object sequence is used to determine the matching loss of the set of candidate sequence pairs.

For example, the matching loss of the candidate sequence pair set can be calculated in the manner shown in the following formula 4:

Wherein, N is the number of pairs of the first prediction object sequence and the second prediction object sequence having a target matching relationship, and N is a positive integer;

denotes the Hungarian matching loss,

Representing at least one pair of a first prediction target sequence and a second prediction target sequence that match each other in the candidate sequence pair set;

is the second object category corresponding to the second prediction object sequence in the i-th pair of the first prediction object sequence and the second prediction object sequence having a target matching relationship,

is the first object category of the first predicted object sequence that has a target matching relationship with the second predicted object sequence is

confidence level;

exist

Take 0 when empty,

Take 1 when it is not empty; b _i is the first object region corresponding to the first prediction object sequence in the i-th pair of the first prediction object sequence and the second prediction object sequence with the target matching relationship,

is the second target area of the second predicted target sequence having a target matching relationship with the first predicted target sequence,

is the loss value between the first object region b _i and the second object region b _σ(i) calculated using the generalized intersection ratio loss function.

Step S413, determining each pair of the first prediction object sequence and the second prediction object sequence having a candidate matching relationship in the candidate sequence pair set with the smallest matching loss in the at least one candidate sequence pair set as at least one pair with the target The first predictor sequence and the second predictor sequence of the matching relationship.

In the embodiment of the present disclosure, each first predictor sequence and each second predictor sequence are matched by using bipartite graph matching, which can improve the determined at least one pair of first predictor sequence and each second predictor sequence with the target matching relationship. The second predicts the accuracy of the target matching relationship between the object sequences, thereby improving the detection accuracy of the trained first model.

An embodiment of the present disclosure provides a model training method, which can be executed by a processor of a computer device. Fig. 5 is a schematic diagram of the implementation flow of a model training method provided by an embodiment of the present disclosure. As shown in Fig. 5, the method includes the following steps S501 to S506:

Step S501, acquiring a first augmented image and a second augmented image obtained by respectively performing augmentation processing on a first image sample.

Step S502, use the first model to be trained to perform object detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform object detection on the second augmented image Target detection is performed on the wide image, and at least one second detection result including the second predicted object sequence is obtained.

Step S503, matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship.

Step S504, based on each pair of the first predicted object sequence and the second predicted object sequence having the target matching relationship, update the model parameters of the first model at least once to obtain the trained first model.

Here, the above-mentioned steps S501 to S504 correspond to the above-mentioned steps S101 to S104 respectively, and for implementation, reference may be made to the implementation manners of the above-mentioned steps S101 to S104.

Step S505, based on the trained first model, determine an initial third model.

Here, in some implementation manners, the feedforward neural network in the trained first model may be adjusted according to an actual target detection scenario, and the adjusted first model may be determined as the initial third model.

In some embodiments, the first model includes a feature extraction network, a converter network, and a first feedforward neural network, a second feedforward neural network, and a third feedforward neural network connected to the converter network; the first The feedforward neural network, the second feedforward neural network and the third feedforward neural network are respectively used to output the first predicted object sequence, the first object region corresponding to the first predicted object sequence, and the first predicted object sequence. The first object category corresponding to the sequence; the first feedforward neural network in the first model after training can be removed, and the third feedforward neural network in the first model can be adjusted according to the actual target detection scene, The adjusted first model is determined as an initial third model.

Step S506, based on at least one second image sample, update the model parameters of the third model to obtain the trained third model.

Here, the second image sample may have label information or may not have label information. During implementation, those skilled in the art may determine an appropriate second image sample according to an actual target detection scene, which is not limited here.

In some implementation manners, the model parameters of the third model may be fine-tuned and trained based on at least one second image sample to obtain the trained third model.

In the embodiment of the present disclosure, an initial third model is determined based on the trained first model, and model parameters of the third model are updated based on at least one second image sample to obtain a trained third model. In this way, the model parameters of the trained first model can be transferred to other target detection models to be applied to various target detection scenarios, which can improve the training efficiency of the third model and the detection accuracy of the trained third model.

An embodiment of the present disclosure provides an image processing method, which can be executed by a processor of a computer device. FIG. 6 is a schematic diagram of an implementation flow of an image processing method provided by an embodiment of the present disclosure. As shown in FIG. 6, the method includes the following steps S601 to S602:

Step S601, acquiring an image to be processed;

Step S602, using the trained fourth model to perform object detection on the image to be processed to obtain a third detection result; wherein, the fourth model includes at least one of the following: using the model training described in the above-mentioned embodiments The first model obtained by the method is the third model obtained by using the model training method described in the above embodiment.

Here, the image to be processed can be any suitable image to be detected. During implementation, those skilled in the art can select an appropriate image to be processed according to the actual application scenario, which is not limited by the embodiments of the present disclosure.

In the embodiments of the present disclosure, due to the model training method described in the above embodiments, the first augmented image and the second augmented image obtained after processing the first augmented image and the second augmented image of the same image sample can be obtained by maintaining the first model and the second model respectively. The consistency between the first predicted object sequence and the second predicted object sequence realizes the sequence-level self-supervised training process of the target detection model, and can train the overall network structure of the target detection model, so that the entire target detection model can be effectively improved. Therefore, based on at least one of the first model and the third model obtained by adopting the model training method described in the above embodiments, performing target detection on the image to be processed can improve the accuracy of target detection.

An embodiment of the present disclosure provides a pre-training method of a self-supervised target detection model based on Transformer sequence consistency, the method can be executed by a processor of a computer device, and the method can use unlabeled data to carry out the overall network structure of the target detection model Training, and based on the sequence characteristics of the Transformer, it can simultaneously realize the object region regression in the target detection model detection and the self-supervised representation learning process of the object category. FIG. 7A is a schematic diagram of an implementation process of model training based on a pre-training method provided by an embodiment of the present disclosure. As shown in FIG. 7A, the method may include the following steps S701 to S703:

Step S701, acquire at least one candidate object in the first image sample in an unsupervised manner, each of the candidate objects has a candidate object area and a candidate object category.

During implementation, any suitable unsupervised detection algorithm may be used to detect the target object in the first image sample to obtain at least one candidate object. For example, a selective search algorithm may be employed to unsupervisedly obtain at least one candidate object with a high recall rate from the first image sample.

Step S702, using the pre-training method of the self-supervised target detection model based on Transformer sequence consistency to pre-train the first model.

In some embodiments, the model training architecture as shown in FIG. 7B can be used to realize the pre-training method of the self-supervised target detection model based on Transformer sequence consistency. As shown in FIG. 7B, the model training architecture includes the first model 10 and The second model 20, wherein the network structure of the first model 10 and the second model 20 is the same, and both include a convolutional neural network (Convolutional Neural Networks, CNN) 11 or 21, a Transformer encoder 12 or 22, and a Transformer decoder 13 Or 23, and feedforward neural network (Feed-Forward Networks, FFN) 14 or 24, feedforward neural network can comprise the first feedforward neural network, the second feedforward neural network and the 3rd feedforward neural network; In model training In the process of , the inputs of the first model 10 and the second model 20 are respectively the first augmented image and the second augmented image obtained after augmenting the first image sample 30, wherein the input of the first model 10 The perturbation of the first augmented image contains more color level perturbations. The processes of the first model 10 and the second model 20 performing target detection on the first augmented image and the second augmented image respectively are the same, taking the process of the first model 10 performing target detection on the first augmented image as an example, After the convolutional neural network 11 is used to extract the features of the first augmented image, a position code 40 will be added to the extracted features, and the Transformer encoder 12 and Transformer decoder 13 will be used to process the features after adding the position code. After the processing of the Transformer encoder 12 and the Transformer decoder 13, at least one feature sequence 31 representing the predicted object can be obtained, and the first feedforward neural network, the second feedforward neural network and the third feedforward neural network are used for each feature Sequence 31 is processed, and for each feature sequence 31, the first predicted object sequence Prj1 output by the first feedforward neural network, and the first object region corresponding to the first predicted object sequence output by the second feedforward neural network can be obtained Bx1, the first object category Cls1 corresponding to the first predicted object sequence output by the third feed-forward neural network, correspondingly, after the second augmented image is processed by the second model 20, the feature sequence 32, the second A prediction target sequence Prj2, a second target region Bx2 corresponding to the second prediction target sequence, and a second target class Cls2 corresponding to the second prediction target sequence. For the output results of the first model 10 and the second model 20, at least one first predictor sequence Prj1 and at least one second predictor sequence Prj2 can be matched using a bipartite graph matching algorithm to obtain at least one pair of objects with a target matching relationship. The first predictor sequence and the second predictor sequence (such as the first predictor sequence corresponding to the first target region Bx1-1 and the second predictor sequence corresponding to the second target region Bx2-1, corresponding to the first target region Bx1 The first prediction target sequence of -4 and the second prediction target sequence of the second target area Bx2-2, the first prediction target sequence corresponding to the first target area Bx1-4 and the second prediction of the second target area Bx2-3 object sequence, the first predicted object sequence corresponding to the first object region Bx1-4 and the second predicted object sequence corresponding to the second object region Bx2-4), and then based on at least one pair of the first predicted object sequence with the target matching relationship and the second predicted object sequence, using the absolute value loss function to calculate the similarity loss, based on the similarity loss, the target loss value can be determined, and based on the target loss value, the network parameters of the first model 10 and the network of the second model 20 The parameters are updated to improve the consistency of the Transformer feature sequence of the augmented image after different augmentation processes on the same image sample; wherein, the network parameters of the first model 10 can be updated in a gradient update manner, and the second model 20 The update of the network parameters can adopt the design of stopping gradient, and update the momentum based on the current network parameters of the first model 10 . Wherein, the bipartite graph matching algorithm is a set-based matching method, and the input of the bipartite graph matching algorithm is at least one first prediction object sequence and at least one second prediction object sequence respectively output by the first model 10 and the second model 20, And the confidence degree of the first object region and the first object category corresponding to each first predicted object sequence, and the confidence degree of the second object region and the second object category corresponding to each second predicted object sequence. Compared with the one-to-one sequence matching based on time series, the bipartite graph matching algorithm can find a better sequence matching pair (that is, the first prediction object sequence and the second prediction object sequence with the target matching relationship), and for the first model Self-supervised learning brings more beneficial information, and ultimately improves the efficiency and accuracy of self-supervised learning.

In some implementations, the target loss value considered in the process of updating the network parameters of the first model 10 and the network parameters of the second model 20 may also include at least one first predicted object sequence corresponding to the output of the first target detection network The difference between the first object region of the first object region and the candidate object region of at least one candidate object, and the difference between the first object category corresponding to each first predicted object sequence and the candidate object category of each candidate object. During implementation, the bipartite graph matching algorithm can be used to match the first object region and the first object category corresponding to each first predicted object sequence, as well as the candidate object region and the candidate object category of each candidate object, and then use the generalized intersection and merge ratio The function determines the first sub-loss value between the first object region and the candidate object region corresponding to each pair of the first predicted object sequence and the candidate object with the target matching relationship, and uses the focal loss function to determine that each pair has the target matching relationship The second sub-loss value between the first object category and the candidate object category corresponding to the first predicted object sequence and the candidate object, based on each first sub-loss value, each second sub-loss value and each pair with The similarity loss between the first predicted object sequence and the second predicted object sequence of the target matching relationship may determine a target loss value.

Step S703, migrating the pre-trained first model to the target detection task.

Here, according to the target detection tasks in different target detection scenarios (such as at least one application scenario such as industrial quality inspection, industrial inspection, medical scene detection, automatic driving, etc.), the first previous model in the trained first model can be The feed-forward neural network is removed, and the number of output categories of the third feed-forward neural network in the first model is adjusted according to the actual target detection task, and the adjusted first model is determined as the initial third model, and then the The model parameters of the third model are fine-tuned and trained to obtain a third model that can be used for the target detection task.

FIG. 8 is a schematic diagram of the composition and structure of a model training device provided by an embodiment of the present disclosure. As shown in FIG. An updating part 840, wherein: the first acquiring part 810 is configured to acquire the first augmented image and the second augmented image obtained after augmenting the first image sample respectively; the first detecting part 820 is configured In order to use the first model to be trained to perform target detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the second augmented image, Obtain at least one second detection result including the second predictor sequence; the first matching part 830 is configured to match each first predictor sequence with each second predictor sequence, and obtain at least one pair with target matching The first predictor sequence and the second predictor sequence of the relationship; the first update part 840 is configured to update the model of the first model based on each pair of the first predictor sequence and the second predictor sequence with the target matching relationship The parameters are updated at least once to obtain the trained first model.

In some embodiments, the first updating part is further configured to: determine the target loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence having the target matching relationship; If the preset condition is not satisfied, the model parameters of the first model are updated to obtain an updated first model; based on the updated first model, the trained first model is determined.

In some embodiments, the first updating part is further configured to: update the model parameters of the first model and the model parameters of the second model respectively when the target loss value does not meet the preset condition, and obtain the updated The first model and the updated second model; based on the updated first model and the updated second model, the trained first model is determined.

In some embodiments, the first update part is further configured to: update the momentum of the model parameters of the second model based on the current model parameters of the first model to obtain the updated second model; The current model parameters of the first model are updated to obtain the updated first model.

In some embodiments, the first updating part is further configured to: respectively determine the first augmented image and the second augmented image obtained after augmenting the next first image sample as the current first augmented image. The widened image and the current second augmented image; use the currently updated first model to perform target detection on the current first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the current updated The second model of is to perform target detection on the current second augmented image, and obtain at least one second detection result including a second predicted object sequence; match each first predicted object sequence with each second predicted object sequence, Obtain at least one pair of the first predictor sequence and the second predictor sequence with the target matching relationship; based on the similarity between each pair of the first predictor sequence and the second predictor sequence with the target matching relationship, determine the current target Loss value: when the current target loss value satisfies the preset condition or the number of times the model parameters of the first model are updated reaches a number threshold, the currently updated first model is determined as the trained first model.

In some embodiments, the first updating part is further configured to: when the current target loss value does not meet the preset condition, respectively update the model parameters of the first model and the model parameters of the second model for the next time , to obtain the first model after the next update and the second model after the next update; based on the first model after the next update and the second model after the next update, determine the first model after training.

In some embodiments, the first detection result further includes a first object area and a first object category corresponding to the first predicted object sequence in the first detection result; the device further includes: a second acquisition part configured to acquire At least one candidate object in the first image sample, each candidate object has a candidate object area and a candidate object category; the second matching part is configured to be based on the first object area and the first object corresponding to each first sequence of predicted objects Category, and the candidate object area and candidate object category of each candidate object, matching each first predicted object sequence and each candidate object to obtain at least one pair of first predicted object sequence and candidate object with a target matching relationship; The first updating part is further configured to: determine a first loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence with the target matching relationship; The first prediction object sequence and the candidate object determine a second loss value; based on the first loss value and the second loss value, determine a target loss value.

In some embodiments, the first update part is further configured to: for each pair of the first predicted object sequence and the candidate object having the target matching relationship, based on the first object region corresponding to the first predicted object sequence and the candidate object The candidate object area, determine a first sub-loss value, and determine a second sub-loss value based on the first object category corresponding to the first predicted object sequence and the candidate object category of the candidate object; based on each first sub-loss The loss value and each second sub-loss value determine the second loss value.

In some embodiments, the second acquisition part is further configured to: use an unsupervised method to perform object detection on the first image sample to obtain at least one predicted object region and a pseudo-label of each predicted object region; each predicted object region The pseudo-label of is used to represent the prediction object category of the prediction object region; for each prediction object region, the prediction object region is used as a candidate object region, and the pseudo-label of the prediction object region is used as a candidate object category to obtain a candidate object.

In some embodiments, the first detection result further includes the first object region and the first object category corresponding to the first predicted object sequence in the first detection result, and the second detection result further includes the first object category corresponding to the first predicted object sequence in the second detection result. The second object region and the second object category corresponding to the two predicted object sequences; the first matching part is further configured to: based on the first object region and the first object category corresponding to each first predicted object sequence, and each second For the second object area and the second object category corresponding to the predicted object sequence, perform bipartite graph matching on each first predicted object sequence and each second predicted object sequence to obtain at least one pair of first predicted object sequences with a target matching relationship and the second predictor sequence.

In some embodiments, the first matching part is further configured to: determine at least one candidate sequence pair set based on each first predictor sequence and each second predictor sequence; each candidate sequence pair set includes at least one For the first prediction target sequence and the second prediction target sequence with a candidate matching relationship; for each candidate sequence pair set, based on each pair of the first prediction target sequence and the second prediction target sequence with a candidate matching relationship in the candidate sequence pair set The first object region and the first object category corresponding to the first predicted object sequence in the object sequence, and the second object region and the second object category corresponding to the second predicted object sequence, determine the matching loss of the candidate sequence pair set; at least Each pair of the first prediction object sequence and the second prediction object sequence with a candidate matching relationship in the candidate sequence pair set with the smallest matching loss in a candidate sequence pair set is determined as at least one pair of first prediction objects with a target matching relationship sequence and the second predictor sequence.

In some embodiments, the first model includes a feature extraction network and a converter network; the first detection part is further configured to: use the feature extraction network of the first model to perform feature extraction on the first augmented image to obtain image feature information ; Using the converter network of the first model to perform prediction processing on the image feature information to obtain at least one sequence of first prediction objects.

In some embodiments, the first model further includes a first feed-forward neural network; the first detection part is further configured to: use the converter network of the first model to predict image feature information to obtain at least one feature sequence; Using the first feed-forward neural network, each feature sequence is mapped to the target dimension to obtain at least one first sequence of predicted objects.

In some embodiments, the first detection result also includes the first object region and the first object category, and the first model also includes a second feedforward neural network and a third feedforward neural network; the first detection part is also configured as: For each feature sequence, the second feedforward neural network is used to perform area prediction on the feature sequence to obtain the first object area, and the third feedforward neural network is used to perform category prediction on the feature sequence to obtain the first object category.

In some embodiments, the second model has the same network structure as the first model.

In some embodiments, the first acquisition part is further configured to: perform a first image augmentation process on the first image sample to obtain a first augmented image; perform a second image augmentation process on the first image sample to obtain a second image augmentation process Two augmented images.

In some embodiments, the first image augmentation process includes at least one of the following: color dithering, grayscale processing, Gaussian blur, random erasure; the second image augmentation process includes at least one of the following: random scaling, random cropping, Flip randomly, resize randomly.

In some embodiments, the apparatus further includes: a determining part configured to determine an initial third model based on the trained first model; a second updating part configured to determine an initial third model based on at least one second image sample The model parameters of the third model are updated to obtain the trained third model.

FIG. 9 is a schematic diagram of the composition and structure of an image processing device provided by an embodiment of the present disclosure. As shown in FIG. 9 , the image processing device 900 includes: a third acquisition part 910 and a second detection part 920, wherein: the third acquisition part 910 , is configured to acquire the image to be processed; the second detection part 920 is configured to use the trained fourth model to perform object detection on the image to be processed to obtain a third detection result; the fourth model includes at least one of the following: using the above The first model obtained by the model training method described in the embodiment, and the third model obtained by the model training method described in the above embodiment are used.

The description of the above device embodiment is similar to the description of the above method embodiment, and has similar beneficial effects as the method embodiment. For technical details not disclosed in the device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

In the embodiments of the present disclosure and other embodiments, a "part" may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course it may also be a unit, a module or a non-modular one.

It should be noted that, in the embodiment of the present disclosure, if the above-mentioned model training method or image processing method is implemented in the form of software function parts, and sold or used as an independent product, it can also be stored in a computer-readable storage medium middle. Based on this understanding, the essence of the technical solution of the embodiments of the present disclosure or the part that contributes to the related technology can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: various media that can store program codes such as U disk, mobile hard disk, read-only memory (Read Only Memory, ROM), magnetic disk or optical disk. As such, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

An embodiment of the present disclosure provides a computer device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the steps in the above method when executing the program.

An embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are implemented. The computer readable storage medium may be transitory or non-transitory.

An embodiment of the present disclosure provides a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, a part or part of the above-mentioned method is implemented. All steps. The computer program product can be realized by hardware, software or a combination thereof. In some embodiments, the computer program product is embodied as a computer-readable storage medium, and in other embodiments, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

It should be pointed out here that: the above descriptions of the storage medium, computer program product, and device embodiments are similar to the descriptions of the above method embodiments, and have similar beneficial effects to those of the method embodiments. For technical details not disclosed in the storage medium, computer program product, and device embodiments of the present disclosure, please refer to the description of the method embodiments of the present disclosure for understanding.

It should be noted that FIG. 10 is a schematic diagram of a hardware entity of a computer device in an embodiment of the present disclosure. As shown in FIG. 10 , the hardware entity of the computer device 1000 includes: a processor 1001, a communication interface 1002, and a memory 1003, wherein: The processor 1001 usually controls the overall operation of the computer device 1000; the communication interface 1002 can enable the computer device to communicate with other terminals or servers through the network; the memory 1003 is configured to store instructions and applications executable by the processor 1001, and can also cache 1001 and the data to be processed or processed by each part of the computer device 1000 (for example, image data, audio data, voice communication data and video communication data), can be stored in flash memory (FLASH) or random access memory (Random Access Memory, RAM) accomplish. Data transmission can be performed between the processor 1001 , the communication interface 1002 and the memory 1003 through the bus 1004 .

It should be understood that reference throughout the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic related to the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of "in one embodiment" or "in an embodiment" in various places throughout the specification are not necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, rather than by the embodiments of the present disclosure. The implementation process constitutes any limitation. The serial numbers of the above-mentioned embodiments of the present disclosure are for description only, and do not represent the advantages and disadvantages of the embodiments. It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the parts is only a logical function division. In actual implementation, there may be other division methods, such as: multiple parts or components can be combined, or May be integrated into another system, or some features may be ignored, or not implemented. In addition, the coupling, or direct coupling, or communication connection between the various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or parts may be in electrical, mechanical or other forms. of.

The part described above as a separate component may or may not be physically separated, and the part shown as a part may or may not be a physical part; it may be located in one place or distributed to multiple network parts; Some or all of them can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, each functional part in each embodiment of the present disclosure may be fully integrated into one processing part, or each part may be separately regarded as one part, or two or more parts may be integrated into one part; the above-mentioned integration The part can be implemented in the form of hardware, or in the form of hardware plus software function.

Those of ordinary skill in the art can understand that all or part of the steps to realize the above method embodiments can be completed by hardware related to program instructions, and the aforementioned programs can be stored in computer-readable storage media. When the program is executed, the execution includes: The steps in the foregoing method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks, or optical disks.

Alternatively, if the above-mentioned integrated part of the present disclosure is realized in the form of software function part and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present disclosure or the part that contributes to the related technology can be embodied in the form of a software product. The computer software product is stored in a storage medium and includes several instructions to make a A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes various media capable of storing program codes such as removable storage devices, ROMs, magnetic disks or optical disks.

The above is only the embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope of the present disclosure, and should within the protection scope of the present disclosure.

Claims

A model training method, the method comprising:

Acquiring a first augmented image and a second augmented image respectively obtained by augmenting the first image sample;

Using the first model to be trained, perform target detection on the first augmented image to obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the second augmented image Target detection, obtaining at least one second detection result including a second predicted object sequence;

Matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

Based on each pair of the first predicted object sequence and the second predicted object sequence having a target matching relationship, the model parameters of the first model are updated at least once to obtain the trained first model.
The method according to claim 1, wherein, based on each pair of the first predicted object sequence and the second predicted object sequence with target matching relationship, the model parameters of the first model are updated at least once to obtain training After the first model, including:

Determining a target loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

When the target loss value does not meet the preset condition, update the model parameters of the first model to obtain the updated first model;

Based on the updated first model, the trained first model is determined.
The method according to claim 2, wherein, when the target loss value does not meet a preset condition, updating the model parameters of the first model to obtain an updated first model includes:

When the target loss value does not meet the preset condition, the model parameters of the first model and the model parameters of the second model are respectively updated to obtain the updated first model and the updated second model. Model;

The determining the trained first model based on the updated first model includes:

Based on the updated first model and the updated second model, the trained first model is determined.
The method according to claim 3, wherein the model parameters of the first model and the model parameters of the second model are respectively updated to obtain an updated first model and an updated second model, include:

Based on the current model parameters of the first model, update the momentum of the model parameters of the second model to obtain an updated second model;

The current model parameters of the first model are updated in a gradient update manner to obtain an updated first model.
The method according to claim 3 or 4, wherein said determining the trained first model based on the updated first model and the updated second model comprises:

Determining the first augmented image and the second augmented image obtained after augmenting the next first image sample respectively as the current first augmented image and the current second augmented image;

Using the currently updated first model, perform target detection on the current first augmented image to obtain at least one first detection result including a first predicted object sequence, and use the currently updated second model to perform target detection on the current first augmented image. Perform target detection on the current second augmented image, and obtain at least one second detection result including a second predicted object sequence;

Matching each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

Determine the current target loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

When the current target loss value satisfies the preset condition or the number of times the model parameters of the first model are updated reaches a number threshold, determine the currently updated first model as the trained model. Describe the first model.
The method according to claim 5, wherein said determining the trained first model based on the updated first model and the updated second model further comprises:

In the case that the current target loss value does not meet the preset condition, the model parameters of the first model and the model parameters of the second model are respectively updated next time to obtain the first model and the model parameters after the next update. The second model after the next update;

A trained first model is determined based on the next updated first model and the next updated second model.
The method according to any one of claims 2 to 6, wherein the first detection result further includes a first object region and a first object category corresponding to the first predicted object sequence in the first detection result ; The method also includes:

acquiring at least one candidate object in the first image sample, each of the candidate objects having a candidate object region and a candidate object class;

Based on the first object region and the first object category corresponding to each of the first predicted object sequences, and the candidate object region and the candidate object category of each of the candidate objects, for each of the first predicted object sequences and each Matching the candidate objects to obtain at least one pair of the first predicted object sequence and the candidate object having a target matching relationship;

The determining the target loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence having a target matching relationship includes:

Determining a first loss value based on the similarity between each pair of the first predictor sequence and the second predictor sequence having a target matching relationship;

Determining a second loss value based on each pair of the first predicted object sequence and the candidate object having a target matching relationship;

Based on the first loss value and the second loss value, a target loss value is determined.
The method according to claim 7, wherein said determining the second loss value based on each pair of the first predicted object sequence and the candidate object having a target matching relationship comprises:

For each pair of the first predicted object sequence and the candidate object having a target matching relationship, determine a first sub-loss value based on the first object region corresponding to the first predicted object sequence and the candidate object region of the candidate object, and determining a second sub-loss value based on the first object category corresponding to the first predicted object sequence and the candidate object category of the candidate object;

A second loss value is determined based on each of the first sub-loss values and each of the second sub-loss values.
The method according to claim 7 or 8, wherein the acquiring at least one candidate object in the first image sample, each of the candidate objects having a candidate object area and a candidate object category, comprises:

In an unsupervised manner, perform target detection on the first image sample to obtain at least one predicted object region and a pseudo-label of each predicted object region; the pseudo-label of each predicted object region is used to characterize the prediction the predicted object category of the object area;

For each of the prediction object regions, the prediction object region is used as a candidate object region, and the pseudo-label of the prediction object region is used as a candidate object category to obtain a candidate object.
The method according to any one of claims 1 to 9, wherein the first detection result further includes a first object region and a first object category corresponding to the first predicted object sequence in the first detection result , the second detection result further includes a second object region and a second object category corresponding to a second predicted object sequence in the second detection result;

The matching of each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of first predictor sequences and second predictor sequences having a target matching relationship includes:

Based on the first object region and the first object category corresponding to each of the first predicted object sequences, and the second object region and the second object category corresponding to each of the second predicted object sequences, for each of the first predicted object sequences A bipartite graph matching is performed between a sequence of prediction objects and each of the second sequence of prediction objects to obtain at least one pair of the first sequence of prediction objects and the second sequence of prediction objects having a target matching relationship.
The method according to claim 10, wherein the first object area and the first object category corresponding to each of the first predicted object sequences, and the second objects corresponding to each of the second predicted object sequences area and the second object category, performing bipartite graph matching on each of the first prediction object sequences and each of the second prediction object sequences to obtain at least one pair of the first prediction object sequence and the second prediction object sequence having a target matching relationship sequence of objects, including:

Based on each of the first predictor sequences and each of the second predictor sequences, determine at least one candidate sequence pair set; each of the candidate sequence pair sets includes at least one pair of first predictors with a candidate matching relationship a sequence of objects and a second predicted sequence of objects;

For each set of candidate sequence pairs, based on the first object region and determining the matching loss of the set of candidate sequence pairs for the first object category, and the second object region and the second object category corresponding to the second predicted object sequence;

Determining each pair of the first prediction target sequence and the second prediction target sequence with a candidate matching relationship in the candidate sequence pair set with the smallest matching loss in the at least one candidate sequence pair set as at least one pair of target matching relationship The first sequence of predictors and the second sequence of predictors.
The method according to any one of claims 1 to 11, wherein the first model comprises a feature extraction network and a transformer network;

Using the first model to be trained to perform target detection on the first augmented image to obtain at least one first detection result including a first predicted object sequence, including:

Using the feature extraction network of the first model to perform feature extraction on the first augmented image to obtain image feature information;

Prediction processing is performed on the image feature information by using the converter network of the first model to obtain at least one sequence of first prediction objects.
The method of claim 12, wherein the first model further comprises a first feed-forward neural network;

Using the converter network of the first model to perform prediction processing on the image feature information to obtain at least one first prediction object sequence, including:

Predicting the image feature information by using the converter network of the first model to obtain at least one feature sequence;

Using the first feed-forward neural network, each of the feature sequences is mapped to a target dimension to obtain at least one first sequence of predicted objects.
The method according to claim 13, wherein the first detection result further includes a first object area and a first object category, and the first model further includes a second feedforward neural network and a third feedforward neural network;

The step of using the first model to be trained to perform object detection on the first augmented image to obtain at least one first detection result including the first predicted object sequence further includes:

For each feature sequence, using the second feedforward neural network to perform region prediction on the feature sequence to obtain a first object region, and using the third feedforward neural network to perform region prediction on the feature sequence Category prediction, get the first object category.
The method according to any one of claims 12 to 14, wherein the second model has the same network structure as the first model.
The method according to any one of claims 1 to 15, wherein said acquiring the first augmented image and the second augmented image respectively obtained after augmenting the first image sample includes:

performing a first image augmentation process on the first image sample to obtain a first augmented image;

Performing a second image augmentation process on the first image sample to obtain a second augmented image.
The method of claim 16, wherein,

The first image augmentation process includes at least one of the following: color dithering, grayscale processing, Gaussian blur, random erasing;

The second image augmentation process includes at least one of the following: random scaling, random cropping, random flipping, and random resizing.
The method according to any one of claims 1 to 17, wherein the method further comprises:

determining an initial third model based on the trained first model;

Based on at least one second image sample, the model parameters of the third model are updated to obtain the trained third model.
An image processing method, comprising:

Get the image to be processed;

Use the trained fourth model to perform target detection on the image to be processed to obtain a third detection result; wherein the fourth model includes at least one of the following: using any one of claims 1 to 17 The first model obtained by the model training method, and the third model obtained by the model training method according to claim 18.
A model training device, comprising:

The first acquisition part is configured to acquire a first augmented image and a second augmented image obtained by respectively augmenting the first image sample;

The first detection part is configured to use the first model to be trained to perform object detection on the first augmented image, obtain at least one first detection result including the first predicted object sequence, and use the second model to perform target detection on the first augmented image. Target detection is performed on the second augmented image, and at least one second detection result including a second predicted object sequence is obtained;

The first matching part is configured to match each of the first predictor sequences and each of the second predictor sequences to obtain at least one pair of first predictor sequences and second predictors having a target matching relationship sequence;

The first update part is configured to update the model parameters of the first model at least once based on each pair of the first predicted object sequence and the second predicted object sequence having a target matching relationship, and obtain the trained first predicted object sequence. a model.
An image processing device, comprising:

The third acquiring part is configured to acquire the image to be processed;

The second detection part is configured to use the trained fourth model to perform target detection on the image to be processed to obtain a third detection result; wherein the fourth model includes at least one of the following: using the method according to claim 1 The first model obtained by the model training method described in any one of to 17, adopts the third model obtained by the model training method described in claim 18.
A computer device, comprising a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the method described in any one of claims 1 to 19 when executing the program step.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method according to any one of claims 1 to 19 are realized.
A computer program product, the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and when the computer program is read and executed by a computer, it realizes any one of claims 1 to 19 steps in the method.