CN113947195A

CN113947195A - Model determination method and device, electronic equipment and memory

Info

Publication number: CN113947195A
Application number: CN202111212328.6A
Authority: CN
Inventors: 王龙超; 孙逸鹏; 姚锟; 韩钧宇; 刘经拓; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-01-18

Abstract

The present disclosure provides a model determination method, an apparatus, an electronic device and a memory, which relate to the field of computer technologies, in particular to the field of computer vision and deep learning, and can be applied to scenes such as image processing, image recognition, and the like. The specific implementation scheme is as follows: acquiring a first image sample and a first text sample; training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns local features of the first text sample; acquiring a second image sample and a second text sample, training the first target model based on the second image sample and the second text sample to obtain a second target model, and learning the global features of the second text sample by the second target model; and determining the second target model as an initialization model of a third target model, wherein the training effect of the initialization model is low.

Description

Model determination method and device, electronic equipment and memory

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision and deep learning, which can be applied to scenes such as image processing and image recognition, and in particular, to a model determination method and apparatus, an electronic device, and a memory.

Background

At present, a pre-training scheme for a model is usually implemented by adopting a one-step training method, that is, an image sample and a text sample are directly input into a pre-training network to perform a pre-training task, so that a training index for initializing the model is not high.

Disclosure of Invention

The disclosure provides a model determination method, a model determination device, an electronic device and a memory.

According to an aspect of the present disclosure, a model determination method is provided. The method can comprise the following steps: acquiring a first image sample and a first text sample, wherein a text in the first text sample is used for carrying out character description on a target image in the first image sample; training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns local features of the first text sample; acquiring a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model, wherein texts in the second text sample are used for performing character description on target images in the second image sample, and the second target model learns the global characteristics of the second text sample; the second object model is determined as an initialization model of the third object model.

According to an aspect of the present disclosure, another model determination method is provided. The method can comprise the following steps: sending a model training request to a server, wherein the model training request comprises a first image sample and a first text sample, and texts in the first text sample are used for performing word description on a target image in the first image sample; and receiving an initialization model sent by a server in response to a model training request, wherein the initialization model is obtained by training a first target model by the server based on a second image sample and a second text sample, the first target model is obtained by training the first image sample and the first text sample by the server, the first target model learns the local features of the first text sample, the text in the second text sample is used for performing word description on the target image in the second image sample, and the initialization model learns the global features of the second text sample.

According to an aspect of the present disclosure, there is provided another image processing method including: acquiring an image to be processed; inputting an image to be processed into a third target model, wherein the third target model is obtained according to the model determination method of the embodiment of the disclosure; and acquiring a processing result of the third target model.

According to an aspect of the present disclosure, a model determination apparatus is provided. The apparatus may include: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first image sample and a first text sample, and texts in the first text sample are used for performing character description on a target image in the first image sample; the training unit is used for training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns the local features of the first text sample; the processing unit is used for obtaining a second image sample and a second text sample, training the first target model based on the second image sample and the second text sample, and obtaining a second target model, wherein the text in the second text sample is used for performing character description on the target image in the second image sample, and the second target model learns the global characteristics of the second text sample; a determining unit for determining the second target model as an initialization model of the third target model.

According to an aspect of the present disclosure, another model determination apparatus is also provided. The apparatus may include: the sending unit is used for sending a model training request to the server, wherein the model training request comprises a first image sample and a first text sample, and texts in the first text sample are used for describing target images in the first image sample; the receiving unit is used for receiving an initialization model sent by the server in response to the model training request, wherein the initialization model is obtained by the server training a first target model based on a second image sample and a second text sample, the first target model is obtained by the server training the first image sample and the first text sample, the first target model learns the local features of the first text sample, a text in the second text sample is used for performing text description on a target image in the second image sample, and the initialization model learns the global features of the second text sample.

According to an aspect of the present disclosure, an image processing apparatus is also provided. The apparatus may include: a third acquiring unit, configured to acquire an image to be processed; the input unit is used for inputting the image to be processed into a third target model, wherein the third target model is obtained by the model determination method of the embodiment of the disclosure; and the fourth acquisition unit is used for acquiring the processing result of the third target model.

According to an aspect of the present disclosure, there is also provided an electronic device, which may include: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model determination method of the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the model determination method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the model determination method of embodiments of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1A is a flow chart of a method of model determination according to an embodiment of the present disclosure;

FIG. 1B is a flow chart of another model determination method according to an embodiment of the present disclosure;

FIG. 1C is a flow chart of a method of image processing according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a multi-stage teletext pre-training system implemented in accordance with the present disclosure;

FIG. 3 is a schematic diagram of a Deit model architecture according to an embodiment of the disclosure;

FIG. 4 is a schematic illustration of matching image features and text features in accordance with an embodiment of the disclosure;

FIG. 5A is a schematic diagram of a model determination apparatus according to an embodiment of the present disclosure;

FIG. 5B is a schematic diagram of another model determination device according to an embodiment of the present disclosure;

FIG. 5C is a schematic diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic block diagram of an electronic device in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1A is a flow chart of a model determination method according to an embodiment of the present disclosure. As shown in fig. 1A, the method may include the steps of:

step S102, a first image sample and a first text sample are obtained.

In the technical solution provided in the above step 102 of the present disclosure, the text (text data) in the first text (text) sample is used to perform text description on the target image (image data) in the first image (image) sample.

The model determination method of this embodiment is a model determination method for image-text pre-training, and may include two stages. In the first stage, the image-text pre-training requires a large amount of data, and this embodiment may obtain a first image sample and a first text sample as a training sample of the first stage, where the first text sample corresponds to the first image sample, where the first text sample may include a large amount of text, the first image sample may include a large amount of images, the images may include pictures, and each text may be used to perform text description on a target image of the large amount of images in the first image sample, that is, each text in the first text sample corresponds to a target image in the first image sample one to one, and each text in the first text sample and the corresponding target image may also be referred to as an image-text pair.

Alternatively, the embodiment may crawl the first image sample and the first text sample through an internet crawler.

Optionally, the first image sample and the first text sample of the embodiment may not need to be labeled and cleaned manually, so as to save labor cost.

Alternatively, the above method may be implemented in the data processing module of the first stage.

And step S104, training the first image sample and the first text sample to obtain a first target model.

In the technical solution provided in the above step 104 of the present disclosure, after the first image sample and the first text sample are obtained, the first image sample and the first text sample may be trained to obtain a first target model, where the first target model learns local features of the first text sample.

In the first stage of the embodiment, the first image sample and the first text sample may be trained, and the first image sample and the first text sample may be machine Learning trained, for example, contrast Learning (contrast Learning) trained on the first image sample and the first text sample, so as to obtain a first target model, and the first target model may learn some local features of the first text sample.

And S106, acquiring a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model.

In the technical solution provided in the above step 106 of the present disclosure, after the first image sample and the first text sample are trained to obtain the first target model, the text in the second text sample is used to perform text description on the target image in the second image sample, and the second target model learns the global features of the second text sample.

The model determination method of this embodiment includes a second stage. In the second stage, a second image sample and a second text sample are obtained as training samples of the second stage, where the second text sample corresponds to the second image sample, where the second text sample may include a large amount of texts, the second image sample may include a large amount of images, the images may include pictures, and each text may be used to perform text description on a target image of the large amount of images in the second image sample, that is, each text in the second text sample corresponds to a target image in the second image sample one to one, and each text in the second text sample and the corresponding target image may also be referred to as an image-text pair.

Alternatively, the embodiment may crawl the second image sample and the second text sample through an internet crawler.

Optionally, the method may be implemented in the data processing module of the second stage.

In this embodiment, the first target model may be trained continuously based on the second image sample and the second text sample to obtain the second target model, or the first target model may be trained by machine learning based on the second image sample and the second text sample, for example, the first image sample and the first text sample are trained by contrast learning to obtain the second target model, and the second target model may learn the global feature of the second text sample, which is the global feature of the second text sample.

Step S108, the second target model is determined as an initialization model of the third target model.

In the technical solution provided in the above step 108 of the present disclosure, after the first target model is trained based on the second image sample and the second text sample to obtain the second target model, the second target model is determined as an initialization model of the third target model.

In this embodiment, the second target model is determined as an initialization model of a third target model, which is used for training to obtain the third target model, and the third target model may be an image detection model, an image segmentation model, an image classification model, or the like.

It should be noted that the third target model of this embodiment is an image detection model, an image segmentation model, and an image classification model, which are only examples of the embodiment of the present disclosure, and do not represent that the third target model of this embodiment is an image detection model, an image segmentation model, and an image classification model, and any model that can be obtained by training an initialization model is within the scope of this embodiment, and is not illustrated here.

In the generation process of the initialization model of the embodiment, not only the local features of the first text sample are considered, but also the global features of the second text sample are considered, so that the pre-training index is improved, and the pre-training index is an index used for expressing the training effect of the initialization model.

Through the steps S102 to S108, a first image sample and a first text sample are obtained, where a text in the first text sample is used to perform text description on a target image in the first image sample; training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns local features of the first text sample; acquiring a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model, wherein texts in the second text sample are used for performing character description on target images in the second image sample, and the second target model learns the global characteristics of the second text sample; the second object model is determined as an initialization model of the third object model. That is to say, the embodiment adopts a multi-stage pre-training method, and the model learns some local features of the text better in the pre-training of the first stage, and the model pays more attention to the global features of the text in the second stage, so that the pre-training index is improved, thereby solving the technical problem of low training effect of the initialized model, and achieving the technical effect of improving the training effect of the initialized model.

The above-described method of this embodiment is further described below.

As an optional implementation, the method further comprises: acquiring entity words of a first text sample; step S104, training the first image sample and the first text sample to obtain a first target model, which comprises: and training the first image sample and the entity words to obtain a first target model.

In this embodiment, in the first stage, after the first text sample is obtained, the entity words may be obtained from the first text sample, and may be extracted from the first text sample. Optionally, in this embodiment, a natural language processing model (NLP) may be used to extract a solid word from the first text sample, and then train the first image sample and the solid word to obtain a first target model, for example, compare the first image sample and the solid word of the first text sample to learn and train the solid word, so as to obtain the first target model, which may make the first target model focus more on training the solid word, so that some local features of the first text sample may be learned, and a modal interaction effect is improved, thereby avoiding a problem of modal interaction without depth in the related art.

As an alternative implementation, training the first image sample and the entity word to obtain the first target model includes: acquiring a first image characteristic of a first image sample; acquiring a first text characteristic of the entity word; and training the first image characteristic and the first text characteristic to obtain a first target model.

In this embodiment, when training the first Image sample and the entity word to obtain the first target model is implemented, the first Image feature of the first Image sample may be obtained first, or the first Image sample may be input to an Image Encoder (Image Encoder), and the first Image feature is extracted from the first Image sample by the Image Encoder, for example, the first Image feature may be I₁、I₂……I_N. The embodiment may further obtain a first Text feature of the entity words of the first Text sample, where the entity words of the first Text sample are input to a Text Encoder (Text Encoder), and the Text Encoder extracts the first Text feature from the entity words of the first Text sample, where the first Text feature may be T₁、T₂……T_N. After the first image feature and the first text feature are obtained, the first image feature and the first text feature may be trained, for example, the first image feature and the first text feature are subjected to comparative learning training, so as to obtain a first target model.

Alternatively, the above-described image encoder of this embodiment may perform the extraction of the first image feature using a data efficient image (Deit) model, that is, Deit applies a deformation model (Transformer) to Computer Vision (CV) from NLP.

Optionally, the text encoder of this embodiment may use a RoBERTa model to perform the extraction of the first text feature, where the RoBERTa model is an upgrade performed on the basis of a language representation model (BERT), and includes that at a specific detail level of the model, an optimization function is improved; in the aspect of a training strategy, a dynamic mask mode is used for training a model instead, the defects of a Next statement Prediction model (NSP) training strategy are proved, and a larger batch size is adopted; in addition, at the data level, larger data sets are used on the one hand, and Byte-Pair Encoding (BPE for short) is used on the other hand to process text data.

As an optional implementation, training the first image feature and the first text feature to obtain the first target model includes: matching the plurality of first image features and the plurality of first text features to obtain a plurality of first matching results and a plurality of first unmatched results, wherein the first matching results comprise first image features and first text features which are successfully matched with each other, and the first unmatched results comprise first image features and first text features which are failed to be matched with each other; determining a first model parameter based on the plurality of first match results and the plurality of first non-match results; a first target model is determined based on the first model parameters.

In this embodiment, when the first image feature and the first text feature are trained to obtain the first target model, the plurality of first image features and the plurality of first text features may be respectively matched, for example, the I-feature and the first text feature are respectively matched₁、I₂……I_NAnd T₁、T₂……T_NMatching is performed to obtain a plurality of first matching results and a plurality of first unmatched results, and the first matching results may include a first image feature and a first text feature that are successfully matched with each other, for example, I₁·T₁、I₂·T₂……I_N·T_NThe first unmatched result may include a first image feature and a first text feature that failed to match each other, e.g., I₁·T₂、I₁·T₃……I₁·T_N、I₂·T₁、I₂·T₃……I₂·T_NAnd the like.

After determining the first plurality of matching results and the first plurality of non-matching results, the first model parameters may be determined based on the first plurality of matching results and the first plurality of non-matching results. Optionally, the embodiment is implemented by using a loss function (InfoNCE loss) using a plurality of first matching results and a plurality of first non-matching results, for example, by the following formula:

wherein x is_iFor indicating the probability, x, that the network output result belongs to the ith class_jFor indicating the probability that the network output result belongs to the jth class, optionally in this embodiment exp (x) as described above_i) A matching result that can be used to represent a matching of multiple image features and multiple text features, whereas_jexp(x_j) May be used to represent the results of a mismatch between a plurality of image features and a plurality of text features.

After determining the first model parameters, a first target model may be generated from the first model parameters.

The second stage of the model determination method of this embodiment is further described below.

As an optional implementation manner, in step S106, training the first target model based on the second image sample and the second text sample, and obtaining the second target model includes: acquiring a second image characteristic of a second image sample; acquiring a second text feature of a second text sample; and training the first target model based on the second image characteristic and the second text characteristic to obtain a second target model.

In this embodiment, when the first target model is trained based on the second image sample and the second text sample to obtain the second target model, the second image feature of the second image sample may be obtained first, or the second image sample may be input to an image encoder, and the second image sample is extracted from the second image sample by the image encoderImage feature, for example, the second image feature may be I₁′、I₂′……I_N'. The embodiment may further obtain a second text feature of the second text sample, where the whole second text sample may be input to a text encoder, and the text encoder extracts the second text feature from the whole second text sample, where the second text feature may be T₁′、T₂′……T_N', to make the model more focused on the global features of the second text sample. After the second image feature and the second text feature are obtained, the first target model may be trained based on the second image feature and the second text feature, for example, the first target model is trained based on the second image feature and the second text feature, so as to obtain a second target model.

As an optional implementation manner, training the first target model based on the second image feature and the second text feature, and obtaining the second target model includes: matching the plurality of second image features and the plurality of second text features to obtain a plurality of second matching results and a plurality of second unmatched results, wherein the second matching results comprise second image features and second text features which are successfully matched with each other, and the second unmatched results comprise second image features and second text features which are failed to be matched with each other; training the first target model based on the plurality of second matching results and the plurality of second unmatched results to obtain second model parameters; a second target model is determined based on the second model parameters.

In this embodiment, when the first target model is trained based on the second image feature and the second text feature to obtain the second target model, the plurality of second image features and the plurality of second text features may be respectively matched, for example, the I-feature is respectively matched₁′、I₂′……I_N' and T₁′、T₂′……T_NMatching to obtain a plurality of second matching results and a plurality of second unmatched results, wherein the second matching results may include a second image feature and a second text feature that are successfully matched with each other, and a ratioE.g. I₁′·T₁′、I₂′·T₂′……I_N′·T_N', the second unmatched result may include a second image feature and a second text feature that failed to match each other, e.g., I₁′·T₂′、I₁′·T₃′……I₁′·T_N′、I₂′·T₁′、I₂′·T₃′……I₂′·T_N' and the like.

After the second matching results and the second unmatching results are determined, the first target model may be trained based on the second matching results and the second unmatching results to obtain second model parameters. Optionally, this embodiment is implemented using InfoNCE loss with a plurality of second matched results and a plurality of second unmatched results.

After the second model parameters are acquired, a second target model can be generated through the second model parameters, and then the second target model is determined to be an initialization model of a third target model, and the initialization model is output.

As an alternative embodiment, the first image sample comprises first image noise data and/or the first text sample comprises first text noise data.

In the related art, the pre-training requires a large amount of data and has poor learning ability for noisy data, such as the double-tower teletext pre-training method in the related art. However, in the first stage of the model determination method of this embodiment, some noise data is tolerated, the first image sample may include first image noise data, and the first text sample may include first text noise data, that is, this embodiment may not perform special processing on the first image noise data and the first text noise data to save labor costs.

As an alternative embodiment, the second image sample comprises second image noise data and/or the second text sample comprises second text noise data.

In the second stage of the model determination method of this embodiment, also tolerant to certain noise data, the second image sample may include second image noise data and the second text sample may include second text noise data, i.e., this embodiment may not use special processing for the second image noise data and the second text noise data to save labor costs.

FIG. 1B is another method of model determination according to an embodiment of the present disclosure. As shown in fig. 1B, the method may include the steps of:

step S1002, sending a model training request to a server, where the model training request includes a first image sample and a first text sample, and a text in the first text sample is used to perform text description on a target image in the first image sample.

In the technical solution provided in the above step S1002 of the present disclosure, in order to train to obtain an initialization model with high accuracy, a large number of image samples and text samples need to be used for training, and the data amount and the computation amount in the whole training process are large. In order to reduce resource consumption of user equipment (for example, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, and the like), training of a model may be performed by a service, and only the trained model is deployed in the user equipment, so as to facilitate use by a user.

In this embodiment, the model training request may be generated according to the model use requirement of the user, and the model training request includes an image sample and a text sample that need to be processed, and may further include an expected processing result and the like.

Alternatively, in this embodiment, a graphical user interface may be provided on the user device, and the user inputs the model training request in an input area of the graphical user interface, so that the user device may send the model training request to the server via the network. In order to be more specific, the server can provide different model training schemes for the user according to the type of the user, the user selects the model training schemes in the input area, and therefore the user equipment can generate a model training request according to the rotation result of the user and sends the model training request to the server through the network.

Step S1004, receiving an initialization model sent by the server in response to the model training request, where the initialization model is obtained by the server training a first target model based on the second image sample and the second text sample, and the first target model is obtained by the server training the first image sample and the first text sample.

In the technical solution provided in the above step S1004 of the present disclosure, the first target model learns the local features of the first text sample, the text in the second text sample is used to perform text description on the target image in the second image sample, and the initialization model learns the global features of the second text sample.

In this embodiment, the server may use the first image sample and the first text sample as the training samples of the first stage in response to the model training request, the first text sample corresponding to the first image sample. Optionally, the first image sample and the first text sample of the embodiment may not need to be labeled and cleaned manually, so as to save labor cost. The server may train the first image sample and the first text sample, and may perform machine learning training on the first image sample and the first text sample to obtain a first target model, where the first target model may learn some local features of the first text sample.

In the second stage, the server may obtain a second image sample and a second text sample as training samples of the second stage, where the second text sample corresponds to the second image sample. The server may perform continuous training on the first target model based on the second image sample and the second text sample to obtain an initialization model, or perform machine learning training on the first target model based on the second image sample and the second text sample to obtain an initialization model, where the initialization model may learn global features of the second text sample, the global features are also integral features of the second text sample, and the initialization model may be used to train to obtain an image detection model, an image segmentation model, an image classification model, and the like.

Furthermore, in order to greatly reduce the operation burden of the user equipment, the trained initialization model can be directly deployed in the server, the user equipment is connected with the server through a specific interface, the model acquisition request is sent to the server through the network, the initialization model sent by the user equipment responding to the model acquisition request through the network acquisition server is used as the initialization model of the second target model, and the purpose of model pre-training is achieved.

The processing method of the embodiment can be a multi-stage image-text pre-training method, an NLP model can be adopted to assist in extracting entity words, and the model can better learn some local features of the text in the pre-training of the first stage; in the second stage, the model pays more attention to the overall characteristics of the text, and the pre-training effect of the initialization model can be better improved through the combination of the local characteristics and the overall characteristics, so that the technical problem of low pre-training effect of the initialization model is solved.

Fig. 1C is a flow chart of a method of image processing according to an embodiment of the present disclosure. As shown in fig. 1C, the method may include the steps of:

step S10002, an image to be processed is acquired.

In the technical solution provided in the above step S10002 of the present disclosure, the image to be processed may be an image that needs to be subjected to image processing, for example, an image that needs to be subjected to image detection, image segmentation, image classification, image recognition, and the like, and the processing type may be flexibly determined according to an image application scene, for example, flexibly determined according to a road scene, an education scene, a vegetation growth prediction scene, a weather prediction scene, and the like, which is not limited herein.

Alternatively, the embodiment may acquire the image to be processed by an image acquisition device, for example, by a camera deployed in a certain space.

Step S10004, inputting the image to be processed into a third target model, where the third target model is obtained according to the model determination method of the embodiment of the present disclosure.

In the technical solution provided in the above step S10004 of the present disclosure, the acquired image to be processed may be input into a third target model, optionally, the third target model in this embodiment is obtained by training an initialization model, and the initialization model may be obtained by training a first target model through a second image sample and a second text sample, and learns the global features of the second text sample, where a text in the second text sample is used to describe a target image in the second image sample in a word manner, and the first target model is obtained by training the first image sample and the first text sample for a server, and learns the local features of the first text sample, for example, the initialization model may be a recurrent neural network model, and is not specifically limited here.

Optionally, in this embodiment, when the training of the initialization model is implemented to obtain the third target model, a large amount of sample data may be acquired in advance, where the sample data may include a large amount of image samples, and the sample data may be labeled to obtain a plurality of labels, where the labels may be labels related to image processing such as image detection, image segmentation, image classification, and image recognition. And then training the initialization model through the sample data and the corresponding label to obtain a third target model.

Optionally, in the sample data, the embodiment may extract features from each sample data through a convolutional neural network to obtain a feature vector including a plurality of features, for example, the feature vector includes features related to the above labels, and training an initialization model through the feature vector and the corresponding labels may obtain target parameters, which may be optimization parameters of the model, and a third target model may be determined through the target parameters and the initialization model.

Optionally, the embodiment may perform preprocessing on the sample data according to an algorithm such as a distribution consistency algorithm and denoising, and then perform feature extraction, feature transformation, feature normalization, feature combination, and the like on the preprocessed data to obtain features for training the initialization model. Optionally, the embodiment may further process the features through an optimization algorithm, an assumption function, a loss function, a decision boundary, a convergence rate, an iteration strategy, and the like, and train the initialization model through the processed features to obtain a third target model.

Optionally, after the third target model, the embodiment may further perform cross validation, target evaluation, over-fitting, under-fitting, and the like on the third target model, so as to determine a final third target model, so as to implement processing of image detection, image segmentation, image classification, image recognition, and the like on the input image through the third target model.

Step S10006 obtains a processing result of the third target model.

In the technical solution provided in the above step S10006 of the present disclosure, the third target model may process the image to be processed, for example, perform image detection, image segmentation, image classification, image recognition and the like on the third target model to obtain a processing result, where the processing result may include an image detection result, an image segmentation result, an image classification result, an image recognition result and the like, and output the processing result, for example, display the image detection result, the image segmentation result, the image classification result, the image recognition result and the like through a graphical user interface to further analyze the processing result.

The above technical solutions of the embodiments of the present disclosure are further illustrated below with reference to preferred embodiments.

In the related art, a pre-training scheme based on double towers can perform real-time off-line retrieval. However, the double-tower pre-training scheme is not as effective as the single-tower search because of the lack of deep modal interaction.

In addition, in the related art, the training schemes adopted in the double-tower image-text pre-training are all one-step training methods, and the image and text pairs are directly input into the pre-training network to perform the pre-training task. For example, a text sample is sorted and input into a text encoder to extract a global text (text) feature, and then the text feature and an image (image) feature are subjected to loss (loss) calculation, so that deep inter-modal interaction cannot be performed, and a pre-training index of an initialization model is not high; in addition, double-tower teletext pre-training requires a large amount of data, which is poorly learnt of noisy data.

In view of the above problems, the embodiment provides a multi-stage graph-text pre-training method, in the first stage, an NLP model is used to assist in extracting entity words, so that the model learns some local features of a text better, in the second stage, the model focuses more on the overall features of the text, and the effect of improving the initialized model is achieved through the combination of the local features and the overall features. As described further below.

Fig. 2 is a schematic diagram of a multi-stage teletext pre-training system implemented in accordance with the present disclosure. As shown in fig. 2, in the Data processing module in the first stage, the teletext pre-training requires a large amount of Data, a large amount of first Image samples and corresponding first Text samples can be crawled by a web crawler, and the first Image samples and corresponding first Text samples can tolerate certain noise Data to save labor cost, so that the input Data in the first stage of the embodiment can be the first Image samples and the first Text samples (noise Product Image-Text Data) containing certain noise Data. Alternatively, the embodiment may employ a large number of unlabeled first text samples and first image samples as training samples for the first stage. Optionally, in the data processing module at the first stage, an NLP model may be used to extract entity words of the first text sample, only the entity words are input to the text encoder for calculation to obtain the first text feature, the first image sample is input to the image encoder for calculation to obtain the first image feature, and then the first text feature and the first image feature are subjected to comparison learning to obtain the first target model.

The processing method in the data processing module at the second stage of this embodiment may be similar to the processing method in the data processing module at the first stage, but the second stage is for the entire text sample, the entire text sample is input to the text encoder to be calculated, so as to obtain the first text feature, the image sample is input to the image encoder to be calculated, so as to obtain the second image feature, and then the second text feature and the second image feature are used to perform contrast learning on the first target model, so as to obtain the second target model, which is also an initialization model of the third target model, and the third target model may be an image detection model, an image segmentation model, an image classification model, and the like, and is not limited specifically here.

In this embodiment, the text encoder module extracts text features using a RoBERTa model, which is an upgrade based on the BERT model. An image encoder performs image feature extraction by using a Deit model, as shown in fig. 3, where fig. 3 is a schematic diagram of a Deit model structure according to an embodiment of the present disclosure, and an obtained output result may be used to obtain an image feature through processing of a self attention mechanism (self attention) and a full connection network (FFN) by inputting a class token (class token), a patch token (patch tokens), and a distillation token (distinguish token). The Deit of this example applies the transformer from NLP to computer vision.

In this embodiment, the comparative learning model module mainly uses InfoNCE loss, and the specific calculation formula is as follows:

wherein x is_iFor indicating the probability, x, that the network output result belongs to the ith class_jFor indicating the probability that the network output result belongs to the jth class, optionally in this embodiment exp (x) as described above_i) May be used to represent a matching result of matching image features and text features, while_jexp(x_j) Can be used to indicate the matching result of the image feature and the text feature matching failure. As shown in fig. 4. Fig. 4 is a schematic diagram of matching image features and text features according to an embodiment of the present disclosure. As shown in fig. 4, an image feature I is extracted from an input image sample by an image encoder₁、I₂……I_NExtracting text characteristic T from input text sample by text encoder₁、T₂……T_NFor image feature I₁、I₂……I_NAnd T₁、T₂……T_NAnd respectively carrying out mutual matching to obtain matching results, wherein the matching result on the diagonal line is the result of successful matching of the text characteristic and the image characteristic, and the matching results except the diagonal line are the matching results of failed matching of the text characteristic and the image characteristic.

The method of the embodiment can be applied to multi-stage multi-tower pre-training, the NLP model can be adopted to assist in extracting entity words, and the model can be made to better learn some local features of the text sample in the pre-training of the first stage. The second-stage model focuses more on the overall characteristics of the text sample (noise data does not need special processing), and the pre-training effect of the initialization model can be better improved through the combination of the local characteristics and the overall characteristics.

The embodiment of the present disclosure further provides a model determining apparatus for executing the model determining method of the embodiment shown in fig. 1A.

Fig. 5A is a schematic diagram of a model determination device according to an embodiment of the present disclosure. As shown in fig. 5A, the model determining means 50 may include: a first acquisition unit 51, a training unit 52, a processing unit 53 and a determination unit 54.

The first obtaining unit 51 is configured to obtain a first image sample and a first text sample, where a text in the first text sample is used to describe a target image in the first image sample.

The training unit 52 is configured to train the first image sample and the first text sample to obtain a first target model, where the first target model learns local features of the first text sample.

And the processing unit 53 is configured to obtain a second image sample and a second text sample, and train the first target model based on the second image sample and the second text sample to obtain a second target model, where a text in the second text sample is used to perform text description on a target image in the second image sample, and the second target model learns the global features of the second text sample.

A determining unit 54 for determining the second object model as an initialization model of the third object model.

Optionally, the apparatus further comprises: the second obtaining unit is used for obtaining entity words of the first text sample; the training unit 52 includes: and the training module is used for training the first image sample and the entity words to obtain a first target model.

Optionally, the training module comprises: the first obtaining submodule is used for obtaining first image characteristics of the first image sample; the second obtaining submodule is used for obtaining the first text characteristics of the entity words; and the first training submodule is used for training the first image characteristic and the first text characteristic to obtain a first target model.

Optionally, the first training sub-module is configured to train the first image feature and the first text feature to obtain a first target model by: matching the plurality of first image features and the plurality of first text features to obtain a plurality of first matching results and a plurality of first unmatched results, wherein the first matching results comprise first image features and first text features which are successfully matched with each other, and the first unmatched results comprise first image features and first text features which are failed to be matched with each other; determining a first model parameter based on the plurality of first match results and the plurality of first non-match results; a first target model is determined based on the first model parameters.

Optionally, the processing unit 53 comprises: the first acquisition module is used for acquiring second image characteristics of the second image sample; the second obtaining module is used for obtaining a second text characteristic of a second text sample; and the first training module is used for training the first target model based on the second image characteristic and the second text characteristic to obtain a second target model.

Optionally, the first training module comprises: the matching submodule is used for matching the plurality of second image features and the plurality of second text features to obtain a plurality of second matching results and a plurality of second unmatched results, wherein the second matching results comprise second image features and second text features which are successfully matched with each other, and the second unmatched results comprise second image features and second text features which are failed to be matched with each other; the second training submodule is used for training the first target model based on the plurality of second matching results and the plurality of second unmatched results to obtain second model parameters; a determination submodule for determining a second target model based on the second model parameters.

Optionally, the first image sample comprises first image noise data and/or the first text sample comprises first text noise data.

Optionally, the second image sample comprises second image noise data and/or the second text sample comprises second text noise data.

The embodiment of the disclosure also provides a model determining apparatus for executing the model determining method of the embodiment shown in fig. 1B.

Fig. 5B is a schematic diagram of another model determination device according to an embodiment of the present disclosure. As shown in fig. 5B, the model determining apparatus 500 may include: a transmitting unit 501 and a receiving unit 502.

A sending unit 501, configured to send a model training request to a server, where the model training request includes a first image sample and a first text sample, and a text in the first text sample is used to perform text description on a target image in the first image sample.

A receiving unit 502, configured to receive an initialization model sent by a server in response to a model training request, where the initialization model is obtained by the server training a first target model based on a second image sample and a second text sample, the first target model is obtained by the server training the first image sample and the first text sample, the first target model learns local features of the first text sample, a text in the second text sample is used to perform text description on a target image in the second image sample, and the initialization model learns global features of the second text sample.

The embodiment of the disclosure also provides an image processing device for executing the image processing method of the embodiment shown in fig. 1C.

Fig. 5C is a schematic diagram of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 5C, the image processing apparatus 5000 may include: a third acquisition unit 5001, an input unit 5002, and a fourth acquisition unit 5003.

A third acquiring unit 5001 for acquiring the image to be processed.

An input unit 5002 is used for inputting the image to be processed into a third target model, wherein the third target model is obtained by the model determination method of the embodiment of the disclosure.

A fourth obtaining unit 5003 configured to obtain a processing result of the third target model.

In this embodiment, a multi-stage pre-training method is adopted, so that the model learns some local features of the text better in the pre-training of the first stage, and the model pays more attention to the global features of the text in the second stage, thereby improving the pre-training index, solving the technical problem of low training effect of the initialized model, and achieving the technical effect of improving the training effect of the initialized model.

It should be noted that the above units and modules can be implemented by software or hardware, and for the latter, the following manners can be implemented, but are not limited to the following manners: the units and the modules are all positioned in the same processor; alternatively, the units and modules may be located in different processors in any combination.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, the present disclosure also provides an electronic device. The electronic device may include: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model determination method of the embodiments of the present disclosure.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In accordance with an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the model determination method of an embodiment of the present disclosure.

Alternatively, in the present embodiment, the above-mentioned nonvolatile storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a first image sample and a first text sample, wherein the text in the first text sample is used for describing the target image in the first image sample;

s2, training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns the local features of the first text sample;

s3, obtaining a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model, wherein the text in the second text sample is used for performing character description on the target image in the second image sample, and the second target model learns the global characteristics of the second text sample;

s4, the second object model is determined as the initialization model of the third object model.

s1, sending a model training request to a server, wherein the model training request comprises a first image sample and a first text sample, and texts in the first text sample are used for describing target images in the first image sample;

and S2, receiving an initialization model sent by the server in response to the model training request, wherein the initialization model is obtained by the server training a first target model based on a second image sample and a second text sample, the first target model is obtained by the server training the first image sample and the first text sample, the first target model learns the local features of the first text sample, the text in the second text sample is used for performing text description on the target image in the second image sample, and the initialization model learns the global features of the second text sample.

s1, acquiring an image to be processed;

s2, inputting the image to be processed into a third target model, wherein the third target model is obtained according to the model determination method of the embodiment of the disclosure;

and S3, acquiring the processing result of the third target model.

Alternatively, in the present embodiment, the non-transitory computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, realizes the steps of:

Optionally, the computer program further realizes the following steps when executed by the processor:

s1, acquiring an image to be processed;

and S3, acquiring the processing result of the third target model.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

The program code of this embodiment for implementing the model determination method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model determination device, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Fig. 6 is a schematic block diagram of an electronic device in accordance with an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 executes the respective methods and processes described above, such as the model determination method. For example, in some embodiments, the model determination method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the model determination method described above may be performed. Alternatively, in other embodiments, the calculation unit 601 may be configured to perform the model determination method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable model determination device, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model determination method, comprising:

acquiring a first image sample and a first text sample, wherein a text in the first text sample is used for carrying out word description on a target image in the first image sample;

training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns local features of the first text sample;

acquiring a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model, wherein a text in the second text sample is used for performing character description on a target image in the second image sample, and the second target model learns the global characteristics of the second text sample;

and determining the second target model as an initialization model of a third target model.

2. The method of claim 1, wherein,

the method further comprises the following steps: acquiring entity words of the first text sample;

training the first image sample and the first text sample to obtain a first target model, including: and training the first image sample and the entity words to obtain a first target model.

3. The method of claim 2, wherein training the first image sample and the entity word to obtain a first target model comprises:

acquiring a first image feature of the first image sample;

acquiring a first text characteristic of the entity word;

and training the first image characteristic and the first text characteristic to obtain the first target model.

4. The method of claim 3, wherein training the first image feature and the first text feature to obtain the first target model comprises:

matching the plurality of first image features and the plurality of first text features to obtain a plurality of first matching results and a plurality of first unmatched results, wherein the first matching results comprise the first image features and the first text features which are successfully matched with each other, and the first unmatched results comprise the first image features and the first text features which are failed to be matched with each other;

determining a first model parameter based on a plurality of the first match results and a plurality of the first no-match results;

determining the first target model based on the first model parameters.

5. The method of claim 1, wherein training the first target model based on the second image sample and the second text sample, resulting in a second target model comprises:

acquiring a second image characteristic of the second image sample;

acquiring a second text feature of the second text sample;

and training the first target model based on the second image characteristic and the second text characteristic to obtain the second target model.

6. The method of claim 5, wherein training the first object model based on the second image features and the second text features, resulting in the second object model comprises:

matching the plurality of second image features and the plurality of second text features to obtain a plurality of second matching results and a plurality of second unmatched results, wherein the second matching results comprise the second image features and the second text features which are successfully matched with each other, and the second unmatched results comprise the second image features and the second text features which are failed to be matched with each other;

training the first target model based on the plurality of second matching results and the plurality of second unmatched results to obtain second model parameters;

determining the second target model based on the second model parameters.

7. The method of any of claims 1-6, wherein the first image sample comprises first image noise data and/or the first text sample comprises first text noise data.

8. The method of any of claims 1-6, wherein the second image sample comprises second image noise data and/or the second text sample comprises second text noise data.

9. An image processing method comprising:

acquiring an image to be processed;

inputting the image to be processed into a third target model, wherein the third target model is obtained by the model determination method of any one of claims 1 to 8;

and acquiring a processing result of the third target model.

10. A model determination apparatus, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a first image sample and a first text sample, and texts in the first text sample are used for performing character description on a target image in the first image sample;

the training unit is used for training the first image sample and the first text sample to obtain a first target model, wherein the first target model learns local features of the first text sample;

the processing unit is used for obtaining a second image sample and a second text sample, and training the first target model based on the second image sample and the second text sample to obtain a second target model, wherein a text in the second text sample is used for performing character description on a target image in the second image sample, and the second target model learns the global features of the second text sample;

a determining unit, configured to determine the second target model as an initialization model of a third target model.

11. The apparatus of claim 10, wherein,

the device further comprises: the second obtaining unit is used for obtaining entity words of the first text sample;

the training unit includes: and the training module is used for training the first image sample and the entity words to obtain a first target model.

12. The apparatus of claim 11, wherein the training module comprises:

the first obtaining submodule is used for obtaining first image characteristics of the first image sample;

the second sub-acquisition module is used for acquiring the first text characteristics of the entity words;

and the first training submodule is used for training the first image characteristic and the first text characteristic to obtain the first target model.

13. The apparatus of claim 12, wherein the first training sub-module is to train the first image feature and the first text feature to obtain the first target model by:

determining the first target model based on the first model parameters.

14. The apparatus of claim 10, wherein the processing unit comprises:

a first obtaining module, configured to obtain a second image feature of the second image sample;

the second obtaining module is used for obtaining a second text feature of the second text sample;

and the first training module is used for training the first target model based on the second image characteristic and the second text characteristic to obtain the second target model.

15. The apparatus of claim 14, wherein the first training module comprises:

the matching sub-module is configured to match the plurality of second image features and the plurality of second text features to obtain a plurality of second matching results and a plurality of second unmatched results, where the second matching results include the second image features and the second text features that are successfully matched with each other, and the second unmatched results include the second image features and the second text features that are unsuccessfully matched with each other;

the second training submodule is used for training the first target model based on the plurality of second matching results and the plurality of second unmatched results to obtain second model parameters;

a determination submodule for determining the second target model based on the second model parameters.

16. An image processing apparatus comprising:

a third acquiring unit, configured to acquire an image to be processed;

an input unit, configured to input the image to be processed into a third target model, where the third target model is obtained by the model determination method according to any one of claims 1 to 8;

and the fourth acquisition unit is used for acquiring the processing result of the third target model.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.