CN113065013B

CN113065013B - Image annotation model training and image annotation method, system, equipment and medium

Info

Publication number: CN113065013B
Application number: CN202110321391.7A
Authority: CN
Inventors: 杨凯; 罗超; 胡泓; 李巍
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2024-05-03
Anticipated expiration: 2041-03-25
Also published as: CN113065013A

Abstract

The invention discloses an image annotation model training and image annotation method, an image annotation system, image annotation equipment and a medium. The image annotation model training method comprises the following steps: acquiring image data and constructing a training data set, wherein the training data set comprises image data marked by preset classification labels; the classification labels comprise a plurality of different target labels and a non-target label; adding an attention mechanism module behind a convolution layer included in a residual error network structure to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are sequentially connected; and inputting the training data set into the image annotation model for training to obtain the target image annotation model. According to the method, the training data set is constructed by adding the non-target labels in the image classification label system, and the image annotation model is constructed by utilizing the residual error network and the attention mechanism, so that the accuracy of image annotation is improved.

Description

Image annotation model training and image annotation method, system, equipment and medium

Technical Field

The invention relates to the technical field of deep learning, in particular to an image annotation model training and image annotation method, an image annotation system, image annotation equipment and a medium.

Background

With the development of information technology, image information has been increasing explosively. For example, a map library for scenic spot sharing and recommendation is newly increased every day, and a large number of disordered pictures are backlogged in the map library, so that the map library is difficult to further use. The image data of a large amount cannot be marked only by manual processing, and an image classification algorithm based on a deep learning model is a main method for marking mass images at present. However, the existing open source image classification model aims at images in specific narrow neighborhood, and image data comprising massive irrelevant pictures cannot be accurately identified and marked in open scenes such as an attack gallery and the like.

Disclosure of Invention

The invention aims to overcome the defect that an image classification model aiming at a specific narrow neighborhood cannot accurately identify and label massive image data comprising massive irrelevant pictures in the prior art, and provides an image labeling model training and image labeling method, an image labeling system, image labeling equipment and a medium.

The invention solves the technical problems by the following technical scheme:

the invention provides an image annotation model training method, which comprises the following steps:

acquiring image data and constructing a training data set, wherein the training data set comprises the image data marked by a preset classification label; the classification labels comprise a plurality of different target labels and a non-target label; the non-target tag is of a different category than the target tag;

adding an attention mechanism module behind a convolution layer included in a residual error network structure to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are sequentially connected;

and inputting the training data set into the image annotation model for training to obtain a target image annotation model.

Preferably, the step of adding the attention mechanism module after the convolutional layer included in the residual network structure includes:

inputting the first feature map output by the convolution layer to the attention mechanism module to obtain an attention weight feature map;

And determining a second characteristic diagram output by the attention mechanism module according to the first characteristic diagram and the attention weight characteristic diagram.

Preferably, the residual network structure comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer;

the step of adding the attention mechanism module after the convolution layer included in the residual network structure comprises the following steps:

And adding an attention mechanism module after the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer respectively.

Preferably, the step of inputting the training data set into the image annotation model for training to obtain the target image annotation model includes:

Inputting the training data set into the image annotation model to obtain a model output result;

calculating error loss of the image annotation model by using a first loss function according to the model output result and the balance factor;

the balance factor is the ratio of the number of samples marked by each classification label in the training data set to the total number of samples in the training data set.

calculating constraint loss of the image annotation model by using a second loss function according to the model output result;

Determining a total loss of the image annotation model according to the error loss and the constraint loss;

and adjusting parameters of the image annotation model according to the total loss until convergence conditions are reached.

The invention also provides an image labeling method, which comprises the following steps:

Acquiring image data to be annotated;

And inputting the image data to be annotated into a target image annotation model obtained by using the image annotation model training method, so as to obtain an annotation result of the image data to be annotated.

The invention also provides an image annotation model training system, which comprises:

The data set construction module is used for acquiring image data and constructing a training data set, wherein the training data set comprises the image data marked by a preset classification label; the classification labels comprise a plurality of different target labels and a non-target label; the non-target tag is of a different category than the target tag;

the model construction module is used for adding an attention mechanism module after a convolution layer is included in a residual error network structure so as to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are connected in sequence;

and the model training module is used for inputting the training data set into the image annotation model for training to obtain a target image annotation model.

The invention also provides an image annotation system, which comprises:

the image acquisition module is used for acquiring image data to be annotated;

And the image labeling module is used for inputting the image data to be labeled into a target image labeling model obtained by using the image labeling model training method to obtain a labeling result of the image data to be labeled.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image annotation model training method as described above or the image annotation method as described above when executing the computer program.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image annotation model training method as described above or an image annotation method as described above.

The invention has the positive progress effects that:

According to the invention, the training data set is constructed by adding the non-target labels in the image classification label system, the image annotation model is constructed by utilizing the residual error network and the attention mechanism, and training is carried out on the image data which comprises massive image data belonging to the target class and the image data which does not belong to the target class, wherein the target labels are marked on the image data belonging to the target class, the non-target labels are marked on the image data which does not belong to the target class, automatic identification and marking of the image data which does not belong to the target class in massive images are realized, the labor cost is greatly saved, the accuracy of image identification and marking is greatly improved, the follow-up secondary development based on the marking result is facilitated, and more quality images are selected for display in product development, so that the user experience is improved.

Drawings

Fig. 1 is a flowchart of an image annotation model training method according to embodiment 1 of the present invention.

Fig. 2 is another flowchart of the image labeling model training method of embodiment 1 of the present invention.

Fig. 3 is a flowchart of an image labeling method according to embodiment 2 of the present invention.

Fig. 4 is a block diagram of an image labeling model training system according to embodiment 3 of the present invention.

FIG. 5 is a block diagram showing an image labeling system according to embodiment 4 of the present invention

Fig. 6 is a schematic hardware structure of an electronic device according to embodiment 5 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1, the present embodiment provides an image labeling model training method, which includes:

S101, acquiring image data and constructing a training data set, wherein the training data set comprises image data marked by preset classification labels; the classification labels comprise a plurality of different target labels and a non-target label; non-target tags are of a different class than target tags.

Specifically, a plurality of different target labels are used for labeling and identifying images of specific neighborhoods, and can be scene labels such as rivers, lakes, waterfalls, oceans, seafloors and the like, or animal labels such as cats, dogs, horses, sheep and the like; the non-target label is different from the target label and is used for labeling massive pictures which do not belong to the category corresponding to the target label. Image data tagged with target tags and non-target tags is acquired in a variety of ways, including image data collected using crawler techniques, accumulating related image data, and supplementing the image data by manual tagging.

S102, adding an attention mechanism module after a convolution layer included in a residual error network structure to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are sequentially connected.

S103, inputting the training data set into the image annotation model for training to obtain the target image annotation model.

As shown in fig. 2, step S102 includes:

S1021, using a residual network as a basic network, wherein the residual network comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer;

And S1022, adding an attention mechanism module after the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer respectively.

In the training stage, based on transfer learning, the image annotation model loads pre-training weights, and as the structure of the model changes more forward, the influence on the original model structure is larger, so that the effective utilization rate of the pre-training weights is smaller, the attention mechanism module is not added after the first convolution layer, but is added after the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer.

Specifically, the residual network selects WIDE RESNET, the input picture size is limited to 224×224, the number of full-connection layer output nodes is set to n+1, corresponding to N target tags and 1 non-target tag.

Inputting the first feature map output by the convolution layer to an attention mechanism module to obtain an attention weight feature map; determining a second feature map output by the attention mechanism module according to the first feature map and the attention weight feature map; the second feature map is input to the next convolution layer or full connection layer. Specifically, the first feature map input to the attention mechanism module by the convolution layer is denoted as F _in,F_in with dimensions [ C, H, W ], C is the number of channels (channels) of the feature map, H is the height (height) of the feature map, and W is the width (width) of the feature map. And carrying out average pooling on F _in, obtaining a descriptor with the size of [ C, 1] through an MLP (Multi-Layer Perceptron) comprising 1 Layer of hidden Layer and a BN Layer (Batch Normalization, batch standardization), further expanding the descriptor into a characteristic diagram of [ C, H, W ], and marking the value of any position of the characteristic diagram of each channel H multiplied by W in M _c,M_c to be equal to the value of the channel corresponding to the original descriptor. F _in is subjected to dimension reduction by using a 1X1 convolution, the dimension reduction proportion is r, context information is utilized by using two 3X 3 cavity convolutions, the dimension of the characteristic diagram is reduced to [1, H, W ] by using 1X1 convolution, the characteristic diagram is regularized by a BN layer, and finally the characteristic diagram of the 1 dimension is copied and expanded into the characteristic diagram of [ C, H, W ], and the characteristic diagram is marked as M _s. Adding M _c and M _s gives the final attention weight profile M _total:

M_c(F_in)＝BN(MLP(AvgPool(F_in)))

M_total(F_in)＝σ(M_c(F_in)+M_s(F_in))

Where f represents a convolution operation and σ represents a sigmoid function.

The attention weight feature map M _total and the first feature map F _in are subjected to element-wise multiplication, and F _in is added to obtain a second feature map F _out after final adjustment:

as shown in fig. 2, step S103 includes:

S1031, pre-training the image annotation model by adopting a transfer learning method to obtain a pre-training weight, and loading the pre-training weight to adjust parameters of the image annotation model.

In this embodiment, the migration learning is performed based on a pre-training model trained on the public scene classification dataset place365 (one dataset), and pre-training weights other than the full-connection layer in the pre-training model are loaded. Fine tuning weights in the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer, wherein initial learning rates of the second convolution layer and the third convolution layer are set to 0.001, and initial learning rates of the fourth convolution layer and the fifth convolution layer are set to 0.002; training weights of 4 attention mechanism modules and a residual error network except for a full connection layer, wherein the initial learning rate is 0.01; the weights in the other layers are frozen and no update is made. In the training process, the parameter learning rate is halved by 5 rounds per iteration.

S1032, inputting the training data set into the image annotation model to obtain a model output result;

S1033, calculating error loss of the image annotation model by using a first loss function according to the model output result and the balance factor;

The balance factor is the ratio of the number of samples labeled by each classification label in the training data set to the total number of samples in the training data set.

Model output y= { Y ₁,y₂,…,y_N+1 }, balance factor a= { α ₁,α₂,…,α_N+1 }, using focal loss as the first loss function, the error loss between N target tags and non-target tags is expressed as loss _fl:

Wherein, the centralized parameter gamma=2, label represents the formal label serial number of the picture, and the value range of error loss is an integer of [1, N+1 ].

S1034, calculating constraint loss of the image annotation model by using a second loss function according to the model output result; the total loss of the image annotation model is determined from the error loss and the constraint loss.

Using ring loss as a second loss function, the target modulus length is R, initializing R with the mean value of the feature vector modulus length after the first round of iteration, and the constraint loss is expressed as loss _rl:

The total loss of image annotation model _total is a weighted sum of two loss functions:

loss_total＝loss_ce+λloss_rl

Wherein lambda is a weight factor and takes a value of 0.01.

S1035, adjusting parameters of the image annotation model according to the total loss until convergence conditions are reached.

In this embodiment, the back propagation of the loss adopts a random gradient descent method based on momentum, and the momentum factor is momentum=0.9.

As shown in fig. 2, the image annotation model training method further includes:

s104, testing the target image annotation model, updating the balance factor according to the test result, and retraining the target image annotation model until the accuracy of the target image annotation model is greater than a preset threshold.

In this embodiment, the on-line data is used to test the model, test the test result, supplement the corresponding positive and negative samples to the training set for the case (situation) of the error label, reject the atypical samples unfavorable for the training of the model, update the balance factor of the error loss, and retrain the model. And repeating the data iteration for a plurality of times until the accuracy of the model meets the production requirement, and stopping training. Based on TorchServe model server frames, target image annotation models are packaged and deployed, and service interfaces are developed by combining Gunicorn and Flask frames.

According to the method, a training data set is added to an image classification label system, an image annotation model is built by utilizing a residual network and an attention mechanism, pre-training model weights are loaded based on transfer learning, training is conducted on image data which comprises massive image data belonging to a target class and image data which does not belong to the target class, the target labels are marked on the image data which belong to the target class, non-target labels are marked on the image data which do not belong to the target class, error loss and constraint loss of the model are calculated during model training, the weight of the model is optimized by utilizing a random gradient descent method based on momentum, automatic identification and marking of the image data which do not belong to the target class in massive images are achieved, labor cost is greatly saved, accuracy of image identification and marking is greatly improved, secondary development based on marking results is facilitated, more and better quality images are selected for display in product development, and user experience is further improved.

Example 2

As shown in fig. 3, the present embodiment provides an image labeling method, which includes:

S201, obtaining image data to be marked;

S202, inputting the image data to be annotated into a target image annotation model obtained by using the image annotation model training method of the embodiment 1, and obtaining an annotation result of the image data to be annotated.

According to the method and the device, the target image annotation model is utilized, so that image data which do not belong to the target category in a large number of pictures can be automatically identified and annotated.

Example 3

As shown in fig. 4, the present embodiment provides an image annotation model training system, which includes:

the data set constructing module 1 is used for acquiring image data and constructing a training data set, wherein the training data set comprises image data marked by preset classification labels; the classification labels comprise a plurality of different target labels and a non-target label; non-target tags are of a different class than target tags.

The model construction module 2 is used for adding an attention mechanism module after a convolution layer is included in the residual error network structure so as to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are connected in sequence;

The model training module 3, the model training module 3 is used for inputting the training data set into the image annotation model for training, and the target image annotation model is obtained.

Specifically, the model building module 2 is further configured to use a residual error network as a base network, where the residual error network includes a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, and a fifth convolution layer; the model building block 2 is further configured to add an attention mechanism block after the second convolution layer, the third convolution layer, the fourth convolution layer, and the fifth convolution layer, respectively.

The model building module 2 is further configured to input the first feature map output by the convolution layer to the attention mechanism module to obtain an attention weight feature map; the model construction module 2 is further used for determining a second feature map output by the attention mechanism module according to the first feature map and the attention weight feature map; the model building block 2 is also used to input the second feature map to the next convolution layer or full connection layer. Specifically, the first feature map input to the attention mechanism module by the convolution layer is denoted as F _in,F_in with dimensions [ C, H, W ], C is the number of channels (channels) of the feature map, H is the height (height) of the feature map, and W is the width (width) of the feature map. And carrying out average pooling on F _in, obtaining a descriptor with the size of [ C, 1] through an MLP (Multi-Layer Perceptron) comprising 1 Layer of hidden Layer and a BN Layer (Batch Normalization, batch standardization), further expanding the descriptor into a characteristic diagram of [ C, H, W ], and marking the value of any position of the characteristic diagram of each channel H multiplied by W in M _c,M_c to be equal to the value of the channel corresponding to the original descriptor. F _in is subjected to dimension reduction by using a 1X 1 convolution, the dimension reduction proportion is r, context information is utilized by using two 3X 3 cavity convolutions, the dimension of the characteristic diagram is reduced to [1, H, W ] by using 1X 1 convolution, the characteristic diagram is regularized by a BN layer, and finally the characteristic diagram of the 1 dimension is copied and expanded into the characteristic diagram of [ C, H, W ], and the characteristic diagram is marked as M _s. Adding M _c and M _s gives the final attention weight profile M _total:

M_c(F_in)＝BN(MLP(AvgPool(F_in)))

M_total(F_in)＝σ(M_c(F_in)+M_s(F_in))

The model training module 3 is further configured to pretrain the image annotation model by adopting a migration learning method to obtain a pretraining weight, and load the pretraining weight to adjust parameters of the image annotation model.

The model training module 3 is also used for inputting a training data set into the image annotation model to obtain a model output result; the model training module 3 is further used for calculating error loss of the image annotation model by using the first loss function according to the model output result and the balance factor; the balance factor is the ratio of the number of samples labeled by each classification label in the training data set to the total number of samples in the training data set.

Model output y= { Y ₁,y₂,...,y_N+1 }, balance factor a= { α ₁,α₂,...,α_N+1 }, using focal loss as the first loss function, the error loss between N target tags and non-target tags is expressed as loss _fl:

The model training module 3 is further used for calculating constraint loss of the image annotation model by using a second loss function according to the model output result;

Model training module 3 is also used to determine the total loss of the image annotation model from the error loss and constraint loss.

loss_total＝loss_ce+λloss_rl

Wherein lambda is a weight factor and takes a value of 0.01.

The model training module 3 is further configured to adjust parameters of the image annotation model according to the total loss until convergence conditions are reached.

The image annotation model training system further comprises:

And the test module 4 is used for testing the target image annotation model, updating the balance factor according to the test result, and retraining the target image annotation model until the accuracy of the target image annotation model is greater than a preset threshold.

Example 4

As shown in fig. 5, the present invention further provides an image labeling system, where the image labeling system includes:

the image acquisition module 5 is used for acquiring image data to be annotated;

The image labeling module 6, the image labeling module 6 is configured to input the image data to be labeled into a target image labeling model obtained by using the image labeling model training system of embodiment 3, so as to obtain a labeling result of the image data to be labeled.

Example 5

Fig. 6 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed implements the image annotation model training method of embodiment 1 or the image annotation method of embodiment 2. The electronic device 30 shown in fig. 6 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 6, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be a server device, for example. Components of electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, a bus 33 connecting the different system components, including the memory 32 and the processor 31.

The bus 33 includes a data bus, an address bus, and a control bus.

Memory 32 may include volatile memory such as Random Access Memory (RAM) 321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 31 executes a computer program stored in the memory 32 to thereby perform various functional applications and data processing, such as the image annotation model training method of embodiment 1 or the image annotation method of embodiment 2 of the present invention.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through an input/output (I/O) interface 35. Also, model-generating device 30 may also communicate with one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet, via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the model-generating device 30, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 6

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image annotation model training method of embodiment 1 or the image annotation method of embodiment 2.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the image annotation model training method of example 1 or the image annotation method of example 2, when said program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, which program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on the remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. The image annotation model training method is characterized by comprising the following steps of:

inputting the training data set into the image annotation model for training to obtain a target image annotation model;

The residual error network structure comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer;

Respectively adding an attention mechanism module after the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer;

The step of inputting the training data set into the image annotation model for training to obtain a target image annotation model comprises the following steps:

Testing the target image annotation model, updating a balance factor according to a test result, and retraining the target image annotation model until the accuracy of the target image annotation model is greater than a preset threshold;

2. The image annotation model training method according to claim 1, wherein the step of adding an attention mechanism module after the convolution layer included in the residual network structure comprises:

3. The method for training an image annotation model of claim 2, wherein the step of inputting the training dataset into the image annotation model for training to obtain a target image annotation model comprises:

4. An image labeling method, characterized in that the image labeling method comprises the following steps:

Acquiring image data to be annotated;

inputting the image data to be annotated into a target image annotation model obtained by the image annotation model training method according to any one of claims 1-3, and obtaining an annotation result of the image data to be annotated.

5. An image annotation model training system, comprising:

The model construction module is used for adding an attention mechanism module after a convolution layer is included in a residual error network structure so as to construct an image annotation model, wherein the attention mechanism module is used for adjusting different channels and areas of a feature map output by the convolution layer, and the residual error network structure comprises at least one convolution layer and one full connection layer which are connected in sequence; the residual error network structure comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer; the model building module is further configured to add an attention mechanism module after the second convolution layer, the third convolution layer, the fourth convolution layer, and the fifth convolution layer, respectively;

the model training module is used for inputting the training data set into the image annotation model for training to obtain a target image annotation model;

The test module is used for testing the target image annotation model, updating a balance factor according to a test result, and retraining the target image annotation model until the accuracy of the target image annotation model is greater than a preset threshold;

the model training module is also used for inputting the training data set into the image annotation model to obtain a model output result;

the model training module is also used for calculating error loss of the image annotation model by using a first loss function according to the model output result and the balance factor;

6. An image annotation system, the image annotation system comprising:

the image acquisition module is used for acquiring image data to be annotated;

The image labeling module is used for inputting the image data to be labeled into a target image labeling model obtained by using the image labeling model training method according to any one of claims 1 to 3, and obtaining a labeling result of the image data to be labeled.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image annotation model training method of any of claims 1 to 3 or the image annotation method of claim 4 when the computer program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image annotation model training method of any one of claims 1 to 3 or the image annotation method of claim 4.