CN117671678A

CN117671678A - Image labeling method and device

Info

Publication number: CN117671678A
Application number: CN202211042115.8A
Authority: CN
Inventors: 窦昊; 王艳; 周云鹏; 张安发
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2024-03-08
Also published as: WO2024045641A1

Abstract

The application discloses an image labeling method and device, and belongs to the technical field of computer vision. After the computer equipment acquires a training image set corresponding to the target image task and a language description text set corresponding to the training image set, a target image annotation model is called to determine a prediction label of each training sample image, the prediction label of the training sample image is obtained by the target image annotation model based on a characteristic matching result of image characteristics corresponding to the training sample image and language characteristics corresponding to each language description text in the language description text set, and then the target image annotation model is trained according to errors between real labels and prediction labels of a plurality of training sample images in the training image set until the target image annotation model converges. According to the method and the device, the image task is converted into the matching task of the image features and the language features, so that the initial labeling performance of the image labeling model is greatly improved, and the image labeling efficiency is improved.

Description

Image annotation method and device

技术领域Technical field

本申请涉及计算机视觉技术领域，特别涉及一种图像标注方法及装置。The present application relates to the field of computer vision technology, and in particular to an image annotation method and device.

背景技术Background technique

在进行面向图像的人工智能(artificial intelligence，AI)应用开发时，需要人力对训练图像进行逐张标注。由于成功的AI模型需要数千甚至数百万张准确标注的训练图像，因此图像标注任务通常很耗时且成本高昂。When developing image-oriented artificial intelligence (AI) applications, manpower is required to annotate training images one by one. Since successful AI models require thousands or even millions of accurately annotated training images, image annotation tasks are often time-consuming and costly.

智能图像标注是进行面向图像的AI应用开发时最实用的技术之一。它基于少量的已标注图像，利用AI算法快速实现对待标注图像的自动标注。借助智能图像标注技术，用户可以节约大量的图像标注成本。在采用智能图像标注技术时，如何提高图像标注效率是目前亟需解决的问题。Intelligent image annotation is one of the most practical technologies when developing image-oriented AI applications. It is based on a small number of labeled images and uses AI algorithms to quickly realize automatic labeling of images to be labeled. With the help of intelligent image annotation technology, users can save a lot of image annotation costs. When using intelligent image annotation technology, how to improve the efficiency of image annotation is an urgent problem that needs to be solved.

发明内容Contents of the invention

本申请提供了一种图像标注方法及装置，可以提高图像标注效率。This application provides an image annotation method and device, which can improve image annotation efficiency.

第一方面，提供了一种图像标注方法。该方法用于计算机设备。该方法包括：获取目标图像任务对应的训练图像集以及训练图像集对应的语言描述文本集合。该训练图像集包括多个训练样本图像。该多个训练样本图像中的每个训练样本图像标注有真实标签。该语言描述文本集合包括多个语言描述文本。该多个语言描述文本与训练图像集对应的多类标签一一对应。每个语言描述文本用于描述多类标签中的一类标签的语义。调用目标图像任务对应的目标图像标注模型，确定训练图像集中每个训练样本图像的预测标签，训练样本图像的预测标签由目标图像标注模型基于训练样本图像对应的图像特征分别与语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。根据训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型收敛。目标图像标注模型用于确定待标注图像在目标图像任务下的标注标签。The first aspect provides an image annotation method. This method is used on computer equipment. The method includes: obtaining a training image set corresponding to the target image task and a language description text set corresponding to the training image set. The training image set includes multiple training sample images. Each training sample image among the plurality of training sample images is annotated with a real label. The language description text set includes a plurality of language description texts. The multiple language description texts are in one-to-one correspondence with the multi-category labels corresponding to the training image set. Each language description text is used to describe the semantics of one category of tags among multiple categories of tags. Call the target image annotation model corresponding to the target image task to determine the predicted label of each training sample image in the training image set. The predicted label of the training sample image is determined by the target image annotation model based on the image features corresponding to the training sample image and the language description text set. The feature matching results of the language features corresponding to each language description text are obtained. Based on the error between the real labels and the predicted labels of multiple training sample images in the training image set, the target image annotation model is trained until the target image annotation model converges. The target image annotation model is used to determine the annotation label of the image to be annotated under the target image task.

其中，训练样本图像对应的图像特征为对训练样本图像进行特征提取得到的图像特征。语言描述文本对应的语言特征为对语言描述文本进行特征提取得到的语言特征。模型收敛，可以是模型的损失值小于预设阈值，或者，相邻两次迭代训练模型的权重变化小于预设阈值，又或者，模型迭代次数达到预设次数。Among them, the image features corresponding to the training sample images are image features obtained by performing feature extraction on the training sample images. The language features corresponding to the language description text are the language features obtained by feature extraction from the language description text. Model convergence can be when the loss value of the model is less than the preset threshold, or when the weight change of the model in two adjacent iterations is less than the preset threshold, or when the number of model iterations reaches the preset number.

本申请中，计算机设备通过获取训练图像集对应的语言描述文本集合，为训练样本图像引入标注任务的语言先验关联知识，并在图像标注模型中将图像任务变换成图像特征与语言特征的匹配任务。这样可以大幅提高图像标注模型的初始标注性能，在初始训练样本图像数量较少的情况下，提升图像标注模型的首轮标注准确率，有效降低图像标注模型的训练轮数，从而提高图像标注效率。In this application, the computer device obtains the language description text set corresponding to the training image set, introduces the language prior association knowledge of the annotation task into the training sample image, and transforms the image task into a match between image features and language features in the image annotation model. Task. This can greatly improve the initial annotation performance of the image annotation model. When the number of initial training sample images is small, it can improve the first round annotation accuracy of the image annotation model, effectively reduce the number of training rounds of the image annotation model, and thereby improve the efficiency of image annotation. .

可选地，在目标图像标注模型收敛之后，计算机设备还可以调用目标图像标注模型，确定验证图像集中的多个验证样本图像的预测标签。验证图像集包括多个验证样本图像，该多个验证样本图像中的每个验证样本图像标注有真实标签。计算机设备根据多个验证样本图像的真实标签与预测标签，确定目标图像标注模型的标注准确度。当目标图像标注模型的标注准确度未达到预设阈值时，计算机设备执行一次或多次模型训练过程，直至目标图像标注模型的标注准确度达到预设阈值。其中，模型训练过程包括：调用目标图像标注模型，确定多个待标注图像的预测标签以及待标注图像的预测标签的置信度。从多个待标注图像中获取预测标签的置信度低于置信度阈值的难标注图像。输出难标注图像以及难标注图像的预测标签，以供人工矫正。响应于接收到针对难标注图像的人工标注结果，将难标注图像作为新的训练样本图像添加至训练图像集，得到更新后的训练图像集。调用目标图像标注模型，确定更新后的训练图像集中每个训练样本图像的预测标签，训练样本图像的预测标签基于训练样本图像对应的图像特征分别与更新后的训练图像集对应的语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。根据更新后的训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型再次收敛。Optionally, after the target image annotation model converges, the computer device can also call the target image annotation model to determine the predicted labels of the multiple verification sample images in the verification image set. The verification image set includes a plurality of verification sample images, and each of the plurality of verification sample images is annotated with a true label. The computer device determines the annotation accuracy of the target image annotation model based on the real labels and predicted labels of multiple verification sample images. When the annotation accuracy of the target image annotation model does not reach the preset threshold, the computer device performs one or more model training processes until the annotation accuracy of the target image annotation model reaches the preset threshold. Among them, the model training process includes: calling the target image annotation model, determining the predicted labels of multiple images to be annotated and the confidence of the predicted labels of the images to be annotated. Obtain difficult-to-label images whose predicted label confidence is lower than the confidence threshold from multiple images to be labeled. Output difficult-to-label images and predicted labels of difficult-to-label images for manual correction. In response to receiving the manual annotation results for the difficult-to-label images, the difficult-to-label images are added to the training image set as new training sample images to obtain an updated training image set. Call the target image annotation model to determine the predicted label of each training sample image in the updated training image set. The predicted label of the training sample image is based on the image features corresponding to the training sample image and the language description text set corresponding to the updated training image set. The feature matching results of the language features corresponding to each language description text are obtained. Based on the error between the real labels and the predicted labels of multiple training sample images in the updated training image set, the target image annotation model is trained until the target image annotation model converges again.

可选地，当目标图像标注模型的标注准确度达到预设阈值时，计算机设备将通过调用目标图像标注模型确定的待标注图像的预测标签，作为待标注图像的标注标签。Optionally, when the annotation accuracy of the target image annotation model reaches a preset threshold, the computer device uses the predicted label of the image to be annotated determined by calling the target image annotation model as the annotation label of the image to be annotated.

可选地，目标图像标注模型包括第一特征提取层、第二特征提取层和特征匹配层。第一特征提取层包括图像特征输出端和语言特征输出端。特征匹配层包括图像特征输入端和语言特征输入端。该图像特征输出端与第二特征提取层的输入端相连。该语言特征输出端与该语言特征输入端相连。第二特征提取层的输出端与该图像特征输入端相连。计算机设备调用目标图像任务对应的目标图像标注模型，确定训练图像集中每个训练样本图像的预测标签的实现过程，包括：计算机设备通过第一特征提取层对语言描述文本集合中的各个语言描述文本分别进行特征提取，得到语言特征集合，语言特征集合包括多组语言特征，每组语言特征为语言描述文本集合中的一个语言描述文本对应的语言特征。针对训练图像集中的每个训练样本图像，计算机设备通过第一特征提取层对训练样本图像进行特征提取，得到训练样本图像对应的全局图像特征，通过第二特征提取层对全局图像特征进行特征提取，得到目标图像特征，目标图像特征与目标图像任务关联，通过特征匹配层对目标图像特征与语言特征集合中的各组语言特征分别进行特征匹配，得到特征匹配结果，特征匹配结果包括目标图像特征与语言特征集合中的每组语言特征之间的特征相似度，将目标语言描述文本所描述的标签作为训练样本图像的预测标签。目标语言描述文本为语言特征集合中与目标图像特征之间的特征相似度最高的一组语言特征对应的语言描述文本。Optionally, the target image annotation model includes a first feature extraction layer, a second feature extraction layer and a feature matching layer. The first feature extraction layer includes an image feature output end and a language feature output end. The feature matching layer includes an image feature input terminal and a language feature input terminal. The image feature output terminal is connected to the input terminal of the second feature extraction layer. The language feature output terminal is connected to the language feature input terminal. The output end of the second feature extraction layer is connected to the image feature input end. The computer device calls the target image annotation model corresponding to the target image task and determines the implementation process of the prediction label of each training sample image in the training image set, including: the computer device uses the first feature extraction layer to identify each language description text in the language description text set. Features are extracted separately to obtain a language feature set. The language feature set includes multiple sets of language features, and each set of language features is a language feature corresponding to a language description text in the language description text set. For each training sample image in the training image set, the computer device performs feature extraction on the training sample image through the first feature extraction layer to obtain the global image features corresponding to the training sample image, and performs feature extraction on the global image features through the second feature extraction layer , obtain the target image features, which are associated with the target image task. The target image features are matched with each group of language features in the language feature set through the feature matching layer to obtain the feature matching results. The feature matching results include the target image features. With the feature similarity between each group of language features in the language feature set, the label described by the target language description text is used as the predicted label of the training sample image. The target language description text is the language description text corresponding to a set of language features in the language feature set that has the highest feature similarity with the target image features.

可选地，目标图像标注模型还包括监督模块，监督模块中设置有与目标图像任务相匹配的损失函数。监督模块的输入端与特征匹配层的输出端相连。计算机设备根据训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型的实现方式，包括：计算机设备通过监督模块基于训练图像集中的多个训练样本图像的真实标签与预测标签，计算损失函数的损失值；向第二特征提取层反向传输损失函数的梯度信息，以调整第二特征提取层至特征匹配层的网络参数。Optionally, the target image annotation model also includes a supervision module, and a loss function matching the target image task is set in the supervision module. The input terminal of the supervision module is connected to the output terminal of the feature matching layer. The implementation of the computer device training a target image annotation model based on the error between the real labels and the predicted labels of multiple training sample images in the training image set includes: the computer device uses a supervision module based on the error between the multiple training sample images in the training image set. Real labels and predicted labels, calculate the loss value of the loss function; reversely transmit the gradient information of the loss function to the second feature extraction layer to adjust the network parameters from the second feature extraction layer to the feature matching layer.

可选地，计算机设备中预先存储有图像标注模型框架。图像标注模型框架包括第一特征提取层、下游任务模型头和特征匹配层。第一特征提取层的图像特征输出端与下游任务模型头的输入端相连。下游任务模型头的输出端与特征匹配层的图像特征输入端相连。下游任务模型头包括与多个图像任务一一对应的多个特征提取层。第二特征提取层为下游任务模型头中与目标图像任务对应的特征提取层。其中，第一特征提取层的图像特征输出端被配置为每次连接下游任务模型头中的一个特征提取层。Optionally, the image annotation model framework is pre-stored in the computer device. The image annotation model framework includes the first feature extraction layer, the downstream task model head and the feature matching layer. The image feature output end of the first feature extraction layer is connected to the input end of the downstream task model head. The output end of the downstream task model head is connected to the image feature input end of the feature matching layer. The downstream task model head includes multiple feature extraction layers that correspond to multiple image tasks one-to-one. The second feature extraction layer is the feature extraction layer corresponding to the target image task in the downstream task model header. Wherein, the image feature output end of the first feature extraction layer is configured to connect to one feature extraction layer in the downstream task model head at a time.

本申请中，通过在图像标注模型框架中设计下游任务模型头，使得该图像标注模型框架能够适应多样性的下游需求，这样只需选择下游任务模型头中不同的特征提取层即可构建得到对应的图像标注模型。由于多个图像任务能够共用图像标注模型框架，因此无需针对每种图像任务分别设计对应的图像标注模型，实现智能图像标注的统一规范化管理，能够降低开发和维护的复杂程度，从而降低技术成本。In this application, by designing the downstream task model header in the image annotation model framework, the image annotation model framework can adapt to diverse downstream needs. In this way, you only need to select different feature extraction layers in the downstream task model header to construct the corresponding image annotation model. Since multiple image tasks can share the image annotation model framework, there is no need to design corresponding image annotation models for each image task. Unified and standardized management of intelligent image annotation can be achieved, which can reduce the complexity of development and maintenance, thereby reducing technical costs.

可选地，第一特征提取层由预先训练得到的视觉语言预训练模型实现。该视觉语言预训练模型能够对输入的图像进行特征提取，得到图像特征，还能对输入的语言描述文本进行特征提取，得到语言特征。Optionally, the first feature extraction layer is implemented by a pre-trained visual language pre-training model. This visual language pre-training model can extract features from the input image to obtain image features, and can also extract features from the input language description text to obtain language features.

可选地，计算机设备获取训练图像集对应的语言描述文本集合的实现方式，包括：响应于接收到针对目标图像任务的启动指令，计算机设备显示模板设置提示，该模板设置提示用于提示用户设置目标图像任务对应的语言描述模板。针对训练图像集对应的每类标签，计算机设备根据设置的语言描述模板以及标签，生成一个语言描述文本。Optionally, the implementation of the computer device to obtain the language description text set corresponding to the training image set includes: in response to receiving a startup instruction for the target image task, the computer device displays a template setting prompt, the template setting prompt is used to prompt the user to set The language description template corresponding to the target image task. For each type of label corresponding to the training image set, the computer device generates a language description text based on the set language description template and label.

本申请通过提供人机交互界面，使得用户可以人工设置图像任务对应的语言描述模板，这样可以提高语言描述文本对标签的语义表达的准确性，进一步提高后续基于语言描述文本提取得到的语言特征的精度，从而提高图像标注模型的精度。This application provides a human-computer interaction interface so that users can manually set the language description template corresponding to the image task. This can improve the accuracy of the semantic expression of the label by the language description text, and further improve the accuracy of subsequent language features extracted based on the language description text. accuracy, thereby improving the accuracy of the image annotation model.

可选地，计算机设备在生成语言描述文本之后，可以显示语言描述文本。Optionally, after generating the language description text, the computer device may display the language description text.

本申请中，计算机设备通过显示语言描述文本，供用户查看生成的语言描述文本是否能够准确表达标签的含义，以便用户调整语言描述模板。In this application, the computer device displays the language description text for the user to check whether the generated language description text can accurately express the meaning of the label, so that the user can adjust the language description template.

可选地，计算机设备还可以显示多个图像任务，目标图像任务为多个图像任务中的一个。响应于检测到对目标图像任务的选择操作，计算机设备确定接收到针对目标图像任务的启动指令。Optionally, the computer device can also display multiple image tasks, and the target image task is one of the multiple image tasks. In response to detecting the selection operation on the target image task, the computer device determines that a start instruction for the target image task is received.

可选地，多个图像任务包括但不限于图像分类、目标检测或动作识别中的一个或多个。Optionally, the multiple image tasks include but are not limited to one or more of image classification, object detection, or action recognition.

第二方面，提供了一种图像标注装置。所述装置包括多个功能模块，所述多个功能模块相互作用，实现上述第一方面及其各实施方式中的方法。所述多个功能模块可以基于软件、硬件或软件和硬件的结合实现，且所述多个功能模块可以基于具体实现进行任意组合或分割。In a second aspect, an image annotation device is provided. The device includes multiple functional modules, and the multiple functional modules interact to implement the method in the above-mentioned first aspect and its various implementations. The multiple functional modules can be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules can be arbitrarily combined or divided based on specific implementation.

第三方面，提供了一种计算机设备，包括：处理器和存储器；In a third aspect, a computer device is provided, including: a processor and a memory;

所述存储器，用于存储计算机程序，所述计算机程序包括程序指令；The memory is used to store a computer program, the computer program includes program instructions;

所述处理器，用于调用所述计算机程序，实现上述第一方面及其各实施方式中的方法。The processor is configured to call the computer program to implement the method in the above first aspect and its various implementations.

第四方面，提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有指令，当所述指令被处理器执行时，实现上述第一方面及其各实施方式中的方法。In a fourth aspect, a computer-readable storage medium is provided. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the methods in the above-mentioned first aspect and its various embodiments are implemented.

第五方面，提供了一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现上述第一方面及其各实施方式中的方法。In a fifth aspect, a computer program product is provided, which includes a computer program. When the computer program is executed by a processor, the method in the above first aspect and its various embodiments is implemented.

第六方面，提供了一种芯片，芯片包括可编程逻辑电路和/或程序指令，当芯片运行时，实现上述第一方面及其各实施方式中的方法。In a sixth aspect, a chip is provided. The chip includes programmable logic circuits and/or program instructions. When the chip is run, the method in the first aspect and its various embodiments is implemented.

附图说明Description of drawings

图1是本申请实施例提供的一种图像标注模型框架的示意图；Figure 1 is a schematic diagram of an image annotation model framework provided by an embodiment of the present application;

图2是本申请实施例提供的一种图像标注方法的流程示意图；Figure 2 is a schematic flowchart of an image annotation method provided by an embodiment of the present application;

图3是本申请实施例提供的一种显示界面示意图；Figure 3 is a schematic diagram of a display interface provided by an embodiment of the present application;

图4是本申请实施例提供的另一种显示界面示意图；Figure 4 is a schematic diagram of another display interface provided by an embodiment of the present application;

图5是本申请实施例提供的一种目标图像标注模型的结构示意图；Figure 5 is a schematic structural diagram of a target image annotation model provided by an embodiment of the present application;

图6是本申请实施例提供的一种图像标注方法涉及的架构示意图；Figure 6 is a schematic diagram of the architecture involved in an image annotation method provided by an embodiment of the present application;

图7是本申请实施例提供的一种图像标注装置的结构示意图；Figure 7 is a schematic structural diagram of an image annotation device provided by an embodiment of the present application;

图8是本申请实施例提供的另一种图像标注装置的结构示意图；Figure 8 is a schematic structural diagram of another image annotation device provided by an embodiment of the present application;

图9是本申请实施例提供的一种图像标注装置的硬件结构示意图。Figure 9 is a schematic diagram of the hardware structure of an image annotation device provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

智能图像标注是一种基于少量的已标注图像，利用AI算法实现对未标注图像的自动标注的技术。借助智能图像标注，用户可以减少50％～90％的图像数据标注成本。目前，智能图像标注的基本流程包括以下五个步骤。Intelligent image annotation is a technology that uses AI algorithms to automatically annotate unlabeled images based on a small number of annotated images. With intelligent image annotation, users can reduce image data annotation costs by 50% to 90%. Currently, the basic process of intelligent image annotation includes the following five steps.

步骤1、基于训练图像集训练得到对应图像任务的AI模型，该训练图像集包括标注有真实标签的多个训练样本图像。Step 1. Train to obtain an AI model corresponding to the image task based on a training image set. The training image set includes multiple training sample images labeled with real labels.

步骤2、基于AI模型对待标注图像进行推理，得到预测标签(也称伪标签)。Step 2: Perform inference on the image to be annotated based on the AI model to obtain predicted labels (also called pseudo labels).

步骤3、根据置信度策略从已推理图像中筛选出有价值样本及其预测标签，以供人工矫正标注。Step 3: Screen out valuable samples and their predicted labels from the inferred images according to the confidence strategy for manual correction and annotation.

步骤4、将经过人工矫正标注的图像作为新的训练样本图像添加到训练图像集中，并基于更新后的训练图像集重新优化AI模型。Step 4. Add the manually corrected and annotated images as new training sample images to the training image set, and re-optimize the AI model based on the updated training image set.

步骤5、重复步骤1-4的迭代优化，直至得到高准确度的AI模型，然后将该AI模型用于所有待标注图像的推理，输出待标注图像的预测标签，并将最终输出的预测标签作为对应待标注图像的最终标注标签，至此得到智能图像标注结果。Step 5. Repeat the iterative optimization of steps 1-4 until a high-accuracy AI model is obtained. Then use the AI model for reasoning on all images to be annotated, output the predicted labels of the images to be annotated, and output the final predicted labels. As the final annotation label corresponding to the image to be annotated, the intelligent image annotation result is obtained.

但是，目前的智能图像标注技术在初始训练样本图像(标注有真实标签的图像)数量较少的情况下，首轮模型训练的精度较低，基于AI模型推理得到的预测标签的置信度的可靠性较差，导致基于置信度策略的图像筛选过程(上述步骤3)难以找到有价值样本，进而会导致模型训练轮数和人工矫正标注轮数较多，因此目前的图像标注效率较低。However, when the current intelligent image annotation technology has a small number of initial training sample images (images annotated with real labels), the accuracy of the first round of model training is low, and the confidence of the predicted labels obtained based on AI model inference is not reliable. The poor performance makes it difficult to find valuable samples in the image screening process based on the confidence strategy (step 3 above), which in turn leads to more rounds of model training and manual correction annotation, so the current image annotation efficiency is low.

基于此，本申请实施例提供了一种图像标注方法。首先，计算机设备获取图像任务对应的训练图像集以及该训练图像集对应的语言描述文本集合。该训练图像集包括标注有真实标签的多个训练样本图像。该语言描述文本集合包括多个语言描述文本。该多个语言描述文本与该训练图像集对应的多类标签一一对应，也即是，语言描述文本集合中语言描述文本的数量与训练图像集中训练样本图像的标签类别数量相同。每个语言描述文本用于描述训练图像集对应的一类标签的语义。然后，计算机设备调用该图像任务对应的图像标注模型，确定训练图像集中每个训练样本图像的预测标签，该预测标签具体由该图像标注模型基于训练样本图像对应的图像特征分别与语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。最后，计算机设备根据训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练该图像任务对应的图像标注模型，直至该图像标注模型收敛。最终训练好的标注准确度达到预设阈值的图像标注模型用于确定待标注图像在该图像任务下的标注标签。本申请实施例通过获取训练图像集对应的语言描述文本集合，为训练样本图像引入标注任务的语言先验关联知识，并在图像标注模型中将图像任务变换成图像特征与语言特征的匹配任务。这样可以大幅提高图像标注模型的初始标注性能，在初始训练样本图像数量较少的情况下，提升图像标注模型的首轮标注准确率，有效降低图像标注模型的训练轮数，从而提高图像标注效率。Based on this, embodiments of the present application provide an image annotation method. First, the computer device obtains a training image set corresponding to the image task and a language description text set corresponding to the training image set. The training image set includes multiple training sample images annotated with real labels. The language description text set includes a plurality of language description texts. The multiple language description texts have a one-to-one correspondence with the multi-category labels corresponding to the training image set. That is, the number of language description texts in the language description text set is the same as the number of label categories of the training sample images in the training image set. Each language description text is used to describe the semantics of a class of labels corresponding to the training image set. Then, the computer device calls the image annotation model corresponding to the image task to determine the prediction label of each training sample image in the training image set. The prediction label is specifically determined by the image annotation model based on the image features corresponding to the training sample image and the language description text set. The feature matching results of the language features corresponding to each language description text are obtained. Finally, the computer device trains the image annotation model corresponding to the image task based on the error between the real labels and the predicted labels of the multiple training sample images in the training image set until the image annotation model converges. The final trained image annotation model whose annotation accuracy reaches the preset threshold is used to determine the annotation label of the image to be annotated under the image task. The embodiment of this application obtains the language description text set corresponding to the training image set, introduces the language prior association knowledge of the annotation task into the training sample image, and transforms the image task into a matching task of image features and language features in the image annotation model. This can greatly improve the initial annotation performance of the image annotation model. When the number of initial training sample images is small, it can improve the first round annotation accuracy of the image annotation model, effectively reduce the number of training rounds of the image annotation model, and thereby improve the efficiency of image annotation. .

另外，在进行面向图像的AI应用开发时，图像任务种类多样。例如，图像任务的种类包括但不限于图像分类、目标检测和动作识别。目前针对每种图像任务，都需要分别设计AI模型以训练得到能够完成该图像任务下的图像自动标注的图像标注模型。针对每种图像任务分别设计AI模型，其开发成本和维护成本都较高。In addition, when developing image-oriented AI applications, there are various types of image tasks. For example, types of image tasks include but are not limited to image classification, object detection, and action recognition. Currently, for each image task, it is necessary to design an AI model separately to train an image annotation model that can complete the automatic annotation of images under that image task. Designing AI models separately for each image task has high development and maintenance costs.

可选地，本申请实施例提供了一种适用于多个图像任务的图像标注模型框架。该图像标注模型框架包括第一特征提取层、下游任务模型头和特征匹配层。例如，图1是本申请实施例提供的一种图像标注模型框架的示意图。如图1所示，第一特征提取层包括图像特征输出端m1和语言特征输出端m2。特征匹配层包括图像特征输入端n1和语言特征输入端n2。第一特征提取层的图像特征输出端m1与下游任务模型头的输入端相连。第一特征提取层的语言特征输出端m2与特征匹配层的语言特征输入端n2相连。下游任务模型头的输出端与特征匹配层的图像特征输入端n1相连。下游任务模型头对应的图像任务包括图像分类、目标检测和动作识别。Optionally, embodiments of the present application provide an image annotation model framework suitable for multiple image tasks. The image annotation model framework includes a first feature extraction layer, a downstream task model head and a feature matching layer. For example, FIG. 1 is a schematic diagram of an image annotation model framework provided by an embodiment of the present application. As shown in Figure 1, the first feature extraction layer includes an image feature output terminal m1 and a language feature output terminal m2. The feature matching layer includes an image feature input terminal n1 and a language feature input terminal n2. The image feature output terminal m1 of the first feature extraction layer is connected to the input terminal of the downstream task model head. The language feature output terminal m2 of the first feature extraction layer is connected to the language feature input terminal n2 of the feature matching layer. The output end of the downstream task model head is connected to the image feature input end n1 of the feature matching layer. The image tasks corresponding to the downstream task model head include image classification, target detection and action recognition.

第一特征提取层用于对输入的语言描述文本进行特征提取，得到该语言描述文本对应的语言特征。第一特征提取层还用于对输入的图像进行特征提取，得到该图像对应的全局图像特征。由于第一特征提取层对图像进行特征提取得到的全局图像特征能够反映整个图像的全局特征，因此不同图像任务可以共用第一特征提取层。可选地，第一特征提取层由预先训练得到的视觉语言预训练模型实现。视觉语言预训练模型是基于大规模的视觉图像数据和对应的语言描述训练得到的模型，本申请实施例在此对视觉语言预训练模型的训练方式不做赘述。The first feature extraction layer is used to extract features from the input language description text and obtain the language features corresponding to the language description text. The first feature extraction layer is also used to extract features from the input image to obtain global image features corresponding to the image. Since the global image features obtained by extracting features from the image in the first feature extraction layer can reflect the global features of the entire image, different image tasks can share the first feature extraction layer. Optionally, the first feature extraction layer is implemented by a pre-trained visual language pre-training model. The visual language pre-training model is a model trained based on large-scale visual image data and corresponding language descriptions. The embodiment of the present application does not elaborate on the training method of the visual language pre-training model.

下游任务模型头包括与多个图像任务一一对应的多个特征提取层。下游任务模型头中的每个特征提取层分别用于提取对应的图像任务下的图像特征，即下游任务模型头中的特征提取层所提取的图像特征与对应的图像任务关联。例如，下游任务模型头中图像分类任务对应的特征提取层用于提取包含分类对象的图像区域特征。下游任务模型头中目标检测任务对应的特征提取层用于提取包含检测目标的图像区域特征以及包含检测目标的图像区域的位置特征。下游任务模型头中动作识别任务对应的特征提取层用于提取待识别对象相关的图像区域特征。第一特征提取层的图像特征输出端被配置为每次连接下游任务模型头中的一个特征提取层。在使用图像标注模型框架时，可以根据需要执行的图像任务在下游任务模型头中选择对应的特征提取层以构建得到该图像任务对应的图像标注模型，然后在使用构建得到的图像标注模型时，由选择的特征提取层对第一特征提取层输出的全局图像特征进行进一步特征提取，并将最终提取得到的图像特征输出至特征匹配层。The downstream task model head includes multiple feature extraction layers that correspond to multiple image tasks one-to-one. Each feature extraction layer in the downstream task model header is used to extract image features under the corresponding image task, that is, the image features extracted by the feature extraction layer in the downstream task model header are associated with the corresponding image task. For example, the feature extraction layer corresponding to the image classification task in the downstream task model header is used to extract features of the image area containing the classification object. The feature extraction layer corresponding to the target detection task in the downstream task model header is used to extract features of the image area containing the detection target and position features of the image area containing the detection target. The feature extraction layer corresponding to the action recognition task in the downstream task model header is used to extract image area features related to the object to be recognized. The image feature output end of the first feature extraction layer is configured to be connected to one feature extraction layer in the downstream task model head at a time. When using the image annotation model framework, you can select the corresponding feature extraction layer in the downstream task model header according to the image task that needs to be performed to build an image annotation model corresponding to the image task, and then when using the built image annotation model, The selected feature extraction layer performs further feature extraction on the global image features output by the first feature extraction layer, and outputs the finally extracted image features to the feature matching layer.

特征匹配层用于对输入的图像特征和语言特征进行特征匹配，得到图像特征与语言特征之间的特征相似度。由于在不同图像任务下，特征匹配层的功能是一样的，因此可以为多个图像任务统一构建特征匹配层，后续在模型训练过程中，根据实际图像任务自动调整特征匹配层的网络参数即可。The feature matching layer is used to match the input image features and language features to obtain the feature similarity between the image features and the language features. Since the function of the feature matching layer is the same under different image tasks, the feature matching layer can be constructed uniformly for multiple image tasks. In the subsequent model training process, the network parameters of the feature matching layer can be automatically adjusted according to the actual image task. .

本申请实施例中，通过在图像标注模型框架中设计下游任务模型头，使得该图像标注模型框架能够适应多样性的下游需求，只需选择下游任务模型头中不同的特征提取层即可构建得到对应的图像标注模型。由于多个图像任务能够共用图像标注模型框架，因此无需针对每种图像任务分别设计对应的图像标注模型，实现智能图像标注的统一规范化管理，能够降低开发和维护的复杂程度，从而降低技术成本。In the embodiment of this application, the downstream task model header is designed in the image annotation model framework, so that the image annotation model framework can adapt to diverse downstream needs. It can be constructed by simply selecting different feature extraction layers in the downstream task model header. Corresponding image annotation model. Since multiple image tasks can share the image annotation model framework, there is no need to design corresponding image annotation models for each image task. Unified and standardized management of intelligent image annotation can be achieved, which can reduce the complexity of development and maintenance, thereby reducing technical costs.

下面从应用场景、方法流程、软件装置、硬件装置等多个角度，对本申请提供的技术方案进行详细介绍。The technical solution provided by this application is introduced in detail below from multiple perspectives such as application scenarios, method processes, software devices, and hardware devices.

下面对本申请实施例的应用场景举例说明。The application scenarios of the embodiments of this application are illustrated below with examples.

本申请实施例提供的图像标注方法用于计算机设备。该计算机设备可以是一台服务器，或者由若干台服务器组成的服务器集群，或者是一个云计算中心。例如，该图像标注方法可以应用到云计算开发平台中作为网页(web)服务。或者，该图像标注方法也可以应用到服务器中，用户通过软件用户界面(user interface，UI)进行功能交互。The image annotation method provided by the embodiment of the present application is used in computer equipment. The computer device can be a server, a server cluster composed of several servers, or a cloud computing center. For example, the image annotation method can be applied to a cloud computing development platform as a web service. Alternatively, the image annotation method can also be applied to the server, and users interact with functions through a software user interface (UI).

下面对本申请实施例的方法流程举例说明。The following is an example of the method flow of the embodiment of the present application.

例如，图2是本申请实施例提供的一种图像标注方法的流程示意图。如图2所示，该方法包括：For example, FIG. 2 is a schematic flowchart of an image annotation method provided by an embodiment of the present application. As shown in Figure 2, the method includes:

步骤201、计算机设备获取目标图像任务对应的训练图像集以及该训练图像集对应的语言描述文本集合。Step 201: The computer device acquires a training image set corresponding to the target image task and a language description text set corresponding to the training image set.

其中，训练图像集包括多个训练样本图像，该多个训练样本图像中的每个训练样本图像标注有真实标签。训练样本图像的真实标签可以是人工标注的。语言描述文本集合包括多个语言描述文本，该多个语言描述文本与训练图像集对应的多类标签一一对应，也即是，语言描述文本集合中语言描述文本的数量与训练图像集中训练样本图像的标签类别数量相同。每个语言描述文本用于描述该多类标签中的一类标签的语义。The training image set includes a plurality of training sample images, and each training sample image in the plurality of training sample images is labeled with a real label. The real labels of the training sample images can be manually annotated. The language description text set includes multiple language description texts, and the multiple language description texts correspond to the multi-category labels corresponding to the training image set. That is, the number of language description texts in the language description text set corresponds to the number of training samples in the training image set. Images have the same number of label categories. Each language description text is used to describe the semantics of one category of tags among the multiple categories of tags.

可选地，目标图像任务可以是图像分类、目标检测或动作识别。Alternatively, the target image task can be image classification, object detection, or action recognition.

例如，目标图像任务为图像分类，具体是对动物图片进行分类。训练图像集包括两类训练样本图像，一类训练样本图像的真实标签为狗，另一类训练样本图像的真实标签为猫，即训练图像集对应两类标签。则该训练图像集对应的语言描述文本集合包括两个语言描述文本，一个语言描述文本用于描述图像中包含狗，另一个语言描述文本用于描述图像中包含猫。例如，语言描述文本的格式可以是“a photo of{}”、“a xx(xx为形容词){}”或“this is a{}”，“{}”中用于添加训练样本图像的真实标签。For example, the target image task is image classification, specifically classifying animal pictures. The training image set includes two types of training sample images. The real label of one type of training sample image is dog, and the real label of the other type of training sample image is cat. That is, the training image set corresponds to two types of labels. Then the language description text set corresponding to the training image set includes two language description texts, one language description text is used to describe that the image contains a dog, and the other language description text is used to describe that the image contains a cat. For example, the format of the language description text can be "a photo of{}", "a xx (xx is an adjective){}" or "this is a{}", and "{}" is used to add the real value of the training sample image. Label.

又例如，目标图像任务为目标检测，具体是对遗留垃圾检测，需检测垃圾和提垃圾的行人。训练图像集包括两类训练样本图像，一类训练样本图像只包含垃圾，该类训练样本图像的真实标签为垃圾，另一类训练样本图像包含提垃圾的行人，该类训练样本图像的真实标签包括垃圾和行人，即训练图像集对应两类标签。则该训练图像集对应的语言描述文本集合包括两个语言描述文本，一个语言描述文本用于描述图像中包含垃圾，另一个语言描述文本用于描述图像中包含行人。例如，两个语言描述文本分别可以是“there are bagsof{garbage}”、“a{person}on the road”。目标检测任务下的语言描述文本的格式可以是“detect：{}”、“there is{}on the xx(xx为名词)”或“{}，which is xx(xx为形容词)”，“{}”中用于添加训练样本图像的真实标签。值得说明的是，目标检测任务下的训练样本图像除了标注有真实标签以外，还可以标注有真实框(ground truth，GT)位置，真实框位置反映检测目标在图像中的区域。真实框通常为矩形框。For another example, the target image task is target detection, specifically the detection of leftover garbage. It is necessary to detect garbage and pedestrians carrying garbage. The training image set includes two types of training sample images. One type of training sample image only contains garbage. The real label of this type of training sample image is garbage. The other type of training sample image contains pedestrians carrying garbage. The real label of this type of training sample image is Including garbage and pedestrians, that is, the training image set corresponds to two types of labels. Then the language description text set corresponding to the training image set includes two language description texts, one language description text is used to describe that the image contains garbage, and the other language description text is used to describe that the image contains pedestrians. For example, the two language description texts can be "there are bags of {garbage}" and "a{person} on the road" respectively. The format of the language description text under the target detection task can be "detect: {}", "there is {} on the xx (xx is a noun)" or "{}, which is xx (xx is an adjective)", "{ }" is used to add the real label of the training sample image. It is worth noting that in addition to being marked with real labels, the training sample images under the target detection task can also be marked with ground truth (GT) positions. The ground truth positions reflect the area of the detection target in the image. Real boxes are usually rectangular boxes.

又例如，目标图像任务为动作识别，具体是识别人员的洗消动作。训练图像集包括三类训练样本图像，一类训练样本图像的真实标签为洗脚，另一类训练样本图像的真实标签为洗手，又一类训练样本图像的真实标签为消毒，即训练图像集对应三类标签。则该训练图像集对应的语言描述文本集合包括三个语言描述文本，一个语言描述文本用于描述图像中的人在洗脚，另一个语言描述文本用于描述图像中的人在洗手，还有一个语言描述文本用于描述图像中的人在消毒。例如，语言描述文本的格式可以是“the man is{}”、“humanaction of{}”或“this is{}，a frame of action”，“{}”中用于添加训练样本图像的真实标签。For another example, the target image task is action recognition, specifically identifying the decontamination actions of people. The training image set includes three types of training sample images. The real label of one type of training sample image is foot washing, the real label of another type of training sample image is hand washing, and the real label of another type of training sample image is disinfection, that is, the training image set Corresponds to three types of labels. Then the language description text set corresponding to the training image set includes three language description texts, one language description text is used to describe the person in the image washing his feet, another language description text is used to describe the person in the image washing his hands, and A verbal description text used to describe the person in the image being disinfected. For example, the format of the language description text can be "the man is{}", "humanaction of{}" or "this is{}, a frame of action", "{}" is used to add the real label of the training sample image .

可选地，计算机设备获取训练图像集对应的语言描述文本集合的一种实现方式，包括以下步骤2011至步骤2012。Optionally, an implementation manner in which the computer device obtains the language description text set corresponding to the training image set includes the following steps 2011 to 2012.

在步骤2011中，响应于接收到针对目标图像任务的启动指令，计算机设备显示模板设置提示，该模板设置提示用于提示用户设置目标图像任务对应的语言描述模板。In step 2011, in response to receiving a startup instruction for the target image task, the computer device displays a template setting prompt, where the template setting prompt is used to prompt the user to set a language description template corresponding to the target image task.

可选地，模板设置提示可以包括目标图像任务下的一个或多个备选语义描述模板。和/或，模板设置提示可以包括自定义控件，该自定义控件供用户自行输入语义描述模板。例如目标图像任务为图像分类，图3是本申请实施例提供的一种显示界面示意图。该显示界面为模板设置界面。如图3所示，显示界面A包括目标图像任务对应的语言描述模板选项A1和自定义控件A2。语言描述模板选项A1包括两个语言描述模板，一个语言描述模板为“a photo of{}”，另一个语言描述模板为“this is a{}”。自定义控件A2包括输入框、添加选项和确认选项。当计算机设备检测到对添加选项的选择操作时，计算机设备将输入框中的输入内容作为目标图像任务对应的一个语言描述模板，并再次启动输入框供用户添加新的语言描述模板。当计算机设备检测到对确认选项的选择操作时，计算机设备将输入框中的输入内容作为目标图像任务对应的语言描述模板，并结束模板自定义设置流程。用户在设置语言描述模板时，可以选择计算机设备提供的语言描述模板，或者也可以自行定义语言描述模板，又或者可以结合选择计算机设备提供的语言描述模板以及自行定义语言描述模板，也即是，用户最终确定的一个图像任务对应的语言描述模板可以包括计算机设备提供的该图像任务对应的语言描述模板和/或用户自行定义的语言描述模板。Optionally, the template setting prompt may include one or more alternative semantic description templates under the target image task. And/or, the template setting prompt may include a custom control for the user to enter a semantic description template themselves. For example, the target image task is image classification. Figure 3 is a schematic diagram of a display interface provided by an embodiment of the present application. This display interface is the template setting interface. As shown in Figure 3, display interface A includes language description template option A1 and custom control A2 corresponding to the target image task. Language description template option A1 includes two language description templates, one language description template is "a photo of {}", and the other language description template is "this is a{}". Custom control A2 includes input box, add option and confirmation option. When the computer device detects the selection operation of the add option, the computer device uses the input content in the input box as a language description template corresponding to the target image task, and activates the input box again for the user to add a new language description template. When the computer device detects the selection operation of the confirmation option, the computer device uses the input content in the input box as the language description template corresponding to the target image task, and ends the template customization setting process. When setting the language description template, the user can select the language description template provided by the computer device, or can define the language description template by himself, or can combine the selection of the language description template provided by the computer device and the self-defined language description template, that is, The language description template corresponding to an image task finally determined by the user may include the language description template corresponding to the image task provided by the computer device and/or the language description template defined by the user.

可选地，计算机设备获取用户设置的语言描述模板之后，还可以根据当前执行的具体图像任务对语言描述模板进行微调，以使语言描述模板能够更匹配当前的图像任务，进而提高后续生成的语言描述文本对标签的语义的描述准确度。例如，计算机设备当前执行的图像任务为识别图片中花的类别，用户设置的语言描述模板为“a photo of{}”，计算机设备可以对该语言描述模板进行微调，得到语言描述模板“a flower photo of{}”，以更准确地表达这是对花进行分类的任务。Optionally, after the computer device obtains the language description template set by the user, it can also fine-tune the language description template according to the specific image task currently performed, so that the language description template can better match the current image task, thereby improving the subsequent generated language. How accurately the description text describes the semantics of the tag. For example, the image task currently performed by the computer device is to identify the category of flowers in the picture, and the language description template set by the user is "a photo of {}". The computer device can fine-tune the language description template to obtain the language description template "a flower" photo of{}" to more accurately convey that this is the task of classifying flowers.

可选地，在计算机设备显示模板设置提示之前，计算机设备显示多个图像任务，目标图像任务为该多个图像任务中的一个。响应于检测到对目标图像任务的选择操作，计算机设备确定接收到针对目标图像任务的启动指令。例如，图4是本申请实施例提供的另一种显示界面示意图。该显示界面为图像任务启动界面。如图4所示，显示界面B包括图像任务选项，该图像任务选项包括三个图像任务，分别为图像分类、目标检测和动作识别。例如目标图像任务为图像分类，当计算机设备通过显示界面B检测到对图像分类的选择操作时，计算机设备可以显示如图3所示的显示界面A。Optionally, before the computer device displays the template setting prompt, the computer device displays a plurality of image tasks, and the target image task is one of the plurality of image tasks. In response to detecting the selection operation on the target image task, the computer device determines that a start instruction for the target image task is received. For example, FIG. 4 is a schematic diagram of another display interface provided by an embodiment of the present application. This display interface is the image task startup interface. As shown in Figure 4, display interface B includes an image task option, which includes three image tasks, namely image classification, target detection and action recognition. For example, the target image task is image classification. When the computer device detects a selection operation for image classification through display interface B, the computer device can display display interface A as shown in Figure 3.

在步骤2012中，针对训练图像集对应的每类标签，计算机设备根据设置的语言描述模板以及该标签，生成一个语言描述文本。In step 2012, for each type of label corresponding to the training image set, the computer device generates a language description text based on the set language description template and the label.

例如目标图像任务为图像分类，训练图像集包括两类训练样本图像，一类训练样本图像的真实标签为狗，另一类训练样本图像的真实标签为猫，用户设置的语言描述模板为“a photo of{}”，则计算机设备生成两个语言描述文本，分别为“a photo of{dog}”和“aphoto of{cat}”。For example, the target image task is image classification. The training image set includes two types of training sample images. The real label of one type of training sample image is dog, and the real label of the other type of training sample image is cat. The language description template set by the user is "a photo of {}", the computer device generates two language description texts, namely "a photo of {dog}" and "a photo of {cat}".

本申请实施例通过提供人机交互界面，使得用户可以人工设置图像任务对应的语言描述模板，这样可以提高语言描述文本对标签的语义表达的准确性，进一步提高后续基于语言描述文本提取得到的语言特征的精度，从而提高图像标注模型的精度。The embodiments of the present application provide a human-computer interaction interface so that users can manually set the language description template corresponding to the image task. This can improve the accuracy of the semantic expression of the label by the language description text, and further improve the subsequent language extraction based on the language description text. Feature accuracy, thereby improving the accuracy of the image annotation model.

可选地，计算机设备在生成语言描述文本之后，还可以显示语言描述文本。计算机设备通过显示语言描述文本，供用户查看生成的语言描述文本是否能够准确表达标签的含义，以便用户调整语言描述模板。Optionally, after the computer device generates the language description text, it may also display the language description text. The computer device displays the language description text for the user to check whether the generated language description text can accurately express the meaning of the label, so that the user can adjust the language description template.

或者，在用户已知训练图像集对应的所有标签的情况下，用户也可以根据每个标签分别设置一个语言描述文本，得到训练图像集对应的语言描述文本集合。Alternatively, when the user knows all the tags corresponding to the training image set, the user can also set a language description text according to each tag to obtain a set of language description texts corresponding to the training image set.

步骤202、计算机设备调用目标图像任务对应的目标图像标注模型，确定训练图像集中每个训练样本图像的预测标签。Step 202: The computer device calls the target image annotation model corresponding to the target image task and determines the prediction label of each training sample image in the training image set.

训练样本图像的预测标签由目标图像标注模型基于该训练样本图像对应的图像特征分别与语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。The predicted label of the training sample image is obtained by the target image annotation model based on the feature matching results of the image features corresponding to the training sample image and the language features corresponding to each language description text in the language description text set.

可选地，图5是本申请实施例提供的一种目标图像标注模型的结构示意图。如图5所示，目标图像标注模型包括第一特征提取层、第二特征提取层和特征匹配层。第一特征提取层包括图像特征输出端m1和语言特征输出端m2。特征匹配层包括图像特征输入端n1和语言特征输入端n2。第一特征提取层的图像特征输出端m1与第二特征提取层的输入端相连。第一特征提取层的语言特征输出端m2与特征匹配层的语言特征输入端n2相连。第二特征提取层的输出端与特征匹配层的图像特征输入端n1相连。Optionally, FIG. 5 is a schematic structural diagram of a target image annotation model provided by an embodiment of the present application. As shown in Figure 5, the target image annotation model includes a first feature extraction layer, a second feature extraction layer and a feature matching layer. The first feature extraction layer includes an image feature output terminal m1 and a language feature output terminal m2. The feature matching layer includes an image feature input terminal n1 and a language feature input terminal n2. The image feature output terminal m1 of the first feature extraction layer is connected to the input terminal of the second feature extraction layer. The language feature output terminal m2 of the first feature extraction layer is connected to the language feature input terminal n2 of the feature matching layer. The output terminal of the second feature extraction layer is connected to the image feature input terminal n1 of the feature matching layer.

可选地，计算机设备中预先存储有图像标注模型框架，该图像标注模型框架例如可以如图1所示。上述第二特征提取层为下游任务模型头中与目标图像任务对应的特征提取层。可选地，响应于接收到针对目标图像任务的启动指令，计算机设备在下游任务模型头中选择目标图像任务对应的特征提取层，构建得到目标图像标注模型。Optionally, an image annotation model framework is pre-stored in the computer device, and the image annotation model framework can be, for example, as shown in Figure 1 . The above-mentioned second feature extraction layer is the feature extraction layer corresponding to the target image task in the downstream task model header. Optionally, in response to receiving a startup instruction for the target image task, the computer device selects a feature extraction layer corresponding to the target image task in the downstream task model header, and constructs a target image annotation model.

可选地，结合如图5所示的目标图像标注模型，上述步骤202的实现过程可以包括以下步骤2021至步骤2022。Optionally, combined with the target image annotation model shown in Figure 5, the implementation process of the above step 202 may include the following steps 2021 to 2022.

在步骤2021中，计算机设备通过第一特征提取层对语言描述文本集合中的各个语言描述文本分别进行特征提取，得到语言特征集合。In step 2021, the computer device performs feature extraction on each language description text in the language description text set through the first feature extraction layer to obtain a language feature set.

语言特征集合包括多组语言特征，每组语言特征为语言描述文本集合中的一个语言描述文本对应的语言特征。可选地，语言描述文本集合包括M个语言描述文本，第m个语言描述文本对应的语言特征表示为L_m，则语言特征集合可以表示为：L＝{L₁，L₂，…，L_m，…，L_M}。其中，M为大于1的整数，1≤m≤M。第一特征提取层对输入的语言描述文本集合中的各个语言描述文本分别进行特征提取之后，将得到的语言特征集合输出至特征匹配层。The language feature set includes multiple sets of language features, and each set of language features is a language feature corresponding to a language description text in the language description text set. Optionally, the language description text set includes M language description texts, and the language feature corresponding to the m-th language description text is expressed as L _m . Then the language feature set can be expressed as: L={L ₁ , L ₂ , ..., L _m ,…, L _M }. Among them, M is an integer greater than 1, 1≤m≤M. The first feature extraction layer performs feature extraction on each language description text in the input language description text set, and then outputs the obtained language feature set to the feature matching layer.

在步骤2022中，针对训练图像集中的每个训练样本图像，计算机设备分别执行标签预测流程，得到每个训练样本图像的预测标签。In step 2022, for each training sample image in the training image set, the computer device performs a label prediction process respectively to obtain the predicted label of each training sample image.

其中，标签预测流程包括以下步骤S1至步骤S4。The label prediction process includes the following steps S1 to S4.

在步骤S1中，计算机设备通过第一特征提取层对训练样本图像进行特征提取，得到该训练样本图像对应的全局图像特征。In step S1, the computer device performs feature extraction on the training sample image through the first feature extraction layer to obtain global image features corresponding to the training sample image.

可选地，第一特征提取层由预先训练得到的视觉语言预训练模型实现。全局图像特征用于反映整个图像的全局特征，全局图像特征与图像任务无关。第一特征提取层对输入的训练样本图像进行特征提取之后，将得到的全局图像特征输出至第二特征提取层。Optionally, the first feature extraction layer is implemented by a pre-trained visual language pre-training model. Global image features are used to reflect the global features of the entire image, and global image features have nothing to do with the image task. After the first feature extraction layer extracts features from the input training sample image, it outputs the obtained global image features to the second feature extraction layer.

在步骤S2中，计算机设备通过第二特征提取层对全局图像特征进行特征提取，得到目标图像特征，目标图像特征与目标图像任务关联。In step S2, the computer device performs feature extraction on the global image features through the second feature extraction layer to obtain target image features, and the target image features are associated with the target image task.

可选地，第二特征提取层采用与目标图像任务相匹配的特征提取算法，对全局图像特征进行特征提取。第二特征提取层主要用于从全局图像特征中提取与类别相关性强的特征集。例如目标图像任务为图像分类或动作识别，则第二特征提取层用于提取包含分类对象或待识别对象的图像区域特征。又例如目标图像任务为目标检测，则第二特征提取层用于提取包含检测目标的图像区域特征以及包含检测目标的图像区域的位置特征。包含检测目标的图像区域的位置特征可以采用预测框位置来表示。预测框通常为矩形框。第二特征提取层对输入的全局图像特征进行特征提取之后，将得到的目标图像特征输出至特征匹配层。Optionally, the second feature extraction layer uses a feature extraction algorithm that matches the target image task to extract global image features. The second feature extraction layer is mainly used to extract feature sets with strong correlation with categories from global image features. For example, if the target image task is image classification or action recognition, the second feature extraction layer is used to extract features of the image area containing the classified object or the object to be recognized. For another example, if the target image task is target detection, the second feature extraction layer is used to extract features of the image area containing the detection target and location features of the image area containing the detection target. The positional features of the image area containing the detection target can be represented by the predicted frame position. The prediction box is usually a rectangular box. After the second feature extraction layer extracts features from the input global image features, it outputs the obtained target image features to the feature matching layer.

在步骤S3中，计算机设备通过特征匹配层对目标图像特征与语言特征集合中的各组语言特征分别进行特征匹配，得到特征匹配结果，该特征匹配结果包括目标图像特征与语言特征集合中的每组语言特征之间的特征相似度。In step S3, the computer device performs feature matching on the target image feature and each group of language features in the language feature set through the feature matching layer to obtain a feature matching result. The feature matching result includes the target image feature and each set of language features in the language feature set. Feature similarity between groups of linguistic features.

可选地，图像特征与语言特征之间的特征相似度可以是余弦相似度或欧式相似度等。Optionally, the feature similarity between image features and language features can be cosine similarity or Euclidean similarity, etc.

可选地，目标图像任务为图像分类或动作识别。对于训练图像集中的训练样本图像，第二特征提取层输出的图像特征为F，i为正整数。语言特征集合L＝{L₁，L₂，…，L_m，…，L_M}。Optionally, the target image task is image classification or action recognition. For the training sample images in the training image set, the image features output by the second feature extraction layer are F, and i is a positive integer. Language feature set L={L ₁ , L ₂ ,..., L _m ,..., L _M }.

则特征匹配层根据输入的语言特征集合以及目标图像特征，得到的特征匹配结果可以表示为：Logits＝func_similarity(F，L)。其中，func_similarity是计算图像特征与语言特征集合中每一组语言特征之间的特征相似度的函数。Then the feature matching layer obtains the feature matching result based on the input language feature set and target image features, which can be expressed as: Logits=func_similarity(F, L). Among them, func_similarity is a function that calculates the feature similarity between image features and each group of language features in the language feature set.

可选地，目标图像任务为目标检测。对于训练图像集中的训练样本图像，第二特征提取层可以提取得到N个区域特征，该N个区域特征可以表示为：O_F＝{O₁，O₂，…，O_n，…，O_N}。其中，N为大于1的整数，1≤n≤N。语言特征集合L＝{L₁，L₂，…，L_m，…，L_M}。则特征匹配层根据输入的语言特征集合以及目标图像特征，得到的特征匹配结果可以表示为：Logits＝O_FL^T，L^T表示L的转置矩阵。该特征匹配结果包括每个区域特征与每组语言特征之间的特征相似度。Optionally, the target image task is target detection. For the training sample images in the training image set, the second feature extraction layer can extract N regional features. The N regional features can be expressed as: O _F = {O ₁ , O ₂ ,..., O _n ,..., O _N }. Among them, N is an integer greater than 1, 1≤n≤N. Language feature set L={L ₁ , L ₂ ,..., L _m ,..., L _M }. Then the feature matching layer based on the input language feature set and target image features, the obtained feature matching result can be expressed as: Logits=O _F L ^T , L ^T represents the transpose matrix of L. The feature matching results include the feature similarity between each regional feature and each set of language features.

在步骤S4中，计算机设备将目标语言描述文本所描述的标签作为训练样本图像的预测标签，目标语言描述文本为语言特征集合中与目标图像特征之间的特征相似度最高的一组语言特征对应的语言描述文本。In step S4, the computer device uses the label described by the target language description text as the predicted label of the training sample image, and the target language description text corresponds to a set of language features in the language feature set that have the highest feature similarity with the target image features. language description text.

对于目标检测任务，一个图像可能包含多个检测目标，那么计算机设备针对图像中的每个检测目标，将语言特征集合中与该检测目标的图像特征之间的特征相似度最高的一组语言特征对应的语言描述文本所描述的标签作为该检测目标的预测标签，最终将图像中所有检测目标的预测标签作为该图像的预测标签。例如需要从图像中检测垃圾和提垃圾的行人，如果图像只包含垃圾，那么图像的预测标签为垃圾，如果图像包含提垃圾的行人，那么图像的预测标签包括垃圾和行人这两个。For target detection tasks, an image may contain multiple detection targets. Then, for each detection target in the image, the computer device selects a set of language features in the language feature set that have the highest feature similarity with the image features of the detection target. The label described by the corresponding language description text is used as the prediction label of the detection target, and finally the prediction labels of all detection targets in the image are used as the prediction label of the image. For example, it is necessary to detect garbage and pedestrians carrying garbage from the image. If the image only contains garbage, then the predicted label of the image is garbage. If the image contains pedestrians carrying garbage, then the predicted label of the image includes both garbage and pedestrians.

可选地，上述步骤S4可以由特征匹配层完成，即特征匹配层对图像特征和语言特征进行特征匹配得到特征匹配结果之后，基于特征匹配结果判定训练样本图像的预测标签，最终特征匹配层输出训练样本图像的预测标签。或者，上述步骤S4也可以由图像标注模型中的其它模块(例如监督模块)根据特征匹配层的输出结果完成。或者上述步骤S4也可以由计算机设备根据特征匹配层的输出结果另行判定。对于后两种情况，特征匹配层对图像特征和语言特征进行特征匹配得到特征匹配结果之后，直接输出特征匹配结果即可。Optionally, the above step S4 can be completed by the feature matching layer, that is, after the feature matching layer performs feature matching on the image features and language features to obtain the feature matching results, it determines the prediction label of the training sample image based on the feature matching results, and finally the feature matching layer outputs Predicted labels for training sample images. Alternatively, the above step S4 can also be completed by other modules in the image annotation model (such as a supervision module) based on the output results of the feature matching layer. Alternatively, the above step S4 may also be determined separately by the computer device based on the output result of the feature matching layer. For the latter two cases, the feature matching layer performs feature matching on image features and language features to obtain the feature matching results, and then directly outputs the feature matching results.

步骤203、计算机设备根据训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型收敛。Step 203: The computer device trains the target image annotation model based on the error between the real labels and the predicted labels of the multiple training sample images in the training image set until the target image annotation model converges.

可选地，请继续参见图5，目标图像标注模型还包括监督模块，监督模块中设置有与目标图像任务相匹配的损失函数。监督模块的输入端与特征匹配层的输出端相连。参见图5，特征匹配层可以向监督模块输出特征匹配结果，由监督模块根据输入的特征匹配结果确定训练样本图像的预测标签。Optionally, please continue to refer to Figure 5. The target image annotation model also includes a supervision module, and a loss function matching the target image task is set in the supervision module. The input terminal of the supervision module is connected to the output terminal of the feature matching layer. Referring to Figure 5, the feature matching layer can output the feature matching results to the supervision module, and the supervision module determines the prediction label of the training sample image based on the input feature matching results.

可选地，上述步骤203的实现过程可以包括以下步骤2031至步骤2032。Optionally, the implementation process of the above step 203 may include the following steps 2031 to 2032.

在步骤2031中，计算机设备通过监督模块基于训练图像集中的多个训练样本图像的真实标签与预测标签，计算损失函数的损失值。In step 2031, the computer device calculates the loss value of the loss function based on the real labels and predicted labels of the multiple training sample images in the training image set through the supervision module.

可选地，目标图像任务为图像分类或动作识别。则监督模块中设置的损失函数可以表示为：loss＝func_classify(Logits，G)。其中，G为训练样本图像的真实标签。Logits的含义参考上述步骤S3中在目标图像任务为图像分类或动作识别的情况下的相关定义。func_classify是根据标签计算分类损失的函数，例如交叉熵损失函数、focal损失(一种困难样本挖掘损失)函数或其变种等损失函数。Optionally, the target image task is image classification or action recognition. Then the loss function set in the supervision module can be expressed as: loss=func_classify(Logits, G). Among them, G is the real label of the training sample image. The meaning of Logits refers to the relevant definition in the above step S3 when the target image task is image classification or action recognition. func_classify is a function that calculates classification loss based on labels, such as cross-entropy loss function, focal loss (a difficult sample mining loss) function or its variants and other loss functions.

可选地，目标图像任务为目标检测。则监督模块中设置的损失函数由两部分组成，包含分类损失和定位损失。分类损失函数可以表示为：loss_cls＝func_classify(Logits，T_c)。其中，T_c为训练样本图像的真实标签，假设训练样本图像中包括K个检测目标，则T_c＝{C₁，…，C_k，…，C_K}。K为正整数，1≤k≤K。Logits的含义参考上述步骤S3中在目标图像任务为目标检测的情况下的相关定义。定位损失函数可以表示为：loss_loc＝func_iou(O_B，T_B)。其中，O_B为第二特征提取层提取得到的N个区域特征分别对应的N个预测框位置，O_B可以由第二特征提取层输出至监督模块。T_B为训练样本图像中K个检测目标分别对应的K个真实框位置，可以表示为：T_B＝{Box₁，…，Box_k，…，Box_K}。func_iou是计算预测框和真实框的交并比(intersection over union，IOU)的函数，例如广义交并比(generalized IOU，GIOU)、完全交并比(complete IOU，IOU)等损失函数。目标检测任务下的损失函数的损失值可以是分类损失函数的损失值与定位损失函数的损失值之和。Optionally, the target image task is target detection. The loss function set in the supervision module consists of two parts, including classification loss and positioning loss. The classification loss function can be expressed as: loss_cls=func_classify(Logits, T _c ). Among them, T _c is the real label of the training sample image. Assuming that the training sample image includes K detection targets, then T _c = {C ₁ ,..., C _k ,..., C _K }. K is a positive integer, 1≤k≤K. The meaning of Logits refers to the relevant definition in the above step S3 when the target image task is target detection. The positioning loss function can be expressed as: loss_loc=func_iou(O _B , T _B ). Among them, _OB is the N prediction box positions corresponding to the N regional features extracted by the second feature extraction layer, and _OB can be output to the supervision module by the second feature extraction layer. T _B is the K real box positions corresponding to the K detection targets in the training sample image, which can be expressed as: T _B = {Box ₁ ,..., Box _k ,..., Box _K }. func_iou is a function that calculates the intersection over union (IOU) of the predicted box and the real box, such as generalized IOU (GIOU), complete intersection and union ratio (complete IOU, IOU) and other loss functions. The loss value of the loss function under the target detection task can be the sum of the loss value of the classification loss function and the loss value of the positioning loss function.

在步骤2032中，计算机设备向第二特征提取层反向传输损失函数的梯度信息，以调整第二特征提取层至特征匹配层的网络参数。In step 2032, the computer device reversely transmits the gradient information of the loss function to the second feature extraction layer to adjust network parameters from the second feature extraction layer to the feature matching layer.

或者，上述步骤2032也可替代为：计算机设备向第一特征提取层反向传输损失函数的梯度信息，以调整第一特征提取层至特征匹配层的网络参数。也即是，第一特征提取层可以在模型训练过程中不做调整，或者第一特征提取层也可以在模型训练过程中根据实际的图像任务进行微调。Alternatively, the above step 2032 can also be replaced by: the computer device reversely transmits the gradient information of the loss function to the first feature extraction layer to adjust the network parameters from the first feature extraction layer to the feature matching layer. That is, the first feature extraction layer may not be adjusted during the model training process, or the first feature extraction layer may be fine-tuned according to the actual image task during the model training process.

在一轮模型训练中，计算机设备基于相同的训练图像集重复训练图像标注模型(即不断调整网络参数)，直至损失函数收敛，即得到该轮训练下收敛的图像标注模型。其中，损失函数收敛可以是损失函数的损失值达到预设数值。In a round of model training, the computer device repeatedly trains the image annotation model based on the same training image set (that is, continuously adjusts the network parameters) until the loss function converges, that is, the image annotation model that converges under this round of training is obtained. The convergence of the loss function may be when the loss value of the loss function reaches a preset value.

值得说明的是，在模型训练过程中，可以在图像标注模型中设置监督模块，在模型训练结束之后，由于监督模块不再起到对网络参数反向调优的作用，因此可以将图像标注模型中的监督模块删除或保留。It is worth mentioning that during the model training process, a supervision module can be set up in the image annotation model. After the model training is completed, since the supervision module no longer plays a role in reverse tuning of network parameters, the image annotation model can be The supervision module is deleted or retained.

上述步骤201至步骤203描述的是对目标图像标注模型的一轮训练过程。计算机设备在对目标图像标注模型完成一轮训练之后，可以进一步验证目标图像标注模型的精度。若目标图像标注模型的精度达到了预设要求，则计算机设备停止对目标图像标注模型的训练，并将最终训练得到的目标图像标注模型用于确定待标注图像在目标图像任务下的标注标签。若目标图像标注模型的精度未达到预设要求，则计算机设备继续对目标图像标注模型进行新一轮的训练。以此往复对目标图像标注模型进行迭代更新，直至目标图像标注模型的精度达到预设要求。具体实现流程参见以下步骤204至步骤207。The above steps 201 to 203 describe a round of training process for the target image annotation model. After the computer device completes a round of training on the target image annotation model, it can further verify the accuracy of the target image annotation model. If the accuracy of the target image annotation model reaches the preset requirements, the computer device stops training the target image annotation model, and uses the finally trained target image annotation model to determine the annotation label of the image to be annotated under the target image task. If the accuracy of the target image annotation model does not meet the preset requirements, the computer device continues to conduct a new round of training on the target image annotation model. In this way, the target image annotation model is iteratively updated until the accuracy of the target image annotation model reaches the preset requirements. For the specific implementation process, please refer to the following steps 204 to 207.

步骤204、在目标图像标注模型收敛之后，计算机设备调用该目标图像标注模型，确定验证图像集中的多个验证样本图像的预测标签。Step 204: After the target image annotation model converges, the computer device calls the target image annotation model to determine the predicted labels of the multiple verification sample images in the verification image set.

其中，验证图像集包括多个验证样本图像，该多个验证样本图像中的每个验证样本图像标注有真实标签。验证样本图像的真实标签可以是人工标注的。验证样本图像的预测标签由目标图像标注模型基于该验证样本图像对应的图像特征分别与语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。目标图像标注模型预测确定验证样本图像的预测标签的方式可参考上述步骤202中确定训练样本图像的预测标签的方式，本申请实施例在此不再赘述。Wherein, the verification image set includes a plurality of verification sample images, and each verification sample image in the plurality of verification sample images is marked with a real label. The true labels of the verification sample images can be manually annotated. The predicted label of the verification sample image is obtained by the target image annotation model based on the feature matching results of the image features corresponding to the verification sample image and the language features corresponding to each language description text in the language description text set. The way in which the target image annotation model predicts and determines the predicted label of the verification sample image can refer to the method of determining the predicted label of the training sample image in the above step 202, which will not be described again in the embodiment of the present application.

可选地，验证图像集与训练图像集不存在交集。验证图像集可以是一个固定的图像集，即验证图像集中的验证样本图像不变。例如智能标注数据集共包括1000个图像，其中100个图像标注有真实标签，剩余900个图像为待标注图像，那么可以将其中标注有真实标签的50个图像作为验证样本图像，得到验证图像集。并将标注有真实标签的另外50个图像作为训练样本图像，得到初始训练图像集。Optionally, there is no intersection between the verification image set and the training image set. The verification image set can be a fixed image set, that is, the verification sample images in the verification image set remain unchanged. For example, the intelligent annotation data set includes a total of 1,000 images, of which 100 images are annotated with real labels, and the remaining 900 images are images to be annotated. Then 50 images annotated with real labels can be used as verification sample images to obtain a verification image set. . And another 50 images marked with real labels are used as training sample images to obtain an initial training image set.

步骤205、计算机设备根据多个验证样本图像的真实标签与预测标签，确定目标图像标注模型的标注准确度。Step 205: The computer device determines the annotation accuracy of the target image annotation model based on the real labels and predicted labels of the multiple verification sample images.

例如验证图像集包括50个验证样本图像，若其中30个验证样本图像的真实标签与预测标签相同，则目标图像标注模型的标注准确度为60％。For example, the verification image set includes 50 verification sample images. If the real labels of 30 of the verification sample images are the same as the predicted labels, the annotation accuracy of the target image annotation model is 60%.

步骤206、当目标图像标注模型的标注准确度未达到预设阈值时，计算机设备执行一次或多次模型训练过程，直至目标图像标注模型的标注准确度达到预设阈值。Step 206: When the annotation accuracy of the target image annotation model does not reach the preset threshold, the computer device performs one or more model training processes until the annotation accuracy of the target image annotation model reaches the preset threshold.

本申请实施例中，在初始训练样本图像较少的情况下，可以将待标注图像一起用于模型训练。结合主动学习策略，筛选出待标注图像中的难例由人工矫正标注，并将经过人工矫正标注的图像作为新的训练样本图像添加至训练图像集，以此扩大训练图像集的规模，并对模型进行新一轮的训练，以提高模型精度。其中，上述模型训练过程包括以下步骤2061至步骤2066。In the embodiment of the present application, when there are few initial training sample images, the images to be annotated can be used for model training together. Combined with the active learning strategy, the difficult examples in the images to be annotated are selected for manual correction and annotation, and the manually corrected and annotated images are added to the training image set as new training sample images, thereby expanding the scale of the training image set, and The model undergoes a new round of training to improve model accuracy. The above model training process includes the following steps 2061 to 2066.

在步骤2061中，计算机设备调用目标图像标注模型，确定多个待标注图像的预测标签以及待标注图像的预测标签的置信度。In step 2061, the computer device calls the target image annotation model to determine the predicted labels of multiple images to be annotated and the confidence of the predicted labels of the images to be annotated.

在步骤2062中，计算机设备从多个待标注图像中获取预测标签的置信度低于置信度阈值的难标注图像。In step 2062, the computer device obtains a difficult-to-label image whose predicted label confidence is lower than a confidence threshold from a plurality of images to be labeled.

在步骤2063中，计算机设备输出难标注图像以及难标注图像的预测标签，以供人工矫正。In step 2063, the computer device outputs the difficult-to-label image and the predicted label of the difficult-to-label image for manual correction.

在步骤2064中，计算机设备响应于接收到针对难标注图像的人工标注结果，将难标注图像作为新的训练样本图像添加至训练图像集，得到更新后的训练图像集。In step 2064, in response to receiving the manual annotation result for the difficult-to-label image, the computer device adds the difficult-to-label image as a new training sample image to the training image set to obtain an updated training image set.

其中，人工标注结果包括人工为难标注图像标注的真实标签。上述步骤2061至步骤2064为模型训练过程中的主动学习，即人工参与的部分，通过计算机设备筛选出合适的候选集给人工标注的迭代学习过程。Among them, the manual annotation results include the real labels of manually annotated image annotations. The above-mentioned steps 2061 to 2064 are active learning in the model training process, that is, the part with manual participation, and iterative learning process in which suitable candidate sets are selected by computer equipment for manual annotation.

在步骤2065中，计算机设备调用目标图像标注模型，确定更新后的训练图像集中每个训练样本图像的预测标签，训练样本图像的预测标签基于训练样本图像对应的图像特征分别与更新后的训练图像集对应的语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。In step 2065, the computer device calls the target image annotation model to determine the predicted label of each training sample image in the updated training image set. The predicted label of the training sample image is based on the image features corresponding to the training sample image and the updated training image. The feature matching results of the language features corresponding to each language description text in the set of language description texts corresponding to the set are obtained.

可选地，若更新后的训练图像集对应的标签发生变化，则计算机设备重新获取更新后的训练图像集对应的语言描述文本集合，具体实现方式可参考上述步骤2011至步骤2012，本申请实施例在此不再赘述。Optionally, if the labels corresponding to the updated training image set change, the computer device re-obtains the language description text set corresponding to the updated training image set. For specific implementation methods, please refer to the above steps 2011 to 2012. This application implements The example will not be repeated here.

在步骤2066中，计算机设备根据更新后的训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型再次收敛。In step 2066, the computer device trains the target image annotation model based on the error between the real labels and the predicted labels of the multiple training sample images in the updated training image set until the target image annotation model converges again.

此步骤2066的实现过程可参考上述步骤203的实现过程，本申请实施例在此不再赘述。The implementation process of this step 2066 can be referred to the implementation process of the above-mentioned step 203, which will not be described again in the embodiment of the present application.

步骤207、当目标图像标注模型的标注准确度达到预设阈值时，计算机设备将通过调用目标图像标注模型确定的待标注图像的预测标签，作为待标注图像的标注标签。Step 207: When the annotation accuracy of the target image annotation model reaches a preset threshold, the computer device uses the predicted label of the image to be annotated determined by calling the target image annotation model as the annotation label of the image to be annotated.

可选地，计算机设备进一步可以输出所有待标注图像的标注标签。Optionally, the computer device may further output annotation labels of all images to be annotated.

例如，图6是本申请实施例提供的一种图像标注方法涉及的架构示意图。如图6所示，该架构下包括数据存储介质、处理器和交互UI。其中，数据存储介质用于存储智能标注数据集，智能标注数据集包括已标注图像和待标注图像。处理器可以是中央处理器(central processing unit，CPU)或图形处理器(graphics processing unit，GPU)。处理器用于运行和训练图像标注模型，图像标注模型包括依次级联的视觉语言预训练模型、下游任务模型头和语言图像特征匹配层，视觉语言预训练模型还与语言图像特征匹配层相连。交互UI用于供用户添加语言描述，包括设置语言描述模板和批量添加语言描述文本。交互UI还用于输出标注结果，例如可以显示图像标注模型对待标注图像的预测标签，并呈现主动学习/难例挖掘结果，以供用户进行人工矫正标注，等等。下游任务模型头包括但不限于图像分类、目标检测、动作识别等图像任务。For example, FIG. 6 is a schematic diagram of the architecture involved in an image annotation method provided by an embodiment of the present application. As shown in Figure 6, this architecture includes data storage media, processors and interactive UIs. The data storage medium is used to store intelligent annotation data sets, and the intelligent annotation data sets include annotated images and images to be annotated. The processor may be a central processing unit (CPU) or a graphics processing unit (GPU). The processor is used to run and train the image annotation model. The image annotation model includes a sequentially cascaded visual language pre-training model, a downstream task model head and a language image feature matching layer. The visual language pre-training model is also connected to the language image feature matching layer. The interactive UI is used for users to add language descriptions, including setting language description templates and adding language description texts in batches. The interactive UI is also used to output annotation results. For example, it can display the predicted labels of the images to be annotated by the image annotation model, and present active learning/difficult example mining results for users to perform manual correction annotation, etc. Downstream task model heads include but are not limited to image tasks such as image classification, target detection, and action recognition.

综上所述，在本申请实施例提供的图像标注方法中，通过获取训练图像集对应的语言描述文本集合，为训练样本图像引入标注任务的语言先验关联知识，并在图像标注模型中将图像任务变换成图像特征与语言特征的匹配任务。这样可以大幅提高图像标注模型的初始标注性能，在初始训练样本图像数量较少的情况下，提升图像标注模型的首轮标注准确率，有效降低图像标注模型的训练轮数，从而提高图像标注效率。另外，本申请实施例还可以提供适用于多个图像任务的图像标注模型框架，通过在图像标注模型框架中设计下游任务模型头，使得该图像标注模型框架能够适应多样性的下游需求，只需选择下游任务模型头中不同的特征提取层即可构建得到对应的图像标注模型。由于多个图像任务能够共用图像标注模型框架，因此无需针对每种图像任务分别设计对应的图像标注模型，实现智能图像标注的统一规范化管理，能够降低开发和维护的复杂程度，从而降低技术成本。To sum up, in the image annotation method provided by the embodiment of the present application, by obtaining the language description text set corresponding to the training image set, the language prior associated knowledge of the annotation task is introduced into the training sample image, and the language a priori associated knowledge is introduced into the image annotation model. The image task is transformed into a matching task of image features and language features. This can greatly improve the initial annotation performance of the image annotation model. When the number of initial training sample images is small, it can improve the first round annotation accuracy of the image annotation model, effectively reduce the number of training rounds of the image annotation model, and thereby improve the efficiency of image annotation. . In addition, embodiments of the present application can also provide an image annotation model framework suitable for multiple image tasks. By designing a downstream task model header in the image annotation model framework, the image annotation model framework can adapt to diverse downstream needs by simply Select different feature extraction layers in the downstream task model header to build the corresponding image annotation model. Since multiple image tasks can share the image annotation model framework, there is no need to design corresponding image annotation models for each image task. Unified and standardized management of intelligent image annotation can be achieved, which can reduce the complexity of development and maintenance, thereby reducing technical costs.

本申请实施例提供的图像标注方法的步骤的先后顺序能够进行适当调整，步骤也能够根据情况进行相应增减。任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化的方法，都应涵盖在本申请的保护范围之内。例如，计算机设备显示信息，可以是计算机设备在自身的显示界面上显示信息，或者也可以是计算机设备将信息发送给其它显示设备，由其它显示设备显示信息。The sequence of the steps of the image annotation method provided by the embodiments of the present application can be adjusted appropriately, and the steps can also be increased or decreased accordingly according to the situation. Any person familiar with the technical field can easily think of modified methods within the technical scope disclosed in this application, and they should be covered by the protection scope of this application. For example, when the computer device displays information, the computer device may display the information on its own display interface, or the computer device may send the information to other display devices, and the other display devices display the information.

下面对本申请实施例的虚拟装置举例说明。The following is an example of the virtual device in the embodiment of the present application.

例如，图7是本申请实施例提供的一种图像标注装置的结构示意图。如图7所示，图像标注装置700包括：获取模块701、确定模块702和训练模块703。For example, FIG. 7 is a schematic structural diagram of an image annotation device provided by an embodiment of the present application. As shown in FIG. 7 , the image annotation device 700 includes: an acquisition module 701 , a determination module 702 and a training module 703 .

获取模块701，用于获取目标图像任务对应的训练图像集以及训练图像集对应的语言描述文本集合，训练图像集包括多个训练样本图像，多个训练样本图像中的每个训练样本图像标注有真实标签，语言描述文本集合包括多个语言描述文本，多个语言描述文本与训练图像集对应的多类标签一一对应，每个语言描述文本用于描述多类标签中的一类标签的语义。The acquisition module 701 is used to acquire the training image set corresponding to the target image task and the language description text set corresponding to the training image set. The training image set includes multiple training sample images, and each training sample image in the multiple training sample images is marked with The real label and language description text set includes multiple language description texts. The multiple language description texts correspond to the multi-category labels corresponding to the training image set. Each language description text is used to describe the semantics of one type of label in the multi-category label. .

确定模块702，用于调用目标图像任务对应的目标图像标注模型，确定训练图像集中每个训练样本图像的预测标签，训练样本图像的预测标签由目标图像标注模型基于训练样本图像对应的图像特征分别与语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。The determination module 702 is used to call the target image annotation model corresponding to the target image task, and determine the prediction label of each training sample image in the training image set. The prediction label of the training sample image is determined by the target image annotation model based on the image features corresponding to the training sample image. The feature matching results of the language features corresponding to each language description text in the language description text collection are obtained.

训练模块703，用于根据训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型收敛，目标图像标注模型用于确定待标注图像在目标图像任务下的标注标签。The training module 703 is used to train the target image annotation model according to the error between the real labels and the predicted labels of the multiple training sample images in the training image set until the target image annotation model converges. The target image annotation model is used to determine the image to be annotated. Annotation labels under the target image task.

可选地，确定模块702，还用于在目标图像标注模型收敛之后，调用目标图像标注模型，确定验证图像集中的多个验证样本图像的预测标签，验证图像集包括多个验证样本图像，多个验证样本图像中的每个验证样本图像标注有真实标签。确定模块702，还用于根据多个验证样本图像的真实标签与预测标签，确定目标图像标注模型的标注准确度。训练模块703，还用于当目标图像标注模型的标注准确度未达到预设阈值时，执行一次或多次模型训练过程，直至目标图像标注模型的标注准确度达到预设阈值。Optionally, the determination module 702 is also configured to call the target image annotation model after the target image annotation model converges to determine the predicted labels of multiple verification sample images in the verification image set. The verification image set includes multiple verification sample images. Each verification sample image in the verification sample images is annotated with a true label. The determination module 702 is also used to determine the annotation accuracy of the target image annotation model based on the real labels and predicted labels of the multiple verification sample images. The training module 703 is also configured to perform one or more model training processes when the annotation accuracy of the target image annotation model does not reach the preset threshold, until the annotation accuracy of the target image annotation model reaches the preset threshold.

其中，模型训练过程包括：调用目标图像标注模型，确定多个待标注图像的预测标签以及待标注图像的预测标签的置信度。从多个待标注图像中获取预测标签的置信度低于置信度阈值的难标注图像。输出难标注图像以及难标注图像的预测标签，以供人工矫正。响应于接收到针对难标注图像的人工标注结果，将难标注图像作为新的训练样本图像添加至训练图像集，得到更新后的训练图像集。调用目标图像标注模型，确定更新后的训练图像集中每个训练样本图像的预测标签，训练样本图像的预测标签基于训练样本图像对应的图像特征分别与更新后的训练图像集对应的语言描述文本集合中每个语言描述文本对应的语言特征的特征匹配结果得到。根据更新后的训练图像集中的多个训练样本图像的真实标签与预测标签之间的误差，训练目标图像标注模型，直至目标图像标注模型再次收敛。Among them, the model training process includes: calling the target image annotation model, determining the predicted labels of multiple images to be annotated and the confidence of the predicted labels of the images to be annotated. Obtain difficult-to-label images whose predicted label confidence is lower than the confidence threshold from multiple images to be labeled. Output difficult-to-label images and predicted labels of difficult-to-label images for manual correction. In response to receiving the manual annotation results for the difficult-to-label images, the difficult-to-label images are added to the training image set as new training sample images to obtain an updated training image set. Call the target image annotation model to determine the predicted label of each training sample image in the updated training image set. The predicted label of the training sample image is based on the image features corresponding to the training sample image and the language description text set corresponding to the updated training image set. The feature matching results of the language features corresponding to each language description text are obtained. Based on the error between the real labels and the predicted labels of multiple training sample images in the updated training image set, the target image annotation model is trained until the target image annotation model converges again.

可选地，确定模块702，还用于当目标图像标注模型的标注准确度达到预设阈值时，将通过调用目标图像标注模型确定的待标注图像的预测标签，作为待标注图像的标注标签。Optionally, the determination module 702 is also configured to use the predicted label of the image to be annotated determined by calling the target image annotation model as the annotation label of the image to be annotated when the annotation accuracy of the target image annotation model reaches a preset threshold.

可选地，目标图像标注模型包括第一特征提取层、第二特征提取层和特征匹配层，第一特征提取层包括图像特征输出端和语言特征输出端，特征匹配层包括图像特征输入端和语言特征输入端，图像特征输出端与第二特征提取层的输入端相连，语言特征输出端与语言特征输入端相连，第二特征提取层的输出端与图像特征输入端相连。确定模块702，用于：通过第一特征提取层对语言描述文本集合中的各个语言描述文本分别进行特征提取，得到语言特征集合，语言特征集合包括多组语言特征，每组语言特征为语言描述文本集合中的一个语言描述文本对应的语言特征。针对训练图像集中的每个训练样本图像，通过第一特征提取层对训练样本图像进行特征提取，得到训练样本图像对应的全局图像特征，通过第二特征提取层对全局图像特征进行特征提取，得到目标图像特征，目标图像特征与目标图像任务关联，通过特征匹配层对目标图像特征与语言特征集合中的各组语言特征分别进行特征匹配，得到特征匹配结果，特征匹配结果包括目标图像特征与语言特征集合中的每组语言特征之间的特征相似度，将目标语言描述文本所描述的标签作为训练样本图像的预测标签，目标语言描述文本为语言特征集合中与目标图像特征之间的特征相似度最高的一组语言特征对应的语言描述文本。Optionally, the target image annotation model includes a first feature extraction layer, a second feature extraction layer and a feature matching layer. The first feature extraction layer includes an image feature output end and a language feature output end. The feature matching layer includes an image feature input end and a language feature output end. The language feature input end and the image feature output end are connected to the input end of the second feature extraction layer, the language feature output end is connected to the language feature input end, and the output end of the second feature extraction layer is connected to the image feature input end. The determination module 702 is configured to: perform feature extraction on each language description text in the language description text set through the first feature extraction layer to obtain a language feature set. The language feature set includes multiple sets of language features, and each set of language features is a language description. A language in the text collection describes the language features corresponding to the text. For each training sample image in the training image set, feature extraction is performed on the training sample image through the first feature extraction layer to obtain the global image features corresponding to the training sample image, and feature extraction is performed on the global image features through the second feature extraction layer to obtain The target image features are associated with the target image task. The target image features are matched with each group of language features in the language feature set through the feature matching layer to obtain the feature matching results. The feature matching results include the target image features and language. The feature similarity between each group of language features in the feature set, the label described by the target language description text is used as the predicted label of the training sample image, the target language description text is the feature similarity between the language feature set and the target image feature The language description text corresponding to the highest set of language features.

可选地，目标图像标注模型还包括监督模块，监督模块中设置有与目标图像任务相匹配的损失函数，监督模块的输入端与特征匹配层的输出端相连。训练模块703，用于：通过监督模块基于训练图像集中的多个训练样本图像的真实标签与预测标签，计算损失函数的损失值。向第二特征提取层反向传输损失函数的梯度信息，以调整第二特征提取层至特征匹配层的网络参数。Optionally, the target image annotation model also includes a supervision module. The supervision module is provided with a loss function that matches the target image task. The input end of the supervision module is connected to the output end of the feature matching layer. The training module 703 is used to: calculate the loss value of the loss function based on the real labels and predicted labels of multiple training sample images in the training image set through the supervision module. Reversely transmit the gradient information of the loss function to the second feature extraction layer to adjust network parameters from the second feature extraction layer to the feature matching layer.

可选地，计算机设备中预先存储有图像标注模型框架，图像标注模型框架包括第一特征提取层、下游任务模型头和特征匹配层，图像特征输出端与下游任务模型头的输入端相连，下游任务模型头的输出端与特征匹配层的图像特征输入端相连，下游任务模型头包括与多个图像任务一一对应的多个特征提取层，第二特征提取层为下游任务模型头中与目标图像任务对应的特征提取层，其中，图像特征输出端被配置为每次连接下游任务模型头中的一个特征提取层。Optionally, an image annotation model framework is pre-stored in the computer device. The image annotation model framework includes a first feature extraction layer, a downstream task model head and a feature matching layer. The image feature output end is connected to the input end of the downstream task model head. The downstream The output end of the task model head is connected to the image feature input end of the feature matching layer. The downstream task model head includes multiple feature extraction layers that correspond to multiple image tasks one-to-one. The second feature extraction layer is the target in the downstream task model head. The feature extraction layer corresponding to the image task, wherein the image feature output end is configured to connect to one feature extraction layer in the downstream task model head at a time.

可选地，第一特征提取层由预先训练得到的视觉语言预训练模型实现。Optionally, the first feature extraction layer is implemented by a pre-trained visual language pre-training model.

可选地，如图8所示，图像标注装置700还包括：显示模块704。Optionally, as shown in FIG. 8 , the image annotation device 700 further includes: a display module 704 .

可选地，显示模块704，用于响应于接收到针对目标图像任务的启动指令，显示模板设置提示，模板设置提示用于提示用户设置目标图像任务对应的语言描述模板。获取模块701，用于针对训练图像集对应的每类标签，根据设置的语言描述模板以及标签，生成一个语言描述文本。Optionally, the display module 704 is configured to display a template setting prompt in response to receiving a startup instruction for the target image task. The template setting prompt is used to prompt the user to set a language description template corresponding to the target image task. The acquisition module 701 is used to generate a language description text for each type of label corresponding to the training image set according to the set language description template and label.

可选地，显示模块704，还用于在生成语言描述文本之后，显示语言描述文本。Optionally, the display module 704 is also configured to display the language description text after generating the language description text.

可选地，显示模块704，还用于显示多个图像任务，目标图像任务为多个图像任务中的一个。获取模块701，用于响应于检测到对目标图像任务的选择操作，确定接收到针对目标图像任务的启动指令。Optionally, the display module 704 is also used to display multiple image tasks, and the target image task is one of the multiple image tasks. The acquisition module 701 is configured to determine that a start instruction for the target image task is received in response to detecting a selection operation on the target image task.

可选地，多个图像任务包括图像分类、目标检测或动作识别中的一个或多个。Optionally, the plurality of image tasks include one or more of image classification, object detection, or action recognition.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

下面对本申请实施例涉及的基本硬件结构举例说明。The following is an example of the basic hardware structure involved in the embodiment of the present application.

例如，图9是本申请实施例提供的一种图像标注装置的硬件结构示意图。如图9所示，图像标注装置900包括处理器901和存储器902，存储器901与存储器902通过总线903连接。图9以处理器901和存储器902相互独立说明。可选地，处理器901和存储器902集成在一起。可选地，图9中的图像标注装置900是具备计算能力的任一计算机设备。For example, FIG. 9 is a schematic diagram of the hardware structure of an image annotation device provided by an embodiment of the present application. As shown in FIG. 9 , the image annotation device 900 includes a processor 901 and a memory 902 . The memory 901 and the memory 902 are connected through a bus 903 . Figure 9 illustrates the processor 901 and the memory 902 as independent of each other. Optionally, processor 901 and memory 902 are integrated together. Optionally, the image annotation device 900 in Figure 9 is any computer device with computing capabilities.

其中，存储器902用于存储计算机程序，计算机程序包括操作系统和程序代码。存储器902是各种类型的存储介质，例如只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、电可擦可编程只读存储器(electrically erasableprogrammable read-only memory，EEPROM)、只读光盘(compact disc read-only memory，CD-ROM)、闪存、光存储器、寄存器、光盘存储、光碟存储、磁盘或者其它磁存储设备。Among them, the memory 902 is used to store computer programs, and the computer programs include operating systems and program codes. Memory 902 is various types of storage media, such as read-only memory (ROM), random access memory (RAM), electrically erasable programmable read-only memory (electrically erasable programmable read-only memory) , EEPROM), compact disc read-only memory (CD-ROM), flash memory, optical memory, register, optical disk storage, optical disk storage, magnetic disk or other magnetic storage device.

其中，处理器901是通用处理器或专用处理器。处理器901可能是单核处理器或多核处理器。处理器901包括至少一个电路，以执行本申请实施例提供的上述图像标注方法。Among them, the processor 901 is a general-purpose processor or a special-purpose processor. Processor 901 may be a single-core processor or a multi-core processor. The processor 901 includes at least one circuit to execute the above image annotation method provided by the embodiment of the present application.

可选地，图像标注装置900还包括网络接口904，网络接口904通过总线903与处理器901和存储器902连接。网络接口904能够实现图像标注装置900与其它设备通信。Optionally, the image annotation device 900 also includes a network interface 904, which is connected to the processor 901 and the memory 902 through the bus 903. The network interface 904 enables the image annotation device 900 to communicate with other devices.

可选地，图像标注装置900还包括输入/输出(input/output，I/O)接口905，I/O接口905通过总线903与处理器901和存储器902连接。处理器901能够通过I/O接口905接收输入的命令或数据等。I/O接口905用于图像标注装置900连接输入设备，这些输入设备例如是键盘、鼠标等。可选地，在一些可能的场景中，上述网络接口904和I/O接口905被统称为通信接口。Optionally, the image annotation device 900 also includes an input/output (I/O) interface 905 , which is connected to the processor 901 and the memory 902 through a bus 903 . The processor 901 can receive input commands or data through the I/O interface 905. The I/O interface 905 is used for the image annotation device 900 to connect to input devices, such as a keyboard, a mouse, etc. Optionally, in some possible scenarios, the above-mentioned network interface 904 and I/O interface 905 are collectively referred to as communication interfaces.

可选地，图像标注装置900还包括显示器906，显示器906通过总线903与处理器901和存储器902连接。显示器906能够用于显示处理器901执行上述方法产生的中间结果和/或最终结果等，例如显示图像任务、模板设置提示、语言描述文本等。在一种可能的实现方式中，显示器906是触控显示屏，以提供人机交互接口。Optionally, the image annotation device 900 also includes a display 906 , which is connected to the processor 901 and the memory 902 through a bus 903 . The display 906 can be used to display intermediate results and/or final results generated by the processor 901 when executing the above method, such as displaying image tasks, template setting prompts, language description text, etc. In a possible implementation, the display 906 is a touch display screen to provide a human-computer interaction interface.

其中，总线903是任何类型的，用于实现图像标注装置900的内部器件互连的通信总线。例如系统总线。本申请实施例以图像标注装置900内部的上述器件通过总线903互连为例说明，可选地，图像标注装置900内部的上述器件采用除了总线903之外的其他连接方式彼此通信连接，例如图像标注装置900内部的上述器件通过图像标注装置900内部的逻辑接口互连。The bus 903 is any type of communication bus used to interconnect internal devices of the image annotation device 900 . For example, system bus. The embodiment of the present application takes the example of the interconnection of the above-mentioned devices inside the image annotation device 900 through the bus 903. Optionally, the above-mentioned devices inside the image annotation device 900 communicate with each other using other connection methods besides the bus 903, such as image The above-mentioned devices inside the annotation device 900 are interconnected through logical interfaces inside the image annotation device 900 .

上述器件可以分别设置在彼此独立的芯片上，也可以至少部分的或者全部的设置在同一块芯片上。将各个器件独立设置在不同的芯片上，还是整合设置在一个或者多个芯片上，往往取决于产品设计的需要。本申请实施例对上述器件的具体实现形式不做限定。The above-mentioned devices may be arranged on separate chips, or at least part or all of them may be arranged on the same chip. Whether each device is independently installed on different chips or integrated on one or more chips often depends on the needs of product design. The embodiments of this application do not limit the specific implementation forms of the above devices.

图9所示的图像标注装置900仅仅是示例性的，在实现过程中，图像标注装置900包括其他组件，本文不再一一列举。图9所示的图像标注装置900可以通过执行上述实施例提供的方法的全部或部分步骤来实现对图像的智能标注。The image annotation device 900 shown in FIG. 9 is only exemplary. During the implementation process, the image annotation device 900 includes other components, which will not be listed one by one in this article. The image annotation device 900 shown in FIG. 9 can realize intelligent annotation of images by executing all or part of the steps of the method provided by the above embodiments.

本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有指令，当所述指令被处理器执行时，实现如图2所示的图像标注方法。Embodiments of the present application also provide a computer-readable storage medium. Instructions are stored on the computer-readable storage medium. When the instructions are executed by a processor, the image annotation method shown in Figure 2 is implemented.

本申请实施例还提供了一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时，实现如图2所示的图像标注方法。An embodiment of the present application also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the image annotation method shown in Figure 2 is implemented.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps to implement the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage media mentioned can be read-only memory, magnetic disks or optical disks, etc.

在本申请实施例中，术语“第一”、“第二”和“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the embodiments of the present application, the terms "first", "second" and "third" are only used for description purposes and cannot be understood as indicating or implying relative importance.

本申请中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。The term "and/or" in this application is just an association relationship describing related objects, indicating that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, alone There are three situations B. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship.

需要说明的是，本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号，均为经用户授权或者经过各方充分授权的，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如，本申请中涉及到的图像数据都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and signals involved in this application, All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the image data involved in this application were obtained with full authorization.

以上所述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的构思和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only optional embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the concepts and principles of the present application shall be included in the protection of the present application. within the range.

Claims

1. An image annotation method for a computer device, the method comprising:

acquiring a training image set corresponding to a target image task and a language description text set corresponding to the training image set, wherein the training image set comprises a plurality of training sample images, each training sample image in the plurality of training sample images is marked with a real label, the language description text set comprises a plurality of language description texts, the plurality of language description texts are in one-to-one correspondence with a plurality of types of labels corresponding to the training image set, and each language description text is used for describing the semantics of one type of labels in the plurality of types of labels;

Invoking a target image annotation model corresponding to the target image task, and determining a prediction label of each training sample image in the training image set, wherein the prediction label of the training sample image is obtained by the target image annotation model based on a feature matching result of image features corresponding to the training sample image and language features corresponding to each language description text in the language description text set;

and training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges, wherein the target image annotation model is used for determining the annotation labels of the images to be annotated under the target image task.

2. The method of claim 1, wherein after the target image annotation model converges, the method further comprises:

invoking the target image annotation model, and determining prediction labels of a plurality of verification sample images in a verification image set, wherein the verification image set comprises the plurality of verification sample images, and each verification sample image in the plurality of verification sample images is annotated with a real label;

Determining the labeling accuracy of the target image labeling model according to the real labels and the prediction labels of the verification sample images;

when the labeling accuracy of the target image labeling model does not reach a preset threshold, executing one or more model training processes until the labeling accuracy of the target image labeling model reaches the preset threshold;

wherein the model training process comprises:

invoking the target image annotation model, and determining the prediction labels of a plurality of images to be annotated and the confidence of the prediction labels of the images to be annotated;

acquiring difficultly-marked images with confidence coefficient of the predictive label lower than a confidence coefficient threshold value from the plurality of images to be marked;

outputting the difficultly-marked image and a prediction label of the difficultly-marked image for manual correction;

in response to receiving a manual annotation result for the difficultly-annotated image, adding the difficultly-annotated image as a new training sample image to the training image set to obtain an updated training image set;

invoking the target image annotation model, and determining a prediction label of each training sample image in the updated training image set, wherein the prediction label of the training sample image is obtained based on feature matching results of image features corresponding to the training sample image and language features corresponding to each language description text in a language description text set corresponding to the updated training image set;

And training the target image annotation model according to errors between the real labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is converged again.

3. The method according to claim 2, wherein the method further comprises:

when the labeling accuracy of the target image labeling model reaches the preset threshold, the prediction label of the image to be labeled, which is determined by calling the target image labeling model, is used as the labeling label of the image to be labeled.

4. A method according to any one of claims 1 to 3, wherein the target image annotation model comprises a first feature extraction layer, a second feature extraction layer and a feature matching layer, the first feature extraction layer comprising an image feature output and a language feature output, the feature matching layer comprising an image feature input and a language feature input, the image feature output being connected to the input of the second feature extraction layer, the language feature output being connected to the language feature input, the output of the second feature extraction layer being connected to the image feature input; the step of calling the target image annotation model corresponding to the target image task and determining the prediction label of each training sample image in the training image set comprises the following steps:

Extracting the characteristics of each language description text in the language description text set through the first characteristic extraction layer to obtain a language characteristic set, wherein the language characteristic set comprises a plurality of groups of language characteristics, and each group of language characteristics is a language characteristic corresponding to one language description text in the language description text set;

for each training sample image in the training image set,

extracting the characteristics of the training sample image through the first characteristic extraction layer to obtain the global image characteristics corresponding to the training sample image,

performing feature extraction on the global image features through the second feature extraction layer to obtain target image features, wherein the target image features are associated with the target image task,

the feature matching layer is used for respectively carrying out feature matching on the target image features and the language features of each group in the language feature set to obtain feature matching results, the feature matching results comprise feature similarity between the target image features and each group of language features in the language feature set,

and taking the label described by the target language description text as a prediction label of the training sample image, wherein the target language description text is the language description text corresponding to a group of language features with highest feature similarity between the language feature set and the target image feature.

5. The method of claim 4, wherein the target image annotation model further comprises a supervision module, wherein a loss function matched with the target image task is set in the supervision module, an input end of the supervision module is connected with an output end of the feature matching layer, and training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set comprises:

calculating, by the supervision module, a loss value of the loss function based on real labels and predictive labels of a plurality of training sample images in the training image set;

and reversely transmitting gradient information of the loss function to the second feature extraction layer so as to adjust network parameters from the second feature extraction layer to the feature matching layer.

6. The method according to claim 4 or 5, wherein an image annotation model framework is pre-stored in the computer device, the image annotation model framework comprises the first feature extraction layer, a downstream task model head and the feature matching layer, the image feature output end is connected with the input end of the downstream task model head, the output end of the downstream task model head is connected with the image feature input end of the feature matching layer, the downstream task model head comprises a plurality of feature extraction layers corresponding to a plurality of image tasks one by one, the second feature extraction layer is a feature extraction layer corresponding to the target image task in the downstream task model head, and the image feature output end is configured to connect one feature extraction layer in the downstream task model head at a time.

7. The method according to any of claims 4 to 6, wherein the first feature extraction layer is implemented by a pre-trained visual language pre-training model.

8. The method according to any one of claims 1 to 7, wherein the obtaining the language description text set corresponding to the training image set includes:

in response to receiving a start instruction for the target image task, displaying a template setting prompt, wherein the template setting prompt is used for prompting a user to set a language description template corresponding to the target image task;

and generating a language description text according to the set language description template and the labels aiming at each type of labels corresponding to the training image set.

9. The method of claim 8, wherein the method further comprises:

after the language description text is generated, the language description text is displayed.

10. The method according to claim 8 or 9, characterized in that the method further comprises:

displaying a plurality of image tasks, wherein the target image task is one of the plurality of image tasks;

in response to detecting a selection operation of the target image task, it is determined that a start instruction for the target image task is received.

11. The method of claim 6 or 10, wherein the plurality of image tasks includes one or more of image classification, object detection, or motion recognition.

12. An image annotation device, the device comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training image set corresponding to a target image task and a language description text set corresponding to the training image set, the training image set comprises a plurality of training sample images, each training sample image in the plurality of training sample images is marked with a real label, the language description text set comprises a plurality of language description texts, the plurality of language description texts are in one-to-one correspondence with a plurality of types of labels corresponding to the training image set, and each language description text is used for describing the semantics of one type of labels in the plurality of types of labels;

the determining module is used for calling a target image annotation model corresponding to the target image task and determining a prediction label of each training sample image in the training image set, wherein the prediction label of the training sample image is obtained by the target image annotation model based on the feature matching result of the image feature corresponding to the training sample image and the language feature corresponding to each language description text in the language description text set;

The training module is used for training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges, and the target image annotation model is used for determining the annotation labels of the images to be annotated under the target image task.

13. The apparatus of claim 12, wherein the device comprises a plurality of sensors,

the determining module is further configured to invoke the target image annotation model after the target image annotation model converges, determine prediction labels of a plurality of verification sample images in a verification image set, where the verification image set includes the plurality of verification sample images, and each verification sample image in the plurality of verification sample images is annotated with a real label;

the determining module is further used for determining the labeling accuracy of the target image labeling model according to the real labels and the prediction labels of the verification sample images;

the training module is further used for executing one or more model training processes when the labeling accuracy of the target image labeling model does not reach a preset threshold value until the labeling accuracy of the target image labeling model reaches the preset threshold value;

Wherein the model training process comprises:

14. The apparatus of claim 13, wherein the device comprises a plurality of sensors,

and the determining module is further used for taking the prediction label of the image to be marked, which is determined by calling the target image marking model, as the marking label of the image to be marked when the marking accuracy of the target image marking model reaches the preset threshold.

15. The apparatus according to any one of claims 12 to 14, wherein the target image annotation model comprises a first feature extraction layer, a second feature extraction layer and a feature matching layer, the first feature extraction layer comprising an image feature output and a language feature output, the feature matching layer comprising an image feature input and a language feature input, the image feature output being connected to the input of the second feature extraction layer, the language feature output being connected to the language feature input, the output of the second feature extraction layer being connected to the image feature input; the determining module is used for:

For each training sample image in the training image set,

16. The apparatus of claim 15, wherein the target image annotation model further comprises a supervision module, the supervision module having a loss function matched to the target image task disposed therein, an input of the supervision module connected to an output of the feature matching layer, and the training module configured to:

17. The apparatus according to claim 15 or 16, wherein an image annotation model framework is pre-stored in the computer device, the image annotation model framework comprising the first feature extraction layer, a downstream task model head and the feature matching layer, the image feature output being connected to an input of the downstream task model head, an output of the downstream task model head being connected to an image feature input of the feature matching layer, the downstream task model head comprising a plurality of feature extraction layers in one-to-one correspondence with a plurality of image tasks, the second feature extraction layer being a feature extraction layer in the downstream task model head corresponding to the target image task, wherein the image feature output is configured to connect one feature extraction layer in the downstream task model head at a time.

18. The apparatus according to any of the claims 15 to 17, wherein the first feature extraction layer is implemented by a pre-trained visual language pre-training model.

19. The apparatus according to any one of claims 12 to 18, further comprising: a display module;

the display module is used for responding to the receiving of the starting instruction aiming at the target image task, displaying a template setting prompt, wherein the template setting prompt is used for prompting a user to set a language description template corresponding to the target image task;

the acquisition module is used for generating a language description text according to the set language description template and the labels aiming at each type of labels corresponding to the training image set.

20. The apparatus of claim 19, wherein the device comprises a plurality of sensors,

the display module is further used for displaying the language description text after the language description text is generated.

21. The device according to claim 19 or 20, wherein,

the display module is further used for displaying a plurality of image tasks, and the target image task is one of the plurality of image tasks;

the acquisition module is used for responding to detection of the selection operation of the target image task and determining that a starting instruction aiming at the target image task is received.

22. The apparatus of claim 17 or 21, wherein the plurality of image tasks comprises one or more of image classification, object detection, or motion recognition.

23. A computer device, comprising: a processor and a memory;

the memory is used for storing a computer program, and the computer program comprises program instructions;

the processor is configured to invoke the computer program to implement the image labeling method according to any of claims 1 to 11.

24. A computer readable storage medium having instructions stored thereon which, when executed by a processor, implement the image annotation method according to any of claims 1 to 11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the image annotation method according to any one of claims 1 to 11.