WO2023029436A1 - Data labeling method and apparatus, and device and storage medium - Google Patents

Data labeling method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2023029436A1
WO2023029436A1 PCT/CN2022/081097 CN2022081097W WO2023029436A1 WO 2023029436 A1 WO2023029436 A1 WO 2023029436A1 CN 2022081097 W CN2022081097 W CN 2022081097W WO 2023029436 A1 WO2023029436 A1 WO 2023029436A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeling
data
labeled
user
results
Prior art date
Application number
PCT/CN2022/081097
Other languages
French (fr)
Chinese (zh)
Inventor
徐佳
唐蓉玮
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023029436A1 publication Critical patent/WO2023029436A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H30/00ICT specially adapted for the handling or processing of medical images
    • G16H30/40ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the present application provides a data labeling method, device, device, and storage medium, which can improve the accuracy of data labeling.
  • the data to be marked is distributed to at least two marking users for marking, and at least two marking results are obtained. Determine whether to perform arbitration on the data to be labeled.
  • the data to be marked is sent to the review user for marking, and when it is determined that the data to be marked is not to be arbitrated, at least two marking results are fused according to the fusion rules to obtain the data to be marked The label of the data.
  • multiple people collaborate to mark the same data, which solves the problem of low labeling accuracy of complex data, and also provides an arbitration function to arbitrate the data to be marked that needs to be arbitrated, further improving the labeling accuracy.
  • the present application provides a data labeling device, which is applied to a data labeling platform, and the device includes:
  • the labeling result management module is further configured to fuse the at least two labeling results according to fusion rules to obtain the label of the data to be labeled if it is determined not to perform arbitration on the data to be labeled.
  • Fig. 1 is a logical schematic diagram of an AI platform provided by an exemplary embodiment of the present application
  • FIG. 2 is a logical schematic diagram of a data labeling platform provided by an exemplary embodiment of the present application
  • Fig. 3 is a schematic diagram of interaction between a data labeling module and a management user provided by an exemplary embodiment of the present application;
  • Fig. 10 is a schematic diagram of an annotation interface provided by an exemplary embodiment of the present application.
  • the auxiliary diagnosis function is used to provide auxiliary diagnosis of various types of diseases.
  • the data labeling platform judges whether at least two labeling results are the same, if not, then decides to perform arbitration on the data to be labeled, and if they are the same, determines not to perform arbitration on the data to be labeled arbitration.
  • the labeling type of the data to be labeled is classification
  • one labeling result among at least two labeling results is determined as the label of the data to be labelled.
  • the labeling type of the data to be labeled is detection or segmentation
  • the union of the labeling frames for the same object is determined as the label of the data to be labeled.
  • the fusion processing process can also be other processes, for example, when the label type of the data to be labeled is detection or segmentation, the labeling frame for the same object is averaged , determined as the label of the data to be labeled.
  • step 808 the data labeling platform determines to perform arbitration on the data to be marked, and determines an audit user in the labeling team corresponding to the labeling task.
  • the data labeling platform sends the data to be marked to the review user. Refer to the description in step 704 for the process of the auditing user performing arbitration on the data to be marked.
  • the data labeling platform determines not to perform arbitration on the data to be labeled, and refer to the description of step 705 for the process of determining the label of the data to be labeled.
  • the data tagging module 201 is configured to send the data to be tagged to at least two tagging users, specifically, it can be used to perform the data tagging function in step 701 and the implicit steps it contains;
  • the at least two tagging results determine whether to perform arbitration on the data to be tagged, which can be specifically used to execute the tagging result management function of step 702 and step 703 and the implicit steps contained therein;

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Neurology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application belongs to the technical field of AI. Provided are a data labeling method and apparatus, and a computing device and a storage medium. The method comprises: a data labeling platform sending, to at least two labeling users, data to be labeled; acquiring at least two labeling results of the at least two labeling users for said data; determining whether to execute arbitration on said data according to the at least two labeling results; if it is determined to execute arbitration on said data, sending said data to an auditing user for labeling; and if it is determined not to execute arbitration on said data, fusing the at least two labeling results according to a fusion rule, so as to obtain a tag of said data. By means of the present application, the same data is labeled by means of multi-person coordination, such that the problem of low labeling accuracy of complex data is solved.

Description

数据标注的方法、装置、设备和存储介质Method, device, equipment and storage medium for data labeling
本申请要求于2021年08月30日提交的申请号为202111003760.4、发明名称为“数据标注的方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111003760.4 and the title of the invention "Data labeling method, device, equipment and storage medium" submitted on August 30, 2021, the entire content of which is incorporated by reference in this application middle.
技术领域technical field
本申请涉及人工智能(artificial intelligence,AI)技术领域,特别涉及一种数据标注的方法、装置、设备和存储介质。The present application relates to the technical field of artificial intelligence (AI), and in particular to a data labeling method, device, device and storage medium.
背景技术Background technique
随着AI技术的广泛应用,AI技术中需要大量的已标注数据来进行算法训练,因此高效准确地标注数据成为当务之急。With the widespread application of AI technology, a large amount of labeled data is required for algorithm training in AI technology, so efficient and accurate labeling of data has become a top priority.
相关技术中,预先训练一个智能标注算法,将每个待标注的数据输入智能标注算法,获得每个待标注的数据的标注结果。将每个待标注的数据的标注结果确定为标签。这样,对于简单的标注智能标注算法能够进行标注,但是对于比较复杂的标注,智能标注算法的标注准确率往往比较低,所以需要提供一种标注准确率更高的方法。In related technologies, an intelligent labeling algorithm is pre-trained, and each data to be labeled is input into the intelligent labeling algorithm to obtain a labeling result of each data to be labeled. The labeling result of each data to be labeled is determined as a label. In this way, the intelligent labeling algorithm can label simple labels, but for more complex labels, the labeling accuracy of the intelligent labeling algorithm is often relatively low, so it is necessary to provide a method with a higher labeling accuracy.
发明内容Contents of the invention
本申请提供了一种数据标注的方法、装置、设备和存储介质,能够提升数据的标注准确率。The present application provides a data labeling method, device, device, and storage medium, which can improve the accuracy of data labeling.
第一方面,本申请提供了一种数据标注的方法,该方法包括:将待标注的数据发送给至少两个标注用户,获取该至少两个标注用户对该待标注的数据的至少两个标注结果。根据该至少两个标注结果,确定是否对该待标注的数据执行仲裁。若确定对该待标注的数据执行仲裁,发送该待标注的数据至审核用户进行标注。若确定不对该待标注的数据执行仲裁,根据融合规则将该至少两个标注结果进行融合,获得该待标注的数据的标签。In a first aspect, the present application provides a method for labeling data, the method comprising: sending the data to be labeled to at least two labeling users, and obtaining at least two labels of the at least two labeling users on the data to be labelled. result. According to the at least two tagging results, it is determined whether to perform arbitration on the data to be tagged. If it is determined to perform arbitration on the data to be marked, send the data to be marked to the review user for marking. If it is determined not to perform arbitration on the data to be labeled, the at least two labeling results are fused according to a fusion rule to obtain the label of the data to be labeled.
本申请所示的方案,将待标注的数据分给至少两个标注用户标注,获得至少两个标注结果。判断是否对待标注的数据执行仲裁。在确定对待标注的数据执行仲裁时,将该待标注的数据发送至审核用户进行标注,在确定不对待标注的数据执行仲裁时,根据融合规则将至少两个标注结果进行融合,得到待标注的数据的标签。这样,通过多人协同对同一数据进行标注,解决了复杂数据的标注准确率低的问题,还提供了仲裁功能,对需要仲裁的待标注的数据进行仲裁,进一步提高标注的准确率。In the scheme shown in this application, the data to be marked is distributed to at least two marking users for marking, and at least two marking results are obtained. Determine whether to perform arbitration on the data to be labeled. When it is determined that the data to be marked is to be arbitrated, the data to be marked is sent to the review user for marking, and when it is determined that the data to be marked is not to be arbitrated, at least two marking results are fused according to the fusion rules to obtain the data to be marked The label of the data. In this way, multiple people collaborate to mark the same data, which solves the problem of low labeling accuracy of complex data, and also provides an arbitration function to arbitrate the data to be marked that needs to be arbitrated, further improving the labeling accuracy.
在一种可能的实现方式中,若确定对待标注的数据执行仲裁,该方法还包括:获得审核用户对待标注的数据的标注结果,将审核用户的标注结果作为待标注的数据的标签。这样,对于需要仲裁的待标注的数据能获得更准确的标签。In a possible implementation manner, if it is determined to perform arbitration on the data to be labeled, the method further includes: obtaining a labeling result of the data to be labeled by the review user, and using the labeling result of the review user as a label of the data to be labeled. In this way, more accurate labels can be obtained for the data to be labeled that require arbitration.
在一种可能的实现方式中,该方法还包括:接收管理用户输入的对标注用户、审核用户 和融合规则的配置信息。这样,使得标注用户、审核用户和融合规则可灵活配置,提高了方法的应用灵活性。In a possible implementation manner, the method further includes: receiving configuration information for marking users, reviewing users and fusion rules input by the management user. In this way, the labeling users, reviewing users and fusion rules can be flexibly configured, which improves the application flexibility of the method.
在一种可能的实现方式中,根据至少两个标注结果,确定是否对待标注的数据执行仲裁,包括:在待标注的数据的标注类型为分类的情况下,若至少两个标注结果不相同,则确定对待标注的数据进行仲裁,若至少两个标注结果相同,则确定不对待标注的数据进行仲裁;在待标注的数据的标注类型为检测或分割的情况下,若至少两个标注结果中针对同一对象的标注框的差值不满足预设条件,则确定对待标注的数据进行仲裁,若至少两个标注结果中针对同一对象的标注框的差值满足预设条件,则确定不对待标注的数据进行仲裁。这样,对于不同标注类型的待标注的数据,有不同的标准判断是否进行仲裁,所以能够准确判断是否进行仲裁。In a possible implementation manner, determining whether to perform arbitration on the data to be labeled according to at least two labeling results includes: if the labeling type of the data to be labeled is classification, if at least two labeling results are different, Then it is determined to arbitrate the data to be labeled. If at least two labeling results are the same, it is determined not to arbitrate the data to be labeled; If the difference between the label boxes for the same object does not meet the preset conditions, it is determined that the data to be labeled is arbitrated, and if the difference between the label boxes for the same object in at least two labeling results meets the preset conditions, it is determined not to be labeled data for arbitration. In this way, for the data to be labeled of different labeling types, there are different standards for judging whether to perform arbitration, so it is possible to accurately determine whether to perform arbitration.
在一种可能的实现方式中,根据融合规则将至少两个标注结果进行融合,获得待标注的数据的标签,包括:在待标注的数据的标注类型为分类的情况下,将至少两个标注结果中的一个标注结果,确定为待标注的数据的标签;在待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为待标注的数据的标签。In a possible implementation, the at least two labeling results are fused according to the fusion rule to obtain the label of the data to be labeled, including: when the labeling type of the data to be labeled is classification, combining at least two labeling results One of the labeling results in the results is determined as the label of the data to be labeled; when the labeling type of the data to be labeled is detection or segmentation, the union of the labeling frames for the same object is determined as the label of the data to be labeled Label.
本申请所示的方案,对于不同标注类型的待标注的数据,有不同的确定标签方式。这样,在有多个标注结果时,能够准确基于标注结果确定出标签。In the solution shown in this application, there are different ways of determining labels for different labeling types of data to be labeled. In this way, when there are multiple labeling results, the label can be accurately determined based on the labeling results.
在一种可能的实现方式中,该方法还包括:当检测到标注用户触发标注界面时,通过标注界面提供待标注的数据的标注类型对应的查看工具和标注工具,该查看工具用于该标注用户查看待标注的数据,该标注工具用于该标注用户为待标注的数据添加标注类型对应的标注结果。In a possible implementation, the method further includes: when it is detected that the labeling user triggers the labeling interface, providing a viewing tool and a labeling tool corresponding to the labeling type of the data to be labeled through the labeling interface, and the viewing tool is used for the labeling The user views the data to be labeled, and the labeling tool is used for the labeling user to add labeling results corresponding to the labeling type for the data to be labelled.
本申请所示的方案,在检测到标注用户触发标注界面时,通过标注界面为该标注用户提供标注类型对应的查看工具和标注工具,使得该标注用户能够使用该查看工具查看该待标注的数据,并且使得该标注用户能够使用该标注工具为待标注的数据添加标签。这样,为标注用户提供了更智能化的标注方式,不仅能够统一标注结果的标注格式,而且可以提高标注效率。In the solution shown in this application, when it is detected that the labeling user triggers the labeling interface, the labeling user is provided with the viewing tool and labeling tool corresponding to the labeling type through the labeling interface, so that the labeling user can use the viewing tool to view the data to be labeled , and enable the annotation user to use the annotation tool to add labels to the data to be annotated. In this way, a more intelligent labeling method is provided for labeling users, which can not only unify the labeling format of labeling results, but also improve labeling efficiency.
在一种可能的实现方式中,获得待标注的数据的标签之后,该方法还包括:获取管理用户输入的该待标注的数据的标签版本信息;将该待标注的数据的标签与该标签版本信息对应存储。In a possible implementation manner, after obtaining the label of the data to be labeled, the method further includes: acquiring label version information of the data to be labeled input by the management user; combining the label of the data to be labeled with the label version Information corresponds to storage.
本申请所示的方案,数据标注平台提供了管理用户输入标签版本信息的功能,管理用户能够为待标注的数据输入标签版本信息,标签版本信息用于指示待标注数据的标签版本。数据标注平台将该待标注的数据的标签与该标签版本信息对应存储,使得能够基于标签版本信息区分待标注的数据的标签。In the scheme shown in this application, the data labeling platform provides the function of the management user to input label version information, and the management user can input the label version information for the data to be labeled, and the label version information is used to indicate the label version of the data to be labelled. The data labeling platform stores the label of the data to be labeled in correspondence with the label version information, so that the label of the data to be labeled can be distinguished based on the label version information.
第二方面,本申请提供了一种数据标注的装置,该装置应用于数据标注平台,该装置包括:In the second aspect, the present application provides a data labeling device, which is applied to a data labeling platform, and the device includes:
数据标注模块,用于将待标注的数据发送给至少两个标注用户;A data labeling module, configured to send the data to be labeled to at least two labeling users;
标注结果管理模块,用于:获取该至少两个标注用户对该待标注的数据的至少两个标注结果;An annotation result management module, configured to: acquire at least two annotation results of the at least two annotation users on the data to be annotated;
根据该至少两个标注结果,确定是否对该待标注的数据执行仲裁;Determine whether to perform arbitration on the data to be marked according to the at least two marking results;
该数据标注模块,还用于若确定对该待标注的数据执行仲裁,发送该待标注的数据至审 核用户进行标注;The data labeling module is also used to send the data to be marked to the review user for labeling if it is determined to perform arbitration on the data to be marked;
该标注结果管理模块,还用于若确定不对该待标注的数据执行仲裁,根据融合规则将该至少两个标注结果进行融合,获得该待标注的数据的标签。The labeling result management module is further configured to fuse the at least two labeling results according to fusion rules to obtain the label of the data to be labeled if it is determined not to perform arbitration on the data to be labeled.
在一种可能的实现方式中,该标注结果管理模块,还用于:若确定对该待标注的数据执行仲裁,获得该审核用户对该待标注的数据的标注结果,将该审核用户的标注结果作为该待标注的数据的标签。In a possible implementation, the labeling result management module is also used to: if it is determined to perform arbitration on the data to be labeled, obtain the labeling result of the review user for the data to be label, and the labeling result of the review user The result serves as the label of the data to be labeled.
在一种可能的实现方式中,该数据标注模块,还用于接收管理用户输入的对该标注用户、该审核用户和该融合规则的配置信息。In a possible implementation manner, the data labeling module is further configured to receive configuration information of the labeling user, the auditing user and the fusion rule input by the management user.
在一种可能的实现方式中,该标注结果管理模块,用于:在该待标注的数据的标注类型为分类的情况下,若该至少两个标注结果不相同,则确定对该待标注的数据进行仲裁,若该至少两个标注结果相同,则确定不对该待标注的数据进行仲裁;In a possible implementation manner, the labeling result management module is configured to: if the labeling type of the data to be labeled is classification, if the at least two labeling results are different, determine The data is arbitrated, and if the at least two labeling results are the same, it is determined not to arbitrate the data to be labeled;
在该待标注的数据的标注类型为检测或分割的情况下,若该至少两个标注结果中针对同一对象的标注框的差值不满足预设条件,则确定对该待标注的数据进行仲裁,若该至少两个标注结果中针对同一对象的标注框的差值满足该预设条件,则确定不对该待标注的数据进行仲裁。In the case where the labeling type of the data to be labeled is detection or segmentation, if the difference between the labeling frames of the same object in the at least two labeling results does not meet the preset condition, then it is determined to arbitrate the data to be labeled , if the difference between the labeling frames of the same object in the at least two labeling results satisfies the preset condition, it is determined not to arbitrate the data to be labeled.
在一种可能的实现方式中,该标注结果管理模块,用于:在该待标注的数据的标注类型为分类的情况下,将该至少两个标注结果中的一个标注结果,确定为该待标注的数据的标签;In a possible implementation manner, the labeling result management module is configured to: when the labeling type of the data to be labeled is classified, determine one of the at least two labeling results as the labeling result to be labeled the label of the annotated data;
在该待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为该待标注的数据的标签。In the case where the labeling type of the data to be labeled is detection or segmentation, the union of the labeling frames for the same object is determined as the label of the data to be labeled.
在一种可能的实现方式中,该数据标注模块,还用于:当检测到该标注用户触发标注界面时,通过该标注界面提供该待标注的数据的标注类型对应的查看工具和标注工具,该查看工具用于该标注用户查看该待标注的数据,该标注工具用于该标注用户为该待标注的数据添加该标注类型对应的标注结果。In a possible implementation manner, the data labeling module is further configured to: when it is detected that the labeling user triggers the labeling interface, provide a viewing tool and a labeling tool corresponding to the labeling type of the data to be labeled through the labeling interface, The viewing tool is used for the labeling user to view the data to be labeled, and the labeling tool is used for the labeling user to add labeling results corresponding to the labeling type to the data to be labelled.
在一种可能的实现方式中,该标注结果管理模块,还用于:获得该待标注的数据的标签之后,获取管理用户输入的该待标注的数据的标签版本信息;In a possible implementation, the tagging result management module is further configured to: after obtaining the tag of the data to be tagged, obtain the tag version information of the data to be tagged input by the management user;
将该待标注的数据的标签与该标签版本信息对应存储。。The label of the data to be labeled is stored in correspondence with the label version information. .
第三方面,本申请提供了一种数据标注的计算设备,该计算设备包括处理器和存储器,其中:In a third aspect, the present application provides a computing device for data labeling, the computing device includes a processor and a memory, wherein:
该存储器中存储有计算机指令;computer instructions are stored in the memory;
该处理器执行该计算机指令,以实现第一方面所述的数据标注的方法。The processor executes the computer instructions to implement the data labeling method described in the first aspect.
第四方面,本申请提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机指令,当该计算机可读存储介质中的计算机指令被计算设备执行时,使得该计算设备执行第一方面所述的数据标注的方法。In a fourth aspect, the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer-readable storage medium is executed by a computing device, the computing device executes the first In one aspect, the data labeling method.
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述第一方面所述的数据标注的方法。In a fifth aspect, the present application provides a computer program product containing instructions, which, when run on a computing device, causes the computing device to execute the data labeling method described in the first aspect above.
附图说明Description of drawings
图1是本申请一个示例性实施例提供的AI平台的逻辑示意图;Fig. 1 is a logical schematic diagram of an AI platform provided by an exemplary embodiment of the present application;
图2是本申请一个示例性实施例提供的数据标注平台的逻辑示意图;FIG. 2 is a logical schematic diagram of a data labeling platform provided by an exemplary embodiment of the present application;
图3是本申请一个示例性实施例提供的数据标注模块与管理用户的交互示意图;Fig. 3 is a schematic diagram of interaction between a data labeling module and a management user provided by an exemplary embodiment of the present application;
图4是本申请一个示例性实施例提供的数据标注平台的逻辑示意图;FIG. 4 is a logical schematic diagram of a data labeling platform provided by an exemplary embodiment of the present application;
图5是本申请一个示例性实施例提供的数据标注平台的逻辑示意图;Fig. 5 is a logical schematic diagram of a data labeling platform provided by an exemplary embodiment of the present application;
图6是本申请一个示例性实施例提供的计算设备的结构示意图;Fig. 6 is a schematic structural diagram of a computing device provided by an exemplary embodiment of the present application;
图7是本申请一个示例性实施例提供的数据标注的方法流程示意图;Fig. 7 is a schematic flow chart of a data labeling method provided by an exemplary embodiment of the present application;
图8是本申请一个示例性实施例提供的数据标注的方法流程示意图;Fig. 8 is a schematic flowchart of a data labeling method provided by an exemplary embodiment of the present application;
图9是本申请一个示例性实施例提供的标注界面的示意图;Fig. 9 is a schematic diagram of an annotation interface provided by an exemplary embodiment of the present application;
图10是本申请一个示例性实施例提供的标注界面的示意图;Fig. 10 is a schematic diagram of an annotation interface provided by an exemplary embodiment of the present application;
图11是本申请一个示例性实施例提供的标注界面的示意图;Fig. 11 is a schematic diagram of an annotation interface provided by an exemplary embodiment of the present application;
图12是本申请一个示例性实施例提供的数据标注的方法流程示意图;Fig. 12 is a schematic flowchart of a data labeling method provided by an exemplary embodiment of the present application;
图13是本申请一个示例性实施例提供的数据标注的装置的结构示意图。Fig. 13 is a schematic structural diagram of a data tagging device provided by an exemplary embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.
随着AI技术的广泛应用,AI技术中需要大量的已标注数据来进行算法训练,因此高效准确地标注数据成为当务之急。在相关方案中,是使用智能标注算法标注数据,但是对于复杂的数据,智能标注算法的准确率往往比较低,因此需要一种标注准确率比较高的方法。With the widespread application of AI technology, a large amount of labeled data is required for algorithm training in AI technology, so efficient and accurate labeling of data has become a top priority. In a related solution, the data is labeled using the intelligent labeling algorithm, but for complex data, the accuracy of the intelligent labeling algorithm is often relatively low, so a method with a relatively high labeling accuracy is needed.
本申请实施例提供了一种数据标注的方法,该方法能够应用于各种数据标注场景中,且能在多种场景中具有良好的标注准确率。例如,图像数据标注场景、语义分析标注场景等。示例性的,图像数据可以是医学影像。The embodiment of the present application provides a data labeling method, which can be applied to various data labeling scenarios, and can have good labeling accuracy in various scenarios. For example, image data annotation scenarios, semantic analysis annotation scenarios, etc. Exemplarily, the image data may be medical images.
本申请实施例以医学影像标注为例进行方案的详细说明。在医学领域中,医学影像的专业性比较高,需要标注用户具有较多的医学影像专业知识。这样,不同的标注用户由于专业知识水平不一致,会导致标注结果差异比较大。而且医学影像数据格式有多种,且模态各异,也需要数据标注的方法能够适配各种医学影像。例如,对于不同肺的医学影像中,肺结节的大小,形状也不大相同。而且为了使得开发的医学AI模型更准确,要求的医学影像也比较多,那么医学影像的标注也需要耗费大量的人力和时间。The embodiment of the present application uses medical image labeling as an example to describe the solution in detail. In the medical field, the professionalism of medical imaging is relatively high, and it is required that the labeling user has more professional knowledge of medical imaging. In this way, different annotation users will have relatively large differences in annotation results due to inconsistent professional knowledge levels. Moreover, there are many medical image data formats and different modalities, and the method of data annotation is also required to be able to adapt to various medical images. For example, in medical images of different lungs, the size and shape of pulmonary nodules are not the same. Moreover, in order to make the developed medical AI model more accurate, more medical images are required, so the labeling of medical images also requires a lot of manpower and time.
基于上述情况,本申请实施例中,在AI平台中设置支持多人标注以及对标注结果进行融合与仲裁的数据标注功能,使得每个待标注的数据分配给至少两个标注用户标注。在至少两个标注用户标注完成后,对至少两个标注用户的标注结果进行融合。在至少两个标注用户的标注结果差异比较大时,对待标注的数据进行仲裁。这样,AI平台中提供的数据标注功能,通过多人协同标注,并在标注结果差异比较大时进行仲裁,不仅可以使得标注结果准确,而且可以使得标注结果的格式一致,便于后续AI模型的训练,该AI模型可以是医学辅助诊断模型等。Based on the above situation, in the embodiment of this application, a data labeling function that supports multi-person labeling and fusion and arbitration of labeling results is set in the AI platform, so that each data to be labeled is allocated to at least two labeling users for labeling. After the at least two annotation users complete the annotation, the annotation results of the at least two annotation users are fused. When the annotation results of at least two annotation users are quite different, the data to be annotated is arbitrated. In this way, the data labeling function provided by the AI platform can not only make the labeling results accurate, but also make the labeling results consistent in format through multiple people’s collaborative labeling and arbitration when the labeling results differ greatly, which is convenient for subsequent AI model training , the AI model can be a medical aided diagnosis model, etc.
示例性的,本申请实施例中,一个待标注的数据被添加标签后,该待标注的数据和该标签可以作为样本,用于训练AI模型。Exemplarily, in the embodiment of the present application, after a label is added to the data to be labeled, the data to be labeled and the label can be used as samples for training an AI model.
在一些实施例中,数据标注的方法的执行主体可以是AI平台,示例性的,该执行主体具体可以是AI平台中包括的数据标注平台。AI平台,是一种为AI开发者和用户提供便捷的AI开发环境以及便利的开发工具的平台。AI平台中提供数据标注功能和AI模型训练功能等。In some embodiments, the execution body of the data labeling method may be an AI platform. Exemplarily, the execution body may specifically be a data labeling platform included in the AI platform. The AI platform is a platform that provides a convenient AI development environment and convenient development tools for AI developers and users. The AI platform provides data annotation functions and AI model training functions.
图1为本申请实施例中提供的AI平台100的结构示意图,应理解,图1仅是示例性地展示了AI平台100的一种结构化逻辑示意图,本申请并不限定对AI平台100中的模块的划分。如图1所示,AI平台100包括AI交互模块101、云基础平台102、平台即服务(platform-as-a-service,PaaS)103、基础设施即服务(infrastructure as a service,IaaS)104等。FIG. 1 is a schematic structural diagram of the AI platform 100 provided in the embodiment of the present application. It should be understood that FIG. division of modules. As shown in Figure 1, the AI platform 100 includes an AI interaction module 101, a cloud base platform 102, a platform as a service (platform-as-a-service, PaaS) 103, an infrastructure as a service (infrastructure as a service, IaaS) 104, etc. .
下面简要地描述图1所示的AI平台100中的各个模块的功能:The functions of each module in the AI platform 100 shown in FIG. 1 are briefly described below:
示例性的,AI交互模块101:用于提供数据管理功能、AI算法开发功能、辅助诊疗服务功能和高阶AI影像辅助功能。Exemplarily, the AI interaction module 101 is used to provide data management functions, AI algorithm development functions, auxiliary diagnosis and treatment service functions, and high-level AI image assistance functions.
数据管理功能用于接收数据的导入、对数据进行脱敏处理、阅片功能和勾画功能。数据可以是待标注的数据等。脱敏处理指将用户数据的敏感信息的匿名化处理。阅片功能用于向用户展示待标注的数据。勾画功能用于用户为待标注的数据添加标注结果。The data management function is used for the import of received data, desensitization of data, image reading function and sketching function. The data may be data to be labeled or the like. Desensitization refers to the anonymization of sensitive information of user data. The image reading function is used to display the data to be marked to the user. The sketching function is used for users to add labeling results for the data to be labeled.
AI算法开发功能包括数据集管理功能、数据标注功能、模型训练功能、模型评估功能、部署上线功能、算法管理功能等。数据集管理功能包括数据集的管理功能、数据归档功能。数据集的管理功能包括指数据集的创建、修改和删除等功能。数据归档功能指将同一个数据的各种信息进行归档的功能。数据标注功能包括标注任务的管理功能、标注团队的管理功能、标注工具的管理功能等。标注任务是对某些数据进行标注的任务。标注团队指对某个数据集进行标注的多个用户组成的团队。标注工具指用于对数据进行标注的工具,如电子画笔等。模型训练功能用于管理AI模型的训练,包括但不限于资源池管理(即管理训练数据)、训练作业的调度等。模型评估功能用于对训练完成的AI模型进行评估等处理,包括但不限于模型验证、镜像打包和模型可视化等。部署上线功能用于将评估通过的AI模型部署至线上,如服务部署、模型共享等。算法管理功能用于各种算法的管理,例如可以包括:医学分割算法管理、3D预训练(对医学影像进行预分割等)、3D影像内容检索、3D跨模态配准、自监督小样本学习、3D网络神经结构搜索(neural architecture search,NAS)寻优等。AI algorithm development functions include dataset management functions, data labeling functions, model training functions, model evaluation functions, deployment and online functions, algorithm management functions, etc. Data set management functions include data set management functions and data archiving functions. The management functions of datasets include the functions of creating, modifying and deleting datasets. The data archiving function refers to the function of archiving various information of the same data. The data labeling function includes the management function of the labeling task, the management function of the labeling team, the management function of the labeling tool, etc. The labeling task is the task of labeling some data. An annotation team refers to a team of multiple users who annotate a dataset. Labeling tools refer to tools used to label data, such as electronic brushes. The model training function is used to manage the training of AI models, including but not limited to resource pool management (i.e. management of training data), scheduling of training jobs, etc. The model evaluation function is used to evaluate the trained AI model, including but not limited to model verification, image packaging, and model visualization. The deployment function is used to deploy the AI model that has passed the evaluation to the line, such as service deployment, model sharing, etc. The algorithm management function is used for the management of various algorithms, such as: medical segmentation algorithm management, 3D pre-training (pre-segmentation of medical images, etc.), 3D image content retrieval, 3D cross-modal registration, self-supervised small sample learning , 3D network neural structure search (neural architecture search, NAS) optimization, etc.
辅助诊断功能用于提供各种类型疾病的辅助诊断。The auxiliary diagnosis function is used to provide auxiliary diagnosis of various types of diseases.
高阶AI影像辅助功能用于提供AI影像的处理功能,包括但不限于图像内容检索功能、图像配准功能等。High-level AI image assistance functions are used to provide AI image processing functions, including but not limited to image content retrieval functions, image registration functions, etc.
云基础平台102,也可以称为AI开发平台。云基础平台102提供各种云基础服务。云基础平台102包括但不限于开发环境管理平台、数据处理平台、数据标注平台、训练作业平台、模型管理功能、服务管理功能等。开发环境管理平台用于提供开发环境管理服务。数据处理平台用于提供数据处理服务。数据标注平台用于提供数据标注服务。例如,数据标注平台用于管理各个待标注的数据的标注结果等。训练作业平台用于提供AI模型训练作业管理服务。模型管理功能用于提供模型的管理服务,如更新模型、删除模型等。服务管理功能用于管理所提供的服务。The cloud base platform 102 may also be called an AI development platform. The cloud infrastructure platform 102 provides various cloud infrastructure services. The cloud base platform 102 includes, but is not limited to, a development environment management platform, a data processing platform, a data labeling platform, a training operation platform, a model management function, a service management function, and the like. The development environment management platform is used to provide development environment management services. The data processing platform is used to provide data processing services. The data annotation platform is used to provide data annotation services. For example, the data labeling platform is used to manage the labeling results of each data to be labeled. The training job platform is used to provide AI model training job management services. The model management function is used to provide model management services, such as updating models, deleting models, and so on. The service management function is used to manage the provided services.
PaaS103,包括资源池。资源池包括但不限于共享资源池、租户专属资源池。共享资源池包括各个租户能够使用的资源,如模型训练资源等。租户专属资源池包括各个租户所能使用的资源,如模型训练资源、模型服务资源等。示例性的,在应用于医学影像领域时,资源池还包括AI辅助诊断资源池,如AI辅助诊断资源池包括用于医学辅助诊断的AI模型。PaaS103, including resource pools. Resource pools include but are not limited to shared resource pools and tenant-specific resource pools. The shared resource pool includes resources available to each tenant, such as model training resources. Tenant-specific resource pools include resources available to each tenant, such as model training resources and model service resources. Exemplarily, when applied to the field of medical imaging, the resource pool further includes an AI-aided diagnosis resource pool, for example, the AI-aided diagnosis resource pool includes an AI model for medical-aided diagnosis.
IaaS104,包括服务器资源。服务器资源包括但不限于弹性云服务器(elastic cloud server,ECS)、ECS+图像处理器(graphics processing unit,GPU)服务器、裸金属服务器(bare metal server,BMS)+GPU服务器等。云服务器也可以称为是计算单元。BMS是为租户提供的专属 物理服务器。IaaS104, including server resources. Server resources include but are not limited to elastic cloud server (elastic cloud server, ECS), ECS+image processing unit (graphics processing unit, GPU) server, bare metal server (bare metal server, BMS)+GPU server, etc. A cloud server can also be called a computing unit. BMS is a dedicated physical server provided for tenants.
应理解,图1所示的AI平台中包括的各个模块仅是一种示例,在一些实施例中,AI平台可以仅包括其中部分模块的功能,或者,在另一些实施例中,AI平台还可以包括其他模块的功能,本申请不对此作限定。It should be understood that the various modules included in the AI platform shown in FIG. 1 are only an example, and in some embodiments, the AI platform may only include the functions of some of the modules, or, in other embodiments, the AI platform may also include Functions of other modules may be included, which is not limited in this application.
图2为本申请实施例中提供的AI平台100中数据标注平台(后续表示为数据标注平台200)的结构示意图,应理解,图2仅是示例性地展示了数据标注平台200的一种结构化示意图,本申请实施例并不限定对数据标注平台200中的模块的划分。如图2所示,数据标注平台200包括数据标注模块201、数据存储模块202和标注结果管理模块203。FIG. 2 is a schematic structural diagram of the data labeling platform (subsequently represented as the data labeling platform 200) in the AI platform 100 provided in the embodiment of the present application. It should be understood that FIG. 2 is only an example showing a structure of the data labeling platform 200 As a schematic diagram, the embodiment of the present application does not limit the division of modules in the data labeling platform 200. As shown in FIG. 2 , the data labeling platform 200 includes a data labeling module 201 , a data storage module 202 and a labeling result management module 203 .
下面简要地描述图2所示的数据标注平台200中的各个模块的功能:The functions of each module in the data labeling platform 200 shown in FIG. 2 are briefly described below:
数据标注模块201提供标注项目管理服务、标注团队管理服务、标注任务管理服务、数据标注服务和标签版本管理服务。The data labeling module 201 provides labeling project management services, labeling team management services, labeling task management services, data labeling services and label version management services.
示例性的,标注项目管理服务用于管理标注项目。每个标注项目针对一个数据集,该数据集用于训练一个AI模型。例如,该数据集为肺部医学影像集,用于训练辅助诊断肺部的AI模型。再例如,该数据集为脑部医学影像集,用于训练辅助诊断脑部的AI模型。标注项目管理服务包括创建标注项目的功能、修改标注项目的功能、删除标注项目的功能、查看标注项目列表的功能、查看标注项目概览的功能等。管理用户可以基于这些功能实现对标注项目的管理。Exemplarily, the labeling project management service is used to manage labeling projects. Each labeling project targets a data set, which is used to train an AI model. For example, this data set is a lung medical image set, which is used to train an AI model for assisting in the diagnosis of the lungs. For another example, the data set is a brain medical image set, which is used to train an AI model that assists in diagnosing the brain. The labeling project management service includes the functions of creating a labeling project, modifying a labeling project, deleting a labeling project, viewing the list of labeling projects, viewing the overview of labeling projects, etc. Management users can manage the marked items based on these functions.
在本申请实施例中,管理用户为开发者,或者AI平台的第三方使用者(如独立软件开发商(independent software vendors,ISV))。In this embodiment of the application, the management user is a developer, or a third-party user of the AI platform (such as an independent software vendor (ISV)).
示例性的,标注团队管理服务用于对标注团队进行管理。标注团队是由多个用户组成的团队。标注团队中包括标注用户和审核用户,例如,通常审核用户的专业知识水平高于标注用户的专业知识水平。标注用户和审核用户均能对数据进行标注,审核用户还可以对数据的标注结果进行仲裁。标注团队管理服务包括创建标注团队的功能、删除标注团队的功能、修改标注团队中成员的功能、添加标注团队中成员的功能、删除标注团队中成员的功能、查看标注团队列表的功能等。管理用户可以基于这些功能实现对标注团队的管理。Exemplarily, the labeling team management service is used to manage the labeling team. Annotation teams are teams of multiple users. The annotation team includes annotation users and review users, for example, usually the professional knowledge level of review users is higher than that of annotation users. Both labeling users and review users can label data, and review users can also arbitrate the results of data labeling. The labeling team management service includes the functions of creating labeling teams, deleting labeling teams, modifying labeling team members, adding labeling team members, deleting labeling team members, viewing labeling team lists, etc. Administrative users can manage the labeling team based on these functions.
示例性的,标注任务管理服务用于管理标注任务。标注任务为对至少一个待标注的数据进行标注的任务,例如,标注任务为对一个标注项目中部分数据进行标注的任务,或者,对一个标注项目中全部数据进行标注的任务。每个待标注的数据对应有标注类型,标注类型可以包括分类、检测或分割中的任一种。一个标注任务可以分配给至少一个标注团队,且该标注任务中每个待标注的数据分配给至少两个标注用户进行标注。标注任务管理服务包括创建标注任务的功能、删除标注任务的功能、查看标注任务进展的功能等。管理用户可以基于这些功能实现对标注任务的管理。Exemplarily, the labeling task management service is used to manage labeling tasks. A labeling task is a task of labeling at least one data to be labeled, for example, a labeling task is a task of labeling part of data in a labeling project, or a task of labeling all data in a labeling project. Each data to be labeled corresponds to a label type, which can include any one of classification, detection or segmentation. A labeling task can be assigned to at least one labeling team, and each data to be labeled in the labeling task is assigned to at least two labeling users for labeling. The labeling task management service includes the functions of creating labeling tasks, deleting labeling tasks, viewing the progress of labeling tasks, etc. Management users can manage labeling tasks based on these functions.
示例性的,标注任务管理服务还包括验收标注结果的功能、查看验收报告的功能。管理用户通过验收标注结果的功能可以将标注结果提交给审核用户进行仲裁。管理用户通过查看验收报告的功能查看验收结果。Exemplarily, the labeling task management service also includes the function of accepting the labeling results and viewing the acceptance report. Through the function of accepting the marked results, the management user can submit the marked results to the review user for arbitration. The management user can view the acceptance result through the function of viewing the acceptance report.
示例性的,数据标注服务包括标注结果的添加功能、修改功能、删除功能和仲裁功能。并且数据标注服务还可以为标注用户或审核用户提供标注工具。标注用户或审核用户可以使用标注工具对数据进行标注。Exemplarily, the data labeling service includes the functions of adding, modifying, deleting and arbitrating the result of labeling. And the data annotation service can also provide annotation tools for annotation users or review users. Annotation users or review users can use annotation tools to annotate data.
示例性的,标签版本管理服务用于管理待标注的数据的标签版本信息。标签版本信息用 于区分针对一个待标注的数据不同批次的标签。标签版本管理服务包括查询标签版本信息的功能、删除标签版本信息的功能、设置当前标签版本信息的功能和发布标签版本信息的功能等。管理用户可以通过查询标签版本信息的功能,查询标签版本信息。管理用户可以通过删除标签版本信息的功能,删除标签版本信息。管理用户可以通过设置标签版本信息的功能,选择待标注的数据的不同标签版本信息,以查看待标注的数据在不同标签版本信息下的标签。发布标签版本信息的功能用于指示将待标注的数据的标签与管理用户指示的标签版本信息对应存储。Exemplarily, the tag version management service is used to manage tag version information of the data to be tagged. Label version information is used to distinguish different batches of labels for a data to be labeled. The tag version management service includes the functions of querying tag version information, deleting tag version information, setting the current tag version information, publishing tag version information, etc. Management users can query tag version information through the function of querying tag version information. Administrative users can delete tag version information through the function of deleting tag version information. The management user can select different label version information of the data to be labeled through the function of setting the label version information, so as to view the labels of the data to be labeled under different label version information. The function of publishing label version information is used to instruct to store the label of the data to be labeled in correspondence with the label version information indicated by the management user.
示例性的,数据标注模块平台200还提供了标签查看服务。管理用户可以通过该服务查看已标注的数据的标签。Exemplarily, the data labeling module platform 200 also provides a label viewing service. Administrative users can view labels of labeled data through this service.
需要说明的是,管理用户、标注用户和审核用户可以通过图形用户界面(graphical user interface,GUI)或调用应用程序接口(application program interface,API)与数据标注平台200交互。It should be noted that the management user, labeling user and auditing user can interact with the data labeling platform 200 through a graphical user interface (graphical user interface, GUI) or by calling an application program interface (application program interface, API).
数据存储模块202,可以是云服务提供商提供的对象存储服务(object storage service,OBS)对应的数据存储资源。数据存储模块202用于存储用户上传的待标注的数据以及标注结果,如存储用户上传的待标注的数据集,待标注的数据集中包括待标注的数据。示例性的,数据标注模块201从OBS中读取待标注的数据,在标注完成后,将标注结果写入OBS。The data storage module 202 may be a data storage resource corresponding to an object storage service (object storage service, OBS) provided by a cloud service provider. The data storage module 202 is used to store the data to be marked uploaded by the user and the marking results, for example, the data set to be marked uploaded by the user is stored, and the data set to be marked includes the data to be marked. Exemplarily, the data labeling module 201 reads the data to be labeled from the OBS, and writes the labeling result into the OBS after the labeling is completed.
示例性的,数据存储模块202还用于存储标注团队的信息等。Exemplarily, the data storage module 202 is also used to store the information of the labeling team and the like.
标注结果管理模块203,用于判断是否对待标注的数据进行仲裁。在不对待标注的数据进行仲裁时,对每个待标注的数据的至少两个标注结果进行融合处理,获得待标注的数据的标签;在对待标注的数据进行仲裁时,将待标注的数据发送给审核用户。示例性的,标注结果管理模块203从数据存储模块202中读取每个待标注的数据的至少两个标注结果,对至少两个标注结果进行融合处理,获得每个待标注的数据的标签。示例性的,可以通过Spark对标注结果进行融合处理。标注结果管理模块203在获取待标注的数据的标签后,可以对标签进行存储,示例性的,可以将标签存储至数据库,如将标签存储至HBase。The labeling result management module 203 is configured to determine whether to perform arbitration on the data to be labeled. When the data to be marked is not arbitrated, at least two labeling results of each data to be marked are fused to obtain the label of the data to be marked; when the data to be marked is arbitrated, the data to be marked is sent to the reviewing user. Exemplarily, the tagging result management module 203 reads at least two tagging results of each data to be tagged from the data storage module 202, performs fusion processing on the at least two tagging results, and obtains a tag of each data to be tagged. Exemplarily, Spark can be used to perform fusion processing on the tagging results. After the labeling result management module 203 acquires the label of the data to be labeled, it can store the label. Exemplarily, the label can be stored in a database, such as storing the label in HBase.
示例性的,数据标注平台200还包括AI推理模块204。AI推理模块204对待标注的数据进行智能标注获得智能标注结果,将智能标注结果通过用户输入/输出(input/out,I/O)功能提供给标注用户和/或审核用户,进行参考。Exemplarily, the data labeling platform 200 further includes an AI reasoning module 204 . The AI reasoning module 204 performs intelligent labeling on the data to be labeled to obtain intelligent labeling results, and provides the intelligent labeling results to labeling users and/or review users through user input/output (I/O) functions for reference.
示例性的,数据标注平台200还可以包括AI模型训练模块和数据预处理模块。AI模型训练模块用于在获得数据集中各个待标注的数据的标签后,获得标注完成的数据集,标注完成的数据集中均是已标注的数据。基于标注完成的数据集,训练获得AI模型。数据预处理模块用于标注完成的数据集进行预处理操作。示例性的,对标注完成的数据集中的已标注的数据进行预处理可使得已标注的数据在尺寸上具有一致性,还可以去除已标注的数据中不恰当的数据以及对已标注的数据进行脱敏处理等。数据预处理模块可以将预处理后的已标注的数据存储至数据存储模块202。Exemplarily, the data labeling platform 200 may also include an AI model training module and a data preprocessing module. The AI model training module is used to obtain the labeled data set after obtaining the label of each data to be labeled in the data set, and the labeled data set is all labeled data. Based on the labeled data set, train the AI model. The data preprocessing module is used to mark the completed data set for preprocessing operation. Exemplarily, preprocessing the marked data in the marked data set can make the marked data consistent in size, and can also remove inappropriate data in the marked data and perform Desensitization treatment, etc. The data preprocessing module can store the preprocessed labeled data in the data storage module 202 .
需要说明的是,本申请中的数据标注平台200可以是一个可以与用户交互的系统,这个系统可以是软件系统也可以是硬件系统,也可以是软硬结合的系统,本申请中不进行限定。It should be noted that the data labeling platform 200 in this application can be a system that can interact with users. This system can be a software system or a hardware system, or a system combining software and hardware, which is not limited in this application. .
为了更好的说明图2中数据标注平台200中数据标注模块201与管理用户的交互,提供了图3所示的交互示意图。在图3中,管理用户可以使用数据标注模块201提供的标注项目管理服务、标注团队管理服务、标签版本管理服务、标注任务管理服务、数据标注服务和标 签查看服务。其中,管理用户通过标注项目管理服务创建标注项目、删除标注项目、修改标注项目、查看标注项目列表等。管理用户通过标注团队管理服务创建标注团队、删除标注团队、修改标注团队成员、添加标注团队成员、删除标注团队成员和查看标注团队列表等。管理用户通过标签版本管理服务创建标签版本信息、删除标签版本信息、设置当前标签版本信息和发布标签版本信息等。管理用户通过标注任务管理服务创建标注任务、删除标注任务、修改标注任务、查看标注任务进展、验收标注结果等。管理用户通过标签查看服务查看已标注的数据的标签。管理用户通过数据标注服务查看待标注的数据的标注结果等。In order to better illustrate the interaction between the data labeling module 201 in the data labeling platform 200 in FIG. 2 and the management user, the interaction diagram shown in FIG. 3 is provided. In FIG. 3 , management users can use the labeling project management service, labeling team management service, label version management service, labeling task management service, data labeling service and label viewing service provided by the data labeling module 201. Among them, the management user creates a marked item, deletes a marked item, modifies a marked item, checks a list of marked items, etc. through the marked item management service. Management users can create labeling teams, delete labeling teams, modify labeling team members, add labeling team members, delete labeling team members, view labeling team lists, etc. through the labeling team management service. Management users create tag version information, delete tag version information, set current tag version information, and publish tag version information through the tag version management service. Management users create labeling tasks, delete labeling tasks, modify labeling tasks, view labeling task progress, and accept labeling results through the labeling task management service. Administrative users can view the labels of labeled data through the label viewing service. The management user can view the labeling results of the data to be labeled through the data labeling service.
另外,在图3中,标注用户和审核用户可以使用数据标注服务进行数据标注。示例性的,标注用户通过数据标注服务对待标注的数据进行标注,并保存标注结果。审核用户通过数据标注服务审核待标注的数据的标注结果以及保存标注结果等。In addition, in Figure 3, the labeling user and the reviewing user can use the data labeling service to perform data labeling. Exemplarily, the labeling user uses the data labeling service to label the data to be labelled, and saves the labeling result. The review user reviews the labeling results of the data to be labeled and saves the labeling results through the data labeling service.
图4为本申请实施例示例性的提供的AI平台100的应用场景示意图。如图4所示,AI平台100可以全部部署在云环境中。云环境是云计算模式下利用基础资源向用户提供云服务的实体。云环境包括云服务提供商拥有的大量基础资源(包括计算资源、存储资源和网络资源),该计算资源可以是大量的计算设备(如服务器)。AI平台100可以独立地部署在云环境中的服务器或虚拟机上,AI平台100也可以分布式地部署在云环境中的多台服务器上、或者分布式地部署在云环境中的多台虚拟机上、再或者分布式地部署在云环境中的服务器和虚拟机上。如图4所示,AI平台100由云服务提供商在云环境抽象成一种云服务提供给用户,云环境利用部署在云环境的AI平台100向用户提供云服务。在使用云服务时,管理用户可以通过API或者GUI将待标注的数据上传至云环境。云环境中的AI平台100(如数据标注平台200)接收待标注的数据,向用户(标注用户、审核用户和管理用户)提供数据标注服务功能。FIG. 4 is a schematic diagram of an application scenario of an AI platform 100 provided by an exemplary embodiment of the present application. As shown in FIG. 4 , the AI platform 100 can all be deployed in a cloud environment. A cloud environment is an entity that uses basic resources to provide cloud services to users under the cloud computing model. The cloud environment includes a large number of basic resources (including computing resources, storage resources and network resources) owned by the cloud service provider, and the computing resources may be a large number of computing devices (such as servers). The AI platform 100 can be independently deployed on servers or virtual machines in the cloud environment, and the AI platform 100 can also be deployed on multiple servers in the cloud environment in a distributed manner, or distributed on multiple virtual machines in the cloud environment. On-machine, or distributed on servers and virtual machines in cloud environments. As shown in FIG. 4 , the AI platform 100 is abstracted into a cloud service by the cloud service provider in the cloud environment and provided to the user, and the cloud environment uses the AI platform 100 deployed in the cloud environment to provide the cloud service to the user. When using cloud services, management users can upload the data to be marked to the cloud environment through API or GUI. The AI platform 100 (such as the data labeling platform 200 ) in the cloud environment receives the data to be labeled, and provides data labeling service functions to users (labeling users, review users and management users).
本申请提供的数据标注平台200的部署较为灵活,如图5所示,在另一种实施例中,本申请提供的数据标注平台200还可以分布式地部署在不同的环境中。本申请提供的数据标注平台200可以在逻辑上分成多个部分,每个部分具有不同的功能。例如,在一种实施例中数据标注平台200包括数据标注模块201、数据存储模块202和标注结果管理模块203。数据标注平台200中的各部分可以分别部署在终端计算设备、边缘环境和云环境中的任意两个或三个环境中。终端计算设备包括:终端服务器、智能手机、笔记本电脑、平板电脑、个人台式电脑、智能摄相机等。边缘环境为包括距离终端计算设备较近的边缘计算设备集合的环境,边缘计算设备包括:边缘服务器、拥有计算能力的边缘小站等。部署在不同环境或设备的数据标注平台200的各个部分协同实现为用户提供数据标注功能。例如,在一种场景中,终端计算设备中部署数据标注平台200中的数据标注模块201、数据存储模块202,边缘环境的边缘计算设备中部署数据标注平台200中的标注结果管理模块203。应理解,本申请不对数据标注平台200的哪些部分部署具体部署在什么环境进行限制性的划分,实际应用时可根据终端计算设备的计算能力、边缘环境和云环境的资源占有情况或具体应用需求进行适应性的部署。The deployment of the data labeling platform 200 provided by this application is relatively flexible, as shown in FIG. 5 , in another embodiment, the data labeling platform 200 provided by this application can also be distributed and deployed in different environments. The data labeling platform 200 provided in this application can be logically divided into multiple parts, and each part has different functions. For example, in one embodiment, the data labeling platform 200 includes a data labeling module 201 , a data storage module 202 and a labeling result management module 203 . Each part of the data labeling platform 200 can be deployed in any two or three environments of the terminal computing device, the edge environment and the cloud environment. Terminal computing devices include: terminal servers, smart phones, notebook computers, tablet computers, personal desktop computers, smart cameras, etc. The edge environment is an environment that includes a collection of edge computing devices that are relatively close to the terminal computing device, and the edge computing devices include: edge servers, edge small stations with computing capabilities, and the like. Various parts of the data labeling platform 200 deployed in different environments or devices cooperate to provide data labeling functions for users. For example, in one scenario, the data labeling module 201 and the data storage module 202 of the data labeling platform 200 are deployed on the terminal computing device, and the labeling result management module 203 of the data labeling platform 200 is deployed on the edge computing device of the edge environment. It should be understood that this application does not restrictively divide which parts of the data labeling platform 200 are deployed in which environment. In actual application, it can be based on the computing power of the terminal computing device, the resource occupancy of the edge environment and the cloud environment, or specific application requirements. Make adaptive deployments.
数据标注平台200也可以单独部署在任意环境中的一个计算设备上(如单独部署在边缘环境的一个边缘服务器上)。图6为部署有数据标注平台200的计算设备600的硬件结构示意图,图6所示的计算设备600包括存储器601、处理器602、通信接口603以及总线604。其中,存储器601、处理器602、通信接口603通过总线604实现彼此之间的通信连接。The data labeling platform 200 can also be independently deployed on a computing device in any environment (for example, independently deployed on an edge server in an edge environment). FIG. 6 is a schematic diagram of a hardware structure of a computing device 600 deployed with the data labeling platform 200 . The computing device 600 shown in FIG. 6 includes a memory 601 , a processor 602 , a communication interface 603 and a bus 604 . Wherein, the memory 601 , the processor 602 , and the communication interface 603 are connected to each other through a bus 604 .
存储器601可以是只读存储器(read only memory,ROM),随机存取存储器(random access  memory,RAM),硬盘,快闪存储器或其任意组合。存储器601可以存储程序,当存储器601中存储的程序被处理器602执行时,处理器602和通信接口603用于执行数据标注平台200进行数据标注的方法。存储器还可以存储数据集。例如,存储器601中的一部分存储资源被划分成存储数据集和数据集中数据的标签,存储器601中的一部分存储资源被划分成存储待标注的数据的标注结果。The memory 601 may be a read only memory (read only memory, ROM), a random access memory (random access memory, RAM), a hard disk, a flash memory or any combination thereof. The memory 601 may store programs, and when the programs stored in the memory 601 are executed by the processor 602, the processor 602 and the communication interface 603 are used to execute the data labeling method performed by the data labeling platform 200 . The memory can also store data sets. For example, a part of storage resources in the memory 601 is divided into storing data sets and labels of data in the data sets, and a part of storage resources in the memory 601 is divided into storing labeling results of data to be labeled.
处理器602可以采用中央处理器(central processing unit,CPU),应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器602可以包括一个或多个芯片。处理器602可以包括AI加速器,例如,神经网络处理器(neural processing unit,NPU)。The processor 602 may be a central processing unit (central processing unit, CPU), an application specific integrated circuit (application specific integrated circuit, ASIC), a GPU or any combination thereof. Processor 602 may include one or more chips. The processor 602 may include an AI accelerator, for example, a neural network processor (neural processing unit, NPU).
通信接口603使用例如收发器一类的收发模块,来实现计算设备600与其他设备或通信网络之间的通信。例如,可以通过通信接口603获取数据。The communication interface 603 uses a transceiver module such as a transceiver to implement communication between the computing device 600 and other devices or communication networks. For example, data can be acquired through the communication interface 603 .
总线604可包括在计算设备600各个部件(例如,存储器601、处理器602、通信接口603)之间传送信息的通路。Bus 604 may include pathways for transferring information between various components of computing device 600 (eg, memory 601 , processor 602 , communication interface 603 ).
下面对数据标注的方法流程进行说明,图7示出了数据标注的方法流程的示意图。数据标注的方法的执行主体可以是前述数据标注平台200,后文简述为数据标注平台。The flow of the data labeling method will be described below, and FIG. 7 shows a schematic diagram of the process of the data labeling method. The subject of execution of the data labeling method may be the aforementioned data labeling platform 200, which will be briefly described as the data labeling platform hereinafter.
步骤701,将待标注的数据发送给至少两个标注用户。 Step 701, sending the data to be tagged to at least two tagging users.
其中,待标注的数据为管理用户向数据标注平台提交的未添加有标签的数据,该待标注的数据是管理用户向数据标注平台单独提交的一个数据,或者管理用户向数据标注平台提交的多个待标注的数据中的一个数据,该多个待标注的数据构成一个待标注的数据集。待标注的数据的类型可以是图像,该图像可以是二维图像、三维图像等,三维图像可以是复杂的医学影像。Among them, the data to be marked is the unlabeled data submitted by the management user to the data labeling platform. One of the data to be labeled, the multiple data to be labeled constitute a data set to be labeled. The type of data to be labeled may be an image, the image may be a two-dimensional image, a three-dimensional image, etc., and the three-dimensional image may be a complex medical image.
在本实施例中,管理用户指示数据标注平台对待标注的数据进行标注。数据标注平台将待标注的数据提供给至少两个标注用户。示例性的,数据标注平台向至少两个标注用户发送待标注的数据的标注界面的访问路径。该访问路径可以是一个地址链接。或者,数据标注平台向至少两个标注用户发送标注通知消息。示例性的,为了节约用于标注的人力资源,前述至少两个标注用户可以是仅有两个标注用户。In this embodiment, the management user instructs the data labeling platform to label the data to be labeled. The data labeling platform provides the data to be labeled to at least two labeling users. Exemplarily, the data annotation platform sends the access path of the annotation interface of the data to be annotated to at least two annotation users. The access path may be an address link. Alternatively, the data annotation platform sends annotation notification messages to at least two annotation users. Exemplarily, in order to save human resources for labeling, the foregoing at least two labeling users may be only two labeling users.
标注用户可以触发标注界面的访问路径,触发标注用户的终端,显示标注界面的登录窗口。标注用户使用自己的账户和密码进行登录。或者,标注用户可以直接使用自己的账户和密码登录数据标注平台,在数据标注平台中进入标注界面。在标注界面中展示标注用户待标注的数据,而不会展示不需要该标注用户标注的数据。例如,数据标注平台预先将标注用户的账户与该标注用户待标注的数据进行绑定,数据标注平台检测到该标注用户登录后,向该标注用户提供该标注用户的账户对应的待标注的数据。标注用户可以在该标注界面中,对待标注的数据添加标注结果。在标注用户对待标注的数据标注完成后,向数据标注平台提交数据的标注结果。数据标注平台存储每个待标注的数据的标注结果。The annotation user can trigger the access path of the annotation interface, trigger the annotation user's terminal, and display the login window of the annotation interface. Mark the user to log in with their own account and password. Alternatively, the labeling user can directly log in to the data labeling platform with his account and password, and enter the labeling interface in the data labeling platform. In the annotation interface, the data to be annotated by the annotating user will be displayed, and the data that does not need to be annotated by the annotating user will not be displayed. For example, the data labeling platform binds the labeling user’s account with the data to be labeled by the labeling user in advance, and the data labeling platform detects that the labeling user has logged in, and provides the labeling user with the data to be labeled corresponding to the labeling user’s account . Annotation users can add annotation results to the data to be annotated in the annotation interface. After the labeling user completes the labeling of the data to be labeled, the labeling result of the data is submitted to the data labeling platform. The data labeling platform stores the labeling results of each data to be labeled.
需要说明的是,数据标注平台存储每个待标注的数据的标注结果时,可以不是将标注结果直接添加在待标注的数据上进行存储,而是对标注结果进行单独存储。这样,标注用户对待标注的数据进行标注时不会受到已有的标注结果的影响。并且将标注结果与待标注的数据的标识对应存储,以将标注结果与待标注的数据相对应。示例性的,数据标注平台存储标注结果时,还可以将标注结果与执行该标注的标注人员的标识相对应。It should be noted that, when the data labeling platform stores the labeling results of each data to be labeled, the labeling results may not be directly added to the data to be labeled for storage, but the labeling results may be stored separately. In this way, the annotation user will not be affected by the existing annotation results when annotating the data to be annotated. And the labeling result is stored corresponding to the identifier of the data to be labeled, so as to correspond the labeling result to the data to be labeled. Exemplarily, when the data labeling platform stores the labeling result, it can also correspond the labeling result with the identifier of the labeler who performed the labeling.
还需要说明的是,针对待标注的数据,不对标注类型的标注结果的形式不相同。例如,标注类型为分类时,标注结果为待标注的数据所属的类别;标注类型为检测或者分割时,标注结果为待标注的数据中的标注框以及标注框的描述信息,描述信息用于指示标注框中的内容,如待标注的数据为医学影像,标注框中的内容是右肺,描述信息为右肺。此处值得注意的是,在标注类型为分割的情况下,分割获得的区域的边界也可以认为是一个标注框。It should also be noted that, for the data to be labeled, the forms of labeling results of different labeling types are different. For example, when the labeling type is classification, the labeling result is the category to which the data to be labeled belongs; when the labeling type is detection or segmentation, the labeling result is the labeling box in the data to be labeling and the description information of the labeling box, and the description information is used to indicate The content in the label box, if the data to be labeled is a medical image, the content in the label box is the right lung, and the description information is the right lung. It is worth noting here that when the label type is segmentation, the boundary of the region obtained by segmentation can also be considered as a label box.
步骤702,获取至少两个标注用户对待标注的数据的至少两个标注结果。 Step 702, acquiring at least two tagging results of data to be tagged by at least two tagging users.
在本实施例中,数据标注平台检测到待标注的数据标注完成后,从标注结果的存储位置,获得待标注的数据的至少两个标注结果。In this embodiment, after the data labeling platform detects that the labeling of the data to be labeled is completed, at least two labeling results of the data to be labelled are obtained from the storage location of the labeling results.
或者,管理用户提交一个待标注的数据集进行标注,待标注的数据属于待标注的数据集。数据标注平台检测到待标注的数据集中每个待标注的数据标注完成后,从标注结果的存储位置,获取每个待标注的数据的至少两个标注结果。Alternatively, the management user submits a dataset to be labeled for labeling, and the data to be labeled belongs to the dataset to be labeled. After the data labeling platform detects that the labeling of each data to be labeled in the data set to be labeled is completed, at least two labeling results of each data to be labeled are obtained from the storage location of the labeling results.
步骤703,根据至少两个标注结果,确定是否对待标注的数据执行仲裁。 Step 703, according to at least two tagging results, determine whether to perform arbitration on the data to be tagged.
在本实施例中,数据标注平台使用至少两个标注结果,确定该至少两个标注结果的差别,基于该差别判断是否对待标注的数据执行仲裁。在差别较大时,确定对待标注的数据执行仲裁,在差别较小时,确定不对待标注的数据执行仲裁。In this embodiment, the data labeling platform uses at least two labeling results, determines a difference between the at least two labeling results, and determines whether to perform arbitration on the data to be labeled based on the difference. When the difference is large, it is determined to perform arbitration on the data to be marked, and when the difference is small, it is determined not to perform arbitration on the data to be marked.
示例性的,不同标注类型的标注结果,是否对待标注的数据执行仲裁有不同的判断规则。在待标注的数据的标注类型为分类的情况下,数据标注平台判断至少两个标注结果是否相同,若不相同,则确定对待标注的数据执行仲裁,若相同,则确定不对待标注的数据执行仲裁。Exemplarily, for labeling results of different labeling types, there are different rules for judging whether to perform arbitration on the data to be labeled. When the labeling type of the data to be labeled is classification, the data labeling platform judges whether at least two labeling results are the same, if not, then decides to perform arbitration on the data to be labeled, and if they are the same, determines not to perform arbitration on the data to be labeled arbitration.
在待标注的数据的标注类型为检测或分割的情况下,数据标注平台确定至少两个标注结果中同一对象的标注框。数据标注平台确定至少两个标注结果中针对同一对象的标注框的差值,判断该差值是否满足预设条件,预设条件可以预设,如预设条件为差值小于或等于目标数值等。对于不同的对象,有可能存在不同的预设条件。若至少两个标注结果中针对同一对象的标注框的差值不满足该预设条件,则确定对待标注的数据执行仲裁,若至少两个标注结果中针对同一对象的标注框的差值满足该预设条件,则确定不对该待标注的数据执行仲裁。When the labeling type of the data to be labeled is detection or segmentation, the data labeling platform determines the labeling boxes of the same object in at least two labeling results. The data labeling platform determines the difference between the label boxes for the same object in at least two labeling results, and judges whether the difference meets the preset condition. The preset condition can be preset. For example, the preset condition is that the difference is less than or equal to the target value, etc. . For different objects, there may be different preset conditions. If the difference between the label frames of the same object in at least two labeling results does not meet the preset condition, it is determined to perform arbitration on the data to be labeled, and if the difference between the label frames of the same object in at least two label results meets the preset condition Preset conditions, it is determined not to perform arbitration on the data to be labeled.
此处标注框的差值可以是标注框的面积的差值、标注框重合度、标注框的中心的距离、标注框的边界点的最长距离和最短距离的差值中的一种或多种。Here, the difference value of the label box can be one or more of the difference between the area of the label box, the coincidence degree of the label box, the distance between the center of the label box, and the difference between the longest distance and the shortest distance of the boundary points of the label box kind.
需要说明的是,对象是待标注的数据中标注用户标注的内容,例如,待标注的数据为肺的医学影像,对象是右肺、左肺等,再例如,待标注的数据为甲状腺的医学影像,对象是甲状腺的结节等。It should be noted that the object is the content marked by the user in the data to be labeled. For example, the data to be labeled is a medical image of the lung, and the objects are right lung, left lung, etc. For example, the data to be labeled is a medical image of the thyroid gland. Imaging, the object is thyroid nodules, etc.
数据标注平台确定至少两个标注结果中同一对象的标注框的处理为:数据标注平台在至少两个标注结果中获取各标注框的描述信息,将描述信息相同的标注框确定为同一对象的标注框。此处每个标注结果中有可能存在多个描述信息相同的标注框,获取至少两个标注结果中相同描述信息,且距离最近的标注框,确定为同一对象的标注框。The process for the data annotation platform to determine the annotation frames of the same object in at least two annotation results is as follows: the data annotation platform obtains the description information of each annotation frame in at least two annotation results, and determines the annotation frames with the same description information as the annotation of the same object frame. Here, there may be multiple labeling boxes with the same description information in each labeling result, and the labeling boxes with the same description information in at least two labeling results and the closest distance are determined to be the labeling boxes of the same object.
步骤704,若确定对待标注的数据执行仲裁,发送待标注的数据至审核用户进行标注。 Step 704, if it is determined to perform arbitration on the data to be marked, send the data to be marked to the review user for marking.
其中,审核用户的专业知识通常比标注用户的专业知识更丰富,也可以认为审核用户的标注准确率高于标注用户。例如,在医学领域,审核用户为专家,标注用户普通医生。在某些情况下,审核用户也可以作为标注用户。若需要仲裁的待标注的数据是审核用户作为标注用户时标注的,则该待标注的数据不会提交给该审核用户,而是提交给其他审核用户。Among them, the professional knowledge of review users is usually richer than that of labeling users, and it can also be considered that the labeling accuracy of review users is higher than that of labeling users. For example, in the medical field, review users as experts and mark users as general doctors. In some cases, moderator users can also serve as callout users. If the data to be marked that needs to be arbitrated is marked by the review user as the markup user, the data to be marked will not be submitted to the review user, but to other review users.
在本实施例中,在确定对待标注的数据执行仲裁时,数据标注平台向审核用户发送待标 注的数据。示例性的,数据标注平台向审核用户发送仲裁通知消息,该仲裁通知消息可以通过邮件、短消息等提供给审核用户。仲裁通知消息中可以包括仲裁界面的访问路径,或者仲裁通知消息也可以不包括仲裁界面的访问路径,仅是提示审核用户需要对待标注的数据进行仲裁。In this embodiment, when it is determined that arbitration is performed on the data to be marked, the data labeling platform sends the data to be marked to the review user. Exemplarily, the data annotation platform sends an arbitration notification message to the review user, and the arbitration notification message may be provided to the review user by email, short message, or the like. The arbitration notification message may include the access path of the arbitration interface, or the arbitration notification message may not include the access path of the arbitration interface, and only reminds the auditing user that the data to be marked needs to be arbitrated.
审核用户接收到仲裁通知消息后,审核用户可以触发终端显示仲裁界面的登录窗口。审核用户使用自己的账户和密码登录仲裁界面。或者,审核用户接收到仲裁通知消息后,直接使用自己的账户和密码登录数据标注平台,在数据标注平台中进入仲裁界面。After the audit user receives the arbitration notification message, the audit user can trigger the terminal to display the login window of the arbitration interface. Audit users use their own accounts and passwords to log in to the arbitration interface. Or, after receiving the arbitration notification message, the review user can directly log in to the data labeling platform with his account and password, and enter the arbitration interface in the data labeling platform.
审核用户进入仲裁界面后,在仲裁界面中展示了审核用户待仲裁的待标注的数据,审核用户依次选择每个待标注的数据进行仲裁。在仲裁每个待标注的数据时,可以显示该待标注的数据的至少两个标注结果的选项。审核用户可以判断至少两个标注结果中是否存在正确的标注结果,如果存在正确的标注结果,则审核用户将正确的标注结果进行提交;如果不存在正确的标注结果,则审核用户对某个标注结果进行修改,将修改后的标注结果进行提交;或者,审核用户重新对待标注的数据进行标注,将重新标注的标注结果进行提交。After the audit user enters the arbitration interface, the data to be marked to be arbitrated by the audit user is displayed in the arbitration interface, and the audit user selects each data to be marked in turn for arbitration. When arbitrating each data to be labeled, at least two options of labeling results for the data to be labeled may be displayed. The review user can judge whether there are correct labeling results in at least two labeling results. If there is a correct labeling result, the reviewing user will submit the correct labeling result; if there is no correct labeling result, the reviewing user will submit a correct labeling result. The results are modified, and the revised labeling results are submitted; or, the review user re-labels the data to be labeled, and submits the re-labeled labeling results.
示例性的,数据标注平台获得审核用户对待标注的数据的标注结果后,将该标注结果作为待标注的数据的标签。Exemplarily, after the data labeling platform obtains the labeling result of the data to be labeled by the auditing user, the labeling result is used as a label of the data to be labeled.
步骤705,若确定不对待标注的数据执行仲裁,根据融合规则将至少两个标注结果进行融合,获得待标注的数据的标签。 Step 705, if it is determined that the data to be labeled is not to be arbitrated, at least two labeling results are fused according to a fusion rule to obtain a label of the data to be labeled.
在本实施例中,数据标注平台获取待标注的数据的融合规则,该融合规则指一个待标注的数据存在至少两个标注结果时获得标签的规则。数据标注平台基于该融合规则,对待标注的数据的至少两个标注结果进行融合处理,获得待标注的数据的标签。数据标注平台对待标注的数据的标签进行存储。In this embodiment, the data labeling platform acquires a fusion rule of the data to be labeled, and the fusion rule refers to a rule for obtaining a label when there are at least two labeling results for one piece of data to be labeled. Based on the fusion rule, the data labeling platform performs fusion processing on at least two labeling results of the data to be labeled to obtain the label of the data to be labeled. The data labeling platform stores the labels of the data to be labeled.
示例性的,不同应用场景的融合规则不相同,不同的标注类型融合规则不相同,如下提供了图像标注的融合处理过程:Exemplarily, the fusion rules of different application scenarios are different, and the fusion rules of different annotation types are different. The fusion processing process of image annotation is provided as follows:
在待标注的数据的标注类型为分类的情况下,将至少两个标注结果中的一个标注结果,确定为待标注的数据的标签。在待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为待标注的数据的标签。此处仅是示例性的给出一种方式,融合处理过程还可以是其它过程,例如,在待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取进行平均,确定为待标注的数据的标签。When the labeling type of the data to be labeled is classification, one labeling result among at least two labeling results is determined as the label of the data to be labelled. When the labeling type of the data to be labeled is detection or segmentation, the union of the labeling frames for the same object is determined as the label of the data to be labeled. Here is just an example of a method, and the fusion processing process can also be other processes, for example, when the label type of the data to be labeled is detection or segmentation, the labeling frame for the same object is averaged , determined as the label of the data to be labeled.
需要说明的是,步骤704和步骤705没有先后顺序,两个步骤的执行顺序可以互换,或者两个步骤可以并行执行。It should be noted that there is no sequence between step 704 and step 705, and the execution order of the two steps can be interchanged, or the two steps can be executed in parallel.
示例性的,一个待标注的数据有可能是经过多次标注,数据标注平台可以为每次的标签添加不同的标签版本信息,以进行区分每次的标注。在获得待标注的数据的标签后,数据标注平台获取管理用户输入的待标注的数据的标签版本信息,将该标签版本信息与待标注的数据的标签对应存储。管理用户可以在向数据标注平台提交待标注的数据时,输入待标注的数据的标签版本信息,也可以是在数据标注平台获得待标注的数据的标签后,输入待标注的数据的标签版本信息。Exemplarily, a piece of data to be labeled may be labeled multiple times, and the data labeling platform may add different label version information to each label to distinguish each label. After obtaining the label of the data to be labeled, the data labeling platform obtains the label version information of the data to be labeled input by the management user, and stores the label version information corresponding to the label of the data to be labeled. The management user can input the label version information of the data to be labeled when submitting the data to be labeled to the data labeling platform, or input the label version information of the data to be labeled after the data labeling platform obtains the label of the data to be labeled .
上述方案中,待标注的数据提交给至少两个标注用户进行标注,使得待标注的数据获得至少两个标注结果。并且判断是否对待标注的数据进行仲裁,在不对待标注的数据进行仲裁时,对待标注的数据的至少两个标注结果进行融合处理,确定出待标注的数据的标签;在对 待标注的数据进行仲裁时,由审核用户进行仲裁。可见,通过多人协同对同一待标注的数据的标注,解决了复杂数据(如三维图像等)的标注效率和准确率低的问题。而且整个标注流程是基于数据标注平台,可以适配各种云计算场景,更加灵活。In the above solution, the data to be labeled is submitted to at least two labeling users for labeling, so that the data to be labeled obtains at least two labeling results. And judge whether to arbitrate the data to be marked, when not arbitrating the data to be marked, perform fusion processing of at least two labeling results of the data to be marked, determine the label of the data to be marked; arbitrate the data to be marked , the moderation is performed by the auditing user. It can be seen that the problem of low labeling efficiency and accuracy of complex data (such as three-dimensional images) is solved through the collaborative labeling of the same data to be labeled by multiple people. Moreover, the entire labeling process is based on the data labeling platform, which can be adapted to various cloud computing scenarios and is more flexible.
图7是对数据标注的方法流程进行的概述。图8提供了图7所示的数据标注的方法的流程的具体说明,在图8中以对一个标注项目的待标注的数据集进行标注为例进行说明。Fig. 7 is an overview of the flow of the data labeling method. FIG. 8 provides a specific description of the flow of the data labeling method shown in FIG. 7 . In FIG. 8 , labeling a data set to be labeled of a labeling item is taken as an example for illustration.
步骤801,数据标注平台创建标注项目。In step 801, the data labeling platform creates labeling items.
在本实施例中,标注训练一个AI模型所需使用的数据集的任务可以称为一个标注项目。例如,该AI模型为肺结节辅助诊断模型,数据集中的数据为肺的医学影像,肺的医学影像的数据集的标注任务为一个标注项目。数据标注平台为管理用户提供标注项目创建界面,管理用户在标注项目创建界面中,输入标注项目的名称,并且将待标注的数据集导入数据标注平台。数据标注平台记录该标注项目。待标注的数据集中包括多个待标注的数据。In this embodiment, the task of labeling a data set required for training an AI model may be referred to as a labeling item. For example, the AI model is an assisted diagnosis model for pulmonary nodules, the data in the dataset is medical images of the lungs, and the labeling task of the dataset of medical images of the lungs is a labeling item. The data labeling platform provides management users with a labeling project creation interface. The management user enters the name of the labeling project in the labeling project creation interface, and imports the dataset to be labeled into the data labeling platform. The data labeling platform records the labeling item. The dataset to be labeled includes multiple pieces of data to be labeled.
另外,管理用户还可以通过数据标注平台管理创建的标注项目。例如,修改标注项目、删除标注项目、查看标注项目等。In addition, administrative users can also manage the created labeling projects through the data labeling platform. For example, modify marked items, delete marked items, view marked items, etc.
示例性的,管理用户在创建标注项目时,还可以输入标签版本信息,该标签版本信息指示标注项目的待标注的数据集的标签版本。Exemplarily, when creating a labeling item, the management user may also input label version information, where the label version information indicates the label version of the dataset to be labeled of the labeling item.
步骤802,数据标注平台创建标注团队。 Step 802, the data labeling platform creates a labeling team.
其中,标注团队用于对待标注的数据集中待标注的数据进行标注。Among them, the labeling team is used to label the data to be labeled in the dataset to be labeled.
在本实施例中,数据标注平台为管理用户提供团队管理界面,管理用户可以在团队管理界面中输入标注团队的信息,该信息包括标注团队中各个用户和各个用户的角色信息。各个用户的角色信息包括标注用户和审核用户。示例性的,标注团队中每个成员为标注用户或审核用户;或者标注团队中某些成员仅为标注用户或审核用户,某些成员既为标注用户,也为审核用户。然后管理用户向数据标注平台提交标注团队的信息。数据标注平台存储该标注团队的信息。In this embodiment, the data labeling platform provides a team management interface for the management user, and the management user can input the information of the labeling team in the team management interface, and the information includes each user in the labeling team and the role information of each user. The role information of each user includes labeling users and reviewing users. Exemplarily, each member of the labeling team is a labeling user or an auditing user; or some members of the labeling team are only labeling users or reviewing users, and some members are both labeling users and reviewing users. Then the management user submits the information of the labeling team to the data labeling platform. The data annotation platform stores the information of the annotation team.
示例性的,管理用户进入创建的标注项目后,数据标注平台向管理用户提供标注团队接口,管理用户可以通过该标注团队接口创建标注团队。这样,数据标注平台在创建标注团队后,将该标注团队存储为该标注项目的标注团队。Exemplarily, after the management user enters the created labeling project, the data labeling platform provides the management user with a labeling team interface, and the management user can create a labeling team through the labeling team interface. In this way, after the data labeling platform creates the labeling team, it stores the labeling team as the labeling team of the labeling project.
示例性的,管理用户也可以直接通过团队管理界面,创建标注团队。在创建标注团队时,输入标注项目的名称。这样,数据标注平台在创建标注团队后,将该标注团队存储为该标注项目的标注团队。Exemplarily, the management user can also directly create an annotation team through the team management interface. When creating a labeling team, enter a name for the labeling project. In this way, after the data labeling platform creates the labeling team, it stores the labeling team as the labeling team of the labeling project.
示例性的,一个标注项目可以对应一个或多个标注团队。每个标注团队的创建方式参见步骤802。一个标注团队也可以对应一个或多个标注项目。Exemplarily, one labeling project may correspond to one or more labeling teams. For the method of creating each labeling team, refer to step 802 . An annotation team can also correspond to one or more annotation projects.
步骤803,数据标注平台创建标注任务。 Step 803, the data labeling platform creates a labeling task.
其中,标注任务为对至少一个待标注的数据进行标注的任务,例如,一个标注任务用于对所属标注项目的数据集进行部分标注,该部分标注指示两种情况,第一种情况为:将标注项目的待标注的数据集划分多个待标注的子数据集,每个待标注的子数据集中的数据不相同,每个标注任务对应一个待标注的子数据集中的待标注的数据,每个待标注的子数据集的标注类型相同;第二种情况为:每个标注任务均是对标注项目的整个待标注的数据集进行标注,但是标注类型不相同,该整个待标注的数据集中待标注的数据即为每个标注任务对应的待标注的数据。例如,三个标注任务,分别进行分类标注、检测标注和分割标注。此处第二种情 况中,标注项目的待标注的数据集是用于训练一个集分类、检测和分割为一体的AI模型。Among them, the labeling task is a task of labeling at least one data to be labeled. For example, a labeling task is used to partially label the dataset of the labeling project, and the part labeling indicates two situations. The first situation is: the The data set to be labeled of the labeling project is divided into multiple sub-datasets to be labeled. The data in each sub-dataset to be labeled is different. Each labeling task corresponds to the data to be labeled in a sub-dataset to be labeled. The labeling types of the sub-datasets to be labeled are the same; the second case is: each labeling task is to label the entire data set to be labeled of the labeling project, but the labeling types are not the same, the entire data set to be labeled The data to be labeled is the data to be labeled corresponding to each labeling task. For example, three labeling tasks are performed on classification labeling, detection labeling and segmentation labeling respectively. In the second case here, the data set to be labeled of the labeling item is used to train an AI model that integrates classification, detection and segmentation.
在本实施例中,管理用户在数据标注平台创建标注项目后,数据标注平台向管理用户提供标注任务接口。管理用户触发标注任务接口,数据标注平台向管理用户提供标注任务创建界面。管理用户在标注任务创建界面中,输入标注任务的名称、标注任务对应的待标注的数据、标注类型和标注团队。另外,管理用户还可以在标注任务创建界面中输入标签保存路径。然后管理用户向数据标注平台提交创建的标注任务。数据标注平台接收到标注任务后,将该标注任务的名称、标注类型、待标注的数据、标注团队与所属的标注项目等对应存储。In this embodiment, after the management user creates a labeling project on the data labeling platform, the data labeling platform provides the management user with a labeling task interface. The management user triggers the labeling task interface, and the data labeling platform provides the management user with a labeling task creation interface. The management user enters the name of the labeling task, the data to be labeled corresponding to the labeling task, the labeling type, and the labeling team in the labeling task creation interface. In addition, the management user can also enter the label saving path in the labeling task creation interface. Then the management user submits the created labeling task to the data labeling platform. After receiving the labeling task, the data labeling platform stores the name of the labeling task, labeling type, data to be labeled, labeling team and labeling project.
示例性的,在创建标注项目时,管理用户未指示标签版本信息,管理用户在创建标注任务时,还可以输入标注任务对应的标签版本信息。Exemplarily, when creating a labeling project, the management user does not indicate label version information, and the management user may also input label version information corresponding to the labeling task when creating a labeling task.
需要说明的是,管理用户在输入每个标注任务的待标注的数据时,可以在标注项目的数据集中选择待标注的数据,而不需要重新导入待标注的数据。数据标注平台将标注任务与该标注任务对应的待标注的数据的标识对应存储。It should be noted that when the management user inputs the data to be labeled for each labeling task, he can select the data to be labeled in the dataset of the labeling project without re-importing the data to be labeled. The data labeling platform stores the labeling task and the identifier of the data to be labeled corresponding to the labeling task.
示例性的,不同的应用场景对应的融合规则不相同,管理用户在数据标注平台创建标注任务时,还可以输入标注任务对应的融合规则。示例性的,数据标注平台向管理用户提供融合规则的配置界面;数据标注平台接收管理用户输入或者在配置界面中选择的融合规则。Exemplarily, the fusion rules corresponding to different application scenarios are different, and the management user can also input the fusion rules corresponding to the labeling task when creating the labeling task on the data labeling platform. Exemplarily, the data annotation platform provides the management user with a fusion rule configuration interface; the data annotation platform receives the fusion rule input by the management user or selected in the configuration interface.
在本实施例中,数据标注平台向管理用户提供融合规则的配置界面。示例性的,数据标注平台在创建标注任务的界面中提供了融合规则的输入接口,管理用户触发该输入接口,数据标注平台向管理用户提供融合规则的配置界面。In this embodiment, the data labeling platform provides management users with a configuration interface for fusion rules. Exemplarily, the data labeling platform provides an input interface for fusion rules in the interface for creating labeling tasks, the management user triggers the input interface, and the data labeling platform provides the management user with a configuration interface for fusion rules.
在配置界面中,管理用户可以输入规定格式的融合规则。或者在配置界面中提供了可供管理用户选择的多种融合规则,管理用户可以在该多种融合规则中选择其中一种。管理用户在输入或者选择融合规则后,向数据标注平台提交融合规则。数据标注平台接收到融合规则后,将融合规则与标注任务对应存储。这样,针对不同场景可以自定义融合规则或者针对一个应用场景可以自定义不同的融合规则,灵活性更高。In the configuration interface, the management user can input fusion rules in a prescribed format. Alternatively, the configuration interface provides a variety of fusion rules that can be selected by the management user, and the management user can choose one of the various fusion rules. After the management user enters or selects the fusion rules, he submits the fusion rules to the data labeling platform. After receiving the fusion rules, the data labeling platform stores the fusion rules and labeling tasks correspondingly. In this way, fusion rules can be customized for different scenarios or different fusion rules can be customized for an application scenario, which has higher flexibility.
另外,若是管理用户输入规定格式的融合规则,则数据标注平台可以对该融合规则进行存储,后续数据标注平台可以将该融合规则提供给其他管理用户使用。这样,多个管理用户输入的融合规则形成一个融合规则集合供选择,给管理用户选择融合规则提供便利。另外,数据标注平台在存储融合规则时,还可以对应存储融合规则的简要内容,便于管理用户理解并选择融合规则。In addition, if the management user inputs a fusion rule in a specified format, the data labeling platform can store the fusion rule, and the subsequent data labeling platform can provide the fusion rule to other management users. In this way, multiple fusion rules input by management users form a fusion rule set for selection, which provides convenience for management users to select fusion rules. In addition, when the data labeling platform stores the fusion rules, it can also store the brief content of the fusion rules, which is convenient for management users to understand and select the fusion rules.
另外,在创建标注任务后,数据标注平台还为管理用户提供查看标注任务、修改标注任务、删除标注任务、查看标注任务的进展等功能。例如,管理用户通过修改标注任务的功能,修改融合规则等。In addition, after the labeling task is created, the data labeling platform also provides management users with functions such as viewing the labeling task, modifying the labeling task, deleting the labeling task, and viewing the progress of the labeling task. For example, the management user modifies the function of the labeling task, modifies the fusion rules, and so on.
需要说明的是,在管理用户未选择融合规则时,数据标注平台基于标注任务的标注类型,在存储的融合规则中,为标注任务自动选择融合规则。It should be noted that, when the management user does not select a fusion rule, the data labeling platform automatically selects a fusion rule for the labeling task from the stored fusion rules based on the labeling type of the labeling task.
步骤804,数据标注平台向标注团队提供标注任务,使得标注任务对应的每个待标注的数据分配给标注团队中的至少两个标注用户。 Step 804, the data labeling platform provides labeling tasks to the labeling team, so that each data to be labeled corresponding to the labeling task is assigned to at least two labeling users in the labeling team.
在本实施例中,数据标注平台基于预设的数据分配规则,向标注团队提供标注任务。预设的数据分配规则是预先设定的用于为数据分配标注用户的规则。In this embodiment, the data labeling platform provides labeling tasks to the labeling team based on preset data distribution rules. The preset data allocation rule is a pre-set rule for labeling users for data allocation.
示例性的,数据标注平台确定标注团队中标注用户的数目,将标注任务对应的待标注的数据平均分配给标注用户,且每个待标注的数据分配给至少两个标注用户。或者,数据标注 平台向管理用户提供标注任务对应的待标注的数据的划分功能,管理用户将标注任务对应的每个待标注的数据划分给至少两个标注用户。数据标注平台确定出每个待标注的数据所分配给的标注用户,数据标注平台可以存储标注用户的标识(该标识可以是标注用户的账户)与待标注的数据的标识的对应关系。Exemplarily, the data labeling platform determines the number of labeling users in the labeling team, equally distributes the data to be labeled corresponding to the labeling task to the labeling users, and allocates each data to be labelled to at least two labeling users. Alternatively, the data labeling platform provides the management user with a division function of the data to be labeled corresponding to the labeling task, and the management user divides each data to be labeled corresponding to the labeling task to at least two labeling users. The data labeling platform determines the labeling user assigned to each data to be labeled, and the data labeling platform can store the corresponding relationship between the labeling user's identification (the identification can be the labeling user's account) and the labeling data's identification.
数据标注平台向标注团队提供标注任务。示例性的,数据标注平台向标注团队中的各个标注用户发送标注通知消息,该标注通知消息可以通过邮件、短消息等发送。该标注通知消息中可以包括标注界面的访问路径。The data labeling platform provides labeling tasks to the labeling team. Exemplarily, the data annotation platform sends an annotation notification message to each annotation user in the annotation team, and the annotation notification message can be sent by email, short message, etc. The annotation notification message may include an access path of the annotation interface.
步骤801至步骤804是步骤701的一种实现方式。或者,步骤802至步骤804是步骤701的一种实现方式。或者,步骤803至步骤804是步骤701的一种实现方式。Step 801 to step 804 are an implementation manner of step 701 . Alternatively, steps 802 to 804 are an implementation of step 701 . Alternatively, steps 803 to 804 are an implementation manner of step 701 .
步骤805,当检测到标注用户触发标注界面时,数据标注平台通过标注界面提供待标注的数据的标注类型对应的查看工具和标注工具,查看工具用于标注用户查看待标注的数据,标注工具用于标注用户为待标注的数据添加标注类型对应的标注结果。 Step 805, when it is detected that the labeling user triggers the labeling interface, the data labeling platform provides the viewing tool and labeling tool corresponding to the labeling type of the data to be labeled through the labeling interface. For labeling, the user adds labeling results corresponding to the labeling type for the data to be labeled.
在本实施例中,标注用户通过终端接收到标注通知消息后,标注用户触发标注界面的访问路径,触发标注用户的终端登录数据标注平台,终端显示标注界面的登录窗口。标注用户可以使用自己的账户和密码进行登录。数据标注平台检测到该标注用户登录后,数据标注平台基于存储的标注用户的账户与该标注用户对应的待标注的数据的对应关系,向该标注用户提供待标注的数据。In this embodiment, after the tagging user receives the tagging notification message through the terminal, the tagging user triggers the access path of the tagging interface, triggers the tagging user's terminal to log in to the data tagging platform, and the terminal displays the login window of the tagging interface. Note that users can log in with their own account and password. After the data labeling platform detects that the labeling user has logged in, the data labeling platform provides the labeling user with the data to be labelled, based on the stored correspondence between the labeling user's account and the labeling user's corresponding data to be labelled.
数据标注平台基于标注任务指示的标注类型,通过标注界面提供查看工具和标注工具。在标注某个待标注的数据时,标注用户使用查看工具,对该待标注的数据进行展示,以更好的查看待标注的数据,查看工具的功能包括但不限于放大、缩小、改变颜色、翻转、下移图层、上移图层等。标注用户在对待标注的数据添加标注结果时,使用标注工具,为待标注的数据添加标注结果。这样,为标注用户提供适合的查看工具和标注工具,使得标注用户能够高效的进行标注。Based on the labeling type indicated by the labeling task, the data labeling platform provides viewing tools and labeling tools through the labeling interface. When annotating a certain data to be annotated, the annotation user uses the viewing tool to display the data to be annotated to better view the data to be annotated. The functions of the viewing tool include but are not limited to zoom in, zoom out, change color, Flip, move layers down, move layers up, and more. When adding labeling results to the data to be labeled, the labeling user uses the labeling tool to add labeling results to the data to be labelled. In this way, suitable viewing tools and labeling tools are provided for labeling users, so that labeling users can perform labeling efficiently.
示例性的,标注类型为分类时,标注工具提供选择功能,标注用户能够在提供的类别中,选择标注结果,或者,标注用户能够输入类别。例如,图9示出了标注类型为分类时的标注界面。在图9中,待标注的数据为肺的医学影像,左侧一列展示待标注的数据列表以及待标注的数据总数(4个),中间一列展示当前标注的数据(数据1),右侧一列展示类别列表(类别为1、2、3)。在中间的最上部展示查看工具等。在图9中,类别为3个,标注用户可以在其中选择一个类别作为标注结果。Exemplarily, when the labeling type is classification, the labeling tool provides a selection function, and the labeling user can select a labeling result from the provided categories, or the labeling user can input a category. For example, FIG. 9 shows an annotation interface when the annotation type is classification. In Figure 9, the data to be labeled is the medical image of the lung, the left column shows the list of data to be labeled and the total number of data to be labeled (4), the middle column shows the currently labeled data (data 1), and the right column Show a list of categories (categories are 1, 2, 3). In the uppermost part of the middle, viewing tools, etc. are displayed. In Fig. 9, there are three categories, and the labeling user can select one category as the labeling result.
标注类型为检测时,标注工具提供选择功能和添加标注框功能等,标注用户能够使用添加标注框功能,在待标注的数据上添加标注框,并且使用选择功能选择对应的描述信息或者输入描述信息。在标注类型为检测时,标注框和对应的描述信息为标注结果,此处标注框的形状可以是任意的,标注框的颜色也可以是任意颜色。例如,图10示出了标注类型为检测时的标注界面。在图10中,待标注的数据为肺的医学影像,左侧一列展示待标注的数据的列表以及待标注的数据总数(8个),中间一列展示当前标注的数据,右侧一列展示描述信息列表。在中间的最上部展示查看工具等。在图10中,示出一个标注框,在图10中使用虚线矩形框表示,在标注框的左上角和右下角设置有圆圈,通过拖动圆圈,能够改变标注框的大小。另外,在图10中,右侧描述信息列表中还展示了两个描述信息,描述信息1和描述信息2,在选中描述信息2时,还对应显示标注框对应的位置坐标,位置坐标可以是标注框的左上角坐 标和右下角坐标。When the annotation type is detection, the annotation tool provides the selection function and the function of adding a annotation frame, etc. The annotation user can use the function of adding a annotation frame to add a annotation frame on the data to be annotated, and use the selection function to select the corresponding description information or enter the description information . When the labeling type is detection, the labeling frame and the corresponding description information are the labeling results, where the shape of the labeling frame can be arbitrary, and the color of the labeling frame can also be any color. For example, FIG. 10 shows an annotation interface when the annotation type is detection. In Figure 10, the data to be labeled is the medical image of the lung, the left column shows the list of the data to be labeled and the total number of data to be labeled (8), the middle column shows the currently labeled data, and the right column shows the descriptive information list. In the uppermost part of the middle, viewing tools, etc. are displayed. In FIG. 10 , a labeling frame is shown, which is represented by a dotted rectangular box in FIG. 10 . There are circles at the upper left corner and lower right corner of the labeling frame. By dragging the circles, the size of the labeling frame can be changed. In addition, in Figure 10, two description information are displayed in the description information list on the right side, description information 1 and description information 2. When the description information 2 is selected, the position coordinates corresponding to the label box are displayed correspondingly, and the position coordinates can be Coordinates of the upper left corner and lower right corner of the label box.
标注类型为分割时,标注工具提供选择功能和分割功能等,标注用户能够在使用分割功能,在待标注的数据上标记分割区域,并且使用选择功能选择对应的描述信息。在标注类型为分割时,分割区域和对应的描述信息为标注结果,此处分割区域的形状可以是任意多边形,分割区域的颜色也可以是任意颜色。例如,图11示出了标注类型为分割时的标注界面。在图11中,待标注的数据为肺的医学影像,左侧一列展示待标注的数据列表以及待标注的数据总数(10个),中间一列展示待标注的数据,右侧一列展示描述信息列表。在中间的最上部展示查看工具等,在中间的最底部展示标注工具,如分割标注工具,分割区域擦除工具等。在图11中,示出了右肺的一个分割区域。When the annotation type is segmentation, the annotation tool provides the selection function and the segmentation function, etc. The annotation user can use the segmentation function to mark the segmentation area on the data to be annotated, and use the selection function to select the corresponding description information. When the labeling type is segmentation, the segmented area and the corresponding description information are the labeling results. Here, the shape of the segmented area can be any polygon, and the color of the segmented area can also be any color. For example, FIG. 11 shows the labeling interface when the labeling type is segmentation. In Figure 11, the data to be labeled is the medical image of the lung, the left column shows the list of data to be labeled and the total number of data to be labeled (10), the middle column shows the data to be labeled, and the right column shows the description information list . Viewing tools, etc. are displayed at the top of the middle, and labeling tools are displayed at the bottom of the middle, such as segmentation labeling tools, segmentation area erasing tools, etc. In Fig. 11, a segmented region of the right lung is shown.
在标注用户对待标注的数据全部标注完成后,向数据标注平台提交标注结果。或者标注用户每次标注完成一个待标注的数据即向数据标注平台进行提交。数据标注平台存储每个待标注的数据的标注结果。After the labeling user completes all labeling of the data to be labeled, the labeling result is submitted to the data labeling platform. Or the labeling user submits to the data labeling platform every time the labeling user completes a piece of data to be labelled. The data labeling platform stores the labeling results of each data to be labeled.
示例性的,标注用户当前对待标注的数据进行标注时,标注界面中还提供了导入历史标签的选项,该选项用于触发显示历史标签,历史标签为该标注用户以往对该待标注的数据的标注结果。标注用户在对待标注的数据进行标注时,能够参考历史标签,提升标注效率。Exemplarily, when the annotation user is currently annotating the data to be annotated, the annotation interface also provides the option of importing historical labels, which is used to trigger the display of historical labels. Label the results. Annotation users can refer to historical labels when annotating the data to be annotated to improve annotation efficiency.
示例性的,数据标注平台还可以提供智能标注结果,供标注用户参考。例如,数据标注平台包括AI推理模型,AI推理模型对待标注的数据进行标注,获得待标注的数据的智能标注结果。在标注界面中,为用户提供导入智能标注结果的选项。标注用户可以通过该选项,触发在标注界面中显示智能标注结果。这样,通过导入智能标注结果供标注用户参考,能够提升标注效率。Exemplarily, the data labeling platform can also provide intelligent labeling results for reference by labeling users. For example, the data labeling platform includes an AI reasoning model, which labels the data to be marked and obtains the intelligent labeling results of the data to be marked. In the labeling interface, provide users with the option to import smart labeling results. Annotation users can use this option to trigger the display of smart annotation results in the annotation interface. In this way, the labeling efficiency can be improved by importing the intelligent labeling results for the reference of labeling users.
步骤806,数据标注平台获取标注任务对应的每个待标注的数据的至少两个标注结果。In step 806, the data labeling platform obtains at least two labeling results of each data to be labeled corresponding to the labeling task.
步骤806的处理过程见步骤702的描述。For the process of step 806, see the description of step 702.
步骤807,数据标注平台根据每个待标注的数据的至少两个标注结果,确定是否对每个待标注的数据执行仲裁。In step 807, the data labeling platform determines whether to perform arbitration on each data to be marked according to at least two labeling results of each data to be marked.
步骤807的处理过程见步骤703的描述。For the process of step 807, see the description of step 703.
步骤808,若确定对待标注的数据执行仲裁,发送待标注的数据至审核用户进行标注。若确定不对待标注的数据执行仲裁,根据融合规则将至少两个标注结果进行融合,获得待标注的数据的标签。 Step 808, if it is determined that arbitration is performed on the data to be marked, send the data to be marked to the review user for marking. If it is determined not to perform arbitration on the data to be labeled, at least two labeling results are fused according to fusion rules to obtain the label of the data to be labeled.
在步骤808中,数据标注平台确定对待标注的数据执行仲裁,在标注任务对应的标注团队中确定审核用户。数据标注平台向该审核用户发送待标注的数据。审核用户对待标注的数据执行仲裁的过程参见步骤704中的描述。数据标注平台确定不对待标注的数据执行仲裁,确定待标注的数据的标签的过程参见步骤705的描述。In step 808, the data labeling platform determines to perform arbitration on the data to be marked, and determines an audit user in the labeling team corresponding to the labeling task. The data labeling platform sends the data to be marked to the review user. Refer to the description in step 704 for the process of the auditing user performing arbitration on the data to be marked. The data labeling platform determines not to perform arbitration on the data to be labeled, and refer to the description of step 705 for the process of determining the label of the data to be labeled.
示例性的,在本实施例中,数据标注平台是将待标注的数据提供给标注团队内的审核用户执行仲裁,在某些情况下,也可以提交给标注团队外的审核用户执行仲裁。数据标注平台获取到审核用户对待标注的数据的标注结果后,将该标注结果确定为待标注的数据的标签。Exemplarily, in this embodiment, the data labeling platform provides the data to be marked to review users within the labeling team for arbitration, and in some cases, may also submit the data to be marked to review users outside the labeling team for arbitration. After the data labeling platform obtains the labeling result of the data to be labeled by the review user, it determines the labeling result as the label of the data to be labelled.
另外,审核用户在审核待标注的数据的过程中,确定待标注的数据是难例(hard sample),也可以进行标记,用于后续训练AI模型。难例是对数据进行标注的过程中,标注结果容易出错的数据。In addition, in the process of reviewing the data to be labeled, the review user determines that the data to be labeled is a hard sample, and can also mark it for subsequent training of the AI model. Difficult examples are data that are prone to errors in the labeling process during the process of labeling data.
步骤809,数据标注平台获取管理用户输入的待标注的数据的标签版本信息,将每个待 标注的数据的标签与标签版本信息对应存储。 Step 809, the data labeling platform obtains the label version information of the data to be labeled input by the management user, and stores the label of each data to be labeled and the label version information correspondingly.
在本实施例中,数据标注平台获取标签版本信息(在创建标注项目时输入或者在创建标注任务时输入),将标注任务对应的每个待标注的数据的标签与标签版本信息对应存储。In this embodiment, the data labeling platform obtains label version information (input when creating a labeling project or creating a labeling task), and stores the label of each data to be labeled corresponding to the labeling task and the label version information.
或者,数据标注平台确定一个标注任务执行完成后,向管理用户反馈该标注任务执行完成的反馈消息。管理用户的终端接收到该反馈消息,该反馈消息中还提示管理用户是否添加标签版本信息,管理用户输入标签版本信息,提交给数据标注平台。数据标注平台将标注任务对应的每个待标注的数据的标签与标签版本信息对应存储。Alternatively, after the data labeling platform determines that a labeling task is completed, it feeds back a feedback message indicating that the labeling task is completed to the management user. The management user's terminal receives the feedback message, and the feedback message also prompts the management user whether to add label version information, and the management user inputs the label version information and submits it to the data labeling platform. The data labeling platform stores the label and label version information of each data to be labeled corresponding to the labeling task.
需要说明的是,在图8所示的流程中,是由数据标注平台自动将待标注的数据提供给审核用户进行仲裁。在一些实现方式中,数据标注平台在确定对待标注的数据进行仲裁后,向管理用户反馈待标注的数据。管理用户接收到待标注的数据后,向数据标注平台提交待标注的数据的审核任务,数据标注平台执行步骤808。另外,管理用户也可以将通过融合规则获得的标签提交给审核用户进行仲裁。It should be noted that, in the process shown in Figure 8, the data labeling platform automatically provides the data to be marked to the review user for arbitration. In some implementation manners, after the data labeling platform determines to arbitrate the data to be marked, it feeds back the data to be marked to the management user. After receiving the data to be labeled, the management user submits the review task of the data to be labeled to the data labeling platform, and the data labeling platform executes step 808 . In addition, management users can also submit labels obtained through fusion rules to review users for arbitration.
还需要说明的是,在数据标注平台确定出标注任务对应的待标注的数据的标签后,管理用户可以控制对待标注的数据进行多次审核,在多次审核结束后,确定标注任务执行完成。此处多次审核也可以称为是多次验收。It should also be noted that after the data labeling platform determines the label of the data to be labeled corresponding to the labeling task, the management user can control the data to be labeled to be reviewed multiple times, and after multiple reviews are completed, it is determined that the labeling task is completed. Multiple audits here can also be referred to as multiple acceptances.
另外,在一个标注项目对应的待标注的数据集标注完成后,数据标注平台基于该数据集训练AI模型。In addition, after the labeling of the dataset to be labeled corresponding to a labeling project is completed, the data labeling platform trains the AI model based on the data set.
还需要说明的是,管理用户是一类管理人员,是为了和标注用户和审核用户进行区分,管理用户并不是仅表示一个人。It should also be noted that the management user is a type of management personnel, which is to be distinguished from the labeling user and the auditing user, and the management user does not mean only one person.
另外,为了更好的理解数据标注的流程,本申请实施例还提供了便于理解的流程,参见图12,图12是关于管理用户、标注用户、数据标注平台(计算设备)和审核用户交互示意图。在图12中,是管理用户将待标注的数据提供给数据标注平台进行仲裁。In addition, in order to better understand the process of data labeling, the embodiment of this application also provides an easy-to-understand process, see Figure 12, Figure 12 is a schematic diagram of interaction between management users, labeling users, data labeling platforms (computing devices) and auditing users . In Fig. 12, it is the management user who provides the data to be labeled to the data labeling platform for arbitration.
步骤S1、管理用户创建标注任务。步骤S2、管理用户为标注任务对应的每个待标注的数据选择至少两个标注用户,并通知给至少两个标注用户。步骤S3、标注用户对待标注的数据进行标注。步骤S4、标注用户向计算设备提交标注结果。步骤S5、对于任一待标注的数据,确定是否对待标注的数据执行仲裁。步骤S6、计算设备存储不需要仲裁的待标注的数据的标签,向管理用户反馈待标注的数据。步骤S7、管理用户向审核用户通知审核待标注的数据。步骤S8、审核用户对待标注的数据进行仲裁。步骤S9、审核用户向管理用户反馈审核用户对待标注的数据的标注结果。步骤S10、管理用户控制计算设备存储仲裁后的标注结果为待标注的数据的标签。确定标注任务对应的每个待标注的数据均被添加标签后,管理用户确定标注任务执行完成。此处仅描述了一个标注任务的执行过程,每个标注任务的执行过程与之相同,此处不再赘述。Step S1, the management user creates a labeling task. Step S2, the management user selects at least two labeling users for each data to be labeled corresponding to the labeling task, and notifies the at least two labeling users. Step S3, the marking user marks the data to be marked. Step S4, the labeling user submits the labeling result to the computing device. Step S5, for any data to be marked, determine whether to perform arbitration on the data to be marked. Step S6, the computing device stores the tags of the data to be marked that do not require arbitration, and feeds back the data to be marked to the management user. Step S7, the management user notifies the review user to review the data to be marked. Step S8, the auditing user conducts arbitration on the data to be marked. Step S9 , the review user feeds back the labeling result of the data to be labeled by the review user to the management user. Step S10 , the management user controls the computing device to store the arbitrated labeling result as the label of the data to be labeled. After determining that each data to be labeled corresponding to the labeling task has been labeled, the management user determines that the labeling task is completed. Only the execution process of one labeling task is described here, and the execution process of each labeling task is the same, and will not be repeated here.
在本申请实施例中,通过统一的人机交互平台,能够支持多人对同一待标注的数据的标注,且获得格式相同的标注结果,还能够支持审核用户对标注结果的仲裁和审核,能够有效的提高整体标注质量、提高容错率。而且还能够对不同的标注场景,配置不同的融合规则,有效提高人工标注质量。而且在标注界面提供了查看工具和标注工具,更便于标注用户标注难度较高的三维图像。In the embodiment of this application, through a unified human-computer interaction platform, it is possible to support multiple people to mark the same data to be marked, and to obtain the marking results in the same format, and to support the arbitration and review of the marking results by review users. Effectively improve the overall labeling quality and increase the error tolerance rate. Moreover, it is also possible to configure different fusion rules for different labeling scenarios, effectively improving the quality of manual labeling. Moreover, viewing tools and annotation tools are provided in the annotation interface, which is more convenient for annotating users to annotate difficult 3D images.
图12是本申请实施例提供的数据标注的装置的结构图。该装置可以是数据标注平台200 的一部分或者全部。该装置可以通过软件、硬件或者两者的结合实现成为装置中的部分或者全部。本申请实施例提供的装置可以实现本申请实施例图7和图8所述的流程,该装置包括:数据标注模块201和标注结果管理模块203,其中:FIG. 12 is a structural diagram of a data labeling device provided by an embodiment of the present application. The device may be part or all of the data labeling platform 200 . The device can be implemented as a part or all of the device through software, hardware or a combination of the two. The device provided in the embodiment of the present application can implement the processes described in Figure 7 and Figure 8 in the embodiment of the present application, and the device includes: a data labeling module 201 and a labeling result management module 203, wherein:
数据标注模块201,用于将待标注的数据发送给至少两个标注用户,具体可以用于执行步骤701的数据标注功能以及其包含的隐含步骤;The data tagging module 201 is configured to send the data to be tagged to at least two tagging users, specifically, it can be used to perform the data tagging function in step 701 and the implicit steps it contains;
标注结果管理模块203,用于:获取所述至少两个标注用户对所述待标注的数据的至少两个标注结果;An annotation result management module 203, configured to: obtain at least two annotation results of the at least two annotation users on the data to be annotated;
根据所述至少两个标注结果,确定是否对所述待标注的数据执行仲裁,具体可以用于执行步骤702和步骤703的标注结果管理功能以及其包含的隐含步骤;According to the at least two tagging results, determine whether to perform arbitration on the data to be tagged, which can be specifically used to execute the tagging result management function of step 702 and step 703 and the implicit steps contained therein;
所述数据标注模块201,还用于若确定对所述待标注的数据执行仲裁,发送所述待标注的数据至审核用户进行标注,具体可以用于执行步骤704的数据标注功能以及其包含的隐含步骤;The data labeling module 201 is further configured to send the data to be marked to the auditing user for labeling if it is determined to perform arbitration on the data to be marked, specifically, it can be used to perform the data labeling function of step 704 and its included implicit steps;
所述标注结果管理模块203,还用于若确定不对所述待标注的数据执行仲裁,根据融合规则将所述至少两个标注结果进行融合,获得所述待标注的数据的标签,具体可以用于执行步骤705的标注结果管理功能以及其包含的隐含步骤。The labeling result management module 203 is further configured to fuse the at least two labeling results according to the fusion rules to obtain the label of the data to be labeled if it is determined not to perform arbitration on the data to be labeled. The labeling result management function of step 705 and the hidden steps contained therein are executed.
在一种可能的实现方式中,所述标注结果管理模块203,还用于:In a possible implementation manner, the tagging result management module 203 is further configured to:
若确定对所述待标注的数据执行仲裁,获得所述审核用户对所述待标注的数据的标注结果,将所述审核用户的标注结果作为所述待标注的数据的标签。If it is determined to perform arbitration on the data to be marked, obtain a marking result of the data to be marked by the review user, and use the mark result of the review user as a label of the data to be marked.
在一种可能的实现方式中,所述数据标注模块201,还用于接收管理用户输入的对所述标注用户、所述审核用户和所述融合规则的配置信息。In a possible implementation manner, the data labeling module 201 is further configured to receive configuration information on the labeling user, the auditing user and the fusion rule input by the management user.
在一种可能的实现方式中,所述标注结果管理模块203,用于:In a possible implementation manner, the labeling result management module 203 is configured to:
在所述待标注的数据的标注类型为分类的情况下,若所述至少两个标注结果不相同,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果相同,则确定不对所述待标注的数据进行仲裁;In the case where the labeling type of the data to be labeled is classified, if the at least two labeling results are not the same, it is determined to arbitrate the data to be labeled, and if the at least two labeling results are the same, then Determine not to arbitrate the data to be labeled;
在所述待标注的数据的标注类型为检测或分割的情况下,若所述至少两个标注结果中针对同一对象的标注框的差值不满足预设条件,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果中针对同一对象的标注框的差值满足所述预设条件,则确定不对所述待标注的数据进行仲裁。In the case where the labeling type of the data to be labeled is detection or segmentation, if the difference between the labeling frames of the same object in the at least two labeling results does not meet the preset condition, it is determined to The data is arbitrated, and if the difference between the tagging frames of the same object in the at least two tagging results satisfies the preset condition, it is determined not to arbitrate the data to be tagged.
在一种可能的实现方式中,所述标注结果管理模块203,用于:In a possible implementation manner, the labeling result management module 203 is configured to:
在所述待标注的数据的标注类型为分类的情况下,将所述至少两个标注结果中的一个标注结果,确定为所述待标注的数据的标签;In the case where the labeling type of the data to be labeled is classification, one labeling result of the at least two labeling results is determined as the label of the data to be labeled;
在所述待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为所述待标注的数据的标签。When the labeling type of the data to be labeled is detection or segmentation, a union of labeling frames for the same object is determined as the label of the data to be labeled.
在一种可能的实现方式中,所述数据标注模块201,还用于:In a possible implementation manner, the data labeling module 201 is further configured to:
当检测到所述标注用户触发标注界面时,通过所述标注界面提供所述待标注的数据的标注类型对应的查看工具和标注工具,所述查看工具用于所述标注用户查看所述待标注的数据,所述标注工具用于所述标注用户为所述待标注的数据添加所述标注类型对应的标注结果。When it is detected that the labeling user triggers the labeling interface, a viewing tool and a labeling tool corresponding to the labeling type of the data to be marked are provided through the labeling interface, and the viewing tool is used for the labeling user to view the labeling to be marked data, the labeling tool is used for the labeling user to add labeling results corresponding to the labeling type for the data to be labelled.
在一种可能的实现方式中,所述标注结果管理模块203,还用于:In a possible implementation manner, the tagging result management module 203 is further configured to:
获得所述待标注的数据的标签之后,获取管理用户输入的所述待标注的数据的标签版本 信息;After obtaining the label of the data to be marked, obtain the label version information of the data to be marked input by the management user;
将所述待标注的数据的标签与所述标签版本信息对应存储。The label of the data to be labeled is stored in correspondence with the label version information.
本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时也可以有另外的划分方式,另外,在本申请各个实施例中的各功能模块可以集成在一个处理器中,也可以是单独物理存在,也可以两个或两个以上模块集成为一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。The division of the modules in the embodiment of the present application is schematic, and it is only a logical function division. In actual implementation, there may be other division methods. In addition, each functional module in each embodiment of the present application can be integrated into one In the processor, it may exist separately physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台终端设备(可以是个人计算机,手机,或者网络设备等)或处理器(processor)执行本申请各个实施例该方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated module is realized in the form of a software function module and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for enabling a terminal device (which may be a personal computer, a mobile phone, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method in each embodiment of the present application. The aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk.
本申请实施例中,还提供了一种包含指令的计算机程序产品,当其在计算设备上运行时,使得计算设备执行上述所提供的数据标注的方法,或者使得所述计算设备实现上述提供的数据标注的装置的功能。In the embodiment of the present application, a computer program product containing instructions is also provided, and when it is run on a computing device, it causes the computing device to execute the data labeling method provided above, or enables the computing device to implement the above provided method. The function of the device labeled by the data.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在服务器或终端上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是服务器或终端能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等),也可以是光介质(如数字视盘(digital video disk,DVD)等),或者半导体介质(如固态硬盘等)。In the above-mentioned embodiments, all or part may be implemented by software, hardware, firmware or any combination thereof, and when software is used, all or part may be implemented in the form of a computer program product. The computer program product includes one or more computer instructions, and when the computer program instructions are loaded and executed on the server or terminal, all or part of the processes or functions according to the embodiments of the present application will be generated. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server, or data center by wired (eg, coaxial cable, optical fiber, DSL) or wireless (eg, infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a server or a terminal, or a data storage device such as a server or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, and a magnetic tape, etc.), an optical medium (such as a digital video disk (DVD), etc.), or a semiconductor medium (such as a solid-state hard disk, etc.).

Claims (17)

  1. 一种数据标注的方法,其特征在于,所述方法应用于数据标注平台,包括:A method for data labeling, characterized in that the method is applied to a data labeling platform, comprising:
    将待标注的数据发送给至少两个标注用户;sending the data to be labeled to at least two labeling users;
    获取所述至少两个标注用户对所述待标注的数据的至少两个标注结果;Acquiring at least two tagging results of the at least two tagging users on the data to be tagged;
    根据所述至少两个标注结果,确定是否对所述待标注的数据执行仲裁;Determine whether to perform arbitration on the data to be marked according to the at least two marking results;
    若确定对所述待标注的数据执行仲裁,发送所述待标注的数据至审核用户进行标注;If it is determined to perform arbitration on the data to be marked, send the data to be marked to the review user for marking;
    若确定不对所述待标注的数据执行仲裁,根据融合规则将所述至少两个标注结果进行融合,获得所述待标注的数据的标签。If it is determined not to perform arbitration on the data to be labeled, the at least two labeling results are fused according to a fusion rule to obtain a label of the data to be labeled.
  2. 根据权利要求1所述的方法,其特征在于,若确定对所述待标注的数据执行仲裁,所述方法还包括:The method according to claim 1, wherein if it is determined to perform arbitration on the data to be marked, the method further comprises:
    获得所述审核用户对所述待标注的数据的标注结果,将所述审核用户的标注结果作为所述待标注的数据的标签。An annotation result of the audit user on the data to be annotated is obtained, and the annotation result of the audit user is used as a label of the data to be annotated.
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:接收管理用户输入的对所述标注用户、所述审核用户和所述融合规则的配置信息。The method according to claim 1 or 2, further comprising: receiving configuration information for the marking user, the reviewing user, and the fusion rule input by a management user.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述根据所述至少两个标注结果,确定是否对所述待标注的数据执行仲裁,包括:The method according to any one of claims 1-3, wherein the determining whether to perform arbitration on the data to be marked according to the at least two marking results includes:
    在所述待标注的数据的标注类型为分类的情况下,若所述至少两个标注结果不相同,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果相同,则确定不对所述待标注的数据进行仲裁;In the case where the labeling type of the data to be labeled is classified, if the at least two labeling results are not the same, it is determined to arbitrate the data to be labeled, and if the at least two labeling results are the same, then Determine not to arbitrate the data to be labeled;
    在所述待标注的数据的标注类型为检测或分割的情况下,若所述至少两个标注结果中针对同一对象的标注框的差值不满足预设条件,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果中针对同一对象的标注框的差值满足所述预设条件,则确定不对所述待标注的数据进行仲裁。In the case where the labeling type of the data to be labeled is detection or segmentation, if the difference between the labeling frames of the same object in the at least two labeling results does not meet the preset condition, it is determined to The data is arbitrated, and if the difference between the tagging frames of the same object in the at least two tagging results satisfies the preset condition, it is determined not to arbitrate the data to be tagged.
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述根据融合规则将所述至少两个标注结果进行融合,获得所述待标注的数据的标签,包括:The method according to any one of claims 1-4, wherein the merging of the at least two labeling results according to a fusion rule to obtain the label of the data to be labeled comprises:
    在所述待标注的数据的标注类型为分类的情况下,将所述至少两个标注结果中的一个标注结果,确定为所述待标注的数据的标签;In the case where the labeling type of the data to be labeled is classification, one labeling result of the at least two labeling results is determined as the label of the data to be labeled;
    在所述待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为所述待标注的数据的标签。When the labeling type of the data to be labeled is detection or segmentation, a union of labeling frames for the same object is determined as the label of the data to be labeled.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1-5, wherein the method further comprises:
    当检测到所述标注用户触发标注界面时,通过所述标注界面提供所述待标注的数据的标注类型对应的查看工具和标注工具,所述查看工具用于所述标注用户查看所述待标注的数据,所述标注工具用于所述标注用户为所述待标注的数据添加所述标注类型对应的标注结果。When it is detected that the labeling user triggers the labeling interface, a viewing tool and a labeling tool corresponding to the labeling type of the data to be marked are provided through the labeling interface, and the viewing tool is used for the labeling user to view the labeling to be marked data, the labeling tool is used for the labeling user to add labeling results corresponding to the labeling type for the data to be labelled.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述获得所述待标注的数据的标签之后,还包括:The method according to any one of claims 1-6, wherein after obtaining the label of the data to be labeled, further comprising:
    获取管理用户输入的所述待标注的数据的标签版本信息;Acquiring label version information of the data to be labeled input by the management user;
    将所述待标注的数据的标签与所述标签版本信息对应存储。The label of the data to be labeled is stored in correspondence with the label version information.
  8. 一种数据标注的装置,其特征在于,包括:A device for data labeling, characterized in that it includes:
    数据标注模块,用于将待标注的数据发送给至少两个标注用户;A data labeling module, configured to send the data to be labeled to at least two labeling users;
    标注结果管理模块,用于:获取所述至少两个标注用户对所述待标注的数据的至少两个标注结果;An annotation result management module, configured to: obtain at least two annotation results of the at least two annotation users on the data to be annotated;
    根据所述至少两个标注结果,确定是否对所述待标注的数据执行仲裁;Determine whether to perform arbitration on the data to be marked according to the at least two marking results;
    所述数据标注模块,还用于若确定对所述待标注的数据执行仲裁,发送所述待标注的数据至审核用户进行标注;The data tagging module is further configured to send the data to be tagged to an auditing user for tagging if it is determined to perform arbitration on the data to be tagged;
    所述标注结果管理模块,还用于若确定不对所述待标注的数据执行仲裁,根据融合规则将所述至少两个标注结果进行融合,获得所述待标注的数据的标签。The tagging result management module is further configured to, if it is determined not to perform arbitration on the data to be tagged, fuse the at least two tagging results according to a fusion rule to obtain a tag of the data to be tagged.
  9. 根据权利要求8所述的装置,其特征在于,所述标注结果管理模块,还用于:The device according to claim 8, wherein the labeling result management module is also used for:
    若确定对所述待标注的数据执行仲裁,获得所述审核用户对所述待标注的数据的标注结果,将所述审核用户的标注结果作为所述待标注的数据的标签。If it is determined to perform arbitration on the data to be marked, obtain a marking result of the data to be marked by the review user, and use the mark result of the review user as a label of the data to be marked.
  10. 根据权利要求8或9所述的装置,其特征在于,所述数据标注模块,还用于接收管理用户输入的对所述标注用户、所述审核用户和所述融合规则的配置信息。The device according to claim 8 or 9, wherein the data labeling module is further configured to receive configuration information on the labeling user, the auditing user and the fusion rule input by the management user.
  11. 根据权利要求8-10任一项所述的装置,其特征在于,所述标注结果管理模块,用于:The device according to any one of claims 8-10, wherein the labeling result management module is configured to:
    在所述待标注的数据的标注类型为分类的情况下,若所述至少两个标注结果不相同,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果相同,则确定不对所述待标注的数据进行仲裁;In the case where the labeling type of the data to be labeled is classified, if the at least two labeling results are not the same, it is determined to arbitrate the data to be labeled, and if the at least two labeling results are the same, then Determine not to arbitrate the data to be labeled;
    在所述待标注的数据的标注类型为检测或分割的情况下,若所述至少两个标注结果中针对同一对象的标注框的差值不满足预设条件,则确定对所述待标注的数据进行仲裁,若所述至少两个标注结果中针对同一对象的标注框的差值满足所述预设条件,则确定不对所述待标注的数据进行仲裁。In the case where the labeling type of the data to be labeled is detection or segmentation, if the difference between the labeling frames of the same object in the at least two labeling results does not meet the preset condition, it is determined to The data is arbitrated, and if the difference between the tagging frames of the same object in the at least two tagging results satisfies the preset condition, it is determined not to arbitrate the data to be tagged.
  12. 根据权利要求8-11任一项所述的装置,其特征在于,所述标注结果管理模块,用于:The device according to any one of claims 8-11, wherein the labeling result management module is configured to:
    在所述待标注的数据的标注类型为分类的情况下,将所述至少两个标注结果中的一个标注结果,确定为所述待标注的数据的标签;In the case where the labeling type of the data to be labeled is classification, one labeling result of the at least two labeling results is determined as the label of the data to be labeled;
    在所述待标注的数据的标注类型为检测或分割的情况下,将针对同一对象的标注框取并集,确定为所述待标注的数据的标签。When the labeling type of the data to be labeled is detection or segmentation, a union of labeling frames for the same object is determined as the label of the data to be labeled.
  13. 根据权利要求8-12任一项所述的装置,其特征在于,所述数据标注模块,还用于:The device according to any one of claims 8-12, wherein the data labeling module is also used for:
    当检测到所述标注用户触发标注界面时,通过所述标注界面提供所述待标注的数据的标注类型对应的查看工具和标注工具,所述查看工具用于所述标注用户查看所述待标注的数据,所述标注工具用于所述标注用户为所述待标注的数据添加所述标注类型对应的标注结果。When it is detected that the labeling user triggers the labeling interface, a viewing tool and a labeling tool corresponding to the labeling type of the data to be marked are provided through the labeling interface, and the viewing tool is used for the labeling user to view the labeling to be marked data, the labeling tool is used for the labeling user to add labeling results corresponding to the labeling type for the data to be labelled.
  14. 根据权利要求8-13任一项所述的装置,其特征在于,所述标注结果管理模块,还用于:The device according to any one of claims 8-13, wherein the labeling result management module is also used for:
    获得所述待标注的数据的标签之后,获取管理用户输入的所述待标注的数据的标签版本信息;After obtaining the label of the data to be marked, obtain the label version information of the data to be marked input by the management user;
    将所述待标注的数据的标签与所述标签版本信息对应存储。The label of the data to be labeled is stored in correspondence with the label version information.
  15. 一种数据标注的计算设备,其特征在于,所述计算设备包括处理器和存储器,其中:A computing device for data labeling, characterized in that the computing device includes a processor and a memory, wherein:
    所述存储器中存储有计算机指令;computer instructions are stored in the memory;
    所述处理器执行所述计算机指令,以实现所述权利要求1-7任一项权利要求所述的方法。The processor executes the computer instructions to implement the method of any one of claims 1-7.
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,当所述计算机可读存储介质中的计算机指令被计算设备执行时,使得所述计算设备执行所述权利要求1-7任一项权利要求所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions, and when the computer-readable storage medium is executed by a computing device, the computing device executes the The method according to any one of claims 1-7.
  17. 一种计算机程序产品,其特征在于,包括计算机指令,当所述计算机指令被计算设备执行时,使得所述计算设备执行所述权利要求1-7任一项权利要求所述的方法。A computer program product, characterized by comprising computer instructions, when the computer instructions are executed by a computing device, causing the computing device to execute the method according to any one of claims 1-7.
PCT/CN2022/081097 2021-08-30 2022-03-16 Data labeling method and apparatus, and device and storage medium WO2023029436A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111003760.4A CN115732061A (en) 2021-08-30 2021-08-30 Data labeling method, device, equipment and storage medium
CN202111003760.4 2021-08-30

Publications (1)

Publication Number Publication Date
WO2023029436A1 true WO2023029436A1 (en) 2023-03-09

Family

ID=85290760

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/081097 WO2023029436A1 (en) 2021-08-30 2022-03-16 Data labeling method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN115732061A (en)
WO (1) WO2023029436A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684468A (en) * 2018-12-13 2019-04-26 四川大学 For the document screening mark platform of evidence-based medicine EBM
US20200226174A1 (en) * 2019-03-29 2020-07-16 Xi'an Jiaotong University Cloud-based large-scale pathological image collaborative annotation method and system
CN111783863A (en) * 2020-06-23 2020-10-16 腾讯科技(深圳)有限公司 Image processing method, device, equipment and computer readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684468A (en) * 2018-12-13 2019-04-26 四川大学 For the document screening mark platform of evidence-based medicine EBM
US20200226174A1 (en) * 2019-03-29 2020-07-16 Xi'an Jiaotong University Cloud-based large-scale pathological image collaborative annotation method and system
CN111783863A (en) * 2020-06-23 2020-10-16 腾讯科技(深圳)有限公司 Image processing method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG WENJUAN, CHAI XIANGFEI; WEI SHENGMEI; ZOU XIAOGUANG; XIA YUWEI; HE JIAJIA; ZHOU GUANGQUAN; LI WENQIANG; GUO HUI: "Research and Implementation of Addh Image Collaborative Annotation System Based on The Cloud Platform", CHINA MEDICAL DEVICES, vol. 36, no. 3, 29 March 2021 (2021-03-29), pages 33 - 37, XP093041994, ISSN: 1674-1633, DOI: 10.3969/j.issn.1674-1633.2021.03.007 *

Also Published As

Publication number Publication date
CN115732061A (en) 2023-03-03

Similar Documents

Publication Publication Date Title
CN110785736B (en) Automatic code generation
US10140709B2 (en) Automatic detection and semantic description of lesions using a convolutional neural network
US20230359778A1 (en) Configuration of a digital twin for a building or other facility via bim data extraction and asset register mapping
US20210342745A1 (en) Artificial intelligence model and data collection/development platform
US10733754B2 (en) Generating a graphical user interface model from an image
US11556749B2 (en) Domain adaptation and fusion using weakly supervised target-irrelevant data
Lekadir et al. FUTURE-AI: guiding principles and consensus recommendations for trustworthy artificial intelligence in medical imaging
Williams et al. Improving digital hospital transformation: development of an outcomes-based infrastructure maturity assessment framework
Nance Jr et al. The future of the radiology information system
EP2784734A1 (en) System and method for high accuracy product classification with limited supervision
JP2018106662A (en) Information processor, information processing method, and program
US20210057058A1 (en) Data processing method, apparatus, and device
WO2019100635A1 (en) Editing method and apparatus for automated test script, terminal device and storage medium
US10691827B2 (en) Cognitive systems for allocating medical data access permissions using historical correlations
US11907860B2 (en) Targeted data acquisition for model training
JP2024507599A (en) System and computer-implemented method for label data verification
CN115686280A (en) Deep learning model management system, method, computer device and storage medium
US20230141049A1 (en) Method and system for consolidating heterogeneous electronic health data
Cohen et al. An orchestration platform that puts radiologists in the driver’s seat of AI innovation: a methodological approach
US9892451B2 (en) Information processing apparatus, information processing method, and non-transitory computer readable medium
US11822896B2 (en) Contextual diagram-text alignment through machine learning
WO2023179038A1 (en) Data labeling method, ai development platform, computing device cluster, and storage medium
WO2023029436A1 (en) Data labeling method and apparatus, and device and storage medium
US11714813B2 (en) System and method for proposing annotations
CN115203472A (en) Data management method and system based on data annotation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22862584

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE