WO2022063274A1 - 一种数据标注方法、系统和电子设备 - Google Patents

一种数据标注方法、系统和电子设备 Download PDF

Info

Publication number
WO2022063274A1
WO2022063274A1 PCT/CN2021/120710 CN2021120710W WO2022063274A1 WO 2022063274 A1 WO2022063274 A1 WO 2022063274A1 CN 2021120710 W CN2021120710 W CN 2021120710W WO 2022063274 A1 WO2022063274 A1 WO 2022063274A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
training
automl
test set
model
Prior art date
Application number
PCT/CN2021/120710
Other languages
English (en)
French (fr)
Inventor
姚超
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022063274A1 publication Critical patent/WO2022063274A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This specification relates to the technical field of information technology, and in particular, to an automatic data labeling method, system and electronic device.
  • the current way of labeling network operation and maintenance data is generally relatively primitive, that is, labeling manually.
  • labeling manually is mainly because: unlike the research in the fields of image, audio and video, where many algorithms and tools have been accumulated, the topics of network optimization and efficiency improvement are diverse, although in some fields, such as anomaly detection, there will be some general Interface tools or unsupervised algorithms assist in labeling, but in more cases, researchers need to manually label or develop custom programs to assist according to the situation.
  • There are many topics the workload of manual annotation is huge, and it is also very troublesome to write a specific program to assist each time.
  • the amount of data to be labeled which is tens of thousands to millions of records, it has brought great difficulties and challenges to the exploration of machine learning algorithms. There is an urgent need for a method or tool to improve the efficiency of data labeling.
  • AutoML Auto Machine Learning
  • a data annotation system including a user processing module, a data selection module, a quality inspection module, an AutoML engine and an extensible model library, wherein:
  • the user processing module is used to provide the user with a visual interactive interface, which is convenient for the user to process and understand the data;
  • the data selection module is used to select part of the data from the full amount of big data according to a specified strategy, and the strategy includes random extraction and sorting and extraction according to predicted probability;
  • the quality inspection module is used to provide confidence support for user sampling data
  • the extensible model library is used to provide AutoML searchable models, which can be extended in a plug-in type to enhance the capabilities of the AutoML engine.
  • an electronic device comprising:
  • a memory arranged to store computer-executable instructions which, when executed, cause the processor to perform the operations of the first aspect.
  • a computer-readable storage medium stores one or more programs, the one or more programs when executed by an electronic device including a plurality of application programs , so that the electronic device performs the operations described in the first aspect.
  • FIG. 1 is a schematic structural diagram of a data labeling system provided by an embodiment of this specification provided by an embodiment of this specification.
  • FIG. 2 is a schematic diagram of steps of a data labeling method provided by an embodiment of the present specification.
  • FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present specification.
  • the data labeling system may include: a user processing module 10 , a data selection module 20 , a quality inspection module 30 , an AutoML engine 40 and an extensible model Library 50.
  • the data labeling system is not limited to including the above modules, and may also include other functional modules that assist in realizing data labeling, which will not be described here.
  • the user processing module 10 is used to provide the user with a visual interactive interface, which is convenient for the user to process and understand the data.
  • the data selection module 20 is used to select part of the data from the full amount of big data according to a specified strategy, and the strategy includes random extraction and ordered extraction according to predicted probability.
  • the quality inspection module 30 is used to provide confidence support for the user's sampling data.
  • the extensible model library 40 is used to provide AutoML with searchable models, which can be extended by plug-ins to enhance the capabilities of the AutoML engine.
  • the AutoML engine 50 is used for receiving training data to automatically train a machine learning model, and receiving test data to give an inference result using a specified model.
  • AutoML can be used supplemented by manual iteration to improve the quality of annotation; each step of AutoML can be automated, which is more general, and each iteration manually processes the worst data predicted by the model to quickly improve the accuracy of the model for the remaining data; quality inspection module Provides multiple confidence intervals for the user's random inspection in real time to support the user's next action.
  • AutoML's feature of wide application scenarios it can assist in manual data annotation in most scenarios.
  • the model process most of the data and manually pay attention to a small part of the data that the model is inaccurate or the model prediction has problems, avoiding the problem that AutoML may not have high accuracy.
  • Taking advantage of strengths and circumventing weaknesses greatly reduces the amount of data that needs to be manually labeled, which greatly reduces the workload of human beings.
  • the method may include the following steps: creating a training set and a test set, and classifying the data to be labeled into the test set; extracting the test set data for labeling and Put it into the training set; set the training parameters, start AutoML model training according to the training set data; apply the AutoML training model to the test set, and output the labeled data that meets the quality inspection requirements. Specifically:
  • Step 101 The training set is empty, and all data is put into the test set.
  • test set and training set are created according to the structured data to be labeled and field information.
  • the training set is empty, and all data is put into the test set, including one or a combination of the following operations:
  • Receive information operation Receive a given structured data and field information to be marked
  • the system automatically creates a test set and a training set.
  • the initial training set is empty, and all the data to be labeled are included in the test set by default.
  • Step 102 Perform quality inspection on the test set.
  • the specific execution is as follows: the system randomly extracts a number of data specified by the user, the user checks the correctness of the annotation, and gives the error rate under different confidence levels according to the random inspection result. index.
  • step 103 it is judged whether the quality inspection result meets the requirements.
  • step 103 is specifically executed as follows: the user determines whether to accept the random inspection result. If the user accepts, go to step 108 , otherwise go to step 104 . Users can also continue to add sampling data to improve confidence in desired indicators.
  • step 108 the prediction result is sampled and detected, and when the predetermined accuracy rate index is satisfied, the process continues, otherwise, jump to step 108 .
  • step 104 n pieces of data are extracted from the test set for manual annotation, and placed in the training set.
  • step 104 when n pieces of data are extracted from the test set for manual annotation and placed in the training set, the specific execution is as follows: the system extracts n pieces of data from the test set (n can be set by the user) and presents it on the interface, and the user manually callout.
  • n items When the system judges that the predicted probability is empty for the first time, n items will be randomly selected; when the predicted probability is not empty, it will be sorted according to the predicted probability, and the top n with the smallest probability will be selected.
  • test set data can be extracted from the test set in the following two ways:
  • n pieces of data are randomly selected from the test set for manual labeling (the size of n is small), and the labeling is completed and placed in the training set.
  • the model After the model is predicted in the iteration, it is sorted according to the predicted probability, and the top n with the smallest probability is selected, that is, the n pieces of data that the model is most inaccurate are manually corrected and relabeled, and the corrected labeled data is put into the training set, so that the subsequent The model converges quickly on the remaining test data.
  • training parameters are set.
  • the specific execution is as follows: the user sets the training parameters, which can be used in subsequent iterations.
  • the user can specify only the columns related to the labeling results as input, and the user can specify the model search space of AutoML, for example, set to try only three models of GBT, Random Forest and LinearSVC.
  • the model is trained with the training set.
  • the specific execution is: using the training set to start the AutoML engine to train the model.
  • step 107 the model is applied to the test set, and step 102 is performed.
  • the specific execution is as follows: after the training is completed, the model is applied to the test set to obtain the prediction result of the test set data, and step 102 is executed.
  • step 108 the training set and the test set are merged, and the result is output.
  • step 108 the training set and the test set are merged, and when outputting the result, the specific execution is as follows: (5) The training set and the predicted test set are merged, and the completed labeling data is output.
  • each step shown in FIG. 2 may refer to the solution of the data labeling system provided in the first embodiment, which will not be repeated here.
  • AutoML can be used supplemented by manual iteration to improve the quality of annotation; each step of AutoML can be automated, which is more general, and each iteration manually processes the worst data predicted by the model to quickly improve the accuracy of the model for the remaining data; quality inspection module Provides multiple confidence intervals for the user's random inspection in real time to support the user's next action.
  • AutoML's feature of wide application scenarios it can assist in manual data annotation in most scenarios.
  • the model process most of the data and manually pay attention to a small part of the data that the model is inaccurate or the model prediction has problems, avoiding the problem that AutoML may not have high accuracy.
  • Taking advantage of strengths and circumventing weaknesses greatly reduces the amount of data that needs to be manually labeled, which greatly reduces the workload of human beings.
  • FIG. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present specification.
  • the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory.
  • the memory may include memory, such as high-speed random-access memory (Random-Access Memory, RAM), or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM random-Access Memory
  • non-volatile memory such as at least one disk memory.
  • the electronic equipment may also include hardware required for other services.
  • the processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Component Interconnect) bus. Industry Standard Architecture, extended industry standard structure) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one bidirectional arrow is used in FIG. 3, but it does not mean that there is only one bus or one type of bus.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory may include memory and non-volatile memory and provide instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and runs it, forming a shared resource access control device at the logical level.
  • the processor executes the program stored in the memory, and is specifically used to perform the following operations:
  • each step of the above-mentioned method can be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software.
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in conjunction with the embodiments of this specification may be directly embodied as being executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the electronic device can also execute the method shown in the accompanying drawings, and realize the functions of the data labeling system in the embodiments shown in the accompanying drawings, and the embodiments of this specification will not be repeated here.
  • the electronic devices in the embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc. That is to say, the execution subjects of the following processing procedures are not limited to each logic A unit can also be a hardware or logic device.
  • AutoML can be used supplemented by manual iteration to improve the quality of annotation; each step of AutoML can be automated, which is more general, and each iteration manually processes the worst data predicted by the model to quickly improve the accuracy of the model for the remaining data; quality inspection module Provides multiple confidence intervals for the user's random inspection in real time to support the user's next action.
  • AutoML's feature of wide application scenarios it can assist in manual data annotation in most scenarios.
  • the model process most of the data and manually pay attention to a small part of the data that the model is inaccurate or the model prediction has problems, avoiding the problem that AutoML may not have high accuracy.
  • Taking advantage of strengths and circumventing weaknesses greatly reduces the amount of data that needs to be manually labeled, which greatly reduces the workload of human beings.
  • the embodiments of the present specification also provide a computer-readable storage medium, where the computer-readable storage medium stores one or more programs, and the one or more programs include instructions, and the instructions, when used by a portable electronic device including a plurality of application programs When executed, the portable electronic device can be made to execute the method of the embodiment shown in the accompanying drawings, and is specifically used to execute the following method:
  • AutoML can be used supplemented by manual iteration to improve the quality of annotation; each step of AutoML can be automated, which is more general, and each iteration manually processes the worst data predicted by the model to quickly improve the accuracy of the model for the remaining data; quality inspection module Provides multiple confidence intervals for the user's random inspection in real time to support the user's next action.
  • AutoML's feature of wide application scenarios it can assist in manual data annotation in most scenarios.
  • the model process most of the data and manually pay attention to a small part of the data that the model is inaccurate or the model prediction has problems, avoiding the problem that AutoML may not have high accuracy.
  • Taking advantage of strengths and circumventing weaknesses greatly reduces the amount of data that needs to be manually labeled, which greatly reduces the workload of human beings.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • Computer readable media includes both persistent and non-permanent, removable and non-removable media and can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本说明书实施例公开了一种数据标注方法,包括以下步骤:创建训练集和测试集,将待标注数据归入测试集;提取测试集数据进行标注并放入训练集;设定训练参数,根据训练集数据启动AutoML模型训练;将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。

Description

一种数据标注方法、系统和电子设备
交叉引用
本申请基于申请号为“202011031077.7”、申请日为2020年09月27日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本说明书涉及信息技术技术领域,尤其涉及一种自动化数据标注方法、系统和电子设备。
背景技术
随着5G新基建的高速发展,通信网络越来越复杂,网络运维也逐渐从人工运维向着半自动化甚至全自动化的方向发展。这其中机器学习算法的作用越来越大。通常而言,有监督的机器学习算法效果更好,因此,需要大量的标注数据来训练、改进机器学习算法的性能。
当前进行网络运维数据的标注的方式普遍都比较原始,即以手工的方式进行标注。这主要是因为:不像图像、音视频领域的研究比较集中,沉淀了诸多的算法和工具,网络优化提效的课题多种多样,虽然在某些领域,例如异常检测,会有一些通用的界面化工具或者无监督算法辅助标注,但更多的情况下是需要研究人员根据情况手工标注或开发定制程序辅助。课题很多,人工标注工作量巨大,每次写特定程序来辅助也很麻烦。面对动辄几万至几百万记录的待标注数据量,给机器学习算法的探索带来了极大的困难和挑战,亟需一种方式或工具对数据标注进行提效。
另一方面,AutoML(Auto Machine Learning自动机器学习)是近年来兴起的将机器学习应用于实际问题的端到端流程自动化构建的方法,它组合了特征 工程、模型选择和优化算法选择这三个构建机器学习所需的步骤,并自动对各部分进行优化,使得领域专家可以不依赖数据科学家,靠自己构建机器学习流程,具有较广泛的通用性。但同时因为通用,不会针对特定问题进行优化,往往无法得到最佳效果。
发明内容
本申请的第一方面,提出了一种数据标注方法,包括以下步骤:
创建训练集和测试集,将待标注数据归入测试集;
提取测试集数据进行标注并放入训练集;
设定训练参数,根据训练集数据启动AutoML模型训练;
将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。
本申请的第二方面,提出了一种数据标注系统,包括用户处理模块、数据选取模块、质检模块、AutoML引擎以及可扩展模型库,其中:
所述用户处理模块,用于提供用户可视化交互界面,方便用户处理和理解数据;
所述数据选取模块,用于根据指定策略从全量大数据里选取部分数据,策略包括随机抽取和按预测概率排序抽取;
所述质检模块,用于提供用户抽检数据的置信度支持;
所述可扩展模型库,用于提供AutoML可搜索的模型,可插件式扩展,增强AutoML引擎的能力。
本申请的第三方面,提出了一种电子设备,包括:
处理器;以及
被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行第一方面所述操作。
本申请的第四方面,提出了一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行第一方面所述操作。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本说明书的一个实施例提供的本说明书实施例提供的一种数据标注系统结构示意图。
图2是本说明书的一个实施例提供的数据标注方法的步骤示意图。
图3是本说明书的一个实施例提供的电子设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本说明书中的技术方案,下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。
下面,结合图1所示,介绍本说明书实施例提供的一种数据标注系统,该数据标注系统可以包括:用户处理模块10、数据选取模块20、质检模块30、AutoML引擎40以及可扩展模型库50。其实,数据标注系统并不限于包括上述模块,还可以包括其它辅助实现数据标注的功能模块,在此不做一一描述。
<用户处理模块>
所述用户处理模块10,用于提供用户可视化交互界面,方便用户处理和理解数据。
<数据选取模块>
所述数据选取模块20用于从全量大数据里按指定策略选取部分数据,策略包括随机抽取和按预测概率排序抽取。
<质检模块>
质检模块30用于提供用户抽检数据的置信度支持。
<可扩展模型库>
可扩展模型库40用于给AutoML提供可搜索的模型,可插件式扩展,以增强AutoML引擎的能力。
<AutoML引擎>
AutoML引擎50用于接收训练数据自动化训练出机器学习模型,以及接收测试数据用指定模型给出推理结果。
通过上述技术方案,可用AutoML辅以人工迭代式提升标注质量;AutoML各个步骤都可以自动化,较为通用,每次迭代人工处理模型预测最差的数据,快速提升模型对剩余数据的精度;质检模块实时给出用户抽检的多个置信度区间,支撑用户的下一步动作。这样,利用AutoML适用场景广的特点,可以在大多数场景协助人工进行数据标注。同时让模型处理大部分数据,人工关注模型测不准的或模型预测有问题的少部分数据,规避了AutoML可能精度不高的问题。扬长避短使得需要人工标注的数据量大幅降低,极大减少了人的工作量。
参照图2所示,为本说明书实施例提供的数据标注方法的步骤示意图,该方法可以包括以下步骤:创建训练集和测试集,将待标注数据归入测试集;提取测试集数据进行标注并放入训练集;设定训练参数,根据训练集数据启动AutoML模型训练;将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。具体为:
步骤101:训练集为空,所有数据放入测试集。
可选地,根据待标注结构化数据和字段信息,创建测试集和训练集。训练集为空,所有数据放入测试集,具体包括以下操作之一或组合:
接收信息操作:接收给定的一份待标注结构化数据及字段信息;
创建集合操作:系统自动创建一个测试集,一个训练集,初始训练集为空,所有待标注数据默认归入测试集。
步骤102:对测试集进行质检。
可选地,步骤202在对测试集进行质检时,具体执行为:系统随机抽取用户指定数量的数据,由用户检查标注的正确性,并根据抽查结果给出不同置信度水平下的错误率指标。
在步骤103,判断质检结果是否满足要求。
可选地,步骤103在判断质检结果是否满足要求时,具体执行为:由用户判定是否接受抽查结果。用户接受则进入步骤108,否则进入步骤104。用户也可以继续增加抽检数据以提高期望指标的置信度。
应理解,对预测结果进行抽样检测,当满足预定正确率指标时,流程继续,否则跳转至步骤108。
在步骤104,从测试集提取n条数据进行人工标注,并放入训练集。
可选地,步骤104在从测试集提取n条数据进行人工标注,并放入训练集时,具体执行为:系统从测试集中抽取n条数据(n可由用户设置)在界面呈现,由用户手工标注。第一次时系统判断预测概率为空,则随机选取n条;预测概率不为空时,按预测概率排序,选取概率最小的top n。
进一步,可具体采用以下两种方式从测试集中提取数据:
方式一:
第一次选取数据时,从测试集中随机选取n条数据进行人工标注(n规模较小),标注完成后放入训练集。
方式二:
在迭代中经过模型的预测,根据预测概率进行排序,选取概率最小的top n,即模型最拿不准的n条数据进行人工修正重新标注,将修正后的标注数据放入训练集,使后续模型对剩余测试数据快速收敛。
在步骤105,设定训练参数。
可选地,步骤105在设定训练参数时,具体执行为:用户设定训练参数,后面迭代可以沿用。
应理解,为加快训练速度,用户可以仅指定跟标注结果有关的列作为输入,以及用户可以指定AutoML的模型搜索空间,例如设定仅尝试GBT、随机森林和LinearSVC三种模型。
在步骤106,用训练集训练模型。
可选地,在步骤106用训练集训练模型时,具体执行为:使用训练集启动AutoML引擎训练模型。
在步骤107,将模型应用于测试集,执行步骤102。
可选地,在步骤107将模型应用于测试集时,具体执行为:训练结束后, 将模型应用于测试集,得到测试集数据的预测结果,执行步骤102。
在步骤108,合并训练集和测试集,输出结果。
可选地,在步骤108合并训练集和测试集,输出结果时,具体执行为:(5)合并训练集和经过预测的测试集,输出完成的标注数据。
其中,图2所示各步骤的具体实现可参照实施例一所提供的数据标注系统方案,在此不做赘述。
通过上述技术方案,可用AutoML辅以人工迭代式提升标注质量;AutoML各个步骤都可以自动化,较为通用,每次迭代人工处理模型预测最差的数据,快速提升模型对剩余数据的精度;质检模块实时给出用户抽检的多个置信度区间,支撑用户的下一步动作。这样,利用AutoML适用场景广的特点,可以在大多数场景协助人工进行数据标注。同时让模型处理大部分数据,人工关注模型测不准的或模型预测有问题的少部分数据,规避了AutoML可能精度不高的问题。扬长避短使得需要人工标注的数据量大幅降低,极大减少了人的工作量。
图3是本说明书的一个实施例电子设备的结构示意图。请参考图3,在硬件层面,该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图3中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在 逻辑层面上形成共享资源访问控制装置。处理器,执行存储器所存放的程序,并具体用于执行以下操作:
创建训练集和测试集,将待标注数据归入测试集;
提取测试集数据进行标注并放入训练集;
设定训练参数,根据训练集数据启动AutoML模型训练;
将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。
上述如本说明书图中所示实施例揭示的数据标注系统执行的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本说明书实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本说明书实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
该电子设备还可执行附图中的方法,并实现数据标注系统在附图所示实施例的功能,本说明书实施例在此不再赘述。
当然,除了软件实现方式之外,本说明书实施例的电子设备并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
通过上述技术方案,可用AutoML辅以人工迭代式提升标注质量;AutoML各个步骤都可以自动化,较为通用,每次迭代人工处理模型预测最差的数据,快速提升模型对剩余数据的精度;质检模块实时给出用户抽检的多个置信度区 间,支撑用户的下一步动作。这样,利用AutoML适用场景广的特点,可以在大多数场景协助人工进行数据标注。同时让模型处理大部分数据,人工关注模型测不准的或模型预测有问题的少部分数据,规避了AutoML可能精度不高的问题。扬长避短使得需要人工标注的数据量大幅降低,极大减少了人的工作量。
本说明书实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行附图中所示实施例的方法,并具体用于执行以下方法:
创建训练集和测试集,将待标注数据归入测试集;
提取测试集数据进行标注并放入训练集;
设定训练参数,根据训练集数据启动AutoML模型训练;
将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。
通过上述技术方案,可用AutoML辅以人工迭代式提升标注质量;AutoML各个步骤都可以自动化,较为通用,每次迭代人工处理模型预测最差的数据,快速提升模型对剩余数据的精度;质检模块实时给出用户抽检的多个置信度区间,支撑用户的下一步动作。这样,利用AutoML适用场景广的特点,可以在大多数场景协助人工进行数据标注。同时让模型处理大部分数据,人工关注模型测不准的或模型预测有问题的少部分数据,规避了AutoML可能精度不高的问题。扬长避短使得需要人工标注的数据量大幅降低,极大减少了人的工作量。
总之,以上所述仅为本说明书的较佳实施例而已,并非用于限定本说明书的保护范围。凡在本说明书的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本说明书的保护范围之内。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任 何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。

Claims (9)

  1. 一种数据标注方法,包括以下步骤:
    创建训练集和测试集,将待标注数据归入测试集;
    提取测试集数据进行标注并放入训练集;
    设定训练参数,根据训练集数据启动AutoML模型训练;
    将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据。
  2. 如权利要求1所述的数据标注方法,其中,所述创建训练集和测试集,将待标注数据归入测试集的步骤,包括,
    根据待标注结构化数据和字段信息,创建测试集和训练集;
    初始训练集为空,所有数据放入测试集。
  3. 如权利要求1所述的数据标注方法,其中,所述提取测试集数据进行标注并放入训练集的步骤,包括,
    初次选取数据时,将从测试集中随机选取的数据进行人工标注后放入训练集;
    根据迭代中AutoML模型的预测概率进行排序,选取概率最小的数据进行人工标注,将修正后的数据放入训练集中。
  4. 如权利要求1至3中任一项所述的数据标注方法,其中,所述设定训练参数,根据训练集数据启动AutoML模型训练的步骤,包括,指定标注结果有关的列和AutoML的模型搜索空间作为训练参数。
  5. 如权利要求1至4中任一项所述的数据标注方法,其中,所述将AutoML训练模型应用于所述测试集,输出满足质检要求的标注数据的步骤,包括,
    获取AutoML训练模型应用于测试集的数据预测结果;
    判断预测结果是否满足质检要求;
    将满足质检要求的测试集和训练集合并,输出标注结果;
    将不满足质检要求的测试集重新提取测试集数据进行标注并放入训练集,进行模型训练。
  6. 如权利要求5所述的数据标注方法,其中,所述判断预测结果是否满足质检要求的步骤,包括,设置不同置信度下的正确率指标,对预测结果进行抽样检测,判断检测结果是否满足预设正确率指标。
  7. 一种数据标注系统,包括用户处理模块、数据选取模块、质检模块、AutoML引擎以及可扩展模型库:
    所述用户处理模块,用于提供用户可视化交互界面,方便用户处理和理解数据;
    所述数据选取模块,用于根据指定策略从全量大数据里选取部分数据,策略包括随机抽取和按预测概率排序抽取;
    所述质检模块,用于提供用户抽检数据的置信度支持;
    所述可扩展模型库,用于提供AutoML可搜索的模型,可插件式扩展,增强AutoML引擎的能力。
  8. 一种电子设备,包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述可执行指令在被执行时使所述处理器执行权利要求1-6任一项所述数据标注方法的步骤。
  9. 一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行权利要求1-6任一项所述数据标注方法的步骤。
PCT/CN2021/120710 2020-09-27 2021-09-26 一种数据标注方法、系统和电子设备 WO2022063274A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011031077.7A CN114282586A (zh) 2020-09-27 2020-09-27 一种数据标注方法、系统和电子设备
CN202011031077.7 2020-09-27

Publications (1)

Publication Number Publication Date
WO2022063274A1 true WO2022063274A1 (zh) 2022-03-31

Family

ID=80844961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120710 WO2022063274A1 (zh) 2020-09-27 2021-09-26 一种数据标注方法、系统和电子设备

Country Status (2)

Country Link
CN (1) CN114282586A (zh)
WO (1) WO2022063274A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001791A (zh) * 2022-05-27 2022-09-02 北京天融信网络安全技术有限公司 攻击资源标注方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287324A (zh) * 2019-06-27 2019-09-27 成都冰鉴信息科技有限公司 一种针对粗粒度文本分类的数据动态标注方法及装置
CN110458238A (zh) * 2019-08-02 2019-11-15 南通使爱智能科技有限公司 一种证件圆弧点检测和定位的方法及系统
CN110766164A (zh) * 2018-07-10 2020-02-07 第四范式(北京)技术有限公司 用于执行机器学习过程的方法和系统
CN111008707A (zh) * 2019-12-09 2020-04-14 第四范式(北京)技术有限公司 自动化建模方法、装置及电子设备
US20200151459A1 (en) * 2018-11-14 2020-05-14 Disney Enterprises, Inc. Guided Training for Automation of Content Annotation
CN111611239A (zh) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 实现自动机器学习的方法、装置、设备及存储介质
CN111611240A (zh) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 执行自动机器学习过程的方法、装置及设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9082084B2 (en) * 2013-06-28 2015-07-14 Linkedin Corporation Facilitating machine learning in an online social network
CN110555206A (zh) * 2018-06-01 2019-12-10 中兴通讯股份有限公司 一种命名实体识别方法、装置、设备及存储介质
TWI684925B (zh) * 2018-10-17 2020-02-11 新漢智能系統股份有限公司 自動建立物件辨識模型的方法
US11023745B2 (en) * 2018-12-27 2021-06-01 Beijing Didi Infinity Technology And Development Co., Ltd. System for automated lane marking
CN110263934B (zh) * 2019-05-31 2021-08-06 中国信息通信研究院 一种人工智能数据标注方法和装置
CN111008706B (zh) * 2019-12-09 2023-05-05 长春嘉诚信息技术股份有限公司 一种自动标注、训练、预测海量数据的处理方法
CN111177136B (zh) * 2019-12-27 2023-04-18 上海依图网络科技有限公司 标注数据清洗装置和方法
CN111680793B (zh) * 2020-04-21 2023-06-09 广州中科易德科技有限公司 一种基于深度学习模型训练的区块链共识方法和系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110766164A (zh) * 2018-07-10 2020-02-07 第四范式(北京)技术有限公司 用于执行机器学习过程的方法和系统
US20200151459A1 (en) * 2018-11-14 2020-05-14 Disney Enterprises, Inc. Guided Training for Automation of Content Annotation
CN110287324A (zh) * 2019-06-27 2019-09-27 成都冰鉴信息科技有限公司 一种针对粗粒度文本分类的数据动态标注方法及装置
CN110458238A (zh) * 2019-08-02 2019-11-15 南通使爱智能科技有限公司 一种证件圆弧点检测和定位的方法及系统
CN111008707A (zh) * 2019-12-09 2020-04-14 第四范式(北京)技术有限公司 自动化建模方法、装置及电子设备
CN111611239A (zh) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 实现自动机器学习的方法、装置、设备及存储介质
CN111611240A (zh) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 执行自动机器学习过程的方法、装置及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001791A (zh) * 2022-05-27 2022-09-02 北京天融信网络安全技术有限公司 攻击资源标注方法及装置
CN115001791B (zh) * 2022-05-27 2024-02-06 北京天融信网络安全技术有限公司 攻击资源标注方法及装置

Also Published As

Publication number Publication date
CN114282586A (zh) 2022-04-05

Similar Documents

Publication Publication Date Title
CN108228758B (zh) 一种文本分类方法及装置
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN110427487B (zh) 一种数据标注方法、装置及存储介质
CN110866930B (zh) 语义分割辅助标注方法与装置
US20230259712A1 (en) Sound effect adding method and apparatus, storage medium, and electronic device
CN112083897A (zh) 一种数字逻辑设计中信号声明系统、方法、设备及介质
CN109857957B (zh) 建立标签库的方法、电子设备及计算机存储介质
CN110717019A (zh) 问答处理方法、问答系统、电子设备及介质
WO2022063274A1 (zh) 一种数据标注方法、系统和电子设备
CN111159354A (zh) 一种敏感资讯检测方法、装置、设备及系统
CN108804563B (zh) 一种数据标注方法、装置以及设备
CN111258905A (zh) 缺陷定位方法、装置和电子设备及计算机可读存储介质
CN112818126B (zh) 网络安全语料库构建模型的训练方法、应用方法及装置
CN112560463B (zh) 文本多标注方法、装置、设备及存储介质
CN108255891B (zh) 一种判别网页类型的方法及装置
CN116976432A (zh) 一种支持任务并行处理的芯片模拟方法、装置和芯片模拟器
US11829696B2 (en) Connection analysis method for multi-port nesting model and storage medium
CN115599973A (zh) 一种用户人群标签筛分方法、系统、设备及存储介质
CN115185685A (zh) 基于深度学习的人工智能任务调度方法、装置及存储介质
CN111324732B (zh) 模型训练方法、文本处理方法、装置及电子设备
CN116029280A (zh) 一种文档关键信息抽取方法、装置、计算设备和存储介质
CN113742470A (zh) 一种数据检索方法、系统、电子设备及介质
TW202139054A (zh) 表單數據檢測方法、電腦裝置及儲存介質
CN110750569A (zh) 数据提取方法、装置、设备及存储介质
CN112445784B (zh) 一种文本结构化的方法、设备及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871643

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21871643

Country of ref document: EP

Kind code of ref document: A1