WO2022174496A1 - Data annotation method and apparatus based on generative model, and device and storage medium - Google Patents
Data annotation method and apparatus based on generative model, and device and storage medium Download PDFInfo
- Publication number
- WO2022174496A1 WO2022174496A1 PCT/CN2021/083758 CN2021083758W WO2022174496A1 WO 2022174496 A1 WO2022174496 A1 WO 2022174496A1 CN 2021083758 W CN2021083758 W CN 2021083758W WO 2022174496 A1 WO2022174496 A1 WO 2022174496A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- labeling
- sample
- label
- probability
- text
- Prior art date
Links
- 238000003860 storage Methods 0.000 title claims abstract description 41
- 230000011218 segmentation Effects 0.000 claims abstract description 66
- 238000002372 labelling Methods 0.000 claims description 283
- 230000000875 corresponding Effects 0.000 claims description 83
- 238000000034 method Methods 0.000 claims description 24
- 238000005457 optimization Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 206010054964 Dysphemia Diseases 0.000 description 3
- 208000003028 Stuttering Diseases 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000001537 neural Effects 0.000 description 3
- 230000002093 peripheral Effects 0.000 description 3
- 241000220225 Malus Species 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000003287 optical Effects 0.000 description 2
- 235000021016 apples Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000002569 neurons Anatomy 0.000 description 1
- 230000000644 propagated Effects 0.000 description 1
- 230000003068 static Effects 0.000 description 1
- 230000001131 transforming Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/381—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Abstract
A data annotation method and apparatus based on a generative model, and a device and a storage medium, which relate to the technical field of artificial intelligence and can be applied to the field of natural language processing. The method comprises: acquiring text to be annotated, and performing splitting, word segmentation and merging processing on said text, in order to obtain a target phrase; annotating the target phrase on the basis of multiple preset annotation rules to obtain a label sample; then acquiring a sample annotation probability, for the target phrase, of the label sample, iteratively updating, on the basis of the sample annotation probability, initial parameters generated by a generative model, in order to obtain a trained generative model, and outputting annotation accuracy by means of the trained generative model; and then determining a target label sample according to the annotation accuracy. The present invention also relates to blockchain technology, and text to be annotated is stored in a blockchain. Data is annotated on the basis of multiple preset rules, and a label sample with the highest data annotation accuracy is selected according to a generative model, thereby facilitating the improvement of data annotation accuracy.
Description
本申请要求于2021年02月20日提交中国专利局、申请号为202110193454.5,发明名称为“基于生成模型的数据标注方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on February 20, 2021 with the application number 202110193454.5 and the title of the invention is "Generation Model-Based Data Labeling Method, Device, Equipment and Storage Medium", the entire contents of which are Incorporated herein by reference.
本申请涉及人工智能技术领域,尤其涉及一种基于生成模型的数据标注方法、装置、设备及存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a data labeling method, apparatus, device and storage medium based on a generative model.
随着知识图谱在各个垂直领域的作用越来越凸显,如何从对大规模的无标注数据进行数据标注是当前知识图谱领域关注重点。With the increasingly prominent role of knowledge graphs in various vertical fields, how to label large-scale unlabeled data is the focus of the current knowledge graph field.
虽然现在针对有标注数据的命名实体识别的准确率已经达到了99%以上,但是针对不同领域的文本标注数据的人工构建,时间周期极长。而且不同领域的标注数据并不完全具有通用性。业务场景、目标用户以及产品定义的区别,都直接导致文本领域很难有可以适用于各个领域的大规模的标注数据。所以如何提高对大规模数据进行标注的效率成为了一个难题。Although the accuracy rate of named entity recognition for labeled data has reached more than 99%, the time period for manual construction of text labeled data in different fields is extremely long. Moreover, labeled data in different fields are not completely universal. Differences in business scenarios, target users, and product definitions directly make it difficult for the text field to have large-scale labeled data that can be applied to various fields. Therefore, how to improve the efficiency of labeling large-scale data has become a difficult problem.
针对以上难题,现有解决方法是通过获取与原始文本对应的词序列,将词序列进行转化和映射,从而得到实体标注向量,并统计实体标注向量中预设实体信息的数量,从而实现对数据的标注;然而发明人发现这种标注方式是有词向量进行转化和映射而得到,容易造成对数据的标注出现错误,从而导致对大规模数据的数据标注的准确率较低。现亟需一种能够提高数据标注准确率的方法。In view of the above problems, the existing solution is to obtain the word sequence corresponding to the original text, transform and map the word sequence to obtain the entity labeling vector, and count the number of preset entity information in the entity labeling vector, so as to realize the data analysis. However, the inventor found that this kind of labeling method is obtained by transforming and mapping word vectors, which is easy to cause errors in data labeling, resulting in low accuracy of data labeling for large-scale data. There is an urgent need for a method that can improve the accuracy of data labeling.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的在于提出一种基于生成模型的数据标注方法、装置、设备及存储介质,以提高数据标注的准确率。The purpose of the embodiments of the present application is to propose a data labeling method, apparatus, device and storage medium based on a generative model, so as to improve the accuracy of data labeling.
为了解决上述技术问题,本申请实施例提供一种基于生成模型的数据标注方法,包括:In order to solve the above technical problems, the embodiment of the present application provides a data labeling method based on a generative model, including:
获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;
通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;
获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;
获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;
通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;
选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
为了解决上述技术问题,本申请实施例提供一种基于生成模型的数据标注装置,包括:In order to solve the above technical problems, an embodiment of the present application provides a data labeling device based on a generative model, including:
待标签文本拆分模块,用于获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;A to-be-labeled text splitting module, configured to obtain the to-be-labeled text, and to split the to-be-labeled text to obtain a split statement;
目标短语获取模块,用于通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;a target phrase acquisition module, used to obtain a target word segmentation by performing word segmentation processing on the split statement, and merging the target word segmentation to obtain a target phrase;
标签样本生成模块,用于获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;The label sample generation module is configured to obtain a plurality of preset labeling rules, and respectively label the target phrase through the plurality of the predefined labeling rules, to obtain label samples corresponding to each of the predefined rules;
初始参数生成模块,用于获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;The initial parameter generation module is used to obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample ;
标注准确率输出模块,用于通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的 标注准确率;The labeling accuracy output module is used to iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labels corresponding to the label samples through the trained generative model Accuracy;
标签样本选取模块,用于选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample selection module is configured to select the label sample with the highest label accuracy rate as the target label sample.
为解决上述技术问题,本申请采用的一个技术方案是:提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:In order to solve the above technical problems, a technical solution adopted in the present application is to provide a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions Implement the following steps:
获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;
通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;
获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;
获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;
通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;
选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
为解决上述技术问题,本申请采用的一个技术方案是:一种计算机可读存储介质,所述计算机可读指令被处理器执行时实现如下步骤:In order to solve the above technical problems, a technical solution adopted in this application is: a computer-readable storage medium, where the computer-readable instructions are executed by a processor to implement the following steps:
获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;
通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;
获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;
获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;
通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;
选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
本申请实施例提供了一种基于生成模型的数据标注方法、装置、设备及存储介质。本申请实施例通过多种预设规则对数据进行标注,并根据生成模型选取数据标注准确率最高的标签样本,有利于提高数据标注的准确率。Embodiments of the present application provide a method, apparatus, device, and storage medium for data labeling based on a generative model. In the embodiment of the present application, the data is labeled by a variety of preset rules, and the label sample with the highest data labeling accuracy is selected according to the generation model, which is beneficial to improve the data labeling accuracy.
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1是本申请实施例提供的基于生成模型的数据标注方法的应用环境示意图;1 is a schematic diagram of an application environment of a method for labeling data based on a generative model provided by an embodiment of the present application;
图2根据本申请实施例提供的基于生成模型的数据标注方法的一实现流程图;FIG. 2 is an implementation flowchart of a method for data labeling based on a generative model provided according to an embodiment of the present application;
图3是本申请实施例提供的基于生成模型的数据标注方法中子流程的一实现流程图;3 is a flow chart of an implementation of a sub-process in the method for labeling data based on a generative model provided by an embodiment of the present application;
图4是本申请实施例提供的基于生成模型的数据标注方法中子流程的又一实现流程图;Fig. 4 is another realization flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
图5是本申请实施例提供的基于生成模型的数据标注方法中子流程的又一实现流程图;Fig. 5 is another realization flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
图6是本申请实施例提供的基于生成模型的数据标注方法中子流程的又一实现流程图;6 is another implementation flowchart of the sub-process in the method for labeling data based on a generative model provided by an embodiment of the present application;
图7是本申请实施例提供的基于生成模型的数据标注方法中子流程的又一实现流程图;7 is another implementation flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
图8是本申请实施例提供的基于生成模型的数据标注方法中子流程的又一实现流程图;FIG. 8 is another implementation flowchart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
图9是本申请实施例提供的基于生成模型的数据标注装置示意图;9 is a schematic diagram of a data labeling device based on a generative model provided by an embodiment of the present application;
图10是本申请实施例提供的计算机设备的示意图。FIG. 10 is a schematic diagram of a computer device provided by an embodiment of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.
下面结合附图和实施方式对本申请进行详细说明。The present application will be described in detail below with reference to the accompanying drawings and embodiments.
请参阅图1,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。Referring to FIG. 1 , the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、搜索类应用、即时通信工具等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, search applications, instant communication tools, and the like.
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
需要说明的是,本申请实施例所提供的基于生成模型的数据标注方法一般由服务器执行,相应地,基于生成模型的数据标注装置一般配置于服务器中。It should be noted that the data labeling method based on the generative model provided in the embodiment of the present application is generally executed by the server, and accordingly, the data labeling apparatus based on the generative model is generally configured in the server.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
请参阅图2,图2示出了基于生成模型的数据标注方法的一种具体实施方式。Referring to FIG. 2, FIG. 2 shows a specific implementation manner of a data labeling method based on a generative model.
需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限,该方法包括如下步骤:It should be noted that, if there is substantially the same result, the method of the present application is not limited to the flow sequence shown in FIG. 2, and the method includes the following steps:
S1:获取待标注文本,并对待标注文本进行拆分,得到拆分语句。S1: Obtain the text to be marked, and split the text to be marked to obtain a split sentence.
具体的,服务器在获取到待标注文本后,会对其进行预处理,例如对其进行数据清洗,然后根据文本中的分割符将待标注文本按照段落、句子等进行拆分后,从而得到拆分语句。其中,待标签文本为需要对其进行数据标注,从而生成具有标注标签的文本。Specifically, after the server obtains the text to be marked, it will preprocess it, such as data cleaning, and then split the text to be marked into paragraphs, sentences, etc. sub-sentence. Among them, the text to be labeled is the data that needs to be labeled, so as to generate text with labeled labels.
S2:通过对拆分语句进行分词处理,得到目标分词,并对目标分词进行合并,得到目标短语。S2: By performing word segmentation on the split sentence, the target word segmentation is obtained, and the target word segmentation is combined to obtain the target phrase.
具体的,在上述步骤中,待标注文本已经被拆分成拆分语句,而拆分语句以短句的形式存在,为了后续更好的对其进行数据标注,通过预设的分词工具对拆分语句进行分词处理,从而生成各个目标分词,再根据目标分词的词性进行词性标注,按照依存句法分析的方式,对目标分词进行合并,从而生成目标短语。Specifically, in the above steps, the text to be marked has been split into split sentences, and the split sentences exist in the form of short sentences. The segmented sentences are processed by word segmentation to generate each target word segment, and then the part of speech is marked according to the part of speech of the target word segment, and the target word segment is merged according to the method of dependency syntax analysis to generate the target phrase.
需要说明的是,预设的分词工具包括但不限于:结巴分词、NLPIR分词系统和SnowNLP等。优选的,采用结巴分词对拆分语句进行分词,得到目标分词。结巴分词具有将句子最精确地切开,适合文本分析,并且其把句子中所有的可以成词的词语都扫描出来,速度较快,适合本实施例的对拆分语句进行分词处理。It should be noted that the preset word segmentation tools include but are not limited to: stuttering word segmentation, NLPIR word segmentation system, SnowNLP, etc. Preferably, stuttering word segmentation is used to segment the split sentence to obtain the target word segmentation. The stuttering word segmentation has the ability to cut the sentence most accurately, which is suitable for text analysis, and it scans all the words that can be formed into words in the sentence, and the speed is relatively fast.
其中,依存句法分析是由法国语言学家L.Tesniere最先提出。它将句子分析成一棵依存句法树,描述出各个词语之间的依存关系,也即指出了词语之间在句法上的搭配关系, 这种搭配关系是和语义相关联的。在本申请实施例中,通过依存句法分析的方式,对目标分词进行合并。Among them, dependency syntax analysis was first proposed by the French linguist L.Tesniere. It analyzes the sentence into a dependency syntax tree, and describes the dependency relationship between each word, that is, it points out the syntactic collocation between words, which is related to semantics. In the embodiment of the present application, the target word segmentation is merged by means of dependency syntax analysis.
S3:获取多种预设标注规则,并通过多种预设标注规则分别对目标短语进行标注,得到每一种预设规则对应的标签样本。S3: Acquire a variety of preset labeling rules, and label the target phrase respectively through the multiple preset labeling rules, and obtain a label sample corresponding to each preset rule.
具体的,本申请实施例是通过将待标注文本经过拆分、分词和合并之后,通过多种标注规则对目标短语进行标注,然后通过生成模型确定各种规则对数据标注的准确率,从而选取准确率最高的标签样本,从而完成对数据的标注。所以服务器通过获取多种预设标注规则,然后根据每一种预设标注规则分别对目标短语标注对应的标签,从而使得目标短语生成每一种预设规则对应的标签样本。Specifically, in the embodiment of the present application, after the text to be marked is split, segmented and merged, the target phrase is marked with various marking rules, and then the accuracy rate of data marking with various rules is determined by the generation model, so as to select The label sample with the highest accuracy rate is used to complete the labeling of the data. Therefore, the server obtains multiple preset tagging rules, and then tags the target phrase with corresponding tags according to each preset tagging rule, so that the target phrase generates a tag sample corresponding to each preset rule.
需要说明的是,多种预设标注规则包括但不限于:正则识别,远程匹配知识库识别以及匹配外部数据方式。其中,正则识别是指通过预先设置不同的SQL查询语句,匹配相应的标注规则,从而实现不同的规则对目标短语进行标注。远程匹配知识库是指通过将目标短语与外设的知识库进行一一匹配,从而完成对其进行标注。匹配外部数据方式是指通过例如众包平台提供的外部数据,将目标短语与其进行匹配,从而完成对目标短语的标注。优选的,通过采用多种不同的标注规则对目标短语进行标注,能够对多种方式下数据标注的准确率进行筛选,从而提高数据标注的准确率。It should be noted that a variety of preset labeling rules include, but are not limited to: regular recognition, remote matching knowledge base recognition, and external data matching methods. Among them, the regular recognition refers to matching the corresponding labeling rules by presetting different SQL query statements, so as to realize different rules to label the target phrase. The remote matching knowledge base means that the target phrase is annotated by matching it with the knowledge base of the peripheral device one by one. The method of matching external data refers to matching the target phrase with the external data provided by, for example, a crowdsourcing platform, so as to complete the labeling of the target phrase. Preferably, by using a variety of different labeling rules to label the target phrase, the accuracy of data labeling in various ways can be screened, thereby improving the accuracy of data labeling.
S4:获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标注概率和标签样本,得到生成模型的初始参数。S4: Obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample.
具体的,样本标注概率是指利用预设标注规则得到的样本标签对目标短语的覆盖率,其后续可以对生成模型的参数进行迭代更新。并且由于每一种预设标注规则对不同的目标短语的样本标注概率不同,所以需要先获取每一种预设标注规则对应的样本标注概率。服务器还通过对样本标注概率和标签样本进行初始化后,得到生成模型的初始估计参数,也即是得到生成模型的初始参数。Specifically, the sample labeling probability refers to the coverage rate of the target phrase by the sample label obtained by using the preset labeling rule, and the parameters of the generation model can be iteratively updated subsequently. And since each preset tagging rule has different sample tagging probabilities for different target phrases, it is necessary to first obtain the sample tagging probability corresponding to each preset tagging rule. The server also obtains initial estimated parameters of the generative model after initializing the sample labeling probability and labeling samples, that is, obtains the initial parameters of the generative model.
其中,生成模型是指能够随机生成观测数据的模型,尤其是在给定某些隐含参数的条件下。生成模型给观测值和标注数据序列指定一个联合概率分布。在本申请实施例中,隐含参数对应本申请的目标短语的真实标签,观测值对应本申请的样本标注概率,标注数据序列对应本申请的标签样本;所以根据该隐含参数,也即真实的数据标签,随机生成观测数据的模型,其能够判断出每一种预设标注规则对目标短语的标注概率。Among them, a generative model refers to a model that can randomly generate observed data, especially given some implicit parameters. Generative models assign a joint probability distribution to observations and labeled data sequences. In the embodiment of the present application, the implicit parameter corresponds to the true label of the target phrase of the present application, the observed value corresponds to the sample labeling probability of the present application, and the labeled data sequence corresponds to the label sample of the present application; therefore, according to the implicit parameter, that is, the true The data label is randomly generated by the observation data model, which can determine the labeling probability of each preset labeling rule for the target phrase.
S5:通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率。S5: Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model.
具体的,通过样本标签概率对生成模型的初始参数进行拟合,利用随机梯度下降的方式,将样本标签概率反向传播回去对初始参数进行迭代更新,使得其样本标签概率不同接近生成模型的参数,从而得到训练好的生成模型。利用训练好的生成模型的参数对标签样本进行概率估计,并进行加权平均处理后,从而得到每一种预设规则下标签样本的标注准确率。Specifically, the initial parameters of the generative model are fitted by the sample label probabilities, and the sample label probabilities are back-propagated back to iteratively update the initial parameters by means of stochastic gradient descent, so that the sample label probabilities are different and close to the parameters of the generative model. , so as to obtain a trained generative model. Using the parameters of the trained generative model to estimate the probability of label samples, and after weighted average processing, the labeling accuracy rate of label samples under each preset rule is obtained.
其中,迭代更新是指通过样本标签概率对生成模型的初始参数进行拟合,利用随机梯度下降的方式,将样本标签概率反向传播回去对初始参数进行迭代计算,使得其样本标签概率不同接近生成模型的参数。Among them, iterative update refers to fitting the initial parameters of the generative model through the sample label probability, and using stochastic gradient descent to backpropagate the sample label probability back to iteratively calculate the initial parameters, so that the sample label probabilities are different and close to the generation parameters of the model.
S6:选取标注准确率最高的标签样本,作为目标标签样本。S6: Select the label sample with the highest labeling accuracy as the target label sample.
具体的,通过上述步骤已经得到了每一种预设标注规则下标签样本的标注准确率,所以选取标注准确率最高的标签样本,作为目标标签样本,从而实现尝试多种标注规则对目标短语进行标注,选择其中准确率最高的标签样本,有利于提高数据标注的准确率。Specifically, through the above steps, the labeling accuracy rate of the label samples under each preset labeling rule has been obtained, so the label sample with the highest labeling accuracy rate is selected as the target label sample, so as to try multiple labeling rules for the target phrase. Labeling, selecting the label sample with the highest accuracy rate is conducive to improving the accuracy of data labeling.
本实施例中,通过对获取的待标注文本进行拆分、分词和合并处理后,得到目标短语,便于后续针对待标注文本分别按照目标短语进行数据标注;再获取多种预设标注规则,并通过多种预设标注规则分别对目标短语进行标注,得到每一种预设规则对应的标签样本,然后获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标 注概率和标签样本,得到生成模型的初始参数,再通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率,选取标注准确率最高的标签样本,作为目标标签样本,实现通过多种预设规则对数据进行标注,并根据生成模型选取数据标注准确率最高的标签样本,有利于提高数据标注的准确率。In this embodiment, after splitting, segmenting and merging the acquired text to be labeled, a target phrase is obtained, which facilitates subsequent data labeling of the text to be labeled according to the target phrase; and then obtains a variety of preset labeling rules, and Label the target phrases through a variety of preset labeling rules, obtain the label samples corresponding to each preset labeling rule, and then obtain the sample labeling probability of the target phrase by the label samples corresponding to each predefined labeling rule, and then obtain the sample labeling probability of the target phrase according to the sample Label the probability and label samples to obtain the initial parameters of the generative model, and then iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model , select the label sample with the highest labeling accuracy as the target label sample, realize the labeling of data through a variety of preset rules, and select the label sample with the highest data labeling accuracy according to the generation model, which is beneficial to improve the accuracy of data labeling.
请参阅图3,图3示出了步骤S4的一种具体实施方式,步骤S4中获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标注概率和标签样本,得到生成模型的初始参数的具体实现过程,详叙如下:Please refer to FIG. 3. FIG. 3 shows a specific implementation of step S4. In step S4, the sample labeling probability of the target phrase corresponding to the label sample corresponding to each preset labeling rule is obtained, and the sample labeling probability and label sample are obtained according to the sample labeling probability and label sample. , the specific implementation process of obtaining the initial parameters of the generated model is described in detail as follows:
S41:计算每一种预设标注规则对应的标签样本对目标短语的覆盖率,并将覆盖率作为样本标注概率。S41: Calculate the coverage rate of the target phrase by the label samples corresponding to each preset labeling rule, and use the coverage rate as the sample labeling probability.
具体的,为了后续对生成模型进行训练,使得样本标注概率接近生成模型的参数,需要先获取到样本标注概率。所以计算每一种预设标注规则对应的标签样本对目标短语的覆盖率,并将覆盖率作为样本标注概率。在本申请实施例中,覆盖率是通过计算标签样本对目标短语的覆盖程度而得来的。Specifically, in order to subsequently train the generative model so that the sample labeling probability is close to the parameters of the generative model, the sample labeling probability needs to be obtained first. Therefore, the coverage rate of the target phrase by the label samples corresponding to each preset labeling rule is calculated, and the coverage rate is used as the sample labeling probability. In the embodiment of the present application, the coverage rate is obtained by calculating the coverage degree of the target phrase by the tag sample.
在一具体实施例中,当采用远程匹配知识库识别的方式对目标短语进行标注时,由于外设的知识库可能存在与目标短语中的目标分词无法一一匹配的情况,导致这些目标分词无法通过该方式进行标注,使得该目标短语标注失败;目标短语中的目标分词与外设的知识库能够一一匹配的,则该目标短语标注成功。将目标短语标注成功的标签样本除以总的目标短语量,得到的结果则是远程匹配知识库识别的方式对目标短语的覆盖率,并将覆盖率作为样本标注概率。例如,目标短语标注成功的数量为9000条,总的目标短语量为10000条,则样本标注概率为90%。In a specific embodiment, when the target phrase is marked by means of remote matching knowledge base recognition, because the knowledge base of the peripheral device may not match the target word segmentation in the target phrase one by one, these target word segmentation cannot be matched. Marking in this way causes the target phrase to fail to be marked; if the target word segmentation in the target phrase and the knowledge base of the peripheral can match one by one, the target phrase is marked successfully. Divide the successfully labeled samples of target phrases by the total number of target phrases, and the result is the coverage rate of the target phrase by the way of remote matching knowledge base recognition, and the coverage rate is taken as the sample labeling probability. For example, if the number of successfully tagged target phrases is 9,000, and the total number of target phrases is 10,000, the probability of sample tagging is 90%.
S42:将样本标签概率和标签样本进行初始化处理,得到生成模型的初始参数。S42: Initialize the sample label probability and the label sample to obtain the initial parameters of the generative model.
具体的,初始化处理是指根据样本标签概率和标签样本,对生成模型的初始参数赋予估计参数值,从而得得到生成模型的初始参数。Specifically, the initialization process refers to assigning estimated parameter values to the initial parameters of the generative model according to the sample label probability and label samples, so as to obtain the initial parameters of the generative model.
在本实施中,通过计算每一种预设标注规则对应的标签样本对目标短语的覆盖率,并将覆盖率作为样本标注概率,再将样本标签概率和标签样本进行初始化处理,得到生成模型的初始参数,实现获取样本标注概率和生成模型的初始参数,便于后续进行生成模型的训练,从而便于提高数据标注的准确率。In this implementation, by calculating the coverage rate of the target phrase by the label samples corresponding to each preset labeling rule, and using the coverage rate as the sample labeling probability, and then initializing the sample label probability and label samples, the generated model is obtained. The initial parameters are used to obtain the sample labeling probability and the initial parameters of the generation model, which is convenient for subsequent training of the generation model, thereby improving the accuracy of data labeling.
请参阅图4,图4示出了步骤S5的一种具体实施方式,步骤S5中通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率的具体实现过程,详叙如下:Please refer to FIG. 4. FIG. 4 shows a specific implementation of step S5. In step S5, the initial parameters of the generative model are iteratively updated through the sample labeling probability to obtain a trained generative model. The specific implementation process of the labeling accuracy corresponding to the output label sample is described in detail as follows:
S51:将生成模型的参数和样本标注概率的差值作为优化特征值。S51: Use the difference between the parameters of the generated model and the sample labeling probability as the optimization feature value.
具体的,本申请实施例是通过对生成模型的参数进行迭代更新,从而使得生成模型的参数不断接近样本标注概率,所以将生成模型的参数和样本标注概率的差值作为优化特征值,通过评估其优化特征值,判断出生成模型的训练程度。Specifically, in the embodiment of the present application, the parameters of the generative model are iteratively updated, so that the parameters of the generative model are constantly approaching the sample labeling probability, so the difference between the parameters of the generative model and the sample labeling probability is used as the optimization feature value, and the evaluation It optimizes the eigenvalues and judges the training degree of the generated model.
具体的,当数据量达到一定规模以后,基于多种预设标注规则对目标短语进行标注,并训练得到的生成模型,基于该生成模型对目标短语的真实标签的估计要优于对样本标签的随机猜测;并且由于生成模型的参数是用来估计标签样本的准确性,以及样本标签概率是通过目标短语标注成功的数量对总的目标短语量的覆盖而计算得来的;所以当生成模型的参数越接近样本标注概率,也即优化特征值越小,生成模型越接近训练完成。例如,生成模型的初始参数为0.4,样本标签概率为0.92,则优化特征值为0.52,当不断进行迭代更新后,优化特征值逐渐变小,当优化特征值变成0.01,此时生成模型的参数已经很接近样本标签概率,则结束迭代更新。Specifically, when the amount of data reaches a certain scale, the target phrase is labeled based on a variety of preset labeling rules, and the generated model is trained. The estimation of the true label of the target phrase based on the generative model is better than that of the sample label. Random guessing; and because the parameters of the generative model are used to estimate the accuracy of the label samples, and the sample label probability is calculated by the coverage of the total target phrase volume by the number of target phrases successfully labeled; so when the generative model's The closer the parameter is to the sample labeling probability, that is, the smaller the optimized eigenvalue, the closer the generative model is to the completion of training. For example, if the initial parameter of the generative model is 0.4 and the sample label probability is 0.92, the optimal eigenvalue is 0.52. After continuous iterative updating, the optimal eigenvalue gradually becomes smaller, and when the optimal eigenvalue becomes 0.01, the If the parameters are already close to the sample label probability, the iterative update ends.
S52:采用随机梯度下降的方式,将样本标注概率进行反向传播,以对初始参数进行迭代更新,其中,每次迭代更新都得到生成模型新的参数和优化特征值发生改变。S52 : Back-propagating the sample labeling probability by means of stochastic gradient descent to iteratively update the initial parameters, wherein, each iterative update results in new parameters of the generation model and changes in the optimized eigenvalues.
具体的,通过采用随机梯度下降的方式,将样本标注概率进行反向传播,以对初始参 数进行迭代更新,每进行一次更新计算生成模型都会得到一个新的参数,将该新的参数和样本标注概率的进行差值计算,可以得到新的优化特征值。其中,由于优化特征值是通过生成模型的参数与样本标注概率的差值计算而来的,并且每次迭代更新后,生成模型的参数都会发生改变,所以每次迭代更新都使得优化特征值发生改变。Specifically, by adopting the stochastic gradient descent method, the sample labeling probability is back-propagated to iteratively update the initial parameters, and a new parameter will be obtained for each update calculation to generate the model, and the new parameter and the sample label will be obtained. Probabilistic difference calculation can be used to obtain new optimized eigenvalues. Among them, since the optimized eigenvalues are calculated by the difference between the parameters of the generative model and the sample labeling probability, and after each iteration update, the parameters of the generative model will change, so each iterative update makes the optimized eigenvalues occur. Change.
其中,梯度下降法是迭代法的一种,可以用于求解最小二乘问题。在求解机器学习算法的模型参数,即无约束优化问题时,梯度下降(Gradient Descent)是最常采用的方法之一。在求解损失函数的最小值时,可以通过梯度下降法来一步步的迭代求解,得到最小化的损失函数和模型参数值。反过来,如果需要求解损失函数的最大值,这时就需要用梯度上升法来迭代了。在机器学习中,基于基本的梯度下降法发展了两种梯度下降方法,分别为随机梯度下降法和批量梯度下降法。在本申请实施例中,采用随机梯度下降的方式,将样本标注概率进行反向传播,以对初始参数进行迭代更新。Among them, the gradient descent method is a kind of iterative method, which can be used to solve the least squares problem. Gradient Descent is one of the most commonly used methods when solving model parameters of machine learning algorithms, i.e. unconstrained optimization problems. When solving the minimum value of the loss function, the gradient descent method can be used to iteratively solve the problem step by step to obtain the minimized loss function and model parameter values. Conversely, if you need to find the maximum value of the loss function, you need to use gradient ascent to iterate. In machine learning, two gradient descent methods have been developed based on the basic gradient descent method, namely stochastic gradient descent and batch gradient descent. In the embodiment of the present application, the method of stochastic gradient descent is used to back-propagate the sample labeling probability to iteratively update the initial parameters.
其中,反向传播算法是适合于多层神经元网络的一种学习算法,它建立在梯度下降法的基础上。反向传播网络的输入输出关系实质上是一种映射关系:一个n输入m输出的反向传播神经网络所完成的功能是从n维欧氏空间向m维欧氏空间中一有限域的连续映射,这一映射具有高度非线性。Among them, the back-propagation algorithm is a learning algorithm suitable for multi-layer neuron networks, which is based on the gradient descent method. The input-output relationship of the back-propagation network is essentially a mapping relationship: the function completed by a back-propagation neural network with n input and m output is the continuity from n-dimensional Euclidean space to a finite field in m-dimensional Euclidean space. mapping, which is highly nonlinear.
在一具体实施例中,将样本标注概率输入到神经网络的输入层,经过隐藏层,最后达到输出层并输出结果,该过程为前向传播过程。但是由于神经网络的输出结果与实际结果有误差,则计算估计值与实际值之间的误差,该误差也即优化特征值,并将该优化特征值从输出层向隐藏层反向传播,直至传播到输入层;在反向传播的过程中,根据优化特征值随机下降进行调整样本标注概率的值,使得优化特征值减小。迭代上述步骤,直到优化特征值达到预设阈值。In a specific embodiment, the sample labeling probability is input into the input layer of the neural network, passes through the hidden layer, and finally reaches the output layer and outputs the result. This process is a forward propagation process. However, due to the error between the output result of the neural network and the actual result, the error between the estimated value and the actual value is calculated, which is also the optimized eigenvalue, and the optimized eigenvalue is back-propagated from the output layer to the hidden layer until Propagated to the input layer; in the process of backpropagation, the value of the sample labeling probability is adjusted according to the random drop of the optimized eigenvalue, so that the optimized eigenvalue is reduced. The above steps are iterated until the optimized feature value reaches a preset threshold.
S53:当优化特征值达到预设阈值时,停止迭代更新,得到训练好的生成模型。S53: When the optimized feature value reaches the preset threshold, stop the iterative update to obtain a trained generative model.
具体的,当优化特征值达到预设阈值时,说明生成模型的参数很接近样本标注概率,此时停止对生成模型的参数的更新,从而得到训练好的生成模型。Specifically, when the optimized feature value reaches the preset threshold, it indicates that the parameters of the generative model are very close to the sample labeling probability, and at this time, the updating of the parameters of the generative model is stopped, thereby obtaining a trained generative model.
需要说明的是,预设阈值根据实际情况进行设定,此处不做限定。在一具体实施例中,预设阈值为0.01。It should be noted that the preset threshold is set according to the actual situation, and is not limited here. In a specific embodiment, the preset threshold value is 0.01.
S54:通过训练好的生成模型输出标签样本对应的标注准确率。S54: Output the labeling accuracy corresponding to the label sample through the trained generative model.
具体的,上述步骤已经生成训练好的生成模型,再通过训练好的生成模型对标签样本进行概率估计,输出标签样本对应的标注准确率。Specifically, the above steps have generated a trained generative model, and then perform probability estimation on the label samples through the trained generative model, and output the labeling accuracy rate corresponding to the label samples.
本实施例中,通过将生成模型的参数和样本标注概率的差值作为优化特征值,采用随机梯度下降的方式,将样本标注概率进行反向传播,以对初始参数进行迭代更新,当优化特征值达到预设阈值时,停止迭代更新,得到训练好的生成模型,通过训练好的生成模型输出标签样本对应的标注准确率,实现对生成模型进行训练,输出标签样本对应的标注准确率,从而有利于提高数据标注的准确率。In this embodiment, the difference between the parameters of the generated model and the sample labeling probability is used as the optimization feature value, and the sample labeling probability is back-propagated by means of stochastic gradient descent to iteratively update the initial parameters. When the value reaches the preset threshold, the iterative update is stopped to obtain a trained generative model. The trained generative model outputs the labeling accuracy rate corresponding to the label sample, so as to train the generative model and output the labeling accuracy rate corresponding to the label sample. It is beneficial to improve the accuracy of data annotation.
请参阅图5,图5示出了步骤S54的一种具体实施方式,步骤S54中通过训练好的生成模型输出标签样本对应的标注准确率的具体实现过程,详叙如下:Please refer to FIG. 5. FIG. 5 shows a specific implementation of step S54. In step S54, the specific implementation process of outputting the labeling accuracy rate corresponding to the label sample through the trained generation model is described in detail as follows:
S541:通过训练好的生成模型的当前参数对标签样本进行概率估计,得到基础概率。S541: Perform probability estimation on the label samples by using the current parameters of the trained generative model to obtain the basic probability.
具体的,通过当前参数对标签样本进行概率估计,得到基础概率,便于后续对基础概率进行进一步处理,从而得到最终的标注准确率。其中当前参数是指优化特征值达到预设阈值时,迭代更新得到生成模型的参数。Specifically, the probability estimation is performed on the label samples through the current parameters to obtain the basic probability, which is convenient for further processing of the basic probability in the future, so as to obtain the final labeling accuracy rate. The current parameter refers to the parameters of the generated model obtained by iterative update when the optimized feature value reaches the preset threshold.
具体的,由于生成模型是指能够随机生成观测数据的模型,尤其是在给定某些隐含参数的条件下。生成模型给观测值和标注数据序列指定一个联合概率分布。在本申请实施例中,隐含参数对应本申请的目标短语的真实标签,观测值对应本申请的样本标注概率,标注数据序列对应本申请的标签样本;所以根据该隐含参数,也即真实的数据标签,随机生成观测数据的模型,该模型有其当前参数构成,其能够判断出每一种预设标注规则对标签样本的概率估计,从而得到基础概率。Specifically, a generative model refers to a model that can randomly generate observational data, especially given some implicit parameters. Generative models assign a joint probability distribution to observations and labeled data sequences. In the embodiment of the present application, the implicit parameter corresponds to the true label of the target phrase of the present application, the observed value corresponds to the sample labeling probability of the present application, and the labeled data sequence corresponds to the label sample of the present application; therefore, according to the implicit parameter, that is, the true The data label is randomly generated, and the model of the observation data is composed of its current parameters, which can judge the probability estimation of each preset labeling rule on the label sample, so as to obtain the basic probability.
S542:对基础概率进行加权平均处理,得到标签样本对应的标注准确率。S542: Perform weighted average processing on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample.
具体的,通过对对基础概率进行加权平均处理,使得标注准确率更加精确。Specifically, by performing a weighted average process on the basic probability, the labeling accuracy is made more accurate.
本实施例中,通过训练好的生成模型的当前参数对标签样本进行概率估计,得到基础概率,并对基础概率进行加权平均处理,得到标签样本对应的标注准确率,使得生成标注准确率更加精确,从而有利于提高数据标注的准确率。In this embodiment, the probability estimation is performed on the label samples through the current parameters of the trained generation model to obtain the basic probability, and the weighted average processing is performed on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample, so that the generated labeling accuracy rate is more accurate , so as to improve the accuracy of data annotation.
请参阅图6,图6示出了步骤S1的一种具体实施方式,步骤S1中获取待标注文本,并对待标注文本进行拆分,得到拆分语句的具体实现过程,详叙如下:Please refer to FIG. 6. FIG. 6 shows a specific implementation of step S1. In step S1, the text to be marked is obtained, and the text to be marked is split to obtain the specific implementation process of the split sentence, which is described in detail as follows:
S11:获取待标注文本,并对待标注文本进行预处理,得到基础文本。S11: Acquire the text to be labeled, and preprocess the text to be labeled to obtain basic text.
具体的,预处理包括对待标注文本进行数据清洗。其中,数据清洗(Data cleaning)是指对数据进行重新审查和校验的过程,目的在于删除重复信息、纠正存在的错误,并提供数据一致性。Specifically, the preprocessing includes data cleaning of the text to be annotated. Among them, data cleaning refers to the process of re-examining and verifying data, with the purpose of removing duplicate information, correcting existing errors, and providing data consistency.
S12:采用正则匹配的方式,获取基础文本中包含的文本分隔符。S12: Obtain the text delimiter contained in the basic text by means of regular matching.
S13:通过文本分隔符对基础文本进行拆分,得到拆分语句。S13: Split the basic text by the text separator to obtain a split statement.
具体的,采用正则匹配的方式,获取基础文本中包含的文本分隔符,用于后续步骤对文本进行分割。Specifically, a regular matching method is used to obtain the text separator contained in the basic text, which is used to segment the text in subsequent steps.
可选地,文本分隔符包括格式分隔符和标点分隔符。Optionally, the text delimiters include format delimiters and punctuation delimiters.
其中,格式分隔符指根据文本编码类型或文本的结构进行分割的分隔符。通过格式分隔符有实现根据文本的编码类型或文本的结构,将基础文本进行拆分。The format delimiter refers to the delimiter that is divided according to the text encoding type or the text structure. The basic text is split according to the encoding type of the text or the structure of the text through the format separator.
其中,标点分隔符指根据标点符号将文本进行分割的分隔符。通过标点分隔符,实现快速将基本文本进行拆分。The punctuation separator refers to the separator that divides the text according to the punctuation characters. Quickly split basic text with punctuation separators.
本实施例中,通过获取待标注文本,并对待标注文本进行预处理,得到基础文本,采用正则匹配的方式,获取基础文本中包含的文本分隔符,通过文本分隔符对基础文本进行拆分,得到拆分语句,便于后续生成目标短语,有利于后续对其进行标注对应标签。In this embodiment, the basic text is obtained by acquiring the text to be marked and preprocessing the text to be marked, and the text separator contained in the basic text is obtained by regular matching, and the basic text is split by the text separator, The split sentences are obtained, which is convenient for subsequent generation of target phrases, and is conducive to subsequent labeling of corresponding labels.
请参阅图7,图7示出了步骤S6之后的一种具体实施方式,该实施例包括:Please refer to FIG. 7. FIG. 7 shows a specific implementation manner after step S6. This embodiment includes:
S61:获取待标注文本的存储路径,作为目标存储路径;S61: Obtain the storage path of the text to be marked as the target storage path;
S62:通过预设的数据映射方式,将目标标签样本映射到目标存储路径之中。S62: Map the target label sample to the target storage path by using a preset data mapping method.
具体的,为了数据溯源,便于查询待标注文本对应的目标标签样本,将目标标签样本和待标注文件存储于同一路径之中。Specifically, in order to trace the source of the data, it is convenient to query the target label sample corresponding to the text to be labeled, and the target label sample and the to-be-labeled file are stored in the same path.
其中,预设的数据映射方式包括但不限于:手工编码(Hand-coded)和可视化操作(Graphical manual)。手工编码是直接用类似XSLT,JAVA,C++这样的编程语言定义数据对应关系;可视化操作通常支持用户在数据项之间画一条线以定义数据项之间的对应关系。在一具体的实施例中,通过可视化操作将目标标签样本映射到目标存储路径之中。The preset data mapping methods include but are not limited to: hand-coded (Hand-coded) and visual operation (Graphical manual). Manual coding is to define data correspondences directly with programming languages such as XSLT, JAVA, and C++; visualization operations usually allow users to draw a line between data items to define the correspondence between data items. In a specific embodiment, the target label samples are mapped into the target storage path through a visualization operation.
请参阅图8,图8示出了对目标分词进行合并,得到目标短语的具体实现过程,详叙如下:Please refer to Fig. 8. Fig. 8 shows the specific implementation process of merging the target word segmentation to obtain the target phrase, which is described in detail as follows:
S2A:通过词性标注的方式,将目标分词进行词性标注,得到词性分词。S2A: Through part-of-speech tagging, part-of-speech tagging is performed on the target word to obtain part-of-speech segmentation.
其中,词性标注也被称为语法标注或词类消疑,是语料库语言学中将语料库内单词的词性按其含义和上下文内容进行标记的文本数据处理技术。词性标注可以由人工或特定算法完成,使用机器学习方法实现词性标注是自然语言处理的研究内容。常见的词性标注算法包括隐马尔可夫模型、条件随机场等。本申请实施例中,通过词性标注的方式,将目标分词进行词性标注,得到词性分词。Among them, part-of-speech tagging, also known as grammar tagging or part-of-speech disambiguation, is a text data processing technology in corpus linguistics that labels the parts of speech of words in a corpus according to their meaning and contextual content. Part-of-speech tagging can be done manually or by a specific algorithm. Using machine learning methods to achieve part-of-speech tagging is the research content of natural language processing. Common part-of-speech tagging algorithms include Hidden Markov Models, Conditional Random Fields, etc. In the embodiment of the present application, the target word segmentation is marked with the part of speech by means of part of speech tagging to obtain the part of speech segmentation.
S2B:根据依存句法分析的方式,将符合一致性规则的词性分词进行合并,得到目标短语。S2B: According to the method of dependency syntax analysis, the parts of speech that conform to the consistency rules are combined to obtain the target phrase.
其中,一致性规则为使用主-谓-宾(SBV)关系,通过对应单词上作标注。例如“我吃苹果”标注为(我,Subject)、(吃,Predict)、(苹果,Object),将提取到的词性分词对应到词性上成分上,将符合一致性规则的词性分词进行合并,得到目标短语。Among them, the consistency rule is to use the subject-verb-object (SBV) relationship, and mark the corresponding words. For example, "I eat apples" is marked as (I, Subject), (eat, Predict), (apple, Object), the extracted part-of-speech segmentation corresponds to the component of speech, and the part-of-speech segmentation that conforms to the consistency rules is merged, Get the target phrase.
本实施例中,通过词性标注的方式,将目标分词进行词性标注,得到词性分词,并根 据依存句法分析的方式,将符合一致性规则的词性分词进行合并,得到目标短语,实现对目标分词进行合并,便于后续进行数据标注。In this embodiment, the part-of-speech tagging is used for the target word segmentation to obtain the part-of-speech segmentation, and according to the method of dependency syntax analysis, the part-of-speech segmentation that conforms to the consistency rule is combined to obtain the target phrase, and the target word segmentation is realized. Combined to facilitate subsequent data annotation.
需要强调的是,为进一步保证上述待标注文本的私密和安全性,上述待标注文本还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned text to be marked, the above-mentioned text to be marked may also be stored in a node of a blockchain.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机的计算机可读指令来指令相关的硬件来完成,该计算机的计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions of a computer, and the computer-readable instructions of the computer can be stored in a computer-readable In the storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
请参考图9,作为对上述图2所示方法的实现,本申请提供了一种基于生成模型的数据标注装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。Please refer to FIG. 9. As an implementation of the method shown in FIG. 2, the present application provides an embodiment of a data labeling device based on a generative model. The device embodiment corresponds to the method embodiment shown in FIG. 2. Specifically, the device can be applied to various electronic devices.
如图9所示,本实施例的基于生成模型的数据标注装置包括:待标签文本拆分模块71、目标短语获取模块72、标签样本生成模块73、初始参数生成模块74、标注准确率输出模块75及标签样本选取模块76,其中:As shown in FIG. 9 , the data labeling device based on the generation model in this embodiment includes: a text to be labelled splitting module 71 , a target phrase acquisition module 72 , a label sample generation module 73 , an initial parameter generation module 74 , and a labeling accuracy rate output module 75 and label sample selection module 76, wherein:
待标签文本拆分模块71,用于获取待标注文本,并对待标注文本进行拆分,得到拆分语句;The to-be-labeled text splitting module 71 is used to obtain the to-be-labeled text, and to split the to-be-labeled text to obtain a split statement;
目标短语获取模块72,用于通过对拆分语句进行分词处理,得到目标分词,并对目标分词进行合并,得到目标短语;The target phrase acquisition module 72 is used to obtain the target word segmentation by performing word segmentation processing on the split sentence, and merge the target word segmentation to obtain the target phrase;
标签样本生成模块73,用于获取多种预设标注规则,并通过多种预设标注规则分别对目标短语进行标注,得到每一种预设规则对应的标签样本;The label sample generation module 73 is configured to obtain multiple preset labeling rules, and respectively label the target phrase through the multiple preset labeling rules to obtain label samples corresponding to each preset rule;
初始参数生成模块74,用于获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标注概率和标签样本,得到生成模型的初始参数;The initial parameter generation module 74 is used to obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;
标注准确率输出模块75,用于通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率;The labeling accuracy rate output module 75 is configured to iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;
标签样本选取模块76,用于选取标注准确率最高的标签样本,作为目标标签样本。The label sample selection module 76 is configured to select the label sample with the highest label accuracy rate as the target label sample.
进一步的,初始参数生成模块74包括:Further, the initial parameter generation module 74 includes:
样本标注概率获取单元,用于计算每一种预设标注规则对应的标签样本对目标短语的覆盖率,并将覆盖率作为样本标注概率;The sample labeling probability obtaining unit is used to calculate the coverage rate of the target phrase by the label sample corresponding to each preset labeling rule, and use the coverage rate as the sample labeling probability;
初始化处理单元,用于将样本标签概率和标签样本进行初始化处理,得到生成模型的初始参数。The initialization processing unit is used to initialize the sample label probability and the label sample to obtain the initial parameters of the generative model.
进一步的,标注准确率输出模块75包括:Further, the labeling accuracy output module 75 includes:
优化特征值定义单元,用于将生成模型的参数和样本标注概率的差值作为优化特征值;The optimization eigenvalue definition unit is used to use the difference between the parameters of the generated model and the sample labeling probability as the optimization eigenvalue;
迭代更新进行单元,用于采用随机梯度下降的方式,将样本标注概率进行反向传播,以对初始参数进行迭代更新,其中,每次迭代更新都得到生成模型新的参数和优化特征值发生改变;The iterative update unit is used to back-propagate the sample labeling probability by means of stochastic gradient descent to iteratively update the initial parameters, in which, each iterative update obtains new parameters of the generative model and changes in the optimized eigenvalues ;
迭代更新停止单元,用于当优化特征值达到预设阈值时,停止迭代更新,得到训练好的生成模型;The iterative update stop unit is used to stop the iterative update when the optimized feature value reaches a preset threshold to obtain a trained generative model;
标注准确率获取单元,用于通过训练好的生成模型输出标签样本对应的标注准确率。The labeling accuracy rate obtaining unit is used to output the labeling accuracy rate corresponding to the label sample through the trained generation model.
进一步的,标注准确率获取单元包括:Further, the labeling accuracy obtaining unit includes:
基础概率获取子单元,用于通过训练好的生成模型的当前参数对标签样本进行概率估计,得到基础概率;The basic probability acquisition subunit is used to estimate the probability of the label sample through the current parameters of the trained generative model to obtain the basic probability;
基础概率处理子单元,用于对基础概率进行加权平均处理,得到标签样本对应的标注准确率。The basic probability processing sub-unit is used to perform weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.
进一步的,待标签文本拆分模块71包括:Further, the to-be-labeled text splitting module 71 includes:
基础文本生成单元,用于获取待标注文本,并对待标注文本进行预处理,得到基础文 本;The basic text generation unit is used to obtain the text to be marked, and preprocess the text to be marked to obtain the basic text;
文本分隔符获取单元,用于采用正则匹配的方式,获取基础文本中包含的文本分隔符;The text separator obtaining unit is used to obtain the text separator contained in the basic text by means of regular matching;
拆分语句生成单元,用于通过文本分隔符对基础文本进行拆分,得到拆分语句。The split statement generation unit is used to split the basic text by the text separator to obtain the split statement.
进一步的,在标签样本选取模块76之后,该基于生成模型的数据标注装置还包括:Further, after the label sample selection module 76, the data labeling device based on the generative model also includes:
目标存储路径获取模块,用于获取待标注文本的存储路径,作为目标存储路径;The target storage path obtaining module is used to obtain the storage path of the text to be marked as the target storage path;
数据映射模块,用于通过预设的数据映射方式,将目标标签样本映射到目标存储路径之中。The data mapping module is used to map the target label samples to the target storage path through a preset data mapping method.
进一步的,目标短语获取模块72还包括:Further, the target phrase acquisition module 72 also includes:
词性分词生成单元,用于通过词性标注的方式,将目标分词进行词性标注,得到词性分词;The part-of-speech and word-segmentation generating unit is used to perform part-of-speech tagging on the target word by means of part-of-speech tagging to obtain part-of-speech segmentation;
目标短语生成单元,用于根据依存句法分析的方式,将符合一致性规则的词性分词进行合并,得到目标短语。The target phrase generation unit is used to combine the part-of-speech segmentations that conform to the consistency rules according to the method of dependency syntax analysis to obtain the target phrase.
需要强调的是,为进一步保证上述待标注文本的私密和安全性,上述待标注文本还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned text to be marked, the above-mentioned text to be marked may also be stored in a node of a blockchain.
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图10,图10为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. Please refer to FIG. 10 for details. FIG. 10 is a block diagram of the basic structure of a computer device according to this embodiment.
计算机设备8包括通过系统总线相互通信连接存储器81、处理器82、网络接口83。需要指出的是,图中仅示出了具有三种组件存储器81、处理器82、网络接口83的计算机设备8,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 8 includes a memory 81 , a processor 82 , and a network interface 83 that are connected to each other through a system bus. It should be pointed out that the figure only shows the computer device 8 with three components, the memory 81, the processor 82, and the network interface 83, but it should be understood that it is not required to implement all the components shown, and alternative implementations are possible. More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
计算机设备,包括存储器和处理器,存储器中存储有计算机可读指令,处理器执行计算机可读指令时实现如下步骤:A computer device includes a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
获取待标注文本,并对待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;
通过对拆分语句进行分词处理,得到目标分词,并对目标分词进行合并,得到目标短语;By segmenting the split sentences, the target word segmentation is obtained, and the target word segmentation is merged to obtain the target phrase;
获取多种预设标注规则,并通过多种预设标注规则分别对目标短语进行标注,得到每一种预设规则对应的标签样本;Obtaining a variety of preset labeling rules, and labeling the target phrase respectively through the multiple preset labeling rules, to obtain a label sample corresponding to each preset rule;
获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标注概率和标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and label sample;
通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy corresponding to the label sample through the trained generative model;
选取标注准确率最高的标签样本,作为目标标签样本。Select the label sample with the highest label accuracy as the target label sample.
计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, and a cloud server and other computing equipment. Computer devices can interact with users through keyboards, mice, remote controls, touchpads, or voice-activated devices.
存储器81至少包括一种类型的可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器81可以是计算机设备8的内部存储单元,例如该计算机设备8的硬盘或内存。在另一些实施例中,存储器81也可以是计算机设备8的外部存储设备,例如该计算机设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。 当然,存储器81还可以既包括计算机设备8的内部存储单元也包括其外部存储设备。本实施例中,存储器81通常用于存储安装于计算机设备8的操作系统和各类应用软件,例如基于生成模型的数据标注方法的计算机可读指令等。此外,存储器81还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 81 includes at least one type of readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (eg , SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM) ), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 81 may be an internal storage unit of the computer device 8 , such as a hard disk or memory of the computer device 8 . In other embodiments, the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 81 may also include both the internal storage unit of the computer device 8 and its external storage device. In this embodiment, the memory 81 is generally used to store the operating system and various application software installed on the computer device 8 , such as computer-readable instructions of the data labeling method based on the generation model, and the like. In addition, the memory 81 can also be used to temporarily store various types of data that have been output or will be output.
处理器82在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器82通常用于控制计算机设备8的总体操作。本实施例中,处理器82用于运行存储器81中存储的计算机可读指令或者处理数据,例如运行上述基于生成模型的数据标注方法的计算机可读指令,以实现基于生成模型的数据标注方法的各种实施例。The processor 82 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8 . In this embodiment, the processor 82 is configured to run the computer-readable instructions stored in the memory 81 or process the data, for example, run the computer-readable instructions of the above-mentioned generative model-based data labeling method, so as to realize the data labeling method based on the generative model. various embodiments.
网络接口83可包括无线网络接口或有线网络接口,该网络接口83通常用于在计算机设备8与其他电子设备之间建立通信连接。The network interface 83 may comprise a wireless network interface or a wired network interface, and the network interface 83 is typically used to establish a communication connection between the computer device 8 and other electronic devices.
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,计算机可读存储介质存储有计算机的计算机可读指令,计算机的计算机可读指令可被至少一个处理器执行,以使至少一个处理器执行以下步骤:The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions of a computer, and the computer-readable instructions of the computer can be executed by at least one processor to Cause at least one processor to perform the following steps:
获取待标注文本,并对待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;
通过对拆分语句进行分词处理,得到目标分词,并对目标分词进行合并,得到目标短语;By segmenting the split sentences, the target word segmentation is obtained, and the target word segmentation is merged to obtain the target phrase;
获取多种预设标注规则,并通过多种预设标注规则分别对目标短语进行标注,得到每一种预设规则对应的标签样本;Obtaining a variety of preset labeling rules, and labeling the target phrase respectively through the multiple preset labeling rules, to obtain a label sample corresponding to each preset rule;
获取每一种预设标注规则对应的标签样本对目标短语的样本标注概率,并根据样本标注概率和标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and label sample;
通过样本标注概率对生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过训练好的生成模型输出标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy corresponding to the label sample through the trained generative model;
选取标注准确率最高的标签样本,作为目标标签样本。Select the label sample with the highest label accuracy as the target label sample.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods of the various embodiments of the present application.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.
Claims (20)
- 一种基于生成模型的数据标注方法,包括:A data labeling method based on a generative model, comprising:获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
- 根据权利要求1所述的基于生成模型的数据标注方法,其中,所述获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数包括:The data labeling method based on a generative model according to claim 1, wherein the obtaining a label sample corresponding to each of the preset labeling rules is a sample labeling probability of the target phrase, and according to the sample labeling probability and the label samples, the initial parameters of the generated model include:计算每一种所述预设标注规则对应的标签样本对所述目标短语的覆盖率,并将所述覆盖率作为所述样本标注概率;Calculate the coverage rate of each label sample corresponding to the preset labeling rule to the target phrase, and use the coverage rate as the sample labeling probability;将所述样本标签概率和所述标签样本进行初始化处理,得到所述生成模型的初始参数。The sample label probability and the label sample are initialized to obtain the initial parameters of the generative model.
- 根据权利要求1所述的基于生成模型的数据标注方法,其中,所述通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率包括:The data labeling method based on a generative model according to claim 1, wherein the initial parameters of the generative model are iteratively updated by using the sample labeling probability to obtain a trained generative model, and the trained The labeling accuracy rate corresponding to the labeling sample output by the generative model includes:将所述生成模型的参数和所述样本标注概率的差值作为优化特征值;Taking the difference between the parameters of the generation model and the sample labeling probability as the optimization feature value;采用随机梯度下降的方式,将所述样本标注概率进行反向传播,以对所述初始参数进行迭代更新,其中,每次所述迭代更新都得到所述生成模型新的参数和所述优化特征值发生改变;Using the stochastic gradient descent method, the sample labeling probability is back-propagated to iteratively update the initial parameters, wherein, new parameters of the generative model and the optimization feature are obtained each time the iterative update is performed. value changes;当所述优化特征值达到预设阈值时,停止所述迭代更新,得到所述训练好的生成模型;When the optimized feature value reaches a preset threshold, stop the iterative update to obtain the trained generative model;通过所述训练好的生成模型输出所述标签样本对应的标注准确率。The labeling accuracy rate corresponding to the label sample is output through the trained generation model.
- 根据权利要求3所述的基于生成模型的数据标注方法,其中,所述通过所述训练好的生成模型输出所述标签样本对应的标注准确率包括:The data labeling method based on a generative model according to claim 3, wherein the outputting the labeling accuracy rate corresponding to the label sample by the trained generative model comprises:通过所述训练好的生成模型的当前参数对所述标签样本进行概率估计,得到基础概率,其中所述当前参数是指所述优化特征值达到预设阈值时,所述迭代更新得到的参数;Perform probability estimation on the label sample by using the current parameter of the trained generation model to obtain a basic probability, wherein the current parameter refers to the parameter obtained by the iterative update when the optimized feature value reaches a preset threshold;对所述基础概率进行加权平均处理,得到所述标签样本对应的标注准确率。A weighted average process is performed on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample.
- 根据权利要求1所述的基于生成模型的数据标注方法,其中,所述获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句包括:The data labeling method based on a generative model according to claim 1, wherein the obtaining the text to be labelled and splitting the text to be labelled to obtain a split statement comprises:获取所述待标注文本,并对所述待标注文本进行预处理,得到基础文本;obtaining the text to be marked, and preprocessing the text to be marked to obtain basic text;采用正则匹配的方式,获取所述基础文本中包含的文本分隔符;Obtain the text delimiter contained in the basic text by means of regular matching;通过所述文本分隔符对所述基础文本进行拆分,得到所述拆分语句。The basic text is split by the text separator to obtain the split statement.
- 根据权利要求1所述的基于生成模型的数据标注方法,其中,在所述选取所述标注准确率最高的所述标签样本,作为目标标签样本之后,所述方法还包括:The data labeling method based on a generative model according to claim 1, wherein after selecting the label sample with the highest label accuracy rate as the target label sample, the method further comprises:获取所述待标注文本的存储路径,作为目标存储路径;Obtain the storage path of the text to be marked as the target storage path;通过预设的数据映射方式,将所述目标标签样本映射到所述目标存储路径之中。The target label sample is mapped to the target storage path by a preset data mapping method.
- 根据权利要求1所述的基于生成模型的数据标注方法,其中,所述对所述目标分词进行合并,得到目标短语包括:The data labeling method based on a generative model according to claim 1, wherein the merging the target word segmentation to obtain the target phrase comprises:通过词性标注的方式,将所述目标分词进行词性标注,得到词性分词;By means of part-of-speech tagging, part-of-speech tagging is performed on the target word segmentation to obtain part-of-speech segmentation;根据依存句法分析的方式,将符合一致性规则的所述词性分词进行合并,得到所述目标短语。According to the method of dependency syntax analysis, the part-of-speech segmentations that conform to the consistency rule are combined to obtain the target phrase.
- 一种基于生成模型的数据标注装置,包括:A data labeling device based on a generative model, comprising:待标签文本拆分模块,用于获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;A to-be-labeled text splitting module, configured to obtain the to-be-labeled text, and to split the to-be-labeled text to obtain a split statement;目标短语获取模块,用于通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;a target phrase acquisition module, used to obtain a target word segmentation by performing word segmentation processing on the split statement, and merging the target word segmentation to obtain a target phrase;标签样本生成模块,用于获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;The label sample generation module is configured to obtain a plurality of preset labeling rules, and respectively label the target phrase through the plurality of the predefined labeling rules, to obtain label samples corresponding to each of the predefined rules;初始参数生成模块,用于获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;The initial parameter generation module is used to obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample ;标注准确率输出模块,用于通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;The labeling accuracy output module is used to iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labels corresponding to the label samples through the trained generative model Accuracy;标签样本选取模块,用于选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample selection module is configured to select the label sample with the highest label accuracy rate as the target label sample.
- 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device, comprising a memory and a processor, wherein computer-readable instructions are stored in the memory, wherein the processor implements the following steps when executing the computer-readable instructions:获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
- 根据权利要求9所述的计算机设备,其中,所述获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数包括:The computer device according to claim 9, wherein the obtaining the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and according to the sample labeling probability and the label sample , the initial parameters of the generated model include:计算每一种所述预设标注规则对应的标签样本对所述目标短语的覆盖率,并将所述覆盖率作为所述样本标注概率;Calculate the coverage rate of each label sample corresponding to the preset labeling rule to the target phrase, and use the coverage rate as the sample labeling probability;将所述样本标签概率和所述标签样本进行初始化处理,得到所述生成模型的初始参数。The sample label probability and the label sample are initialized to obtain the initial parameters of the generative model.
- 根据权利要求9所述的计算机设备,其中,所述获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数包括:The computer device according to claim 9, wherein the obtaining the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and according to the sample labeling probability and the label sample , the initial parameters of the generated model include:所述通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率包括:The iteratively updating the initial parameters of the generation model through the sample labeling probability to obtain a trained generation model, and outputting the labeling accuracy rate corresponding to the label sample through the trained generation model includes:将所述生成模型的参数和所述样本标注概率的差值作为优化特征值;Taking the difference between the parameters of the generation model and the sample labeling probability as the optimization feature value;采用随机梯度下降的方式,将所述样本标注概率进行反向传播,以对所述初始参数进行迭代更新,其中,每次所述迭代更新都得到所述生成模型新的参数和所述优化特征值发生改变;Using the stochastic gradient descent method, the sample labeling probability is back-propagated to iteratively update the initial parameters, wherein, new parameters of the generative model and the optimization feature are obtained each time the iterative update is performed. value changes;当所述优化特征值达到预设阈值时,停止所述迭代更新,得到所述训练好的生成模型;When the optimized feature value reaches a preset threshold, stop the iterative update to obtain the trained generative model;通过所述训练好的生成模型输出所述标签样本对应的标注准确率。The labeling accuracy rate corresponding to the label sample is output through the trained generation model.
- 根据权利要求11所述的计算机设备,其中,所述通过所述训练好的生成模型输出所述标签样本对应的标注准确率包括:The computer device according to claim 11, wherein the outputting the labeling accuracy rate corresponding to the labeling sample by the trained generative model comprises:通过所述训练好的生成模型的当前参数对所述标签样本进行概率估计,得到基础概率, 其中所述当前参数是指所述优化特征值达到预设阈值时,所述迭代更新得到的参数;Perform probability estimation on the label sample by using the current parameter of the trained generation model to obtain a basic probability, wherein the current parameter refers to the parameter obtained by the iterative update when the optimized feature value reaches a preset threshold;对所述基础概率进行加权平均处理,得到所述标签样本对应的标注准确率。A weighted average process is performed on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample.
- 根据权利要求9所述的计算机设备,其中,所述获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句包括:The computer device according to claim 9, wherein the acquiring the text to be marked and splitting the text to be marked, and obtaining the split statement comprises:获取所述待标注文本,并对所述待标注文本进行预处理,得到基础文本;obtaining the text to be marked, and preprocessing the text to be marked to obtain basic text;采用正则匹配的方式,获取所述基础文本中包含的文本分隔符;Obtain the text delimiter contained in the basic text by means of regular matching;通过所述文本分隔符对所述基础文本进行拆分,得到所述拆分语句。The basic text is split by the text separator to obtain the split statement.
- 根据权利要求9所述的计算机设备,其中,在所述选取所述标注准确率最高的所述标签样本,作为目标标签样本之后,所述方法还包括:The computer device according to claim 9, wherein after selecting the label sample with the highest label accuracy rate as the target label sample, the method further comprises:获取所述待标注文本的存储路径,作为目标存储路径;Obtain the storage path of the text to be marked as the target storage path;通过预设的数据映射方式,将所述目标标签样本映射到所述目标存储路径之中。The target label sample is mapped to the target storage path by a preset data mapping method.
- 根据权利要求9所述的计算机设备,其中,所述对所述目标分词进行合并,得到目标短语包括:The computer device according to claim 9, wherein the combining the target word segmentation to obtain the target phrase comprises:通过词性标注的方式,将所述目标分词进行词性标注,得到词性分词;By means of part-of-speech tagging, part-of-speech tagging is performed on the target word segmentation to obtain part-of-speech segmentation;根据依存句法分析的方式,将符合一致性规则的所述词性分词进行合并,得到所述目标短语。According to the method of dependency syntax analysis, the part-of-speech segmentations that conform to the consistency rule are combined to obtain the target phrase.
- 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被一种处理器执行时使得所述一种处理器执行如下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores computer-readable instructions that, when executed by a processor, cause the processor to perform the following steps:获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句;Obtain the text to be marked, and split the text to be marked to obtain a split statement;通过对所述拆分语句进行分词处理,得到目标分词,并对所述目标分词进行合并,得到目标短语;By performing word segmentation processing on the split sentence, a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase;获取多种预设标注规则,并通过多种所述预设标注规则分别对所述目标短语进行标注,得到每一种所述预设规则对应的标签样本;Obtaining a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of the predefined labeling rules, to obtain a label sample corresponding to each of the predefined rules;获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数;Obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率;Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;选取所述标注准确率最高的所述标签样本,作为目标标签样本。The label sample with the highest labeling accuracy is selected as the target label sample.
- 根据权利要求16所述的计算机可读存储介质,其中,所述获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数包括:The computer-readable storage medium according to claim 16, wherein the obtaining the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and according to the sample labeling probability and the sample labeling probability According to the label sample, the initial parameters of the generated model include:计算每一种所述预设标注规则对应的标签样本对所述目标短语的覆盖率,并将所述覆盖率作为所述样本标注概率;Calculate the coverage rate of each label sample corresponding to the preset labeling rule to the target phrase, and use the coverage rate as the sample labeling probability;将所述样本标签概率和所述标签样本进行初始化处理,得到所述生成模型的初始参数。The sample label probability and the label sample are initialized to obtain the initial parameters of the generative model.
- 根据权利要求16所述的计算机可读存储介质,其中,所述获取每一种所述预设标注规则对应的标签样本对所述目标短语的样本标注概率,并根据所述样本标注概率和所述标签样本,得到生成模型的初始参数包括:The computer-readable storage medium according to claim 16, wherein the obtaining the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and according to the sample labeling probability and the sample labeling probability According to the label sample, the initial parameters of the generated model include:所述通过所述样本标注概率对所述生成模型的初始参数进行迭代更新,得到训练好的生成模型,并通过所述训练好的生成模型输出所述标签样本对应的标注准确率包括:The iteratively updating the initial parameters of the generation model through the sample labeling probability to obtain a trained generation model, and outputting the labeling accuracy rate corresponding to the label sample through the trained generation model includes:将所述生成模型的参数和所述样本标注概率的差值作为优化特征值;Taking the difference between the parameters of the generation model and the sample labeling probability as the optimization feature value;采用随机梯度下降的方式,将所述样本标注概率进行反向传播,以对所述初始参数进行迭代更新,其中,每次所述迭代更新都得到所述生成模型新的参数和所述优化特征值发生改变;Using the stochastic gradient descent method, the sample labeling probability is back-propagated to iteratively update the initial parameters, wherein, new parameters of the generative model and the optimization feature are obtained each time the iterative update is performed. value changes;当所述优化特征值达到预设阈值时,停止所述迭代更新,得到所述训练好的生成模型;When the optimized feature value reaches a preset threshold, stop the iterative update to obtain the trained generative model;通过所述训练好的生成模型输出所述标签样本对应的标注准确率。The labeling accuracy rate corresponding to the label sample is output through the trained generation model.
- 根据权利要求18所述的计算机可读存储介质,其中,所述通过所述训练好的生 成模型输出所述标签样本对应的标注准确率包括:The computer-readable storage medium according to claim 18, wherein the outputting the labeling accuracy rate corresponding to the labeling sample by the trained generative model comprises:通过所述训练好的生成模型的当前参数对所述标签样本进行概率估计,得到基础概率,其中所述当前参数是指所述优化特征值达到预设阈值时,所述迭代更新得到的参数;Perform probability estimation on the label sample by using the current parameter of the trained generation model to obtain a basic probability, wherein the current parameter refers to the parameter obtained by the iterative update when the optimized feature value reaches a preset threshold;对所述基础概率进行加权平均处理,得到所述标签样本对应的标注准确率。A weighted average process is performed on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample.
- 根据权利要求16所述的计算机可读存储介质,其中,所述获取待标注文本,并对所述待标注文本进行拆分,得到拆分语句包括:The computer-readable storage medium according to claim 16, wherein the acquiring the text to be marked and splitting the text to be marked, and obtaining the split sentence comprises:获取所述待标注文本,并对所述待标注文本进行预处理,得到基础文本;obtaining the text to be marked, and preprocessing the text to be marked to obtain basic text;采用正则匹配的方式,获取所述基础文本中包含的文本分隔符;Obtain the text delimiter contained in the basic text by means of regular matching;通过所述文本分隔符对所述基础文本进行拆分,得到所述拆分语句。The basic text is split by the text separator to obtain the split statement.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110193454.5 | 2021-02-20 | ||
CN202110193454.5A CN112860919A (en) | 2021-02-20 | 2021-02-20 | Data labeling method, device and equipment based on generative model and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022174496A1 true WO2022174496A1 (en) | 2022-08-25 |
Family
ID=75988385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/083758 WO2022174496A1 (en) | 2021-02-20 | 2021-03-30 | Data annotation method and apparatus based on generative model, and device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112860919A (en) |
WO (1) | WO2022174496A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196908A (en) * | 2019-04-17 | 2019-09-03 | 深圳壹账通智能科技有限公司 | Data classification method, device, computer installation and storage medium |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
US20200320171A1 (en) * | 2019-04-02 | 2020-10-08 | International Business Machines Corporation | Cross-subject model-generated training data for relation extraction modeling |
CN112084752A (en) * | 2020-09-08 | 2020-12-15 | 中国平安财产保险股份有限公司 | Statement marking method, device, equipment and storage medium based on natural language |
-
2021
- 2021-02-20 CN CN202110193454.5A patent/CN112860919A/en active Pending
- 2021-03-30 WO PCT/CN2021/083758 patent/WO2022174496A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200320171A1 (en) * | 2019-04-02 | 2020-10-08 | International Business Machines Corporation | Cross-subject model-generated training data for relation extraction modeling |
CN110196908A (en) * | 2019-04-17 | 2019-09-03 | 深圳壹账通智能科技有限公司 | Data classification method, device, computer installation and storage medium |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
CN112084752A (en) * | 2020-09-08 | 2020-12-15 | 中国平安财产保险股份有限公司 | Statement marking method, device, equipment and storage medium based on natural language |
Also Published As
Publication number | Publication date |
---|---|
CN112860919A (en) | 2021-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107491534B (en) | Information processing method and device | |
CN107679039B (en) | Method and device for determining statement intention | |
US20190065506A1 (en) | Search method and apparatus based on artificial intelligence | |
US10755048B2 (en) | Artificial intelligence based method and apparatus for segmenting sentence | |
CN107861954B (en) | Information output method and device based on artificial intelligence | |
US20180068221A1 (en) | System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus | |
CN108628830B (en) | Semantic recognition method and device | |
WO2022105122A1 (en) | Answer generation method and apparatus based on artificial intelligence, and computer device and medium | |
CN110795938A (en) | Text sequence word segmentation method, device and storage medium | |
CN107112009B (en) | Method, system and computer-readable storage device for generating a confusion network | |
EP3923159A1 (en) | Method, apparatus, device and storage medium for matching semantics | |
WO2012158572A2 (en) | Exploiting query click logs for domain detection in spoken language understanding | |
WO2020244065A1 (en) | Character vector definition method, apparatus and device based on artificial intelligence, and storage medium | |
CN111985229A (en) | Sequence labeling method and device and computer equipment | |
CN113987169A (en) | Text abstract generation method, device and equipment based on semantic block and storage medium | |
CN113220835A (en) | Text information processing method and device, electronic equipment and storage medium | |
CN113076739A (en) | Method and system for realizing cross-domain Chinese text error correction | |
CN114840671A (en) | Dialogue generation method, model training method, device, equipment and medium | |
CN112101031B (en) | Entity identification method, terminal equipment and storage medium | |
WO2021135469A1 (en) | Machine learning-based information extraction method, apparatus, computer device, and medium | |
WO2021051574A1 (en) | English text sequence labelling method and system, and computer device | |
CN113761923A (en) | Named entity recognition method and device, electronic equipment and storage medium | |
CN115565177A (en) | Character recognition model training method, character recognition device, character recognition equipment and medium | |
WO2022174496A1 (en) | Data annotation method and apparatus based on generative model, and device and storage medium | |
WO2021212681A1 (en) | Semantic role annotation method and apparatus, and computer device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21926213 Country of ref document: EP Kind code of ref document: A1 |