CN112860919A

CN112860919A - Data labeling method, device and equipment based on generative model and storage medium

Info

Publication number: CN112860919A
Application number: CN202110193454.5A
Authority: CN
Inventors: 李薿; 陈曦; 崔艳; 庄伯金; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-05-28
Also published as: WO2022174496A1

Abstract

The application relates to the technical field of artificial intelligence, and discloses a data labeling method, a data labeling device, data labeling equipment and a data labeling storage medium based on a generative model, wherein the method comprises the steps of obtaining a text to be labeled, splitting, segmenting words and merging the text to be labeled to obtain a target phrase; labeling the target phrases respectively through a plurality of preset labeling rules to obtain label samples; obtaining the sample labeling probability of the label sample to the target phrase, iteratively updating the initial parameters generated by the generation model according to the sample labeling probability to obtain a trained generation model, and outputting the labeling accuracy through the trained generation model; and determining a target label sample according to the labeling accuracy. The application also relates to a block chain technology, and the text to be marked is stored in the block chain. The data are labeled according to various preset rules, and the label sample with the highest data labeling accuracy is selected according to the generated model, so that the data labeling accuracy is improved.

Description

Data labeling method, device and equipment based on generative model and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data annotation method, apparatus, device, and storage medium based on a generative model.

Background

With the increasingly prominent role of the knowledge graph in each vertical field, how to label data of large-scale label-free data is the focus of attention in the current knowledge graph field.

Although the accuracy rate of recognition of named entities with labeled data reaches over 99 percent, the time period is extremely long for manual construction of text labeled data in different fields. And the labeling data of different fields is not completely universal. The differences of the service scene, the target user and the product definition directly cause that large-scale marking data which can be suitable for various fields are difficult to exist in the text field. Therefore, how to improve the efficiency of labeling large-scale data becomes a difficult problem.

Aiming at the problems, the existing solution is to obtain a word sequence corresponding to an original text, convert and map the word sequence so as to obtain an entity tagging vector, and count the number of preset entity information in the entity tagging vector, so as to realize tagging of data; however, the labeling method is obtained by converting and mapping the word vector, which is easy to cause errors in labeling data, thereby resulting in low accuracy of labeling data of large-scale data. There is a need for a method that can improve the accuracy of data annotation.

Disclosure of Invention

The embodiment of the application aims to provide a data annotation method, a data annotation device, data annotation equipment and a storage medium based on a generative model so as to improve the accuracy of data annotation.

In order to solve the above technical problem, an embodiment of the present application provides a data annotation method based on a generative model, including:

acquiring a text to be marked, and splitting the text to be marked to obtain a split sentence;

performing word segmentation processing on the split sentences to obtain target words, and merging the target words to obtain target phrases;

acquiring a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of preset labeling rules to obtain a label sample corresponding to each preset rule;

acquiring the sample labeling probability of the label sample corresponding to each preset labeling rule to the target phrase, and obtaining the initial parameters of the generated model according to the sample labeling probability and the label sample;

iteratively updating the initial parameters of the generated model according to the sample labeling probability to obtain a trained generated model, and outputting the labeling accuracy corresponding to the label sample according to the trained generated model;

and selecting the label sample with the highest labeling accuracy as a target label sample.

In order to solve the above technical problem, an embodiment of the present application provides a data annotation device based on a generative model, including:

the system comprises a to-be-labeled text splitting module, a to-be-labeled text extracting module and a to-be-labeled text extracting module, wherein the to-be-labeled text splitting module is used for acquiring a to-be-labeled text and splitting the to-be-labeled text to obtain a split statement;

the target phrase acquisition module is used for performing word segmentation processing on the split sentences to obtain target words, and merging the target words to obtain target phrases;

the label sample generation module is used for acquiring a plurality of preset labeling rules and labeling the target phrase through the plurality of preset labeling rules respectively to obtain a label sample corresponding to each preset rule;

the initial parameter generation module is used for acquiring the sample labeling probability of the label sample corresponding to each preset labeling rule to the target phrase, and obtaining the initial parameter of the generation model according to the sample labeling probability and the label sample;

the labeling accuracy rate output module is used for iteratively updating the initial parameters of the generated model according to the sample labeling probability to obtain a trained generated model, and outputting the labeling accuracy rate corresponding to the label sample according to the trained generated model;

and the label sample selecting module is used for selecting the label sample with the highest labeling accuracy as a target label sample.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided that includes, one or more processors; a memory for storing one or more programs for causing the one or more processors to implement any of the generative model-based data annotation methods described above.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the generative model-based data annotation method according to any one of the preceding claims.

The embodiment of the invention provides a data labeling method, a data labeling device, data labeling equipment and a storage medium based on a generative model. According to the embodiment of the invention, the target phrase is obtained after the acquired text to be labeled is subjected to splitting, word segmentation and merging, so that data labeling is conveniently carried out on the text to be labeled respectively according to the target phrase in the follow-up process; and then obtaining a plurality of preset labeling rules, labeling the target phrase respectively through the plurality of preset labeling rules to obtain a label sample corresponding to each preset rule, obtaining the sample labeling probability of the label sample corresponding to each preset labeling rule on the target phrase, obtaining an initial parameter of the generated model according to the sample labeling probability and the label sample, iteratively updating the initial parameter of the generated model through the sample labeling probability to obtain a trained generated model, outputting the labeling accuracy corresponding to the label sample through the trained generated model, selecting the label sample with the highest labeling accuracy as the target label sample, labeling the data through the plurality of preset rules, selecting the label sample with the highest data labeling accuracy according to the generated model, and facilitating improvement of the accuracy of data labeling.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is a schematic diagram of an application environment of a data annotation method based on a generative model according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a data annotation method based on generative models according to an embodiment of the present application;

FIG. 3 is a flow chart of an implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 4 is a flowchart of another implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 5 is a flowchart of another implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 6 is a flowchart of another implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 7 is a flowchart of another implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 8 is a flowchart of another implementation of a sub-process in the data annotation method based on generative models according to the embodiment of the present application;

FIG. 9 is a schematic diagram of a data annotation device based on generative models according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device provided in an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

Referring to fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a search-type application, an instant messaging tool, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

The data annotation method based on the generative model provided in the embodiments of the present application is generally executed by a server, and accordingly, the data annotation apparatus based on the generative model is generally configured in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 shows an embodiment of a data annotation method based on generative model.

It should be noted that, if the result is substantially the same, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:

s1: and acquiring a text to be marked, and splitting the text to be marked to obtain a split sentence.

Specifically, after acquiring the text to be annotated, the server may perform preprocessing, such as data cleaning, on the text to be annotated, and then split the text to be annotated according to paragraphs, sentences, and the like in the text, so as to obtain split sentences. The text to be labeled needs to be subjected to data labeling, so that the text with the label is generated.

S2: and performing word segmentation processing on the split sentences to obtain target words, and merging the target words to obtain target phrases.

Specifically, in the above step, the text to be labeled is already split into split sentences, the split sentences exist in the form of short sentences, and in order to better label data of the split sentences in the following process, word segmentation processing is performed on the split sentences through a preset word segmentation tool, so as to generate each target word segmentation, part-of-speech labeling is performed according to part-of-speech of the target word segmentation, and the target word segmentation is merged according to a dependency syntax analysis mode, so as to generate the target phrase.

It should be noted that the preset word segmentation tools include, but are not limited to: a crust participle, an NLPIR participle system, a SnowNLP, etc. Preferably, the segmented sentences are segmented by using the ending segmentation to obtain target segmentation. The Chinese sentence segmentation method has the advantages that the Chinese sentence is cut accurately, the Chinese sentence segmentation method is suitable for text analysis, all words which can be formed into words in the Chinese sentence are scanned, the speed is high, and the Chinese sentence segmentation method is suitable for performing word segmentation on split sentences.

Among them, the dependency parsing was first proposed by french linguist l.tesniere. The sentence is analyzed into a dependency syntax tree, the dependency relationship among all the words is described, namely the syntactic collocation relationship among the words is pointed out, and the collocation relationship is associated with semantics. In the embodiment of the application, the target participles are merged in a dependency parsing mode.

S3: and acquiring a plurality of preset labeling rules, and labeling the target phrase respectively through the plurality of preset labeling rules to obtain a label sample corresponding to each preset rule.

Specifically, in the embodiment of the application, after the text to be labeled is split, participled and combined, the target phrase is labeled according to various labeling rules, and then the accuracy of labeling the data according to various rules is determined through a generation model, so that the label sample with the highest accuracy is selected, and the labeling of the data is completed. Therefore, the server obtains a plurality of preset labeling rules and labels corresponding to the target phrases are respectively labeled according to each preset labeling rule, so that the target phrases generate label samples corresponding to each preset rule.

It should be noted that the various preset labeling rules include, but are not limited to: regular identification, remote matching knowledge base identification and external data matching. The regular identification refers to matching corresponding labeling rules by presetting different SQL query sentences, so that different rules are realized to label the target phrase. The remote matching of the knowledge base refers to that target phrases are matched with the knowledge base of the peripheral equipment one by one, so that the target phrases are labeled. The external data matching mode is to match the target phrase with external data provided by a crowdsourcing platform, for example, so as to complete the labeling of the target phrase. Preferably, the target phrase is labeled by adopting a plurality of different labeling rules, so that the accuracy of data labeling in a plurality of modes can be screened, and the accuracy of data labeling is improved.

S4: and acquiring the sample labeling probability of the label sample corresponding to each preset labeling rule to the target phrase, and acquiring the initial parameters of the generated model according to the sample labeling probability and the label sample.

Specifically, the sample labeling probability refers to the coverage rate of the sample label to the target phrase obtained by using a preset labeling rule, and the parameters of the generated model can be iteratively updated subsequently. And because each preset labeling rule has different labeling probabilities for different samples of the target phrase, the labeling probability of the sample corresponding to each preset labeling rule needs to be obtained first. The server also initializes the sample labeling probability and the label sample to obtain initial estimation parameters of the generated model, namely the initial parameters of the generated model.

Wherein, generating a model refers to a model capable of randomly generating observation data, especially given some implicit parameters. And generating a model to assign a joint probability distribution to the observed value and the labeled data sequence. In the embodiment of the application, the implicit parameters correspond to real labels of target phrases of the application, the observed values correspond to sample labeling probabilities of the application, and the labeled data sequences correspond to label samples of the application; therefore, a model of the observation data is randomly generated according to the implicit parameters, namely the real data tags, and the labeling probability of each preset labeling rule to the target phrase can be judged.

S5: and iteratively updating the initial parameters of the generated model according to the sample labeling probability to obtain a trained generated model, and outputting the labeling accuracy rate corresponding to the label sample according to the trained generated model.

Specifically, initial parameters of the generated model are fitted through the sample label probability, and the sample label probability is reversely propagated back to perform iterative updating on the initial parameters in a random gradient descending mode, so that the sample label probabilities are different and close to the parameters of the generated model, and the trained generated model is obtained. And performing probability estimation on the label samples by using the trained parameters of the generated model, and performing weighted average processing to obtain the labeling accuracy of the label samples under each preset rule.

The iterative updating means that initial parameters of the generated model are fitted through the sample label probability, and the sample label probability is reversely propagated back to carry out iterative calculation on the initial parameters in a random gradient descending mode, so that the sample label probabilities are different and close to the parameters of the generated model.

S6: and selecting the label sample with the highest labeling accuracy as a target label sample.

Specifically, the labeling accuracy of the label sample under each preset labeling rule is obtained through the steps, so that the label sample with the highest labeling accuracy is selected as the target label sample, the target phrase is labeled by trying multiple labeling rules, the label sample with the highest accuracy is selected, and the data labeling accuracy is improved.

In the embodiment, the target phrases are obtained by splitting, segmenting and combining the acquired text to be labeled, so that data labeling is conveniently performed on the text to be labeled respectively according to the target phrases in the following process; and then obtaining a plurality of preset labeling rules, labeling the target phrase respectively through the plurality of preset labeling rules to obtain a label sample corresponding to each preset rule, obtaining the sample labeling probability of the label sample corresponding to each preset labeling rule on the target phrase, obtaining an initial parameter of the generated model according to the sample labeling probability and the label sample, iteratively updating the initial parameter of the generated model through the sample labeling probability to obtain a trained generated model, outputting the labeling accuracy corresponding to the label sample through the trained generated model, selecting the label sample with the highest labeling accuracy as the target label sample, labeling the data through the plurality of preset rules, selecting the label sample with the highest data labeling accuracy according to the generated model, and facilitating improvement of the accuracy of data labeling.

Referring to fig. 3, fig. 3 shows a specific implementation manner of step S4, where in step S4, a sample labeling probability of a target phrase to a label sample corresponding to each preset labeling rule is obtained, and a specific implementation process of obtaining initial parameters of a generative model according to the sample labeling probability and the label sample is described as follows:

s41: and calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability.

Specifically, in order to train the generated model subsequently and make the sample labeling probability approach the parameters of the generated model, the sample labeling probability needs to be obtained first. Therefore, the coverage rate of the label sample corresponding to each preset labeling rule to the target phrase is calculated, and the coverage rate is used as the sample labeling probability. In the embodiment of the application, the coverage rate is obtained by calculating the coverage degree of the target phrase by the label sample.

In a specific embodiment, when a target phrase is labeled in a manner of remote matching knowledge base identification, the target participles cannot be labeled in the manner due to the fact that the peripheral knowledge base may have a situation that the target participles in the target phrase cannot be matched one by one, so that the target phrase is failed to be labeled; and matching the target word segmentation in the target phrase with the peripheral knowledge base one by one, and then successfully labeling the target phrase. And dividing the label sample with the successful target phrase labeling by the total target phrase amount to obtain the result of the coverage rate of the target phrase in a remote matching knowledge base identification mode, and taking the coverage rate as the sample labeling probability. For example, the number of successful target phrase labeling is 9000, the total target phrase amount is 10000, and the sample labeling probability is 90%.

S42: and initializing the sample label probability and the label sample to obtain initial parameters of the generated model.

Specifically, the initialization processing refers to assigning an estimated parameter value to an initial parameter of the generated model according to the sample label probability and the label sample, so as to obtain the initial parameter of the generated model.

In the implementation, the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase is calculated, the coverage rate is used as the sample labeling probability, the sample label probability and the label sample are initialized to obtain the initial parameters of the generated model, the sample labeling probability and the initial parameters of the generated model are obtained, the subsequent training of the generated model is facilitated, and therefore the accuracy of data labeling is improved conveniently.

Referring to fig. 4, fig. 4 shows a specific implementation manner of step S5, in step S5, the initial parameters of the generated model are iteratively updated according to the sample labeling probability to obtain a trained generated model, and a specific implementation process of outputting the labeling accuracy corresponding to the label sample according to the trained generated model is described as follows:

s51: and taking the difference value between the parameters of the generated model and the sample labeling probability as an optimization characteristic value.

Specifically, in the embodiment of the present application, the parameters of the generated model are iteratively updated, so that the parameters of the generated model are continuously close to the sample labeling probability, and therefore, the difference value between the parameters of the generated model and the sample labeling probability is used as an optimization characteristic value, and the training degree of the generated model is determined by evaluating the optimization characteristic value.

Specifically, after the data volume reaches a certain scale, labeling the target phrase based on a plurality of preset labeling rules, and training an obtained generation model, wherein the estimation of the real label of the target phrase based on the generation model is superior to the random guess of the sample label; and because the parameters of the generated model are used for estimating the accuracy of the label sample, and the sample label probability is calculated by covering the total target phrase quantity by the successful labeling quantity of the target phrases; therefore, when the parameters of the generated model are closer to the sample labeling probability, namely the optimized characteristic value is smaller, the generated model is closer to the completion of training. For example, if the initial parameter of the generated model is 0.4 and the sample label probability is 0.92, the optimized feature value is 0.52, the optimized feature value becomes smaller gradually after the iterative update is continuously performed, and if the optimized feature value becomes 0.01 and the parameter of the generated model is already close to the sample label probability, the iterative update is ended.

S52: and (3) performing back propagation on the sample labeling probability by adopting a random gradient descending mode to perform iterative updating on the initial parameter, wherein each iterative updating is to obtain a new parameter and an optimized characteristic value of the generated model to be changed.

Specifically, the sample labeling probability is propagated reversely by adopting a random gradient descent mode so as to iteratively update the initial parameter, a new parameter is obtained every time the model is generated by updating calculation, and a new optimization characteristic value can be obtained by calculating the difference value between the new parameter and the sample labeling probability. The optimization characteristic value is calculated through the difference value between the parameters of the generated model and the sample labeling probability, and the parameters of the generated model are changed after each iteration updating, so that the optimization characteristic value is changed after each iteration updating.

The gradient descent method is one of iteration methods, and can be used for solving a least square problem. Gradient Descent (Gradient decision) is one of the most commonly used methods when solving model parameters of machine learning algorithms, i.e. unconstrained optimization problems. When the minimum value of the loss function is solved, iterative solution can be carried out step by step through a gradient descent method, and the minimized loss function and the model parameter value are obtained. Conversely, if the maximum of the loss function needs to be solved, then the gradient ascent method needs to be iterated. In machine learning, two gradient descent methods, namely a random gradient descent method and a batch gradient descent method, are developed based on a basic gradient descent method. In the embodiment of the application, the sample labeling probability is propagated reversely by adopting a random gradient descent mode so as to iteratively update the initial parameters.

Among them, the back propagation algorithm is a learning algorithm suitable for a multi-layer neuron network, and is based on a gradient descent method. The input-output relationship of the back propagation network is essentially a mapping relationship: an n-input m-output back-propagation neural network performs the function of continuously mapping from n-dimensional euclidean space to a finite field in m-dimensional euclidean space, which is highly non-linear.

In a specific embodiment, the sample labeling probability is input into an input layer of the neural network, passes through a hidden layer, finally reaches an output layer and outputs a result, and the process is a forward propagation process. However, because the output result of the neural network has an error with the actual result, the error between the estimated value and the actual value is calculated, namely the optimized characteristic value, and the optimized characteristic value is reversely propagated from the output layer to the hidden layer until being propagated to the input layer; and in the process of back propagation, adjusting the value of the sample labeling probability according to the random decline of the optimized characteristic value, so that the optimized characteristic value is reduced. And iterating the steps until the optimization characteristic value reaches a preset threshold value.

S53: and when the optimization characteristic value reaches a preset threshold value, stopping iterative updating to obtain a trained generated model.

Specifically, when the optimized characteristic value reaches a preset threshold value, it is indicated that the parameters of the generated model are very close to the sample labeling probability, and at this time, the updating of the parameters of the generated model is stopped, so that the trained generated model is obtained.

The preset threshold is set according to actual conditions, and is not limited herein. In one embodiment, the predetermined threshold is 0.01.

S54: and outputting the labeling accuracy corresponding to the label sample through the trained generation model.

Specifically, the trained generation model is generated in the above steps, probability estimation is performed on the label sample through the trained generation model, and the labeling accuracy corresponding to the label sample is output.

In this embodiment, the difference between the parameters of the generated model and the sample labeling probability is used as an optimized characteristic value, a random gradient descent mode is adopted, the sample labeling probability is reversely propagated to iteratively update the initial parameters, when the optimized characteristic value reaches a preset threshold value, iterative updating is stopped, the trained generated model is obtained, the labeling accuracy rate corresponding to the label sample is output through the trained generated model, training of the generated model is achieved, the labeling accuracy rate corresponding to the label sample is output, and therefore the data labeling accuracy rate is improved.

Referring to fig. 5, fig. 5 shows an embodiment of step S54, and a detailed implementation process of outputting the annotation accuracy corresponding to the label sample through the trained generative model in step S54 is as follows:

s541: and performing probability estimation on the label sample through the current parameters of the trained generation model to obtain the basic probability.

Specifically, probability estimation is carried out on the label sample through the current parameters to obtain the basic probability, so that the basic probability is further processed conveniently in the follow-up process, and the final marking accuracy is obtained. And when the current parameter is the optimized characteristic value reaches a preset threshold value, the current parameter is iteratively updated to obtain the parameter of the generated model.

In particular, since a generative model refers to a model that is capable of randomly generating observed data, especially given certain implicit parameters. And generating a model to assign a joint probability distribution to the observed value and the labeled data sequence. In the embodiment of the application, the implicit parameters correspond to real labels of target phrases of the application, the observed values correspond to sample labeling probabilities of the application, and the labeled data sequences correspond to label samples of the application; therefore, a model of the observation data is randomly generated according to the hidden parameters, namely the real data labels, the model is composed of the current parameters, and the probability estimation of each preset labeling rule on the label sample can be judged, so that the basic probability is obtained.

S542: and carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.

Specifically, the labeling accuracy is more accurate by performing weighted average processing on the basic probability.

In the embodiment, probability estimation is performed on the label sample through the current parameters of the trained generation model to obtain the basic probability, and weighted average processing is performed on the basic probability to obtain the labeling accuracy rate corresponding to the label sample, so that the generation labeling accuracy rate is more accurate, and the data labeling accuracy rate is favorably improved.

Referring to fig. 6, fig. 6 shows a specific implementation manner of step S1, where in step S1, a text to be annotated is obtained, and the text to be annotated is split to obtain a specific implementation process of a split sentence, which is described in detail as follows:

s11: and acquiring a text to be marked, and preprocessing the text to be marked to obtain a basic text.

Specifically, the preprocessing includes data cleaning of the text to be annotated. Data cleansing (Data cleansing) refers to a process of reviewing and verifying Data, and aims to delete duplicate information, correct errors and provide Data consistency.

S12: and acquiring the text separators contained in the basic text by adopting a regular matching mode.

S13: and splitting the basic text through the text separators to obtain split sentences.

Specifically, a regular matching mode is adopted to obtain text separators contained in the basic text for segmenting the text in the subsequent steps.

Optionally, the text separator comprises a format separator and a punctuation separator.

The format separator refers to a separator that is divided according to a text encoding type or a structure of a text. The basic text is split according to the encoding type of the text or the structure of the text by the format separator.

The punctuation separators are separators for dividing the text according to punctuation marks. And the basic text is rapidly split through the punctuation separators.

In the embodiment, the text to be labeled is obtained and preprocessed to obtain the basic text, the text separators included in the basic text are obtained in a regular matching mode, the basic text is split through the text separators to obtain the split sentences, the target phrases are conveniently generated subsequently, and the corresponding labels are favorably labeled subsequently.

Referring to fig. 7, fig. 7 shows a specific implementation manner after step S6, which includes:

s61: acquiring a storage path of a text to be marked as a target storage path;

s62: and mapping the target label sample into a target storage path in a preset data mapping mode.

Specifically, for data tracing, a target label sample corresponding to the text to be labeled is conveniently inquired, and the target label sample and the file to be labeled are stored in the same path.

The preset data mapping mode includes, but is not limited to: manual coding (Hand-coded) and visual manipulation (Graphical manual). The manual coding is to directly define the data corresponding relation by using programming languages like XSLT, JAVA and C + +; visualization operations typically support a user drawing a line between data items to define a correspondence between the data items. In a particular embodiment, the target tag exemplar is mapped into the target storage path through a visualization operation.

Referring to fig. 8, fig. 8 shows a specific implementation process of merging target word segments to obtain a target phrase, which is described in detail as follows:

S2A: and performing part-of-speech tagging on the target participle in a part-of-speech tagging mode to obtain the part-of-speech participle.

Part-of-speech tagging, also known as grammar tagging or part-of-speech disambiguation, is a text data processing technique in which parts-of-speech of words in a corpus are tagged according to their meaning and context in linguistic analysis of the corpus. The part-of-speech tagging can be completed manually or by a specific algorithm, and the part-of-speech tagging realized by using a machine learning method is the research content of natural language processing. Common part-of-speech tagging algorithms include hidden markov models, conditional random fields, and the like. In the embodiment of the application, the part-of-speech tagging is performed on the target participle in a part-of-speech tagging mode, so that the part-of-speech participle is obtained.

S2B: and merging the part-of-speech participles which accord with the consistency rule according to a dependency syntax analysis mode to obtain the target phrase.

Wherein, the consistency rule is that a main-predicate-object (SBV) relation is used and marks are made on corresponding words. For example, "i eat apple" is labeled as (i, Subject), (eat, Predict), (apple, Object), the extracted part-of-speech participles are mapped to part-of-speech components, and the part-of-speech participles that meet the consistency rule are merged to obtain the target phrase.

In this embodiment, part-of-speech tagging is performed on the target part-of-speech in a part-of-speech tagging manner to obtain part-of-speech, and part-of-speech matching the consistency rule is combined in a dependency syntax analysis manner to obtain a target phrase, so that combination of the target part-of-speech is realized, and subsequent data tagging is facilitated.

It should be emphasized that, in order to further ensure the privacy and security of the text to be annotated, the text to be annotated may also be stored in a node of a block chain.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data annotation device based on a generative model, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be applied to various electronic devices.

As shown in fig. 9, the data annotation apparatus based on generative model according to the present embodiment includes: a to-be-labeled text splitting module 71, a target phrase obtaining module 72, a label sample generating module 73, an initial parameter generating module 74, a labeling accuracy output module 75 and a label sample selecting module 76, wherein:

the to-be-labeled text splitting module 71 is configured to obtain a to-be-labeled text, and split the to-be-labeled text to obtain a split sentence;

a target phrase obtaining module 72, configured to perform word segmentation processing on the split sentences to obtain target words, and merge the target words to obtain target phrases;

the tag sample generation module 73 is configured to obtain multiple preset tagging rules, and tag the target phrase according to the multiple preset tagging rules, so as to obtain a tag sample corresponding to each preset rule;

an initial parameter generating module 74, configured to obtain a sample labeling probability of the target phrase for the label sample corresponding to each preset labeling rule, and obtain an initial parameter of the generated model according to the sample labeling probability and the label sample;

the labeling accuracy output module 75 is configured to iteratively update the initial parameters of the generated model according to the sample labeling probability to obtain a trained generated model, and output a labeling accuracy corresponding to the label sample according to the trained generated model;

and a label sample selecting module 76, configured to select a label sample with the highest labeling accuracy as a target label sample.

Further, the initial parameter generating module 74 includes:

the sample labeling probability obtaining unit is used for calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability;

and the initialization processing unit is used for carrying out initialization processing on the sample label probability and the label sample to obtain initial parameters of the generated model.

Further, the labeling accuracy output module 75 includes:

the optimization characteristic value definition unit is used for taking the difference value between the parameters of the generated model and the sample labeling probability as an optimization characteristic value;

the iterative update performing unit is used for performing back propagation on the sample labeling probability in a random gradient descending mode so as to perform iterative update on the initial parameters, wherein each iterative update is to obtain a new parameter and an optimized characteristic value of the generated model to be changed;

the iteration updating stopping unit is used for stopping iteration updating when the optimization characteristic value reaches a preset threshold value to obtain a trained generated model;

and the marking accuracy rate obtaining unit is used for outputting the marking accuracy rate corresponding to the label sample through the trained generation model.

Further, the labeling accuracy obtaining unit includes:

the basic probability obtaining subunit is used for carrying out probability estimation on the label sample according to the current parameters of the trained generation model to obtain a basic probability;

and the basic probability processing subunit is used for carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.

Further, the to-be-tagged text splitting module 71 includes:

the basic text generating unit is used for acquiring a text to be labeled and preprocessing the text to be labeled to obtain a basic text;

the text separator acquisition unit is used for acquiring text separators contained in the basic text in a regular matching mode;

and the split statement generating unit is used for splitting the basic text through the text separators to obtain split statements.

Further, after the label sample selecting module 76, the data labeling apparatus based on generative model further includes:

the target storage path acquisition module is used for acquiring a storage path of the text to be marked as a target storage path;

and the data mapping module is used for mapping the target label sample into the target storage path in a preset data mapping mode.

Further, the target phrase obtaining module 72 further includes:

the part-of-speech participle generating unit is used for carrying out part-of-speech tagging on the target participle in a part-of-speech tagging mode to obtain part-of-speech participles;

and the target phrase generating unit is used for merging the part-of-speech participles which accord with the consistency rule according to the dependency syntax analysis mode to obtain the target phrase.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 10, fig. 10 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 8 includes a memory 81, a processor 82, and a network interface 83 communicatively connected to each other via a system bus. It is noted that only a computer device 8 having three components, a memory 81, a processor 82, and a network interface 83, is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 81 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 81 may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8. In other embodiments, the memory 81 may be an external storage device of the computer device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 8. Of course, the memory 81 may also include both internal and external storage devices of the computer device 8. In this embodiment, the memory 81 is generally used for storing an operating system installed in the computer device 8 and various types of application software, such as program codes of a data annotation method based on a generative model. Further, the memory 81 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 82 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 82 is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 82 is configured to execute the program code stored in the memory 81 or process data, for example, execute the program code of the data annotation method based on generative model, so as to implement various embodiments of the data annotation method based on generative model.

The network interface 83 may include a wireless network interface or a wired network interface, and the network interface 83 is generally used to establish communication connections between the computer device 8 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer-readable storage medium, which stores a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of a generative model-based data annotation method as described above.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A data labeling method based on a generative model is characterized by comprising the following steps:

2. The generative model-based data labeling method according to claim 1, wherein the obtaining a sample labeling probability of the label sample corresponding to each of the preset labeling rules to the target phrase, and obtaining initial parameters of the generative model according to the sample labeling probability and the label sample comprises:

calculating the coverage rate of the label sample corresponding to each preset labeling rule on the target phrase, and taking the coverage rate as the sample labeling probability;

and initializing the sample label probability and the label sample to obtain initial parameters of the generated model.

3. The generative model-based data annotation method of claim 1, wherein iteratively updating initial parameters of the generative model according to the sample annotation probability to obtain a trained generative model, and outputting the annotation accuracy corresponding to the label sample according to the trained generative model comprises:

taking the difference value between the parameters of the generated model and the sample labeling probability as an optimization characteristic value;

carrying out back propagation on the sample labeling probability by adopting a random gradient descending mode so as to carry out iterative updating on the initial parameter, wherein each iterative updating is carried out to obtain a new parameter of the generated model and the change of the optimized characteristic value;

stopping the iterative updating when the optimization characteristic value reaches a preset threshold value to obtain the trained generated model;

and outputting the labeling accuracy corresponding to the label sample through the trained generation model.

4. The generative model-based data annotation method of claim 3, wherein the outputting of the annotation accuracy corresponding to the label sample by the trained generative model comprises:

performing probability estimation on the label sample through a current parameter of the trained generated model to obtain a basic probability, wherein the current parameter refers to a parameter obtained by iterative updating when the optimization characteristic value reaches a preset threshold value;

and carrying out weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.

5. The data annotation method based on the generative model of claim 1, wherein the obtaining of the text to be annotated and the splitting of the text to be annotated to obtain the split sentence comprises:

acquiring the text to be marked, and preprocessing the text to be marked to obtain a basic text;

acquiring text separators contained in the basic text in a regular matching mode;

and splitting the basic text through the text separator to obtain the split sentence.

6. The generative model-based data annotation method of claim 1, wherein after said selecting the label exemplar with the highest annotation accuracy as a target label exemplar, the method further comprises:

acquiring a storage path of the text to be marked as a target storage path;

and mapping the target label sample into the target storage path in a preset data mapping mode.

7. The generative model-based data tagging method according to any one of claims 1 to 6, wherein the merging the target participles to obtain a target phrase comprises:

performing part-of-speech tagging on the target participle in a part-of-speech tagging mode to obtain part-of-speech participles;

and merging the part-of-speech participles which accord with the consistency rule according to a dependency syntax analysis mode to obtain the target phrase.

8. A data annotation device based on generative models, comprising:

9. A computer device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, implements the generative model-based data annotation method as claimed in any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the generative model-based data annotation method according to any one of claims 1 to 7.