WO2022174496A1 - Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage - Google Patents

Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage Download PDF

Info

Publication number
WO2022174496A1
WO2022174496A1 PCT/CN2021/083758 CN2021083758W WO2022174496A1 WO 2022174496 A1 WO2022174496 A1 WO 2022174496A1 CN 2021083758 W CN2021083758 W CN 2021083758W WO 2022174496 A1 WO2022174496 A1 WO 2022174496A1
Authority
WO
WIPO (PCT)
Prior art keywords
labeling
sample
label
probability
text
Prior art date
Application number
PCT/CN2021/083758
Other languages
English (en)
Chinese (zh)
Inventor
李薿
陈曦
崔艳
庄伯金
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022174496A1 publication Critical patent/WO2022174496A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/381Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a data labeling method, apparatus, device and storage medium based on a generative model.
  • the existing solution is to obtain the word sequence corresponding to the original text, transform and map the word sequence to obtain the entity labeling vector, and count the number of preset entity information in the entity labeling vector, so as to realize the data analysis.
  • this kind of labeling method is obtained by transforming and mapping word vectors, which is easy to cause errors in data labeling, resulting in low accuracy of data labeling for large-scale data.
  • the purpose of the embodiments of the present application is to propose a data labeling method, apparatus, device and storage medium based on a generative model, so as to improve the accuracy of data labeling.
  • the embodiment of the present application provides a data labeling method based on a generative model, including:
  • a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase
  • the label sample with the highest labeling accuracy is selected as the target label sample.
  • an embodiment of the present application provides a data labeling device based on a generative model, including:
  • a to-be-labeled text splitting module configured to obtain the to-be-labeled text, and to split the to-be-labeled text to obtain a split statement;
  • a target phrase acquisition module used to obtain a target word segmentation by performing word segmentation processing on the split statement, and merging the target word segmentation to obtain a target phrase
  • the label sample generation module is configured to obtain a plurality of preset labeling rules, and respectively label the target phrase through the plurality of the predefined labeling rules, to obtain label samples corresponding to each of the predefined rules;
  • the initial parameter generation module is used to obtain the sample labeling probability of the target phrase by the label sample corresponding to each of the preset labeling rules, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample ;
  • the labeling accuracy output module is used to iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labels corresponding to the label samples through the trained generative model Accuracy;
  • the label sample selection module is configured to select the label sample with the highest label accuracy rate as the target label sample.
  • a technical solution adopted in the present application is to provide a computer device, including a memory and a processor, wherein the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions Implement the following steps:
  • a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase
  • the label sample with the highest labeling accuracy is selected as the target label sample.
  • a technical solution adopted in this application is: a computer-readable storage medium, where the computer-readable instructions are executed by a processor to implement the following steps:
  • a target word segmentation is obtained, and the target word segmentation is combined to obtain a target phrase
  • the label sample with the highest labeling accuracy is selected as the target label sample.
  • Embodiments of the present application provide a method, apparatus, device, and storage medium for data labeling based on a generative model.
  • the data is labeled by a variety of preset rules, and the label sample with the highest data labeling accuracy is selected according to the generation model, which is beneficial to improve the data labeling accuracy.
  • FIG. 1 is a schematic diagram of an application environment of a method for labeling data based on a generative model provided by an embodiment of the present application;
  • FIG. 2 is an implementation flowchart of a method for data labeling based on a generative model provided according to an embodiment of the present application
  • FIG. 3 is a flow chart of an implementation of a sub-process in the method for labeling data based on a generative model provided by an embodiment of the present application;
  • Fig. 4 is another realization flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
  • Fig. 5 is another realization flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
  • FIG. 6 is another implementation flowchart of the sub-process in the method for labeling data based on a generative model provided by an embodiment of the present application
  • FIG. 7 is another implementation flow chart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
  • FIG. 8 is another implementation flowchart of the sub-process in the data labeling method based on the generative model provided by the embodiment of the present application;
  • FIG. 9 is a schematic diagram of a data labeling device based on a generative model provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, search applications, instant communication tools, and the like.
  • the terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • the data labeling method based on the generative model provided in the embodiment of the present application is generally executed by the server, and accordingly, the data labeling apparatus based on the generative model is generally configured in the server.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 2 shows a specific implementation manner of a data labeling method based on a generative model.
  • the method of the present application is not limited to the flow sequence shown in FIG. 2, and the method includes the following steps:
  • the server After the server obtains the text to be marked, it will preprocess it, such as data cleaning, and then split the text to be marked into paragraphs, sentences, etc. sub-sentence.
  • the text to be labeled is the data that needs to be labeled, so as to generate text with labeled labels.
  • the text to be marked has been split into split sentences, and the split sentences exist in the form of short sentences.
  • the segmented sentences are processed by word segmentation to generate each target word segment, and then the part of speech is marked according to the part of speech of the target word segment, and the target word segment is merged according to the method of dependency syntax analysis to generate the target phrase.
  • the preset word segmentation tools include but are not limited to: stuttering word segmentation, NLPIR word segmentation system, SnowNLP, etc.
  • stuttering word segmentation is used to segment the split sentence to obtain the target word segmentation.
  • the stuttering word segmentation has the ability to cut the sentence most accurately, which is suitable for text analysis, and it scans all the words that can be formed into words in the sentence, and the speed is relatively fast.
  • dependency syntax analysis was first proposed by the French linguist L.Tesniere. It analyzes the sentence into a dependency syntax tree, and describes the dependency relationship between each word, that is, it points out the syntactic collocation between words, which is related to semantics.
  • the target word segmentation is merged by means of dependency syntax analysis.
  • S3 Acquire a variety of preset labeling rules, and label the target phrase respectively through the multiple preset labeling rules, and obtain a label sample corresponding to each preset rule.
  • the target phrase is marked with various marking rules, and then the accuracy rate of data marking with various rules is determined by the generation model, so as to select The label sample with the highest accuracy rate is used to complete the labeling of the data. Therefore, the server obtains multiple preset tagging rules, and then tags the target phrase with corresponding tags according to each preset tagging rule, so that the target phrase generates a tag sample corresponding to each preset rule.
  • a variety of preset labeling rules include, but are not limited to: regular recognition, remote matching knowledge base recognition, and external data matching methods.
  • the regular recognition refers to matching the corresponding labeling rules by presetting different SQL query statements, so as to realize different rules to label the target phrase.
  • the remote matching knowledge base means that the target phrase is annotated by matching it with the knowledge base of the peripheral device one by one.
  • the method of matching external data refers to matching the target phrase with the external data provided by, for example, a crowdsourcing platform, so as to complete the labeling of the target phrase.
  • the accuracy of data labeling in various ways can be screened, thereby improving the accuracy of data labeling.
  • S4 Obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample.
  • the sample labeling probability refers to the coverage rate of the target phrase by the sample label obtained by using the preset labeling rule, and the parameters of the generation model can be iteratively updated subsequently.
  • each preset tagging rule has different sample tagging probabilities for different target phrases, it is necessary to first obtain the sample tagging probability corresponding to each preset tagging rule.
  • the server also obtains initial estimated parameters of the generative model after initializing the sample labeling probability and labeling samples, that is, obtains the initial parameters of the generative model.
  • a generative model refers to a model that can randomly generate observed data, especially given some implicit parameters.
  • Generative models assign a joint probability distribution to observations and labeled data sequences.
  • the implicit parameter corresponds to the true label of the target phrase of the present application
  • the observed value corresponds to the sample labeling probability of the present application
  • the labeled data sequence corresponds to the label sample of the present application; therefore, according to the implicit parameter, that is, the true
  • the data label is randomly generated by the observation data model, which can determine the labeling probability of each preset labeling rule for the target phrase.
  • S5 Iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model.
  • the initial parameters of the generative model are fitted by the sample label probabilities, and the sample label probabilities are back-propagated back to iteratively update the initial parameters by means of stochastic gradient descent, so that the sample label probabilities are different and close to the parameters of the generative model. , so as to obtain a trained generative model.
  • the parameters of the trained generative model to estimate the probability of label samples, and after weighted average processing, the labeling accuracy rate of label samples under each preset rule is obtained.
  • iterative update refers to fitting the initial parameters of the generative model through the sample label probability, and using stochastic gradient descent to backpropagate the sample label probability back to iteratively calculate the initial parameters, so that the sample label probabilities are different and close to the generation parameters of the model.
  • the labeling accuracy rate of the label samples under each preset labeling rule has been obtained, so the label sample with the highest labeling accuracy rate is selected as the target label sample, so as to try multiple labeling rules for the target phrase. Labeling, selecting the label sample with the highest accuracy rate is conducive to improving the accuracy of data labeling.
  • a target phrase is obtained, which facilitates subsequent data labeling of the text to be labeled according to the target phrase; and then obtains a variety of preset labeling rules, and Label the target phrases through a variety of preset labeling rules, obtain the label samples corresponding to each preset labeling rule, and then obtain the sample labeling probability of the target phrase by the label samples corresponding to each predefined labeling rule, and then obtain the sample labeling probability of the target phrase according to the sample Label the probability and label samples to obtain the initial parameters of the generative model, and then iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model , select the label sample with the highest labeling accuracy as the target label sample, realize the labeling of data through a variety of preset rules, and select the label sample with the highest data labeling accuracy according to the generation model, which is beneficial
  • FIG. 3 shows a specific implementation of step S4.
  • step S4 the sample labeling probability of the target phrase corresponding to the label sample corresponding to each preset labeling rule is obtained, and the sample labeling probability and label sample are obtained according to the sample labeling probability and label sample.
  • the specific implementation process of obtaining the initial parameters of the generated model is described in detail as follows:
  • S41 Calculate the coverage rate of the target phrase by the label samples corresponding to each preset labeling rule, and use the coverage rate as the sample labeling probability.
  • the sample labeling probability needs to be obtained first. Therefore, the coverage rate of the target phrase by the label samples corresponding to each preset labeling rule is calculated, and the coverage rate is used as the sample labeling probability.
  • the coverage rate is obtained by calculating the coverage degree of the target phrase by the tag sample.
  • the target phrase when the target phrase is marked by means of remote matching knowledge base recognition, because the knowledge base of the peripheral device may not match the target word segmentation in the target phrase one by one, these target word segmentation cannot be matched. Marking in this way causes the target phrase to fail to be marked; if the target word segmentation in the target phrase and the knowledge base of the peripheral can match one by one, the target phrase is marked successfully. Divide the successfully labeled samples of target phrases by the total number of target phrases, and the result is the coverage rate of the target phrase by the way of remote matching knowledge base recognition, and the coverage rate is taken as the sample labeling probability. For example, if the number of successfully tagged target phrases is 9,000, and the total number of target phrases is 10,000, the probability of sample tagging is 90%.
  • the initialization process refers to assigning estimated parameter values to the initial parameters of the generative model according to the sample label probability and label samples, so as to obtain the initial parameters of the generative model.
  • the generated model is obtained.
  • the initial parameters are used to obtain the sample labeling probability and the initial parameters of the generation model, which is convenient for subsequent training of the generation model, thereby improving the accuracy of data labeling.
  • FIG. 4 shows a specific implementation of step S5.
  • step S5 the initial parameters of the generative model are iteratively updated through the sample labeling probability to obtain a trained generative model.
  • the specific implementation process of the labeling accuracy corresponding to the output label sample is described in detail as follows:
  • the parameters of the generative model are iteratively updated, so that the parameters of the generative model are constantly approaching the sample labeling probability, so the difference between the parameters of the generative model and the sample labeling probability is used as the optimization feature value, and the evaluation It optimizes the eigenvalues and judges the training degree of the generated model.
  • the target phrase is labeled based on a variety of preset labeling rules, and the generated model is trained.
  • the estimation of the true label of the target phrase based on the generative model is better than that of the sample label. Random guessing; and because the parameters of the generative model are used to estimate the accuracy of the label samples, and the sample label probability is calculated by the coverage of the total target phrase volume by the number of target phrases successfully labeled; so when the generative model's The closer the parameter is to the sample labeling probability, that is, the smaller the optimized eigenvalue, the closer the generative model is to the completion of training.
  • the optimal eigenvalue is 0.52. After continuous iterative updating, the optimal eigenvalue gradually becomes smaller, and when the optimal eigenvalue becomes 0.01, the If the parameters are already close to the sample label probability, the iterative update ends.
  • the sample labeling probability is back-propagated to iteratively update the initial parameters, and a new parameter will be obtained for each update calculation to generate the model, and the new parameter and the sample label will be obtained.
  • Probabilistic difference calculation can be used to obtain new optimized eigenvalues. Among them, since the optimized eigenvalues are calculated by the difference between the parameters of the generative model and the sample labeling probability, and after each iteration update, the parameters of the generative model will change, so each iterative update makes the optimized eigenvalues occur. Change.
  • the gradient descent method is a kind of iterative method, which can be used to solve the least squares problem.
  • Gradient Descent is one of the most commonly used methods when solving model parameters of machine learning algorithms, i.e. unconstrained optimization problems.
  • the gradient descent method can be used to iteratively solve the problem step by step to obtain the minimized loss function and model parameter values.
  • the maximum value of the loss function you need to use gradient ascent to iterate.
  • two gradient descent methods have been developed based on the basic gradient descent method, namely stochastic gradient descent and batch gradient descent.
  • the method of stochastic gradient descent is used to back-propagate the sample labeling probability to iteratively update the initial parameters.
  • the back-propagation algorithm is a learning algorithm suitable for multi-layer neuron networks, which is based on the gradient descent method.
  • the input-output relationship of the back-propagation network is essentially a mapping relationship: the function completed by a back-propagation neural network with n input and m output is the continuity from n-dimensional Euclidean space to a finite field in m-dimensional Euclidean space. mapping, which is highly nonlinear.
  • the sample labeling probability is input into the input layer of the neural network, passes through the hidden layer, and finally reaches the output layer and outputs the result.
  • This process is a forward propagation process.
  • the error between the output result of the neural network and the actual result is calculated, which is also the optimized eigenvalue, and the optimized eigenvalue is back-propagated from the output layer to the hidden layer until Propagated to the input layer; in the process of backpropagation, the value of the sample labeling probability is adjusted according to the random drop of the optimized eigenvalue, so that the optimized eigenvalue is reduced.
  • the above steps are iterated until the optimized feature value reaches a preset threshold.
  • the optimized feature value reaches the preset threshold, it indicates that the parameters of the generative model are very close to the sample labeling probability, and at this time, the updating of the parameters of the generative model is stopped, thereby obtaining a trained generative model.
  • the preset threshold is set according to the actual situation, and is not limited here. In a specific embodiment, the preset threshold value is 0.01.
  • the above steps have generated a trained generative model, and then perform probability estimation on the label samples through the trained generative model, and output the labeling accuracy rate corresponding to the label samples.
  • the difference between the parameters of the generated model and the sample labeling probability is used as the optimization feature value, and the sample labeling probability is back-propagated by means of stochastic gradient descent to iteratively update the initial parameters.
  • the iterative update is stopped to obtain a trained generative model.
  • the trained generative model outputs the labeling accuracy rate corresponding to the label sample, so as to train the generative model and output the labeling accuracy rate corresponding to the label sample. It is beneficial to improve the accuracy of data annotation.
  • FIG. 5 shows a specific implementation of step S54.
  • step S54 the specific implementation process of outputting the labeling accuracy rate corresponding to the label sample through the trained generation model is described in detail as follows:
  • S541 Perform probability estimation on the label samples by using the current parameters of the trained generative model to obtain the basic probability.
  • the probability estimation is performed on the label samples through the current parameters to obtain the basic probability, which is convenient for further processing of the basic probability in the future, so as to obtain the final labeling accuracy rate.
  • the current parameter refers to the parameters of the generated model obtained by iterative update when the optimized feature value reaches the preset threshold.
  • a generative model refers to a model that can randomly generate observational data, especially given some implicit parameters.
  • Generative models assign a joint probability distribution to observations and labeled data sequences.
  • the implicit parameter corresponds to the true label of the target phrase of the present application
  • the observed value corresponds to the sample labeling probability of the present application
  • the labeled data sequence corresponds to the label sample of the present application; therefore, according to the implicit parameter, that is, the true
  • the data label is randomly generated, and the model of the observation data is composed of its current parameters, which can judge the probability estimation of each preset labeling rule on the label sample, so as to obtain the basic probability.
  • S542 Perform weighted average processing on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample.
  • the labeling accuracy is made more accurate.
  • the probability estimation is performed on the label samples through the current parameters of the trained generation model to obtain the basic probability
  • the weighted average processing is performed on the basic probability to obtain the labeling accuracy rate corresponding to the labeling sample, so that the generated labeling accuracy rate is more accurate , so as to improve the accuracy of data annotation.
  • FIG. 6 shows a specific implementation of step S1.
  • step S1 the text to be marked is obtained, and the text to be marked is split to obtain the specific implementation process of the split sentence, which is described in detail as follows:
  • S11 Acquire the text to be labeled, and preprocess the text to be labeled to obtain basic text.
  • the preprocessing includes data cleaning of the text to be annotated.
  • data cleaning refers to the process of re-examining and verifying data, with the purpose of removing duplicate information, correcting existing errors, and providing data consistency.
  • a regular matching method is used to obtain the text separator contained in the basic text, which is used to segment the text in subsequent steps.
  • the text delimiters include format delimiters and punctuation delimiters.
  • the format delimiter refers to the delimiter that is divided according to the text encoding type or the text structure.
  • the basic text is split according to the encoding type of the text or the structure of the text through the format separator.
  • the punctuation separator refers to the separator that divides the text according to the punctuation characters. Quickly split basic text with punctuation separators.
  • the basic text is obtained by acquiring the text to be marked and preprocessing the text to be marked, and the text separator contained in the basic text is obtained by regular matching, and the basic text is split by the text separator,
  • the split sentences are obtained, which is convenient for subsequent generation of target phrases, and is conducive to subsequent labeling of corresponding labels.
  • FIG. 7 shows a specific implementation manner after step S6. This embodiment includes:
  • S62 Map the target label sample to the target storage path by using a preset data mapping method.
  • the target label sample corresponding to the text to be labeled it is convenient to query the target label sample corresponding to the text to be labeled, and the target label sample and the to-be-labeled file are stored in the same path.
  • the preset data mapping methods include but are not limited to: hand-coded (Hand-coded) and visual operation (Graphical manual).
  • Manual coding is to define data correspondences directly with programming languages such as XSLT, JAVA, and C++; visualization operations usually allow users to draw a line between data items to define the correspondence between data items.
  • the target label samples are mapped into the target storage path through a visualization operation.
  • Fig. 8 shows the specific implementation process of merging the target word segmentation to obtain the target phrase, which is described in detail as follows:
  • S2A Through part-of-speech tagging, part-of-speech tagging is performed on the target word to obtain part-of-speech segmentation.
  • part-of-speech tagging also known as grammar tagging or part-of-speech disambiguation
  • Part-of-speech tagging is a text data processing technology in corpus linguistics that labels the parts of speech of words in a corpus according to their meaning and contextual content.
  • Part-of-speech tagging can be done manually or by a specific algorithm.
  • machine learning methods to achieve part-of-speech tagging is the research content of natural language processing.
  • Common part-of-speech tagging algorithms include Hidden Markov Models, Conditional Random Fields, etc.
  • the target word segmentation is marked with the part of speech by means of part of speech tagging to obtain the part of speech segmentation.
  • S2B According to the method of dependency syntax analysis, the parts of speech that conform to the consistency rules are combined to obtain the target phrase.
  • the consistency rule is to use the subject-verb-object (SBV) relationship, and mark the corresponding words. For example, "I eat apples” is marked as (I, Subject), (eat, Predict), (apple, Object), the extracted part-of-speech segmentation corresponds to the component of speech, and the part-of-speech segmentation that conforms to the consistency rules is merged, Get the target phrase.
  • SBV subject-verb-object
  • the part-of-speech tagging is used for the target word segmentation to obtain the part-of-speech segmentation, and according to the method of dependency syntax analysis, the part-of-speech segmentation that conforms to the consistency rule is combined to obtain the target phrase, and the target word segmentation is realized. Combined to facilitate subsequent data annotation.
  • the above-mentioned text to be marked may also be stored in a node of a blockchain.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a data labeling device based on a generative model.
  • the device embodiment corresponds to the method embodiment shown in FIG. 2.
  • the device can be applied to various electronic devices.
  • the data labeling device based on the generation model in this embodiment includes: a text to be labelled splitting module 71 , a target phrase acquisition module 72 , a label sample generation module 73 , an initial parameter generation module 74 , and a labeling accuracy rate output module 75 and label sample selection module 76, wherein:
  • the to-be-labeled text splitting module 71 is used to obtain the to-be-labeled text, and to split the to-be-labeled text to obtain a split statement;
  • the target phrase acquisition module 72 is used to obtain the target word segmentation by performing word segmentation processing on the split sentence, and merge the target word segmentation to obtain the target phrase;
  • the label sample generation module 73 is configured to obtain multiple preset labeling rules, and respectively label the target phrase through the multiple preset labeling rules to obtain label samples corresponding to each preset rule;
  • the initial parameter generation module 74 is used to obtain the sample labeling probability of the target phrase by the label sample corresponding to each preset labeling rule, and obtain the initial parameters of the generation model according to the sample labeling probability and the label sample;
  • the labeling accuracy rate output module 75 is configured to iteratively update the initial parameters of the generative model through the sample labeling probability to obtain a trained generative model, and output the labeling accuracy rate corresponding to the label sample through the trained generative model;
  • the label sample selection module 76 is configured to select the label sample with the highest label accuracy rate as the target label sample.
  • the initial parameter generation module 74 includes:
  • the sample labeling probability obtaining unit is used to calculate the coverage rate of the target phrase by the label sample corresponding to each preset labeling rule, and use the coverage rate as the sample labeling probability;
  • the initialization processing unit is used to initialize the sample label probability and the label sample to obtain the initial parameters of the generative model.
  • the labeling accuracy output module 75 includes:
  • the optimization eigenvalue definition unit is used to use the difference between the parameters of the generated model and the sample labeling probability as the optimization eigenvalue;
  • the iterative update unit is used to back-propagate the sample labeling probability by means of stochastic gradient descent to iteratively update the initial parameters, in which, each iterative update obtains new parameters of the generative model and changes in the optimized eigenvalues ;
  • the iterative update stop unit is used to stop the iterative update when the optimized feature value reaches a preset threshold to obtain a trained generative model
  • the labeling accuracy rate obtaining unit is used to output the labeling accuracy rate corresponding to the label sample through the trained generation model.
  • the labeling accuracy obtaining unit includes:
  • the basic probability acquisition subunit is used to estimate the probability of the label sample through the current parameters of the trained generative model to obtain the basic probability
  • the basic probability processing sub-unit is used to perform weighted average processing on the basic probability to obtain the labeling accuracy corresponding to the label sample.
  • the to-be-labeled text splitting module 71 includes:
  • the basic text generation unit is used to obtain the text to be marked, and preprocess the text to be marked to obtain the basic text
  • the text separator obtaining unit is used to obtain the text separator contained in the basic text by means of regular matching;
  • the split statement generation unit is used to split the basic text by the text separator to obtain the split statement.
  • the data labeling device based on the generative model also includes:
  • the target storage path obtaining module is used to obtain the storage path of the text to be marked as the target storage path;
  • the data mapping module is used to map the target label samples to the target storage path through a preset data mapping method.
  • target phrase acquisition module 72 also includes:
  • the part-of-speech and word-segmentation generating unit is used to perform part-of-speech tagging on the target word by means of part-of-speech tagging to obtain part-of-speech segmentation;
  • the target phrase generation unit is used to combine the part-of-speech segmentations that conform to the consistency rules according to the method of dependency syntax analysis to obtain the target phrase.
  • the above-mentioned text to be marked may also be stored in a node of a blockchain.
  • FIG. 10 is a block diagram of the basic structure of a computer device according to this embodiment.
  • the computer device 8 includes a memory 81 , a processor 82 , and a network interface 83 that are connected to each other through a system bus. It should be pointed out that the figure only shows the computer device 8 with three components, the memory 81, the processor 82, and the network interface 83, but it should be understood that it is not required to implement all the components shown, and alternative implementations are possible. More or fewer components.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • a computer device includes a memory and a processor, the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the target word segmentation is obtained, and the target word segmentation is merged to obtain the target phrase;
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, and a cloud server and other computing equipment.
  • Computer devices can interact with users through keyboards, mice, remote controls, touchpads, or voice-activated devices.
  • the memory 81 includes at least one type of readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (eg , SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM) ), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 81 may be an internal storage unit of the computer device 8 , such as a hard disk or memory of the computer device 8 .
  • the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 81 may also include both the internal storage unit of the computer device 8 and its external storage device.
  • the memory 81 is generally used to store the operating system and various application software installed on the computer device 8 , such as computer-readable instructions of the data labeling method based on the generation model, and the like.
  • the memory 81 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 82 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 82 is typically used to control the overall operation of the computer device 8 .
  • the processor 82 is configured to run the computer-readable instructions stored in the memory 81 or process the data, for example, run the computer-readable instructions of the above-mentioned generative model-based data labeling method, so as to realize the data labeling method based on the generative model. various embodiments.
  • the network interface 83 may comprise a wireless network interface or a wired network interface, and the network interface 83 is typically used to establish a communication connection between the computer device 8 and other electronic devices.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions of a computer, and the computer-readable instructions of the computer can be executed by at least one processor to Cause at least one processor to perform the following steps:
  • the target word segmentation is obtained, and the target word segmentation is merged to obtain the target phrase;
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods of the various embodiments of the present application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Animal Behavior & Ethology (AREA)
  • Library & Information Science (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un appareil d'annotation de données basés sur un modèle génératif, ainsi qu'un dispositif et un support de stockage, qui se rapportent au domaine technique de l'intelligence artificielle et peuvent être appliqués au domaine du traitement du langage naturel. Le procédé consiste à : acquérir un texte à annoter, puis effectuer une division, une segmentation de mots et un traitement de fusion sur ledit texte afin d'obtenir une expression cible ; annoter l'expression cible d'après de multiples règles d'annotation prédéfinies afin d'obtenir un échantillon d'étiquette ; puis acquérir une probabilité d'annotation d'échantillon, pour l'expression cible, de l'échantillon d'étiquette, mettre à jour de manière itérative, d'après la probabilité d'annotation d'échantillon, des paramètres initiaux générés par un modèle génératif afin d'obtenir un modèle génératif appris, puis générer une précision d'annotation au moyen du modèle génératif appris ; et déterminer un échantillon d'étiquette cible en fonction de la précision d'annotation. La présente invention concerne également la technologie des chaînes de blocs, et le texte à annoter est stocké dans une chaîne de blocs. Des données sont annotées d'après de multiples règles prédéfinies, et un échantillon d'étiquette ayant la précision d'annotation de données la plus élevée est sélectionné selon un modèle génératif, ce qui facilite l'amélioration de la précision d'annotation de données.
PCT/CN2021/083758 2021-02-20 2021-03-30 Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage WO2022174496A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110193454.5A CN112860919B (zh) 2021-02-20 2021-02-20 基于生成模型的数据标注方法、装置、设备及存储介质
CN202110193454.5 2021-02-20

Publications (1)

Publication Number Publication Date
WO2022174496A1 true WO2022174496A1 (fr) 2022-08-25

Family

ID=75988385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083758 WO2022174496A1 (fr) 2021-02-20 2021-03-30 Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage

Country Status (2)

Country Link
CN (1) CN112860919B (fr)
WO (1) WO2022174496A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515587B (zh) * 2021-06-02 2024-06-21 中国神华国际工程有限公司 一种标的物信息提取方法、装置、计算机设备及存储介质
CN113590729B (zh) * 2021-07-30 2023-06-20 博米智能科技(杭州)有限公司 楼宇设备点位识别方法、装置、计算机设备和存储介质
CN113761577B (zh) * 2021-09-10 2024-05-31 平安科技(深圳)有限公司 一种大数据脱敏的方法、装置、计算机设备及存储介质
CN114020877B (zh) * 2021-11-18 2024-05-10 中科雨辰科技有限公司 一种用于标注文本的数据处理系统
CN116796356A (zh) * 2022-03-07 2023-09-22 华为云计算技术有限公司 一种数据切分方法及相关装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196908A (zh) * 2019-04-17 2019-09-03 深圳壹账通智能科技有限公司 数据分类方法、装置、计算机装置及存储介质
CN111507104A (zh) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 建立标签标注模型的方法、装置、电子设备和可读存储介质
US20200320171A1 (en) * 2019-04-02 2020-10-08 International Business Machines Corporation Cross-subject model-generated training data for relation extraction modeling
CN112084752A (zh) * 2020-09-08 2020-12-15 中国平安财产保险股份有限公司 基于自然语言的语句标注方法、装置、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725417B2 (en) * 2006-02-09 2010-05-25 Ebay Inc. Method and system to analyze rules based on popular query coverage
CN106997382B (zh) * 2017-03-22 2020-12-01 山东大学 基于大数据的创新创意标签自动标注方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320171A1 (en) * 2019-04-02 2020-10-08 International Business Machines Corporation Cross-subject model-generated training data for relation extraction modeling
CN110196908A (zh) * 2019-04-17 2019-09-03 深圳壹账通智能科技有限公司 数据分类方法、装置、计算机装置及存储介质
CN111507104A (zh) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 建立标签标注模型的方法、装置、电子设备和可读存储介质
CN112084752A (zh) * 2020-09-08 2020-12-15 中国平安财产保险股份有限公司 基于自然语言的语句标注方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112860919A (zh) 2021-05-28
CN112860919B (zh) 2024-07-12

Similar Documents

Publication Publication Date Title
WO2022174496A1 (fr) Procédé et appareil d'annotation de données basés sur un modèle génératif, dispositif et support de stockage
WO2022105122A1 (fr) Procédé et appareil de génération de réponse basés sur l'intelligence artificielle, ainsi que dispositif informatique et support
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
WO2021135469A1 (fr) Procédé, appareil, dispositif informatique et support d'extraction d'informations basée sur l'apprentissage automatique
WO2020244065A1 (fr) Procédé, appareil et dispositif de définition de vecteur de caractère basés sur l'intelligence artificielle et support de stockage
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
CN108628830B (zh) 一种语义识别的方法和装置
CN107861954B (zh) 基于人工智能的信息输出方法和装置
WO2021051574A1 (fr) Procédé et système d'étiquetage de séquence de texte en anglais et dispositif informatique
CN113076739A (zh) 一种实现跨领域的中文文本纠错方法和系统
CN111985229A (zh) 一种序列标注方法、装置及计算机设备
CN113987169A (zh) 基于语义块的文本摘要生成方法、装置、设备及存储介质
CN112101031B (zh) 一种实体识别方法、终端设备及存储介质
WO2021212681A1 (fr) Procédé et appareil de d'annotation de rôle sémantique, ainsi que dispositif informatique et support de stockage
CN108804591A (zh) 一种病历文本的文本分类方法及装置
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
CN112949320B (zh) 基于条件随机场的序列标注方法、装置、设备及介质
CN114416995A (zh) 信息推荐方法、装置及设备
CN111967253A (zh) 一种实体消歧方法、装置、计算机设备及存储介质
CN113268560A (zh) 用于文本匹配的方法和装置
CN113220835A (zh) 文本信息处理方法、装置、电子设备以及存储介质
CN116303537A (zh) 数据查询方法及装置、电子设备、存储介质
CN112328655A (zh) 文本标签挖掘方法、装置、设备及存储介质
WO2021042529A1 (fr) Procédé de génération automatique d'un résumé d'article, dispositif, et support d'informations lisible par ordinateur

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21926213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21926213

Country of ref document: EP

Kind code of ref document: A1