WO2022142122A1

WO2022142122A1 - Method and apparatus for training entity recognition model, and device and storage medium

Info

Publication number: WO2022142122A1
Application number: PCT/CN2021/097543
Authority: WO
Inventors: 阮鸿涛; 郑立颖; 胡沛弦; 徐亮
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-31
Filing date: 2021-05-31
Publication date: 2022-07-07
Also published as: CN112733911B; CN112733911A

Abstract

The present application relates to the field of natural language processing of artificial intelligence. Disclosed is a method for training an entity recognition model, the method comprising: acquiring a designated training sample that is not completely labeled; inputting the designated training sample into a probability prediction model, so as to obtain label probabilities corresponding to all unlabeled characters in the designated training sample; calculating a label sequence with the highest probability according to the label probabilities respectively corresponding to all the unlabeled characters and by means of a Viterbi algorithm; determining, according to the label sequence with the highest probability, covering labels respectively corresponding to all the unlabeled characters in the designated training sample; obtaining, according to the covering labels, a label sequence set corresponding to the designated training sample; acquiring, according to an acquisition method for the label sequence set corresponding to the designated training sample, label sequence sets respectively corresponding to all training samples in an incompletely labeled data set; and under the constraint of a preset loss function, training an entity recognition model by means of the label sequence sets respectively corresponding to all the training samples. An actual label sequence can be more easily recognized.

Description

Entity recognition model training method, device, equipment and storage medium

This application claims the priority of the Chinese patent application filed on December 31, 2020 with the application number 202011633046.9 and the title of the invention is "Method, Apparatus, Equipment and Storage Medium for Entity Recognition Model Training", the entire contents of which are approved by Reference is incorporated in this application.

technical field

The present application relates to the field of natural language processing of artificial intelligence, and in particular, the present application relates to a training method, apparatus, device and storage medium for an entity recognition model.

Background technique

The training of entity recognition models relies on a large amount of fully annotated data, but high-quality annotated data usually requires very professional annotators, which makes it difficult and expensive to obtain training data. In order to save costs, the entity recognition model can be trained with incompletely labeled data. Incompletely labeled data means that only some entities in the text are labeled, while other unlabeled content may be non-entities or entities. In order to improve the effect of training the entity recognition model with incompletely labeled data, all label sequences that conform to the labeled text are usually taken into account in the model training, the probability distribution information of all possible label sequences is estimated, and the information is integrated into the model training. Allows the model to pay attention to all possible label sequences. However, the inventors realized that since named entities usually have multiple categories, and named entities are sparsely distributed in the text, the number of candidate label sequences increases exponentially with the increase of unlabeled content in the text, resulting in the attention of the named entity model. Scattered, it is difficult to pay attention to the real label sequence, which affects the recognition effect.

technical problem

Because named entities usually have multiple categories, and named entities are sparsely distributed in the text, the number of candidate label sequences increases exponentially with the increase of unlabeled content in the text, resulting in the distraction of the named entity model, which is not easy to pay attention to. The real label sequence affects the recognition effect.

technical solutions

The main purpose of this application is to provide a training method for an entity recognition model, aiming to solve the technical problem that the pronunciation of online speech cannot be recognized and evaluated in time.

In a first aspect, the present application proposes a training method for an entity recognition model, including:

Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;

Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;

According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;

Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;

According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;

Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.

In a second aspect, the present application also provides a training device for an entity recognition model, including:

a first acquisition module, configured to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;

An input module, for inputting the specified training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the specified training sample;

a calculation module, configured to calculate the label sequence with the highest probability by the Viterbi algorithm according to the label probabilities corresponding to all unlabeled texts in the specified training sample;

A determination module, configured to determine the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;

obtaining module, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;

The second obtaining module is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;

The training module is used for training the entity recognition model through the label sequence sets corresponding to all training samples under the constraint of the preset loss function.

In a third aspect, the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements a training method for an entity recognition model when the processor executes the computer program; wherein, The method for training an entity recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, Obtain the label probabilities corresponding to all the unlabeled texts in the designated training samples; according to the label probabilities corresponding to all the unlabeled texts in the designated training samples, the Viterbi algorithm is used to calculate the label sequence with the highest probability; label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training sample; obtain the label sequence set corresponding to the designated training sample according to the cover label; according to the label sequence set corresponding to the designated training sample The acquisition method is to acquire the label sequence sets corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained by using the label sequence sets corresponding to all the training samples respectively.

In a fourth aspect, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements a training method for an entity recognition model; wherein, the entity The training method for the recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample The label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained through the sets of label sequences corresponding to all the training samples respectively.

beneficial effect

This application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence. The number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.

Description of drawings

1 is a schematic flowchart of a training method for an entity recognition model according to an embodiment of the present application;

2 is a schematic flowchart of a training system for an entity recognition model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.

BEST MODE FOR CARRYING OUT THE INVENTION

Referring to FIG. 1, a training method for an entity recognition model according to an embodiment of the present application includes:

S1: Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;

S2: Input the designated training sample into a probability prediction model, and obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

S3: According to the label probabilities corresponding to all the unlabeled texts in the designated training samples, the Viterbi algorithm is used to calculate the label sequence with the highest probability;

S4: According to the label sequence with the highest probability, determine the cover labels corresponding to all the unlabeled texts in the designated training samples;

S5: Obtain a label sequence set corresponding to the designated training sample according to the masking label;

S6: According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;

S7: Under the constraint of a preset loss function, train the entity recognition model through the label sequence sets corresponding to all training samples respectively.

The incompletely labeled designated training samples in the embodiments of the present application refer to texts that are not labeled with the entity label type in the text sentence samples used for entity recognition. The above entity label types vary according to the task domain. For example, in a promotional text describing a company's business, entity tags include, but are not limited to, company, address, person's name, time, organization, and the like.

In this application, an incompletely labeled dataset is constructed based on the complete situation of manual labeling. It is assumed that an incompletely labeled training sample is represented as x=(x ₁ , x ₂ ,...,x _n ), where each x _i (i=1 ,2,...,n) represents a word in the text sequence of the training samples. The incompletely labeled label sequence corresponding to the incompletely labeled training sample x is expressed as yu _u = (-, y ₂ ,-,...,-), where y _i ∈ Y, Y represents the entity label set, and yi represents the true labeling person The label of the marked x _i , "-" represents the label that is not marked as an entity, that is, the label at this position can be filled with a non-entity label or any entity label in the corresponding entity label set of the current task field. The labels that are not marked as entities are selected from the corresponding entity label set in the current task field, and combined into a set C( _yu ) composed of all possible complete label sequences that exist in the training sample. Assuming that the real complete label sequence of x is y=(y ₁ , y ₂ ,...,y _n ), then y should be a complete label sequence in C(y _u ).

The present application predicts the probability that the label of the x _i text of the training text x is _y`i by using the trained probability prediction model, y`i _∈ Y, and then calculates the probability corresponding to the complete label sequence of x by the Viterbi algorithm, that is, Each x _i is the probability that the complete label sequence of x is _y` when label y`i, y`∈C(y _u ). Through the Viterbi algorithm, the complete label sequence with the highest probability is obtained, and the masking label is determined for masking deletion. Through the Viterbi algorithm, the process of obtaining the complete label sequence with the highest probability is as follows, assuming a given hidden Markov model HMM state space, the number of states in common, the label probability according to the initial text, and the label sequence from the initial text to other texts The corresponding state transition probability, the probability of producing the most probable label sequence. The Viterbi path is obtained by saving the label probability of each character used in the recursion. Then, by exhausting the labels that have appeared in the unlabeled text, the label sequence of the training sample is correspondingly combined, and the label sequence of the training sample and the training sample are combined to form the training data to train the entity recognition model. The above entity recognition model and probability prediction model are both BERT (Bidirectional Encoder Representation from Transformers) + CRF (Conditional Random Field) model architectures, the difference lies in the model parameters and output variables used.

The present application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence. The number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.

Further, according to the label sequence with the highest probability, the step S4 of determining the cover labels corresponding to all the unlabeled texts in the designated training samples respectively includes:

S41: Obtain a set of label types corresponding to all labels in the current entity recognition task;

S42: Determine the designated labels corresponding to the designated texts in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is the set of label types one or more of the labels;

S43: Use labels other than the specified labels in the label type set as cover labels corresponding to the specified text;

S44: Determine, according to the method for determining the masking labels corresponding to the specified text, the masking labels corresponding to all the unmarked texts in the specified training sample respectively.

This embodiment of the present application obtains k complete label sequences with the highest probability through the Viterbi algorithm in the k-best case, denoted as K(x)={K _i (x)} _i=1,2,...,k , where The i-th complete label sequence is Ki (x)=[K _i (x ₁ ), _Ki ₍ x ₂ ), . . . , _Ki (x _n )]. The Viterbi algorithm in the above k-best case means that each node in the algorithm does not only retain the minimum value, but retains K minimum values, that is, retains the highest value of the top K in the ascending order. Then, through the k complete tag sequences with the highest probability, the type of tags that have not been marked before is determined, and the entity tags whose unmarked text does not appear in the k complete tag sequences are used as cover tags to cover and delete.

Further, the step S5 of obtaining the set of label sequences corresponding to the designated training samples according to the masking label includes:

S51: Delete the cover tag from the tag type set to obtain a predicted tag set corresponding to the specified text;

S52: Mark the predicted label set on the specified text;

S53: According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;

S54: According to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into all the label sequences corresponding to the designated training samples in a one-to-one correspondence to form the set of tag sequences.

In this embodiment of the present application, K(x) is used to determine the cover label and construct a predicted label set. For an unlabeled literal x _j in x , its possible labels are non-entity labels or unions outside the cover label

a label in , the union

represents the union of labels where the text x _j occurs in K(x). The predicted label set is obtained by masking or deleting entity labels that have not appeared in the corresponding positions of the k label sequences. For example, the tags that appear in the text x _j include "company", "person's name" and "organization", and the tags that do not appear in the text x _j include "address" and "time", then "address" and "time" are covered tags , the predicted label set consists of the labels "company", "person's name" and "organization". Reduces the number of optional labels for each unlabeled text, which can only be selected from the predicted label set. For each "-" position in y _u =(-,y ₂ ,-,...,-), select a label label from the corresponding prediction label set to form a complete label sequence. By deleting the cover label Reduce the number of optional labels for each unlabeled text, and finally determine the set of possible label sequences as S(y _u ,K(x)), then S(y _u ,K(x))

The number of tag sequences in will be much less than the number of tag sequences in C( _yu ).

Further, according to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the labels corresponding to the designated training samples. sequence, the step S53 of forming the tag sequence set includes:

S531: Determine whether the first text is a marked text, wherein the first text is any text in the designated training sample;

S532: If yes, obtain the labeling label corresponding to the first text;

S533: Determine whether the second text arranged after the first text is a marked text;

S534: If not, obtain the predicted label set corresponding to the second text;

S535: Mark the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;

S536: According to the formation method of the label path from the first character to the second character, form all label paths corresponding to all characters in the specified training sample;

S537: Use all label paths corresponding to all characters in the designated training sample as all label sequences corresponding to the designated training sample to form the label sequence set.

The examples of this application are described in detail by taking a labeled word followed by an unlabeled word in the training sample as an example for detailed description, but does not limit the arrangement rules of the words, but only the process of obtaining the labels of the labeled words, and the unlabeled words. The process of obtaining the label of the word is similar. According to the number of labels in the predicted label set corresponding to the unlabeled word, the same number of label paths are formed between the labeled word and an unlabeled word. According to the above-mentioned forming process of the label path between the first character and the second character, all the label paths between any two adjacent characters are obtained, and then, according to the arrangement order of the characters, successively connect the two adjacent characters. All tag paths between, get all tag sequences, and form the tag sequence set.

Further, under the constraint of the preset loss function, the step S7 of training the entity recognition model through the label sequence sets corresponding to all the training samples respectively includes:

S71: Input the set of label sequences corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;

S72: Set the label sequence probabilities corresponding to the label sequences of all the training samples in a one-to-one correspondence as the allocation weights corresponding to the label sequences respectively;

S73: Form training data with the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;

S74: Input the training data into the entity recognition model, and train until the preset loss function converges.

In the embodiment of the present application, the cross-validation model is a BERT+CRF model architecture, and the difference lies in that the model parameters and output variables used are different from the entity recognition model. The above preset loss function is

w is the model parameter of the entity recognition model, and q(y'|x) represents the assignment weight. The present application estimates the probability distribution p(y′|x) of each tag sequence in the tag sequence set through a cross-validation model, where y′∈S( _yu , K(x)). So as to obtain the probability distribution of each tag sequence in the tag sequence set

Where T is the temperature parameter, T>0. The probability distribution q(y'|x) of each label sequence in the estimated label sequence set is taken as the distribution weight of each possible label sequence corresponding to each x.

Further, the step S71 of inputting the label sequence sets corresponding to all training samples into the cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively includes:

S711: Divide the label sequences in the label sequence sets corresponding to all training samples into a first part of data and a second part of data equally;

S712: Input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;

S713: Input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model, The label sequence probability corresponding to each label sequence in the first part of the data is obtained.

When the cross-validation model is used for cross-validation in this embodiment of the present application, the assigned weight of each possible label sequence corresponding to each x is firstly matched with each x to obtain training data consisting of label sequences of all training samples, and then The training data is divided into two parts according to the number, one part is used as training set and the other part is used as validation set. That is, a sequence labeling model is trained on a part of the training data, and the label sequence probability p(y′|x) of each x in the other part of the training data is predicted by the trained sequence labeling model.

Referring to FIG. 2 , a training device for an entity recognition model according to an embodiment of the present application includes:

The first acquisition module 1 is used to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;

Input module 2, for inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

The calculation module 3 is used to obtain the label sequence with the highest probability through the Viterbi algorithm calculation according to the label probability corresponding to all the unlabeled characters in the specified training sample;

A determination module 4, configured to determine the cover labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;

Obtaining module 5, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;

The second obtaining module 6 is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;

The training module 7 is used for training the entity recognition model through the label sequence sets corresponding to all the training samples under the constraint of the preset loss function.

The relevant explanations of the embodiments of the present application and the explanations of the corresponding parts of the applicable methods will not be repeated.

Further, determine module 4, including:

The acquisition unit is used to acquire the set of tag types corresponding to all tags in the current entity recognition task;

The first determination unit is used to determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is one or more tags in the tag type set;

As a unit, it is used to use the tags other than the specified tags in the tag type set as the cover tags corresponding to the specified text;

The second determining unit is configured to determine, according to the determination method of the masking label corresponding to the designated text, the masking label corresponding to all the unmarked texts in the designated training sample respectively.

Further, module 5 is obtained, including:

a deletion unit, configured to delete the cover tag from the tag type set to obtain a predicted tag set corresponding to the specified text;

a first labeling unit, used to label the predicted label set on the specified text;

a second labeling unit, configured to label the predicted label sets corresponding to all unlabeled texts in the designated training samples according to the labeling process of the predicted label set corresponding to the specified text;

The combining unit is used to combine the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, to form all the labels corresponding to the designated training samples in a one-to-one correspondence sequence to form the set of tag sequences.

Further, the combination unit includes:

a first judging subunit, used for judging whether the first character is a marked character, wherein the first character is any character in the specified training sample;

a first obtaining subunit, configured to obtain the labeling label corresponding to the first character if it is an annotation text;

a second judging subunit for judging whether the second text arranged after the first text is a marked text;

a second obtaining subunit, configured to obtain a predicted label set corresponding to the second character if it is a marked text;

A labeling subunit, configured to label the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form the first character to the second character the label path;

forming a subunit for forming all label paths corresponding to all characters in the specified training sample according to the formation method of the label paths from the first character to the second character;

As a subunit, it is used to use all label paths corresponding to all characters in the specified training sample as all label sequences corresponding to the specified training sample to form the label sequence set.

Further, the training module 7 includes:

The input unit is used to input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;

a setting unit, configured to set the label sequence probabilities corresponding to the label sequences of all training samples to the assigned weights corresponding to each of the label sequences in a one-to-one correspondence;

A composition unit, configured to form training data by combining the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;

A training unit, configured to input the training data into the entity recognition model, and train until the preset loss function converges.

Further, the input unit includes:

The average molecular unit is used to divide the label sequences in the label sequence sets corresponding to all training samples into the first part of data and the second part of data equally;

a training subunit, configured to input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;

The input subunit is used to input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the first part of the data. Second, verify the model, and obtain the label sequence probability corresponding to each label sequence in the first part of the data.

Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store all the data required for the training process of the entity recognition model. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of training an entity recognition model.

The above processor executes the training method for the entity recognition model, including: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all the unlabeled words in the designated training sample; according to the label probabilities corresponding to all the unlabeled words in the designated training sample, the Viterbi algorithm is used to calculate the label sequence with the highest probability; according to the probability The highest label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training samples; according to the cover labels, obtain the label sequence set corresponding to the designated training samples; According to the label sequences corresponding to the designated training samples The acquisition method of the set is to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained by using the label sequence sets corresponding to all the training samples respectively.

The above computer equipment predicts the label probability of unlabeled text through the prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the number of unlabeled sequences according to the most likely label sequence. The number of possible labels of the annotated text can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequence, and reducing the computational complexity.

In one embodiment, the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor, includes: acquiring a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is in the set of label types. One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the way of determining the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.

In one embodiment, the step of the above-mentioned processor obtaining the label sequence set corresponding to the specified training sample according to the cover label includes: deleting the cover label from the label type set, and obtaining the corresponding label of the specified text the predicted label set; label the predicted label set on the specified text; according to the labeling process of the predicted label set corresponding to the specified text, label the predicted label set corresponding to all unlabeled text in the specified training sample respectively ; According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples, forming all the designated training samples. The set of tag sequences described.

In one embodiment, the above-mentioned processor, according to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into the designated training samples in a one-to-one correspondence All corresponding label sequences, the step of forming the label sequence set includes: judging whether the first character is a label character, wherein the first character is any character in the designated training sample; if so, obtaining the first character corresponding label; determine whether the second word arranged after the first word is a label; if not, obtain the predicted label set corresponding to the second word; set the predicted label set corresponding to the second word are respectively marked on the second text, and are respectively connected to the labeling labels corresponding to the first text to form a label path from the first text to the second text; according to the labels from the first text to the second text The path formation method is to form all label paths corresponding to all characters in the specified training sample; all label paths corresponding to all characters in the specified training sample are used as all the label sequences corresponding to the specified training sample to form the A collection of tag sequences.

In one embodiment, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all the training samples respectively includes: inputting the label sequence sets corresponding to all the training samples respectively. Cross-validation model, obtain the label sequence probability corresponding to the label sequence of all training samples respectively; set the label sequence probability corresponding to the label sequence of all training samples respectively as the allocation weight corresponding to each label sequence in one-to-one correspondence; The weighted label sequence and the training samples corresponding to each of the label sequences constitute training data; the training data is input into the entity recognition model, and trained until the preset loss function converges.

In one embodiment, the above-mentioned processor inputs the label sequence sets corresponding to all training samples into the cross-validation model, and the step of obtaining the label sequence probabilities corresponding to the label sequences of all training samples respectively includes: inputting the labels corresponding to all training samples respectively The label sequences in the sequence set are equally divided into a first part of data and a second part of data; input the first part of the data into the cross-validation model, train to obtain the first validation model, and input the second part of the data into the The cross-validation model is trained to obtain a second validation model; the second part of the data is input into the first validation model, and the label sequence probability corresponding to each label sequence in the second part of the data is obtained. A part of the data is input into the second verification model, and the tag sequence probability corresponding to each tag sequence in the first part of the data is obtained.

Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.

An embodiment of the present application further provides a computer-readable storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, and a computer program is stored thereon, and when the computer program is executed by a processor, entity identification is realized A method for training a model, comprising: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample The label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained through the sets of label sequences corresponding to all the training samples respectively.

The above computer-readable storage medium predicts the label probability of the unlabeled text through a prediction probability model, and combines with the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, and then selects the most likely label sequence. Sequence reduces the number of possible labels for unlabeled text, effectively reducing the number of label sequences that need to estimate probability distributions, making it easier for entity recognition models to identify real label sequences, and reducing computational complexity.

In one embodiment, the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor, includes: obtaining a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is in the set of label types. One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the determination method of the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.

Those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the above-mentioned computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Claims

A training method for an entity recognition model, comprising:

Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;

Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;

According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;

Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;

According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;

Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
The method for training an entity recognition model according to claim 1, wherein the step of determining the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability, comprises:

Get the set of tag types corresponding to all tags in the current entity recognition task;

Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;

Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;

According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
The method for training an entity recognition model according to claim 1, wherein the step of obtaining the set of label sequences corresponding to the designated training samples according to the masked label comprises:

Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;

marking the predicted label set on the specified text;

According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;

According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
The method for training an entity recognition model according to claim 3, wherein, according to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one The steps of forming the label sequence set by combining all the label sequences corresponding to the designated training samples by one correspondence include:

Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;

If so, obtain the labeling label corresponding to the first text;

judging whether the second text arranged after the first text is a marked text;

If not, obtain the predicted label set corresponding to the second text;

Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;

According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;

All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
The method for training an entity recognition model according to claim 1, wherein, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all training samples respectively comprises:

Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;

The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;

The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;

Input the training data into the entity recognition model, and train until the preset loss function converges.
The method for training an entity recognition model according to claim 5, wherein the step of inputting the label sequence sets corresponding to all training samples into the cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples, comprising: :

Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;

Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;

Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.
A training device for an entity recognition model, comprising:

a first acquisition module, configured to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;

an input module, configured to input the designated training sample into a probability prediction model, and obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

a calculation module, configured to calculate the label sequence with the highest probability by the Viterbi algorithm according to the label probabilities corresponding to all unlabeled texts in the specified training sample;

A determination module, configured to determine the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;

obtaining module, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;

The second obtaining module is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;

The training module is used for training the entity recognition model through the label sequence sets corresponding to all training samples under the constraint of the preset loss function.
The training device of entity recognition model according to claim 7, wherein, the determining module comprises:

The acquisition unit is used to acquire the set of tag types corresponding to all tags in the current entity recognition task;

The first determination unit is used to determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is one or more tags in the tag type set;

As a unit, it is used to use the tags other than the specified tags in the tag type set as the cover tags corresponding to the specified text;

The second determining unit is configured to determine, according to the determination method of the masking label corresponding to the designated text, the masking label corresponding to all the unmarked texts in the designated training sample respectively.
A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements an entity recognition model training method when executing the computer program; wherein the entity recognition model The training methods include:

Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;

Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;

According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;

Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;

According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;

Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
The computer device according to claim 9, wherein the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all the unlabeled texts in the designated training samples, comprises:

Get the set of tag types corresponding to all tags in the current entity recognition task;

Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;

Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;

According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
The computer device according to claim 9, wherein the step of obtaining the label sequence set corresponding to the designated training sample according to the masked label comprises:

Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;

marking the predicted label set on the specified text;

According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;

According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
The computer device according to claim 11, wherein the predicted label sets corresponding to all unlabeled texts in the designated training samples are combined into a one-to-one correspondence according to the text arrangement order in the designated training samples. The step of forming the set of label sequences for all label sequences corresponding to the designated training samples includes:

Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;

If so, obtain the labeling label corresponding to the first text;

judging whether the second text arranged after the first text is a marked text;

If not, obtain the predicted label set corresponding to the second text;

Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;

According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;

All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
The computer device according to claim 9, wherein, under the constraint of a preset loss function, the step of training the entity recognition model through the label sequence sets corresponding to all training samples respectively comprises:

Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;

The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;

The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;

Input the training data into the entity recognition model, and train until the preset loss function converges.
The computer device according to claim 13, wherein the step of inputting the label sequence sets corresponding to all training samples into a cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively comprises:

Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;

Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;

Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.
A computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, a training method for an entity recognition model is implemented; wherein, the training method for an entity recognition model comprises:

Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;

Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;

According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;

According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;

Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;

According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;

Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
The computer-readable storage medium according to claim 15, wherein the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, comprises:

Get the set of tag types corresponding to all tags in the current entity recognition task;

Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;

Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;

According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
The computer-readable storage medium according to claim 15, wherein the step of obtaining the set of label sequences corresponding to the specified training samples according to the masked labels comprises:

Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;

marking the predicted label set on the specified text;

According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;

According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
The computer-readable storage medium according to claim 17, wherein, according to the predicted label sets corresponding to all unlabeled texts in the specified training samples, and according to the text arrangement order in the specified training samples, one by one The steps of correspondingly combining all the label sequences corresponding to the designated training samples to form the label sequence set include:

Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;

If so, obtain the labeling label corresponding to the first text;

judging whether the second text arranged after the first text is a marked text;

If not, obtain the predicted label set corresponding to the second text;

Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;

According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;

All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
The computer-readable storage medium according to claim 15, wherein, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all training samples respectively comprises:

Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;

The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;

The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;

Input the training data into the entity recognition model, and train until the preset loss function converges.
The computer-readable storage medium according to claim 19, wherein the step of inputting the label sequence sets corresponding to all training samples into a cross-validation model to obtain label sequence probabilities corresponding to the label sequences of all training samples respectively comprises:

Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;

Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;

Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.