WO2022142122A1 - Method and apparatus for training entity recognition model, and device and storage medium - Google Patents

Method and apparatus for training entity recognition model, and device and storage medium Download PDF

Info

Publication number
WO2022142122A1
WO2022142122A1 PCT/CN2021/097543 CN2021097543W WO2022142122A1 WO 2022142122 A1 WO2022142122 A1 WO 2022142122A1 CN 2021097543 W CN2021097543 W CN 2021097543W WO 2022142122 A1 WO2022142122 A1 WO 2022142122A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
training
designated
text
training samples
Prior art date
Application number
PCT/CN2021/097543
Other languages
French (fr)
Chinese (zh)
Inventor
阮鸿涛
郑立颖
胡沛弦
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142122A1 publication Critical patent/WO2022142122A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of natural language processing of artificial intelligence, and in particular, the present application relates to a training method, apparatus, device and storage medium for an entity recognition model.
  • the training of entity recognition models relies on a large amount of fully annotated data, but high-quality annotated data usually requires very professional annotators, which makes it difficult and expensive to obtain training data.
  • the entity recognition model can be trained with incompletely labeled data. Incompletely labeled data means that only some entities in the text are labeled, while other unlabeled content may be non-entities or entities.
  • all label sequences that conform to the labeled text are usually taken into account in the model training, the probability distribution information of all possible label sequences is estimated, and the information is integrated into the model training. Allows the model to pay attention to all possible label sequences.
  • the main purpose of this application is to provide a training method for an entity recognition model, aiming to solve the technical problem that the pronunciation of online speech cannot be recognized and evaluated in time.
  • the present application proposes a training method for an entity recognition model, including:
  • the label sequence with the highest probability is obtained by calculating the Viterbi algorithm
  • the acquisition method of the label sequence set corresponding to the designated training sample obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
  • the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
  • the present application also provides a training device for an entity recognition model, including:
  • a first acquisition module configured to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;
  • An input module for inputting the specified training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the specified training sample;
  • a calculation module configured to calculate the label sequence with the highest probability by the Viterbi algorithm according to the label probabilities corresponding to all unlabeled texts in the specified training sample;
  • a determination module configured to determine the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability
  • obtaining module for obtaining the label sequence set corresponding to the designated training sample according to the masking label
  • the second obtaining module is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;
  • the training module is used for training the entity recognition model through the label sequence sets corresponding to all training samples under the constraint of the preset loss function.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements a training method for an entity recognition model when the processor executes the computer program; wherein,
  • the method for training an entity recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, Obtain the label probabilities corresponding to all the unlabeled texts in the designated training samples; according to the label probabilities corresponding to all the unlabeled texts in the designated training samples, the Viterbi algorithm is used to calculate the label sequence with the highest probability; label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training sample; obtain the label sequence set corresponding to the designated training sample according to the cover label; according to the label sequence set corresponding to the designated training sample
  • the acquisition method is to acquire the label sequence sets corresponding to all the training samples in the incompletely labeled data set
  • the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements a training method for an entity recognition model; wherein, the entity The training method for the recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample
  • the label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences corresponding to all the training samples in the incompletely labele
  • This application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence.
  • the number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.
  • FIG. 1 is a schematic flowchart of a training method for an entity recognition model according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a training system for an entity recognition model according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
  • a training method for an entity recognition model includes:
  • S2 Input the designated training sample into a probability prediction model, and obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
  • the incompletely labeled designated training samples in the embodiments of the present application refer to texts that are not labeled with the entity label type in the text sentence samples used for entity recognition.
  • entity label types vary according to the task domain.
  • entity tags include, but are not limited to, company, address, person's name, time, organization, and the like.
  • the label of the marked x i , "-" represents the label that is not marked as an entity, that is, the label at this position can be filled with a non-entity label or any entity label in the corresponding entity label set of the current task field.
  • the present application predicts the probability that the label of the x i text of the training text x is y ⁇ i by using the trained probability prediction model, y ⁇ i ⁇ Y, and then calculates the probability corresponding to the complete label sequence of x by the Viterbi algorithm, that is, Each x i is the probability that the complete label sequence of x is y ⁇ when label y ⁇ i, y ⁇ C(y u ). Through the Viterbi algorithm, the complete label sequence with the highest probability is obtained, and the masking label is determined for masking deletion.
  • the process of obtaining the complete label sequence with the highest probability is as follows, assuming a given hidden Markov model HMM state space, the number of states in common, the label probability according to the initial text, and the label sequence from the initial text to other texts The corresponding state transition probability, the probability of producing the most probable label sequence.
  • the Viterbi path is obtained by saving the label probability of each character used in the recursion. Then, by exhausting the labels that have appeared in the unlabeled text, the label sequence of the training sample is correspondingly combined, and the label sequence of the training sample and the training sample are combined to form the training data to train the entity recognition model.
  • the above entity recognition model and probability prediction model are both BERT (Bidirectional Encoder Representation from Transformers) + CRF (Conditional Random Field) model architectures, the difference lies in the model parameters and output variables used.
  • the present application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence.
  • the number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.
  • the step S4 of determining the cover labels corresponding to all the unlabeled texts in the designated training samples respectively includes:
  • S42 Determine the designated labels corresponding to the designated texts in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is the set of label types one or more of the labels;
  • S44 Determine, according to the method for determining the masking labels corresponding to the specified text, the masking labels corresponding to all the unmarked texts in the specified training sample respectively.
  • the Viterbi algorithm in the above k-best case means that each node in the algorithm does not only retain the minimum value, but retains K minimum values, that is, retains the highest value of the top K in the ascending order.
  • the type of tags that have not been marked before is determined, and the entity tags whose unmarked text does not appear in the k complete tag sequences are used as cover tags to cover and delete.
  • step S5 of obtaining the set of label sequences corresponding to the designated training samples according to the masking label includes:
  • K(x) is used to determine the cover label and construct a predicted label set.
  • its possible labels are non-entity labels or unions outside the cover label a label in
  • the union represents the union of labels where the text x j occurs in K(x).
  • the predicted label set is obtained by masking or deleting entity labels that have not appeared in the corresponding positions of the k label sequences.
  • the tags that appear in the text x j include "company”, “person's name” and “organization”, and the tags that do not appear in the text x j include “address” and "time”, then "address” and "time” are covered tags
  • the predicted label set consists of the labels "company", "person's name” and "organization”.
  • Reduces the number of optional labels for each unlabeled text, which can only be selected from the predicted label set. For each "-" position in y u (-,y 2 ,-,...,-), select a label label from the corresponding prediction label set to form a complete label sequence.
  • By deleting the cover label Reduce the number of optional labels for each unlabeled text, and finally determine the set of possible label sequences as S(y u ,K(x)), then S(y u ,K(x))
  • the number of tag sequences in will be much less than the number of tag sequences in C( yu ).
  • the step S53 of forming the tag sequence set includes:
  • S531 Determine whether the first text is a marked text, wherein the first text is any text in the designated training sample;
  • S533 Determine whether the second text arranged after the first text is a marked text
  • S535 Mark the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;
  • S536 According to the formation method of the label path from the first character to the second character, form all label paths corresponding to all characters in the specified training sample;
  • the step S7 of training the entity recognition model through the label sequence sets corresponding to all the training samples respectively includes:
  • S71 Input the set of label sequences corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
  • S73 Form training data with the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;
  • the cross-validation model is a BERT+CRF model architecture, and the difference lies in that the model parameters and output variables used are different from the entity recognition model.
  • the above preset loss function is w is the model parameter of the entity recognition model, and q(y'
  • the present application estimates the probability distribution p(y′
  • x) of each label sequence in the estimated label sequence set is taken as the distribution weight of each possible label sequence corresponding to each x.
  • step S71 of inputting the label sequence sets corresponding to all training samples into the cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively includes:
  • S711 Divide the label sequences in the label sequence sets corresponding to all training samples into a first part of data and a second part of data equally;
  • S712 Input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;
  • S713 Input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model, The label sequence probability corresponding to each label sequence in the first part of the data is obtained.
  • the assigned weight of each possible label sequence corresponding to each x is firstly matched with each x to obtain training data consisting of label sequences of all training samples, and then The training data is divided into two parts according to the number, one part is used as training set and the other part is used as validation set. That is, a sequence labeling model is trained on a part of the training data, and the label sequence probability p(y′
  • a training device for an entity recognition model includes:
  • the first acquisition module 1 is used to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;
  • Input module 2 for inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
  • the calculation module 3 is used to obtain the label sequence with the highest probability through the Viterbi algorithm calculation according to the label probability corresponding to all the unlabeled characters in the specified training sample;
  • a determination module 4 configured to determine the cover labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability
  • Obtaining module 5 for obtaining the label sequence set corresponding to the designated training sample according to the masking label
  • the second obtaining module 6 is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;
  • the training module 7 is used for training the entity recognition model through the label sequence sets corresponding to all the training samples under the constraint of the preset loss function.
  • determine module 4 including:
  • the acquisition unit is used to acquire the set of tag types corresponding to all tags in the current entity recognition task
  • the first determination unit is used to determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is one or more tags in the tag type set;
  • the second determining unit is configured to determine, according to the determination method of the masking label corresponding to the designated text, the masking label corresponding to all the unmarked texts in the designated training sample respectively.
  • module 5 is obtained, including:
  • a deletion unit configured to delete the cover tag from the tag type set to obtain a predicted tag set corresponding to the specified text
  • a first labeling unit used to label the predicted label set on the specified text
  • a second labeling unit configured to label the predicted label sets corresponding to all unlabeled texts in the designated training samples according to the labeling process of the predicted label set corresponding to the specified text
  • the combining unit is used to combine the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, to form all the labels corresponding to the designated training samples in a one-to-one correspondence sequence to form the set of tag sequences.
  • the combination unit includes:
  • a first judging subunit used for judging whether the first character is a marked character, wherein the first character is any character in the specified training sample
  • a first obtaining subunit configured to obtain the labeling label corresponding to the first character if it is an annotation text
  • a second judging subunit for judging whether the second text arranged after the first text is a marked text
  • a second obtaining subunit configured to obtain a predicted label set corresponding to the second character if it is a marked text
  • a labeling subunit configured to label the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form the first character to the second character the label path;
  • training module 7 includes:
  • the input unit is used to input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
  • a setting unit configured to set the label sequence probabilities corresponding to the label sequences of all training samples to the assigned weights corresponding to each of the label sequences in a one-to-one correspondence
  • a composition unit configured to form training data by combining the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;
  • a training unit configured to input the training data into the entity recognition model, and train until the preset loss function converges.
  • the input unit includes:
  • the average molecular unit is used to divide the label sequences in the label sequence sets corresponding to all training samples into the first part of data and the second part of data equally;
  • a training subunit configured to input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;
  • the input subunit is used to input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the first part of the data. Second, verify the model, and obtain the label sequence probability corresponding to each label sequence in the first part of the data.
  • an embodiment of the present application further provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store all the data required for the training process of the entity recognition model.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program when executed by a processor, implements a method of training an entity recognition model.
  • the above processor executes the training method for the entity recognition model, including: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all the unlabeled words in the designated training sample; according to the label probabilities corresponding to all the unlabeled words in the designated training sample, the Viterbi algorithm is used to calculate the label sequence with the highest probability; according to the probability The highest label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training samples; according to the cover labels, obtain the label sequence set corresponding to the designated training samples; According to the label sequences corresponding to the designated training samples
  • the acquisition method of the set is to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained by using the label sequence sets corresponding to all the training samples respectively.
  • the above computer equipment predicts the label probability of unlabeled text through the prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the number of unlabeled sequences according to the most likely label sequence.
  • the number of possible labels of the annotated text can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequence, and reducing the computational complexity.
  • the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor includes: acquiring a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is in the set of label types.
  • One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the way of determining the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.
  • the step of the above-mentioned processor obtaining the label sequence set corresponding to the specified training sample according to the cover label includes: deleting the cover label from the label type set, and obtaining the corresponding label of the specified text the predicted label set; label the predicted label set on the specified text; according to the labeling process of the predicted label set corresponding to the specified text, label the predicted label set corresponding to all unlabeled text in the specified training sample respectively ; According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples, forming all the designated training samples.
  • the above-mentioned processor according to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into the designated training samples in a one-to-one correspondence All corresponding label sequences
  • the step of forming the label sequence set includes: judging whether the first character is a label character, wherein the first character is any character in the designated training sample; if so, obtaining the first character corresponding label; determine whether the second word arranged after the first word is a label; if not, obtain the predicted label set corresponding to the second word; set the predicted label set corresponding to the second word are respectively marked on the second text, and are respectively connected to the labeling labels corresponding to the first text to form a label path from the first text to the second text; according to the labels from the first text to the second text
  • the path formation method is to form all label paths corresponding to all characters in the specified training sample; all label paths corresponding to all characters in the specified training sample are used as all the label sequence
  • the step of training the entity recognition model by using the label sequence sets corresponding to all the training samples respectively includes: inputting the label sequence sets corresponding to all the training samples respectively.
  • Cross-validation model obtain the label sequence probability corresponding to the label sequence of all training samples respectively; set the label sequence probability corresponding to the label sequence of all training samples respectively as the allocation weight corresponding to each label sequence in one-to-one correspondence;
  • the weighted label sequence and the training samples corresponding to each of the label sequences constitute training data; the training data is input into the entity recognition model, and trained until the preset loss function converges.
  • the above-mentioned processor inputs the label sequence sets corresponding to all training samples into the cross-validation model
  • the step of obtaining the label sequence probabilities corresponding to the label sequences of all training samples respectively includes: inputting the labels corresponding to all training samples respectively
  • the label sequences in the sequence set are equally divided into a first part of data and a second part of data; input the first part of the data into the cross-validation model, train to obtain the first validation model, and input the second part of the data into the
  • the cross-validation model is trained to obtain a second validation model; the second part of the data is input into the first validation model, and the label sequence probability corresponding to each label sequence in the second part of the data is obtained.
  • a part of the data is input into the second verification model, and the tag sequence probability corresponding to each tag sequence in the first part of the data is obtained.
  • FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • An embodiment of the present application further provides a computer-readable storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, and a computer program is stored thereon, and when the computer program is executed by a processor, entity identification is realized
  • a method for training a model comprising: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample
  • the label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences
  • the above computer-readable storage medium predicts the label probability of the unlabeled text through a prediction probability model, and combines with the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, and then selects the most likely label sequence. Sequence reduces the number of possible labels for unlabeled text, effectively reducing the number of label sequences that need to estimate probability distributions, making it easier for entity recognition models to identify real label sequences, and reducing computational complexity.
  • the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor includes: obtaining a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is in the set of label types.
  • One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the determination method of the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.
  • the step of the above-mentioned processor obtaining the label sequence set corresponding to the specified training sample according to the cover label includes: deleting the cover label from the label type set, and obtaining the corresponding label of the specified text the predicted label set; label the predicted label set on the specified text; according to the labeling process of the predicted label set corresponding to the specified text, label the predicted label set corresponding to all unlabeled text in the specified training sample respectively ; According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples, forming all the designated training samples.
  • the above-mentioned processor according to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into the designated training samples in a one-to-one correspondence All corresponding label sequences
  • the step of forming the label sequence set includes: judging whether the first character is a label character, wherein the first character is any character in the designated training sample; if so, obtaining the first character corresponding label; determine whether the second word arranged after the first word is a label; if not, obtain the predicted label set corresponding to the second word; set the predicted label set corresponding to the second word are respectively marked on the second text, and are respectively connected to the labeling labels corresponding to the first text to form a label path from the first text to the second text; according to the labels from the first text to the second text
  • the path formation method is to form all label paths corresponding to all characters in the specified training sample; all label paths corresponding to all characters in the specified training sample are used as all the label sequence
  • the step of training the entity recognition model by using the label sequence sets corresponding to all the training samples respectively includes: inputting the label sequence sets corresponding to all the training samples respectively.
  • Cross-validation model obtain the label sequence probability corresponding to the label sequence of all training samples respectively; set the label sequence probability corresponding to the label sequence of all training samples respectively as the allocation weight corresponding to each label sequence in one-to-one correspondence;
  • the weighted label sequence and the training samples corresponding to each of the label sequences constitute training data; the training data is input into the entity recognition model, and trained until the preset loss function converges.
  • the above-mentioned processor inputs the label sequence sets corresponding to all training samples into the cross-validation model
  • the step of obtaining the label sequence probabilities corresponding to the label sequences of all training samples respectively includes: inputting the labels corresponding to all training samples respectively
  • the label sequences in the sequence set are equally divided into a first part of data and a second part of data; input the first part of the data into the cross-validation model, train to obtain the first validation model, and input the second part of the data into the
  • the cross-validation model is trained to obtain a second validation model; the second part of the data is input into the first validation model, and the label sequence probability corresponding to each label sequence in the second part of the data is obtained.
  • a part of the data is input into the second verification model, and the tag sequence probability corresponding to each tag sequence in the first part of the data is obtained.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the field of natural language processing of artificial intelligence. Disclosed is a method for training an entity recognition model, the method comprising: acquiring a designated training sample that is not completely labeled; inputting the designated training sample into a probability prediction model, so as to obtain label probabilities corresponding to all unlabeled characters in the designated training sample; calculating a label sequence with the highest probability according to the label probabilities respectively corresponding to all the unlabeled characters and by means of a Viterbi algorithm; determining, according to the label sequence with the highest probability, covering labels respectively corresponding to all the unlabeled characters in the designated training sample; obtaining, according to the covering labels, a label sequence set corresponding to the designated training sample; acquiring, according to an acquisition method for the label sequence set corresponding to the designated training sample, label sequence sets respectively corresponding to all training samples in an incompletely labeled data set; and under the constraint of a preset loss function, training an entity recognition model by means of the label sequence sets respectively corresponding to all the training samples. An actual label sequence can be more easily recognized.

Description

实体识别模型的训练方法、装置、设备和存储介质Entity recognition model training method, device, equipment and storage medium
本申请要求于2020年12月31日提交中国专利局、申请号为202011633046.9,发明名称为“实体识别模型的训练方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on December 31, 2020 with the application number 202011633046.9 and the title of the invention is "Method, Apparatus, Equipment and Storage Medium for Entity Recognition Model Training", the entire contents of which are approved by Reference is incorporated in this application.
技术领域technical field
本申请涉及人工智能的自然语言处理领域,本申请特别是涉及到实体识别模型的训练方法、装置、设备和存储介质。The present application relates to the field of natural language processing of artificial intelligence, and in particular, the present application relates to a training method, apparatus, device and storage medium for an entity recognition model.
背景技术Background technique
实体识别模型的训练依赖大量标注完全的数据,但是高质量的标注数据通常需要非常专业的标注人员,导致训练数据获取困难且成本昂贵。现为节约成本,可利用不完全标注数据训练实体识别模型,不完全标注数据即文本中仅有部分实体被标注,而其他未标注的内容可能是非实体,也可能是实体。为提高利用不完全标注数据训练实体识别模型效果,通常将所有符合文本已标注情况的标签序列都考虑到模型训练当中,估计所有可能的标签序列的概率分布信息,并将该信息融入模型训练,使模型可以关注到所有可能的标签序列。但发明人意识到由于命名实体通常有多个类别,且命名实体在文本中稀疏分布,导致候选标签序列的数量随文本中未标记内容的增加而呈指数增长,致使命名实体模型的注意力被分散,不易关注到真实的标签序列,影响识别效果。The training of entity recognition models relies on a large amount of fully annotated data, but high-quality annotated data usually requires very professional annotators, which makes it difficult and expensive to obtain training data. In order to save costs, the entity recognition model can be trained with incompletely labeled data. Incompletely labeled data means that only some entities in the text are labeled, while other unlabeled content may be non-entities or entities. In order to improve the effect of training the entity recognition model with incompletely labeled data, all label sequences that conform to the labeled text are usually taken into account in the model training, the probability distribution information of all possible label sequences is estimated, and the information is integrated into the model training. Allows the model to pay attention to all possible label sequences. However, the inventors realized that since named entities usually have multiple categories, and named entities are sparsely distributed in the text, the number of candidate label sequences increases exponentially with the increase of unlabeled content in the text, resulting in the attention of the named entity model. Scattered, it is difficult to pay attention to the real label sequence, which affects the recognition effect.
技术问题technical problem
由于命名实体通常有多个类别,且命名实体在文本中稀疏分布,导致候选标签序列的数量随文本中未标记内容的增加而呈指数增长,致使命名实体模型的注意力被分散,不易关注到真实的标签序列,影响识别效果。Because named entities usually have multiple categories, and named entities are sparsely distributed in the text, the number of candidate label sequences increases exponentially with the increase of unlabeled content in the text, resulting in the distraction of the named entity model, which is not easy to pay attention to. The real label sequence affects the recognition effect.
技术解决方案technical solutions
本申请的主要目的为提供实体识别模型的训练方法,旨在解决不能及时识别与评估在线语音的发音的技术问题。The main purpose of this application is to provide a training method for an entity recognition model, aiming to solve the technical problem that the pronunciation of online speech cannot be recognized and evaluated in time.
第一方面,本申请提出一种实体识别模型的训练方法,包括:In a first aspect, the present application proposes a training method for an entity recognition model, including:
获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;
将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;
根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;
根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;
根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
第二方面,本申请还提供了一种实体识别模型的训练装置,包括:In a second aspect, the present application also provides a training device for an entity recognition model, including:
第一获取模块,用于获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;a first acquisition module, configured to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;
输入模块,用于将所述指定训练样本输入概率预测模型,得到所述指定训练 样本中所有未标注文字分别对应的标签概率;An input module, for inputting the specified training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the specified training sample;
计算模块,用于根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;a calculation module, configured to calculate the label sequence with the highest probability by the Viterbi algorithm according to the label probabilities corresponding to all unlabeled texts in the specified training sample;
确定模块,用于根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;A determination module, configured to determine the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;
得到模块,用于根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;obtaining module, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;
第二获取模块,用于根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;The second obtaining module is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;
训练模块,用于在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。The training module is used for training the entity recognition model through the label sequence sets corresponding to all training samples under the constraint of the preset loss function.
第三方面,本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种实体识别模型的训练方法;其中,所述一种实体识别模型的训练方法包括:获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。In a third aspect, the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements a training method for an entity recognition model when the processor executes the computer program; wherein, The method for training an entity recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, Obtain the label probabilities corresponding to all the unlabeled texts in the designated training samples; according to the label probabilities corresponding to all the unlabeled texts in the designated training samples, the Viterbi algorithm is used to calculate the label sequence with the highest probability; label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training sample; obtain the label sequence set corresponding to the designated training sample according to the cover label; according to the label sequence set corresponding to the designated training sample The acquisition method is to acquire the label sequence sets corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained by using the label sequence sets corresponding to all the training samples respectively.
第四方面,本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种实体识别模型的训练方法;其中,所述一种实体识别模型的训练方法包括:获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。In a fourth aspect, the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements a training method for an entity recognition model; wherein, the entity The training method for the recognition model includes: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample The label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained through the sets of label sequences corresponding to all the training samples respectively.
有益效果beneficial effect
本申请通过预测概率模型预测未标注文字的标签概率,并结合维特比算法,得到训练语句对应的标签序列概率,然后选择最有可能的标签序列,并根据最有可能的标签序列减少未标注文字的可能标签数量,以有效减少需要估计概率分布的标签序列的数量,使得实体识别模型更容易识别真实的标签序列,且减少了计算复杂度。This application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence. The number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.
附图说明Description of drawings
图1本申请一实施例的实体识别模型的训练方法流程示意图;1 is a schematic flowchart of a training method for an entity recognition model according to an embodiment of the present application;
图2本申请一实施例的实体识别模型的训练系统流程示意图;2 is a schematic flowchart of a training system for an entity recognition model according to an embodiment of the present application;
图3本申请一实施例的计算机设备内部结构示意图。FIG. 3 is a schematic diagram of an internal structure of a computer device according to an embodiment of the present application.
本发明的最佳实施方式BEST MODE FOR CARRYING OUT THE INVENTION
参照图1,本申请一实施例的实体识别模型的训练方法,包括:Referring to FIG. 1, a training method for an entity recognition model according to an embodiment of the present application includes:
S1:获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;S1: Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;
S2:将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;S2: Input the designated training sample into a probability prediction model, and obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
S3:根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;S3: According to the label probabilities corresponding to all the unlabeled texts in the designated training samples, the Viterbi algorithm is used to calculate the label sequence with the highest probability;
S4:根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;S4: According to the label sequence with the highest probability, determine the cover labels corresponding to all the unlabeled texts in the designated training samples;
S5:根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;S5: Obtain a label sequence set corresponding to the designated training sample according to the masking label;
S6:根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;S6: According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
S7:在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。S7: Under the constraint of a preset loss function, train the entity recognition model through the label sequence sets corresponding to all training samples respectively.
本申请实施例中不完全标注的指定训练样本,指用于实体识别的文本语句样本中存在未标注实体标签类型的文字。上述实体标签类型根据任务领域不同而不同。举例地,在描述公司业务的宣传文本中,实体标签包括但不限于公司、地址、人名、时间、组织等。The incompletely labeled designated training samples in the embodiments of the present application refer to texts that are not labeled with the entity label type in the text sentence samples used for entity recognition. The above entity label types vary according to the task domain. For example, in a promotional text describing a company's business, entity tags include, but are not limited to, company, address, person's name, time, organization, and the like.
本申请根据人工标注标签的完整情况构建不完全标注数据集,假设一条不完全标注的训练样本表示为x=(x 1,x 2,…,x n),其中每个x i(i=1,2,…,n)代表训练样本的文本序列中的一个字。不完全标注的训练样本x对应的不完全标注的标签序列表示为y u=(-,y 2,-,…,-),其中y i∈Y,Y表示实体标签集合,yi代表标记人员真实标注的x i的标签,“-”代表未被标记为实体的标签,即该位置的标签可以填补为非实体标签或当前任务领域对应实体标签集中的任意实体标签。通过对未被标记为实体的标签从当前任务领域对应实体标签集进行全排列选择,组合成该训练样本可能存在的所有完全标签序列,组成的集合C(y u)。假设x真实的完全标签序列为y=(y 1,y 2,…,y n),则y应该是C(y u)中的一条完全标签序列。 In this application, an incompletely labeled dataset is constructed based on the complete situation of manual labeling. It is assumed that an incompletely labeled training sample is represented as x=(x 1 , x 2 ,...,x n ), where each x i (i=1 ,2,...,n) represents a word in the text sequence of the training samples. The incompletely labeled label sequence corresponding to the incompletely labeled training sample x is expressed as yu u = (-, y 2 ,-,...,-), where y i ∈ Y, Y represents the entity label set, and yi represents the true labeling person The label of the marked x i , "-" represents the label that is not marked as an entity, that is, the label at this position can be filled with a non-entity label or any entity label in the corresponding entity label set of the current task field. The labels that are not marked as entities are selected from the corresponding entity label set in the current task field, and combined into a set C( yu ) composed of all possible complete label sequences that exist in the training sample. Assuming that the real complete label sequence of x is y=(y 1 , y 2 ,...,y n ), then y should be a complete label sequence in C(y u ).
本申请通过训练好的概率预测模型预测得到训练文本x的x i文字的标签为y` i的概率,y` i∈Y,再通过维特比算法计算x的完全标签序列分别对应的概率,即每个x i是标签y` i时x的完全标签序列为y`的概率,y`∈C(y u)。通过维特比算法,获得概率最高的完全标签序列,并确定遮盖标签进行遮盖删除。通过维特比算法,获得概率最高的完全标签序列的过程如下,假设给定隐式马尔可夫模型HMM状态空间,共有的状态数量,根据初始文字的标签概率,以及初始文字到其他文字的标签序列对应的状态转移概率,产生最有可能的标签序列的概率。通过保存递推中用到的各文字的标签概率,获得维特比路径。然后通过穷举未标注文字已出现过的标签,对应组合训练样本的标签序列,并将训练样本的标签序列和训练样本一起组成训练数据训练实体识别模型。上述实体识别模型和概率预测模型均为BERT(Bidirectional Encoder Representation from Transformers)+CRF(Conditional Random Field,即条件随机场)模型架构,区别在于使用的模型参数和输出变量不同。 The present application predicts the probability that the label of the x i text of the training text x is y`i by using the trained probability prediction model, y`i Y, and then calculates the probability corresponding to the complete label sequence of x by the Viterbi algorithm, that is, Each x i is the probability that the complete label sequence of x is y` when label y`i, y`∈C(y u ). Through the Viterbi algorithm, the complete label sequence with the highest probability is obtained, and the masking label is determined for masking deletion. Through the Viterbi algorithm, the process of obtaining the complete label sequence with the highest probability is as follows, assuming a given hidden Markov model HMM state space, the number of states in common, the label probability according to the initial text, and the label sequence from the initial text to other texts The corresponding state transition probability, the probability of producing the most probable label sequence. The Viterbi path is obtained by saving the label probability of each character used in the recursion. Then, by exhausting the labels that have appeared in the unlabeled text, the label sequence of the training sample is correspondingly combined, and the label sequence of the training sample and the training sample are combined to form the training data to train the entity recognition model. The above entity recognition model and probability prediction model are both BERT (Bidirectional Encoder Representation from Transformers) + CRF (Conditional Random Field) model architectures, the difference lies in the model parameters and output variables used.
本申请通过预测概率模型预测未标注文字的标签概率,并结合维特比算法, 得到训练语句对应的标签序列概率,然后选择最有可能的标签序列,并根据最有可能的标签序列减少未标注文字的可能标签数量,以有效减少需要估计概率分布的标签序列的数量,使得实体识别模型更容易识别真实的标签序列,且减少了计算复杂度。The present application predicts the label probability of unlabeled text through a prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the unlabeled text according to the most likely label sequence. The number of possible labels can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequences, and reducing the computational complexity.
进一步地,所述根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤S4,包括:Further, according to the label sequence with the highest probability, the step S4 of determining the cover labels corresponding to all the unlabeled texts in the designated training samples respectively includes:
S41:获取当前实体识别任务中所有标签对应的标签类型集合;S41: Obtain a set of label types corresponding to all labels in the current entity recognition task;
S42:确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;S42: Determine the designated labels corresponding to the designated texts in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is the set of label types one or more of the labels;
S43:将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;S43: Use labels other than the specified labels in the label type set as cover labels corresponding to the specified text;
S44:根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。S44: Determine, according to the method for determining the masking labels corresponding to the specified text, the masking labels corresponding to all the unmarked texts in the specified training sample respectively.
本申请实施例通过k-best情况下的维特比算法,获得概率最高的k条完全标签序列,记为K(x)={K i(x)} i=1,2,…,k,其中第i条完全标签序列为K i(x)=[K i(x 1),K i(x 2),…,K i(x n)]。上述k-best情况下的维特比算法,表示算法中的每个结点不是只保留最小值,而是保留K个最小值,即从升序排列中保留排序靠前的top K的最值。然后通过概率最高的k条完全标签序列,确定之前未标注文字出现过的标签类型,并将该未标注文字未出现在k条完全标签序列中的实体标签作为遮盖标签进行遮盖删除。 This embodiment of the present application obtains k complete label sequences with the highest probability through the Viterbi algorithm in the k-best case, denoted as K(x)={K i (x)} i=1,2,...,k , where The i-th complete label sequence is Ki (x)=[K i (x 1 ), Ki ( x 2 ), . . . , Ki (x n )]. The Viterbi algorithm in the above k-best case means that each node in the algorithm does not only retain the minimum value, but retains K minimum values, that is, retains the highest value of the top K in the ascending order. Then, through the k complete tag sequences with the highest probability, the type of tags that have not been marked before is determined, and the entity tags whose unmarked text does not appear in the k complete tag sequences are used as cover tags to cover and delete.
进一步地,所述根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤S5,包括:Further, the step S5 of obtaining the set of label sequences corresponding to the designated training samples according to the masking label includes:
S51:从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;S51: Delete the cover tag from the tag type set to obtain a predicted tag set corresponding to the specified text;
S52:将所述预测标签集标注于所述指定文字上;S52: Mark the predicted label set on the specified text;
S53:根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;S53: According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;
S54:根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。S54: According to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into all the label sequences corresponding to the designated training samples in a one-to-one correspondence to form the set of tag sequences.
本申请实施例通过K(x)确定遮盖标签并构建预测标签集。对于x中的一个未被标记的文字x j,其可能的标签为遮盖标签之外的非实体标签或者并集
Figure PCTCN2021097543-appb-000001
中的一个标签,并集
Figure PCTCN2021097543-appb-000002
表示K(x)中文字x j出现的标签并集。通过遮盖或删除未在k条标签序列的对应位置出现过的实体标签,得到预测标签集。举例地,文字x j出现过的标签包括“公司”、“人名”和“组织”,文字x j未出现的标签包括“地址”和“时间”,则“地址”和“时间”为遮盖标签,预测标签集由标签“公司”、“人名”和“组织”组成。使得每个未标注文字的可选标签的数量降低,仅可以从预测标签集中选择。对y u=(-,y 2,-,…,-)中每个“-”的位置,从各自对应的预测标签集中选择一个标签标注上,就可构成一条完整标签序列,通过删除遮盖标签减少了每个未标注文字的可选标签数量,最后确定可能标签序列的集合为S(y u,K(x)),则S(y u,K(x))
In this embodiment of the present application, K(x) is used to determine the cover label and construct a predicted label set. For an unlabeled literal x j in x , its possible labels are non-entity labels or unions outside the cover label
Figure PCTCN2021097543-appb-000001
a label in , the union
Figure PCTCN2021097543-appb-000002
represents the union of labels where the text x j occurs in K(x). The predicted label set is obtained by masking or deleting entity labels that have not appeared in the corresponding positions of the k label sequences. For example, the tags that appear in the text x j include "company", "person's name" and "organization", and the tags that do not appear in the text x j include "address" and "time", then "address" and "time" are covered tags , the predicted label set consists of the labels "company", "person's name" and "organization". Reduces the number of optional labels for each unlabeled text, which can only be selected from the predicted label set. For each "-" position in y u =(-,y 2 ,-,...,-), select a label label from the corresponding prediction label set to form a complete label sequence. By deleting the cover label Reduce the number of optional labels for each unlabeled text, and finally determine the set of possible label sequences as S(y u ,K(x)), then S(y u ,K(x))
中的标签序列数量会远少于C(y u)中的标签序列数量。 The number of tag sequences in will be much less than the number of tag sequences in C( yu ).
进一步地,所述根据所述指定训练样本中所有未标注文字分别对应的预测标 签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤S53,包括:Further, according to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the labels corresponding to the designated training samples. sequence, the step S53 of forming the tag sequence set includes:
S531:判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;S531: Determine whether the first text is a marked text, wherein the first text is any text in the designated training sample;
S532:若是,则获取所述第一文字对应的标注标签;S532: If yes, obtain the labeling label corresponding to the first text;
S533:判断排布于所述第一文字之后的第二文字是否为标注文字;S533: Determine whether the second text arranged after the first text is a marked text;
S534:若否,则获取所述第二文字对应的预测标签集;S534: If not, obtain the predicted label set corresponding to the second text;
S535:将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;S535: Mark the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;
S536:按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;S536: According to the formation method of the label path from the first character to the second character, form all label paths corresponding to all characters in the specified training sample;
S537:将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。S537: Use all label paths corresponding to all characters in the designated training sample as all label sequences corresponding to the designated training sample to form the label sequence set.
本申请实施例以训练样本中一个标注标签的字接续一个未标注标签的字为例进行详细说明,但并不限定文字的排布规律,只是标注标签的字的标签获取过程,以及未标注标签的字的标签获取过程类似。根据未标注标签的字对应的预测标签集中存在的标签数量,在标注标签的字接续一个未标注标签的字之间,形成相同数量的标签路径,通过将训练样本中所有标注的字和所有未标注的字按照上述第一文字和第二文字间的标签路径的形成过程,得到任意两个相邻字之间的所有标签路径,然后按照文字的排布次序,依次接续连接两个相邻字之间的所有标签路径,得到所有的标签序列,形成所述标签序列集合。The examples of this application are described in detail by taking a labeled word followed by an unlabeled word in the training sample as an example for detailed description, but does not limit the arrangement rules of the words, but only the process of obtaining the labels of the labeled words, and the unlabeled words. The process of obtaining the label of the word is similar. According to the number of labels in the predicted label set corresponding to the unlabeled word, the same number of label paths are formed between the labeled word and an unlabeled word. According to the above-mentioned forming process of the label path between the first character and the second character, all the label paths between any two adjacent characters are obtained, and then, according to the arrangement order of the characters, successively connect the two adjacent characters. All tag paths between, get all tag sequences, and form the tag sequence set.
进一步地,所述在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤S7,包括:Further, under the constraint of the preset loss function, the step S7 of training the entity recognition model through the label sequence sets corresponding to all the training samples respectively includes:
S71:将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;S71: Input the set of label sequences corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
S72:将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;S72: Set the label sequence probabilities corresponding to the label sequences of all the training samples in a one-to-one correspondence as the allocation weights corresponding to the label sequences respectively;
S73:将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;S73: Form training data with the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;
S74:将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。S74: Input the training data into the entity recognition model, and train until the preset loss function converges.
本申请实施例中,交叉验证模型为BERT+CRF模型架构,区别在于使用的模型参数和输出变量与实体识别模型不同。上述预设损失函数为
Figure PCTCN2021097543-appb-000003
Figure PCTCN2021097543-appb-000004
w为所述实体识别模型的模型参数,q(y′|x)表示分配权重。本申请通过交叉验证模型估计标签序列集合中各标签序列的概率分布p(y′|x),其中y′∈S(y u,K(x))。从而获得标签序列集合中各标签序列的概率分布
Figure PCTCN2021097543-appb-000005
其中T为温度参数,T>0。将估计得到的标签序列集合中各标签序列的概率分布q(y′|x),作为每条x对应的每条可能标签序列的分配权重。
In the embodiment of the present application, the cross-validation model is a BERT+CRF model architecture, and the difference lies in that the model parameters and output variables used are different from the entity recognition model. The above preset loss function is
Figure PCTCN2021097543-appb-000003
Figure PCTCN2021097543-appb-000004
w is the model parameter of the entity recognition model, and q(y'|x) represents the assignment weight. The present application estimates the probability distribution p(y′|x) of each tag sequence in the tag sequence set through a cross-validation model, where y′∈S( yu , K(x)). So as to obtain the probability distribution of each tag sequence in the tag sequence set
Figure PCTCN2021097543-appb-000005
Where T is the temperature parameter, T>0. The probability distribution q(y'|x) of each label sequence in the estimated label sequence set is taken as the distribution weight of each possible label sequence corresponding to each x.
进一步地,所述将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤S71,包括:Further, the step S71 of inputting the label sequence sets corresponding to all training samples into the cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively includes:
S711:将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;S711: Divide the label sequences in the label sequence sets corresponding to all training samples into a first part of data and a second part of data equally;
S712:将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;S712: Input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;
S713:将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。S713: Input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model, The label sequence probability corresponding to each label sequence in the first part of the data is obtained.
本申请实施例利用交叉验证模型进行交叉验证时,先将每条x对应的每条可能标签序列的分配权重和每条x一一对应匹配,得到所有训练样本的标签序列组成的训练数据,然后将训练数据按照数量均分为两部分,一部分作为训练集,另一部分作为验证集。即在一部分训练数据上训练序列标注模型,并通过训练的序列标注模型预测另一部分训练数据中每条x的标签序列概率p(y′|x)。When the cross-validation model is used for cross-validation in this embodiment of the present application, the assigned weight of each possible label sequence corresponding to each x is firstly matched with each x to obtain training data consisting of label sequences of all training samples, and then The training data is divided into two parts according to the number, one part is used as training set and the other part is used as validation set. That is, a sequence labeling model is trained on a part of the training data, and the label sequence probability p(y′|x) of each x in the other part of the training data is predicted by the trained sequence labeling model.
参照图2,本申请一实施例的实体识别模型的训练装置,包括:Referring to FIG. 2 , a training device for an entity recognition model according to an embodiment of the present application includes:
第一获取模块1,用于获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;The first acquisition module 1 is used to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;
输入模块2,用于将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率; Input module 2, for inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
计算模块3,用于根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;The calculation module 3 is used to obtain the label sequence with the highest probability through the Viterbi algorithm calculation according to the label probability corresponding to all the unlabeled characters in the specified training sample;
确定模块4,用于根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;A determination module 4, configured to determine the cover labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;
得到模块5,用于根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;Obtaining module 5, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;
第二获取模块6,用于根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;The second obtaining module 6 is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;
训练模块7,用于在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。The training module 7 is used for training the entity recognition model through the label sequence sets corresponding to all the training samples under the constraint of the preset loss function.
本申请实施例的相关解释,适用方法对应部分的解释,不赘述。The relevant explanations of the embodiments of the present application and the explanations of the corresponding parts of the applicable methods will not be repeated.
进一步地,确定模块4,包括:Further, determine module 4, including:
获取单元,用于获取当前实体识别任务中所有标签对应的标签类型集合;The acquisition unit is used to acquire the set of tag types corresponding to all tags in the current entity recognition task;
第一确定单元,用于确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;The first determination unit is used to determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is one or more tags in the tag type set;
作为单元,用于将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;As a unit, it is used to use the tags other than the specified tags in the tag type set as the cover tags corresponding to the specified text;
第二确定单元,用于根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。The second determining unit is configured to determine, according to the determination method of the masking label corresponding to the designated text, the masking label corresponding to all the unmarked texts in the designated training sample respectively.
进一步地,得到模块5,包括:Further, module 5 is obtained, including:
删除单元,用于从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;a deletion unit, configured to delete the cover tag from the tag type set to obtain a predicted tag set corresponding to the specified text;
第一标注单元,用于将所述预测标签集标注于所述指定文字上;a first labeling unit, used to label the predicted label set on the specified text;
第二标注单元,用于根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;a second labeling unit, configured to label the predicted label sets corresponding to all unlabeled texts in the designated training samples according to the labeling process of the predicted label set corresponding to the specified text;
组合单元,用于根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。The combining unit is used to combine the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, to form all the labels corresponding to the designated training samples in a one-to-one correspondence sequence to form the set of tag sequences.
进一步地,组合单元,包括:Further, the combination unit includes:
第一判断子单元,用于判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;a first judging subunit, used for judging whether the first character is a marked character, wherein the first character is any character in the specified training sample;
第一获取子单元,用于若为标注文字,则获取所述第一文字对应的标注标签;a first obtaining subunit, configured to obtain the labeling label corresponding to the first character if it is an annotation text;
第二判断子单元,用于判断排布于所述第一文字之后的第二文字是否为标注文字;a second judging subunit for judging whether the second text arranged after the first text is a marked text;
第二获取子单元,用于若为标注文字,则获取所述第二文字对应的预测标签集;a second obtaining subunit, configured to obtain a predicted label set corresponding to the second character if it is a marked text;
标注子单元,用于将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;A labeling subunit, configured to label the predicted label set corresponding to the second character on the second character respectively, and connect the labeling labels corresponding to the first character respectively to form the first character to the second character the label path;
形成子单元,用于按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;forming a subunit for forming all label paths corresponding to all characters in the specified training sample according to the formation method of the label paths from the first character to the second character;
作为子单元,用于将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。As a subunit, it is used to use all label paths corresponding to all characters in the specified training sample as all label sequences corresponding to the specified training sample to form the label sequence set.
进一步地,训练模块7,包括:Further, the training module 7 includes:
输入单元,用于将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;The input unit is used to input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
设置单元,用于将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;a setting unit, configured to set the label sequence probabilities corresponding to the label sequences of all training samples to the assigned weights corresponding to each of the label sequences in a one-to-one correspondence;
组成单元,用于将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;A composition unit, configured to form training data by combining the label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively;
训练单元,用于将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。A training unit, configured to input the training data into the entity recognition model, and train until the preset loss function converges.
进一步地,输入单元,包括:Further, the input unit includes:
均分子单元,用于将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;The average molecular unit is used to divide the label sequences in the label sequence sets corresponding to all training samples into the first part of data and the second part of data equally;
训练子单元,用于将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;a training subunit, configured to input the first part of the data into the cross-validation model, obtain a first validation model through training, input the second part of the data into the cross-validation model, and obtain a second validation model through training;
输入子单元,用于将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。The input subunit is used to input the second part of the data into the first verification model, obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the first part of the data. Second, verify the model, and obtain the label sequence probability corresponding to each label sequence in the first part of the data.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储 实体识别模型的训练过程需要的所有数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现实体识别模型的训练方法。Referring to FIG. 3 , an embodiment of the present application further provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store all the data required for the training process of the entity recognition model. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a method of training an entity recognition model.
上述处理器执行上述实体识别模型的训练方法,包括:获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。The above processor executes the training method for the entity recognition model, including: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model, to obtain the label probabilities corresponding to all the unlabeled words in the designated training sample; according to the label probabilities corresponding to all the unlabeled words in the designated training sample, the Viterbi algorithm is used to calculate the label sequence with the highest probability; according to the probability The highest label sequence, determine the cover labels corresponding to all the unlabeled texts in the designated training samples; according to the cover labels, obtain the label sequence set corresponding to the designated training samples; According to the label sequences corresponding to the designated training samples The acquisition method of the set is to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained by using the label sequence sets corresponding to all the training samples respectively.
上述计算机设备,通过预测概率模型预测未标注文字的标签概率,并结合维特比算法,得到训练语句对应的标签序列概率,然后选择最有可能的标签序列,并根据最有可能的标签序列减少未标注文字的可能标签数量,以有效减少需要估计概率分布的标签序列的数量,使得实体识别模型更容易识别真实的标签序列,且减少了计算复杂度。The above computer equipment predicts the label probability of unlabeled text through the prediction probability model, and combines the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, then selects the most likely label sequence, and reduces the number of unlabeled sequences according to the most likely label sequence. The number of possible labels of the annotated text can effectively reduce the number of label sequences that need to estimate the probability distribution, making it easier for the entity recognition model to identify the real label sequence, and reducing the computational complexity.
在一个实施例中,上述处理器根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤,包括:获取当前实体识别任务中所有标签对应的标签类型集合;确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。In one embodiment, the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor, includes: acquiring a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all unlabeled texts in the designated training sample, and the designated label is in the set of label types. One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the way of determining the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.
在一个实施例中,上述处理器根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤,包括:从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;将所述预测标签集标注于所述指定文字上;根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。In one embodiment, the step of the above-mentioned processor obtaining the label sequence set corresponding to the specified training sample according to the cover label includes: deleting the cover label from the label type set, and obtaining the corresponding label of the specified text the predicted label set; label the predicted label set on the specified text; according to the labeling process of the predicted label set corresponding to the specified text, label the predicted label set corresponding to all unlabeled text in the specified training sample respectively ; According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples, forming all the designated training samples. The set of tag sequences described.
在一个实施例中,上述处理器根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤,包括:判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;若是,则获取所述第一文字对应的标注标签;判断排布于所述第一文字之后的第二文字是否为标注文字;若否,则获取所述第二文字对应的预测标签集;将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。In one embodiment, the above-mentioned processor, according to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into the designated training samples in a one-to-one correspondence All corresponding label sequences, the step of forming the label sequence set includes: judging whether the first character is a label character, wherein the first character is any character in the designated training sample; if so, obtaining the first character corresponding label; determine whether the second word arranged after the first word is a label; if not, obtain the predicted label set corresponding to the second word; set the predicted label set corresponding to the second word are respectively marked on the second text, and are respectively connected to the labeling labels corresponding to the first text to form a label path from the first text to the second text; according to the labels from the first text to the second text The path formation method is to form all label paths corresponding to all characters in the specified training sample; all label paths corresponding to all characters in the specified training sample are used as all the label sequences corresponding to the specified training sample to form the A collection of tag sequences.
在一个实施例中,上述处理器在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤,包括:将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。In one embodiment, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all the training samples respectively includes: inputting the label sequence sets corresponding to all the training samples respectively. Cross-validation model, obtain the label sequence probability corresponding to the label sequence of all training samples respectively; set the label sequence probability corresponding to the label sequence of all training samples respectively as the allocation weight corresponding to each label sequence in one-to-one correspondence; The weighted label sequence and the training samples corresponding to each of the label sequences constitute training data; the training data is input into the entity recognition model, and trained until the preset loss function converges.
在一个实施例中,上述处理器将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤,包括:将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。In one embodiment, the above-mentioned processor inputs the label sequence sets corresponding to all training samples into the cross-validation model, and the step of obtaining the label sequence probabilities corresponding to the label sequences of all training samples respectively includes: inputting the labels corresponding to all training samples respectively The label sequences in the sequence set are equally divided into a first part of data and a second part of data; input the first part of the data into the cross-validation model, train to obtain the first validation model, and input the second part of the data into the The cross-validation model is trained to obtain a second validation model; the second part of the data is input into the first validation model, and the label sequence probability corresponding to each label sequence in the second part of the data is obtained. A part of the data is input into the second verification model, and the tag sequence probability corresponding to each tag sequence in the first part of the data is obtained.
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
本申请一实施例还提供一种计算机可读存储介质,所述存储介质为易失性存储介质或非易失性存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现实体识别模型的训练方法,包括:获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。An embodiment of the present application further provides a computer-readable storage medium, the storage medium is a volatile storage medium or a non-volatile storage medium, and a computer program is stored thereon, and when the computer program is executed by a processor, entity identification is realized A method for training a model, comprising: acquiring an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set; inputting the designated training sample into a probability prediction model to obtain the designated training sample The label probabilities corresponding to all unlabeled texts in the sample; according to the label probabilities corresponding to all unlabeled texts in the specified training sample, the label sequence with the highest probability is calculated by the Viterbi algorithm; according to the label sequence with the highest probability, determine the obtaining the masking labels corresponding to all the unlabeled texts in the designated training samples; obtaining the label sequence sets corresponding to the designated training samples according to the masking labels; obtaining the label sequence sets corresponding to the designated training samples The set of label sequences corresponding to all the training samples in the incompletely labeled data set; under the constraint of a preset loss function, the entity recognition model is trained through the sets of label sequences corresponding to all the training samples respectively.
上述计算机可读存储介质,通过预测概率模型预测未标注文字的标签概率,并结合维特比算法,得到训练语句对应的标签序列概率,然后选择最有可能的标签序列,并根据最有可能的标签序列减少未标注文字的可能标签数量,以有效减少需要估计概率分布的标签序列的数量,使得实体识别模型更容易识别真实的标签序列,且减少了计算复杂度。The above computer-readable storage medium predicts the label probability of the unlabeled text through a prediction probability model, and combines with the Viterbi algorithm to obtain the label sequence probability corresponding to the training sentence, and then selects the most likely label sequence. Sequence reduces the number of possible labels for unlabeled text, effectively reducing the number of label sequences that need to estimate probability distributions, making it easier for entity recognition models to identify real label sequences, and reducing computational complexity.
在一个实施例中,上述处理器根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤,包括:获取当前实体识别任务中所有标签对应的标签类型集合;确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。In one embodiment, the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, by the processor, includes: obtaining a label type set corresponding to all labels in the current entity recognition task ; Determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is in the set of label types. One or more labels; labels other than the specified labels in the label type set are used as cover labels corresponding to the specified text; according to the determination method of the cover labels corresponding to the specified words, determine the specified label The masked labels corresponding to all unlabeled texts in the training samples.
在一个实施例中,上述处理器根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤,包括:从所述标签类型集合中删除所述遮盖标签,得 到所述指定文字对应的预测标签集;将所述预测标签集标注于所述指定文字上;根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。In one embodiment, the step of the above-mentioned processor obtaining the label sequence set corresponding to the specified training sample according to the cover label includes: deleting the cover label from the label type set, and obtaining the corresponding label of the specified text the predicted label set; label the predicted label set on the specified text; according to the labeling process of the predicted label set corresponding to the specified text, label the predicted label set corresponding to all unlabeled text in the specified training sample respectively ; According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples, forming all the designated training samples. The set of tag sequences described.
在一个实施例中,上述处理器根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤,包括:判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;若是,则获取所述第一文字对应的标注标签;判断排布于所述第一文字之后的第二文字是否为标注文字;若否,则获取所述第二文字对应的预测标签集;将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。In one embodiment, the above-mentioned processor, according to the predicted label sets corresponding to all the unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, combine them into the designated training samples in a one-to-one correspondence All corresponding label sequences, the step of forming the label sequence set includes: judging whether the first character is a label character, wherein the first character is any character in the designated training sample; if so, obtaining the first character corresponding label; determine whether the second word arranged after the first word is a label; if not, obtain the predicted label set corresponding to the second word; set the predicted label set corresponding to the second word are respectively marked on the second text, and are respectively connected to the labeling labels corresponding to the first text to form a label path from the first text to the second text; according to the labels from the first text to the second text The path formation method is to form all label paths corresponding to all characters in the specified training sample; all label paths corresponding to all characters in the specified training sample are used as all the label sequences corresponding to the specified training sample to form the A collection of tag sequences.
在一个实施例中,上述处理器在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤,包括:将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。In one embodiment, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all the training samples respectively includes: inputting the label sequence sets corresponding to all the training samples respectively. Cross-validation model, obtain the label sequence probability corresponding to the label sequence of all training samples respectively; set the label sequence probability corresponding to the label sequence of all training samples respectively as the allocation weight corresponding to each label sequence in one-to-one correspondence; The weighted label sequence and the training samples corresponding to each of the label sequences constitute training data; the training data is input into the entity recognition model, and trained until the preset loss function converges.
在一个实施例中,上述处理器将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤,包括:将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。In one embodiment, the above-mentioned processor inputs the label sequence sets corresponding to all training samples into the cross-validation model, and the step of obtaining the label sequence probabilities corresponding to the label sequences of all training samples respectively includes: inputting the labels corresponding to all training samples respectively The label sequences in the sequence set are equally divided into a first part of data and a second part of data; input the first part of the data into the cross-validation model, train to obtain the first validation model, and input the second part of the data into the The cross-validation model is trained to obtain a second validation model; the second part of the data is input into the first validation model, and the label sequence probability corresponding to each label sequence in the second part of the data is obtained. A part of the data is input into the second verification model, and the tag sequence probability corresponding to each tag sequence in the first part of the data is obtained.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,上述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the above-mentioned computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Claims (20)

  1. 一种实体识别模型的训练方法,其中,包括:A training method for an entity recognition model, comprising:
    获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;
    将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
    根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;
    根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;
    根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;
    根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
    在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
  2. 根据权利要求1所述的实体识别模型的训练方法,其中,所述根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤,包括:The method for training an entity recognition model according to claim 1, wherein the step of determining the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability, comprises:
    获取当前实体识别任务中所有标签对应的标签类型集合;Get the set of tag types corresponding to all tags in the current entity recognition task;
    确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;
    将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;
    根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
  3. 根据权利要求1所述的实体识别模型的训练方法,其中,所述根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤,包括:The method for training an entity recognition model according to claim 1, wherein the step of obtaining the set of label sequences corresponding to the designated training samples according to the masked label comprises:
    从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;
    将所述预测标签集标注于所述指定文字上;marking the predicted label set on the specified text;
    根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;
    根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
  4. 根据权利要求3所述的实体识别模型的训练方法,其中,所述根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤,包括:The method for training an entity recognition model according to claim 3, wherein, according to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one The steps of forming the label sequence set by combining all the label sequences corresponding to the designated training samples by one correspondence include:
    判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;
    若是,则获取所述第一文字对应的标注标签;If so, obtain the labeling label corresponding to the first text;
    判断排布于所述第一文字之后的第二文字是否为标注文字;judging whether the second text arranged after the first text is a marked text;
    若否,则获取所述第二文字对应的预测标签集;If not, obtain the predicted label set corresponding to the second text;
    将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;
    按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;
    将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
  5. 根据权利要求1所述的实体识别模型的训练方法,其中,所述在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤,包括:The method for training an entity recognition model according to claim 1, wherein, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all training samples respectively comprises:
    将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
    将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;
    将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;
    将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。Input the training data into the entity recognition model, and train until the preset loss function converges.
  6. 根据权利要求5所述的实体识别模型的训练方法,其中,所述将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤,包括:The method for training an entity recognition model according to claim 5, wherein the step of inputting the label sequence sets corresponding to all training samples into the cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples, comprising: :
    将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;
    将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;
    将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.
  7. 一种实体识别模型的训练装置,其中,包括:A training device for an entity recognition model, comprising:
    第一获取模块,用于获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;a first acquisition module, configured to acquire an incompletely labeled designated training sample, wherein the designated training sample is any sample in an incompletely labeled data set;
    输入模块,用于将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;an input module, configured to input the designated training sample into a probability prediction model, and obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
    计算模块,用于根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;a calculation module, configured to calculate the label sequence with the highest probability by the Viterbi algorithm according to the label probabilities corresponding to all unlabeled texts in the specified training sample;
    确定模块,用于根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;A determination module, configured to determine the masking labels corresponding to all unlabeled texts in the designated training samples according to the label sequence with the highest probability;
    得到模块,用于根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;obtaining module, for obtaining the label sequence set corresponding to the designated training sample according to the masking label;
    第二获取模块,用于根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;The second obtaining module is configured to obtain the label sequence sets corresponding to all the training samples in the incompletely labeled data set according to the obtaining method of the label sequence sets corresponding to the designated training samples;
    训练模块,用于在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。The training module is used for training the entity recognition model through the label sequence sets corresponding to all training samples under the constraint of the preset loss function.
  8. 根据权利要求7所述的实体识别模型的训练装置,其中,确定模块,包括:The training device of entity recognition model according to claim 7, wherein, the determining module comprises:
    获取单元,用于获取当前实体识别任务中所有标签对应的标签类型集合;The acquisition unit is used to acquire the set of tag types corresponding to all tags in the current entity recognition task;
    第一确定单元,用于确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;The first determination unit is used to determine the designated label corresponding to the designated text in the label sequence group with the highest probability, wherein the designated text is any one of all the unlabeled text in the designated training sample, and the designated label is one or more tags in the tag type set;
    作为单元,用于将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;As a unit, it is used to use the tags other than the specified tags in the tag type set as the cover tags corresponding to the specified text;
    第二确定单元,用于根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。The second determining unit is configured to determine, according to the determination method of the masking label corresponding to the designated text, the masking label corresponding to all the unmarked texts in the designated training sample respectively.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种实体识别模型的训练方法;其中,所述一种实体识别模型的训练方法包括:A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements an entity recognition model training method when executing the computer program; wherein the entity recognition model The training methods include:
    获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;
    将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
    根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;
    根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;
    根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;
    根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
    在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
  10. 根据权利要求9所述的计算机设备,其中,所述根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤,包括:The computer device according to claim 9, wherein the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all the unlabeled texts in the designated training samples, comprises:
    获取当前实体识别任务中所有标签对应的标签类型集合;Get the set of tag types corresponding to all tags in the current entity recognition task;
    确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;
    将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;
    根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤,包括:The computer device according to claim 9, wherein the step of obtaining the label sequence set corresponding to the designated training sample according to the masked label comprises:
    从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;
    将所述预测标签集标注于所述指定文字上;marking the predicted label set on the specified text;
    根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;
    根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
  12. 根据权利要求11所述的计算机设备,其中,所述根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤,包括:The computer device according to claim 11, wherein the predicted label sets corresponding to all unlabeled texts in the designated training samples are combined into a one-to-one correspondence according to the text arrangement order in the designated training samples. The step of forming the set of label sequences for all label sequences corresponding to the designated training samples includes:
    判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;
    若是,则获取所述第一文字对应的标注标签;If so, obtain the labeling label corresponding to the first text;
    判断排布于所述第一文字之后的第二文字是否为标注文字;judging whether the second text arranged after the first text is a marked text;
    若否,则获取所述第二文字对应的预测标签集;If not, obtain the predicted label set corresponding to the second text;
    将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;
    按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;
    将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
  13. 根据权利要求9所述的计算机设备,其中,所述在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤,包括:The computer device according to claim 9, wherein, under the constraint of a preset loss function, the step of training the entity recognition model through the label sequence sets corresponding to all training samples respectively comprises:
    将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
    将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;
    将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;
    将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。Input the training data into the entity recognition model, and train until the preset loss function converges.
  14. 根据权利要求13所述的计算机设备,其中,所述将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤,包括:The computer device according to claim 13, wherein the step of inputting the label sequence sets corresponding to all training samples into a cross-validation model to obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively comprises:
    将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;
    将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;
    将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种实体识别模型的训练方法;其中,所述一种实体识别模型的训练方法包括:A computer-readable storage medium on which a computer program is stored, wherein, when the computer program is executed by a processor, a training method for an entity recognition model is implemented; wherein, the training method for an entity recognition model comprises:
    获取不完全标注的指定训练样本,其中,所述指定训练样本为不完全标注数据集中的任一样本;Obtain an incompletely labeled designated training sample, wherein the designated training sample is any sample in the incompletely labeled data set;
    将所述指定训练样本输入概率预测模型,得到所述指定训练样本中所有未标注文字分别对应的标签概率;Inputting the designated training sample into a probability prediction model to obtain the label probabilities corresponding to all unlabeled texts in the designated training sample;
    根据所述指定训练样本中所有未标注文字分别对应的标签概率,通过维特比算法计算得到概率最高的标签序列;According to the label probabilities corresponding to all unlabeled texts in the designated training samples, the label sequence with the highest probability is obtained by calculating the Viterbi algorithm;
    根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签;According to the label sequence with the highest probability, determine the masking labels corresponding to all the unlabeled texts in the specified training sample;
    根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合;Obtaining a set of label sequences corresponding to the designated training samples according to the masking label;
    根据所述指定训练样本对应的标签序列集合的获取方式,获取所述不完全标注数据集中所有训练样本分别对应的标签序列集合;According to the acquisition method of the label sequence set corresponding to the designated training sample, obtain the label sequence set corresponding to all the training samples in the incompletely labeled data set;
    在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型。Under the constraint of a preset loss function, the entity recognition model is trained through the label sequence sets corresponding to all training samples respectively.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述根据概率最高的标签序列,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签的步骤,包括:The computer-readable storage medium according to claim 15, wherein the step of determining, according to the label sequence with the highest probability, the masking labels corresponding to all unlabeled texts in the specified training sample, comprises:
    获取当前实体识别任务中所有标签对应的标签类型集合;Get the set of tag types corresponding to all tags in the current entity recognition task;
    确定概率最高的标签序列组中指定文字分别对应的指定标签,其中,所述指定文字为所述指定训练样本中所有未标注文字中的任意一个,所述指定标签为所述标签类型集合中一种或多种标签;Determine the designated label corresponding to the designated text in the tag sequence group with the highest probability, wherein the designated text is any one of all the unlabeled texts in the designated training sample, and the designated label is one of the set of label types. one or more labels;
    将所述标签类型集合中所述指定标签之外的标签,作为所述指定文字对应的遮盖标签;Using labels other than the specified labels in the label type set as the cover labels corresponding to the specified text;
    根据所述指定文字对应的遮盖标签的确定方式,确定所述指定训练样本中所有未标注文字分别对应的遮盖标签。According to the method of determining the cover labels corresponding to the specified text, the cover labels corresponding to all the unlabeled words in the specified training sample are determined respectively.
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述根据所述遮盖标签,得到所述指定训练样本对应的标签序列集合的步骤,包括:The computer-readable storage medium according to claim 15, wherein the step of obtaining the set of label sequences corresponding to the specified training samples according to the masked labels comprises:
    从所述标签类型集合中删除所述遮盖标签,得到所述指定文字对应的预测标签集;Delete the cover tag from the tag type set to obtain the predicted tag set corresponding to the specified text;
    将所述预测标签集标注于所述指定文字上;marking the predicted label set on the specified text;
    根据所述指定文字对应的预测标签集的标注过程,标注所述指定训练样本中所有未标注文字分别对应的预测标签集;According to the labeling process of the predicted label set corresponding to the specified text, label the predicted label sets corresponding to all the unlabeled texts in the specified training sample respectively;
    根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合。According to the predicted label sets corresponding to all unlabeled texts in the designated training samples, and according to the text arrangement order in the designated training samples, one-to-one correspondence is combined into all the label sequences corresponding to the designated training samples to form the A collection of tag sequences.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述根据所述指定训练样本中所有未标注文字分别对应的预测标签集,按照所述指定训练样本中的文字排布次序,一一对应组合成所述指定训练样本对应的所有标签序列,形成所述标签序列集合的步骤,包括:The computer-readable storage medium according to claim 17, wherein, according to the predicted label sets corresponding to all unlabeled texts in the specified training samples, and according to the text arrangement order in the specified training samples, one by one The steps of correspondingly combining all the label sequences corresponding to the designated training samples to form the label sequence set include:
    判断第一文字是否为标注文字,其中,所述第一文字为所述指定训练样本中的任一文字;Determine whether the first text is a marked text, wherein the first text is any text in the specified training sample;
    若是,则获取所述第一文字对应的标注标签;If so, obtain the labeling label corresponding to the first text;
    判断排布于所述第一文字之后的第二文字是否为标注文字;judging whether the second text arranged after the first text is a marked text;
    若否,则获取所述第二文字对应的预测标签集;If not, obtain the predicted label set corresponding to the second text;
    将所述第二文字对应的预测标签集分别标注于所述第二文字上,并分别连接所述第一文字对应的标注标签,形成所述第一文字到所述第二文字的标签路径;Marking the predicted label set corresponding to the second character on the second character respectively, and connecting the labeling labels corresponding to the first character respectively to form a label path from the first character to the second character;
    按照所述第一文字到所述第二文字的标签路径的形成方式,形成所述指定训练样本中所有文字对应的所有标签路径;According to the formation method of the label path from the first character to the second character, all label paths corresponding to all characters in the specified training sample are formed;
    将所述指定训练样本中所有文字对应的所有标签路径,作为所述指定训练样本对应的所有标签序列,形成所述标签序列集合。All label paths corresponding to all characters in the specified training sample are used as all label sequences corresponding to the specified training sample to form the label sequence set.
  19. 根据权利要求15所述的计算机可读存储介质,其中,所述在预设损失函数约束下,通过所有训练样本分别对应的标签序列集合,训练所述实体识别模型的步骤,包括:The computer-readable storage medium according to claim 15, wherein, under the constraint of a preset loss function, the step of training the entity recognition model by using the label sequence sets corresponding to all training samples respectively comprises:
    将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率;Input the label sequence sets corresponding to all training samples into the cross-validation model, and obtain the label sequence probabilities corresponding to the label sequences of all training samples respectively;
    将所有训练样本的标签序列分别对应的标签序列概率,一一对应设置为各所述标签序列分别对应的分配权重;The label sequence probabilities corresponding to the label sequences of all training samples are set in a one-to-one correspondence as the allocation weights corresponding to each of the label sequences respectively;
    将携带所述分配权重的标签序列和各所述标签序列分别对应的训练样本,组成训练数据;The label sequence carrying the assigned weight and the training samples corresponding to each of the label sequences respectively form training data;
    将所述训练数据输入所述实体识别模型,训练至所述预设损失函数收敛。Input the training data into the entity recognition model, and train until the preset loss function converges.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述将所有训练样本分别对应的标签序列集合输入交叉验证模型,得到所有训练样本的标签序列分别对应的标签序列概率的步骤,包括:The computer-readable storage medium according to claim 19, wherein the step of inputting the label sequence sets corresponding to all training samples into a cross-validation model to obtain label sequence probabilities corresponding to the label sequences of all training samples respectively comprises:
    将所有训练样本分别对应的标签序列集合中的标签序列,均分为第一部分数据和第二部分数据;Divide the label sequences in the label sequence set corresponding to all training samples into the first part of data and the second part of data equally;
    将所述第一部分数据输入至所述交叉验证模型,训练得到第一验证模型,将所述第二部分数据输入至所述交叉验证模型,训练得到第二验证模型;Inputting the first part of the data into the cross-validation model, training to obtain a first validation model, inputting the second part of the data into the cross-validation model, and training to obtain a second validation model;
    将所述第二部分数据输入所述第一验证模型,得到所述第二部分数据中每条标签序列分别对应的标签序列概率,将所述第一部分数据输入所述第二验证模型,得到所述第一部分数据中每条标签序列分别对应的标签序列概率。Input the second part of the data into the first verification model to obtain the label sequence probability corresponding to each label sequence in the second part of the data, and input the first part of the data into the second verification model to obtain the The label sequence probability corresponding to each label sequence in the first part of the data.
PCT/CN2021/097543 2020-12-31 2021-05-31 Method and apparatus for training entity recognition model, and device and storage medium WO2022142122A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011633046.9 2020-12-31
CN202011633046.9A CN112733911B (en) 2020-12-31 2020-12-31 Training method, device, equipment and storage medium of entity recognition model

Publications (1)

Publication Number Publication Date
WO2022142122A1 true WO2022142122A1 (en) 2022-07-07

Family

ID=75608419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097543 WO2022142122A1 (en) 2020-12-31 2021-05-31 Method and apparatus for training entity recognition model, and device and storage medium

Country Status (2)

Country Link
CN (1) CN112733911B (en)
WO (1) WO2022142122A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036869A (en) * 2023-10-08 2023-11-10 之江实验室 Model training method and device based on diversity and random strategy

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733911B (en) * 2020-12-31 2023-05-30 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of entity recognition model
CN113642635B (en) * 2021-08-12 2023-09-15 百度在线网络技术(北京)有限公司 Model training method and device, electronic equipment and medium
CN114399766B (en) * 2022-01-18 2024-05-10 平安科技(深圳)有限公司 Optical character recognition model training method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286629A1 (en) * 2014-04-08 2015-10-08 Microsoft Corporation Named entity recognition
CN108363701A (en) * 2018-04-13 2018-08-03 达而观信息科技(上海)有限公司 Name entity recognition method and system
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
CN111611802A (en) * 2020-05-21 2020-09-01 苏州大学 Multi-field entity identification method
CN112733911A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Entity recognition model training method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717410B (en) * 2018-05-17 2022-05-20 达而观信息科技(上海)有限公司 Named entity identification method and system
CN109299458B (en) * 2018-09-12 2023-03-28 广州多益网络股份有限公司 Entity identification method, device, equipment and storage medium
US11551136B2 (en) * 2018-11-14 2023-01-10 Tencent America LLC N-best softmax smoothing for minimum bayes risk training of attention based sequence-to-sequence models
CN111368544B (en) * 2020-02-28 2023-09-19 中国工商银行股份有限公司 Named entity identification method and device
CN111553164A (en) * 2020-04-29 2020-08-18 平安科技(深圳)有限公司 Training method and device for named entity recognition model and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286629A1 (en) * 2014-04-08 2015-10-08 Microsoft Corporation Named entity recognition
CN108363701A (en) * 2018-04-13 2018-08-03 达而观信息科技(上海)有限公司 Name entity recognition method and system
CN110070183A (en) * 2019-03-11 2019-07-30 中国科学院信息工程研究所 A kind of the neural network model training method and device of weak labeled data
CN111611802A (en) * 2020-05-21 2020-09-01 苏州大学 Multi-field entity identification method
CN112733911A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Entity recognition model training method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036869A (en) * 2023-10-08 2023-11-10 之江实验室 Model training method and device based on diversity and random strategy
CN117036869B (en) * 2023-10-08 2024-01-09 之江实验室 Model training method and device based on diversity and random strategy

Also Published As

Publication number Publication date
CN112733911B (en) 2023-05-30
CN112733911A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
WO2022142122A1 (en) Method and apparatus for training entity recognition model, and device and storage medium
CN110442870B (en) Text error correction method, apparatus, computer device and storage medium
JP6222821B2 (en) Error correction model learning device and program
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
WO2022142041A1 (en) Training method and apparatus for intent recognition model, computer device, and storage medium
WO2022105083A1 (en) Text error correction method and apparatus, device, and medium
CN110704576B (en) Text-based entity relationship extraction method and device
CN111666775B (en) Text processing method, device, equipment and storage medium
CN111145718A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN110688853A (en) Sequence labeling method and device, computer equipment and storage medium
CN115293138B (en) Text error correction method and computer equipment
CN110808049B (en) Voice annotation text correction method, computer device and storage medium
CN112002310B (en) Domain language model construction method, device, computer equipment and storage medium
CN111400340A (en) Natural language processing method and device, computer equipment and storage medium
WO2022142123A1 (en) Training method and apparatus for named entity model, device, and medium
CN113223504B (en) Training method, device, equipment and storage medium of acoustic model
CN114528387A (en) Deep learning conversation strategy model construction method and system based on conversation flow bootstrap
CN117094325B (en) Named entity identification method in rice pest field
CN111507103B (en) Self-training neural network word segmentation model using partial label set
CN112347780A (en) Judicial fact finding generation method, device and medium based on deep neural network
CN112906398A (en) Sentence semantic matching method, system, storage medium and electronic equipment
CN115098722B (en) Text and image matching method and device, electronic equipment and storage medium
JP2015141253A (en) Voice recognition device and program
CN113096646B (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912883

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912883

Country of ref document: EP

Kind code of ref document: A1