CN112329466A - Method, device and equipment for constructing named entity recognition model and storage medium - Google Patents

Method, device and equipment for constructing named entity recognition model and storage medium Download PDF

Info

Publication number
CN112329466A
CN112329466A CN202011099769.5A CN202011099769A CN112329466A CN 112329466 A CN112329466 A CN 112329466A CN 202011099769 A CN202011099769 A CN 202011099769A CN 112329466 A CN112329466 A CN 112329466A
Authority
CN
China
Prior art keywords
text sequence
model
labeled
text
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011099769.5A
Other languages
Chinese (zh)
Inventor
许慧敏
陈玉念
孙剑
曹雪智
谢睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202011099769.5A priority Critical patent/CN112329466A/en
Publication of CN112329466A publication Critical patent/CN112329466A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method for constructing a named entity recognition model, which comprises the steps of carrying out previous N times of iterative training on a preset model by utilizing a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence; re-labeling the unmarked text sequence and/or the partially marked text sequence by using the initial model to obtain a new partial text sequence; performing iterative training from the N +1 th time to the N + m th time on the preset model by using all part labeled text sequences including the new part labeled text sequence to obtain an intermediate model, and adjusting parameters of the intermediate model by using a complete labeled text sequence to obtain a named entity recognition model; and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.

Description

Method, device and equipment for constructing named entity recognition model and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for constructing a named entity recognition model.
Background
Named Entity Recognition (NER) is one of the most important tasks in Natural Language Processing (NLP). A named entity generally refers to an entity in text that has a particular meaning or strong reference, such as a person's name, place name, organization name, time of day, proper noun, and the like.
In the related art, when named entity recognition is performed, recognition is generally performed through a named entity recognition model, so that the named entity recognition model needs to be constructed.
However, the three construction methods all need to construct a sample text sequence, and generally require that each entity character in the sample text sequence is labeled, however, during model training, there may be sample text sequences which are not labeled completely, and when the three construction methods are adopted to construct a model, the difference between the sample text sequences is generally adopted, but the same processing method is simply and roughly adopted, so that the problem that the accuracy of the constructed named entity identification model for named entity identification is not high is caused.
Disclosure of Invention
In order to solve the above problems, the present application provides a method, an apparatus, a device, and a storage medium for constructing a named entity recognition model, which aim to improve the accuracy of the named entity recognition model in named entity recognition.
In a first aspect of the embodiments of the present disclosure, a method for constructing a named entity recognition model is provided, where the method includes:
performing previous N times of iterative training on a preset model by using a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence;
re-labeling the unmarked text sequence and/or the partially marked text sequence by using the initial model to obtain a new partial text sequence;
performing iterative training from the N +1 th time to the N + m th time on the preset model by using all part label text sequences including the new part label text sequence to obtain an intermediate model;
adjusting parameters of the intermediate model by using a complete labeled text sequence to obtain a named entity recognition model;
and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.
Optionally, the re-labeling the unlabeled text sequence and/or the partially-labeled text sequence by using the initial model to obtain a new partially-labeled text sequence includes:
carrying out named entity recognition on the unlabeled text sequence and/or the partially labeled text sequence by using the initial model to obtain respective prediction results of the unlabeled text sequence and/or the partially labeled text sequence;
determining texts which meet preset conditions in the unlabeled text sequence and/or the partially labeled text sequence according to respective prediction results of the unlabeled text sequence and/or the partially labeled text sequence, wherein the preset conditions are as follows: the original label of the text is not identified, the predicted label of the text represents an entity and belongs to a complete entity sequence, and the average confidence coefficient of each predicted label included in the entity sequence is greater than a preset threshold value;
and re-labeling the text meeting the preset conditions in the non-labeled text sequence and/or the partially labeled text sequence as the entity represented by the predicted label to obtain a new partially labeled text sequence.
Optionally, in the process of performing previous N times of iterative training on the preset model, each iterative training is continued on the basis of model parameters of the model obtained in the previous iterative training;
in the process of carrying out (N + 1) th to (N + m) th iterative training on the preset model, training is carried out on the basis of the model parameters of the preset model when each iterative training is started;
and in the process of carrying out each iterative training on the preset model, the number of rounds of each iterative training is increased along with the number of iterations.
Optionally, the method further comprises:
according to the category of each entity in the complete labeled text sequence, replacing part of the entities with other entities of the same category in a preset entity dictionary according to a preset probability to obtain an enhanced text sequence;
and adjusting parameters of the intermediate model by using the complete labeled text sequence to obtain an entity recognition model, wherein the method comprises the following steps:
and adjusting parameters of the intermediate model by using the complete labeling text sequence and/or the enhanced text sequence to obtain a named entity recognition model.
Optionally, the method further comprises:
obtaining a text sequence to be recognized;
and inputting the text sequence to be recognized into the named entity recognition model or the intermediate model to obtain a recognition result of the text sequence to be recognized.
In a second aspect of the embodiments of the present disclosure, an apparatus for constructing a named entity recognition model is provided, where the apparatus includes:
the first training module is used for carrying out previous N times of iterative training on a preset model by utilizing a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence;
the marking module is used for re-marking the text sequence without marking and/or the text sequence with partial marking by utilizing the initial model to obtain a new text sequence with partial marking;
the second training module is used for performing iterative training from the (N + 1) th time to the (N + m) th time on the preset model by using all part label text sequences including the new part label text sequence to obtain an intermediate model;
the fine tuning module is used for adjusting the parameters of the intermediate model by using the complete labeled text sequence to obtain a named entity recognition model;
and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.
In a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the method for constructing a named entity recognition model according to the first aspect.
In a fourth aspect of the embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, in which instructions are executable by a processor to perform operations performed by the method for constructing a named entity recognition model according to any one of the first aspect.
In the embodiment of the application, a part of tagged text sequence and/or a complete tagged text sequence are/is used for training a preset model to obtain an initial model, a non-tagged text sequence and/or a part of tagged text sequence are/is re-tagged through the initial model to obtain a new part of text sequence, the new part of text sequence is used for carrying out iterative training on the preset model to obtain an intermediate model, and finally, the complete tagged text sequence is used for adjusting parameters of the intermediate model to obtain a named entity recognition model.
In the embodiment of the application, according to the condition that whether the labeled text sequence is completely labeled or not, the labeled text sequences with different labeling degrees are used as samples to be trained in different stages of model construction, for example, during the previous N times of training, a preset model is trained by using a part of labeled text sequences, during the N +1 th to N + m times of training, the preset model is trained by using the re-labeled text sequence, and when an intermediate model is obtained, namely, during the final stage of model training, the intermediate model is finely tuned by using the complete labeled text sequence. By adopting the model construction method, on one hand, the difference of the labeled text sequences is fully considered, on the other hand, the labeled text sequences with different labeling degrees, namely the complete labeled text sequence, the partial labeled text sequence and the unlabeled text sequence, are placed in different stages of model construction, so that the strategy of model construction is enriched, and the contribution of the labeled text sequences with different labeling degrees to model training is fully utilized, for example, the incomplete labeled text sequence and the unlabeled labeled text sequence can help to improve the generalization of the model, and the complete labeled text sequence can improve the accuracy of the model, so that the accuracy of the constructed named entity recognition model for named entity recognition is integrally improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a design concept diagram of a method for constructing a named entity recognition model according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of a method for constructing a named entity recognition model according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating one embodiment of the re-labeling steps;
fig. 4 is a schematic diagram of a framework of a device for constructing a named entity recognition model according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
For purposes of understanding the present application, the following description is presented in terms of a sequence of labeled text used in named entity recognition:
the tagging text sequence is simply a given natural language sequence, and each element in the sequence is tagged to obtain a tagged sequence, wherein Chinese word segmentation, named entity recognition, part of speech tagging and the like belong to the category of sequence tagging tasks. The algorithms for sequence labeling can be HMM (hidden Markov model), MEMM (maximum entropy hidden Markov model), CRF (conditional random field), and LSTM-CRF models.
Sequences are labeled in a number of different labeling modes: BIO, BIEO, BIEUO and the like. For the example of BIEO, "B" represents the beginning of an attribute, "I" represents the inside of an attribute, "O" represents the outside of an attribute, and "E" represents the end of an attribute. For example, as shown in the following table 1-1, a BIEO labeling case is shown, in which "snowy mountain fox" and "golden sea" are respectively used as TV entity and name entity, and are represented by the labels "BIE" and "BE":
TABLE 1-1
Snow (snow) Mountain Fly away Fox Gold (Au) Inferior sea chest Wu-Wu Swordsman Drama (E)
B-TV I-TV I-TV E-TV B-PER E-PER O O O
As in Table 1-1, snow is the beginning of a noun of an entity and is labeled "B", while "snow" is labeled "B-TV" in order to distinguish the entity clearly from the "gold" name. The "martial arts" are not entity names and are therefore labeled "O".
In the process of forming the labeled text sequence, due to errors of labeling personnel or noise caused by data generated by remote supervision, the labeled text sequence serving as the training sample has the situations of wrong labels and missed labels, namely, the labeled text sequence is divided into a completely labeled text sequence, a partially labeled text sequence and a label text sequence without labels.
Wherein, as shown in tables 1-2, the fully labeled tagged text sequence represents a text sequence in which all entities in the sentence are correctly labeled. The partially labeled text sequence indicates that the entities in the sentence are partially labeled, but some entities are not labeled, and the unlabeled labeled text sequence indicates that the sentence is not labeled.
Tables 1 to 2
Snow (snow) Mountain Fly away Fox Gold (Au) Inferior sea chest Wu-Wu Swordsman Drama (E)
1 B-TV I-TV I-TV E-TV B-PER E-PER O O O
2 O O O O B-PER E-PER O O O
3 O O O O O O O O O
In the prior art, the labeled text sequence is used as a training sample to train a preset model so as to construct a named entity recognition model. Generally, from the aspect of model construction, a rule-based method, a statistical model-based method and a deep learning model-based method are mainly used, but the above-described methods are only applicable to a complete labeled text sequence, and in practice, due to the conditions of noise or missed labeling of manual labeling and the like, a partial labeled text sequence and an unlabeled text sequence inevitably occur, and it is difficult to ensure that all the labeled text sequences are complete, so that the recognition accuracy of the named entity recognition model constructed by the above-described methods is not high.
In view of the above, the present applicant proposes a method for constructing a named entity recognition model, which is mainly conceived in that differences of respective complete labeling degrees of labeled text sequences serving as training samples are fully considered, and model construction modes with different strategies are formulated so as to fully utilize contribution of each labeled text sequence to model training. For example, an initial model can be obtained by training using an incomplete tagged text sequence to improve the generalization performance of the model, and the complete tagged text sequence is used for fine tuning of parameters of the obtained model to improve the accuracy of the model.
The initial model can be obtained by training the incomplete annotation text sequence, the initial model has a basic named entity recognition function, and the incomplete annotation text sequence can help to improve the generalization of the model, so that the initial model can recognize the named entity of the incomplete annotation text sequence, and therefore the incomplete annotation text sequence and the unmarked text sequence can be predicted by the initial model. Furthermore, according to the prediction result, the incomplete labeled text sequence and the unlabeled text sequence can be re-labeled, so that the labeling quality of the training text sequence is improved. Furthermore, when the preset model is trained by subsequently reusing the text sequence obtained after the re-labeling, the accuracy of the model is further improved due to the improvement of the quality of the sample for training.
Referring to fig. 1, a design concept diagram of a method for constructing a named entity recognition model in this embodiment is shown, and as shown in fig. 1, the method for constructing a named entity recognition model provided in this embodiment first divides a labeled text sequence into a partial labeled text sequence, a complete text sequence, and an unlabeled text sequence according to the completeness of labeling. The method comprises the steps of training a preset model into an initial model by using a part of labeled text sequence, then re-labeling the part of labeled text sequence and an unlabeled text sequence by using the initial model, further training the preset model into an intermediate model according to the text sequence obtained after re-labeling, and finally, finely adjusting the intermediate model by using a complete labeled text sequence to obtain a named entity recognition model.
Referring to fig. 2, a flowchart illustrating steps of a method for constructing a named entity recognition model according to an embodiment of the present application is shown, and as shown in fig. 2, the method may specifically include the following steps:
step S201: and performing previous N times of iterative training on the preset model by using a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence.
In this embodiment, the entity information refers to an entity having a specific meaning or strong reference, such as a name of a person, a name of a place, a name of an organization, a date and time, a proper noun, and the like. For example, a text sequence is "snowy flying fox is martial arts novel written by gold inferior", wherein "snowy flying fox" and "gold inferior" belong to entity information. In the previous N times of iterative training of the preset model, the text sequence containing the entity information may be a partial annotation text sequence, a complete annotation text sequence, or a partial annotation text sequence and a complete annotation text sequence.
As described above, a full annotation text sequence may refer to a text sequence in which all entity information is fully annotated, i.e., all entities in the full annotation text sequence are annotated. For example, the text sequence described in line 1 of tables 1-2. The partial annotation text sequence may refer to a text sequence in which only part of the entity information is annotated, that is, a part of the entities in the partial annotation text sequence is annotated. For example, the text sequence described in line 2 of tables 1-2.
In this embodiment, the preset model may be a BiLSTM-CRF model, and the iterative training may be performed on the preset model N times before, where the iterative training is performed on the model obtained by the previous training N times after the previous training. That is, when the next training is completed, the model parameters are updated again on the model parameters obtained in the previous training.
In this embodiment, N may be set to be greater than or equal to 1, the initial model obtained by performing previous N times of iterative training on the preset model may be considered as an initial construction stage of the whole model construction, and in this stage, when training is performed by using a part of tagged text sequence and/or a complete tagged text sequence, since the tagged text sequence used for training is a sequence with incomplete tags, the use of the incomplete tagged text sequence may help the preset model to improve fault-tolerance and cover more entity types, thereby improving generalization capability of the preset model, and thus, the initial model may have more basic named entity recognition capability.
Step S202: and re-labeling the unmarked text sequence and/or the partially marked text sequence by using the initial model to obtain a new partial text sequence.
In this embodiment, the unlabeled text sequence may refer to a text sequence in which any entity information is not labeled, that is, the unlabeled text sequence may not contain the entity information. For example, a text sequence as described in line 3 of tables 1-2.
Because the initial model can have basic named entity identification capability, the initial model can be used for identifying the unlabeled text sequence and/or the partially-labeled text sequence, and then the unlabeled text sequence and/or the partially-labeled text sequence can be re-labeled according to the identification result, so that a new partial text sequence is obtained.
When the label-free text sequence is identified, the label-free text sequence can be re-labeled according to the identification result of the label-free text sequence. When the partial annotation text sequence is identified, the partial annotation text sequence can be re-labeled according to the identification result of identifying the partial annotation text sequence.
It should be noted that the re-labeling in this embodiment may refer to labeling entity information that is not labeled in the partial labeling text sequence and the non-labeling text sequence.
When the unlabeled text sequence or the partially labeled text sequence is re-labeled, the sample size of the partially labeled text sequence can be expanded, and of course, when the partially labeled text sequence is re-labeled, the sample size of the complete labeled text sequence can be expanded. Therefore, in the interval of training the preset model, the trained initial model can be used for expanding the sample capacity with different labeling degrees, the completely labeled degree of the labeled text sequence is continuously improved, and the difference between the labeled text sequences in the labeling degree is continuously reduced.
Step S203: and performing (N + 1) th to (N + m) th iterative training on the preset model by using all part labeling text sequences including the new part labeling text sequence to obtain an intermediate model.
In this embodiment, after obtaining a new partial annotation text sequence, a partial annotation text sequence may be continuously screened from the new partial annotation text sequence, and then the N +1 th to N + m th iterative training is performed on the preset model by using all the existing partial annotation text sequences to obtain an intermediate model.
The (N + 1) th to (N + m) th iterative training of the preset model can be understood as an intermediate stage of the model construction process. The sample size of the part of the marked text sequence can be expanded no matter the unmarked text sequence is re-marked or the part of the marked text sequence is re-marked, so that in the stage, the part of the marked text sequence with expanded capacity is used for carrying out new stage training on the preset model, the training effect on the preset model is better due to the increase of the part of the marked text sequence, and thus, the accuracy of the named entity recognition of the trained model is higher.
Of course, in a specific implementation, when the N +1 th to N + m th iterative training is performed on the preset model by using all the part tagged text sequences including the new part tagged text sequence, the N +1 th to N + m th iterative training may also be performed on the preset model by using the complete tagged text sequence.
Step S204: and adjusting parameters of the intermediate model by using the complete labeled text sequence to obtain a named entity recognition model.
In this embodiment, adjusting the parameters of the intermediate model by using the complete annotation text sequence may refer to: the intermediate model is trained by using the complete label text sequence to update the parameters of the intermediate model, so that the complete label can help to quickly learn the model with high availability, and the identification accuracy of the model can be improved.
By adopting the technical scheme of the application, in different stages of model training, such as the previous N times of training, the marked text sequence which is not completely marked is used as a sample for training, then the incompletely marked text sequence is re-marked to obtain a new marked text sequence, then the part of the marked text sequence including the new marked text sequence is used for carrying out repeated iterative training on the preset model to obtain an intermediate model, and then the complete marked text sequence is used for adjusting the parameters of the intermediate model. Therefore, the text sequence of each labeling degree is reasonably utilized, different stage model training is carried out according to the text sequences of different labeling degrees, the incomplete labeling text sequence and the labeling-free labeling text sequence can help to improve the generalization of the model, the complete labeling text sequence can improve the accuracy of the model, and therefore the accuracy of the constructed named entity recognition model for named entity recognition is improved on the whole.
In a specific implementation, different training strategies may be formulated for different model construction stages, for example, in an initial construction stage of the previous N iterative trainings and a subsequent construction stage of the (N + 1) th to (N + m) th iterative trainings, according to different model construction stages.
In specific implementation, in the model construction stage of the (N + 1) th to (N + m) th iterative training, in order to avoid error accumulation of the last iteration, the model parameters may be reset. In each iteration of each model construction stage, in order to obtain a named entity recognition model with higher accuracy, the number of rounds of model training of each iteration can be increased along with the number of iterations. The method comprises the following specific steps:
in the process of carrying out previous N times of iterative training on the preset model, each time of iterative training is continuously trained on the basis of model parameters of the model obtained in the previous iterative training; in the process of carrying out (N + 1) th to (N + m) th iterative training on the preset model, training is carried out on the basis of the model parameters of the preset model when each iterative training is started; in the process of performing each iterative training on the preset model, the number of rounds of iterative training at each time is increased along with the number of iterations.
In this embodiment, the previous N times of iterative training may be referred to as an initial construction stage, and in this stage, because a part of the labeled text sequence is used to train the preset model, each iterative training may be continued on the basis of the model parameters of the model obtained in the previous iterative training, that is, in the initial construction stage, the model parameters obtained when the previous training is finished are updated when the next training is finished.
In this embodiment, the process of performing the (N + 1) th to (N + m) th iterative training on the preset model may be referred to as an intermediate construction stage, in which a part of labeled text sequences including a new labeled text sequence are used to train the preset model, and compared with the initial construction stage, the capacity of training samples (i.e., part of labeled text sequences) with the same labeling degree is expanded, so as to achieve a better training effect. Therefore, the model parameters may be reset at each iterative training, for example, the preset model parameters may be reset at the N +3 th iterative training, and the preset model parameters may be reset at the N +4 th iterative training.
Further, in each iteration in the intermediate construction stage, the inventor finds that the accuracy rate of the first few rounds of the model is high, the recall rate is low, and the accuracy rate and the recall rate approach to the balance as the number of training rounds increases. Therefore, in the initial stage of iteration, in order to obtain a model with high accuracy as much as possible, the number of rounds of model training for each iteration can be increased along with the number of iterations. For example, in the N +3 th iterative training, the number of training rounds is 100 rounds, and in the N +4 th iterative training, the number of training rounds is 150 rounds.
The difference between the number of rounds of the next iterative training and the number of rounds of the previous iterative training may be a preset difference, that is, the number of rounds of the next iterative training is always the same as the number of rounds of the previous iterative training. Of course, the increment of the number of rounds of the next iteration training compared with the previous iteration training may be random.
In one example, in order to improve the accuracy of adjusting the parameters of the intermediate model to improve the accuracy of identifying the named entity identification model, the parameters of the intermediate model may also be adjusted by using an enhanced text sequence, where the enhanced text sequence may refer to a text sequence that is different from the full label text sequence but has the same entity information structure as the full label text sequence. In specific implementation, the process of obtaining the enhanced text sequence may be as follows:
and according to the category of each entity in the complete labeled text sequence, replacing part of the entities with other entities of the same category in a preset entity dictionary according to a preset probability to obtain an enhanced text sequence.
Then, when the parameters of the intermediate model are adjusted by using the complete tagged text sequence to obtain the entity recognition model, the parameters of the intermediate model can be adjusted by using the complete tagged text sequence and/or the enhanced text sequence to obtain the named entity recognition model.
Wherein a category of an entity may refer to a category to which the entity belongs, such as a person's name, place name, organization name, time of day, proper noun, etc., e.g., a category of "gold" is a person's name; the preset entity dictionary stores a plurality of different classes of entities in advance, and there may be a plurality of entities in each class, for example, the entities in the "name of person" class may include "jinui", "xu shimo", "guo lao" and the like.
In this embodiment, when the enhanced text sequence is obtained by replacing some entities with other entities of the same category in the preset entity dictionary according to the preset probability, the preset probability may be a pre-specified probability. During replacement, all entities in the complete labeled text sequence can be replaced by other entities of the same category in the preset entity dictionary, and part of entities in the complete labeled text sequence can also be replaced by other entities of the same category in the preset entity dictionary.
When the replacement is carried out, the corresponding entity in the complete labeling text sequence can be randomly replaced by other entities of the same category in the preset entity dictionary, namely, other entities of the same category as the entity to be replaced are randomly selected from the preset entity dictionary.
Illustratively, tables 1-3 below show a relationship table in which a complete sequence of tagged text is replaced with other entities of the same category in a predetermined entity dictionary.
Tables 1 to 3
Figure BDA0002722817820000121
As shown in tables 1-3, the complete annotation text sequence is that "snowy mountain flying fox is the martial arts novel written by jinyon", wherein "snowy mountain flying fox" and "jinyon" are entity information, and then "snowy mountain flying fox" can be replaced by preset "to identify the kangqiao" and "jinyon" can be replaced by "xu shimo". Of course, the "snowy mountain flying fox" may be replaced with "again do the job" in the preset entity dictionary, and the "mediocre" may be replaced with "xuxu shimo" in the preset entity dictionary. Of course, "gold" may also be replaced with "guo lao" in the preset entity dictionary.
It should be noted that the above examples are only illustrative and do not represent a limitation of the present application, i.e. they do not represent some degree of association between different categories of entities, for example, "re-identified conbridge" is written in "xus shimo" and does not represent that "re-identified conbridge" is bound to "xu shimo" and appears in the same text sequence.
When the method is adopted, the parameters of the intermediate model can be adjusted by a complete labeling text sequence, the parameters of the intermediate model can also be adjusted by an enhanced text sequence, and the parameters of the intermediate model can also be adjusted by the complete labeling text sequence and the enhanced text sequence. In the actual implementation process, the parameters of the intermediate model can be adjusted in any way.
Referring to fig. 3, a flowchart illustrating steps of performing relabeling in an example is shown, and specifically, in a process of obtaining a new partial text sequence by performing relabeling on an unlabeled text sequence and/or a partially labeled text sequence using the initial model, the method may include the following steps:
step S2021: and carrying out named entity identification on the unlabeled text sequence and/or the partially-labeled text sequence by using the initial model to obtain respective prediction results of the unlabeled text sequence and/or the partially-labeled text sequence.
In this embodiment, since the initial model may have a relatively basic named entity recognition capability, the initial model may be used to perform named entity recognition on an unlabeled text sequence and/or a partially labeled text sequence, that is, the initial model is used to recognize a text included in an input text sequence.
The recognizing of the text may refer to predicting a label of each character included in the text, and thus the prediction result may include a prediction label for predicting the label of each character. The prediction result of the unlabeled text sequence may refer to a prediction tag of each character in the unlabeled text sequence, and the prediction result of each of the partially-labeled text sequences may refer to a prediction tag of each character in the partially-labeled text sequence.
Step S2022: determining texts which meet preset conditions in the unlabeled text sequence and/or the partially labeled text sequence according to respective prediction results of the unlabeled text sequence and/or the partially labeled text sequence, wherein the preset conditions are as follows: the original label of the text is not recognized, the predicted label of the text represents an entity and belongs to a complete entity sequence, and the average confidence of each predicted label included in the entity sequence is greater than a preset threshold.
In this embodiment, the text that meets the preset condition in the unlabeled text sequence may be determined according to the prediction result of the unlabeled text sequence, and/or the text that meets the preset condition in the partially labeled text sequence may be determined according to the prediction result of the partially labeled text sequence. The text meeting the preset condition may be a text representing an entity in a text sequence, for example, a text of a snow-mountain flying fox in a snow-mountain flying fox gold-American swordsman drama.
The prediction tag of each character in the text may represent whether the text is an entity and whether the text is a complete entity sequence, where a complete entity sequence means that each character constituting the text represents an entity, for example, the text "snow mountain flying fox" is a complete entity sequence, and "snow mountain" is not a complete entity sequence.
And the average confidence of each prediction tag included in the entity sequence of the text meeting the preset condition is greater than a preset threshold, wherein each sequence in the entity sequence, that is, each character has one prediction tag, and the confidence of the prediction tag of each character can be a degree that represents the trustworthiness of the prediction tag. In one example, text with an average confidence of the predicted labels of the individual characters greater than or equal to 0.7 may be used as the text to be re-annotated. For example, the text of "snowy mountain flying fox" is that the confidence of "B" is 0.95, the confidence of "I" is 0.99, the confidence of "I" is 0.95, the confidence of "E" is 0.5, and the average confidence is 0.85, so the text of "snowy mountain flying fox" can be regarded as the text to be re-labeled.
The original tag of a text may refer to the tags labeled for the text, such as "BIE" and "BE" tags, and when the original tag is unrecognized, the text is labeled without a tag. For example, the text sequence "snowmobile fox heyday swordsman" is labeled as part of tables 1-2, wherein "snowmobile fox" has no tag, i.e., is unrecognized.
Due to the limitation of the preset conditions, when the text is re-labeled, the text which is the entity and has no label in the prediction result in the text sequence is re-labeled, so that the accuracy of the re-labeled text is high.
Step S2023: and re-labeling the text meeting the preset conditions in the non-labeled text sequence and/or the partially labeled text sequence as the entity represented by the predicted label to obtain a new partially labeled text sequence.
In this embodiment, the determined text meeting the preset condition may be re-labeled, where the re-labeling may refer to re-labeling each character in the text as an entity represented by the prediction tag. For example, the determined text is "snowy flying fox", wherein the prediction label of "snow" is "B", the prediction label of "mountain" is "I", the prediction label of "flying" is "I", and the prediction label of "flying" is "E", and the representation "snowy flying fox" is a complete entity sequence, so that each character can be re-labeled according to the prediction label of "snowy flying fox", and the label of "snow" after re-labeling is "B", the label of "mountain" is "I", the label of "flying" is "I", and the label of "flying" is "E".
After the initial model is utilized to perform re-labeling on the unmarked text sequence and/or the partially marked text sequence, the original label can be marked as the unmarked entity, so that the coverage rate of the marked entity in the text sequence is improved.
After the named entity recognition model is obtained through the above process, the named entity recognition model can be used for carrying out named entity recognition on the recognized text sequence. Specifically, a text sequence to be recognized may be obtained, and the text sequence to be recognized is input to the named entity recognition model or the intermediate model, so as to obtain a recognition result of the text sequence to be recognized.
In this embodiment, the text sequence to be recognized may refer to a text sequence including entity information, and entities in the text sequence may be labeled. The intermediate model is obtained by training a part of labeled text sequence including a new part of labeled text sequence, and has a more accurate named entity recognition function, so that the text sequence to be recognized can be recognized through the intermediate model, and the text sequence to be recognized can also be recognized through the named entity recognition model.
Of course, since the named entity recognition model is obtained by adjusting the model parameters of the intermediate model, the named entity recognition model is used for recognizing the text sequence to be recognized, so that a more accurate recognition result can be obtained.
Based on the same inventive concept as the above embodiments, in a second aspect of the embodiments of the present disclosure, a device 400 for constructing a named entity recognition model is provided, as shown in fig. 4, the device 400 for constructing a named entity recognition model specifically may include the following modules:
the first training module 401 is configured to perform previous N times of iterative training on a preset model by using a text sequence containing entity information to obtain an initial model, where the text sequence containing entity information includes a partial annotation text sequence and/or a complete annotation text sequence;
a labeling module 402, configured to perform re-labeling on the text sequence without labeling and/or the text sequence with partial labeling by using the initial model to obtain a new text sequence with partial labeling;
a second training module 403, configured to perform iterative training from the N +1 th time to the N + m th time on the preset model by using all the part tagged text sequences including the new part tagged text sequence, to obtain an intermediate model;
a fine tuning module 404, configured to adjust parameters of the intermediate model by using a complete labeled text sequence to obtain a named entity identification model;
and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.
Optionally, the labeling module 402 may specifically include the following units:
the prediction unit is used for carrying out named entity recognition on the unlabeled text sequence and/or the partially-labeled text sequence by utilizing the initial model to obtain respective prediction results of the unlabeled text sequence and/or the partially-labeled text sequence;
the screening unit is used for determining texts which meet preset conditions in the label-free text sequence and/or the part of the label text sequence according to respective prediction results of the label-free text sequence and/or the part of the label text sequence, wherein the preset conditions are as follows: the original label of the text is not identified, the predicted label of the text represents an entity and belongs to a complete entity sequence, and the average confidence coefficient of each predicted label included in the entity sequence is greater than a preset threshold value;
and the labeling unit is used for re-labeling the text meeting the preset conditions in the non-labeled text sequence and/or the partial labeled text sequence as the entity represented by the predicted label to obtain a new partial labeled text sequence.
Optionally, the first training module 401 is configured to, in the process of performing previous N times of iterative training on the preset model, continue training on the basis of model parameters of the model obtained in the previous iterative training each time of iterative training;
the second training module 403 is configured to train based on the model parameter of the preset model when each iterative training starts in the process of performing iterative training from the N +1 th time to the N + m th time on the preset model;
and in the process of carrying out each iterative training on the preset model, the number of rounds of each iterative training is increased along with the number of iterations.
Optionally, the apparatus may further include the following modules:
the enhanced text sequence obtaining module is used for replacing part of entities with other entities of the same category in a preset entity dictionary according to the category of each entity in the complete labeled text sequence and the preset probability to obtain an enhanced text sequence;
the fine tuning module 404 is specifically configured to adjust parameters of the intermediate model by using the complete labeled text sequence and/or the enhanced text sequence to obtain a named entity recognition model.
Optionally, the apparatus may further include the following modules:
the text sequence to be recognized obtaining module is used for obtaining a text sequence to be recognized;
and the identification module is used for inputting the text sequence to be identified into the named entity identification model or the intermediate model to obtain the identification result of the text sequence to be identified.
It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.
The embodiment of the present invention further provides an electronic device, which may include a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is configured to execute the method for constructing the named entity recognition model.
The disclosed embodiments also provide a non-transitory computer-readable storage medium, and when executed by a processor, the instructions in the storage medium enable the processor to perform operations performed by a method for constructing a named entity recognition model according to the present disclosure.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method, the device, the equipment and the storage medium for constructing the named entity recognition model provided by the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method for constructing a named entity recognition model, the method comprising:
performing previous N times of iterative training on a preset model by using a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence;
re-labeling the unmarked text sequence and/or the partially marked text sequence by using the initial model to obtain a new partial text sequence;
performing iterative training from the N +1 th time to the N + m th time on the preset model by using all part label text sequences including the new part label text sequence to obtain an intermediate model;
adjusting parameters of the intermediate model by using a complete labeled text sequence to obtain a named entity recognition model;
and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.
2. The method of claim 1, wherein re-labeling the unlabeled text sequence and/or the partially labeled text sequence using the initial model to obtain a new partially labeled text sequence comprises:
carrying out named entity recognition on the unlabeled text sequence and/or the partially labeled text sequence by using the initial model to obtain respective prediction results of the unlabeled text sequence and/or the partially labeled text sequence;
determining texts which meet preset conditions in the unlabeled text sequence and/or the partially labeled text sequence according to respective prediction results of the unlabeled text sequence and/or the partially labeled text sequence, wherein the preset conditions are as follows: the original label of the text is not identified, the predicted label of the text represents an entity and belongs to a complete entity sequence, and the average confidence coefficient of each predicted label included in the entity sequence is greater than a preset threshold value;
and re-labeling the text meeting the preset conditions in the non-labeled text sequence and/or the partially labeled text sequence as the entity represented by the predicted label to obtain a new partially labeled text sequence.
3. The method according to claim 1 or 2, wherein in the process of performing previous N times of iterative training on the preset model, each iterative training is continued on the basis of model parameters of the model obtained in the previous iterative training;
in the process of carrying out (N + 1) th to (N + m) th iterative training on the preset model, training is carried out on the basis of the model parameters of the preset model when each iterative training is started;
and in the process of carrying out each iterative training on the preset model, the number of rounds of each iterative training is increased along with the number of iterations.
4. The method of claim 1, further comprising:
according to the category of each entity in the complete labeled text sequence, replacing part of the entities with other entities of the same category in a preset entity dictionary according to a preset probability to obtain an enhanced text sequence;
and adjusting parameters of the intermediate model by using the complete labeled text sequence to obtain an entity recognition model, wherein the method comprises the following steps:
and adjusting parameters of the intermediate model by using the complete labeling text sequence and/or the enhanced text sequence to obtain a named entity recognition model.
5. The method according to any one of claims 1-4, further comprising:
obtaining a text sequence to be recognized;
and inputting the text sequence to be recognized into the named entity recognition model or the intermediate model to obtain a recognition result of the text sequence to be recognized.
6. An apparatus for constructing a named entity recognition model, the apparatus comprising:
the first training module is used for carrying out previous N times of iterative training on a preset model by utilizing a text sequence containing entity information to obtain an initial model, wherein the text sequence containing the entity information comprises a partial labeling text sequence and/or a complete labeling text sequence;
the marking module is used for re-marking the text sequence without marking and/or the text sequence with partial marking by utilizing the initial model to obtain a new text sequence with partial marking;
the second training module is used for performing iterative training from the (N + 1) th time to the (N + m) th time on the preset model by using all part label text sequences including the new part label text sequence to obtain an intermediate model;
the fine tuning module is used for adjusting the parameters of the intermediate model by using the complete labeled text sequence to obtain a named entity recognition model;
and the unlabeled text sequence does not contain entity information, part of entities in the part of the labeled text sequence are labeled, and all the entities in the complete labeled text sequence are labeled.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the method of constructing a named entity recognition model according to any one of claims 1 to 6.
8. A computer-readable storage medium, characterized in that it stores a computer program for causing a processor to execute the method of constructing a named entity recognition model according to any one of claims 1 to 6.
CN202011099769.5A 2020-10-13 2020-10-13 Method, device and equipment for constructing named entity recognition model and storage medium Pending CN112329466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011099769.5A CN112329466A (en) 2020-10-13 2020-10-13 Method, device and equipment for constructing named entity recognition model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011099769.5A CN112329466A (en) 2020-10-13 2020-10-13 Method, device and equipment for constructing named entity recognition model and storage medium

Publications (1)

Publication Number Publication Date
CN112329466A true CN112329466A (en) 2021-02-05

Family

ID=74314906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011099769.5A Pending CN112329466A (en) 2020-10-13 2020-10-13 Method, device and equipment for constructing named entity recognition model and storage medium

Country Status (1)

Country Link
CN (1) CN112329466A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723104A (en) * 2021-09-15 2021-11-30 云知声智能科技股份有限公司 Method and device for entity extraction under noisy data
CN117574906A (en) * 2024-01-15 2024-02-20 深圳市客路网络科技有限公司 Named entity identification method, device and equipment
CN117574906B (en) * 2024-01-15 2024-05-24 深圳市客路网络科技有限公司 Named entity identification method, device and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723104A (en) * 2021-09-15 2021-11-30 云知声智能科技股份有限公司 Method and device for entity extraction under noisy data
CN117574906A (en) * 2024-01-15 2024-02-20 深圳市客路网络科技有限公司 Named entity identification method, device and equipment
CN117574906B (en) * 2024-01-15 2024-05-24 深圳市客路网络科技有限公司 Named entity identification method, device and equipment

Similar Documents

Publication Publication Date Title
CN107622050B (en) Bi-LSTM and CRF-based text sequence labeling system and method
CN107729468B (en) answer extraction method and system based on deep learning
CN106570180B (en) Voice search method and device based on artificial intelligence
US20210406464A1 (en) Skill word evaluation method and device, electronic device, and non-transitory computer readable storage medium
CN112966106A (en) Text emotion recognition method, device and equipment and storage medium
CN112070138A (en) Multi-label mixed classification model construction method, news classification method and system
CN110795938A (en) Text sequence word segmentation method, device and storage medium
CN110751234A (en) OCR recognition error correction method, device and equipment
CN112188311B (en) Method and apparatus for determining video material of news
CN107844531B (en) Answer output method and device and computer equipment
CN116070632A (en) Informal text entity tag identification method and device
CN112329466A (en) Method, device and equipment for constructing named entity recognition model and storage medium
CN114218379A (en) Intelligent question-answering system-oriented method for attributing questions which cannot be answered
US20200226325A1 (en) Converting unstructured technical reports to structured technical reports using machine learning
CN110442858B (en) Question entity identification method and device, computer equipment and storage medium
CN110472231A (en) It is a kind of identification legal documents case by method and apparatus
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN115858781A (en) Text label extraction method, device, equipment and medium
CN115238093A (en) Model training method and device, electronic equipment and storage medium
CN114741512A (en) Automatic text classification method and system
CN114638229A (en) Entity identification method, device, medium and equipment of record data
CN111538898B (en) Web service package recommendation method and system based on combined feature extraction
CN113934833A (en) Training data acquisition method, device and system and storage medium
CN114298048A (en) Named entity identification method and device
CN109885827B (en) Deep learning-based named entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination