CN112101032A

CN112101032A - Named entity identification and error correction method based on self-distillation

Info

Publication number: CN112101032A
Application number: CN202010897066.0A
Authority: CN
Inventors: 陈开冉; 黎展; 张天翔
Original assignee: Guangzhou Tungee Technology Co ltd
Current assignee: Guangzhou Tungee Technology Co ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-18
Anticipated expiration: 2040-08-31
Also published as: CN112101032B

Abstract

The invention discloses a named entity identification and error correction method based on self-distillation, which comprises the following steps: training a named entity recognition model; the named entity recognition model comprises a first layer model, a second layer model and a third layer model; the first layer model is used for training according to the label-free data and compressing the probability distribution in the first layer model into the second layer model; the second layer model is used for carrying out named entity extraction; the third layer model is used for carrying out error detection and error correction on the extracted named entities; acquiring a text of a named entity to be extracted and inputting the text into a model; and extracting the named entities through a named entity recognition model. By adopting the method, the problems that the characteristics of the context are too strong and the wrong named entity is extracted when grammar errors or punctuation errors occur in the text when the text named entity is extracted can be solved. Meanwhile, automatic correction can be carried out when the wrong named entity is extracted, so that the effect of more accurate extraction is achieved.

Description

Named entity identification and error correction method based on self-distillation

Technical Field

The invention relates to the field of text processing, in particular to a named entity identification and error correction method and device based on self-distillation, a storage medium and computing equipment.

Background

Named entity recognition based on text data is a natural language processing technology widely used in man-machine conversation systems, words which accord with preset attributes in texts can be automatically extracted after the text data is cleaned, vector mapped and matched with semantic comprehension, and therefore the purpose of specially recognizing a certain type of special words is achieved.

The prior art has the following defects:

1) in a general named entity recognition method, if a grammar error or a punctuation error occurs in a text named entity, the method extracts an error vocabulary due to over-strong contextual characteristics, and cannot solve such abnormal errors.

2) In the general type technology, a large amount of external knowledge and grammar logic support are not provided, and the effect of accurately extracting special words is difficult to achieve.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present invention provides a named entity recognition and correction method, apparatus, storage medium and computing device based on self-distillation, which can achieve the purpose of accurately extracting the named entities of the text by training a three-layer named entity recognition model. The purpose of the invention is realized by the following scheme:

a self-distillation based named entity identification and error correction method, comprising:

training a named entity recognition model; the named entity recognition model comprises a first layer model, a second layer model and a third layer model; the first layer model is used for training according to label-free data and compressing probability distribution in the first layer model into the second layer model; the second layer model is used for carrying out named entity extraction; the third layer model is used for carrying out error detection on the extracted named entities and carrying out error correction after errors are detected;

acquiring a text of a named entity to be extracted and inputting the text into the named entity recognition model;

and extracting the named entity through the named entity recognition model.

Further, the first layer model is a bert-large model;

training the bert-large model, comprising:

and performing label-free data training on the bert-large model, and performing vertical domain data fine tuning training on the bert-large model.

Further, the second layer model is a Transformer Encoder model;

training the Transformer Encoder model, including:

training the Transformer Encoder model through the context characteristics of each word in the text; wherein the probability distribution of the first layer model is compressed into the Transformer Encoder model in advance by means of branch self-distillation, and the fully-connected layer output of the first layer model is used as a pre-training word vector.

Further, the third layer model is a CBOW model;

training the CBOW model, including:

and training the CBOW model of the vertical domain through the structured text data.

Further, performing error detection on the extracted named entity, including:

calculating the grammatical logic rationality of the named entity and scoring;

if the grammatical logic rationality score of the named entity is higher than a preset first threshold value, outputting the extracted named entity as a result;

and if the grammatical logic rationality score of the named entity is lower than the preset first threshold value, confirming that the extracted named entity is wrong.

Further, the performing error correction after detecting the error includes:

generating a correction candidate set according to the extracted named entities;

evaluating named entities of the revised candidate set;

and selecting a named entity from the correction candidate set according to the evaluation result as the correction result of the extracted named entity.

Further, evaluating the named entities of the revision candidate set includes:

acquiring modification rate statistical data of the named entities of the correction candidate set;

calculating a first semantic similarity between the named entity of the correction candidate set and the named entity in the proprietary lexicon;

calculating a second semantic similarity between the named entity of the corrected candidate set and the named entity in the public word stock;

and evaluating the named entities of the correction candidate set according to the modification rate statistical data, the first semantic similarity and the second semantic similarity.

A named entity identification and error correction method device based on self-distillation comprises the following steps:

the model training module is used for training a named entity recognition model; the named entity recognition model comprises a first layer model, a second layer model and a third layer model; the first layer model is used for training according to label-free data and compressing probability distribution in the first layer model into the second layer model; the second layer model is used for carrying out named entity extraction; the third layer model is used for carrying out error detection on the extracted named entities and carrying out error correction after errors are detected;

the text acquisition module is used for acquiring a text of the named entity to be extracted and inputting the text into the named entity recognition model;

and the named entity extraction module is used for extracting the text named entity through the named entity recognition model.

A readable storage medium having executable instructions thereon which, when executed, cause a computer to perform a method as described above.

A computing device, comprising:

one or more processors;

a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform the methods described above.

Compared with the prior art, the invention has the advantages that: according to the invention, the externally trained knowledge model is introduced into the three-layer named entity recognition model, the named entity recognition model is subjected to knowledge filling, and even if common grammatical errors or punctuation filling errors occur in the text, the named entity recognition model can abandon certain words according to a preset threshold value or select to automatically correct the words, so that a more accurate named entity extraction effect is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a flow chart of a named entity identification and correction method based on self-distillation according to an embodiment of the present invention;

FIG. 2 is a schematic view of attention scoring at different levels according to an embodiment of the present invention;

fig. 3 is a block diagram of a device structure of a method for extracting a text named entity according to an embodiment of the present invention.

Detailed description of the preferred embodiments

The present invention will be described in further detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 schematically shows a self-distillation based named entity recognition and correction method according to the present invention, which starts with step S100: the named entity recognition model is trained, the named entity recognition model comprises three layers of models, namely a first layer model, a second layer model and a third layer model, and in order to achieve the effect of accurately extracting the named entity, different contents of the three layers of models need to be trained.

The first layer of model is a bert-large model, and is used for performing label-free data training on the bert-large model and performing vertical domain data fine tuning training on the bert-large model. The first layer model is used for training according to the label-free data and compressing the probability distribution in the first layer model into the second layer model.

The second layer model is a Transformer Encoder model, and when the Transformer Encoder model is trained, the Transformer Encoder model is trained through the context characteristics of each word in the text; the method comprises the steps of compressing probability distribution of a first layer model into a Transformer Encoder model in a branched self-distillation mode in advance, using full-connection layer output of the first layer model as a pre-training word vector, and training a named entity recognition model through high-latitude vector characteristics of corresponding words in the pre-training word vector so as to achieve the purpose of obtaining accurate named entity extraction through the named entity recognition model in subsequent use. The second layer model is used for extracting named entities.

By training the first-layer model and the second-layer model, more accurate external knowledge is provided for the named entity recognition model, and meanwhile, the named entity recognition model has higher speed and smaller size.

The third layer model is a CBOW model; training is performed on the CBOW model, and comprises the following steps: and training the CBOW model of the vertical domain through the structured text data.

The third layer model is used for carrying out error detection on the extracted named entities and carrying out error correction after errors are detected.

After the model training is finished, the method enters step S101, the text of the named entity to be extracted is obtained and input into the named entity recognition model, and then the method enters step S102, and the named entity is extracted through the named entity recognition model.

After the extraction of the named entities is finished, carrying out error detection on the named entities after extraction and carrying out error correction after errors are detected through the trained third-layer model. The trained third-layer model can provide an accurate error detection and modification scheme for the extraction of the named entity, and the semantic understanding capability of the model can be improved.

Meanwhile, compared with the problem that the existing words or sentences can only be extracted by a common named entity recognition method and error recognition and correction can not be carried out according to the existing data, the method disclosed by the invention expands the capability boundary of the named entity model through the trained third-layer model.

When the error detection is carried out on the extracted named entity, the method comprises the following steps:

calculating the grammatical logic rationality of the named entity and scoring; if the grammatical logic rationality score of the named entity is higher than a preset first threshold value, preferably the preset first threshold value is 0.5, confirming that the named entity extracted through the named entity recognition model is correct, and outputting the extracted named entity as an extraction result of the named entity;

and if the grammatical logic rationality score of the named entity is lower than a preset first threshold value, confirming the named entity errors extracted through the named entity recognition model.

The grammar logic is obtained based on label-free data training and learning of a large number of vertical fields, the logic model is displayed in the form of the language model, the language model can score according to the upper and lower words to obtain the rationality of the grammar logic, if the score is too low, the logic does not exist or rarely exists in large-scale linguistic data, namely, the grammar logic is considered to have problems possibly, namely, the grammar logic is possibly abnormal description text. The named entity error extracted by the named entity recognition model is confirmed.

After error detection is performed on the named entity, if the syntax logic rationality score of the named entity is lower than a preset first threshold, the extracted named entity is confirmed to be erroneous, and error correction needs to be performed on the erroneous named entity.

In the error correction process, the method comprises the following steps:

evaluating the named entities of the revised candidate set, comprising:

calculating a first semantic similarity between the named entity of the correction candidate set and the named entity in the proprietary lexicon; the proprietary thesaurus here is related to the enterprise itself where named entity extraction is performed.

Calculating a second semantic similarity between the named entity of the correction candidate set and the named entity in the public word stock; the public lexicon can be an existing lexicon which can be inquired by the public, such as WiKi, Babel or Zhishime.

During the evaluation, the modification rate is set to be defined as: "1-word vector similarity of two words" i.e. the more similar the two words will be the lower the modification rate.

In the specific evaluation process, if the modification rate is higher than 0.5, or the first semantic similarity is higher than a preset third threshold, preferably, the preset third threshold is 0.7, or the second semantic similarity is higher than a preset fourth threshold, preferably, the preset fourth threshold is 0.5, it is determined that the named entity of the modified candidate set is correct, the modified named entity is selected from the modified candidate set and is used as a modification result for extracting the named entity, and the modified named entity is output as an extraction result for the named entity.

Referring to the specification and the attached figure 2, in the attention scores of the embodiment of the invention at different levels, in a pre-training stage, the vertical domain data set is finely adjusted by freezing weights layer by layer, the attention score is found to be the highest in 3 layers, and the result proves that the 3-layer model already contains most information for named entity recognition.

Based on the method, a 3-layer Transformer Encoder structure is used as a knowledge acceptor, a bert-large model which is finely adjusted based on a large amount of unmarked vertical field data is used as a guide teacher, probability distribution in the bert-large model is compressed into the 3-layer Transformer carrier model in a branched self-distillation mode, and full-connection layer output of the model is used as a pre-training word vector, so that more accurate external knowledge is provided for a named entity recognition model, and meanwhile, the method has higher speed and smaller size.

Referring to the following table, the table is a comparison table of the model effect and performance of the named entity recognition model and other models, wherein recall represents recall ratio; precision, F1-score represents precision; predict _ speed represents the predicted speed (i.e., performance) of the model on cpu:

from the above table it follows that: compared with the effect of the latest mainstream technology based on the data set in the vertical field, the (disconnected-transform + BilsTM + CRF) model has obvious advantages in the aspects of recall and precision; in effect, the current sota technical path (Bert + BilSTM + CRF) is almost leveled, but the performance of the method is more 8 times that of the sota technical path (Bert + BilSTM + CRF), and the method is more practical.

The invention also discloses a precise named entity identification and error correction technology based on self-distillation, which comprises the following specific steps:

the method comprises the following steps: the named entity extraction technology is accurate and rapid, namely, named entities which are judged through a named entity model in a text and accord with preset categories are accurately extracted:

the method comprises the steps of training a named entity recognition model through the context features of each word in a text and the corresponding named entity high-dimension vector features in a miniaturized pre-training word vector obtained based on knowledge distillation, and obtaining a relatively accurate named entity through the named entity recognition model in the subsequent use.

Step two: performing automatic error correction technology of the named entity according to the named entity recognition model and the vocabulary of the existing word stock;

and calculating the grammatical logic rationality of each suspected named entity according to a language model trained based on a large amount of text data, scoring, enabling the suspected named entities with higher scores to enter a named entity library, modifying the named entities with lower scores according to a candidate modification scheme generated by the grammar model, and judging that the false modification is correct if the semantic similarity between the named entities and all named entities in the existing word library is higher than a preset threshold value after the semantic similarity is calculated according to the modification rate of the word after the automatic correction and other named entities of the enterprise.

Referring to the accompanying fig. 3, an embodiment of the present invention provides a named entity recognition and error correction device based on self-distillation, including:

a model training module 200, configured to train a named entity recognition model; the named entity recognition model comprises a first layer model, a second layer model and a third layer model; the first layer model is used for training according to the label-free data and compressing the probability distribution in the first layer model into the second layer model; the second layer model is used for carrying out named entity extraction; the third layer model is used for carrying out error detection on the extracted named entities and carrying out error correction after errors are detected;

the text acquisition module 201 is used for acquiring a text of the named entity to be extracted and inputting a named entity recognition model;

and a named entity extraction module 202, configured to extract the text named entity through a named entity recognition model.

Optionally, the method further comprises:

the error detection module is used for calculating the grammatical logic rationality of the named entity and scoring;

and if the grammatical logic rationality score of the named entity is lower than a preset first threshold value, confirming that the extracted named entity is wrong.

Optionally, the system further includes an error correction module, configured to generate a correction candidate set according to the extracted named entity; evaluating the named entities of the correction candidate set; and selecting the named entity from the correction candidate set according to the evaluation result as a correction result of the extracted named entity.

Optionally, the named entity recognition model further includes a named entity evaluation module, configured to obtain modification rate statistics data of the named entities of the modification candidate set; calculating a first semantic similarity between the named entity of the correction candidate set and the named entity in the proprietary lexicon; calculating a second semantic similarity between the named entity of the correction candidate set and the named entity in the public word stock; and evaluating the named entities of the correction candidate set according to the modification rate statistical data, the first semantic similarity and the second semantic similarity.

The invention also discloses a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements: training a named entity recognition model; and acquiring a text of the named entity to be extracted and inputting the text into the named entity recognition model. And extracting the named entities through a named entity recognition model.

The invention also discloses a terminal, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the following steps when executing the computer program: training a named entity recognition model; acquiring a text of a named entity to be extracted and inputting a named entity recognition model; and extracting the named entities through a named entity recognition model.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A named entity recognition and error correction method based on self-distillation, comprising:

and extracting the named entity through the named entity recognition model.

2. The method of claim 1, wherein the first layer model is a bert-large model;

training the bert-large model, comprising:

3. The method of claim 1, wherein the second layer model is a Transformer Encoder model;

training the Transformer Encoder model, including:

4. The method of claim 1, in which the third layer model is a CBOW model;

training the CBOW model, including:

5. The method of claim 1, wherein error detecting the extracted named entity comprises:

calculating the grammatical logic rationality of the named entity and scoring;

6. The method of claim 1, wherein performing error correction after detecting an error comprises:

evaluating named entities of the revised candidate set;

7. The method of claim 6, wherein evaluating the named entities of the revised candidate set comprises:

8. A named entity recognition and correction device based on self-distillation, comprising:

9. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the method of any one of claims 1-7.

10. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform the method recited in any of claims 1-7.