CN117610544A

CN117610544A - Term error correction method, apparatus, electronic device, and storage medium

Info

Publication number: CN117610544A
Application number: CN202311542825.1A
Authority: CN
Inventors: 邓乔波
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-02-27

Abstract

The invention provides a term error correction method, a term error correction device, an electronic device and a storage medium, wherein the method comprises the following steps: determining a text to be corrected in the target field; inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model; and matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity. The method, the device, the electronic equipment and the storage medium provided by the invention realize term error correction aiming at the target field based on the error entity identification model and the term library matching mode, ensure the field pertinence of the term error correction and improve the accuracy of the term error correction.

Description

Term error correction method, apparatus, electronic device, and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a term error correction method, a term error correction device, an electronic device, and a storage medium.

Background

With the development of natural language processing technology, a text correction method based on a neural network model gradually becomes a mainstream technology of text correction.

Through a large amount of corpus training, the model can have good error correction capability, so that the text error correction method based on the neural network has excellent performance on texts in the general field. The frequency of occurrence of specific professional terms in the professional field in the general corpus is low, and model training is excessively dependent on the general corpus, so that the conventional method is poor in term correction in the professional field.

Disclosure of Invention

The invention provides a term error correction method, a term error correction device, electronic equipment and a storage medium, which are used for solving the defect that the term error correction aiming at the professional field in the prior art is not good.

The invention provides a term error correction method, comprising the following steps:

determining a text to be corrected in the target field;

inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model;

and matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity.

According to the term error correction method provided by the invention, the training step of the error entity identification model comprises the following steps:

acquiring a sample text in the target field and an error entity tag of the sample text;

inputting a sample text into a sequence labeling model to obtain an incorrect entity prediction result of the sample text output by the sequence labeling model;

and carrying out parameter iteration on the sequence labeling model based on the error identification prediction result and the error entity label to obtain the error entity identification model.

According to the term error correction method provided by the invention, the method for acquiring the sample text in the target field and the error entity label of the sample text comprises the following steps:

acquiring an original text of the target field;

and replacing correct terms in the original text with error terms based on the term library in the target field, obtaining the sample text, and determining error entity labels of the sample text based on the error terms used for replacement.

According to the term error correction method provided by the invention, the matching of the error entity with the term library in the target field comprises the following steps:

matching the erroneous entity with candidate erroneous terms in the term library;

and in the case of matching to the candidate error term, taking the correct term corresponding to the matched candidate error term in the term library as the correct term matched with the error entity.

According to the term error correction method provided by the invention, the matching of the error entity with the candidate error term in the term library further comprises the following steps:

under the condition that the candidate error terms are not matched, converting the error entity into a first pinyin sequence, and matching the first pinyin sequence with a second pinyin sequence of each correct term in the term library;

and under the condition of matching with the second pinyin sequence, taking the correct term corresponding to the second pinyin sequence as the correct term matched with the error entity.

According to the present invention, a method for correcting errors by converting the entities into a first pinyin sequence includes:

converting the error entity into a pinyin sequence, and performing fuzzy processing on the pinyin sequence of the error entity to obtain the first pinyin sequence;

the second pinyin sequence is obtained by replacing the correct term with the pinyin sequence and performing fuzzy processing.

According to the term error correction method provided by the invention, the matching of the first pinyin sequence with the second pinyin sequence of each correct term in the term base further comprises the following steps:

under the condition that the second pinyin sequence is not matched, performing shape and near word replacement on the error entity to obtain a shape and near entity, and matching the shape and near entity with correct terms in the term library;

in case of matching to the correct term, the matched correct term is taken as the correct term matched with the wrong entity.

The present invention also provides a term error correction apparatus comprising:

the text acquisition unit is used for determining a text to be corrected in the target field;

the entity identification unit is used for inputting the text to be corrected into an error entity identification model to obtain an error entity in the text to be corrected, which is output by the error entity identification model;

and the term matching unit is used for matching the error entity with a term library in the target field and correcting the text to be corrected based on the correct term matched with the error entity.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the term error correction method as defined in any of the above when executing said program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the term error correction method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements the term error correction method as described in any of the above.

The term error correction method, the device, the electronic equipment and the storage medium provided by the invention realize term error correction for the target field based on the error entity identification model and the term library matching mode, ensure the field pertinence of the term error correction and improve the accuracy of the term error correction.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the error correction method of the present invention;

FIG. 2 is a schematic flow chart of a training method of the false entity recognition model according to the present invention;

FIG. 3 is a second flowchart of a training method of the false entity recognition model according to the embodiment of the present invention;

FIG. 4 is a second flow chart of the error correction method according to the present invention;

FIG. 5 is a schematic diagram of a construction of the term error correction apparatus provided by the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In many fields, the use of incorrect terms may lead to serious misunderstandings and even serious consequences. Thus, accurate term of art usage is critical.

In the text error correction scheme based on the neural network model, the frequency of occurrence of specific professional terms in the professional field in the general corpus is low, and the model training is excessively dependent on the general corpus, so that the existing model does not perform well in term error correction in the professional field. This is especially the case in the fields of medicine, law, science and technology, etc., where the accuracy of grammar correction is more pronounced due to the large difference in spelling and usage of the term of art from the daily text in the general field.

Fig. 1 is a schematic flow chart of a method for error correction according to the present invention, as shown in fig. 1, the method includes:

step 110, determining text to be corrected in the target field.

Specifically, the target field is a specific field requiring text correction, for example, a legal field, a scientific field, a medical field, or a sub-field of any of these fields, for example, a medical field may be used to sub-divide internal science, imaging, pathology, etc.

The text to be corrected acquired here, namely the text needing correction. And the text to be corrected belongs to the text in the target field, and the text to be corrected carries the term in the target field or is used for describing the information in the target field.

The text to be corrected may be text directly input by the user, text directly captured through the internet, or text obtained by transferring voice or video input by the user, which is not particularly limited in the embodiment of the present invention.

And 120, inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model.

Specifically, after the text to be corrected is obtained, the text to be corrected can be input into the error entity recognition model, and terms which are possibly in error in the text to be corrected and output by the error entity recognition model are recorded as error entities.

It will be appreciated that the erroneous entity recognition model herein may be directed to a target domain, and that different domains may correspond to different erroneous entity recognition models. Different false entity recognition models can specifically recognize terms in different fields that are likely to be in error.

The erroneous entity recognition model here may be a pre-trained entity recognition model, i.e. the erroneous entity recognition model may be derived based on NER (Named Entity Recognition ) techniques. NER technology is generally used for identifying specific entities in text, such as person names, place names, organization names, proper nouns and the like, and in the embodiment of the invention, the error terms in the target field can be used as a type of entity needing to be identified so as to realize the identification of the error entity.

And 130, matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity.

Specifically, after obtaining the error entity in the text to be corrected based on the error entity recognition model, the correct term matched with the error entity can be obtained by means of term matching.

Before the method, a term library in the target field can be pre-established, the term library stores the technical terms in the target field, the matched technical terms, namely the correct terms, can be used as terms which are supposed to be correctly written in the text to be corrected by matching the error entity with the technical terms in the target field, and the term correction for the text to be corrected is realized by replacing the error entity with the correct terms. Here, the way to match the incorrect entity with the correct term may be pronunciation matching, i.e. the correct term consistent with the incorrect entity pronunciation is used as the correct term matched with the incorrect entity; alternatively, the edit distance between the incorrect entity and each correct term may be calculated, and the correct term with the smallest edit distance may be used as the correct term matched with the incorrect entity.

In addition, the term library may store not only terms under the target field, but also common error terms corresponding to the terms, which may be referred to herein as candidate error terms. Here, the way to match the incorrect entity with the correct term may be to match the incorrect entity with a candidate incorrect term in the term base, and in case of matching to the candidate incorrect term, the correct term corresponding to the candidate incorrect term in the term base is used as the correct term matched with the incorrect entity.

After obtaining the correct term matched with the error entity, the error entity in the text to be corrected can be replaced based on the correct term, so that text correction for the text to be corrected is realized.

The method provided by the embodiment of the invention realizes the term error correction aiming at the target field based on the matching mode of the error entity recognition model and the term library, ensures the field pertinence of the term error correction and improves the accuracy of the term error correction.

Based on the above embodiment, fig. 2 is a second flowchart of the training method of the erroneous entity recognition model according to the present invention, and as shown in fig. 2, the training steps of the erroneous entity recognition model include:

step 210, obtaining a sample text in the target field and an error entity tag in the sample text.

Here, the sample text is a training sample of the wrong entity recognition model, and the wrong entity label of the sample text is a label required for performing supervised training on the wrong entity recognition model. It can be understood that the sample text is text in the target field, and the sample text carries an error entity in the target field, and the error entity label of the sample text is a pre-labeled label for identifying the error entity in the sample text.

Step 220, inputting a sample text into a sequence labeling model to obtain an incorrect entity prediction result of the sample text output by the sequence labeling model;

and 230, performing parameter iteration on the sequence labeling model based on the error identification prediction result and the error entity label to obtain the error entity identification model.

Specifically, after the collection of the sample text in the target field and the marking of the false entity label are completed, the false entity recognition model training can be performed based on the collection of the sample text in the target field and the marking of the false entity label. The training of the false entity recognition model can be realized by taking the sequence labeling model as an initial model. The sequence labeling model here may be BERT (Bidirectional Encoder Representations from Transformers), biLSTM+CRF (Bidirectional Long Short-Term memory+ Conditional Random Field), or the like.

The sample text can be input into a sequence labeling model, and the sequence labeling model predicts the terms which are likely to be in error in the sample text, and the terms which are likely to be in error in the sample text output by the sequence labeling model are recorded as false entity prediction results.

After obtaining the false entity prediction result, the false entity prediction result can be compared with the pre-marked false entity label, the loss is calculated, the parameter iteration is carried out on the sequence labeling model based on the loss, and the sequence labeling model after the parameter iteration is completed is used as a false entity recognition model for recognizing the false term.

Optionally, after model training is completed, model evaluation and verification can be performed on the sequence labeling model, and the sequence labeling model after evaluation and verification is used as an error entity identification model.

Based on any of the above embodiments, step 210 includes:

acquiring an original text of the target field;

Specifically, in order to construct a sample to enable training for a sequence annotation model, so that the sequence annotation model obtained by training can have the capability of extracting terms that may be in error from text in a target field, original text in the target field may be collected first.

Here, the original text of the target field may be a sentence collected from various sources of papers, textbooks, and the like of the target field.

After the original text is collected, the correct terms in the original text can be replaced by the error terms, so that the construction of the sample text is realized, and since the error terms added in the construction process of the sample text are known, the error entity labels do not need to be manually marked, and the error entity labels can be automatically generated based on the added error terms.

The construction process of the sample text can be implemented based on a term library of the target domain. That is, individual segmentations in the original text may be matched with terms in the term library, thereby locating the correct terms in the original text; after obtaining the correct term, the correct term in the original text can be replaced by the incorrect term, specifically, the correct term can be replaced by the incorrect term by homophones, word-shape rules or optionally adding a word, optionally reducing a word, so as to obtain the sample text. For example, "penicillin" may be replaced with "qingmycin" or "penicillin" may be replaced with "green plum" by homonym rules, etc.

On this basis, the false terms in the sample text can be marked by a BIO (Beginning, inside, outside) marking method to form false entity tags for subsequent model training.

Based on any of the foregoing embodiments, in step 130, the matching the error entity with the term library in the target domain includes:

Specifically, the term library may store not only terms under the target field, but also common error terms corresponding to the terms, which may be denoted as candidate error terms herein.

After the erroneous entity in the text to be corrected is obtained based on the erroneous entity recognition model, the erroneous entity may be matched with candidate erroneous terms in the term library, where the matching form may be character matching. If a candidate error term is matched, that is, there is a candidate error term consistent with the error entity in the term library, a correct term corresponding to the candidate error in the term library, which is pre-corresponding, may be used as a correct term matched with the error entity.

Based on any of the foregoing embodiments, in step 130, the matching the error entity with the candidate error term in the term library further includes:

In particular, in a scheme of matching an erroneous entity with candidate erroneous terms in a term library, since the candidate erroneous terms in the term library can reflect only a part of common term errors, all possible term errors cannot be exhausted, and thus there may be a case where the candidate erroneous terms cannot be directly matched from the term library.

For such cases, matching may be based on homonym rules. Specifically, the incorrect entity may be converted to pinyin, referred to herein as a first pinyin sequence, and the first pinyin sequence may be matched with a second pinyin sequence of each correct term in the term library. Here, the second pinyin sequence, i.e. the pinyin sequence of the correct term, is in the form of a presentation after converting the text to pinyin, which differs in that the first pinyin sequence is for the wrong entity and the second pinyin sequence is for the correct term in the term library.

It will be appreciated that for the matching of the first pinyin sequence and the second pinyin sequence, i.e. the character matching between the pinyin sequences, the matching is to the second pinyin sequence, i.e. the case where there is a correct term in the term base that is consistent in pronunciation with the wrong entity. At this time, the correct term corresponding to the matched second pinyin sequence may be used as the correct term matched with the wrong entity.

Based on any of the above embodiments, the converting the incorrect entity into the first pinyin sequence includes:

Specifically, in the step of obtaining the first pinyin sequence, not only the wrong entity needs to be converted into the pinyin sequence, but also the pinyin sequence obtained by conversion needs to be subjected to fuzzy processing, and the pinyin sequence after the fuzzy processing is used as the first pinyin sequence. The blurring process is used for blurring the pronunciation characteristics of the pinyin sequence, and specific processing modes can include flat tongue, front nose sound, rear nose sound and the like, for example, zh is replaced by z, ang is replaced by an.

Similarly, in the step of obtaining the second pinyin sequence, in addition to converting the correct term into the pinyin sequence, fuzzy processing is performed on the pinyin sequence obtained by conversion, and the pinyin sequence after the fuzzy processing is used as the second pinyin sequence.

Because the first pinyin sequence and the second pinyin sequence are the pinyin sequences after fuzzy processing, the subtle difference in pronunciation can be ignored based on the matching of the first pinyin sequence and the second pinyin sequence, and the method has better fault-tolerant property, so that the correct terms can be matched from the term library more easily.

Based on any of the foregoing embodiments, in step 130, the matching the first pinyin sequence with the second pinyin sequence of each correct term in the term base further includes:

In particular, it is contemplated that the errors of the wrong entities are not necessarily homophones, but may be caused by wrongly writing near words. Therefore, in the case that the second pinyin sequence cannot be matched, matching can be performed based on the shape-near word rule. In particular, the near word replacement may be performed on the error entity, that is, a word or words in the error entity are replaced by near words of the word, and the entity after the replacement is completed is denoted as a near entity. After the near entity is obtained, the near entity may be used to match the correct entity in the term library, where the matching may be character matching.

In the case of matching to a correct term, that is, in the case where there is a correct entity in the term library that coincides with a near entity, the matched correct term may be regarded as a correct term that matches the incorrect entity.

In addition, in the case where the correct term matching the error entity is still not obtained based on the above embodiments, only the error entity may be output, and error correction is not performed on the error entity.

Based on any of the above embodiments, fig. 3 is a second flowchart of a training method of an erroneous entity recognition model according to an embodiment of the present invention, as shown in fig. 3, the method includes the following steps:

first, a term library is constructed:

the terms of art in the target domain may be collected to construct a term library in the target domain. The term library herein may contain correct terms, as well as error terms reflecting common error writing of correct terms, such as { ' urticaria [ ' hives ', ' nettle ', … ] }.

Secondly, NER dataset construction:

the method comprises the steps of collecting original text in the target field, locating correct terms in the original text in a dictionary matching mode, replacing the correct terms in the original text with error terms, obtaining sample text, and marking the error terms in the sample text by a BIO marking method to be used as error entity labels of the sample text.

Then, model training:

and selecting a proper sequence labeling model as an NER model, performing model training by using the sample text constructed in the previous step and the error entity label, and performing model evaluation and verification to obtain an error entity identification model.

Based on any of the above embodiments, fig. 4 is a second flowchart of the term error correction method provided by the present invention, as shown in fig. 4, the method includes the following steps:

first, text is entered:

namely, a section of text to be corrected in the target field is used as input and is input into the error entity recognition model.

Secondly, term extraction:

based on the error entity recognition model, error terms possibly existing in the text to be corrected are recognized and recorded as error entities.

Next, at least one of term library matching, homonym matching, and shape-near word matching is performed:

the matching of correct terms corresponding to the wrong entity, the specific approach includes at least one of term library matching, homonym matching, and shape-near word matching. Here, the three matching modes can be alternatively executed, or other modes can be applied to match under the condition that one term cannot acquire the corresponding correct term.

Wherein, term library matching refers to matching an incorrect entity with common incorrect terms in the term library; homophone matching refers to matching correct terms converted into pinyin and fuzzified by converting the error belongs to the correct terms into pinyin and fuzzified by converting the error belongs to the homophone; the word shape near word matching refers to replacing the word in the wrong entity with the word shape near word, and matching the replaced entity with the correct term in the term library.

Finally, the term error correction:

after the correct terms corresponding to the error entities are obtained through matching, the correct terms can be applied to replace the error entities in the text to be corrected, so that text correction for the text to be corrected is realized.

According to the method provided by the embodiment of the invention, the term text error correction is carried out in the manner of NER recognition and term library matching, so that the problem that the accuracy of the term error correction of the seq2seq model and other models in the technical field is low is solved.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of the term error correction device provided by the present invention, as shown in fig. 5, where the device includes:

a text obtaining unit 510, configured to determine a text to be corrected in the target field;

the entity recognition unit 520 is configured to input the text to be corrected into an error entity recognition model, so as to obtain an error entity in the text to be corrected output by the error entity recognition model;

and a term matching unit 530, configured to match the error entity with a term library in the target domain, and correct the text to be corrected based on the correct term matched with the error entity.

The device provided by the embodiment of the invention realizes the term error correction aiming at the target field based on the matching mode of the error entity identification model and the term library, ensures the field pertinence of the term error correction and improves the accuracy of the term error correction.

Based on any of the above embodiments, the apparatus further comprises a model training unit configured to:

Based on any of the above embodiments, the model training unit is specifically configured to:

acquiring an original text of the target field;

Based on any of the above embodiments, the term matching unit is specifically used to:

Based on any of the above embodiments, the term matching unit is further used to:

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform the term error correction method, which includes: determining a text to be corrected in the target field; inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model; and matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, is capable of performing the term error correction method provided by the methods as described above, the method comprising: determining a text to be corrected in the target field; inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model; and matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the term error correction method provided by the above methods, the method comprising: determining a text to be corrected in the target field; inputting the text to be corrected into an error entity recognition model to obtain an error entity in the text to be corrected, which is output by the error entity recognition model; and matching the error entity with a term library in the target field, and correcting the text to be corrected based on the correct term matched with the error entity.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of error correction, comprising:

determining a text to be corrected in the target field;

2. The term error correction method of claim 1, wherein the training step of the erroneous entity recognition model comprises:

3. The term error correction method of claim 2, wherein the obtaining the sample text in the target area and the erroneous entity tag of the sample text comprises:

acquiring an original text of the target field;

4. A method of term error correction as claimed in any one of claims 1 to 3, wherein said matching said incorrect entity with a library of terms in said target domain comprises:

5. The method of claim 4, wherein said matching said erroneous entity with candidate erroneous terms in said term library further comprises:

6. The method of claim 5, wherein said converting said incorrect entity into a first pinyin sequence comprises:

7. The method of claim 5, wherein the matching the first pinyin sequence with the second pinyin sequence for each correct term in the term base further comprises:

8. A term error correction device comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the term error correction method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the term error correction method according to any of claims 1 to 7.