CN115455948A

CN115455948A - Spelling error correction model training method, spelling error correction method and storage medium

Info

Publication number: CN115455948A
Application number: CN202211415838.8A
Authority: CN
Inventors: 马永亮; 甘子发; 周明
Original assignee: Beijing Lanzhou Technology Co ltd
Current assignee: Beijing Lanzhou Technology Co ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2022-12-09

Abstract

The present invention relates to the field of spelling error correction technology, and in particular, to a spelling error correction model training method, a spelling error correction method, and a storage medium. The spelling error correction model training method comprises the following steps: acquiring a keyword dictionary containing domain knowledge, and for each keyword, adding context containing the domain knowledge to the keyword by using a domain search paradigm summarized in advance to obtain an extended domain text; replacing characters in the text of the extended field by using a preset confusion set according to a preset replacement rule to obtain field spelling error correction data; a spell correction model is trained based on the domain spell correction data. The invention expands the context by the domain dictionary and a small amount of domain knowledge normal forms to obtain the text corpus containing rich domain knowledge, and when synthesizing data, spelling error correction data more fitting the domain search scene can be obtained by replacing only part of characters in the text corpus according to a certain rule, so that the trained model has better domain adaptability.

Description

Spelling error correction model training method, spelling error correction method and storage medium

Technical Field

The present invention relates to the field of spelling error correction technologies, and in particular, to a spelling error correction model training method, a spelling error correction method, and a storage medium.

Background

The Chinese search spelling error correction study is how to detect and correct Chinese spelling errors in query inputs (queries) of a search engine, returning the correct query. In a search engine, a user wants to obtain a webpage or a document with better quality related to a query input, but the query input by the user is not high in quality or wrong for various reasons, which may result in recalling wrong results, or results are few or no results, and at this time, the search engine needs to correct the query in order to improve the user experience. The existing Chinese search spelling error correction scheme usually constructs corresponding index data according to keywords, when error correction is carried out, a query is firstly segmented, correction candidates are obtained from the index data according to similar pinyin, editing distance, user search history records and the like of each segmented segment, then the original segment is replaced by the candidates, an n-gram model, a pre-training model and the like are used for evaluating the candidates, and a final result is selected. For spelling error correction of a domain search engine, because a traditional scheme uses various strategies to correct a query, the whole process is complicated, long texts and complicated spelling errors in the long texts are difficult to solve, and when knowledge is updated in the domain, related index data needs to be constructed for the query and corresponding candidate evaluation models need to be trained, so that the domain adaptability is poor.

Disclosure of Invention

In order to solve the problem that the existing spelling error correction model is difficult to adapt to the change of field requirements, the invention provides a spelling error correction model training method, a spelling error correction method and a storage medium.

In order to solve the technical problems, the invention provides the following technical scheme: a spelling error correction model training method comprises the following steps:

acquiring a keyword dictionary containing domain knowledge, and for each keyword, adding context containing the domain knowledge to the keyword by using a domain search paradigm summarized in advance to obtain an extended domain text;

replacing characters in the text of the extended field by using a preset confusion set according to a preset replacement rule to obtain field spelling error correction data;

a spell correction model is trained based on the domain spell correction data.

Preferably, the spell correction model employs a Soft-Masked BERT spell correction model.

Preferably, the confusion set contains each word and its corresponding homophones, near-phonetic words and near-form words.

Preferably, the replacing the characters in the text by using the preset confusion set and according to the preset replacing rule comprises the following steps:

when the preset replacement rule is field knowledge replacement, replacing the field knowledge in the extended field text; and/or

And when the preset replacement rule is character replacement, randomly replacing characters of the extended field text by using the confusion set.

Preferably, before randomly replacing the characters of the extended domain text by using the confusion set, the character frequency is counted from all the extended domain texts and the characters in the confusion set are sorted, and the characters in the extended domain text are replaced according to the sorted confusion set.

Preferably, when the characters in the extended domain text are replaced, less than or equal to 15% of the characters are selected for replacement.

Preferably, after the context containing domain knowledge is added to the keywords by using the domain paradigm summarized in advance, the method further includes the following steps:

acquiring related domain texts on the network by using the keywords and a pre-summarized domain paradigm;

and these related domain texts are also used as extended domain texts.

Preferably, training the spell correction model based on the domain spell correction data comprises the steps of:

inputting spelling error correction data into a spelling error correction model, regarding each character as a token, and converting each token into a corresponding number in a BERT dictionary;

the Embedding layer of the spelling error correction model converts each character in the spelling error correction data into a vector;

extracting the features of the vectors to obtain feature vectors, inputting the feature vectors into a classifier, and converting the feature vectors into vectors with dimensions the same as those of a BERT dictionary to serve as final output vectors;

and finally, converting the output vector into a prediction result.

In order to solve the above technical problems, the present invention provides another technical solution as follows: a method of spell correction comprising the steps of:

obtaining a spelling error correction model, wherein the spelling error correction model is obtained by adopting the training method of the spelling error correction model;

inputting the text to be corrected into the spell correction model, and outputting the corrected text by the spell correction model.

In order to solve the above technical problems, the present invention provides another technical solution as follows: a computer storage medium having stored thereon a computer program which, when executed, implements the steps of a method of spell correction as previously described.

Compared with the prior art, the spelling error correction model training method, the spelling error correction method and the storage medium provided by the invention have the following beneficial effects:

1. the spelling error correction model training method provided by the embodiment of the invention obtains the text corpus containing rich domain knowledge by expanding the context by the domain dictionary and a small amount of domain knowledge paradigm, in addition, in the data synthesis, spelling error correction data more fitting the domain search scene can be obtained by only replacing part of characters in the text corpus according to certain rules, for example, only replacing the domain knowledge in the text corpus according to the rules, after synthesizing the corpus text, related domain texts are searched on the network by keywords, the corpus in the rich domain conforms to the complex scene of the domain search, so that the trained model has better domain adaptability, and the historical data searched by a user does not need to be accumulated, so that the model has better cold start effect.

2. According to the spelling error correction model training method provided by the embodiment of the invention, a Soft-Masked BERT spelling error correction model is adopted, compared with the traditional error correction scheme, the model can process more complex spelling errors due to strong semantic representation capability, and the model is an end-to-end model integrating detection and correction, has no additional input and output in the middle, can be conveniently and quickly iterated, and can be better adapted to the change and update of domain knowledge.

3. According to the spelling error correction model training method provided by the embodiment of the invention, the different preset replacement rules can be used for obtaining the field spelling error correction data with different complexity degrees, and only the field knowledge can be replaced, so that the model is more concerned about the error correction of the field knowledge; the confusion set can also be sorted according to the word frequency counted from the corpus, and more common confusion words in the field can be selected for replacement, so that the input method is more consistent with the real input scene.

4. The embodiment of the present invention further provides a spelling error correction method, which has the same beneficial effects as the spelling error correction model obtained by training with the spelling error correction model training method, and details are not repeated herein.

5. The embodiment of the present invention further provides a computer storage medium, which has the same beneficial effects as the spelling error correction method described above, and is not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of a method for training a spell correction model according to a first embodiment of the present invention.

Fig. 2 is a flowchart illustrating steps after step S1 of a method for training a spell correction model according to a first embodiment of the invention.

Fig. 3 is a flowchart illustrating the step S2 of the method for training the spell correction model according to the first embodiment of the present invention.

Fig. 4 is a flowchart illustrating a step S3 of a method for training a spell correction model according to a first embodiment of the invention.

Fig. 5 is a flowchart illustrating a method for training a spell correction model according to a first embodiment of the present invention.

Fig. 6 is a flowchart illustrating steps of a method for spell correction according to a second embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and implementation examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, a first embodiment of the present invention provides a method for training a spell correction model, including the following steps:

s1: acquiring a keyword dictionary containing domain knowledge, and for each keyword, adding context containing the domain knowledge to the keyword by using a domain paradigm summarized in advance to obtain an extended domain text;

s2: replacing characters in the text of the extended field by using a preset confusion set according to a preset replacement rule to obtain field spelling error correction data;

s3: a spell correction model is trained based on the domain spell correction data.

The domain knowledge-containing keywords can be entity names of a certain domain, such as company names, person names and the like of the financial domain, the added context containing the domain knowledge can be terms of the financial domain, such as profits, annual newspapers and the like, the professional terms contain the financial domain knowledge and aim to enable the model to learn and identify knowledge related to the domain, and the context is manually summarized domain knowledge, and the keywords are domain keywords, so that the context is combined to construct a domain text.

Specifically, referring to fig. 2, after the context containing domain knowledge is added to the keywords by using the domain paradigm summarized in advance, the method further includes the following steps:

s11: acquiring related domain texts on the network by using keywords and a domain paradigm summarized in advance;

s12: and these related domain texts are also used as extended domain texts.

And continuously utilizing the key words to acquire related domain texts on the network, so that new domain knowledge is merged into the corpus, and the corpus contains rich semantic knowledge to accord with the complex scene of domain search.

Specifically, in the present embodiment, the confusion set includes each character and its corresponding homophone, near-consonant character and near-font character.

Referring to fig. 3, the step S2 of replacing the text with the predetermined confusion set according to the predetermined replacement rule includes the following steps:

s21: when the preset replacement rule is field knowledge replacement, replacing the field knowledge in the extended field text; and/or

S22: and when the preset replacement rule is character replacement, randomly replacing characters of the extended field text by using the confusion set.

According to different preset replacement rules, domain spelling error correction data with different complexity degrees can be obtained, for example, only characters which represent domain knowledge in the extended domain text are replaced, the model can pay more attention to the error correction of the domain knowledge, or characters are randomly replaced by using a confusion set, and some characters are replaced by homophones, near-phonetic characters or near-form characters corresponding to the characters.

Before randomly replacing characters of an extended field text by using a confusion set, counting character frequency from all the extended field texts and sequencing the characters in the confusion set, wherein the characters with larger character frequency are arranged in front, performing character replacement on the extended field text according to the sequenced confusion set, selecting more common confusion characters in the field for replacement, and selecting according to the sequencing of each character in the extended field text in the confusion set, for example, when a certain character is in front of the sequencing in the confusion set relative to other characters, replacing the character with a homophone character, a near-tone character or a near-shape character corresponding to the changed character, which is more suitable for a real input scene.

Specifically, in the present embodiment, when characters in the extended domain text are replaced, characters less than or equal to 15% are selected for replacement.

Specifically, referring to fig. 4 and 5, training the spell correction model based on the domain spell correction data includes the following steps:

s31: inputting spelling error correction data into a spelling error correction model, regarding each character as a token, and converting each token into a corresponding number in a BERT dictionary;

s32: the Embedding layer of the spelling error correction model converts each character in the spelling error correction data into a vector;

s33: extracting the features of the vectors to obtain feature vectors, inputting the feature vectors into a classifier, and converting the feature vectors into vectors with dimensions the same as those of a BERT dictionary to serve as final output vectors;

s34: and finally, converting the output vector into a prediction result.

Specifically, in the embodiment, the spelling error correction model is a Soft-Masked BERT model, and the model mainly comprises a detection module, a Soft-Masking module and a correction module.

When a Soft-Masked BERT model is trained, the detection model predicts the positions which are possibly misspelled and outputs the probability of the misspelled position to each position, the Soft-masking module adds the input vector of the position and the vector of the [ MASK ] character according to the probability by using the probability, the probability that misspelled is higher, the [ MASK ] proportion is higher, the BERT learns to restore the [ MASK ] character to a Chinese character during pre-training, and therefore the Soft-masking mechanism can utilize the pre-training knowledge of the BERT to strengthen the error detection and correction capability of the model.

Specifically, a text sequence is input, each character of the text sequence is converted into a number of the character in a BERT dictionary through a data import function and is input into a model, an Embedding layer of the model takes an Embedding vector of the character out of an Embedding vector matrix according to the number so as to convert the input character sequence into vectors, then the vectors pass through a detection model and a multilayer neural network in an error correction model to extract the characteristics of the vectors, then the error correction model outputs the characteristic vectors, and meanwhile the detection model predicts the positions which may be errors; then, inputting the feature vector into a classifier, converting the feature vector into a vector with the dimension same as that of the BERT dictionary by the classifier, and taking the vector as a final output vector of the model, wherein for example, 21128 tokens exist in the BERT dictionary, and the dimension of the output vector of the classifier is 21128; and converting the output vector corresponding to the position predicted as the error into a prediction result, taking the subscript of the dimension with the maximum output vector score, for example, the 100 th dimension with the maximum score, and taking the token with the number of 100 from the BERT dictionary, wherein the token is the final result of the position corrected by the model, and the tokens output by other positions are the same as the input tokens.

The Soft-Masked BERT model dynamically combines a detection model with a correction model by using a Soft-Masking module, pre-training knowledge contained in the BERT model can be better utilized, the influence on subsequent error correction when error detection is incorrect is reduced, and the model can be used for correcting errors to process complex spelling errors, such as multiple spelling errors contained in input, compared with the traditional search error correction scheme, due to the strong semantic representation capability of the BERT pre-training model. And no extra input and output exists between the error detection module and the correction module of the model, so that the integrity is strong, and the model can be iterated rapidly and simultaneously and is suitable for the change and update of the field knowledge.

In the training process of an example shown in fig. 5, it is assumed that the text input into the spell correction model is "great capital east stock reduction share of formula", the learning corrects the wrongly written characters in the "great capital east stock reduction share of formula" through the learning of the spell correction model, and the final output after the correction is "great capital east stock reduction share of company".

Referring to fig. 6, a spelling error correction method according to a second embodiment of the present invention includes the following steps:

s100: obtaining a spelling error correction model, wherein the spelling error correction model is obtained by training with a spelling error correction model training method according to the first embodiment;

s200: inputting the text to be corrected into the spell correction model, and outputting the corrected text by the spell correction model.

It can be understood that, in an application scenario of the domain search engine, a query input by a user is usually a complex proper noun, and may also be a long text containing rich semantics, and a situation that the query contains multiple spelling errors may occur.

The third embodiment of the present invention also provides a computer storage medium having a computer program stored thereon, which when executed, implements the steps of a spell correction method as described above. Has the same beneficial effects as the spelling error correction method, and is not described herein.

In the embodiments provided herein, it should be understood that "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The flowchart and block diagrams in the figures of the present application illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

1. the spelling error correction model training method provided by the embodiment of the invention obtains the text corpus containing rich domain knowledge by expanding the context through the domain dictionary and a small amount of domain knowledge normal forms, in addition, when synthesizing data, spelling error correction data more suitable for the domain search scene can be obtained by only replacing part of characters in the text corpus according to certain rules, for example, only replacing the domain knowledge in the text corpus according to the rules, after synthesizing the corpus text, related domain texts are searched on the network through keywords, the corpus of the domain is enriched, the complex scene of the domain search is met, the trained model has better domain adaptability, and the historical data searched by a user does not need to be accumulated, so that the model has better cold start effect.

3. According to the spelling error correction model training method provided by the embodiment of the invention, the field spelling error correction data with different complexity can be obtained according to different preset replacement rules, and only field knowledge can be replaced, so that the model is more concerned about the error correction of the field knowledge; the confusion set can also be sorted according to the word frequency counted from the corpus, and more common confusion words in the field can be selected for replacement, so that the input method is more consistent with the real input scene.

5. An embodiment of the present invention further provides a computer storage medium, which has the same beneficial effects as the above-mentioned spell error correction method, and details are not repeated here.

The spelling error correction model training method, the spelling error correction method and the storage medium disclosed in the embodiments of the present invention are introduced in detail, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and applications, and in view of the above, the content of the present specification should not be construed as a limitation to the present invention, and any modifications, equivalent substitutions and improvements made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A spelling error correction model training method is characterized in that: the method comprises the following steps:

acquiring a keyword dictionary containing domain knowledge, and for each keyword, adding context containing the domain knowledge to the keyword by using a domain paradigm summarized in advance to obtain an extended domain text;

training a spell correction model based on the domain spell correction data.

2. The method of spell correction model training as recited in claim 1, wherein: the spelling error correction model adopts a Soft-Masked BERT spelling error correction model.

3. The method of spell correction model training as recited in claim 1, wherein: the confusion set comprises each character and the corresponding homophone character, near-pronunciation character and near-shape character.

4. The method of spell correction model training as recited in claim 1, wherein: the method for replacing the characters in the text by using the preset confusion set according to the preset replacement rule comprises the following steps:

And when the preset replacement rule is character replacement, randomly replacing characters of the text in the extended field by using the confusion set.

5. The method of spell correction model training as recited in claim 4, wherein: before randomly replacing characters of the text in the extended field by using the confusion set, the character frequency is counted from all the texts in the extended field, the characters in the confusion set are sequenced, and the characters in the text in the extended field are replaced according to the sequenced confusion set.

6. The method of spell correction model training as recited in claim 4, wherein: and when the characters in the extended domain text are replaced, selecting the characters less than or equal to 15% for replacement.

7. The method of spell correction model training as recited in claim 1, wherein: after the context containing domain knowledge is added to the keywords by using the domain paradigm summarized in advance, the method further comprises the following steps of:

acquiring related domain texts on the network by using keywords and a domain paradigm summarized in advance;

and these related domain texts are also used as extended domain texts.

8. The method of spell correction model training as recited in claim 1, wherein: training a spell correction model based on domain spell correction data includes the steps of:

and finally, converting the output vector into a prediction result.

9. A method of spell correction, comprising: the method comprises the following steps:

obtaining a spell correction model, wherein the spell correction model is obtained by training with a spell correction model training method according to any one of claims 1-8;

10. A computer storage medium having a computer program stored thereon, characterized in that: the computer program when executed performs the steps of a method of spell correction as claimed in claim 9.