CN113076739A

CN113076739A - Method and system for realizing cross-domain Chinese text error correction

Info

Publication number: CN113076739A
Application number: CN202110383985.0A
Authority: CN
Inventors: 宋正博; 肖龙源; 李稀敏; 李威
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-07-06

Abstract

The invention provides a method for realizing cross-field Chinese text error correction, which comprises the following steps: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain text.

Description

Method and system for realizing cross-domain Chinese text error correction

Technical Field

The invention relates to the field of text error correction, in particular to a method and a system for realizing cross-domain Chinese text error correction.

Background

In daily life, wrong words often appear when a user browses a webpage and looks at a public number article in social tools such as WeChat and microblog, and the text meaning is ambiguous. The Chinese text error correction technology is an important technology for automatically checking and correcting Chinese sentences through an algorithm of natural language processing, and aims to improve the correctness of languages and improve the efficiency and value of text interaction. The existing mainstream text error correction technology is mainly divided into two types: one is to find the text error position by means of sequence learning and then correct the pipline of the text error information by means of sorting. Another is the end-to-end NMT (neural network translation) based model from inputting incorrect text to outputting correct text content.

However, the former algorithm for correcting the wrong text by the sorting recall is inefficient, and the given correct text has a limited application range due to the fact that the candidate set is a limited set, and can also cause ambiguity. The latter end-to-end approach requires a large number of supervised training sets, and the very high model complexity performance cannot be embedded as a basic module in many downstream applications, which is too inefficient.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art, and provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.

The invention adopts the following technical scheme:

a method for realizing cross-domain Chinese text error correction comprises the following steps:

carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;

retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;

and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.

Specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:

the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;

the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;

and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.

the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;

Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:

filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;

reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.

Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:

and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation. Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:

the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;

the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;

and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.

Specifically, the calculation of the confusion degree is specifically as follows:

where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.

An aspect of an embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:

an error detection module: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;

an error recall module: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;

an error correction sorting module: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.

In another aspect, an apparatus is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned steps of implementing the cross-domain chinese text error correction method when executing the computer program.

Still another aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the above-mentioned steps of implementing the cross-domain chinese text error correction method.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

(1) the invention provides a method for realizing cross-domain Chinese text error correction, which adopts an error detection model of sequence labeling to combine with a supervision data training model of the general field to carry out error detection; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; according to the invention, a set of error detection → candidate recall → error correction sequencing model is provided, so that the error correction problem of the cross-domain text can be more comprehensively processed, the text is recalled through a language model of deep learning training, the perplexity of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.

(2) The invention can correct the error text in different fields by adopting the error detection model of sequence labeling and combining the supervision data training model of the general field to detect errors, thereby realizing the cross-field text error correction.

Drawings

FIG. 1 is a flowchart of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;

FIG. 2 is an architecture diagram of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;

FIG. 3 is a block diagram of a system for implementing cross-domain Chinese text correction according to an embodiment of the present invention

Fig. 4 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.

The invention is described in further detail below with reference to the figures and specific examples.

Detailed Description

The embodiment of the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.

As shown in fig. 1, a specific flowchart for implementing a cross-domain chinese text error correction method according to an embodiment of the present invention includes the following steps:

s101: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;

Wherein hidden is a hidden layer.

The BERT model is a language model constructed based on a bidirectional Transformer; word vectors are generated from previous pre-trained models (including word2vec, ELMo, etc.), such pre-trained models belonging to the domain migration, while bert models belonging to the model migration.

The BERT model is combined with a pre-training model and a downstream task model, namely the BERT model is still used when the downstream task is done, the text classification task is naturally supported, the model does not need to be modified when the text classification task is done, and the efficiency is improved.

In another embodiment, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model, and the sequence-labeled error detection model is specifically performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model:

The Skip-gram model and the cbow model are two models involved in word2vec, cbow is the context of the known current word to predict the current word, and Skip-gram is the opposite, and is the context of the known current word to predict it;

the skip-gram and cbow models each include three layers, namely an input layer, a projection layer and an output layer, and are all based on a Huffman tree, and initialization values of intermediate vectors stored in non-leaf nodes in the Huffman tree are zero vectors, and word vectors of words corresponding to the leaf nodes are initialized randomly.

and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation.

The embodiment of the invention can correct the error text in different fields by adopting the error detection model of the sequence label and combining the supervision data training model of the general field to detect the error, thereby realizing the cross-field text error correction.

S102: retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;

the edit distance, also called the Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;

for example, convert kitten's word to sitting: sitten (k → s); sittin (e → i); sitting (→ g);

finding the editing distance of the character string, namely changing a character string s1 into a programming character string s2 through a minimum of operations, wherein the operations comprise three operations, namely adding a character, deleting a character and modifying a character;

the Jaccard distance is the Jaccard distance, and the distance is the proportion of different elements in the two sets in all the elements to measure the distinguishing degree of the two sets; the concept opposite to the Jacard similarity factor is the Jacard Distance (Jaccard Distance), which can be expressed by the following formula:

the proportion of the number of intersection elements of the two sets A and B in the A, B union is called the Jacard coefficient of the two sets and is represented by a symbol J (A, B). The Jacard similarity factor is an index for measuring the similarity between two sets (cosine distance can also be used to measure the similarity between two sets).

S103: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.

Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:

The confusion is an index used in the natural language processing field (NLP) for measuring the quality of a language model. The method mainly estimates the probability of occurrence of a sentence according to each word, and the lower the confusion degree is, the higher the probability of occurrence of the sentence is, and the higher the confusion degree of the sentence is.

For sentence S, the probability of sentence occurrence is:

P(S)＝P(W₁,W₂...W_N)

＝p(W₁)p(W₂|W₁)…p(W_N|W₁,W₂,…,W_N-1)

is the joint probability multiplied by the probability of occurrence of each word;

sentence S is puzzled:

then:

taking logarithm on two sides of the above formula, then solving PP (S), and obtaining the form that each word is multiplied by negative log and then exponential:

the index part is actually in a cross entropy loss form, the higher the probability of sentence occurrence is, the lower the confusion degree is, and the probability of sentence occurrence can represent the confusion degree of the sentence, so that the confusion degree of the sentence is measured by the index.

Fig. 2 is an architecture diagram of a method for implementing cross-domain chinese text error correction according to an embodiment of the present invention.

As shown in fig. 3, an aspect of the embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:

the error detection module 301: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;

in the error detection module 301, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervision data training model, and the sequence-labeled error detection model in combination with the universal-field supervision data training model specifically includes:

processing the characters and the labeled entity labels into a one-to-one correspondence form, and adopting word segmentation to process a pinyin dictionary

Error recall module 302: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;

in the error recall module, the edit distance, also called Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another string. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;

The error correction sorting module 303: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.

In the error correction sorting module, specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on sentences after the errors are replaced, where the rnnlm language model specifically is:

For sentence S, the probability of sentence occurrence is:

P(S)＝P(W₁,W₂...W_N)

＝p(W₁)p(W₂|W₁)…p(W_N|W₁,W₂,…,W_N-1)

sentence S is puzzled:

then:

Referring to fig. 4, another aspect of the present invention further provides an apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory and running on the processor, where the processor 420 executes the computer program 411 to implement one of the above steps for implementing the cross-domain chinese text error correction method.

In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.

Since the electronic device described in this embodiment is a device used for implementing a data processing apparatus in the embodiment of the present invention, based on the method described in this embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the electronic device in this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present invention by the electronic device is not described in detail herein, and as long as the person skilled in the art implements the device used for implementing the method in this embodiment of the present invention, the device used for implementing the method in this embodiment of the present invention belongs to the protection scope of the present invention.

As shown in fig. 5, a further aspect of the embodiment of the present invention further provides a computer-readable storage medium 500, on which a computer program 511 is stored, which, when being executed by a processor, implements one of the above-mentioned steps for implementing a cross-domain chinese text error correction method.

When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. A method for realizing cross-domain Chinese text error correction is characterized by comprising the following steps:

2. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:

3. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:

4. The method of claim 1, wherein before performing error detection by using the error detection model with sequence labeling in combination with the supervised data training model in the general field, the method further comprises:

5. The method of claim 1, wherein the text is filtered by special characters and emoticons, a word list is formed, and words in each sentence are digitized, and the method further comprises:

6. The method for realizing cross-domain Chinese text error correction according to claim 1, wherein words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on the sentences after the errors are replaced, wherein the rnnlm language model specifically comprises:

7. The method for implementing cross-domain Chinese text error correction according to claim 5, wherein the calculation of the confusion specifically comprises:

8. A system for implementing cross-domain Chinese text error correction, comprising:

9. An apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.