CN116127952A

CN116127952A - Multi-granularity Chinese text error correction method and device

Info

Publication number: CN116127952A
Application number: CN202310088091.8A
Authority: CN
Inventors: 赵鑫安; 宋伟; 朱世强; 谢冰; 袭向明; 尹越; 王雨菡
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-05-16

Abstract

A multi-granularity chinese text correction method comprising: preprocessing a Chinese text to be corrected; constructing a noun knowledge base and a text error correction training corpus; vector encoding is carried out on the text input to be corrected by using a pre-training language model, and the voice information of the text is fused to obtain a character vector sequence; detecting word granularity and word granularity errors in the text based on a neural network to obtain an error word set and an error word set; correcting the errors of the detected word granularity and word granularity respectively to obtain candidate replacement words and words with the errors of the word granularity and the word granularity; jointly training the whole model by using a multi-task learning mode; and fusing the word granularity correction results to obtain the corrected text. The invention also comprises a multi-granularity Chinese text error correction device. The invention can effectively correct errors with multiple granularities (word granularity and word granularity) in the text, trains the whole model by adopting a multi-task learning mode, and has good correction accuracy and practicality.

Description

Multi-granularity Chinese text error correction method and device

Technical Field

The invention relates to the field of text correction, in particular to a multi-granularity Chinese text correction method and device.

Background

In recent years, with the rapid development of artificial intelligence technology, automatic speech recognition technology has made a major breakthrough in theory and application, and is widely applied to the fields of intelligent robots, intelligent customer service, speech recognition dictation machines and the like. However, due to the effects of the user's accent, dialect, expression, background noise, and defects of the speech recognition model itself, the speech recognition technology cannot completely and correctly convert the speech signal into text, and particularly, the recognition accuracy for proper nouns in the vertical field (such as name, thing nouns, etc., which occur with low frequency) is low. And speech recognition errors can greatly affect the accuracy and recall of downstream tasks (such as intent recognition, text retrieval, entity recognition, etc.).

In order to solve the problem of speech recognition errors, text error correction technology is required to detect and correct errors of text obtained by speech recognition. Text error correction technology plays an important role in many Natural Language Processing (NLP) applications, being an indispensable pre-module. The existing text error correction technical scheme and the existing problems are as follows:

(1) The text error correction method based on the statistical language model comprises the steps of firstly, detecting word errors by scoring local n-grams of texts to be error corrected, then, sequentially replacing candidate words corresponding to the error words by using a pre-constructed confusion set to generate candidate texts, and finally, screening and obtaining optimal texts by calculating confusion degree scores of the candidate texts or using set rules. Since the statistical language model only uses word frequency information in the text corpus, it cannot use semantic characterization, cannot use context information with a longer distance, and the error correction effect is affected by the confusion set, the accuracy of the method is limited.

(2) In recent years, deep learning models represented by CNN, RNN have demonstrated their effectiveness over many NLP tasks, especially pre-trained language models represented by BERT. Likewise, many approaches utilize pre-trained language models such as BERT to achieve end-to-end text correction, which achieves good results on many text correction tasks, particularly on common word granularity errors, due to their powerful language understanding and characterization capabilities. However, text correction methods based on pre-trained language models are prone to interference from contextual noise, have poor correction for text with multiple errors, and have poor correction for word granularity errors, especially proper nouns in the vertical domain.

(3) To solve the error of word granularity, the existing method mostly identifies the words which may have errors through word segmentation (such as jieba, corenlp and other word segmentation tools) or using named entity identification (such as hidden Markov HMM, conditional random field CRF, biLSTM+CRF and other models), and then uses candidate word recall, sorting and screening to correct errors. However, due to the limitation of word segmentation algorithm performance, word error segmentation often occurs, and in addition, the named entity recognition model listed above also often occurs entity error recognition and unrecognized situations, especially the situation that nested entity nouns occur in texts cannot be solved, so that the overall error correction accuracy is limited, and only errors of word granularity can be corrected.

In addition, all the above methods are easily affected by the distribution of training data, and have poor recognition effect on low-frequency proper noun errors. Therefore, how to correct the errors of the word granularity and the word granularity at the same time, including common errors of the word granularity at high frequency and unusual errors of proper nouns at low frequency (especially proper noun errors in the vertical field), has great significance for improving the performance of a text error correction algorithm.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a multi-granularity Chinese text error correction method and device.

In order to achieve the above purpose, the technical scheme of the invention is as follows: a first aspect of the embodiment of the invention provides a multi-granularity Chinese text error correction method, which comprises the following steps:

s1: preprocessing a Chinese text to be corrected;

s2: constructing a noun knowledge base, collecting error correction original data and constructing a training corpus of a text error correction model;

s3: vector encoding is carried out on the input text to be corrected by using a pre-training language model, and voice information of the text is fused to obtain a corresponding character vector sequence;

s4: after a text character vector sequence is obtained in the step S3, a word granularity error detection layer formed by a full-connection layer is used for detecting word granularity errors in the text, and a word granularity error detection neural network formed by character segment head characters, tail characters and relative distance features is used for detecting word granularity errors in the text, so that an error word set and an error word set are obtained;

S5: correcting the error word set and the error word set detected in the step S4 respectively to obtain candidate replacement words with wrong word granularity and candidate replacement words with wrong word granularity; for the error word detected in the S4, predicting the correct character by using a word granularity error correction layer formed by a layer of full-connection layers to obtain a candidate replacement word; for the error words detected in the step S4, a candidate recall and candidate sorting screening mode is adopted to obtain corresponding correction words from the noun knowledge base constructed in the step S2, and candidate replacement words are obtained;

s6: training corpus is constructed in the S2, and a pinyin coding module in the S3 and an embellishing module and an encoder of a pre-training language model are jointly trained in a multitask learning mode, wherein a word granularity error detection layer, a word granularity error detection neural network and a word granularity error correction layer in the S4 are adopted;

s7: and for the text to be corrected, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, fusing word granularity correction results according to a preset rule to obtain corrected text.

Further, the step S1 specifically includes: in order to improve the accuracy of the subsequent error correction, the Chinese text to be subjected to error correction is preprocessed in advance, and the preprocessing step comprises Unicode text standardization, complex and simple conversion processing, punctuation recovery and digital preprocessing.

Further, the noun knowledge base constructed in step S2 includes a common noun obtained from a public existing noun knowledge base or the like, and a proper noun obtained from text corpus data of a vertical domain or a limited domain by a manual or statistical/rule method; further, the error correction original data collected from the intelligent question-answering system or the voice recognition system is preprocessed in the mode described in the step S1, and the character granularity error label sequence, the character granularity correct character sequence and the word granularity error label of each text are marked manually, and are divided into a training set, a verification set and a test set randomly after marking is finished and used for training a subsequent error correction model.

Further, the step S3 specifically includes: firstly, obtaining a character index sequence for inputting a text to be corrected by using a word stock of a pre-training language model, then obtaining a character embedded vector and a position embedded vector of each character by using an embellishing module of the pre-training language model, then encoding a pinyin sequence of each character by using a neural network to obtain a voice embedded vector, adding the three to obtain a final embedded vector of the character, and further inputting the final embedded vector into a transducer-based encoder of the pre-training language model to obtain a character vector sequence of the text.

Further, in step S4, the method for performing word granularity error detection and word granularity error detection on the text to be corrected respectively specifically includes: after the text character vector sequence is obtained in the step S3, the word granularity error detection uses a word granularity error detection layer formed by a full connection layer to obtain the error probability of each character word granularity in the text, and the collected error probability is larger than the preset word granularity error threshold v _char Is a set of erroneous words.

Still further, the word granularity error detecting step is performed on the slave character x _i To character x _j (1<＝i<j<N, n is text length), and the characterization vector of the first character of the character segment is obtained by using a full connection layer and a RELU layer respectively with the first character, the last character and the relative distance of the character segment as features

And character segment tail character characterization vector

And using a relative distance coding function pair x _i And x _j Encoding the distance between the two to obtain e _dist Will be

e _dist Inputting the three pieces into an error segment classification layer to obtain the probability of segment error, if the error probability of the segment is greater than or equal to the preset word granularity error threshold epsilon _span If the character segment is wrong, correction is needed, and the character segment with the error probability larger than the preset threshold value is collected to obtain an error word set. And (3) performing word granularity error correction and word granularity error correction on the error word set and the error word set obtained by the detection in the step (S4) respectively.

Further, the step S5 specifically includes: for the error word detected in S4, the word granularity error correction step is characterized by error words and error probability, a word granularity error correction layer formed by a layer of full-connection layers is used for predicting the probability of correcting each error word into each character in a word list, and the maximum probability value is selected as the candidate replacement word of the error word; and for the error words detected in the step S4, the word granularity error correction step recalls the equal-length candidate words and the unequal-length candidate words from the noun knowledge base constructed in the step S2 respectively based on the pinyin editing distance, then uses a truncated linear regression model, inputs the characteristic calculation comprehensive similarity of the Chinese character layer, the pinyin layer and the like to sort and screen the candidate words, and takes the candidate words with the highest scores and larger than a preset threshold value as candidate replacement words of the error words.

Further, the total loss function in step S6 is a weighted average of the loss of the word granularity error detection layer in S4, the loss of the word granularity error detection neural network in S4, the loss of the word granularity error correction layer in S5, and the parameters of the model are optimized by minimizing the total loss function.

Further, the step S7 specifically includes: fusing correction results of the word granularity according to a preset rule, avoiding the occurrence of conflict and error correction of the correction results of the word granularity, and replacing each word granularity in the input text to be corrected with a replacement word to obtain a corrected text; when the word granularity error correction result and the word granularity error correction result conflict, the preset rule preferentially adopts the word granularity error correction result.

A second aspect of an embodiment of the present invention provides an apparatus for multi-granularity chinese text correction, comprising a memory and a processor, the memory coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the multi-granularity Chinese text error correction method.

A third aspect of an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a multi-granularity chinese text correction method as described above.

The beneficial effects of the invention are as follows:

1. the method and the device have the advantages that the pre-training language model has strong text feature extraction capability, the input text to be corrected is encoded, the voice information of the text is integrated during encoding, and the accuracy of the subsequent text correction is improved.

2. In the process of detecting errors in the text, based on a text vector sequence obtained by coding a pre-training language model, detecting the error of the granularity of a word and the error of the granularity of a word in the text respectively, and simultaneously giving consideration to the errors of the granularity of the word and the granularity of the word. And word granularity error detection neural network based on the first character, the last character and the relative distance of the character fragments detects word granularity error in the text, is not influenced by word segmentation error, and can solve the problem that the traditional named entity recognition model can not recognize nested entity nouns.

3. In the error correction process, the errors of the word granularity and the errors of the word granularity are corrected respectively. Wherein a pre-trained language model is directly used to predict the correct character for word granularity errors. Aiming at word granularity errors, the method adopts candidate recall and sequencing modes to correct, adopts Chinese characters, pinyin and other characteristics, builds a comprehensive similarity function with better performance based on a truncated linear regression model, and can correct high-frequency and low-frequency word granularity errors without being influenced by data distribution. Therefore, the method not only can correct multi-granularity errors (word granularity and word granularity errors), but also can simultaneously consider high-frequency word granularity errors and low-frequency word granularity errors (especially low-frequency proper noun errors), and obtains higher error correction accuracy through mutual complementation.

4. The training device adopts a multitask learning mode to jointly train an embellishing module and an encoder of a pre-training language model, a word granularity error detection layer, a word granularity error detection neural network and a word granularity error correction layer are used for optimizing parameters of the model, on one hand, each module shares a transducer encoder of the pre-training language model, memory occupation and reasoning time are reduced, on the other hand, information among a plurality of modules is shared, mutual complementation is carried out, and accuracy of each other is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the overall model architecture according to the present invention;

FIG. 3 is a block diagram of a word granularity error detection neural network;

FIG. 4 is a block diagram of a word granularity error detection layer and correction layer;

fig. 5 is a schematic view of the structure of the device of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

Example 1

The following describes a multi-granularity Chinese text error correction method in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.

As shown in fig. 1, the invention provides a multi-granularity chinese text error correction method, comprising the steps of:

s1: and preprocessing the Chinese text to be corrected.

Specifically, the preprocessing step comprises Unicode text standardization, complex conversion processing, punctuation recovery and digital preprocessing. The specific description is as follows:

s11: unicode text normalization.

Since the input text to be corrected may come from different sources, different encoding modes may be used, and in order to avoid the influence on the subsequent modules, unicode standardization processing is needed to be performed on the text to be corrected.

S12: and (5) complex conversion processing.

Because Chinese users from different regions can use different Chinese character standards, open-source kits such as OpenCC are required to uniformly convert input text to be corrected into simplified Chinese.

S13: punctuation recovery.

Text obtained from systems such as speech recognition generally does not contain punctuation marks and unsingulated character sequences, which results in low text readability and incoherence of semantics, and affects the effect of downstream tasks such as subsequent error correction models. It is therefore necessary to add omitted punctuation marks to the text using punctuation restoration techniques. A neural network based sequence labeling method (e.g., RNN, CNN, transformer, etc.) is used in the present method to predict missing punctuation marks in text.

S14: digital preprocessing.

Systems such as speech recognition may incorrectly recognize chinese characters as arabic numerals, for example, a speech recognition model may incorrectly recognize 10 how this scheme is implemented ("real" incorrectly recognized as "10"), and arabic numerals cannot be converted to pinyin using tools. It is therefore necessary to replace the simple arabic numerals (0, 1, …,9,10,20, …,90,100,200, … 900) of non-date, non-time, non-sequence number, non-number, etc. in the text with the corresponding chinese, for example, to convert "10" into "ten".

S2: and constructing a noun knowledge base, collecting error correction original data and constructing a training corpus of a text error correction model.

Specifically, the method comprises two sub-steps of constructing a noun knowledge base and constructing a training corpus of a text error correction model, and is specifically described as follows:

s21: and constructing a noun knowledge base.

Specifically, before text correction is performed, a noun knowledge base containing common nouns and proper nouns needs to be built in advance. Nouns refer to entities or abstract thing words such as people, things, places, concepts and the like, and can be divided into common nouns and proper nouns according to generality. The common nouns may be obtained from a public knowledge base of existing nouns, and the proper nouns may be obtained from text corpus data in the vertical domain or the limited domain by a manual or statistical/regular method, and the specific method is not limited herein.

S22: and collecting error correction original data and constructing a training corpus of the text error correction model. The method comprises the following steps:

training of the text correction model requires the use of correction corpus. The error correction raw data can be obtained from the history of the intelligent question-answering system or can be generated manually using a voice recognition system.

After obtaining the error correction raw data, each text is preprocessed by using the preprocessing mode in the step S1.

Word recognition errors occurring in texts in the error correction original data are marked in a manual mode and are used for constructing training corpus of a text error correction model. Specifically, for each text X to be corrected, the text character sequence thereof is (X ₁ ,x ₂ ,…,x _n ) (n is the length of the text, x _i Is the ith character in the text, X may or may not contain errors, and the error tag sequence G= (G) with the granularity of words is obtained in the form of manual labeling ₁ ,g ₂ ,…,g _n ) Wherein g _i =1 represents the ith character x _i Is wrong, g _i =0 represents the ith character x _i Is correct; and marks the correct character sequence y= (Y) of the word granularity ₁ ,y ₂ ,…,y _n ) Wherein y is _i Is the ith character x _i Corresponding correct character, if character x _i No error, x _i ＝y _i . And labels the error label of word granularity (character fragment) as Z = { (i, j, Z) _i,j ):1<＝i<j<=n }, where z _i,j =1 represents character segment X from i to j _i:j Is a proper noun, z _i,j =0 represents the character from i to jFragment X _i:j No errors or a correct word. It should be noted that the speech recognition system may recognize a certain noun as a wrong noun with unequal lengths, so that when the granularity of the words is mislabeled, if the length of the wrong noun is smaller than that of the right noun, the longest common subsequence algorithm is imitated to obtain the right character corresponding to each character in the wrong noun; if the length of the wrong noun is larger than that of the right noun, setting the right character corresponding to the redundant character in the wrong noun as a null character. The label sample corresponding to the text X to be corrected is (X, G, Y, Z) obtained in the above mode.

For example, the speech recognition system will "who is the responsible person for this project? Is the person responsible for the item "incorrectly identified as" water? "(wherein, the" purpose "of the" item "is erroneously identified as" wood ", and" who "is erroneously identified as" water "). From the text to be corrected, "is the person responsible for this item of wood water? "the obtained labeling data are:

x= (this, one, item, wood, negative, responsibility, person, water,

G＝(0，0，0，1，0，0，0，0，0，1，0)，

y= (this, one, item, order, negative, responsibility, person, who,

Z＝{(2，3，1)}∪{(i，j，0):1<＝i<j<＝11,i！＝2,j！＝3}。

If K texts exist in the collected error correction original data, the following data are obtained according to the labeling mode: { (X) _k ,G _k ,Y _k ,Z _k ):1<＝k<=k }, wherein

Z _k ＝{(i,j,z _k,i,j ):

1<＝i<j<＝n _k The text character sequence corresponding to the kth text, the character granularity error label sequence and the character granularity correct character sequenceError labels, n, of word granularity (character fragments) _k Is the character sequence length of the text and K is the total number of samples in the dataset.

After the original error correction data is marked, randomly scrambling the marked data set according to a random division mode, and according to 8:1: the proportion of 1 is divided into a training set, a verification set and a test set, which are respectively used for training a text error correction model, adjusting the hyper-parameters of the model and evaluating the effect of the model.

S3: and vector encoding is carried out on the input text to be corrected by using a pre-training language model, and the voice information of the text is fused to obtain a corresponding character vector sequence.

Specifically, for the input chinese text with error correction x= (X) ₁ ,x ₂ ,…,x _n ) (n is the length of the text, x _i Is the ith character in the text) to obtain a corresponding integer index sequence using a vocabulary (the vocabulary contains common characters and character fragments, the vocabulary size is N) of a pre-trained language model (e.g., a pre-trained language model such as BERT, roBERTa, ALBERT). Embedding module using pre-training model to get each character x _i Character embedded vector of (a)

And position embedding vector +.>

Considering that the speech recognition system usually can wrongly recognize words as homophonic or near-phonic words, the speech information (namely pinyin information) of the words is introduced when the text is encoded, so that richer features can be obtained, and the accuracy of the text error correction model is improved. For character x _i If the character is a Chinese character, using the python Chinese character to Pinyin tool such as xpin to obtain a corresponding Pinyin letter sequence p _i The method comprises the steps of carrying out a first treatment on the surface of the If the character is not Chinese, such as English letters, numbers, etc., the Pinyin letter sequence p of the character is caused _i ＝x _i . For example, the Chinese character "term" corresponds to the Pinyin alphabetic sequence of (x, i, a, n, g), the English character "b"The corresponding phonetic alphabet sequence is (b), and the phonetic alphabet sequence corresponding to the number character '3' is (3). Obtaining character x _i Pinyin letter sequence p _i Then, the Pinyin coding module of a layer of neural network (particularly, RNN, LSTM, GRU, CNN and other neural networks can be used) is used for coding the Pinyin letter sequence p _i Coding to obtain character x _i Speech embedding vector

Further, the character x is obtained _i Character embedded vector of (a)

Position embedding vector +.>

And speech embedding vector

Then, the three are added to obtain the character x _i Final embedded vector->

After the embedded vectors of every other character are obtained in the above manner, the embedded vector sequence of the text X is e= (E ₁ ,e ₂ ,…,e _n )。

Further, the embedded vector sequence E of the text X is input to an encoder of a pre-trained language model composed of a plurality of transducer layers. By BERT _base By way of example, a pre-trained language model, BERT _base Is composed of 12 identical transducer layers. The input of each transform layer is the hidden state vector sequence output by the last transform layer, which is obtained via multi-head self-attention (multi-head self-attention), feed-forward network, residual connection and layer normalization. In this example, the hidden state vector sequence output by the last transform layer is denoted as h= (H) as the final encoded vector sequence of text X ₁ ,h ₂ ,…,h _n ) Wherein h is _i Is character x _i Is a coded vector of (a). The semantic and grammar information of each character in the text to be corrected can be effectively obtained by means of the strong performance of the pre-training language model, and the voice information of the text is integrated, so that the subsequent correction effect can be improved.

S4: after obtaining the text character vector sequence in the step S3, detecting the word granularity error in the text by using a word granularity error detection layer formed by a full connection layer, and detecting the word granularity error in the text by using a word granularity error detection neural network formed by the first character, the last character and the relative distance of character fragments to obtain an error word set and an error word set, wherein the specific steps are as follows:

S41: and detecting word granularity errors in the text by using a word granularity error detection layer formed by a full connection layer to obtain an error word set. The method comprises the following steps:

for word granularity errors, each character X in text X is determined _i Inputting a word granularity error detection layer composed of full connection layers to obtain the probability of whether the character is wrong

The definition is as follows:

wherein W is _d And b _d Is the weight matrix and bias term of the fully connected layer, and σ is the sigmoid function.

According to a preset word granularity error threshold epsilon _char If the probability of error

Epsilon is greater than or equal to _char If the character is determined to be wrong, correction is needed, if the error probability is +.>

Less than epsilon _char Then judgeThe character is error-free and does not need correction. Obtaining the error word set { x } _i :i∈SET _char }，SET _char Is the sequence number set of the error character detected by the character granularity detection module.

S42: and detecting word granularity errors in the text by using a word granularity error detection neural network formed based on the first character, the last character and the relative distance of the character fragments, so as to obtain an error word set. The method comprises the following steps:

for word granularity errors, for the slave character X in text X _i To character x _j The character segment of the composition is denoted as X _ij ＝[x _i ,x _i+1 ,…,x _j ](wherein i, j satisfy 1<＝i<j<＝n)，X _ij The corresponding character vector sequence is H _ij ＝[h _i ,h _i+1 ,…,h _j ]The probability that it is an erroneous word is calculated using a multi-layer neural network as follows:

e _dist ＝f _dist (j – i)， (4)

Wherein the first character x is to be _i Is the code vector h of (2) _i Inputting a full connection layer to obtain a characterization vector of the first character of the character segment

Wherein W is _start ，b _start Is the weight matrix and bias term of the fully connected layer, RELU (·) is the RELU activation function, RELU (x) =max (x, 0); the tail character x _j Is the code vector h of (2) _j Inputting a full connection layer to obtain a characterization vector of the character segment tail character>

Wherein W is _end ，b _end Is the weight matrix and bias term of the full connection layer; f (f) _dist Is a distance coding function, which codes the character x _i To character x _j The relative distance j-i between the two is input into the function to obtain the character x _i To character x _j Distance code e of (2) _dist The method is used for reserving length information of the character fragments; will->

e _dist The three are spliced to obtain the character x _i To character x _j Characterization vector of the composed fragment->

Will->

Inputting a word granularity error classification layer to obtain the probability of the character segment error>

Wherein->

Is the weight matrix and bias term of the full-connection layer in the word granularity error classification layer, and sigma is the sigmoid function.

According to a preset word granularity error threshold epsilon _span If the probability of error

Epsilon is greater than or equal to _span The character segment is an error word, and correction is needed if the error is occurredProbability->

Less than epsilon _span The character segment is error free and need not be corrected. By the error character segment detection mode, nested error nouns can be effectively identified, and acceleration operation can be performed in parallel. Get the error word (error character segment) set { X } _i:j :(i,j)∈SET _span } wherein SET _span Is the sequence number set of the start and end positions of the wrong word (wrong character segment) detected by the word granularity error detection module.

In the prior art, a word segmentation tool is generally used for detecting the wrong noun, or named entity recognition methods such as CRF, biLSTM+CRF, BERT+CRF and the like are adopted for detecting the wrong noun, but the word segmentation tool often has the situation of wrong segmentation, and particularly on text data in a limited field, the problem of nested entities cannot be solved by the prior named entity recognition method. The error character segment detection method can detect the error words without word segmentation, avoids the influence caused by word segmentation errors, can identify nested error nouns, and has more accurate word granularity detection accuracy.

S5: and (3) correcting the error word set and the error word set detected in the step (S4) respectively to obtain candidate replacement words with wrong word granularity and candidate replacement words with wrong word granularity. For the error word detected in the S4, predicting the correct character by using a word granularity error correction layer formed by a layer of full-connection layers to obtain a candidate replacement word; and (3) for the error words detected in the step (S4), obtaining corresponding correction words from the noun knowledge base constructed in the step (S2) by adopting a candidate recall and candidate sorting screening mode, and obtaining candidate replacement words. The method comprises the following specific steps:

S51: and in the step of word granularity error correction, for each error word in the error word set obtained by detection in the step S41, a word granularity error correction layer formed by a full-connection layer is used for predicting correct characters, and candidate replacement words are obtained. The method comprises the following steps:

error in set of error words obtained by error detection according to S41 word granularityCharacter x _i The error probability of the word granularity is that

Inputting it into a word granularity error correction layer composed of a layer of full connection layer to obtain character x _i The probability of being corrected to character j in the vocabulary is:

wherein W is _c ，b _c Is the weight matrix and bias term, o, of the word granularity error correction layer _i Is character x _i Is the one-hot vector of (the vector length is the number of characters in the vocabulary, character x _i The position value is 1, the rest position values are 0), softmax (·) is a normalized exponential function for calculating the character x _i Probability that the correct character is the j-th character in the vocabulary; y is _i Is character x _i The corresponding correct character.

Obtaining character x _i After being corrected to the probabilities of the characters in the word list, the character with the highest probability is taken as x _i Is recorded as a candidate replacement character of

The method can effectively correct the word granularity of the error characters in the high-frequency word granularity error and the high-frequency word granularity error.

S52: and step of correcting word granularity errors, namely detecting each error word in the error word set in step S42, and obtaining a corresponding correction word from the noun knowledge base constructed in step S2 by adopting a candidate recall and candidate sorting screening mode to obtain a candidate replacement word. The method comprises the following steps:

after the word granularity error set obtained according to the S42 word granularity error detection, the character x is assumed _i To character x _j Fragment X of the composition _i:j Is a wrong word (wrong character segment), and adopts a candidate recall and candidate sorting screening mode to obtain the corresponding correction word from the pre-constructed noun knowledge baseIs a candidate replacement term.

Firstly, defining the required editing distance similarity and Jaccard similarity as follows:

the edit distance, also called Levenshtein, is a measure of the degree of differentiation between two strings, defined as how many times at least processing is required to change one string to another. Defining similarity based on edit distance

Where x, y are two strings, edit_dist (x, y) is the edit distance between strings x, y, and len (x) is the length of string x. The Jaccard similarity is used for comparing the similarity between two limited sample sets, for two character strings, corresponding character sets A and B are obtained first, and then the Jaccard similarity J (A, B) of the two sets is defined as the proportion of the number of intersection elements of the two character sets in the union of the two sets: / >

The greater the Jaccard similarity, the more similar the two conjugates.

Considering that most errors in a speech recognition scenario are homophones and near-phones recognition errors due to accents, dialects, expressions, background noise, etc., the candidate recall step uses pinyin-based similarity to recall the candidate word.

For the character segment X to be corrected detected in S42 _i:j And using a rough search mode to select a word set from the noun knowledge base. In consideration of the fact that the nouns may be recognized as incorrect nouns with different lengths during voice recognition, for example, one word is recognized more or one word is recognized less, recall is performed on the equal-length nouns and the unequal-length nouns in the knowledge base during recall.

Specifically, for equal length noun recall, all nouns with lengths of j-i+1 in a noun knowledge base are obtained first to obtain pinyin sequences and character segments X of each equal length noun in the knowledge base _i:j Calculates the edit distance similarity between the two,and taking all nouns with the similarity larger than or equal to a preset equal-length noun recall threshold value as initial equal-length candidate words to obtain an initial equal-length candidate word set.

For recall of nouns with different lengths, firstly, all nouns with lengths not being j-i+1 in a noun knowledge base are obtained, and pinyin sequences and character segments X of each noun with different lengths are obtained _i:j Calculating the edit distance similarity between the two, and taking all nouns with the similarity being more than or equal to a preset unequal length noun recall threshold value as initially selected unequal length candidate words to obtain an initial unequal length candidate word set.

For each character segment X to be corrected _i:j After the recalled initial equal-length candidate word set and the recalled initial unequal-length candidate word set are obtained, candidate ranking screening is further needed to be conducted on candidate word scoring, and the replacement word with the highest score and larger than a preset threshold value is selected to be used as the replacement word of the error segment.

In the conventional candidate sorting and screening method, a series of candidate sentences are obtained by replacing the wrong character segments in the text sentences with error correction with candidate words, and then the best candidate sentences (namely the best candidate words) are selected by calculating a confusion degree score by using an n-gram language model or a language model based on a neural network, so that the text error correction is realized. However, the n-gram language model and the language model based on the neural network depend on the scale and distribution of the training corpus, when a proper noun occurs less frequently or even does not occur in the training corpus (i.e., a low-frequency situation), the confusion degree obtained by calculating the correct sentence containing the proper noun by using the n-gram language model and the language model based on the neural network is high, so that the language model tends to select another noun with high occurrence frequency in the training corpus, and the situation that uncorrectable and error correction occurs is caused. Therefore, the low-frequency errors of word granularity cannot be solved by using the confusion degree score based on the language model for the candidate ranking screening method.

In order to solve the problem that low-frequency errors of word granularity are difficult to correct, the candidate ranking screening method in the scheme only uses features such as similarity between the wrong character fragments and the candidate words, and does not use features such as confusion degree and the like which are easily influenced by training corpus distribution and are calculated by models such as an n-gram language model and the like to rank and screen the candidate words.

The feature calculation for candidate ranking screening is as follows: for character segment X to be corrected _i:j And calculating the characteristics and length characteristics of the Chinese character layer and the voice layer of each candidate word in the initial equal-length candidate word set and the initial unequal-length candidate word set obtained through recall, wherein the characteristics and length characteristics are as follows:

chinese character layer characteristics: character segment X to be corrected _i:j And converting the candidate words into Chinese character sequences, and then calculating the similarity based on the editing distance and the Jaccard similarity between the candidate words as the characteristics of the Chinese character layer.

Speech level features: character segment X to be corrected _i:j And converting the candidate words into pinyin sequences, obtaining initial consonant and final sound sequences, and calculating edit distance similarity and Jaccard similarity between the initial consonant and the final sound sequences as characteristics of a voice layer. The speech level features may capture pronunciation similarity between the character segment to be corrected and the candidate word.

Length characteristics: calculating each character segment X to be corrected _i:j As well as the length difference between it and the candidate word, as character length features. The character length feature allows for adaptive adjustment of the threshold for error correction of character segments of different lengths.

The scoring function used in candidate ranking screening is as follows: after obtaining the characteristics and length characteristics of the Chinese character layer and the voice layer, training a linear regression model by using the word pairs formed by the miswords and the corresponding correct words in the text extracted from the training set in the second step, wherein the linear regression model can calculate and obtain a comprehensive similarity score s between the character segment to be corrected and the candidate words based on the characteristics _overall Cut it off to obtain a score s=max (0, min (s _overall 1), the similarity score integrates all aspects of characteristics, and the accuracy of error correction can be effectively improved.

And for each candidate word in the character segment to be corrected, the initial equal-length candidate word set and the initial unequal-length candidate word set, respectively calculating a score s between the character segment to be corrected and the initial equal-length candidate word set, reserving the candidate words with the score s being greater than or equal to a preset screening threshold, and removing the candidate words with the score s being smaller than the screening threshold to obtain a new equal-length candidate word set and an unequal-length candidate word set.

If the filtered set of the equal-length candidate words is not empty, sequencing the candidate words in the set of the equal-length candidate words according to the score s from large to small, and selecting the equal-length candidate word with the highest score as the final replacement word of the character segment to be corrected; if the filtered set of the equal-length candidate words is empty, and the filtered set of the unequal-length candidate words is not empty, sequencing the candidate words in the set of the unequal-length candidate words according to the score s from large to small, and selecting the unequal-length candidate word with the highest score as the final replacement word of the character segment to be corrected; and if the candidate word sets with equal length and unequal length are empty, not correcting the character segment to be corrected. The error word correcting method can not only effectively correct high-frequency word granularity errors, but also correct low-frequency word granularity errors and errors which do not occur (especially proper nouns with few occurrence times in training corpus).

S6: the training corpus is constructed in the S2, and the pinyin coding module in the S3 and the embellishing module and the encoder of the pre-training language model are jointly trained in a multitask learning mode, wherein the word granularity error detection layer, the word granularity error detection neural network and the word granularity error correction layer in the S4 are adopted. The method comprises the following steps:

Specifically, considering that the three steps of word granularity error detection in S4, word granularity error detection and word granularity error correction in S5 are all encoders based on a pre-training language model (such as BERT, roBERTa, ALBERT, etc.), parameters of the pre-training language model are shared, and the three steps complement each other. The accuracy of word granularity error detection is improved, and the accuracy of word granularity error correction is improved. The word granularity error detection and the word granularity error detection are mutually beneficial effects. If a character in the text is detected to be an incorrect character, a word (character segment) containing the character in the text is also incorrect; if a character segment in the text is detected as being erroneous, then it must also contain some erroneous character. In order to share information among a plurality of steps and complement each other and improve accuracy of each other, the invention adopts a multitask learning mode to jointly train a pinyin coding module in S3, an embellishing module of a pre-training language model and an encoder, a word granularity error detection layer in S4, a word granularity error detection neural network and a word granularity error correction layer in S5. In addition, the multiple steps share one model, so that memory occupation can be reduced, and the speed of model reasoning and predicting is increased.

Specifically, taking a sample (X, G, Y, Z) as an example, wherein x= (X) ₁ ,x ₂ ,…,x _n )、G＝(g ₁ ,g ₂ ,…,g _n )、Y＝(y ₁ ,y ₂ ,…,y _n )、Z＝{(i,j,z _i,j ):1<＝i<j<N, n is the length of text X.

Defining the loss of the word granularity error detection layer in S4 as:

defining the loss of the word granularity error detection neural network in S4 as:

defining the loss of the word granularity error correction layer in S5 as:

the three above loss functions are linearly combined to obtain the following total loss function:

wherein the method comprises the steps of0<λ ₁ ,λ ₂ ,λ ₃ <＝1(λ ₁ +λ ₂ +λ ₃ =1) is the weight coefficient of three loss functions for balancing the effect of the individual losses. Lambda (lambda) ₁ ,λ ₂ ,λ ₃ And selecting through the effect of the model on the verification set.

Using the training dataset constructed in S2, the parameters of the model were optimized using an AdamW optimizer, using a batch gradient descent algorithm to minimize the total loss L.

The model obtained after training saves the model structure and parameters, and the model is loaded for prediction during prediction.

S7: and for the text to be corrected, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, fusing word granularity correction results according to a preset rule to obtain the corrected text. The specific process is as follows:

obtaining correction results from the word granularity correction step of S51

Wherein SET _char Is the sequence number set of the error characters detected by S41 word granularity error, x _i Is the i-th character of the text X to be corrected, is entered>

Is the correct character predicted by the model in the step of S51 word granularity correction, i.e. the candidate replacement word.

Obtaining a correction result COR from the step of correcting the granularity of the S52 words _span ＝{(X _i:j ,W _ij ):(i,j)∈

SET _span } wherein SET _span Is the sequence number set of the starting and ending positions of the error character fragments obtained by S42 word granularity error detection, X _i:j Is the character segment to be corrected from the sequence number i to j in the text to be corrected X is input, W _ij The candidate replacement words are obtained from candidate recall, sorting and screening in the step of correcting the granularity of the S52 words.

Considering that each word granularity error in text may be contained in a certainIn a word-granularity error, the results of the word-granularity error correction step and the word-granularity error correction step may not be identical to the correction result for the error word. To avoid the above and error correction, the present invention corrects the result for each word granularity error

If x _i Contained in COR _span If the error character segment is in a certain error character segment, the error character is not corrected from the angle of the granularity of the character, otherwise, the error character segment is corrected, and X in the input text X to be corrected is X _i Replaced by->

For each word granularity error correction result (X _i:j ,W _ij ) X in the text X to be corrected to be input _i:j Replaced by W _ij . Finally obtaining corrected text X in the above manner _cor 。

Example 2

The invention also provides an embodiment of the multi-granularity Chinese text error correction device corresponding to the embodiment of the multi-granularity Chinese text error correction method.

Referring to fig. 5, the multi-granularity chinese text error correction apparatus provided by the embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the one or more processors are configured to implement a multi-granularity chinese text error correction method in the above embodiment when executing the executable codes.

The embodiment of the multi-granularity Chinese text error correction device provided by the embodiment of the invention can be applied to any device with data processing capability, and the any device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an apparatus with any data processing capability where the multi-granularity chinese text error correction apparatus provided by the embodiment of the present invention is located is shown in fig. 5, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 5, the apparatus with any data processing capability where the apparatus is located in the embodiment generally includes other hardware according to an actual function of the apparatus with any data processing capability, which is not described herein again.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Example 3

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a multi-granularity Chinese text error correction method in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-granularity Chinese text error correction method is characterized by comprising the following steps:

s1: preprocessing a Chinese text to be corrected;

2. The method for correcting chinese text with multiple granularities according to claim 1, wherein the step of preprocessing the chinese text to be corrected in step S1 includes Unicode text normalization, complex-to-simple conversion processing, punctuation recovery, and digital preprocessing.

3. The multi-granularity chinese text error correction method of claim 1, wherein the noun knowledge base constructed in step S2 includes common nouns obtained from a public existing noun knowledge base, and proper nouns obtained from text corpus data in vertical domain or limited domain by manual or statistical/regular method; the training corpus in the step S2 is as follows: the error correction original data collected from the intelligent question-answering system or the voice recognition system is preprocessed in the mode in the step S1, the character granularity error label sequence, the character granularity correct character sequence and the word granularity error label of each text are marked manually, and the text is divided into a training set, a verification set and a test set randomly after marking is finished and used for training a subsequent error correction model.

4. The multi-granularity chinese text error correction method of claim 1, wherein the manner of vector encoding the text to be error corrected in step S3 is: firstly, obtaining a character index sequence for inputting a text to be corrected by using a word stock of a pre-training language model, then obtaining a character embedded vector and a position embedded vector of each character by using an embellishing module of the pre-training language model, then encoding a pinyin sequence of each character by using a neural network to obtain a voice embedded vector, adding the three to obtain a final embedded vector of the character, and then inputting the final embedded vector of the character into a encoder of the pre-training language model based on a transducer to obtain a character vector sequence of the text.

5. The multi-granularity Chinese text error correction method according to claim 1, wherein the method for performing word granularity error detection and word granularity error detection on the text to be corrected in S4 respectively comprises the following steps: after obtaining a text character vector sequence in the step S3, obtaining the error probability of each character granularity in the text by using a character granularity error detection layer formed by a full connection layer, and collecting characters with the error probability larger than a preset threshold value to obtain an error character set; the word granularity error detection is characterized by calculating the error probability of the character segment by using a multi-layer neural network based on the first character, the last character and the relative distance of the character segment, and collecting the character segments with the error probability larger than a preset threshold value to obtain an error word set.

6. The multi-granularity chinese text error correction method of claim 1, wherein step S5 performs word granularity error correction and word granularity error correction on the set of error words and the set of error words detected in step S4, respectively: the word granularity error correction step is characterized by error words and error probability, a word granularity error correction layer formed by a full-connection layer is used for predicting the probability of correcting each error word into each character in a word list, and the maximum probability value is selected as a candidate replacement word of the error word; the word granularity error correction step firstly recalls equal-length candidate words and unequal-length candidate words from a noun knowledge base constructed in the step S2 based on the pinyin editing distance, then uses a truncated linear regression model, inputs the Chinese character layer, the pinyin layer and other characteristics to calculate comprehensive similarity so as to sort and screen the candidate words, and takes the candidate words with the highest score and larger than a preset threshold value as candidate replacement words of the error words.

7. The multi-granularity chinese text error correction method of claim 1, wherein the step S6 uses the training corpus constructed in the step S2 to jointly train the pinyin coding module in the step S3 and the emmbedding module, the encoder, and the word granularity error detection layer, the word granularity error detection neural network in the step S4, and the word granularity error correction layer in the step S5 by using a multi-task learning method, and the total loss function is a weighted average of the loss of the word granularity error detection layer in the step S4, the loss of the word granularity error detection neural network in the step S4, and the loss of the word granularity error correction layer in the step S5, so as to optimize the parameters of the model by minimizing the total loss function.

8. The method for correcting the multi-granularity Chinese text according to claim 1, wherein in the step S7, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, the word granularity correction result is fused according to a preset rule to obtain the corrected text; when the word granularity error correction result and the word granularity error correction result conflict, the preset rule preferentially adopts the word granularity error correction result.

9. A multi-granularity chinese text error correction apparatus comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement a multi-granularity chinese text error correction method according to any one of claims 1-8.

10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a multi-granularity chinese text correction method as claimed in any one of claims 1 to 8.