CN116127952A - Multi-granularity Chinese text error correction method and device - Google Patents

Multi-granularity Chinese text error correction method and device Download PDF

Info

Publication number
CN116127952A
CN116127952A CN202310088091.8A CN202310088091A CN116127952A CN 116127952 A CN116127952 A CN 116127952A CN 202310088091 A CN202310088091 A CN 202310088091A CN 116127952 A CN116127952 A CN 116127952A
Authority
CN
China
Prior art keywords
word
granularity
error
text
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310088091.8A
Other languages
Chinese (zh)
Inventor
赵鑫安
宋伟
朱世强
谢冰
袭向明
尹越
王雨菡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310088091.8A priority Critical patent/CN116127952A/en
Publication of CN116127952A publication Critical patent/CN116127952A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A multi-granularity chinese text correction method comprising: preprocessing a Chinese text to be corrected; constructing a noun knowledge base and a text error correction training corpus; vector encoding is carried out on the text input to be corrected by using a pre-training language model, and the voice information of the text is fused to obtain a character vector sequence; detecting word granularity and word granularity errors in the text based on a neural network to obtain an error word set and an error word set; correcting the errors of the detected word granularity and word granularity respectively to obtain candidate replacement words and words with the errors of the word granularity and the word granularity; jointly training the whole model by using a multi-task learning mode; and fusing the word granularity correction results to obtain the corrected text. The invention also comprises a multi-granularity Chinese text error correction device. The invention can effectively correct errors with multiple granularities (word granularity and word granularity) in the text, trains the whole model by adopting a multi-task learning mode, and has good correction accuracy and practicality.

Description

Multi-granularity Chinese text error correction method and device
Technical Field
The invention relates to the field of text correction, in particular to a multi-granularity Chinese text correction method and device.
Background
In recent years, with the rapid development of artificial intelligence technology, automatic speech recognition technology has made a major breakthrough in theory and application, and is widely applied to the fields of intelligent robots, intelligent customer service, speech recognition dictation machines and the like. However, due to the effects of the user's accent, dialect, expression, background noise, and defects of the speech recognition model itself, the speech recognition technology cannot completely and correctly convert the speech signal into text, and particularly, the recognition accuracy for proper nouns in the vertical field (such as name, thing nouns, etc., which occur with low frequency) is low. And speech recognition errors can greatly affect the accuracy and recall of downstream tasks (such as intent recognition, text retrieval, entity recognition, etc.).
In order to solve the problem of speech recognition errors, text error correction technology is required to detect and correct errors of text obtained by speech recognition. Text error correction technology plays an important role in many Natural Language Processing (NLP) applications, being an indispensable pre-module. The existing text error correction technical scheme and the existing problems are as follows:
(1) The text error correction method based on the statistical language model comprises the steps of firstly, detecting word errors by scoring local n-grams of texts to be error corrected, then, sequentially replacing candidate words corresponding to the error words by using a pre-constructed confusion set to generate candidate texts, and finally, screening and obtaining optimal texts by calculating confusion degree scores of the candidate texts or using set rules. Since the statistical language model only uses word frequency information in the text corpus, it cannot use semantic characterization, cannot use context information with a longer distance, and the error correction effect is affected by the confusion set, the accuracy of the method is limited.
(2) In recent years, deep learning models represented by CNN, RNN have demonstrated their effectiveness over many NLP tasks, especially pre-trained language models represented by BERT. Likewise, many approaches utilize pre-trained language models such as BERT to achieve end-to-end text correction, which achieves good results on many text correction tasks, particularly on common word granularity errors, due to their powerful language understanding and characterization capabilities. However, text correction methods based on pre-trained language models are prone to interference from contextual noise, have poor correction for text with multiple errors, and have poor correction for word granularity errors, especially proper nouns in the vertical domain.
(3) To solve the error of word granularity, the existing method mostly identifies the words which may have errors through word segmentation (such as jieba, corenlp and other word segmentation tools) or using named entity identification (such as hidden Markov HMM, conditional random field CRF, biLSTM+CRF and other models), and then uses candidate word recall, sorting and screening to correct errors. However, due to the limitation of word segmentation algorithm performance, word error segmentation often occurs, and in addition, the named entity recognition model listed above also often occurs entity error recognition and unrecognized situations, especially the situation that nested entity nouns occur in texts cannot be solved, so that the overall error correction accuracy is limited, and only errors of word granularity can be corrected.
In addition, all the above methods are easily affected by the distribution of training data, and have poor recognition effect on low-frequency proper noun errors. Therefore, how to correct the errors of the word granularity and the word granularity at the same time, including common errors of the word granularity at high frequency and unusual errors of proper nouns at low frequency (especially proper noun errors in the vertical field), has great significance for improving the performance of a text error correction algorithm.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a multi-granularity Chinese text error correction method and device.
In order to achieve the above purpose, the technical scheme of the invention is as follows: a first aspect of the embodiment of the invention provides a multi-granularity Chinese text error correction method, which comprises the following steps:
s1: preprocessing a Chinese text to be corrected;
s2: constructing a noun knowledge base, collecting error correction original data and constructing a training corpus of a text error correction model;
s3: vector encoding is carried out on the input text to be corrected by using a pre-training language model, and voice information of the text is fused to obtain a corresponding character vector sequence;
s4: after a text character vector sequence is obtained in the step S3, a word granularity error detection layer formed by a full-connection layer is used for detecting word granularity errors in the text, and a word granularity error detection neural network formed by character segment head characters, tail characters and relative distance features is used for detecting word granularity errors in the text, so that an error word set and an error word set are obtained;
S5: correcting the error word set and the error word set detected in the step S4 respectively to obtain candidate replacement words with wrong word granularity and candidate replacement words with wrong word granularity; for the error word detected in the S4, predicting the correct character by using a word granularity error correction layer formed by a layer of full-connection layers to obtain a candidate replacement word; for the error words detected in the step S4, a candidate recall and candidate sorting screening mode is adopted to obtain corresponding correction words from the noun knowledge base constructed in the step S2, and candidate replacement words are obtained;
s6: training corpus is constructed in the S2, and a pinyin coding module in the S3 and an embellishing module and an encoder of a pre-training language model are jointly trained in a multitask learning mode, wherein a word granularity error detection layer, a word granularity error detection neural network and a word granularity error correction layer in the S4 are adopted;
s7: and for the text to be corrected, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, fusing word granularity correction results according to a preset rule to obtain corrected text.
Further, the step S1 specifically includes: in order to improve the accuracy of the subsequent error correction, the Chinese text to be subjected to error correction is preprocessed in advance, and the preprocessing step comprises Unicode text standardization, complex and simple conversion processing, punctuation recovery and digital preprocessing.
Further, the noun knowledge base constructed in step S2 includes a common noun obtained from a public existing noun knowledge base or the like, and a proper noun obtained from text corpus data of a vertical domain or a limited domain by a manual or statistical/rule method; further, the error correction original data collected from the intelligent question-answering system or the voice recognition system is preprocessed in the mode described in the step S1, and the character granularity error label sequence, the character granularity correct character sequence and the word granularity error label of each text are marked manually, and are divided into a training set, a verification set and a test set randomly after marking is finished and used for training a subsequent error correction model.
Further, the step S3 specifically includes: firstly, obtaining a character index sequence for inputting a text to be corrected by using a word stock of a pre-training language model, then obtaining a character embedded vector and a position embedded vector of each character by using an embellishing module of the pre-training language model, then encoding a pinyin sequence of each character by using a neural network to obtain a voice embedded vector, adding the three to obtain a final embedded vector of the character, and further inputting the final embedded vector into a transducer-based encoder of the pre-training language model to obtain a character vector sequence of the text.
Further, in step S4, the method for performing word granularity error detection and word granularity error detection on the text to be corrected respectively specifically includes: after the text character vector sequence is obtained in the step S3, the word granularity error detection uses a word granularity error detection layer formed by a full connection layer to obtain the error probability of each character word granularity in the text, and the collected error probability is larger than the preset word granularity error threshold v char Is a set of erroneous words.
Still further, the word granularity error detecting step is performed on the slave character x i To character x j (1<=i<j<N, n is text length), and the characterization vector of the first character of the character segment is obtained by using a full connection layer and a RELU layer respectively with the first character, the last character and the relative distance of the character segment as features
Figure BDA0004074043460000031
And character segment tail character characterization vector
Figure BDA0004074043460000032
And using a relative distance coding function pair x i And x j Encoding the distance between the two to obtain e dist Will be
Figure BDA0004074043460000033
e dist Inputting the three pieces into an error segment classification layer to obtain the probability of segment error, if the error probability of the segment is greater than or equal to the preset word granularity error threshold epsilon span If the character segment is wrong, correction is needed, and the character segment with the error probability larger than the preset threshold value is collected to obtain an error word set. And (3) performing word granularity error correction and word granularity error correction on the error word set and the error word set obtained by the detection in the step (S4) respectively.
Further, the step S5 specifically includes: for the error word detected in S4, the word granularity error correction step is characterized by error words and error probability, a word granularity error correction layer formed by a layer of full-connection layers is used for predicting the probability of correcting each error word into each character in a word list, and the maximum probability value is selected as the candidate replacement word of the error word; and for the error words detected in the step S4, the word granularity error correction step recalls the equal-length candidate words and the unequal-length candidate words from the noun knowledge base constructed in the step S2 respectively based on the pinyin editing distance, then uses a truncated linear regression model, inputs the characteristic calculation comprehensive similarity of the Chinese character layer, the pinyin layer and the like to sort and screen the candidate words, and takes the candidate words with the highest scores and larger than a preset threshold value as candidate replacement words of the error words.
Further, the total loss function in step S6 is a weighted average of the loss of the word granularity error detection layer in S4, the loss of the word granularity error detection neural network in S4, the loss of the word granularity error correction layer in S5, and the parameters of the model are optimized by minimizing the total loss function.
Further, the step S7 specifically includes: fusing correction results of the word granularity according to a preset rule, avoiding the occurrence of conflict and error correction of the correction results of the word granularity, and replacing each word granularity in the input text to be corrected with a replacement word to obtain a corrected text; when the word granularity error correction result and the word granularity error correction result conflict, the preset rule preferentially adopts the word granularity error correction result.
A second aspect of an embodiment of the present invention provides an apparatus for multi-granularity chinese text correction, comprising a memory and a processor, the memory coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the multi-granularity Chinese text error correction method.
A third aspect of an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a multi-granularity chinese text correction method as described above.
The beneficial effects of the invention are as follows:
1. the method and the device have the advantages that the pre-training language model has strong text feature extraction capability, the input text to be corrected is encoded, the voice information of the text is integrated during encoding, and the accuracy of the subsequent text correction is improved.
2. In the process of detecting errors in the text, based on a text vector sequence obtained by coding a pre-training language model, detecting the error of the granularity of a word and the error of the granularity of a word in the text respectively, and simultaneously giving consideration to the errors of the granularity of the word and the granularity of the word. And word granularity error detection neural network based on the first character, the last character and the relative distance of the character fragments detects word granularity error in the text, is not influenced by word segmentation error, and can solve the problem that the traditional named entity recognition model can not recognize nested entity nouns.
3. In the error correction process, the errors of the word granularity and the errors of the word granularity are corrected respectively. Wherein a pre-trained language model is directly used to predict the correct character for word granularity errors. Aiming at word granularity errors, the method adopts candidate recall and sequencing modes to correct, adopts Chinese characters, pinyin and other characteristics, builds a comprehensive similarity function with better performance based on a truncated linear regression model, and can correct high-frequency and low-frequency word granularity errors without being influenced by data distribution. Therefore, the method not only can correct multi-granularity errors (word granularity and word granularity errors), but also can simultaneously consider high-frequency word granularity errors and low-frequency word granularity errors (especially low-frequency proper noun errors), and obtains higher error correction accuracy through mutual complementation.
4. The training device adopts a multitask learning mode to jointly train an embellishing module and an encoder of a pre-training language model, a word granularity error detection layer, a word granularity error detection neural network and a word granularity error correction layer are used for optimizing parameters of the model, on one hand, each module shares a transducer encoder of the pre-training language model, memory occupation and reasoning time are reduced, on the other hand, information among a plurality of modules is shared, mutual complementation is carried out, and accuracy of each other is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of the overall model architecture according to the present invention;
FIG. 3 is a block diagram of a word granularity error detection neural network;
FIG. 4 is a block diagram of a word granularity error detection layer and correction layer;
fig. 5 is a schematic view of the structure of the device of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
Example 1
The following describes a multi-granularity Chinese text error correction method in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
As shown in fig. 1, the invention provides a multi-granularity chinese text error correction method, comprising the steps of:
s1: and preprocessing the Chinese text to be corrected.
Specifically, the preprocessing step comprises Unicode text standardization, complex conversion processing, punctuation recovery and digital preprocessing. The specific description is as follows:
s11: unicode text normalization.
Since the input text to be corrected may come from different sources, different encoding modes may be used, and in order to avoid the influence on the subsequent modules, unicode standardization processing is needed to be performed on the text to be corrected.
S12: and (5) complex conversion processing.
Because Chinese users from different regions can use different Chinese character standards, open-source kits such as OpenCC are required to uniformly convert input text to be corrected into simplified Chinese.
S13: punctuation recovery.
Text obtained from systems such as speech recognition generally does not contain punctuation marks and unsingulated character sequences, which results in low text readability and incoherence of semantics, and affects the effect of downstream tasks such as subsequent error correction models. It is therefore necessary to add omitted punctuation marks to the text using punctuation restoration techniques. A neural network based sequence labeling method (e.g., RNN, CNN, transformer, etc.) is used in the present method to predict missing punctuation marks in text.
S14: digital preprocessing.
Systems such as speech recognition may incorrectly recognize chinese characters as arabic numerals, for example, a speech recognition model may incorrectly recognize 10 how this scheme is implemented ("real" incorrectly recognized as "10"), and arabic numerals cannot be converted to pinyin using tools. It is therefore necessary to replace the simple arabic numerals (0, 1, …,9,10,20, …,90,100,200, … 900) of non-date, non-time, non-sequence number, non-number, etc. in the text with the corresponding chinese, for example, to convert "10" into "ten".
S2: and constructing a noun knowledge base, collecting error correction original data and constructing a training corpus of a text error correction model.
Specifically, the method comprises two sub-steps of constructing a noun knowledge base and constructing a training corpus of a text error correction model, and is specifically described as follows:
s21: and constructing a noun knowledge base.
Specifically, before text correction is performed, a noun knowledge base containing common nouns and proper nouns needs to be built in advance. Nouns refer to entities or abstract thing words such as people, things, places, concepts and the like, and can be divided into common nouns and proper nouns according to generality. The common nouns may be obtained from a public knowledge base of existing nouns, and the proper nouns may be obtained from text corpus data in the vertical domain or the limited domain by a manual or statistical/regular method, and the specific method is not limited herein.
S22: and collecting error correction original data and constructing a training corpus of the text error correction model. The method comprises the following steps:
training of the text correction model requires the use of correction corpus. The error correction raw data can be obtained from the history of the intelligent question-answering system or can be generated manually using a voice recognition system.
After obtaining the error correction raw data, each text is preprocessed by using the preprocessing mode in the step S1.
Word recognition errors occurring in texts in the error correction original data are marked in a manual mode and are used for constructing training corpus of a text error correction model. Specifically, for each text X to be corrected, the text character sequence thereof is (X 1 ,x 2 ,…,x n ) (n is the length of the text, x i Is the ith character in the text, X may or may not contain errors, and the error tag sequence G= (G) with the granularity of words is obtained in the form of manual labeling 1 ,g 2 ,…,g n ) Wherein g i =1 represents the ith character x i Is wrong, g i =0 represents the ith character x i Is correct; and marks the correct character sequence y= (Y) of the word granularity 1 ,y 2 ,…,y n ) Wherein y is i Is the ith character x i Corresponding correct character, if character x i No error, x i =y i . And labels the error label of word granularity (character fragment) as Z = { (i, j, Z) i,j ):1<=i<j<=n }, where z i,j =1 represents character segment X from i to j i:j Is a proper noun, z i,j =0 represents the character from i to jFragment X i:j No errors or a correct word. It should be noted that the speech recognition system may recognize a certain noun as a wrong noun with unequal lengths, so that when the granularity of the words is mislabeled, if the length of the wrong noun is smaller than that of the right noun, the longest common subsequence algorithm is imitated to obtain the right character corresponding to each character in the wrong noun; if the length of the wrong noun is larger than that of the right noun, setting the right character corresponding to the redundant character in the wrong noun as a null character. The label sample corresponding to the text X to be corrected is (X, G, Y, Z) obtained in the above mode.
For example, the speech recognition system will "who is the responsible person for this project? Is the person responsible for the item "incorrectly identified as" water? "(wherein, the" purpose "of the" item "is erroneously identified as" wood ", and" who "is erroneously identified as" water "). From the text to be corrected, "is the person responsible for this item of wood water? "the obtained labeling data are:
x= (this, one, item, wood, negative, responsibility, person, water,
G=(0,0,0,1,0,0,0,0,0,1,0),
y= (this, one, item, order, negative, responsibility, person, who,
Z={(2,3,1)}∪{(i,j,0):1<=i<j<=11,i!=2,j!=3}。
If K texts exist in the collected error correction original data, the following data are obtained according to the labeling mode: { (X) k ,G k ,Y k ,Z k ):1<=k<=k }, wherein
Figure BDA0004074043460000081
Figure BDA0004074043460000082
Z k ={(i,j,z k,i,j ):
1<=i<j<=n k The text character sequence corresponding to the kth text, the character granularity error label sequence and the character granularity correct character sequenceError labels, n, of word granularity (character fragments) k Is the character sequence length of the text and K is the total number of samples in the dataset.
After the original error correction data is marked, randomly scrambling the marked data set according to a random division mode, and according to 8:1: the proportion of 1 is divided into a training set, a verification set and a test set, which are respectively used for training a text error correction model, adjusting the hyper-parameters of the model and evaluating the effect of the model.
S3: and vector encoding is carried out on the input text to be corrected by using a pre-training language model, and the voice information of the text is fused to obtain a corresponding character vector sequence.
Specifically, for the input chinese text with error correction x= (X) 1 ,x 2 ,…,x n ) (n is the length of the text, x i Is the ith character in the text) to obtain a corresponding integer index sequence using a vocabulary (the vocabulary contains common characters and character fragments, the vocabulary size is N) of a pre-trained language model (e.g., a pre-trained language model such as BERT, roBERTa, ALBERT). Embedding module using pre-training model to get each character x i Character embedded vector of (a)
Figure BDA0004074043460000083
And position embedding vector +.>
Figure BDA0004074043460000084
Considering that the speech recognition system usually can wrongly recognize words as homophonic or near-phonic words, the speech information (namely pinyin information) of the words is introduced when the text is encoded, so that richer features can be obtained, and the accuracy of the text error correction model is improved. For character x i If the character is a Chinese character, using the python Chinese character to Pinyin tool such as xpin to obtain a corresponding Pinyin letter sequence p i The method comprises the steps of carrying out a first treatment on the surface of the If the character is not Chinese, such as English letters, numbers, etc., the Pinyin letter sequence p of the character is caused i =x i . For example, the Chinese character "term" corresponds to the Pinyin alphabetic sequence of (x, i, a, n, g), the English character "b"The corresponding phonetic alphabet sequence is (b), and the phonetic alphabet sequence corresponding to the number character '3' is (3). Obtaining character x i Pinyin letter sequence p i Then, the Pinyin coding module of a layer of neural network (particularly, RNN, LSTM, GRU, CNN and other neural networks can be used) is used for coding the Pinyin letter sequence p i Coding to obtain character x i Speech embedding vector
Figure BDA0004074043460000085
Further, the character x is obtained i Character embedded vector of (a)
Figure BDA0004074043460000091
Position embedding vector +.>
Figure BDA0004074043460000092
And speech embedding vector
Figure BDA0004074043460000093
Then, the three are added to obtain the character x i Final embedded vector->
Figure BDA0004074043460000094
After the embedded vectors of every other character are obtained in the above manner, the embedded vector sequence of the text X is e= (E 1 ,e 2 ,…,e n )。
Further, the embedded vector sequence E of the text X is input to an encoder of a pre-trained language model composed of a plurality of transducer layers. By BERT base By way of example, a pre-trained language model, BERT base Is composed of 12 identical transducer layers. The input of each transform layer is the hidden state vector sequence output by the last transform layer, which is obtained via multi-head self-attention (multi-head self-attention), feed-forward network, residual connection and layer normalization. In this example, the hidden state vector sequence output by the last transform layer is denoted as h= (H) as the final encoded vector sequence of text X 1 ,h 2 ,…,h n ) Wherein h is i Is character x i Is a coded vector of (a). The semantic and grammar information of each character in the text to be corrected can be effectively obtained by means of the strong performance of the pre-training language model, and the voice information of the text is integrated, so that the subsequent correction effect can be improved.
S4: after obtaining the text character vector sequence in the step S3, detecting the word granularity error in the text by using a word granularity error detection layer formed by a full connection layer, and detecting the word granularity error in the text by using a word granularity error detection neural network formed by the first character, the last character and the relative distance of character fragments to obtain an error word set and an error word set, wherein the specific steps are as follows:
S41: and detecting word granularity errors in the text by using a word granularity error detection layer formed by a full connection layer to obtain an error word set. The method comprises the following steps:
for word granularity errors, each character X in text X is determined i Inputting a word granularity error detection layer composed of full connection layers to obtain the probability of whether the character is wrong
Figure BDA0004074043460000095
The definition is as follows:
Figure BDA0004074043460000096
wherein W is d And b d Is the weight matrix and bias term of the fully connected layer, and σ is the sigmoid function.
According to a preset word granularity error threshold epsilon char If the probability of error
Figure BDA0004074043460000097
Epsilon is greater than or equal to char If the character is determined to be wrong, correction is needed, if the error probability is +.>
Figure BDA0004074043460000098
Less than epsilon char Then judgeThe character is error-free and does not need correction. Obtaining the error word set { x } i :i∈SET char },SET char Is the sequence number set of the error character detected by the character granularity detection module.
S42: and detecting word granularity errors in the text by using a word granularity error detection neural network formed based on the first character, the last character and the relative distance of the character fragments, so as to obtain an error word set. The method comprises the following steps:
for word granularity errors, for the slave character X in text X i To character x j The character segment of the composition is denoted as X ij =[x i ,x i+1 ,…,x j ](wherein i, j satisfy 1<=i<j<=n),X ij The corresponding character vector sequence is H ij =[h i ,h i+1 ,…,h j ]The probability that it is an erroneous word is calculated using a multi-layer neural network as follows:
Figure BDA0004074043460000101
Figure BDA0004074043460000102
e dist =f dist (j – i), (4)
Figure BDA0004074043460000103
Figure BDA0004074043460000104
Wherein the first character x is to be i Is the code vector h of (2) i Inputting a full connection layer to obtain a characterization vector of the first character of the character segment
Figure BDA0004074043460000105
Wherein W is start ,b start Is the weight matrix and bias term of the fully connected layer, RELU (·) is the RELU activation function, RELU (x) =max (x, 0); the tail character x j Is the code vector h of (2) j Inputting a full connection layer to obtain a characterization vector of the character segment tail character>
Figure BDA0004074043460000106
Wherein W is end ,b end Is the weight matrix and bias term of the full connection layer; f (f) dist Is a distance coding function, which codes the character x i To character x j The relative distance j-i between the two is input into the function to obtain the character x i To character x j Distance code e of (2) dist The method is used for reserving length information of the character fragments; will->
Figure BDA0004074043460000107
e dist The three are spliced to obtain the character x i To character x j Characterization vector of the composed fragment->
Figure BDA0004074043460000108
Will->
Figure BDA0004074043460000109
Inputting a word granularity error classification layer to obtain the probability of the character segment error>
Figure BDA00040740434600001010
Wherein->
Figure BDA00040740434600001011
Is the weight matrix and bias term of the full-connection layer in the word granularity error classification layer, and sigma is the sigmoid function.
According to a preset word granularity error threshold epsilon span If the probability of error
Figure BDA00040740434600001012
Epsilon is greater than or equal to span The character segment is an error word, and correction is needed if the error is occurredProbability->
Figure BDA00040740434600001013
Less than epsilon span The character segment is error free and need not be corrected. By the error character segment detection mode, nested error nouns can be effectively identified, and acceleration operation can be performed in parallel. Get the error word (error character segment) set { X } i:j :(i,j)∈SET span } wherein SET span Is the sequence number set of the start and end positions of the wrong word (wrong character segment) detected by the word granularity error detection module.
In the prior art, a word segmentation tool is generally used for detecting the wrong noun, or named entity recognition methods such as CRF, biLSTM+CRF, BERT+CRF and the like are adopted for detecting the wrong noun, but the word segmentation tool often has the situation of wrong segmentation, and particularly on text data in a limited field, the problem of nested entities cannot be solved by the prior named entity recognition method. The error character segment detection method can detect the error words without word segmentation, avoids the influence caused by word segmentation errors, can identify nested error nouns, and has more accurate word granularity detection accuracy.
S5: and (3) correcting the error word set and the error word set detected in the step (S4) respectively to obtain candidate replacement words with wrong word granularity and candidate replacement words with wrong word granularity. For the error word detected in the S4, predicting the correct character by using a word granularity error correction layer formed by a layer of full-connection layers to obtain a candidate replacement word; and (3) for the error words detected in the step (S4), obtaining corresponding correction words from the noun knowledge base constructed in the step (S2) by adopting a candidate recall and candidate sorting screening mode, and obtaining candidate replacement words. The method comprises the following specific steps:
S51: and in the step of word granularity error correction, for each error word in the error word set obtained by detection in the step S41, a word granularity error correction layer formed by a full-connection layer is used for predicting correct characters, and candidate replacement words are obtained. The method comprises the following steps:
error in set of error words obtained by error detection according to S41 word granularityCharacter x i The error probability of the word granularity is that
Figure BDA0004074043460000111
Inputting it into a word granularity error correction layer composed of a layer of full connection layer to obtain character x i The probability of being corrected to character j in the vocabulary is:
Figure BDA0004074043460000112
wherein W is c ,b c Is the weight matrix and bias term, o, of the word granularity error correction layer i Is character x i Is the one-hot vector of (the vector length is the number of characters in the vocabulary, character x i The position value is 1, the rest position values are 0), softmax (·) is a normalized exponential function for calculating the character x i Probability that the correct character is the j-th character in the vocabulary; y is i Is character x i The corresponding correct character.
Obtaining character x i After being corrected to the probabilities of the characters in the word list, the character with the highest probability is taken as x i Is recorded as a candidate replacement character of
Figure BDA0004074043460000113
The method can effectively correct the word granularity of the error characters in the high-frequency word granularity error and the high-frequency word granularity error.
S52: and step of correcting word granularity errors, namely detecting each error word in the error word set in step S42, and obtaining a corresponding correction word from the noun knowledge base constructed in step S2 by adopting a candidate recall and candidate sorting screening mode to obtain a candidate replacement word. The method comprises the following steps:
after the word granularity error set obtained according to the S42 word granularity error detection, the character x is assumed i To character x j Fragment X of the composition i:j Is a wrong word (wrong character segment), and adopts a candidate recall and candidate sorting screening mode to obtain the corresponding correction word from the pre-constructed noun knowledge baseIs a candidate replacement term.
Firstly, defining the required editing distance similarity and Jaccard similarity as follows:
the edit distance, also called Levenshtein, is a measure of the degree of differentiation between two strings, defined as how many times at least processing is required to change one string to another. Defining similarity based on edit distance
Figure BDA0004074043460000121
Where x, y are two strings, edit_dist (x, y) is the edit distance between strings x, y, and len (x) is the length of string x. The Jaccard similarity is used for comparing the similarity between two limited sample sets, for two character strings, corresponding character sets A and B are obtained first, and then the Jaccard similarity J (A, B) of the two sets is defined as the proportion of the number of intersection elements of the two character sets in the union of the two sets: / >
Figure BDA0004074043460000122
The greater the Jaccard similarity, the more similar the two conjugates.
Considering that most errors in a speech recognition scenario are homophones and near-phones recognition errors due to accents, dialects, expressions, background noise, etc., the candidate recall step uses pinyin-based similarity to recall the candidate word.
For the character segment X to be corrected detected in S42 i:j And using a rough search mode to select a word set from the noun knowledge base. In consideration of the fact that the nouns may be recognized as incorrect nouns with different lengths during voice recognition, for example, one word is recognized more or one word is recognized less, recall is performed on the equal-length nouns and the unequal-length nouns in the knowledge base during recall.
Specifically, for equal length noun recall, all nouns with lengths of j-i+1 in a noun knowledge base are obtained first to obtain pinyin sequences and character segments X of each equal length noun in the knowledge base i:j Calculates the edit distance similarity between the two,and taking all nouns with the similarity larger than or equal to a preset equal-length noun recall threshold value as initial equal-length candidate words to obtain an initial equal-length candidate word set.
For recall of nouns with different lengths, firstly, all nouns with lengths not being j-i+1 in a noun knowledge base are obtained, and pinyin sequences and character segments X of each noun with different lengths are obtained i:j Calculating the edit distance similarity between the two, and taking all nouns with the similarity being more than or equal to a preset unequal length noun recall threshold value as initially selected unequal length candidate words to obtain an initial unequal length candidate word set.
For each character segment X to be corrected i:j After the recalled initial equal-length candidate word set and the recalled initial unequal-length candidate word set are obtained, candidate ranking screening is further needed to be conducted on candidate word scoring, and the replacement word with the highest score and larger than a preset threshold value is selected to be used as the replacement word of the error segment.
In the conventional candidate sorting and screening method, a series of candidate sentences are obtained by replacing the wrong character segments in the text sentences with error correction with candidate words, and then the best candidate sentences (namely the best candidate words) are selected by calculating a confusion degree score by using an n-gram language model or a language model based on a neural network, so that the text error correction is realized. However, the n-gram language model and the language model based on the neural network depend on the scale and distribution of the training corpus, when a proper noun occurs less frequently or even does not occur in the training corpus (i.e., a low-frequency situation), the confusion degree obtained by calculating the correct sentence containing the proper noun by using the n-gram language model and the language model based on the neural network is high, so that the language model tends to select another noun with high occurrence frequency in the training corpus, and the situation that uncorrectable and error correction occurs is caused. Therefore, the low-frequency errors of word granularity cannot be solved by using the confusion degree score based on the language model for the candidate ranking screening method.
In order to solve the problem that low-frequency errors of word granularity are difficult to correct, the candidate ranking screening method in the scheme only uses features such as similarity between the wrong character fragments and the candidate words, and does not use features such as confusion degree and the like which are easily influenced by training corpus distribution and are calculated by models such as an n-gram language model and the like to rank and screen the candidate words.
The feature calculation for candidate ranking screening is as follows: for character segment X to be corrected i:j And calculating the characteristics and length characteristics of the Chinese character layer and the voice layer of each candidate word in the initial equal-length candidate word set and the initial unequal-length candidate word set obtained through recall, wherein the characteristics and length characteristics are as follows:
chinese character layer characteristics: character segment X to be corrected i:j And converting the candidate words into Chinese character sequences, and then calculating the similarity based on the editing distance and the Jaccard similarity between the candidate words as the characteristics of the Chinese character layer.
Speech level features: character segment X to be corrected i:j And converting the candidate words into pinyin sequences, obtaining initial consonant and final sound sequences, and calculating edit distance similarity and Jaccard similarity between the initial consonant and the final sound sequences as characteristics of a voice layer. The speech level features may capture pronunciation similarity between the character segment to be corrected and the candidate word.
Length characteristics: calculating each character segment X to be corrected i:j As well as the length difference between it and the candidate word, as character length features. The character length feature allows for adaptive adjustment of the threshold for error correction of character segments of different lengths.
The scoring function used in candidate ranking screening is as follows: after obtaining the characteristics and length characteristics of the Chinese character layer and the voice layer, training a linear regression model by using the word pairs formed by the miswords and the corresponding correct words in the text extracted from the training set in the second step, wherein the linear regression model can calculate and obtain a comprehensive similarity score s between the character segment to be corrected and the candidate words based on the characteristics overall Cut it off to obtain a score s=max (0, min (s overall 1), the similarity score integrates all aspects of characteristics, and the accuracy of error correction can be effectively improved.
And for each candidate word in the character segment to be corrected, the initial equal-length candidate word set and the initial unequal-length candidate word set, respectively calculating a score s between the character segment to be corrected and the initial equal-length candidate word set, reserving the candidate words with the score s being greater than or equal to a preset screening threshold, and removing the candidate words with the score s being smaller than the screening threshold to obtain a new equal-length candidate word set and an unequal-length candidate word set.
If the filtered set of the equal-length candidate words is not empty, sequencing the candidate words in the set of the equal-length candidate words according to the score s from large to small, and selecting the equal-length candidate word with the highest score as the final replacement word of the character segment to be corrected; if the filtered set of the equal-length candidate words is empty, and the filtered set of the unequal-length candidate words is not empty, sequencing the candidate words in the set of the unequal-length candidate words according to the score s from large to small, and selecting the unequal-length candidate word with the highest score as the final replacement word of the character segment to be corrected; and if the candidate word sets with equal length and unequal length are empty, not correcting the character segment to be corrected. The error word correcting method can not only effectively correct high-frequency word granularity errors, but also correct low-frequency word granularity errors and errors which do not occur (especially proper nouns with few occurrence times in training corpus).
S6: the training corpus is constructed in the S2, and the pinyin coding module in the S3 and the embellishing module and the encoder of the pre-training language model are jointly trained in a multitask learning mode, wherein the word granularity error detection layer, the word granularity error detection neural network and the word granularity error correction layer in the S4 are adopted. The method comprises the following steps:
Specifically, considering that the three steps of word granularity error detection in S4, word granularity error detection and word granularity error correction in S5 are all encoders based on a pre-training language model (such as BERT, roBERTa, ALBERT, etc.), parameters of the pre-training language model are shared, and the three steps complement each other. The accuracy of word granularity error detection is improved, and the accuracy of word granularity error correction is improved. The word granularity error detection and the word granularity error detection are mutually beneficial effects. If a character in the text is detected to be an incorrect character, a word (character segment) containing the character in the text is also incorrect; if a character segment in the text is detected as being erroneous, then it must also contain some erroneous character. In order to share information among a plurality of steps and complement each other and improve accuracy of each other, the invention adopts a multitask learning mode to jointly train a pinyin coding module in S3, an embellishing module of a pre-training language model and an encoder, a word granularity error detection layer in S4, a word granularity error detection neural network and a word granularity error correction layer in S5. In addition, the multiple steps share one model, so that memory occupation can be reduced, and the speed of model reasoning and predicting is increased.
Specifically, taking a sample (X, G, Y, Z) as an example, wherein x= (X) 1 ,x 2 ,…,x n )、G=(g 1 ,g 2 ,…,g n )、Y=(y 1 ,y 2 ,…,y n )、Z={(i,j,z i,j ):1<=i<j<N, n is the length of text X.
Defining the loss of the word granularity error detection layer in S4 as:
Figure BDA0004074043460000151
defining the loss of the word granularity error detection neural network in S4 as:
Figure BDA0004074043460000152
defining the loss of the word granularity error correction layer in S5 as:
Figure BDA0004074043460000153
the three above loss functions are linearly combined to obtain the following total loss function:
Figure BDA0004074043460000154
wherein the method comprises the steps of0<λ 123 <=1(λ 123 =1) is the weight coefficient of three loss functions for balancing the effect of the individual losses. Lambda (lambda) 123 And selecting through the effect of the model on the verification set.
Using the training dataset constructed in S2, the parameters of the model were optimized using an AdamW optimizer, using a batch gradient descent algorithm to minimize the total loss L.
The model obtained after training saves the model structure and parameters, and the model is loaded for prediction during prediction.
S7: and for the text to be corrected, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, fusing word granularity correction results according to a preset rule to obtain the corrected text. The specific process is as follows:
obtaining correction results from the word granularity correction step of S51
Figure BDA0004074043460000155
Wherein SET char Is the sequence number set of the error characters detected by S41 word granularity error, x i Is the i-th character of the text X to be corrected, is entered>
Figure BDA0004074043460000156
Is the correct character predicted by the model in the step of S51 word granularity correction, i.e. the candidate replacement word.
Obtaining a correction result COR from the step of correcting the granularity of the S52 words span ={(X i:j ,W ij ):(i,j)∈
SET span } wherein SET span Is the sequence number set of the starting and ending positions of the error character fragments obtained by S42 word granularity error detection, X i:j Is the character segment to be corrected from the sequence number i to j in the text to be corrected X is input, W ij The candidate replacement words are obtained from candidate recall, sorting and screening in the step of correcting the granularity of the S52 words.
Considering that each word granularity error in text may be contained in a certainIn a word-granularity error, the results of the word-granularity error correction step and the word-granularity error correction step may not be identical to the correction result for the error word. To avoid the above and error correction, the present invention corrects the result for each word granularity error
Figure BDA0004074043460000157
If x i Contained in COR span If the error character segment is in a certain error character segment, the error character is not corrected from the angle of the granularity of the character, otherwise, the error character segment is corrected, and X in the input text X to be corrected is X i Replaced by->
Figure BDA0004074043460000161
For each word granularity error correction result (X i:j ,W ij ) X in the text X to be corrected to be input i:j Replaced by W ij . Finally obtaining corrected text X in the above manner cor
Example 2
The invention also provides an embodiment of the multi-granularity Chinese text error correction device corresponding to the embodiment of the multi-granularity Chinese text error correction method.
Referring to fig. 5, the multi-granularity chinese text error correction apparatus provided by the embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the one or more processors are configured to implement a multi-granularity chinese text error correction method in the above embodiment when executing the executable codes.
The embodiment of the multi-granularity Chinese text error correction device provided by the embodiment of the invention can be applied to any device with data processing capability, and the any device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an apparatus with any data processing capability where the multi-granularity chinese text error correction apparatus provided by the embodiment of the present invention is located is shown in fig. 5, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 5, the apparatus with any data processing capability where the apparatus is located in the embodiment generally includes other hardware according to an actual function of the apparatus with any data processing capability, which is not described herein again.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Example 3
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a multi-granularity Chinese text error correction method in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A multi-granularity Chinese text error correction method is characterized by comprising the following steps:
s1: preprocessing a Chinese text to be corrected;
s2: constructing a noun knowledge base, collecting error correction original data and constructing a training corpus of a text error correction model;
s3: vector encoding is carried out on the input text to be corrected by using a pre-training language model, and voice information of the text is fused to obtain a corresponding character vector sequence;
s4: after a text character vector sequence is obtained in the step S3, a word granularity error detection layer formed by a full-connection layer is used for detecting word granularity errors in the text, and a word granularity error detection neural network formed by character segment head characters, tail characters and relative distance features is used for detecting word granularity errors in the text, so that an error word set and an error word set are obtained;
S5: correcting the error word set and the error word set detected in the step S4 respectively to obtain candidate replacement words with wrong word granularity and candidate replacement words with wrong word granularity; for the error word detected in the S4, predicting the correct character by using a word granularity error correction layer formed by a layer of full-connection layers to obtain a candidate replacement word; for the error words detected in the step S4, a candidate recall and candidate sorting screening mode is adopted to obtain corresponding correction words from the noun knowledge base constructed in the step S2, and candidate replacement words are obtained;
s6: training corpus is constructed in the S2, and a pinyin coding module in the S3 and an embellishing module and an encoder of a pre-training language model are jointly trained in a multitask learning mode, wherein a word granularity error detection layer, a word granularity error detection neural network and a word granularity error correction layer in the S4 are adopted;
s7: and for the text to be corrected, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, fusing word granularity correction results according to a preset rule to obtain corrected text.
2. The method for correcting chinese text with multiple granularities according to claim 1, wherein the step of preprocessing the chinese text to be corrected in step S1 includes Unicode text normalization, complex-to-simple conversion processing, punctuation recovery, and digital preprocessing.
3. The multi-granularity chinese text error correction method of claim 1, wherein the noun knowledge base constructed in step S2 includes common nouns obtained from a public existing noun knowledge base, and proper nouns obtained from text corpus data in vertical domain or limited domain by manual or statistical/regular method; the training corpus in the step S2 is as follows: the error correction original data collected from the intelligent question-answering system or the voice recognition system is preprocessed in the mode in the step S1, the character granularity error label sequence, the character granularity correct character sequence and the word granularity error label of each text are marked manually, and the text is divided into a training set, a verification set and a test set randomly after marking is finished and used for training a subsequent error correction model.
4. The multi-granularity chinese text error correction method of claim 1, wherein the manner of vector encoding the text to be error corrected in step S3 is: firstly, obtaining a character index sequence for inputting a text to be corrected by using a word stock of a pre-training language model, then obtaining a character embedded vector and a position embedded vector of each character by using an embellishing module of the pre-training language model, then encoding a pinyin sequence of each character by using a neural network to obtain a voice embedded vector, adding the three to obtain a final embedded vector of the character, and then inputting the final embedded vector of the character into a encoder of the pre-training language model based on a transducer to obtain a character vector sequence of the text.
5. The multi-granularity Chinese text error correction method according to claim 1, wherein the method for performing word granularity error detection and word granularity error detection on the text to be corrected in S4 respectively comprises the following steps: after obtaining a text character vector sequence in the step S3, obtaining the error probability of each character granularity in the text by using a character granularity error detection layer formed by a full connection layer, and collecting characters with the error probability larger than a preset threshold value to obtain an error character set; the word granularity error detection is characterized by calculating the error probability of the character segment by using a multi-layer neural network based on the first character, the last character and the relative distance of the character segment, and collecting the character segments with the error probability larger than a preset threshold value to obtain an error word set.
6. The multi-granularity chinese text error correction method of claim 1, wherein step S5 performs word granularity error correction and word granularity error correction on the set of error words and the set of error words detected in step S4, respectively: the word granularity error correction step is characterized by error words and error probability, a word granularity error correction layer formed by a full-connection layer is used for predicting the probability of correcting each error word into each character in a word list, and the maximum probability value is selected as a candidate replacement word of the error word; the word granularity error correction step firstly recalls equal-length candidate words and unequal-length candidate words from a noun knowledge base constructed in the step S2 based on the pinyin editing distance, then uses a truncated linear regression model, inputs the Chinese character layer, the pinyin layer and other characteristics to calculate comprehensive similarity so as to sort and screen the candidate words, and takes the candidate words with the highest score and larger than a preset threshold value as candidate replacement words of the error words.
7. The multi-granularity chinese text error correction method of claim 1, wherein the step S6 uses the training corpus constructed in the step S2 to jointly train the pinyin coding module in the step S3 and the emmbedding module, the encoder, and the word granularity error detection layer, the word granularity error detection neural network in the step S4, and the word granularity error correction layer in the step S5 by using a multi-task learning method, and the total loss function is a weighted average of the loss of the word granularity error detection layer in the step S4, the loss of the word granularity error detection neural network in the step S4, and the loss of the word granularity error correction layer in the step S5, so as to optimize the parameters of the model by minimizing the total loss function.
8. The method for correcting the multi-granularity Chinese text according to claim 1, wherein in the step S7, after the candidate replacement words with the wrong word granularity and the candidate replacement words with the wrong word granularity are obtained from S5, the word granularity correction result is fused according to a preset rule to obtain the corrected text; when the word granularity error correction result and the word granularity error correction result conflict, the preset rule preferentially adopts the word granularity error correction result.
9. A multi-granularity chinese text error correction apparatus comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement a multi-granularity chinese text error correction method according to any one of claims 1-8.
10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a multi-granularity chinese text correction method as claimed in any one of claims 1 to 8.
CN202310088091.8A 2023-01-16 2023-01-16 Multi-granularity Chinese text error correction method and device Pending CN116127952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310088091.8A CN116127952A (en) 2023-01-16 2023-01-16 Multi-granularity Chinese text error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310088091.8A CN116127952A (en) 2023-01-16 2023-01-16 Multi-granularity Chinese text error correction method and device

Publications (1)

Publication Number Publication Date
CN116127952A true CN116127952A (en) 2023-05-16

Family

ID=86309757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310088091.8A Pending CN116127952A (en) 2023-01-16 2023-01-16 Multi-granularity Chinese text error correction method and device

Country Status (1)

Country Link
CN (1) CN116127952A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306600A (en) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method
CN116502629A (en) * 2023-06-20 2023-07-28 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching
CN116681070A (en) * 2023-08-04 2023-09-01 北京永辉科技有限公司 Text error correction method, system, model training method, medium and equipment
CN116991874A (en) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 Text error correction and large model-based SQL sentence generation method and device
CN117094311A (en) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction
CN117151084A (en) * 2023-10-31 2023-12-01 山东齐鲁壹点传媒有限公司 Chinese spelling and grammar error correction method, storage medium and equipment
CN117556363A (en) * 2024-01-11 2024-02-13 中电科大数据研究院有限公司 Data set abnormality identification method based on multi-source data joint detection

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306600B (en) * 2023-05-25 2023-08-11 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method
CN116306600A (en) * 2023-05-25 2023-06-23 山东齐鲁壹点传媒有限公司 MacBert-based Chinese text error correction method
CN116502629A (en) * 2023-06-20 2023-07-28 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching
CN116502629B (en) * 2023-06-20 2023-08-18 神州医疗科技股份有限公司 Medical direct reporting method and system based on self-training text error correction and text matching
CN116681070A (en) * 2023-08-04 2023-09-01 北京永辉科技有限公司 Text error correction method, system, model training method, medium and equipment
CN116991874B (en) * 2023-09-26 2024-03-01 海信集团控股股份有限公司 Text error correction and large model-based SQL sentence generation method and device
CN116991874A (en) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 Text error correction and large model-based SQL sentence generation method and device
CN117094311A (en) * 2023-10-19 2023-11-21 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction
CN117094311B (en) * 2023-10-19 2024-01-26 山东齐鲁壹点传媒有限公司 Method for establishing error correction filter for Chinese grammar error correction
CN117151084B (en) * 2023-10-31 2024-02-23 山东齐鲁壹点传媒有限公司 Chinese spelling and grammar error correction method, storage medium and equipment
CN117151084A (en) * 2023-10-31 2023-12-01 山东齐鲁壹点传媒有限公司 Chinese spelling and grammar error correction method, storage medium and equipment
CN117556363A (en) * 2024-01-11 2024-02-13 中电科大数据研究院有限公司 Data set abnormality identification method based on multi-source data joint detection
CN117556363B (en) * 2024-01-11 2024-04-09 中电科大数据研究院有限公司 Data set abnormality identification method based on multi-source data joint detection

Similar Documents

Publication Publication Date Title
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN116127952A (en) Multi-granularity Chinese text error correction method and device
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
Abandah et al. Automatic diacritization of Arabic text using recurrent neural networks
WO2019085779A1 (en) Machine processing and text correction method and device, computing equipment and storage media
CN110046350B (en) Grammar error recognition method, device, computer equipment and storage medium
US8185376B2 (en) Identifying language origin of words
CN111339750B (en) Spoken language text processing method for removing stop words and predicting sentence boundaries
CN111046670B (en) Entity and relationship combined extraction method based on drug case legal documents
CN112836496B (en) Text error correction method based on BERT and feedforward neural network
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN114169330A (en) Chinese named entity identification method fusing time sequence convolution and Transformer encoder
CN114818668B (en) Name correction method and device for voice transcription text and computer equipment
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN110909144A (en) Question-answer dialogue method and device, electronic equipment and computer readable storage medium
Jauhiainen et al. Language model adaptation for language and dialect identification of text
CN111930939A (en) Text detection method and device
CN114818669B (en) Method for constructing name error correction model and computer equipment
Hládek et al. Learning string distance with smoothing for OCR spelling correction
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
CN114386399A (en) Text error correction method and device
Chen et al. Integrated semantic and phonetic post-correction for chinese speech recognition
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN113901210B (en) Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination