CN113076739A - Method and system for realizing cross-domain Chinese text error correction - Google Patents
Method and system for realizing cross-domain Chinese text error correction Download PDFInfo
- Publication number
- CN113076739A CN113076739A CN202110383985.0A CN202110383985A CN113076739A CN 113076739 A CN113076739 A CN 113076739A CN 202110383985 A CN202110383985 A CN 202110383985A CN 113076739 A CN113076739 A CN 113076739A
- Authority
- CN
- China
- Prior art keywords
- sentence
- error
- text
- model
- error detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000001514 detection method Methods 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 8
- 230000007704 transition Effects 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 5
- 238000012163 sequencing technique Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000005012 migration Effects 0.000 description 4
- 238000013508 migration Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method for realizing cross-field Chinese text error correction, which comprises the following steps: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain text.
Description
Technical Field
The invention relates to the field of text error correction, in particular to a method and a system for realizing cross-domain Chinese text error correction.
Background
In daily life, wrong words often appear when a user browses a webpage and looks at a public number article in social tools such as WeChat and microblog, and the text meaning is ambiguous. The Chinese text error correction technology is an important technology for automatically checking and correcting Chinese sentences through an algorithm of natural language processing, and aims to improve the correctness of languages and improve the efficiency and value of text interaction. The existing mainstream text error correction technology is mainly divided into two types: one is to find the text error position by means of sequence learning and then correct the pipline of the text error information by means of sorting. Another is the end-to-end NMT (neural network translation) based model from inputting incorrect text to outputting correct text content.
However, the former algorithm for correcting the wrong text by the sorting recall is inefficient, and the given correct text has a limited application range due to the fact that the candidate set is a limited set, and can also cause ambiguity. The latter end-to-end approach requires a large number of supervised training sets, and the very high model complexity performance cannot be embedded as a basic module in many downstream applications, which is too inefficient.
Disclosure of Invention
The invention mainly aims to overcome the defects in the prior art, and provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
The invention adopts the following technical scheme:
a method for realizing cross-domain Chinese text error correction comprises the following steps:
carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
Specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation. Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
An aspect of an embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:
an error detection module: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
an error recall module: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
an error correction sorting module: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
In another aspect, an apparatus is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned steps of implementing the cross-domain chinese text error correction method when executing the computer program.
Still another aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the above-mentioned steps of implementing the cross-domain chinese text error correction method.
As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:
(1) the invention provides a method for realizing cross-domain Chinese text error correction, which adopts an error detection model of sequence labeling to combine with a supervision data training model of the general field to carry out error detection; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; according to the invention, a set of error detection → candidate recall → error correction sequencing model is provided, so that the error correction problem of the cross-domain text can be more comprehensively processed, the text is recalled through a language model of deep learning training, the perplexity of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
(2) The invention can correct the error text in different fields by adopting the error detection model of sequence labeling and combining the supervision data training model of the general field to detect errors, thereby realizing the cross-field text error correction.
Drawings
FIG. 1 is a flowchart of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;
FIG. 2 is an architecture diagram of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;
FIG. 3 is a block diagram of a system for implementing cross-domain Chinese text correction according to an embodiment of the present invention
Fig. 4 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.
The invention is described in further detail below with reference to the figures and specific examples.
Detailed Description
The embodiment of the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
As shown in fig. 1, a specific flowchart for implementing a cross-domain chinese text error correction method according to an embodiment of the present invention includes the following steps:
s101: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Wherein hidden is a hidden layer.
The BERT model is a language model constructed based on a bidirectional Transformer; word vectors are generated from previous pre-trained models (including word2vec, ELMo, etc.), such pre-trained models belonging to the domain migration, while bert models belonging to the model migration.
The BERT model is combined with a pre-training model and a downstream task model, namely the BERT model is still used when the downstream task is done, the text classification task is naturally supported, the model does not need to be modified when the text classification task is done, and the efficiency is improved.
In another embodiment, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model, and the sequence-labeled error detection model is specifically performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The Skip-gram model and the cbow model are two models involved in word2vec, cbow is the context of the known current word to predict the current word, and Skip-gram is the opposite, and is the context of the known current word to predict it;
the skip-gram and cbow models each include three layers, namely an input layer, a projection layer and an output layer, and are all based on a Huffman tree, and initialization values of intermediate vectors stored in non-leaf nodes in the Huffman tree are zero vectors, and word vectors of words corresponding to the leaf nodes are initialized randomly.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation.
The embodiment of the invention can correct the error text in different fields by adopting the error detection model of the sequence label and combining the supervision data training model of the general field to detect the error, thereby realizing the cross-field text error correction.
S102: retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
the edit distance, also called the Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;
for example, convert kitten's word to sitting: sitten (k → s); sittin (e → i); sitting (→ g);
finding the editing distance of the character string, namely changing a character string s1 into a programming character string s2 through a minimum of operations, wherein the operations comprise three operations, namely adding a character, deleting a character and modifying a character;
the Jaccard distance is the Jaccard distance, and the distance is the proportion of different elements in the two sets in all the elements to measure the distinguishing degree of the two sets; the concept opposite to the Jacard similarity factor is the Jacard Distance (Jaccard Distance), which can be expressed by the following formula:
the proportion of the number of intersection elements of the two sets A and B in the A, B union is called the Jacard coefficient of the two sets and is represented by a symbol J (A, B). The Jacard similarity factor is an index for measuring the similarity between two sets (cosine distance can also be used to measure the similarity between two sets).
S103: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
The confusion is an index used in the natural language processing field (NLP) for measuring the quality of a language model. The method mainly estimates the probability of occurrence of a sentence according to each word, and the lower the confusion degree is, the higher the probability of occurrence of the sentence is, and the higher the confusion degree of the sentence is.
For sentence S, the probability of sentence occurrence is:
P(S)=P(W1,W2...WN)
=p(W1)p(W2|W1)…p(WN|W1,W2,…,WN-1)
is the joint probability multiplied by the probability of occurrence of each word;
sentence S is puzzled:
then:
taking logarithm on two sides of the above formula, then solving PP (S), and obtaining the form that each word is multiplied by negative log and then exponential:
the index part is actually in a cross entropy loss form, the higher the probability of sentence occurrence is, the lower the confusion degree is, and the probability of sentence occurrence can represent the confusion degree of the sentence, so that the confusion degree of the sentence is measured by the index.
Fig. 2 is an architecture diagram of a method for implementing cross-domain chinese text error correction according to an embodiment of the present invention.
As shown in fig. 3, an aspect of the embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:
the error detection module 301: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
in the error detection module 301, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervision data training model, and the sequence-labeled error detection model in combination with the universal-field supervision data training model specifically includes:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The BERT model is a language model constructed based on a bidirectional Transformer; word vectors are generated from previous pre-trained models (including word2vec, ELMo, etc.), such pre-trained models belonging to the domain migration, while bert models belonging to the model migration.
The BERT model is combined with a pre-training model and a downstream task model, namely the BERT model is still used when the downstream task is done, the text classification task is naturally supported, the model does not need to be modified when the text classification task is done, and the efficiency is improved.
In another embodiment, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model, and the sequence-labeled error detection model is specifically performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The Skip-gram model and the cbow model are two models involved in word2vec, cbow is the context of the known current word to predict the current word, and Skip-gram is the opposite, and is the context of the known current word to predict it;
the skip-gram and cbow models each include three layers, namely an input layer, a projection layer and an output layer, and are all based on a Huffman tree, and initialization values of intermediate vectors stored in non-leaf nodes in the Huffman tree are zero vectors, and word vectors of words corresponding to the leaf nodes are initialized randomly.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
processing the characters and the labeled entity labels into a one-to-one correspondence form, and adopting word segmentation to process a pinyin dictionary
Error recall module 302: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
in the error recall module, the edit distance, also called Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another string. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;
for example, convert kitten's word to sitting: sitten (k → s); sittin (e → i); sitting (→ g);
finding the editing distance of the character string, namely changing a character string s1 into a programming character string s2 through a minimum of operations, wherein the operations comprise three operations, namely adding a character, deleting a character and modifying a character;
the Jaccard distance is the Jaccard distance, and the distance is the proportion of different elements in the two sets in all the elements to measure the distinguishing degree of the two sets; the concept opposite to the Jacard similarity factor is the Jacard Distance (Jaccard Distance), which can be expressed by the following formula:
the proportion of the number of intersection elements of the two sets A and B in the A, B union is called the Jacard coefficient of the two sets and is represented by a symbol J (A, B). The Jacard similarity factor is an index for measuring the similarity between two sets (cosine distance can also be used to measure the similarity between two sets).
The error correction sorting module 303: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
In the error correction sorting module, specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on sentences after the errors are replaced, where the rnnlm language model specifically is:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
The confusion is an index used in the natural language processing field (NLP) for measuring the quality of a language model. The method mainly estimates the probability of occurrence of a sentence according to each word, and the lower the confusion degree is, the higher the probability of occurrence of the sentence is, and the higher the confusion degree of the sentence is.
For sentence S, the probability of sentence occurrence is:
P(S)=P(W1,W2...WN)
=p(W1)p(W2|W1)…p(WN|W1,W2,…,WN-1)
is the joint probability multiplied by the probability of occurrence of each word;
sentence S is puzzled:
then:
taking logarithm on two sides of the above formula, then solving PP (S), and obtaining the form that each word is multiplied by negative log and then exponential:
the index part is actually in a cross entropy loss form, the higher the probability of sentence occurrence is, the lower the confusion degree is, and the probability of sentence occurrence can represent the confusion degree of the sentence, so that the confusion degree of the sentence is measured by the index.
Referring to fig. 4, another aspect of the present invention further provides an apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory and running on the processor, where the processor 420 executes the computer program 411 to implement one of the above steps for implementing the cross-domain chinese text error correction method.
In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.
Since the electronic device described in this embodiment is a device used for implementing a data processing apparatus in the embodiment of the present invention, based on the method described in this embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the electronic device in this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present invention by the electronic device is not described in detail herein, and as long as the person skilled in the art implements the device used for implementing the method in this embodiment of the present invention, the device used for implementing the method in this embodiment of the present invention belongs to the protection scope of the present invention.
As shown in fig. 5, a further aspect of the embodiment of the present invention further provides a computer-readable storage medium 500, on which a computer program 511 is stored, which, when being executed by a processor, implements one of the above-mentioned steps for implementing a cross-domain chinese text error correction method.
When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.
Claims (10)
1. A method for realizing cross-domain Chinese text error correction is characterized by comprising the following steps:
carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
2. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
3. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
4. The method of claim 1, wherein before performing error detection by using the error detection model with sequence labeling in combination with the supervised data training model in the general field, the method further comprises:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
5. The method of claim 1, wherein the text is filtered by special characters and emoticons, a word list is formed, and words in each sentence are digitized, and the method further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation.
6. The method for realizing cross-domain Chinese text error correction according to claim 1, wherein words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on the sentences after the errors are replaced, wherein the rnnlm language model specifically comprises:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
8. A system for implementing cross-domain Chinese text error correction, comprising:
an error detection module: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
an error recall module: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
an error correction sorting module: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
9. An apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383985.0A CN113076739A (en) | 2021-04-09 | 2021-04-09 | Method and system for realizing cross-domain Chinese text error correction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383985.0A CN113076739A (en) | 2021-04-09 | 2021-04-09 | Method and system for realizing cross-domain Chinese text error correction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113076739A true CN113076739A (en) | 2021-07-06 |
Family
ID=76615941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110383985.0A Pending CN113076739A (en) | 2021-04-09 | 2021-04-09 | Method and system for realizing cross-domain Chinese text error correction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113076739A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113642318A (en) * | 2021-10-14 | 2021-11-12 | 江西风向标教育科技有限公司 | Method, system, storage medium and device for correcting English article |
CN113836919A (en) * | 2021-09-30 | 2021-12-24 | 中国建筑第七工程局有限公司 | Building industry text error correction method based on transfer learning |
CN114611494A (en) * | 2022-03-17 | 2022-06-10 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN114818669A (en) * | 2022-04-26 | 2022-07-29 | 北京中科智加科技有限公司 | Method for constructing name error correction model and computer equipment |
CN115048907A (en) * | 2022-05-31 | 2022-09-13 | 北京深言科技有限责任公司 | Text data quality determination method and device |
CN115204151A (en) * | 2022-09-15 | 2022-10-18 | 华东交通大学 | Chinese text error correction method, system and readable storage medium |
CN115221866A (en) * | 2022-06-23 | 2022-10-21 | 平安科技(深圳)有限公司 | Method and system for correcting spelling of entity word |
CN115293138A (en) * | 2022-08-03 | 2022-11-04 | 北京中科智加科技有限公司 | Text error correction method and computer equipment |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019085779A1 (en) * | 2017-11-01 | 2019-05-09 | 阿里巴巴集团控股有限公司 | Machine processing and text correction method and device, computing equipment and storage media |
CN110717031A (en) * | 2019-10-15 | 2020-01-21 | 南京摄星智能科技有限公司 | Intelligent conference summary generation method and system |
CN110751234A (en) * | 2019-10-09 | 2020-02-04 | 科大讯飞股份有限公司 | OCR recognition error correction method, device and equipment |
CN111695343A (en) * | 2020-06-23 | 2020-09-22 | 深圳壹账通智能科技有限公司 | Wrong word correcting method, device, equipment and storage medium |
CN112149406A (en) * | 2020-09-25 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | Chinese text error correction method and system |
-
2021
- 2021-04-09 CN CN202110383985.0A patent/CN113076739A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019085779A1 (en) * | 2017-11-01 | 2019-05-09 | 阿里巴巴集团控股有限公司 | Machine processing and text correction method and device, computing equipment and storage media |
CN110751234A (en) * | 2019-10-09 | 2020-02-04 | 科大讯飞股份有限公司 | OCR recognition error correction method, device and equipment |
CN110717031A (en) * | 2019-10-15 | 2020-01-21 | 南京摄星智能科技有限公司 | Intelligent conference summary generation method and system |
CN111695343A (en) * | 2020-06-23 | 2020-09-22 | 深圳壹账通智能科技有限公司 | Wrong word correcting method, device, equipment and storage medium |
CN112149406A (en) * | 2020-09-25 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | Chinese text error correction method and system |
Non-Patent Citations (1)
Title |
---|
施晓华: "《矩阵分解学习及其网络社区发现方法》", 北京:北京邮电大学出版社, pages: 137 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836919A (en) * | 2021-09-30 | 2021-12-24 | 中国建筑第七工程局有限公司 | Building industry text error correction method based on transfer learning |
CN113642318B (en) * | 2021-10-14 | 2022-01-28 | 江西风向标教育科技有限公司 | Method, system, storage medium and device for correcting English article |
CN113642318A (en) * | 2021-10-14 | 2021-11-12 | 江西风向标教育科技有限公司 | Method, system, storage medium and device for correcting English article |
CN114611494B (en) * | 2022-03-17 | 2024-02-02 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN114611494A (en) * | 2022-03-17 | 2022-06-10 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN114818669A (en) * | 2022-04-26 | 2022-07-29 | 北京中科智加科技有限公司 | Method for constructing name error correction model and computer equipment |
CN114818669B (en) * | 2022-04-26 | 2023-06-27 | 北京中科智加科技有限公司 | Method for constructing name error correction model and computer equipment |
CN115048907A (en) * | 2022-05-31 | 2022-09-13 | 北京深言科技有限责任公司 | Text data quality determination method and device |
CN115048907B (en) * | 2022-05-31 | 2024-02-27 | 北京深言科技有限责任公司 | Text data quality determining method and device |
CN115221866A (en) * | 2022-06-23 | 2022-10-21 | 平安科技(深圳)有限公司 | Method and system for correcting spelling of entity word |
CN115221866B (en) * | 2022-06-23 | 2023-07-18 | 平安科技(深圳)有限公司 | Entity word spelling error correction method and system |
CN115293138A (en) * | 2022-08-03 | 2022-11-04 | 北京中科智加科技有限公司 | Text error correction method and computer equipment |
CN115204151A (en) * | 2022-09-15 | 2022-10-18 | 华东交通大学 | Chinese text error correction method, system and readable storage medium |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11574122B2 (en) | Method and system for joint named entity recognition and relation extraction using convolutional neural network | |
CN113076739A (en) | Method and system for realizing cross-domain Chinese text error correction | |
CN112528672B (en) | Aspect-level emotion analysis method and device based on graph convolution neural network | |
CN111309915B (en) | Method, system, device and storage medium for training natural language of joint learning | |
WO2021179897A1 (en) | Entity linking method and apparatus | |
CN111738003B (en) | Named entity recognition model training method, named entity recognition method and medium | |
CN112800776B (en) | Bidirectional GRU relation extraction data processing method, system, terminal and medium | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
JP7301922B2 (en) | Semantic retrieval method, device, electronic device, storage medium and computer program | |
US20150170051A1 (en) | Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation | |
CN111444320A (en) | Text retrieval method and device, computer equipment and storage medium | |
CN111709243A (en) | Knowledge extraction method and device based on deep learning | |
CN108664512B (en) | Text object classification method and device | |
WO2022174496A1 (en) | Data annotation method and apparatus based on generative model, and device and storage medium | |
CN116151132B (en) | Intelligent code completion method, system and storage medium for programming learning scene | |
Jiang et al. | An LSTM-CNN attention approach for aspect-level sentiment classification | |
CN114612921B (en) | Form recognition method and device, electronic equipment and computer readable medium | |
CN113392209A (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN111753082A (en) | Text classification method and device based on comment data, equipment and medium | |
CN110874536A (en) | Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN115269834A (en) | High-precision text classification method and device based on BERT | |
Wong et al. | isentenizer-: Multilingual sentence boundary detection model | |
Kim et al. | Weakly labeled data augmentation for social media named entity recognition | |
CN112528653B (en) | Short text entity recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |