CN113076739A - Method and system for realizing cross-domain Chinese text error correction - Google Patents

Method and system for realizing cross-domain Chinese text error correction Download PDF

Info

Publication number
CN113076739A
CN113076739A CN202110383985.0A CN202110383985A CN113076739A CN 113076739 A CN113076739 A CN 113076739A CN 202110383985 A CN202110383985 A CN 202110383985A CN 113076739 A CN113076739 A CN 113076739A
Authority
CN
China
Prior art keywords
sentence
error
text
model
error detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110383985.0A
Other languages
Chinese (zh)
Inventor
宋正博
肖龙源
李稀敏
李威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202110383985.0A priority Critical patent/CN113076739A/en
Publication of CN113076739A publication Critical patent/CN113076739A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for realizing cross-field Chinese text error correction, which comprises the following steps: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain text.

Description

Method and system for realizing cross-domain Chinese text error correction
Technical Field
The invention relates to the field of text error correction, in particular to a method and a system for realizing cross-domain Chinese text error correction.
Background
In daily life, wrong words often appear when a user browses a webpage and looks at a public number article in social tools such as WeChat and microblog, and the text meaning is ambiguous. The Chinese text error correction technology is an important technology for automatically checking and correcting Chinese sentences through an algorithm of natural language processing, and aims to improve the correctness of languages and improve the efficiency and value of text interaction. The existing mainstream text error correction technology is mainly divided into two types: one is to find the text error position by means of sequence learning and then correct the pipline of the text error information by means of sorting. Another is the end-to-end NMT (neural network translation) based model from inputting incorrect text to outputting correct text content.
However, the former algorithm for correcting the wrong text by the sorting recall is inefficient, and the given correct text has a limited application range due to the fact that the candidate set is a limited set, and can also cause ambiguity. The latter end-to-end approach requires a large number of supervised training sets, and the very high model complexity performance cannot be embedded as a basic module in many downstream applications, which is too inefficient.
Disclosure of Invention
The invention mainly aims to overcome the defects in the prior art, and provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
The invention adopts the following technical scheme:
a method for realizing cross-domain Chinese text error correction comprises the following steps:
carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
Specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation. Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
Figure BDA0003014112610000031
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
An aspect of an embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:
an error detection module: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
an error recall module: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
an error correction sorting module: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
In another aspect, an apparatus is further provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned steps of implementing the cross-domain chinese text error correction method when executing the computer program.
Still another aspect of the embodiments of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the above-mentioned steps of implementing the cross-domain chinese text error correction method.
As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:
(1) the invention provides a method for realizing cross-domain Chinese text error correction, which adopts an error detection model of sequence labeling to combine with a supervision data training model of the general field to carry out error detection; retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set; sequentially replacing the words in the error replacement set with errors, adopting an rnnlm language model to calculate the confusion degree of the sentences with the errors, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction; according to the invention, a set of error detection → candidate recall → error correction sequencing model is provided, so that the error correction problem of the cross-domain text can be more comprehensively processed, the text is recalled through a language model of deep learning training, the perplexity of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
(2) The invention can correct the error text in different fields by adopting the error detection model of sequence labeling and combining the supervision data training model of the general field to detect errors, thereby realizing the cross-field text error correction.
Drawings
FIG. 1 is a flowchart of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;
FIG. 2 is an architecture diagram of a method for implementing cross-domain Chinese text error correction according to an embodiment of the present invention;
FIG. 3 is a block diagram of a system for implementing cross-domain Chinese text correction according to an embodiment of the present invention
Fig. 4 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an embodiment of a computer-readable storage medium according to an embodiment of the present invention.
The invention is described in further detail below with reference to the figures and specific examples.
Detailed Description
The embodiment of the invention provides a method for realizing cross-domain Chinese text error correction, namely a set of error detection → candidate recall → error correction sequencing model, which can more generally process the error correction problem of cross-domain texts, and the text is recalled through a language model of deep learning training, so that the confusion of recalling the text can be improved, and the models are mutually decoupled, so that the efficiency is improved.
As shown in fig. 1, a specific flowchart for implementing a cross-domain chinese text error correction method according to an embodiment of the present invention includes the following steps:
s101: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
specifically, the error detection is performed by combining a sequence-labeled error detection model with a universal-field supervision data training model, and the sequence-labeled error detection model is combined with the universal-field supervision data training model, and specifically comprises the following steps:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
Wherein hidden is a hidden layer.
The BERT model is a language model constructed based on a bidirectional Transformer; word vectors are generated from previous pre-trained models (including word2vec, ELMo, etc.), such pre-trained models belonging to the domain migration, while bert models belonging to the model migration.
The BERT model is combined with a pre-training model and a downstream task model, namely the BERT model is still used when the downstream task is done, the text classification task is naturally supported, the model does not need to be modified when the text classification task is done, and the efficiency is improved.
In another embodiment, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model, and the sequence-labeled error detection model is specifically performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The Skip-gram model and the cbow model are two models involved in word2vec, cbow is the context of the known current word to predict the current word, and Skip-gram is the opposite, and is the context of the known current word to predict it;
the skip-gram and cbow models each include three layers, namely an input layer, a projection layer and an output layer, and are all based on a Huffman tree, and initialization values of intermediate vectors stored in non-leaf nodes in the Huffman tree are zero vectors, and word vectors of words corresponding to the leaf nodes are initialized randomly.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation.
The embodiment of the invention can correct the error text in different fields by adopting the error detection model of the sequence label and combining the supervision data training model of the general field to detect the error, thereby realizing the cross-field text error correction.
S102: retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
the edit distance, also called the Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;
for example, convert kitten's word to sitting: sitten (k → s); sittin (e → i); sitting (→ g);
finding the editing distance of the character string, namely changing a character string s1 into a programming character string s2 through a minimum of operations, wherein the operations comprise three operations, namely adding a character, deleting a character and modifying a character;
the Jaccard distance is the Jaccard distance, and the distance is the proportion of different elements in the two sets in all the elements to measure the distinguishing degree of the two sets; the concept opposite to the Jacard similarity factor is the Jacard Distance (Jaccard Distance), which can be expressed by the following formula:
Figure BDA0003014112610000071
Figure BDA0003014112610000072
the proportion of the number of intersection elements of the two sets A and B in the A, B union is called the Jacard coefficient of the two sets and is represented by a symbol J (A, B). The Jacard similarity factor is an index for measuring the similarity between two sets (cosine distance can also be used to measure the similarity between two sets).
S103: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
Specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to calculate the confusion of the sentences after the errors are replaced, wherein the rnnlm language model specifically includes:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
Figure BDA0003014112610000081
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
The confusion is an index used in the natural language processing field (NLP) for measuring the quality of a language model. The method mainly estimates the probability of occurrence of a sentence according to each word, and the lower the confusion degree is, the higher the probability of occurrence of the sentence is, and the higher the confusion degree of the sentence is.
For sentence S, the probability of sentence occurrence is:
P(S)=P(W1,W2...WN)
=p(W1)p(W2|W1)…p(WN|W1,W2,…,WN-1)
is the joint probability multiplied by the probability of occurrence of each word;
sentence S is puzzled:
Figure BDA0003014112610000091
then:
Figure BDA0003014112610000092
taking logarithm on two sides of the above formula, then solving PP (S), and obtaining the form that each word is multiplied by negative log and then exponential:
Figure BDA0003014112610000093
the index part is actually in a cross entropy loss form, the higher the probability of sentence occurrence is, the lower the confusion degree is, and the probability of sentence occurrence can represent the confusion degree of the sentence, so that the confusion degree of the sentence is measured by the index.
Fig. 2 is an architecture diagram of a method for implementing cross-domain chinese text error correction according to an embodiment of the present invention.
As shown in fig. 3, an aspect of the embodiment of the present invention further provides a system for implementing cross-domain chinese text error correction, including:
the error detection module 301: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
in the error detection module 301, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervision data training model, and the sequence-labeled error detection model in combination with the universal-field supervision data training model specifically includes:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The BERT model is a language model constructed based on a bidirectional Transformer; word vectors are generated from previous pre-trained models (including word2vec, ELMo, etc.), such pre-trained models belonging to the domain migration, while bert models belonging to the model migration.
The BERT model is combined with a pre-training model and a downstream task model, namely the BERT model is still used when the downstream task is done, the text classification task is naturally supported, the model does not need to be modified when the text classification task is done, and the efficiency is improved.
In another embodiment, the error detection is performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model, and the sequence-labeled error detection model is specifically performed by using a sequence-labeled error detection model in combination with a universal-field supervised data training model:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
The Skip-gram model and the cbow model are two models involved in word2vec, cbow is the context of the known current word to predict the current word, and Skip-gram is the opposite, and is the context of the known current word to predict it;
the skip-gram and cbow models each include three layers, namely an input layer, a projection layer and an output layer, and are all based on a Huffman tree, and initialization values of intermediate vectors stored in non-leaf nodes in the Huffman tree are zero vectors, and word vectors of words corresponding to the leaf nodes are initialized randomly.
Specifically, before the error detection is performed by using the error detection model labeled by the sequence and combining with the supervision data training model in the general field, the method further comprises the following steps:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
Specifically, the method for filtering the special characters and the emoticons of the text, forming a word list, and digitizing the words in each sentence further comprises:
processing the characters and the labeled entity labels into a one-to-one correspondence form, and adopting word segmentation to process a pinyin dictionary
Error recall module 302: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
in the error recall module, the edit distance, also called Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another string. Permitted editing operations include replacing one character with another, inserting one character, deleting one character;
for example, convert kitten's word to sitting: sitten (k → s); sittin (e → i); sitting (→ g);
finding the editing distance of the character string, namely changing a character string s1 into a programming character string s2 through a minimum of operations, wherein the operations comprise three operations, namely adding a character, deleting a character and modifying a character;
the Jaccard distance is the Jaccard distance, and the distance is the proportion of different elements in the two sets in all the elements to measure the distinguishing degree of the two sets; the concept opposite to the Jacard similarity factor is the Jacard Distance (Jaccard Distance), which can be expressed by the following formula:
Figure BDA0003014112610000111
Figure BDA0003014112610000112
the proportion of the number of intersection elements of the two sets A and B in the A, B union is called the Jacard coefficient of the two sets and is represented by a symbol J (A, B). The Jacard similarity factor is an index for measuring the similarity between two sets (cosine distance can also be used to measure the similarity between two sets).
The error correction sorting module 303: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
In the error correction sorting module, specifically, words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on sentences after the errors are replaced, where the rnnlm language model specifically is:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
Specifically, the calculation of the confusion degree is specifically as follows:
Figure BDA0003014112610000121
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
The confusion is an index used in the natural language processing field (NLP) for measuring the quality of a language model. The method mainly estimates the probability of occurrence of a sentence according to each word, and the lower the confusion degree is, the higher the probability of occurrence of the sentence is, and the higher the confusion degree of the sentence is.
For sentence S, the probability of sentence occurrence is:
P(S)=P(W1,W2...WN)
=p(W1)p(W2|W1)…p(WN|W1,W2,…,WN-1)
is the joint probability multiplied by the probability of occurrence of each word;
sentence S is puzzled:
Figure BDA0003014112610000131
then:
Figure BDA0003014112610000132
taking logarithm on two sides of the above formula, then solving PP (S), and obtaining the form that each word is multiplied by negative log and then exponential:
Figure BDA0003014112610000133
the index part is actually in a cross entropy loss form, the higher the probability of sentence occurrence is, the lower the confusion degree is, and the probability of sentence occurrence can represent the confusion degree of the sentence, so that the confusion degree of the sentence is measured by the index.
Referring to fig. 4, another aspect of the present invention further provides an apparatus, which includes a memory 410, a processor 420, and a computer program 411 stored in the memory and running on the processor, where the processor 420 executes the computer program 411 to implement one of the above steps for implementing the cross-domain chinese text error correction method.
In a specific implementation, when the processor 420 executes the computer program 411, any of the embodiments corresponding to fig. 1 may be implemented.
Since the electronic device described in this embodiment is a device used for implementing a data processing apparatus in the embodiment of the present invention, based on the method described in this embodiment of the present invention, a person skilled in the art can understand the specific implementation manner of the electronic device in this embodiment and various variations thereof, so that how to implement the method in this embodiment of the present invention by the electronic device is not described in detail herein, and as long as the person skilled in the art implements the device used for implementing the method in this embodiment of the present invention, the device used for implementing the method in this embodiment of the present invention belongs to the protection scope of the present invention.
As shown in fig. 5, a further aspect of the embodiment of the present invention further provides a computer-readable storage medium 500, on which a computer program 511 is stored, which, when being executed by a processor, implements one of the above-mentioned steps for implementing a cross-domain chinese text error correction method.
When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims (10)

1. A method for realizing cross-domain Chinese text error correction is characterized by comprising the following steps:
carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
retrieving errors in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
2. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:
the text representation layer is used for carrying out text representation through a bert pre-training model, the text is represented as a matrix of n x k, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
3. The method for implementing cross-domain Chinese text error correction according to claim 1, wherein the error detection is performed by using a sequence-labeled error detection model in combination with a generic-domain supervised data training model, and the sequence-labeled error detection model in combination with the generic-domain supervised data training model specifically comprises:
the text representation layer is used for embedding texts in a skip-gram or cbow mode, the texts are represented as n-k matrixes, wherein n is the maximum length of a sentence, and k is a word vector dimension;
the Bi-LSTM layer realizes the output of each word in the sentence through a long-short term memory network and keeps the information of the long-distance word through a mathematical structure, and the output matrix of the Bi-LSTM layer is n x 2 x h, wherein h is the dimensionality of the text representation layer;
and the CRF layer is combined with the output of the Bi-LSTM layer to calculate the optimal path of the entity label of each sentence by initializing the transition matrix.
4. The method of claim 1, wherein before performing error detection by using the error detection model with sequence labeling in combination with the supervised data training model in the general field, the method further comprises:
filtering the text by special characters and emoticons, forming a word list, and digitizing the words in each sentence;
reading data corresponding to the characters and the entity labels in a batch mode, tokenize each sentence, and adding [ CLS ] and [ SEP ] to the beginning and the end of the sentence.
5. The method of claim 1, wherein the text is filtered by special characters and emoticons, a word list is formed, and words in each sentence are digitized, and the method further comprises:
and processing the characters and the labeled entity labels into a one-to-one corresponding form, and processing the pinyin dictionary by word segmentation.
6. The method for realizing cross-domain Chinese text error correction according to claim 1, wherein words in the error replacement set are sequentially replaced by errors, and an rnnlm language model is adopted to perform confusion calculation on the sentences after the errors are replaced, wherein the rnnlm language model specifically comprises:
the presentation layer is used for representing sentences by combining the characters and the words and vectorizing by using word2 vec;
the RNN layer comprises a recurrent neural network, performs sequence modeling on the text, and learns the expression sequence of the sentence, wherein the output of each hidden layer depends on the current input and the output of the previous moment;
and the output layer is accessed to an activation function with linear change to obtain the loss value of each sentence.
7. The method for implementing cross-domain Chinese text error correction according to claim 5, wherein the calculation of the confusion specifically comprises:
Figure FDA0003014112600000021
where S denotes a sentence, w denotes a word, i denotes a sequence number of the word in the sentence, i is 1,2.
8. A system for implementing cross-domain Chinese text error correction, comprising:
an error detection module: carrying out error detection by combining an error detection model labeled by a sequence with a supervision data training model in the general field;
an error recall module: error recall is carried out in a phonetic library of the vocabulary through the editing distance or the Jaccard distance to obtain an error replacement set;
an error correction sorting module: and sequentially replacing the words in the error replacement set with errors, calculating the confusion degree of the wrongly replaced sentences by adopting an rnnlm language model, determining the correct words in the error replacement set according to the calculated sentence confusion degree, and completing Chinese text error correction.
9. An apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202110383985.0A 2021-04-09 2021-04-09 Method and system for realizing cross-domain Chinese text error correction Pending CN113076739A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383985.0A CN113076739A (en) 2021-04-09 2021-04-09 Method and system for realizing cross-domain Chinese text error correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383985.0A CN113076739A (en) 2021-04-09 2021-04-09 Method and system for realizing cross-domain Chinese text error correction

Publications (1)

Publication Number Publication Date
CN113076739A true CN113076739A (en) 2021-07-06

Family

ID=76615941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383985.0A Pending CN113076739A (en) 2021-04-09 2021-04-09 Method and system for realizing cross-domain Chinese text error correction

Country Status (1)

Country Link
CN (1) CN113076739A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642318A (en) * 2021-10-14 2021-11-12 江西风向标教育科技有限公司 Method, system, storage medium and device for correcting English article
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning
CN114611494A (en) * 2022-03-17 2022-06-10 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN115048907A (en) * 2022-05-31 2022-09-13 北京深言科技有限责任公司 Text data quality determination method and device
CN115204151A (en) * 2022-09-15 2022-10-18 华东交通大学 Chinese text error correction method, system and readable storage medium
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word
CN115293138A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Text error correction method and computer equipment
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019085779A1 (en) * 2017-11-01 2019-05-09 阿里巴巴集团控股有限公司 Machine processing and text correction method and device, computing equipment and storage media
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN110751234A (en) * 2019-10-09 2020-02-04 科大讯飞股份有限公司 OCR recognition error correction method, device and equipment
CN111695343A (en) * 2020-06-23 2020-09-22 深圳壹账通智能科技有限公司 Wrong word correcting method, device, equipment and storage medium
CN112149406A (en) * 2020-09-25 2020-12-29 中国电子科技集团公司第十五研究所 Chinese text error correction method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019085779A1 (en) * 2017-11-01 2019-05-09 阿里巴巴集团控股有限公司 Machine processing and text correction method and device, computing equipment and storage media
CN110751234A (en) * 2019-10-09 2020-02-04 科大讯飞股份有限公司 OCR recognition error correction method, device and equipment
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN111695343A (en) * 2020-06-23 2020-09-22 深圳壹账通智能科技有限公司 Wrong word correcting method, device, equipment and storage medium
CN112149406A (en) * 2020-09-25 2020-12-29 中国电子科技集团公司第十五研究所 Chinese text error correction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
施晓华: "《矩阵分解学习及其网络社区发现方法》", 北京:北京邮电大学出版社, pages: 137 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836919A (en) * 2021-09-30 2021-12-24 中国建筑第七工程局有限公司 Building industry text error correction method based on transfer learning
CN113642318B (en) * 2021-10-14 2022-01-28 江西风向标教育科技有限公司 Method, system, storage medium and device for correcting English article
CN113642318A (en) * 2021-10-14 2021-11-12 江西风向标教育科技有限公司 Method, system, storage medium and device for correcting English article
CN114611494B (en) * 2022-03-17 2024-02-02 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN114611494A (en) * 2022-03-17 2022-06-10 平安科技(深圳)有限公司 Text error correction method, device, equipment and storage medium
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN114818669B (en) * 2022-04-26 2023-06-27 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN115048907A (en) * 2022-05-31 2022-09-13 北京深言科技有限责任公司 Text data quality determination method and device
CN115048907B (en) * 2022-05-31 2024-02-27 北京深言科技有限责任公司 Text data quality determining method and device
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word
CN115221866B (en) * 2022-06-23 2023-07-18 平安科技(深圳)有限公司 Entity word spelling error correction method and system
CN115293138A (en) * 2022-08-03 2022-11-04 北京中科智加科技有限公司 Text error correction method and computer equipment
CN115204151A (en) * 2022-09-15 2022-10-18 华东交通大学 Chinese text error correction method, system and readable storage medium
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method

Similar Documents

Publication Publication Date Title
US11574122B2 (en) Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
CN111309915B (en) Method, system, device and storage medium for training natural language of joint learning
WO2021179897A1 (en) Entity linking method and apparatus
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
CN112800776B (en) Bidirectional GRU relation extraction data processing method, system, terminal and medium
CN110427623A (en) Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
US20150170051A1 (en) Applying a Genetic Algorithm to Compositional Semantics Sentiment Analysis to Improve Performance and Accelerate Domain Adaptation
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111709243A (en) Knowledge extraction method and device based on deep learning
CN108664512B (en) Text object classification method and device
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN116151132B (en) Intelligent code completion method, system and storage medium for programming learning scene
Jiang et al. An LSTM-CNN attention approach for aspect-level sentiment classification
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN115269834A (en) High-precision text classification method and device based on BERT
Wong et al. isentenizer-: Multilingual sentence boundary detection model
Kim et al. Weakly labeled data augmentation for social media named entity recognition
CN112528653B (en) Short text entity recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination