CN114548116A

CN114548116A - Chinese text error detection method and system based on language sequence and semantic joint analysis

Info

Publication number: CN114548116A
Application number: CN202210178120.5A
Authority: CN
Inventors: 周仁杰; 沈佳冰; 任永坚; 张纪林; 万健; 曾艳; 寇亮; 袁俊峰; 王星
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-05-27

Abstract

The invention discloses a Chinese text error detection method and system based on word order and semantic joint analysis. Aiming at the problems that the semantics of a Chinese text cannot be deeply understood and the weight is automatically assigned in the existing Chinese text error detection method, a Chinese text prediction error model which takes the text as a one-dimensional picture and uses a bidirectional cyclic neural network to fit the text and an automatic attention mechanism to assign the weight is designed. The invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.

Description

Chinese text error detection method and system based on language sequence and semantic joint analysis

Technical Field

The invention belongs to the field of Chinese text processing, text cleaning and text error detection, and relates to a Chinese text error detection method and system based on word order and semantic joint analysis.

Background

With the development of science and technology, 4G and 5G are popularized, the informationized water of the whole society is increasingly higher, the people do not sink at night in online office and remote office, and the paperless times are coming. With the advent of paperless, information is increasingly stored in storage devices as electronic information. Because of the particularity of the text, only slight differences can bring completely different meanings, and the meaning of the whole sentence is different due to the fact that a word is increased. These problems cause great troubles and losses to people. Such as official documents, academic papers, legal documents and case documents, the texts are rather precious information and are understood wrongly, which often brings unpredictable results.

Chinese is the most complex and elegant language in the world, the complexity and the beauty bring about the variability of the language, the semantics of sentences with the same word or characters can generate different meanings in different contexts, and the meanings of the whole text can be greatly different along with the error occurrence of the Chinese text. For example, many people often understand different characters because of similarity of characters and pronunciation and sometimes people wrote characters with different meanings because of similarity of characters and pronunciation. China is a country with broad width, large land and large Chinese Bo and multi-nationality fusion, people in different regions use different dialects, and the different dialects have different reading methods for expressing the same character and often have different descriptions for the same thing. These problems are also waiting to be solved. At present, the problem of the lack of common knowledge in the past about the correction of the Chinese text is also existed, so that the Chinese text for checking the error under the real scene becomes the hot spot of the current research.

Successfully solving the problems can help people to free from heavy and mechanized manual error detection and comparison errors. If people are used for comparing different errors, firstly, the cost is increased, and secondly, for many professional errors, people with professional knowledge are needed to identify the errors, which often causes waste of human resources. It is imperative to propose a solution to these problems.

Throughout the text error detection technology, the mainstream methods such as convolutional neural network and cyclic neural network have achieved good results. But the effect display applied to the Chinese text field is not ideal. The method mainly comprises the steps that the semantics of the Chinese text are complex, the semantics need to be understood by a model, and error detection is performed on the basis of semantic understanding. For example, the original sentence is "Xiaosheng has a strong desire to live" and the wrong sentence "Xiaosheng has a strong desire to win", which have no problem in the structure of the word, but the "win will" is correct according to the context. However, the current mainstream technology is difficult to mine the semantic problem of the word, so that the error detection cannot be well carried out. And the interrelationship between different words is different, and different weights need to be distributed to express the correlation, and the existing method is not ideal for distributing the weights.

Disclosure of Invention

One objective of the present invention is to provide a method for detecting errors in chinese text based on joint analysis of word order and semantics. The method can give consideration to semantic understanding and word weight distribution under the condition of fitting the text.

The technical scheme adopted by the invention is as follows:

step 1: preprocessing data;

1-1, acquiring original text data, dividing all texts in the original text data according to word levels, and constructing a Chinese character set D (w); inserting identifiers into the Chinese character set D (w), and then marking the Chinese character set D (w) by using indexes, wherein each word corresponds to a dictionary index to form a dictionary Dic (w, k);

1-2, converting a text statement in original text data into Token, adding an identifier, and fixing the sentence length;

preferably, the adding identifier in step 1-2 is adding a "START symbol in the beginning of the sentence, adding a" CLS "spacer in the sentence, and adding an" END "terminator at the END of the sentence;

preferably, the fixed sentence length is a part of the long sentence which is cut off from the long sentence, and the short sentence is filled to the fixed sentence length by using the PAD symbol;

1-3 serializing the text sentences after the Token conversion in the step 1-2 according to the dictionary index in the step 1-1;

1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding (Embedding) technology;

step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the language sequence and semantic joint analysis;

the Chinese text error detection model based on the language order and semantic joint analysis comprises an information extraction module, a Self-Attention module (Self-Attention) and an output layer;

the information extraction module comprises a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR);

the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state generated by the bidirectional gated recurrent neural network at the last moment, and the hidden state is used for extracting text time sequence information; the method comprises the following steps:

the bidirectional gated loop unit model comprises two gated loop units (GRUs);

the GRU has a reset gate R and an update gate Z, the reset gate R at time t_tAnd t isCarved updating door Z_tThe calculation is as follows:

wherein

Is the mapped 768-dimensional vector, H, from step 1 at time t_t-1Is a hidden state at time t-1, W_xrIs to reset the gate input weight parameter, W_xzIs to update the gate input weight parameter, W_hrIs to reset the gate hidden state weight parameter, W_hzIs to update the door hidden state weight parameter, b_rrAnd b_rzBias parameters for the reset gate and the update gate, respectively; sigma is a Sigmoid function, and the size range of the reset gate and the updating gate is controlled to be between 0 and 1;

reset gate for generating candidate hidden states

The calculation is expressed as follows:

wherein W_xhIs a candidate hidden state input weight parameter, W_hhIs a weight parameter of the candidate hidden state with respect to the hidden state, b_hIs a candidate hidden state bias parameter, tahn is an activation function;

updating the gate for generating the hidden state H at the current moment_tThe calculation is expressed as follows:

wherein

Representing a hadamard product, which is a multiplication for elements;

two gated cyclic units (GRU) one being a forward input and one being an inverted input, with forward hidden states

And reverse hidden state

The calculation is expressed as follows:

wherein

Indicating that the hidden states are generated sequentially using GRUs,

indicating that the GRU is used in reverse to generate the hidden state,

indicating a forward hidden state at time t,

representing a reverse hidden state at the time t;

the hidden state H is generated not by simple addition but by concatenation, as follows:

wherein

Representing a hadamard product, which is a multiplication for elements;

the input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; it comprises a plurality of semantic understanding units, each semantic understanding unit comprising a full convolutional neural network (FCN); each semantic understanding unit is connected by adopting a residual error network (ResNet) and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;

the residual error network ResNet and the improved Sigmoid activation function are calculated according to the following formula:

wherein

Indicating the output of ResNet at time t,

representing the output of the semantic understanding unit at time t-1,

representing the output of the semantic understanding unit at the time t-2;

the input of the Self-Attention module (Self-Attention) is the superposition output of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and is used for distributing word weight; the input is differentiated into a Key matrix (Key), a question matrix (Query) and a Value matrix (Value), then a Similarity matrix (Similarity) is calculated according to the Key matrix and the question matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain an Attention matrix (Attention); the method comprises the following steps:

(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to

Wherein W_qIs a question mark matrix weight parameter, W_kIs a key matrix weight parameter, W_vIs a value matrix weight parameter;

the output of a bidirectional recurrent neural network BiGRU and FR semantic understanding module in the information extraction module at the time t is represented;

(b) calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:

Similarity(Query,Key)＝Query×Key (2.14)

(c) normalizing each row of the similarity matrix

Wherein a is_ijRepresenting the values of the normalized similarity matrix at the ith row and the jth column, n representing the similarity momentNumber of elements per row of the array; similarity_ijThe value of the similarity matrix in the ith row and the jth column is represented,

indicating similarity with e as base_ijIs a power operation of an exponent;

(d) weighting the normalized similarity matrix and the value matrix to obtain an Attention matrix (Attention)

Wherein attention_ijThe value, representing the value of the Attention matrix Attention in the ith row and jth column_ijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix;

the output layer includes a full connected layer (full connected layer) and an activation function Sigmoid, and is used for judging whether an output word has an error.

Another objective of the present invention is to provide a chinese text error detection system based on joint analysis of word order and semantics, which includes:

the data preprocessing module is used for converting the text data into 768-dimensional vectors;

and the Chinese text error detection module realizes Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.

The technical scheme provided by the invention has the following beneficial effects:

(1) the invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.

(2) The invention uses the bidirectional gated recurrent neural network (BiGRU) to fit the text data, and has the following two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; the second is to use both past and future text information to fit the current text with more feature information.

(3) The invention superposes the output of a semantic understanding module (FR) and the output of a bidirectional gated cyclic network (BiGRU), thereby avoiding the problem of losing time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and fills a layer.

(4) The invention adopts a Self-Attention mechanism (Self-Attention), and has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.

Drawings

FIG. 1 is a flow chart according to the present invention;

FIG. 2 is a diagram of a semantic understanding module architecture (dashed lines in the figure are residual network connections);

FIG. 3 is a diagram of a bidirectional gated loop network

FIG. 4 is a diagram of a residual network architecture

FIG. 5 is a diagram of a model structure;

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific flow description is shown in fig. 1, wherein:

step 1: input data acquired by the model is preprocessed.

The pretreatment process comprises the following four steps:

1-1 create a dictionary. Dividing all text sentences into words and constructing candidate Chinese character set

And counting the occurrence frequency of each word according to the set, filtering the words with the frequency lower than 3, and removing the weight of the filtered set to form a Chinese character set D (w). Inserting special characters into Chinese character set D (w)Such as "START" initiator, "END" terminator, "CLS" spacer, "unknown," PAD "filler, etc. These symbols help the computer to better fit the text. Each word in the set of words d (w) in the index token is then used, each word having a unique mapping, to form the dictionary Dic (w, k).

1-2 data Token. The data is in the form of sentences, the beginning of each sentence is added with a START initial character, the sentence is added with a CLS interval character, the END of the sentence is added with an END terminal character, and characters which do not appear in the dictionary are met and replaced by an unknown character. And judging a theater, wherein the sentence is not in a fixed length, and the length of the sentence needs to be processed. Long sentences truncate the overlength and short sentences need to fill the rest with "PAD" padding.

1-3 data serialization: converting each word in the Token-quantized text into a dictionary index by using the dictionary Dic (w, k) obtained in the step 1-1.

1-4 words embed the mapping. The number of words in the dictionary is too large, and one-hot coding is used to bring sparse matrix, which wastes storage space and slows down operation speed. The index of each word after serialization is mapped into a vector with 768 dimensions by adopting the word Embedding (Embedding) technology.

the adoption of the semantic understanding module has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened; the use of a two-way gated loop network has two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; secondly, the text information from the past and the future is used at the same time to fit the current text by using more characteristic information;

the output of the semantic understanding module and the output of the bidirectional gated cyclic network are superposed, so that the problem of loss of time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and a filling layer is solved;

the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state at the previous moment, and the input is used for extracting text time sequence information; the method comprises the following steps:

the bidirectional gated loop unit model comprises two gated loop units (GRUs);

the GRU has a reset gate R and an update gate Z, the reset gate R at time t_tUpdate gate Z with time t_tThe calculation is as follows:

wherein

Is the mapped 768-dimensional vector, H, from step 1 at time t_t-1Is a hidden state at time t-1, W_xrIs to reset the gate input weight parameter, W_xzIs to update the gate input weight parameter, W_hrIs to reset the gate hidden state weight parameter, W_hzIs to update the door hidden state weight parameter, b_rrAnd b_rzThe bias parameters of the reset gate and the update gate, respectively. σ is a Sigmoid function that controls the size of the reset gate and the update gate to range between 0 and 1.

Reset gates may be used to generate candidate hidden states

The calculation is expressed as follows:

wherein W_xhIs a candidate hidden state input weight parameter, W_hhIs a weight parameter of the candidate hidden state with respect to the hidden state, b_hIs a candidate hidden state bias parameter and tahn is an activation function.

The update gate may generate a hidden state H at the current time_tThe calculation is expressed as follows:

wherein

Is a hadamard product, which is a multiplication for elements.

And reverse hidden state

The calculation is expressed as follows:

wherein

Indicating that the forward hidden state is generated sequentially using GRUs,

indicating that the reverse order generates the hidden state using GRU. The hidden state H is generated not by simple addition but by concatenation, as follows:

wherein

Is a dimension join operation.

The input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the input of the first unit is the 768-dimensional vector preprocessed in the step 1, the input of the second unit is the 768-dimensional vector preprocessed in the step 1 and the output of the first unit, and the input of the second unit is the output of the first unit and the output of the second unit;

each cell comprising a fully convolutional neural network (FCN) comprising a convolutional layer, a Relu activation function, an average pooling layer, an anti-convolutional layer, a modified Sigmoid activation function; each unit is connected by adopting a residual error network (ResNet);

the residual error network ResNet calculation formula and the improved Sigmoid activation function are expressed as follows:

wherein

Indicating the output of ResNet at time t,

indicating the output of ResNet at time t-1,

indicating the output of ResNet at time t-2.

Wherein W_qIs a question mark matrix weight parameter, W_kIs a key matrix weight parameter, W_vAre value matrix weight parameters.

Similarity(Query,Key)＝Query×Key (2.14)

(c) normalizing each row of the similarity matrix

Wherein a is_ijRepresenting the value of the normalized similarity matrix at the ith row and the jth column, and n representing a row with several elements.

(d) Weighting the similarity matrix and the value matrix to obtain an Attention matrix (Attention)

Wherein attention_ijThe value, of the Attention matrix (Attention) at the ith row and jth column_ijThe value of the matrix of values in the ith row and the jth column is represented, and l represents the number of row elements.

The output layer includes a Fully connected layer (Fully connected layer) and an activation function Sigmoid. The input of the output layer is from an Attention matrix (Attention), the probability of word errors is output through a full connection layer and an activation function, and if the probability of errors is larger than 0.5, the words are judged to be wrongly written or mispronounced.

The Self-Attention mechanism (Self-Attention) has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; secondly, the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.

The training of the invention adopts a data set Merge collected by the self to train, and the performance evaluation adopts a Chinese spelling data set disclosed by SIGHAN15 to evaluate. The model performs an experiment on this data set to predict wrongly written words and counts the indices for comparison. The following table shows the data volume of Merge and SIGHAN15 data sets.

	Merge	SIGHAN15
			Number of paragraphs	2390	1100
Number of errors	3740	1602

The performance evaluation indexes adopted by the invention are Precetion, Recall and F₁、F_0.5。

	True value 1	True value-1
			Predicted value 1	TP(TruePositive)	FP(FalseNegative)
Prediction value-1	FN(FalseNegative)	TN(TrueNegative)

Precision: the probability that, among all samples predicted to be positive, the sample is actually positive is set as the prediction result.

Recall: probability of being predicted as a positive sample among actually positive samples with respect to the original sample

F₁And F_0.5A balance point is found between the two, the accuracy and the recall rate are referred, and the comprehensive reaction model quality measurement standard is integrated.

The following table shows the results of gender prediction experiments on the SIGHAN15 data set according to the present invention:

	Precision(％)	Recall(％)	F₁(％)	F_0.5(％)
					LSTM	56.16	47.03	51.19	54.06
GRU	70..17	46.18	55.70	63.57
					BiGRU-CNN	81.94	89.38	85.50	83.33
BiGRU-Attention	64.45	99.06	78.09	69.29
					RFRA	84.60	98.01	90.81	87.00

in the above table of the Chinese text error detection experiment results, LSTM and GRU are conventional recurrent neural network detectors, BiGRU-CNN is the combination of recurrent neural network and convolutional neural network, and BiGRU-Attention is the combination of recurrent neural network and Attention mechanism. RFRA is the Chinese text error detection model based on the word order and semantic combined analysis in the invention.

Claims

1. A Chinese text error detection method based on word order and semantic joint analysis is characterized by comprising the following steps:

step 1: preprocessing data;

1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding technology;

step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the word order and semantic joint analysis;

the Chinese text error detection model based on the language order and semantic combined analysis comprises an information extraction module, a Self-Attention module Self-Attention and an output layer;

the information extraction module comprises a bidirectional gating recurrent neural network (BiGRU) and a semantic understanding module FR;

the input of the semantic understanding module FR is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the system comprises a plurality of semantic understanding units, wherein each semantic understanding unit comprises a full convolution neural network (FCN); each semantic understanding unit is connected by adopting a residual error network ResNet and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;

the input of the Self-Attention module Self-Attention is the superposition output of a bidirectional gated recurrent neural network BiGRU and a semantic understanding module FR, and is used for distributing the word weight; the input is differentiated into a Key matrix Key, a question mark matrix Query and a Value matrix Value, then a Similarity matrix Similarity is calculated according to the Key matrix and the question mark matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain Attention moment matrix Attention;

the output layer is used for judging whether the output word has errors.

2. The method of claim 1, wherein the join identifier of step 1-2 is a START identifier added to the beginning of the sentence, a CLS spacer added to the sentence, and an END identifier added to the END of the sentence.

3. The method of claim 1, wherein the fixed sentence length of step 1-2 is obtained by truncating the long sentence by an overlength portion and the short sentence is filled to the fixed sentence length using a "PAD" character.

4. The method of claim 1, wherein the input of the bidirectional gated recurrent neural network BiGRU is the 768-dimensional vector preprocessed in step 1 and the hidden state of the last moment generated by the vector, and is used for extracting text timing information; the method comprises the following steps:

the bidirectional gating cycle unit model comprises two gating cycle units GRU;

wherein

reset gate for generating candidate hidden states

The calculation is expressed as follows:

wherein

Representing a hadamard product, which is a multiplication for elements;

one of the two gated loop units GRU is a forward input,one is a reverse input, which is in a forward hidden state

And reverse hidden state

The calculation is expressed as follows:

wherein

Indicating that the hidden states are generated sequentially using GRUs,

indicating that the GRU is used in reverse to generate the hidden state,

indicating a forward hidden state at time t,

representing a reverse hidden state at the time t;

the hidden state H is generated specifically as follows:

wherein

Representing dimension join operations。

5. The method of claim 1 wherein said residual network ResNet calculation is expressed as follows:

the improved Sigmoid function calculation formula is as follows:

wherein

Indicating the output of ResNet at time t,

representing the output of the semantic understanding unit at time t-1,

representing the output of the semantic understanding unit at time t-2.

6. The method according to claim 1, wherein the Self-Attention module Self-Attention is specifically:

Similarity(Query,Key)＝Query×Key (2.14)

(c) normalizing each row of the similarity matrix

Wherein a is_ijRepresenting the value of the similarity matrix subjected to normalization in the ith row and the jth column, wherein n represents the number of elements in each row of the similarity matrix; similarity_ijThe value representing the similarity matrix at the ith row and jth column,

indicating similarity with e as base_ijIs a power operation of an exponent;

Wherein attention_ijThe value, representing the value of the Attention matrix Attention in the ith row and jth column_ijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix.

7. The method of claim 1, wherein the output Layer comprises two Fully Connected layers Fully Connected Layer and two activation functions Gelu.

8. A Chinese text error detection system based on language order and semantic joint analysis is characterized by comprising:

and the Chinese text error detection module is used for realizing Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.