CN114548116A - Chinese text error detection method and system based on language sequence and semantic joint analysis - Google Patents
Chinese text error detection method and system based on language sequence and semantic joint analysis Download PDFInfo
- Publication number
- CN114548116A CN114548116A CN202210178120.5A CN202210178120A CN114548116A CN 114548116 A CN114548116 A CN 114548116A CN 202210178120 A CN202210178120 A CN 202210178120A CN 114548116 A CN114548116 A CN 114548116A
- Authority
- CN
- China
- Prior art keywords
- matrix
- attention
- text
- hidden state
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese text error detection method and system based on word order and semantic joint analysis. Aiming at the problems that the semantics of a Chinese text cannot be deeply understood and the weight is automatically assigned in the existing Chinese text error detection method, a Chinese text prediction error model which takes the text as a one-dimensional picture and uses a bidirectional cyclic neural network to fit the text and an automatic attention mechanism to assign the weight is designed. The invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.
Description
Technical Field
The invention belongs to the field of Chinese text processing, text cleaning and text error detection, and relates to a Chinese text error detection method and system based on word order and semantic joint analysis.
Background
With the development of science and technology, 4G and 5G are popularized, the informationized water of the whole society is increasingly higher, the people do not sink at night in online office and remote office, and the paperless times are coming. With the advent of paperless, information is increasingly stored in storage devices as electronic information. Because of the particularity of the text, only slight differences can bring completely different meanings, and the meaning of the whole sentence is different due to the fact that a word is increased. These problems cause great troubles and losses to people. Such as official documents, academic papers, legal documents and case documents, the texts are rather precious information and are understood wrongly, which often brings unpredictable results.
Chinese is the most complex and elegant language in the world, the complexity and the beauty bring about the variability of the language, the semantics of sentences with the same word or characters can generate different meanings in different contexts, and the meanings of the whole text can be greatly different along with the error occurrence of the Chinese text. For example, many people often understand different characters because of similarity of characters and pronunciation and sometimes people wrote characters with different meanings because of similarity of characters and pronunciation. China is a country with broad width, large land and large Chinese Bo and multi-nationality fusion, people in different regions use different dialects, and the different dialects have different reading methods for expressing the same character and often have different descriptions for the same thing. These problems are also waiting to be solved. At present, the problem of the lack of common knowledge in the past about the correction of the Chinese text is also existed, so that the Chinese text for checking the error under the real scene becomes the hot spot of the current research.
Successfully solving the problems can help people to free from heavy and mechanized manual error detection and comparison errors. If people are used for comparing different errors, firstly, the cost is increased, and secondly, for many professional errors, people with professional knowledge are needed to identify the errors, which often causes waste of human resources. It is imperative to propose a solution to these problems.
Throughout the text error detection technology, the mainstream methods such as convolutional neural network and cyclic neural network have achieved good results. But the effect display applied to the Chinese text field is not ideal. The method mainly comprises the steps that the semantics of the Chinese text are complex, the semantics need to be understood by a model, and error detection is performed on the basis of semantic understanding. For example, the original sentence is "Xiaosheng has a strong desire to live" and the wrong sentence "Xiaosheng has a strong desire to win", which have no problem in the structure of the word, but the "win will" is correct according to the context. However, the current mainstream technology is difficult to mine the semantic problem of the word, so that the error detection cannot be well carried out. And the interrelationship between different words is different, and different weights need to be distributed to express the correlation, and the existing method is not ideal for distributing the weights.
Disclosure of Invention
One objective of the present invention is to provide a method for detecting errors in chinese text based on joint analysis of word order and semantics. The method can give consideration to semantic understanding and word weight distribution under the condition of fitting the text.
The technical scheme adopted by the invention is as follows:
step 1: preprocessing data;
1-1, acquiring original text data, dividing all texts in the original text data according to word levels, and constructing a Chinese character set D (w); inserting identifiers into the Chinese character set D (w), and then marking the Chinese character set D (w) by using indexes, wherein each word corresponds to a dictionary index to form a dictionary Dic (w, k);
1-2, converting a text statement in original text data into Token, adding an identifier, and fixing the sentence length;
preferably, the adding identifier in step 1-2 is adding a "START symbol in the beginning of the sentence, adding a" CLS "spacer in the sentence, and adding an" END "terminator at the END of the sentence;
preferably, the fixed sentence length is a part of the long sentence which is cut off from the long sentence, and the short sentence is filled to the fixed sentence length by using the PAD symbol;
1-3 serializing the text sentences after the Token conversion in the step 1-2 according to the dictionary index in the step 1-1;
1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding (Embedding) technology;
step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the language sequence and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic joint analysis comprises an information extraction module, a Self-Attention module (Self-Attention) and an output layer;
the information extraction module comprises a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR);
the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state generated by the bidirectional gated recurrent neural network at the last moment, and the hidden state is used for extracting text time sequence information; the method comprises the following steps:
the bidirectional gated loop unit model comprises two gated loop units (GRUs);
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttAnd t isCarved updating door ZtThe calculation is as follows:
whereinIs the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzBias parameters for the reset gate and the update gate, respectively; sigma is a Sigmoid function, and the size range of the reset gate and the updating gate is controlled to be between 0 and 1;
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter, tahn is an activation function;
updating the gate for generating the hidden state H at the current momenttThe calculation is expressed as follows:
two gated cyclic units (GRU) one being a forward input and one being an inverted input, with forward hidden statesAnd reverse hidden stateThe calculation is expressed as follows:
whereinIndicating that the hidden states are generated sequentially using GRUs,indicating that the GRU is used in reverse to generate the hidden state,indicating a forward hidden state at time t,representing a reverse hidden state at the time t;
the hidden state H is generated not by simple addition but by concatenation, as follows:
the input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; it comprises a plurality of semantic understanding units, each semantic understanding unit comprising a full convolutional neural network (FCN); each semantic understanding unit is connected by adopting a residual error network (ResNet) and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;
the residual error network ResNet and the improved Sigmoid activation function are calculated according to the following formula:
whereinIndicating the output of ResNet at time t,representing the output of the semantic understanding unit at time t-1,representing the output of the semantic understanding unit at the time t-2;
the input of the Self-Attention module (Self-Attention) is the superposition output of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and is used for distributing word weight; the input is differentiated into a Key matrix (Key), a question matrix (Query) and a Value matrix (Value), then a Similarity matrix (Similarity) is calculated according to the Key matrix and the question matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain an Attention matrix (Attention); the method comprises the following steps:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvIs a value matrix weight parameter;the output of a bidirectional recurrent neural network BiGRU and FR semantic understanding module in the information extraction module at the time t is represented;
(b) calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Wherein a isijRepresenting the values of the normalized similarity matrix at the ith row and the jth column, n representing the similarity momentNumber of elements per row of the array; similarityijThe value of the similarity matrix in the ith row and the jth column is represented,indicating similarity with e as baseijIs a power operation of an exponent;
(d) weighting the normalized similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Wherein attentionijThe value, representing the value of the Attention matrix Attention in the ith row and jth columnijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix;
the output layer includes a full connected layer (full connected layer) and an activation function Sigmoid, and is used for judging whether an output word has an error.
Another objective of the present invention is to provide a chinese text error detection system based on joint analysis of word order and semantics, which includes:
the data preprocessing module is used for converting the text data into 768-dimensional vectors;
and the Chinese text error detection module realizes Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.
The technical scheme provided by the invention has the following beneficial effects:
(1) the invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.
(2) The invention uses the bidirectional gated recurrent neural network (BiGRU) to fit the text data, and has the following two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; the second is to use both past and future text information to fit the current text with more feature information.
(3) The invention superposes the output of a semantic understanding module (FR) and the output of a bidirectional gated cyclic network (BiGRU), thereby avoiding the problem of losing time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and fills a layer.
(4) The invention adopts a Self-Attention mechanism (Self-Attention), and has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.
Drawings
FIG. 1 is a flow chart according to the present invention;
FIG. 2 is a diagram of a semantic understanding module architecture (dashed lines in the figure are residual network connections);
FIG. 3 is a diagram of a bidirectional gated loop network
FIG. 4 is a diagram of a residual network architecture
FIG. 5 is a diagram of a model structure;
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific flow description is shown in fig. 1, wherein:
step 1: input data acquired by the model is preprocessed.
The pretreatment process comprises the following four steps:
1-1 create a dictionary. Dividing all text sentences into words and constructing candidate Chinese character setAnd counting the occurrence frequency of each word according to the set, filtering the words with the frequency lower than 3, and removing the weight of the filtered set to form a Chinese character set D (w). Inserting special characters into Chinese character set D (w)Such as "START" initiator, "END" terminator, "CLS" spacer, "unknown," PAD "filler, etc. These symbols help the computer to better fit the text. Each word in the set of words d (w) in the index token is then used, each word having a unique mapping, to form the dictionary Dic (w, k).
1-2 data Token. The data is in the form of sentences, the beginning of each sentence is added with a START initial character, the sentence is added with a CLS interval character, the END of the sentence is added with an END terminal character, and characters which do not appear in the dictionary are met and replaced by an unknown character. And judging a theater, wherein the sentence is not in a fixed length, and the length of the sentence needs to be processed. Long sentences truncate the overlength and short sentences need to fill the rest with "PAD" padding.
1-3 data serialization: converting each word in the Token-quantized text into a dictionary index by using the dictionary Dic (w, k) obtained in the step 1-1.
1-4 words embed the mapping. The number of words in the dictionary is too large, and one-hot coding is used to bring sparse matrix, which wastes storage space and slows down operation speed. The index of each word after serialization is mapped into a vector with 768 dimensions by adopting the word Embedding (Embedding) technology.
Step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the language sequence and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic joint analysis comprises an information extraction module, a Self-Attention module (Self-Attention) and an output layer;
the information extraction module comprises a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR);
the adoption of the semantic understanding module has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened; the use of a two-way gated loop network has two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; secondly, the text information from the past and the future is used at the same time to fit the current text by using more characteristic information;
the output of the semantic understanding module and the output of the bidirectional gated cyclic network are superposed, so that the problem of loss of time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and a filling layer is solved;
the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state at the previous moment, and the input is used for extracting text time sequence information; the method comprises the following steps:
the bidirectional gated loop unit model comprises two gated loop units (GRUs);
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttUpdate gate Z with time ttThe calculation is as follows:
whereinIs the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzThe bias parameters of the reset gate and the update gate, respectively. σ is a Sigmoid function that controls the size of the reset gate and the update gate to range between 0 and 1.
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter and tahn is an activation function.
The update gate may generate a hidden state H at the current timetThe calculation is expressed as follows:
Two gated cyclic units (GRU) one being a forward input and one being an inverted input, with forward hidden statesAnd reverse hidden stateThe calculation is expressed as follows:
whereinIndicating that the forward hidden state is generated sequentially using GRUs,indicating that the reverse order generates the hidden state using GRU. The hidden state H is generated not by simple addition but by concatenation, as follows:
The input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the input of the first unit is the 768-dimensional vector preprocessed in the step 1, the input of the second unit is the 768-dimensional vector preprocessed in the step 1 and the output of the first unit, and the input of the second unit is the output of the first unit and the output of the second unit;
each cell comprising a fully convolutional neural network (FCN) comprising a convolutional layer, a Relu activation function, an average pooling layer, an anti-convolutional layer, a modified Sigmoid activation function; each unit is connected by adopting a residual error network (ResNet);
the residual error network ResNet calculation formula and the improved Sigmoid activation function are expressed as follows:
whereinIndicating the output of ResNet at time t,indicating the output of ResNet at time t-1,indicating the output of ResNet at time t-2.
The input of the Self-Attention module (Self-Attention) is the superposition output of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and is used for distributing word weight; the input is differentiated into a Key matrix (Key), a question matrix (Query) and a Value matrix (Value), then a Similarity matrix (Similarity) is calculated according to the Key matrix and the question matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain an Attention matrix (Attention); the method comprises the following steps:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvAre value matrix weight parameters.
(b) Calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Wherein a isijRepresenting the value of the normalized similarity matrix at the ith row and the jth column, and n representing a row with several elements.
(d) Weighting the similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Wherein attentionijThe value, of the Attention matrix (Attention) at the ith row and jth columnijThe value of the matrix of values in the ith row and the jth column is represented, and l represents the number of row elements.
The output layer includes a Fully connected layer (Fully connected layer) and an activation function Sigmoid. The input of the output layer is from an Attention matrix (Attention), the probability of word errors is output through a full connection layer and an activation function, and if the probability of errors is larger than 0.5, the words are judged to be wrongly written or mispronounced.
The Self-Attention mechanism (Self-Attention) has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; secondly, the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.
The training of the invention adopts a data set Merge collected by the self to train, and the performance evaluation adopts a Chinese spelling data set disclosed by SIGHAN15 to evaluate. The model performs an experiment on this data set to predict wrongly written words and counts the indices for comparison. The following table shows the data volume of Merge and SIGHAN15 data sets.
Merge | SIGHAN15 | |
Number of paragraphs | 2390 | 1100 |
Number of errors | 3740 | 1602 |
The performance evaluation indexes adopted by the invention are Precetion, Recall and F1、F0.5。
|
True value-1 | |
|
TP(TruePositive) | FP(FalseNegative) |
Prediction value-1 | FN(FalseNegative) | TN(TrueNegative) |
Precision: the probability that, among all samples predicted to be positive, the sample is actually positive is set as the prediction result.
Recall: probability of being predicted as a positive sample among actually positive samples with respect to the original sample
F1And F0.5A balance point is found between the two, the accuracy and the recall rate are referred, and the comprehensive reaction model quality measurement standard is integrated.
The following table shows the results of gender prediction experiments on the SIGHAN15 data set according to the present invention:
Precision(%) | Recall(%) | F1(%) | F0.5(%) | |
LSTM | 56.16 | 47.03 | 51.19 | 54.06 |
GRU | 70..17 | 46.18 | 55.70 | 63.57 |
BiGRU-CNN | 81.94 | 89.38 | 85.50 | 83.33 |
BiGRU-Attention | 64.45 | 99.06 | 78.09 | 69.29 |
RFRA | 84.60 | 98.01 | 90.81 | 87.00 |
in the above table of the Chinese text error detection experiment results, LSTM and GRU are conventional recurrent neural network detectors, BiGRU-CNN is the combination of recurrent neural network and convolutional neural network, and BiGRU-Attention is the combination of recurrent neural network and Attention mechanism. RFRA is the Chinese text error detection model based on the word order and semantic combined analysis in the invention.
Claims (8)
1. A Chinese text error detection method based on word order and semantic joint analysis is characterized by comprising the following steps:
step 1: preprocessing data;
1-1, acquiring original text data, dividing all texts in the original text data according to word levels, and constructing a Chinese character set D (w); inserting identifiers into the Chinese character set D (w), and then marking the Chinese character set D (w) by using indexes, wherein each word corresponds to a dictionary index to form a dictionary Dic (w, k);
1-2, converting a text statement in original text data into Token, adding an identifier, and fixing the sentence length;
1-3 serializing the text sentences after the Token conversion in the step 1-2 according to the dictionary index in the step 1-1;
1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding technology;
step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the word order and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic combined analysis comprises an information extraction module, a Self-Attention module Self-Attention and an output layer;
the information extraction module comprises a bidirectional gating recurrent neural network (BiGRU) and a semantic understanding module FR;
the input of the semantic understanding module FR is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the system comprises a plurality of semantic understanding units, wherein each semantic understanding unit comprises a full convolution neural network (FCN); each semantic understanding unit is connected by adopting a residual error network ResNet and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;
the input of the Self-Attention module Self-Attention is the superposition output of a bidirectional gated recurrent neural network BiGRU and a semantic understanding module FR, and is used for distributing the word weight; the input is differentiated into a Key matrix Key, a question mark matrix Query and a Value matrix Value, then a Similarity matrix Similarity is calculated according to the Key matrix and the question mark matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain Attention moment matrix Attention;
the output layer is used for judging whether the output word has errors.
2. The method of claim 1, wherein the join identifier of step 1-2 is a START identifier added to the beginning of the sentence, a CLS spacer added to the sentence, and an END identifier added to the END of the sentence.
3. The method of claim 1, wherein the fixed sentence length of step 1-2 is obtained by truncating the long sentence by an overlength portion and the short sentence is filled to the fixed sentence length using a "PAD" character.
4. The method of claim 1, wherein the input of the bidirectional gated recurrent neural network BiGRU is the 768-dimensional vector preprocessed in step 1 and the hidden state of the last moment generated by the vector, and is used for extracting text timing information; the method comprises the following steps:
the bidirectional gating cycle unit model comprises two gating cycle units GRU;
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttUpdate gate Z with time ttThe calculation is as follows:
whereinIs the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzBias parameters for the reset gate and the update gate, respectively; sigma is a Sigmoid function, and the size range of the reset gate and the updating gate is controlled to be between 0 and 1;
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter, tahn is an activation function;
updating the gate for generating the hidden state H at the current momenttThe calculation is expressed as follows:
one of the two gated loop units GRU is a forward input,one is a reverse input, which is in a forward hidden stateAnd reverse hidden stateThe calculation is expressed as follows:
whereinIndicating that the hidden states are generated sequentially using GRUs,indicating that the GRU is used in reverse to generate the hidden state,indicating a forward hidden state at time t,representing a reverse hidden state at the time t;
the hidden state H is generated specifically as follows:
5. The method of claim 1 wherein said residual network ResNet calculation is expressed as follows:
the improved Sigmoid function calculation formula is as follows:
6. The method according to claim 1, wherein the Self-Attention module Self-Attention is specifically:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvIs a value matrix weight parameter;the output of a bidirectional recurrent neural network BiGRU and FR semantic understanding module in the information extraction module at the time t is represented;
(b) calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Wherein a isijRepresenting the value of the similarity matrix subjected to normalization in the ith row and the jth column, wherein n represents the number of elements in each row of the similarity matrix; similarityijThe value representing the similarity matrix at the ith row and jth column,indicating similarity with e as baseijIs a power operation of an exponent;
(d) weighting the normalized similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Wherein attentionijThe value, representing the value of the Attention matrix Attention in the ith row and jth columnijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix.
7. The method of claim 1, wherein the output Layer comprises two Fully Connected layers Fully Connected Layer and two activation functions Gelu.
8. A Chinese text error detection system based on language order and semantic joint analysis is characterized by comprising:
the data preprocessing module is used for converting the text data into 768-dimensional vectors;
and the Chinese text error detection module is used for realizing Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210178120.5A CN114548116A (en) | 2022-02-25 | 2022-02-25 | Chinese text error detection method and system based on language sequence and semantic joint analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210178120.5A CN114548116A (en) | 2022-02-25 | 2022-02-25 | Chinese text error detection method and system based on language sequence and semantic joint analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114548116A true CN114548116A (en) | 2022-05-27 |
Family
ID=81678632
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210178120.5A Pending CN114548116A (en) | 2022-02-25 | 2022-02-25 | Chinese text error detection method and system based on language sequence and semantic joint analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114548116A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115886830A (en) * | 2022-12-09 | 2023-04-04 | 中科南京智能技术研究院 | Twelve-lead electrocardiogram classification method and system |
CN116975863A (en) * | 2023-07-10 | 2023-10-31 | 福州大学 | Malicious code detection method based on convolutional neural network |
-
2022
- 2022-02-25 CN CN202210178120.5A patent/CN114548116A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115886830A (en) * | 2022-12-09 | 2023-04-04 | 中科南京智能技术研究院 | Twelve-lead electrocardiogram classification method and system |
CN116975863A (en) * | 2023-07-10 | 2023-10-31 | 福州大学 | Malicious code detection method based on convolutional neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109871535B (en) | French named entity recognition method based on deep neural network | |
CN107832400B (en) | A kind of method that location-based LSTM and CNN conjunctive model carries out relationship classification | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN109697232A (en) | A kind of Chinese text sentiment analysis method based on deep learning | |
CN111966812B (en) | Automatic question answering method based on dynamic word vector and storage medium | |
CN111858932A (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN110083710A (en) | It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure | |
CN110309511B (en) | Shared representation-based multitask language analysis system and method | |
CN114548116A (en) | Chinese text error detection method and system based on language sequence and semantic joint analysis | |
CN112309528B (en) | Medical image report generation method based on visual question-answering method | |
CN112561718A (en) | Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN114492460B (en) | Event causal relationship extraction method based on derivative prompt learning | |
CN114356990A (en) | Base named entity recognition system and method based on transfer learning | |
CN113051887A (en) | Method, system and device for extracting announcement information elements | |
CN113486174B (en) | Model training, reading understanding method and device, electronic equipment and storage medium | |
CN112990196B (en) | Scene text recognition method and system based on super-parameter search and two-stage training | |
CN112349294B (en) | Voice processing method and device, computer readable medium and electronic equipment | |
CN117332789A (en) | Semantic analysis method and system for dialogue scene | |
CN112347783A (en) | Method for identifying types of alert condition record data events without trigger words | |
CN116595023A (en) | Address information updating method and device, electronic equipment and storage medium | |
CN112949284A (en) | Text semantic similarity prediction method based on Transformer model | |
CN111813927A (en) | Sentence similarity calculation method based on topic model and LSTM | |
CN115757775A (en) | Text implication-based triggerless text event detection method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |