CN114548116A - Chinese text error detection method and system based on language sequence and semantic joint analysis - Google Patents

Chinese text error detection method and system based on language sequence and semantic joint analysis Download PDF

Info

Publication number
CN114548116A
CN114548116A CN202210178120.5A CN202210178120A CN114548116A CN 114548116 A CN114548116 A CN 114548116A CN 202210178120 A CN202210178120 A CN 202210178120A CN 114548116 A CN114548116 A CN 114548116A
Authority
CN
China
Prior art keywords
matrix
attention
text
hidden state
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210178120.5A
Other languages
Chinese (zh)
Inventor
周仁杰
沈佳冰
任永坚
张纪林
万健
曾艳
寇亮
袁俊峰
王星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210178120.5A priority Critical patent/CN114548116A/en
Publication of CN114548116A publication Critical patent/CN114548116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese text error detection method and system based on word order and semantic joint analysis. Aiming at the problems that the semantics of a Chinese text cannot be deeply understood and the weight is automatically assigned in the existing Chinese text error detection method, a Chinese text prediction error model which takes the text as a one-dimensional picture and uses a bidirectional cyclic neural network to fit the text and an automatic attention mechanism to assign the weight is designed. The invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.

Description

Chinese text error detection method and system based on language sequence and semantic joint analysis
Technical Field
The invention belongs to the field of Chinese text processing, text cleaning and text error detection, and relates to a Chinese text error detection method and system based on word order and semantic joint analysis.
Background
With the development of science and technology, 4G and 5G are popularized, the informationized water of the whole society is increasingly higher, the people do not sink at night in online office and remote office, and the paperless times are coming. With the advent of paperless, information is increasingly stored in storage devices as electronic information. Because of the particularity of the text, only slight differences can bring completely different meanings, and the meaning of the whole sentence is different due to the fact that a word is increased. These problems cause great troubles and losses to people. Such as official documents, academic papers, legal documents and case documents, the texts are rather precious information and are understood wrongly, which often brings unpredictable results.
Chinese is the most complex and elegant language in the world, the complexity and the beauty bring about the variability of the language, the semantics of sentences with the same word or characters can generate different meanings in different contexts, and the meanings of the whole text can be greatly different along with the error occurrence of the Chinese text. For example, many people often understand different characters because of similarity of characters and pronunciation and sometimes people wrote characters with different meanings because of similarity of characters and pronunciation. China is a country with broad width, large land and large Chinese Bo and multi-nationality fusion, people in different regions use different dialects, and the different dialects have different reading methods for expressing the same character and often have different descriptions for the same thing. These problems are also waiting to be solved. At present, the problem of the lack of common knowledge in the past about the correction of the Chinese text is also existed, so that the Chinese text for checking the error under the real scene becomes the hot spot of the current research.
Successfully solving the problems can help people to free from heavy and mechanized manual error detection and comparison errors. If people are used for comparing different errors, firstly, the cost is increased, and secondly, for many professional errors, people with professional knowledge are needed to identify the errors, which often causes waste of human resources. It is imperative to propose a solution to these problems.
Throughout the text error detection technology, the mainstream methods such as convolutional neural network and cyclic neural network have achieved good results. But the effect display applied to the Chinese text field is not ideal. The method mainly comprises the steps that the semantics of the Chinese text are complex, the semantics need to be understood by a model, and error detection is performed on the basis of semantic understanding. For example, the original sentence is "Xiaosheng has a strong desire to live" and the wrong sentence "Xiaosheng has a strong desire to win", which have no problem in the structure of the word, but the "win will" is correct according to the context. However, the current mainstream technology is difficult to mine the semantic problem of the word, so that the error detection cannot be well carried out. And the interrelationship between different words is different, and different weights need to be distributed to express the correlation, and the existing method is not ideal for distributing the weights.
Disclosure of Invention
One objective of the present invention is to provide a method for detecting errors in chinese text based on joint analysis of word order and semantics. The method can give consideration to semantic understanding and word weight distribution under the condition of fitting the text.
The technical scheme adopted by the invention is as follows:
step 1: preprocessing data;
1-1, acquiring original text data, dividing all texts in the original text data according to word levels, and constructing a Chinese character set D (w); inserting identifiers into the Chinese character set D (w), and then marking the Chinese character set D (w) by using indexes, wherein each word corresponds to a dictionary index to form a dictionary Dic (w, k);
1-2, converting a text statement in original text data into Token, adding an identifier, and fixing the sentence length;
preferably, the adding identifier in step 1-2 is adding a "START symbol in the beginning of the sentence, adding a" CLS "spacer in the sentence, and adding an" END "terminator at the END of the sentence;
preferably, the fixed sentence length is a part of the long sentence which is cut off from the long sentence, and the short sentence is filled to the fixed sentence length by using the PAD symbol;
1-3 serializing the text sentences after the Token conversion in the step 1-2 according to the dictionary index in the step 1-1;
1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding (Embedding) technology;
step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the language sequence and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic joint analysis comprises an information extraction module, a Self-Attention module (Self-Attention) and an output layer;
the information extraction module comprises a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR);
the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state generated by the bidirectional gated recurrent neural network at the last moment, and the hidden state is used for extracting text time sequence information; the method comprises the following steps:
the bidirectional gated loop unit model comprises two gated loop units (GRUs);
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttAnd t isCarved updating door ZtThe calculation is as follows:
Figure BDA0003521177330000021
Figure BDA0003521177330000022
wherein
Figure BDA0003521177330000023
Is the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzBias parameters for the reset gate and the update gate, respectively; sigma is a Sigmoid function, and the size range of the reset gate and the updating gate is controlled to be between 0 and 1;
reset gate for generating candidate hidden states
Figure BDA0003521177330000031
The calculation is expressed as follows:
Figure BDA0003521177330000032
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter, tahn is an activation function;
updating the gate for generating the hidden state H at the current momenttThe calculation is expressed as follows:
Figure BDA0003521177330000033
wherein
Figure BDA0003521177330000034
Representing a hadamard product, which is a multiplication for elements;
two gated cyclic units (GRU) one being a forward input and one being an inverted input, with forward hidden states
Figure BDA0003521177330000035
And reverse hidden state
Figure BDA0003521177330000036
The calculation is expressed as follows:
Figure BDA0003521177330000037
Figure BDA0003521177330000038
wherein
Figure BDA0003521177330000039
Indicating that the hidden states are generated sequentially using GRUs,
Figure BDA00035211773300000310
indicating that the GRU is used in reverse to generate the hidden state,
Figure BDA00035211773300000311
indicating a forward hidden state at time t,
Figure BDA00035211773300000312
representing a reverse hidden state at the time t;
the hidden state H is generated not by simple addition but by concatenation, as follows:
Figure BDA00035211773300000313
wherein
Figure BDA00035211773300000314
Representing a hadamard product, which is a multiplication for elements;
the input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; it comprises a plurality of semantic understanding units, each semantic understanding unit comprising a full convolutional neural network (FCN); each semantic understanding unit is connected by adopting a residual error network (ResNet) and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;
the residual error network ResNet and the improved Sigmoid activation function are calculated according to the following formula:
Figure BDA00035211773300000315
Figure BDA00035211773300000316
wherein
Figure BDA0003521177330000041
Indicating the output of ResNet at time t,
Figure BDA0003521177330000042
representing the output of the semantic understanding unit at time t-1,
Figure BDA0003521177330000043
representing the output of the semantic understanding unit at the time t-2;
the input of the Self-Attention module (Self-Attention) is the superposition output of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and is used for distributing word weight; the input is differentiated into a Key matrix (Key), a question matrix (Query) and a Value matrix (Value), then a Similarity matrix (Similarity) is calculated according to the Key matrix and the question matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain an Attention matrix (Attention); the method comprises the following steps:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Figure BDA0003521177330000044
Figure BDA0003521177330000045
Figure BDA0003521177330000046
Figure BDA0003521177330000047
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvIs a value matrix weight parameter;
Figure BDA0003521177330000048
the output of a bidirectional recurrent neural network BiGRU and FR semantic understanding module in the information extraction module at the time t is represented;
(b) calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Figure BDA0003521177330000049
Wherein a isijRepresenting the values of the normalized similarity matrix at the ith row and the jth column, n representing the similarity momentNumber of elements per row of the array; similarityijThe value of the similarity matrix in the ith row and the jth column is represented,
Figure BDA00035211773300000410
indicating similarity with e as baseijIs a power operation of an exponent;
(d) weighting the normalized similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Figure BDA0003521177330000051
Wherein attentionijThe value, representing the value of the Attention matrix Attention in the ith row and jth columnijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix;
the output layer includes a full connected layer (full connected layer) and an activation function Sigmoid, and is used for judging whether an output word has an error.
Another objective of the present invention is to provide a chinese text error detection system based on joint analysis of word order and semantics, which includes:
the data preprocessing module is used for converting the text data into 768-dimensional vectors;
and the Chinese text error detection module realizes Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.
The technical scheme provided by the invention has the following beneficial effects:
(1) the invention adopts a semantic understanding module (FR) consisting of a full convolution neural network (FCN) and a residual error network (ResNet), and has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; and secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened.
(2) The invention uses the bidirectional gated recurrent neural network (BiGRU) to fit the text data, and has the following two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; the second is to use both past and future text information to fit the current text with more feature information.
(3) The invention superposes the output of a semantic understanding module (FR) and the output of a bidirectional gated cyclic network (BiGRU), thereby avoiding the problem of losing time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and fills a layer.
(4) The invention adopts a Self-Attention mechanism (Self-Attention), and has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.
Drawings
FIG. 1 is a flow chart according to the present invention;
FIG. 2 is a diagram of a semantic understanding module architecture (dashed lines in the figure are residual network connections);
FIG. 3 is a diagram of a bidirectional gated loop network
FIG. 4 is a diagram of a residual network architecture
FIG. 5 is a diagram of a model structure;
Detailed Description
Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific flow description is shown in fig. 1, wherein:
step 1: input data acquired by the model is preprocessed.
The pretreatment process comprises the following four steps:
1-1 create a dictionary. Dividing all text sentences into words and constructing candidate Chinese character set
Figure BDA0003521177330000061
And counting the occurrence frequency of each word according to the set, filtering the words with the frequency lower than 3, and removing the weight of the filtered set to form a Chinese character set D (w). Inserting special characters into Chinese character set D (w)Such as "START" initiator, "END" terminator, "CLS" spacer, "unknown," PAD "filler, etc. These symbols help the computer to better fit the text. Each word in the set of words d (w) in the index token is then used, each word having a unique mapping, to form the dictionary Dic (w, k).
1-2 data Token. The data is in the form of sentences, the beginning of each sentence is added with a START initial character, the sentence is added with a CLS interval character, the END of the sentence is added with an END terminal character, and characters which do not appear in the dictionary are met and replaced by an unknown character. And judging a theater, wherein the sentence is not in a fixed length, and the length of the sentence needs to be processed. Long sentences truncate the overlength and short sentences need to fill the rest with "PAD" padding.
1-3 data serialization: converting each word in the Token-quantized text into a dictionary index by using the dictionary Dic (w, k) obtained in the step 1-1.
1-4 words embed the mapping. The number of words in the dictionary is too large, and one-hot coding is used to bring sparse matrix, which wastes storage space and slows down operation speed. The index of each word after serialization is mapped into a vector with 768 dimensions by adopting the word Embedding (Embedding) technology.
Step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the language sequence and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic joint analysis comprises an information extraction module, a Self-Attention module (Self-Attention) and an output layer;
the information extraction module comprises a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR);
the adoption of the semantic understanding module has the following two advantages: firstly, one-dimensional text data is regarded as a one-dimensional picture by using a complete convolution neural network (FCN), text semantics are understood, and the problem that the semantic processing means in the prior art is lack is solved; secondly, the number of layers of the network is deepened by using a residual error network (ResNet), the number of features is increased, and the comprehension degree of text semantics is deepened; the use of a two-way gated loop network has two advantages: firstly, the gated recurrent neural network (GRU) can avoid the defect that the common Recurrent Neural Network (RNN) can not fit long sentences; secondly, the text information from the past and the future is used at the same time to fit the current text by using more characteristic information;
the output of the semantic understanding module and the output of the bidirectional gated cyclic network are superposed, so that the problem of loss of time sequence information when the time sequence information passes through a pooling layer of a full convolution neural network and a filling layer is solved;
the input of the bidirectional gated recurrent neural network (BiGRU) is the 768-dimensional vector preprocessed in the step 1 and the hidden state at the previous moment, and the input is used for extracting text time sequence information; the method comprises the following steps:
the bidirectional gated loop unit model comprises two gated loop units (GRUs);
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttUpdate gate Z with time ttThe calculation is as follows:
Figure BDA0003521177330000071
Figure BDA0003521177330000072
wherein
Figure BDA0003521177330000073
Is the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzThe bias parameters of the reset gate and the update gate, respectively. σ is a Sigmoid function that controls the size of the reset gate and the update gate to range between 0 and 1.
Reset gates may be used to generate candidate hidden states
Figure BDA0003521177330000074
The calculation is expressed as follows:
Figure BDA0003521177330000075
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter and tahn is an activation function.
The update gate may generate a hidden state H at the current timetThe calculation is expressed as follows:
Figure BDA0003521177330000076
wherein
Figure BDA0003521177330000077
Is a hadamard product, which is a multiplication for elements.
Two gated cyclic units (GRU) one being a forward input and one being an inverted input, with forward hidden states
Figure BDA0003521177330000078
And reverse hidden state
Figure BDA0003521177330000079
The calculation is expressed as follows:
Figure BDA00035211773300000710
Figure BDA00035211773300000711
wherein
Figure BDA00035211773300000712
Indicating that the forward hidden state is generated sequentially using GRUs,
Figure BDA00035211773300000713
indicating that the reverse order generates the hidden state using GRU. The hidden state H is generated not by simple addition but by concatenation, as follows:
Figure BDA0003521177330000081
wherein
Figure BDA0003521177330000082
Is a dimension join operation.
The input of the semantic understanding module (FR) is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the input of the first unit is the 768-dimensional vector preprocessed in the step 1, the input of the second unit is the 768-dimensional vector preprocessed in the step 1 and the output of the first unit, and the input of the second unit is the output of the first unit and the output of the second unit;
each cell comprising a fully convolutional neural network (FCN) comprising a convolutional layer, a Relu activation function, an average pooling layer, an anti-convolutional layer, a modified Sigmoid activation function; each unit is connected by adopting a residual error network (ResNet);
the residual error network ResNet calculation formula and the improved Sigmoid activation function are expressed as follows:
Figure BDA0003521177330000083
Figure BDA0003521177330000084
wherein
Figure BDA0003521177330000085
Indicating the output of ResNet at time t,
Figure BDA0003521177330000086
indicating the output of ResNet at time t-1,
Figure BDA0003521177330000087
indicating the output of ResNet at time t-2.
The input of the Self-Attention module (Self-Attention) is the superposition output of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and is used for distributing word weight; the input is differentiated into a Key matrix (Key), a question matrix (Query) and a Value matrix (Value), then a Similarity matrix (Similarity) is calculated according to the Key matrix and the question matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain an Attention matrix (Attention); the method comprises the following steps:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Figure BDA0003521177330000088
Figure BDA0003521177330000089
Figure BDA00035211773300000810
Figure BDA00035211773300000811
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvAre value matrix weight parameters.
(b) Calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Figure BDA0003521177330000091
Wherein a isijRepresenting the value of the normalized similarity matrix at the ith row and the jth column, and n representing a row with several elements.
(d) Weighting the similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Figure BDA0003521177330000092
Wherein attentionijThe value, of the Attention matrix (Attention) at the ith row and jth columnijThe value of the matrix of values in the ith row and the jth column is represented, and l represents the number of row elements.
The output layer includes a Fully connected layer (Fully connected layer) and an activation function Sigmoid. The input of the output layer is from an Attention matrix (Attention), the probability of word errors is output through a full connection layer and an activation function, and if the probability of errors is larger than 0.5, the words are judged to be wrongly written or mispronounced.
The Self-Attention mechanism (Self-Attention) has the following two advantages: firstly, the Attention mechanism (Attention) is capable of automatically distributing weights, and words with closer relations are distributed with larger weights, which indicates that the relevance degree is higher; secondly, the Self-Attention mechanism (Self-Attention) has the anti-interference capability, and the problem that the semantic meaning is brought by wrong words is effectively avoided.
The training of the invention adopts a data set Merge collected by the self to train, and the performance evaluation adopts a Chinese spelling data set disclosed by SIGHAN15 to evaluate. The model performs an experiment on this data set to predict wrongly written words and counts the indices for comparison. The following table shows the data volume of Merge and SIGHAN15 data sets.
Merge SIGHAN15
Number of paragraphs 2390 1100
Number of errors 3740 1602
The performance evaluation indexes adopted by the invention are Precetion, Recall and F1、F0.5
True value 1 True value-1
Predicted value 1 TP(TruePositive) FP(FalseNegative)
Prediction value-1 FN(FalseNegative) TN(TrueNegative)
Precision: the probability that, among all samples predicted to be positive, the sample is actually positive is set as the prediction result.
Figure BDA0003521177330000101
Recall: probability of being predicted as a positive sample among actually positive samples with respect to the original sample
Figure BDA0003521177330000102
F1And F0.5A balance point is found between the two, the accuracy and the recall rate are referred, and the comprehensive reaction model quality measurement standard is integrated.
Figure BDA0003521177330000103
Figure BDA0003521177330000104
The following table shows the results of gender prediction experiments on the SIGHAN15 data set according to the present invention:
Precision(%) Recall(%) F1(%) F0.5(%)
LSTM 56.16 47.03 51.19 54.06
GRU 70..17 46.18 55.70 63.57
BiGRU-CNN 81.94 89.38 85.50 83.33
BiGRU-Attention 64.45 99.06 78.09 69.29
RFRA 84.60 98.01 90.81 87.00
in the above table of the Chinese text error detection experiment results, LSTM and GRU are conventional recurrent neural network detectors, BiGRU-CNN is the combination of recurrent neural network and convolutional neural network, and BiGRU-Attention is the combination of recurrent neural network and Attention mechanism. RFRA is the Chinese text error detection model based on the word order and semantic combined analysis in the invention.

Claims (8)

1. A Chinese text error detection method based on word order and semantic joint analysis is characterized by comprising the following steps:
step 1: preprocessing data;
1-1, acquiring original text data, dividing all texts in the original text data according to word levels, and constructing a Chinese character set D (w); inserting identifiers into the Chinese character set D (w), and then marking the Chinese character set D (w) by using indexes, wherein each word corresponds to a dictionary index to form a dictionary Dic (w, k);
1-2, converting a text statement in original text data into Token, adding an identifier, and fixing the sentence length;
1-3 serializing the text sentences after the Token conversion in the step 1-2 according to the dictionary index in the step 1-1;
1-4 mapping the data subjected to index serialization in the step 1-3 into 768-dimensional vectors by a word Embedding technology;
step 2: the Chinese text error detection is realized through a Chinese text error detection model RFRA based on the word order and semantic joint analysis;
the Chinese text error detection model based on the language order and semantic combined analysis comprises an information extraction module, a Self-Attention module Self-Attention and an output layer;
the information extraction module comprises a bidirectional gating recurrent neural network (BiGRU) and a semantic understanding module FR;
the input of the semantic understanding module FR is 768-dimensional vectors preprocessed in the step 1 and used for extracting text semantic information; the system comprises a plurality of semantic understanding units, wherein each semantic understanding unit comprises a full convolution neural network (FCN); each semantic understanding unit is connected by adopting a residual error network ResNet and adopts an improved Sigmoid function; the input of each semantic understanding unit is the output of the first two layers of units;
the input of the Self-Attention module Self-Attention is the superposition output of a bidirectional gated recurrent neural network BiGRU and a semantic understanding module FR, and is used for distributing the word weight; the input is differentiated into a Key matrix Key, a question mark matrix Query and a Value matrix Value, then a Similarity matrix Similarity is calculated according to the Key matrix and the question mark matrix, then the Similarity matrix is normalized, and finally the Similarity matrix and the Value matrix are weighted to obtain Attention moment matrix Attention;
the output layer is used for judging whether the output word has errors.
2. The method of claim 1, wherein the join identifier of step 1-2 is a START identifier added to the beginning of the sentence, a CLS spacer added to the sentence, and an END identifier added to the END of the sentence.
3. The method of claim 1, wherein the fixed sentence length of step 1-2 is obtained by truncating the long sentence by an overlength portion and the short sentence is filled to the fixed sentence length using a "PAD" character.
4. The method of claim 1, wherein the input of the bidirectional gated recurrent neural network BiGRU is the 768-dimensional vector preprocessed in step 1 and the hidden state of the last moment generated by the vector, and is used for extracting text timing information; the method comprises the following steps:
the bidirectional gating cycle unit model comprises two gating cycle units GRU;
the GRU has a reset gate R and an update gate Z, the reset gate R at time ttUpdate gate Z with time ttThe calculation is as follows:
Figure FDA0003521177320000021
Figure FDA0003521177320000022
wherein
Figure FDA0003521177320000023
Is the mapped 768-dimensional vector, H, from step 1 at time tt-1Is a hidden state at time t-1, WxrIs to reset the gate input weight parameter, WxzIs to update the gate input weight parameter, WhrIs to reset the gate hidden state weight parameter, WhzIs to update the door hidden state weight parameter, brrAnd brzBias parameters for the reset gate and the update gate, respectively; sigma is a Sigmoid function, and the size range of the reset gate and the updating gate is controlled to be between 0 and 1;
reset gate for generating candidate hidden states
Figure FDA0003521177320000024
The calculation is expressed as follows:
Figure FDA0003521177320000025
wherein WxhIs a candidate hidden state input weight parameter, WhhIs a weight parameter of the candidate hidden state with respect to the hidden state, bhIs a candidate hidden state bias parameter, tahn is an activation function;
updating the gate for generating the hidden state H at the current momenttThe calculation is expressed as follows:
Figure FDA0003521177320000026
wherein
Figure FDA0003521177320000027
Representing a hadamard product, which is a multiplication for elements;
one of the two gated loop units GRU is a forward input,one is a reverse input, which is in a forward hidden state
Figure FDA0003521177320000028
And reverse hidden state
Figure FDA0003521177320000029
The calculation is expressed as follows:
Figure FDA00035211773200000210
Figure FDA0003521177320000031
wherein
Figure FDA0003521177320000032
Indicating that the hidden states are generated sequentially using GRUs,
Figure FDA0003521177320000033
indicating that the GRU is used in reverse to generate the hidden state,
Figure FDA0003521177320000034
indicating a forward hidden state at time t,
Figure FDA0003521177320000035
representing a reverse hidden state at the time t;
the hidden state H is generated specifically as follows:
Figure FDA0003521177320000036
wherein
Figure FDA0003521177320000037
Representing dimension join operations。
5. The method of claim 1 wherein said residual network ResNet calculation is expressed as follows:
Figure FDA0003521177320000038
the improved Sigmoid function calculation formula is as follows:
Figure FDA0003521177320000039
wherein
Figure FDA00035211773200000310
Indicating the output of ResNet at time t,
Figure FDA00035211773200000311
representing the output of the semantic understanding unit at time t-1,
Figure FDA00035211773200000312
representing the output of the semantic understanding unit at time t-2.
6. The method according to claim 1, wherein the Self-Attention module Self-Attention is specifically:
(a) superposing the outputs of a bidirectional gated recurrent neural network (BiGRU) and a semantic understanding module (FR) and then differentiating the superposed outputs into a Key matrix (Key), a question mark matrix (Query) and a Value matrix (Value); in particular to
Figure FDA00035211773200000313
Figure FDA00035211773200000314
Figure FDA00035211773200000315
Figure FDA00035211773200000316
Wherein WqIs a question mark matrix weight parameter, WkIs a key matrix weight parameter, WvIs a value matrix weight parameter;
Figure FDA0003521177320000041
the output of a bidirectional recurrent neural network BiGRU and FR semantic understanding module in the information extraction module at the time t is represented;
(b) calculating a Similarity matrix (Similarity) according to the key matrix and the question mark matrix:
Similarity(Query,Key)=Query×Key (2.14)
(c) normalizing each row of the similarity matrix
Figure FDA0003521177320000042
Wherein a isijRepresenting the value of the similarity matrix subjected to normalization in the ith row and the jth column, wherein n represents the number of elements in each row of the similarity matrix; similarityijThe value representing the similarity matrix at the ith row and jth column,
Figure FDA0003521177320000043
indicating similarity with e as baseijIs a power operation of an exponent;
(d) weighting the normalized similarity matrix and the value matrix to obtain an Attention matrix (Attention)
Figure FDA0003521177320000044
Wherein attentionijThe value, representing the value of the Attention matrix Attention in the ith row and jth columnijThe value of the value matrix in the ith row and the jth column is represented, and l represents the number of elements in each column of the normalized similarity matrix.
7. The method of claim 1, wherein the output Layer comprises two Fully Connected layers Fully Connected Layer and two activation functions Gelu.
8. A Chinese text error detection system based on language order and semantic joint analysis is characterized by comprising:
the data preprocessing module is used for converting the text data into 768-dimensional vectors;
and the Chinese text error detection module is used for realizing Chinese text error detection by using a Chinese text error detection model based on the word order and semantic combined analysis.
CN202210178120.5A 2022-02-25 2022-02-25 Chinese text error detection method and system based on language sequence and semantic joint analysis Pending CN114548116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210178120.5A CN114548116A (en) 2022-02-25 2022-02-25 Chinese text error detection method and system based on language sequence and semantic joint analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210178120.5A CN114548116A (en) 2022-02-25 2022-02-25 Chinese text error detection method and system based on language sequence and semantic joint analysis

Publications (1)

Publication Number Publication Date
CN114548116A true CN114548116A (en) 2022-05-27

Family

ID=81678632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210178120.5A Pending CN114548116A (en) 2022-02-25 2022-02-25 Chinese text error detection method and system based on language sequence and semantic joint analysis

Country Status (1)

Country Link
CN (1) CN114548116A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115886830A (en) * 2022-12-09 2023-04-04 中科南京智能技术研究院 Twelve-lead electrocardiogram classification method and system
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115886830A (en) * 2022-12-09 2023-04-04 中科南京智能技术研究院 Twelve-lead electrocardiogram classification method and system
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network

Similar Documents

Publication Publication Date Title
CN109871535B (en) French named entity recognition method based on deep neural network
CN107832400B (en) A kind of method that location-based LSTM and CNN conjunctive model carries out relationship classification
CN110134946B (en) Machine reading understanding method for complex data
CN109697232A (en) A kind of Chinese text sentiment analysis method based on deep learning
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN110083710A (en) It is a kind of that generation method is defined based on Recognition with Recurrent Neural Network and the word of latent variable structure
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN114548116A (en) Chinese text error detection method and system based on language sequence and semantic joint analysis
CN112309528B (en) Medical image report generation method based on visual question-answering method
CN112561718A (en) Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN113051887A (en) Method, system and device for extracting announcement information elements
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
CN112990196B (en) Scene text recognition method and system based on super-parameter search and two-stage training
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN112347783A (en) Method for identifying types of alert condition record data events without trigger words
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN112949284A (en) Text semantic similarity prediction method based on Transformer model
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
CN115757775A (en) Text implication-based triggerless text event detection method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination