CN110134950B - Automatic text proofreading method combining words - Google Patents

Automatic text proofreading method combining words Download PDF

Info

Publication number
CN110134950B
CN110134950B CN201910349756.XA CN201910349756A CN110134950B CN 110134950 B CN110134950 B CN 110134950B CN 201910349756 A CN201910349756 A CN 201910349756A CN 110134950 B CN110134950 B CN 110134950B
Authority
CN
China
Prior art keywords
word
model
pos
error
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910349756.XA
Other languages
Chinese (zh)
Other versions
CN110134950A (en
Inventor
苏萌
苏海波
王然
檀玉飞
孙伟
高体伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Percent Technology Group Co ltd
Original Assignee
Beijing Percent Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Percent Technology Group Co ltd filed Critical Beijing Percent Technology Group Co ltd
Priority to CN201910349756.XA priority Critical patent/CN110134950B/en
Publication of CN110134950A publication Critical patent/CN110134950A/en
Application granted granted Critical
Publication of CN110134950B publication Critical patent/CN110134950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word-combined text automatic proofreading method, which firstly adopts the following two error checking methods to check errors respectively: 1) An error checking method based on an n-gram language model; 2) An error checking method based on an lstm language model; and then solving the intersection of the error checking results of the two methods to obtain the final error checking result. The method realizes word segmentation and part-of-speech tagging of the input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizes error search in the text based on an n-gram model, a bidirectional lstm language model and a rule strategy.

Description

Automatic text proofreading method combining words
Technical Field
The invention relates to the technical field of data processing, in particular to a word-combined text automatic proofreading method.
Background
Automatic text proofreading is a technology for searching and correcting errors of characters, words, word collocation, semantic grammar and the like appearing in texts, and is one of main application fields of natural language processing.
Early natural language processing systems were based primarily on manually written rules, which were time consuming and laborious and did not cover all language phenomena. In the late 80 s of the last century, machine learning algorithms were introduced into natural language processing due to the ever increasing computing power of computers. Research is mainly focused on statistical models, parameters of the models are automatically learned by large-scale corpora training (corpus), and compared with the previous rule-based method, the method is more robust.
Statistical Language models (Statistical Language models) are proposed in such environments and contexts. It is widely used for various natural language processing problems, such as speech recognition, machine translation, word segmentation, part-of-speech tagging, and the like. In short, a language model is a model used to calculate the probability of a sentence, i.e., P (W1, W2.. Wk). Using a language model, it can be determined which word sequence is more likely, or given several words, the next most likely word can be predicted.
The n-gram model, also known as an n-1 order Markov model, has a finite history assumption: the probability of occurrence of the current word is only related to the first n-1 words. Given a sentence (word sequence) S = W1, W2., wk, its probability can be expressed as:
Figure BDA0002043538270000011
when n takes 1, 2, 3, the n-gram models are called unigram, bigram, and trigram language models, respectively. The parameters of the n-gram model are the conditional probability P (Wi | Wi-n + 1., wi-1). Assuming a vocabulary size of 100,000, the number of parameters for the n-gram model is 100,000n. The larger n, the more accurate and complex the model, and the larger the amount of computation required. Bigram is the most commonly used, unigram and trigram are the next, and n is less often equal to or greater than 4.
The biggest problem of n-grams is that the probability estimation is not very accurate, especially when n in n-grams is large, the amount of data needed is large if the accuracy needs to be guaranteed, but in practice, the data can become sparse because it is impossible to obtain as much training data. In addition, n-gram can only count the number of occurrences of a word sequence of a fixed length (generally, a length not exceeding 3), and cannot extract long context information.
Partial interpretation of the technical terms:
word segmentation and part of speech tagging: the words are divided into individual words, and the part-of-speech (such as noun, verb, adjective, etc.) of each word is identified.
Word2vec: the algorithm developed by google corporation changes words into a vector with hundreds of dimensions through unsupervised training, and the vector can capture semantic correlation among the words. Also called word vectors or word embedding.
Tensorflow: tensorflow is a google open-source deep learning platform, and provides rich interfaces, multiple platforms (CPU, GPU, HADOOP) and distributed support and visual monitoring.
LSTM: the LSTM (Long Short-Term Memory) Long-Short Term Memory network is a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long interval and delay in time sequence. The historical information is controlled to be left through the memory gate and the forgetting gate, and the problem of long path dependence of the traditional recurrent neural network is effectively solved.
CRF: CRF (Conditional Random Field) Conditional Random fields are one of algorithms commonly used in the Field of natural language processing in recent years, and are commonly used for syntactic analysis, named entity recognition, part-of-speech tagging and the like. The CRF adopts a Markov chain as a probability transfer model of a hidden variable, discriminates the hidden variable through an observable state and belongs to a discrimination model.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an automatic text proofreading method with combined words, which is used for realizing word segmentation and part-of-speech tagging of an input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizing error search in the text based on an n-gram, a bidirectional lstm language model and a rule strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for automatically proofreading texts combined by words comprises the following steps:
s1, debugging is carried out by respectively adopting the following two debugging methods:
1) An error checking method based on an n-gram language model;
2) An error checking method based on an Lstm language model;
the debugging method based on the n-gram language model comprises the following steps:
1.1 For an input text S, performing word segmentation and part-of-speech tagging on the text by adopting a deep learning-based method to obtain S = w 1 ,w 2 ,…,w n Wherein w is i Is a word obtained after word segmentation, and the corresponding part of speech is pos i ,i=1,2,…,n;
1.2 Using unigram, bigram and trigram language models to judge whether errors exist in the results after word segmentation;
1.2.1 ) and determining w i Part of speech pos i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model i Frequency P (w) i ) If P (w) i )>= threshold value T 0 Then step 1.2.2) is performed, if P (w) i )<T 0 Then mark w i An error;
1.2.2 P (w) i-1 ,w i ) The expression w i-1 And w i Number of co-occurrences of (p) is in i For a person or place name, use bigram model to judge w i-1 And pos i Co-occurrence number P (w) i-1 ,pos i ) If P (w) i-1 ,pos i )>= threshold value T 1 Then step 1.2.3) is performed, if P (w) i-1 ,pos i )<T 1 Then mark w i An error; if pos i Not name of person and place, P (w) is used i-1 ,w i ) Make a judgment if P (w) i-1 ,w i )>=T 1 Then step 1.2.3) is performed, if P (w) i-1 ,w i )<T 1 Then mark w i An error;
1.2.3 P (w) i-2 ,w i-1 ,w i ) The expression w i-2 ,w i-1 And w i Number of co-occurrences of (i.e., pos) i Judging w for the name of person or place by using trigram model i-2 、w i-1 And pos i Co-occurrence number P (w) i-2 ,w i-1 ,pos i ) If P (w) i-2 ,w i-1 ,pos i )>= threshold value T 2 Then consider w i There is no error, if P (w) i-2 ,w i-1 ,pos i )<T 2 Then mark w i An error; if pos i Not name of person and place, use w i-2 ,w i-1 And w i Co-occurrence number P (w) i-2 ,w i-1 ,w i ) Make a judgment if P (w) i-2 ,w i-1 ,w i )>=T 2 (T 2 Artificially set threshold), then w is considered i There is no error; if P (w) i-2 ,w i-1 ,w i )<T 2 Then mark w i An error;
the error checking method based on the LSTM language model specifically comprises the following steps:
2.1 Vectorizing each character using a word vector model;
2.2 Carrying out automatic feature extraction through a bidirectional LSTM model to obtain an output sequence;
2.3 X for each character), for each character t Output h of t Obtaining the probability of the next time word through a Softmax activation function, then judging the probability of the next time word and the size of a set threshold value, if the probability of the next time word is larger than the set threshold value, the character is correct, otherwise, the character is marked as an error;
and S2, for an input text, after the input text is processed in the step S1, an error checking result based on the n-gram language model and an error checking result based on the LSTM language model are respectively obtained, and the intersection of the two error checking results is obtained to serve as a final error checking result.
Further, in step 2.1), word2vec model is adopted for character vectorization.
Furthermore, the word2vec model is trained by adopting a Skip-gram method.
Further, the specific method of step 2.2) is as follows:
firstly, the character vector generated in step S2.1 is loaded, then the bidirectional LSTM operation process is entered, and the output of the forward LSTM is h ft The output of backward LSTM is h bt After vector splicing, each character x is obtained t Output h of t =[hf t ,h bt ]The output of all characters constitutes an output sequence.
Furthermore, the number of times of new words appearing in the input text is counted, namely P (new word), and if P (new word) > is a preset threshold value, the new word is considered to be correct, and no error prompt is performed.
The invention has the beneficial effects that: the method realizes word segmentation and part-of-speech tagging of the input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizes error search in the text based on an n-gram model, a bidirectional lstm language model and a rule strategy. The advantages of the invention are as follows:
1) The method for deep learning is adopted to perform word segmentation and part-of-speech tagging on the text, so that names of people and places in the text can be accurately extracted, and false alarm caused by names of people and places is reduced.
2) The proofreading method based on the combination of the N-gram and the LSTM can extract the local features and the global features of the text and can accurately search errors occurring in the text.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a CBOW training method;
FIG. 3 is a schematic diagram of the Skip-gram method;
FIG. 4 is a schematic diagram of a word vector model training process according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical scheme, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides an automatic text proofreading method based on word combination, as shown in fig. 1, including the following steps:
s1, debugging is carried out by respectively adopting the following two debugging methods:
1) An error checking method based on an n-gram language model;
2) An error checking method based on an lstm language model;
the debugging method based on the n-gram language model comprises the following steps:
1.1 For an input text S, performing word segmentation and part-of-speech tagging on the text by adopting a deep learning-based method to obtain S = w 1 ,w 2 ,…,w n Wherein w is i The corresponding part of speech is pos for the word obtained after word segmentation i ,i=1,2,…,n;
The method based on deep learning is adopted to carry out word segmentation and part-of-speech tagging on an input text, and the word segmentation and part-of-speech tagging problem is converted into a sequence tagging problem in principle. The method comprises the following specific steps:
1.1.1 By the word2vec method), the input text is converted into a computable vector;
1.1.2 Input the vector into Lstm-crf model to obtain word segmentation result and corresponding part of speech of each word;
the method mainly relates to three algorithms of word2vec, bidirectional lstm and crf.
Word vector model (word 2 vec) algorithms are capable of transforming an incalculable chinese word into a vector in a low-dimensional space that is computable by a computer, typically several hundred dimensions. Compared with the traditional one-hot method, the method has the advantages that the dimensionality of the vector is reduced, the vector is converted from sparse representation to dense representation, and the calculation amount of a computer is greatly reduced. In addition, the semantic relevance between such characters can be approximately described by the distance of the vectors.
Recurrent Neural Networks (RNNs) have been widely demonstrated to be advantageous in the field of natural language processing. For arbitrary input text sequences (x) 1 ,x 2 ,…,x n ) RNN returns a set of output values (h) for this sequence 1 ,h 2 ,…,h n ). Wherein, any value xi in the input sequence can be a character (or a word), and the probability value of the character (or the word) at the time of outputting i +1 is output. However, in the process of solving the optimization, the conventional RNNs have the problem of gradient disappearance, so that model parameters of the RNNs can only "memorize" context information of a short distance before and after a current character, and are ineligible for a long-distance dependence phenomenon. The LSTM model perfectly solves the problem, the input and the output of historical information are controlled through a plurality of gates (gates), each gate is subjected to nonlinear normalization by a sigmoid function to be between 0 and 1, and the closer the value is to 0, the less historical information passes through the gate is indicated; conversely, a closer to 1 indicates more information is passing through the "gate". The LSTM effectively solves the problem of long-distance dependence in the sequence labeling task through the special design.
The CRF (conditional random field) conditional random field model is a typical discriminant model proposed by John Lafferty in 2001. The conditional random field model has the advantages of a discriminant model, has the characteristics that a generative model considers the transition probability among context markers and performs global parameter optimization and decoding in a serialization mode, and solves the marker bias problem which is difficult to avoid by other discriminant models (such as a maximum entropy Markov model).
1.2 Using unigram, bigram and trigram language models to judge whether errors exist in the results obtained in the step 1.1);
1.2.1 B), determination of w i Part of speech pos i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model i Frequency P (w) i ) If P (w) i )>=T 0 (T 0 Artificially set threshold), step 1.2.2) is performed, if P (w) i )<T 0 Then mark w i An error;
1.2.2 P and P (w) i-1 ,w i ) The expression w i-1 And w i Number of co-occurrences of (p) is in i For a person or place name, use bigram model to judge w i-1 And pos i Co-occurrence number P (w) i-1 ,pos i ) If P (w) i-1 ,pos i )>=T 1 (T 1 Artificially set threshold), step 1.2.3) is executed if P (w) i-1 ,pos i )<T 1 Then mark w i An error; if pos i Not name of person and place, P (w) is used i-1 ,w i ) Make a judgment if P (w) i-1 ,w i )>=T 1 Then step 1.2.3) is performed, if P (w) i-1 ,w i )<T 1 Then mark w i An error;
1.2.3 P (w) i-2 ,w i-1 ,w i ) The expression w i-2 ,w i-1 And w i Number of co-occurrences of (i.e., pos) i Judging w for the name of person or place by using trigram model i-2 、w i-1 And pos i Co-occurrence number P (w) i-2 ,w i-1 ,pos i ) If P (w) i-2 ,w i-1 ,pos i )>=T 2 (T 2 Artificially set threshold), then w is considered i There is no error, if P (w) i-2 ,w i-1 ,pos i )<T 2 Then mark w i An error; if pos i Not name of person and place, use w i-2 ,w i-1 And w i Co-occurrence number of P (w) i-2 ,w i-1 ,w i ) Make a judgment if P (w) i-2 ,w i-1 ,w i )>=T 2 (T 2 Artificially set threshold), then w is considered i There is no error; if P (w) i-2 ,w i-1 ,w i )<T 2 Then mark w i An error;
the error checking method based on the LSTM language model specifically comprises the following steps:
2.1 Vectorizing each character in the input text by using the character vector model to generate a character vector;
compared with the commonly used word vector, the character-based vectorization technology can bring the following advantages: character features of finer granularity can be represented; because the number of characters is far smaller than the number of words, the obtained model occupies extremely small space, and the loading speed of the model is greatly improved; over time, new words are emerging, and the word vector models trained previously have increasingly severe feature hit rate drop problems that are effectively avoided by character-based vectors, which are created relatively rarely every year.
In this embodiment, a word2vec model is used for character vectorization, which is an unsupervised learning method, i.e., the model can be trained without manually labeling corpora, and there are two common training methods: CBOW and Skip-gram. CBOW is a word predicted according to the context prediction center, and vectors of words are connected according to characters w (t-2), w (t-1), w (t + 1) and w (t + 2) around the current character w (t), so that context information can be fully reserved, and the reference of FIG. 2 is provided; the Skip-gram method is just the opposite, using w (t) to predict the surrounding words w (t-2), w (t-1), w (t + 1), w (t + 2), see FIG. 3. Under the condition of large data volume, the Skip-gram method is suitable, so the Skip-gram method is used in the embodiment.
As shown in fig. 4, when training a word vector, the specific steps are:
(1) Firstly, collecting related balance corpora (because unsupervised learning is needed, the larger the data volume is, the better the data volume is, and labeling is not needed), wherein the corpora mainly aim at corresponding application scenes and cover most data types of the scenes as much as possible;
(2) Preprocessing the collected balanced corpora, including filtering garbage data, filtering low-frequency words and nonsense symbols, and then sorting the low-frequency words and nonsense symbols into a format of training data, namely representing input and output, so as to prepare for establishing a training target;
(3) And (4) sending the training data to a Skip-gram model, and training to obtain a word vector model.
2.2 Automatically extracting characteristics through a bidirectional LSTM (Bi-LSTM) model to obtain an output sequence; the method comprises the following specific steps:
first, the character vector generated in step S2.1 is loaded, and then enters the bidirectional LSTM operation process, forward LSTM (forward LSTM) and backward LSTM (backward LST)M), the output of forward LSTM is h ft The output of backward LSTM is h bt After vector splicing, each character x is obtained t Output h of t =[h ft ,h bt ]The output of all characters constitutes an output sequence; wherein the output h of the forward LSTM ft Characterizing historical context information and then outputting h to the LSTM bt The future context information is characterized.
2.3 X for each character), for each character t Output h of t Obtaining the probability of the next time word through a Softmax activation function, then judging the probability of the next time word and the size of a set threshold value, if the probability of the next time word is larger than the set threshold value, the character is correct, otherwise, the character is marked as an error;
and S2, for an input text, after the input text is processed in the step S1, an error checking result based on the n-gram language model and an error checking result based on the LSTM language model are respectively obtained, and the intersection of the two error checking results is obtained to serve as a final error checking result.
In addition, the number of times of new words appearing in the input text is counted, namely P (new word), and if P (new word) > is a preset threshold value, the new word is considered to be correct, and no error prompt is performed.
Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims (4)

1. A method for automatically proofreading texts combined by words is characterized by comprising the following steps:
s1, debugging is carried out by respectively adopting the following two debugging methods:
1) An error checking method based on an n-gram language model;
2) An error checking method based on an LSTM language model;
the debugging method based on the n-gram language model comprises the following steps:
1.1 For the input text S, a method based on deep learning is adopted forPerforming word segmentation and part-of-speech tagging on the text to obtain S = w 1 ,w 2 ,…,w n Wherein w is i Is a word obtained after word segmentation, and the corresponding part of speech is pos i ,i=1,2,…,n;
The method is characterized in that the method based on deep learning is adopted to carry out word segmentation and part-of-speech tagging on an input text, and the method comprises the following specific steps:
1.1.1 By the word2vec method), the input text is converted into a computable vector;
1.1.2 Input the vector into LSTM-crf model to obtain the word segmentation result and the corresponding part of speech of each word;
1.2 Using unigram, bigram and trigram language models to judge whether errors exist in the results after word segmentation;
1.2.1 B), determination of w i Part of speech pos i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model i Frequency P (w) i ) If P (w) i )>= threshold value T 0 Then step 1.2.2) is performed, if P (w) i )<T 0 Then mark w i An error;
1.2.2 P and P (w) i-1 ,w i ) The expression w i-1 And w i Number of co-occurrences of (i.e., pos) i For a person or place name, use bigram model to judge w i-1 And pos i Co-occurrence number P (w) i-1 ,pos i ) If P (w) i-1 ,pos i )>= threshold value T 1 Then step 1.2.3) is performed, if P (w) i-1 ,pos i )<T 1 Then mark w i An error; if pos i Instead of name and place, P (w) is used i-1 ,w i ) Make a judgment if P (w) i-1 ,w i )>=T 1 Then step 1.2.3) is performed, if P (w) i-1 ,w i )<T 1 Then mark w i An error;
1.2.3 P (w) i-2 ,w i-1 ,w i ) The expression w i-2 ,w i-1 And w i Number of co-occurrences of (p) is in i Judging w for the name of person or place by using trigram model i-2 、w i-1 And pos i Co-occurrence number P (w) i-2 ,w i-1 ,pos i ) If P (w) i-2 ,w i-1 ,pos i )>= threshold value T 2 Then consider w i There is no error, if P (w) i-2 ,w i-1 ,pos i )<T 2 Then mark w i An error; if pos i Not name of person and place, use w i-2 ,w i-1 And w i Co-occurrence number of P (w) i-2 ,w i-1 ,w i ) Make a judgment if P (w) i-2 ,w i-1 ,w i )>=T 2 Then consider w i There is no error; if P (w) i-2 ,w i-1 ,w i )<T 2 Then mark w i An error;
the error checking method based on the LSTM language model specifically comprises the following steps:
2.1 Vectorizing each character using a word vector model;
2.2 Carrying out automatic feature extraction through a bidirectional LSTM model to obtain an output sequence; the specific method comprises the following steps: firstly, the character vector generated in step S2.1 is loaded, then the bidirectional LSTM operation process is entered, and the output of the forward LSTM is h ft And the backward LSTM output is h bt After vector splicing, each character x is obtained t Output h of t =[h ft ,h bt ]The output of all characters constitutes an output sequence; wherein the forward LSTM output characterizes historical context information, and the backward LSTM output characterizes future context information;
2.3 X for each character), for each character t Output h of t Obtaining the probability of the next time word through a Softmax activation function, then judging the probability of the next time word and the size of a set threshold value, if the probability of the next time word is larger than the set threshold value, the character is correct, otherwise, the character is marked as an error;
and S2, for an input text, after the input text is processed in the step S1, an error checking result based on the n-gram language model and an error checking result based on the LSTM language model are respectively obtained, and the intersection of the two error checking results is obtained to serve as a final error checking result.
2. The method for automatically proofreading text combined with words according to claim 1, characterized in that in step 2.1), word2vec model is adopted for character vectorization.
3. The method of automatic text proofreading by word combination according to claim 2, characterized in that a Skip-gram method is used for training a word2vec model.
4. The method for automatically proofreading texts combined by words according to claim 1, wherein the number of times of new words appearing in the input texts is counted, that is, P (new word), and if P (new word) > is greater than a preset threshold, the new word is considered to be correct, and no error prompt is performed.
CN201910349756.XA 2019-04-28 2019-04-28 Automatic text proofreading method combining words Active CN110134950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910349756.XA CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910349756.XA CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Publications (2)

Publication Number Publication Date
CN110134950A CN110134950A (en) 2019-08-16
CN110134950B true CN110134950B (en) 2022-12-06

Family

ID=67575430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910349756.XA Active CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Country Status (1)

Country Link
CN (1) CN110134950B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460827B (en) * 2020-04-01 2020-12-15 北京爱咔咔信息技术有限公司 Text information processing method, system, equipment and computer readable storage medium
CN112364631B (en) * 2020-09-21 2022-08-02 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
CN112380850A (en) * 2020-11-30 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment
CN113836912A (en) * 2021-09-08 2021-12-24 上海蜜度信息技术有限公司 Method, system and device for sequence labeling word segmentation of language model and word stock correction

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142881A (en) * 1999-11-16 2001-05-25 Nippon Telegr & Teleph Corp <Ntt> Statistic language model and probability calculating method using the same
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142881A (en) * 1999-11-16 2001-05-25 Nippon Telegr & Teleph Corp <Ntt> Statistic language model and probability calculating method using the same
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于LSTM和N-gram的ESL文章的语法错误自动纠正方法;谭咏梅等;《中文信息学报》;20180615(第06期);第19-27页 *
基于词性预测的中文文本自动查错研究;王虹等;《贵州师范大学学报(自然科学版)》;20010510(第02期);第72-75页 *

Also Published As

Publication number Publication date
CN110134950A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110134950B (en) Automatic text proofreading method combining words
CN109726389B (en) Chinese missing pronoun completion method based on common sense and reasoning
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112541356B (en) Method and system for recognizing biomedical named entities
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN110502742B (en) Complex entity extraction method, device, medium and system
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN112836039B (en) Voice data processing method and device based on deep learning
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN115168580A (en) Text classification method based on keyword extraction and attention mechanism
CN115238693A (en) Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN112883713A (en) Evaluation object extraction method and device based on convolutional neural network
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN114386425B (en) Big data system establishing method for processing natural language text content
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100081 No.101, 1st floor, building 14, 27 Jiancai Chengzhong Road, Haidian District, Beijing

Applicant after: Beijing PERCENT Technology Group Co.,Ltd.

Address before: 100081 16 / F, block a, Beichen Century Center, building 2, courtyard 8, Beichen West Road, Chaoyang District, Beijing

Applicant before: BEIJING BAIFENDIAN INFORMATION SCIENCE & TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant