CN110134950B

CN110134950B - Automatic text proofreading method combining words

Info

Publication number: CN110134950B
Application number: CN201910349756.XA
Authority: CN
Inventors: 苏萌; 苏海波; 王然; 檀玉飞; 孙伟; 高体伟
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2022-12-06
Anticipated expiration: 2039-04-28
Also published as: CN110134950A

Abstract

The invention discloses a word-combined text automatic proofreading method, which firstly adopts the following two error checking methods to check errors respectively: 1) An error checking method based on an n-gram language model; 2) An error checking method based on an lstm language model; and then solving the intersection of the error checking results of the two methods to obtain the final error checking result. The method realizes word segmentation and part-of-speech tagging of the input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizes error search in the text based on an n-gram model, a bidirectional lstm language model and a rule strategy.

Description

Automatic text proofreading method combining words

Technical Field

The invention relates to the technical field of data processing, in particular to a word-combined text automatic proofreading method.

Background

Automatic text proofreading is a technology for searching and correcting errors of characters, words, word collocation, semantic grammar and the like appearing in texts, and is one of main application fields of natural language processing.

Early natural language processing systems were based primarily on manually written rules, which were time consuming and laborious and did not cover all language phenomena. In the late 80 s of the last century, machine learning algorithms were introduced into natural language processing due to the ever increasing computing power of computers. Research is mainly focused on statistical models, parameters of the models are automatically learned by large-scale corpora training (corpus), and compared with the previous rule-based method, the method is more robust.

Statistical Language models (Statistical Language models) are proposed in such environments and contexts. It is widely used for various natural language processing problems, such as speech recognition, machine translation, word segmentation, part-of-speech tagging, and the like. In short, a language model is a model used to calculate the probability of a sentence, i.e., P (W1, W2.. Wk). Using a language model, it can be determined which word sequence is more likely, or given several words, the next most likely word can be predicted.

The n-gram model, also known as an n-1 order Markov model, has a finite history assumption: the probability of occurrence of the current word is only related to the first n-1 words. Given a sentence (word sequence) S = W1, W2., wk, its probability can be expressed as:

when n takes 1, 2, 3, the n-gram models are called unigram, bigram, and trigram language models, respectively. The parameters of the n-gram model are the conditional probability P (Wi | Wi-n + 1., wi-1). Assuming a vocabulary size of 100,000, the number of parameters for the n-gram model is 100,000n. The larger n, the more accurate and complex the model, and the larger the amount of computation required. Bigram is the most commonly used, unigram and trigram are the next, and n is less often equal to or greater than 4.

The biggest problem of n-grams is that the probability estimation is not very accurate, especially when n in n-grams is large, the amount of data needed is large if the accuracy needs to be guaranteed, but in practice, the data can become sparse because it is impossible to obtain as much training data. In addition, n-gram can only count the number of occurrences of a word sequence of a fixed length (generally, a length not exceeding 3), and cannot extract long context information.

Partial interpretation of the technical terms:

word segmentation and part of speech tagging: the words are divided into individual words, and the part-of-speech (such as noun, verb, adjective, etc.) of each word is identified.

Word2vec: the algorithm developed by google corporation changes words into a vector with hundreds of dimensions through unsupervised training, and the vector can capture semantic correlation among the words. Also called word vectors or word embedding.

Tensorflow: tensorflow is a google open-source deep learning platform, and provides rich interfaces, multiple platforms (CPU, GPU, HADOOP) and distributed support and visual monitoring.

LSTM: the LSTM (Long Short-Term Memory) Long-Short Term Memory network is a time recurrent neural network, and is suitable for processing and predicting important events with relatively Long interval and delay in time sequence. The historical information is controlled to be left through the memory gate and the forgetting gate, and the problem of long path dependence of the traditional recurrent neural network is effectively solved.

CRF: CRF (Conditional Random Field) Conditional Random fields are one of algorithms commonly used in the Field of natural language processing in recent years, and are commonly used for syntactic analysis, named entity recognition, part-of-speech tagging and the like. The CRF adopts a Markov chain as a probability transfer model of a hidden variable, discriminates the hidden variable through an observable state and belongs to a discrimination model.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide an automatic text proofreading method with combined words, which is used for realizing word segmentation and part-of-speech tagging of an input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizing error search in the text based on an n-gram, a bidirectional lstm language model and a rule strategy.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for automatically proofreading texts combined by words comprises the following steps:

s1, debugging is carried out by respectively adopting the following two debugging methods:

1) An error checking method based on an n-gram language model;

2) An error checking method based on an Lstm language model;

the debugging method based on the n-gram language model comprises the following steps:

1.1 For an input text S, performing word segmentation and part-of-speech tagging on the text by adopting a deep learning-based method to obtain S = w ₁ ,w ₂ ,…,w _n Wherein w is _i Is a word obtained after word segmentation, and the corresponding part of speech is pos _i ，i＝1,2,…,n；

1.2 Using unigram, bigram and trigram language models to judge whether errors exist in the results after word segmentation;

1.2.1 ) and determining w _i Part of speech pos _i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model _i Frequency P (w) _i ) If P (w) _i )>= threshold value T ₀ Then step 1.2.2) is performed, if P (w) _i )<T ₀ Then mark w _i An error;

1.2.2 P (w) _i-1 ,w _i ) The expression w _i-1 And w _i Number of co-occurrences of (p) is in _i For a person or place name, use bigram model to judge w _i-1 And pos _i Co-occurrence number P (w) _i-1 ,pos _i ) If P (w) _i-1 ,pos _i )>= threshold value T ₁ Then step 1.2.3) is performed, if P (w) _i-1 ,pos _i )<T ₁ Then mark w _i An error; if pos _i Not name of person and place, P (w) is used _i-1 ,w _i ) Make a judgment if P (w) _i-1 ,w _i )>＝T ₁ Then step 1.2.3) is performed, if P (w) _i-1 ,w _i )<T ₁ Then mark w _i An error;

1.2.3 P (w) _i-2 ,w _i-1 ,w _i ) The expression w _i-2 ,w _i-1 And w _i Number of co-occurrences of (i.e., pos) _i Judging w for the name of person or place by using trigram model _i-2 、w _i-1 And pos _i Co-occurrence number P (w) _i-2 ,w _i-1 ,pos _i ) If P (w) _i-2 ,w _i-1 ,pos _i )>= threshold value T ₂ Then consider w _i There is no error, if P (w) _i-2 ,w _i-1 ,pos _i )<T ₂ Then mark w _i An error; if pos _i Not name of person and place, use w _i-2 ,w _i-1 And w _i Co-occurrence number P (w) _i-2 ,w _i-1 ,w _i ) Make a judgment if P (w) _i-2 ,w _i-1 ,w _i )>＝T ₂ (T ₂ Artificially set threshold), then w is considered _i There is no error; if P (w) _i-2 ,w _i-1 ,w _i )<T ₂ Then mark w _i An error;

the error checking method based on the LSTM language model specifically comprises the following steps:

2.1 Vectorizing each character using a word vector model;

2.2 Carrying out automatic feature extraction through a bidirectional LSTM model to obtain an output sequence;

2.3 X for each character), for each character _t Output h of _t Obtaining the probability of the next time word through a Softmax activation function, then judging the probability of the next time word and the size of a set threshold value, if the probability of the next time word is larger than the set threshold value, the character is correct, otherwise, the character is marked as an error;

and S2, for an input text, after the input text is processed in the step S1, an error checking result based on the n-gram language model and an error checking result based on the LSTM language model are respectively obtained, and the intersection of the two error checking results is obtained to serve as a final error checking result.

Further, in step 2.1), word2vec model is adopted for character vectorization.

Furthermore, the word2vec model is trained by adopting a Skip-gram method.

Further, the specific method of step 2.2) is as follows:

firstly, the character vector generated in step S2.1 is loaded, then the bidirectional LSTM operation process is entered, and the output of the forward LSTM is h _ft The output of backward LSTM is h _bt After vector splicing, each character x is obtained _t Output h of _t ＝[hf _t ,h _bt ]The output of all characters constitutes an output sequence.

Furthermore, the number of times of new words appearing in the input text is counted, namely P (new word), and if P (new word) > is a preset threshold value, the new word is considered to be correct, and no error prompt is performed.

The invention has the beneficial effects that: the method realizes word segmentation and part-of-speech tagging of the input text based on a word embedding (word embedding) technology, a bidirectional lstm network, a CRF (Conditional Random Field) model and the like, and realizes error search in the text based on an n-gram model, a bidirectional lstm language model and a rule strategy. The advantages of the invention are as follows:

1) The method for deep learning is adopted to perform word segmentation and part-of-speech tagging on the text, so that names of people and places in the text can be accurately extracted, and false alarm caused by names of people and places is reduced.

2) The proofreading method based on the combination of the N-gram and the LSTM can extract the local features and the global features of the text and can accurately search errors occurring in the text.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a CBOW training method;

FIG. 3 is a schematic diagram of the Skip-gram method;

FIG. 4 is a schematic diagram of a word vector model training process according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that the present embodiment is based on the technical scheme, and a detailed implementation manner and a specific operation process are provided, but the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides an automatic text proofreading method based on word combination, as shown in fig. 1, including the following steps:

1) An error checking method based on an n-gram language model;

2) An error checking method based on an lstm language model;

1.1 For an input text S, performing word segmentation and part-of-speech tagging on the text by adopting a deep learning-based method to obtain S = w ₁ ,w ₂ ,…,w _n Wherein w is _i The corresponding part of speech is pos for the word obtained after word segmentation _i ，i＝1,2,…,n；

The method based on deep learning is adopted to carry out word segmentation and part-of-speech tagging on an input text, and the word segmentation and part-of-speech tagging problem is converted into a sequence tagging problem in principle. The method comprises the following specific steps:

1.1.1 By the word2vec method), the input text is converted into a computable vector;

1.1.2 Input the vector into Lstm-crf model to obtain word segmentation result and corresponding part of speech of each word;

the method mainly relates to three algorithms of word2vec, bidirectional lstm and crf.

Word vector model (word 2 vec) algorithms are capable of transforming an incalculable chinese word into a vector in a low-dimensional space that is computable by a computer, typically several hundred dimensions. Compared with the traditional one-hot method, the method has the advantages that the dimensionality of the vector is reduced, the vector is converted from sparse representation to dense representation, and the calculation amount of a computer is greatly reduced. In addition, the semantic relevance between such characters can be approximately described by the distance of the vectors.

Recurrent Neural Networks (RNNs) have been widely demonstrated to be advantageous in the field of natural language processing. For arbitrary input text sequences (x) ₁ ,x ₂ ,…,x _n ) RNN returns a set of output values (h) for this sequence ₁ ,h ₂ ,…,h _n ). Wherein, any value xi in the input sequence can be a character (or a word), and the probability value of the character (or the word) at the time of outputting i +1 is output. However, in the process of solving the optimization, the conventional RNNs have the problem of gradient disappearance, so that model parameters of the RNNs can only "memorize" context information of a short distance before and after a current character, and are ineligible for a long-distance dependence phenomenon. The LSTM model perfectly solves the problem, the input and the output of historical information are controlled through a plurality of gates (gates), each gate is subjected to nonlinear normalization by a sigmoid function to be between 0 and 1, and the closer the value is to 0, the less historical information passes through the gate is indicated; conversely, a closer to 1 indicates more information is passing through the "gate". The LSTM effectively solves the problem of long-distance dependence in the sequence labeling task through the special design.

The CRF (conditional random field) conditional random field model is a typical discriminant model proposed by John Lafferty in 2001. The conditional random field model has the advantages of a discriminant model, has the characteristics that a generative model considers the transition probability among context markers and performs global parameter optimization and decoding in a serialization mode, and solves the marker bias problem which is difficult to avoid by other discriminant models (such as a maximum entropy Markov model).

1.2 Using unigram, bigram and trigram language models to judge whether errors exist in the results obtained in the step 1.1);

1.2.1 B), determination of w _i Part of speech pos _i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model _i Frequency P (w) _i ) If P (w) _i )>＝T ₀ (T ₀ Artificially set threshold), step 1.2.2) is performed, if P (w) _i )<T ₀ Then mark w _i An error;

1.2.2 P and P (w) _i-1 ,w _i ) The expression w _i-1 And w _i Number of co-occurrences of (p) is in _i For a person or place name, use bigram model to judge w _i-1 And pos _i Co-occurrence number P (w) _i-1 ,pos _i ) If P (w) _i-1 ,pos _i )>＝T ₁ (T ₁ Artificially set threshold), step 1.2.3) is executed if P (w) _i-1 ,pos _i )<T ₁ Then mark w _i An error; if pos _i Not name of person and place, P (w) is used _i-1 ,w _i ) Make a judgment if P (w) _i-1 ,w _i )>＝T ₁ Then step 1.2.3) is performed, if P (w) _i-1 ,w _i )<T ₁ Then mark w _i An error;

1.2.3 P (w) _i-2 ,w _i-1 ,w _i ) The expression w _i-2 ,w _i-1 And w _i Number of co-occurrences of (i.e., pos) _i Judging w for the name of person or place by using trigram model _i-2 、w _i-1 And pos _i Co-occurrence number P (w) _i-2 ,w _i-1 ,pos _i ) If P (w) _i-2 ,w _i-1 ,pos _i )>＝T ₂ (T ₂ Artificially set threshold), then w is considered _i There is no error, if P (w) _i-2 ,w _i-1 ,pos _i )<T ₂ Then mark w _i An error; if pos _i Not name of person and place, use w _i-2 ,w _i-1 And w _i Co-occurrence number of P (w) _i-2 ,w _i-1 ,w _i ) Make a judgment if P (w) _i-2 ,w _i-1 ,w _i )>＝T ₂ (T ₂ Artificially set threshold), then w is considered _i There is no error; if P (w) _i-2 ,w _i-1 ,w _i )<T ₂ Then mark w _i An error;

2.1 Vectorizing each character in the input text by using the character vector model to generate a character vector;

compared with the commonly used word vector, the character-based vectorization technology can bring the following advantages: character features of finer granularity can be represented; because the number of characters is far smaller than the number of words, the obtained model occupies extremely small space, and the loading speed of the model is greatly improved; over time, new words are emerging, and the word vector models trained previously have increasingly severe feature hit rate drop problems that are effectively avoided by character-based vectors, which are created relatively rarely every year.

In this embodiment, a word2vec model is used for character vectorization, which is an unsupervised learning method, i.e., the model can be trained without manually labeling corpora, and there are two common training methods: CBOW and Skip-gram. CBOW is a word predicted according to the context prediction center, and vectors of words are connected according to characters w (t-2), w (t-1), w (t + 1) and w (t + 2) around the current character w (t), so that context information can be fully reserved, and the reference of FIG. 2 is provided; the Skip-gram method is just the opposite, using w (t) to predict the surrounding words w (t-2), w (t-1), w (t + 1), w (t + 2), see FIG. 3. Under the condition of large data volume, the Skip-gram method is suitable, so the Skip-gram method is used in the embodiment.

As shown in fig. 4, when training a word vector, the specific steps are:

(1) Firstly, collecting related balance corpora (because unsupervised learning is needed, the larger the data volume is, the better the data volume is, and labeling is not needed), wherein the corpora mainly aim at corresponding application scenes and cover most data types of the scenes as much as possible;

(2) Preprocessing the collected balanced corpora, including filtering garbage data, filtering low-frequency words and nonsense symbols, and then sorting the low-frequency words and nonsense symbols into a format of training data, namely representing input and output, so as to prepare for establishing a training target;

(3) And (4) sending the training data to a Skip-gram model, and training to obtain a word vector model.

2.2 Automatically extracting characteristics through a bidirectional LSTM (Bi-LSTM) model to obtain an output sequence; the method comprises the following specific steps:

first, the character vector generated in step S2.1 is loaded, and then enters the bidirectional LSTM operation process, forward LSTM (forward LSTM) and backward LSTM (backward LST)M), the output of forward LSTM is h _ft The output of backward LSTM is h _bt After vector splicing, each character x is obtained _t Output h of _t ＝[h _ft ,h _bt ]The output of all characters constitutes an output sequence; wherein the output h of the forward LSTM _ft Characterizing historical context information and then outputting h to the LSTM _bt The future context information is characterized.

In addition, the number of times of new words appearing in the input text is counted, namely P (new word), and if P (new word) > is a preset threshold value, the new word is considered to be correct, and no error prompt is performed.

Various corresponding changes and modifications can be made by those skilled in the art based on the above technical solutions and concepts, and all such changes and modifications should be included in the protection scope of the present invention.

Claims

1. A method for automatically proofreading texts combined by words is characterized by comprising the following steps:

1) An error checking method based on an n-gram language model;

2) An error checking method based on an LSTM language model;

1.1 For the input text S, a method based on deep learning is adopted forPerforming word segmentation and part-of-speech tagging on the text to obtain S = w ₁ ,w ₂ ,…,w _n Wherein w is _i Is a word obtained after word segmentation, and the corresponding part of speech is pos _i ，i＝1,2,…,n；

The method is characterized in that the method based on deep learning is adopted to carry out word segmentation and part-of-speech tagging on an input text, and the method comprises the following specific steps:

1.1.2 Input the vector into LSTM-crf model to obtain the word segmentation result and the corresponding part of speech of each word;

1.2.1 B), determination of w _i Part of speech pos _i If the name is a person name or a place name, executing the step 1.2.2); otherwise, judging w by using unigram model _i Frequency P (w) _i ) If P (w) _i )>= threshold value T ₀ Then step 1.2.2) is performed, if P (w) _i )<T ₀ Then mark w _i An error;

1.2.2 P and P (w) _i-1 ,w _i ) The expression w _i-1 And w _i Number of co-occurrences of (i.e., pos) _i For a person or place name, use bigram model to judge w _i-1 And pos _i Co-occurrence number P (w) _i-1 ,pos _i ) If P (w) _i-1 ,pos _i )>= threshold value T ₁ Then step 1.2.3) is performed, if P (w) _i-1 ,pos _i )<T ₁ Then mark w _i An error; if pos _i Instead of name and place, P (w) is used _i-1 ,w _i ) Make a judgment if P (w) _i-1 ,w _i )>＝T ₁ Then step 1.2.3) is performed, if P (w) _i-1 ,w _i )<T ₁ Then mark w _i An error;

1.2.3 P (w) _i-2 ,w _i-1 ,w _i ) The expression w _i-2 ,w _i-1 And w _i Number of co-occurrences of (p) is in _i Judging w for the name of person or place by using trigram model _i-2 、w _i-1 And pos _i Co-occurrence number P (w) _i-2 ,w _i-1 ,pos _i ) If P (w) _i-2 ,w _i-1 ,pos _i )>= threshold value T ₂ Then consider w _i There is no error, if P (w) _i-2 ,w _i-1 ,pos _i )<T ₂ Then mark w _i An error; if pos _i Not name of person and place, use w _i-2 ,w _i-1 And w _i Co-occurrence number of P (w) _i-2 ,w _i-1 ,w _i ) Make a judgment if P (w) _i-2 ,w _i-1 ,w _i )>＝T ₂ Then consider w _i There is no error; if P (w) _i-2 ,w _i-1 ,w _i )<T ₂ Then mark w _i An error;

2.1 Vectorizing each character using a word vector model;

2.2 Carrying out automatic feature extraction through a bidirectional LSTM model to obtain an output sequence; the specific method comprises the following steps: firstly, the character vector generated in step S2.1 is loaded, then the bidirectional LSTM operation process is entered, and the output of the forward LSTM is h _ft And the backward LSTM output is h _bt After vector splicing, each character x is obtained _t Output h of _t ＝[h _ft ,h _bt ]The output of all characters constitutes an output sequence; wherein the forward LSTM output characterizes historical context information, and the backward LSTM output characterizes future context information;

2. The method for automatically proofreading text combined with words according to claim 1, characterized in that in step 2.1), word2vec model is adopted for character vectorization.

3. The method of automatic text proofreading by word combination according to claim 2, characterized in that a Skip-gram method is used for training a word2vec model.

4. The method for automatically proofreading texts combined by words according to claim 1, wherein the number of times of new words appearing in the input texts is counted, that is, P (new word), and if P (new word) > is greater than a preset threshold, the new word is considered to be correct, and no error prompt is performed.