CN110134950A

CN110134950A - A kind of text auto-collation that words combines

Info

Publication number: CN110134950A
Application number: CN201910349756.XA
Authority: CN
Inventors: 苏萌; 苏海波; 王然; 檀玉飞; 孙伟; 高体伟
Original assignee: Beijing Baifendian Information Science & Technology Co Ltd
Current assignee: Beijing Baifendian Information Science & Technology Co Ltd
Priority date: 2019-04-28
Filing date: 2019-04-28
Publication date: 2019-08-16
Anticipated expiration: 2039-04-28
Also published as: CN110134950B

Abstract

The invention discloses the text auto-collations that a kind of words combines, and the following two kinds error-checking method is respectively adopted first and carries out debugging: the 1) error-checking method based on n-gram language model；2) error-checking method based on lstm language model；Then the debugging result of two methods is sought common ground, obtains final debugging result.The method of the present invention is based on lexicon and is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes that mistake present in text is searched.

Description

A kind of text auto-collation that words combines

Technical field

The present invention relates to technical field of data processing, and in particular to a kind of text auto-collation that words combines.

Background technique

Text automatic Proofreading is that the mistakes such as the word occurred in text, word, collocations, semantic grammar are searched and entangled A positive technology, is one of main application fields of natural language processing.

The natural language processing system of early stage is mainly based upon the rule manually write, and this method is not only time-consuming and laborious, And various language phenomenons can not be covered.Later period the last century 80's, since the computing capability of computer is continuously improved, machine Learning algorithm is introduced in natural language processing.Research is concentrated mainly on statistical model, and this method is using extensive Training corpus (corpus) parameter of model is automatically learnt, compared with rule-based method before, it is this Method has more robustness.

Statistical language model (Statistical Language Model) is exactly to be mentioned under such environment and background Out.It is widely used in various natural language processing problems, such as speech recognition, machine translation, participle, part-of-speech tagging, etc. Deng.Briefly, language model is exactly the model for calculating the probability of a sentence, i.e. P (W1, W2 ... Wk).Utilize language A possibility that saying model, can determining which word sequence is bigger, or gives several words, can predict that next most probable goes out Existing word.

N-gram model is also referred to as n-1 rank Markov model, it has a limited history to assume: the appearance of current word is general Rate is only related to the word of front n-1.Given sentence (sequence of terms) S=W1, W2 ..., Wk, its probability can indicate are as follows:

When n takes 1,2,3, n-gram model is referred to as unigram, bigram and trigram language model.n-gram The parameter of model be exactly conditional probability P (Wi | Wi-n+1 ..., Wi-1).Assuming that the size of vocabulary is 100,000, then n- The number of parameters of gram model is 100,000n.N is bigger, and model is more accurate, also more complicated, and the calculation amount needed is bigger.Most often It is bigram, followed by unigram and trigram, the case where n takes >=4 is less.

The problem of n-gram maximum be probability Estimation obtain be not very precisely, when the n in especially n-gram is very big, such as If fruit needs to guarantee precision, then the data volume needed is very big, but it is practically impossible to obtain so much training data, number According to can become sparse.In addition, n-gram can only count regular length (frequency of occurrence of the word sequence of general of length no more than 3), And longer contextual information can not be extracted.

Part technical term is explained:

Participle and part-of-speech tagging: it will in short be divided into individual word one by one and identify the part of speech of each word (such as Noun, verb, adjective etc.) it marks out and.

Word2vec: being the algorithm of google company exploitation, by unsupervised training, by word become a several hundred dimensions to Amount, this vector can capture the semantic dependency between word.Also term vector or word is made to be embedded in.

Tensorflow:Tensorflow is the deep learning platform of google open source, provides interface abundant, mostly flat Platform (CPU, GPU, HADOOP) and distributed support, visual control.

LSTM:LSTM (Long Short-Term Memory) shot and long term memory network, is a kind of time recurrent neural net Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.It is by " Memory-Gate " and " forgets Note door " controls the going or staying of historical information, efficiently solves the problems, such as the long Route Dependence of conventional recycle neural network.

CRF:CRF (Conditional Random Field) condition random field is that natural language processing field was normal in recent years One of algorithm is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..CRF is using Markov Chain as implicit The probability metastasis model of variable implies variable by Observable condition discrimination, belongs to discrimination model.

Summary of the invention

In view of the deficiencies of the prior art, the present invention is intended to provide the text auto-collation that a kind of words combines, is based on Lexicon is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition Random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram, two-way lstm language on this basis It says model and rule and policy, realizes that mistake present in text is searched.

To achieve the goals above, the present invention adopts the following technical scheme:

A kind of text auto-collation that words combines, includes the following steps:

S1, the progress debugging of the following two kinds error-checking method is respectively adopted:

1) error-checking method based on n-gram language model；

2) error-checking method based on Lstm language model；

The error-checking method based on n-gram language model includes the following steps:

1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtained To S=w₁,w₂,…,w_n, wherein w_iFor the word obtained after participle, correspondence part of speech is pos_i, i=1,2 ..., n；

1.2), whether the result after participle is wherein deposited using the judgement of unigram, bigram and trigram language model In mistake；

1.2.1), judge w_iPart of speech pos_iIf it is name or place name, 1.2.2 is thened follow the steps)；Otherwise it uses Unigram model judges w_iFrequency P (w_i), if P (w_i) >=threshold value T₀, then follow the steps 1.2.2), if P (w_i)<T₀, then mark Remember w_iMistake；

1.2.2), P (w is set_i-1,w_i) indicate word w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, then use Bigram model judges w_i-1And pos_iCo-occurrence number P (w_i-1,pos_i), if P (w_i-1,pos_i) >=threshold value T₁, then follow the steps 1.2.3), if P (w_i-1,pos_i)<T₁, then w is marked_iMistake；If pos_iIt is not name and place name, then uses P (w_i-1,w_i) sentenced It is disconnected, if P (w_i-1,w_i) >=T₁, then follow the steps 1.2.3), if P (w_i-1,w_i)<T₁, then w is marked_iMistake；

1.2.3), P (w is set_i-2,w_i-1,w_i) indicate word w_i-2,w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, Then w is judged using trigram model_i-2、w_i-1And pos_iCo-occurrence number P (w_i-2,w_i-1,pos_i), if P (w_i-2,w_i-1,pos_i)> =threshold value T₂, then it is assumed that w_iMistake is not present in place, if P (w_i-2,w_i-1,pos_i)<T₂, then w is marked_iMistake；If pos_iIt is not name And place name, then use w_i-2,w_i-1And w_iCo-occurrence number P (w_i-2,w_i-1,w_i) judged, if P (w_i-2,w_i-1,w_i) >=T₂(T₂ For artificial given threshold), then it is assumed that w_iMistake is not present in place；If P (w_i-2,w_i-1,w_i)<T₂, then w is marked_iMistake；

The error-checking method based on LSTM language model specifically:

2.1), using word vector model by each character vector；

2.2) it, is extracted automatically by two-way LSTM model progress feature and obtains output sequence；

2.3), for each character x_tOutput h_t, the probability of subsequent time word is obtained by Softmax activation primitive, so The probability of subsequent time word and the size of given threshold are judged afterwards, and the probability carved characters for the moment instantly is greater than the threshold value of setting, then should Character is correct, and otherwise marking the character is mistake；

S2, for an input text, respectively obtained looking into based on n-gram language model after step S1 processing Wrong result and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.

Further, in step 2.1), character vector is carried out using word2vec model.

Further, word2vec model is trained using Skip-gram method.

Further, step 2.2) method particularly includes:

Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM's Output is h_ft, the output of backward LSTM is h_bt, after the two carries out vector splicing, obtain each character x_tOutput h_t=[hf_t, h_bt], the output of all characters constitutes output sequence.

Further, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold, Then think that the neologisms are correctly, without miscue.

The beneficial effects of the present invention are: the method for the present invention to be based on lexicon insertion (word embeddings) technology, two-way Lstm network, CRF (Conditional Random Field, condition random field) model etc., which are realized, segments input text With part-of-speech tagging, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes and is deposited in text Mistake search.Advantage of the invention is as follows:

1) participle and part-of-speech tagging are carried out to text using the method for deep learning, can accurately extracts the people in text Name, place name, reduce wrong report caused by name, place name etc..

2) local feature and global characteristics of text, energy can be extracted based on the proofreading method that N-gram and LSTM are combined It is enough accurately to search the mistake occurred in text.

Detailed description of the invention

Fig. 1 is the method for the present invention implementation process diagram；

Fig. 2 is CBOW training method schematic illustration；

Fig. 3 is Skip-gram Method And Principle schematic diagram；

Fig. 4 is word vector model training flow diagram in the embodiment of the present invention.

Specific embodiment

Below with reference to attached drawing, the invention will be further described, it should be noted that the present embodiment is with this technology side Premised on case, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to this reality Apply example.

The present embodiment provides the text auto-collations that a kind of words combines, as shown in Figure 1, including the following steps:

1) error-checking method based on n-gram language model；

2) error-checking method based on lstm language model；

Participle and part-of-speech tagging are carried out to input text using the method based on deep learning, are that will segment from principle Sequence labelling problem is converted to part-of-speech tagging problem.Specific step is as follows:

It 1.1.1 is computable vector by input text conversion) by word2vec method；

1.1.2) vector is input in Lstm-crf model, obtains word segmentation result part of speech corresponding with each word；

Wherein relate generally to word2vec, two-way lstm and tri- kinds of algorithms of crf.

Incalculable Chinese words can be become computer computable one by term vector model (word2vec) algorithm Vector in a lower dimensional space, usual several hundred dimensions.The dimension for the vector that this method reduces compared with traditional one-hot method Degree, so that vector switchs to dense expression from rarefaction representation, greatly reduces the calculation amount of computer.In addition, in this way between character Semantic dependency can be with the distance of vector come approximate description.

Recognition with Recurrent Neural Network (Recurrent neural networks, RNNs) has been demonstrated it certainly extensively at present The advantage in right Language Processing field.For arbitrarily inputting text sequence (x₁,x₂,…,x_n), RNN returns to the output for being directed to this sequence Value set (h₁,h₂,…,h_n).Wherein, the arbitrary value xi in list entries can be a character (or a word), export i+ The probability value of the character (or word) at 1 moment.But traditional RNNs is due to that can generate in doing the remittance of optimization process The problem of gradient disappears, so that the model parameter of RNNs " can only remember " contextual information of short distance before and after current character, it is right It is helpless in the dependence phenomenon of long range.This problem has been solved perfectly in the appearance of LSTM model, passes through several " doors (gates) " input and output of historical information are controlled, each " door " by sigmoid function do non-linearization be normalized to 0~1 it Between, value then shows less historical information by being somebody's turn to do " door " closer to 0；On the contrary, then showing there are more information closer to 1 Pass through " door ".LSTM solves the problems, such as that sequence labelling task middle and long distance relies on by the above special design.

CRF (conditional random field) conditional random field models were by John Lafferty in 2001 A kind of typical discriminative model proposed.It models target sequence on the basis of observation sequence, and emphasis solves sequence The problem of columnization mark, conditional random field models not only have the advantages that discriminative model, but also have production model in view of upper Transition probability between hereafter marking solves other differentiations the characteristics of carrying out global parameter optimization in the form of serializing and decode The marking bias problem that formula model (such as maximum entropy Markov model) is difficult to avoid that.

1.2), result obtained in step 1.1) is judged wherein using unigram, bigram and trigram language model With the presence or absence of mistake；

1.2.1), judge w_iPart of speech pos_iIf it is name or place name, 1.2.2 is thened follow the steps)；Otherwise it uses Unigram model judges w_iFrequency P (w_i), if P (w_i) >=T₀(T₀For artificial given threshold), then follow the steps 1.2.2), if P(w_i)<T₀, then w is marked_iMistake；

1.2.2), P (w is set_i-1,w_i) indicate word w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, then use Bigram model judges w_i-1And pos_iCo-occurrence number P (w_i-1,pos_i), if P (w_i-1,pos_i) >=T₁(T₁For threshold is manually set Value), then follow the steps 1.2.3), if P (w_i-1,pos_i)<T₁, then w is marked_iMistake；If pos_iIt is not name and place name, then uses P(w_i-1,w_i) judged, if P (w_i-1,w_i) >=T₁, then follow the steps 1.2.3), if P (w_i-1,w_i)<T₁, then w is marked_iIt is wrong Accidentally；

1.2.3), P (w is set_i-2,w_i-1,w_i) indicate word w_i-2,w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, Then w is judged using trigram model_i-2、w_i-1And pos_iCo-occurrence number P (w_i-2,w_i-1,pos_i), if P (w_i-2,w_i-1,pos_i)> =T₂(T₂For artificial given threshold), then it is assumed that w_iMistake is not present in place, if P (w_i-2,w_i-1,pos_i)<T₂, then w is marked_iMistake； If pos_iIt is not name and place name, then uses w_i-2,w_i-1And w_iCo-occurrence number P (w_i-2,w_i-1,w_i) judged, if P (w_i-2, w_i-1,w_i) >=T₂(T₂For artificial given threshold), then it is assumed that w_iMistake is not present in place；If P (w_i-2,w_i-1,w_i)<T₂, then w is marked_i Mistake；

The error-checking method based on LSTM language model specifically:

2.1), each character vector in text will be inputted using word vector model, generates character vector；

Relative to common term vector, the vectorization technology based on character can bring following advantage: can characterize thinner The character feature of granularity；Since character quantity is much smaller than word quantity, obtained model occupied space is minimum, greatly improves mould Type loading velocity；Over time, neologisms can continue to bring out, and the term vector model trained before will appear increasingly tighter The feature hit rate downslide problem of weight, and the vector based on character then effectively prevents this problem, comes because being continuously created every year Fresh character it is relatively seldom.

Character vector is carried out using word2vec model in the present embodiment, is unsupervised learning method, that is, does not need Artificial mark corpus can training pattern, common are two kinds of training methods: CBOW and Skip-gram.CBOW is according to upper The hereafter word of pre- measured center, according to the character w (t-2), w (t-1), w (t+1) around current character w (t), w (t+2) prediction will The vector of these words connects, and can be sufficiently reserved contextual information in this way, referring to fig. 2；Skip-gram method is exactly the opposite, makes The word w (t-2), w (t-1), w (t+1), w (t+2), referring to Fig. 3 around prediction are removed with w (t).Under the conditions of big data quantity, it is suitble to Using Skip-gram method, so the present embodiment uses Skip-gram method.

As shown in figure 4, in training word vector, specific steps are as follows:

(1) collecting relevant balanced corpus first, (because to do unsupervised learning, data volume is the bigger the better, without mark Note), these corpus cover most of data type of the scene mainly for corresponding application scenarios as far as possible；

(2) it is pre-processed for collected balanced corpus, including filtering spam data, filtering low word and meaningless symbol Number, it is then organized into the format of training data, that is, indicates to output and input, prepare to establish training objective；

(3) training data is given to Skip-gram model, training obtains word vector model.

2.2) feature, is carried out by two-way LSTM (Bi-LSTM) model to extract automatically, obtains output sequence；Specifically:

Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM The output of (forward LSTM) and backward LSTM (backward LSTM), forward direction LSTM are h_ft, the output of backward LSTM is h_bt, after the two carries out vector splicing, obtain each character x_tOutput h_t=[h_ft,h_bt], the output of all characters constitutes output Sequence；The wherein output h of forward direction LSTM_ftCharacterize history context information, and the output h of backward LSTM_btThen characterize future Contextual information.

In addition, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold, recognizes It is correctly, without miscue for the neologisms.

For those skilled in the art, it can be provided various corresponding according to above technical solution and design Change and modification, and all these change and modification, should be construed as being included within the scope of protection of the claims of the present invention.

Claims

1. the text auto-collation that a kind of words combines, which comprises the steps of:

1) error-checking method based on n-gram language model；

2) error-checking method based on Lstm language model；

1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtain S= w₁,w₂,…,w_n, wherein w_iFor the word obtained after participle, correspondence part of speech is pos_i, i=1,2 ..., n；

1.2), the result after participle is judged wherein using unigram, bigram and trigram language model with the presence or absence of mistake Accidentally；

1.2.1), judge w_iPart of speech pos_iIf it is name or place name, 1.2.2 is thened follow the steps)；Otherwise unigram is used Model judges w_iFrequency P (w_i), if P (w_i) >=threshold value T₀, then follow the steps 1.2.2), if P (w_i)<T₀, then w is marked_iIt is wrong Accidentally；

1.2.2), P (w is set_i-1,w_i) indicate word w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, then bigram is used Model judges w_i-1And pos_iCo-occurrence number P (w_i-1,pos_i), if P (w_i-1,pos_i) >=threshold value T₁, then follow the steps 1.2.3), if P (w_i-1,pos_i)<T₁, then w is marked_iMistake；If pos_iIt is not name and place name, then uses P (w_i-1,w_i) sentenced It is disconnected, if P (w_i-1,w_i) >=T₁, then follow the steps 1.2.3), if P (w_i-1,w_i)<T₁, then w is marked_iMistake；

1.2.3), P (w is set_i-2,w_i-1,w_i) indicate word w_i-2,w_i-1And w_iCo-occurrence number, if pos_iFor name or place name, then make W is judged with trigram model_i-2、w_i-1And pos_iCo-occurrence number P (w_i-2,w_i-1,pos_i), if P (w_i-2,w_i-1,pos_i) >=threshold Value T₂, then it is assumed that w_iMistake is not present in place, if P (w_i-2,w_i-1,pos_i)<T₂, then w is marked_iMistake；If pos_iIt is not name and ground Name then uses w_i-2,w_i-1And w_iCo-occurrence number P (w_i-2,w_i-1,w_i) judged, if P (w_i-2,w_i-1,w_i) >=T₂(T₂For people For given threshold), then it is assumed that w_iMistake is not present in place；If P (w_i-2,w_i-1,w_i)<T₂, then w is marked_iMistake；

The error-checking method based on LSTM language model specifically:

2.1), using word vector model by each character vector；

2.3), for each character x_tOutput h_t, the probability of subsequent time word is obtained by Softmax activation primitive, is then sentenced The probability of disconnected subsequent time word and the size of given threshold, the probability carved characters for the moment instantly are greater than the threshold value of setting, then the character Correctly, otherwise marking the character is mistake；

S2, for an input text, respectively obtained the debugging knot based on n-gram language model after step S1 processing Fruit and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.

2. the text auto-collation that words according to claim 1 combines, which is characterized in that in step 2.1), adopt Character vector is carried out with word2vec model.

3. the text auto-collation that words according to claim 2 combines, which is characterized in that word2vec model It is trained using Skip-gram method.

4. the text auto-collation that words according to claim 1 combines, which is characterized in that step 2.2) it is specific Method are as follows:

Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, the output of forward direction LSTM For h_ft, the output of backward LSTM is h_bt, after the two carries out vector splicing, obtain each character x_tOutput h_t=[h_ft,h_bt], The output of all characters constitutes output sequence.

5. the text auto-collation that words according to claim 1 combines, which is characterized in that in statistics input text The number of the neologisms of appearance, i.e. P (neologisms), if P (neologisms) > preset threshold, then it is assumed that the neologisms are correctly, without mistake Prompt.