CN110134950A - A kind of text auto-collation that words combines - Google Patents

A kind of text auto-collation that words combines Download PDF

Info

Publication number
CN110134950A
CN110134950A CN201910349756.XA CN201910349756A CN110134950A CN 110134950 A CN110134950 A CN 110134950A CN 201910349756 A CN201910349756 A CN 201910349756A CN 110134950 A CN110134950 A CN 110134950A
Authority
CN
China
Prior art keywords
pos
model
mistake
text
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910349756.XA
Other languages
Chinese (zh)
Other versions
CN110134950B (en
Inventor
苏萌
苏海波
王然
檀玉飞
孙伟
高体伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baifendian Information Science & Technology Co Ltd
Original Assignee
Beijing Baifendian Information Science & Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baifendian Information Science & Technology Co Ltd filed Critical Beijing Baifendian Information Science & Technology Co Ltd
Priority to CN201910349756.XA priority Critical patent/CN110134950B/en
Publication of CN110134950A publication Critical patent/CN110134950A/en
Application granted granted Critical
Publication of CN110134950B publication Critical patent/CN110134950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the text auto-collations that a kind of words combines, and the following two kinds error-checking method is respectively adopted first and carries out debugging: the 1) error-checking method based on n-gram language model;2) error-checking method based on lstm language model;Then the debugging result of two methods is sought common ground, obtains final debugging result.The method of the present invention is based on lexicon and is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes that mistake present in text is searched.

Description

A kind of text auto-collation that words combines
Technical field
The present invention relates to technical field of data processing, and in particular to a kind of text auto-collation that words combines.
Background technique
Text automatic Proofreading is that the mistakes such as the word occurred in text, word, collocations, semantic grammar are searched and entangled A positive technology, is one of main application fields of natural language processing.
The natural language processing system of early stage is mainly based upon the rule manually write, and this method is not only time-consuming and laborious, And various language phenomenons can not be covered.Later period the last century 80's, since the computing capability of computer is continuously improved, machine Learning algorithm is introduced in natural language processing.Research is concentrated mainly on statistical model, and this method is using extensive Training corpus (corpus) parameter of model is automatically learnt, compared with rule-based method before, it is this Method has more robustness.
Statistical language model (Statistical Language Model) is exactly to be mentioned under such environment and background Out.It is widely used in various natural language processing problems, such as speech recognition, machine translation, participle, part-of-speech tagging, etc. Deng.Briefly, language model is exactly the model for calculating the probability of a sentence, i.e. P (W1, W2 ... Wk).Utilize language A possibility that saying model, can determining which word sequence is bigger, or gives several words, can predict that next most probable goes out Existing word.
N-gram model is also referred to as n-1 rank Markov model, it has a limited history to assume: the appearance of current word is general Rate is only related to the word of front n-1.Given sentence (sequence of terms) S=W1, W2 ..., Wk, its probability can indicate are as follows:
When n takes 1,2,3, n-gram model is referred to as unigram, bigram and trigram language model.n-gram The parameter of model be exactly conditional probability P (Wi | Wi-n+1 ..., Wi-1).Assuming that the size of vocabulary is 100,000, then n- The number of parameters of gram model is 100,000n.N is bigger, and model is more accurate, also more complicated, and the calculation amount needed is bigger.Most often It is bigram, followed by unigram and trigram, the case where n takes >=4 is less.
The problem of n-gram maximum be probability Estimation obtain be not very precisely, when the n in especially n-gram is very big, such as If fruit needs to guarantee precision, then the data volume needed is very big, but it is practically impossible to obtain so much training data, number According to can become sparse.In addition, n-gram can only count regular length (frequency of occurrence of the word sequence of general of length no more than 3), And longer contextual information can not be extracted.
Part technical term is explained:
Participle and part-of-speech tagging: it will in short be divided into individual word one by one and identify the part of speech of each word (such as Noun, verb, adjective etc.) it marks out and.
Word2vec: being the algorithm of google company exploitation, by unsupervised training, by word become a several hundred dimensions to Amount, this vector can capture the semantic dependency between word.Also term vector or word is made to be embedded in.
Tensorflow:Tensorflow is the deep learning platform of google open source, provides interface abundant, mostly flat Platform (CPU, GPU, HADOOP) and distributed support, visual control.
LSTM:LSTM (Long Short-Term Memory) shot and long term memory network, is a kind of time recurrent neural net Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.It is by " Memory-Gate " and " forgets Note door " controls the going or staying of historical information, efficiently solves the problems, such as the long Route Dependence of conventional recycle neural network.
CRF:CRF (Conditional Random Field) condition random field is that natural language processing field was normal in recent years One of algorithm is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..CRF is using Markov Chain as implicit The probability metastasis model of variable implies variable by Observable condition discrimination, belongs to discrimination model.
Summary of the invention
In view of the deficiencies of the prior art, the present invention is intended to provide the text auto-collation that a kind of words combines, is based on Lexicon is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition Random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram, two-way lstm language on this basis It says model and rule and policy, realizes that mistake present in text is searched.
To achieve the goals above, the present invention adopts the following technical scheme:
A kind of text auto-collation that words combines, includes the following steps:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on Lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtained To S=w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
1.2), whether the result after participle is wherein deposited using the judgement of unigram, bigram and trigram language model In mistake;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise it uses Unigram model judges wiFrequency P (wi), if P (wi) >=threshold value T0, then follow the steps 1.2.2), if P (wi)<T0, then mark Remember wiMistake;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then use Bigram model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=threshold value T1, then follow the steps 1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses P (wi-1,wi) sentenced It is disconnected, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediMistake;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name, Then w is judged using trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi)> =threshold value T2, then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake;If posiIt is not name And place name, then use wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2,wi-1,wi) >=T2(T2 For artificial given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markediMistake;
The error-checking method based on LSTM language model specifically:
2.1), using word vector model by each character vector;
2.2) it, is extracted automatically by two-way LSTM model progress feature and obtains output sequence;
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, so The probability of subsequent time word and the size of given threshold are judged afterwards, and the probability carved characters for the moment instantly is greater than the threshold value of setting, then should Character is correct, and otherwise marking the character is mistake;
S2, for an input text, respectively obtained looking into based on n-gram language model after step S1 processing Wrong result and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
Further, in step 2.1), character vector is carried out using word2vec model.
Further, word2vec model is trained using Skip-gram method.
Further, step 2.2) method particularly includes:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM's Output is hft, the output of backward LSTM is hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft, hbt], the output of all characters constitutes output sequence.
Further, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold, Then think that the neologisms are correctly, without miscue.
The beneficial effects of the present invention are: the method for the present invention to be based on lexicon insertion (word embeddings) technology, two-way Lstm network, CRF (Conditional Random Field, condition random field) model etc., which are realized, segments input text With part-of-speech tagging, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes and is deposited in text Mistake search.Advantage of the invention is as follows:
1) participle and part-of-speech tagging are carried out to text using the method for deep learning, can accurately extracts the people in text Name, place name, reduce wrong report caused by name, place name etc..
2) local feature and global characteristics of text, energy can be extracted based on the proofreading method that N-gram and LSTM are combined It is enough accurately to search the mistake occurred in text.
Detailed description of the invention
Fig. 1 is the method for the present invention implementation process diagram;
Fig. 2 is CBOW training method schematic illustration;
Fig. 3 is Skip-gram Method And Principle schematic diagram;
Fig. 4 is word vector model training flow diagram in the embodiment of the present invention.
Specific embodiment
Below with reference to attached drawing, the invention will be further described, it should be noted that the present embodiment is with this technology side Premised on case, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to this reality Apply example.
The present embodiment provides the text auto-collations that a kind of words combines, as shown in Figure 1, including the following steps:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtained To S=w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
Participle and part-of-speech tagging are carried out to input text using the method based on deep learning, are that will segment from principle Sequence labelling problem is converted to part-of-speech tagging problem.Specific step is as follows:
It 1.1.1 is computable vector by input text conversion) by word2vec method;
1.1.2) vector is input in Lstm-crf model, obtains word segmentation result part of speech corresponding with each word;
Wherein relate generally to word2vec, two-way lstm and tri- kinds of algorithms of crf.
Incalculable Chinese words can be become computer computable one by term vector model (word2vec) algorithm Vector in a lower dimensional space, usual several hundred dimensions.The dimension for the vector that this method reduces compared with traditional one-hot method Degree, so that vector switchs to dense expression from rarefaction representation, greatly reduces the calculation amount of computer.In addition, in this way between character Semantic dependency can be with the distance of vector come approximate description.
Recognition with Recurrent Neural Network (Recurrent neural networks, RNNs) has been demonstrated it certainly extensively at present The advantage in right Language Processing field.For arbitrarily inputting text sequence (x1,x2,…,xn), RNN returns to the output for being directed to this sequence Value set (h1,h2,…,hn).Wherein, the arbitrary value xi in list entries can be a character (or a word), export i+ The probability value of the character (or word) at 1 moment.But traditional RNNs is due to that can generate in doing the remittance of optimization process The problem of gradient disappears, so that the model parameter of RNNs " can only remember " contextual information of short distance before and after current character, it is right It is helpless in the dependence phenomenon of long range.This problem has been solved perfectly in the appearance of LSTM model, passes through several " doors (gates) " input and output of historical information are controlled, each " door " by sigmoid function do non-linearization be normalized to 0~1 it Between, value then shows less historical information by being somebody's turn to do " door " closer to 0;On the contrary, then showing there are more information closer to 1 Pass through " door ".LSTM solves the problems, such as that sequence labelling task middle and long distance relies on by the above special design.
CRF (conditional random field) conditional random field models were by John Lafferty in 2001 A kind of typical discriminative model proposed.It models target sequence on the basis of observation sequence, and emphasis solves sequence The problem of columnization mark, conditional random field models not only have the advantages that discriminative model, but also have production model in view of upper Transition probability between hereafter marking solves other differentiations the characteristics of carrying out global parameter optimization in the form of serializing and decode The marking bias problem that formula model (such as maximum entropy Markov model) is difficult to avoid that.
1.2), result obtained in step 1.1) is judged wherein using unigram, bigram and trigram language model With the presence or absence of mistake;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise it uses Unigram model judges wiFrequency P (wi), if P (wi) >=T0(T0For artificial given threshold), then follow the steps 1.2.2), if P(wi)<T0, then w is markediMistake;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then use Bigram model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=T1(T1For threshold is manually set Value), then follow the steps 1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses P(wi-1,wi) judged, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediIt is wrong Accidentally;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name, Then w is judged using trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi)> =T2(T2For artificial given threshold), then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake; If posiIt is not name and place name, then uses wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2, wi-1,wi) >=T2(T2For artificial given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markedi Mistake;
The error-checking method based on LSTM language model specifically:
2.1), each character vector in text will be inputted using word vector model, generates character vector;
Relative to common term vector, the vectorization technology based on character can bring following advantage: can characterize thinner The character feature of granularity;Since character quantity is much smaller than word quantity, obtained model occupied space is minimum, greatly improves mould Type loading velocity;Over time, neologisms can continue to bring out, and the term vector model trained before will appear increasingly tighter The feature hit rate downslide problem of weight, and the vector based on character then effectively prevents this problem, comes because being continuously created every year Fresh character it is relatively seldom.
Character vector is carried out using word2vec model in the present embodiment, is unsupervised learning method, that is, does not need Artificial mark corpus can training pattern, common are two kinds of training methods: CBOW and Skip-gram.CBOW is according to upper The hereafter word of pre- measured center, according to the character w (t-2), w (t-1), w (t+1) around current character w (t), w (t+2) prediction will The vector of these words connects, and can be sufficiently reserved contextual information in this way, referring to fig. 2;Skip-gram method is exactly the opposite, makes The word w (t-2), w (t-1), w (t+1), w (t+2), referring to Fig. 3 around prediction are removed with w (t).Under the conditions of big data quantity, it is suitble to Using Skip-gram method, so the present embodiment uses Skip-gram method.
As shown in figure 4, in training word vector, specific steps are as follows:
(1) collecting relevant balanced corpus first, (because to do unsupervised learning, data volume is the bigger the better, without mark Note), these corpus cover most of data type of the scene mainly for corresponding application scenarios as far as possible;
(2) it is pre-processed for collected balanced corpus, including filtering spam data, filtering low word and meaningless symbol Number, it is then organized into the format of training data, that is, indicates to output and input, prepare to establish training objective;
(3) training data is given to Skip-gram model, training obtains word vector model.
2.2) feature, is carried out by two-way LSTM (Bi-LSTM) model to extract automatically, obtains output sequence;Specifically:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM The output of (forward LSTM) and backward LSTM (backward LSTM), forward direction LSTM are hft, the output of backward LSTM is hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft,hbt], the output of all characters constitutes output Sequence;The wherein output h of forward direction LSTMftCharacterize history context information, and the output h of backward LSTMbtThen characterize future Contextual information.
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, so The probability of subsequent time word and the size of given threshold are judged afterwards, and the probability carved characters for the moment instantly is greater than the threshold value of setting, then should Character is correct, and otherwise marking the character is mistake;
S2, for an input text, respectively obtained looking into based on n-gram language model after step S1 processing Wrong result and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
In addition, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold, recognizes It is correctly, without miscue for the neologisms.
For those skilled in the art, it can be provided various corresponding according to above technical solution and design Change and modification, and all these change and modification, should be construed as being included within the scope of protection of the claims of the present invention.

Claims (5)

1. the text auto-collation that a kind of words combines, which comprises the steps of:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on Lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtain S= w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
1.2), the result after participle is judged wherein using unigram, bigram and trigram language model with the presence or absence of mistake Accidentally;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise unigram is used Model judges wiFrequency P (wi), if P (wi) >=threshold value T0, then follow the steps 1.2.2), if P (wi)<T0, then w is markediIt is wrong Accidentally;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then bigram is used Model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=threshold value T1, then follow the steps 1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses P (wi-1,wi) sentenced It is disconnected, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediMistake;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name, then make W is judged with trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi) >=threshold Value T2, then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake;If posiIt is not name and ground Name then uses wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2,wi-1,wi) >=T2(T2For people For given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markediMistake;
The error-checking method based on LSTM language model specifically:
2.1), using word vector model by each character vector;
2.2) it, is extracted automatically by two-way LSTM model progress feature and obtains output sequence;
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, is then sentenced The probability of disconnected subsequent time word and the size of given threshold, the probability carved characters for the moment instantly are greater than the threshold value of setting, then the character Correctly, otherwise marking the character is mistake;
S2, for an input text, respectively obtained the debugging knot based on n-gram language model after step S1 processing Fruit and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
2. the text auto-collation that words according to claim 1 combines, which is characterized in that in step 2.1), adopt Character vector is carried out with word2vec model.
3. the text auto-collation that words according to claim 2 combines, which is characterized in that word2vec model It is trained using Skip-gram method.
4. the text auto-collation that words according to claim 1 combines, which is characterized in that step 2.2) it is specific Method are as follows:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, the output of forward direction LSTM For hft, the output of backward LSTM is hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft,hbt], The output of all characters constitutes output sequence.
5. the text auto-collation that words according to claim 1 combines, which is characterized in that in statistics input text The number of the neologisms of appearance, i.e. P (neologisms), if P (neologisms) > preset threshold, then it is assumed that the neologisms are correctly, without mistake Prompt.
CN201910349756.XA 2019-04-28 2019-04-28 Automatic text proofreading method combining words Active CN110134950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910349756.XA CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910349756.XA CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Publications (2)

Publication Number Publication Date
CN110134950A true CN110134950A (en) 2019-08-16
CN110134950B CN110134950B (en) 2022-12-06

Family

ID=67575430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910349756.XA Active CN110134950B (en) 2019-04-28 2019-04-28 Automatic text proofreading method combining words

Country Status (1)

Country Link
CN (1) CN110134950B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460827A (en) * 2020-04-01 2020-07-28 北京爱咔咔信息技术有限公司 Text information processing method, system, equipment and computer readable storage medium
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
CN112380850A (en) * 2020-11-30 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment
CN113836912A (en) * 2021-09-08 2021-12-24 上海蜜度信息技术有限公司 Method, system and device for sequence labeling word segmentation of language model and word stock correction

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142881A (en) * 1999-11-16 2001-05-25 Nippon Telegr & Teleph Corp <Ntt> Statistic language model and probability calculating method using the same
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001142881A (en) * 1999-11-16 2001-05-25 Nippon Telegr & Teleph Corp <Ntt> Statistic language model and probability calculating method using the same
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王虹等: "基于词性预测的中文文本自动查错研究", 《贵州师范大学学报(自然科学版)》 *
谭咏梅等: "基于LSTM和N-gram的ESL文章的语法错误自动纠正方法", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460827A (en) * 2020-04-01 2020-07-28 北京爱咔咔信息技术有限公司 Text information processing method, system, equipment and computer readable storage medium
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
CN112380850A (en) * 2020-11-30 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment
CN113836912A (en) * 2021-09-08 2021-12-24 上海蜜度信息技术有限公司 Method, system and device for sequence labeling word segmentation of language model and word stock correction

Also Published As

Publication number Publication date
CN110134950B (en) 2022-12-06

Similar Documents

Publication Publication Date Title
CN107992597B (en) Text structuring method for power grid fault case
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN110083831A (en) A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110263325B (en) Chinese word segmentation system
CN110134950A (en) A kind of text auto-collation that words combines
CN109003601A (en) A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112541356B (en) Method and system for recognizing biomedical named entities
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN113761890B (en) Multi-level semantic information retrieval method based on BERT context awareness
CN109684928B (en) Chinese document identification method based on internet retrieval
CN112163425A (en) Text entity relation extraction method based on multi-feature information enhancement
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN112784604A (en) Entity linking method based on entity boundary network
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN109766523A (en) Part-of-speech tagging method and labeling system
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN104317882A (en) Decision-based Chinese word segmentation and fusion method
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN115130475A (en) Extensible universal end-to-end named entity identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100081 No.101, 1st floor, building 14, 27 Jiancai Chengzhong Road, Haidian District, Beijing

Applicant after: Beijing PERCENT Technology Group Co.,Ltd.

Address before: 100081 16 / F, block a, Beichen Century Center, building 2, courtyard 8, Beichen West Road, Chaoyang District, Beijing

Applicant before: BEIJING BAIFENDIAN INFORMATION SCIENCE & TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant