CN110134950A - A kind of text auto-collation that words combines - Google Patents
A kind of text auto-collation that words combines Download PDFInfo
- Publication number
- CN110134950A CN110134950A CN201910349756.XA CN201910349756A CN110134950A CN 110134950 A CN110134950 A CN 110134950A CN 201910349756 A CN201910349756 A CN 201910349756A CN 110134950 A CN110134950 A CN 110134950A
- Authority
- CN
- China
- Prior art keywords
- pos
- model
- mistake
- text
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses the text auto-collations that a kind of words combines, and the following two kinds error-checking method is respectively adopted first and carries out debugging: the 1) error-checking method based on n-gram language model;2) error-checking method based on lstm language model;Then the debugging result of two methods is sought common ground, obtains final debugging result.The method of the present invention is based on lexicon and is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes that mistake present in text is searched.
Description
Technical field
The present invention relates to technical field of data processing, and in particular to a kind of text auto-collation that words combines.
Background technique
Text automatic Proofreading is that the mistakes such as the word occurred in text, word, collocations, semantic grammar are searched and entangled
A positive technology, is one of main application fields of natural language processing.
The natural language processing system of early stage is mainly based upon the rule manually write, and this method is not only time-consuming and laborious,
And various language phenomenons can not be covered.Later period the last century 80's, since the computing capability of computer is continuously improved, machine
Learning algorithm is introduced in natural language processing.Research is concentrated mainly on statistical model, and this method is using extensive
Training corpus (corpus) parameter of model is automatically learnt, compared with rule-based method before, it is this
Method has more robustness.
Statistical language model (Statistical Language Model) is exactly to be mentioned under such environment and background
Out.It is widely used in various natural language processing problems, such as speech recognition, machine translation, participle, part-of-speech tagging, etc.
Deng.Briefly, language model is exactly the model for calculating the probability of a sentence, i.e. P (W1, W2 ... Wk).Utilize language
A possibility that saying model, can determining which word sequence is bigger, or gives several words, can predict that next most probable goes out
Existing word.
N-gram model is also referred to as n-1 rank Markov model, it has a limited history to assume: the appearance of current word is general
Rate is only related to the word of front n-1.Given sentence (sequence of terms) S=W1, W2 ..., Wk, its probability can indicate are as follows:
When n takes 1,2,3, n-gram model is referred to as unigram, bigram and trigram language model.n-gram
The parameter of model be exactly conditional probability P (Wi | Wi-n+1 ..., Wi-1).Assuming that the size of vocabulary is 100,000, then n-
The number of parameters of gram model is 100,000n.N is bigger, and model is more accurate, also more complicated, and the calculation amount needed is bigger.Most often
It is bigram, followed by unigram and trigram, the case where n takes >=4 is less.
The problem of n-gram maximum be probability Estimation obtain be not very precisely, when the n in especially n-gram is very big, such as
If fruit needs to guarantee precision, then the data volume needed is very big, but it is practically impossible to obtain so much training data, number
According to can become sparse.In addition, n-gram can only count regular length (frequency of occurrence of the word sequence of general of length no more than 3),
And longer contextual information can not be extracted.
Part technical term is explained:
Participle and part-of-speech tagging: it will in short be divided into individual word one by one and identify the part of speech of each word (such as
Noun, verb, adjective etc.) it marks out and.
Word2vec: being the algorithm of google company exploitation, by unsupervised training, by word become a several hundred dimensions to
Amount, this vector can capture the semantic dependency between word.Also term vector or word is made to be embedded in.
Tensorflow:Tensorflow is the deep learning platform of google open source, provides interface abundant, mostly flat
Platform (CPU, GPU, HADOOP) and distributed support, visual control.
LSTM:LSTM (Long Short-Term Memory) shot and long term memory network, is a kind of time recurrent neural net
Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.It is by " Memory-Gate " and " forgets
Note door " controls the going or staying of historical information, efficiently solves the problems, such as the long Route Dependence of conventional recycle neural network.
CRF:CRF (Conditional Random Field) condition random field is that natural language processing field was normal in recent years
One of algorithm is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..CRF is using Markov Chain as implicit
The probability metastasis model of variable implies variable by Observable condition discrimination, belongs to discrimination model.
Summary of the invention
In view of the deficiencies of the prior art, the present invention is intended to provide the text auto-collation that a kind of words combines, is based on
Lexicon is embedded in (word embeddings) technology, two-way lstm network, CRF (Conditional Random Field, condition
Random field) model etc. realizes and carries out participle and part-of-speech tagging to input text, it is based on n-gram, two-way lstm language on this basis
It says model and rule and policy, realizes that mistake present in text is searched.
To achieve the goals above, the present invention adopts the following technical scheme:
A kind of text auto-collation that words combines, includes the following steps:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on Lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtained
To S=w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
1.2), whether the result after participle is wherein deposited using the judgement of unigram, bigram and trigram language model
In mistake;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise it uses
Unigram model judges wiFrequency P (wi), if P (wi) >=threshold value T0, then follow the steps 1.2.2), if P (wi)<T0, then mark
Remember wiMistake;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then use
Bigram model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=threshold value T1, then follow the steps
1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses P (wi-1,wi) sentenced
It is disconnected, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediMistake;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name,
Then w is judged using trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi)>
=threshold value T2, then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake;If posiIt is not name
And place name, then use wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2,wi-1,wi) >=T2(T2
For artificial given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markediMistake;
The error-checking method based on LSTM language model specifically:
2.1), using word vector model by each character vector;
2.2) it, is extracted automatically by two-way LSTM model progress feature and obtains output sequence;
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, so
The probability of subsequent time word and the size of given threshold are judged afterwards, and the probability carved characters for the moment instantly is greater than the threshold value of setting, then should
Character is correct, and otherwise marking the character is mistake;
S2, for an input text, respectively obtained looking into based on n-gram language model after step S1 processing
Wrong result and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
Further, in step 2.1), character vector is carried out using word2vec model.
Further, word2vec model is trained using Skip-gram method.
Further, step 2.2) method particularly includes:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM's
Output is hft, the output of backward LSTM is hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft,
hbt], the output of all characters constitutes output sequence.
Further, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold,
Then think that the neologisms are correctly, without miscue.
The beneficial effects of the present invention are: the method for the present invention to be based on lexicon insertion (word embeddings) technology, two-way
Lstm network, CRF (Conditional Random Field, condition random field) model etc., which are realized, segments input text
With part-of-speech tagging, it is based on n-gram model, two-way lstm language model and rule and policy on this basis, realizes and is deposited in text
Mistake search.Advantage of the invention is as follows:
1) participle and part-of-speech tagging are carried out to text using the method for deep learning, can accurately extracts the people in text
Name, place name, reduce wrong report caused by name, place name etc..
2) local feature and global characteristics of text, energy can be extracted based on the proofreading method that N-gram and LSTM are combined
It is enough accurately to search the mistake occurred in text.
Detailed description of the invention
Fig. 1 is the method for the present invention implementation process diagram;
Fig. 2 is CBOW training method schematic illustration;
Fig. 3 is Skip-gram Method And Principle schematic diagram;
Fig. 4 is word vector model training flow diagram in the embodiment of the present invention.
Specific embodiment
Below with reference to attached drawing, the invention will be further described, it should be noted that the present embodiment is with this technology side
Premised on case, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to this reality
Apply example.
The present embodiment provides the text auto-collations that a kind of words combines, as shown in Figure 1, including the following steps:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtained
To S=w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
Participle and part-of-speech tagging are carried out to input text using the method based on deep learning, are that will segment from principle
Sequence labelling problem is converted to part-of-speech tagging problem.Specific step is as follows:
It 1.1.1 is computable vector by input text conversion) by word2vec method;
1.1.2) vector is input in Lstm-crf model, obtains word segmentation result part of speech corresponding with each word;
Wherein relate generally to word2vec, two-way lstm and tri- kinds of algorithms of crf.
Incalculable Chinese words can be become computer computable one by term vector model (word2vec) algorithm
Vector in a lower dimensional space, usual several hundred dimensions.The dimension for the vector that this method reduces compared with traditional one-hot method
Degree, so that vector switchs to dense expression from rarefaction representation, greatly reduces the calculation amount of computer.In addition, in this way between character
Semantic dependency can be with the distance of vector come approximate description.
Recognition with Recurrent Neural Network (Recurrent neural networks, RNNs) has been demonstrated it certainly extensively at present
The advantage in right Language Processing field.For arbitrarily inputting text sequence (x1,x2,…,xn), RNN returns to the output for being directed to this sequence
Value set (h1,h2,…,hn).Wherein, the arbitrary value xi in list entries can be a character (or a word), export i+
The probability value of the character (or word) at 1 moment.But traditional RNNs is due to that can generate in doing the remittance of optimization process
The problem of gradient disappears, so that the model parameter of RNNs " can only remember " contextual information of short distance before and after current character, it is right
It is helpless in the dependence phenomenon of long range.This problem has been solved perfectly in the appearance of LSTM model, passes through several " doors
(gates) " input and output of historical information are controlled, each " door " by sigmoid function do non-linearization be normalized to 0~1 it
Between, value then shows less historical information by being somebody's turn to do " door " closer to 0;On the contrary, then showing there are more information closer to 1
Pass through " door ".LSTM solves the problems, such as that sequence labelling task middle and long distance relies on by the above special design.
CRF (conditional random field) conditional random field models were by John Lafferty in 2001
A kind of typical discriminative model proposed.It models target sequence on the basis of observation sequence, and emphasis solves sequence
The problem of columnization mark, conditional random field models not only have the advantages that discriminative model, but also have production model in view of upper
Transition probability between hereafter marking solves other differentiations the characteristics of carrying out global parameter optimization in the form of serializing and decode
The marking bias problem that formula model (such as maximum entropy Markov model) is difficult to avoid that.
1.2), result obtained in step 1.1) is judged wherein using unigram, bigram and trigram language model
With the presence or absence of mistake;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise it uses
Unigram model judges wiFrequency P (wi), if P (wi) >=T0(T0For artificial given threshold), then follow the steps 1.2.2), if
P(wi)<T0, then w is markediMistake;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then use
Bigram model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=T1(T1For threshold is manually set
Value), then follow the steps 1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses
P(wi-1,wi) judged, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediIt is wrong
Accidentally;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name,
Then w is judged using trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi)>
=T2(T2For artificial given threshold), then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake;
If posiIt is not name and place name, then uses wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2,
wi-1,wi) >=T2(T2For artificial given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markedi
Mistake;
The error-checking method based on LSTM language model specifically:
2.1), each character vector in text will be inputted using word vector model, generates character vector;
Relative to common term vector, the vectorization technology based on character can bring following advantage: can characterize thinner
The character feature of granularity;Since character quantity is much smaller than word quantity, obtained model occupied space is minimum, greatly improves mould
Type loading velocity;Over time, neologisms can continue to bring out, and the term vector model trained before will appear increasingly tighter
The feature hit rate downslide problem of weight, and the vector based on character then effectively prevents this problem, comes because being continuously created every year
Fresh character it is relatively seldom.
Character vector is carried out using word2vec model in the present embodiment, is unsupervised learning method, that is, does not need
Artificial mark corpus can training pattern, common are two kinds of training methods: CBOW and Skip-gram.CBOW is according to upper
The hereafter word of pre- measured center, according to the character w (t-2), w (t-1), w (t+1) around current character w (t), w (t+2) prediction will
The vector of these words connects, and can be sufficiently reserved contextual information in this way, referring to fig. 2;Skip-gram method is exactly the opposite, makes
The word w (t-2), w (t-1), w (t+1), w (t+2), referring to Fig. 3 around prediction are removed with w (t).Under the conditions of big data quantity, it is suitble to
Using Skip-gram method, so the present embodiment uses Skip-gram method.
As shown in figure 4, in training word vector, specific steps are as follows:
(1) collecting relevant balanced corpus first, (because to do unsupervised learning, data volume is the bigger the better, without mark
Note), these corpus cover most of data type of the scene mainly for corresponding application scenarios as far as possible;
(2) it is pre-processed for collected balanced corpus, including filtering spam data, filtering low word and meaningless symbol
Number, it is then organized into the format of training data, that is, indicates to output and input, prepare to establish training objective;
(3) training data is given to Skip-gram model, training obtains word vector model.
2.2) feature, is carried out by two-way LSTM (Bi-LSTM) model to extract automatically, obtains output sequence;Specifically:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, forward direction LSTM
The output of (forward LSTM) and backward LSTM (backward LSTM), forward direction LSTM are hft, the output of backward LSTM is
hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft,hbt], the output of all characters constitutes output
Sequence;The wherein output h of forward direction LSTMftCharacterize history context information, and the output h of backward LSTMbtThen characterize future
Contextual information.
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, so
The probability of subsequent time word and the size of given threshold are judged afterwards, and the probability carved characters for the moment instantly is greater than the threshold value of setting, then should
Character is correct, and otherwise marking the character is mistake;
S2, for an input text, respectively obtained looking into based on n-gram language model after step S1 processing
Wrong result and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
In addition, the number of the neologisms occurred in statistics input text, i.e. P (neologisms), if P (neologisms) > preset threshold, recognizes
It is correctly, without miscue for the neologisms.
For those skilled in the art, it can be provided various corresponding according to above technical solution and design
Change and modification, and all these change and modification, should be construed as being included within the scope of protection of the claims of the present invention.
Claims (5)
1. the text auto-collation that a kind of words combines, which comprises the steps of:
S1, the progress debugging of the following two kinds error-checking method is respectively adopted:
1) error-checking method based on n-gram language model;
2) error-checking method based on Lstm language model;
The error-checking method based on n-gram language model includes the following steps:
1.1), for inputting text S, participle and part-of-speech tagging is carried out to text using the method based on deep learning, obtain S=
w1,w2,…,wn, wherein wiFor the word obtained after participle, correspondence part of speech is posi, i=1,2 ..., n;
1.2), the result after participle is judged wherein using unigram, bigram and trigram language model with the presence or absence of mistake
Accidentally;
1.2.1), judge wiPart of speech posiIf it is name or place name, 1.2.2 is thened follow the steps);Otherwise unigram is used
Model judges wiFrequency P (wi), if P (wi) >=threshold value T0, then follow the steps 1.2.2), if P (wi)<T0, then w is markediIt is wrong
Accidentally;
1.2.2), P (w is seti-1,wi) indicate word wi-1And wiCo-occurrence number, if posiFor name or place name, then bigram is used
Model judges wi-1And posiCo-occurrence number P (wi-1,posi), if P (wi-1,posi) >=threshold value T1, then follow the steps
1.2.3), if P (wi-1,posi)<T1, then w is markediMistake;If posiIt is not name and place name, then uses P (wi-1,wi) sentenced
It is disconnected, if P (wi-1,wi) >=T1, then follow the steps 1.2.3), if P (wi-1,wi)<T1, then w is markediMistake;
1.2.3), P (w is seti-2,wi-1,wi) indicate word wi-2,wi-1And wiCo-occurrence number, if posiFor name or place name, then make
W is judged with trigram modeli-2、wi-1And posiCo-occurrence number P (wi-2,wi-1,posi), if P (wi-2,wi-1,posi) >=threshold
Value T2, then it is assumed that wiMistake is not present in place, if P (wi-2,wi-1,posi)<T2, then w is markediMistake;If posiIt is not name and ground
Name then uses wi-2,wi-1And wiCo-occurrence number P (wi-2,wi-1,wi) judged, if P (wi-2,wi-1,wi) >=T2(T2For people
For given threshold), then it is assumed that wiMistake is not present in place;If P (wi-2,wi-1,wi)<T2, then w is markediMistake;
The error-checking method based on LSTM language model specifically:
2.1), using word vector model by each character vector;
2.2) it, is extracted automatically by two-way LSTM model progress feature and obtains output sequence;
2.3), for each character xtOutput ht, the probability of subsequent time word is obtained by Softmax activation primitive, is then sentenced
The probability of disconnected subsequent time word and the size of given threshold, the probability carved characters for the moment instantly are greater than the threshold value of setting, then the character
Correctly, otherwise marking the character is mistake;
S2, for an input text, respectively obtained the debugging knot based on n-gram language model after step S1 processing
Fruit and debugging based on LSTM language model are as a result, seek the intersection of two debugging results as final debugging result.
2. the text auto-collation that words according to claim 1 combines, which is characterized in that in step 2.1), adopt
Character vector is carried out with word2vec model.
3. the text auto-collation that words according to claim 2 combines, which is characterized in that word2vec model
It is trained using Skip-gram method.
4. the text auto-collation that words according to claim 1 combines, which is characterized in that step 2.2) it is specific
Method are as follows:
Firstly, the character vector generated in load step S2.1, subsequently into two-way LSTM calculating process, the output of forward direction LSTM
For hft, the output of backward LSTM is hbt, after the two carries out vector splicing, obtain each character xtOutput ht=[hft,hbt],
The output of all characters constitutes output sequence.
5. the text auto-collation that words according to claim 1 combines, which is characterized in that in statistics input text
The number of the neologisms of appearance, i.e. P (neologisms), if P (neologisms) > preset threshold, then it is assumed that the neologisms are correctly, without mistake
Prompt.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910349756.XA CN110134950B (en) | 2019-04-28 | 2019-04-28 | Automatic text proofreading method combining words |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910349756.XA CN110134950B (en) | 2019-04-28 | 2019-04-28 | Automatic text proofreading method combining words |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134950A true CN110134950A (en) | 2019-08-16 |
CN110134950B CN110134950B (en) | 2022-12-06 |
Family
ID=67575430
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910349756.XA Active CN110134950B (en) | 2019-04-28 | 2019-04-28 | Automatic text proofreading method combining words |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134950B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460827A (en) * | 2020-04-01 | 2020-07-28 | 北京爱咔咔信息技术有限公司 | Text information processing method, system, equipment and computer readable storage medium |
CN112364631A (en) * | 2020-09-21 | 2021-02-12 | 山东财经大学 | Chinese grammar error detection method and system based on hierarchical multitask learning |
CN112380850A (en) * | 2020-11-30 | 2021-02-19 | 沈阳东软智能医疗科技研究院有限公司 | Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment |
CN113836912A (en) * | 2021-09-08 | 2021-12-24 | 上海蜜度信息技术有限公司 | Method, system and device for sequence labeling word segmentation of language model and word stock correction |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001142881A (en) * | 1999-11-16 | 2001-05-25 | Nippon Telegr & Teleph Corp <Ntt> | Statistic language model and probability calculating method using the same |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
-
2019
- 2019-04-28 CN CN201910349756.XA patent/CN110134950B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001142881A (en) * | 1999-11-16 | 2001-05-25 | Nippon Telegr & Teleph Corp <Ntt> | Statistic language model and probability calculating method using the same |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
Non-Patent Citations (2)
Title |
---|
王虹等: "基于词性预测的中文文本自动查错研究", 《贵州师范大学学报(自然科学版)》 * |
谭咏梅等: "基于LSTM和N-gram的ESL文章的语法错误自动纠正方法", 《中文信息学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460827A (en) * | 2020-04-01 | 2020-07-28 | 北京爱咔咔信息技术有限公司 | Text information processing method, system, equipment and computer readable storage medium |
CN112364631A (en) * | 2020-09-21 | 2021-02-12 | 山东财经大学 | Chinese grammar error detection method and system based on hierarchical multitask learning |
CN112380850A (en) * | 2020-11-30 | 2021-02-19 | 沈阳东软智能医疗科技研究院有限公司 | Wrongly-written character recognition method, wrongly-written character recognition device, wrongly-written character recognition medium and electronic equipment |
CN113836912A (en) * | 2021-09-08 | 2021-12-24 | 上海蜜度信息技术有限公司 | Method, system and device for sequence labeling word segmentation of language model and word stock correction |
Also Published As
Publication number | Publication date |
---|---|
CN110134950B (en) | 2022-12-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107992597B (en) | Text structuring method for power grid fault case | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN110083831A (en) | A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF | |
CN110263325B (en) | Chinese word segmentation system | |
CN110134950A (en) | A kind of text auto-collation that words combines | |
CN109003601A (en) | A kind of across language end-to-end speech recognition methods for low-resource Tujia language | |
CN112183094B (en) | Chinese grammar debugging method and system based on multiple text features | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN111062217B (en) | Language information processing method and device, storage medium and electronic equipment | |
CN112818118B (en) | Reverse translation-based Chinese humor classification model construction method | |
CN113761890B (en) | Multi-level semantic information retrieval method based on BERT context awareness | |
CN109684928B (en) | Chinese document identification method based on internet retrieval | |
CN112163425A (en) | Text entity relation extraction method based on multi-feature information enhancement | |
CN113377897B (en) | Multi-language medical term standard standardization system and method based on deep confrontation learning | |
CN112966525B (en) | Law field event extraction method based on pre-training model and convolutional neural network algorithm | |
CN113177412A (en) | Named entity identification method and system based on bert, electronic equipment and storage medium | |
CN112784604A (en) | Entity linking method based on entity boundary network | |
CN114298010A (en) | Text generation method integrating dual-language model and sentence detection | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN109766523A (en) | Part-of-speech tagging method and labeling system | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN104317882A (en) | Decision-based Chinese word segmentation and fusion method | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
CN115130475A (en) | Extensible universal end-to-end named entity identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100081 No.101, 1st floor, building 14, 27 Jiancai Chengzhong Road, Haidian District, Beijing Applicant after: Beijing PERCENT Technology Group Co.,Ltd. Address before: 100081 16 / F, block a, Beichen Century Center, building 2, courtyard 8, Beichen West Road, Chaoyang District, Beijing Applicant before: BEIJING BAIFENDIAN INFORMATION SCIENCE & TECHNOLOGY Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |