CN108959260A - A kind of Chinese grammer error-detecting method based on textual term vector - Google Patents

A kind of Chinese grammer error-detecting method based on textual term vector Download PDF

Info

Publication number
CN108959260A
CN108959260A CN201810735068.2A CN201810735068A CN108959260A CN 108959260 A CN108959260 A CN 108959260A CN 201810735068 A CN201810735068 A CN 201810735068A CN 108959260 A CN108959260 A CN 108959260A
Authority
CN
China
Prior art keywords
text
term vector
word
neural network
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810735068.2A
Other languages
Chinese (zh)
Other versions
CN108959260B (en
Inventor
李思
赵建博
李明正
徐雅静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201810735068.2A priority Critical patent/CN108959260B/en
Publication of CN108959260A publication Critical patent/CN108959260A/en
Application granted granted Critical
Publication of CN108959260B publication Critical patent/CN108959260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a kind of Chinese grammer error-detecting method and devices, belong to field of information processing.The feature of this method includes: the first text word vectorization to input, and connection forms text matrix;Recognition with Recurrent Neural Network is recycled to form the mask about component significance level each in term vector;Rebuild text matrix;Contextual information is extracted using Recognition with Recurrent Neural Network;Each error resistance score of word is calculated using feedforward neural network;Errors present is inferred using error resistance score.The present invention, which passes through to combine, is based on textual term vector, so that Chinese grammer detection effect gets a promotion, has very big use value.

Description

A kind of Chinese grammer error-detecting method based on textual term vector
Technical field
The present invention relates to field of information processing, in particular to a kind of Chinese grammer error detection side neural network based Method.
Background technique
Chinese grammer error detection is the new task of the comparison in Chinese natural language processing, it is therefore an objective to judge that non-Chinese is female Whether sentence written by the people of language is wrong, and provides error message.
Chinese grammer error-detecting method most common at present is the sequence for having supervision using error detection task as one Mark task is completed.Relatively common syntax error detection has N-Gram, Recognition with Recurrent Neural Network etc..But these networks all ten Divide the feature for relying on engineer, needs the addition of more manual features.Recently, since neural network can oneself study The feature of text is to replace complicated manual features, so many work are all wrong in Chinese grammer by Application of Neural Network in trial Error detection.But most work has ignored same word not well using information expressed by Chinese vocabulary It may be different with meaning under text.And the present invention obtains to solve the problem above-mentioned using Recognition with Recurrent Neural Network The mask of each component significance level in term vector recycles Recognition with Recurrent Neural Network, has obtained preferable error detection effect.
Summary of the invention
In order to solve existing technical problem, the present invention provides a kind of Chinese grammer error detections neural network based Method.Scheme is as follows:
Step 1, each word for inputting text is mapped as term vector by us, by text parameter, by the text of input Originally it is mapped as a text matrix.
Step 2, we handle text matrix using a Recognition with Recurrent Neural Network, obtain about term vector component The mask of significance level in the text.
Step 3 is handled text matrix using the mask of term vector component significance level in the text, obtains weight The text matrix that the term vector built indicates.
Step 4, the text matrix that the term vector of reconstruction indicates is carried out input Recognition with Recurrent Neural Network and handled by us, Obtain the character representation of each term vector in text.
Step 5, we handle the character representation of term vector each in text, and character representation passes through a forward direction Neural network obtains the error resistance score of each word;
Step 6 is inferred in face of the error resistance score of each word in entire text layers, obtains wrong word information.
Detailed description of the invention
Fig. 1 is the network structure of Chinese grammer error detection provided by the invention
Fig. 2 is the internal structure chart of shot and long term memory network unit
Specific embodiment
It next will be for a more detailed description to embodiment of the present invention.
Scheme the network structure first is that error-detecting method provided by the invention, including:
Step S1: the text word vectorization of input;
Step S2: Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector;
Step S3: text matrix is rebuild;
Step S4: Recognition with Recurrent Neural Network extracts contextual information;
Step S5: feedforward neural network calculates each error resistance score of word;
Step S6: errors present is inferred using error resistance score;
Each step will be specifically described below:
Step S1: text word vectorization.The present invention initially sets up the mapping dictionary that word is numbered to term vector, by text In each word be mapped as corresponding word number.Term vector matrix is established, per number corresponding corresponding word number line by line, often A line represents a term vector.Word number is mapped as corresponding term vector by term vector matrix.Connect each word in text Term vector formed text matrix.Assuming that Chinese word share it is N number of, then term vector matrix can be expressed as the square of a N*d Battle array, wherein d indicates the dimension of term vector.For the word of entire input text after vectorization, text matrix can be expressed as x.
Wherein, xi indicates that the term vector of i-th of word in text, n indicate word number in text size, that is, text,Table Show the column connection of vector.
Step S2: Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector.Recognition with Recurrent Neural Network The contextual information of text can be preferably extracted, and is widely applied in text-processing field, such as: Chinese word segmentation, text Classification.Compared to the method for the N-Gram based on statistical learning, the dependence that Recognition with Recurrent Neural Network can be concerned about the longer time is closed System preferably captures the Global Information of article.Traditional Recognition with Recurrent Neural Network will appear the problem of gradient disappearance and gradient explosion, And shot and long term memory network (LSTM) can be very good to solve this problem.Input gate is utilized in shot and long term memory network, is forgotten Door, out gate can more effectively Schistosomiasis control to long range dependence.The method of current Chinese grammer error detection In do not fully take into account the limitation of expression of the term vector to word meaning.The same word may in different texts There is different meanings.Therefore the mask about component significance level each in term vector is formed using Recognition with Recurrent Neural Network herein, Term vector is rebuild using mask.
Fig. 2 gives a kind of cellular construction of shot and long term memory network, and when moment t can be described as:
it=σ (Wi·xt+Ui·ht-1+bi)
ft=σ (Wf·xt+Uf·ht-1+bf)
ot=σ (Wo·xt+Uo·ht-1+bo)
ht=ot⊙tanh(Ct)
Wherein x is the vector of input, and C is memory unit, and i is input gate, and f is to forget door, and o is out gate.σ is Sigmoid activation primitive.⊙ is that numerical value contraposition is multiplied, and is matrix multiple.W and U is the weight square of input and hidden layer respectively Battle array, b are biasings.It is the candidate value of memory unit, is codetermined by current input with hiding layer state before.CtPass through Input gate and forget that door is respectively acting on the value collective effect of memory unit candidate value and previous moment memory unit.In term vector The mask of each component significance level is determined by the last layer that shot and long term memory network exports, and contains entire article from front to back All information.The calculating of mask can be described as:
Mask=hn
Wherein mask indicates the mask of each component significance level in term vector.
Step S3: text matrix is rebuild.We have obtained the text matrix that term vector is formed by connecting in step 1, in step We have obtained the mask of term vector component significance level in the text in rapid 2.It is considered that each representation in components in term vector The different information of word meaning.For different texts, the meaning of same vocabulary may be made a big difference.We utilize The dot product of mask and term vector reconstructs term vector.Mask illustrates under the text which aspect is the meaning of word be more biased towards, will On the one hand that suitable for current text amplifies, that on the one hand reduction to not being the meaning indicated in text.
Term vector after reconstruct reconnects, and constitutes the text matrix that the term vector of reconstruction indicates.
Step S4: Recognition with Recurrent Neural Network extracts contextual information.The term vector that we are rebuild in step 3 indicates Text matrix, we are handled using text matrix of the Recognition with Recurrent Neural Network to reconstruction, obtain each term vector in text Character representation.Although unidirectional shot and long term memory network can also extract the network information but cannot extract reversed letter well Breath.We are extracted using two-way shot and long term memory network with the contextual information of text.Two-way shot and long term memory network There are two the memory units in direction, extract positive and reversed text information respectively.We export positive and reversed unit Vector is spliced, the character representation as term vector each in text.
WhereinIndicate output of the shot and long term memory network in t moment of forward direction,Indicate reversed shot and long term memory network In the output of t moment.Vector withVector is directly spliced, the character representation of the term vector as t moment.
Step S5: feedforward neural network calculates each error resistance score of word.We have obtained passing through circulation in step 4 The character representation of each term vector in text after Processing with Neural Network.We calculate word using the character representation of term vector Error resistance score.
Score=ht·W
Wherein W is the weight for calculating score.Calculate the score of all words in text, the available institute in text level The error resistance score of some words.
Step S6: errors present is inferred using error resistance score.By taking correct and wrong double labels as an example, it correctly will be designated as 0, Mistake is designated as 1.For wrong score, error resistance score have it is bigger, it is more it is possible that wrong.Judged in text by score Errors present provides error message.
Above in conjunction with attached drawing to a kind of Chinese grammer error-detecting method based on textual term vector that is proposed and each The specific embodiment of module is expounded.By the description of embodiment of above, one of ordinary skill in the art can To be clearly understood that the present invention can realize by means of software and necessary general hardware platform.
According to the thought of the present invention, there will be changes in the specific implementation manner and application range.In conclusion this Description should not be construed as limiting the invention.
Invention described above embodiment does not constitute the restriction to invention protection scope.It is any of the invention Made modifications, equivalent substitutions and improvements etc., should all be included in the protection scope of the present invention within spirit and principle.

Claims (10)

1. a kind of Chinese grammer error-detecting method based on textual term vector, which is characterized in that the method includes following Structure and step:
(1) input text word vectorization: to input text word map, by word be converted to corresponding word to Amount, the text segmented the i.e. numerical value of input turn to the text matrix that the term vector of each word is formed by connecting;
(2) Recognition with Recurrent Neural Network forms the mask about component significance level each in term vector: the text square obtained to step (1) Battle array is handled, and Recognition with Recurrent Neural Network handles contextual information, is obtained about term vector component important journey in the text The mask of degree;
(3) text matrix rebuild: to word in step (1) be converted to corresponding term vector using step (2) obtain about word The mask of component of a vector significance level in the text is handled, the text matrix that the term vector rebuild indicates;
(4) Recognition with Recurrent Neural Network extracts contextual information: carrying out to the text matrix that term vector indicates of rebuilding that step (3) obtains Processing, Recognition with Recurrent Neural Network extract contextual information, obtain the character representation of each term vector in text;
(5) feedforward neural network calculates each error resistance score of word: to each term vector in text obtained in step (4) Character representation is handled, and character representation obtains the error resistance score of each word in text by a feedforward neural network;
(6) infer errors present using error resistance score: the error resistance score obtained to step (5) is handled, in entire text This level infers the error resistance score of each word, obtains wrong word information.
2. the method as described in claim 1, which is characterized in that the step (1) specifically includes:
(1.1) map index and term vector matrix that initialization word is numbered to term vector;
(1.2) word corresponding word is mapped as by map index to number;
(1.3) corresponding term vector in term vector matrix is obtained by the word number of each word;
(1.4) term vector is connected, obtains the text matrix that the term vector of each word is formed by connecting.
3. the method as described in claim 1, which is characterized in that the step (2) specifically includes:
(2.1) loop initialization neural network parameter;
(2.2) the text matrix that circulation neural unit obtains step (1) is handled, and obtains output matrix;
(2.3) output matrix is handled, obtains the mask about term vector component significance level in the text.
4. the method as described in claim 1, which is characterized in that step (2) Recognition with Recurrent Neural Network is that shot and long term remembers net Network.
5. method as claimed in claim 3, which is characterized in that the step (2.3) specifically includes:
The output matrix obtained to step (2.2) is handled, obtain output matrix the last layer output as about word to Measure the mask of component significance level in the text.
6. the method as described in claim 1, which is characterized in that the step (3) specifically includes:
The text that step (1) is obtained about the mask of term vector component significance level in the text obtained by step (2) Each term vector carries out dot product in matrix, the text matrix that the term vector rebuild indicates.
7. the method as described in claim 1, which is characterized in that the step (4) specifically includes:
(3.1) loop initialization neural network parameter:
(3.2) the text matrix that the reconstruction term vector that step (3) obtains indicates is handled by Recognition with Recurrent Neural Network, is obtained The character representation of each term vector in text.
8. the method as described in claim 1, which is characterized in that step (4) Recognition with Recurrent Neural Network is two-way shot and long term note Recall network.
9. the method as described in claim 1, which is characterized in that the step (5) specifically includes:
(4.1) feedforward neural network parameter is initialized;
(4.2) character representation of each term vector inputs feedforward neural network in the text for obtaining step (4), obtains in text The error resistance score of each word.
10. the method as described in claim 1, which is characterized in that the step (6) specifically includes:
The error resistance score of each word is handled in the text obtained to step (5), and in entire text level, mistake point is most High word is considered mistake occur, obtains wrong word information.
CN201810735068.2A 2018-07-06 2018-07-06 A kind of Chinese grammer error-detecting method based on textual term vector Active CN108959260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810735068.2A CN108959260B (en) 2018-07-06 2018-07-06 A kind of Chinese grammer error-detecting method based on textual term vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810735068.2A CN108959260B (en) 2018-07-06 2018-07-06 A kind of Chinese grammer error-detecting method based on textual term vector

Publications (2)

Publication Number Publication Date
CN108959260A true CN108959260A (en) 2018-12-07
CN108959260B CN108959260B (en) 2019-05-28

Family

ID=64486041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810735068.2A Active CN108959260B (en) 2018-07-06 2018-07-06 A kind of Chinese grammer error-detecting method based on textual term vector

Country Status (1)

Country Link
CN (1) CN108959260B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377882A (en) * 2019-07-17 2019-10-25 标贝(深圳)科技有限公司 For determining the method, apparatus, system and storage medium of the phonetic of text
CN110889284A (en) * 2019-12-04 2020-03-17 成都中科云集信息技术有限公司 Multi-task learning Chinese language disease diagnosis method based on bidirectional long-time and short-time memory network
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111950292A (en) * 2020-06-22 2020-11-17 北京百度网讯科技有限公司 Training method of text error correction model, and text error correction processing method and device
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150248608A1 (en) * 2014-02-28 2015-09-03 Educational Testing Service Deep Convolutional Neural Networks for Automated Scoring of Constructed Responses
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FEILIANG REN ET AL: "A dynamic weighted method with support vector machines for Chinese word segmentation", 《2005 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING》 *
陈珂 等: "基于多通道卷积神经网络的中文微博情感分析", 《计算机研究与发展》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377882A (en) * 2019-07-17 2019-10-25 标贝(深圳)科技有限公司 For determining the method, apparatus, system and storage medium of the phonetic of text
CN110377882B (en) * 2019-07-17 2023-06-09 标贝(深圳)科技有限公司 Method, apparatus, system and storage medium for determining pinyin of text
CN110889284A (en) * 2019-12-04 2020-03-17 成都中科云集信息技术有限公司 Multi-task learning Chinese language disease diagnosis method based on bidirectional long-time and short-time memory network
CN110889284B (en) * 2019-12-04 2023-04-07 成都中科云集信息技术有限公司 Multi-task learning Chinese language sickness diagnosis method based on bidirectional long-time and short-time memory network
US20210216725A1 (en) * 2020-01-14 2021-07-15 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
US11775776B2 (en) * 2020-01-14 2023-10-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for processing information
CN111950292A (en) * 2020-06-22 2020-11-17 北京百度网讯科技有限公司 Training method of text error correction model, and text error correction processing method and device
CN111950292B (en) * 2020-06-22 2023-06-27 北京百度网讯科技有限公司 Training method of text error correction model, text error correction processing method and device
CN111767718A (en) * 2020-07-03 2020-10-13 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN111767718B (en) * 2020-07-03 2021-12-07 北京邮电大学 Chinese grammar error correction method based on weakened grammar error feature representation
CN112364631A (en) * 2020-09-21 2021-02-12 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning
CN112364631B (en) * 2020-09-21 2022-08-02 山东财经大学 Chinese grammar error detection method and system based on hierarchical multitask learning

Also Published As

Publication number Publication date
CN108959260B (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN108959260B (en) A kind of Chinese grammer error-detecting method based on textual term vector
Song et al. Mass: Masked sequence to sequence pre-training for language generation
CN107291693B (en) Semantic calculation method for improved word vector model
CN108984525B (en) A kind of Chinese grammer error-detecting method based on the term vector that text information is added
CN106294322A (en) A kind of Chinese based on LSTM zero reference resolution method
CN110555083B (en) Non-supervision entity relationship extraction method based on zero-shot
CN110334219A (en) The knowledge mapping for incorporating text semantic feature based on attention mechanism indicates learning method
CN109697285A (en) Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN109726389A (en) A kind of Chinese missing pronoun complementing method based on common sense and reasoning
CN109299462A (en) Short text similarity calculating method based on multidimensional convolution feature
CN113553440B (en) Medical entity relationship extraction method based on hierarchical reasoning
CN109492223A (en) A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN110175221A (en) Utilize the refuse messages recognition methods of term vector combination machine learning
CN108345583A (en) Event recognition and sorting technique based on multi-lingual attention mechanism and device
CN109766553A (en) A kind of Chinese word cutting method of the capsule model combined based on more regularizations
Zhang et al. Multi-gram CNN-based self-attention model for relation classification
Narayan et al. Neural network based parts of speech tagger for hindi
CN114254645A (en) Artificial intelligence auxiliary writing system
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN108932222A (en) A kind of method and device obtaining the word degree of correlation
CN111723572B (en) Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN116702760A (en) Geographic naming entity error correction method based on pre-training deep learning
pal Singh et al. Naive Bayes classifier for word sense disambiguation of Punjabi language
CN115659981A (en) Named entity recognition method based on neural network model
CN113177120B (en) Quick information reorganizing method based on Chinese text classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant