CN106294322A - A kind of Chinese based on LSTM zero reference resolution method - Google Patents

A kind of Chinese based on LSTM zero reference resolution method Download PDF

Info

Publication number
CN106294322A
CN106294322A CN201610633621.2A CN201610633621A CN106294322A CN 106294322 A CN106294322 A CN 106294322A CN 201610633621 A CN201610633621 A CN 201610633621A CN 106294322 A CN106294322 A CN 106294322A
Authority
CN
China
Prior art keywords
lstm
word
zero
layer
extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610633621.2A
Other languages
Chinese (zh)
Inventor
赵铁军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201610633621.2A priority Critical patent/CN106294322A/en
Publication of CN106294322A publication Critical patent/CN106294322A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A kind of Chinese based on LSTM zero reference resolution method, the present invention relates to Chinese zero reference resolution method based on LSTM.The invention aims to solve the accuracy rate of existing method Chinese zero reference resolution task low and semantic information is understood the problem that accuracy rate is low.One, each word in existing text data is processed, use word2vec instrument that each word in the text data after processing is trained, obtain a term vector dictionary;Two, the selected zero antecedent Candidate Set referred to;If the candidate phrase in the three antecedent Candidate Sets referred to when leading zero is zero to refer to real antecedent, then this training sample is positive example sample, otherwise for negative example sample;Four, Dropout layer connects logistic regression layer, represents that mode input sample is judged as the probit of positive example sample, and this value is as the output of model.The present invention is used for natural language processing field.

Description

A kind of Chinese based on LSTM zero reference resolution method
Technical field
The present invention relates to Chinese zero reference resolution method based on LSTM.
Background technology
Refer to refer to that referring to word with one in chapter refers to certain linguistic unit said in the past.In linguistics, refer to Word is referred to as anaphor, and the object of indication or content are referred to as first lang.Refer to be rhetorical a kind of term, refer to one section of word or In a language piece again and again mention same word, same person or the phenomenon of same things.Reference resolution just determines that photograph Answer the process of mutual relation between language and first lang, be one of the key issue of natural language processing.In natural language, reader Can the part inferred of relation based on context often be omitted, the part being omitted undertakes sentence in sentence Syntactic constituent, and refer to the linguistic unit hereinbefore said, this phenomenon referred to as zero refers to.Zero refer to i.e. to refer to should at itself Occur that the place zero pronoun referring to word replaces.Such as: nine years old that year of little celery,Noon cooks rice boiled, strained and then steamed,Hear mother hem and dam very It is pleasant to the ear,Listen for a moment before standing in table,Also forget cooking.In above-mentioned example eachIt is all " little celery " that place refers to subject, but The most do not refer to actual personal pronoun, and use anaphora, but do not affect the understanding of full sentence.
For the asian types such as Chinese, omit the phenomenon of certain part in syntactic structure and be up to 36%.This demonstrate Chinese occurring, zero phenomenon referred to is the most universal.Due to zero universal existence referring to phenomenon so that Chinese is at a lot of necks The research in territory is the most difficult.Such as, in machine translation field, cannot knowing the meaning representated by clipped when, Translator of Chinese cannot be become object language etc..Therefore the research that refers to of Chinese zero is the key of natural language processing and focus is asked One of topic, very important in texts in natural language understands.In generally talking about at one section, in order to ensure the short and sweet of text, literary composition Often dispensing a lot of information in Ben, people can obtain these information by context, but machine is for default place Not being understood that, this will have a kind of method to obtain default information from text.The research that Chinese zero refers to is exactly for solving Such problem and propose.The research that Chinese zero refers to not only plays an important role, at machine translation, literary composition in information extraction This classification and information extraction etc. are the most crucial in applying.
Zero research referred in early days mainly utilizes the syntactic feature formation logic rule of language to clear up, and compares and has representative The method of property includes center theory, method based on syntax etc..This type of method subject matter be represent and process the most extremely difficult, And needing substantial amounts of manual intervention, portability and the automaticity of simultaneity factor are the most poor.Therefore machine learning method For solving the problem of reference resolution, such as decision tree, SVM, tree kernel method etc..But due to based on syntactic feature to The method of amount or syntax tree structure is difficult to more effectively improve the accuracy rate of zero reference resolution problem.Along with deep The rise of learning research method and development, in natural language processing field, more and more use term vectors solve nature Language processing tasks, also achieves good effect, and the word mode of " term vector " is represented be by deep learning Introduce a core technology in NLP field.Become so using term vector and neural net method to solve zero reference resolution task A kind of necessary trial and innovation are become.
At present the method for Chinese zero reference resolution mainly has three classes:
(1) Chinese zero reference resolution is regarded as a binary classification task.In sentence each zero is referred to position, first First determine its antecedent Candidate Set according to rule;According to the feature masterplate of design, complete syntax tree is extracted feature and just obtains Negative training sample;A binary classifier is trained to carry out Chinese zero reference resolution.
(2) this problem is regarded as a binary classification problems equally.First on complete syntax tree, determine zero position referred to Put, antecedent candidate and mark positive counter-example;Extraction comprises zero subtree referring to position and antecedent candidate, according to tree Kernel principle, use SVM-TK instrument train binary classifier carry out zero refer to clear up.
(3) unsupervised approaches.Many unsupervised approaches are had to be also employed in Chinese zero reference resolution problem, such as combing Ranking model, Integer Linear Programing model, probabilistic model etc..
Above traditional method only make use of in sentence zero to refer to occur that location context syntactic information does not utilize it Semantic information, the accuracy rate causing Chinese zero reference resolution task is low and low to the true rate of semantic information understanding.
Summary of the invention
The invention aims to solve the accuracy rate of existing method Chinese zero reference resolution task low and to semanteme The shortcoming that comprehension of information accuracy rate is low, and a kind of Chinese based on LSTM zero reference resolution method is proposed.
Above-mentioned goal of the invention is achieved through the following technical solutions:
Step one, each word in existing text data is processed, after using word2vec instrument to processing In text data, each word is trained, and obtains a term vector dictionary, and each of which word all correspond to a term vector;
Step 2, the Chinese data used in OntoNotes5.0 corpus, in this Chinese data the zero of sentence refer to and Its antecedent has clear and definite mark;To having marked zero sentence referring to position, first it is converted into completely with syntactic analysis instrument The form of syntax tree, in complete syntax tree, chooses maximum NP knot to occurring in the zero all NP nodes referred to before position Point and modified NP node are as this zero antecedent Candidate Set referred to;
Described NP is noun phrase;
Step 3, to occurring in the zero sentence extraction key word referring to after position, the antecedent referred to each zero is waited Noun phrase one training sample of composition in selected works, if when the candidate phrase in the antecedent Candidate Set that leading zero refers to is zero Refer to real antecedent, then this training sample is positive example sample, otherwise for negative example sample;
Step 4, by positive and negative example sample all of word composition one word dictionary, to one id label of each word, by positive and negative All word id labels in example sample are replaced, and obtain word sequence, as the input of model;The word sequence of input connects Embedding layer, the id label of input is converted into term vector by Embedding layer, uses the term vector dictionary that step one obtains Initialize all term vectors of Embedding layer;Embedding layer connects two-way LSTM Internet, by two-way for each moment The output result of LSTM Internet is stitched together, and sends into Dropout layer;Dropout layer connects logistic regression layer, logistic regression layer Exporting the numerical value between 0 to 1, represent that mode input sample is judged as the probit of positive example sample, this value is made Output for model;
Described Embedding layer is embeding layer;LSTM is shot and long term memory models.
Invention effect
Correlational study of the present invention is not only informatics, the evidence of linguistics correlation theory, simultaneously to natural language understanding There is facilitation.The present invention is to solve that traditional method only make use of morphology and syntactic structure information or statistical probability information Deng, not carrying out Chinese zero in semantic analysis aspect refers to the problem that task is cleared up, the proposition of novelty use term vector and LSTM model carries out this task.On identical data set, compared with the present invention has measure of supervision with tradition, F1-score value carries Rise 5.8%, improved 2% with unsupervised approaches ratio.The term vector obtained by corpus data training is proved to containing specific Structurally and semantically information, be a kind of well semantic meaning representation form.The present invention proposes the extracting method of a kind of key word, will In sentence, zero refers to word relevant with antecedent in the word that position appears below and extracts, and forms one with each antecedent candidate Individual sample, the most just changes into a binary classification task by Chinese zero reference resolution problem, has redesigned this classification applicable and has asked The two-way LSTM neural network structure of topic, obtains this binary classification model by training.This model is used to carry out Chinese zero finger In generation, clears up, as long as sentence being converted into corresponding form input model just can obtain classification results.Solve in existing method the most sharp Refer to occur that location context syntactic information does not utilize its semantic information, Chinese zero reference resolution task with in sentence zero Accuracy rate low and semantic information is understood the shortcoming that accuracy rate is low, the present invention considers semantic information, improves Chinese zero The accuracy rate of reference resolution task and the accuracy rate to semantic information understanding.
The present invention proposes the abstracting method of a kind of key word, and in sentence, zero refers to occur what the hereafter extraction of position was correlated with Noun and verb, arrange a length keywords parameter, if the key word of extraction is more than this parameter, carries out cutting, otherwise then Supplement.The antecedent candidate phrase referred to zero, owing to the word quantity in phrase is not fixed, also to set a word quantity Parameter, cuts out accordingly or supplements.
The present invention uses term vector as the mode of a kind of semantic meaning representation, uses two-way LSTM neutral net to carry out semantic pass System's modeling.Carrying out semantic modeling by antecedent candidate phrase and zero in sentence being referred to key word hereinafter, finding both Between semantic relation, thus preferably carry out Chinese zero reference resolution at semantic level.
The present invention uses two-way LSTM network, and the term vector dictionary obtained by training is for initializing in LSTM network Embedding layer parameter, two-way LSTM layer is made up of a forward LSTM layer and a reverse LSTM layer, the two LSTM The output of each time node of layer, as the input of a logistic regression layer, finally divides as binary with the output of logistic regression layer The output of class model.
Illustrate flow process and the effect of this invention.For sentence, " China's electronic product foreign trade continues to increase, * Pro* accounts for the proportion of total import and export to be continued to rise." in sentence " * pro* " be zero position referring to occur, this sentence is converted into Syntax tree completely, at the NP node that " * pro* " above occurs, determines that the zero antecedent candidate phrase referring to position is: " China's machine Electricity product foreign trade ", " Chinese ", " electronic product "." * pro* " real antecedent is that " China's electronic product is imported and exported Trade ".Eradication keyword abstraction rule, zero refer to position " * pro* " the key word hereafter extracted be: " accounting for ", " turnover Mouthful ", " proportion ", " continuation ", " rising ".Arranging maximum key word number is 6, and antecedent candidate phrase major term number is 3, if Word number is inadequate, fills with symbol " * ".Obtain three samples: [product foreign trade accounts for import and export proportion and continues to rise *], [* * China accounts for import and export proportion and continues to rise *] and [* electronic product accounts for import and export proportion and continues rising *].By word dictionary, Word in these samples is replaced to word ID, then inputs the binary classification model based on two-way LSTM trained.Mould [product foreign trade accounts for import and export proportion and continues to rise *] can be divided into positive example by type, and two other sample is divided into negative example, recognizes It is " * pro* " real antecedent for " China electronic product foreign trade ".
Accompanying drawing explanation
Fig. 1 is the whole flow chart carrying out Chinese zero reference resolution based on two-way LSTM;
Fig. 2 is the two-way LSTM prototype network structure chart that detailed description of the invention one proposes;
Fig. 3 is conventional network structure figure;
Fig. 4 is dropout network structure.
Detailed description of the invention
Detailed description of the invention one: combine Fig. 1 and present embodiment is described, the one of present embodiment is based on term vector and two-way The Chinese zero reference resolution method of LSTM, specifically prepares according to following steps:
Step one, each word in existing text data is simply processed, use word2vec instrument to place In text data after reason, each word is trained that (word2vec is a open source software, is specifically used to the literary composition of point good word Word, by internal model, is converted into corresponding vector by this), obtain a term vector dictionary, each of which word is the most corresponding A term vector;
Chinese department's divided data in step 2, use OntoNotes5.0 corpus, sentence in this Chinese department's divided data Zero refers to and antecedent has clear and definite mark;To having marked the zero sentence text referring to position, first use syntactic analysis work Tool (sentence being converted into the instrument of tree form, such as: Stanford Parser) is converted into the form of complete syntax tree, complete In full syntax tree, choose maximum NP node (ancestors to occurring in zero all NP (noun phrase) node referred to before position Without NP node in node) and modified NP node (father node is NP node and right sibling is also NP node) as this zero The antecedent Candidate Set referred to;
Described NP is noun phrase;
Step 3, to occur in zero refer to position after sentence (referring to position occurs to sentence end from zero) extraction close Noun phrase NP in keyword, with each zero antecedent Candidate Set referred to forms a training sample, if when leading zero refers to Antecedent Candidate Set in candidate phrase be zero to refer to real antecedent, then this training sample is positive example sample, is otherwise Negative example sample;
Step 4, by positive and negative example sample all of word composition one word dictionary, to one id label of each word, by positive and negative All word id labels in example sample are replaced, and obtain word sequence, as the input of model;The word sequence of input connects Embedding layer, the id label of input is converted into term vector by Embedding layer, uses the term vector dictionary that step one obtains Initialize all term vector parameters of Embedding layer;Embedding layer connects two-way LSTM Internet, is used for extracting feature; The output result of two-way for each moment LSTM Internet is stitched together, sends into Dropout layer;Dropout layer connects logic and returns Return layer, logistic regression layer one numerical value between 0 to 1 of output, represent that mode input sample is judged as the probability of positive example Value, this value is as the output of model;
Described Embedding layer is embeding layer;LSTM is shot and long term memory models;Dropout layer is a kind of ad hoc network Structure, the implicit unit of certain ratio of selection that the when of model training, dropout Internet can be random is ineffective.Such as Fig. 3 And Fig. 4;Fig. 3 is conventional network structure figure, and Fig. 4 is dropout network structure;
Dropout refers to allow the weight of some hidden layer node of network not work at random when model training, idle Those nodes can temporarily not think it is the part of network structure, but its weight must remain (simply the most more Newly), because during sample input next time, it may work again (somewhat abstract, implement the experiment portion seen below Point).May be considered a kind of special network structure.
Detailed description of the invention two: present embodiment is unlike detailed description of the invention one: to existing in described step one The process that simply processes of text data be: use participle program that sentence in existing text data is carried out participle, Spcial character is removed, only retains Chinese character, English and punctuate (spcial character such as Greek alphabet, Russion letter, phonetic notation symbol Number, special symbol etc.).
Detailed description of the invention three: present embodiment is unlike detailed description of the invention one or two: in described step 2 first The processing mode of row word Candidate Set is:
Arranging antecedent Candidate Set major term number is n, 1≤n≤maxW, and maxW represents the major term number of a sentence Mesh;
If antecedent Candidate Set word number is less than n, then it is filled with symbol * until word number is equal to n;
If antecedent Candidate Set word number is more than n, the most only retain last n word;
Being mapped to the term vector stage at word, * is mapped to null vector.
Detailed description of the invention four: present embodiment is unlike one of detailed description of the invention one to three: described step 3 In to occur in zero refer to position after sentence (referring to position occurs to sentence end from zero) extracting keywords;Detailed process For:
Arranging key word major term number is m, 1≤m≤maxW, and maxW represents major term number, the key word of a sentence Extracting rule is: the noun in extraction sentence and verb;
If total word number of extraction is less than m, then it is filled with symbol *, until reaching m word;
If total word number of extraction is equal to m, then need not extra process;
If total word number of extraction is more than m, then all words of extraction is carried out cutting, first delete modified noun, Calculate total word number of the extraction after deleting modified noun, if total word number of extraction is equal to m, then need not extra process;As Total word number of fruit extraction less than m, is then filled with symbol *, until reaching m word;
If total word number of extraction is more than m, delete the noun in addition to modified noun in noun the most again, calculate after deleting Total word number of extraction, if total word number of extraction is less than m, is then filled with symbol *, until reaching m word;If extraction Total word number equal to m, then need not extra process;If total word number of extraction is more than m, delete verb the most again, calculate and delete verb After total word number of extraction, if total word number of extraction is less than m, be then filled with symbol *, until reaching m word;If taken out The total word number taken is equal to m, then need not extra process.
Detailed description of the invention five: present embodiment is unlike one of detailed description of the invention one to four: described step 4 In two-way LSTM Internet include forward LSTM layer and reverse LSTM layer;As shown in Figure 2;The effect of LSTM layer is in input Feature is extracted on keyword sequence;
All words in positive and negative example sample forward input forward LSTM layer respectively, reversely inputs reverse LSTM layer;Use double The information of both direction input is preserved respectively to LSTM layer.Model can be made in theory to use when processing current time data The contextual information of whole sequence, last the two LSTM layer is stitched together in the output of each sequential.3 doors and independence The design of memory cell so that LSTM unit has preservation, reads, resets and update the ability of distance historical information.
Detailed description of the invention six: present embodiment is unlike one of detailed description of the invention one to five: described LSTM layer It is made up of LSTM unit, the corresponding LSTM unit of each sequential;LSTM unit each sequential can input a word to Amount, then one value of output, the output valve of each sequential through concatenation (two vectorial concatenation can be regarded as by Second vector is appended to first vectorial end so that it is be merged into a new vector) obtain a characteristic vector, send into Dropout layer, is connected with the logistic regression layer of Dropout layer, logistic regression layer one numerical value between 0 to 1 of output, table Showing that input sample is judged as the probit of positive example, this value is as the output of model.
Detailed description of the invention seven: present embodiment is unlike one of detailed description of the invention one to six: described LSTM layer It is made up of LSTM unit, the corresponding LSTM unit of each sequential;LSTM unit each sequential can input a word to Amount, then one value of LSTM unit output;Detailed process is:
LSTM unit specialized designs mnemon (memory cell) is used for preserving historical information.Historical information is more Newly with utilization respectively by the control input gate (input gate) of 3 doors, forget door (forget gate), out gate (output gate);
If h is LSTM unit exports data, c is LSTM candidate's mnemon value, and x is that LSTM unit inputs data;
(1) candidate's mnemon value of current time is calculated according to the formula of tradition RNNWxc、WhcIt is that LSTM is mono-respectively Unit's current time input data xtWith upper moment LSTM unit output data ht-1Weighting parameter, bcFor offset parameter, h is sharp Function alive;
c ~ t = tanh ( W x c x t + W h c h t - 1 + b c )
(2) value i of input gate input gate is calculatedt, input gate is used for controlling current data and inputs mnemon shape The impact of state value.The calculating of all doors is except by present input data xtWith upper moment LSTM unit output valve ht-1Outside impact, Also recalled cell value c by engraving upper a period of timet-1Impact.
it=σ (Wxixt+Whiht-1+Wcict-1+bi)
Wherein, WxiData x are inputted for LSTM unit current timetWeighting parameter, WhiDefeated for a upper moment LSTM unit Go out data ht-1Weighting parameter, WciFor upper moment candidate's mnemon value ct-1Weighting parameter, biFor offset parameter;σ is Activation primitive;
(3) value f of forget gate is forgotten in calculatingt, forget door for controlling historical information to current mnemon shape The impact of state value.
ft=σ (Wxfxt+Whfht-1+Wcfct-1+bf)
Wherein, WxfData x are inputted for LSTM unit current timetWeighting parameter, WhfDefeated for a upper moment LSTM unit Go out data ht-1Weighting parameter, WcfFor upper moment candidate's mnemon value ct-1Weighting parameter, bfFor offset parameter;
(4) current time mnemon value c is calculatedt
Wherein, ⊙ represents pointwise product;From formula, mnemon updates and depends on moment candidate's mnemon Value ct-1Candidate's mnemon value with current time, and by input gate with forget respectively this two parts factor to be entered Row regulation.
(5) out gate o is calculatedt;For controlling the output of mnemon state value.
ot=σ (Wxoxt+Whoht-1+Wcoct-1+bo)
Wherein, WxoData x are inputted for LSTM unit current timetWeighting parameter, WhoDefeated for a upper moment LSTM unit Go out data ht-1Weighting parameter, WcoFor upper moment candidate's mnemon value ct-1Weighting parameter, boFor offset parameter;
(6) last LSTM unit is output as
ht=ot⊙tanh(ct)。
Detailed description of the invention eight: present embodiment is unlike one of detailed description of the invention one to seven: described σ typically takes Logistic sigmoid function, span 0≤σ≤1.
Detailed description of the invention nine: present embodiment is unlike one of detailed description of the invention one to eight: described to LSTM The value of unit output, uses logistic regression to carry out binary classification, and the output result of logistic regression layer is the sample of mode input, quilt Be predicted as positive example probit (the last output of model that this patent proposes is exactly this probit, and this probit is the most accurate, Illustrate that model is the best), this value as the output detailed process of model is:
Classification formula is:
p ( y = 1 | x ) = exp ( w · x + b ) 1 + exp ( w · x + b )
Wherein, x is the characteristic vector of dropout network output, and b is bias vector, and y is tag along sort, is divided into positive example mark Sign or negative example label;Logistic regression p (y=1 | x) calculate is that y is positive example mark under conditions of the characteristic vector of input is x The probability signed;In Chinese zero reference resolution framework based on two-way LSTM model.
In order to prevent neutral net from Expired Drugs occur, over-fitting is existing to use dropout technology to avoid model to occur As.Dropout layer allows the implicit node of certain proportion (ratio p generally takes 0.5) not work model training when at random.No The weights that these nodes of work are corresponding would not update in current training.But model uses when, all nodes Will be used, recover complete and connect.Reach to prevent Expired Drugs by this mechanism.
The construction process of whole LSTM binary classification network is: at data preprocessing phase, by the keyword sequence of extraction Word dictionary is utilized to be converted into word label sequence;Then using these word label sequences as the input of neutral net, it is connected to Embedding layer, the word label of each sequential is converted into term vector by embedding layer, and order passes to forward LSTM net respectively Network layers and backward pass to reverse LSTM Internet;Two LSTM layers can have an output in each time series, and these are defeated Go out result horizontally-spliced (being operated by concatenate, be spliced together), be then fed into dropout layer;dropout Layer output result sends into logistic regression classification layer, last output category probit.
Employing following example checking beneficial effects of the present invention:
Embodiment one:
The present embodiment one, specifically prepares according to following steps:
(1) sample extraction.Sentence and its complete syntax that Chinese zero refers to is comprised in the extraction of OntoNote5.0 corpus Tree.The complete syntax tree of sentence extracts antecedent Candidate Set.Each antecedent candidate phrase zero refers to constitute one with it Whether sample, be that the zero real antecedent referred to determines that this sample is positive example or negative example according to this candidate phrase.
(2) keyword abstraction.The keyword abstraction strategy proposed by the present invention, in extraction sentence, zero refers to position to sentence The key word of tail and the key word of candidate phrase.Finally according to word dictionary, these key words are replaced to word label.
(3) positive and negative training sample is sent into the two-way LSTM model framework that the present invention proposes, after training, obtain one Individual Chinese zero reference resolution model.
(4) finally new test sample (also from said method and corpus) is sent into model, pre-according to model Survey result and the legitimate reading of test sample, obtain testing data.
Test result is as follows:
Accuracy rate Recall rate F1 value
50.7 50.7 50.7
The present invention also can have other various embodiments, in the case of without departing substantially from present invention spirit and essence thereof, and this area Technical staff is when making various corresponding change and deformation according to the present invention, but these change accordingly and deformation all should belong to The protection domain of appended claims of the invention.

Claims (9)

1. a Chinese zero reference resolution method based on LSTM, it is characterised in that: a kind of Chinese zero based on LSTM refers to disappear Solution method, specifically prepares according to following steps:
Step one, each word in existing text data is processed, use word2vec instrument to the text after processing In data, each word is trained, and obtains a term vector dictionary, and each of which word all correspond to a term vector;
Chinese data in step 2, use OntoNotes5.0 corpus, in this Chinese data, the zero of sentence refers to and first Row word has clear and definite mark;To having marked zero sentence referring to position, first it is converted into complete syntax with syntactic analysis instrument Tree form, in complete syntax tree, to occur in the zero all NP nodes referred to before position choose maximum NP node and Modified NP node is as this zero antecedent Candidate Set referred to;
Described NP is noun phrase;
Step 3, to occurring in the zero sentence extraction key word referring to after position, with each zero antecedent Candidate Set referred to In noun phrase one training sample of composition, if the candidate phrase in the antecedent Candidate Set referred to when leading zero is zero to refer to Real antecedent, then this training sample is positive example sample, otherwise for negative example sample;
Step 4, by positive and negative example sample all of word composition one word dictionary, to one id label of each word, by positive and negative example sample All word id labels in Ben are replaced, and obtain word sequence, as the input of model;The word sequence of input connects Embedding Layer, the id label of input is converted into term vector by Embedding layer, and the term vector dictionary using step one to obtain initializes All term vectors of Embedding layer;Embedding layer connects two-way LSTM Internet, by two-way for each moment LSTM network The output result of layer is stitched together, and sends into Dropout layer;Dropout layer connects logistic regression layer, and logistic regression layer exports one Numerical value between 0 to 1, represents that mode input sample is judged as the probit of positive example sample, and this value is as model Output;
Described Embedding layer is embeding layer;LSTM is shot and long term memory models.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described step one In process that existing text data is processed be: use participle program sentence in existing text data to be carried out point Word, removes spcial character, only retains Chinese character, English and punctuate.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described step 2 The processing mode of middle antecedent Candidate Set is:
Arranging antecedent Candidate Set major term number is n, 1≤n≤maxW, and maxW represents the major term number of a sentence;
If antecedent Candidate Set word number is less than n, then it is filled with symbol * until word number is equal to n;
If antecedent Candidate Set word number is more than n, the most only retain last n word;
If antecedent Candidate Set word number is equal to n, then need not process;
Being mapped to the term vector stage at word, * is mapped to null vector.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described step 3 In to occurring in the zero sentence extraction key word referring to after position;Detailed process is:
Arranging key word major term number is m, 1≤m≤maxW, and maxW represents major term number, the keyword extraction of a sentence Rule is: the noun in extraction sentence and verb;
If total word number of extraction is less than m, then it is filled with symbol *, until reaching m word;
If total word number of extraction is equal to m, do not process;
If total word number of extraction is more than m, then all words of extraction is carried out cutting, first delete modified noun, calculate Delete total word number of the extraction after modified noun, if total word number of extraction is equal to m, do not process;If total word of extraction Number less than m, is then filled with symbol *, until reaching m word;
If total word number of extraction is more than m, delete the noun in addition to modified noun in noun the most again, calculate the extraction after deleting Total word number, if total word number of extraction is less than m, be then filled with symbol *, until reaching m word;If total word of extraction Number, equal to m, does not processes;If total word number of extraction is more than m, delete verb the most again, calculate the total of the extraction after deleting verb Word number, if total word number of extraction is less than m, is then filled with symbol *, until reaching m word;If total word number etc. of extraction In m, do not process.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described step 4 In two-way LSTM Internet include forward LSTM layer and reverse LSTM layer;All words in positive and negative example sample forward respectively just inputs To LSTM layer, reversely input reverse LSTM layer;Two-way LSTM layer is used to preserve the information of both direction input respectively.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described LSTM layer It is made up of LSTM unit, the corresponding LSTM unit of each sequential;LSTM unit each sequential can input a word to Amount, then one value of LSTM unit output, the output valve of each sequential obtains a characteristic vector through concatenation, sends into Dropout layer, is connected with the logistic regression layer of Dropout layer, logistic regression layer one numerical value between 0 to 1 of output, table Showing that input sample is judged as the probit of positive example sample, this value is as the output of model.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described LSTM layer It is made up of LSTM unit, the corresponding LSTM unit of each sequential;LSTM unit each sequential can input a word to Amount, then one value of LSTM unit output;Detailed process is:
(1) candidate's mnemon value of current time is calculated according to the formula of tradition RNN
c ~ t = tanh ( W x c x t + W h c h t - 1 + b c )
In formula, Wxc、WhcIt is that LSTM unit current time inputs data x respectivelytWith upper moment LSTM unit output data ht-1's Weighting parameter, bcFor offset parameter, h is activation primitive;Described RNN is Recognition with Recurrent Neural Network;
(2) value i of input gate is calculatedt,
it=σ (Wxixt+Whiht-1+Wcict-1+bi)
Wherein, WxiData x are inputted for LSTM unit current timetWeighting parameter, WhiFor a upper moment LSTM unit output number According to ht-1Weighting parameter, WciFor upper moment candidate's mnemon value ct-1Weighting parameter, biFor offset parameter;σ is for activating Function;
(3) value f of door is forgotten in calculatingt,
ft=σ (Wxfxt+Whfht-1+Wcfct-1+bf)
Wherein, WxfData x are inputted for LSTM unit current timetWeighting parameter, WhfFor a upper moment LSTM unit output number According to ht-1Weighting parameter, WcfFor upper moment candidate's mnemon value ct-1Weighting parameter, bfFor offset parameter;
(4) current time mnemon value c is calculatedt
Wherein, ⊙ represents pointwise product;
(5) out gate o is calculatedt
ot=σ (Wxoxt+Whoht-1+Wcoct-1+bo)
Wherein, WxoData x are inputted for LSTM unit current timetWeighting parameter, WhoFor a upper moment LSTM unit output number According to ht-1Weighting parameter, WcoFor upper moment candidate's mnemon value ct-1Weighting parameter, boFor offset parameter;
(6) LSTM unit is output as
ht=ot⊙tanh(ct)。
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described σ value Scope 0≤σ≤1.
A kind of Chinese based on LSTM zero reference resolution method, it is characterised in that: described to LSTM The value of unit output obtains a characteristic vector through concatenation, sends into Dropout layer, with the logistic regression of Dropout layer Layer connects, and uses logistic regression to carry out binary classification, logistic regression layer one numerical value between 0 to 1 of output, represents input Sample is judged as the probit of positive example sample, and this value as the output detailed process of model is:
Classification formula is:
p ( y = 1 | x ) = exp ( w · x + b ) 1 + exp ( w · x + b )
Wherein, x be dropout network output characteristic vector, b is bias vector, and y is tag along sort, be divided into positive example label or Person bears example label;Logistic regression p (y=1 | x) calculate is that y is positive example label under conditions of the characteristic vector of input is x Probability;In Chinese zero reference resolution framework based on two-way LSTM model.
CN201610633621.2A 2016-08-04 2016-08-04 A kind of Chinese based on LSTM zero reference resolution method Pending CN106294322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610633621.2A CN106294322A (en) 2016-08-04 2016-08-04 A kind of Chinese based on LSTM zero reference resolution method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610633621.2A CN106294322A (en) 2016-08-04 2016-08-04 A kind of Chinese based on LSTM zero reference resolution method

Publications (1)

Publication Number Publication Date
CN106294322A true CN106294322A (en) 2017-01-04

Family

ID=57664940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610633621.2A Pending CN106294322A (en) 2016-08-04 2016-08-04 A kind of Chinese based on LSTM zero reference resolution method

Country Status (1)

Country Link
CN (1) CN106294322A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919646A (en) * 2017-01-18 2017-07-04 南京云思创智信息科技有限公司 Chinese text summarization generation system and method
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN107203813A (en) * 2017-05-22 2017-09-26 成都准星云学科技有限公司 A kind of new default entity nomenclature and its system
CN107330032A (en) * 2017-06-26 2017-11-07 北京理工大学 A kind of implicit chapter relationship analysis method based on recurrent neural network
CN107679035A (en) * 2017-10-11 2018-02-09 石河子大学 A kind of information intent detection method, device, equipment and storage medium
CN107797989A (en) * 2017-10-16 2018-03-13 平安科技(深圳)有限公司 Enterprise name recognition methods, electronic equipment and computer-readable recording medium
CN108287817A (en) * 2017-05-08 2018-07-17 腾讯科技(深圳)有限公司 A kind of information processing method and equipment
CN108319666A (en) * 2018-01-19 2018-07-24 国网浙江省电力有限公司电力科学研究院 A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion
CN108566627A (en) * 2017-11-27 2018-09-21 浙江鹏信信息科技股份有限公司 A kind of method and system identifying fraud text message using deep learning
CN108595408A (en) * 2018-03-15 2018-09-28 中山大学 A kind of reference resolution method based on end-to-end neural network
CN108897896A (en) * 2018-07-13 2018-11-27 深圳追科技有限公司 Keyword abstraction method based on intensified learning
CN108959630A (en) * 2018-07-24 2018-12-07 电子科技大学 A kind of character attribute abstracting method towards English without structure text
CN109165386A (en) * 2017-08-30 2019-01-08 哈尔滨工业大学 A kind of Chinese empty anaphora resolution method and system
CN109446517A (en) * 2018-10-08 2019-03-08 平安科技(深圳)有限公司 Reference resolution method, electronic device and computer readable storage medium
CN109471919A (en) * 2018-11-15 2019-03-15 北京搜狗科技发展有限公司 Empty anaphora resolution method and device
CN109492223A (en) * 2018-11-06 2019-03-19 北京邮电大学 A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
CN109783801A (en) * 2018-12-14 2019-05-21 厦门快商通信息技术有限公司 A kind of electronic device, multi-tag classification method and storage medium
CN109885841A (en) * 2019-03-20 2019-06-14 苏州大学 Reference resolution method based on node representation
CN109948166A (en) * 2019-03-25 2019-06-28 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
CN110019788A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 File classification method and device
CN110134944A (en) * 2019-04-08 2019-08-16 国家计算机网络与信息安全管理中心 A kind of reference resolution method based on intensified learning
CN110162600A (en) * 2019-05-20 2019-08-23 腾讯科技(深圳)有限公司 A kind of method of information processing, the method and device of conversational response
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110377750A (en) * 2019-06-17 2019-10-25 北京百度网讯科技有限公司 Comment generates and comment generates model training method, device and storage medium
TWI685760B (en) * 2018-01-10 2020-02-21 威盛電子股份有限公司 Method for analyzing semantics of natural language
CN111488733A (en) * 2020-04-07 2020-08-04 苏州大学 Chinese zero-index resolution method and system based on Mask mechanism and twin network
CN111626042A (en) * 2020-05-28 2020-09-04 成都网安科技发展有限公司 Reference resolution method and device
CN111737949A (en) * 2020-07-22 2020-10-02 江西风向标教育科技有限公司 Topic content extraction method and device, readable storage medium and computer equipment
CN112256868A (en) * 2020-09-30 2021-01-22 华为技术有限公司 Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
WO2021164293A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Big-data-based zero anaphora resolution method and apparatus, and device and medium
CN114676709A (en) * 2022-04-11 2022-06-28 昆明理工大学 Chinese-Yue data enhancement method based on zero-pronoun completion
WO2023279921A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Neural network model training method, data processing method, and apparatuses
US11645465B2 (en) 2020-12-10 2023-05-09 International Business Machines Corporation Anaphora resolution for enhanced context switching

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298635A (en) * 2011-09-13 2011-12-28 苏州大学 Method and system for fusing event information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BINGBING WU ET.AL: "A Context-Aware Model Using Distributed Representations for Chinese Zero Pronoun Resolution", 《A CONTEXT-AWARE MODEL USING DISTRIBUTED REPRESENTATIONS FOR CHINESE ZERO PRONOUN RESOLUTION》 *
YIN QINGYU ET.AL: "A Deep Neural Network for Chinese Zero Pronoun Resolution", 《EPRINT ARXIV:1604.05800》 *
胡新辰: "基于LSTM的语义关系分类研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919646A (en) * 2017-01-18 2017-07-04 南京云思创智信息科技有限公司 Chinese text summarization generation system and method
CN106919646B (en) * 2017-01-18 2020-06-09 南京云思创智信息科技有限公司 Chinese text abstract generating system and method
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN108287817B (en) * 2017-05-08 2020-08-11 腾讯科技(深圳)有限公司 Information processing method and device
CN108287817A (en) * 2017-05-08 2018-07-17 腾讯科技(深圳)有限公司 A kind of information processing method and equipment
CN107203813A (en) * 2017-05-22 2017-09-26 成都准星云学科技有限公司 A kind of new default entity nomenclature and its system
CN107330032A (en) * 2017-06-26 2017-11-07 北京理工大学 A kind of implicit chapter relationship analysis method based on recurrent neural network
CN109165386A (en) * 2017-08-30 2019-01-08 哈尔滨工业大学 A kind of Chinese empty anaphora resolution method and system
CN110019788A (en) * 2017-09-30 2019-07-16 北京国双科技有限公司 File classification method and device
CN107679035A (en) * 2017-10-11 2018-02-09 石河子大学 A kind of information intent detection method, device, equipment and storage medium
CN107679035B (en) * 2017-10-11 2020-06-12 石河子大学 Information intention detection method, device, equipment and storage medium
CN107797989A (en) * 2017-10-16 2018-03-13 平安科技(深圳)有限公司 Enterprise name recognition methods, electronic equipment and computer-readable recording medium
CN108566627A (en) * 2017-11-27 2018-09-21 浙江鹏信信息科技股份有限公司 A kind of method and system identifying fraud text message using deep learning
TWI685760B (en) * 2018-01-10 2020-02-21 威盛電子股份有限公司 Method for analyzing semantics of natural language
CN108319666A (en) * 2018-01-19 2018-07-24 国网浙江省电力有限公司电力科学研究院 A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion
CN108319666B (en) * 2018-01-19 2021-09-28 国网浙江省电力有限公司营销服务中心 Power supply service assessment method based on multi-modal public opinion analysis
CN108595408A (en) * 2018-03-15 2018-09-28 中山大学 A kind of reference resolution method based on end-to-end neural network
WO2020010955A1 (en) * 2018-07-13 2020-01-16 深圳追一科技有限公司 Keyword extraction method based on reinforcement learning, and computer device and storage medium
CN108897896A (en) * 2018-07-13 2018-11-27 深圳追科技有限公司 Keyword abstraction method based on intensified learning
CN108959630A (en) * 2018-07-24 2018-12-07 电子科技大学 A kind of character attribute abstracting method towards English without structure text
CN109446517B (en) * 2018-10-08 2022-07-05 平安科技(深圳)有限公司 Reference resolution method, electronic device and computer readable storage medium
CN109446517A (en) * 2018-10-08 2019-03-08 平安科技(深圳)有限公司 Reference resolution method, electronic device and computer readable storage medium
CN109492223A (en) * 2018-11-06 2019-03-19 北京邮电大学 A kind of Chinese missing pronoun complementing method based on ANN Reasoning
CN109492223B (en) * 2018-11-06 2020-08-04 北京邮电大学 Chinese missing pronoun completion method based on neural network reasoning
CN109726389A (en) * 2018-11-13 2019-05-07 北京邮电大学 A kind of Chinese missing pronoun complementing method based on common sense and reasoning
CN109726389B (en) * 2018-11-13 2020-10-13 北京邮电大学 Chinese missing pronoun completion method based on common sense and reasoning
CN109471919A (en) * 2018-11-15 2019-03-15 北京搜狗科技发展有限公司 Empty anaphora resolution method and device
CN109783801B (en) * 2018-12-14 2023-08-25 厦门快商通信息技术有限公司 Electronic device, multi-label classification method and storage medium
CN109783801A (en) * 2018-12-14 2019-05-21 厦门快商通信息技术有限公司 A kind of electronic device, multi-tag classification method and storage medium
CN109885841B (en) * 2019-03-20 2023-07-11 苏州大学 Reference digestion method based on node representation method
CN109885841A (en) * 2019-03-20 2019-06-14 苏州大学 Reference resolution method based on node representation
CN109948166A (en) * 2019-03-25 2019-06-28 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
CN110134944A (en) * 2019-04-08 2019-08-16 国家计算机网络与信息安全管理中心 A kind of reference resolution method based on intensified learning
CN110162600A (en) * 2019-05-20 2019-08-23 腾讯科技(深圳)有限公司 A kind of method of information processing, the method and device of conversational response
CN110162600B (en) * 2019-05-20 2024-01-30 腾讯科技(深圳)有限公司 Information processing method, session response method and session response device
CN110321342A (en) * 2019-05-27 2019-10-11 平安科技(深圳)有限公司 Business valuation studies method, apparatus and storage medium based on intelligent characteristic selection
CN110377750A (en) * 2019-06-17 2019-10-25 北京百度网讯科技有限公司 Comment generates and comment generates model training method, device and storage medium
CN110377750B (en) * 2019-06-17 2022-05-27 北京百度网讯科技有限公司 Comment generation method, comment generation device, comment generation model training device and storage medium
WO2021164293A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Big-data-based zero anaphora resolution method and apparatus, and device and medium
CN111488733A (en) * 2020-04-07 2020-08-04 苏州大学 Chinese zero-index resolution method and system based on Mask mechanism and twin network
CN111488733B (en) * 2020-04-07 2023-12-19 苏州大学 Chinese zero reference resolution method and system based on Mask mechanism and twin network
CN111626042A (en) * 2020-05-28 2020-09-04 成都网安科技发展有限公司 Reference resolution method and device
CN111737949A (en) * 2020-07-22 2020-10-02 江西风向标教育科技有限公司 Topic content extraction method and device, readable storage medium and computer equipment
CN112256868A (en) * 2020-09-30 2021-01-22 华为技术有限公司 Zero-reference resolution method, method for training zero-reference resolution model and electronic equipment
US11645465B2 (en) 2020-12-10 2023-05-09 International Business Machines Corporation Anaphora resolution for enhanced context switching
WO2023279921A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Neural network model training method, data processing method, and apparatuses
CN114676709A (en) * 2022-04-11 2022-06-28 昆明理工大学 Chinese-Yue data enhancement method based on zero-pronoun completion

Similar Documents

Publication Publication Date Title
CN106294322A (en) A kind of Chinese based on LSTM zero reference resolution method
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN110032648A (en) A kind of case history structuring analytic method based on medical domain entity
Fahad et al. Inflectional review of deep learning on natural language processing
CN106776562A (en) A kind of keyword extracting method and extraction system
Suleiman et al. The use of hidden Markov model in natural ARABIC language processing: a survey
CN110209822A (en) Sphere of learning data dependence prediction technique based on deep learning, computer
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN110321563A (en) Text emotion analysis method based on mixing monitor model
CN110532328A (en) A kind of text concept figure building method
CN110717330A (en) Word-sentence level short text classification method based on deep learning
CN110222344B (en) Composition element analysis algorithm for composition tutoring of pupils
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN114330338A (en) Program language identification system and method fusing associated information
CN110851593A (en) Complex value word vector construction method based on position and semantics
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
Al-Harbi et al. Lexical disambiguation in natural language questions (nlqs)
CN107894976A (en) A kind of mixing language material segmenting method based on Bi LSTM
Sarmah et al. Survey on word sense disambiguation: an initiative towards an Indo-Aryan language
Zhao Research and design of automatic scoring algorithm for English composition based on machine learning
Putra et al. Sentence boundary disambiguation for Indonesian language
CN111813927A (en) Sentence similarity calculation method based on topic model and LSTM
Muaidi Levenberg-Marquardt learning neural network for part-of-speech tagging of Arabic sentences
Cui et al. Aspect level sentiment classification based on double attention mechanism

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170104