CN107656921A - A kind of short text dependency analysis method based on deep learning - Google Patents

A kind of short text dependency analysis method based on deep learning Download PDF

Info

Publication number
CN107656921A
CN107656921A CN201710934201.2A CN201710934201A CN107656921A CN 107656921 A CN107656921 A CN 107656921A CN 201710934201 A CN201710934201 A CN 201710934201A CN 107656921 A CN107656921 A CN 107656921A
Authority
CN
China
Prior art keywords
sentence
short text
dependency
mrow
dependency analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710934201.2A
Other languages
Chinese (zh)
Other versions
CN107656921B (en
Inventor
肖仰华
谢晨昊
梁家卿
崔万云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Several Eyes Technology Development Co Ltd
Original Assignee
Shanghai Several Eyes Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Several Eyes Technology Development Co Ltd filed Critical Shanghai Several Eyes Technology Development Co Ltd
Priority to CN201710934201.2A priority Critical patent/CN107656921B/en
Publication of CN107656921A publication Critical patent/CN107656921A/en
Application granted granted Critical
Publication of CN107656921B publication Critical patent/CN107656921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of short text dependency analysis method based on deep learning, including:Step 1) obtains the html file where user's query statement, as training dataset from search engine logs;Step 2) generates the dependency analysis tree of query statement according to training dataset;Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.The present invention automatically generates the short text dependency analysis data set of magnanimity, and carry out noise reduction and optimization to the data set of generation with a variety of methods using the dependency analysis device of used sentence level.We trained the dependency analysis model of short text on the basis of this data set, and experiment shows that mark effect of this model on short text is greatly improved compared to the dependency analysis device of sentence level.

Description

A kind of short text dependency analysis method based on deep learning
Technical field
The invention belongs to a kind of short text dependency analysis method based on deep learning.
Background technology
Phrase structure and dependency structure are that widest two classes grammar construct is studied in current syntactic analysis.Dependency grammar Earliest by French linguist L.Tesniere in its works《Structure syntax basis》Itd is proposed in (nineteen fifty-nine).Dependency grammar leads to Cross the dependence in analysis linguistic unit between composition and disclose its syntactic structure, it is to dominate it to advocate sentence center word aroused in interest The center compositions of its composition, and itself is not dominated by other any compositions, all subject compositions are all interdependent with certain Relation is subordinated to dominator.
Such as text " Its apple watch charging stand is my favorite stand. ", enter The dependency analysis tree such as Fig. 2 obtained after row dependency analysis:
From dependency analysis tree, the overall syntactic structure of sentence can be clearly obtained, the modification between word and word is closed System, and the semanteme of sentence can be understood to a certain extent.
The dependency analysis of short text is most important for grammatical item, word part of speech and the semanteme for understanding short text.Consider Following search inquiry and its corresponding syntactic structure, such as Fig. 3:
" cover iphone 6plus " result shows the main body containment vessel (cover) of this phrase, user to short text Demand be iphone to be found containment vessel, rather than iphone.Based on this knowledge, search engine can is rational Show the relevant advertisements of iphone containment vessels.For " distance earthmoon ", main body are apart from (distance), table Bright user's is intended that the distance between the inquiry earth (earth) and moon (moon).For faucet adapter Female, it is intended that be then to look for tap adapter.In a word, if correctly the dependence of short text can be identified, just The relation between the core main body in short text and modification can be extracted, is better understood from the semanteme of short text.
The significant challenge for carrying out dependency analysis to short text has:
1. in short text, usually not complete grammatical feature helps to be analyzed.In fact, short text generally has There is very high ambiguity.For example, short text " kids toys " may be represented " toys forkids " may also represent " kids With toys ", toys and kids dependence side is antipodal under both of these case, such as Fig. 4.
2. do not have the linguistic rules that dependency analysis is carried out on short text so far.The artificial annotation process of dependency analysis In, in fact it could happen that mark caused by lacking standard is unclear.And the cost manually marked be it is huge, one interdependent point Analysis mark collection, which generally requires the time of several years, to be completed.
In dependency analysis, the semantic information of short text is mainly contained in dependency analysis side.I.e. in short text Any two word x, y ∈ q, judge to whether there is dependence between x and y, and if it exists, be it is any according to Deposit relation.
This judgement is carried out, the semanteme of utilizable short text is broadly divided into two major classes:Context-free information and Context-sensitive information.
● context-free information:During using context-free information, we are directly to P (e | x, y) modelings, wherein e tables Show x, dependence side corresponding to y (x → y or x ← y).This modeling pattern is context-free, because we do not examine Consider relative position relations of the x and y in input.
P (e | x, y) is obtained, a kind of mode is by there are the corpus of mark such as Google syntactic ngram Data set.For two words x and y, we are counted the x in corpus and modify y number and y modifications x number, estimated with this Meter P (e | x, y).
● context-sensitive information:Only there are two major defects using context-free information:1) context is not being considered In the case of directly consider two words between relation be risky.2) context-free information can not often portray two The individual direct dependency relationship type of word, and then can not the semanteme that entirely inputs of complete representation.
In order to which contextual information is accounted for, i.e., for any two word x, y estimations P (e | x, y, q), we Targeted transformation is to construct the dependency analysis device (dependency parser) that one is short essay the design.It is such in order to construct Dependency analysis device is, it is necessary to the training dataset of magnanimity.We devise the method for automatically generating this data set, to avoid craft The cost of mark.Whole method based on the assumption that:Short text q intention and the intention one of the click sentence of this short text Cause.We remember sentence s be short text q click sentence and if only if:1) sentence s in short text q search result by user High reps is clicked on.2) each word in short text q occurs in sentence s.For example, it is assumed that sentence s=" ... my Favorite Thai food in Houston ... " be short text " q=thai food Houston " click sentence, then The entirety intention of the two is similar, meanwhile, the dependence in short text between word pair can also correspond to word to straight with sentence The relation connect is similar.But, it is contemplated that some word in short text is not to that may be what is be directly connected in sentence, still So a method is needed reasonably to be mapped to the dependence in sentence on short text.
In recent years, deep learning is proved there is very strong applicability in natural language processing (NLP) problem.Early in At the beginning of 21 century, the language model based on neutral net is suggested, and has started deep learning applied to natural language processing task The beginning.Then, research shows that the deep learning based on convolutional neural networks (convolutional neural network) exists Part-of-speech tagging (part-of-speech tagging), piecemeal (chunking) and name Entity recognition (named entity ) etc. recognizing there is the performance of brilliance in numerous natural language processing tasks.Still later, with recurrent neural network The popularization of (recurrent neural network), deep learning have more preferable performance in NLP problems, and all There is wider application in the more fields of such as machine translation (machine translation).
The content of the invention
The technical problems to be solved by the invention are to provide a kind of short text dependency analysis method based on deep learning, For solving the problems, such as that prior art is present.
It is as follows that the present invention solves the technical scheme that above-mentioned technical problem is taken:
A kind of short text dependency analysis method based on deep learning, including:
Step 1) obtains the html file where user's query statement, as training data from search engine logs Collection;
Step 2) generates the dependency analysis tree of query statement according to training dataset;
Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
Preferably, in step 1), specifically include:
For each inquiry q in search daily record and the higher URL column of user's clicking rate under this search result Table, obtain its corresponding html document;
The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q, S, count), wherein count represents the number that the word occurs in the sentence;
Training dataset of the obtained triple collection as generation dependency analysis tree.
Preferably, a short text might have the sentence that multiple corresponding users click on, wherein, further, it is Short text q generates dependency analysis tree in sentence s, specifically includes:
If TsRepresent all subtrees of s dependency tree;
Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;
To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s
If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;
If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and its is interim Labeled as dep.
For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one Individual scoring functions f assesses the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is time that this edge occurs on whole data set Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score The parameter of method significance level;
Finally need to refine label.
Preferably, the type on part dependence side is arranged to placeholder " dep ", and we must infer " dep " Into a real label, inconsistent phenomenon can be caused by otherwise being concentrated in training data;
To solve this problem, we use the mode of majority voting (majority vote);
Including:For arbitraryStatisticsConcentrated in training data for each specific mark Check out existing number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number, Placeholder dep is just changed to the label by us.
Preferably, part-of-speech tagging device and parser of the step 3) training based on neural network model, specific bag Include:
To each word in sentence, stationary window, extraction feature, including the word folder are established centered on the word Body, capital and small letter, prefix, suffix;
For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to insertion Carry out random initializtion;
Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used such as following table institute Show:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder Body, t represent part-of-speech tagging, and l represents dependence label.
The present invention utilizes the dependency analysis device of used sentence level, automatically generates the short text dependency analysis number of magnanimity Noise reduction and optimization are carried out to the data set of generation according to collection, and with a variety of methods.We trained short on the basis of this data set The dependency analysis model of text, experiment show that mark effect of this model on short text compares the dependency analysis of sentence level Device is greatly improved.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by the explanations write Specifically noted structure is come real in book, claims and accompanying drawing
Brief description of the drawings
The present invention is described in detail below in conjunction with the accompanying drawings, to cause the above-mentioned advantage of the present invention definitely.Its In,
Fig. 1 is the integrally-built of the dependency analysis device of the short text dependency analysis method of the invention based on deep learning Schematic diagram;
Fig. 2 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 3 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 4 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 5 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 6 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 7 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 8 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 9 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 10 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 11 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 12 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 13 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 14 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 15 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 16 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 17 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 18 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention.
Embodiment
Embodiments of the present invention are described in detail below with reference to drawings and Examples, and the present invention how should whereby Solves technical problem with technological means, and the implementation process for reaching technique effect can fully understand and implement according to this.Need Bright, as long as not forming conflict, each embodiment in the present invention and each feature in each embodiment can be tied mutually Close, the technical scheme formed is within protection scope of the present invention.
In addition, can be in the computer of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of Performed in system, although also, show logical order in flow charts, in some cases, can be with different from this The order at place performs shown or described step.
Specifically, the present invention constructs a system end to end, from search engine logs, utilizes used sentence The dependency analysis device of sub- rank, automatically generate the short text dependency analysis data set of magnanimity, and the number with a variety of methods to generation Noise reduction and optimization are carried out according to collection.We trained the dependency analysis model of short text on the basis of this data set, and experiment shows Mark effect of this model on short text is greatly improved compared to the dependency analysis device of sentence level.
Wherein, specifically, a kind of short text dependency analysis method based on deep learning, including:
Step 1) obtains the html file where user's query statement, as training data from search engine logs Collection;
Step 2) generates the dependency analysis tree of query statement according to training dataset;
Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
Preferably, in step 1), specifically include:
For each inquiry q in search daily record and the higher URL column of user's clicking rate under this search result Table, obtain its corresponding html document;
The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q, S, count), wherein count represents the number that the word occurs in the sentence;
Training dataset of the obtained triple collection as generation dependency analysis tree.
Preferably, a short text might have the sentence that multiple corresponding users click on, wherein, further, it is Short text q generates dependency analysis tree in sentence s, specifically includes:
If TsRepresent all subtrees of s dependency tree;
Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;
To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s
If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;
If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and its is interim Labeled as dep.
For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one Individual scoring functions f assesses the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is time that this edge occurs on whole data set Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score The parameter of method significance level;
Finally need to refine label.
Preferably, the type on part dependence side is arranged to placeholder " dep ", and we must infer " dep " Into a real label, inconsistent phenomenon can be caused by otherwise being concentrated in training data;
To solve this problem, we use the mode of majority voting (majority vote);
Including:For arbitraryStatisticsConcentrated in training data for each specific mark Check out existing number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number, Placeholder dep is just changed to the label by us.
Preferably, part-of-speech tagging device and parser of the step 3) training based on neural network model, specific bag Include:
To each word in sentence, stationary window, extraction feature, including the word folder are established centered on the word Body, capital and small letter, prefix, suffix;
For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to insertion Carry out random initializtion;
Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used such as following table institute Show:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder Body, t represent part-of-speech tagging, and l represents dependence label.
The present invention utilizes the dependency analysis device of used sentence level, automatically generates the short text dependency analysis number of magnanimity Noise reduction and optimization are carried out to the data set of generation according to collection, and with a variety of methods.We trained short on the basis of this data set The dependency analysis model of text, experiment show that mark effect of this model on short text compares the dependency analysis of sentence level Device is greatly improved.
In a specific embodiment:
5.1. data source
Search daily record of the data source from search engine.For each inquiry q in search daily record and in this search As a result user is descended to click on higher HTML lists, to wherein each URL, we obtain its corresponding html document, will wherein wrapped Sentence containing each word in this inquiry takes out a click sentence as this inquiry.One three can so be obtained Tuple:
(q,s,count).Afterwards, we are analyzed using the dependency analysis device on sentence s, obtain its interdependent point Analysis tree, it is believed that this dependency analysis tree is in the main true.
5.2. dependency analysis tree is inferred
One short text q might have the sentence that multiple corresponding users click on.This step be to one of sentence s, Its dependency analysis tree is mapped on short text q.
Dependence on sentence s is mapped on short text q by the heuristic below our uses.
1. set TsRepresent all subtrees of s dependency tree.
2. find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q
3. q dependency tree t is inherited from t with following modeq,s:To two words x and y in q:
A. if having a line x ' → y ' in t, then in tq,sThe middle side for creating x → y as one.
B. if having an a series of path (paths include sides in the same direction) from x ' to y ' in t, then in tq,sIn X → y side is created, and is dep by the type temporary marker on its side, and is updated to have more in follow-up Optimization Steps The dependence of body.
Classify below to dependence common in short text and from the mapping method in sentence.
It is directly connected to:In this case, we directly replicate the side in sentence and its type.Consider short text " party supplies cheap " corresponding sentence, such as Fig. 5:
In this police station, (party, supplies) and (supplies, cheap) this two groups of words are all directly to connect Connect.Therefore the relation that the dependence of this short text can be inherited directly in sentence obtains, such as Fig. 6:
Connected by function word (functionword):In short text inquiry, it is very common to omit preposition.Such as Short text " moonlanding " corresponding sentence, such as Fig. 7:
We can map to obtain following dependency tree, such as Fig. 8:
For short text " side effects b12 " corresponding sentence, such as Fig. 9:
Following dependency tree, such as Figure 10 can be obtained:
In both cases, can all occur that interim " " type-dependent relation side, we can be in step below by dep Processing.
Connected by qualifier (modifier word):Many search inquiries are all made up of noun phrase, and they Corresponding sentence then may have many qualifiers to be omitted.Partitioned mode (nounphrase depending on noun phrase Bracketing), noun phrase, which is directly likely to be, is joined directly together, it is also possible to is indirectly connected.
For " offshore work " and sentence corresponding to it, eliminate qualifier " drilling " can't be brought Any problem:" offshore " and " work " it is still what is be joined directly together, directly can so inherits dependence, Such as Figure 11.
But for short text " crude price " and it corresponding to sentence be not then so, such as Figure 12.
In such case, it is contemplated that a paths crude ← oil ← price, can inherit to obtain a line, such as scheme 13.Connected by first place word (headword):In some cases, the first place word of a noun phrase may be omitted.Consider " country singers " and its corresponding sentence, such as Figure 14:
They obvious semanteme is consistent, but " music " is omitted first place word in short text.In sentence Still have one from " singers " to " country " path, dependency tree, such as Figure 15 can be obtained successively.
Connected by verb (verb):A kind of common example is copular omission.Consider example " plants Sentence, such as Figure 16 corresponding to poisonous to goats ":In this case, eliminate " are " and have no effect on short text Middle word directly connects.But consider short text " painbetweenbreasts " and its corresponding sentence, such as Figure 17:
In such a case, it is possible to inherit to obtain dependence, such as Figure 18:
5.3. dependency analysis tree is merged
In last step, we are had been obtained for for a short text q, map what is obtained from multiple corresponding sentences Dependency analysis tree is gathered.These dependency trees may be not consistent.Chief reason has:1. the dependency analysis device of sentence is simultaneously It is not perfect.2. short text may have ambiguity in itself.3. the semantic sentence being consistent may be not present in part short text.This The main purpose of one step is to merge this multiple possible incomplete same dependency tree, obtains this short text only One dependency tree.
In order to select a unique dependency tree to a short text q, we define a scoring functions f to comment Estimate the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is that this edge goes out occurrence on whole data set Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score The parameter of the mutual significance level of method.
Section 1 in scoring functions portrays the compactedness of short text dependency analysis tree, the good dependency tree of compactedness It often more can concisely portray the semanteme of short text.Such as short text " deep learning ", following two correspondences be present Sentence:
In first sentence, " deep " and " learning " connection it is very loose, cause it semantic and short text Semantic deviation is very big.And in second sentence, two words are joined directly together, and whole sentence also has very with short text Good Semantic Similarity.
The Section 2 of scoring functions portrays the global coherency of short text dependency analysis tree.For a pair of words x, y), such as Fruit is on whole data set, and the y → x occurrence number when x → y occurrence number is far above, then the latter is likely to be wrong 's.A special circumstances for needing to consider in the process are the orders of word, if appearance of two words in short text Order is different, then grammatical relation corresponding to them can be with inconsistent.Such as " child of " and " ofchild " be all by " Child " and " of " two word compositions, but the two correct dependence is different.
5.4. result optimizing
In before the step of, the type on part dependence side is arranged to placeholder " dep ".Using what is obtained Before data set training dependency analysis device, " dep " must be inferred to a label for being really by we, otherwise in training data Specific and unspecific label can be present simultaneously by concentrating, and be caused inconsistent.For example, for short text " crude price ", from Comprising " crudeprice obtained in crude oil price " sentence side type is dep, from including " crude The crudeprice obtained in price " sentence side type can be amod.
In order to infer " dep ", we first use most voting methods.First
On our training dataset, said process can solve about 90% dependence.Running into solve Situation when, because such side does not provide dependence information, can directly delete.It is contemplated that at these In short text, for other words to may also include significant information, we take a kind of side of bootstrapping (bootstrap) Formula is handled:These short text dependence data for containing uncertain side type are first deleted, train a short text analysis Device;This 10% or so data is predicted again, if prediction result is consistent with the direction of these data, by analyzer The particular type of " dep " side output is backfilled in dependency analysis tree;Finally, the dependency tree after backfill is added to instruction Practice and concentrate, re -training dependency analysis device obtains final model.
5.5. short text dependency analysis model
Short text dependency analysis uses the dependency analysis device knot based on neutral net used in similar (Danqi 2014) Structure.The main Feature wherein used is as follows:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder Body, t represent part-of-speech tagging, and l represents dependence label.
It should be noted that for above method embodiment, in order to be briefly described, therefore it is all expressed as to a system The combination of actions of row, but those skilled in the art should know, the application is not limited by described sequence of movement, Because according to the application, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art also should This knows that embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily originally Necessary to application.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer journey Sequence product.Therefore, in terms of the application can use complete hardware embodiment, complete software embodiment or combine software and hardware The form of embodiment.
Moreover, the application can use the computer for wherein including computer usable program code in one or more can With the computer program product implemented in storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Form.
Finally it should be noted that:The preferred embodiments of the present invention are the foregoing is only, are not limited to this hair It is bright, although the present invention is described in detail with reference to the foregoing embodiments, for those skilled in the art, its according to The technical scheme described in foregoing embodiments can so be modified, or which part technical characteristic is equal Replace.Within the spirit and principles of the invention, any modification, equivalent substitution and improvements made etc., it should be included in this Within the protection domain of invention.

Claims (5)

  1. A kind of 1. short text dependency analysis method based on deep learning, it is characterised in that including:
    Step 1) obtains the html file where user's query statement, as training dataset from search engine logs;
    Step 2) generates the dependency analysis tree of query statement according to training dataset;
    Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
  2. 2. the short text dependency analysis method according to claim 1 based on deep learning, it is characterised in that step 1) In, specifically include:
    For each inquiry q in search daily record and the higher url list of user's clicking rate under this search result, obtain Its corresponding html document;
    The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q,s, Count), wherein count represents the number that the word occurs in the sentence;
    Training dataset of the obtained triple collection as generation dependency analysis tree.
  3. 3. the short text dependency analysis method according to claim 2 based on deep learning a, it is characterised in that short essay Originally the sentence that multiple corresponding users click on is might have, wherein, further, interdependent point is generated in sentence s for short text q Analysis tree, specifically includes:
    If TsRepresent all subtrees of s dependency tree;
    Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;
    To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s
    If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;
    If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and by its temporary marker For dep.
    For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one and beaten Function f is divided to assess the dependency tree t generated from q corresponding sentence sqQuality:
    <mrow> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>q</mi> </msub> <mo>,</mo> <mi>s</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>&amp;RightArrow;</mo> <mi>y</mi> <mo>)</mo> <mo>&amp;Element;</mo> <msub> <mi>t</mi> <mi>q</mi> </msub> </mrow> </munder> <mo>-</mo> <mi>&amp;alpha;</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>&amp;RightArrow;</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>&amp;RightArrow;</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    Wherein (x → y) represents a line on tree, and count (x → y) is the number that this edge occurs on whole data set, Dist (x, y) is distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two point systems The parameter of significance level;
    Finally label is refined.
  4. 4. the short text dependency analysis method according to claim 3 based on deep learning, it is characterised in that part is interdependent The type on relation side is arranged to placeholder " dep ", and " dep " must be inferred to a real label by we, otherwise instruct Inconsistent phenomenon can be caused by practicing in data set;
    To solve this problem, we use the mode of majority voting (majority vote);
    Including:For arbitraryStatisticsConcentrate in training data and occur for each specific label Number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number, we just will Placeholder dep is changed to the label.
  5. 5. the short text dependency analysis method according to claim 1 based on deep learning, it is characterised in that step 3) is instructed Practice part-of-speech tagging device and parser based on neural network model, specifically include:
    To each word in sentence, stationary window, extraction feature, including the word are established centered on the word in itself, greatly Small letter, prefix, suffix;
    For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to it is embedded carry out with Machine initializes;
    Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used is as shown in the table:
    In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represents i-th yuan of buffering area Element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word in itself, t tables Show part-of-speech tagging, l represents dependence label.
CN201710934201.2A 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning Active CN107656921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710934201.2A CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710934201.2A CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Publications (2)

Publication Number Publication Date
CN107656921A true CN107656921A (en) 2018-02-02
CN107656921B CN107656921B (en) 2021-01-08

Family

ID=61117779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710934201.2A Active CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Country Status (1)

Country Link
CN (1) CN107656921B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491512A (en) * 2018-03-23 2018-09-04 北京奇虎科技有限公司 The method of abstracting and device of headline
CN108647785A (en) * 2018-05-17 2018-10-12 普强信息技术(北京)有限公司 A kind of neural network method for automatic modeling, device and storage medium
CN110189751A (en) * 2019-04-24 2019-08-30 中国联合网络通信集团有限公司 Method of speech processing and equipment
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment
WO2020232943A1 (en) * 2019-05-23 2020-11-26 广州市香港科大霍英东研究院 Knowledge graph construction method for event prediction and event prediction method
CN112446405A (en) * 2019-09-04 2021-03-05 杭州九阳小家电有限公司 User intention guiding method for home appliance customer service and intelligent home appliance

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298642A (en) * 2011-09-15 2011-12-28 苏州大学 Method and system for extracting text information
CN102968431A (en) * 2012-09-18 2013-03-13 华东师范大学 Control device for mining relation between Chinese entities on basis of dependency tree
US20130117010A1 (en) * 2010-07-13 2013-05-09 Sk Planet Co., Ltd. Method and device for filtering a translation rule and generating a target word in hierarchical-phase-based statistical machine translation
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
CN106598951A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Dependency structure treebank acquisition method and system
CN106776686A (en) * 2016-11-09 2017-05-31 武汉泰迪智慧科技有限公司 Chinese domain short text understanding method and system based on many necks
CN107168948A (en) * 2017-04-19 2017-09-15 广州视源电子科技股份有限公司 A kind of sentence recognition methods and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130117010A1 (en) * 2010-07-13 2013-05-09 Sk Planet Co., Ltd. Method and device for filtering a translation rule and generating a target word in hierarchical-phase-based statistical machine translation
CN102298642A (en) * 2011-09-15 2011-12-28 苏州大学 Method and system for extracting text information
CN102968431A (en) * 2012-09-18 2013-03-13 华东师范大学 Control device for mining relation between Chinese entities on basis of dependency tree
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
CN106776686A (en) * 2016-11-09 2017-05-31 武汉泰迪智慧科技有限公司 Chinese domain short text understanding method and system based on many necks
CN106598951A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Dependency structure treebank acquisition method and system
CN107168948A (en) * 2017-04-19 2017-09-15 广州视源电子科技股份有限公司 A kind of sentence recognition methods and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈超: "基于转换的依存句法分析研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491512A (en) * 2018-03-23 2018-09-04 北京奇虎科技有限公司 The method of abstracting and device of headline
CN108647785A (en) * 2018-05-17 2018-10-12 普强信息技术(北京)有限公司 A kind of neural network method for automatic modeling, device and storage medium
CN110189751A (en) * 2019-04-24 2019-08-30 中国联合网络通信集团有限公司 Method of speech processing and equipment
WO2020232943A1 (en) * 2019-05-23 2020-11-26 广州市香港科大霍英东研究院 Knowledge graph construction method for event prediction and event prediction method
CN112446405A (en) * 2019-09-04 2021-03-05 杭州九阳小家电有限公司 User intention guiding method for home appliance customer service and intelligent home appliance
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN107656921B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN107656921A (en) A kind of short text dependency analysis method based on deep learning
CN104462057B (en) For the method and system for the lexicon for producing language analysis
Orosz et al. PurePos 2.0: a hybrid tool for morphological disambiguation
CN101251862B (en) Content-based problem automatic classifying method and system
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN112925563B (en) Code reuse-oriented source code recommendation method
CN103593335A (en) Chinese semantic proofreading method based on ontology consistency verification and reasoning
CN102693279A (en) Method, device and system for fast calculating comment similarity
CN112328800A (en) System and method for automatically generating programming specification question answers
Feldman et al. TEG—a hybrid approach to information extraction
CN107943940A (en) Data processing method, medium, system and electronic equipment
CN113064985A (en) Man-machine conversation method, electronic device and storage medium
CN104391969A (en) User query statement syntactic structure determining method and device
Guo et al. Chase: A large-scale and pragmatic chinese dataset for cross-database context-dependent text-to-sql
Stepanov et al. Language style and domain adaptation for cross-language SLU porting
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
CN101246473B (en) Segmentation system evaluating method and segmentation evaluating system
Murugathas et al. Domain specific question & answer generation in tamil
Lee Natural Language Processing: A Textbook with Python Implementation
Gleize et al. A unified kernel approach for learning typed sentence rewritings
JP2018010481A (en) Deep case analyzer, deep case learning device, deep case estimation device, method, and program
CN107451295B (en) Method for obtaining deep learning training data based on grammar network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant