CN107656921A - A kind of short text dependency analysis method based on deep learning - Google Patents
A kind of short text dependency analysis method based on deep learning Download PDFInfo
- Publication number
- CN107656921A CN107656921A CN201710934201.2A CN201710934201A CN107656921A CN 107656921 A CN107656921 A CN 107656921A CN 201710934201 A CN201710934201 A CN 201710934201A CN 107656921 A CN107656921 A CN 107656921A
- Authority
- CN
- China
- Prior art keywords
- sentence
- short text
- dependency
- mrow
- dependency analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of short text dependency analysis method based on deep learning, including:Step 1) obtains the html file where user's query statement, as training dataset from search engine logs;Step 2) generates the dependency analysis tree of query statement according to training dataset;Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.The present invention automatically generates the short text dependency analysis data set of magnanimity, and carry out noise reduction and optimization to the data set of generation with a variety of methods using the dependency analysis device of used sentence level.We trained the dependency analysis model of short text on the basis of this data set, and experiment shows that mark effect of this model on short text is greatly improved compared to the dependency analysis device of sentence level.
Description
Technical field
The invention belongs to a kind of short text dependency analysis method based on deep learning.
Background technology
Phrase structure and dependency structure are that widest two classes grammar construct is studied in current syntactic analysis.Dependency grammar
Earliest by French linguist L.Tesniere in its works《Structure syntax basis》Itd is proposed in (nineteen fifty-nine).Dependency grammar leads to
Cross the dependence in analysis linguistic unit between composition and disclose its syntactic structure, it is to dominate it to advocate sentence center word aroused in interest
The center compositions of its composition, and itself is not dominated by other any compositions, all subject compositions are all interdependent with certain
Relation is subordinated to dominator.
Such as text " Its apple watch charging stand is my favorite stand. ", enter
The dependency analysis tree such as Fig. 2 obtained after row dependency analysis:
From dependency analysis tree, the overall syntactic structure of sentence can be clearly obtained, the modification between word and word is closed
System, and the semanteme of sentence can be understood to a certain extent.
The dependency analysis of short text is most important for grammatical item, word part of speech and the semanteme for understanding short text.Consider
Following search inquiry and its corresponding syntactic structure, such as Fig. 3:
" cover iphone 6plus " result shows the main body containment vessel (cover) of this phrase, user to short text
Demand be iphone to be found containment vessel, rather than iphone.Based on this knowledge, search engine can is rational
Show the relevant advertisements of iphone containment vessels.For " distance earthmoon ", main body are apart from (distance), table
Bright user's is intended that the distance between the inquiry earth (earth) and moon (moon).For faucet adapter
Female, it is intended that be then to look for tap adapter.In a word, if correctly the dependence of short text can be identified, just
The relation between the core main body in short text and modification can be extracted, is better understood from the semanteme of short text.
The significant challenge for carrying out dependency analysis to short text has:
1. in short text, usually not complete grammatical feature helps to be analyzed.In fact, short text generally has
There is very high ambiguity.For example, short text " kids toys " may be represented " toys forkids " may also represent " kids
With toys ", toys and kids dependence side is antipodal under both of these case, such as Fig. 4.
2. do not have the linguistic rules that dependency analysis is carried out on short text so far.The artificial annotation process of dependency analysis
In, in fact it could happen that mark caused by lacking standard is unclear.And the cost manually marked be it is huge, one interdependent point
Analysis mark collection, which generally requires the time of several years, to be completed.
In dependency analysis, the semantic information of short text is mainly contained in dependency analysis side.I.e. in short text
Any two word x, y ∈ q, judge to whether there is dependence between x and y, and if it exists, be it is any according to
Deposit relation.
This judgement is carried out, the semanteme of utilizable short text is broadly divided into two major classes:Context-free information and
Context-sensitive information.
● context-free information:During using context-free information, we are directly to P (e | x, y) modelings, wherein e tables
Show x, dependence side corresponding to y (x → y or x ← y).This modeling pattern is context-free, because we do not examine
Consider relative position relations of the x and y in input.
P (e | x, y) is obtained, a kind of mode is by there are the corpus of mark such as Google syntactic ngram
Data set.For two words x and y, we are counted the x in corpus and modify y number and y modifications x number, estimated with this
Meter P (e | x, y).
● context-sensitive information:Only there are two major defects using context-free information:1) context is not being considered
In the case of directly consider two words between relation be risky.2) context-free information can not often portray two
The individual direct dependency relationship type of word, and then can not the semanteme that entirely inputs of complete representation.
In order to which contextual information is accounted for, i.e., for any two word x, y estimations P (e | x, y, q), we
Targeted transformation is to construct the dependency analysis device (dependency parser) that one is short essay the design.It is such in order to construct
Dependency analysis device is, it is necessary to the training dataset of magnanimity.We devise the method for automatically generating this data set, to avoid craft
The cost of mark.Whole method based on the assumption that:Short text q intention and the intention one of the click sentence of this short text
Cause.We remember sentence s be short text q click sentence and if only if:1) sentence s in short text q search result by user
High reps is clicked on.2) each word in short text q occurs in sentence s.For example, it is assumed that sentence s=" ... my
Favorite Thai food in Houston ... " be short text " q=thai food Houston " click sentence, then
The entirety intention of the two is similar, meanwhile, the dependence in short text between word pair can also correspond to word to straight with sentence
The relation connect is similar.But, it is contemplated that some word in short text is not to that may be what is be directly connected in sentence, still
So a method is needed reasonably to be mapped to the dependence in sentence on short text.
In recent years, deep learning is proved there is very strong applicability in natural language processing (NLP) problem.Early in
At the beginning of 21 century, the language model based on neutral net is suggested, and has started deep learning applied to natural language processing task
The beginning.Then, research shows that the deep learning based on convolutional neural networks (convolutional neural network) exists
Part-of-speech tagging (part-of-speech tagging), piecemeal (chunking) and name Entity recognition (named entity
) etc. recognizing there is the performance of brilliance in numerous natural language processing tasks.Still later, with recurrent neural network
The popularization of (recurrent neural network), deep learning have more preferable performance in NLP problems, and all
There is wider application in the more fields of such as machine translation (machine translation).
The content of the invention
The technical problems to be solved by the invention are to provide a kind of short text dependency analysis method based on deep learning,
For solving the problems, such as that prior art is present.
It is as follows that the present invention solves the technical scheme that above-mentioned technical problem is taken:
A kind of short text dependency analysis method based on deep learning, including:
Step 1) obtains the html file where user's query statement, as training data from search engine logs
Collection;
Step 2) generates the dependency analysis tree of query statement according to training dataset;
Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
Preferably, in step 1), specifically include:
For each inquiry q in search daily record and the higher URL column of user's clicking rate under this search result
Table, obtain its corresponding html document;
The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q,
S, count), wherein count represents the number that the word occurs in the sentence;
Training dataset of the obtained triple collection as generation dependency analysis tree.
Preferably, a short text might have the sentence that multiple corresponding users click on, wherein, further, it is
Short text q generates dependency analysis tree in sentence s, specifically includes:
If TsRepresent all subtrees of s dependency tree;
Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;
To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s:
If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;
If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and its is interim
Labeled as dep.
For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one
Individual scoring functions f assesses the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is time that this edge occurs on whole data set
Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score
The parameter of method significance level;
Finally need to refine label.
Preferably, the type on part dependence side is arranged to placeholder " dep ", and we must infer " dep "
Into a real label, inconsistent phenomenon can be caused by otherwise being concentrated in training data;
To solve this problem, we use the mode of majority voting (majority vote);
Including:For arbitraryStatisticsConcentrated in training data for each specific mark
Check out existing number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number,
Placeholder dep is just changed to the label by us.
Preferably, part-of-speech tagging device and parser of the step 3) training based on neural network model, specific bag
Include:
To each word in sentence, stationary window, extraction feature, including the word folder are established centered on the word
Body, capital and small letter, prefix, suffix;
For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to insertion
Carry out random initializtion;
Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used such as following table institute
Show:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area
Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder
Body, t represent part-of-speech tagging, and l represents dependence label.
The present invention utilizes the dependency analysis device of used sentence level, automatically generates the short text dependency analysis number of magnanimity
Noise reduction and optimization are carried out to the data set of generation according to collection, and with a variety of methods.We trained short on the basis of this data set
The dependency analysis model of text, experiment show that mark effect of this model on short text compares the dependency analysis of sentence level
Device is greatly improved.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by the explanations write
Specifically noted structure is come real in book, claims and accompanying drawing
Brief description of the drawings
The present invention is described in detail below in conjunction with the accompanying drawings, to cause the above-mentioned advantage of the present invention definitely.Its
In,
Fig. 1 is the integrally-built of the dependency analysis device of the short text dependency analysis method of the invention based on deep learning
Schematic diagram;
Fig. 2 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 3 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 4 is the schematic diagram of the analysis of sentence that background technology is related in the present invention;
Fig. 5 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 6 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 7 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 8 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Fig. 9 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 10 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 11 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 12 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 13 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 14 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 15 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 16 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 17 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention;
Figure 18 is the schematic diagram of the analysis of sentence that embodiment is related in the present invention.
Embodiment
Embodiments of the present invention are described in detail below with reference to drawings and Examples, and the present invention how should whereby
Solves technical problem with technological means, and the implementation process for reaching technique effect can fully understand and implement according to this.Need
Bright, as long as not forming conflict, each embodiment in the present invention and each feature in each embodiment can be tied mutually
Close, the technical scheme formed is within protection scope of the present invention.
In addition, can be in the computer of such as one group computer executable instructions the flow of accompanying drawing illustrates the step of
Performed in system, although also, show logical order in flow charts, in some cases, can be with different from this
The order at place performs shown or described step.
Specifically, the present invention constructs a system end to end, from search engine logs, utilizes used sentence
The dependency analysis device of sub- rank, automatically generate the short text dependency analysis data set of magnanimity, and the number with a variety of methods to generation
Noise reduction and optimization are carried out according to collection.We trained the dependency analysis model of short text on the basis of this data set, and experiment shows
Mark effect of this model on short text is greatly improved compared to the dependency analysis device of sentence level.
Wherein, specifically, a kind of short text dependency analysis method based on deep learning, including:
Step 1) obtains the html file where user's query statement, as training data from search engine logs
Collection;
Step 2) generates the dependency analysis tree of query statement according to training dataset;
Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
Preferably, in step 1), specifically include:
For each inquiry q in search daily record and the higher URL column of user's clicking rate under this search result
Table, obtain its corresponding html document;
The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q,
S, count), wherein count represents the number that the word occurs in the sentence;
Training dataset of the obtained triple collection as generation dependency analysis tree.
Preferably, a short text might have the sentence that multiple corresponding users click on, wherein, further, it is
Short text q generates dependency analysis tree in sentence s, specifically includes:
If TsRepresent all subtrees of s dependency tree;
Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;
To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s:
If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;
If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and its is interim
Labeled as dep.
For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one
Individual scoring functions f assesses the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is time that this edge occurs on whole data set
Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score
The parameter of method significance level;
Finally need to refine label.
Preferably, the type on part dependence side is arranged to placeholder " dep ", and we must infer " dep "
Into a real label, inconsistent phenomenon can be caused by otherwise being concentrated in training data;
To solve this problem, we use the mode of majority voting (majority vote);
Including:For arbitraryStatisticsConcentrated in training data for each specific mark
Check out existing number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number,
Placeholder dep is just changed to the label by us.
Preferably, part-of-speech tagging device and parser of the step 3) training based on neural network model, specific bag
Include:
To each word in sentence, stationary window, extraction feature, including the word folder are established centered on the word
Body, capital and small letter, prefix, suffix;
For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to insertion
Carry out random initializtion;
Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used such as following table institute
Show:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area
Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder
Body, t represent part-of-speech tagging, and l represents dependence label.
The present invention utilizes the dependency analysis device of used sentence level, automatically generates the short text dependency analysis number of magnanimity
Noise reduction and optimization are carried out to the data set of generation according to collection, and with a variety of methods.We trained short on the basis of this data set
The dependency analysis model of text, experiment show that mark effect of this model on short text compares the dependency analysis of sentence level
Device is greatly improved.
In a specific embodiment:
5.1. data source
Search daily record of the data source from search engine.For each inquiry q in search daily record and in this search
As a result user is descended to click on higher HTML lists, to wherein each URL, we obtain its corresponding html document, will wherein wrapped
Sentence containing each word in this inquiry takes out a click sentence as this inquiry.One three can so be obtained
Tuple:
(q,s,count).Afterwards, we are analyzed using the dependency analysis device on sentence s, obtain its interdependent point
Analysis tree, it is believed that this dependency analysis tree is in the main true.
5.2. dependency analysis tree is inferred
One short text q might have the sentence that multiple corresponding users click on.This step be to one of sentence s,
Its dependency analysis tree is mapped on short text q.
Dependence on sentence s is mapped on short text q by the heuristic below our uses.
1. set TsRepresent all subtrees of s dependency tree.
2. find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q
3. q dependency tree t is inherited from t with following modeq,s:To two words x and y in q:
A. if having a line x ' → y ' in t, then in tq,sThe middle side for creating x → y as one.
B. if having an a series of path (paths include sides in the same direction) from x ' to y ' in t, then in tq,sIn
X → y side is created, and is dep by the type temporary marker on its side, and is updated to have more in follow-up Optimization Steps
The dependence of body.
Classify below to dependence common in short text and from the mapping method in sentence.
It is directly connected to:In this case, we directly replicate the side in sentence and its type.Consider short text
" party supplies cheap " corresponding sentence, such as Fig. 5:
In this police station, (party, supplies) and (supplies, cheap) this two groups of words are all directly to connect
Connect.Therefore the relation that the dependence of this short text can be inherited directly in sentence obtains, such as Fig. 6:
Connected by function word (functionword):In short text inquiry, it is very common to omit preposition.Such as
Short text " moonlanding " corresponding sentence, such as Fig. 7:
We can map to obtain following dependency tree, such as Fig. 8:
For short text " side effects b12 " corresponding sentence, such as Fig. 9:
Following dependency tree, such as Figure 10 can be obtained:
In both cases, can all occur that interim " " type-dependent relation side, we can be in step below by dep
Processing.
Connected by qualifier (modifier word):Many search inquiries are all made up of noun phrase, and they
Corresponding sentence then may have many qualifiers to be omitted.Partitioned mode (nounphrase depending on noun phrase
Bracketing), noun phrase, which is directly likely to be, is joined directly together, it is also possible to is indirectly connected.
For " offshore work " and sentence corresponding to it, eliminate qualifier " drilling " can't be brought
Any problem:" offshore " and " work " it is still what is be joined directly together, directly can so inherits dependence,
Such as Figure 11.
But for short text " crude price " and it corresponding to sentence be not then so, such as Figure 12.
In such case, it is contemplated that a paths crude ← oil ← price, can inherit to obtain a line, such as scheme
13.Connected by first place word (headword):In some cases, the first place word of a noun phrase may be omitted.Consider
" country singers " and its corresponding sentence, such as Figure 14:
They obvious semanteme is consistent, but " music " is omitted first place word in short text.In sentence
Still have one from " singers " to " country " path, dependency tree, such as Figure 15 can be obtained successively.
Connected by verb (verb):A kind of common example is copular omission.Consider example " plants
Sentence, such as Figure 16 corresponding to poisonous to goats ":In this case, eliminate " are " and have no effect on short text
Middle word directly connects.But consider short text " painbetweenbreasts " and its corresponding sentence, such as Figure 17:
In such a case, it is possible to inherit to obtain dependence, such as Figure 18:
5.3. dependency analysis tree is merged
In last step, we are had been obtained for for a short text q, map what is obtained from multiple corresponding sentences
Dependency analysis tree is gathered.These dependency trees may be not consistent.Chief reason has:1. the dependency analysis device of sentence is simultaneously
It is not perfect.2. short text may have ambiguity in itself.3. the semantic sentence being consistent may be not present in part short text.This
The main purpose of one step is to merge this multiple possible incomplete same dependency tree, obtains this short text only
One dependency tree.
In order to select a unique dependency tree to a short text q, we define a scoring functions f to comment
Estimate the dependency tree t generated from q corresponding sentence sqQuality:
Wherein (x → y) represents a line on tree, and count (x → y) is that this edge goes out occurrence on whole data set
Number, dist (x, y) are distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two score
The parameter of the mutual significance level of method.
Section 1 in scoring functions portrays the compactedness of short text dependency analysis tree, the good dependency tree of compactedness
It often more can concisely portray the semanteme of short text.Such as short text " deep learning ", following two correspondences be present
Sentence:
In first sentence, " deep " and " learning " connection it is very loose, cause it semantic and short text
Semantic deviation is very big.And in second sentence, two words are joined directly together, and whole sentence also has very with short text
Good Semantic Similarity.
The Section 2 of scoring functions portrays the global coherency of short text dependency analysis tree.For a pair of words x, y), such as
Fruit is on whole data set, and the y → x occurrence number when x → y occurrence number is far above, then the latter is likely to be wrong
's.A special circumstances for needing to consider in the process are the orders of word, if appearance of two words in short text
Order is different, then grammatical relation corresponding to them can be with inconsistent.Such as " child of " and " ofchild " be all by "
Child " and " of " two word compositions, but the two correct dependence is different.
5.4. result optimizing
In before the step of, the type on part dependence side is arranged to placeholder " dep ".Using what is obtained
Before data set training dependency analysis device, " dep " must be inferred to a label for being really by we, otherwise in training data
Specific and unspecific label can be present simultaneously by concentrating, and be caused inconsistent.For example, for short text " crude price ", from
Comprising " crudeprice obtained in crude oil price " sentence side type is dep, from including " crude
The crudeprice obtained in price " sentence side type can be amod.
In order to infer " dep ", we first use most voting methods.First
On our training dataset, said process can solve about 90% dependence.Running into solve
Situation when, because such side does not provide dependence information, can directly delete.It is contemplated that at these
In short text, for other words to may also include significant information, we take a kind of side of bootstrapping (bootstrap)
Formula is handled:These short text dependence data for containing uncertain side type are first deleted, train a short text analysis
Device;This 10% or so data is predicted again, if prediction result is consistent with the direction of these data, by analyzer
The particular type of " dep " side output is backfilled in dependency analysis tree;Finally, the dependency tree after backfill is added to instruction
Practice and concentrate, re -training dependency analysis device obtains final model.
5.5. short text dependency analysis model
Short text dependency analysis uses the dependency analysis device knot based on neutral net used in similar (Danqi 2014)
Structure.The main Feature wherein used is as follows:
In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represent the i-th of buffering area
Individual element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word folder
Body, t represent part-of-speech tagging, and l represents dependence label.
It should be noted that for above method embodiment, in order to be briefly described, therefore it is all expressed as to a system
The combination of actions of row, but those skilled in the art should know, the application is not limited by described sequence of movement,
Because according to the application, some steps can use other orders or carry out simultaneously.Secondly, those skilled in the art also should
This knows that embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily originally
Necessary to application.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer journey
Sequence product.Therefore, in terms of the application can use complete hardware embodiment, complete software embodiment or combine software and hardware
The form of embodiment.
Moreover, the application can use the computer for wherein including computer usable program code in one or more can
With the computer program product implemented in storage medium (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Form.
Finally it should be noted that:The preferred embodiments of the present invention are the foregoing is only, are not limited to this hair
It is bright, although the present invention is described in detail with reference to the foregoing embodiments, for those skilled in the art, its according to
The technical scheme described in foregoing embodiments can so be modified, or which part technical characteristic is equal
Replace.Within the spirit and principles of the invention, any modification, equivalent substitution and improvements made etc., it should be included in this
Within the protection domain of invention.
Claims (5)
- A kind of 1. short text dependency analysis method based on deep learning, it is characterised in that including:Step 1) obtains the html file where user's query statement, as training dataset from search engine logs;Step 2) generates the dependency analysis tree of query statement according to training dataset;Step 3) uses part-of-speech tagging device and parser of the dependency tree training based on neural network model.
- 2. the short text dependency analysis method according to claim 1 based on deep learning, it is characterised in that step 1) In, specifically include:For each inquiry q in search daily record and the higher url list of user's clicking rate under this search result, obtain Its corresponding html document;The sentence s for wherein including each word in this inquiry is taken out, can so obtain several triples:(q,s, Count), wherein count represents the number that the word occurs in the sentence;Training dataset of the obtained triple collection as generation dependency analysis tree.
- 3. the short text dependency analysis method according to claim 2 based on deep learning a, it is characterised in that short essay Originally the sentence that multiple corresponding users click on is might have, wherein, further, interdependent point is generated in sentence s for short text q Analysis tree, specifically includes:If TsRepresent all subtrees of s dependency tree;Find minimum subtree t ∈ TsMeet one and only one matching x ' ∈ t of each word x ∈ q;To any two the word x and y in q, the dependency tree t with following mode from t generations qq,s:If there is a line x ' → y ' in t, in tq,sOne identical side x → y of middle establishment;If there is a path from x ' to y ' in t, in tq,sOne x → y of middle establishment side, and by its temporary marker For dep.For after each sentence generation dependency tree, it is necessary to be the short text select a unique dependency tree.We define one and beaten Function f is divided to assess the dependency tree t generated from q corresponding sentence sqQuality:<mrow> <mi>f</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>q</mi> </msub> <mo>,</mo> <mi>s</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mo>(</mo> <mi>x</mi> <mo>&RightArrow;</mo> <mi>y</mi> <mo>)</mo> <mo>&Element;</mo> <msub> <mi>t</mi> <mi>q</mi> </msub> </mrow> </munder> <mo>-</mo> <mi>&alpha;</mi> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mfrac> <mrow> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>&RightArrow;</mo> <mi>y</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>c</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>&RightArrow;</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>Wherein (x → y) represents a line on tree, and count (x → y) is the number that this edge occurs on whole data set, Dist (x, y) is distances of the word x and y on the dependency analysis tree of script sentence, and α is one and is used for adjusting two point systems The parameter of significance level;Finally label is refined.
- 4. the short text dependency analysis method according to claim 3 based on deep learning, it is characterised in that part is interdependent The type on relation side is arranged to placeholder " dep ", and " dep " must be inferred to a real label by we, otherwise instruct Inconsistent phenomenon can be caused by practicing in data set;To solve this problem, we use the mode of majority voting (majority vote);Including:For arbitraryStatisticsConcentrate in training data and occur for each specific label Number.If the frequency of a specific label is more than threshold value, such as 10 times of other unnecessary labels of occurrence number, we just will Placeholder dep is changed to the label.
- 5. the short text dependency analysis method according to claim 1 based on deep learning, it is characterised in that step 3) is instructed Practice part-of-speech tagging device and parser based on neural network model, specifically include:To each word in sentence, stationary window, extraction feature, including the word are established centered on the word in itself, greatly Small letter, prefix, suffix;For word feature, the word2vec embedding grammars of pre-training are used;For capital and small letter and it is front and rear sew, to it is embedded carry out with Machine initializes;Next, using the dependency analysis system analysis sentence based on ArcStandard, the feature used is as shown in the table:In form, si(i=1,2 ...) represents i-th of element of stack top, bi(i=1,2 ...) represents i-th yuan of buffering area Element, lck(si) and rck(si) represent siLeft end k-th of child node of k-th of child node and right-hand member.W represents word in itself, t tables Show part-of-speech tagging, l represents dependence label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710934201.2A CN107656921B (en) | 2017-10-10 | 2017-10-10 | Short text dependency analysis method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710934201.2A CN107656921B (en) | 2017-10-10 | 2017-10-10 | Short text dependency analysis method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107656921A true CN107656921A (en) | 2018-02-02 |
CN107656921B CN107656921B (en) | 2021-01-08 |
Family
ID=61117779
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710934201.2A Active CN107656921B (en) | 2017-10-10 | 2017-10-10 | Short text dependency analysis method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107656921B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491512A (en) * | 2018-03-23 | 2018-09-04 | 北京奇虎科技有限公司 | The method of abstracting and device of headline |
CN108647785A (en) * | 2018-05-17 | 2018-10-12 | 普强信息技术(北京)有限公司 | A kind of neural network method for automatic modeling, device and storage medium |
CN110189751A (en) * | 2019-04-24 | 2019-08-30 | 中国联合网络通信集团有限公司 | Method of speech processing and equipment |
CN111523302A (en) * | 2020-07-06 | 2020-08-11 | 成都晓多科技有限公司 | Syntax analysis method and device, storage medium and electronic equipment |
WO2020232943A1 (en) * | 2019-05-23 | 2020-11-26 | 广州市香港科大霍英东研究院 | Knowledge graph construction method for event prediction and event prediction method |
CN112446405A (en) * | 2019-09-04 | 2021-03-05 | 杭州九阳小家电有限公司 | User intention guiding method for home appliance customer service and intelligent home appliance |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102298642A (en) * | 2011-09-15 | 2011-12-28 | 苏州大学 | Method and system for extracting text information |
CN102968431A (en) * | 2012-09-18 | 2013-03-13 | 华东师范大学 | Control device for mining relation between Chinese entities on basis of dependency tree |
US20130117010A1 (en) * | 2010-07-13 | 2013-05-09 | Sk Planet Co., Ltd. | Method and device for filtering a translation rule and generating a target word in hierarchical-phase-based statistical machine translation |
CN103473223A (en) * | 2013-09-25 | 2013-12-25 | 中国科学院计算技术研究所 | Rule extraction and translation method based on syntax tree |
CN105740235A (en) * | 2016-01-29 | 2016-07-06 | 昆明理工大学 | Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features |
CN105893346A (en) * | 2016-03-30 | 2016-08-24 | 齐鲁工业大学 | Graph model word sense disambiguation method based on dependency syntax tree |
CN106598951A (en) * | 2016-12-23 | 2017-04-26 | 北京金山办公软件股份有限公司 | Dependency structure treebank acquisition method and system |
CN106776686A (en) * | 2016-11-09 | 2017-05-31 | 武汉泰迪智慧科技有限公司 | Chinese domain short text understanding method and system based on many necks |
CN107168948A (en) * | 2017-04-19 | 2017-09-15 | 广州视源电子科技股份有限公司 | A kind of sentence recognition methods and system |
-
2017
- 2017-10-10 CN CN201710934201.2A patent/CN107656921B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130117010A1 (en) * | 2010-07-13 | 2013-05-09 | Sk Planet Co., Ltd. | Method and device for filtering a translation rule and generating a target word in hierarchical-phase-based statistical machine translation |
CN102298642A (en) * | 2011-09-15 | 2011-12-28 | 苏州大学 | Method and system for extracting text information |
CN102968431A (en) * | 2012-09-18 | 2013-03-13 | 华东师范大学 | Control device for mining relation between Chinese entities on basis of dependency tree |
CN103473223A (en) * | 2013-09-25 | 2013-12-25 | 中国科学院计算技术研究所 | Rule extraction and translation method based on syntax tree |
CN105740235A (en) * | 2016-01-29 | 2016-07-06 | 昆明理工大学 | Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features |
CN105893346A (en) * | 2016-03-30 | 2016-08-24 | 齐鲁工业大学 | Graph model word sense disambiguation method based on dependency syntax tree |
CN106776686A (en) * | 2016-11-09 | 2017-05-31 | 武汉泰迪智慧科技有限公司 | Chinese domain short text understanding method and system based on many necks |
CN106598951A (en) * | 2016-12-23 | 2017-04-26 | 北京金山办公软件股份有限公司 | Dependency structure treebank acquisition method and system |
CN107168948A (en) * | 2017-04-19 | 2017-09-15 | 广州视源电子科技股份有限公司 | A kind of sentence recognition methods and system |
Non-Patent Citations (1)
Title |
---|
沈超: "基于转换的依存句法分析研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108491512A (en) * | 2018-03-23 | 2018-09-04 | 北京奇虎科技有限公司 | The method of abstracting and device of headline |
CN108647785A (en) * | 2018-05-17 | 2018-10-12 | 普强信息技术(北京)有限公司 | A kind of neural network method for automatic modeling, device and storage medium |
CN110189751A (en) * | 2019-04-24 | 2019-08-30 | 中国联合网络通信集团有限公司 | Method of speech processing and equipment |
WO2020232943A1 (en) * | 2019-05-23 | 2020-11-26 | 广州市香港科大霍英东研究院 | Knowledge graph construction method for event prediction and event prediction method |
CN112446405A (en) * | 2019-09-04 | 2021-03-05 | 杭州九阳小家电有限公司 | User intention guiding method for home appliance customer service and intelligent home appliance |
CN111523302A (en) * | 2020-07-06 | 2020-08-11 | 成都晓多科技有限公司 | Syntax analysis method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107656921B (en) | 2021-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107656921A (en) | A kind of short text dependency analysis method based on deep learning | |
CN104462057B (en) | For the method and system for the lexicon for producing language analysis | |
Orosz et al. | PurePos 2.0: a hybrid tool for morphological disambiguation | |
CN101251862B (en) | Content-based problem automatic classifying method and system | |
CN104615589A (en) | Named-entity recognition model training method and named-entity recognition method and device | |
CN103488724A (en) | Book-oriented reading field knowledge map construction method | |
CN108509409A (en) | A method of automatically generating semantic similarity sentence sample | |
CN112925563B (en) | Code reuse-oriented source code recommendation method | |
CN103593335A (en) | Chinese semantic proofreading method based on ontology consistency verification and reasoning | |
CN102693279A (en) | Method, device and system for fast calculating comment similarity | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
Feldman et al. | TEG—a hybrid approach to information extraction | |
CN107943940A (en) | Data processing method, medium, system and electronic equipment | |
CN113064985A (en) | Man-machine conversation method, electronic device and storage medium | |
CN104391969A (en) | User query statement syntactic structure determining method and device | |
Guo et al. | Chase: A large-scale and pragmatic chinese dataset for cross-database context-dependent text-to-sql | |
Stepanov et al. | Language style and domain adaptation for cross-language SLU porting | |
CN117473054A (en) | Knowledge graph-based general intelligent question-answering method and device | |
CN113807102B (en) | Method, device, equipment and computer storage medium for establishing semantic representation model | |
CN101246473B (en) | Segmentation system evaluating method and segmentation evaluating system | |
Murugathas et al. | Domain specific question & answer generation in tamil | |
Lee | Natural Language Processing: A Textbook with Python Implementation | |
Gleize et al. | A unified kernel approach for learning typed sentence rewritings | |
JP2018010481A (en) | Deep case analyzer, deep case learning device, deep case estimation device, method, and program | |
CN107451295B (en) | Method for obtaining deep learning training data based on grammar network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |