CN107656921B - Short text dependency analysis method based on deep learning - Google Patents

Short text dependency analysis method based on deep learning Download PDF

Info

Publication number
CN107656921B
CN107656921B CN201710934201.2A CN201710934201A CN107656921B CN 107656921 B CN107656921 B CN 107656921B CN 201710934201 A CN201710934201 A CN 201710934201A CN 107656921 B CN107656921 B CN 107656921B
Authority
CN
China
Prior art keywords
dependency
sentence
short text
tree
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710934201.2A
Other languages
Chinese (zh)
Other versions
CN107656921A (en
Inventor
肖仰华
谢晨昊
梁家卿
崔万云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shuyan Technology Development Co ltd
Original Assignee
Shanghai Shuyan Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shuyan Technology Development Co ltd filed Critical Shanghai Shuyan Technology Development Co ltd
Priority to CN201710934201.2A priority Critical patent/CN107656921B/en
Publication of CN107656921A publication Critical patent/CN107656921A/en
Application granted granted Critical
Publication of CN107656921B publication Critical patent/CN107656921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a short text dependency analysis method based on deep learning, which comprises the following steps: step 1) acquiring an HTML file where a user query statement is located from a search engine log as a training data set; step 2) generating a dependency analysis tree of the query statement according to the training data set; and 3) training a part of speech annotator and a syntactic analyzer based on the neural network model by using the dependency tree. The invention utilizes the dependency analyzer at the sentence level in use to automatically generate a massive short text dependency analysis data set, and carries out noise reduction and optimization on the generated data set by using a plurality of methods. Based on the data set, a dependence analysis model of the short text is trained, and experiments show that the labeling effect of the model on the short text is greatly improved compared with that of a sentence-level dependence analyzer.

Description

Short text dependency analysis method based on deep learning
Technical Field
The invention belongs to a short text dependency analysis method based on deep learning.
Background
Phrase structure and dependency structure are the two most widely studied types of grammar structures in current syntactic analysis. The ontario grammar was originally proposed by the french linguist l.tesniere in his work "foundation of structural syntax" (1959). The dependency grammar reveals the syntactic structure by analyzing the dependency relationship among the components in the language unit, and the core verb in the sentence is claimed to be the central component which governs other components, but is not governed by any other components, and all the governed components depend on a governing person with a certain dependency relationship.
For example, for the text "Its apple gathering stand is my favorite stand", the dependency analysis tree obtained after dependency analysis is shown in FIG. 2:
from the dependency analysis tree, the overall syntactic structure of the sentence, the modification relation between words can be clearly obtained, and the semantics of the sentence can be understood to a certain extent.
Dependency analysis of short text is important for understanding the grammatical composition, word part of speech and semantic meaning of short text. Consider the following search query and its corresponding syntactic structure, as in FIG. 3:
the result of the short text "cover iphone 6 plus" indicates the protective shell (cover) of the body of this phrase, and the user's desire is to find the protective shell of the iphone, rather than the iphone. Based on the knowledge, the search engine can reasonably display the related advertisements of the iphone protective shell. For "distance earth", the subject is distance (distance), indicating that the user's intention is to ask for the distance between the earth (earth) and the moon (moon). For the faucet adapter simple, the intent is to find the faucet adapter. In short, if the dependency relationship of the short text can be correctly identified, the relationship between the core main body and the modification in the short text can be extracted, so that the semantics of the short text can be better understood.
The main challenges of dependency analysis on short text are:
1. in short text, there is typically no complete grammar element to assist in the analysis. In fact, short texts are often highly ambiguous. For example, the short text "kids tods" may represent "tods fordics" and "kids with tods", in which case the dependency edges of tods and kids are diametrically opposed, as in FIG. 4.
2. There is no linguistic rule for dependency analysis on short texts. In the manual annotation process of the dependency analysis, the annotation caused by the lack of the standard may be unclear. Moreover, the cost of manual labeling is huge, and a dependency analysis labeling set can be completed in years.
In dependency analysis, semantic information of a short text is mainly contained in a dependency analysis margin. That is, for any two words x, y ∈ q in the short text, it is determined whether there is a dependency between x and y, and if so, which dependency.
To make this determination, the semantics of short text that can be utilized are mainly classified into two main categories: context-free information and context-related information.
● context free information: with context-free information, we model P (e | x, y) directly, where e represents the dependency edge for x, y (x → y or x ← y). This modeling approach is context-free, since we do not consider the relative positional relationship of x and y in the input.
One way to obtain P (e | x, y) is through a labeled corpus such as Google's syntax gram dataset. For two words x and y, we estimate P (e | x, y) by counting the number of times x modifies y and the number of times y modifies x in the corpus.
● context related information: there are two main disadvantages to using only context-free information: 1) it is risky to consider directly the relationship between two words without considering context. 2) Context-free information often fails to characterize the type of dependency that two words directly depend on, and thus fails to fully represent the entire input semantics.
To take context information into account, i.e. to estimate P (e | x, y, q) for any two words x, y, our goal is to translate into constructing a dependency parser (dependency parser) designed for short text. To construct such a dependency analyzer, a massive training data set is required. We devised a method of automatically generating this data set to avoid the cost of manual labeling. The whole method is based on the following assumptions: the intention of the short text q coincides with the intention of a click sentence of this short text. Let us remember that sentence s is a click sentence of short text q if and only if: 1) the sentence s is clicked a high number of times by the user in the search result of the short text q. 2) Each word in the short text q appears in the sentence s. For example, assuming that the sentence s is a click sentence of "… my favorite Thai food in Houston …", the whole intentions of the two are similar, and at the same time, the dependency relationship between word pairs in the short text is similar to the direct relationship between corresponding word pairs in the sentence. However, considering that a certain word pair in the short text may not be directly connected in the sentence, a method is still needed to map the dependency relationship in the sentence onto the short text reasonably.
In recent years, deep learning has proven to be highly applicable to Natural Language Processing (NLP) problems. As early as the 21 st century, a neural network-based language model was proposed, opening the way to deep learning to apply to natural language processing tasks. Next, studies have shown that deep learning based on convolutional neural network (convolutional neural network) is excellent in many natural language processing tasks such as part-of-speech tagging, chunking, and named entity recognition. Still later, with the popularization of a recurrent neural network (recurrent neural network), deep learning has better performance in NLP problems and has wider application in more fields such as machine translation (machine translation).
Disclosure of Invention
The invention aims to solve the technical problem of providing a short text dependency analysis method based on deep learning, which is used for solving the problems in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a short text dependency analysis method based on deep learning comprises the following steps:
step 1) acquiring an HTML file where a user query statement is located from a search engine log as a training data set;
step 2) generating a dependency analysis tree of the query statement according to the training data set;
and 3) training a part of speech annotator and a syntactic analyzer based on the neural network model by using the dependency tree.
Preferably, the step 1) specifically includes:
for each query q in the search log and the URL list with a high user click rate under the search result, acquiring an HTML document corresponding to the query q;
the sentence s containing each word in the query is extracted, so that a plurality of triads can be obtained: (q, s, count), wherein count represents the number of times the word occurs in the sentence;
the resulting triplet set is used as the training data set for generating the dependency analysis tree.
Preferably, a short text may have a plurality of corresponding sentences clicked by the user, wherein further, generating a dependency analysis tree in the sentence s for the short text q specifically includes:
let TsAll subtrees of the dependency tree representing s;
finding the minimum subtree T ∈ TsSatisfying that each word x belongs to q and only one matching x' belongs to t;
for any two words x and y in q, a dependency tree t for q is generated from t in the following mannerq,s
If there is an edge x '→ y' in t, then at tq,sCreating a same edge x → y;
if there is a path from x 'to y' in t, then at tq,sAn x → y edge is created and temporarily marked as dep.
After the dependency tree is generated for each sentence, a unique dependency tree needs to be selected for the short text. I define a scoring function f to evaluate the dependency tree t generated from the corresponding sentence s of qqThe mass of (A):
Figure BDA0001429436250000041
wherein (x → y) represents an edge on the tree, count (x → y) is the number of times this edge appears on the whole data set, dist (x, y) is the distance between the words x and y on the dependency analysis tree of the original sentence, and α is a parameter for adjusting the importance of the two scoring methods;
the label is finally refined.
Preferably, the type of partial dependency edge is set to placeholder "dep", which we must infer as a true label, otherwise inconsistencies may result in the training dataset;
to solve this problem, we use a way of majority voting (majpriority vote);
the method comprises the following steps: for arbitrary
Figure RE-GDA0001474746840000042
Statistics of
Figure RE-GDA0001474746840000043
The number of occurrences for each particular label in the training dataset. If the frequency of a particular tag is greater than a threshold, such as 10 times more occurrences than other tags, we change the placeholder dep to that tag.
Preferably, the step 3) of training the part of speech annotator and the syntactic analyzer based on the neural network model specifically includes:
establishing a fixed window by taking each word in the sentence as a center, and extracting characteristics including the word, capital and lowercase, prefix and suffix;
for word features, a pre-trained word2vec embedding method is used; for case and prefix and suffix, randomly initializing embedding;
next, the sentence is parsed using the ArcStandard based dependency analysis system, using the following characteristics:
Figure BDA0001429436250000053
in the table, si(i ═ 1,2, …) denotes the ith element at the top of the stack, bi(i ═ 1,2, …) denotes the ith element of the buffer, lck(si) And rck(si) Denotes siThe left-end kth child node and the right-end kth child node. w represents the word itself, t represents a part-of-speech tag, and l represents a dependency label.
The invention utilizes the dependency analyzer at the sentence level in use to automatically generate a massive short text dependency analysis data set, and carries out noise reduction and optimization on the generated data set by using a plurality of methods. Based on the data set, a dependence analysis model of the short text is trained, and experiments show that the annotation effect of the model on the short text is greatly improved compared with that of a sentence-level dependence analyzer.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings
Drawings
The present invention will be described in detail below with reference to the accompanying drawings so that the above advantages of the present invention will be more apparent. Wherein the content of the first and second substances,
FIG. 1 is a schematic diagram of the overall structure of a dependency analyzer of the deep learning-based short text dependency analysis method of the present invention;
FIG. 2 is a diagram of sentence analysis related to the background art of the present invention;
FIG. 3 is a diagram of sentence analysis related to the background art of the present invention;
FIG. 4 is a diagram of sentence analysis related to the background art of the present invention;
FIG. 5 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 6 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 7 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 8 is a diagram of sentence analysis in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 10 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 11 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 12 is a diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 13 is a diagram of sentence analysis in accordance with an embodiment of the present invention;
FIG. 14 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 15 is a schematic diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 16 is a diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 17 is a diagram of sentence analysis involved in an embodiment of the present invention;
FIG. 18 is a diagram of sentence analysis according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail with reference to the accompanying drawings and examples, so that how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented. It should be noted that, as long as no conflict is formed, the embodiments and the features of the embodiments of the present invention may be combined with each other, and the technical solutions formed are within the scope of the present invention.
Additionally, the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions, and while a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order different than here.
Specifically, the invention constructs an end-to-end system, automatically generates massive short text dependency analysis data sets from a search engine log by using an existing sentence-level dependency analyzer, and performs noise reduction and optimization on the generated data sets by using various methods. Based on the data set, a dependence analysis model of the short text is trained, and experiments show that the labeling effect of the model on the short text is greatly improved compared with that of a sentence-level dependence analyzer.
Specifically, the method for analyzing the short text dependence based on deep learning comprises the following steps:
step 1) acquiring an HTML file where a user query statement is located from a search engine log as a training data set;
step 2) generating a dependency analysis tree of the query statement according to the training data set;
and 3) training a part of speech annotator and a syntactic analyzer based on the neural network model by using the dependency tree.
Preferably, the step 1) specifically includes:
for each query q in the search log and the URL list with a high user click rate under the search result, acquiring an HTML document corresponding to the query q;
the sentence s containing each word in the query is extracted, so that a plurality of triads can be obtained: (q, s, count), wherein count represents the number of times the word occurs in the sentence;
the resulting triplet set is used as the training data set for generating the dependency analysis tree.
Preferably, a short text may have a plurality of corresponding sentences clicked by the user, wherein further, generating a dependency analysis tree in the sentence s for the short text q specifically includes:
let TsAll subtrees of the dependency tree representing s;
finding the minimum subtree T ∈ TsSatisfying that each word x belongs to q and only one matching x' belongs to t;
for any two words x and y in q, a dependency tree t for q is generated from t in the following mannerq,s
If there is an edge x '→ y' in t, then at tq,sCreating a same edge x → y;
if there is a path from x 'to y' in t, then at tq,sAn x → y edge is created and temporarily marked as dep.
After the dependency tree is generated for each sentence, a unique dependency tree needs to be selected for the short text. I define a scoring function f to evaluate the dependency tree t generated from the corresponding sentence s of qqThe mass of (A):
Figure BDA0001429436250000071
wherein (x → y) represents an edge on the tree, count (x → y) is the number of times this edge appears on the whole data set, dist (x, y) is the distance between the words x and y on the dependency analysis tree of the original sentence, and α is a parameter for adjusting the importance of the two scoring methods;
the label is finally refined.
Preferably, the type of partial dependency edge is set to placeholder "dep", which we must infer as a true label, otherwise inconsistencies may result in the training dataset;
to solve this problem, we use a way of majority voting (majpriority vote);
the method comprises the following steps: for arbitrary
Figure RE-GDA0001474746840000081
Statistics of
Figure RE-GDA0001474746840000082
The number of occurrences for each particular label in the training dataset. If the frequency of a particular tag is greater than a threshold, such as 10 times more occurrences than other tags, we change the placeholder dep to that tag.
Preferably, the step 3) of training the part of speech annotator and the syntactic analyzer based on the neural network model specifically includes:
establishing a fixed window by taking each word in the sentence as a center, and extracting characteristics including the word, capital and lowercase, prefix and suffix;
for word features, a pre-trained word2vec embedding method is used; for case and prefix and suffix, randomly initializing embedding;
next, the sentence is parsed using the ArcStandard based dependency analysis system, using the following characteristics:
Figure BDA0001429436250000081
in the table, si(i ═ 1,2, …) denotes the ith element at the top of the stack, bi(i ═ 1,2, …) denotes the ith element of the buffer, lck(si) And rck(si) Denotes siThe left-end kth child node and the right-end kth child node. w represents the word itself, t represents the part of speechNote that l represents a dependency label.
The invention utilizes the dependency analyzer at the sentence level in use to automatically generate a massive short text dependency analysis data set, and carries out noise reduction and optimization on the generated data set by using a plurality of methods. Based on the data set, a dependence analysis model of the short text is trained, and experiments show that the annotation effect of the model on the short text is greatly improved compared with that of a sentence-level dependence analyzer.
In one embodiment:
5.1. data source
The data source is from a search log of a search engine. For each query q in the search log and the higher HTML list clicked by the user under the search result, for each URL, acquiring the corresponding HTML document, and taking out the sentence containing each word in the query as a clicked sentence of the query. This results in a triplet:
(q, s, count). Then, we analyze the dependency analyzer on the s-use sentence to obtain its dependency analysis tree, which we consider to be basically correct.
5.2. Inferring dependency parse trees
A short text q may have a plurality of corresponding sentences clicked by the user. This step is to map one of the sentences s to its dependency parse tree onto the short text q.
We use the following heuristic to map the dependencies on the sentence s onto the short text q.
1. Let TsAll subtrees of the dependency tree representing s.
2. Finding the minimum subtree T ∈ TsSatisfying that each word x e q has and only one match x' et
3. Inheriting q's dependency tree t from t in the following mannerq,s: for two words x and y in q:
a. if there is an edge x '→ y' in t, then at tq,sCreating a common x → y edge.
b. If it is notThere is a path from x 'to y' in t (a path contains a series of co-directional edges), then at tq,sAn x → y edge is created and its edge type is temporarily marked as dep and updated to a more specific dependency in a subsequent optimization step.
The following classifies the dependency relationships common in short texts and the mapping method from sentences.
Direct connection: in this case, we directly copy the edges and their types in the sentence. Consider a sentence corresponding to the short text "party supports" as shown in FIG. 5:
in this situation, the two groups of words (party, supports) and (apply) are directly connected. Therefore, the dependency relationship of the short text can be obtained by directly inheriting the relationship in the sentence, as shown in fig. 6:
connecting by functional words (functional words): in short text queries, it is common to omit prepositions. For example, a sentence corresponding to the short text "moonlanding", as shown in fig. 7:
we can map the following dependency tree, as in FIG. 8:
for the sentence corresponding to the short text "side effects b 12", as shown in fig. 9:
the following dependency tree can be obtained, as in FIG. 10:
in both cases, a temporary "dep" type dependency edge will appear, which we will process in the following step.
Connecting through a modifier word: many search queries are composed of noun phrases, and their corresponding sentences may have many modifiers omitted. Depending on the way noun phrases are partitioned (nounphrase breaking), noun phrases may be directly connected or indirectly connected.
For "offset word" and its corresponding sentence, omitting the modifier "drilling" does not pose any problem: "offset" and "work" are still directly connected, so that the dependency relationship can be directly inherited, as shown in FIG. 11.
But this is not the case for the short text "loud price" and its corresponding sentence, as in fig. 12.
In this case, considering a path, raw ← oil ← price, one can inherit to get an edge, as in fig. 13. Connected by a header word: in some cases, the head term of a noun phrase may be omitted. Consider "countrysingers" and its corresponding sentences, as in FIG. 14:
obviously their semantics are consistent, but the first name "music" is omitted in the short text. There is still a path from "singers" to "counter" in the sentence, and the dependency tree can be obtained in turn, as shown in fig. 15.
Linking by verbs (verbs): a common example is the omission of tie verbs. Consider the sentence corresponding to the example "plants poisonous to targets", as in FIG. 16: in this case, omitting "are" does not affect the direct concatenation of the words in the short text. But consider the short text "paiinbetweenbranches" and its corresponding sentence, as in fig. 17:
in this case, the dependency relationship can be inherited, as in FIG. 18:
5.3. merging dependency parse trees
In the last step, we have obtained a set of dependency parse trees derived from a plurality of corresponding sentence mappings for a short text q. These dependency trees may not be consistent. The main reasons are: 1. the dependency parser for sentences is not perfect. 2. The short text itself may have ambiguity. Part of the short text may not have semantically consistent sentences. The main purpose of this step is to merge these multiple dependency trees, which may not be identical, to obtain the dependency tree unique to this short text.
To select a unique dependency tree for a short text q, we define a scoring function f to evaluate the dependency tree t generated from the corresponding sentence s of qqThe mass of (A):
Figure BDA0001429436250000121
where (x → y) represents an edge on the tree, count (x → y) is the number of occurrences of the edge on the entire data set, dist (x, y) is the distance between the words x and y on the dependency analysis tree of the original sentence, and α is a parameter for adjusting the degree of importance of the two scoring methods.
The first item in the scoring function characterizes the compactness of the short text dependency analysis tree, and the dependency relationship tree with good compactness can more simply characterize the semantics of the short text. For example, for the short text "deep learning", there are two corresponding sentences:
in the first sentence, the connection between "deep" and "learning" is very loose, resulting in a large semantic deviation of its semantics from short text. In the second sentence, the two words are directly connected, and the whole sentence has good semantic similarity with the short text.
The second term of the scoring function characterizes the global consistency of the short text dependency analysis tree. For a pair of words x, y), if the number of occurrences of the edge x → y is much higher than the number of occurrences of the edge y → x over the entire data set, the latter is likely to be erroneous. One particular case that needs to be considered in this process is the order of words, and if the order of appearance of two words in a short text is different, their corresponding grammatical relations may not be consistent. For example, "child of" and "ofchild" are both composed of two words, "child" and "of", but the correct dependencies are not the same.
5.4. Result optimization
In the previous step, the type of partial dependency edge is set to the placeholder "dep". Before training the dependency analyzer using the resulting data set, we must infer "dep" as a true label, otherwise there will be both specific and unspecific labels in the training data set, resulting in inconsistencies. For example, for the short text "include price", the edge type of include derived from the sentence containing "include price" is dep, and the edge type of include derived from the sentence containing "include price" will be amod.
To infer "dep," we first use the majority voting method. Firstly, the method
On our training data set, the above process can resolve approximately 90% of the dependencies. In case of an unresolvable situation, it can be deleted directly since such an edge already does not provide dependency information. But considering that in these short texts, other word pairs may also contain meaningful information, we take a bootstrap (bootstrap) approach to deal with: deleting the short text dependency relationship data containing the uncertain edge types, and training a short text analyzer; predicting the data of about 10 percent, and if the direction of the prediction result is consistent with that of the data, backfilling the specific type output by the analyzer to the dep edge into a dependency analysis tree; and finally, adding the backfilled dependency relationship tree into the training set, and retraining the dependency analyzer again to obtain a final model.
5.5. Short text dependency analysis model
Short text dependency analysis uses a neural network based dependency analyzer structure similar to that used in (Danqi 2014). The main features used therein are as follows:
Figure BDA0001429436250000131
in the table, si(i ═ 1,2, …) denotes the ith element at the top of the stack, bi(i ═ 1,2, …) denotes the ith element of the buffer, lck(si) And rck(si) Denotes siThe left-end kth child node and the right-end kth child node. w represents the word itself, t represents a part-of-speech tag, and l represents a dependency label.
It should be noted that for simplicity of description, the above method embodiments are described as a series of acts or combination of acts, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that the acts and modules involved are not necessarily required for this application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A short text dependency analysis method based on deep learning is characterized by comprising the following steps:
step 1) acquiring an HTML file where a user query statement is located from a search engine log as a training data set;
step 2) generating a dependency analysis tree of the query statement according to the training data set;
step 3) training a part of speech annotator and a syntactic analyzer based on the neural network model by using a dependency tree;
the step 1) specifically comprises the following steps:
for each query q in the search log and a URL list with a high user click rate under the search result, acquiring an HTML document corresponding to the query q;
the sentence s containing each word in the query is extracted, so that several triples are obtained: (q, s, count), wherein count represents the number of times the word occurs in the sentence;
the obtained ternary group set is used as a training data set for generating a dependency analysis tree;
the method includes that a short text has a plurality of corresponding sentences clicked by users, wherein a dependency analysis tree is generated in a sentence s for the short text q, and the method specifically comprises the following steps:
let TsAll subtrees of the dependency tree representing s;
finding the minimum subtree T ∈ TsSatisfying that each word x belongs to q and only one matching x' belongs to t;
for any two words x and y in q, a dependency tree t for q is generated from t in the following mannerq,s
If there is an edge x '→ y' in t, then at tq,sCreating a same edge x → y;
if there is a path from x 'to y' in t, then at tq,sAn x → y edge is created, and temporarily marked as dep,
after generating the dependency tree for each sentence, a unique dependency tree needs to be selected for the short text, and a scoring function f is defined to evaluate the dependency tree t generated from the corresponding sentence s of qqThe mass of (A):
Figure FDA0002756073590000011
wherein (x → y) represents an edge on the tree, count (x → y) is the number of times the edge appears on the whole data set, dist (x, y) is the distance between the words x and y on the dependency analysis tree of the original sentence, and α is a parameter for adjusting the importance of the two scoring methods;
and finally refining the label.
2. The deep learning-based short text dependency analysis method according to claim 1, wherein the type of the partial dependency relationship edge is set as a placeholder "dep", and the "dep" is inferred to be a real label, otherwise an inconsistency phenomenon is caused in the training data set;
a mode of using majority vote (majpriority vote) correspondingly;
the method comprises the following steps: for arbitrary
Figure FDA0002756073590000021
Statistics of
Figure FDA0002756073590000022
The number of occurrences for each particular label in the training dataset; if the frequency of a particular tag is greater than the threshold, the placeholder dep is changed to that tag when it occurs more than 10 times more frequently than the other tags.
3. The deep learning-based short text dependency analysis method according to claim 1, wherein the step 3) of training the part of speech annotator and the syntactic analyzer based on the neural network model specifically comprises:
establishing a fixed window by taking each word in the sentence as a center, and extracting characteristics including the word, capital and lowercase, prefix and suffix;
for word features, a pre-trained word2vec embedding method is used; for case and prefix and suffix, randomly initializing embedding;
next, the sentence is parsed using the ArcStandard based dependency analysis system using the following characteristics as shown in the table:
Figure FDA0002756073590000023
in the table, si(i 1, 2..) denotes the ith element at the top of the stack, bi(i 1, 2..) denotes the ith element of the buffer, lck(si) And rck(si) Denotes siK-th child node at the left end and k-th child node at the right end of (a), w represents the word itself, t represents the word itselfPart of speech is labeled, and l represents a dependency label.
CN201710934201.2A 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning Active CN107656921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710934201.2A CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710934201.2A CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Publications (2)

Publication Number Publication Date
CN107656921A CN107656921A (en) 2018-02-02
CN107656921B true CN107656921B (en) 2021-01-08

Family

ID=61117779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710934201.2A Active CN107656921B (en) 2017-10-10 2017-10-10 Short text dependency analysis method based on deep learning

Country Status (1)

Country Link
CN (1) CN107656921B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491512A (en) * 2018-03-23 2018-09-04 北京奇虎科技有限公司 The method of abstracting and device of headline
CN108647785A (en) * 2018-05-17 2018-10-12 普强信息技术(北京)有限公司 A kind of neural network method for automatic modeling, device and storage medium
CN110189751A (en) * 2019-04-24 2019-08-30 中国联合网络通信集团有限公司 Method of speech processing and equipment
CN110263177B (en) * 2019-05-23 2021-09-07 广州市香港科大霍英东研究院 Knowledge graph construction method for event prediction and event prediction method
CN112446405A (en) * 2019-09-04 2021-03-05 杭州九阳小家电有限公司 User intention guiding method for home appliance customer service and intelligent home appliance
CN111523302B (en) * 2020-07-06 2020-10-02 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101794274B1 (en) * 2010-07-13 2017-11-06 에스케이플래닛 주식회사 Method and apparatus for filtering translation rules and generating target word in hierarchical phrase-based statistical machine translation
CN102298642B (en) * 2011-09-15 2012-09-05 苏州大学 Method and system for extracting text information
CN102968431B (en) * 2012-09-18 2018-08-10 华东师范大学 A kind of control device that the Chinese entity relationship based on dependency tree is excavated
CN105893346A (en) * 2016-03-30 2016-08-24 齐鲁工业大学 Graph model word sense disambiguation method based on dependency syntax tree
CN106776686A (en) * 2016-11-09 2017-05-31 武汉泰迪智慧科技有限公司 Chinese domain short text understanding method and system based on many necks
CN106598951B (en) * 2016-12-23 2019-08-16 北京金山办公软件股份有限公司 A kind of dependency structure treebank acquisition methods and system
CN107168948A (en) * 2017-04-19 2017-09-15 广州视源电子科技股份有限公司 A kind of sentence recognition methods and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473223A (en) * 2013-09-25 2013-12-25 中国科学院计算技术研究所 Rule extraction and translation method based on syntax tree
CN105740235A (en) * 2016-01-29 2016-07-06 昆明理工大学 Phrase tree to dependency tree transformation method capable of combining Vietnamese grammatical features

Also Published As

Publication number Publication date
CN107656921A (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN107656921B (en) Short text dependency analysis method based on deep learning
CN110399457B (en) Intelligent question answering method and system
CN107436864B (en) Chinese question-answer semantic similarity calculation method based on Word2Vec
US10496749B2 (en) Unified semantics-focused language processing and zero base knowledge building system
US9606990B2 (en) Cognitive system with ingestion of natural language documents with embedded code
CN106537370B (en) Method and system for robust tagging of named entities in the presence of source and translation errors
CN104462057B (en) For the method and system for the lexicon for producing language analysis
KR101130444B1 (en) System for identifying paraphrases using machine translation techniques
US10423649B2 (en) Natural question generation from query data using natural language processing system
WO2016112679A1 (en) Method, system and storage medium for realizing intelligent answering of questions
US20160041986A1 (en) Smart Search Engine
US20150081277A1 (en) System and Method for Automatically Classifying Text using Discourse Analysis
CN109062904B (en) Logic predicate extraction method and device
CN111382571B (en) Information extraction method, system, server and storage medium
US20200311345A1 (en) System and method for language-independent contextual embedding
Rodrigues et al. Advanced applications of natural language processing for performing information extraction
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
CN113064985A (en) Man-machine conversation method, electronic device and storage medium
Bai et al. Enhanced natural language interface for web-based information retrieval
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
Kim et al. Improving the performance of a named entity recognition system with knowledge acquisition
CN113807102B (en) Method, device, equipment and computer storage medium for establishing semantic representation model
Zhang et al. Constructing covid-19 knowledge graph from a large corpus of scientific articles
CN111949781B (en) Intelligent interaction method and device based on natural sentence syntactic analysis
CN108255818A (en) Utilize the compound machine interpretation method of cutting techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant