CN110750704B - Method and device for automatically completing query - Google Patents
Method and device for automatically completing query Download PDFInfo
- Publication number
- CN110750704B CN110750704B CN201911014061.2A CN201911014061A CN110750704B CN 110750704 B CN110750704 B CN 110750704B CN 201911014061 A CN201911014061 A CN 201911014061A CN 110750704 B CN110750704 B CN 110750704B
- Authority
- CN
- China
- Prior art keywords
- dictionary tree
- query
- nodes
- internal
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000004458 analytical method Methods 0.000 claims abstract description 15
- 238000012163 sequencing technique Methods 0.000 claims abstract description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 19
- 230000011218 segmentation Effects 0.000 claims description 12
- 239000000203 mixture Substances 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000009467 reduction Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 7
- 230000004913 activation Effects 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 101100481876 Danio rerio pbk gene Proteins 0.000 description 1
- 101100481878 Mus musculus Pbk gene Proteins 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004992 fission Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9032—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a method and a device for inquiring automatic completion, wherein the method for inquiring automatic completion comprises the following steps: receiving a query prefix from a user side; matching the character result of the query prefix based on a nested dictionary tree structure; adding the character result into an interval list according to the nested dictionary tree nodes; and sequencing the interval list according to the analysis of the user target character string to obtain a result set. The embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.
Description
Technical Field
The invention relates to the technical field of search, in particular to a method and a device for automatically completing inquiry.
Background
Query autocompletion techniques are an important component of guiding users to correctly enter queries and reduce the number of characters that need to be entered. In search engines (e.g., Google, hundredths, etc.), users often want to enter a small amount of information and return their desired results. Such as the user entering MJ of this query and the search engine expecting to return results on Michael Jordan. When a user enters a query in a search box, the query autocomplete will give appropriate suggestions with the query input character as a prefix.
To better enhance human-computer interaction experience, query autocompletion is often used in various error-prone applications that require a lot of human input, such as command lines, desktop searches, mobile devices, and so on. Because of its importance, the query autocomplete technology has been widely regarded and applied to information extraction and database search.
For the existing query autocompletion methods, a user needs to manually separate keywords input by a query, and the methods perform matching operation by using query characters as prefixes of the keywords. These methods are not effective when the user does not prefer or otherwise facilitate manual separation of keywords in a query.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are provided to provide a method for query autocompletion and a corresponding apparatus for query autocompletion that overcome or at least partially solve the above problems.
In order to solve the above problems, an embodiment of the present invention discloses a method for query automatic completion, including:
receiving a query prefix from a user side;
matching the character result of the query prefix based on a nested dictionary tree structure;
adding the character result into an interval list according to the nested dictionary tree nodes;
and sequencing the interval list according to the analysis of the user target character string to obtain a result set.
Further, after the step of sorting the interval list according to the analysis of the user target character string to obtain a result set, the method further includes:
and returning a target result set by adopting a Top-K algorithm according to the user requirement.
Further, before the step of receiving the query prefix from the user side, the method includes:
and establishing the nested dictionary tree structure.
Further, the step of establishing the nested trie structure includes:
dividing the keywords and establishing a dictionary tree;
the dictionary trees are linked together to form a nested dictionary tree structure.
Further, the dictionary tree includes an internal dictionary tree and an external dictionary tree, and the step of dividing the keywords and establishing the dictionary tree includes:
the first letter of the keyword is added to the external dictionary tree and the other letters of the corresponding keyword are added to the internal dictionary tree.
Further, the step of linking the tries together to form a nested trie structure includes:
linking the outer dictionary tree and the inner dictionary tree together to form a nested dictionary tree.
Further, the step of sorting the interval list according to the analysis of the user target character string to obtain a result set includes:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to the descending mode of the segmentation matching probability.
The embodiment of the invention discloses a device for automatically completing inquiry, which comprises:
the receiving module is used for receiving the query prefix from the user side;
the matching module is used for matching the character result of the query prefix based on a nested dictionary tree structure;
the interval list merging module is used for adding the character result into an interval list according to the nested dictionary tree nodes;
and the interval result sorting module is used for sorting the interval list according to the analysis of the user target character string to obtain a result set.
The embodiment of the invention discloses electronic equipment, which comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the method for automatically completing inquiry when being executed by the processor.
The embodiment of the invention discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for automatically completing the query are realized.
The embodiment of the invention has the following advantages: the embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.
Drawings
FIG. 1 is a diagram illustrating a nested trie structure in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a fast query dictionary tree algorithm in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of steps in an embodiment of a method for query autocomplete of the present invention;
FIG. 4 is a flow chart of steps of another embodiment of a method for query autocomplete of the present invention;
FIG. 5 is a block diagram illustrating an embodiment of an apparatus for query autocomplete according to the present invention;
FIG. 6 is a block diagram of another embodiment of an apparatus for query autocomplete according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
One of the core concepts of the embodiments of the present invention is to provide a method and a device for query automatic completion, where the method for query automatic completion includes: receiving a query prefix from a user side; matching character results of the query prefixes based on the nested dictionary tree structure; adding the character result into the interval list according to the nested dictionary tree nodes; and sequencing the interval list according to the analysis of the user target character string to obtain a result set. The embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.
Referring to fig. 1 to 4, a flowchart illustrating steps of an embodiment of a method for query autocomplete of the present invention is shown, which may specifically include the following steps:
s100, receiving a query prefix from a user side;
in this embodiment, Σ is a limited set of characters; a string s is an ordered array of characters extracted from sigma. | s | represents the length of the string s, s [ i |)]Representing the ith character in s. s [ i]Representing the sub-string from the ith character to the jth character in s. Given 2 strings s and t, a prefix for s being t is expressed as s ≦ t, if and only if s [1.. i]=t[1..i]And i is more than or equal to 1 and less than or equal to s. The string concatenated in s and t order is denoted by st. A set of character string arrays [ s ]1,s2,..sn](n>1) If s is equal to s1s2..snSplicing of (a) with (b) a1,s2,..sn]One cut called s. By s<A prefix substring representing any one s. Given that S is a string dataset, each string S ∈ S can be cut into a set of keywords, assuming that Σ contains a set of english letters. The segmentation symbol can be a space, a punctuation, a capital letter, etc. For example, "AddNextValue" is divided into three parts, "Add", "Next", and "Value". Consider that a string s can be partitioned into a set of keywords s1]. Given a query string q, said q is a prefix abbreviation match for s, expressed asIf and only if q is s1<s2<..si<I is more than or equal to 1 and less than or equal to n; q is the concatenation of prefix abbreviations of the first i keywords of s. For example, gene is a prefix abbreviation match for the string "GetNextValue" because ge and ne are prefixes of Get and Next. Prefix abbreviation matching is denoted by PAM. Given a character string data set S, a query character string q and a prefix abbreviation Query Automatic Completion (QACA), all character string sets si are found to be the same as S, and the conditions are metThe output results are incrementally computed based on the user's current input characters.
The method for automatically completing the query allows a user to input the link of the reducible keyword prefix as the query, and improves the experience degree. According to the scene of keyword prefix link, an index structure and a query method are designed to complete the functions of the method. And a ranking algorithm is proposed which is incorporated into the queries to ensure a quality ranking of the results output, i.e. the top ranked results are most likely to be desired by the user. A small amount of K is returned by a Top-K method, and the result is high in quality.
In this embodiment, by establishing a nested dictionary tree index structure, a query algorithm, an interval list merging method, an interval result ordering method and an interval Top-K algorithm, on-line, after original data is given, preprocessing data according to different requirements, such as removing noise and dirty data, and establishing an index structure. When the user inquires on the line, the inquiry algorithm is executed until the output result is presented to the user.
The index data structure in this embodiment is a nested trie structure, which includes a plurality of internal tries nested within an external trie. Referring to fig. 1, a diagram of a nested dictionary tree structure is shown. To build a nested dictionary tree, given each string input S, the initials of each key of the string are selected to be added to the external dictionary tree. Then, for the outer node where each initial is located, the other letters of the corresponding keyword are added to the internal dictionary tree. Nodes and edges of the external dictionary tree are called external nodes and edges, and nodes and edges of the internal dictionary tree are called internal nodes and edges. The root node of the nested trie is the root node of the external trie. Links from internal nodes to external nodes are also added between the nodes of the tree. For an internal node n, the root node containing the number of internal fields of n is represented by the initial node. And for any data character string where the non-initial character is located, if the data character string is followed by an immediately connected keyword, adding a shortcut link to the external node to the initial node corresponding to the internal node. The label of this quick link is the first letter of the next keyword.
To reduce the space for quick links, most of the links do not need to be physically saved. The target node of the link is always a subset of the outer edges. Based on this phenomenon, for an outer edge, one bit, namely a bit vector, is used for storage. The destination of the link of the ith bit representing the node is the same as the destination outside the ith entry. This avoids duplicate edges that hold the same function. Compared with the traditional dictionary tree, the nested dictionary tree combines the keywords sharing the same initial. In the following description of the algorithm, such a data structure can effectively reduce the number of active nodes. At the same time, the active node can also be found quickly.
S200, matching character results of the query prefix based on a nested dictionary tree structure;
in the nested dictionary tree structure, an active node n is a node having at least one path (through an edge or a link) from a root node to n, which can exactly match a query string input by a user. The algorithm starts from an external root node, and for each character input by a user, a new activation node is found from the existing activation nodes. Given this entered character, either the first character or the non-first character may be matched. Nested tries can support such matching well. For a non-initial character, a new activation node is found by walking an internal edge. For an initial letter, a new activation node can be found by walking an outer edge. In addition, a new activation node can be generated by jumping from the internal node to the external node through a shortcut link.
In this embodiment, the data under each node is not all the desired result. Strings that are not the result are removed by means of list merging. Defining In as a sequence of ordered intervalsThe operation is to merge the sequences of two intervals. Where x isiAnd yjTwo intervals are shown. Property 1, given a path from the root node to n, n1,...,nk. The result of query q is to exist onlyAmong them. Based on property 1, the complexity of a fast query dictionary tree algorithm in the present embodiment is: o (log | In' |). the specific algorithm is shown In fig. 2.
S300, adding the character result into an interval list according to the nodes of the nested dictionary tree;
in this embodiment, a query in the nested trie algorithm may not match all of the strings below the active node. In order not to report non-result data, each node in the trie is added to an ordered list of intervals to display strings describing a match between a prefix and a path in the trie. To compute the intervals in the list, a string is given, the nodes in the dictionary tree are traversed, and the ID of the string is added to the interval list for the corresponding node. One basic method is to use the sweepline algorithm to process interval list merging, and the time complexity of the method is O (| I)n|+|In'| where | represents the number of intervals in the list. Due to the merge operation, | InI is generally very small in practical cases and much smaller than In'L. If is holding InL is regarded as a constant, and the time complexity becomes O (| I)n'|). When traversing deep nodes in a nested trie, intervals in memory fission can become very dispersed, and | In'As l becomes larger, a large amount of merging penalty is introduced here. In view of the above problem, the present embodiment is an algorithm for list merging. For an interval u, v in the list]Using a binary search mode to take u as a key value in In'Find the first sum [ u, v]There is an intersecting interval.
S400, sorting the interval list according to the analysis of the user target character string to obtain a result set.
In this embodiment, the results of the output are sorted according to the target string of the estimated user based on the analysis of the user's needs.
In this embodiment, before the step of receiving the query prefix from the user side, S100 includes:
and establishing a nested dictionary tree structure.
In this embodiment, the step of establishing the nested trie structure includes:
dividing the keywords and establishing a dictionary tree;
the tries are linked together to form a nested trie structure.
In this embodiment, the trie includes an internal trie and an external trie, and the step of dividing the keyword and establishing the trie includes:
the first letter of the keyword is added to the external dictionary tree and the other letters of the corresponding keyword are added to the internal dictionary tree.
In this embodiment, the step of linking the tries together to form a nested trie structure includes:
the outer dictionary tree and the inner dictionary tree are linked together to form a nested dictionary tree.
In this embodiment, step S400 of sorting the interval list according to the analysis of the user target character string to obtain a result set includes:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to a mode of descending the segmentation matching probability.
In the present embodiment, given a data string s is cut into s1,...,sn]Assume that the first m keywords have been abbreviated to the query and the remaining (n-m) keywords have not been entered. Thus, it is possible to provideq may be cut into [ q ]1,...,qm]And satisfy qi≤siAnd i is more than or equal to 1 and less than or equal to m and less than or equal to n. Adding (n-m) empty strings, by qm+1,...,qnTo indicate. So that q and s will have the same number of cuts. The score for ranking s is defined as the string s being a query string with respect to segmentation [ q1,...,qn]And [ s ]1,...,sn]Probability of match, using score (s, q) ═ P(s)1...sn|q1...qn) To indicate. If there are multiple cutting modes, one cutting mode can be selected to obtain the maximum score. For all q PAM results, sorting is performed by score (s, q) function to obtain a descending result set.
To calculate score (s, q), bayes' theorem is applied:
score(s,q)=P(s1...sn|q1...qn)
=P(q1...qn|s1...sn)*P(s1...sn)/P(q1...qn)
∝P(q1...qn|s1...sn)*P(s1...sn)
=P(q1...qn|s1...sn)*P(s)
denominator P (q) in the above formula1...qn) Can be safely ignored because P (q)1...qn) P (q), this is the same value for all strings that PA matches. P(s) is characterized by the popularity of s. To calculate P (q)1...qn|s1...sn) Let P (q) be assumedi|si) I is 1-n are independent of each other. Thus, there are: p (q)1...qn|s1...sn)=P(q1|s1)·...·P(qn|sn) The following formula is obtained:
score(s,q)∝P(q1|s1)·...·P(qn|sn)·P(s)
each P (q)i|si) Described user input query string qiIn the case of (2) is a character string siProbability of the prefix. Suppose P (q) for a character that has not been enteredi|si)=1,m<i is less than or equal to n. The reason for this is that these keywords are then used as user input. In order that the fraction of s is not due to sequential operationsThe values are low, especially when n is much larger than m, these probability values are set to 1.
To better calculate P (q)i|si) It is found that users habitually narrow down some special character sequences, such as ignoring consonant portions, and that there is a certain pattern of such omission. The current features are therefore described using vectors: (1) q. q.siLength of (2) qiHow many vowels there are, (3) qiHow many consonants there are, (4) qiWhether or not to end with a consonant, (5) the value of i, i.e. the character siThe position in the string. As described above, the current feature is represented by a 5-dimensional vector. Here siAnd is not fully encoded in the vector. The reason for this is as follows: let p beiRepresenting the user reducing si to qiThe mode vector of (1). Since it is known how a keyword is reduced, i.e. is P (q)i,si)=P(pi)·P(si). Because P (q)i,si)=P(qi|si)·P(si),P(pi). Thus P (P)i) The result of (a) is P (q)i|si)。
Given a mode vector, P (P) is calculated using a mixed Gaussian model (GMM)i) The value of (c). The Gaussian mixture model uses unknown parameters to calculate the density function of p, which is the probability as follows:
where l is the number of Gaussian distributions, wi is the weight of each Gaussian distribution, N (p | μi,∑i) Is measured in muiIs a mean value and ∑iIs a variance matrix and is a probability density function of p. Where the parameter/can be fine-tuned in the training. Meanwhile, other parameters can be learned in a clustering manner and by using an EM algorithm: a series of data strings are given by the user, after which all prefixes of their data are collected and converted into keyword and prefix data pairs as features of the training data.
In this embodiment, after the step of sorting the interval list according to the analysis of the user target character string to obtain the result set, S400 further includes:
and S500, returning a target result set by adopting a Top-K algorithm according to the user requirement.
In this embodiment, the user may not be interested in all the results, and usually only the top K results, during the process of inputting the query. Under this assumption, results that are unlikely to go to the first K can be filtered ahead of time. And estimating the upper limit of the score of one activated node, and filtering the activated node in advance if the upper limit is lower than the lower limit of the current K previous results. In the interval list algorithm, one merged interval list is obtained in each valid node as a validation set. And if TopK of a result is required to be obtained, traversing the interval list in each effective node, calculating a corresponding score value for each character string in the interval, and then sorting according to the calculated scores and extracting the result of Top-K. The greatest cost in current method implementations is to use a Gaussian mixture model to compute the probability P (q)i|si). Because the number of strings in the interval is large in practical situations, especially for query strings with short length, it is necessary to design an efficient Top-K algorithm to reduce the number of computations of the gaussian mixture model.
In a specific embodiment, the maximum possible score in the merge interval list is defined. According to the characteristics of the merging list, the following characteristics are provided: for each interval [ u, v ]]∈JnAlways present in one interval [ u ', v']∈InAnd u ' is less than or equal to u ' and v ' is more than or equal to v. Thus, in List JnThe maximum possible score value for the middle string is List InThe upper bound of (c). To calculate the score for each interval, consider the root node of a dictionary tree as n. The depth of the dictionary tree is denoted by d, where all lists I can be deducednHas at least d keywords, and when n becomes an active node, the query q has exactly d non-empty partitions. Thus for each interval u, v]∈InCan be processed in an offline modePhysical string su...svAnd the maximum value is used to define the boundary of the on-line query. Given a string siFor every d keywordsEnumerating a string siAll possible prefixesThen calculate the probabilityNote here that when j-d, there is only one possible prefix since a match is made on node n. Maximum probabilityIs represented by a string siIs calculated, where the maximum value is taken and stored in the interval u, v in the dictionary tree]In (1).
The embodiment discloses an online Top-K result extraction algorithm. At the very beginning, a priority queue R is initialized for storing the Top-K results. For each activation node n, for list JnThe intervals in (1) are sorted in descending order of the maximum score. Second, for JnEach interval [ u, v ] of (1)]The score for each string is computed sequentially and then updated into the priority queue. If an interval is reached where his maximum score is not greater than the kth result, the process for n can safely end.
In another embodiment, the calculation of some gaussian mixture models is skipped, and some keywords are shared by the character strings in the same interval with a high probability, i.e. with the same probability p ═ (q ═ q)i|si). For in an interval u, v]∈InTwo adjacent character strings siAnd si+1Checking offline the number of keywords they share as prefixes and recording this value in si +1, with si+1Spr. For online query processing, if siAnd si+1At the same time at JnCan be for the first s in the same intervali+1The Gaussian mixture model calculation of the spr key is skipped because it has already been calculated. To make better use of keyword sharing, the strings in S are sorted in the order of the earliest points.
The application allows the user to decide the number of results. If the user expects to obtain all results and screens the results one by one, the step of returning the target result set by adopting a Top-K algorithm according to the user requirement in the step S500 can be skipped, and if the user only wants a limited number of high-quality results, the step of returning the K results most wanted by the user is carried out.
The application discloses a method for inquiring automatic completion, which is based on a model for inquiring prefix abbreviation matching of a completion technology, wherein the model for inquiring prefix abbreviation matching of the completion technology is a new algorithm in the completion technology. Compared with the prior art, the method and the device fully consider various scenes, and particularly do not display separators which indicate key words for users. The method and the device can save 20% of the number of characters input by a user. The embedded dictionary tree is a new data structure for supporting the automatic completion technology. Compared with a traditional dictionary tree index structure, the embedded dictionary tree can more accurately position the character string interval matched with the prefix. To return more meaningful results, a ranking algorithm is designed that uses the probability of the query string versus the data string versus the segmentation, and uses bayesian formulas and gaussian mixture model structures to compute its probability value. The ranking algorithm can return results that are more desirable to the user. Considering the interesting result of the user, two Top-K optimization algorithms are designed, namely the calculation times of designing the score upper bound of each interval list and skipping the Gaussian mixture model with higher complexity. Compared with the existing algorithm, the Top-K optimization algorithm has higher efficiency and accuracy.
The method is not limited to be applied to the technical fields of prompt of database query input, search box optimization of search engines, code prompt in integrated development environments, query prompt systems in the field of biochemical medicine, quick input interfaces of input methods, limited terminal input interfaces and the like.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 5 to 6, there are shown block diagrams of the structural embodiments of an apparatus for query autocomplete according to the present invention, which may specifically include the following modules:
a receiving module 100, configured to receive a query prefix from a user side;
a matching module 200, configured to match the character result of the query prefix based on the nested trie structure;
the interval list merging module 300 is used for adding the character result into the interval list according to the nested dictionary tree nodes;
and the interval result sorting module 400 is configured to sort the interval list according to analysis of the user target character string to obtain a result set.
In this embodiment, the method further includes:
and the result screening module 500 is used for returning the target result set by adopting a Top-K algorithm according to the user requirements.
In this embodiment, the method further includes:
and the structure establishing module is used for establishing a nested dictionary tree structure.
In this embodiment, the structure building module includes:
the splitting unit is used for dividing the keywords and establishing a dictionary tree;
and the linking unit is used for linking the dictionary trees together to form a nested dictionary tree structure.
In this embodiment, the splitting unit includes:
and the splitting subunit is used for adding the first letter of the keyword to the external dictionary tree and adding other letters of the corresponding keyword to the internal dictionary tree.
In the present embodiment, the link unit includes:
and the link subunit is used for linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree.
In this embodiment, the interval result sorting module includes:
the segmentation probability calculation unit is used for calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and the sorting unit is used for sorting the interval list in a mode of descending the segmentation matching probability.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiment of the invention discloses electronic equipment, which comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the query automatic completion method when being executed by the processor.
The embodiment of the invention discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to realize the steps of the query automatic completion method.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method for automatically completing inquiry and the corresponding device for automatically completing inquiry provided by the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (6)
1. A method for query autocompletion, comprising:
building a nested dictionary tree structure; specifically, the first letter of a keyword is added to an external dictionary tree, and the other letters of the corresponding keyword are added to an internal dictionary tree; linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree; adding a link from an internal node to an external node in the nested dictionary tree, and if an immediate keyword is behind a character string where a non-initial character is located, adding a link to the external node for the initial node of the internal node corresponding to the non-initial character, wherein the label of the link is the initial letter of the immediate keyword; wherein the internal nodes are nodes of the internal dictionary tree, the external nodes are nodes of the external dictionary tree, and the initial nodes are root nodes containing internal fields of the internal nodes;
receiving a query prefix from a user side; the query prefix is the concatenation of prefix abbreviations of any previous keyword of a character string formed by sequentially splicing a plurality of keywords;
matching the character result of the query prefix based on a nested dictionary tree structure;
adding the character result into an interval list according to the nested dictionary tree nodes;
and sequencing the interval list according to the analysis of the user target character string to obtain a result set.
2. The method of claim 1, wherein after the step of sorting the interval list according to the analysis of the user target string to obtain a result set, further comprising:
and returning a target result set by adopting a Top-K algorithm according to the user requirement.
3. The method of claim 1, wherein the step of sorting the interval list according to the analysis of the user target string to obtain a result set comprises:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to the descending mode of the segmentation matching probability.
4. An apparatus for query autocomplete, comprising:
the structure building module is used for building a nested dictionary tree structure; specifically, the first letter of a keyword is added to an external dictionary tree, and the other letters of the corresponding keyword are added to an internal dictionary tree; linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree; adding a link from an internal node to an external node in the nested dictionary tree, and if an immediate keyword is behind a character string where a non-initial character is located, adding a link to the external node for the initial node of the internal node corresponding to the non-initial character, wherein the label of the link is the initial letter of the immediate keyword; wherein the internal nodes are nodes of the internal dictionary tree, the external nodes are nodes of the external dictionary tree, and the initial nodes are root nodes containing internal fields of the internal nodes;
the receiving module is used for receiving the query prefix from the user side; the query prefix is the concatenation of prefix abbreviations of any previous keyword of a character string formed by sequentially splicing a plurality of keywords;
the matching module is used for matching the character result of the query prefix based on a nested dictionary tree structure;
the interval list merging module is used for adding the character result into an interval list according to the nested dictionary tree nodes;
and the interval result sorting module is used for sorting the interval list according to the analysis of the user target character string to obtain a result set.
5. Electronic device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and capable of running on said processor, said computer program, when executed by said processor, implementing the steps of the method for query autocompletion according to any one of claims 1 to 3.
6. Computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method for query autocompletion according to any one of claims 1 to 3.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911014061.2A CN110750704B (en) | 2019-10-23 | 2019-10-23 | Method and device for automatically completing query |
PCT/CN2019/126590 WO2021077585A1 (en) | 2019-10-23 | 2019-12-19 | Method and device for auto-completing query |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911014061.2A CN110750704B (en) | 2019-10-23 | 2019-10-23 | Method and device for automatically completing query |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110750704A CN110750704A (en) | 2020-02-04 |
CN110750704B true CN110750704B (en) | 2022-03-11 |
Family
ID=69279673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911014061.2A Active CN110750704B (en) | 2019-10-23 | 2019-10-23 | Method and device for automatically completing query |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110750704B (en) |
WO (1) | WO2021077585A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112256821B (en) * | 2020-09-23 | 2024-05-17 | 北京捷通华声科技股份有限公司 | Chinese address completion method, device, equipment and storage medium |
CN113312549B (en) * | 2021-05-25 | 2024-01-26 | 北京天空卫士网络安全技术有限公司 | Domain name processing method and device |
CN113360666A (en) * | 2021-05-31 | 2021-09-07 | 珠海大横琴科技发展有限公司 | Data dictionary management method and device, electronic equipment and storage medium |
CN117546155A (en) * | 2021-06-10 | 2024-02-09 | 维萨国际服务协会 | Systems, methods, and computer program products for feature analysis using embedded trees |
CN115878924B (en) * | 2021-09-27 | 2024-03-12 | 小沃科技有限公司 | Data processing method, device, medium and electronic equipment based on double dictionary trees |
CN114969242A (en) * | 2022-01-19 | 2022-08-30 | 支付宝(杭州)信息技术有限公司 | Method and device for automatically completing query content |
US12079279B2 (en) | 2023-02-06 | 2024-09-03 | Walmart Apollo, Llc | Systems and methods for generating query suggestions |
US12032608B1 (en) | 2023-02-06 | 2024-07-09 | Walmart Apollo, Llc | Systems and methods for generating query suggestions |
CN117640259B (en) * | 2024-01-25 | 2024-06-04 | 武汉思普崚技术有限公司 | Script step-by-step detection method and device, electronic equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063508A (en) * | 2011-01-10 | 2011-05-18 | 浙江大学 | Generalized suffix tree based fuzzy auto-completion method for Chinese search engine |
CN105447080A (en) * | 2015-11-05 | 2016-03-30 | 华建宇通科技(北京)有限责任公司 | Query completion method in community ask-answer search |
CN106663100A (en) * | 2014-05-30 | 2017-05-10 | 苹果公司 | Multi-domain query completion |
CN107169045A (en) * | 2017-04-19 | 2017-09-15 | 中国人民解放军国防科学技术大学 | A kind of query word method for automatically completing and device based on temporal signatures |
CN109325635A (en) * | 2018-10-25 | 2019-02-12 | 电子科技大学中山学院 | Position prediction method based on automatic completion |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8073869B2 (en) * | 2008-07-03 | 2011-12-06 | The Regents Of The University Of California | Method for efficiently supporting interactive, fuzzy search on structured data |
CN104052669B (en) * | 2013-03-12 | 2018-12-07 | 凯为公司 | For handling the device for the longest prefix match table being alternately arranged |
CN108241695B (en) * | 2016-12-26 | 2021-11-02 | 北京国双科技有限公司 | Information processing method and device |
CN108427756B (en) * | 2018-03-16 | 2021-02-12 | 中国人民解放军国防科技大学 | Personalized query word completion recommendation method and device based on same-class user model |
-
2019
- 2019-10-23 CN CN201911014061.2A patent/CN110750704B/en active Active
- 2019-12-19 WO PCT/CN2019/126590 patent/WO2021077585A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063508A (en) * | 2011-01-10 | 2011-05-18 | 浙江大学 | Generalized suffix tree based fuzzy auto-completion method for Chinese search engine |
CN106663100A (en) * | 2014-05-30 | 2017-05-10 | 苹果公司 | Multi-domain query completion |
CN105447080A (en) * | 2015-11-05 | 2016-03-30 | 华建宇通科技(北京)有限责任公司 | Query completion method in community ask-answer search |
CN107169045A (en) * | 2017-04-19 | 2017-09-15 | 中国人民解放军国防科学技术大学 | A kind of query word method for automatically completing and device based on temporal signatures |
CN109325635A (en) * | 2018-10-25 | 2019-02-12 | 电子科技大学中山学院 | Position prediction method based on automatic completion |
Non-Patent Citations (1)
Title |
---|
基于局部过滤的字符串近似匹配算法和优化技术;王尧舒;《中国优秀硕士学位论文全文数据库信息科技辑》;20160815;第I138-1404页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110750704A (en) | 2020-02-04 |
WO2021077585A1 (en) | 2021-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110750704B (en) | Method and device for automatically completing query | |
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
CN105653706B (en) | A kind of multilayer quotation based on literature content knowledge mapping recommends method | |
CN103136352B (en) | Text retrieval system based on double-deck semantic analysis | |
CN112988969B (en) | Method, apparatus, device and storage medium for text retrieval | |
CN108319627B (en) | Keyword extraction method and keyword extraction device | |
JP4754247B2 (en) | Apparatus and computerized method for determining words constituting compound words | |
CN111611356B (en) | Information searching method, device, electronic equipment and readable storage medium | |
EP2045735A2 (en) | Refining a search space inresponse to user Input | |
CN111460798A (en) | Method and device for pushing similar meaning words, electronic equipment and medium | |
US8606779B2 (en) | Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof | |
US20110282858A1 (en) | Hierarchical Content Classification Into Deep Taxonomies | |
JP2009525520A (en) | Evaluation method for ranking and sorting electronic documents in search result list based on relevance, and database search engine | |
CN112000783B (en) | Patent recommendation method, device and equipment based on text similarity analysis and storage medium | |
JP2002510076A (en) | Information retrieval and speech recognition based on language model | |
KR20080031262A (en) | Relationship networks | |
CN111625621B (en) | Document retrieval method and device, electronic equipment and storage medium | |
WO2009154570A1 (en) | System and method for aligning and indexing multilingual documents | |
KR100847376B1 (en) | Method and apparatus for searching information using automatic query creation | |
CN111680152B (en) | Method and device for extracting abstract of target text, electronic equipment and storage medium | |
KR20220119745A (en) | Methods for retrieving content, devices, devices and computer-readable storage media | |
JP4325370B2 (en) | Document-related vocabulary acquisition device and program | |
Minkov et al. | Learning graph walk based similarity measures for parsed text | |
CN114385777A (en) | Text data processing method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |