CN110750704B - Method and device for automatically completing query - Google Patents

Method and device for automatically completing query Download PDF

Info

Publication number
CN110750704B
CN110750704B CN201911014061.2A CN201911014061A CN110750704B CN 110750704 B CN110750704 B CN 110750704B CN 201911014061 A CN201911014061 A CN 201911014061A CN 110750704 B CN110750704 B CN 110750704B
Authority
CN
China
Prior art keywords
dictionary tree
query
nodes
internal
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911014061.2A
Other languages
Chinese (zh)
Other versions
CN110750704A (en
Inventor
秦建斌
王尧舒
毛睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Computing Sciences
Original Assignee
Shenzhen Institute of Computing Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Computing Sciences filed Critical Shenzhen Institute of Computing Sciences
Priority to CN201911014061.2A priority Critical patent/CN110750704B/en
Priority to PCT/CN2019/126590 priority patent/WO2021077585A1/en
Publication of CN110750704A publication Critical patent/CN110750704A/en
Application granted granted Critical
Publication of CN110750704B publication Critical patent/CN110750704B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for inquiring automatic completion, wherein the method for inquiring automatic completion comprises the following steps: receiving a query prefix from a user side; matching the character result of the query prefix based on a nested dictionary tree structure; adding the character result into an interval list according to the nested dictionary tree nodes; and sequencing the interval list according to the analysis of the user target character string to obtain a result set. The embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.

Description

Method and device for automatically completing query
Technical Field
The invention relates to the technical field of search, in particular to a method and a device for automatically completing inquiry.
Background
Query autocompletion techniques are an important component of guiding users to correctly enter queries and reduce the number of characters that need to be entered. In search engines (e.g., Google, hundredths, etc.), users often want to enter a small amount of information and return their desired results. Such as the user entering MJ of this query and the search engine expecting to return results on Michael Jordan. When a user enters a query in a search box, the query autocomplete will give appropriate suggestions with the query input character as a prefix.
To better enhance human-computer interaction experience, query autocompletion is often used in various error-prone applications that require a lot of human input, such as command lines, desktop searches, mobile devices, and so on. Because of its importance, the query autocomplete technology has been widely regarded and applied to information extraction and database search.
For the existing query autocompletion methods, a user needs to manually separate keywords input by a query, and the methods perform matching operation by using query characters as prefixes of the keywords. These methods are not effective when the user does not prefer or otherwise facilitate manual separation of keywords in a query.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are provided to provide a method for query autocompletion and a corresponding apparatus for query autocompletion that overcome or at least partially solve the above problems.
In order to solve the above problems, an embodiment of the present invention discloses a method for query automatic completion, including:
receiving a query prefix from a user side;
matching the character result of the query prefix based on a nested dictionary tree structure;
adding the character result into an interval list according to the nested dictionary tree nodes;
and sequencing the interval list according to the analysis of the user target character string to obtain a result set.
Further, after the step of sorting the interval list according to the analysis of the user target character string to obtain a result set, the method further includes:
and returning a target result set by adopting a Top-K algorithm according to the user requirement.
Further, before the step of receiving the query prefix from the user side, the method includes:
and establishing the nested dictionary tree structure.
Further, the step of establishing the nested trie structure includes:
dividing the keywords and establishing a dictionary tree;
the dictionary trees are linked together to form a nested dictionary tree structure.
Further, the dictionary tree includes an internal dictionary tree and an external dictionary tree, and the step of dividing the keywords and establishing the dictionary tree includes:
the first letter of the keyword is added to the external dictionary tree and the other letters of the corresponding keyword are added to the internal dictionary tree.
Further, the step of linking the tries together to form a nested trie structure includes:
linking the outer dictionary tree and the inner dictionary tree together to form a nested dictionary tree.
Further, the step of sorting the interval list according to the analysis of the user target character string to obtain a result set includes:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to the descending mode of the segmentation matching probability.
The embodiment of the invention discloses a device for automatically completing inquiry, which comprises:
the receiving module is used for receiving the query prefix from the user side;
the matching module is used for matching the character result of the query prefix based on a nested dictionary tree structure;
the interval list merging module is used for adding the character result into an interval list according to the nested dictionary tree nodes;
and the interval result sorting module is used for sorting the interval list according to the analysis of the user target character string to obtain a result set.
The embodiment of the invention discloses electronic equipment, which comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the method for automatically completing inquiry when being executed by the processor.
The embodiment of the invention discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for automatically completing the query are realized.
The embodiment of the invention has the following advantages: the embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.
Drawings
FIG. 1 is a diagram illustrating a nested trie structure in accordance with an embodiment of the present invention;
FIG. 2 is a block diagram of a fast query dictionary tree algorithm in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of steps in an embodiment of a method for query autocomplete of the present invention;
FIG. 4 is a flow chart of steps of another embodiment of a method for query autocomplete of the present invention;
FIG. 5 is a block diagram illustrating an embodiment of an apparatus for query autocomplete according to the present invention;
FIG. 6 is a block diagram of another embodiment of an apparatus for query autocomplete according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
One of the core concepts of the embodiments of the present invention is to provide a method and a device for query automatic completion, where the method for query automatic completion includes: receiving a query prefix from a user side; matching character results of the query prefixes based on the nested dictionary tree structure; adding the character result into the interval list according to the nested dictionary tree nodes; and sequencing the interval list according to the analysis of the user target character string to obtain a result set. The embedded dictionary tree can more accurately position the character string interval matched with the prefix, and supports the query automatic completion technology of keyword reduction, thereby greatly reducing the query length required to be input by a user and improving the comfort level of user experience.
Referring to fig. 1 to 4, a flowchart illustrating steps of an embodiment of a method for query autocomplete of the present invention is shown, which may specifically include the following steps:
s100, receiving a query prefix from a user side;
in this embodiment, Σ is a limited set of characters; a string s is an ordered array of characters extracted from sigma. | s | represents the length of the string s, s [ i |)]Representing the ith character in s. s [ i]Representing the sub-string from the ith character to the jth character in s. Given 2 strings s and t, a prefix for s being t is expressed as s ≦ t, if and only if s [1.. i]=t[1..i]And i is more than or equal to 1 and less than or equal to s. The string concatenated in s and t order is denoted by st. A set of character string arrays [ s ]1,s2,..sn](n>1) If s is equal to s1s2..snSplicing of (a) with (b) a1,s2,..sn]One cut called s. By s<A prefix substring representing any one s. Given that S is a string dataset, each string S ∈ S can be cut into a set of keywords, assuming that Σ contains a set of english letters. The segmentation symbol can be a space, a punctuation, a capital letter, etc. For example, "AddNextValue" is divided into three parts, "Add", "Next", and "Value". Consider that a string s can be partitioned into a set of keywords s1]. Given a query string q, said q is a prefix abbreviation match for s, expressed as
Figure BDA0002245111450000041
If and only if q is s1<s2<..si<I is more than or equal to 1 and less than or equal to n; q is the concatenation of prefix abbreviations of the first i keywords of s. For example, gene is a prefix abbreviation match for the string "GetNextValue" because ge and ne are prefixes of Get and Next. Prefix abbreviation matching is denoted by PAM. Given a character string data set S, a query character string q and a prefix abbreviation Query Automatic Completion (QACA), all character string sets si are found to be the same as S, and the conditions are met
Figure BDA0002245111450000051
The output results are incrementally computed based on the user's current input characters.
The method for automatically completing the query allows a user to input the link of the reducible keyword prefix as the query, and improves the experience degree. According to the scene of keyword prefix link, an index structure and a query method are designed to complete the functions of the method. And a ranking algorithm is proposed which is incorporated into the queries to ensure a quality ranking of the results output, i.e. the top ranked results are most likely to be desired by the user. A small amount of K is returned by a Top-K method, and the result is high in quality.
In this embodiment, by establishing a nested dictionary tree index structure, a query algorithm, an interval list merging method, an interval result ordering method and an interval Top-K algorithm, on-line, after original data is given, preprocessing data according to different requirements, such as removing noise and dirty data, and establishing an index structure. When the user inquires on the line, the inquiry algorithm is executed until the output result is presented to the user.
The index data structure in this embodiment is a nested trie structure, which includes a plurality of internal tries nested within an external trie. Referring to fig. 1, a diagram of a nested dictionary tree structure is shown. To build a nested dictionary tree, given each string input S, the initials of each key of the string are selected to be added to the external dictionary tree. Then, for the outer node where each initial is located, the other letters of the corresponding keyword are added to the internal dictionary tree. Nodes and edges of the external dictionary tree are called external nodes and edges, and nodes and edges of the internal dictionary tree are called internal nodes and edges. The root node of the nested trie is the root node of the external trie. Links from internal nodes to external nodes are also added between the nodes of the tree. For an internal node n, the root node containing the number of internal fields of n is represented by the initial node. And for any data character string where the non-initial character is located, if the data character string is followed by an immediately connected keyword, adding a shortcut link to the external node to the initial node corresponding to the internal node. The label of this quick link is the first letter of the next keyword.
To reduce the space for quick links, most of the links do not need to be physically saved. The target node of the link is always a subset of the outer edges. Based on this phenomenon, for an outer edge, one bit, namely a bit vector, is used for storage. The destination of the link of the ith bit representing the node is the same as the destination outside the ith entry. This avoids duplicate edges that hold the same function. Compared with the traditional dictionary tree, the nested dictionary tree combines the keywords sharing the same initial. In the following description of the algorithm, such a data structure can effectively reduce the number of active nodes. At the same time, the active node can also be found quickly.
S200, matching character results of the query prefix based on a nested dictionary tree structure;
in the nested dictionary tree structure, an active node n is a node having at least one path (through an edge or a link) from a root node to n, which can exactly match a query string input by a user. The algorithm starts from an external root node, and for each character input by a user, a new activation node is found from the existing activation nodes. Given this entered character, either the first character or the non-first character may be matched. Nested tries can support such matching well. For a non-initial character, a new activation node is found by walking an internal edge. For an initial letter, a new activation node can be found by walking an outer edge. In addition, a new activation node can be generated by jumping from the internal node to the external node through a shortcut link.
In this embodiment, the data under each node is not all the desired result. Strings that are not the result are removed by means of list merging. Defining In as a sequence of ordered intervals
Figure BDA0002245111450000061
The operation is to merge the sequences of two intervals.
Figure BDA0002245111450000062
Figure BDA0002245111450000063
Where x isiAnd yjTwo intervals are shown. Property 1, given a path from the root node to n, n1,...,nk. The result of query q is to exist only
Figure BDA0002245111450000064
Among them. Based on property 1, the complexity of a fast query dictionary tree algorithm in the present embodiment is: o (log | In' |). the specific algorithm is shown In fig. 2.
S300, adding the character result into an interval list according to the nodes of the nested dictionary tree;
in this embodiment, a query in the nested trie algorithm may not match all of the strings below the active node. In order not to report non-result data, each node in the trie is added to an ordered list of intervals to display strings describing a match between a prefix and a path in the trie. To compute the intervals in the list, a string is given, the nodes in the dictionary tree are traversed, and the ID of the string is added to the interval list for the corresponding node. One basic method is to use the sweepline algorithm to process interval list merging, and the time complexity of the method is O (| I)n|+|In'| where | represents the number of intervals in the list. Due to the merge operation, | InI is generally very small in practical cases and much smaller than In'L. If is holding InL is regarded as a constant, and the time complexity becomes O (| I)n'|). When traversing deep nodes in a nested trie, intervals in memory fission can become very dispersed, and | In'As l becomes larger, a large amount of merging penalty is introduced here. In view of the above problem, the present embodiment is an algorithm for list merging. For an interval u, v in the list]Using a binary search mode to take u as a key value in In'Find the first sum [ u, v]There is an intersecting interval.
S400, sorting the interval list according to the analysis of the user target character string to obtain a result set.
In this embodiment, the results of the output are sorted according to the target string of the estimated user based on the analysis of the user's needs.
In this embodiment, before the step of receiving the query prefix from the user side, S100 includes:
and establishing a nested dictionary tree structure.
In this embodiment, the step of establishing the nested trie structure includes:
dividing the keywords and establishing a dictionary tree;
the tries are linked together to form a nested trie structure.
In this embodiment, the trie includes an internal trie and an external trie, and the step of dividing the keyword and establishing the trie includes:
the first letter of the keyword is added to the external dictionary tree and the other letters of the corresponding keyword are added to the internal dictionary tree.
In this embodiment, the step of linking the tries together to form a nested trie structure includes:
the outer dictionary tree and the inner dictionary tree are linked together to form a nested dictionary tree.
In this embodiment, step S400 of sorting the interval list according to the analysis of the user target character string to obtain a result set includes:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to a mode of descending the segmentation matching probability.
In the present embodiment, given a data string s is cut into s1,...,sn]Assume that the first m keywords have been abbreviated to the query and the remaining (n-m) keywords have not been entered. Thus, it is possible to provide
Figure BDA0002245111450000081
q may be cut into [ q ]1,...,qm]And satisfy qi≤siAnd i is more than or equal to 1 and less than or equal to m and less than or equal to n. Adding (n-m) empty strings, by qm+1,...,qnTo indicate. So that q and s will have the same number of cuts. The score for ranking s is defined as the string s being a query string with respect to segmentation [ q1,...,qn]And [ s ]1,...,sn]Probability of match, using score (s, q) ═ P(s)1...sn|q1...qn) To indicate. If there are multiple cutting modes, one cutting mode can be selected to obtain the maximum score. For all q PAM results, sorting is performed by score (s, q) function to obtain a descending result set.
To calculate score (s, q), bayes' theorem is applied:
score(s,q)=P(s1...sn|q1...qn)
=P(q1...qn|s1...sn)*P(s1...sn)/P(q1...qn)
∝P(q1...qn|s1...sn)*P(s1...sn)
=P(q1...qn|s1...sn)*P(s)
denominator P (q) in the above formula1...qn) Can be safely ignored because P (q)1...qn) P (q), this is the same value for all strings that PA matches. P(s) is characterized by the popularity of s. To calculate P (q)1...qn|s1...sn) Let P (q) be assumedi|si) I is 1-n are independent of each other. Thus, there are: p (q)1...qn|s1...sn)=P(q1|s1)·...·P(qn|sn) The following formula is obtained:
score(s,q)∝P(q1|s1)·...·P(qn|sn)·P(s)
each P (q)i|si) Described user input query string qiIn the case of (2) is a character string siProbability of the prefix. Suppose P (q) for a character that has not been enteredi|si)=1,m<i is less than or equal to n. The reason for this is that these keywords are then used as user input. In order that the fraction of s is not due to sequential operationsThe values are low, especially when n is much larger than m, these probability values are set to 1.
To better calculate P (q)i|si) It is found that users habitually narrow down some special character sequences, such as ignoring consonant portions, and that there is a certain pattern of such omission. The current features are therefore described using vectors: (1) q. q.siLength of (2) qiHow many vowels there are, (3) qiHow many consonants there are, (4) qiWhether or not to end with a consonant, (5) the value of i, i.e. the character siThe position in the string. As described above, the current feature is represented by a 5-dimensional vector. Here siAnd is not fully encoded in the vector. The reason for this is as follows: let p beiRepresenting the user reducing si to qiThe mode vector of (1). Since it is known how a keyword is reduced, i.e. is P (q)i,si)=P(pi)·P(si). Because P (q)i,si)=P(qi|si)·P(si),P(pi). Thus P (P)i) The result of (a) is P (q)i|si)。
Given a mode vector, P (P) is calculated using a mixed Gaussian model (GMM)i) The value of (c). The Gaussian mixture model uses unknown parameters to calculate the density function of p, which is the probability as follows:
Figure BDA0002245111450000101
where l is the number of Gaussian distributions, wi is the weight of each Gaussian distribution, N (p | μi,∑i) Is measured in muiIs a mean value and ∑iIs a variance matrix and is a probability density function of p. Where the parameter/can be fine-tuned in the training. Meanwhile, other parameters can be learned in a clustering manner and by using an EM algorithm: a series of data strings are given by the user, after which all prefixes of their data are collected and converted into keyword and prefix data pairs as features of the training data.
In this embodiment, after the step of sorting the interval list according to the analysis of the user target character string to obtain the result set, S400 further includes:
and S500, returning a target result set by adopting a Top-K algorithm according to the user requirement.
In this embodiment, the user may not be interested in all the results, and usually only the top K results, during the process of inputting the query. Under this assumption, results that are unlikely to go to the first K can be filtered ahead of time. And estimating the upper limit of the score of one activated node, and filtering the activated node in advance if the upper limit is lower than the lower limit of the current K previous results. In the interval list algorithm, one merged interval list is obtained in each valid node as a validation set. And if TopK of a result is required to be obtained, traversing the interval list in each effective node, calculating a corresponding score value for each character string in the interval, and then sorting according to the calculated scores and extracting the result of Top-K. The greatest cost in current method implementations is to use a Gaussian mixture model to compute the probability P (q)i|si). Because the number of strings in the interval is large in practical situations, especially for query strings with short length, it is necessary to design an efficient Top-K algorithm to reduce the number of computations of the gaussian mixture model.
In a specific embodiment, the maximum possible score in the merge interval list is defined. According to the characteristics of the merging list, the following characteristics are provided: for each interval [ u, v ]]∈JnAlways present in one interval [ u ', v']∈InAnd u ' is less than or equal to u ' and v ' is more than or equal to v. Thus, in List JnThe maximum possible score value for the middle string is List InThe upper bound of (c). To calculate the score for each interval, consider the root node of a dictionary tree as n. The depth of the dictionary tree is denoted by d, where all lists I can be deducednHas at least d keywords, and when n becomes an active node, the query q has exactly d non-empty partitions. Thus for each interval u, v]∈InCan be processed in an offline modePhysical string su...svAnd the maximum value is used to define the boundary of the on-line query. Given a string siFor every d keywords
Figure BDA0002245111450000111
Enumerating a string siAll possible prefixes
Figure BDA0002245111450000112
Then calculate the probability
Figure BDA0002245111450000113
Note here that when j-d, there is only one possible prefix since a match is made on node n. Maximum probability
Figure BDA0002245111450000114
Is represented by a string siIs calculated, where the maximum value is taken and stored in the interval u, v in the dictionary tree]In (1).
The embodiment discloses an online Top-K result extraction algorithm. At the very beginning, a priority queue R is initialized for storing the Top-K results. For each activation node n, for list JnThe intervals in (1) are sorted in descending order of the maximum score. Second, for JnEach interval [ u, v ] of (1)]The score for each string is computed sequentially and then updated into the priority queue. If an interval is reached where his maximum score is not greater than the kth result, the process for n can safely end.
In another embodiment, the calculation of some gaussian mixture models is skipped, and some keywords are shared by the character strings in the same interval with a high probability, i.e. with the same probability p ═ (q ═ q)i|si). For in an interval u, v]∈InTwo adjacent character strings siAnd si+1Checking offline the number of keywords they share as prefixes and recording this value in si +1, with si+1Spr. For online query processing, if siAnd si+1At the same time at JnCan be for the first s in the same intervali+1The Gaussian mixture model calculation of the spr key is skipped because it has already been calculated. To make better use of keyword sharing, the strings in S are sorted in the order of the earliest points.
The application allows the user to decide the number of results. If the user expects to obtain all results and screens the results one by one, the step of returning the target result set by adopting a Top-K algorithm according to the user requirement in the step S500 can be skipped, and if the user only wants a limited number of high-quality results, the step of returning the K results most wanted by the user is carried out.
The application discloses a method for inquiring automatic completion, which is based on a model for inquiring prefix abbreviation matching of a completion technology, wherein the model for inquiring prefix abbreviation matching of the completion technology is a new algorithm in the completion technology. Compared with the prior art, the method and the device fully consider various scenes, and particularly do not display separators which indicate key words for users. The method and the device can save 20% of the number of characters input by a user. The embedded dictionary tree is a new data structure for supporting the automatic completion technology. Compared with a traditional dictionary tree index structure, the embedded dictionary tree can more accurately position the character string interval matched with the prefix. To return more meaningful results, a ranking algorithm is designed that uses the probability of the query string versus the data string versus the segmentation, and uses bayesian formulas and gaussian mixture model structures to compute its probability value. The ranking algorithm can return results that are more desirable to the user. Considering the interesting result of the user, two Top-K optimization algorithms are designed, namely the calculation times of designing the score upper bound of each interval list and skipping the Gaussian mixture model with higher complexity. Compared with the existing algorithm, the Top-K optimization algorithm has higher efficiency and accuracy.
The method is not limited to be applied to the technical fields of prompt of database query input, search box optimization of search engines, code prompt in integrated development environments, query prompt systems in the field of biochemical medicine, quick input interfaces of input methods, limited terminal input interfaces and the like.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 5 to 6, there are shown block diagrams of the structural embodiments of an apparatus for query autocomplete according to the present invention, which may specifically include the following modules:
a receiving module 100, configured to receive a query prefix from a user side;
a matching module 200, configured to match the character result of the query prefix based on the nested trie structure;
the interval list merging module 300 is used for adding the character result into the interval list according to the nested dictionary tree nodes;
and the interval result sorting module 400 is configured to sort the interval list according to analysis of the user target character string to obtain a result set.
In this embodiment, the method further includes:
and the result screening module 500 is used for returning the target result set by adopting a Top-K algorithm according to the user requirements.
In this embodiment, the method further includes:
and the structure establishing module is used for establishing a nested dictionary tree structure.
In this embodiment, the structure building module includes:
the splitting unit is used for dividing the keywords and establishing a dictionary tree;
and the linking unit is used for linking the dictionary trees together to form a nested dictionary tree structure.
In this embodiment, the splitting unit includes:
and the splitting subunit is used for adding the first letter of the keyword to the external dictionary tree and adding other letters of the corresponding keyword to the internal dictionary tree.
In the present embodiment, the link unit includes:
and the link subunit is used for linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree.
In this embodiment, the interval result sorting module includes:
the segmentation probability calculation unit is used for calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and the sorting unit is used for sorting the interval list in a mode of descending the segmentation matching probability.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiment of the invention discloses electronic equipment, which comprises a processor, a memory and a computer program which is stored on the memory and can run on the processor, wherein the computer program realizes the steps of the query automatic completion method when being executed by the processor.
The embodiment of the invention discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to realize the steps of the query automatic completion method.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method for automatically completing inquiry and the corresponding device for automatically completing inquiry provided by the invention are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (6)

1. A method for query autocompletion, comprising:
building a nested dictionary tree structure; specifically, the first letter of a keyword is added to an external dictionary tree, and the other letters of the corresponding keyword are added to an internal dictionary tree; linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree; adding a link from an internal node to an external node in the nested dictionary tree, and if an immediate keyword is behind a character string where a non-initial character is located, adding a link to the external node for the initial node of the internal node corresponding to the non-initial character, wherein the label of the link is the initial letter of the immediate keyword; wherein the internal nodes are nodes of the internal dictionary tree, the external nodes are nodes of the external dictionary tree, and the initial nodes are root nodes containing internal fields of the internal nodes;
receiving a query prefix from a user side; the query prefix is the concatenation of prefix abbreviations of any previous keyword of a character string formed by sequentially splicing a plurality of keywords;
matching the character result of the query prefix based on a nested dictionary tree structure;
adding the character result into an interval list according to the nested dictionary tree nodes;
and sequencing the interval list according to the analysis of the user target character string to obtain a result set.
2. The method of claim 1, wherein after the step of sorting the interval list according to the analysis of the user target string to obtain a result set, further comprising:
and returning a target result set by adopting a Top-K algorithm according to the user requirement.
3. The method of claim 1, wherein the step of sorting the interval list according to the analysis of the user target string to obtain a result set comprises:
calculating the segmentation matching probability of the target character string by using Bayes theorem and a Gaussian mixture model;
and sequencing the interval list according to the descending mode of the segmentation matching probability.
4. An apparatus for query autocomplete, comprising:
the structure building module is used for building a nested dictionary tree structure; specifically, the first letter of a keyword is added to an external dictionary tree, and the other letters of the corresponding keyword are added to an internal dictionary tree; linking the external dictionary tree and the internal dictionary tree together to form a nested dictionary tree; adding a link from an internal node to an external node in the nested dictionary tree, and if an immediate keyword is behind a character string where a non-initial character is located, adding a link to the external node for the initial node of the internal node corresponding to the non-initial character, wherein the label of the link is the initial letter of the immediate keyword; wherein the internal nodes are nodes of the internal dictionary tree, the external nodes are nodes of the external dictionary tree, and the initial nodes are root nodes containing internal fields of the internal nodes;
the receiving module is used for receiving the query prefix from the user side; the query prefix is the concatenation of prefix abbreviations of any previous keyword of a character string formed by sequentially splicing a plurality of keywords;
the matching module is used for matching the character result of the query prefix based on a nested dictionary tree structure;
the interval list merging module is used for adding the character result into an interval list according to the nested dictionary tree nodes;
and the interval result sorting module is used for sorting the interval list according to the analysis of the user target character string to obtain a result set.
5. Electronic device, characterized in that it comprises a processor, a memory and a computer program stored on said memory and capable of running on said processor, said computer program, when executed by said processor, implementing the steps of the method for query autocompletion according to any one of claims 1 to 3.
6. Computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method for query autocompletion according to any one of claims 1 to 3.
CN201911014061.2A 2019-10-23 2019-10-23 Method and device for automatically completing query Active CN110750704B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911014061.2A CN110750704B (en) 2019-10-23 2019-10-23 Method and device for automatically completing query
PCT/CN2019/126590 WO2021077585A1 (en) 2019-10-23 2019-12-19 Method and device for auto-completing query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911014061.2A CN110750704B (en) 2019-10-23 2019-10-23 Method and device for automatically completing query

Publications (2)

Publication Number Publication Date
CN110750704A CN110750704A (en) 2020-02-04
CN110750704B true CN110750704B (en) 2022-03-11

Family

ID=69279673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911014061.2A Active CN110750704B (en) 2019-10-23 2019-10-23 Method and device for automatically completing query

Country Status (2)

Country Link
CN (1) CN110750704B (en)
WO (1) WO2021077585A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256821B (en) * 2020-09-23 2024-05-17 北京捷通华声科技股份有限公司 Chinese address completion method, device, equipment and storage medium
CN113312549B (en) * 2021-05-25 2024-01-26 北京天空卫士网络安全技术有限公司 Domain name processing method and device
CN113360666A (en) * 2021-05-31 2021-09-07 珠海大横琴科技发展有限公司 Data dictionary management method and device, electronic equipment and storage medium
WO2022261345A1 (en) * 2021-06-10 2022-12-15 Visa International Service Association System, method, and computer program product for feature analysis using an embedding tree
CN115878924B (en) * 2021-09-27 2024-03-12 小沃科技有限公司 Data processing method, device, medium and electronic equipment based on double dictionary trees
CN114969242A (en) * 2022-01-19 2022-08-30 支付宝(杭州)信息技术有限公司 Method and device for automatically completing query content
CN117640259A (en) * 2024-01-25 2024-03-01 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063508A (en) * 2011-01-10 2011-05-18 浙江大学 Generalized suffix tree based fuzzy auto-completion method for Chinese search engine
CN105447080A (en) * 2015-11-05 2016-03-30 华建宇通科技(北京)有限责任公司 Query completion method in community ask-answer search
CN106663100A (en) * 2014-05-30 2017-05-10 苹果公司 Multi-domain query completion
CN107169045A (en) * 2017-04-19 2017-09-15 中国人民解放军国防科学技术大学 A kind of query word method for automatically completing and device based on temporal signatures
CN109325635A (en) * 2018-10-25 2019-02-12 电子科技大学中山学院 Position prediction method based on automatic completion

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084363B (en) * 2008-07-03 2014-11-12 加利福尼亚大学董事会 A method for efficiently supporting interactive, fuzzy search on structured data
CN104052669B (en) * 2013-03-12 2018-12-07 凯为公司 For handling the device for the longest prefix match table being alternately arranged
CN108241695B (en) * 2016-12-26 2021-11-02 北京国双科技有限公司 Information processing method and device
CN108427756B (en) * 2018-03-16 2021-02-12 中国人民解放军国防科技大学 Personalized query word completion recommendation method and device based on same-class user model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063508A (en) * 2011-01-10 2011-05-18 浙江大学 Generalized suffix tree based fuzzy auto-completion method for Chinese search engine
CN106663100A (en) * 2014-05-30 2017-05-10 苹果公司 Multi-domain query completion
CN105447080A (en) * 2015-11-05 2016-03-30 华建宇通科技(北京)有限责任公司 Query completion method in community ask-answer search
CN107169045A (en) * 2017-04-19 2017-09-15 中国人民解放军国防科学技术大学 A kind of query word method for automatically completing and device based on temporal signatures
CN109325635A (en) * 2018-10-25 2019-02-12 电子科技大学中山学院 Position prediction method based on automatic completion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于局部过滤的字符串近似匹配算法和优化技术;王尧舒;《中国优秀硕士学位论文全文数据库信息科技辑》;20160815;第I138-1404页 *

Also Published As

Publication number Publication date
CN110750704A (en) 2020-02-04
WO2021077585A1 (en) 2021-04-29

Similar Documents

Publication Publication Date Title
CN110750704B (en) Method and device for automatically completing query
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN112988969B (en) Method, apparatus, device and storage medium for text retrieval
JP4754247B2 (en) Apparatus and computerized method for determining words constituting compound words
JP5203934B2 (en) Propose and refine user input based on original user input
US8108405B2 (en) Refining a search space in response to user input
CN111611356B (en) Information searching method, device, electronic equipment and readable storage medium
CN111460798A (en) Method and device for pushing similar meaning words, electronic equipment and medium
US20110282858A1 (en) Hierarchical Content Classification Into Deep Taxonomies
JP2009525520A (en) Evaluation method for ranking and sorting electronic documents in search result list based on relevance, and database search engine
JP2002510076A (en) Information retrieval and speech recognition based on language model
KR20080031262A (en) Relationship networks
JP2005122533A (en) Question-answering system and question-answering processing method
US20100023505A1 (en) Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof
WO2009154570A1 (en) System and method for aligning and indexing multilingual documents
CN112000783B (en) Patent recommendation method, device and equipment based on text similarity analysis and storage medium
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
KR20220119745A (en) Methods for retrieving content, devices, devices and computer-readable storage media
KR100847376B1 (en) Method and apparatus for searching information using automatic query creation
CN111680152B (en) Method and device for extracting abstract of target text, electronic equipment and storage medium
JP4325370B2 (en) Document-related vocabulary acquisition device and program
Minkov et al. Learning graph walk based similarity measures for parsed text
JP6181890B2 (en) Literature analysis apparatus, literature analysis method and program
CN111079448A (en) Intention identification method and device
CN114385777A (en) Text data processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant