WO2022012205A1 - 词补全方法和装置 - Google Patents
词补全方法和装置 Download PDFInfo
- Publication number
- WO2022012205A1 WO2022012205A1 PCT/CN2021/098072 CN2021098072W WO2022012205A1 WO 2022012205 A1 WO2022012205 A1 WO 2022012205A1 CN 2021098072 W CN2021098072 W CN 2021098072W WO 2022012205 A1 WO2022012205 A1 WO 2022012205A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- words
- node
- hot
- user
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/322—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3322—Query formulation using system suggestions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Definitions
- the present application relates to the field of computer technology, and in particular, to a word completion method and device.
- Location search is widely used in various application scenarios such as map navigation, travel, and social communication.
- the location search specifically includes functions such as query suggestion, point of information (POI) search, and obtaining POI details.
- the query suggestion function accounts for 75% of the search requests, which can provide users with search suggestions for incomplete query texts.
- the query suggestion function can automatically complete when the user's input is incomplete, and provide users with text-related hot words (ie, popular search terms) or popular POIs.
- the server when a user inputs an incomplete character string, the server usually performs prefix matching in a hot word database according to the character string input by the user, and enumerates all hot words that satisfy the prefix matching. Then, sort all the hot words that satisfy the prefix matching in descending order according to the search popularity, etc., and take the top N (Top-N) results as the hot words that are finally recommended to the user.
- prefix matching in a hot word database according to the character string input by the user, and enumerates all hot words that satisfy the prefix matching. Then, sort all the hot words that satisfy the prefix matching in descending order according to the search popularity, etc., and take the top N (Top-N) results as the hot words that are finally recommended to the user.
- prefix matching is performed according to the string output by the user, and all hot words that satisfy the prefix matching are obtained.
- the number of hot words that meet the prefix matching is huge, especially when the input string is short, Therefore, the search efficiency of hot words is low, which affects the user experience.
- the embodiment of the present application provides a word completion method, which is used to improve the word completion efficiency, and can avoid recommending words to the user when the user inputs a string that is too short.
- a first aspect of the embodiments of the present application provides a method for completing a word, including: obtaining a character string input by a user; searching a dictionary tree Trie for a target node matching the character string to output at least one word, where The Trie includes a plurality of first nodes, the target node is one of the plurality of first nodes, and each of the first nodes stores information including a path from the root node to the first node. At least one word of a string composed of characters of , and the word stored by the target node includes the output at least one word.
- an efficient dictionary tree structure is constructed, which is improved on the basis of the traditional dictionary tree, and hot words are stored in the nodes of the dictionary tree, and the stored hot words include The characters passed on the path from the root node to this node.
- the hot words stored in the nodes are words with a high probability of being completed. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the nodes of the dictionary tree.
- a target node that matches the character string input by the user is searched on the improved dictionary tree, and the character string formed by the characters passing on the path from the root node of the dictionary tree to the target node is the same as the character string.
- the string matching input by the user based on the words stored in the target node, output at least one word as a recommended hot word. Since the completed hot word is output to the user only when the target node stores the hot word, and the output is based on the word stored in the target node, there is no need to search for the hot word that meets the prefix condition based on the string input by the user, so the hot word can be improved.
- the efficiency of word completion also enables the completed hot words presented to the user to be more in line with the user's requirements.
- the prefix of the at least one word stored by the first node is a string composed of characters passed on a path from the root node to the first node .
- the stored hot words are all hot words prefixed with a string composed of characters passing on the path from the root node to the first node, which conforms to the dictionary tree.
- the query logic is more efficient.
- the Trie includes a plurality of second nodes, and the second nodes do not store words.
- the Trie further includes a plurality of second nodes that do not store hot words.
- the hot word will not be output to the user, which can improve the user experience.
- the character string includes the first character to the Nth character arranged in the user input sequence; the searching for a target matching the character string in the dictionary tree Trie node, to output at least one word, including: the method further includes: searching in a dictionary tree according to the input sequence, wherein the character string consisting of the first character to the N-1th character input by the user matches What is reached is a second node in the dictionary tree.
- the matched node is the second node that does not store the hot word, so no recommendation is output.
- the matched node of the string consisting of the first character to the N-1th character is still the second node that does not store the hot word, so the recommendation is not output.
- the hot word of and after the input of the Nth character, the first node that stores the word is matched, that is, the target node described above. It can be seen that when the character string input by the user is short, the recommended hot words are not output, which can avoid interference to the user and improve the user experience.
- the character string includes the first character to the N th character arranged in the order input by the user, and the first character to the N-1 th character input by the user
- the formed string matches a first node in the dictionary tree, and the word stored in the first node is the same or different from the word stored in the target node.
- the first node further stores the respective completions corresponding to each word prefixed with a character string composed of characters from the root node to the characters of the first node probability, wherein the completion probability corresponding to each of the words indicates the probability of outputting the word in the case of matching to the first node.
- the completion probability of a word is, in the database, the proportion of the word frequency of the word to the sum of the word frequencies of all words prefixed with characters from the root node to the first node.
- the first node stores a plurality of hot words, and the plurality of hot words are arranged in descending order according to the completion probability.
- the first node also stores the completion probability of each hot word, so that the order of the words recommended to the user can be determined based on the completion probability of the word, which is convenient for the user Quickly obtaining words with high completion probability can improve user experience.
- the first node stores data of at least one key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is a key from
- the characters from the root node to the first node are prefixed words, and the value is the completion probability of the word.
- the first node stores data of one or more key-value structures, which is a convenient and clear storage method. Sort words.
- the order of the output at least one word is related to the arrangement order of words stored in the target node.
- the word completion method there may be various ordering orders of words stored in the target node, optionally, according to the alphabetical order of the same character positions in different hot words; optionally, according to different hot words Sort by the pinyin order of the same character position in the word; or, sort according to the word frequency of hot words.
- the order of the output at least one word may be the same as the order of the words stored in the target node, thereby reducing the time required for hot word recommendation and improving the efficiency of hot word recommendation.
- the words stored in the plurality of first nodes are derived from POI data of information points or user log data.
- the hot word stored in the first node may be derived from various existing hot word databases, such as POI data or user log data. Different databases can be selected according to actual application scenarios. For example, in map applications, hot words derived from POI data can usually be considered.
- a second aspect of the embodiments of the present application provides a method for completing a word, including: acquiring a character string input by a user; searching a dictionary tree Trie for a target node matching the character string to output at least one word, where The Trie includes a plurality of first nodes, the target node is one of the plurality of first nodes, and each of the first nodes stores at least one word including a character string corresponding to the first node, so The words stored by the target node include the at least one word of the output.
- the storage form of the nodes in the dictionary tree is different, and the character string corresponding to each node is one more character than the previous node, including the path from the root node to the node.
- the string corresponding to the previous node At this time, the first node stores at least one word including the character string corresponding to the first node.
- the part other than the storage form of the dictionary tree is similar to the first aspect of the embodiment of the present application, and details are not repeated here.
- a third aspect of the embodiments of the present application provides a word completion method, including: acquiring a character string input by a user; searching for a target matching the character string in a first dictionary tree Trie constructed based on a first word database node to output a first word set, the first word set includes at least one word, the first Trie includes a plurality of first nodes, the target node is one of the plurality of first nodes, each The first node stores at least one word including a string composed of characters passed on the path from the root node to the first node, and the words stored in the target node include words in the first word set.
- the target node stores at least one word prefixed with a string consisting of characters from the root node to characters of the first node.
- the method provided by the embodiment of the present application can combine the hot word sources of at least two hot word databases to recommend hot words for users, wherein the nodes of the first dictionary tree Trie constructed based on the first word database store the hot words,
- the stored hot words include the characters passed on the path from the root node to the node.
- the hot words stored in the nodes are words with a high probability of being completed. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the nodes of the dictionary tree.
- the first word set for completion is output to the user only when the target node stores the hot word, it can avoid the user inputting a short string and other inconspicuous intentions, or triggering the hot word in scenarios where the possibility of the hot word being completed is low. word completion.
- outputting hot words for the user in combination with the first word set and the second word set can improve the accuracy of hot word recommendation.
- the first word set includes at least two words arranged in order; the second word set includes at least two words arranged in order; the target word set Including the first word in the first word set and the first word in the second word set.
- the output hot words pushed by the user include the hot words ranked first in the first hot word set and the hot words ranked first in the second hot word set.
- the recommended hot words are more likely to meet the user's intent, which can improve the accuracy of hot word recommendation.
- the outputting at least one word recommended for the user according to the first word set and the second word set includes: according to the first word set of the preset first word set The weight and the second weight of the second word set determine the completion probability of each word in the union of the first word set and the second word set; according to the completion probability, determine the at least one recommended for the user. a word.
- the method provided by the embodiment of the present application can set recommendation weights for hot word databases of different data sources according to the actual application scenario of hot word recommendation, such as the recommended application type, so as to calculate the first word set and the second word set
- the completion probability of each word in the union can improve the accuracy of hot word recommendation.
- the outputting the second word set prefixed with the character string based on the second word database includes: obtaining the Trie constructed according to the user log word database, obtaining the the second set of words; or, inputting the character string into a machine learning algorithm trained based on the user log word database to output the second set of words; or, a hash tree constructed according to the user log word database , to obtain the second set of words.
- the data source of the first word database is the POI database
- the data source of the second word database is the user log database
- the second word set can be obtained through a variety of existing hot word completion methods.
- the completion results of POI data and the completion results of log data, the timeliness of completion hot words and their relevance to POI can be guaranteed.
- a fourth aspect of the embodiments of the present application provides a method for constructing a dictionary tree, including: constructing a dictionary tree Trie according to a word database, wherein the dictionary tree includes a plurality of first nodes, and each of the first nodes stores a At least one word of a string composed of characters passing on the path from the root node to the first node; prune the words stored in the node, and keep the completion probability in each node greater than or equal to the first A threshold of words, the completion probability indicating the probability of outputting a word if the node is matched.
- an efficient method for constructing a dictionary tree structure is proposed, which is improved on the basis of a traditional dictionary tree, and specifically, hot words are stored in the nodes of the dictionary tree,
- the stored hot words include the characters passed on the path from the root node to the node.
- the hot words stored in the node are words whose probability of being completed is greater than or equal to the first threshold. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the dictionary tree. on the node.
- the target node that matches the string input by the user, and output at least one word as the recommended hot word based on the words stored in the target node. Since the hot word is only stored in the target node, the supplementary word is output to the user. All hot words are output based on the words stored in the target node, and there is no need to search for hot words that meet the prefix conditions based on the string input by the user, so the efficiency of hot word completion can be improved. In addition, it can also avoid triggering hot word completion in scenarios where the user's input string is short and the intent is not obvious, or the hot word is less likely to be completed.
- the completion probability is a ratio of the word frequency of the word to the sum of the word frequencies of all words prefixed with characters from the root node to the first node.
- the first threshold is a preset value, and the value range is [0.1, 0.2].
- the node of the Trie stores data of a key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is from the root node to the
- the character of the node is a word with a prefix
- the value is the completion probability of the word
- the completion probability of the word is the word frequency of the word. All the characters from the root node to the first node are Proportion of the sum of the word frequencies of the prefixed words.
- the method further includes: arranging words stored in the nodes of the Trie in descending order of completion probability.
- a fifth aspect of the embodiments of the present application provides a word completion device, which is characterized by comprising: an acquisition unit, configured to acquire a character string input by a user; an output unit, used to search the dictionary tree Trie for the A target node for string matching to output at least one word, the Trie includes a plurality of first nodes, the target node is one of the plurality of first nodes, and each of the first nodes stores one or A plurality of words, the one or more words all include a string composed of characters passing on the path from the root node of the Trie tree to the first node where the first or more words are located, and the The words stored by the target node include the output at least one word.
- the prefix of the one or more words is a character passed on a path from the root node of the Trie tree to the first node where the one or more words are located composed string.
- the Trie includes a plurality of second nodes, and each of the second nodes does not store a word.
- the character string includes the first character to the Nth character arranged according to the input order of the user; the output unit is specifically configured to: in the input order according to the input order search in the dictionary tree, wherein the character string consisting of the first character to the N-1th character input by the user matches a second node in the dictionary tree.
- the first node further stores respective completion probabilities corresponding to the one or more words, wherein the completion probability corresponding to each of the words indicates an indication In the case of matching to the first node, output the probability of the word.
- the first node stores data of at least one key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is a A word prefixed by a string consisting of characters passed on the path from the root node to the first node
- the value is the completion probability of the word, wherein the completion probability indicates that the first node is matched when the In the case of , output the probability of the key.
- the order of the output at least one word is related to the arrangement order of words stored in the target node.
- the words stored in the plurality of first nodes are derived from POI data of information points or user log data.
- the sixth aspect of the embodiment of the present application provides a word completion device, which is different from the word completion device of the fifth aspect of the embodiment of the present application in that the storage form of the nodes in the dictionary tree is different.
- the string is one character longer than the previous node, including the string corresponding to any node on the path from the root node to this node.
- the first node stores at least one word including the character string corresponding to the first node.
- the part other than the storage form of the dictionary tree is similar to that of the fifth aspect of the embodiment of the present application, and details are not repeated here.
- a seventh aspect of the embodiments of the present application provides a word completion device, which is characterized by comprising: an acquisition unit, configured to acquire a character string input by a user; an output unit, used to search the dictionary tree Trie for the A target node for string matching to output a first word set, the first word set includes at least one word, the first Trie includes a plurality of first nodes, and the target node is one of the plurality of first nodes One, each of the first nodes stores at least one word that includes a string of characters that pass on the path from the root node of the Trie to the first node, and the words stored in the target node include the Describe the words in the first word set, the words stored in the plurality of first nodes are from the first word database; the output unit is further configured to output the second word prefixed with the character string based on the second word database a word set; the output unit is further configured to output at least one word recommended for the user according to the first word set and the second word set.
- the first word set includes at least two words arranged in order; the second word set includes at least two words arranged in order; the target word set Including the first word in the first word set and the first word in the second word set.
- the output unit is specifically configured to: output a user recommendation according to the probability of each word being output in the union of the first word set and the second word set At least one word in the union set, the probability of each word being output in the union set is determined according to the preset first weight of the first word set and the second weight of the second word set.
- the output unit is specifically configured to: obtain the second word set according to the Trie constructed by the word database of the user log; or, input the character string based on In the machine learning algorithm trained on the word database of the user log, the second word set is output; or, the second word set is obtained according to a hash tree constructed from the word database of the user log.
- An eighth aspect of the embodiments of the present application provides an apparatus for constructing a dictionary tree, including: a construction unit configured to construct a dictionary tree Trie according to a word database, wherein the dictionary tree includes a plurality of first nodes, each of the first nodes A node stores at least one word including a character string consisting of characters passed on the path from the root node to the first node; the deletion unit is used to delete the words stored in the node, and retain In each node, a word with a completion probability greater than or equal to the first threshold, the completion probability indicating the probability of outputting the word if the node is matched.
- the completion probability is a ratio of the word frequency of the word to the sum of the word frequencies of all words prefixed with characters from the root node to the first node.
- the first threshold is a preset value, and the value range is [0.1, 0.2].
- the node of the Trie stores data of a key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is from the root node to
- the character of the node is a word with a prefix
- the value is the completion probability of the word
- the completion probability of the word is the word frequency of the word. All the characters from the root node to the first node are Proportion of the sum of the word frequencies of the prefixed words.
- the method further includes: arranging words stored in the nodes of the Trie in descending order of completion probability.
- a ninth aspect of an embodiment of the present application provides a terminal, including: one or more processors and a memory; wherein, the memory stores computer-readable instructions; the one or more processors read the Computer readable instructions to cause the terminal to implement the method as described in any one of the above-mentioned first to fourth aspects and various possible implementations.
- a tenth aspect of the embodiments of the present application provides a server, including: one or more processors and a memory; wherein, the memory stores computer-readable instructions; the one or more processors reads the Computer readable instructions to cause the terminal to implement the method as described in any one of the above-mentioned first to fourth aspects and various possible implementations.
- An eleventh aspect of the embodiments of the present application provides a computer program product containing instructions, characterized in that, when it runs on a computer, the computer is made to execute the above-mentioned first to fourth aspects and various possible Implement the method described in any one of the modes.
- a twelfth aspect of an embodiment of the present application provides a computer-readable storage medium, including instructions, characterized in that, when the instructions are executed on a computer, the computer is made to execute the above-mentioned first to fourth aspects and various The method described in any one of the possible implementations.
- a sixth aspect of the embodiments of the present application provides a chip, including a processor.
- the processor is configured to read and execute the computer program stored in the memory, so as to execute the method described in any one of the above-mentioned first to fourth aspects and various possible implementation manners.
- the chip includes a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
- the chip further includes a communication interface, and the processor is connected to the communication interface.
- the communication interface is used for receiving data and/or information to be processed, the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
- the communication interface may be an input-output interface.
- the embodiments of the present application have the following advantages:
- the hot word completion method performs word completion based on an improved dictionary tree, wherein the improved dictionary tree includes a plurality of first nodes storing words, and by searching the improved dictionary tree for characters that are related to the characters input by the user
- the target node for string matching based on the words stored in the target node, output at least one word as a recommended hot word. Since the completed hot word is output to the user only when the target node stores the hot word, and the output is based on the word stored in the target node, there is no need to search for the hot word that meets the prefix condition based on the string input by the user, so the hot word can be improved. Word completion efficiency.
- the completion probability of the hot word is used for screening.
- this scheme calculates the completion probability of the hot word prefixed by the string corresponding to the node.
- the completion trigger length of each hot word can be determined in a targeted manner, which is more in line with the actual completion needs. It can avoid that the user input string is short and the intention is not obvious, or the hot word is less likely to be completed, and the hot word completion is triggered. It can effectively improve the completion efficiency.
- this solution outputs based on the completion results of the first word database and the second word database. For example, by combining the completion results of POI data and log data, it can ensure the timeliness of the completion of hot words and the compatibility with POI. Correlation.
- 1 is a schematic diagram of an example of a basic Trie tree structure
- Figure 2a is a schematic diagram of a hot word completion interface
- 2b is a schematic diagram of a system architecture of a hot word completion method in an embodiment of the present application
- 3a is a schematic diagram of an embodiment of constructing a C-Trie tree in the embodiment of the present application.
- 3b is a schematic diagram of hot word data stored by a node of a C-Trie tree in an embodiment of the present application
- Fig. 3c is a schematic diagram of the hot word data stored by the nodes of the C-Trie tree in the embodiment of the application;
- 3d is another schematic diagram of hot word data stored by a node of a C-Trie tree in an embodiment of the present application
- FIG. 4 is a schematic diagram of an embodiment of a method for completing words in an embodiment of the present application.
- FIG. 5 is a schematic diagram of another embodiment of the method for completing words in the embodiment of the present application.
- FIG. 6 is a schematic diagram of an input and output structure of a prediction model in an embodiment of the present application.
- FIG. 7 is a schematic diagram of another embodiment of the method for completing words in the embodiment of the present application.
- FIG. 8 is a schematic diagram of an embodiment of a word completion device in an embodiment of the present application.
- FIG. 9 is a schematic diagram of another embodiment of a word completion device in an embodiment of the present application.
- FIG. 10 is a schematic diagram of another embodiment of the device for completing words in the embodiment of the present application.
- FIG. 11 is a schematic diagram of an embodiment of an apparatus for constructing a Trie tree in an embodiment of the present application.
- FIG. 12 is a schematic diagram of an embodiment of a terminal in an embodiment of the present application.
- FIG. 13 is a schematic diagram of an embodiment of a server in an embodiment of the present application.
- the embodiment of the present application provides a word completion method, which is used to improve the word completion efficiency, and can avoid recommending words to the user when the user inputs a string that is too short.
- the word completion method proposed in the present application is applied to the search field, specifically, when the user inputs characters one by one in the search bar area, a complete word is recommended for the user by predicting the search word expected by the user, so that the user does not need to input the complete word character by character. words, and can directly choose from the recommended words, which can improve the user's search efficiency and improve the user experience.
- the data sources of the words recommended for the user include historical data such as user logs. After statistics and screening, combined with the incomplete characters that the user has entered, the words that are as close to the user's expectation as possible are recommended.
- the word mentioned in the embodiments of the present application is a broad concept, and based on the user's search language, the "word” includes words in different languages; based on the data source of the word, the word may include a single word word or phrases consisting of multiple words.
- the "words” recommended for users can be obtained through statistics and screening through the database of words from user logs and other sources. Since they are usually popular search words with high word frequency in the database, they are often called “hot words” in the technical field. That is, the "word” in the embodiments of the present application.
- the word frequency source of the word is not limited in this application.
- the data source of the word can be a hot word database commonly used in the prior art, which is also not limited in this application. In the following embodiments, "hot words" are used for introduction.
- a POI can be a house, a shop, a mailbox, a bus stop, etc.
- POI refers to the name of the place on the map, such as “Golden Mountain Park”, “illy Cafe”, “Xintianxia Apartment”, “China Post” and so on.
- POI hot words are selected based on POI data. For example, the word frequency of each word in the POI data is counted, and a certain percentage of words with the highest word frequency are selected as POI hot words.
- the query text entered by the user in the search box is usually an incomplete word and may include one or more characters, which are collectively referred to as character strings in this embodiment of the present application.
- a dictionary tree also called a search tree or a prefix tree, is also called a Trie tree in this embodiment.
- the typical application of Trie tree is to count, sort and save a large number of strings, so it is often used for text word frequency statistics by search engine systems. Its advantages are: use the common prefix of strings to reduce query time and minimize unnecessary string comparisons.
- a common solution for incomplete prefix completion is to build a Trie tree for all candidate words, and search strings in the Trie tree to improve query efficiency.
- FIG 1 Please refer to Figure 1 for a schematic diagram of an example of a basic Trie tree structure. If there are 5 strings, namely "code”, “cook”, “file”, “fat”, and “find”, the Trie tree structure established for them is shown in Figure 1.
- the method of finding a node matching a string through the Trie tree is to start from the root node, first determine the node that is the same as the first character in the string, and then find the child node of the node that matches the second character in the string. the same node, and so on.
- the children of a node are nodes that are directly connected down to the node.
- the characters corresponding to the child nodes of a node are not the same.
- the server or terminal recommends the hot word “starbucks” for the user, and the user can click on the hot word, and then use the hot word for POI search to obtain more comprehensive and detailed information. List of search results.
- the completed hot words are usually word categories such as category or brand name, such as "hotel”, “starbucks”, etc., rather than pointing to a specific place, such as "Oriental Pearl”.
- the present invention by constructing a Trie tree structure, thanks to the special structure of C-Trie, can conditionally screen and sort the candidate hot words, and perform additional storage, which greatly improves the hot word. Word completion performance.
- the timeliness of the completion hot words and the correlation with POI data can be guaranteed at the same time.
- FIG. 2b is a schematic diagram of the system architecture of the hot word completion method in the embodiment of the present application.
- the system architecture includes a user, a terminal and a server, and the terminal and the server are connected through various communication links.
- the user inputs a character string through the input module of the terminal for searching, the terminal searches locally according to the user input, and outputs the recommended hot words to the user through the display module of the terminal.
- the terminal sends the request input by the user to the server, the server completes the hot word according to the character string input by the user, and sends the output hot word to the terminal, and the terminal receives the hot word sent by the server and outputs it to the user through the output module .
- the hot word completion method provided in the embodiment of the present application may be executed by a terminal, or may be executed by a network device, such as a server, which is not specifically limited.
- the traditional Trie tree is improved, and only hot words that meet the conditions are stored in some nodes of the Trie tree. Therefore, in the embodiment of the present application, the improved Trie tree is called a condition (conditional) dictionary tree, referred to as C-Trie tree, it can be understood that as long as the Trie tree that stores hot words that meet certain conditions on some nodes of the Trie tree belongs to the improved Trie tree of the embodiment of the present application, the present application
- the name of the improved Trie tree is not limited.
- the C-Trie tree adds a key-value (Key-Value, KV) structure to some nodes of the common Trie tree, where Key is a certain hot word prefixed by the string represented by the node.
- Key is a certain hot word prefixed by the string represented by the node.
- the string represented by the node is "wine”
- the key of a KV structure data stored by the node is "hotel”
- the value is among all the hot words prefixed by this node, the hot word (that is, the hot word corresponding to the key)
- the ratio of the word frequency of the word) to the total number of word frequencies of all hot words that is, the probability that the hot word is completed, which is referred to as "completion probability" in this embodiment of the present application.
- the hot words prefixed with “wine” have two words, "hotel” and “bar”, the word frequency of "hotel” is 70, and the word frequency of "bar” is 30, then the value corresponding to "hotel” as the key is 70%.
- the K-V structure data on each node is sorted in descending order according to the value of the value.
- other data formats can also be used to store hot words and completion probabilities on some nodes of the C-Trie tree, such as a mapping table, a linked list, etc., which is not limited in this application.
- Step 1 Build an ordinary Trie tree based on the hot word database
- each node in the figure shows the character string represented by the node, including all characters connected by the characters passed by the path from the root node to the node.
- each node can also directly represent the string shown in Figure 3a.
- the hot word database may be an existing hot word database from any source, such as a hot word database based on a user log or a POI hot word database, etc., which is not specifically limited here.
- Step 2 Store hot words on multiple nodes of the Trie tree
- the hot words stored in each node include at least one word in a string composed of characters passed on the path from the root node to the node.
- the hot words stored in each node include at least one word prefixed with a string composed of characters passed on the path from the root node to the node.
- the source of the word can be the hot word database for building the Trie tree in step 1, or another hot word database that is different from the hot word database for building the Trie tree in step 1.
- the source of storing the hot word is not limited here. .
- the Trie tree structure itself is a data structure in which hot words are stored according to the same prefix, for each node, the words prefixed with the characters on the path from the root node to the node are all prefixed. It can be obtained one by one by querying the child nodes of the node, that is, all words prefixed with the characters passing on the path from the root node to the node can be obtained from the Trie tree structure itself, but in this embodiment, the node stores hot words, not Refers to obtaining hot words by querying the Trie tree, but directly storing the hot words at the node. In the process of using the solution, after obtaining the target node, the hot words stored in the target node can be directly obtained, and no need to query through the Trie tree structure. .
- each node also stores the completion probability corresponding to the hot word, and the completion probability indicates the probability of outputting the hot word when the node is matched.
- the hot words stored on the nodes are K-V structured data, and each node may store one or more K-V structured data.
- Key is a hot word prefixed by the string represented by the node
- Value is the completion probability of the hot word
- the completion probability is the word frequency of the hot word in all the strings prefixed by the node represented by the node.
- the proportion of the total word frequency of the hot word, or in other words, among all the hot words prefixed by this node, the proportion of the word frequency of the hot word to the total word frequency of all hot words is the value corresponding to the hot word, that is, the hot word is probability of completion.
- the completion probability is the ratio of the word frequency of a hot word prefixed with a character from the root node to the node to the sum of the word frequencies of all hot words prefixed with a character from the root node to the node.
- the hot words prefixed with “st” include “starbucks", “state”, “start”, “stir”, etc. These hot words are used as keys, and the calculation hot words are completed.
- the probability Among them, the completion probability of "starbucks” is 0.012, the completion probability of "state” is 0.0012, the completion probability of "start” is 0.002, and the completion probability of "stir” is 0.0004.
- hot word data of K-V structure is added to each node of the Trie tree.
- step 1 and step 2 can be performed synchronously, that is, while constructing each node of the trie tree, hot words are stored for each node.
- Step 3 Delete the hot words stored in each node.
- the hot words stored in each node are pruned according to the first threshold, and optionally, the data of the K-V structure is pruned.
- the K-V structure is deleted.
- the first threshold value is a number between (0, 1), and the specific value is not limited. 0.15, 0.18 or 0.2 etc. It can be understood that by setting the first threshold reasonably, it can be ensured that the next user recommends hot words at the right time. Under the same conditions, the larger the first threshold, the longer the string that triggers the hot word recommendation. more likely to match.
- the value of the first threshold needs to be selected according to The actual hot word completion scene is reasonably set, which can be preset or adjusted according to the application scene or user needs.
- Step 1, Step 2 and Step 3 can be executed synchronously, that is, while constructing each node of the trie tree, store the hot words for each node, and delete the hot words synchronously.
- Step 4 Sort the hot words stored on the node
- the hot words stored on the node can be arranged in order, and the sorting rules are not limited here. For example, for English hot words, they are sorted according to the order of the first letter, and the order of the second letter of the same first letter is compared, and so on; for Chinese hot words Words, sorted according to the pinyin sequence of characters; or, sorted according to the word frequency of hot words.
- the hot word data of the remaining K-V structure on each node in descending order according to the value of the value.
- the hot words stored in the node "star” are sorted in descending order of completion probability.
- step 4 is an optional step, which may or may not be performed.
- step 4 and step 3 are not limited.
- the C-Trie tree involved in the embodiment of the present application is constructed and obtained, and it can be seen that the main features of the C-Trie tree are:
- the node may not be associated with the data of the K-V structure, or may be associated with one or more pairs of data of the K-V structure;
- the completion probability of the indicated hot word is greater than or equal to the first threshold
- the data of the K-V structure stored by a node is arranged in descending order of the completion probability of the hot words.
- the C-Trie tree may delete some nodes that do not store hot words, which will not be repeated here.
- the C-Trie tree Since some nodes of the C-Trie tree store words, by searching the improved C-Trie tree for the target node that matches the string input by the user, and based on the words stored in the target node, output at least one word to the user as a recommended word. hot word. Since the completed hot word is output to the user only when the target node that stores the hot word is matched, and the output is based on the word stored in the target node, there is no need to search for the hot word that meets the prefix condition based on the string input by the user. It can improve the efficiency of hot word completion.
- FIG. 4 is a schematic diagram of an embodiment of a hot word completion method in an embodiment of the present application.
- the server or terminal obtains the search text input by the user. Since the user needs to input character by character, the search text is usually an incomplete string. Intelligent completion based on the prefix of the string input by the user at an appropriate time can save the user input time, improve user experience.
- the output hot words are used to recommend to users. Users can select words that meet their expectations from at least one of the output words, so that they do not have to input the complete words. When the user has difficulty in spelling or partially forgets the words they want to input, etc. Get help to improve the user experience.
- the character string obtained by sequentially concatenating the characters on the path from the root node to a certain node is the character string corresponding to the node.
- the Trie tree in the embodiment of the present application is the C-Trie tree introduced in the above-mentioned embodiments.
- the character string represented by the node of the C-Trie tree is the same as that of the conventional Trie tree.
- some nodes of the C-Trie tree store hot words.
- the node storing the hot words may be called the first node, and the C-Trie tree includes multiple first nodes.
- the hot word stored in the first node of the C-Trie tree includes a character string obtained by sequentially concatenating characters passed by a path from the root node to the first node.
- the character string obtained by sequentially concatenating the path from the root node to the first node is "rea”
- the hot words stored in the first node include words prefixed with rea, such as read; it can also include words containing rea words, such as enter.
- the hot word stored in the first node of the C-Trie tree is a hot word prefixed with a character string obtained by connecting the path from the root node to the node in sequence, for example, from the root node to the first node.
- the string obtained by connecting the characters of the path of the node in sequence is "rea", and the hot words stored in the first node are ready, rear, really, etc.
- a node of the C-Trie tree stores data in a key-value pair (KV) structure, where K is a hot word prefixed with a string represented by the node, and V is the completion probability of the hot word, that is, the hot word.
- K is a hot word prefixed with a string represented by the node
- V is the completion probability of the hot word, that is, the hot word.
- the completion probability indicates the probability of outputting the hot word when the first node is matched.
- the completion probability of the reserved hot words is greater than or equal to the first threshold, that is, the hot words whose completion probability is less than the first threshold are deleted.
- some nodes of the C-Trie tree may not have data associated with the KV structure. Scenarios where the string is too short; optionally, among the multiple hot words prefixed by the string corresponding to the node, the completion probability of some hot words is less than the first threshold and is deleted, and the completion probability of some hot words is greater than or is equal to the first threshold and is reserved, so the node stores some hot words and belongs to the first node; optionally, the completion probability of multiple hot words prefixed by the string corresponding to the node is greater than or equal to the first threshold, are all reserved, the node stores the hot word, and belongs to the first node.
- the node that does not store the hot word is called the second node, and the C-Trie tree includes a plurality of second nodes.
- a first node of the C-Trie tree may be associated with a pair of KV-structured data, including that there is only one hot word prefixed by the character corresponding to the node, or the completion probability of only one hot word is greater than or equal to the first threshold;
- a first node of the C-Trie tree may also be associated with multiple pairs of KV structure data.
- the specific number of KV structure data is not limited, and can also be deleted according to the size of the completion probability, retaining the preset number and the completion probability The data of the KV structure corresponding to the larger hot words.
- hot word databases from different sources may be used when constructing the C-Trie tree in this embodiment of the present application.
- the hot word database may be a user's log hot word database, and the log hot words are obtained by filtering log data.
- Popular search words; the hot word database can also be a POI hot word database.
- POI hot words are obtained by counting the word frequency of each word in the POI data of the map, and selecting a certain percentage of words with the highest word frequency.
- the server or terminal searches the C-Trie tree according to the character string, and can determine the target node matching the character string, that is, the character string represented by the target node is consistent with the character string.
- the target node stores hot word data
- at least one hot word is output as a recommended hot word according to the stored hot word.
- the number of recommended hot words is limited, which can be preset by the server or set according to user needs.
- the specific number of hot words is not limited. If the set number of recommended hot words is less than the hot words stored on the target node, the output hot words are part of all the hot words stored on the target node.
- the hot words stored on the node can be arranged in order, and the sorting rules are not limited here. For example, for English hot words, they are sorted according to the order of the first letter, and the order of the second letter of the same first letter is compared, and so on; for Chinese hot words Words, sorted according to the pinyin sequence of characters; or, sorted according to the word frequency of hot words.
- sort the hot word data of the remaining K-V structure on each node in descending order according to the value of the value. As shown in Figure 3d, the hot words stored in the node "star" are sorted in descending order of completion probability.
- the order of the output at least one word is related to the arrangement order of the hot words stored in the target node. For example, directly according to the arrangement order of the hot words stored in the target node, it belongs to the top N hot words, where N is the preset recommended number of hot words.
- the hot word completion method provided by the embodiment of the present application can also combine the improved Trie tree with the existing hot word completion method to recommend hot words for users.
- the following is a detailed introduction. Please refer to FIG. 5 , which is an embodiment of the present application.
- Either a server or a terminal can be used as the execution body of the solution.
- the following takes the implementation of the solution by a server as an example for introduction.
- the server obtains the search text input by the user. Since the user needs to input character by character, the search text is usually an incomplete string. Intelligent completion based on the prefix of the string input by the user at an appropriate time can save the user's input. time to improve user experience.
- the server searches the pre-built C-Trie tree according to the string, and can determine the target node that matches the string.
- the character string obtained by connecting the characters on the path from the root node to a certain node in sequence is the character string corresponding to the node.
- the string corresponding to the target node that matches the string is the same as the string entered by the user.
- the C-Trie tree in this embodiment is constructed based on the POI hot word database.
- the specific process of constructing the C-Trie tree please refer to the preceding embodiment, and details are not repeated here.
- the target node stores hot word data, outputting a first hot word set prefixed with the character string according to the hot word data;
- the target node stores hot word data
- the hot word stored in the database prefixed with the character string as the output optionally, select the hot word data with the highest completion probability from the hot word data stored in the target node N hot words.
- N hot words are the first hot word set, N is an integer greater than or equal to 1, and N is a preset value.
- the specific value is not limited here.
- N is 3. It can be understood that, if the number of hot words prefixed with the string is less than N, it is sufficient to output all the hot words as the hot words in the first hot word set.
- the user log hot word database construct a user log hot word Trie to perform hot word completion, and obtain the second hot word set;
- the character string is input into a machine learning algorithm trained based on the user log hot word database, and the second hot word set is output, and the machine learning algorithm includes a recurrent neural network (recurrent neural network, RNN), Long short-term memory network (long short-term memory, LSTM), gated recurrent unit (GRU) or support vector machine (SVM), etc.
- RNN recurrent neural network
- LSTM Long short-term memory network
- GRU gated recurrent unit
- SVM support vector machine
- a user log hot word hash tree is constructed to perform hot word completion, and the second hot word set is obtained.
- the prediction model is used to predict the hot word result for the character string input by the user, and the Top-N result is taken as the second hot word set.
- the prediction model used in this embodiment is GRU, and the internal structure and principle of the GRU will not be described in detail.
- the general input and output structure of the GRU is shown in FIG. 6 .
- the time series model GRU has an input character Xt at the current moment, and a hidden state ht-1 passed down at the previous moment, which contains the relevant information of the previous node. Combining Xt and ht-1, GRU will get the output yt at the current moment and the hidden state ht passed to the next moment, until the output y is the terminator.
- all the predicted characters y are spliced into the string input by the user, which is the final predicted hot word result.
- each column represents the input and output at a certain moment.
- the predicted Top-3 results are starbucks, starhub, and starstreet, with completion probabilities of 0.77, 0.15, and 0.08, respectively.
- the machine learning algorithm deployed in the server can train user log hot words in the form of online learning, which can reflect changes in user log data in real time and ensure the timeliness of completing hot words.
- the server determines at least one hot word that is finally recommended to the user according to the first hot word set and the second hot word set, and sends it to the terminal, and the terminal displays the recommended hot word to the user through a display device.
- the terminal determines at least one hot word recommended to the user according to the first hot word set and the second hot word set and displays it to the user.
- the server can fuse the results with higher confidence and reorder them to recommend Top-N hot word results.
- the final recommended at least one hot word may all belong to the hot word in the first hot word set; or all belong to the hot word in the second hot word set; or may include the hot word in the first hot word set. word, and also includes hot words in the second hot word set, which is not specifically limited here.
- the Top 1 hot words from the POI hot word completion candidate set and the log hot word completion candidate set, add the hot word completion result set, and remove the two top 1 words.
- the remaining 2N-2 candidate results are weighted, summed and sorted. The weight is adjusted according to the business scenario, and the Top (N-K) results are taken and added to the hot word completion result set.
- the embodiment of the present application uses POI data to construct a C-Trie tree, performs conditional screening on hot words with common prefixes, filters hot words with too low probability, reduces the number of prefix candidate hot words, and stores candidate hot words in an orderly manner.
- the hot word completion starts when the user inputs the first letter
- the solution of the embodiment of the present application judges whether to trigger the hot word completion according to the C-trie tree. It can avoid triggering hot word completion when the user's input word is too short and the search intent is not yet clear, interfering with non-hot word search logic, solving the problem of triggering hot word completion when the input is too short, and the low efficiency of hot word completion, which can effectively Improve hot word completion performance.
- the hot word completion method integrates the POI hot word database and the user log hot word database, which can improve the correlation between the recommended hot words and the POI data, and reduce the situation that no results are returned when the POI search is performed according to the recommended hot words .
- FIG. 7 is a schematic diagram of another embodiment of the hot word completion method in the embodiment of the present application.
- S1 Filter POI hot words based on POI data. The word frequency of each word in the POI data is counted, and a certain percentage of words with the highest word frequency are selected as POI hot words.
- S3 Filter log hot words based on log data. The word frequency of each search word in the log data is counted, and a certain percentage of words with the highest word frequency are selected as the log hot words.
- a character-level generation model is trained to predict and complete incomplete strings.
- the prediction models include but are not limited to commonly used sequential models such as RNN and LSTM.
- the input to the model is an incomplete string and the output is the predicted complete hotword.
- S5 For the character string input by the user, determine whether the hot word completion condition is satisfied in the C-Trie tree. The judgment is based on whether the string stores KV-structured data in the corresponding node in the C-Trie tree. If the corresponding node does not contain KV-structured data, hotword completion will not be triggered; if the corresponding node contains KV-structured data , the hot word completion is triggered. When the hot word completion conditions are met, the Top-N results are taken from the data of the K-V structure of the node corresponding to the string in descending order of the completion probability, as the hot word completion candidate set based on POI data.
- FIG. 8 is one of the devices for completing the word in the embodiment of this application. Schematic diagram of the embodiment.
- the software or firmware includes, but is not limited to, computer program instructions or code, and can be executed by a hardware processor.
- the hardware includes, but is not limited to, various types of integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).
- CPU central processing unit
- DSP digital signal processor
- FPGA field programmable gate array
- ASIC application specific integrated circuit
- An output unit 802 configured to search a dictionary tree Trie for a target node matching the character string to output at least one word, where the Trie includes a plurality of first nodes, and the target node is the plurality of first nodes One of the first nodes, each of which stores one or more words, and the one or more words include from the root node of the Trie tree to the first place where the first or more words are located.
- a string composed of characters passed on the path of the node, and the words stored in the target node include the output at least one word.
- the prefix of the one or more words is a string composed of characters passing on the path from the root node of the Trie tree to the first node where the one or more words are located.
- the Trie includes a plurality of second nodes, each of which does not store a word.
- the character string includes the first character to the Nth character arranged according to the input order of the user; the output unit 802 is specifically configured to: search in the dictionary tree according to the input order, wherein, The character string consisting of the first character to the N-1 th character input by the user is matched to a second node in the dictionary tree.
- the first node also stores respective completion probabilities corresponding to the one or more words, wherein the completion probability corresponding to each of the words indicates that the completion probability corresponding to the first node is matched to the first node. case, output the probability of the word.
- the first node stores data of at least one key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is a sequence from the root node to the first node.
- the character string consisting of the characters passing on the path is the prefix of the word
- the value is the completion probability of the word, wherein the completion probability indicates that in the case of matching the first node, the output of the key is probability.
- the order of the output of the at least one word is related to the arrangement order of words stored in the target node.
- the words stored in the plurality of first nodes are derived from POI data of information points or user log data.
- the storage form of the nodes in the dictionary tree is different from the storage form of the nodes in this embodiment, and each node has more character strings than the previous node.
- a character consisting of the string corresponding to any node on the path from the root node to this node.
- the first node stores at least one word including the character string corresponding to the first node.
- the word completion device in the embodiment of the present application can be used to execute the word completion method provided by the foregoing embodiments.
- the nodes of the improved dictionary tree store hot words, and the stored hot words include the characters passed on the path from the root node to the node.
- the hot words stored in the nodes are words with a high probability of being completed. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the nodes of the dictionary tree.
- the word completion device obtains the character string input by the user through the obtaining unit, and the output unit outputs at least one word as a recommended hot word by searching for a target node matching the character string input by the user on the improved dictionary tree. Since the completed hot word is output to the user only when the target node stores the hot word, and the output is based on the word stored in the target node, there is no need to search for the hot word that meets the prefix condition based on the character string input by the user. Therefore, this application implements The word completion device provided by the example can improve the completion efficiency of hot words, and also make the completed hot words presented to the user more in line with the requirements of the user.
- FIG. 9 is a schematic diagram of another embodiment of the word completion apparatus in the embodiment of the present application.
- the software or firmware includes, but is not limited to, computer program instructions or code, and can be executed by a hardware processor.
- the hardware includes, but is not limited to, various types of integrated circuits, such as a central processing unit (CPU), a digital signal processor (DSP), a field programmable gate array (FPGA), or an application specific integrated circuit (ASIC).
- CPU central processing unit
- DSP digital signal processor
- FPGA field programmable gate array
- ASIC application specific integrated circuit
- An output unit 902 configured to search for a target node matching the character string in the dictionary tree Trie, to output a first word set, where the first word set includes at least one word, and the first Trie includes a plurality of first words node, the target node is one of the plurality of first nodes, and each of the first nodes stores a character consisting of characters passing on the path from the root node of the Trie to the first node At least one word of the string, the words stored by the target node include words in the first word set, and the words stored by the plurality of first nodes are from the first word database;
- the output unit 902 is further configured to output a second set of words prefixed with the character string based on the second word database;
- the output unit 902 is further configured to output at least one word recommended for the user according to the first word set and the second word set.
- the first word set includes at least two words in an orderly arrangement; the second word set includes at least two words in an orderly arrangement; the target word set includes an order in the first word set
- the first word and the second set of words are always ranked first.
- the output unit 902 is specifically configured to: output at least one word recommended by the user according to the probability of each word being output in the union of the first word set and the second word set, and the union The probability that each word in the set is output is determined according to the preset first weight of the first word set and the second weight of the second word set.
- the output unit 902 is specifically configured to: obtain the second set of words according to the Trie constructed by the word database of the user log; or, input the character string into the word database training based on the user log In the machine learning algorithm, the second word set is output; or, the second word set is obtained according to a hash tree constructed by the word database of the user log.
- the word completion device in the embodiment of the present application can be used to execute the word completion method provided by the foregoing embodiments.
- Combining the hot word sources of at least two hot word databases recommend hot words for users, wherein hot words are stored in the nodes of the first dictionary tree Trie constructed based on the first word database, and the stored hot words include starting from the root node.
- the hot words stored in the nodes are words with a high probability of being completed. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the nodes of the dictionary tree.
- the word completion device Since the first word set for completion is output to the user only when the target node stores the hot word, the word completion device provided by the embodiment of the present application can avoid the user inputting a short string and other inconspicuous intentions, or the hot word being blocked by the user. Hot word completion is triggered in scenarios where the possibility of completion is low. In addition, outputting hot words for the user in combination with the first word set and the second word set can improve the accuracy of hot word recommendation.
- FIG. 10 is a schematic diagram of another embodiment of the word completion device in the embodiment of the present application.
- the complementing device may be implemented in the form of a software system, and after the software system is deployed, external services are provided in the form of a remote interface.
- the device for completing the word includes two modules: an offline module 1001 and an online module 1002, wherein:
- the main task of the offline module 1001 is to process data from different data sources into a specific data structure or model
- the online module 1002 is mainly responsible for responding to the user's query request.
- FIG. 11 is a schematic diagram of an embodiment of an apparatus for constructing a Trie tree in an embodiment of the present application.
- the device for constructing a dictionary tree includes: a construction unit 1101, configured to construct a dictionary tree Trie according to a word database, wherein the dictionary tree includes a plurality of first nodes, and each of the first nodes stores data including a dictionary from the root node At least one word of the character string formed by the characters on the path to the first node; the deletion unit 1102 is used to delete the words stored in the node, and keep the completion probability in each node greater than or A word equal to the first threshold, the completion probability indicates the probability of outputting the word if the node is matched.
- the completion probability is a ratio of the word frequency of the word to the sum of the word frequencies of all words prefixed with a character from the root node to the first node.
- the first threshold is a preset value, and the value range is [0.1, 0.2].
- the node of the Trie stores data of a key-value structure
- the key-value structure includes a key and a value associated with the key
- the key is a word prefixed with a character from the root node to the node
- the value is the completion probability of the word
- the completion probability of the word is the ratio of the word frequency of the word to the sum of the word frequencies of all words prefixed with characters from the root node to the first node.
- the method further includes: the words stored in the nodes of the Trie are arranged in descending order of completion probability.
- the device for constructing a dictionary tree provided by the embodiment of the present application is used to construct the improved Trie tree provided by the embodiment of the present application, that is, a C-Trie tree. Improvements are made on the basis of the traditional dictionary tree, specifically, hot words are stored in the nodes of the dictionary tree, and the stored hot words include the characters passed on the path from the root node to the node.
- the hot words stored in the node are words whose probability of being completed is greater than or equal to the first threshold. In other words, because the hot words prefixed with a string that is too short have a low probability of being completed, they are not stored in the dictionary tree. on the node.
- the target node that matches the string input by the user, and output at least one word as the recommended hot word based on the word stored in the target node. Since the hot word is only stored in the target node, the supplementary word is output to the user. All hot words are output based on the words stored in the target node, and there is no need to search for hot words that meet the prefix conditions based on the string input by the user, so the efficiency of hot word completion can be improved. In addition, it can also avoid triggering hot word completion in scenarios where the user's input string is short and the intent is not obvious, or the hot word is less likely to be completed.
- FIG. 12 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
- the terminal 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, Antenna 1, Antenna 2, Mobile Communication Module 150, Wireless Communication Module 160, Audio Module 170, Speaker 170A, Receiver 170B, Microphone 170C, Headphone Interface 170D, Sensor Module 180, Key 190, Motor 191, Indicator 192, Camera 193, Display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195 and so on.
- SIM subscriber identification module
- the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
- the terminal 100 may include more or less components than shown, or some components may be combined, or some components may be separated, or different component arrangements.
- the illustrated components may be implemented in hardware, software, or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- application processor application processor, AP
- modem processor graphics processor
- graphics processor graphics processor
- ISP image signal processor
- controller memory
- video codec digital signal processor
- DSP digital signal processor
- NPU neural-network processing unit
- the controller may be the nerve center and command center of the terminal 100 .
- the controller can generate an operation control signal according to the instruction operation code and timing signal, and complete the control of fetching and executing instructions.
- the controller may implement the word completion method provided by the embodiment of the present application according to the instruction.
- a memory may also be provided in the processor 110 for storing instructions and data.
- the memory in processor 110 is cache memory. This memory may hold instructions or data that have just been used or recycled by the processor 110 . If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby increasing the efficiency of the system.
- the memory stores a pre-built Trie tree.
- the processor 110 may include one or more interfaces.
- the interface may include an integrated circuit (inter-integrated circuit, I1C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I1S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous transceiver (universal asynchronous transmitter) receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / or universal serial bus (universal serial bus, USB) interface, etc.
- I1C integrated circuit
- I1S integrated circuit built-in audio
- PCM pulse code modulation
- PCM pulse code modulation
- UART universal asynchronous transceiver
- MIPI mobile industry processor interface
- GPIO general-purpose input/output
- SIM subscriber identity module
- USB universal serial bus
- the interface connection relationship between the modules illustrated in the embodiments of the present application is only a schematic illustration, and does not constitute a structural limitation of the terminal 100 .
- the terminal 100 may also adopt different interface connection manners in the foregoing embodiments, or a combination of multiple interface connection manners.
- the charging management module 140 is used to receive charging input from the charger.
- the charger may be a wireless charger or a wired charger.
- the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
- the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
- the power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.
- the wireless communication function of the terminal 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modulation and demodulation processor, the baseband processor, and the like.
- the terminal 100 may communicate with other devices using a wireless communication function.
- the terminal 100 may communicate with the second electronic device, the terminal 100 establishes a screen projection connection with the second electronic device, and the terminal 100 outputs the screen projection data to the second electronic device.
- the screen projection data output by the terminal 100 may be audio and video data.
- Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
- Each antenna in terminal 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
- the antenna 1 can be multiplexed as a diversity antenna of the wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
- the mobile communication module 150 may provide a wireless communication solution including 1G/3G/4G/5G, etc. applied on the terminal 100 .
- the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA) and the like.
- the mobile communication module 150 can receive electromagnetic waves from the antenna 1, filter and amplify the received electromagnetic waves, and transmit them to the modulation and demodulation processor for demodulation.
- the mobile communication module 150 can also amplify the signal modulated by the modulation and demodulation processor, and then convert it into electromagnetic waves and radiate it out through the antenna 2 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110 .
- at least part of the functional modules of the mobile communication module 150 may be provided in the same device as at least part of the modules of the processor 110 .
- the terminal can communicate with the server through the mobile communication module.
- the modem processor may include a modulator and a demodulator.
- the modulator is used to modulate the low frequency baseband signal to be sent into a medium and high frequency signal.
- the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
- the low frequency baseband signal is processed by the baseband processor and passed to the application processor.
- the application processor outputs sound signals through audio devices (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 194 .
- the modem processor may be a stand-alone device.
- the modem processor may be independent of the processor 110, and may be provided in the same device as the mobile communication module 150 or other functional modules.
- the wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
- WLAN wireless local area networks
- BT wireless fidelity
- GNSS global navigation satellite system
- frequency modulation frequency modulation, FM
- NFC near field communication technology
- infrared technology infrared, IR
- the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
- the wireless communication module 160 receives electromagnetic waves via the antenna 1 , modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
- the wireless communication module 160 can also receive the signal to be sent from the processor 110 , perform frequency modulation on it, amplify it, and convert it into electromagnetic waves for radiation
- the antenna 1 of the terminal 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the terminal 100 can communicate with the network and other devices through wireless communication technology.
- the wireless communication technologies may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code Division Multiple Access (WCDMA), Time Division Code Division Multiple Access (TD-SCDMA), Long Term Evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
- the GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (GLONASS), a Beidou navigation satellite system (BDS), a quasi-zenith satellite system (quasi -zenith satellite system, QZSS) and/or satellite based augmentation systems (SBAS).
- GPS global positioning system
- GLONASS global navigation satellite system
- BDS Beidou navigation satellite system
- QZSS quasi-zenith satellite system
- SBAS satellite based augmentation systems
- the terminal 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
- the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor.
- the GPU is used to perform mathematical and geometric calculations for graphics rendering.
- Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
- Display screen 194 is used to display images, videos, and the like.
- Display screen 194 includes a display panel.
- the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (active-matrix organic light).
- emitting diode, AMOLED organic light-emitting diode
- flexible light-emitting diode flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED) and so on.
- the terminal 100 may include one or N display screens 194 , where N is a positive integer greater than one.
- the display screen 194 may display the output words to the user.
- the touch-type display can also obtain strings entered by the user.
- the display screen 194 may be used to display various interfaces output by the system of the terminal 100 .
- interfaces output by the terminal 100 For each interface output by the terminal 100, reference may be made to related descriptions in subsequent embodiments.
- the terminal 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.
- the ISP is used to process the data fed back by the camera 193 .
- the shutter is opened, the light is transmitted to the camera photosensitive element through the lens, the light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
- ISP can also perform algorithm optimization on image noise, brightness, and skin tone.
- ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
- the ISP may be provided in the camera 193 .
- Camera 193 is used to capture still images or video.
- the object is projected through the lens to generate an optical image onto the photosensitive element.
- the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
- CMOS complementary metal-oxide-semiconductor
- the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image signal.
- the ISP outputs the digital image signal to the DSP for processing.
- DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
- the terminal 100 may include 1 or N cameras 193 , where N is a positive integer greater than 1.
- a digital signal processor is used to process digital signals, in addition to processing digital image signals, it can also process other digital signals.
- Video codecs are used to compress or decompress digital video.
- Terminal 100 may support one or more video codecs.
- the terminal 100 can play or record videos in various encoding formats, such as: moving picture experts group (moving picture experts group, MPEG) 1, MPEG1, MPEG3, MPEG4, and so on.
- MPEG moving picture experts group
- the NPU is a neural-network (NN) computing processor.
- NN neural-network
- Applications such as intelligent cognition of the terminal 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
- the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal 100.
- the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example to save files like music, video etc in external memory card.
- Internal memory 121 may be used to store computer executable program code, which includes instructions.
- the processor 110 executes various functional applications and data processing of the terminal 100 by executing the instructions stored in the internal memory 121 .
- the internal memory 121 may include a storage program area and a storage data area.
- the storage program area can store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like.
- the storage data area may store data (such as audio data, phone book, etc.) created during the use of the terminal 100 and the like.
- the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), and the like.
- the terminal 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playback, recording, etc.
- the audio module 170 can be used to play the sound corresponding to the video. For example, when the display screen 194 displays a video playing screen, the audio module 170 outputs the sound of the video playing.
- the audio module 170 is used for converting digital audio information into analog audio signal output, and also for converting analog audio input into digital audio signal.
- Speaker 170A also referred to as a “speaker” is used to convert audio electrical signals into sound signals.
- the receiver 170B also referred to as “earpiece”, is used to convert audio electrical signals into sound signals.
- the microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into electrical signals.
- the earphone jack 170D is used to connect wired earphones.
- the earphone interface 170D can be the USB interface 130, or can be a 3.5mm open mobile terminal platform (OMTP) standard interface, a cellular telecommunications industry association of the USA (CTIA) standard interface.
- OMTP open mobile terminal platform
- CTIA cellular telecommunications industry association of the USA
- the pressure sensor 180A is used to sense pressure signals, and can convert the pressure signals into electrical signals.
- the pressure sensor 180A may be provided on the display screen 194 .
- the gyro sensor 180B may be used to determine the motion attitude of the terminal 100 .
- the air pressure sensor 180C is used to measure air pressure.
- the acceleration sensor 180E can detect the magnitude of the acceleration of the terminal 100 in various directions (including three axes or six axes). When the terminal 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the terminal posture, and can be used in horizontal and vertical screen switching, pedometer and other applications.
- Distance sensor 180F for measuring distance.
- the ambient light sensor 180L is used to sense ambient light brightness.
- the fingerprint sensor 180H is used to collect fingerprints.
- the temperature sensor 180J is used to detect the temperature.
- Touch sensor 180K also called “touch panel”.
- the touch sensor 180K may be disposed on the display screen 194 , and the touch sensor 180K and the display screen 194 form a touch screen, also called a “touch screen”.
- the touch sensor 180K is used to detect a touch operation on or near it.
- the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
- Visual output related to touch operations may be provided through display screen 194 .
- the touch sensor 180K may also be disposed on the surface of the terminal 100 , which is different from the position where the display screen 194 is located.
- the keys 190 include a power-on key, a volume key, and the like. Keys 190 may be mechanical keys. It can also be a touch key.
- the terminal 100 may receive key input and generate key signal input related to user settings and function control of the terminal 100 .
- Motor 191 can generate vibrating cues.
- the indicator 192 can be an indicator light, which can be used to indicate the charging state, the change of the power, and can also be used to indicate a message, a missed call, a notification, and the like.
- the SIM card interface 195 is used to connect a SIM card.
- FIG. 13 is a schematic diagram of an embodiment of a server in an embodiment of the present application.
- the server 1300 provided in this embodiment may vary greatly due to different configurations or performance, and may include one or more processors 1301 and a memory 1302, where programs or data are stored in the memory 1302.
- the memory 1302 may be volatile storage or non-volatile storage.
- the processor 1301 is one or more central processing units (CPU, Central Processing Unit, which can be a single-core CPU or a multi-core CPU.
- CPU Central Processing Unit
- the processor 1301 can communicate with the memory 1302 and execute on the server 1300 A series of instructions in memory 1302.
- the server 1300 also includes one or more wired or wireless network interfaces 1303, such as Ethernet interfaces.
- the server 1300 may also include one or more power supplies; one or more input and output interfaces, which may be used to connect a monitor, mouse, keyboard, touch screen device or sensing device etc., the input and output interfaces are optional components, which may or may not exist, and are not limited here.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
一种词的补全方法,应用于搜索场景,为用户输入的不完整词进行补全。该方法基于改进的字典树,在字典树的部分节点中存储热词,词的补全方法中通过在字典树Trie中查找与所述字符串匹配的目标节点,基于目标节点存储的热词输出向用户输出补全的至少一个词。可以提升词补全效率,避免在用户输入过短字符串时向用户推荐词。
Description
本申请要求于2020年7月15日提交中国国家知识产权局、申请号为202010683199.8、发明名称为“词补全方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,尤其涉及一种词补全方法和装置。
位置搜索在地图导航、旅游出行、社交通讯等在各类应用场景中被广泛使用。位置搜索具体包括查询建议、信息点(point of information,POI)搜索、获取POI详情等功能。其中,查询建议功能占搜索请求的75%,可以为用户提供不完整的查询文本的搜索建议。查询建议功能,能够在用户输入不完整的情况下进行自动补全,为用户提供文本相关的热词(即热门搜索词)或热门POI。
在现有技术中,当用户输入不完整的字符串时,服务器通常根据用户输入的字符串在热词数据库中进行前缀匹配,枚举所有满足前缀匹配的热词。然后对满足前缀匹配的所有热词按照搜索热度等进行降序排序,取前N个(Top-N)结果作为最终向用户推荐的热词。
现有的热词补全方法中,根据用户输出的字符串进行前缀匹配,获取所有满足前缀匹配的热词,通常符合前缀匹配的热词数量巨大,特别是在输入的字符串较短时,因此,热词搜索效率低,影响用户体验。
发明内容
本申请实施例提供了一种词补全方法,用于提升词补全效率,可以避免在用户输入过短字符串时向用户推荐词。
本申请实施例的第一方面提供了一种词的补全方法,包括:获取用户输入的字符串;在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有包括从所述根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述输出的所述至少一个词。
本申请实施例提供的词补全方法中,构建了一种高效的字典树结构,在传统字典树的基础上进行了改进,在字典树的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。节点中存储的热词是被补全的概率较高的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。该词的补全方法中,在改进的字典树上查找与用户输入的字符串匹配的目标节点,该字典树的根节点到所述目标节点的路径上经过的字符所组成的字符串与所述用户输入的字符串匹配,基于该目标节点存储的词,输出至少一个词作为推荐的热词。由于只有目标节点存储了热词时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前 缀条件的热词,因此可以提高热词补全效率,也使得呈现给用户的补全后的热词能够更符合用户的要求。
在第一方面的一种可能的实现方式中,所述第一节点存储的所述至少一个词的前缀为从所述根节点至所述第一节点的路径上经过的字符所组成的字符串。
本申请实施例提供的词补全方法中,存储的热词都是以从所述根节点至所述第一节点的路径上经过的字符所组成的字符串为前缀的热词,符合字典树的查询逻辑,搜索效率更高。
在第一方面的一种可能的实现方式中,所述Trie包括多个第二节点,所述第二节点未存储词。
本申请实施例提供的词补全方法中,Trie还包括多个未存储热词的第二节点,当用户输入的字符串匹配到第二节点时,将不会向用户输出热词,可以提升用户体验。
在第一方面的一种可能的实现方式中,所述字符串包括按所述用户输入顺序排列的第一字符至第N字符;所述在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,包括:所述方法还包括:按照所述输入顺序在字典树中查找,其中,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第二节点。
本申请实施例提供的词补全方法中,用户按照字符串的顺序逐个字符输入的过程中,输入第一个字符时,匹配到的节点为未存储热词的第二节点,因此不输出推荐的热词;类似地,直至输入第N-1字符时,第一字符至第N-1字符组成的字符串匹配的匹配到的节点依然为未存储热词的第二节点,因此不输出推荐的热词,而输入第N个字符后,才匹配到一个存储有词的第一节点,也就是前文描述的目标节点。由此可见,当用户输入的字符串较短时,不输出推荐的热词,可以避免对用户的干扰,提升用户体验。
在第一方面的一种可能的实现方式中,所述字符串包括按所述用户输入顺序排列的第一字符至第N字符,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第一节点,所述第一节点存储的词与所述目标节点存储的词相同或者不同。在第一方面的一种可能的实现方式中,所述第一节点还存储有每个以从根节点的字符至所述第一节点的字符组成的字符串为前缀的词各自对应的补全概率,其中,每个所述词对应的所述补全概率指示在匹配到所述第一节点的情况下,输出所述词的概率。可选地,一个词的补全概率为,在数据库中,该词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。可选地,第一节点存储多个热词,所述多个热词根据补全概率由大到小排列。
本申请实施例提供的词补全方法中,所述第一节点还存储有每个热词的补全概率,由此,可以基于词的补全概率确定向用户推荐的词的顺序,便于用户迅速获取补全概率较高的词,可以提升用户体验。
在第一方面的一种可能的实现方式中,所述第一节点存储有至少一个键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述第一节点的字符为前缀的词,所述值为所述词的补全概率。
本申请实施例提供的词补全方法中,第一节点存储一个或多个键值结构的数据,是一种便捷清晰的存储方式,每个词对应词的补全概率,可以方便地对多个词进行排序。
在第一方面的一种可能的实现方式中,所述输出的所述至少一个词的顺序与所述目标节 点中存储的词的排列顺序有关。
本申请实施例提供的词补全方法中,目标节点中存储的词的排列顺序可以有多种,可选地,根据不同热词中相同字符位置的字母顺序排序;可选地,根据不同热词中相同字符位置的拼音顺序排序;或者,根据热词的词频进行排序。输出的至少一个词的顺序可以与目标节点中存储的词的顺序相同,由此,可以减少热词推荐所需的时间,提升热词推荐效率。
在第一方面的一种可能的实现方式中,所述多个第一节点中存储的词来源于信息点POI数据或者用户日志数据。
本申请实施例提供的词补全方法中,第一节点中存储的热词可以来源于现有的各类热词数据库,例如POI数据或者用户日志数据。可以根据实际应用场景针对性选取不同的数据库,例如,在地图类应用中,通常可以考虑使用来源于POI数据的热词。
本申请实施例的第二方面提供了一种词的补全方法,包括:获取用户输入的字符串;在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有包括所述第一节点对应的字符串的至少一个词,所述目标节点存储的词包括所述输出的所述至少一个词。
与本申请实施例第一方面的区别在于,本申请实施例中,字典树中节点的存储形式不同,每个节点对应的字符串较上一节点多一个字符,包括从根节点该节点的路径上任一节点对应的字符串。此时,第一节点存储有包括所述第一节点对应的字符串的至少一个词。除字典树的存储形式之外的部分与本申请实施例第一方面类似,具体此处不再赘述。
本申请实施例的第三方面提供了一种词的补全方法,包括:获取用户输入的字符串;在基于第一词数据库构建的第一字典树Trie中查找与所述字符串匹配的目标节点,以输出第一词集合,所述第一词集合包括至少一个词,所述第一Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储包括从所述根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述第一词集合中的词;基于第二词数据库,输出以所述字符串为前缀的第二词集合;根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词。可选地,目标节点存储有以从根节点的字符至所述第一节点的字符组成的字符串为前缀的至少一个词。
本申请实施例提供的方法,可以结合至少两个热词数据库的热词来源,为用户进行热词推荐,其中,基于第一词数据库构建的第一字典树Trie的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。节点中存储的热词是被补全的概率较高的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。由于只有目标节点存储了热词时才向用户输出补全的第一词集合,因此可以避免用户输入字符串较短等意图不明显,或热词被补全的可能性较低场景下触发热词补全。此外,结合第一词集合和第二词集合为用户输出热词,可以提高热词推荐的准确度。
在第三方面的一种可能的实现方式中,所述第一词集合包括有序排列的至少两个词;所述第二词集合包括有序排列的至少两个词;所述目标词集合包括所述第一词集合中排序第一 的词以及所述第二词集合总排序第一的词。
本申请实施例提供的方法,输出的为用户推挤的热词包括第一热词集合中排序第一的热词和第二热词集合中排序第一的热词,相较现有技术,推荐的热词符合用户意图的可能性更高,可以提高热词推荐的准确度。
在第三方面的一种可能的实现方式中,所述根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词包括:根据预设的第一词集合的第一权重和第二词集合的第二权重,确定所述第一词集合和所述第二词集合的并集中每个词的补全概率;根据所述补全概率确定所述为用户推荐的至少一个词。
本申请实施例提供的方法,可以根据热词推荐的实际应用场景,例如进行推荐的应用类型,对不同数据来源的热词数据库设置推荐权重,以计算第一词集合和所述第二词集合的并集中每个词的补全概率,可以提高热词推荐的准确度。
在第三方面的一种可能的实现方式中,所述基于第二词数据库,输出以所述字符串为前缀的第二词集合包括:根据所述用户日志词数据库构建的Trie,获取所述第二词集合;或者,将所述字符串输入基于所述用户日志词数据库训练的机器学习算法中,以输出所述第二词集合;或者,根据所述用户日志词数据库构建的哈希树,获取所述第二词集合。
可选地,第一词数据库的数据来源为POI数据库,第二词数据库的数据来源为用户日志数据库。
本申请实施例提供的方法,可以通过多种已有的热词补全方法获取第二词集合。此外,通过融合POI数据的补全结果和日志数据的补全结果,可以保证补全热词的时效性和与POI的相关性。
本申请实施例的第四方面提供了一种字典树的构建方法,包括:根据词数据库构建字典树Trie,所述字典树中包括多个第一节点,每个所述第一节点存储有包括从所述根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词;对所述节点存储的词进行删减,保留每个节点中补全概率大于或等于第一阈值的词,所述补全概率指示在匹配到该节点的情况下,输出词的概率。
本申请实施例提供的字典树的构建方法中,提出了一种高效的字典树结构的构建方法,在传统字典树的基础上进行了改进,具体是在字典树的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。节点中存储的热词是被补全的概率大于或等于第一阈值的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。基于改进的字典树,查找与用户输入的字符串匹配的目标节点,基于该目标节点存储的词,输出至少一个词作为推荐的热词,由于只有目标节点存储了热词时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前缀条件的热词,因此可以提高热词补全效率。此外,还可以避免用户输入字符串较短等意图不明显,或热词被补全的可能性较低场景下触发热词补全。
在第四方面的一种可能的实现方式中,所述补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
在第四方面的一种可能的实现方式中,第一阈值为预设值,取值范围为[0.1,0.2],通 过合理设置第一阈值,可以确保在合适的时机下位用户推荐热词,相同条件下,第一阈值越大,触发热词推荐的字符串越长,推荐热词与用户意图相符的可能性越大。但是,若第一阈值过大,将使得用户经过较长的等待(或者输入较多的字符时)才能获取推荐的热词,也不利于提升用户体验,因此,第一阈值的取值需要根据实际热词补全场景合理设定,可以预设,也可以根据应用场景或用户需求进行调整。
在第四方面的一种可能的实现方式中,所述Trie的节点存储键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述节点的字符为前缀的词,所述值为所述词的补全概率,所述词的补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
在第四方面的一种可能的实现方式中,所述方法还包括:所述Trie的节点存储的词按照补全概率由大到小排列。
本申请实施例的第五方面提供了一种词的补全装置,其特征在于,包括:获取单元,用于获取用户输入的字符串;输出单元,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有一个或多个词,所述一个或多个词都包括从所述Trie树的根节点至所述第一个或多个词所在的第一节点的路径上经过的字符所组成的字符串,所述目标节点存储的词中包括输出的所述至少一个词。
在第五方面的一种可能的实现方式中,所述一个或多个词的前缀为从所述Trie树的根节点至所述一个或多个词所在的第一节点的路径上经过的字符所组成的字符串。
在第五方面的一种可能的实现方式中,所述Trie包括多个第二节点,每个所述第二节点未存储词。
在第五方面的一种可能的实现方式中,所述字符串包括按所述用户的输入顺序排列的第一字符至第N字符;所述输出单元具体用于:按照所述输入顺序在所述字典树中查找,其中,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第二节点。
在第五方面的一种可能的实现方式中,所述第一节点还存储有所述一个或多个词各自对应的补全概率,其中,每个所述词对应的所述补全概率指示在匹配到所述第一节点的情况下,输出所述词的概率。
在第五方面的一种可能的实现方式中,所述第一节点存储有至少一个键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述第一节点的路径上经过的字符所组成的字符串为前缀的词,所述值为词的补全概率,其中,所述补全概率指示在匹配到所述第一节点的情况下,输出所述键的概率。
在第五方面的一种可能的实现方式中,所述输出的所述至少一个词的顺序与所述目标节点中存储的词的排列顺序有关。
在第五方面的一种可能的实现方式中,所述多个第一节点中存储的词来源于信息点POI数据或者用户日志数据。
本申请实施例的第六方面提供了一种词的补全装置,与本申请实施例第五方面的词的补 全装置的区别在于,字典树中节点的存储形式不同,每个节点对应的字符串较上一节点多一个字符,包括从根节点该节点的路径上任一节点对应的字符串。此时,第一节点存储有包括所述第一节点对应的字符串的至少一个词。除字典树的存储形式之外的部分与本申请实施例第五方面类似,具体此处不再赘述。
本申请实施例的第七方面提供了一种词的补全装置,其特征在于,包括:获取单元,用于获取用户输入的字符串;输出单元,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出第一词集合,所述第一词集合包括至少一个词,所述第一Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储包括从所述Trie的根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述第一词集合中的词,所述多个第一节点存储的词来自第一词数据库;所述输出单元,还用于基于第二词数据库,输出以所述字符串为前缀的第二词集合;所述输出单元,还用于根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词。
在第七方面的一种可能的实现方式中,所述第一词集合包括有序排列的至少两个词;所述第二词集合包括有序排列的至少两个词;所述目标词集合包括所述第一词集合中排序第一的词以及所述第二词集合总排序第一的词。
在第七方面的一种可能的实现方式中,所述输出单元具体用于:根据所述第一词集合和所述第二词集合的并集中每个词被输出的概率,输出为用户推荐的至少一个词,所述并集中每个词被输出的概率根据预设的第一词集合的第一权重和第二词集合的第二权重确定。
在第七方面的一种可能的实现方式中,所述输出单元具体用于:根据所述用户日志的词数据库构建的Trie,获取所述第二词集合;或者,将所述字符串输入基于所述用户日志的词数据库训练的机器学习算法中,以输出所述第二词集合;或者,根据所述用户日志的词数据库构建的哈希树,获取所述第二词集合。
本申请实施例的第八方面提供了一种字典树的构建装置,包括:构建单元,用于根据词数据库构建字典树Trie,所述字典树中包括多个第一节点,每个所述第一节点存储有包括从所述根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词;删减单元,用于对所述节点存储的词进行删减,保留每个节点中补全概率大于或等于第一阈值的词,所述补全概率指示在匹配到该节点的情况下,输出词的概率。
在第八方面的一种可能的实现方式中,所述补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
在第八方面的一种可能的实现方式中,第一阈值为预设值,取值范围为[0.1,0.2]。
在第八方面的一种可能的实现方式中,所述Trie的节点存储键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述节点的字符为前缀的词,所述值为所述词的补全概率,所述词的补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
在第八方面的一种可能的实现方式中,所述方法还包括:所述Trie的节点存储的词按照补全概率由大到小排列。
本申请实施例的第九方面提供了一种终端,包括:一个或多个处理器和存储器;其中, 所述存储器中存储有计算机可读指令;所述一个或多个处理器读取所述计算机可读指令以使所述终端实现执行如上述第一方面至第四方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例的第十方面提供了一种服务器,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;所述一个或多个处理器读取所述计算机可读指令以使所述终端实现执行如上述第一方面至第四方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第十一方面提供了一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如上述第一方面至第四方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第十二方面提供了一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如上述第一方面至第四方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第六方面提供了一种芯片,包括处理器。处理器用于读取并执行存储器中存储的计算机程序,以执行上述第一方面至第四方面以及各种可能的实现方式中任一项所述的方法。可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。
其中,第五方面至第十二方面中任一种实现方式所带来的技术效果可参见第一方面至第四方面中相应实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请提供的热词补全方法,基于改进的字典树进行词补全,其中,改进的字典树包括多个存储有词的第一节点,通过在改进的字典树上查找与用户输入的字符串匹配的目标节点,基于该目标节点存储的词,输出至少一个词作为推荐的热词。由于只有目标节点存储了热词时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前缀条件的热词,因此可以提高热词补全效率。
此外,本方案中通过热词的补全概率进行筛选,相较严格设置最短触发长度来限制热词补全时机的方案,本方案通过统计节点对应的字符串为前缀的热词的补全概率,可以针对性确定每个热词的补全触发长度,更符合实际补全需求。可以避免用户输入字符串较短等意图不明显,或热词被补全的可能性较低场景下触发热词补全,由于避免了过短的字符串为前缀的热词搜索,本方案还可以有效提高补全效率。
此外,本方案基于第一词数据库和第二词数据库的补全结果进行输出,例如融合POI数据的补全结果和日志数据的补全结果,可以保证补全热词的时效性和与POI的相关性。
图1为基本Trie树结构示例的示意图;
图2a为热词补全界面的示意图;
图2b为本申请实施例中热词补全方法的系统架构示意图;
图3a为本申请实施例中构建C-Trie树的一个实施例示意图;
图3b为本申请实施例中C-Trie树的节点存储的热词数据的一个示意图;
图3c为本申请实施例中C-Trie树的节点删减存储的热词数据的一个示意图;
图3d为本申请实施例中C-Trie树的节点存储的热词数据的另一个示意图;
图4为本申请实施例中词的补全方法的一个实施例示意图;
图5为本申请实施例中词的补全方法的另一个实施例示意图;
图6为本申请实施例中预测模型的输入输出结构的示意图;
图7为本申请实施例中词的补全方法的另一个实施例示意图;
图8为本申请实施例中词的补全装置的一个实施例示意图;
图9为本申请实施例中词的补全装置的另一个实施例示意图;
图10为本申请实施例中词的补全装置的另一个实施例示意图;
图11为本申请实施例中Trie树的构建装置的一个实施例示意图;
图12为本申请实施例中终端的一个实施例示意图;
图13为本申请实施例中服务器的一个实施例示意图。
本申请实施例提供了一种词的补全方法,用于提升词补全效率,可以避免在用户输入过短字符串时向用户推荐词。
为了便于理解,下面对本申请实施例涉及的部分技术术语进行简要介绍:
1、词
本申请涉及词的补全,这里首先对“词”的含义进行简要介绍。
本申请提出的词的补全方法应用于搜索领域,具体是当用户在搜索栏区域逐个输入字符时,通过预测用户期望的搜索词为用户推荐完整的词,由此,用户无需逐个字符输入完整的词,而可以直接从推荐的词中进行选择,可以提升用户的搜索效率,改善用户体验。通常,为用户推荐的词的数据来源包括用户日志等历史数据,经统计和筛选,结合用户已经输入的不完整字符,推荐尽可能接近用户预期的词。基于该技术背景,可以理解的是,本申请实施例提及的“词”为广义的概念,基于用户的搜索语言,“词”包括不同语言的词;基于词的数据来源,词可以包括单个词或多个词组成的短语。下面,以中文的词和英文的词进行举例介绍:
1)“超市”,属于中文的一个词语,“超市”属于本申请中涉及的“词”,包括两个字符;
2)“农业银行”,虽然包括“农业”和“银行”两个词语,但是,若基于用户日志等数据来源,“农业银行”作为整体属于用户常用的搜索词,则“农业银行”也属于本申请中涉及的“词”,包括四个字符;
3)“starbucks”为一个英文词汇,属于本申请涉及的“词”,包括9个字符;
4)“burger king”,按照狭义的理解,由“burger”和“king”两个英文词汇组成,但是,由于“burger king”作为整体属于用户常用的搜索词,则“burger king”也属于本申请中涉及的“词”,值得注意的是,在本申请中将“burger king”视为一个整体,“burger king”包括11个字符,即“burger”和“king”之间的空格也占用一个字符。
为用户推荐的“词”可以通过用户日志等来源的词的数据库经统计和筛选得到,由于通 常为数据库中词频较高的热门的搜索词,在本技术领域中常被称为“热词”,即本申请实施例中的“词”。本申请中对于词的词频来源不做限定,此外,词的数据来源可以为现有技术中常用的热词数据库,本申请对此也不做限定。以下实施例中,以“热词”进行介绍。
2、POI
是“Point of Information”的缩写,中文可以翻译为“信息点”。在地理信息系统中,一个POI可以是一栋房子、一个商铺、一个邮筒、一个公交站等。
本申请实施例中POI是指地图上的地点名称,例如“黄金山公园”、“illy咖啡厅”、“新天下公寓”、“中国邮政”等等。
POI热词,是基于POI数据筛选得到的,例如,统计POI数据中每个词语的词频,取一定比例词频最高的词语为POI热词。
2、字符串
用户在搜索框输入的查询文本,通常为不完整的词,可能包括一个或多个字符,本申请实施例中统称为字符串。
3、Trie
字典树,又称查找树或前缀树,本实施例中也称为Trie树。Trie树典型的应用是用于统计,排序和保存大量的字符串,所以经常被搜索引擎系统用于文本词频统计。它的优点是:利用字符串的公共前缀来减少查询时间,最大限度地减少无谓的字符串比较。
不完整前缀补全的常见方案是对所有的候选词建立Trie树,对字符串在Trie树中进行检索,提升查询效率。
请参阅图1,为基本Trie树结构示例的示意图。假如有5个字符串,分别为“code”、“cook”、“file”、“fat”、“find”,对它们建立的Trie树结构如图1所示。通过Trie树查找与字符串匹配的节点的方法,是从根节点开始,先确定与字符串中第一个字符相同的节点,然后从该节点的子节点中查找与字符串中第二个字符相同的节点,以此类推。一个节点的子节点是与该节点向下直接连接的节点。一个节点的子节点对应的字符均不相同。
通过Trie树来检索前缀匹配的字符串,能够减少无谓的字符串比较,利用共同前缀来提高查询效率,例如“code”与“cook”有共同前缀“co”时,无需进行两次比较共同前缀“co”。在Trie树的实际应用中,通常在每一个候选词对应的节点中额外存储词频,以词频作为共同前缀候选词的排序依据。
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。 在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。
如图2a所示,当用户输入“star”时,服务器或终端为用户推荐热词“starbucks”,用户可以通过点击该热词,进而采用该热词进行POI搜索,以获取更加全面、详细的搜索结果列表。需要说明的是,补全的热词通常是类别或品牌名等词类,例如“酒店”、“星巴克”等,而不是指向某一个明确的地点,例如“东方明珠”。
在现有技术中,当用户输入不完整的字符串时,通常对用户输入的字符串在所有热词中进行前缀匹配,枚举所有满足前缀匹配的热词。接下来对满足前缀匹配的热词,按照搜索热度或者字符串与热词的相关性进行降序排序,取Top-N结果作为最终向用户推荐的热词。该方法在不完整文本补全场景中十分简洁有效,但是在实际应用中,该方法还存在以下技术问题:通常满足前缀匹配的热词数量巨大,尤其在字符串较短时,搜索效率较低;不同热词的补全触发时机难以确定,当用户输入的字符串过短,例如只有1到2个字母时,用户搜索意图尚未明确,符合前缀匹配的热词数量巨大。此时向用户推荐热门搜索词,有可能影响用户体验,并且干扰非热词搜索场景,例如在地图场景中,输入1到2个字母的情况下通常返回周围重要的行政区。
为了解决热词补全技术的上述问题,本发明通过构建Trie树结构,得益于C-Trie的特殊结构,能够对候选热词进行条件筛选、排序,并且进行额外的存储,极大提升热词补全性能。
此外,通过融合基于POI数据的补全结果和基于搜索日志的补全结果,还可以同时保障补全热词的时效性和与POI数据的相关性。
下面对本申请方法的系统架构进行简要介绍,请参阅图2b,为本申请实施例中热词补全方法的系统架构示意图。
该系统架构中包括用户、终端和服务器,终端和服务器之间通过各类通信链路进行通信连接。
通常,用户通过终端的输入模块输入字符串用于进行搜索,终端根据用户输入在本地进行查找搜索,通过终端的显示模块向用户输出推荐的热词。
或者,终端将用户输入的请求发送给服务器,由服务器根据用户输入的字符串进行热词补全,并将输出的热词发送给终端,终端接收服务器发送的热词并通过输出模块向用户输出。
可见,本申请实施例提供的热词补全方法可以由终端执行,也可以由网络设备,例如服务器执行,具体不做限定。
本申请的热词补全方法中对传统的Trie树进行了改进,只有满足条件的热词被存储至Trie树的部分节点上,因此,本申请实施例中将改进后的Trie树称为条件(conditional)字典树,简称C-Trie树,可以理解的是,只要符合在Trie树的部分节点上存储满足一定条件的热词的Trie树均属于本申请实施例改进后的Trie树,本申请并不限定改进后的Trie树的名称。
一种实现方式下,C-Trie树在普通Trie树的一些节点上添加键值(Key-Value,K-V)结构的数据结构,其中,Key为以节点代表的字符串为前缀的某一热词,例如节点代表的字 符串为“酒”,该节点存储的一个K-V结构数据的Key为“酒店”,Value为在以该节点为前缀的所有热词中,该热词(即key对应的热词)的词频占所有热词词频总数的比例,即该热词被补全的概率,本申请实施例中称为“补全概率”。例如以“酒”为前缀的热词一共有“酒店”和“酒吧”两个词,“酒店”的词频为70,“酒吧”的词频为30,那么以“酒店”为Key对应的Value为70%。可选地,每个节点上的K-V结构数据按Value值大小进行降序排序。当然,也可以用其他形式的数据格式在C-Trie树的一些节点上存储热词和补全概率,比如映射表,链表等,本申请不做限制。
下面对C-Trie树的构建方法进行介绍:
步骤1、基于热词数据库构建一棵普通的Trie树;
可选地,请参阅图3a,为便于理解,该图中每个节点中展示该节点代表的字符串,包括从根节点到该节点路径经过的字符连接起来的所有字符。或者,每个节点也可以直接代表图3a所示的字符串。热词数据库可以为现有的任意来源的热词数据库,常见的例如基于用户日志的热词数据库或者POI热词数据库等等,具体此处不做限定。
步骤2、在Trie树的多个节点上存储热词;
每个节点存储的热词,包括从所述根节点至所述节点的路径上经过的字符所组成的字符串的至少一个词。可选地,每个节点存储的热词,包括以从所述根节点至所述节点的路径上经过的字符所组成的字符串为前缀的至少一个词。词的来源可以是步骤1中构建Trie树的热词数据库,也可以是与步骤1中构建Trie树的热词数据库来源不同的另一个热词数据库,对于存储热词的来源此处不做限定。
需要说明的是,由于Trie树结构本身是对于热词按照相同前缀进行存储的一种数据结构,对于每个节点而言,以从根节点至该节点的路径上经过的字符为前缀的词都可以通过查询该节点的子节点一一获取,即从Trie树结构本身可以获取以从根节点至该节点的路径上经过的字符为前缀的所有词,但本实施例中节点存储热词,不是指通过查询Trie树获取热词,而是在该节点处直接存储热词,在方案使用过程中,获取目标节点后即可直接获取目标节点存储的热词,不再需要通过Trie树结构进行查询。
可选地,每个节点还存储热词对应的补全概率,补全概率指示在匹配到该节点的情况下,输出该热词的概率。
可选地,节点上存储的热词为K-V结构的数据,每个节点可能存储一个或多个K-V结构的数据。Key为以节点表示的字符串为前缀的一个热词,Value为该热词的补全概率,可选地,补全概率为该热词的词频占所有以该节点代表的字符串为前缀的热词的词频总和的比例,或者说,在以该节点为前缀的所有热词中,该热词的词频占所有热词的词频总数的比例为该热词对应的value,即该热词被补全的概率。补全概率为以从根节点至所述节点的字符为前缀的一个热词的词频占所有以从根节点至所述节点的字符为前缀的热词的词频的总和的比例。
示例性地,如图3b所示,以“st”为前缀的热词有“starbucks”、“state”、“start”、“stir”等,以这些热词为Key,计算热词被补全的概率。其中,“starbucks”的补全概率为0.012,“state”的补全概率为0.0012,“start”的补全概率为0.002,“stir”的补全概率为0.0004。
类似地,Trie树的每个节点上添加K-V结构的热词数据。
需要说明的是,步骤1和步骤2可以同步执行,即在构建trie树的每个节点的同时,为每个节点存储热词。
步骤3、对每个节点存储的热词进行删减。
根据第一阈值对每个节点存储的热词进行删减,可选地,对K-V结构的数据进行删减。
可选地,若节点存储的某个K-V结构数据中的Value小于指定的第一阈值,则将该K-V结构删除。如图3c所示,假定为0.1,此时“st”、“sta”节点中所有候选热词的概率均不满足阈值条件,全部被删除。需要说明的是第一阈值为属于(0,1)之间的数,具体数值不做限定,示例性地,第一阈值的取值范围为[0.1,0.2],例如可以为0.1、0.12、0.15、0.18或0.2等。可以理解的是,通过合理设置第一阈值,可以确保在合适的时机下位用户推荐热词,相同条件下,第一阈值越大,触发热词推荐的字符串越长,推荐热词与用户意图相符的可能性越大。但是,若第一阈值过大,将使得用户经过较长的等待(或者输入较多的字符时)才能获取推荐的热词,也不利于提升用户体验,因此,第一阈值的取值需要根据实际热词补全场景合理设定,可以预设,也可以根据应用场景或用户需求进行调整。
需要说明的是,步骤1、步骤2和步骤3可以同步执行,即在构建trie树的每个节点的同时,为每个节点存储热词,并同步对热词进行删减。
步骤4、对节点上存储的热词进行排序;
节点上存储的热词可以有序排列,排序规则此处不做限定,例如,对于英文热词,根据首字母顺序排序,相同首字母的比较第二个字母的顺序,依次类推;对于中文热词,根据字符的拼音顺序排序;或者,根据热词的词频进行排序。
可选地,对每个节点上未被删减的剩余K-V结构的热词数据,按Value值大小进行降序排序。如图3d所示,节点“star”中存储的热词按补全概率大小降序排列。
需要说明的是,步骤4为可选步骤,可以执行也可以不执行。
可以理解的是,步骤4与步骤3的执行顺序不做限定。
根据以上步骤,构建得到本申请实施例涉及的C-Trie树,可以看出,C-Trie树的主要特征:
(1)节点可能不关联K-V结构的数据,或者关联一对或多对K-V结构的数据;
(2)每对K-V结构的数据,用于指示以该节点表示的字符串为前缀的某一热词被补全的概率;
(3)节点存储的K-V结构的数据中,指示的热词的补全概率大于或等于第一阈值;
(4)一个节点存储的K-V结构的数据,按热词的补全概率降序排列。
在另一种可能的实现方式中,C-Trie树可以删去一部分未存储热词的节点,此处不再赘述。
由于C-Trie树的部分节点存储有词,通过在改进的C-Trie树上查找与用户输入的字符串匹配的目标节点,基于该目标节点存储的词,向用户输出至少一个词作为推荐的热词。由于只有匹配到存储了热词的目标节点时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前缀条件的热词,因此可以提高热词补全效率。
请参阅图4,为本申请实施例中热词补全方法的一个实施例示意图;
401、获取用户输入的字符串;
服务器或终端获取用户输入的搜索文本,由于用户需要一个字符一个字符地进行输入,搜索文本通常为不完整的字符串,在适当的时机根据用户输入的字符串为前缀进行智能补全可以节省用户的输入时长,提升用户体验。
402、在预置的字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词;
输出的热词用于向用户进行推荐,用户可以在输出的至少一个词中挑选符合自己期望的词,从而不必输入完整的词,在用户对于自己期望输入的词存在拼写困难或者部分遗忘等场景下得到帮助,从而提升用户体验。
从根节点到某一节点的路径上经过的字符按顺序连接得到的字符串为该节点对应的字符串。本申请实施例中的Trie树为上述实施例中介绍的C-Trie树,根据C-Trie树的构建方法可知,C-Trie树的节点代表的字符串与常规的Trie树一样,都是从根节点到该节点的路径经过的字符顺序连接得到的字符串。不同之处在于,C-Trie树的部分节点存储了热词,为描述方便,可以将存储了热词的节点称为第一节点,C-Trie树包括多个第一节点。可选地,C-Trie树的第一节点存储的热词包含从根节点到该第一节点的路径经过的字符顺序连接得到的字符串。例如,从根节点到该第一节点的路径经过的字符顺序连接得到的字符串为“rea”,第一节点存储的热词,包括以rea为前缀的词,例如read;也可以包括包含rea的词,例如entreat。可选地,C-Trie树的第一节点存储的热词为以从根节点到该节点的路径经过的字符顺序连接得到的字符串为前缀的热词,例如,从根节点到该第一节点的路径经过的字符顺序连接得到的字符串为“rea”,第一节点存储的热词为ready、rear、really等。
可选地,C-Trie树的节点存储键值对(K-V)结构的数据,K为以该节点代表的字符串为前缀的热词,V为该热词的补全概率,即该热词的词频占所有以该节点代表的字符串为前缀的热词的词频总和的比例。补全概率指示在匹配到所述第一节点的情况下,输出该热词的概率。根据C-Trie树的构建过程可知,保留的热词的补全概率大于或等于第一阈值,也就是说,补全概率小于第一阈值的热词被删减。可选地,由于节点对应的字符串为前缀的多个热词的补全概率均小于第一阈值,C-Trie树的部分节点可能没有关联K-V结构的数据,这一情况常见于节点对应的字符串过短的场景;可选地,节点对应的字符串为前缀的多个热词中,部分热词的补全概率小于第一阈值而被删减,部分热词的补全概率大于或等于第一阈值而被保留,因此该节点存储了部分热词,属于第一节点;可选地,节点对应的字符串为前缀的多个热词的补全概率均大于或等于第一阈值,被全部保留,该节点存储了热词,属于第一节点。为便于描述,将未存储热词的节点称为第二节点,C-Trie树包括多个第二节点。
C-Trie树的一个第一节点可能关联一对K-V结构的数据,包括该节点对应的字符为前缀的热词仅有一个,或者仅有一个热词的补全概率大于或等于第一阈值;C-Trie树的一个第一节点还可能关联多对K-V结构的数据,K-V结构的数据的具体数量不做限定,也可以根据补全概率的大小进行删减,保留预设数量且补全概率较大的热词对应的K-V结构的数据。
可选地,构建本申请实施例的C-Trie树时可以使用不同来源的热词数据库,示例性地,热词数据库可以是用户的日志热词数据库,日志热词是通过日志数据筛选得到的热门搜索词;热词数据库还可以是POI热词数据库,POI热词是通过统计地图POI数据中每个词语的词频,取词频最高的一定比例词语得到。
服务器或终端根据字符串在C-Trie树中进行查找,可以确定与该字符串匹配的目标节点,即目标节点代表的字符串与该字符串一致。
若所述目标节点存储有热词数据,则根据存储的热词数据输出以该字符串为前缀的至少一个热词。
若该目标节点存储有热词数据,则根据存储的热词,输出至少一个热词作为推荐热词,通常推荐的热词数量有限,可以由服务器预设,也可以根据用户需要设定,推荐的热词的具体数量不做限定。若设定的推荐的热词数量小于目标节点上存储的热词,则输出的热词为目标节点上存储的所有热词中的一部分。
节点上存储的热词可以有序排列,排序规则此处不做限定,例如,对于英文热词,根据首字母顺序排序,相同首字母的比较第二个字母的顺序,依次类推;对于中文热词,根据字符的拼音顺序排序;或者,根据热词的词频进行排序。可选地,对每个节点上未被删减的剩余K-V结构的热词数据,按Value值大小进行降序排序。如图3d所示,节点“star”中存储的热词按补全概率大小降序排列。
可选地,输出的至少一个词的顺序与目标节点中存储的热词的排列顺序有关。例如,直接根据目标节点中存储的热词的排列顺序,属于前N个热词,其中N为预设的热词推荐数量。
本申请实施例提供的热词补全方法,还可以将改进的Trie树结合已有的热词补全方法为用户进行热词推荐,下面具体进行介绍,请参阅图5,为本申请实施例中热词补全方法的另一个实施例示意图。
501、获取用户输入的字符串;
服务器或终端都可以作为方案的执行主体,下面以服务器实施本方案为例进行介绍。服务器获取用户输入的搜索文本,由于用户需要一个字符一个字符地进行输入,搜索文本通常为不完整的字符串,在适当的时机根据用户输入的字符串为前缀进行智能补全可以节省用户的输入时长,提升用户体验。
502、在预置的C-Trie树中查找与所述字符串匹配的目标节点,以输出第一热词集合;
服务器根据字符串在预先构建的C-Trie树中进行查找,可以确定与该字符串匹配的目标节点。从根节点到某一节点的路径上经过的字符按顺序连接得到的字符串为节点对应的字符串。与字符串匹配的目标节点对应的字符串与用户输入的字符串相同。
本实施例中的C-Trie树是基于POI热词数据库构建的,构建C-Trie树的具体过程请参考前序实施例,具体此处不再赘述。
若所述目标节点存储有热词数据,则根据所述热词数据输出以该字符串为前缀的第一热词集合;
若所述目标节点存储有热词数据,则可以根据将数据库中存储的以该字符串为前缀的热词作为输出,可选地,从目标节点存储的热词数据中选取补全概率最高的N个热词。
输出的N个热词即为第一热词集合,N为大于或等于1的整数,N为预设值,具体数值此处不做限定,可选地,例如N为3。可以理解的是,若以该字符串为前缀的热词少于N,将所有热词输出为第一热词集合中的热词即可。
示例性地,以“st”为例,当用户输入“st”时,对应节点上无K-V结构存储,此时不触发热词补全。当用户输入“star”时,对应节点上有K-V结构存储,按补全概率大小取 Top-N热词结果。假设N=3,则C-Trie树热词补全结果,即第一热词集合为:starbucks,starhub,starhotel。
503、将字符串输入生成模型,获取第二热词集合;
基于用户日志热词数据库,输出以所述字符串为前缀的第二热词集合,所述第二热词集合包括至少一个热词。
基于用户日志热词数据库,输出以所述字符串为前缀的第二热词集合的方式有多种,
可选地,根据所述用户日志热词数据库,构建用户日志热词Trie进行热词补全,获取所述第二热词集合;
可选地,将所述字符串输入基于所述用户日志热词数据库训练的机器学习算法中,输出所述第二热词集合,该机器学习算法包括循环神经网络(recurrent neural network,RNN)、长短期记忆网络(long short-term memory,LSTM)、门控循环单元(GRU)或支持向量机(SVM)等,具体算法此处不做限定,本申请实施例中也将该机器学习算法称为生成模型或者预测模型;
可选地,根据所述用户日志热词数据库,构建用户日志热词哈希树进行热词补全,获取所述第二热词集合。
下面对根据用户日志热词数据库,利用机器学习算法进行热词补全的方案进行介绍:
若满足热词补全条件,通过预测模型对用户输入的字符串进行热词结果预测,取Top-N结果为第二热词集合。
示例性地,本实施例中采用的预测模型为GRU,GRU的内部结构及原理不展开赘述,其大致输入输出结构如图6所示。时序模型GRU有一个当前时刻的输入字符Xt,和上一时刻传递下来的隐状态ht-1,这个隐状态包含了之前节点的相关信息。结合Xt和ht-1,GRU会得到当前时刻的输出yt和传递给下一个时刻的隐状态ht,直至输出的y为结束符为止。以此类推,将用户输入的字符串拼接上预测出的所有字符y,则为最终预测的热词结果。
示例性地,以star作为模型输入,每一列表示某一时刻输入和输出。预测的Top-3结果为starbucks,starhub,starstreet,补全概率分别为0.77,0.15和0.08。
需要说明的是,部署在服务器中的该机器学习算法可以采用在线学习的形式对用户日志热词进行训练,能够实时反映用户日志数据的变化,保障补全热词的时效性。
504、根据所述第一热词集合和第二热词集合,确定向用户推荐的至少一个热词。
服务器根据所述第一热词集合和第二热词集合,确定最终向用户推荐的至少一个热词,并发送给终端,终端通过显示设备向用户显示推荐的热词。或者,终端根据第一热词集合和第二热词集合,确定向用户推荐的至少一个热词并向用户显示。
服务器可以基于上述两组热词集合提供的热词候选结果集,融合置信度较高的结果并重排序,推荐Top-N热词结果。
可选地,最终推荐的至少一个热词,可能均属于第一热词集合中的热词;或者均属于第二热词集合中的热词;也可能即包含第一热词集合中的热词,也包含第二热词集合中的热词,具体此处不做限定。
可选地,基于上述两组候选结果集,首先从POI热词补全候选集和日志热词补全候选集中各取Top 1热词,加入热词补全结果集,并对两个top1去重得到K个热词(K=1或2)。然 后对剩余的2N-2个候选结果进行加权求和并排序,权重根据业务场景进行调整,取Top(N-K)个结果加入热词补全结果集中。最后对热词补全结果集进行重排序,得到最终的热词补全结果。
示例性地,1、从poi热词补全候选集和日志热词补全候选集中各取Top 1热词,分别为“starbucks”和“starbucks”,进行去重后得到1个热词“starbucks”,加入结果集中,
2、对剩余结果进行加权求和并排列,设poi热词补全的权重为3,日志热词补全的权重为1。取Top(3-1)结果“starhub”和“starhotel”加入结果集中。
对剩余结果进行加权求和并排序的结果如表1:
表1
starhub | 0.60 |
starhotel | 0.33 |
starstreet | 0.08 |
3、对补全结果集进行重排序,得到最终的热词补全结果为“starbucks”,“starhub”,“starhotel”。
本申请实施例采用POI数据构建C-Trie树,对共同前缀的热词进行条件筛选,过滤概率过低的热词,减少前缀候选热词数量,并且对候选热词有序存储。相较已有技术从用户输入第一个字母时开始进行热词补全,本申请实施例的方案根据C-trie树判断是否触发热词补全。可以避免用户输入词过短、搜索意图尚未明确时触发热词补全,干扰非热词搜索逻辑,解决了输入过短时触发热词补全,以及热词补全效率低的问题,可以有效提高热词补全性能。
此外,若热词补全方法中仅考虑用户搜索日志,将无法保证补全的热词与POI数据的相关性,采用热词进行POI搜索有可能无返回结果。本申请实施例提供的热词补全方法,综合POI热词数据库和用户日志热词数据库,可以提高推荐热词与POI数据的相关性,减少根据推荐热词进行POI搜索时无返回结果的情况。
请参阅图7,为本申请实施例中热词补全方法的另一个实施例示意图。
S1:基于POI数据筛选POI热词。统计POI数据中每个词语的词频,取词频最高的一定比例词语为POI热词。
S2:基于筛选的POI热词,构建C-Trie树。构建C-Trie树的具体过程请参考前述实施例,此处不再赘述。
S3:基于日志数据筛选日志热词。统计日志数据中每个搜索词的词频,取词频最高的一定比例词语作为日志热词。
S4:基于筛选的日志热词,训练字符级生成模型,用于对不完整字符串进行预测补全。其中预测模型包括但不限于RNN、LSTM等常用的序模型。模型的输入为不完整的字符串,输出为预测的完整热词。
S5:对于用户输入的字符串,在C-Trie树中判断是否满足热词补全条件。判断依据为字符串在C-Trie树中对应的节点中是否存储K-V结构的数据,若对应节点中不包含K-V结构的数据,则不触发热词补全;若对应节点中包含K-V结构的数据,则触发热词补全。当满足热词补全条件时,从字符串对应节点的K-V结构的数据按补全概率由大到小地顺序取Top-N个结果,作为基于POI数据的热词补全候选集。
S6:当满足热词补全条件时,采用基于日志数据训练好的字符级生成模型对用户输入进行热词结果预测,同样取Top-N个结果,作为基于日志数据的热词补全候选集。
S7:基于上述两组候选结果集,首先从POI热词补全候选集和日志热词补全候选集中各取Top 1热词,加入热词补全结果集并去重得到K个热词(K=1或2)。然后对剩余的2N-2K个候选结果进行加权求和并排序,权重根据业务场景进行调整,取Top(N-K)个结果加入热词补全结果集中。最后对热词补全结果集进行重排序,得到最终的热词补全结果。
上面介绍了本申请提供的词的补全方法,下面对实现该词的补全方法的词的补全装置进行介绍,请参阅图8,为本申请实施例中词的补全装置的一个实施例示意图。
图8中的各个模块的只一个或多个可以软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(CPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)或专用集成电路(ASIC)。
该词的补全装置,包括:
获取单元801,用于获取用户输入的字符串;
输出单元802,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有一个或多个词,所述一个或多个词都包括从所述Trie树的根节点至所述第一个或多个词所在的第一节点的路径上经过的字符所组成的字符串,所述目标节点存储的词中包括输出的所述至少一个词。
可选地,所述一个或多个词的前缀为从所述Trie树的根节点至所述一个或多个词所在的第一节点的路径上经过的字符所组成的字符串。
可选地,所述Trie包括多个第二节点,每个所述第二节点未存储词。
可选地,所述字符串包括按所述用户的输入顺序排列的第一字符至第N字符;所述输出单元802具体用于:按照所述输入顺序在所述字典树中查找,其中,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第二节点。
可选地,所述第一节点还存储有所述一个或多个词各自对应的补全概率,其中,每个所述词对应的所述补全概率指示在匹配到所述第一节点的情况下,输出所述词的概率。
可选地,所述第一节点存储有至少一个键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述第一节点的路径上经过的字符所组成的字符串为前缀的词,所述值为词的补全概率,其中,所述补全概率指示在匹配到所述第一节点的情况下,输出所述键的概率。
可选地,所述输出的所述至少一个词的顺序与所述目标节点中存储的词的排列顺序有关。
可选地,所述多个第一节点中存储的词来源于信息点POI数据或者用户日志数据。
需要说明的是,在词的补全装置的另一个可能的实现方式中,字典树中节点的存储形式与本实施例中节点的存储形式不同,每个节点对应的字符串较上一节点多一个字符,包括从根节点该节点的路径上任一节点对应的字符串。此时,第一节点存储有包括所述第一节点对应的字符串的至少一个词。除字典树的存储形式之外的部分与本申请实施例类似,具体此处不再赘述。
本申请实施例中词的补全装置,可以用于执行前述实施例提供的词的补全方法。在传统字典树的基础上进行了改进,改进的字典树的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。节点中存储的热词是被补全的概率较高的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。词的补全装置通过获取单元获取用户输入的字符串,输出单元通过在改进的字典树上查找与用户输入的字符串匹配的目标节点,输出至少一个词作为推荐的热词。由于只有目标节点存储了热词时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前缀条件的热词,因此本申请实施例提供的词的补全装置可以提高热词补全效率,也使得呈现给用户的补全后的热词能够更符合用户的要求。
请参阅图9,为本申请实施例中词的补全装置的另一个实施例示意图。
图9中的各个模块的只一个或多个可以软件、硬件、固件或其结合实现。所述软件或固件包括但不限于计算机程序指令或代码,并可以被硬件处理器所执行。所述硬件包括但不限于各类集成电路,如中央处理单元(CPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)或专用集成电路(ASIC)。
该词的补全装置,包括:
获取单元901,用于获取用户输入的字符串;
输出单元902,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出第一词集合,所述第一词集合包括至少一个词,所述第一Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储包括从所述Trie的根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述第一词集合中的词,所述多个第一节点存储的词来自第一词数据库;
所述输出单元902,还用于基于第二词数据库,输出以所述字符串为前缀的第二词集合;
所述输出单元902,还用于根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词。
可选地,所述第一词集合包括有序排列的至少两个词;所述第二词集合包括有序排列的至少两个词;所述目标词集合包括所述第一词集合中排序第一的词以及所述第二词集合总排序第一的词。
可选地,所述输出单元902具体用于:根据所述第一词集合和所述第二词集合的并集中每个词被输出的概率,输出为用户推荐的至少一个词,所述并集中每个词被输出的概率根据预设的第一词集合的第一权重和第二词集合的第二权重确定。
可选地,所述输出单元902具体用于:根据所述用户日志的词数据库构建的Trie,获取所述第二词集合;或者,将所述字符串输入基于所述用户日志的词数据库训练的机器学习算法中,以输出所述第二词集合;或者,根据所述用户日志的词数据库构建的哈希树,获取所述第二词集合。
本申请实施例中词的补全装置,可以用于执行前述实施例提供的词的补全方法。结合至少两个热词数据库的热词来源,为用户进行热词推荐,其中,基于第一词数据库构建的第一字典树Trie的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。 节点中存储的热词是被补全的概率较高的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。由于只有目标节点存储了热词时才向用户输出补全的第一词集合,因此本申请实施例提供的词的补全装置可以避免用户输入字符串较短等意图不明显,或热词被补全的可能性较低场景下触发热词补全。此外,结合第一词集合和第二词集合为用户输出热词,可以提高热词推荐的准确度。
请参阅图10,为本申请实施例中词的补全装置的另一个实施例示意图。
该补全装置可以以软件系统的形式实现,该软件系统部署后通过远程接口的形式对外提供服务。该词的补全装置包括:离线模块1001和在线模块1002两个模块,其中:
离线模块1001的主要任务是将不同数据源的数据处理成特定数据结构或模型;
在线模块1002主要负责响应用户的查询请求。
请参阅图11,为本申请实施例中Trie树的构建装置的一个实施例示意图。
该字典树的构建装置,包括:构建单元1101,用于根据词数据库构建字典树Trie,所述字典树中包括多个第一节点,每个所述第一节点存储有包括从所述根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词;删减单元1102,用于对所述节点存储的词进行删减,保留每个节点中补全概率大于或等于第一阈值的词,所述补全概率指示在匹配到该节点的情况下,输出词的概率。
可选地,所述补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
可选地,第一阈值为预设值,取值范围为[0.1,0.2]。
可选地,所述Trie的节点存储键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述节点的字符为前缀的词,所述值为所述词的补全概率,所述词的补全概率为所述词的词频占所有以从根节点至所述第一节点的字符为前缀的词的词频总和的比例。
可选地,所述方法还包括:所述Trie的节点存储的词按照补全概率由大到小排列。
本申请实施例提供的字典树的构建装置,用于构建本申请实施例提供的改进的Trie树,即C-Trie树。在传统字典树的基础上进行了改进,具体是在字典树的节点中存储了热词,存储的热词包括以从根节点至该节点的路径上经过的字符。节点中存储的热词是被补全的概率大于或等于第一阈值的词,换言之,由于以过短的字符串为前缀的热词被补全的概率通常较低,不被存储在字典树的节点上。基于改进的字典树,查找与用户输入的字符串匹配的目标节点,基于该目标节点存储的词,输出至少一个词作为推荐的热词,由于只有目标节点存储了热词时才向用户输出补全的热词,且基于该目标节点存储的词进行输出,不需要基于用户输入的字符串查找符合前缀条件的热词,因此可以提高热词补全效率。此外,还可以避免用户输入字符串较短等意图不明显,或热词被补全的可能性较低场景下触发热词补全。
为便于理解,下面将对本申请实施例提供的终端100的结构进行示例说明。参见图12,图12是本申请实施例提供的终端的结构示意图。
如图12所示,终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬 声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对终端100的具体限定。在本申请另一些实施例中,终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是终端100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。本申请中,控制器可以根据指令实现本申请实施例提供的词的补全方法。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。可选地,存储器存储了预先构建的Trie树。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I1C)接口,集成电路内置音频(inter-integrated circuit sound,I1S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对终端100的结构限定。在本申请另一些实施例中,终端100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存 储器,显示屏194,摄像头193,和无线通信模块160等供电。
终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
在一些可行的实施方式中,终端100可以使用无线通信功能和其他设备通信。例如,终端100可以和第二电子设备通信,终端100与第二电子设备建立投屏连接,终端100输出投屏数据至第二电子设备等。其中,终端100输出的投屏数据可以为音视频数据。
天线1和天线2用于发射和接收电磁波信号。终端100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在终端100上的包括1G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线2转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。终端可以通过移动通信模块与服务器通信。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在终端100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线1接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,终端100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得终端100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期 演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,终端100可以包括1个或N个显示屏194,N为大于1的正整数。具体地,显示屏194可以将输出的词显示给用户。触摸类显示屏还可以获取用户输入的字符串。
在一些可行的实施方式中,显示屏194可用于显示终端100的系统输出的各个界面。终端100输出的各个界面可参考后续实施例的相关描述。
终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,终端100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。
视频编解码器用于对数字视频压缩或解压缩。终端100可以支持一种或多种视频编解码器。这样,终端100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG1,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展终端100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行终端100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储终端100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。在一些可行的实施方式中,音频模块170可用于播放视频对应的声音。例如,显示屏194显示视频播放画面时,音频模块170输出视频播放的声音。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。陀螺仪传感器180B可以用于确定终端100的运动姿态。气压传感器180C用于测量气压。
加速度传感器180E可检测终端100在各个方向上(包括三轴或六轴)加速度的大小。当终端100静止时可检测出重力的大小及方向。还可以用于识别终端姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。
环境光传感器180L用于感知环境光亮度。
指纹传感器180H用于采集指纹。
温度传感器180J用于检测温度。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于终端100的表面,与显示屏194所处的位置不同。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。终端100可以接收按键输入,产生与终端100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。
请参阅图13,为本申请实施例中服务器的一个实施例示意图;
本实施例提供的服务器1300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1301和存储器1302,该存储器1302中存储有程序或数据。
其中,存储器1302可以是易失性存储或非易失性存储。可选地,处理器1301是一个或多个中央处理器(CPU,Central Processing Unit,该CPU可以是单核CPU,也可以是多核CPU。处理器1301可以与存储器1302通信,在服务器1300上执行存储器1302中的一系列指令。
该服务器1300还包括一个或一个以上有线或无线网络接口1303,例如以太网接口。
可选地,尽管图13中未示出,服务器1300还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。
本实施例中服务器1300中的处理器1301所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。 而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (28)
- 一种词的补全方法,其特征在于,包括:获取用户输入的字符串;在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有一个或多个词,所述一个或多个词都包括从所述Trie树的根节点至所述第一个或多个词所在的第一节点的路径上经过的字符所组成的字符串,所述目标节点存储的词中包括输出的所述至少一个词。
- 根据权利要求1所述的方法,其特征在于,所述一个或多个词的前缀为从所述Trie树的根节点至所述一个或多个词所在的第一节点的路径上经过的字符所组成的字符串。
- 根据权利要求1或2所述的方法,其特征在于,所述Trie包括多个第二节点,每个所述第二节点未存储词。
- 根据权利要求3所述的方法,其特征在于,所述字符串包括按所述用户的输入顺序排列的第一字符至第N字符;所述在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,包括:按照所述输入顺序在所述字典树中查找,其中,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第二节点。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述第一节点还存储有所述一个或多个词各自对应的补全概率,其中,每个所述词对应的所述补全概率指示在匹配到所述第一节点的情况下,输出所述词的概率。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述第一节点存储有至少一个键值结构的数据,所述键值结构包括键和与所述键关联的值,所述键为以从根节点至所述第一节点的路径上经过的字符所组成的字符串为前缀的词,所述值为词的补全概率,其中,所述补全概率指示在匹配到所述第一节点的情况下,输出所述键的概率。
- 根据权利要求1至6中任一项所述的方法,其特征在于,所述输出的所述至少一个词的顺序与所述目标节点中存储的词的排列顺序有关。
- 根据权利要求1至7中任一项所述的方法,其特征在于,所述多个第一节点中存储的词来源于信息点POI数据或者用户日志数据。
- 一种词的补全方法,其特征在于,包括:获取用户输入的字符串;在字典树Trie中查找与所述字符串匹配的目标节点,以输出第一词集合,所述第一词集合包括至少一个词,所述第一Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储包括从所述Trie的根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述第一词集合中的词,所述多个第一节点存储的词来自第一词数据库;基于第二词数据库,输出以所述字符串为前缀的第二词集合;根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词。
- 根据权利要求9所述的方法,其特征在于,所述第一词集合包括有序排列的至少两个词;所述第二词集合包括有序排列的至少两个词;所述目标词集合包括所述第一词集合中排序第一的词以及所述第二词集合总排序第一的词。
- 根据权利要求9所述的方法,其特征在于,所述根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词包括:根据所述第一词集合和所述第二词集合的并集中每个词被输出的概率,输出为用户推荐的至少一个词,所述并集中每个词被输出的概率根据预设的第一词集合的第一权重和第二词集合的第二权重确定。
- 根据权利要求9至11中任一项所述的方法,其特征在于,所述基于第二词数据库,输出以所述字符串为前缀的第二词集合包括:根据所述用户日志的词数据库构建的Trie,获取所述第二词集合;或者,将所述字符串输入基于所述用户日志的词数据库训练的机器学习算法中,以输出所述第二词集合;或者,根据所述用户日志的词数据库构建的哈希树,获取所述第二词集合。
- 一种词的补全装置,其特征在于,包括:获取单元,用于获取用户输入的字符串;输出单元,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出至少一个词,所述Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储有一个或多个词,所述一个或多个词都包括从所述Trie树的根节点至所述第一个或多个词所在的第一节点的路径上经过的字符所组成的字符串,所述目标节点存储的词中包括输出的所述至少一个词。
- 根据权利要求13所述的装置,其特征在于,所述一个或多个词的前缀为从所述Trie树的根节点至所述一个或多个词所在的第一节点的路径上经过的字符所组成的字符串。
- 根据权利要求13或14所述的装置,其特征在于,所述Trie包括多个第二节点,每个所述第二节点未存储词。
- 根据权利要求15所述的装置,其特征在于,所述字符串包括按所述用户的输入顺序排列的第一字符至第N字符;所述输出单元具体用于:按照所述输入顺序在所述字典树中查找,其中,所述用户输入的所述第一字符至第N-1字符组成的字符串匹配到的是所述字典树中的一个第二节点。
- 根据权利要求13至16中任一项所述的装置,其特征在于,所述第一节点还存储有所述一个或多个词各自对应的补全概率,其中,每个所述词对应的所述补全概率指示在匹配到所述第一节点的情况下,输出所述词的概率。
- 根据权利要求13至16中任一项所述的装置,其特征在于,所述第一节点存储有至少一个键值结构的数据,所述键值结构包括键和与所述键关联的 值,所述键为以从根节点至所述第一节点的路径上经过的字符所组成的字符串为前缀的词,所述值为词的补全概率,其中,所述补全概率指示在匹配到所述第一节点的情况下,输出所述键的概率。
- 根据权利要求13至18中任一项所述的装置,其特征在于,所述输出的所述至少一个词的顺序与所述目标节点中存储的词的排列顺序有关。
- 根据权利要求13至19中任一项所述的装置,其特征在于,所述多个第一节点中存储的词来源于信息点POI数据或者用户日志数据。
- 一种词的补全装置,其特征在于,包括:获取单元,用于获取用户输入的字符串;输出单元,用于在字典树Trie中查找与所述字符串匹配的目标节点,以输出第一词集合,所述第一词集合包括至少一个词,所述第一Trie包括多个第一节点,所述目标节点为所述多个第一节点中的一个,每个所述第一节点存储包括从所述Trie的根节点至所述第一节点的路径上经过的字符所组成的字符串的至少一个词,所述目标节点存储的词包括所述第一词集合中的词,所述多个第一节点存储的词来自第一词数据库;所述输出单元,还用于基于第二词数据库,输出以所述字符串为前缀的第二词集合;所述输出单元,还用于根据所述第一词集合和所述第二词集合输出为用户推荐的至少一个词。
- 根据权利要求21所述的装置,其特征在于,所述第一词集合包括有序排列的至少两个词;所述第二词集合包括有序排列的至少两个词;所述目标词集合包括所述第一词集合中排序第一的词以及所述第二词集合总排序第一的词。
- 根据权利要求21所述的装置,其特征在于,所述输出单元具体用于:根据所述第一词集合和所述第二词集合的并集中每个词被输出的概率,输出为用户推荐的至少一个词,所述并集中每个词被输出的概率根据预设的第一词集合的第一权重和第二词集合的第二权重确定。
- 根据权利要求21至23中任一项所述的装置,其特征在于,所述输出单元具体用于:根据所述用户日志的词数据库构建的Trie,获取所述第二词集合;或者,将所述字符串输入基于所述用户日志的词数据库训练的机器学习算法中,以输出所述第二词集合;或者,根据所述用户日志的词数据库构建的哈希树,获取所述第二词集合。
- 一种服务器,其特征在于,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;所述一个或多个处理器读取所述计算机可读指令以使所述服务器实现如权利要求1至12中任一项所述的方法。
- 一种终端,其特征在于,包括:一个或多个处理器和存储器;其中,所述存储器中存储有计算机可读指令;所述一个或多个处理器读取所述计算机可读指令以使所述服务器实现如权利要求1至12中任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至12任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机上运行时,使得所述计算机执行如权利要求1至12中任一项所述的方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/096,952 US12056192B2 (en) | 2020-07-15 | 2023-01-13 | Word completion method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010683199.8A CN113946719A (zh) | 2020-07-15 | 2020-07-15 | 词补全方法和装置 |
CN202010683199.8 | 2020-07-15 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/096,952 Continuation US12056192B2 (en) | 2020-07-15 | 2023-01-13 | Word completion method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022012205A1 true WO2022012205A1 (zh) | 2022-01-20 |
Family
ID=79326239
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/098072 WO2022012205A1 (zh) | 2020-07-15 | 2021-06-03 | 词补全方法和装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US12056192B2 (zh) |
CN (1) | CN113946719A (zh) |
WO (1) | WO2022012205A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114822533A (zh) * | 2022-04-12 | 2022-07-29 | 广州小鹏汽车科技有限公司 | 语音交互方法、模型训练方法、电子设备和存储介质 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969242A (zh) * | 2022-01-19 | 2022-08-30 | 支付宝(杭州)信息技术有限公司 | 查询内容自动补全的方法及装置 |
CN114822532A (zh) * | 2022-04-12 | 2022-07-29 | 广州小鹏汽车科技有限公司 | 语音交互方法、电子设备和存储介质 |
CN115113740A (zh) * | 2022-07-04 | 2022-09-27 | 腾讯科技(上海)有限公司 | 一种信息输入方法、装置、设备、存储介质及程序产品 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665217A (zh) * | 2016-07-29 | 2018-02-06 | 苏宁云商集团股份有限公司 | 一种用于搜索业务的词汇处理方法及系统 |
US20180225572A1 (en) * | 2017-02-03 | 2018-08-09 | Baldu Online Network Technology(Beijing) Co, Ltd. | Neural network machine translation method and apparatus |
CN110851722A (zh) * | 2019-11-12 | 2020-02-28 | 腾讯云计算(北京)有限责任公司 | 基于字典树的搜索处理方法、装置、设备和存储介质 |
CN111400584A (zh) * | 2020-03-16 | 2020-07-10 | 南方科技大学 | 联想词的推荐方法、装置、计算机设备和存储介质 |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2323856A1 (en) * | 2000-10-18 | 2002-04-18 | 602531 British Columbia Ltd. | Method, system and media for entering data in a personal computing device |
US20100235780A1 (en) * | 2009-03-16 | 2010-09-16 | Westerman Wayne C | System and Method for Identifying Words Based on a Sequence of Keyboard Events |
WO2010139277A1 (en) * | 2009-06-03 | 2010-12-09 | Google Inc. | Autocompletion for partially entered query |
US8676828B1 (en) * | 2009-11-04 | 2014-03-18 | Google Inc. | Selecting and presenting content relevant to user input |
US20120246133A1 (en) * | 2011-03-23 | 2012-09-27 | Microsoft Corporation | Online spelling correction/phrase completion system |
US8700654B2 (en) * | 2011-09-13 | 2014-04-15 | Microsoft Corporation | Dynamic spelling correction of search queries |
US9158758B2 (en) * | 2012-01-09 | 2015-10-13 | Microsoft Technology Licensing, Llc | Retrieval of prefix completions by way of walking nodes of a trie data structure |
US8972388B1 (en) * | 2012-02-29 | 2015-03-03 | Google Inc. | Demotion of already observed search query completions |
US8825474B1 (en) * | 2013-04-16 | 2014-09-02 | Google Inc. | Text suggestion output using past interaction data |
US9477782B2 (en) * | 2014-03-21 | 2016-10-25 | Microsoft Corporation | User interface mechanisms for query refinement |
US9659109B2 (en) * | 2014-05-27 | 2017-05-23 | Wal-Mart Stores, Inc. | System and method for query auto-completion using a data structure with trie and ternary query nodes |
US11061893B2 (en) * | 2014-05-30 | 2021-07-13 | Apple Inc. | Multi-domain query completion |
US10078631B2 (en) * | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US10824678B2 (en) * | 2017-06-03 | 2020-11-03 | Apple Inc. | Query completion suggestions |
US11106690B1 (en) * | 2018-02-20 | 2021-08-31 | A9.Com, Inc. | Neural query auto-correction and completion |
US11010179B2 (en) * | 2018-04-20 | 2021-05-18 | Facebook, Inc. | Aggregating semantic information for improved understanding of users |
US20200042104A1 (en) * | 2018-08-03 | 2020-02-06 | International Business Machines Corporation | System and Method for Cognitive User-Behavior Driven Messaging or Chat Applications |
US11475053B1 (en) * | 2018-09-28 | 2022-10-18 | Splunk Inc. | Providing completion recommendations for a partial natural language request received by a natural language processing system |
CN110046298B (zh) | 2019-04-24 | 2021-04-13 | 中国人民解放军国防科技大学 | 一种查询词推荐方法、装置、终端设备及计算机可读介质 |
-
2020
- 2020-07-15 CN CN202010683199.8A patent/CN113946719A/zh active Pending
-
2021
- 2021-06-03 WO PCT/CN2021/098072 patent/WO2022012205A1/zh active Application Filing
-
2023
- 2023-01-13 US US18/096,952 patent/US12056192B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665217A (zh) * | 2016-07-29 | 2018-02-06 | 苏宁云商集团股份有限公司 | 一种用于搜索业务的词汇处理方法及系统 |
US20180225572A1 (en) * | 2017-02-03 | 2018-08-09 | Baldu Online Network Technology(Beijing) Co, Ltd. | Neural network machine translation method and apparatus |
CN110851722A (zh) * | 2019-11-12 | 2020-02-28 | 腾讯云计算(北京)有限责任公司 | 基于字典树的搜索处理方法、装置、设备和存储介质 |
CN111400584A (zh) * | 2020-03-16 | 2020-07-10 | 南方科技大学 | 联想词的推荐方法、装置、计算机设备和存储介质 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114822533A (zh) * | 2022-04-12 | 2022-07-29 | 广州小鹏汽车科技有限公司 | 语音交互方法、模型训练方法、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
US12056192B2 (en) | 2024-08-06 |
CN113946719A (zh) | 2022-01-18 |
US20230195801A1 (en) | 2023-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022012205A1 (zh) | 词补全方法和装置 | |
US10956771B2 (en) | Image recognition method, terminal, and storage medium | |
CN116797684B (zh) | 图像生成方法、装置、电子设备及存储介质 | |
WO2023125335A1 (zh) | 问答对生成的方法和电子设备 | |
WO2021018154A1 (zh) | 信息表示方法及装置 | |
WO2020108234A1 (zh) | 图像索引生成方法、图像搜索方法、装置、终端及介质 | |
CN111914113B (zh) | 一种图像检索的方法以及相关装置 | |
CN113378556A (zh) | 提取文本关键字的方法及装置 | |
WO2022100221A1 (zh) | 检索处理方法、装置及存储介质 | |
CN109918669A (zh) | 实体确定方法、装置及存储介质 | |
WO2024040865A1 (zh) | 视频编辑方法和电子设备 | |
CN105531758A (zh) | 使用外国单词语法的语音识别 | |
WO2021180109A1 (zh) | 电子设备以及电子设备的搜索方法、介质 | |
WO2024036616A1 (zh) | 一种基于终端的问答方法及装置 | |
CN114757208B (zh) | 一种问答匹配方法及装置 | |
WO2021169351A1 (zh) | 指代消解的方法、装置及电子设备 | |
KR20200083159A (ko) | 사용자 단말에서의 사진 검색 방법 및 시스템 | |
CN110110045A (zh) | 一种检索相似文本的方法、装置以及存储介质 | |
US20220012436A1 (en) | Electronic device and method for translating language | |
US20160012078A1 (en) | Intelligent media management system | |
CN111738000B (zh) | 一种短语推荐的方法以及相关装置 | |
CN113742460A (zh) | 生成虚拟角色的方法及装置 | |
WO2023040603A1 (zh) | 一种搜索方法、终端、服务器及系统 | |
CN113505596B (zh) | 话题切换标记方法、装置及计算机设备 | |
KR20240052055A (ko) | 교차-모달 검색 방법 및 관련 디바이스 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21842851 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21842851 Country of ref document: EP Kind code of ref document: A1 |