WO2015025467A1 - Text character string search device, text character string search method, and text character string search program - Google Patents
Text character string search device, text character string search method, and text character string search program Download PDFInfo
- Publication number
- WO2015025467A1 WO2015025467A1 PCT/JP2014/003817 JP2014003817W WO2015025467A1 WO 2015025467 A1 WO2015025467 A1 WO 2015025467A1 JP 2014003817 W JP2014003817 W JP 2014003817W WO 2015025467 A1 WO2015025467 A1 WO 2015025467A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- prefix
- character string
- search
- score
- specified
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Definitions
- the present invention relates to a character string search device, a character string search method, and a character string search program for searching for a key that partially matches an input character string.
- a search keyword of a search candidate is displayed on an input form of a search engine, or a URL as a candidate is displayed on a URL (Uniform ResourceatorLocator) input form of a Web browser.
- a URL Uniform ResourceatorLocator
- displaying conversion candidates at the time of prediction conversion of IME (Input Method Editor) or displaying correct spelling candidates in the spell checker are examples of input support.
- Such input support is realized as a dictionary search.
- a character string that the user is likely to input is registered in advance in the dictionary as a key.
- the dictionary is searched using the character string input by the user as a search query, an appropriate key is acquired as an input candidate, and displayed on the screen.
- search keyword recommendation search keywords previously input by the user are registered in the dictionary in advance and used as input candidates.
- Top-k search Topic-k dictionary search
- Non-Patent Document 1 describes a data structure that obtains a high-order key at a high speed from keys that coincide with each other by using a RMQ Trie and a RMQ (Ranged Minimum Query) structure. .
- FIG. 9 is an explanatory diagram showing the RMQ Trie.
- the node v having the search query P as a prefix is found, and the key range [a, b] under the node v is obtained. All keys included in [a, b] have the search query P as a prefix.
- the top k keys having the search query P as a prefix are obtained. .
- Non-Patent Document 1 describes two other types of data structures that are used to acquire a high-order key at a high speed from keys that coincide with each other in the same manner as RMQ Trie.
- Non-patent document 2 describes Top-k search in document search. This method realizes Top-k search by adding additional data necessary for Top-k search to the data structure based on the data structure for document search.
- Non-Patent Document 1 by using each data structure described in Non-Patent Document 1, it is possible to acquire a key candidate that matches forward at high speed, but it is difficult to acquire a key candidate that partially matches. is there.
- Top-k search in document search can be realized.
- the data used for document search is large in size, there is a problem that when the search method used for document search is directly applied to a dictionary, the target data size becomes large.
- an object of the present invention is to provide a character string search device, a character string search method, and a character string search program that can perform partial match search of a character string at high speed while reducing the amount of data.
- a character string search device is a character string that searches for a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched.
- a prefix for identifying a set of prefixes ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string, in the search device From the set identification part and the set of prefixes ending with the input character string, each prefix is defined with the largest character string score among the character string scores associated with the search candidate character string beginning with that prefix.
- Character string specifying the search candidate character string with the highest character string score from the prefix specification part that specifies the prefix with the highest prefix score and the search candidate character string starting with the specified prefix With department And wherein the door.
- the character string search method is a character string that searches for a search candidate character string that includes an input character string from a set of search candidate character strings that are associated with a character string score that indicates the degree to be searched with priority
- a prefix for identifying a set of prefixes ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string in the search method From the set identification step and a set of prefixes ending with the input string, each prefix is defined with the largest string score among the string scores associated with the search candidate string starting with that prefix.
- a prefix identification step that identifies the prefix with the largest prefix score, and a string identification that identifies the search candidate string with the largest string score from among the search candidate strings that begin with the identified prefix Characterized in that it comprises a step.
- a character string search program provides a computer for searching for a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched.
- An applied character string search program wherein a prefix ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string
- Prefix set identification processing that identifies a set of words, out of a set of prefixes that end with the input character string, out of the character string score associated with the search candidate character string that starts with that prefix for each prefix
- a prefix identification process that identifies the prefix with the largest prefix score defined by the largest string score, and a search string that begins with the identified prefix, and a string Core characterized in that to execute a string specifying process for specifying the maximum search candidate character strings.
- a partial match search of a character string can be performed at high speed while reducing the amount of data.
- the present invention realizes a data structure for searching for a high-order key that partially matches an inputted character string at high speed by extending XBW, which is a dictionary data structure, to Top-k search. is there.
- each key that is a search candidate character string is assigned a score (hereinafter referred to as a character string score) indicating the degree of search priority, and a set of keys is represented by a trie structure. .
- all the prefixes of the keys included in the key set are represented by the XBW structure used for dictionary search.
- the character string search device of the present invention uses this XBW structure to specify a prefix range ending with the input character string.
- Each prefix is associated with a maximum score (hereinafter referred to as a prefix score) among keys starting with the prefix. Therefore, the character string search device identifies the prefix with the largest prefix score within the identified prefix range.
- the RMQ structure is used to specify the maximum prefix score within the specified prefix range.
- the RMQ structure used to represent the relationship between the prefix and the prefix score is referred to as a first RMQ structure.
- the character string search device uses the first RMQ structure to specify the prefix with the largest prefix score within the specified prefix range.
- the character string search device identifies the key with the largest character string score among the keys that start with the identified prefix.
- the identified prefix corresponds to one node in the trie tree. Therefore, in order to specify the maximum character string score in the range of keys existing under each node, the RMQ structure is used as in the case of specifying the maximum prefix score.
- the RMQ structure used to represent the relationship between the key and the character string score is referred to as a second RMQ structure.
- the character string search device uses the second RMQ structure to specify the key with the maximum character string score from the range of keys starting with the specified prefix.
- the character string search device After specifying the key with the maximum character string score, the character string search device performs a process of searching for a key having a character string score of the second or later in order to apply this character string search to the Top-k search.
- the position where the second and subsequent keys exist is the second and subsequent keys starting with the specified prefix, or the first and subsequent prefixes that have not been specified yet.
- the character string search device holds the prefix score of the specified prefix and the character string score of the specified key.
- the character string search device selects a key or prefix that is the maximum score among the character string scores and prefix scores it holds. If it is a key, the key with the next highest string score is searched for among the keys that start with the same prefix. If it is a prefix, a prefix having the next highest prefix score is searched for. By repeating this, it is possible to efficiently search for keys having a higher character string score among keys including the input character string.
- FIG. FIG. 1 is a block diagram showing a configuration example of a first embodiment of a character string search device according to the present invention.
- the character string search device according to the present embodiment includes an input unit 10, a prefix set specifying unit 20, a search management unit 30, a prefix specifying unit 31, a character string specifying unit 32, an output unit 40, and search information. And a storage unit 50.
- the input unit 10 inputs a character string of one character or more.
- the character string search device searches for a key that partially matches an input character string.
- an input character string is referred to as a search query (or simply a query) P.
- the search information storage unit 50 stores a set of keys that are search candidate character strings.
- the character string score is associated with the key used in this embodiment. That is, the character string search device of the present embodiment preferentially searches for a key having a higher character string score from the set of keys.
- the search target key is represented using the structure of the trie tree.
- FIG. 2 is an explanatory diagram illustrating an example of a trie tree corresponding to a key.
- the trie tree is configured to place a common character at a common node.
- the search information storage unit 50 may store the key itself represented by the trie tree, or may store only the structure of the trie tree, as will be described later.
- each leaf node represented by a tree structure corresponds to each key. Accordingly, the search information storage unit 50 stores the score (character string score) of each key illustrated in FIG. 2 in association with each leaf node. Thereby, when the trie tree is searched and the leaf node is reached, the character string score corresponding to the key represented by the leaf node can be acquired.
- the search information storage unit 50 stores a set of prefixes p so that a character string ending with the query P can be searched.
- the prefix p is a character string of one or more consecutive characters extracted from the first character of each key. This set of prefixes p may be sorted lexicographically from the end.
- XBW a structure called XBW is used to represent such a set of prefixes.
- XBW is a data structure that can efficiently represent a labeled tree structure. By expressing the trie tree using this XBW structure, a range search of the prefix p ending with the query P becomes possible.
- the XBW is known to be able to realize a data structure that realizes an equivalent operation by two types of methods.
- the first XBW has a structure in which characters representing child nodes are associated with each prefix of the dictionary at a node on the trie tree corresponding to the prefix.
- the second XBW has a structure in which each prefix of the dictionary is associated with an ID of a prefix that becomes a parent node in a node on the trie tree corresponding to the prefix.
- FIG. 3 is an explanatory diagram showing an example of the first XBW.
- prefixes corresponding to the nodes of the trie tree are arranged in lexicographic order from the end, and characters representing child nodes are associated with the respective prefixes.
- FIG. 4 is an explanatory diagram showing an example of the second XBW.
- the prefixes corresponding to the nodes of the trie tree are arranged in lexicographic order from the end, and an ID is assigned to each prefix.
- Each prefix is associated with its parent ID.
- Such a structure makes it possible to move to the next parent node.
- a range search can be performed for the prefix p ending with the query P.
- the second XBW since it has only the parent ID, it is difficult to search for a child node. However, even when the second XBW is used, a range search of the prefix p ending with the query P is possible. In this embodiment, any XBW can be used.
- the first XBW is described in Reference Document 1, and the second XBW is described in Reference Document 2.
- ⁇ Reference 1 > Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini and S. Muthukrishnan, "Structuring labeled trees for optimal succinctness, and beyond", FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, Pages 184-196
- a score (that is, a prefix score) is defined for each prefix.
- the prefix score is defined by the largest character string score among character string scores associated with keys starting with the prefix. When this score is expressed by an expression, it becomes as shown in Expression 1.
- the score on the right side in Expression 1 represents a character string score, and the score on the left side in Expression 1 represents a prefix score.
- Score (p) max ⁇ Score (key starting with prefix p) ⁇ (Formula 1)
- the prefix score is the largest character string score among the keys existing under the node.
- a first RMQ structure is added to the XBW structure, and a prefix score corresponding to each node can be specified using the first RMQ structure.
- the prefix score of each prefix is stored in an array used in RMQ.
- the array in which the prefix score is stored is referred to as a prefix score column R p . Since the prefixes are sorted based on the end, prefixes that end with the same string are specified in a single range. Therefore, by using the first RMQ structure, the maximum value in any range of the prefix score string R p can be specified.
- the character string score of each key can be specified using the second RMQ structure.
- the character string score of each key is stored in an array used in RMQ.
- an array in which character string scores are stored is referred to as a character string score string R k . Since each key is sorted from the top, keys starting with a certain prefix are specified in a range of connections. Therefore, the maximum value in an arbitrary range of the character string score string R k can be specified by using the second RMQ structure.
- FIG. 5 is an explanatory diagram illustrating an example of a data structure stored in the search information storage unit.
- the XBW structure in the present embodiment is represented by a set S having a set of three elements for each node of the trie tree.
- S last is a binary flag, and is 1 when the node is the last child of the node's parent node, and 0 otherwise.
- S ⁇ is a character represented by the node.
- S ⁇ is a prefix corresponding to the parent node of the node, and is a character string obtained by sequentially combining characters from the root to the parent node. Note that S ⁇ does not include the characters of the node itself.
- the set of three elements is sorted in lexicographic order by comparing from the last character of the prefix included in each element to the first character.
- row numbers are assigned to the sorted groups (S ⁇ , S ⁇ , S last ) in order from the top.
- $ indicates the beginning of the key
- # indicates the end of the key.
- a prefix score Rp is defined for each prefix. Since the prefix score Rp is calculated from the character string score associated with each key as described above, the prefix score does not need to be explicitly retained.
- the prefix IDs illustrated in FIG. 5 are assigned in the order in which all prefixes included in the dictionary are sorted from the end. Therefore, the order of prefix IDs matches the order of prefixes for which 1 is set in S last .
- a range of prefixes ending with the query P can be specified.
- the line corresponding to the prefix ending with the query “ab” corresponds to the line numbers 7 to 9 (that is, lines corresponding to “$ ab” and “$ cab”).
- the ID of the prefix having the maximum score in the range can be acquired with the first RMQ structure. Furthermore, by using the first RMQ structure recursively, it is possible to acquire a prefix ID having a score of second or lower.
- the prefix set specifying unit 20 specifies a set of prefixes including the input character string from the set of prefixes stored in the search information storage unit 50. Specifically, the prefix set specifying unit 20 specifies a set of prefixes ending with the input character string. For example, when the search information storage unit 50 stores a set of prefixes illustrated in FIG. 5, when “ab” is input as a character string, the prefix set specifying unit 20 sets the line numbers 7 to 9. A prefix existing in the range (ie, “$ ab”, “$ cab”) is specified as a set of prefixes.
- the prefix specifying unit 31 specifies a prefix having a higher prefix score from the set of prefixes specified by the prefix set specifying unit 20.
- the prefix identification unit 31 may identify a prefix having the largest prefix score or a prefix corresponding to the top n prefix scores (n is an arbitrary natural number).
- the character string specifying unit 32 specifies a key having a higher character string score among keys starting with the specified prefix.
- the character string specifying unit 32 may search for a key having the largest character string score or a key corresponding to the top m characters of the character string score (m is an arbitrary natural number).
- the prefix identifying unit 31 identifies the prefix as “$ ab”.
- the keys starting with the identified prefix “$ ab” are “aba” and “abcc”.
- the character string score of “abba” is 3, and the character string score of “abcc” is 9.
- the character string specifying unit 32 may select “abcc” as a key.
- the search management unit 30 specifies a prefix range to be searched by the prefix specifying unit 31.
- the search management unit 30 specifies a key range to be searched by the character string specifying unit 32, and specifies the key specified by the character string specifying unit 32 as a search target key.
- the search management unit 30 first specifies the prefix range specified by the prefix set specifying unit 20 as the prefix range searched by the prefix specifying unit 31. Then, the search management unit 30 specifies a key starting with a prefix within the specified range as a key range to be searched by the character string specifying unit 32. Then, the search management unit 30 specifies the key specified by the character string specifying unit 32 as the search target key.
- the search management unit 30 specifies a range excluding the already specified key from the keys starting with the key prefix specified by the character string specifying unit 32. Further, the search management unit 30 specifies a range obtained by removing the prefix specified by the prefix specifying unit 31 from the set of prefixes specified by the prefix set specifying unit 20.
- the search management unit 30 causes the prefix specifying unit 31 and the character string specifying unit 32 to execute each process.
- the prefix specifying unit 31 specifies the prefix having the maximum prefix score from the prefix range specified by the search management unit 30.
- the character string specifying unit 32 specifies the key having the maximum character string score from the key range specified by the search management unit 30.
- the search management unit 30 compares the prefix score specified from the prefix range with the character string score of the key specified from the key range. As a result of the comparison, if the highest score is a character string score, a key starting with the same prefix as that key is searched for a key having the next highest character string score. Specifically, the search management unit 30 excludes the key from the key range used when the key is specified, and bisects the key range to specify the two ranges.
- the character string specifying unit 32 specifies a key having the maximum character string score from the two ranges.
- the search management unit 30 excludes the prefix from the prefix range used when specifying the prefix and bisects the two ranges to specify the two ranges.
- the prefix specifying unit 31 specifies a prefix having the maximum prefix score from the range.
- the output unit 40 outputs the key specified by the search management unit 30 as a search result.
- the prefix set specifying unit 20, the search managing unit 30, the prefix specifying unit 31, and the character string specifying unit 32 are realized by a CPU of a computer that operates according to a program (character string search program).
- the program is stored in a storage unit (not shown) of a character string search device, and the CPU reads the program, and according to the program, a prefix set specifying unit 20, a search management unit 30, a prefix specifying unit 31 and The character string specifying unit 32 may be operated.
- each of the prefix set specifying unit 20, the search managing unit 30, the prefix specifying unit 31, and the character string specifying unit 32 may be realized by dedicated hardware.
- FIG. 6 is a flowchart showing an operation example of the character string search apparatus according to the present embodiment.
- k keys are selected as candidates.
- the search management unit 30 holds a priority queue (prefix and prefix score pair specified by the prefix specifying unit 31 and a key and character string score pair specified by the character string specifying unit 32). (Not shown).
- This priority queue is a queue for holding candidate information. In the following description, a priority queue is simply referred to as a queue.
- the input unit 10 inputs a character string to be searched (step S11).
- the prefix set specifying unit 20 refers to the search information storage unit 50 and specifies a set of prefixes including the input character string (step S12).
- the prefix identification unit 31 identifies a prefix having the largest prefix score from the prefix set identified by the prefix set identification unit 20, and holds the identified prefix / prefix score pair in the queue. (Step S13).
- the character string specifying unit 32 specifies the key having the highest character string score from the keys starting with the specified prefix, and holds the specified key / character string score pair in the queue (step S14).
- the search management unit 30 specifies the prefix or key of the maximum score among the prefix scores or character string scores held in the queue (step S15). Then, the search management unit 30 determines whether the maximum score is a prefix score or a character string score (step S16).
- the search management unit 30 specifies the key of the character string score as an output target and excludes it from the queue (step S17). . Then, the character string specifying unit 32 specifies the key having the next highest character string score after the excluded key in the key range used when specifying the excluded key, and the specified key and character string The score pair is held in the queue (step S18).
- the search management unit 30 excludes the prefix of the prefix score from the queue (step S19). Then, the prefix identification unit 31 identifies and identifies a prefix having the next largest prefix score after the excluded prefix in the range of prefixes used to identify the excluded prefix. A pair of prefix and prefix score is held in the queue (step S20).
- the character string specifying unit 32 specifies the key having the largest character string score from the keys starting with the prefix specified in step S20, and holds the specified key and character string score pair in the queue ( Step S21).
- step S22 If the queue is empty or the maximum score in the queue is lower than the kth largest string score found so far (Yes in step S22), the search management unit 30 will The key found in (1) is output as the upper key (step S23). On the other hand, when the key is not empty and the maximum score in the queue is not lower than the k-th largest character string score that has been found so far (No in step S22), the processing after step S15 is performed. Repeated.
- the prefix / prefix score pair specified by the prefix specifying unit 31 and the key / string score pair specified by the character string specifying unit 32 are put in the same priority queue. By setting it, it is possible to take out the pair with the largest score from the prefix score or the character string score.
- FIG. 7 is an explanatory diagram illustrating an example of processing for selecting a key having a large character string score.
- the list illustrated in the left frame of FIG. 7 is a list schematically showing the XBW structure, where a number represents a prefix score and a letter represents a prefix.
- the list illustrated in the right frame of FIG. 7 is a list schematically showing a trie tree, where a number represents a character string score and a character represents a key.
- the prefix set specifying unit 20 uses “aggres”, “congres”, and “progress” as candidates from the set of prefixes represented by the XBW structure for the key range that partially matches the character string “gres”. Identify. If a prefix is specified, keys that begin with that prefix can be specified.
- the prefix specifying unit 31 selects a prefix having the highest prefix score from prefixes ending with the input character string “gres” from the determined set of prefixes.
- FIG. 7 shows a state in which the selected prefixes are arranged in descending order of prefix score.
- the prefix score of “congres” is 45, which is the largest. Therefore, the prefix specifying unit 31 specifies “congres” as a prefix.
- the character string specifying unit 32 selects a key having the largest character string score from keys starting with the selected prefix.
- a key having the largest character string score is “congress”. Therefore, the character string specifying unit 32 specifies “congress” as the first key, and the search management unit 30 specifies the specified “congress” as a search target key.
- the search management unit 30 includes a priority queue (not shown) for holding candidate information, and holds the prefix and key found so far in the queue together with the score. Keep it.
- the search management unit 30 refers to the queue and selects the one with the highest score from the prefixes and keys held in the queue.
- the character string specifying unit 32 searches for the key having the next highest character string score within the same key range as that when the key is searched. If the selected one is a prefix, the prefix specifying unit 31 selects a prefix having the next highest prefix score after the prefix within the same prefix range as that when the prefix is searched. Search for.
- the prefix “congres” of the prefix score 45 and the key “congress” of the character string score 45 are held in the queue.
- the search management unit 30 first pops the key “congress” and removes it from the queue.
- the character string specifying unit 32 searches for a key having the next highest character string score after the key “congress” among keys starting with the same prefix “congres” as when the key “congress” was obtained.
- the search management unit 30 bisects the key range searched for when the key “congress” is obtained, this time excluding “congress”, and the character string specifying unit 32 determines the two ranges.
- the search management unit 30 When searching for a prefix, the search management unit 30 first pops the prefix “congres” and removes it from the queue. Then, the prefix specifying unit 31 searches for a prefix having a prefix score that is higher than the prefix “congres”. Specifically, the search management unit 30 bisects the range of the prefix searched when the prefix “congres” is obtained, excluding “congres”, and the prefix identification unit 31 Find the prefix with the highest prefix score in the range. At this time, prefixes having a large prefix score in the two ranges divided into two parts by excluding the prefix “congres” are prefix “aggres” of prefix score 12 and prefix “progres” of prefix score 21, respectively. It is. Therefore, the search management unit 30 newly holds these two prefixes in the queue.
- the character string specifying unit 32 acquires the key having the maximum character string score starting with each prefix for the prefix “aggres” and the prefix “progres”. Thereby, the key “aggressive” of the character string score 12 and the key “progress” of the character string score 21 are obtained. Thereby, it can be confirmed that the prefix score of the prefix “aggres” is 12 and the prefix score of the prefix “progres” is 21 of the two prefixes acquired earlier.
- the RMQ structure only the RMQ structure is retained, and the prefix score itself is not retained.
- the RMQ structure alone can find which prefix has the largest prefix score, but cannot calculate its specific prefix score. Therefore, after obtaining the prefix with the largest prefix score in the range, in order to determine the specific value of the prefix score, the largest character among the keys starting with that prefix Need to get column score.
- Prefix “progres” of prefix score 21 prefix “aggres” of prefix score 12, key “progress” of character string score 21, key “congressmen” of character string score 13, key “aggressive” of character string score 12 ".
- the score is highest because the prefix “progress” of the prefix score 21 or the key “progress” of the character string score 21, and it is sufficient to search for either of these prefixes or keys.
- the search management unit 30 does not register the prefix in the queue. This is because a prefix having the highest prefix score after that prefix has a smaller score. Similarly, when the score of a newly found key is lower than the kth largest character string score found so far, the search management unit 30 does not register the key in the queue. Thereby, it is possible to avoid searching for a prefix having a small prefix score or a key having a small character string score, and the top k keys can be efficiently collected.
- the search ends when the queue is empty or the maximum score in the queue falls below the kth largest string score found so far.
- the prefix set specifying unit 20 specifies a set of prefixes ending with the input character string from the set of prefixes, and the prefix specifying unit 31 is input.
- the prefix with the highest prefix score is identified from the set of prefixes ending with the specified string.
- specification part 32 specifies the key with the largest character string score from the keys which begin with the specified prefix.
- the dictionary size can be reduced as compared with the case where the index is created for all the partial character strings.
- the prefix specifying unit 31 specifies a prefix having a large prefix score
- the character string specifying unit 32 searches for a key having a large character string score from the prefix. If the key is searched from the largest, the top k keys can be efficiently searched. Therefore, it is possible to perform a partial match search for character strings at high speed while reducing the amount of data.
- the trie tree is used as a data structure capable of collecting common prefixes, so that the data size can be reduced.
- the case where the key is represented using the data structure of the trie tree is illustrated, but the data structure may be a Patricia tree. By using the Patricia tree, the data size can be reduced more than the tree structure of the trie tree.
- the character string search device of the present embodiment includes a search management unit 30 that manages the search range.
- the search management unit 30 specifies a range excluding the already specified key from the keys starting with the key prefix specified by the character string specifying unit 32, and the prefix set specifying unit 20 A range excluding the prefix specified by the prefix specifying unit 31 is specified from the specified prefix set.
- the prefix specifying unit 31 specifies the prefix having the largest prefix score from the prefix range specified by the search managing unit 30, and the character string specifying unit 32 is specified by the search managing unit 30.
- the key with the highest string score is identified from the key range.
- XBW used as a data structure for a dictionary can be expanded to a Top-k search, and therefore, space saving and high-speed processing can be realized when a partial match search is performed for the top k candidates.
- Embodiment 2 a second embodiment of the character string search device according to the present invention will be described.
- the configuration of the character string search device of this embodiment is the same as that of the first embodiment.
- the character string search device according to the second embodiment can reduce the amount of data to be retained more than the character string device according to the first embodiment.
- Two data structures are generated from the trie tree T described in the first embodiment.
- One is a data structure related to a prefix, and the other is a data structure related to a key.
- the data structure related to the prefix includes xbw which is the XBW representation of the trie tree T and the accompanying first RMQ structure. As described in the first embodiment, on xbw, the prefixes are arranged in the order sorted from the end.
- a first RMQ structure is generated for the prefix score string R p shown in the first embodiment.
- the search information storage unit 50 a prefix score column R p, may not be explicitly held, be held only the first RMQ structure calculated from the prefix score column R p Good.
- the data structure relating to the key includes a Patricia tree T c generated from the trie tree T, a second RMQ structure, and a character string score string R k .
- Tree structure of a Patricia tree T c is represented by DFUDS. Further, in the Patricia tree Tc , in order to identify only the leaf nodes of the tree structure, the same number of bit strings as the number of nodes are prepared.
- a general Patricia tree holds a character string corresponding to each node.
- the search information storage unit 50 of this embodiment removes the character string corresponding to each node and stores only the tree structure representing the parent-child relationship between the nodes. The reason for storing only such a tree structure will be described later.
- each key is sorted in lexicographic order from the first character, and each key is assigned a key ID in that order.
- Each prefix is also sorted in lexicographic order from the end, and each prefix is assigned a prefix ID in that order. Further, the range of the prefix ID is written as [s p , e p ], and the range on the set S representing the prefix is written as [s s , e s ].
- the prefix set specifying unit 20 specifies a range [s p , e p ] of prefix IDs ending with the input character string. Specifically, the prefix set identifying unit 20 uses the XBW, range that is a character string end is entered prefix [s s, e s] to identify. However, since this range [s s , e s ] is a range on the set S, it is necessary to convert it to a prefix ID range [s p , e p ]. Therefore, the prefix set identification unit 20 identifies [s p , e p by identifying what number the first 1 and the last 1 included in [s s , e s ] are on S last. ] Is specified. This is because the elements that become 1 on S last correspond one-to-one in the same order as the prefix ID.
- the prefix specifying unit 31 specifies the prefix having the maximum prefix score from the range [s p , e p ] of the specified prefix ID. Specifically, the prefix specifying unit 31 uses the first RMQ structure to specify the position of the prefix having the maximum prefix score within the range of [s p , e p ]. Incidentally, denoted here identified position of the prefix and i p.
- Search management unit 30 from the position i p of the identified prefix identifies a range of keys that start with the prefix.
- a range of keys starting with the specified prefix is denoted as [s k , e k ].
- the retrieval management section 30 first as the corresponding position i s on S, identifies the last node having the prefix which corresponds to the position i p prefix.
- the search management unit 30 restores a character string representing this prefix in xbw. Specifically, the retrieval management unit 30, by combining the characters when went following the parent from a node representing i s th row in XBW, restores the character string. The number of moves from node to parent is equal to the prefix length.
- the search management unit 30 moves the target position from the parent node to the child node according to the order stored in the array d in the Patricia tree Tc . However, when the corresponding value of the array d is 1, the search management unit 30 ignores the value and performs processing for the next value.
- Search management unit 30 uses the DFUDS, range of keys corresponding to the descendants of node u c has been reached [s k, e k] specifying the. Since the keys included in [s k , e k ] are all children of u c , it can be said that [s k , e k ] indicates a range of keys starting with the specified prefix.
- the character string specifying unit 32 specifies the key ID (hereinafter referred to as i k ) having the maximum character string score from the specified key range [s k , e k ]. Specifically, the character string specifying unit 32 uses the second RMQ structure to specify the key position i k having the maximum prefix score within the range of [s k , e k ].
- Character string specifying unit 32 from the position i k of the specified key ID, and identifies the string of keys. i k corresponds to the i k th leaf node u i on the Patricia tree T c . Therefore, the character string specifying unit 32 traces from u i to the parent node of the Patricia tree T c and stores the numbers of the child nodes in the array d in the reverse order of the order traced toward the parent.
- the character string specifying unit 32 can specify the position of the node on xbw corresponding to u i by tracing xbw sequentially from the root according to this array d.
- the character string specifying unit 32 can accurately restore the key by following a single chain.
- the top second key is the second key with the same prefix or the first key with another prefix.
- the data size when the data structure described in this embodiment is used will be described. If a trie tree T and a score array R k are given, the number of nodes t> the number of keys l holds. In general, the number of nodes t is about 10 times the number of keys l.
- the data size is expressed by the following equation 2.
- T c (Patricia tree) is generated from the trie tree T, and the tree structure is expressed in DFUDS.
- the same number of bit strings as the number of nodes are prepared, and each bit of this bit string is used to identify whether or not it is a leaf node.
- the Patricia tree of this embodiment is represented only by a tree structure from which character strings are removed. This is because character string information is obtained from xbw as described above.
- a second RMQ structure (key) is generated for the key character string score array R k (score).
- is represented by 2l + o (l) bits.
- the calculation amount is calculated by O (k (log (k) +
- indicates the length of the query
- h indicates the average length of the keys registered in the dictionary.
- FIG. 8 is a block diagram showing an outline of a character string search apparatus according to the present invention.
- a character string search device includes a search candidate character string including an input character string from a set of search candidate character strings (for example, keys) associated with a character string score indicating a degree to be preferentially searched.
- the prefix score defined by the largest character string score among the character string scores associated with the search candidate character string starting with the prefix is the largest.
- the prefix specifying unit 82 specifies a prefix having a large prefix score
- the character string specifying unit 83 searches for a search candidate character string having a large character string score from the prefix. If the search is started from the larger one, the top k search candidate character strings can be efficiently searched.
- the character string search device may include a search management unit (for example, the search management unit 30) that manages the search range.
- the search management unit specifies the range of the search candidate character string excluding the already specified search candidate character string from the search candidate character strings starting with the prefix of the search candidate character string specified by the character string specifying unit 83.
- the prefix range excluding the prefix specified by the prefix specifying unit 82 from the set of prefixes specified by the prefix set specifying unit 81 may be specified.
- the prefix specifying unit 82 specifies the prefix having the maximum prefix score from the prefix range specified by the search management unit
- the character string specifying unit 83 is the search candidate specified by the search management unit.
- a search candidate character string having the maximum character string score may be specified from the character string range.
- the search management unit holds a pair of prefix and prefix score specified by the prefix specifying unit 82 and a pair of search target character string and character string score specified by the character string specifying unit 83 (for example, a priority queue may be included. Then, the search management unit identifies the prefix or the search target string with the maximum score from the prefix score or the string score from the pairs held in the queue, and the maximum score is the string score. If the maximum score is a prefix score, the search target character string of the character string score is excluded from the queue and specified as the output target, and the prefix of the prefix score may be excluded from the queue .
- the prefix specifying unit 82 specifies a prefix with the next largest prefix score after the prefix excluded from the queue, and the string specifying unit 83 sets the maximum score. If the score is a string score, the excluded search target string in the search target string that starts with the same prefix used to identify the search target string excluded from the queue If the next highest character string score is specified and the largest score is the prefix score, the search with the highest character string score from the search target character strings starting with the prefix specified by the prefix specifying unit 82 The target character string may be specified.
- the prefix score or character that is retained in the queue by retaining both the prefix-prefix score pair and both the search target string-string score pair pair in one queue.
- the prefix specifying unit 82 and the character string specifying unit 83 repeat the above-described processing, whereby the search target character string having the higher character string score can be efficiently specified.
- the character string search device generates a set of prefixes (for example, xbw) having a XBW data structure generated from a set of search candidate character strings represented by a trie data structure, and a trie tree data structure.
- An information storage unit (for example, search information storage unit 50) may be provided.
- the prefix identification unit 82 identifies the position of the prefix having the largest prefix score from the set of prefixes having the XBW data structure, and the search management unit identifies the Patricia from the position of the identified prefix.
- the position of the corresponding node in the tree eg, u c ) may be specified. With such a configuration, it is possible to reduce the amount of data to be stored used for search.
- the character string specifying unit 83 determines the position (for example, u i ) of the search candidate character string having the maximum character string score from the search candidate character strings existing under the position of the node specified by the search management unit.
- the search candidate character string corresponding to the specified position may be specified from the set of prefixes having the XBW data structure.
- the prefix specifying unit 82 uses the first RMQ structure based on the relationship between the prefix represented by the first RMQ structure and the prefix score, and sets the prefixes specified.
- the prefix of the largest prefix score may be specified by performing a range search.
- the character string specifying unit 83 uses the second RMQ structure based on the relationship between the search candidate character string represented by the second RMQ structure and the character string score, and identifies the prefix specified.
- the search candidate character string having the maximum character string score may be specified by performing a range search on the search candidate character string starting with “.”
- the present invention is preferably applied to a character string search device that searches for a key that partially matches an input character string.
- the character string search device according to the present invention can be used, for example, when providing a search service.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Description
図1は、本発明による文字列検索装置の第1の実施形態の構成例を示すブロック図である。本実施形態の文字列検索装置は、入力部10と、接頭辞集合特定部20と、検索管理部30と、接頭辞特定部31と、文字列特定部32と、出力部40と、検索情報記憶部50とを備えている。
FIG. 1 is a block diagram showing a configuration example of a first embodiment of a character string search device according to the present invention. The character string search device according to the present embodiment includes an
<参照文献1>Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini and S. Muthukrishnan, "Structuring labeled trees for optimal succinctness, and beyond", FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, Pages 184-196
<参照文献2>Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jerey Scott Vitter, “Faster compressed dictionary matching”, SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval, Pages 191-200 The first XBW is described in
<
<
次に、本発明による文字列検索装置の第2の実施形態を説明する。本実施形態の文字列検索装置の構成は、第1の実施形態と同様である。ただし、第2の実施形態の文字列検索装置は、第1の実施形態の文字列装置よりも、保持するデータ量をより削減できるようにするものである。
Next, a second embodiment of the character string search device according to the present invention will be described. The configuration of the character string search device of this embodiment is the same as that of the first embodiment. However, the character string search device according to the second embodiment can reduce the amount of data to be retained more than the character string device according to the first embodiment.
+|Tc(パトリシア木)|+|第二のRMQ構造(キー)|+|Rk(スコア)|
・・・(式2) | XBW | + | First RMQ structure (prefix) |
+ | T c (Patricia tree) | + | second RMQ structure (key) | + | R k (score) |
... (Formula 2)
20 接頭辞集合特定部
30 検索管理部
31 接頭辞特定部
32 文字列特定部
40 出力部
50 検索情報記憶部 DESCRIPTION OF
Claims (11)
- 優先的に検索すべき度合いを示す文字列スコアが対応づけられた検索候補文字列の集合から、入力された文字列を含む検索候補文字列を検索する文字列検索装置であって、
各検索候補文字列の先頭文字から抽出される連続する1文字以上の文字列である接頭辞の集合から、入力された文字列で終わる接頭辞の集合を特定する接頭辞集合特定部と、
入力された文字列で終わる接頭辞の集合の中から、前記接頭辞ごとに当該接頭辞で始まる検索候補文字列に対応づけられた文字列スコアのうち最も大きい文字列スコアで定義される接頭辞スコアが最大の接頭辞を特定する接頭辞特定部と、
特定された接頭辞で始まる検索候補文字列の中から、前記文字列スコアが最大の検索候補文字列を特定する文字列特定部とを備えた
ことを特徴とする文字列検索装置。 A character string search device for searching a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched,
A prefix set specifying unit that specifies a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification part that identifies the prefix with the highest score;
A character string search device comprising: a character string specifying unit that specifies a search candidate character string having the maximum character string score from search candidate character strings starting with the specified prefix. - 検索範囲を管理する検索管理部を備え、
前記検索管理部は、文字列特定部によって特定された検索候補文字列の接頭辞で始まる検索候補文字列の中から、すでに特定された検索候補文字列を除く検索候補文字列の範囲を特定し、接頭辞集合特定部により特定された接頭辞の集合から、接頭辞特定部により特定された接頭辞を除いた接頭辞の範囲を特定し、
接頭辞特定部は、前記検索管理部により特定された接頭辞の範囲から、接頭辞スコアが最大の接頭辞を特定し、
文字列特定部は、前記検索管理部により特定された検索候補文字列の範囲から、文字列スコアが最大の検索候補文字列を特定する
請求項1記載の文字列検索装置。 It has a search management unit that manages the search range,
The search management unit specifies a range of search candidate character strings excluding the already specified search candidate character strings from search candidate character strings starting with a prefix of the search candidate character string specified by the character string specifying unit. Identify a range of prefixes from the set of prefixes specified by the prefix set specification part, excluding the prefixes specified by the prefix specification part,
The prefix identification unit identifies a prefix having the largest prefix score from the range of prefixes identified by the search management unit,
The character string search device according to claim 1, wherein the character string specifying unit specifies a search candidate character string having a maximum character string score from a range of search candidate character strings specified by the search management unit. - 検索管理部は、接頭辞特定部が特定した接頭辞と接頭辞スコアのペア、および、文字列特定部が特定した検索対象文字列と文字列スコアのペアを保持するキューを含み、
前記検索管理部は、前記キューに保持されたペアの中から、接頭辞スコアまたは文字列スコアのうち、最大のスコアの接頭辞または検索対象文字列を特定し、最大のスコアが文字列スコアだった場合、当該文字列スコアの検索対象文字列をキューから除外して出力対象と特定し、最大のスコアが接頭辞スコアだった場合、当該接頭辞スコアの接頭辞をキューから除外し、
接頭辞特定部は、最大のスコアが接頭辞スコアだった場合、キューから除外された接頭辞の次に大きい接頭辞スコアの接頭辞を特定し、
文字列特定部は、最大のスコアが文字列スコアだった場合、キューから除外された検索対象文字列を特定する際に用いられた接頭辞と同じ接頭辞で始まる検索対象文字列の中で、当該除外された検索対象文字列の次に大きい文字列スコアを特定し、最大のスコアが接頭辞スコアだった場合、前記接頭辞特定部によって特定された接頭辞で始まる検索対象文字列の中から、文字列スコアが最も大きい検索対象文字列を特定する
請求項2記載の文字列検索装置。 The search management unit includes a queue that holds a prefix / prefix score pair identified by the prefix identification unit, and a search target string / string score pair identified by the string identification unit,
The search management unit identifies a prefix or a search target character string having a maximum score among prefix scores or character string scores from the pairs held in the queue, and the maximum score is a character string score. If the maximum score is a prefix score, exclude the prefix of the prefix score from the queue,
The prefix identification part identifies the prefix with the next highest prefix score after the prefix excluded from the queue if the largest score is the prefix score,
When the maximum score is a string score, the string specifying unit, in the search target string starting with the same prefix as the prefix used to specify the search target string excluded from the queue, When the next largest string score is specified after the excluded search target character string and the maximum score is a prefix score, the search target character string starting with the prefix specified by the prefix specifying unit is selected. The character string search device according to claim 2, wherein a search target character string having the largest character string score is specified. - トライ木のデータ構造で表わされた検索候補文字列の集合から生成され、XBWのデータ構造を有する接頭辞の集合と、前記トライ木のデータ構造から生成されるパトリシア木であって、当該パトリシア木の各ノードに対応する文字列が除外されノード間の親子関係を表わす木構造のみ有するパトリシア木とを記憶する検索情報記憶部を備え、
接頭辞特定部は、前記XBWのデータ構造を有する接頭辞の集合から、接頭辞スコアが最大の接頭辞の位置を特定し、
検索管理部は、特定された前記接頭辞の位置から、前記パトリシア木において対応するノードの位置を特定する
請求項1から請求項3のうちのいずれか1項に記載の文字列検索装置。 A Patricia tree generated from a set of search candidate character strings represented by a trie tree data structure and having an XBW data structure, and a Patricia tree generated from the trie tree data structure, A search information storage unit for storing a Patricia tree having only a tree structure in which a character string corresponding to each node of the tree is excluded and a parent-child relationship between the nodes is included;
The prefix specifying unit specifies the position of the prefix having the maximum prefix score from the set of prefixes having the XBW data structure,
The character string search device according to any one of claims 1 to 3, wherein the search management unit specifies a position of a corresponding node in the Patricia tree from the specified position of the prefix. - 文字列特定部は、検索管理部により特定されたノードの位置配下に存在する検索候補文字列の中から文字列スコアが最大の検索候補文字列の位置を特定し、XBWのデータ構造を有する接頭辞の集合から、前記特定した位置に対応する検索候補文字列を特定する
請求項4記載の文字列検索装置。 The character string specifying unit specifies the position of the search candidate character string having the maximum character string score from the search candidate character strings existing under the position of the node specified by the search management unit, and has a prefix having an XBW data structure. The character string search device according to claim 4, wherein a search candidate character string corresponding to the specified position is specified from a set of words. - 接頭辞特定部は、第一のRMQ構造で表わされた接頭辞と接頭辞スコアとの関係をもとに、当該第一のRMQ構造を用いて、特定された接頭辞の集合を範囲検索することにより、最大の接頭辞スコアの接頭辞を特定する
請求項1から請求項5のうちのいずれか1項に記載の文字列検索装置。 The prefix specifying unit performs a range search on the specified set of prefixes using the first RMQ structure based on the relationship between the prefix represented by the first RMQ structure and the prefix score. The character string search device according to any one of claims 1 to 5, wherein a prefix having a maximum prefix score is specified by - 文字列特定部は、第二のRMQ構造で表わされた検索候補文字列と文字列スコアとの関係をもとに、当該第二のRMQ構造を用いて、特定された接頭辞で始まる検索候補文字列を範囲検索することにより、最大の文字列スコアの検索候補文字列を特定する
請求項1から請求項6のうちのいずれか1項に記載の文字列検索装置。 The character string specifying unit uses the second RMQ structure to start a specified prefix based on the relationship between the search candidate character string represented by the second RMQ structure and the character string score. The character string search device according to any one of claims 1 to 6, wherein a search candidate character string having a maximum character string score is specified by performing a range search on the candidate character string. - 優先的に検索すべき度合いを示す文字列スコアが対応づけられた検索候補文字列の集合から、入力された文字列を含む検索候補文字列を検索する文字列検索方法であって、
各検索候補文字列の先頭文字から抽出される連続する1文字以上の文字列である接頭辞の集合から、入力された文字列で終わる接頭辞の集合を特定する接頭辞集合特定ステップと、
入力された文字列で終わる接頭辞の集合の中から、前記接頭辞ごとに当該接頭辞で始まる検索候補文字列に対応づけられた文字列スコアのうち最も大きい文字列スコアで定義される接頭辞スコアが最大の接頭辞を特定する接頭辞特定ステップと、
特定された接頭辞で始まる検索候補文字列の中から、前記文字列スコアが最大の検索候補文字列を特定する文字列特定ステップとを含む
ことを特徴とする文字列検索方法。 A character string search method for searching a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched,
A prefix set specifying step for specifying a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification step to identify the prefix with the highest score;
A character string search method comprising: a character string specifying step of specifying a search candidate character string having the maximum character string score from search candidate character strings starting with the specified prefix. - 検索範囲を管理する検索管理ステップを含み、
前記検索管理ステップで、文字列特定ステップで特定された検索候補文字列の接頭辞で始まる検索候補文字列の中から、すでに特定された検索候補文字列を除く検索候補文字列の範囲を特定し、接頭辞集合特定ステップで特定された接頭辞の集合から、接頭辞特定ステップで特定された接頭辞を除いた接頭辞の範囲を特定し、
接頭辞特定ステップで、前記検索管理ステップで特定された接頭辞の範囲から、接頭辞スコアが最大の接頭辞を特定し、
文字列特定ステップで、前記検索管理ステップで特定された検索候補文字列の範囲から、文字列スコアが最大の検索候補文字列を特定する
請求項8記載の文字列検索方法。 Including a search management step for managing the search scope;
In the search management step, the search candidate character string range excluding the already specified search candidate character string is specified from the search candidate character strings starting with the prefix of the search candidate character string specified in the character string specifying step. Identify the range of prefixes from the set of prefixes identified in the prefix set identification step, excluding the prefix identified in the prefix identification step,
In the prefix identification step, the prefix with the largest prefix score is identified from the prefix range identified in the search management step,
The character string search method according to claim 8, wherein in the character string specifying step, a search candidate character string having a maximum character string score is specified from a range of search candidate character strings specified in the search management step. - 優先的に検索すべき度合いを示す文字列スコアが対応づけられた検索候補文字列の集合から、入力された文字列を含む検索候補文字列を検索するコンピュータに適用される文字列検索プログラムであって、
前記コンピュータに、
各検索候補文字列の先頭文字から抽出される連続する1文字以上の文字列である接頭辞の集合から、入力された文字列で終わる接頭辞の集合を特定する接頭辞集合特定処理、
入力された文字列で終わる接頭辞の集合の中から、前記接頭辞ごとに当該接頭辞で始まる検索候補文字列に対応づけられた文字列スコアのうち最も大きい文字列スコアで定義される接頭辞スコアが最大の接頭辞を特定する接頭辞特定処理、および、
特定された接頭辞で始まる検索候補文字列の中から、前記文字列スコアが最大の検索候補文字列を特定する文字列特定処理
を実行させるための文字列検索プログラム。 A character string search program applied to a computer that searches a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched. And
In the computer,
A prefix set specifying process for specifying a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification process that identifies the prefix with the highest score, and
A character string search program for executing a character string specifying process for specifying a search candidate character string having the maximum character string score from search candidate character strings that start with the specified prefix. - コンピュータに、
検索範囲を管理する検索管理処理を実行させ、
前記検索管理処理で、文字列特定処理で特定された検索候補文字列の接頭辞で始まる検索候補文字列の中から、すでに特定された検索候補文字列を除く検索候補文字列の範囲を特定させ、接頭辞集合特定処理で特定された接頭辞の集合から、接頭辞特定処理で特定された接頭辞を除いた接頭辞の範囲を特定させ、
接頭辞特定処理で、前記検索管理処理で特定された接頭辞の範囲から、接頭辞スコアが最大の接頭辞を特定させ、
文字列特定処理で、前記検索管理処理で特定された検索候補文字列の範囲から、文字列スコアが最大の検索候補文字列を特定させる
請求項10記載の文字列検索プログラム。 On the computer,
Execute search management processing to manage the search range,
In the search management process, a range of search candidate character strings excluding an already specified search candidate character string is specified from search candidate character strings that start with a prefix of the search candidate character string specified in the character string specifying process. , From the set of prefixes specified by the prefix set specifying process, the range of the prefix excluding the prefix specified by the prefix specifying process is specified,
In the prefix identification process, the prefix with the largest prefix score is identified from the prefix range identified in the search management process.
The character string search program according to claim 10, wherein the character string specifying process specifies a search candidate character string having a maximum character string score from a range of search candidate character strings specified by the search management process.
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP14838200.5A EP3037986A4 (en) | 2013-08-21 | 2014-07-18 | Text character string search device, text character string search method, and text character string search program |
JP2015532688A JP6072922B2 (en) | 2013-08-21 | 2014-07-18 | Character string search device, character string search method, and character string search program |
CN201480046496.4A CN105474214A (en) | 2013-08-21 | 2014-07-18 | Text character string search device, text character string search method, and text character string search program |
US14/909,793 US20160196303A1 (en) | 2013-08-21 | 2014-07-18 | String search device, string search method, and string search program |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2013-171291 | 2013-08-21 | ||
JP2013171291 | 2013-08-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015025467A1 true WO2015025467A1 (en) | 2015-02-26 |
Family
ID=52483264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/003817 WO2015025467A1 (en) | 2013-08-21 | 2014-07-18 | Text character string search device, text character string search method, and text character string search program |
Country Status (5)
Country | Link |
---|---|
US (1) | US20160196303A1 (en) |
EP (1) | EP3037986A4 (en) |
JP (1) | JP6072922B2 (en) |
CN (1) | CN105474214A (en) |
WO (1) | WO2015025467A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9892789B1 (en) | 2017-01-16 | 2018-02-13 | International Business Machines Corporation | Content addressable memory with match hit quality indication |
JP2020098583A (en) * | 2017-03-15 | 2020-06-25 | センシェア アーゲー | Efficient use of trie data structure in databases |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110222238B (en) * | 2019-04-30 | 2022-02-25 | 上海交通大学 | Query method and system for bidirectional mapping of character string and identifier |
US11860884B2 (en) * | 2021-03-30 | 2024-01-02 | Snap Inc. | Search query modification database |
CN114065733B (en) * | 2021-10-18 | 2024-07-26 | 浙江香侬慧语科技有限责任公司 | Dependency syntax analysis method, device and medium based on machine reading understanding |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008090606A1 (en) * | 2007-01-24 | 2008-07-31 | Fujitsu Limited | Information search program, recording medium containing the program, information search device, and information search method |
WO2011104754A1 (en) * | 2010-02-24 | 2011-09-01 | 三菱電機株式会社 | Search device and search program |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7941310B2 (en) * | 2003-09-09 | 2011-05-10 | International Business Machines Corporation | System and method for determining affixes of words |
US8156156B2 (en) * | 2006-04-06 | 2012-04-10 | Universita Di Pisa | Method of structuring and compressing labeled trees of arbitrary degree and shape |
CN102084363B (en) * | 2008-07-03 | 2014-11-12 | 加利福尼亚大学董事会 | A method for efficiently supporting interactive, fuzzy search on structured data |
CN101916263B (en) * | 2010-07-27 | 2012-10-31 | 武汉大学 | Fuzzy keyword query method and system based on weighing edit distance |
US8930391B2 (en) * | 2010-12-29 | 2015-01-06 | Microsoft Corporation | Progressive spatial searching using augmented structures |
-
2014
- 2014-07-18 CN CN201480046496.4A patent/CN105474214A/en active Pending
- 2014-07-18 EP EP14838200.5A patent/EP3037986A4/en not_active Ceased
- 2014-07-18 WO PCT/JP2014/003817 patent/WO2015025467A1/en active Application Filing
- 2014-07-18 US US14/909,793 patent/US20160196303A1/en not_active Abandoned
- 2014-07-18 JP JP2015532688A patent/JP6072922B2/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008090606A1 (en) * | 2007-01-24 | 2008-07-31 | Fujitsu Limited | Information search program, recording medium containing the program, information search device, and information search method |
WO2011104754A1 (en) * | 2010-02-24 | 2011-09-01 | 三菱電機株式会社 | Search device and search program |
Non-Patent Citations (7)
Title |
---|
BO-JUNE (PAUL) HSU ET AL.: "Space- efficient data structures for Top-k completion", WWW '13 PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 13 May 2013 (2013-05-13), pages 583 - 593, XP058019880, Retrieved from the Internet <URL:http://dl.acm.org/citation.cfm?id=2488440> [retrieved on 20141007] * |
BO-JUNE (PAUL) HSU; GIUSEPPE OTTAVIANO: "Space-Efficient Data Structures for Top-k Completion", WWW' 13 PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, May 2013 (2013-05-01), pages 583 - 594, XP058019880, DOI: doi:10.1145/2488388.2488440 |
PAOLO FERRAGINA ET AL.: "Structuring labeled trees for optimal succinctness, and beyond", FOCS '05 PROCEEDINGS OF THE 46TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, 2005, pages 184 - 193, XP010851712, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1530713> [retrieved on 20141007] * |
PAOLO FERRAGINA; FABRIZIO LUCCIO; GIOVANNI MANZINI; S. MUTHUKRISHNAN: "Structuring labeled trees for optimal succinctness, and beyond", FOCS '05 PROCEEDINGS OF THE 46TH ANNUAL IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, pages 184 - 196 |
See also references of EP3037986A4 |
WING-KAI HON; RAHUL SHAH; SHARMA V. THANKACHAN: "Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval", CPM' 12 PROCEEDINGS OF THE 23RD ANNUAL CONFERENCE ON COMBINATORIAL PATTERN MATCHING, pages 173 - 184 |
WING-KAI HON; TSUNG-HAN KU; RAHUL SHAH; SHARMA V. THANKACHAN; JEREY SCOTT VITTER: "Faster compressed dictionary matching", SPIRE'10 PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON STRING PROCESSING AND INFORMATION RETRIEVAL, pages 191 - 200 |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9892789B1 (en) | 2017-01-16 | 2018-02-13 | International Business Machines Corporation | Content addressable memory with match hit quality indication |
US10347337B2 (en) | 2017-01-16 | 2019-07-09 | International Business Machines Corporation | Content addressable memory with match hit quality indication |
US10614884B2 (en) | 2017-01-16 | 2020-04-07 | International Business Machines Corporation | Content addressable memory with match hit quality indication |
JP2020098583A (en) * | 2017-03-15 | 2020-06-25 | センシェア アーゲー | Efficient use of trie data structure in databases |
US11275740B2 (en) | 2017-03-15 | 2022-03-15 | Censhare Gmbh | Efficient use of trie data structure in databases |
US11347741B2 (en) | 2017-03-15 | 2022-05-31 | Censhare Gmbh | Efficient use of TRIE data structure in databases |
JP7198192B2 (en) | 2017-03-15 | 2022-12-28 | センシェア ゲーエムベーハー | Efficient Use of Trie Data Structures in Databases |
US11899667B2 (en) | 2017-03-15 | 2024-02-13 | Censhare Gmbh | Efficient use of trie data structure in databases |
Also Published As
Publication number | Publication date |
---|---|
CN105474214A (en) | 2016-04-06 |
JP6072922B2 (en) | 2017-02-01 |
EP3037986A4 (en) | 2017-01-04 |
US20160196303A1 (en) | 2016-07-07 |
JPWO2015025467A1 (en) | 2017-03-02 |
EP3037986A1 (en) | 2016-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7912818B2 (en) | Web graph compression through scalable pattern mining | |
JP6072922B2 (en) | Character string search device, character string search method, and character string search program | |
CN111159990B (en) | Method and system for identifying general special words based on pattern expansion | |
CN109902142B (en) | Character string fuzzy matching and query method based on edit distance | |
CN105843882A (en) | Information matching method and apparatus | |
US20050120017A1 (en) | Efficient retrieval of variable-length character string data | |
CN105426412A (en) | Multi-mode string matching method and device | |
CN104268176A (en) | Recommendation method and system based on search keyword | |
CN104021202B (en) | The entry processing unit and method of a kind of knowledge sharing platform | |
JP5980520B2 (en) | Method and apparatus for efficiently processing a query | |
CN105608201A (en) | Text matching method supporting multi-keyword expression | |
US9529835B2 (en) | Online compression for limited sequence length radix tree | |
CN113065419B (en) | Pattern matching algorithm and system based on flow high-frequency content | |
KR101089722B1 (en) | Method and apparatus for prefix tree based indexing, and recording medium thereof | |
CN111814009B (en) | Mode matching method based on search engine retrieval information | |
CN101576877A (en) | Fast word segmentation realization method | |
JP5190192B2 (en) | SEARCH DEVICE, SEARCH METHOD, AND PROGRAM | |
JP2000194713A (en) | Method and device for retrieving character string, and storage medium stored with character string retrieval program | |
JP2011138365A (en) | Term extraction device, method, and data structure of term dictionary | |
JP5628365B2 (en) | Search device | |
CN113641783B (en) | Content block retrieval method, device, equipment and medium based on key sentences | |
CN118193543B (en) | Method for searching node tree based on EDA, electronic equipment and storage medium | |
KR101080898B1 (en) | Method and apparatus for indexing character string | |
JP5237400B2 (en) | Search device | |
CN111061884A (en) | Method for constructing K12 education knowledge graph based on DeepDive technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201480046496.4 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14838200 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015532688 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14909793 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2014838200 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |