WO2015025467A1

WO2015025467A1 - Text character string search device, text character string search method, and text character string search program

Info

Publication number: WO2015025467A1
Application number: PCT/JP2014/003817
Authority: WO
Inventors: 穣岡嶋; 康高山本
Original assignee: Ｎｅｃソリューションイノベータ株式会社
Priority date: 2013-08-21
Filing date: 2014-07-18
Publication date: 2015-02-26
Also published as: CN105474214A; JP6072922B2; EP3037986A4; US20160196303A1; JPWO2015025467A1; EP3037986A1

Abstract

A prefix set specification unit (81) specifies, from a set of prefixes which are text character strings of one or more continuous text characters which are extracted from lead text characters of each search candidate text character string, a set of prefixes which end with an inputted text character string. A prefix specification unit (82) specifies the prefix with the highest prefix score from the set of prefixes which end with the inputted text character string, a prefix score being defined for each prefix by the highest text character string score among text character string scores associated with search candidate text character strings beginning with the said prefix. A text character string specification unit (83) specifies, from among the search candidate text character strings which begin with the specified prefix, the search candidate text character string with the highest text character string score.

Description

Character string search device, character string search method, and character string search program

The present invention relates to a character string search device, a character string search method, and a character string search program for searching for a key that partially matches an input character string.

The method of supporting human text input has become widespread and has become indispensable for our lives. As input support, for example, a search keyword of a search candidate is displayed on an input form of a search engine, or a URL as a candidate is displayed on a URL (Uniform ResourceatorLocator) input form of a Web browser. In addition, displaying conversion candidates at the time of prediction conversion of IME (Input Method Editor) or displaying correct spelling candidates in the spell checker are examples of input support.

Such input support is realized as a dictionary search. A character string that the user is likely to input is registered in advance in the dictionary as a key. When the user newly starts inputting a character string, the dictionary is searched using the character string input by the user as a search query, an appropriate key is acquired as an input candidate, and displayed on the screen. For example, in search keyword recommendation, search keywords previously input by the user are registered in the dictionary in advance and used as input candidates.

In actual situations, it is not necessary to enumerate all keys that are candidates. For example, in a scene where a search keyword is recommended, the top k items with the highest input frequency may be recommended as candidates. The problem of searching for the top k keys having a large score in this way is called a Top-k search (Top-k dictionary search).

Non-Patent Document 1 describes a data structure that obtains a high-order key at a high speed from keys that coincide with each other by using a RMQ Trie and a RMQ (Ranged Minimum Query) structure. .

FIG. 9 is an explanatory diagram showing the RMQ Trie. In the example shown in FIG. 9, the node v having the search query P as a prefix is found, and the key range [a, b] under the node v is obtained. All keys included in [a, b] have the search query P as a prefix. At this time, by searching the score in the range [a, b] from the score array R arranged in association with each key, the top k keys having the search query P as a prefix are obtained. .

Non-Patent Document 1 describes two other types of data structures that are used to acquire a high-order key at a high speed from keys that coincide with each other in the same manner as RMQ Trie.

Non-patent document 2 describes Top-k search in document search. This method realizes Top-k search by adding additional data necessary for Top-k search to the data structure based on the data structure for document search.

If the keys included in the dictionary become large, the number of keys corresponding to the input character string will increase, and search will take time. Therefore, it is required to obtain a candidate key at high speed.

On the other hand, by using each data structure described in Non-Patent Document 1, it is possible to acquire a key candidate that matches forward at high speed, but it is difficult to acquire a key candidate that partially matches. is there.

Further, by using the data structure described in Non-Patent Document 2, Top-k search in document search can be realized. However, since the data used for document search is large in size, there is a problem that when the search method used for document search is directly applied to a dictionary, the target data size becomes large.

Therefore, an object of the present invention is to provide a character string search device, a character string search method, and a character string search program that can perform partial match search of a character string at high speed while reducing the amount of data.

A character string search device according to the present invention is a character string that searches for a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched. A prefix for identifying a set of prefixes ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string, in the search device From the set identification part and the set of prefixes ending with the input character string, each prefix is defined with the largest character string score among the character string scores associated with the search candidate character string beginning with that prefix. Character string specifying the search candidate character string with the highest character string score from the prefix specification part that specifies the prefix with the highest prefix score and the search candidate character string starting with the specified prefix With department And wherein the door.

The character string search method according to the present invention is a character string that searches for a search candidate character string that includes an input character string from a set of search candidate character strings that are associated with a character string score that indicates the degree to be searched with priority A prefix for identifying a set of prefixes ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string in the search method From the set identification step and a set of prefixes ending with the input string, each prefix is defined with the largest string score among the string scores associated with the search candidate string starting with that prefix. A prefix identification step that identifies the prefix with the largest prefix score, and a string identification that identifies the search candidate string with the largest string score from among the search candidate strings that begin with the identified prefix Characterized in that it comprises a step.

A character string search program according to the present invention provides a computer for searching for a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched. An applied character string search program, wherein a prefix ending with an input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string Prefix set identification processing that identifies a set of words, out of a set of prefixes that end with the input character string, out of the character string score associated with the search candidate character string that starts with that prefix for each prefix A prefix identification process that identifies the prefix with the largest prefix score defined by the largest string score, and a search string that begins with the identified prefix, and a string Core characterized in that to execute a string specifying process for specifying the maximum search candidate character strings.

According to the present invention, a partial match search of a character string can be performed at high speed while reducing the amount of data.

It is a block diagram which shows the structural example of 1st Embodiment of the character string search apparatus by this invention. It is explanatory drawing which shows the example of the trie tree corresponding to a key. It is explanatory drawing which shows the example of 1st XBW. It is explanatory drawing which shows the example of 2nd XBW. It is explanatory drawing which shows the example of the data structure which a search information storage part memorize | stores. It is a flowchart which shows the operation example of the character string search apparatus of 1st Embodiment. It is explanatory drawing which shows the example of the process which selects a key with a big character string score. It is a block diagram which shows the outline | summary of the character string search apparatus by this invention. It is explanatory drawing which shows RMQ Trie.

First, an outline of the character string search device of the present invention will be described. The present invention realizes a data structure for searching for a high-order key that partially matches an inputted character string at high speed by extending XBW, which is a dictionary data structure, to Top-k search. is there.

In the present invention, each key that is a search candidate character string is assigned a score (hereinafter referred to as a character string score) indicating the degree of search priority, and a set of keys is represented by a trie structure. .

In addition, all the prefixes of the keys included in the key set are represented by the XBW structure used for dictionary search. The character string search device of the present invention uses this XBW structure to specify a prefix range ending with the input character string. Each prefix is associated with a maximum score (hereinafter referred to as a prefix score) among keys starting with the prefix. Therefore, the character string search device identifies the prefix with the largest prefix score within the identified prefix range.

In the present invention, the RMQ structure is used to specify the maximum prefix score within the specified prefix range. Hereinafter, in order to specify the maximum prefix score, the RMQ structure used to represent the relationship between the prefix and the prefix score is referred to as a first RMQ structure. The character string search device uses the first RMQ structure to specify the prefix with the largest prefix score within the specified prefix range.

Furthermore, the character string search device identifies the key with the largest character string score among the keys that start with the identified prefix. At this time, the identified prefix corresponds to one node in the trie tree. Therefore, in order to specify the maximum character string score in the range of keys existing under each node, the RMQ structure is used as in the case of specifying the maximum prefix score. Hereinafter, the RMQ structure used to represent the relationship between the key and the character string score is referred to as a second RMQ structure. The character string search device uses the second RMQ structure to specify the key with the maximum character string score from the range of keys starting with the specified prefix.

After specifying the key with the maximum character string score, the character string search device performs a process of searching for a key having a character string score of the second or later in order to apply this character string search to the Top-k search. The position where the second and subsequent keys exist is the second and subsequent keys starting with the specified prefix, or the first and subsequent prefixes that have not been specified yet.

Therefore, the character string search device holds the prefix score of the specified prefix and the character string score of the specified key. The character string search device selects a key or prefix that is the maximum score among the character string scores and prefix scores it holds. If it is a key, the key with the next highest string score is searched for among the keys that start with the same prefix. If it is a prefix, a prefix having the next highest prefix score is searched for. By repeating this, it is possible to efficiently search for keys having a higher character string score among keys including the input character string.

Hereinafter, embodiments of the character string search device of the present invention will be described more specifically with reference to the drawings.

Embodiment 1. FIG.
FIG. 1 is a block diagram showing a configuration example of a first embodiment of a character string search device according to the present invention. The character string search device according to the present embodiment includes an input unit 10, a prefix set specifying unit 20, a search management unit 30, a prefix specifying unit 31, a character string specifying unit 32, an output unit 40, and search information. And a storage unit 50.

The input unit 10 inputs a character string of one character or more. The character string search device according to the present embodiment searches for a key that partially matches an input character string. In the following description, an input character string is referred to as a search query (or simply a query) P.

The search information storage unit 50 stores a set of keys that are search candidate character strings. As described above, the character string score is associated with the key used in this embodiment. That is, the character string search device of the present embodiment preferentially searches for a key having a higher character string score from the set of keys.

In this embodiment, in order to reduce the amount of data, the search target key is represented using the structure of the trie tree. FIG. 2 is an explanatory diagram illustrating an example of a trie tree corresponding to a key. For example, when there are four words (aba, abcc, cab, cac) illustrated in FIG. 2, the trie tree is configured to place a common character at a common node. The search information storage unit 50 may store the key itself represented by the trie tree, or may store only the structure of the trie tree, as will be described later.

In addition, each leaf node represented by a tree structure corresponds to each key. Accordingly, the search information storage unit 50 stores the score (character string score) of each key illustrated in FIG. 2 in association with each leaf node. Thereby, when the trie tree is searched and the leaf node is reached, the character string score corresponding to the key represented by the leaf node can be acquired.

Further, the search information storage unit 50 stores a set of prefixes p so that a character string ending with the query P can be searched. Here, the prefix p is a character string of one or more consecutive characters extracted from the first character of each key. This set of prefixes p may be sorted lexicographically from the end.

In this embodiment, a structure called XBW is used to represent such a set of prefixes. XBW is a data structure that can efficiently represent a labeled tree structure. By expressing the trie tree using this XBW structure, a range search of the prefix p ending with the query P becomes possible.

XBW is known to be able to realize a data structure that realizes an equivalent operation by two types of methods. The first XBW has a structure in which characters representing child nodes are associated with each prefix of the dictionary at a node on the trie tree corresponding to the prefix. The second XBW has a structure in which each prefix of the dictionary is associated with an ID of a prefix that becomes a parent node in a node on the trie tree corresponding to the prefix. Hereinafter, the contents of each XBW will be described.

FIG. 3 is an explanatory diagram showing an example of the first XBW. In the first XBW illustrated in FIG. 3, prefixes corresponding to the nodes of the trie tree are arranged in lexicographic order from the end, and characters representing child nodes are associated with the respective prefixes. With such a structure, it is possible to move from each prefix to a child node representing a specific character, and an operation equivalent to a trie tree can be realized. Further, it is possible to perform a range search for the prefix p ending with the query P.

FIG. 4 is an explanatory diagram showing an example of the second XBW. In the second XBW illustrated in FIG. 4, the prefixes corresponding to the nodes of the trie tree are arranged in lexicographic order from the end, and an ID is assigned to each prefix. Each prefix is associated with its parent ID. Such a structure makes it possible to move to the next parent node. Similarly to the first XBW, a range search can be performed for the prefix p ending with the query P.

In the second XBW, since it has only the parent ID, it is difficult to search for a child node. However, even when the second XBW is used, a range search of the prefix p ending with the query P is possible. In this embodiment, any XBW can be used.

The first XBW is described in Reference Document 1, and the second XBW is described in Reference Document 2.
<Reference 1> Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini and S. Muthukrishnan, "Structuring labeled trees for optimal succinctness, and beyond", FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, Pages 184-196
<Reference 2> Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jerey Scott Vitter, “Faster compressed dictionary matching”, SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval , Pages 191-200

Also, in this embodiment, a score (that is, a prefix score) is defined for each prefix. The prefix score is defined by the largest character string score among character string scores associated with keys starting with the prefix. When this score is expressed by an expression, it becomes as shown in Expression 1. The score on the right side in Expression 1 represents a character string score, and the score on the left side in Expression 1 represents a prefix score.

Score (p) = max {Score (key starting with prefix p)} (Formula 1)

In this embodiment, since a set of keys is represented by a tree structure, a key starting with a certain prefix exists under a node corresponding to the prefix. Therefore, the prefix score is the largest character string score among the keys existing under the node.

In the present embodiment, a first RMQ structure is added to the XBW structure, and a prefix score corresponding to each node can be specified using the first RMQ structure. Specifically, the prefix score of each prefix is stored in an array used in RMQ. Hereinafter, the array in which the prefix score is stored is referred to as a prefix score column R _p . Since the prefixes are sorted based on the end, prefixes that end with the same string are specified in a single range. Therefore, by using the first RMQ structure, the maximum value in any range of the prefix score string R _p can be specified.

Furthermore, in this embodiment, the character string score of each key can be specified using the second RMQ structure. Specifically, the character string score of each key is stored in an array used in RMQ. Hereinafter, an array in which character string scores are stored is referred to as a character string score string R _k . Since each key is sorted from the top, keys starting with a certain prefix are specified in a range of connections. Therefore, the maximum value in an arbitrary range of the character string score string R _k can be specified by using the second RMQ structure.

FIG. 5 is an explanatory diagram illustrating an example of a data structure stored in the search information storage unit. The XBW structure in the present embodiment is represented by a set S having a set of three elements for each node of the trie tree. S _last is a binary flag, and is 1 when the node is the last child of the node's parent node, and 0 otherwise. S _α is a character represented by the node. S _π is a prefix corresponding to the parent node of the node, and is a character string obtained by sequentially combining characters from the root to the parent node. Note that _Sπ does not include the characters of the node itself. The set of three elements is sorted in lexicographic order by comparing from the last character of the prefix included in each element to the first character. In the example shown in FIG. 5, row numbers are assigned to the sorted groups (S _π , S _α , S _last ) in order from the top. In FIG. 5, $ indicates the beginning of the key, and # indicates the end of the key.

Also, as shown in FIG. 5, a prefix score _Rp is defined for each prefix. Since the prefix score _Rp is calculated from the character string score associated with each key as described above, the prefix score does not need to be explicitly retained. The prefix IDs illustrated in FIG. 5 are assigned in the order in which all prefixes included in the dictionary are sorted from the end. Therefore, the order of prefix IDs matches the order of prefixes for which 1 is set in S _last .

With the structure illustrated in FIG. 5, a range of prefixes ending with the query P can be specified. For example, it can be seen that the line corresponding to the prefix ending with the query “ab” corresponds to the line numbers 7 to 9 (that is, lines corresponding to “$ ab” and “$ cab”). It can also be seen that the prefix scores R _p of “$ ab” and “$ cab” are 9 and 4 corresponding to the prefix IDs = 4 and 5, respectively.

If the prefix range can be specified, the ID of the prefix having the maximum score in the range can be acquired with the first RMQ structure. Furthermore, by using the first RMQ structure recursively, it is possible to acquire a prefix ID having a score of second or lower.

From the above, by using the XBW structure, it is possible to select any number of prefixes p having a higher prefix score from the prefixes p ending with the query P.

The prefix set specifying unit 20 specifies a set of prefixes including the input character string from the set of prefixes stored in the search information storage unit 50. Specifically, the prefix set specifying unit 20 specifies a set of prefixes ending with the input character string. For example, when the search information storage unit 50 stores a set of prefixes illustrated in FIG. 5, when “ab” is input as a character string, the prefix set specifying unit 20 sets the line numbers 7 to 9. A prefix existing in the range (ie, “$ ab”, “$ cab”) is specified as a set of prefixes.

The prefix specifying unit 31 specifies a prefix having a higher prefix score from the set of prefixes specified by the prefix set specifying unit 20. The prefix identification unit 31 may identify a prefix having the largest prefix score or a prefix corresponding to the top n prefix scores (n is an arbitrary natural number).

The character string specifying unit 32 specifies a key having a higher character string score among keys starting with the specified prefix. The character string specifying unit 32 may search for a key having the largest character string score or a key corresponding to the top m characters of the character string score (m is an arbitrary natural number).

For example, in FIG. 5, it is assumed that the prefix identifying unit 31 identifies the prefix as “$ ab”. In this case, the keys starting with the identified prefix “$ ab” are “aba” and “abcc”. The character string score of “abba” is 3, and the character string score of “abcc” is 9. In this case, the character string specifying unit 32 may select “abcc” as a key.

The search management unit 30 specifies a prefix range to be searched by the prefix specifying unit 31. In addition, the search management unit 30 specifies a key range to be searched by the character string specifying unit 32, and specifies the key specified by the character string specifying unit 32 as a search target key.

Specifically, the search management unit 30 first specifies the prefix range specified by the prefix set specifying unit 20 as the prefix range searched by the prefix specifying unit 31. Then, the search management unit 30 specifies a key starting with a prefix within the specified range as a key range to be searched by the character string specifying unit 32. Then, the search management unit 30 specifies the key specified by the character string specifying unit 32 as the search target key.

After that, the search management unit 30 specifies a range excluding the already specified key from the keys starting with the key prefix specified by the character string specifying unit 32. Further, the search management unit 30 specifies a range obtained by removing the prefix specified by the prefix specifying unit 31 from the set of prefixes specified by the prefix set specifying unit 20.

Then, the search management unit 30 causes the prefix specifying unit 31 and the character string specifying unit 32 to execute each process. Specifically, the prefix specifying unit 31 specifies the prefix having the maximum prefix score from the prefix range specified by the search management unit 30. Furthermore, the character string specifying unit 32 specifies the key having the maximum character string score from the key range specified by the search management unit 30.

The search management unit 30 compares the prefix score specified from the prefix range with the character string score of the key specified from the key range. As a result of the comparison, if the highest score is a character string score, a key starting with the same prefix as that key is searched for a key having the next highest character string score. Specifically, the search management unit 30 excludes the key from the key range used when the key is specified, and bisects the key range to specify the two ranges. The character string specifying unit 32 specifies a key having the maximum character string score from the two ranges.

If the highest score is a prefix score, the prefix with the highest prefix score is searched after that prefix. Specifically, the search management unit 30 excludes the prefix from the prefix range used when specifying the prefix and bisects the two ranges to specify the two ranges. The prefix specifying unit 31 specifies a prefix having the maximum prefix score from the range.

The output unit 40 outputs the key specified by the search management unit 30 as a search result.

The prefix set specifying unit 20, the search managing unit 30, the prefix specifying unit 31, and the character string specifying unit 32 are realized by a CPU of a computer that operates according to a program (character string search program). For example, the program is stored in a storage unit (not shown) of a character string search device, and the CPU reads the program, and according to the program, a prefix set specifying unit 20, a search management unit 30, a prefix specifying unit 31 and The character string specifying unit 32 may be operated.

Further, each of the prefix set specifying unit 20, the search managing unit 30, the prefix specifying unit 31, and the character string specifying unit 32 may be realized by dedicated hardware.

Next, the operation of the character string search device of this embodiment will be described. FIG. 6 is a flowchart showing an operation example of the character string search apparatus according to the present embodiment. Here, k keys are selected as candidates. In addition, the search management unit 30 holds a priority queue (prefix and prefix score pair specified by the prefix specifying unit 31 and a key and character string score pair specified by the character string specifying unit 32). (Not shown). This priority queue is a queue for holding candidate information. In the following description, a priority queue is simply referred to as a queue.

The input unit 10 inputs a character string to be searched (step S11). The prefix set specifying unit 20 refers to the search information storage unit 50 and specifies a set of prefixes including the input character string (step S12).

The prefix identification unit 31 identifies a prefix having the largest prefix score from the prefix set identified by the prefix set identification unit 20, and holds the identified prefix / prefix score pair in the queue. (Step S13).

The character string specifying unit 32 specifies the key having the highest character string score from the keys starting with the specified prefix, and holds the specified key / character string score pair in the queue (step S14).

Next, the search management unit 30 specifies the prefix or key of the maximum score among the prefix scores or character string scores held in the queue (step S15). Then, the search management unit 30 determines whether the maximum score is a prefix score or a character string score (step S16).

When the maximum score is the character string score (“character string score” in step S16), the search management unit 30 specifies the key of the character string score as an output target and excludes it from the queue (step S17). . Then, the character string specifying unit 32 specifies the key having the next highest character string score after the excluded key in the key range used when specifying the excluded key, and the specified key and character string The score pair is held in the queue (step S18).

On the other hand, when the maximum score is a prefix score (“prefix score” in step S16), the search management unit 30 excludes the prefix of the prefix score from the queue (step S19). Then, the prefix identification unit 31 identifies and identifies a prefix having the next largest prefix score after the excluded prefix in the range of prefixes used to identify the excluded prefix. A pair of prefix and prefix score is held in the queue (step S20).

Further, the character string specifying unit 32 specifies the key having the largest character string score from the keys starting with the prefix specified in step S20, and holds the specified key and character string score pair in the queue ( Step S21).

If the queue is empty or the maximum score in the queue is lower than the kth largest string score found so far (Yes in step S22), the search management unit 30 will The key found in (1) is output as the upper key (step S23). On the other hand, when the key is not empty and the maximum score in the queue is not lower than the k-th largest character string score that has been found so far (No in step S22), the processing after step S15 is performed. Repeated.

In this way, the prefix / prefix score pair specified by the prefix specifying unit 31 and the key / string score pair specified by the character string specifying unit 32 are put in the same priority queue. By setting it, it is possible to take out the pair with the largest score from the prefix score or the character string score.

Hereinafter, the operation illustrated in FIG. 6 will be described using a specific example. FIG. 7 is an explanatory diagram illustrating an example of processing for selecting a key having a large character string score. In the example shown in FIG. 7, a character string “gres” is input, and a method of searching for three keys (k = 3) partially matching this character string is shown. The list illustrated in the left frame of FIG. 7 is a list schematically showing the XBW structure, where a number represents a prefix score and a letter represents a prefix. In addition, the list illustrated in the right frame of FIG. 7 is a list schematically showing a trie tree, where a number represents a character string score and a character represents a key.

The prefix set specifying unit 20 uses “aggres”, “congres”, and “progress” as candidates from the set of prefixes represented by the XBW structure for the key range that partially matches the character string “gres”. Identify. If a prefix is specified, keys that begin with that prefix can be specified.

The prefix specifying unit 31 selects a prefix having the highest prefix score from prefixes ending with the input character string “gres” from the determined set of prefixes. FIG. 7 shows a state in which the selected prefixes are arranged in descending order of prefix score. In the example shown in FIG. 7, the prefix score of “congres” is 45, which is the largest. Therefore, the prefix specifying unit 31 specifies “congres” as a prefix.

The character string specifying unit 32 selects a key having the largest character string score from keys starting with the selected prefix. In the example illustrated in FIG. 7, there are three keys having “congres” as a prefix, “congress”, “congressional”, and “congressmen”. Among these, the key having the largest character string score is “congress”. Therefore, the character string specifying unit 32 specifies “congress” as the first key, and the search management unit 30 specifies the specified “congress” as a search target key.

At this stage, since only one key has been specified, the process of specifying the key is repeated.

As described above, the search management unit 30 includes a priority queue (not shown) for holding candidate information, and holds the prefix and key found so far in the queue together with the score. Keep it.

The search management unit 30 refers to the queue and selects the one with the highest score from the prefixes and keys held in the queue. When the selected item is a key, the character string specifying unit 32 searches for the key having the next highest character string score within the same key range as that when the key is searched. If the selected one is a prefix, the prefix specifying unit 31 selects a prefix having the next highest prefix score after the prefix within the same prefix range as that when the prefix is searched. Search for.

In this example, the prefix “congres” of the prefix score 45 and the key “congress” of the character string score 45 are held in the queue. At this time, since the scores are the same value, either the key or the prefix may be searched first. When searching for a key, the search management unit 30 first pops the key “congress” and removes it from the queue. Then, the character string specifying unit 32 searches for a key having the next highest character string score after the key “congress” among keys starting with the same prefix “congres” as when the key “congress” was obtained. Specifically, the search management unit 30 bisects the key range searched for when the key “congress” is obtained, this time excluding “congress”, and the character string specifying unit 32 determines the two ranges. Search for the key with the highest string score. At this time, of the two ranges divided into two parts by excluding the key “congress”, there is no key in the range preceding the “congress” in dictionary order. Therefore, it is only necessary to obtain a key having a maximum character string score in the later range of the dictionary order. The key is the key “congressmen” of the character string score 13. Therefore, the search management unit 30 newly holds this key in the queue.

When searching for a prefix, the search management unit 30 first pops the prefix “congres” and removes it from the queue. Then, the prefix specifying unit 31 searches for a prefix having a prefix score that is higher than the prefix “congres”. Specifically, the search management unit 30 bisects the range of the prefix searched when the prefix “congres” is obtained, excluding “congres”, and the prefix identification unit 31 Find the prefix with the highest prefix score in the range. At this time, prefixes having a large prefix score in the two ranges divided into two parts by excluding the prefix “congres” are prefix “aggres” of prefix score 12 and prefix “progres” of prefix score 21, respectively. It is. Therefore, the search management unit 30 newly holds these two prefixes in the queue.

Further, the character string specifying unit 32 acquires the key having the maximum character string score starting with each prefix for the prefix “aggres” and the prefix “progres”. Thereby, the key “aggressive” of the character string score 12 and the key “progress” of the character string score 21 are obtained. Thereby, it can be confirmed that the prefix score of the prefix “aggres” is 12 and the prefix score of the prefix “progres” is 21 of the two prefixes acquired earlier.

In this embodiment, only the RMQ structure is retained, and the prefix score itself is not retained. The RMQ structure alone can find which prefix has the largest prefix score, but cannot calculate its specific prefix score. Therefore, after obtaining the prefix with the largest prefix score in the range, in order to determine the specific value of the prefix score, the largest character among the keys starting with that prefix Need to get column score.

As a result of the above processing, five scores are held in the queue. Prefix “progres” of prefix score 21, prefix “aggres” of prefix score 12, key “progress” of character string score 21, key “congressmen” of character string score 13, key “aggressive” of character string score 12 ".

Of these, the score is highest because the prefix “progress” of the prefix score 21 or the key “progress” of the character string score 21, and it is sufficient to search for either of these prefixes or keys.

Repeat this process. If the prefix score of the newly found prefix is lower than the k-th largest character string score found so far, the search management unit 30 does not register the prefix in the queue. This is because a prefix having the highest prefix score after that prefix has a smaller score. Similarly, when the score of a newly found key is lower than the kth largest character string score found so far, the search management unit 30 does not register the key in the queue. Thereby, it is possible to avoid searching for a prefix having a small prefix score or a key having a small character string score, and the top k keys can be efficiently collected.

The search ends when the queue is empty or the maximum score in the queue falls below the kth largest string score found so far.

As described above, according to the present embodiment, the prefix set specifying unit 20 specifies a set of prefixes ending with the input character string from the set of prefixes, and the prefix specifying unit 31 is input. The prefix with the highest prefix score is identified from the set of prefixes ending with the specified string. And the character string specific | specification part 32 specifies the key with the largest character string score from the keys which begin with the specified prefix.

Specifically, in this embodiment, since the index for the prefix and the key is created, the dictionary size can be reduced as compared with the case where the index is created for all the partial character strings. In this embodiment, the prefix specifying unit 31 specifies a prefix having a large prefix score, and the character string specifying unit 32 searches for a key having a large character string score from the prefix. If the key is searched from the largest, the top k keys can be efficiently searched. Therefore, it is possible to perform a partial match search for character strings at high speed while reducing the amount of data.

For example, Japanese and English dictionaries, query logs and URLs often have a common prefix. In the character string search apparatus according to the present embodiment, the trie tree is used as a data structure capable of collecting common prefixes, so that the data size can be reduced. In the present embodiment, the case where the key is represented using the data structure of the trie tree is illustrated, but the data structure may be a Patricia tree. By using the Patricia tree, the data size can be reduced more than the tree structure of the trie tree.

In addition, the character string search device of the present embodiment includes a search management unit 30 that manages the search range. Specifically, the search management unit 30 specifies a range excluding the already specified key from the keys starting with the key prefix specified by the character string specifying unit 32, and the prefix set specifying unit 20 A range excluding the prefix specified by the prefix specifying unit 31 is specified from the specified prefix set. Then, the prefix specifying unit 31 specifies the prefix having the largest prefix score from the prefix range specified by the search managing unit 30, and the character string specifying unit 32 is specified by the search managing unit 30. The key with the highest string score is identified from the key range. In this way, XBW used as a data structure for a dictionary can be expanded to a Top-k search, and therefore, space saving and high-speed processing can be realized when a partial match search is performed for the top k candidates.

Embodiment 2. FIG.
Next, a second embodiment of the character string search device according to the present invention will be described. The configuration of the character string search device of this embodiment is the same as that of the first embodiment. However, the character string search device according to the second embodiment can reduce the amount of data to be retained more than the character string device according to the first embodiment.

Two data structures are generated from the trie tree T described in the first embodiment. One is a data structure related to a prefix, and the other is a data structure related to a key.

The data structure related to the prefix includes xbw which is the XBW representation of the trie tree T and the accompanying first RMQ structure. As described in the first embodiment, on xbw, the prefixes are arranged in the order sorted from the end.

Further, a first RMQ structure is generated for the prefix score string R _p shown in the first embodiment. In this case, the search information storage unit 50, a prefix score column R _p, may not be explicitly held, be held only the first RMQ structure calculated from the prefix score column R _p Good.

The data structure relating to the key includes a Patricia tree T _c generated from the trie tree T, a second RMQ structure, and a character string score string R _k . Tree structure of a Patricia tree _{T c} is represented by DFUDS. Further, in the Patricia tree _Tc , in order to identify only the leaf nodes of the tree structure, the same number of bit strings as the number of nodes are prepared.

A general Patricia tree holds a character string corresponding to each node. On the other hand, the search information storage unit 50 of this embodiment removes the character string corresponding to each node and stores only the tree structure representing the parent-child relationship between the nodes. The reason for storing only such a tree structure will be described later.

In the following description, as illustrated in FIG. 5, it is assumed that each key is sorted in lexicographic order from the first character, and each key is assigned a key ID in that order. Each prefix is also sorted in lexicographic order from the end, and each prefix is assigned a prefix ID in that order. Further, the range of the prefix ID is written as [s _p , e _p ], and the range on the set S representing the prefix is written as [s _s , e _s ].

The prefix set specifying unit 20 specifies a range [s _p , e _p ] of prefix IDs ending with the input character string. Specifically, the prefix set identifying unit 20 uses the XBW, range that is a character string end is entered prefix [s _{s, e} _s] to identify. However, since this range [s _s , e _s ] is a range on the set S, it is necessary to convert it to a prefix ID range [s _p , e _p ]. Therefore, the prefix set identification unit 20 identifies [s _p , e _{p by} identifying what number the first 1 and the last 1 included in [s _s , e _s ] are on S _last. ] Is specified. This is because the elements that become 1 on S _last correspond one-to-one in the same order as the prefix ID.

The prefix specifying unit 31 specifies the prefix having the maximum prefix score from the range [s _p , e _p ] of the specified prefix ID. Specifically, the prefix specifying unit 31 uses the first RMQ structure to specify the position of the prefix having the maximum prefix score within the range of [s _p , e _p ]. Incidentally, denoted here identified position of the prefix and i _p.

Search management unit 30 from the position i _p of the identified prefix identifies a range of keys that start with the prefix. Hereinafter, a range of keys starting with the specified prefix is denoted as [s _k , e _k ]. Specifically, the retrieval management section 30 first as the corresponding position i _s on S, identifies the last node having the prefix which corresponds to the position i _p prefix.

Next, the search management unit 30 restores a character string representing this prefix in xbw. Specifically, the retrieval management unit 30, by combining the characters when went following the parent from a node representing i _s th row in XBW, restores the character string. The number of moves from node to parent is equal to the prefix length.

In this case, the retrieval management unit 30, for each location _{i s} on traced S, the difference between _{i s} closest _S last _[i f] than in front = becomes 1 position _{i _f} _{d =} _i s -i _f Calculate Then, the search management unit 30 stores the calculated values in the array d in the reverse order of the order traced toward the parent. However, if the above i _f are not present, the search management unit 30 stores the value obtained by the i f _{= 0} to the array d.

Next, the search management unit 30 moves the target position from the parent node to the child node according to the order stored in the array d in the Patricia tree _Tc . However, when the corresponding value of the array d is 1, the search management unit 30 ignores the value and performs processing for the next value.

Since xbw and _Tc are generated from the same trie tree T, the number of children of each node and the order thereof are the same except when there is one child of the node. Therefore, when the target position on T _c is moved according to the array d, the position reaches the node u _c on T _c corresponding to the prefix.

Search management unit 30 then uses the DFUDS, range of keys corresponding to the descendants of node _{u c} has been reached _[s _{k, e} k] specifying the. Since the keys included in [s _k , e _k ] are all children of u _c , it can be said that [s _k , e _k ] indicates a range of keys starting with the specified prefix.

The character string specifying unit 32 specifies the key ID (hereinafter referred to as i _k ) having the maximum character string score from the specified key range [s _k , e _k ]. Specifically, the character string specifying unit 32 uses the second RMQ structure to specify the key position i _k having the maximum prefix score within the range of [s _k , e _k ].

Character string specifying unit 32, from the position i _k of the specified key ID, and identifies the string of keys. i _k corresponds to the i _k th leaf node u _i on the Patricia tree T _c . Therefore, the character string specifying unit 32 traces from u _i to the parent node of the Patricia tree T _c and stores the numbers of the child nodes in the array d in the reverse order of the order traced toward the parent. The character string specifying unit 32 can specify the position of the node on xbw corresponding to u _i by tracing xbw sequentially from the root according to this array d. When there is one child node, it is only necessary to move it in the direction of the leaf node unconditionally without referring to the array d. There is no branching in the descendants of the trie node corresponding to the leaf node of the Patricia tree _Tc . Therefore, the character string specifying unit 32 can accurately restore the key by following a single chain.

As described above, key information can be obtained from xbw. Therefore, it is sufficient to exclude the character string held by each node of the Patricia tree and leave only the parent-child relationship between the nodes.

For example, in FIG. 5, “$ ab” can be reached by selecting the first child from the route and further selecting the first child at the selected node. Therefore, the search management unit 30 may store information of d = 1, 1 in the array d for specifying the selected node.

Hereinafter, the specific operation of the character string search apparatus of this embodiment will be described using the example shown in FIG. Here, the search query P = “ab” and k = 2. The range on S that ends with P is [s _s , e _s ] = [7, 9]. The corresponding range on R _p is [s _p , e _p ] = [4, 5]. The position of the largest prefix score in this range is i _p = 4. This corresponds to a position i _S = 8 on _S. Since the prefix corresponding to i _S is “$ ab” and both are the first children, the array d = 1, 1 is obtained.

On _Tc , the route node is departed, moved to the first child node, and once again moved to the first child node, the node corresponding to “$ ab” is reached. The maximum character string score under this node is 9, and this key is a key identified by key ID = 1. Therefore, the key ID and character string score pair <1, 9> is obtained. This is the key of the maximum character string score in the dictionary in FIG.

The top second key is the second key with the same prefix or the first key with another prefix. The second key of the same prefix is a key identified by key ID = 0, and its character string score = 3 (hereinafter referred to as <0, 3>). On the other hand, in order to obtain the first key of another prefix, the prefix range excluding the prefix i _p = 4 specified earlier as the maximum prefix score is specified. The range [s _p , e _p ] = [5, 5] is specified by excluding the prefix i _p = 4 and dividing the prefix range into two. Therefore, the range _{[s p, e p] =} [5,5] in identifying the prefix largest prefix score, processing for specifying the key for maximum word score takes place. In the example shown in FIG. 5, this corresponds to specifying the key with the maximum character string score among the keys starting with the prefix “$ cab”. As a result, a key ID / character string score pair <2, 4> is newly specified.

So far, three candidate pairs have been identified, but <0,3> having a small score is excluded. The finally remaining pairs are <1, 9> and <2, 4>. After these two pairs are specified, a process for restoring the key from each pair is performed. The paths d at _Tc are 1, 2, and 2, 1, respectively. The key in the original dictionary corresponding to this key is uniquely obtained by tracing xbw from the root, and becomes “$ abcc #” and “$ cab #”.

Next, the data size when the data structure described in this embodiment is used will be described. If a trie tree T and a score array R _k are given, the number of nodes t> the number of keys l holds. In general, the number of nodes t is about 10 times the number of keys l.

When the data structure described in this embodiment is used, the data size is expressed by the following equation 2.

| XBW | + | First RMQ structure (prefix) |
+ | T _c (Patricia tree) | + | second RMQ structure (key) | + | R _k (score) |
... (Formula 2)

In Expression 2, | XBW | represents the data size when the trie tree T is represented by xbw, and | R _k (score) | represents the size of the array of character string scores.

T _c (Patricia tree) is generated from the trie tree T, and the tree structure is expressed in DFUDS. In the present embodiment, in order to identify only the leaf nodes of the tree structure, the same number of bit strings as the number of nodes are prepared, and each bit of this bit string is used to identify whether or not it is a leaf node. The Patricia tree of this embodiment is represented only by a tree structure from which character strings are removed. This is because character string information is obtained from xbw as described above.

The maximum number of nodes in the Patricia tree is 2l-1, but twice the number of nodes in DFUDS, and the bit string for identifying leaf nodes requires the same number of bits as the number of nodes. Therefore, | T _c (Patricia tree) | is represented by 6l + o (l) bits.

Further, a second RMQ structure (key) is generated for the key character string score array R _k (score). The | second RMQ structure (key) | is represented by 2l + o (l) bits.

When | XBW | and | R _k (score) | are the minimum data necessary for realizing the dictionary and the score, when realizing the data structure described in this embodiment, the overhead is 2t + 6l + o (t) at the maximum. Thus, the amount of data is reduced compared to the general method.

Next, the amount of calculation when the data structure described in this embodiment is used will be described. The calculation amount is calculated by O (k (log (k) + | P | + h)). Here, | P | indicates the length of the query, and h indicates the average length of the keys registered in the dictionary. Thus, when the data structure described in the present embodiment is used, the search process can be executed without depending on the data size.

Next, the outline of the present invention will be described. FIG. 8 is a block diagram showing an outline of a character string search apparatus according to the present invention. A character string search device according to the present invention includes a search candidate character string including an input character string from a set of search candidate character strings (for example, keys) associated with a character string score indicating a degree to be preferentially searched. Is a character string search device for searching for a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string (for example, a set of prefixes based on an XBW data structure) From the prefix set specifying unit 81 (for example, the prefix set specifying unit 20) that determines a set of prefixes ending with the input character string and the prefix set ending with the input character string For each word, the prefix score defined by the largest character string score among the character string scores associated with the search candidate character string starting with the prefix (for example, the prefix score defined by Equation 1) is the largest. prefix A character string specification that specifies a search candidate character string having the maximum character string score from among a prefix specification unit 82 (for example, prefix specification unit 31) that specifies the character string and a search candidate character string that starts with the specified prefix Part 83 (for example, character string specifying part 32).

Thus, the prefix specifying unit 82 specifies a prefix having a large prefix score, and the character string specifying unit 83 searches for a search candidate character string having a large character string score from the prefix. If the search is started from the larger one, the top k search candidate character strings can be efficiently searched.

Furthermore, the character string search device may include a search management unit (for example, the search management unit 30) that manages the search range. The search management unit specifies the range of the search candidate character string excluding the already specified search candidate character string from the search candidate character strings starting with the prefix of the search candidate character string specified by the character string specifying unit 83. The prefix range excluding the prefix specified by the prefix specifying unit 82 from the set of prefixes specified by the prefix set specifying unit 81 may be specified. Further, the prefix specifying unit 82 specifies the prefix having the maximum prefix score from the prefix range specified by the search management unit, and the character string specifying unit 83 is the search candidate specified by the search management unit. A search candidate character string having the maximum character string score may be specified from the character string range.

At this time, the search management unit holds a pair of prefix and prefix score specified by the prefix specifying unit 82 and a pair of search target character string and character string score specified by the character string specifying unit 83 ( For example, a priority queue may be included. Then, the search management unit identifies the prefix or the search target string with the maximum score from the prefix score or the string score from the pairs held in the queue, and the maximum score is the string score. If the maximum score is a prefix score, the search target character string of the character string score is excluded from the queue and specified as the output target, and the prefix of the prefix score may be excluded from the queue . Furthermore, if the maximum score is a prefix score, the prefix specifying unit 82 specifies a prefix with the next largest prefix score after the prefix excluded from the queue, and the string specifying unit 83 sets the maximum score. If the score is a string score, the excluded search target string in the search target string that starts with the same prefix used to identify the search target string excluded from the queue If the next highest character string score is specified and the largest score is the prefix score, the search with the highest character string score from the search target character strings starting with the prefix specified by the prefix specifying unit 82 The target character string may be specified.

In this way, the prefix score or character that is retained in the queue by retaining both the prefix-prefix score pair and both the search target string-string score pair pair in one queue. Based on the column score, it can be determined whether the highest score is a prefix score or a string score. Based on the highest score, the prefix specifying unit 82 and the character string specifying unit 83 repeat the above-described processing, whereby the search target character string having the higher character string score can be efficiently specified.

The character string search device generates a set of prefixes (for example, xbw) having a XBW data structure generated from a set of search candidate character strings represented by a trie data structure, and a trie tree data structure. A search for storing a Patricia tree generated from a Patricia tree having only a tree structure that represents a parent-child relationship between the nodes excluding character strings corresponding to each node of the Patricia tree (for example, Patricia tree T _c ) An information storage unit (for example, search information storage unit 50) may be provided. Then, the prefix identification unit 82 identifies the position of the prefix having the largest prefix score from the set of prefixes having the XBW data structure, and the search management unit identifies the Patricia from the position of the identified prefix. The position of the corresponding node in the tree (eg, u _c ) may be specified. With such a configuration, it is possible to reduce the amount of data to be stored used for search.

At this time, the character string specifying unit 83 determines the position (for example, u _i ) of the search candidate character string having the maximum character string score from the search candidate character strings existing under the position of the node specified by the search management unit. The search candidate character string corresponding to the specified position may be specified from the set of prefixes having the XBW data structure.

In addition, the prefix specifying unit 82 uses the first RMQ structure based on the relationship between the prefix represented by the first RMQ structure and the prefix score, and sets the prefixes specified. The prefix of the largest prefix score may be specified by performing a range search.

In addition, the character string specifying unit 83 uses the second RMQ structure based on the relationship between the search candidate character string represented by the second RMQ structure and the character string score, and identifies the prefix specified. The search candidate character string having the maximum character string score may be specified by performing a range search on the search candidate character string starting with “.”

As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2013-171291 filed on August 21, 2013, the entire disclosure of which is incorporated herein.

The present invention is preferably applied to a character string search device that searches for a key that partially matches an input character string. The character string search device according to the present invention can be used, for example, when providing a search service.

DESCRIPTION OF SYMBOLS 10 Input part 20 Prefix set specific part 30 Search management part 31 Prefix specific part 32 Character string specific part 40 Output part 50 Search information storage part

Claims

A character string search device for searching a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched,
A prefix set specifying unit that specifies a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification part that identifies the prefix with the highest score;
A character string search device comprising: a character string specifying unit that specifies a search candidate character string having the maximum character string score from search candidate character strings starting with the specified prefix.
It has a search management unit that manages the search range,
The search management unit specifies a range of search candidate character strings excluding the already specified search candidate character strings from search candidate character strings starting with a prefix of the search candidate character string specified by the character string specifying unit. Identify a range of prefixes from the set of prefixes specified by the prefix set specification part, excluding the prefixes specified by the prefix specification part,
The prefix identification unit identifies a prefix having the largest prefix score from the range of prefixes identified by the search management unit,
The character string search device according to claim 1, wherein the character string specifying unit specifies a search candidate character string having a maximum character string score from a range of search candidate character strings specified by the search management unit.
The search management unit includes a queue that holds a prefix / prefix score pair identified by the prefix identification unit, and a search target string / string score pair identified by the string identification unit,
The search management unit identifies a prefix or a search target character string having a maximum score among prefix scores or character string scores from the pairs held in the queue, and the maximum score is a character string score. If the maximum score is a prefix score, exclude the prefix of the prefix score from the queue,
The prefix identification part identifies the prefix with the next highest prefix score after the prefix excluded from the queue if the largest score is the prefix score,
When the maximum score is a string score, the string specifying unit, in the search target string starting with the same prefix as the prefix used to specify the search target string excluded from the queue, When the next largest string score is specified after the excluded search target character string and the maximum score is a prefix score, the search target character string starting with the prefix specified by the prefix specifying unit is selected. The character string search device according to claim 2, wherein a search target character string having the largest character string score is specified.
A Patricia tree generated from a set of search candidate character strings represented by a trie tree data structure and having an XBW data structure, and a Patricia tree generated from the trie tree data structure, A search information storage unit for storing a Patricia tree having only a tree structure in which a character string corresponding to each node of the tree is excluded and a parent-child relationship between the nodes is included;
The prefix specifying unit specifies the position of the prefix having the maximum prefix score from the set of prefixes having the XBW data structure,
The character string search device according to any one of claims 1 to 3, wherein the search management unit specifies a position of a corresponding node in the Patricia tree from the specified position of the prefix.
The character string specifying unit specifies the position of the search candidate character string having the maximum character string score from the search candidate character strings existing under the position of the node specified by the search management unit, and has a prefix having an XBW data structure. The character string search device according to claim 4, wherein a search candidate character string corresponding to the specified position is specified from a set of words.
The prefix specifying unit performs a range search on the specified set of prefixes using the first RMQ structure based on the relationship between the prefix represented by the first RMQ structure and the prefix score. The character string search device according to any one of claims 1 to 5, wherein a prefix having a maximum prefix score is specified by
The character string specifying unit uses the second RMQ structure to start a specified prefix based on the relationship between the search candidate character string represented by the second RMQ structure and the character string score. The character string search device according to any one of claims 1 to 6, wherein a search candidate character string having a maximum character string score is specified by performing a range search on the candidate character string.
A character string search method for searching a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched,
A prefix set specifying step for specifying a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification step to identify the prefix with the highest score;
A character string search method comprising: a character string specifying step of specifying a search candidate character string having the maximum character string score from search candidate character strings starting with the specified prefix.
Including a search management step for managing the search scope;
In the search management step, the search candidate character string range excluding the already specified search candidate character string is specified from the search candidate character strings starting with the prefix of the search candidate character string specified in the character string specifying step. Identify the range of prefixes from the set of prefixes identified in the prefix set identification step, excluding the prefix identified in the prefix identification step,
In the prefix identification step, the prefix with the largest prefix score is identified from the prefix range identified in the search management step,
The character string search method according to claim 8, wherein in the character string specifying step, a search candidate character string having a maximum character string score is specified from a range of search candidate character strings specified in the search management step.
A character string search program applied to a computer that searches a search candidate character string including an input character string from a set of search candidate character strings associated with a character string score indicating a degree to be preferentially searched. And
In the computer,
A prefix set specifying process for specifying a set of prefixes ending with the input character string from a set of prefixes that are one or more consecutive character strings extracted from the first character of each search candidate character string;
A prefix defined by the largest string score among the string scores associated with the search candidate string starting with the prefix from the set of prefixes ending with the input string A prefix identification process that identifies the prefix with the highest score, and
A character string search program for executing a character string specifying process for specifying a search candidate character string having the maximum character string score from search candidate character strings that start with the specified prefix.
On the computer,
Execute search management processing to manage the search range,
In the search management process, a range of search candidate character strings excluding an already specified search candidate character string is specified from search candidate character strings that start with a prefix of the search candidate character string specified in the character string specifying process. , From the set of prefixes specified by the prefix set specifying process, the range of the prefix excluding the prefix specified by the prefix specifying process is specified,
In the prefix identification process, the prefix with the largest prefix score is identified from the prefix range identified in the search management process.
The character string search program according to claim 10, wherein the character string specifying process specifies a search candidate character string having a maximum character string score from a range of search candidate character strings specified by the search management process.