JP2012190379A

JP2012190379A - Search device, program and method

Info

Publication number: JP2012190379A
Application number: JP2011055019A
Authority: JP
Inventors: Nobuhiro Yugami; 伸弘湯上
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-03-14
Filing date: 2011-03-14
Publication date: 2012-10-04
Anticipated expiration: 2031-03-14
Also published as: JP5614338B2

Abstract

PROBLEM TO BE SOLVED: To perform search for a similar letter string at high speed.SOLUTION: The device includes: first search data on search target letter strings; a data storage part to store second search data on reversed letter strings each of which is each search target letter string with a letter order of its Nth and subsequent letters reversed as to each integer number N, N being equal to and larger than 1 and equal to and smaller than a first predetermined letter number; a first search part to perform forward match search for a search letter string in the first search data, and to detect a search target letter string having a second predetermined letter number or more of forward-matching letters; a second search part to perform forward match search for a search key which is a search letter string with a letter order of its Mth and subsequent letters reversed, M being an integer number, in the second search data corresponding to the integer M, to detect a reversed letter string having the second predetermined letter number or more of forward-matching letters, and to determine whether a reversed letter string having M-2 matching letters, M being an integer number, exists or not; a control part to output a search instruction to the second search part as to a number from the second predetermined letter number to the integer M, except immediately after the second search part has determined that a reversed letter string having M-2 matching letters, M being an integer number, does not exist.

Description

本技術は、文字列の検索技術に関する。 The present technology relates to a character string search technology.

名寄せとは、名称や住所、電話番号等の複数の属性で表現されるレコードについて、同一実体を表すレコードを収集する処理である。属性毎に、例えば文字列の類似性を評価して、同一実体か否かを判断することになる。同一実体を表すレコードの属性の値は、本来であれば同一の値となるべきであるが、様々な理由で類似した値となる場合がある。例えば、「渡辺商店」とすべきところ「渡邊商店」と異字体が入力される場合や、「ＡＡ研究所」とすべきところ「ＡＡ研」といったような省略がなされる場合、「ＡＡ研究所」とすべきところ「Ａっつ研究所」といったように入力ミスがある場合などが想定される。 Name identification is a process of collecting records representing the same entity among records represented by a plurality of attributes such as names, addresses, and telephone numbers. For each attribute, for example, the similarity of character strings is evaluated to determine whether or not they are the same entity. The attribute values of the records representing the same entity should be the same value originally, but may be similar values for various reasons. For example, when “Watanabe Shoten” should be entered and “Watanabe Shoten” is entered as a variant, or “AA Laboratories” should be omitted, such as “AA Lab.” ”Where there is an input error such as“ Atsu Institute ”.

名寄せでは、このような同一の値ではなく類似する値となっている場合でもレコードを抽出して、他の要素と共に評価して同一実体であると推定されるレコード群を特定する。 In name identification, records are extracted even if they are not the same value but similar values, and are evaluated together with other elements to identify a record group that is presumed to be the same entity.

なお、入力ＵＲＬ（Uniform Resource Locator）で、有害Ｗｅｂ（ウェブ）サイトのＵＲＬを格納しているデータベースを検索する場合、ＵＲＬの特性を用いて文字列検索を行う技術が存在している。具体的には、ＵＲＬのホスト＋ドメイン部分については右から左方向に詳細になっており、ＵＲＬのホスト＋ドメイン部分のうちホスト部分を簡単に変更できること、ドメイン部分についてもサブドメインの定義が可能であるという特性がある。そこで、ホスト＋ドメイン部分については右からの一致を優先すべく、例えばホスト＋ドメイン部分がｗｗｗ．ａｂｃ．ｄｅｆ．ｃｏ．ｊｐであれば、ｐｊ．ｏｃ．ｆｅｄ．ｃｂａ．ｗｗｗといったホスト＋ドメイン名の区切りで文字の順番を反転させた形でデータベースに登録しておき、検索する際も入力ＵＲＬのホスト＋ドメイン部分をドット毎の区切りで文字の順番を反転させて文字列の前方一致検索を行うものである。しかしながら、この技術は、ＵＲＬの特性に基づき、ホスト＋ドメイン名の区切りで文字の順番を入れ替えているだけで、名寄せにそのまま適用できるようなものではない。すなわち、一般的には文字列間の異なり位置は不確定であり、スラッシュのように確定した区切り位置は存在していない。 When searching a database storing URLs of harmful Web sites using an input URL (Uniform Resource Locator), there is a technique for performing a character string search using the characteristics of the URL. Specifically, the URL host + domain part is detailed from right to left, the host part of the URL host + domain part can be easily changed, and subdomains can also be defined for the domain part. There is a characteristic that it is. Therefore, for the host + domain part, in order to give priority to matching from the right, for example, the host + domain part is www. abc. def. co. jp, pj. oc. fed. cba. It is registered in the database with the order of characters reversed by the host + domain name delimiter such as www, and when searching, the host + domain part of the input URL is reversed by the dot order and the character order is reversed. A forward match search is performed for the column. However, this technique cannot be applied to name identification as it is simply by changing the order of characters at the host + domain name separator based on the characteristics of the URL. That is, generally, different positions between character strings are indeterminate, and there is no delimited position such as a slash.

一般的な検索文字列と一般的な検索対象文字列とを比較する際に、それらの間で相違が発生する文字位置は様々である。大量の検索対象文字列から、検索文字列に前方からだけではなく後方からも類似する検索対象文字列を抽出するには、相違する文字位置のバリエーションを考慮すると、相当の処理時間が掛ってしまう。 When comparing a general search character string and a general search target character string, there are various character positions where a difference occurs between them. In order to extract a search target character string that is similar to the search character string not only from the front but also from the back from a large number of search target character strings, it takes a considerable amount of processing time considering variations of different character positions. .

特開２００６−２２１２９４号公報JP 2006-221294 A

従って、本技術の目的は、一側面において、高速に類似する文字列の検索を行うための技術を提供することである。 Accordingly, an object of the present technology is, in one aspect, to provide a technology for searching for a similar character string at high speed.

本検索装置は、（Ａ）検索対象文字列の第１の検索用データと、１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列についての第２の検索用データとを格納する第１のデータ格納部と、（Ｂ）第１のデータ格納部に格納されている第１の検索用データに対して検索文字列の前方一致検索を行って、第２の所定文字数以上前方一致する検索対象文字列を検出し、検出された検索対象文字列の識別子を第２のデータ格納部に格納する第１の検索部と、（Ｃ）検索文字列における整数Ｍ文字以降の文字の順番を入れ替えた検索キーの前方一致検索を、データ格納部に格納されている整数Ｍについての第２の検索用データに対して行って、第２の所定文字数以上前方一致する反転文字列を検出し、検出された反転文字列に対応する検索対象文字列の識別子を第２のデータ格納部に格納し、整数Ｍ−２文字一致した反転文字列が存在するか否かを判断する第２の検索部と、（Ｄ）第２の検索部に対して、第２の所定文字数から１まで当該第２の所定文字数から１までのうち整数Ｍについての検索指示を、第２の検索部により整数Ｍ−２文字一致した反転文字列が存在しないと判断された直後を除き出力する制御部とを有する。 The search apparatus determines (A) the order of the characters after the N characters in each search target character string for each of the first search data of the search target character string and an integer N between 1 and the first predetermined number of characters. A first data storage unit for storing second search data for the reversed inverted character string; and (B) a search character for the first search data stored in the first data storage unit. A first search unit that performs a forward match search of the columns, detects a search target character string that matches the second predetermined number of characters or more, and stores an identifier of the detected search target character string in the second data storage unit And (C) a forward matching search of the search key with the order of the characters after the integer M characters in the search character string being changed is performed on the second search data for the integer M stored in the data storage unit. Forward one more than the second predetermined number of characters. Whether or not there is an inverted character string that matches the integer M-2 characters, and stores the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit. And (D) the second search unit from the second predetermined number of characters to 1, and a search instruction for the integer M out of the second predetermined number of characters to 1, And a control unit for outputting except immediately after it is determined by the search unit of 2 that there is no inverted character string that matches the integer M-2 characters.

高速に類似する文字列の検索を行うことができるようになる。 It becomes possible to search for a similar character string at high speed.

図１は、文字列間の比較を説明するための図である。FIG. 1 is a diagram for explaining comparison between character strings. 図２は、本実施の形態における処理の具体的な例を示す図である。FIG. 2 is a diagram showing a specific example of processing in the present embodiment. 図３Ａは、本実施の形態の概要を説明するための図である。FIG. 3A is a diagram for explaining the outline of the present embodiment. 図３Ｂは、本実施の形態の概要を説明するための図である。FIG. 3B is a diagram for explaining the outline of the present embodiment. 図３Ｃは、本実施の形態の概要を説明するための図である。FIG. 3C is a diagram for explaining the outline of the present embodiment. 図３Ｄは、本実施の形態の概要を説明するための図である。FIG. 3D is a diagram for describing an outline of the present embodiment. 図４は、本実施の形態に係る検索装置の機能ブロック図である。FIG. 4 is a functional block diagram of the search device according to the present embodiment. 図５は、検索用データ生成部の機能ブロック図である。FIG. 5 is a functional block diagram of the search data generation unit. 図６は、検索処理部の機能ブロック図である。FIG. 6 is a functional block diagram of the search processing unit. 図７は、検索用データ生成処理の処理フローを示す図である。FIG. 7 is a diagram illustrating a processing flow of search data generation processing. 図８は、検索対象文字列の一例を示す図である。FIG. 8 is a diagram illustrating an example of a search target character string. 図９は、ソート後の検索対象文字列の一例を示す図である。FIG. 9 is a diagram illustrating an example of a search target character string after sorting. 図１０Ａは、第１のグループの例を示す図である。FIG. 10A is a diagram illustrating an example of the first group. 図１０Ｂは、第２のグループの例を示す図である。FIG. 10B is a diagram illustrating an example of the second group. 図１１は、検索用データ生成処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of search data generation processing. 図１２は、グループについての処理の一例を示す図である。FIG. 12 is a diagram illustrating an example of processing for a group. 図１３は、末尾文字「ｅ」の検索用データの一例を示す図である。FIG. 13 is a diagram illustrating an example of search data for the end character “e”. 図１４は、末尾文字「ｇ」の検索用データの一例を示す図である。FIG. 14 is a diagram illustrating an example of search data for the end character “g”. 図１５は、検索処理の処理フローを示す図である。FIG. 15 is a diagram illustrating a processing flow of search processing. 図１６は、検索処理を説明するための具体例を示す図である。FIG. 16 is a diagram illustrating a specific example for explaining the search processing. 図１７は、検索処理の処理フローを示す図である。FIG. 17 is a diagram illustrating a processing flow of search processing. 図１８は、検索処理を説明するための具体例を示す図である。FIG. 18 is a diagram illustrating a specific example for explaining the search processing. 図１９は、検索処理を説明するための具体例を示す図である。FIG. 19 is a diagram illustrating a specific example for explaining the search processing. 図２０は、コンピュータの機能ブロック図である。FIG. 20 is a functional block diagram of a computer.

名寄せの目的となるデータベースへのデータ入力において、入力ミス等が同一レコードの同一属性で複数回起きる可能性は小さいので、属性毎の文字列において相違箇所は高々１カ所で、且つ相違箇所の長さは文字列全体と比べて小さいという仮定を導入しても、大きな問題はないと考えられる。 When entering data into the database for name identification, it is unlikely that an input error will occur multiple times for the same attribute of the same record, so there is at most one difference in the character string for each attribute, and the length of the difference Even if the assumption that the size is smaller than the whole character string is introduced, it is considered that there is no big problem.

相違箇所が１カ所であるなら、相違箇所は、先頭から比較したとき最初に異なる文字が現れる位置から、末尾から比較したときに最初に異なる文字が現れる位置までの区間である。従って、単純な検索方法では、前方一致検索と後方一致検索とを組み合わせることになる。例えば、図１に示すように、文字列Ｓと文字列Ｔとを比較する場合には、前方から後方に向けて「ａ」「ｂ」が一致しており、後方から前方に向けて「ｇ」「ｆ」が一致しているということで、図１の例では、２文字ずつ一致しているでの合計４文字一致していると判断する。 If there is a single difference, the difference is a section from a position where a different character appears first when compared from the beginning to a position where a different character appears first when compared from the end. Therefore, with a simple search method, a forward match search and a backward match search are combined. For example, as shown in FIG. 1, when comparing a character string S and a character string T, “a” and “b” match from the front to the rear, and “g” from the rear to the front. In the example of FIG. 1, it is determined that a total of 4 characters match each other in the example of FIG.

本実施の形態では、条件「先頭からの文字の一致数＋末端からの文字の一致数≧最低一致文字数」を満たすような文字列を高速に抽出するために、以下のような処理を実施する。 In the present embodiment, the following processing is performed in order to quickly extract a character string that satisfies the condition “number of matching characters from the beginning + number of matching characters from the end ≧ minimum number of matching characters”. .

具体的には、図２に示すように、（ａ）に示すような検索対象文字列に対して、（ｇ）に示すような検索文字列「ａｂｃｄｅ」が入力されて、４文字以上一致する文字列を抽出する場合の処理について説明する。 Specifically, as shown in FIG. 2, a search character string “abcde” as shown in (g) is input to the search target character string as shown in (a), and four or more characters match. Processing for extracting a character string will be described.

本実施の形態では、（ａ）をそのまま辞書順でソートした結果（ｂ）と、検索対象文字列の４文字目以降の文字の順番を反転させた反転文字列について辞書順にソートした結果（ｃ）と、検索対象文字列の３文字目以降の文字の順番を反転させた反転文字列について辞書順にソートした結果（ｄ）と、検索対象文字列の２文字目以降の文字の順番を反転させた反転文字列について辞書順にソートした結果（ｅ）と、検索対象文字列の１文字目以降の文字の順番を反転させた反転文字列について辞書順にソートした結果（ｆ）とを用意する。 In this embodiment, the result (b) obtained by sorting (a) as it is in the dictionary order and the result obtained by sorting the inverted character string obtained by inverting the order of the fourth and subsequent characters of the search target character string (c) ), The result of sorting the reversed character strings obtained by reversing the order of the third and subsequent characters of the search target character string in dictionary order, and the order of the second and subsequent characters of the search target character string are reversed. A result (e) of sorting the inverted character string in dictionary order and a result (f) of sorting the inverted character string obtained by inverting the order of the first and subsequent characters of the search target character string in dictionary order are prepared.

そして、検索文字列の先頭４文字「ａｂｃｄ」で（ｂ）を検索して、４文字以上前方一致する検索対象文字列を抽出する。この例ではＩＤ「５」の「ａｂｃｄｇ」が特定される。 Then, (b) is searched with the first four characters “abcd” of the search character string, and a search target character string that matches the front of four characters or more is extracted. In this example, “abcdg” with ID “5” is specified.

さらに、検索文字列の４文字目以降の文字の順番を反転させた反転文字列（ｉ）「ａｂｃｅｄ」の先頭４文字「ａｂｃｅ」で（ｃ）を検索して、４文字以上前方一致する反転文字列を検出する。この例ではＩＤ「４」の「ａｂｃｅｇ」が特定される。なお、本実施の形態では、反転位置（切断点とも呼ぶ）「４」−２＝２文字一致している反転文字列が存在するかを判断する。（ｃ）には存在しないことが分かる。存在しない場合には、以下で述べる理由で反転位置「３」についての検索処理をスキップする。一方、存在する場合には、反転位置「３」についての検索処理を実施する。 Further, an inverted character string obtained by reversing the order of the fourth and subsequent characters of the search character string (i) Searching for (c) with the first four characters “abce” of “abced”, and performing an inversion that matches four or more characters in front Detect a string. In this example, “abceg” with ID “4” is specified. In the present embodiment, it is determined whether there is an inverted character string that matches the inversion position (also referred to as a cutting point) “4” −2 = 2 characters. It can be seen that (c) does not exist. If it does not exist, the search process for the inversion position “3” is skipped for the reason described below. On the other hand, if it exists, search processing for the inversion position “3” is performed.

このように途中で検索処理をスキップする理由を図３Ａ乃至図３Ｄを用いて説明する。 The reason for skipping the search process in the middle will be described with reference to FIGS. 3A to 3D.

例えば図３Ａの左側に示すように、切断点（反転位置）が４文字目であり、先頭５文字一致である場合、図３Ａの右側に示すように、切断点（反転位置）が３文字目になると、先頭４文字一致になってしまう。すなわち、一致数≧切断点−１の時には、切断点を先頭方向に１ずらすと一致数は必ず１減少してしまう。 For example, as shown on the left side of FIG. 3A, when the cutting point (reverse position) is the fourth character and the first five characters match, the cutting point (reverse position) is the third character as shown on the right side of FIG. 3A. Then, the first 4 characters are matched. That is, when the number of coincidence ≧ cutting point-1, the number of coincidence is always reduced by 1 if the cut point is shifted by 1 in the head direction.

また、図３Ｂの左側に示すように、切断点（反転位置）が４文字目であり、先頭１文字一致である場合、図３Ｂの右側に示すように、切断点（反転位置）が３文字目になると、先頭１文字一致は変わらない。すなわち、一致数＜切断点−２の時には、切断点を先頭方向に１ずらしても、一致数は変化しない。 Also, as shown on the left side of FIG. 3B, when the cutting point (reverse position) is the fourth character and the first one character matches, the cutting point (reverse position) is 3 characters as shown on the right side of FIG. 3B. When it comes to eyes, the first character match does not change. That is, when the number of matches <cutting point-2, the number of matches does not change even if the cut point is shifted by 1 in the head direction.

さらに、図３Ｃの左側に示すように、切断点（反転位置）が４文字目であり、先頭２文字が一致する場合、図３Ｃの右側に示すように、切断点（反転位置）が３の時には、先頭２文字一致は変わらない。なお、上段の元々の文字列は「ａｂｃｄｅｆｇ」であり、下段の元々の文字列は「ａｂｈｅｆｋ」であり、末尾文字は不一致である。すなわち、一致数＝切断点−２の時であって、末尾文字が不一致である場合には、切断点を先頭方向に１ずらしても、一致数は変化しない。 Furthermore, as shown on the left side of FIG. 3C, when the cutting point (reverse position) is the fourth character and the first two characters match, the cutting point (reverse position) is 3 as shown on the right side of FIG. Sometimes the first two characters match does not change. Note that the original character string in the upper row is “abcdefg”, the original character string in the lower row is “abhefk”, and the end characters do not match. That is, when the number of matches = the cut point-2 and the end characters are not matched, the number of matches does not change even if the cut point is shifted by 1 in the head direction.

一方、図３Ｄでは、上段の元々の文字列は「ａｂｃｄｅｆｇ」であり、下段の元々の文字列は「ａｂｈｅｆｇ」であり、末尾文字が一致している。このような場合、図３Ｄの左側に示すように、切断点（反転位置）が４文字目であり、先頭２文字が一致する場合、図３Ｄの右側に示すように、切断点（反転位置）が３の時には、先頭５文字一致になる。 On the other hand, in FIG. 3D, the original character string in the upper row is “abcdefg”, the original character string in the lower row is “abhefg”, and the end characters match. In such a case, as shown on the left side of FIG. 3D, when the cutting point (inversion position) is the fourth character and the first two characters match, the cutting point (inversion position) is shown on the right side of FIG. 3D. When is 3, the first five characters match.

このように、一致数＝切断点−２の時、元々の文字列の末尾文字が一致していれば、切断点を左にシフトした場合に一致数が増加し、末尾文字が一致していなければ、切断点を左にシフトした場合に一致数は変化しない、という特性があることが分かる。従って、末尾文字が一致しているか否かに拘わらず、末尾文字が一致している検索対象文字列に対して前方一致検索を行って、一致文字数が切断点−２となる検索対象文字列が存在していれば、切断点を左にシフトした場合において前方一致検索を実施すれば目的の検索対象文字列が得られる可能性がある。 Thus, when the number of matches = cutting point -2, if the end character of the original character string matches, the number of matches increases when the cutting point is shifted to the left, and the end character must match. For example, it can be seen that the number of matches does not change when the cut point is shifted to the left. Therefore, regardless of whether or not the end characters are matched, a forward match search is performed on the search target character strings that match the end characters, and the search target character string whose number of matching characters is the cut-point-2 is obtained. If it exists, the target search target character string may be obtained if a forward matching search is performed when the cut point is shifted to the left.

なお、末尾文字が一致している検索対象文字列に対してのみ前方一致検索を行うことができるのであれば、より効率的に目的の検索対象文字列が得られるということになる。 Note that if a forward matching search can be performed only on a search target character string that matches the end character, a target search target character string can be obtained more efficiently.

図２の説明に戻って、上でも述べたように、（ｃ）には切断点「４」−２＝２文字一致の反転文字列は存在していないので、切断点「３」（＝現在の切断点「４」−１）についての前方一致検索はスキップできる。 Returning to the description of FIG. 2, as described above, since there is no inversion character string of the cut point “4” −2 = 2 character match in (c), the cut point “3” (= current The forward matching search for the cut point “4” -1) can be skipped.

次に、検索文字列の２文字目以降の文字の順番を反転させた反転文字列（ｋ）「ａｅｄｃｂ」の先頭４文字「ａｅｄｃ」で（ｅ）を検索して、４文字以上前方一致する反転文字列を検出する。ここでは該当する反転文字列は存在しない。また、切断点「２」−２＝０文字一致の反転文字列が存在するか判断する。図２の（ｅ）の反転文字列の中には「ｂｅｂｄｃ」が該当するので、切断点「１」の前方一致検索については実行することになる。 Next, an inverted character string (k) obtained by inverting the order of the second and subsequent characters in the search character string (k) Search for (e) with the first four characters “aedc” of “aedcb”. Detects reverse character string. Here, there is no corresponding reverse character string. Further, it is determined whether there is an inverted character string in which the cut point “2” −2 = 0 matches. Since “bebdc” corresponds to the inverted character string in FIG. 2E, the forward matching search for the cut point “1” is executed.

そして、検索文字列の１文字目以降の文字の順番を反転させた反転文字列（ｌ）「ｅｄｃｂａ」の先頭４文字「ｅｄｃｂ」で（ｆ）を検索して、４文字以上前方一致する反転文字列を検出する。ここでは該当する反転文字列は存在しない。 Then, the reverse character string obtained by reversing the order of the first and subsequent characters of the search character string (l) Search for (f) with the first four characters “edcb” of “edcba”, and inversion that matches four or more characters in front Detect a string. Here, there is no corresponding reverse character string.

このようにして、最終的な検索結果としては、ＩＤ「４」及び「５」の検索対象文字列が得られることになる。 In this way, the search target character strings with IDs “4” and “5” are obtained as the final search results.

このような処理を実施すれば実施すべき前方一致検索が間引かれて、処理負荷が削減され、検索処理が高速化される。 If such processing is performed, the forward matching search to be performed is thinned out, the processing load is reduced, and the search processing is speeded up.

なお、前方一致で最低一致文字数を満足する場合を除き、図１に示すように、文字列の中間部分で文字列の不一致が発生している場合に、最低一致文字数を満足するのは、検索対象文字列の末尾文字と検索文字列の末尾文字とが一致している場合のみである。図３Ａ乃至３Ｄの観点からも末尾文字が一致しているか否かが分かれば、処理負荷をさらに削減することができる。従って、予め検索対象文字列を、末尾文字の種類毎にグループ化しておけば、前方一致検索を行うべき検索対象文字列の数をさらに削減できる。 Except when the minimum number of matching characters is satisfied by forward matching, as shown in FIG. 1, when the character string mismatch occurs in the middle part of the character string, This is only when the last character of the target character string matches the last character of the search character string. If it is determined from the viewpoints of FIGS. 3A to 3D whether the end characters match, the processing load can be further reduced. Therefore, if the search target character strings are grouped in advance for each type of the end character, the number of search target character strings to be subjected to the forward matching search can be further reduced.

以上のような観点から、図４乃至図１９に示すような検索装置を導入するものとする。 From the above viewpoint, a search device as shown in FIGS. 4 to 19 is introduced.

図４に示すように、本実施の形態に係る検索装置１０００は、検索対象文字列格納部３０００に格納されている検索対象文字列から検索用データを生成する検索用データ生成部１１００と、検索用データ生成部１１００によって生成された検索用データを格納する検索用データ格納部１２００と、検索条件入力を受け付け且つ検索用データ格納部１２００に対して以下で述べる検索処理を実施する検索処理部１３００と、検索処理部１３００の検索結果を格納する検索結果格納部１４００と、検索結果格納部１４００に格納されている検索結果を検索用データ格納部１２００に基づき出力する出力部１５００とを有する。 As shown in FIG. 4, the search device 1000 according to the present embodiment includes a search data generation unit 1100 that generates search data from a search target character string stored in a search target character string storage unit 3000, and a search A search data storage unit 1200 for storing the search data generated by the search data generation unit 1100, and a search processing unit 1300 that receives a search condition input and performs a search process described below on the search data storage unit 1200. A search result storage unit 1400 for storing the search results of the search processing unit 1300, and an output unit 1500 for outputting the search results stored in the search result storage unit 1400 based on the search data storage unit 1200.

図５に示すように、検索用データ生成部１１００は、データ読込部１１１０と、データ分割部１１２０と、文字列グループ格納部１１３０と、文字反転部１１４０と、反転文字列格納部１１５０と、ソート処理部１１６０とを有する。データ読込部１１１０は、検索対象文字列格納部３０００から検索対象文字列を読み出してデータ分割部１１２０とソート処理部１１６０に出力する。データ分割部１１２０は、データ読込部１１１０から得られた検索対象文字列をグループ化する処理を実施し、処理結果を文字列グループ格納部１１３０に格納する。文字列反転部１１４０は、文字列グループ格納部１１３０に格納されているデータに基づき、文字列グループ毎に、所属する検索対象文字列に対して以下で説明する文字反転処理を実施し、処理結果を反転文字列格納部１１５０に格納する。ソート処理部１１６０は、反転文字列格納部１１５０に格納されている反転文字列をグループ毎にソートして検索インデックスのデータを生成し、検索用データ格納部１２００に格納する。 As shown in FIG. 5, the search data generation unit 1100 includes a data reading unit 1110, a data division unit 1120, a character string group storage unit 1130, a character reversal unit 1140, a reverse character string storage unit 1150, and a sort And a processing unit 1160. The data reading unit 1110 reads the search target character string from the search target character string storage unit 3000 and outputs it to the data dividing unit 1120 and the sort processing unit 1160. The data dividing unit 1120 performs processing for grouping search target character strings obtained from the data reading unit 1110 and stores the processing results in the character string group storage unit 1130. Based on the data stored in the character string group storage unit 1130, the character string reversing unit 1140 performs character reversal processing described below on the search target character string to which each character string group belongs, and the processing result Is stored in the inverted character string storage unit 1150. The sort processing unit 1160 sorts the inverted character strings stored in the inverted character string storage unit 1150 for each group, generates search index data, and stores the search index data in the search data storage unit 1200.

また、図６に示すように、検索処理部１３００は、検索条件取得部１３１０と、検索部１３２０と、制御部１３４０とを有する。検索条件取得部１３１０は、検索条件入力を受け付け、検索部１３２０と制御部１３４０とに出力する。検索部１３２０は、第１検索部１３２１と、第２検索部１３２２とを有する。第１検索部１３２１は、検索文字列について、検索用データ格納部１２００に格納されている検索用データに対して前方一致検索を行い、検索結果を検索結果格納部１４００に格納する。第２検索部１３２２は、制御部１３４０からの指示に従って、検索用データ格納部１２００に格納されている検索用データに対して前方一致検索を行って、検索結果を検索結果格納部１４００に格納する。なお、検索結果のうち一部のデータについては制御部１３４０に出力する。 As illustrated in FIG. 6, the search processing unit 1300 includes a search condition acquisition unit 1310, a search unit 1320, and a control unit 1340. The search condition acquisition unit 1310 receives the search condition input and outputs it to the search unit 1320 and the control unit 1340. The search unit 1320 includes a first search unit 1321 and a second search unit 1322. The first search unit 1321 performs a forward matching search on the search data stored in the search data storage unit 1200 for the search character string, and stores the search result in the search result storage unit 1400. The second search unit 1322 performs a forward match search on the search data stored in the search data storage unit 1200 according to an instruction from the control unit 1340 and stores the search result in the search result storage unit 1400. . Note that some data in the search result is output to the control unit 1340.

制御部１３４０は、検索キー生成部１３４１と、条件判定部１３４２とを有している。検索キー生成部１３４１は、検索文字列の一部の文字を反転させて検索キーを生成する。また、条件判定部１３４２は、検索キーの出力前に、以下で詳細に説明する条件を満たしているか否かを判断する。なお、第２検索部１３２２に前方一致検索を実施させるか否かを判断するための検索フラグについては、制御部１３４０及び第２検索部１３２２が参照可能なメモリの領域に用意されているものとする。 The control unit 1340 includes a search key generation unit 1341 and a condition determination unit 1342. The search key generation unit 1341 generates a search key by inverting some characters of the search character string. In addition, the condition determination unit 1342 determines whether or not a condition described in detail below is satisfied before outputting the search key. Note that the search flag for determining whether or not the second search unit 1322 performs the forward match search is prepared in a memory area that can be referred to by the control unit 1340 and the second search unit 1322. To do.

以下、検索装置１０００の処理内容を図７乃至図１９を用いて説明する。まず、検索用データ生成部１１００の処理を図７乃至図１４を用いて説明する。データ読込部１１１０は、検索対象文字列格納部３０００から、検索対象文字列を読み込み、データ分割部１１２０及びソート処理部１１６０に出力する（図７：ステップＳ１）。例えば、図８に示すような検索対象文字列が読み込まれたものとする。 Hereinafter, processing contents of the search apparatus 1000 will be described with reference to FIGS. First, the processing of the search data generation unit 1100 will be described with reference to FIGS. The data reading unit 1110 reads the search target character string from the search target character string storage unit 3000 and outputs it to the data dividing unit 1120 and the sort processing unit 1160 (FIG. 7: step S1). For example, assume that a search target character string as shown in FIG. 8 is read.

そして、ソート処理部１１６０は、検索対象文字列をデータ読込部１１１０から取得し、当該検索対象文字列を辞書順でソートして、切断なしの前方一致検索用のインデックスデータを生成し、検索用データ格納部１２００に格納する（ステップＳ３）。図８のような検索対象文字列の場合には、図９に示すような形で通常の辞書順にソートされる。このようなソート後のデータからインデックスデータを生成する。 Then, the sort processing unit 1160 acquires the search target character string from the data reading unit 1110, sorts the search target character string in dictionary order, generates index data for forward matching search without cutting, and performs search The data is stored in the data storage unit 1200 (step S3). In the case of search target character strings as shown in FIG. 8, they are sorted in the normal dictionary order as shown in FIG. Index data is generated from such sorted data.

また、データ分割部１１２０は、検索対象文字列をデータ読込部１１１０から取得し、当該検索対象文字列を、その末尾の文字でグループ化する（ステップＳ５）。図８の例であれば、末尾文字が「ｇ」のグループ（図１０Ａ）と、「ｅ」のグループ（図１０Ｂ）とがあるので、２つのグループに分けられる。グループ分けについては、様々なバリエーションが可能である。例えば、異なる文字毎にグループを生成しても良いし、複数の種類の文字を纏めてグループ化するようにしてもよい。グループ化の結果については、グループ毎に検索対象文字列を文字列グループ格納部１１３０に格納する。 In addition, the data dividing unit 1120 acquires the search target character string from the data reading unit 1110, and groups the search target character string by the last character (step S5). In the example of FIG. 8, there are a group (FIG. 10A) whose last character is “g” and a group (FIG. 10B) whose name is “e”. Various variations are possible for grouping. For example, a group may be generated for each different character, or a plurality of types of characters may be grouped together. For the grouping result, the search target character string is stored in the character string group storage unit 1130 for each group.

次に、文字反転部１１４０は、未処理のグループを１つ特定する（ステップＳ７）。そして、文字反転部１１４０は、特定されたグループ内の（最長文字列長−１）をＮに設定する（ステップＳ９）。端子Ａを介して図１１の処理に移行する。 Next, the character reversing unit 1140 identifies one unprocessed group (step S7). Then, the character reversing unit 1140 sets (longest character string length-1) in the specified group to N (step S9). The processing shifts to the processing in FIG.

図１１の処理の説明に移行して、文字反転部１１４０は、カウンタｉを１に初期化する（ステップＳ１１）。そして、文字反転部１１４０は、特定されたグループ内の各検索対象文字列について、ｉ文字目以降を反転させた反転文字列を生成し、反転文字列格納部１１５０に格納する（ステップＳ１３）。 Shifting to the description of the processing in FIG. 11, the character reversing unit 1140 initializes the counter i to 1 (step S11). Then, the character reversing unit 1140 generates an inverted character string obtained by inverting the i-th and subsequent characters for each search target character string in the specified group, and stores the inverted character string in the inverted character string storage unit 1150 (step S13).

例えば、末尾文字が「ｅ」であるグループを処理する場合、図１２の（ａ）に示すような検索文字列が処理の対象となる。そして、ｉ＝３であれば、３文字目以降の文字の順番を入れ替えて、（ｂ）に示すような反転文字列が生成される。 For example, when a group whose end character is “e” is processed, a search character string as shown in FIG. If i = 3, the order of the third and subsequent characters is changed, and an inverted character string as shown in (b) is generated.

さらに、ソート処理部１１６０は、反転文字列格納部１１５０に格納されており且つ生成した反転文字列を辞書順でソートし、前方一致検索用のインデックスデータを生成して、検索用データ格納部１２００に格納する（ステップＳ１５）。図１２の例では、ソート結果が図１２（ｃ）のようになる。 Further, the sort processing unit 1160 sorts the generated inverted character strings stored in the inverted character string storage unit 1150 in the dictionary order, generates index data for forward matching search, and searches the data storage unit 1200. (Step S15). In the example of FIG. 12, the sorting result is as shown in FIG.

以上まとめると、例えば、末尾文字が「ｅ」であるグループを処理する場合、図１３に示すように、ｉ＝１であれば（ａ）のようなデータが生成され、ｉ＝２であれば（ｂ）のようなデータが生成され、ｉ＝３であれば（ｃ）のようなデータが生成され、ｉ＝４であれば（ｄ）のようなデータが生成される。 In summary, for example, when processing a group whose end character is “e”, as shown in FIG. 13, if i = 1, data like (a) is generated, and if i = 2, Data such as (b) is generated. If i = 3, data such as (c) is generated, and if i = 4, data such as (d) is generated.

また、末尾文字が「ｇ」であるグループを処理する場合、図１４に示すように、ｉ＝１であれば（ａ）のようなデータが生成され、ｉ＝２であれば（ｂ）のようなデータが生成され、ｉ＝３であれば（ｃ）のようなデータが生成され、ｉ＝４であれば（ｄ）のようなデータが生成される。 Further, when processing a group whose end character is “g”, as shown in FIG. 14, if i = 1, data such as (a) is generated, and if i = 2, data in (b) is generated. Data such as (c) is generated if i = 3, and data such as (d) is generated if i = 4.

そして、文字反転部１１４０は、ｉがＮ以上となったか判断する（ステップＳ１７）。ｉがＮ未満であれば、文字反転部１１４０は、ｉを１インクリメントしてステップＳ１３に戻る（ステップＳ２０）。一方、ｉがＮ以上である場合、文字反転部１１４０は、未処理のグループが文字列グループ格納部１１３０にあるか判断する（ステップＳ１９）。未処理のグループが存在する場合には端子Ｂを介して図７のステップＳ７に戻る。一方、未処理のグループが存在しない場合には、処理を終了する。 Then, the character reversing unit 1140 determines whether i is N or more (step S17). If i is less than N, the character reversing unit 1140 increments i by 1 and returns to step S13 (step S20). On the other hand, if i is greater than or equal to N, the character reversing unit 1140 determines whether there is an unprocessed group in the character string group storage unit 1130 (step S19). If there is an unprocessed group, the process returns to step S7 in FIG. On the other hand, if there is no unprocessed group, the process ends.

以上のような処理を実施すれば、検索用データ格納部１２００には、図９と図１３及び図１４のような検索用データが格納されることになる。 If the processing as described above is performed, the search data storage unit 1200 stores the search data as shown in FIGS. 9, 13, and 14.

なお、グループ化しない処理の場合には、全検索対象文字列を１つのグループとして設定して、上で述べた処理を実施すればよい。このような場合には、図２上段（ｂ）乃至（ｆ）のデータが、図１３及び図１４のデータの代わりに生成される。 In the case of processing that is not grouped, all the search target character strings may be set as one group and the above-described processing may be performed. In such a case, the data in the upper sections (b) to (f) in FIG. 2 are generated instead of the data in FIGS. 13 and 14.

次に、図１５乃至図１９を用いて、検索時の処理について説明する。まず、検索処理部１３００の検索条件取得部１３１０は、検索者からの検索条件（検索文字列及び最低一致文字数Ｌ）の入力を受け付ける（図１５：ステップＳ２１）。なお、検索装置１０００が、ネットワークに接続されており、当該ネットワークに接続されている他の端末から受信するような場合もある。検索条件取得部１３１０は、取得した検索条件のデータを、制御部１３４０と検索部１３２０に出力する。例えば、最小一致文字数Ｌ＝４で、検索文字列「ａｂｃｄｅ」が入力されたものとする。 Next, processing at the time of search will be described with reference to FIGS. First, the search condition acquisition unit 1310 of the search processing unit 1300 receives an input of a search condition (search character string and minimum matching character number L) from a searcher (FIG. 15: Step S21). In some cases, the search apparatus 1000 is connected to a network and receives from another terminal connected to the network. The search condition acquisition unit 1310 outputs the acquired search condition data to the control unit 1340 and the search unit 1320. For example, it is assumed that the search character string “abcde” is input with the minimum number of matching characters L = 4.

そして、検索部１３２０の第１検索部１３２１は、検索用データ格納部１２００における、切断なしの検索用インデックスに対して、検索文字列についての前方一致検索を実施し、Ｌ文字以上一致する文字列の識別子を抽出して、検索結果格納部１４００に格納する（ステップＳ２３）。 Then, the first search unit 1321 of the search unit 1320 performs a forward match search for the search character string with respect to the search index without cutting in the search data storage unit 1200, and a character string that matches L characters or more. Are extracted and stored in the search result storage unit 1400 (step S23).

例えば図１６において、（ａ）に示すような切断なしの検索用データに対して、検索文字列（ｆ）「ａｂｃｄｅ」について前方一致検索を実施する。より具体的には先頭Ｌ文字「ａｂｃｄ」で前方一致検索を実施する。この場合、ＩＤ「５」が得られる。 For example, in FIG. 16, a forward matching search is performed on the search character string (f) “abcde” with respect to search data without cutting as shown in FIG. More specifically, a forward matching search is performed using the first L character “abcd”. In this case, ID “5” is obtained.

一方、制御部１３４０の条件判定部１３４２は、切断点Ｎに対して最小一致文字数Ｌを設定する（ステップＳ２５）。また、制御部１３４０は、初期的に、次回の前方一致検索を行うべきことを表す検索フラグをセットする（ステップＳ２７）。そして端子Ｃを介して図１７の処理に移行する。 On the other hand, the condition determination unit 1342 of the control unit 1340 sets the minimum matching character count L for the cutting point N (step S25). In addition, the control unit 1340 initially sets a search flag indicating that the next forward match search should be performed (step S27). Then, the processing shifts to the processing in FIG.

そして、制御部１３４０の条件判定部１３４２は、Ｎ＝０であるか判断する（ステップＳ２９）。Ｎ＝０ではない場合には、条件判定部１３４２は、検索フラグがセットされているか判断する（ステップＳ３１）。初回は必ず検索フラグがセットされており、ステップＳ３３に移行する。 Then, the condition determination unit 1342 of the control unit 1340 determines whether N = 0 (step S29). If N = 0 is not satisfied, the condition determination unit 1342 determines whether a search flag is set (step S31). The first time, the search flag is always set, and the process proceeds to step S33.

検索フラグがセットされている場合には、条件判定部１３４２は、検索フラグをオフにセットする（ステップＳ３２）。 If the search flag is set, the condition determination unit 1342 sets the search flag to off (step S32).

その後、条件判定部１３４２は、検索キー生成部１３４１に対して指示を出力し、この指示に応じて検索キー生成部１３４１は、検索文字列におけるＮ文字目以降の文字の順番を反転させた検索キーを生成し、切断点Ｎ及び検索文字列の末尾文字と共に検索キーを第２検索部１３２２に出力する（ステップＳ３３）。そして、第２検索部１３２２は、切断点及び検索文字列の末尾文字と共に検索キーを制御部１３４０から受け取ると、検索用データ格納部１２００における、検索文字列の末尾文字についてのグループにおける切断点Ｎのインデックスに対して、検索キーについての前方一致検索を実施し、最低一致文字数Ｌ文字以上一致する反転文字列の識別子（本実施の形態では元の検索対象文字列と同じ）を抽出して、検索結果格納部１４００に格納する（ステップＳ３５）。末尾文字に応じたグループ分けを行っていない場合には、１つのグループのみが存在するものとして処理すればよい。 Thereafter, the condition determination unit 1342 outputs an instruction to the search key generation unit 1341, and in response to this instruction, the search key generation unit 1341 performs a search in which the order of characters after the Nth character in the search character string is reversed. A key is generated, and the search key is output to the second search unit 1322 together with the cutting point N and the end character of the search character string (step S33). When the second search unit 1322 receives the search key from the control unit 1340 together with the cut point and the end character of the search character string, the second search unit 1322 in the search data storage unit 1200 includes a cut point N in the group for the end character of the search character string. A forward match search for the search key is performed on the index, and an identifier of the inverted character string that matches the minimum matching character number L or more (same as the original search target character string in this embodiment) is extracted, The result is stored in the search result storage unit 1400 (step S35). If the grouping according to the last character is not performed, the processing may be performed assuming that only one group exists.

「ａｂｃｄｅ」が検索文字列であるから末尾が「ｅ」のグループにおいて、Ｎ＝４であれば、図１６（ｂ）に示す検索用データに対して前方一致検索を行う。この際４文字目以降の文字が反転されるので、図１６（ｇ）に示すように検索キーは「ａｂｃｅｄ」となる。従って、図１８に示すように、ＩＤ「２」の反転文字列についての一致数は「１」であり、ＩＤ「４」の反転文字列についての一致数は「４」であり、ＩＤ「３」の反転文字列についての一致数は「０」となる。従って、ＩＤ「４」が抽出されて、検索結果格納部１４００に格納する。 Since “abcde” is a search character string, in a group whose end is “e”, if N = 4, a forward match search is performed on the search data shown in FIG. At this time, since the fourth and subsequent characters are reversed, the search key is “abced” as shown in FIG. Therefore, as shown in FIG. 18, the number of matches for the inverted character string of ID “2” is “1”, the number of matches for the inverted character string of ID “4” is “4”, and the ID “3” The number of matches for the inverted character string “” is “0”. Accordingly, the ID “4” is extracted and stored in the search result storage unit 1400.

また、第２検索部１３２２は、一致数が（Ｎ−２）の反転文字列が存在しているか判断し、一致数が（Ｎ−２）の反転文字列が存在している場合には検索フラグをセットする（ステップＳ３７）。上でも述べたように、最低一致文字数Ｌの反転文字列が抽出できる可能性がある場合には検索フラグをセットし、そうでない場合には検索フラグをオフにしたままにする。 The second search unit 1322 determines whether there is an inverted character string with the number of matches (N−2), and if there is an inverted character string with the number of matches (N−2), the search is performed. A flag is set (step S37). As described above, the search flag is set when there is a possibility that an inverted character string having the minimum number L of matching characters can be extracted, and otherwise the search flag is left off.

なお、図１８の例では、Ｎ＝４から２を引いた一致数２の反転文字列が存在していないので、検索フラグはオフのままになる。そして、ステップＳ４１に移行する。 In the example of FIG. 18, the search flag remains off because there is no inverted character string with a match number of 2 obtained by subtracting 2 from N = 4. Then, the process proceeds to step S41.

ステップＳ４１では、条件判定部１３４２は、Ｎを１デクリメントする。その後処理はステップＳ２９に戻る。 In step S41, the condition determination unit 1342 decrements N by 1. Thereafter, the process returns to step S29.

上で述べている具体例ではＮ＝３については第２検索部１３２２が前方一致検索を行わないように検索フラグがオフにセットされている。従って、ステップＳ３１では、検索フラグはセットされていないと判断されて、条件判定部１３４２は、検索フラグをセットする（ステップＳ３９）。一度検索フラグがオフにセットされたからといって以降の前方一致検索を全てスキップできるわけではない。従って、ここで次のＮについては前方一致検索を実施すべく、検索フラグをセットする。そしてステップＳ４１に移行する。 In the specific example described above, for N = 3, the search flag is set off so that the second search unit 1322 does not perform the forward match search. Therefore, in step S31, it is determined that the search flag is not set, and the condition determination unit 1342 sets the search flag (step S39). Once the search flag is set off, not all forward matching searches can be skipped. Therefore, for the next N, a search flag is set in order to perform a forward match search. Then, control goes to a step S41.

図１６の例でもＮ＝３についての検索キー（ｈ）については生成されることなく、検索用データ（ｃ）に対する前方一致検索も行われない。 In the example of FIG. 16, the search key (h) for N = 3 is not generated, and the forward matching search for the search data (c) is not performed.

Ｎ＝２になると、検索フラグがオンになっているので、図１６の（ｉ）で示すような２文字目以降の文字の順番が反転された検索キーが生成される。そして、Ｎ＝２のための検索用データ（ｄ）に対して検索キーについての前方一致検索を実施する。４文字以上一致する反転文字列は存在しない。なお、この場合、図１９に示すように、Ｎ＝２から２を引いた「０」文字一致した反転文字列が存在している。従って、検索フラグがセットされる。 When N = 2, since the search flag is on, a search key in which the order of the second and subsequent characters is reversed as shown in FIG. 16 (i) is generated. Then, a forward matching search for the search key is performed on the search data (d) for N = 2. There is no reverse character string that matches four or more characters. In this case, as shown in FIG. 19, there is an inverted character string in which “0” characters are matched by subtracting 2 from N = 2. Accordingly, the search flag is set.

そうすると、Ｎ＝１になって、検索フラグがオンになっているので、図１６（ｊ）で示すような１文字目以降の文字の順番が反転された検索キーが生成される。そして、Ｎ＝１のための検索用データ（ｅ）に対して検索キーについての前方一致検索を実施する。４文字以上一致する反転文字列は存在しない。Ｎ＝１の場合には、ステップＳ３７の処理についてはスキップしても良い。 Then, N = 1 and the search flag is turned on, so that a search key in which the order of the first and subsequent characters is reversed as shown in FIG. Then, a forward matching search for the search key is performed on the search data (e) for N = 1. There is no reverse character string that matches four or more characters. When N = 1, the process of step S37 may be skipped.

以上のような処理を実施すれば、検索結果格納部１４００には検索対象文字列の識別子｛４，５｝が格納されることになる。 If the processing as described above is performed, the search result storage unit 1400 stores the identifier {4, 5} of the search target character string.

ステップＳ２９で、Ｎ＝０と判断された場合には、出力部１５００は、検索用データ格納部１２００から、検索結果格納部１４００に格納されている検索対象文字列の識別子に対応する検索対象文字列を抽出して、出力する（ステップＳ４３）。 When it is determined in step S29 that N = 0, the output unit 1500 searches the search target character string corresponding to the identifier of the search target character string stored in the search result storage unit 1400 from the search data storage unit 1200. A column is extracted and output (step S43).

以上のような処理を実施すれば、文字列中高々１カ所しか不一致がないという前提において、前方からの一致文字数と後方からの一致文字数との和が一定以上、という条件を満足する類似文字列を高速に検索することができる。また、前方一致と後方一致の両方を併用することなく、処理も簡略化されている。 If the above processing is performed, a similar character string that satisfies the condition that the sum of the number of matching characters from the front and the number of matching characters from the back is greater than or equal to a certain number on the premise that there is only one mismatch in the character string. Can be searched at high speed. In addition, the processing is simplified without using both forward coincidence and backward coincidence.

以上本技術の一実施の形態を説明したが、本技術はこれに限定されるものではない。 Although one embodiment of the present technology has been described above, the present technology is not limited to this.

例えば図４乃至図６の機能ブロック図は一例であって、必ずしも実際のプログラムモジュールと一致しない場合もある。また、処理フローについても、同様の処理結果が得られる限り、ステップの順番を入れ替えたり、並列処理を行っても良い。例えば、第１検索部１３２１と第２検索部１３２２についての検索処理については、並列実施するようにしても良い。 For example, the functional block diagrams of FIGS. 4 to 6 are examples, and may not necessarily match actual program modules. As for the processing flow, as long as a similar processing result is obtained, the order of steps may be changed or parallel processing may be performed. For example, the search processing for the first search unit 1321 and the second search unit 1322 may be performed in parallel.

同様に、１つの検索装置１０００に検索用データ生成部１１００と検索処理部１３００とを実施する場合を示しているが、異なる装置によって実施するようにしても良い。さらに、検索用データ生成部１１００の処理についても複数の装置によって実施するようにしても良いし、検索処理部１３００の処理についても複数の装置で処理するようにしても良い。 Similarly, although the case where the search data generation unit 1100 and the search processing unit 1300 are implemented in one search device 1000 is shown, the search data generation unit 1100 and search processing unit 1300 may be implemented by different devices. Furthermore, the processing of the search data generation unit 1100 may be performed by a plurality of devices, and the processing of the search processing unit 1300 may be performed by a plurality of devices.

なお、上で述べた検索装置１０００は、コンピュータ装置であって、図２０に示すように、メモリ２５０１とＣＰＵ（Central Processing Unit）２５０３とハードディスク・ドライブ（ＨＤＤ：Hard Disk Drive）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とがバス２５１９で接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。ＣＰＵ２５０３は、アプリケーション・プログラムの処理内容に応じて表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリ２５０１に格納されるが、ＨＤＤ２５０５に格納されるようにしてもよい。本技術の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The search device 1000 described above is a computer device, and as shown in FIG. 20, a memory 2501, a CPU (Central Processing Unit) 2503, a hard disk drive (HDD: Hard Disk Drive) 2505, and a display device 2509. A display control unit 2507 connected to the computer, a drive device 2513 for a removable disk 2511, an input device 2515, and a communication control unit 2517 for connecting to a network are connected by a bus 2519. An operating system (OS) and an application program for executing the processing in this embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. The CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 according to the processing content of the application program, and performs a predetermined operation. Further, data in the middle of processing is mainly stored in the memory 2501, but may be stored in the HDD 2505. In an embodiment of the present technology, an application program for performing the above-described processing is stored in a computer-readable removable disk 2511 and distributed, and installed from the drive device 2513 to the HDD 2505. In some cases, the HDD 2505 may be installed via a network such as the Internet and the communication control unit 2517. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU 2503 and the memory 2501 described above and programs such as the OS and application programs. .

以上述べた本実施の形態をまとめると以下のようになる。 The above-described embodiment can be summarized as follows.

本実施の形態に係る検索装置は、（Ａ）検索対象文字列の第１の検索用データと、１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列についての第２の検索用データとを格納する第１のデータ格納部と、（Ｂ）第１のデータ格納部に格納されている第１の検索用データに対して検索文字列（例えば検索文字列のうち先頭の第２の所定数部分）の前方一致検索を行って、第２の所定文字数以上前方一致する検索対象文字列を検出し、検出された検索対象文字列の識別子を第２のデータ格納部に格納する第１の検索部と、（Ｃ）検索文字列における整数Ｍ文字以降の文字の順番を入れ替えた検索キー（例えば検索キーのうち先頭の第２の所定数部分）の前方一致検索を、データ格納部に格納されている整数Ｍについての第２の検索用データに対して行って、第２の所定文字数以上前方一致する反転文字列を検出し、検出された反転文字列に対応する検索対象文字列の識別子を第２のデータ格納部に格納し、整数Ｍ−２文字一致した反転文字列が存在するか否かを判断する第２の検索部と、（Ｄ）第２の検索部に対して、第２の所定文字数から１まで第２の所定文字数から１までのうち整数Ｍについての検索指示を、第２の検索部により整数Ｍ−２文字一致した反転文字列が存在しないと判断された直後を除き出力する制御部とを有する。 The search device according to the present embodiment includes (A) the first search data of a search target character string and N or more characters in each search target character string for each of an integer N that is greater than or equal to a first predetermined number of characters. A first data storage unit for storing second search data for an inverted character string in which the order of the characters is changed, and (B) first search data stored in the first data storage unit The search character string (for example, the second predetermined number portion at the beginning of the search character string) is subjected to a forward matching search, and a search target character string that matches the second predetermined number of characters or more is detected. A first search unit that stores the identifier of the search target character string in the second data storage unit, and (C) a search key in which the order of characters after the integer M characters in the search character string is switched (for example, the first of the search keys) The second predetermined number of parts) Is performed on the second search data for the integer M stored in the data storage unit to detect an inverted character string that matches the second predetermined number of characters or more and matches the detected inverted character string. A second search unit that stores an identifier of the search target character string to be stored in the second data storage unit and determines whether or not there is an inverted character string that matches the integer M-2 characters; and (D) a second There is an inverted character string in which a search instruction for the integer M out of the second predetermined number of characters to 1 is matched with the second searching unit from the second predetermined number of characters to 1 by the second searching unit. And a control unit for outputting except immediately after it is determined not to.

このようにすれば第２の検索部による前方一致検索の回数を削減することができるため、検索処理が高速化される。 In this way, the number of forward matching searches by the second search unit can be reduced, so that the search process is speeded up.

また、上で述べた第２の検索用データが検索対象文字列の末尾文字の種類毎にグループ化されている場合もある。その場合、第２の検索部が、検索文字列の末尾文字と一致する末尾文字の種類のグループに属する第２の検索用データに対して前方一致検索を実施するようにしてもよい。このようにすれば、さらに検索対象文字列を絞り込むことができ、検索処理を高速化することができるようになる。 In addition, the second search data described above may be grouped for each type of end character of the search target character string. In this case, the second search unit may perform a forward match search for the second search data belonging to the group of the last character type that matches the last character of the search character string. In this way, the search target character string can be further narrowed down, and the search process can be speeded up.

なお、上で述べた制御部が、第２の検索部に対して、第２の所定文字数を整数Ｍとして検索指示を出力し、第２の検索部により整数Ｍ−２文字一致した反転文字列が存在すると判断された場合には、整数Ｍ−１を新たな整数Ｍとして検索指示を第２の検索部に出力し、第２の検索部により整数Ｍ−２文字一致した反転文字列が存在しないと判断された場合には、整数Ｍ−２を新たな整数Ｍとして検索指示を第２の検索部に出力するようにしてもよい。 The control unit described above outputs a search instruction with the second predetermined number of characters as an integer M to the second search unit, and an inverted character string in which the integer M-2 characters are matched by the second search unit. If it is determined that there is an integer M-1 as a new integer M, a search instruction is output to the second search unit, and there is an inverted character string that matches the integer M-2 characters by the second search unit. If it is determined not to do so, the search instruction may be output to the second search unit with the integer M-2 as the new integer M.

さらに、上で述べた検索装置が、１以上第１の所定文字数以下の整数Ｎの各々について各検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列を生成し、当該反転文字列をソートして第２の検索用データを生成する検索用データ生成部をさらに有するようにしてもよい。これによって、自動的に第２の検索用データが生成されるようになる。 Furthermore, the search device described above generates an inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integers N that are greater than or equal to the first predetermined number of characters, and the inverted characters You may make it further have a search data generation part which sorts a column and produces | generates the 2nd search data. As a result, the second search data is automatically generated.

また、検索装置が、検索対象文字列の末尾文字の種類毎に検索対象文字列をグループ化し、当該グループの各々について、１以上第１の所定文字数以下の整数Ｎの各々について当該グループに属する各検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列を生成し、当該反転文字列をソートして第２の検索用データを生成する検索用データ生成部をさらに有するようにしても良い。末尾文字でグループ化する場合においても第２の検索用データが自動的に生成されるようになる。 Further, the search device groups the search target character strings for each type of the last character of the search target character string, and for each of the groups, each of the integers N that is 1 or more and the first predetermined number of characters or less belongs to the group. It further includes a search data generation unit that generates an inverted character string in which the order of characters after the N characters in the search target character string is changed, and sorts the inverted character string to generate second search data. Also good. Even when grouping by the last character, the second search data is automatically generated.

なお、上記処理をコンピュータに行わせるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブルディスク、ＣＤ−ＲＯＭ、光磁気ディスク、半導体メモリ、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納される。尚、中間的な処理結果はメインメモリ等の記憶装置に一時保管される。 A program for causing a computer to perform the above-described processing can be created. The program is, for example, a computer-readable storage medium or storage device such as a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, or a hard disk. Stored in The intermediate processing result is temporarily stored in a storage device such as a main memory.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）
検索対象文字列の第１の検索用データと、１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列についての第２の検索用データとを格納する第１のデータ格納部と、
前記第１のデータ格納部に格納されている前記第１の検索用データに対して検索文字列の前方一致検索を行って、第２の所定文字数以上前方一致する検索対象文字列を検出し、検出された前記検索対象文字列の識別子を第２のデータ格納部に格納する第１の検索部と、
前記検索文字列における整数Ｍ文字以降の文字の順番を入れ替えた検索キーの前方一致検索を、前記第１のデータ格納部に格納されている前記整数Ｍについての前記第２の検索用データに対して行って、前記第２の所定文字数以上前方一致する反転文字列を検出し、検出された前記反転文字列に対応する前記検索対象文字列の識別子を前記第２のデータ格納部に格納し、前記整数Ｍ−２文字一致した反転文字列が存在するか否かを判断する第２の検索部と、
前記第２の検索部に対して、前記第２の所定文字数から１まで前記第２の所定文字数から１までのうち整数Ｍについての検索指示を、前記第２の検索部により前記整数Ｍ−２文字一致した反転文字列が存在しないと判断された直後を除き出力する制御部と、
を有する検索装置。 (Appendix 1)
For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. A first data storage unit for storing two search data;
Performing a search character string forward match search for the first search data stored in the first data storage unit to detect a search target character string that matches the second predetermined number of characters in advance, A first search unit that stores an identifier of the detected character string to be searched in a second data storage unit;
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A second search unit for determining whether there is an inverted character string that matches the integer M-2 characters;
A search instruction for the integer M out of the second predetermined number of characters to 1 is sent from the second predetermined number of characters to 1 to the second search unit by the second search unit. A control unit that outputs except immediately after it is determined that a character-matched inverted character string does not exist;
A search device having:

（付記２）
前記第２の検索用データが前記検索対象文字列の末尾文字の種類毎にグループ化されており、
前記第２の検索部が、前記検索文字列の末尾文字と一致する末尾文字の種類のグループに属する前記第２の検索用データに対して前方一致検索を実施する
付記１記載の検索装置。 (Appendix 2)
The second search data is grouped for each type of end character of the search target character string,
The search device according to claim 1, wherein the second search unit performs a forward match search for the second search data belonging to a group of end character types that match the end character of the search character string.

（付記３）
前記制御部が、
前記第２の検索部に対して、前記第２の所定文字数を前記整数Ｍとして検索指示を出力し、
前記第２の検索部により前記整数Ｍ−２文字一致した反転文字列が存在すると判断された場合には、前記整数Ｍ−１を新たな整数Ｍとして検索指示を前記第２の検索部に出力し、
前記第２の検索部により前記整数Ｍ−２文字一致した反転文字列が存在しないと判断された場合には、前記整数Ｍ−２を新たな整数Ｍとして検索指示を前記第２の検索部に出力する
付記１又は２記載の検索装置。 (Appendix 3)
The control unit is
To the second search unit, a search instruction is output with the second predetermined number of characters as the integer M,
If the second search unit determines that there is an inverted character string that matches the integer M-2 characters, the search instruction is output to the second search unit with the integer M-1 as a new integer M. And
If the second search unit determines that there is no inverted character string that matches the integer M-2 characters, the search instruction is sent to the second search unit with the integer M-2 as a new integer M. Output the search device according to appendix 1 or 2.

（付記４）
１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列を生成し、当該反転文字列をソートして前記第２の検索用データを生成する検索用データ生成部
をさらに有する付記１乃至３のいずれか１つ記載の検索装置。 (Appendix 4)
For each of the integers N that is greater than or equal to 1 and less than or equal to the first predetermined number of characters, an inverted character string in which the order of the characters after the N characters in each search target character string is replaced is generated, and the inverted character string is sorted and the second character string is sorted. The search device according to any one of supplementary notes 1 to 3, further comprising: a search data generation unit that generates search data.

（付記５）
前記検索対象文字列の末尾文字の種類毎に前記検索対象文字列をグループ化し、当該グループの各々について、１以上第１の所定文字数以下の整数Ｎの各々について当該グループに属する各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列を生成し、当該反転文字列をソートして前記第２の検索用データを生成する検索用データ生成部
をさらに有する付記２記載の検索装置。 (Appendix 5)
The search target character strings are grouped for each type of the last character of the search target character string, and for each of the groups, each of the search target characters belonging to the group for each of the integers N that is greater than or equal to the first predetermined number of characters. The search according to supplementary note 2, further comprising: a search data generation unit that generates an inverted character string in which the order of characters after the N characters in the sequence is changed, and sorts the inverted character string to generate the second search data. apparatus.

（付記６）
検索対象文字列の第１の検索用データと、１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列についての第２の検索用データとを格納する第１のデータ格納部に格納されている前記第１の検索用データに対して検索文字列の前方一致検索を行って、第２の所定文字数以上前方一致する検索対象文字列を検出し、検出された前記検索対象文字列の識別子を第２のデータ格納部に格納するステップと、
前記第２の所定文字数から１まで前記第２の所定文字数から１までのうち整数Ｍについての検索処理を、当該検索処理において前記整数Ｍ−２文字一致した反転文字列が存在しないと判断された直後を除き実施するステップと、
を含み、
前記検索処理が、
前記検索文字列における整数Ｍ文字以降の文字の順番を入れ替えた検索キーの前方一致検索を、前記第１のデータ格納部に格納されている前記整数Ｍについての前記第２の検索用データに対して行って、前記第２の所定文字数以上前方一致する反転文字列を検出し、検出された前記反転文字列に対応する前記検索対象文字列の識別子を前記第２のデータ格納部に格納し、前記整数Ｍ−２文字一致した反転文字列が存在するか否かを判断する処理
であり、コンピュータにより実行される検索処理方法。 (Appendix 6)
For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. The first search data stored in the first data storage unit storing the second search data is subjected to a forward match search of the search character string, and the front matches by a predetermined number of characters or more. Detecting a search target character string and storing the detected identifier of the search target character string in a second data storage unit;
From the second predetermined number of characters to 1, the search processing for the integer M out of the second predetermined number of characters to 1 is determined that there is no inverted character string that matches the integer M-2 characters in the search processing. The steps to perform except immediately after,
Including
The search process
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A search processing method executed by a computer for determining whether or not there is an inverted character string that matches the integer M-2 characters.

（付記７）
検索対象文字列の第１の検索用データと、１以上第１の所定文字数以下の整数Ｎの各々について各前記検索対象文字列におけるＮ文字以降の文字の順番を入れ替えた反転文字列についての第２の検索用データとを格納する第１のデータ格納部に格納されている前記第１の検索用データに対して検索文字列の前方一致検索を行って、第２の所定文字数以上前方一致する検索対象文字列を検出し、検出された前記検索対象文字列の識別子を第２のデータ格納部に格納するステップと、
前記第２の所定文字数から１まで前記第２の所定文字数から１までのうち整数Ｍについての検索処理を、当該検索処理において前記整数Ｍ−２文字一致した反転文字列が存在しないと判断された直後を除き実施するステップと、
をコンピュータに実行させ、
前記検索処理が、
前記検索文字列における整数Ｍ文字以降の文字の順番を入れ替えた検索キーの前方一致検索を、前記第１のデータ格納部に格納されている前記整数Ｍについての前記第２の検索用データに対して行って、前記第２の所定文字数以上前方一致する反転文字列を検出し、検出された前記反転文字列に対応する前記検索対象文字列の識別子を前記第２のデータ格納部に格納し、前記整数Ｍ−２文字一致した反転文字列が存在するか否かを判断する処理
である検索処理プログラム。 (Appendix 7)
For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. The first search data stored in the first data storage unit storing the second search data is subjected to a forward match search of the search character string, and the front matches by a predetermined number of characters or more. Detecting a search target character string and storing the detected identifier of the search target character string in a second data storage unit;
From the second predetermined number of characters to 1, the search processing for the integer M out of the second predetermined number of characters to 1 is determined that there is no inverted character string that matches the integer M-2 characters in the search processing. The steps to perform except immediately after,
To the computer,
The search process
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A search processing program for determining whether or not there is an inverted character string that matches the integer M-2 characters.

１０００検索装置
３０００検索対象文字列格納部
１１００検索用データ生成部
１２００検索用データ格納部
１３００検索処理部
１４００検索結果格納部
１５００出力部 1000 Search Device 3000 Search Target Character String Storage Unit 1100 Search Data Generation Unit 1200 Search Data Storage Unit 1300 Search Processing Unit 1400 Search Result Storage Unit 1500 Output Unit

Claims

For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. A first data storage unit for storing two search data;
Performing a search character string forward match search for the first search data stored in the first data storage unit to detect a search target character string that matches the second predetermined number of characters in advance, A first search unit that stores an identifier of the detected character string to be searched in a second data storage unit;
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A second search unit for determining whether there is an inverted character string that matches the integer M-2 characters;
A search instruction for the integer M out of the second predetermined number of characters to 1 is sent from the second predetermined number of characters to 1 to the second search unit by the second search unit. A control unit that outputs except immediately after it is determined that a character-matched inverted character string does not exist;
A search device having:

The second search data is grouped for each type of end character of the search target character string,
2. The search device according to claim 1, wherein the second search unit performs a forward match search on the second search data belonging to a group of the last character type that matches the last character of the search character string.

For each of the integers N that is greater than or equal to 1 and less than or equal to the first predetermined number of characters, an inverted character string in which the order of the characters after the N characters in each search target character string is replaced is generated, and the inverted character string is sorted and the second character string is sorted. The search device according to claim 1, further comprising a search data generation unit configured to generate search data.

The search target character strings are grouped for each type of the last character of the search target character string, and for each of the groups, each of the search target characters belonging to the group for each of the integers N that is greater than or equal to the first predetermined number of characters. The search data generation unit according to claim 2, further comprising: an inverted character string in which the order of characters after the N characters in the sequence is switched, and generating the second search data by sorting the inverted character string. Search device.

For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. The first search data stored in the first data storage unit storing the second search data is subjected to a forward match search of the search character string, and the front matches by a predetermined number of characters or more. Detecting a search target character string and storing the detected identifier of the search target character string in a second data storage unit;
From the second predetermined number of characters to 1, the search processing for the integer M out of the second predetermined number of characters to 1 is determined that there is no inverted character string that matches the integer M-2 characters in the search processing. The steps to perform except immediately after,
Including
The search process
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A search processing method executed by a computer for determining whether or not there is an inverted character string that matches the integer M-2 characters.

For the first character string for the search target character string and the inverted character string in which the order of the characters after the N characters in each search target character string is changed for each of the integer N that is greater than or equal to the first predetermined number of characters. The first search data stored in the first data storage unit storing the second search data is subjected to a forward match search of the search character string, and the front matches by a predetermined number of characters or more. Detecting a search target character string and storing the detected identifier of the search target character string in a second data storage unit;
From the second predetermined number of characters to 1, the search processing for the integer M out of the second predetermined number of characters to 1 is determined that there is no inverted character string that matches the integer M-2 characters in the search processing. The steps to perform except immediately after,
To the computer,
The search process
A forward match search of a search key in which the order of characters after the integer M characters in the search character string is changed is performed on the second search data for the integer M stored in the first data storage unit. And detecting an inverted character string that matches the second predetermined number of characters or more forward, and storing the identifier of the search target character string corresponding to the detected inverted character string in the second data storage unit, A search processing program for determining whether or not there is an inverted character string that matches the integer M-2 characters.