JP5237400B2

JP5237400B2 - Search device

Info

Publication number: JP5237400B2
Application number: JP2011010528A
Authority: JP
Inventors: 潤小林; 篤哉鈴木; 哲夫佐藤; 時広松村
Original assignee: Bank of Tokyo Mitsubishi UFJ Trust Co
Current assignee: MUFG Bank Ltd
Priority date: 2011-01-21
Filing date: 2011-01-21
Publication date: 2013-07-17
Anticipated expiration: 2031-01-21
Also published as: JP2012150751A

Description

本発明は、検索装置に関する。特に、登録された複数の文字列の集合に対し与えられた文字列を検索し、与えられた文字列が、登録された文字列であるかどうかを検出する検索装置に関する。 The present invention relates to a search device. In particular, the present invention relates to a search device that searches a given character string for a set of a plurality of registered character strings and detects whether the given character string is a registered character string.

文字列が、登録された文字列かどうかを検出する技術は情報処理において基本的な技術であり、広い技術分野において用いられている。例えば銀行業務においては、他行、特に外国銀行、からの振込の宛名人が自行に口座を有しているかどうかの確認を行なうなどのために用いられている。 A technique for detecting whether a character string is a registered character string is a basic technique in information processing and is used in a wide technical field. For example, in banking business, it is used to confirm whether the addressee of a transfer from another bank, particularly a foreign bank, has an account in the bank.

文字列が、登録された文字列かどうかを検出する方法の一つとして、Ｂ木を用いる技術が知られている（例えば、特許文献１参照。）。Ｂ木においては、図１０に示されるように、木構造を構成するノードに、子ノードへのポインタと値とを格納し、「２３」より小さい値（例えば、辞書順において「２３」より前に並ぶ値）を格納する子ノードへのポインタをスロット１００１に格納し、「２３」より大きく「７１」より小さい値を格納する子ノードへのポインタをスロット１００２に格納し、「７１」より大きい値を格納する子ノードへのポインタをスロット１００３に格納する。ある値が与えられた場合には、「２３」、「７１」と順に比較を行い、「２３」、「７１」と等しければ登録された値であることが判明し、「２３」および「７１」と異なれば、比較の結果に応じて、子ノードへ移動する。 As one of methods for detecting whether a character string is a registered character string, a technique using a B-tree is known (see, for example, Patent Document 1). In the B-tree, as shown in FIG. 10, a pointer to a child node and a value are stored in nodes constituting the tree structure, and a value smaller than “23” (for example, before “23” in dictionary order). Is stored in slot 1001, and a pointer to a child node storing a value larger than “23” and smaller than “71” is stored in slot 1002, and is larger than “71”. A pointer to the child node storing the value is stored in the slot 1003. When a certain value is given, comparison is made in order with “23” and “71”, and if it is equal to “23” and “71”, it is determined that the registered value is obtained, and “23” and “71” Otherwise, the child node is moved to the child node according to the comparison result.

また、与えられた文字列が文字列テーブルに含まれているかどうかを判断する際に、文字列テーブルから文字列を取り出し、与えられた文字列と取り出した文字列との先頭文字を比較し、同じであれば、長さを比較し、長さも同じであれば、文字列の最後尾から文字を比較する方法が知られている（例えば、特許文献２参照。）。 Further, when determining whether or not the given character string is included in the character string table, the character string is extracted from the character string table, the first character of the given character string and the extracted character string are compared, If the lengths are the same, the lengths are compared. If the lengths are the same, a method of comparing characters from the end of the character string is known (for example, see Patent Document 2).

また、文字列の一種である文書の全文検索を行なう際に、単語の長さを単語とともに記録して、単語の重複を排除する技術が知られている（例えば、特許文献３参照。）。 Also, a technique is known in which the length of a word is recorded together with the word to eliminate duplication of the word when performing a full-text search of a document that is a kind of character string (see, for example, Patent Document 3).

特開平８−８７５１１号公報JP-A-8-87511 特開平７−８５０７０号公報Japanese Patent Laid-Open No. 7-85070 特開２０００−２００２８７号公報JP 2000-200287 A

Ｂ木などを用いる従来の技術においては、登録された文字列を絞り込むことが行なわれることにより、検索の効率化が図られている。しかしながら、絞り込まれた範囲内において、文字列を順次比較する必要がある。このため、文字列が登録されていないことを確認するためには、絞り込まれた範囲内の全ての文字列を比較することになる。したがって、文字列が登録されていることを検出することは少ない手間で行なえるものの、登録されていないことを検出するには、手間が掛かることになる。 In the conventional technique using a B-tree or the like, search efficiency is improved by narrowing down registered character strings. However, it is necessary to sequentially compare character strings within the narrowed range. For this reason, in order to confirm that no character string is registered, all character strings within the narrowed range are compared. Therefore, although it can be detected with little effort to detect that the character string is registered, it takes time to detect that the character string is not registered.

本発明の目的の一つとして、与えられた文字列が、登録されていることの検出および登録されていないことの検出を、少ない手間により行なう検索装置などを提供する。 As one of the objects of the present invention, there is provided a search device and the like for performing detection of whether a given character string is registered and detection of not being registered with little effort.

そこで、本発明の一実施形態として、複数の文字列それぞれを１または複数のトークンの列に変換し、共通するトークンの列を開始端より併合して構成される木構造データであり、前記木構造データのノードには末端ノードまでの長さの最大値および最小値が関連づけられている木構造データを記憶するインデックスメモリと、検索する文字列を１または複数のトークンの列に変換するトークナイザと、前記トークナイザにより変換されたトークンの列を走査しながら前記インデックスメモリに記憶されている木構造データのノードの子ノードを選択して根ノードを始点とする経路を検索することにより、前記検索する文字列が前記複数の文字列の中に存在するかどうかを検出する検索部と、を有し、前記検索部は、ノードの子ノードの中から、末端ノードまでの長さの最大値および最小値の範囲内に未走査のトークンの数を含む子ノードが存在しなければ、前記検索する文字列が前記複数の文字列の中に存在しないことを検出する検索装置を提供する。 Therefore, as one embodiment of the present invention, there is tree-structured data configured by converting each of a plurality of character strings into a sequence of one or a plurality of tokens, and merging a common sequence of tokens from a start end. An index memory that stores tree structure data in which maximum and minimum lengths to the end nodes are associated with the nodes of the structure data, and a tokenizer that converts a character string to be searched into a sequence of one or more tokens The search is performed by selecting a child node of the node of the tree structure data stored in the index memory while scanning the token row converted by the tokenizer and searching for a route starting from the root node. A search unit that detects whether or not a character string exists in the plurality of character strings, wherein the search unit is a child node of a node. If there is no child node including the number of unscanned tokens within the range of the maximum value and the minimum value of the length to the end node, the character string to be searched does not exist in the plurality of character strings. Provided is a search device for detecting

より具体的には、検索部は、トークン列保持部と、トークン個数保持部と、トークン番号保持部と、ノード情報保持部とを備え、トークン列保持部は、前記トークナイザが変換したトークンの列の情報を保持し、トークン個数保持部は、未走査のトークンの個数を保持し、トークン番号保持部は、走査中のトークンの前記トークンの列における番号を保持し、ノード情報保持部は、現在到達しているノードの情報を保持し、現在到達しているノードの子ノードの中に、末端ノードまでの長さの最大値および最小値の範囲内にトークン個数保持部が保持している数を含む子ノードが存在しなければ、前記検索する文字列が前記複数の文字列の中に存在しないことを検出する。 More specifically, the search unit includes a token string holding unit, a token number holding unit, a token number holding unit, and a node information holding unit, and the token string holding unit is a sequence of tokens converted by the tokenizer. The token number holding unit holds the number of unscanned tokens, the token number holding unit holds the number of the token being scanned in the token sequence, and the node information holding unit Number of nodes that hold information about the node that has reached, and that are held by the token number holding unit within the range of maximum and minimum lengths to the end node among the child nodes of the currently reached node If there is no child node including, it is detected that the character string to be searched does not exist in the plurality of character strings.

また、本発明の一実施形態として、複数の文字列それぞれを１または複数のトークンの列に変換し、共通するトークンの列を開始端より併合して構成される木構造データであり、前記木構造データのノードには末端ノードまでの長さの最大値および最小値が関連づけられている木構造データを記憶部に記憶している計算機に、検索する文字列を１または複数のトークンの列に変換させ、変換されたトークンの列を走査しながら前記インデックスメモリに記憶されている木構造データのノードの子ノードを選択して根ノードを始点とする経路を検索することにより、前記検索する文字列が前記複数の文字列の中に存在するかどうかを検出させるためのプログラムであり、ノードの子ノードの中から、末端ノードまでの長さの最大値および最小値の範囲内に未走査のトークンの数を含む子ノードが存在しなければ、前記検索する文字列が前記複数の文字列の中に存在しないことを検出させるプログラムを提供する。 Further, according to an embodiment of the present invention, there is tree-structured data configured by converting each of a plurality of character strings into a sequence of one or a plurality of tokens and merging a common sequence of tokens from a start end, In the structure data node, the tree structure data in which the maximum value and the minimum value of the length to the end node are associated is stored in the storage unit, and the character string to be searched is converted into one or a plurality of token strings. The character to be searched is searched by selecting a child node of the node of the tree structure data stored in the index memory while scanning the converted token sequence and searching for a route starting from the root node. A program for detecting whether a sequence exists in the plurality of character strings, and a maximum value and a minimum value of a length from a child node of a node to a terminal node If the number of unscanned token exists whatever child nodes within the range, to provide a program for detecting that the string to the search does not exist in said plurality of strings.

また、本発明の一実施形態として、複数の文字列それぞれを１または複数のトークンの列に変換し、共通するトークンの列を開始端より併合して構成される木構造データであり、前記木構造データのノードには末端ノードまでの長さの最大値および最小値が関連づけられている木構造データを記憶部に記憶している計算機の動作方法であり、前記計算機が、検索する文字列を１または複数のトークンの列に変換し、変換されたトークンの列を走査しながら前記インデックスメモリに記憶されている木構造データのノードの子ノードを選択して根ノードを始点とする経路を検索することにより、前記検索する文字列が前記複数の文字列の中に存在するかどうかを検出し、ノードの子ノードの中から、末端ノードまでの長さの最大値および最小値の範囲内に未走査のトークンの数を含む子ノードが存在しなければ、前記検索する文字列が前記複数の文字列の中に存在しないことを含む、計算機の動作方法を提供する。 Further, according to an embodiment of the present invention, there is tree-structured data configured by converting each of a plurality of character strings into a sequence of one or a plurality of tokens and merging a common sequence of tokens from a start end, The structure data node is a computer operation method in which the tree structure data in which the maximum value and the minimum value of the length to the end node are associated with each other is stored in the storage unit. Convert to a sequence of one or more tokens, scan the converted sequence of tokens, select a child node of the node of the tree structure data stored in the index memory, and search for a route starting from the root node To detect whether the character string to be searched exists in the plurality of character strings, and the maximum value and the minimum value of the length from the child node of the node to the terminal node If the number of unscanned token exists whatever child nodes within the range, including the character string to be the search does not exist in said plurality of strings, provides a method of operating a computer.

本発明により、与えられた文字列が登録された文字列かどうかの検索の早期の段階において、与えられた文字列が登録されていないことが検出され、与えられた文字列が、登録されていることの検出および登録されていないことの検出を、従来よりも少ない手間により行なうことが可能となる。 According to the present invention, it is detected that the given character string is not registered at an early stage of searching whether the given character string is a registered character string, and the given character string is registered. It is possible to detect the presence and the absence of registration with less effort than before.

本発明の一実施形態に係る検索装置の機能ブロック図。The functional block diagram of the search device which concerns on one Embodiment of this invention. 文字列が変換されたトークンの列に対応するデータ構造の一例図。FIG. 6 is an example of a data structure corresponding to a token column in which a character string is converted. 本発明の一実施形態に係る木構造データを構成するためのデータ構造の一例図。An example of the data structure for comprising the tree structure data concerning one embodiment of the present invention. 本発明の一実施形態に係る複数の文字列と複数の文字列に対応する木構造の一例図。An example figure of a tree structure corresponding to a plurality of character strings and a plurality of character strings concerning one embodiment of the present invention. 本発明の一実施形態に係る木構造データを構成する処理のフローチャート。The flowchart of the process which comprises the tree structure data which concerns on one Embodiment of this invention. 本発明の一実施形態に係る検索装置の検索部の機能ブロック図。The functional block diagram of the search part of the search device which concerns on one Embodiment of this invention. 本発明の一実施形態に係る検索装置の検索部の処理のフローチャート。The flowchart of the process of the search part of the search device which concerns on one Embodiment of this invention. 本発明の一実施形態に係る木構造の一例図と検索部の処理の一例図。An example figure of the tree structure concerning one embodiment of the present invention, and an example figure of processing of a search part. 本発明の一実施形態に係る木構造データとそれを実現するデータ構造の一例図。An example figure of tree structure data concerning one embodiment of the present invention, and a data structure which realizes it. Ｂ木の一例図。An example figure of B-tree.

以下、本発明を実施するための形態を実施形態として説明する。なお、本発明は、以下の実施形態に限定されることはなく、種々の変形を加えて実施することが可能である。 Hereinafter, embodiments for carrying out the present invention will be described as embodiments. Note that the present invention is not limited to the following embodiments, and can be implemented with various modifications.

（実施形態１）
図１は、本発明の一実施形態に係る検索装置の機能ブロック図を示す。検索装置１００は、インデックスメモリ１０１と、トークナイザ１０２と、検索部１０３とを有する。 (Embodiment 1)
FIG. 1 is a functional block diagram of a search device according to an embodiment of the present invention. The search device 100 includes an index memory 101, a tokenizer 102, and a search unit 103.

（インデックスメモリ）
インデックスメモリ１０１は、１または複数の木構造データを記憶する。インデックスメモリ１０１が記憶する１または複数の木構造データが表す木構造は次の２つの条件を満たす。
（条件１）複数の文字列それぞれを１または複数のトークンの列に変換して、共通するトークンの列を開始端より併合することにより、構成されている。
（条件２）木構造のノードには、末端ノードまでの長さの最大値および最小値が関連付けられている。 (Index memory)
The index memory 101 stores one or a plurality of tree structure data. The tree structure represented by one or more tree structure data stored in the index memory 101 satisfies the following two conditions.
(Condition 1) Each of a plurality of character strings is converted into one or a plurality of token strings, and a common token string is merged from the start end.
(Condition 2) The maximum value and the minimum value of the length to the end node are associated with the nodes of the tree structure.

（木構造に関する定義）
本実施形態において、木構造とは、数学的に定義されるグラフ構造であり、そのグラフ構造は（イ）１または複数のノードを有し、（ロ）始点と終点とをノードして持つエッジによって始点のノードと終点のノードとを接続し、（ハ）単連結であり、かつ、（ニ）ループを構成するエッジが存在しないことを特徴とする。木構造を表すデータを木構造データと定義する。木構造データは、ノードとエッジとを表すデータ構造により表すことができる。このようなデータ構造としては、例えば後に図３を用いて説明されるデータ構造がある。 (Definition regarding tree structure)
In this embodiment, the tree structure is a mathematically defined graph structure, and the graph structure has (a) one or a plurality of nodes, and (b) an edge having a start point and an end point as nodes. The starting point node and the ending point node are connected by (c), (c) is simply connected, and (d) there is no edge constituting the loop. Data representing a tree structure is defined as tree structure data. Tree structure data can be represented by a data structure representing nodes and edges. An example of such a data structure is a data structure described later with reference to FIG.

帰納的に、（１）木構造を定義しつつ同時に（２）根ノードと、（３）末端ノードと、（４）根ノードから末端ノードまでの長さの最大値および最小値とを定義すると次のようになる。
（定義１）（１）１つのノードは木構造である。（２、３）そのノードは、その木構造の根ノードでありかつ末端ノードである。（４）この木構造において、根ノードから末端ノードまでの長さの最大値および最小値はともに１である。
（定義２）（１）１または複数の木構造Ｔ_１、…、Ｔ_ｎがある場合、木構造Ｔ_１、…、Ｔ_ｎのいずれのノードでもない新たなノードｐと木構造Ｔ_１、…、Ｔ_ｎのそれぞれの根ノードとを、ノードｐを始点としそれぞれの根ノードを終点とするｎ本のエッジそれぞれにより接続した構造は、新たな木構造である。（２）この新たな木構造の根ノードはノードｐであり、（３）この新たな木構造の末端ノードは、木構造Ｔ_１、…、Ｔ_ｎのそれぞれの末端ノードであり、（４）この新たな木構造の根ノードから末端ノードまでの長さの最大値は、｛Ｔ_１の根ノードから末端ノードまでの長さの最大値、…、Ｔ_ｎの根ノードから末端ノードまでの長さの最大値｝の最大値に１を加えた値であり、この新たな木構造の根ノードから末端ノードまでの長さの最小値は、｛Ｔ_１の根ノードから末端ノードまでの長さの最小値、…、Ｔ_ｎの根ノードから末端ノードまでの長さの最小値｝の最小値に１を加えた値である。 Inductively, (1) while defining a tree structure, simultaneously (2) a root node, (3) a terminal node, and (4) a maximum value and a minimum value of the length from the root node to the terminal node It becomes as follows.
(Definition 1) (1) One node has a tree structure. (2, 3) The node is a root node of the tree structure and a terminal node. (4) In this tree structure, the maximum value and the minimum value of the length from the root node to the terminal node are both 1.
(Definition 2) (1) one or more of the tree _T 1, ..., if there is _{T n,} the tree structure _T 1, ..., a new node not one of the nodes of _{T n} p and the tree structure _{T 1,} ... , T _n are connected to each root node by n edges starting from the node p and ending at each root node, which is a new tree structure. (2) The root node of this new tree structure is node p, (3) The end nodes of this new tree structure are the respective end nodes of tree structures T ₁ ,..., T _n , (4) maximum value from the root node to the terminal node length of the new tree structure, {the maximum length of the root node of T ₁ to the end node, ..., the length from the root node of T _n to the end node a value obtained by adding 1 to the maximum value of the maximum value}, the minimum value of the length from the root node of the new tree to the end node, the length of the to the end node from the root node of {T ₁ minimum value of, ..., is a value obtained by adding 1 to the minimum value of the minimum value of the length from the root node of T _n to the end node}.

木構造の任意のノードｑから、ノードｑを始点としノードｒ_１を終点とするエッジｅ_１があり、ノードｒ_１を始点としノードｒ_２を終点とするエッジｅ_２があり、…、ノードｒ_{（ｎ−１）}を始点としノードｒ_ｎを終点とするエッジｅｎがあるとき、ノードの列ｑ、ｒ_１、ｒ_２、…、ｒ_{（ｎ−１）}、ｒ_ｎを「ノードｑを始点とする経路」と定義し、ノードｑからノードｒ_ｎまでの長さを「ｎ」と定義する。 From any node q of the tree structure, there is an edge e ₁ to the end point node r ₁ to the start point node q, there is an edge e ₂ of the end point node r ₂ to the start point node r _1, ..., node r _(n-1) when there is an edge en to the end point node _{r n} a start point of the column _q, _r 1, r 2 _{nodes, ..., r (n-1),} and starting a "node q to _{r n} It defines a path "to the length from node q to node r _n is defined as" n ".

（トークンの列に関する定義）
図２は、文字列を１または複数のトークンの列に変換することを説明するための図である。トークンとは文字または文字の並びである単語を表すデータ構造をいう。文字列を１または複数のトークンの列に変換するとは、文字列を、その構成要素である文字または単語を表すデータ構造の並びに変換することをいう。文字列をトークンの列に変換することは、字句解析（ＬｅｘｉｃａｌＡｎａｌｙｓｉｓ）と称されることもある。 (Definition for token columns)
FIG. 2 is a diagram for explaining conversion of a character string into a sequence of one or more tokens. A token refers to a data structure representing a word that is a character or a sequence of characters. Converting a character string into a string of one or more tokens means converting the character string into a sequence of data structures representing the characters or words that are its constituent elements. Converting a character string into a token string is sometimes referred to as lexical analysis.

図２は、一例として、文字列「ＴＹＨＫＢＭＢＮＫＹＭ」をトークンの列に変換する際に、空白を表す文字「」を単語間の区切りとして扱うことにより、単語「ＴＹ」、「ＨＫ」、「ＢＭＢ」、「ＮＫＹＭ」それぞれに対応する４つのトークンの並びに区切られることを示している。 As an example, FIG. 2 illustrates that when the character string “TY HK BMB NKYM” is converted into a token string, the characters “”, “HK”, It shows that the four tokens corresponding to “BMB” and “NKYM” are delimited.

図２の（ａ）、（ｂ）それぞれは、変換される文字列をトークンの列に変換した場合のトークンの並びを表すデータ構造の例を示している。図２の（ａ）は、５つの要素（スロット）からなる配列が用意され、スロット２０１には、文字列「ＴＹ」へのポインタ（「ＴＹ」が格納されたメモリのアドレス）、スロット２０２には、文字列「ＨＫ」へのポインタ、スロット２０３には、文字列「ＢＭＢ」へのポインタ、スロット２０４には、文字列「ＮＫＹＭ」へのポインタが格納され、スロット２０５には、トークンの列の終わりであることを示す値（例えば、ＮＵＬＬ）が格納されている状態を示す。なお、５つのスロットからなる配列の代わりに、５つのリストが直列に連結されたデータ構造などを用いてもよい。 Each of (a) and (b) of FIG. 2 shows an example of a data structure representing a sequence of tokens when a character string to be converted is converted into a token string. In FIG. 2A, an array including five elements (slots) is prepared. In the slot 201, a pointer to the character string “TY” (the address of the memory in which “TY” is stored), and in the slot 202. Stores a pointer to the character string “HK”, a slot 203 stores a pointer to the character string “BMB”, a slot 204 stores a pointer to the character string “NKYM”, and a slot 205 stores a token string. Indicates a state in which a value indicating the end of (for example, NULL) is stored. Instead of an array of five slots, a data structure in which five lists are connected in series may be used.

図２の（ｂ）は、図２の（ａ）と同様に、５つのスロットからなる配列が用意されていることを示しているが、各スロットは、文字列「ＴＹＨＫＢＭＢＮＫＹＭ」が格納されているメモリの領域２１０の中のアドレスと、トークンが表す文字列の長さとを格納している。すなわち、スロット２１１には、領域２１０の「ＴＹ」の開始文字である「Ｔ」が格納されているアドレスと、「ＴＹ」の文字数である２とが格納され、スロット２１２には、領域２１０の「ＨＫ」の開始文字である「Ｈ」が格納されているアドレスと、「ＨＫ」の文字数である２とが格納され、スロット２１３には、領域２１０の、「ＢＭＢ」の開始文字である開始文字である「Ｂ」が格納されているアドレスと、「ＢＭＢ」の文字数である３とが格納され、スロット２１４には、領域２１０の「ＮＫＹＭ」の開始文字である「Ｎ」が格納されているアドレスと、「ＮＫＹＭ」の文字数である４とが格納されている。また、スロット２１５には、トークンの列の終わりである値と、例えば０とが格納されている。 FIG. 2B shows that an array of five slots is prepared, as in FIG. 2A, but each slot stores a character string “TY HK BMB NKYM”. The address in the memory area 210 and the length of the character string represented by the token are stored. That is, the slot 211 stores an address where “T”, which is the start character of “TY” in the area 210, and 2 which is the number of characters “TY”, and the slot 212 stores the address of the area 210. An address storing “H”, which is the start character of “HK”, and “2”, which is the number of characters of “HK”, are stored. In slot 213, the start character which is the start character of “BMB” in area 210 The address where the character “B” is stored and the number 3 of the characters “BMB” are stored, and the slot 214 stores “N” which is the start character of “NKYM” in the area 210. And 4 which is the number of characters “NKYM” are stored. The slot 215 stores a value that is the end of the token string and, for example, 0.

文字列をトークンの列に変換する場合には、図２に示したように、空白やその他の記号を表す文字により文字列を区切り、単語としてのトークンの列に変化することができる。また、空白やその他の記号を表す文字が含まれない場合や空白やその他の記号を表す文字を区切りとしない場合には、あらかじめ定められた単語を、文字列の始め（文字列の１文字目）から順に取得してトークンのデータ構造としてもよい。例えば、あらかじめ定められた単語が「ＴＹ」、「ＨＫ」である場合には、文字列「ＨＫＴＹ」は「ＨＫ」と「ＴＹ」とに対応するトークンの列に変換することができる。また、文字列「ＴＹＫＴＪＭＨＫ」は、「ＫＴＪＭ」があらかじめ定められている単語ではないが、前後の「ＴＹ」、「ＨＫ」はあらかじめ定められた単語であるので、あらかじめ定められた単語の有無により、「ＴＹ」、「ＫＴＪＭ」、「ＨＫ」という単語のトークンの列に変換することができる。 When converting a character string into a token string, as shown in FIG. 2, the character string can be delimited by a character representing a space or another symbol, and changed into a token string as a word. In addition, when a character representing a space or other symbol is not included, or when a character representing a space or other symbol is not used as a delimiter, a predetermined word is added to the beginning of the character string (the first character of the character string). ) To obtain the token data structure. For example, when the predetermined words are “TY” and “HK”, the character string “HKTY” can be converted into a token string corresponding to “HK” and “TY”. In addition, the character string “TYKTJMHK” is not a word for which “KTJM” is determined in advance, but “TY” and “HK” before and after are predetermined words. , “TY”, “KTJM”, and “HK”.

また、文字列を構成する１文字それぞれを１つのトークンに対応させて、トークンの列に変換することも可能である。例えば、文字列「ＴＹＨＫ」は、「Ｔ」、「Ｙ」、「Ｈ」、「Ｋ」という文字のトークンの列に変換することができる。一般に、文字コードによっては全ての１文字が等しいバイト数で構成されるとは限らない。そこで、全ての１文字が等しいバイト数で構成されない文字コードを用いる場合などには、このように１文字を１トークンとして、文字列をトークンの列に変換することもできる。 It is also possible to convert each character constituting the character string into a token string in association with one token. For example, the character string “TYHK” can be converted into a token string of characters “T”, “Y”, “H”, and “K”. In general, depending on the character code, not all of one character is composed of the same number of bytes. Therefore, when using a character code in which all one character is not composed of the same number of bytes, the character string can be converted into a token string by using one character as one token.

なお、トークンの列において、１番目のトークンを「トークン列の開始端」という。例えば、図２の例においては、「ＴＹ」に対応するトークンがトークン列の開始端となる。 In the token sequence, the first token is referred to as a “start end of the token sequence”. For example, in the example of FIG. 2, the token corresponding to “TY” is the start end of the token string.

（トークンの列の併合により構成される木構造に関する定義）
上述の（条件１）における、「共通するトークンの列を開始端より併合する」とは、次の処理をいう。第１のトークンの列Ａ_１Ａ_２…Ａ_ｎＢ_１Ｂ_２…Ｂ_ｍと第２のトークンの列Ａ_１Ａ_２…Ａ_ｎＣ_１Ｃ_２…Ｃ_ｋとが与えられた場合、この２つトークンの列は、開始端Ａ_１よりｎ個のトークンの部分列Ａ_１Ａ_２…Ａ_ｎが共通している。そこで、Ａ_１からＡ_ｎのそれぞれのトークンに対応するノードを順に直列にエッジで接続し、Ａ_１に対応するノードを根ノードとし、Ａ_ｎに対応するノードを末端ノードとする木構造を作る。次に、Ａ_ｎに対応するノードの子ノードを、Ｂ_１に対応するノードとＣ_１に対応するノードとし、Ｂ_１からＢ_ｍそれぞれのトークンに対応するノードを順に直列にエッジで接続し、また、Ｃ_１からＣ_ｋのそれぞれに対応するノードを順に直列にエッジで接続することをいう。 (Definition regarding the tree structure formed by merging token sequences)
In the above (Condition 1), “merging a common token sequence from the start end” refers to the following processing. If a column _{_{_{_{A 1 A 2 ... A n C}}}} 1 C 2 ... C k of the first row of the token _{_{_{_{A 1 A 2 ... A n B}}}} 1 B 2 ... B m and the second token is given, this 2 In the sequence of two tokens, a partial sequence A ₁ A ₂ ... An of _n tokens from the start end A ₁ is common. Therefore, the node corresponding from A ₁ to each token of A _n sequentially connected by an edge in series, a node corresponding to A ₁ and a root node, creating a tree structure to the node corresponding to A _n-terminal node . Then, the child nodes of the node corresponding to A _n, and nodes corresponding to the nodes and C ₁ corresponding to B _1, in turn connected by edges in series node corresponding from B ₁ to B _m each token, Further, it means that nodes corresponding to each of C ₁ to C _k are connected in series with edges.

なお、２つのトークンの列において、共通するトークンの列が開始端より存在しない場合には、２つの異なる木構造が構成される。例えば、ＤＥＦとＧＥＦのように、開始端のトークンであるＤとＧとが異なる場合には、トークンＤ、トークンＥおよびトークンＦそれぞれに対応するノードを順に直列にエッジで接続した木構造と、トークンＧ、トークンＥおよびトークンＦそれぞれに対応するノードを順に直列にエッジで接続した木構造とが構成される。 Note that, in the two token sequences, when there is no common token sequence from the start end, two different tree structures are configured. For example, when D and G, which are tokens at the start end, are different as in DEF and GEF, a tree structure in which nodes corresponding to token D, token E, and token F are connected in series with edges in order, A tree structure in which nodes corresponding to the token G, the token E, and the token F are connected in series with edges in order is configured.

なお、文字列の先頭には、文字数０の空文字列が存在するとして、先頭の空文字列をトークンとすれば、共通するトークンの列が開始端より存在しない場合でも、１つの木構造を生成することができ、このような場合でも本発明の一実施形態は適用可能である。 If a null character string with 0 characters is present at the beginning of the character string and the leading null character string is a token, a single tree structure is generated even when there is no common token string from the start end. Even in such a case, one embodiment of the present invention is applicable.

（部分木に関する定義）
なお、以下では、トークンの列Ｄ_１Ｄ_２…Ｄ_ｎについて、Ｄ_１からＤ_ｎそれぞれに対応するノードを順に直列にエッジで接続して得られる木構造を、トークンの列Ｄ_１Ｄ_２…Ｄ_ｎに対応する木構造ということにする。また、木構造の根ノードを開始点として、エッジの始点から終点を順に辿って得られる経路を木構造の部分木ということにする。例えば、トークンの列Ｅ_１Ｅ_２Ｅ_３Ｅ_４Ｅ_５に対応する木構造の部分木として、Ｅ_１、Ｅ_１Ｅ_２、Ｅ_１Ｅ_２Ｅ_３やＥ_１Ｅ_２Ｅ_３Ｅ_４に対応する木構造を挙げることができる。 (Definition regarding subtree)
In the following description, regarding the token sequence D ₁ D ₂ ... D _n , a tree structure obtained by sequentially connecting the nodes corresponding to D ₁ to D _n in series with edges is referred to as the token sequence D ₁ D ₂ . A tree structure corresponding to D _n is assumed. Further, a path obtained by tracing the edge node in order from the start point of the edge with the root node of the tree structure as a start point is referred to as a subtree of the tree structure. For example, as a sub-tree of a tree structure corresponding to the token sequence E ₁ E ₂ E ₃ E ₄ E _5, it corresponds to E ₁ , E ₁ E ₂ , E ₁ E ₂ E ₃ and E ₁ E ₂ E ₃ E ₄ A tree structure that can be cited.

また、すでに、共通するトークンの列を開始端より併合して構成された木構造がすでに存在する場合において、新たなトークンの列を追加するとは、その新たなトークンの列の開始端からの部分列に対応する木構造が、すでに存在する木構造のいずれかの木構造の部分木となれば、部分列以外のトークンの列に対応する木構造の根ノードを、その部分木の最後のノードの子ノードとする。もし、新たなトークンの列の開始側の部分に対応する木構造が、すでに存在する木構造のどの部分木ともならない場合には、その新たなトークンの列に対応する木構造を追加する。 In addition, when there is already a tree structure formed by merging a common token sequence from the start end, adding a new token sequence is a part from the start end of the new token sequence. If the tree structure corresponding to the column is a subtree of one of the existing tree structures, the root node of the tree structure corresponding to the token column other than the subsequence is the last node of the subtree. It is a child node of. If the tree structure corresponding to the start side portion of the new token sequence is not any subtree of the existing tree structure, the tree structure corresponding to the new token sequence is added.

（ノードとエッジとを表すデータ構造の例）
本実施形態においては、（条件２）として述べたように、木構造を構成するノードそれぞれには、そのノードを根ノードとする木構造の末端ノードまでの長さの最大値および最小値が関連付けられている。そこで、インデックスメモリ１０１が記憶する木構造データは、例えば、図３に示すデータ構造により、ノードとそのノードを始点とするエッジとを表して記憶される。 (Example of data structure representing nodes and edges)
In the present embodiment, as described in (Condition 2), the maximum value and the minimum value of the length to the end node of the tree structure with the node as the root node are associated with each node constituting the tree structure. It has been. Therefore, the tree structure data stored in the index memory 101 is stored, for example, representing a node and an edge starting from the node by the data structure shown in FIG.

図３において、１つのデータ構造には４つのスロット３０１−３０４が含まれ、第１スロット３０１には、トークンが表す文字または文字列が格納される。第２スロットには、そのノードを根ノードとする木構造の末端ノードまでの長さの最大値が格納される。第３スロットには、そのノードを根ノードとする木構造の末端ノードまでの長さの最小値が格納される。第４スロット３０４には、子ノードの情報が格納され、これによりエッジが表される。なお、第１スロット３０１には、トークンが表す文字または文字列そのものが格納される必要はなく、トークンが表す文字または文字列へのポインタなど、トークンが表す文字または文字列を表すための情報が格納されていればよい。 In FIG. 3, one data structure includes four slots 301 to 304, and the first slot 301 stores characters or character strings represented by tokens. The second slot stores the maximum value of the length to the end node of the tree structure with the node as the root node. The third slot stores the minimum value of the length to the end node of the tree structure with the node as the root node. The fourth slot 304 stores child node information, thereby representing an edge. The first slot 301 does not need to store the character or character string itself represented by the token, but has information for representing the character or character string represented by the token, such as a pointer to the character or character string represented by the token. It only has to be stored.

スロット３０４に格納される子ノードの情報とは、子ノードを表すデータ構造を参照するために必要な情報である。例えば、子ノードに対応するデータ構造が格納されているメモリのアドレスである。１つのノードの子ノードが複数存在すれば、複数のメモリのアドレスが第４スロット３０４に格納される。また、末端ノードを表すデータ構造においては、子ノードの情報は格納されないこととなる。 The child node information stored in the slot 304 is information necessary for referring to the data structure representing the child node. For example, an address of a memory in which a data structure corresponding to a child node is stored. If there are a plurality of child nodes of one node, a plurality of memory addresses are stored in the fourth slot 304. Further, in the data structure representing the end node, the information of the child node is not stored.

（トークンの列に対応する木構造の例）
図４は、複数のトークンの列と、その複数のトークンの列から共通するトークンの列を開始端より併合して構成される木構造との例を示す。 (Example of a tree structure corresponding to a token sequence)
FIG. 4 shows an example of a plurality of token sequences and a tree structure formed by merging common token sequences from the plurality of token sequences from the start end.

図４の（ａ）に示すように、第１のトークンの列として、ＡＢＣＤ、第２のトークンの列としてＡＢＤＥＣ、第３のトークンの列として、ＡＣＥＢＥＦ、第４のトークンの列として、ＡＣＥＢＧＨが与えられたとする。これらのトークンの列から、共通するトークンの列を開始端より併合して構成される木構造は、図４の（ｂ）に示すように、各ノードに対応するデータ構造４０１−４１４を用いて構成される。例えばデータ構造４０１は、トークンＡのノードに対応する。トークンＡは複数のトークンの列の全ての開始端に現れるので、データ構造４０１に対応するノードの末端ノードまでの長さの最大値は、ＡＣＥＢＥＦおよびＡＣＥＢＧＨのトークンの個数である６となり、最小値は、ＡＢＣＤのトークンの個数である４となる。 As shown in FIG. 4A, ABCD is used as the first token sequence, ABDEC is used as the second token sequence, ACEBEF is used as the third token sequence, and ACEBGH is used as the fourth token sequence. Suppose you are given. A tree structure formed by merging a common token sequence from these token sequences from the start end, as shown in FIG. 4B, uses data structures 401-414 corresponding to each node. Composed. For example, the data structure 401 corresponds to the node of token A. Since the token A appears at all the start ends of a plurality of token strings, the maximum value of the length to the end node of the node corresponding to the data structure 401 is 6, which is the number of ACEBEF and ACEBGH tokens, and is the minimum value. Is 4 which is the number of tokens of ABCD.

データ構造４０１に対応するノードの子ノードとして、データ構造４０２に対応するノードとデータ構造４０８に対応するノードとがある。前者は、トークンＢに対応し、後者はトークンＣに対応する。このため、データ構造４０１の第４のスロットの値として、データ構造４０２とデータ構造４０８とのそれぞれが格納されているアドレスと、アドレスの列の終わりを示す値ＮＵＬＬとが格納されている。データ構造４０２には、ＢＤＥＣとＢＣＤとに対応して、データ構造４０２に対応するノードを根ノードとする木構造の末端ノードまでの最大値および最小値として、４および３が格納され、子ノードを表す情報として、ＣＤというトークンの列に対応する木構造の根ノードに対応するデータ構造４０３とＤＥＣというノード列に対応する木構造の根ノードに対応するデータ構造４０５とのそれぞれが格納されているアドレスが格納されている。以下、データ構造４０４−４１４は同様に説明される。 As child nodes of the node corresponding to the data structure 401, there are a node corresponding to the data structure 402 and a node corresponding to the data structure 408. The former corresponds to token B, and the latter corresponds to token C. For this reason, as the value of the fourth slot of the data structure 401, an address in which each of the data structure 402 and the data structure 408 is stored and a value NULL indicating the end of the column of addresses are stored. Corresponding to BDEC and BCD, 4 and 3 are stored in the data structure 402 as maximum and minimum values up to the end node of the tree structure having the node corresponding to the data structure 402 as a root node. Are stored as a data structure 403 corresponding to the root node of the tree structure corresponding to the token string CD and a data structure 405 corresponding to the root node of the tree structure corresponding to the node string DEC. Is stored. Hereinafter, the data structures 404-414 are described in the same manner.

特に、末端ノードに対応するデータ構造４０４、４０７、４１２、４１４において、それに対応するノードを根ノードとする木構造の末端ノードまでの長さの最大値と最小値とは同じになり、１である。したがって、末端ノードまでの長さの最大値と最小値とがともに１であるかどうかにより、末端ノードであるかどうかを判定できる。 In particular, in the data structures 404, 407, 412, and 414 corresponding to the end node, the maximum value and the minimum value of the length to the end node of the tree structure having the corresponding node as the root node are the same, and 1 is there. Therefore, whether or not the node is the terminal node can be determined based on whether or not the maximum value and the minimum value of the length to the terminal node are both 1.

（共通するトークンの列を開始端より併合する処理のフローチャートの例）
共通するトークンの列を開始端より併合してトークンに対応するノードを用いて構成された木構造の構成方法については、一般的に述べたが、それをフローチャートで具体的に示した例が、図５である。 (An example of a flowchart of processing for merging common token sequences from the start)
The method for constructing a tree structure configured by using a node corresponding to a token by merging a common token sequence from the start end is generally described, but an example specifically showing it in a flowchart is as follows. FIG.

ステップＳ５０１の処理として、変数ＬとＭとの初期化を行なう。すなわち、変数Ｌに、併合するべきトークンの列のトークンの個数を代入し、変数Ｍに、１を代入する。図５のフローチャートにおいては、トークンの列を走査し、開始端より順にトークンの文字列を取得し、木構造の根ノードから順にノードの文字列と比較を行なう。このとき、変数Ｌは、未走査のトークンの数を表し、変数Ｍは、開始端より何番目のトークンを現在注目しているかを表す。 As the processing in step S501, the variables L and M are initialized. That is, the number of tokens in the sequence of tokens to be merged is substituted for variable L, and 1 is substituted for variable M. In the flowchart of FIG. 5, the token string is scanned, the token character strings are acquired in order from the start end, and compared with the node character strings in order from the root node of the tree structure. At this time, the variable L represents the number of unscanned tokens, and the variable M represents what number of tokens are currently focused on from the start end.

ステップＳ５０２の処理として、配列Ｎの初期化を行なう。配列Ｎは、木構造の根ノードを始点とする経路の情報を格納する。そこで、Ｍが１であるステップＳ５０２においては、配列ＮのＭ番目の要素Ｎ［Ｍ］に、木構造の根ノードに対応するデータ構造であって、開始端のトークンの文字列を第１のスロットに格納しているデータ構造のアドレスを代入する。なお、そのような木構造の根ノードに対応するデータ構造が存在しない場合には、Ｎ［Ｍ］にはＮＵＬＬが代入されるものとする。 As processing in step S502, the array N is initialized. The array N stores path information starting from the root node of the tree structure. Therefore, in step S502 in which M is 1, the Mth element N [M] of the array N has a data structure corresponding to the root node of the tree structure, and the character string of the token at the start end is set to the first character string. Substitute the address of the data structure stored in the slot. Note that if there is no data structure corresponding to the root node of such a tree structure, NULL is substituted for N [M].

ステップＳ５０３として、Ｎ［Ｍ］の値がＮＵＬＬであるかどうかを判断する。もし、Ｎ［Ｍ］の値がＮＵＬＬであれば、共通するトークンの列が開始端より存在しない。そこで、ステップＳ５０４へ処理を移行させ、トークンの列に対応する木構造を構成して追加する。 In step S503, it is determined whether the value of N [M] is NULL. If the value of N [M] is NULL, there is no common token sequence from the start end. Therefore, the process proceeds to step S504, and a tree structure corresponding to the token string is configured and added.

ステップＳ５０３において、Ｎ［Ｍ］の値がＮＵＬＬでなければ、共通するトークンの列が開始端よりする。そこで、ステップＳ５０５に処理を移行させ、Ｎ［Ｍ］の表すノードの最大値及び最小値の調整を行なう。すなわち、Ｌの値が、Ｎ［Ｍ］の表すノードの末端ノードまでの長さの最大値を超えていれば、Ｌを、Ｎ［Ｍ］の表すノードの最大値として格納する。また、Ｌの値が、Ｎ［Ｍ］の表すノードの末端ノードまでの長さの最小値を下回っていれば、Ｌを、Ｎ［Ｍ］の表すノードの最小値として格納する。 In step S503, if the value of N [M] is not NULL, a common token string is set from the start end. Therefore, the process proceeds to step S505, and the maximum value and the minimum value of the node represented by N [M] are adjusted. That is, if the value of L exceeds the maximum value of the length to the end node of the node represented by N [M], L is stored as the maximum value of the node represented by N [M]. If the value of L is less than the minimum value of the length of the node represented by N [M] to the end node, L is stored as the minimum value of the node represented by N [M].

ステップＳ５０６の処理として、トークンを一つ処理したので、Ｌの値を１減少させ、次のトークンを処理するために、Ｍの値を１増加させる。ステップＳ５０７の処理として、Ｌの値が０と等しいかどうかを判定する。もし、Ｌの値が０に等しければ、トークンの列の全てのトークンに対応するノードが木構造に存在したので、処理を終了する。Ｌの値が０に等しくなければ、ステップＳ５０８へ処理を移行する。 Since one token has been processed as the processing of step S506, the value of L is decreased by 1, and the value of M is increased by 1 in order to process the next token. In step S507, it is determined whether or not the value of L is equal to zero. If the value of L is equal to 0, nodes corresponding to all tokens in the token sequence exist in the tree structure, and the process ends. If the value of L is not equal to 0, the process proceeds to step S508.

ステップＳ５０８の処理として、Ｎ［Ｍ−１］の表すノードの子ノードのうち、Ｍ番目のトークンの文字列を持つノードのアドレスを変数Ｎ［Ｍ］に代入する。 In step S508, the address of the node having the character string of the Mth token among the child nodes of the node represented by N [M-1] is substituted into the variable N [M].

ステップＳ５０９の処理として、Ｎ［Ｍ］の値がＮＵＬＬであるかどうかを判定する。もし、Ｎ［Ｍ］の値がＮＵＬＬであれば、ステップＳ５１０へ処理を移行させ、Ｍ番目以後のトークンの列に対応する木構造を生成し、その根ノードを、Ｎ［Ｍ−１］の子ノードとして追加する。 In step S509, it is determined whether the value of N [M] is NULL. If the value of N [M] is NULL, the process proceeds to step S510, a tree structure corresponding to the Mth and subsequent token strings is generated, and the root node of N [M−1] Add as a child node.

ステップＳ５０９の処理において、Ｎ［Ｍ］の値がＮＵＬＬでなければ、Ｍ番目のトークンとＮ［Ｍ］に対応するノードとの処理のために、ステップＳ５０５へ戻る。 If the value of N [M] is not NULL in the process of step S509, the process returns to step S505 to process the Mth token and the node corresponding to N [M].

インデックスメモリ１０１は、複数の文字列が与えられた場合、それぞれの文字列について、例えば図５のフローチャートを実行して得られる木構造データを記憶する。 When a plurality of character strings are given, the index memory 101 stores tree structure data obtained by executing, for example, the flowchart of FIG. 5 for each character string.

インデックスメモリ１０１に、複数の文字列が与えられ、条件１と条件２とを満たす木構造に対応する木構造データが記憶されているときには、新たな文字列が、複数の文字列の中に存在するかどうかは、次のようにして検出することができる。すなわち、新たな文字列をトークンの列に変換し、図５のフローチャートを実行した場合に、ステップＳ５０７において、変数Ｌの値が０となったときに、Ｎ［Ｍ］が末端ノードであるかどうかにより検出することができる。そこで、以下では、複数の文字列が与えられ、条件１と条件２とを満たす木構造に対応する木構造データが記憶されている場合に、新たな文字列が、複数の文字列の中に存在するかどうかを検出する処理についてより詳細に説明する。また、新たな文字列のことを、「検索する文字列」という。 When a plurality of character strings are given to the index memory 101 and tree structure data corresponding to the tree structure satisfying the conditions 1 and 2 is stored, a new character string exists in the plurality of character strings. Whether or not to do so can be detected as follows. That is, when a new character string is converted into a token string and the flowchart of FIG. 5 is executed, whether or not N [M] is a terminal node when the value of the variable L becomes 0 in step S507. It can be detected depending on why. Therefore, in the following, when a plurality of character strings are given and tree structure data corresponding to the tree structure satisfying conditions 1 and 2 is stored, a new character string is included in the plurality of character strings. The process for detecting whether or not it exists will be described in more detail. The new character string is referred to as “character string to be searched”.

（トークナイザ）
トークナイザ１０２は、検索する文字列を１または複数のトークンの列に変換する。変換されたトークンの列は、検索装置１００のメモリに格納され、例えば格納されたメモリのアドレスが、検索部１０３に伝達される。 (Tokenizer)
The tokenizer 102 converts a character string to be searched into a string of one or more tokens. The converted token sequence is stored in the memory of the search device 100. For example, the address of the stored memory is transmitted to the search unit 103.

（検索部）
検索部１０３は、トークナイザ１０２により検索する文字列から変換されたトークンの列を走査し、インデックスメモリ１０１に記憶されている木構造データのノードの子ノードを選択して根ノードを始点とする経路を検索する。すなわち、検索部１０３は、トークナイザ１０２により検索する文字列から変換されたトークンの列を開始端から順に走査し、走査中のトークンの情報を取得する。そして、取得されたトークンの情報を用いて、インデックスメモリ１０１に記憶されている木構造データの表す木構造の走査中のノードの子ノードの中からノードを選択する。これを、「子ノードを選択する」という。 (Search part)
The search unit 103 scans the token string converted from the character string searched by the tokenizer 102, selects a child node of the node of the tree structure data stored in the index memory 101, and starts from the root node Search for. That is, the search unit 103 sequentially scans the token string converted from the character string searched by the tokenizer 102 from the start end, and acquires information on the token being scanned. Then, using the acquired token information, a node is selected from the child nodes of the node being scanned in the tree structure represented by the tree structure data stored in the index memory 101. This is called “selecting a child node”.

子ノードを選択するときには、選択される子ノードの末端ノードまでの長さの最大値以下かつ最小値以上の範囲に、未走査のトークンの数を含むように選択する。もし、未走査のトークンの数が、どの子ノードの末端ノードまでの長さの最大値を超えていたり、あるいは、どの子ノードの末端ノードまでの長さの最小値を下回っていたりすれば、検索する文字列のトークンの個数が、複数の文字列のどの文字列のトークンの個数と同じにならないので、検索する文字列は、複数の文字列のどの文字列とも異なることが直ちに検出できる。 When selecting a child node, a selection is made so that the number of unscanned tokens is included in a range that is not more than the maximum value and not less than the minimum value of the length to the end node of the selected child node. If the number of unscanned tokens exceeds the maximum length of any child node to the end node, or falls below the minimum length of any child node to the end node, Since the number of tokens in the character string to be searched is not the same as the number of tokens in any character string in the plurality of character strings, it can be immediately detected that the character string to be searched is different from any character string in the plurality of character strings.

そこで、検索部１０３は、図６に示すように、トークン列保持部６０２と、トークン個数保持部６０３と、トークン番号保持部６０４と、ノード情報保持部６０５とを有していてもよい。トークン列保持部６０２は、トークナイザが、検索する文字列を変換したトークンの列を保持する。例えば、図２の（ａ）や図２の（ｂ）に示す配列のアドレスを、レジスタなどを用いて記憶する。トークン個数保持部６０３は、トークナイザが、検索する文字列を変換したトークンの列のトークンの個数を保持する。例えば、トークンの個数を、レジスタなどを用いて記憶する。トークン番号保持部６０４は、トークンの列の中で、現在走査しているトークンの番号を保持する。例えば、トークンの番号を、レジスタなどを用いて記憶する。ノード情報保持部６０５は、木構造データの現在走査しているノードに対応するデータ構造に関する情報を記憶する。例えば、ノードに対応するデータ構造のアドレスを、レジスタなどを用いて記憶する。 Therefore, the search unit 103 may include a token string holding unit 602, a token number holding unit 603, a token number holding unit 604, and a node information holding unit 605 as shown in FIG. The token string holding unit 602 holds a token string obtained by converting a character string to be searched by the tokenizer. For example, the addresses of the arrays shown in FIGS. 2A and 2B are stored using a register or the like. The token number holding unit 603 holds the number of tokens in the token string obtained by converting the character string to be searched by the tokenizer. For example, the number of tokens is stored using a register or the like. The token number holding unit 604 holds the number of the token currently being scanned in the token string. For example, the token number is stored using a register or the like. The node information holding unit 605 stores information related to the data structure corresponding to the currently scanned node of the tree structure data. For example, the address of the data structure corresponding to the node is stored using a register or the like.

図７は、検索部１０３の処理の流れを説明するフローチャートである。変数Ｌは未走査のトークンの数を保持する変数である。変数Ｍは、何番目のトークンを走査しているかを表す変数である。そこで、ステップＳ７０１として、変数Ｌにトークナイザから伝達されたトークンの列のトークンの個数を代入する。また、変数Ｍに１を代入する。なお、変数Ｌは、トークン個数保持部６０３に対応し、変数Ｍは、トークン番号保持部６０４に対応する。 FIG. 7 is a flowchart for explaining the processing flow of the search unit 103. The variable L is a variable that holds the number of unscanned tokens. The variable M is a variable indicating what number token is being scanned. Therefore, in step S701, the number of tokens in the token string transmitted from the tokenizer is substituted into the variable L. Also, 1 is assigned to the variable M. The variable L corresponds to the token number holding unit 603, and the variable M corresponds to the token number holding unit 604.

ステップＳ７０２の処理として、１番目のトークンの文字列を持ち、変数Ｌの値が、かつ、末端ノードまでの長さの最大値と最小値との範囲内となる根ノードの情報を変数Ｎに代入する。変数Ｎは、ノード情報保持部６０５に対応する。 As a process of step S702, information on the root node that has the character string of the first token, the value of the variable L, and the range between the maximum value and the minimum value of the length to the end node is stored in the variable N. substitute. The variable N corresponds to the node information holding unit 605.

ステップＳ７０３の処理として、Ｎの値がＮＵＬＬかどうかを判定する。Ｎの値がＮＵＬＬであれば、１番目のトークンの文字列を持ち、かつ、末端ノードまでの長さの最大値と最小値との範囲内となる根ノードが存在しないことであるので、検索する文字列は登録されていない。すなわち、複数の文字列の中に検索する文字列は存在しない。したがって、ステップＳ７０４処理を移行し、文字列は登録されていないと判断する。 In step S703, it is determined whether the value of N is NULL. If the value of N is NULL, there is no root node that has the character string of the first token and is within the range between the maximum value and the minimum value of the length to the end node. The character string to be registered is not registered. That is, there is no character string to be searched for among a plurality of character strings. Therefore, the process proceeds to step S704, and it is determined that no character string is registered.

ステップＳ７０３においてＮの値がＮＵＬＬでなければ、ステップＳ７０５へ処理を移行し、トークンを１つ走査したので、未走査のトークンの数を格納する変数Ｌの値を１減少させ、次のトークンを走査するために、変数Ｍの値を１増加させる。 If the value of N is not NULL in step S703, the process proceeds to step S705, and one token is scanned. Therefore, the value of the variable L that stores the number of unscanned tokens is decreased by 1, and the next token is In order to scan, the value of the variable M is increased by one.

ステップＳ７０６の処理として、変数Ｌの値が０に等しいかどうかを判定する。もし変数Ｌの値が０に等しければ、トークンの列の全てのトークンを走査したので、ステップＳ７０７へ処理を移行させる。ステップＳ７０７の処理として、変数Ｎが末端ノードを表すかどうかを判定する。もし、Ｎが末端ノードでなければ、ステップＳ７０８へ処理を移行させ、文字列は登録されていないと判断する。ステップＳ７０７において、Ｎが末端ノードであると判定されれば、ステップＳ７０９へ処理を移行させ、文字列は登録されていると判断する。 In step S706, it is determined whether the value of the variable L is equal to zero. If the value of the variable L is equal to 0, all tokens in the token row have been scanned, and the process proceeds to step S707. In step S707, it is determined whether the variable N represents a terminal node. If N is not a terminal node, the process proceeds to step S708 to determine that no character string is registered. If it is determined in step S707 that N is a terminal node, the process proceeds to step S709 to determine that the character string is registered.

ステップＳ７０６において、Ｌが０と等しくないと判定されれば、ステップＳ７１０へ処理を移行し、変数Ｎの表すノードの子ノードのうち、Ｍ番目のトークンの文字列を持ち、かつ、変数Ｌの値が、末端ノードまでの長さの最大値と最小値との範囲内となるノードの情報を代入する。 If it is determined in step S706 that L is not equal to 0, the process proceeds to step S710, and has a character string of the Mth token among the child nodes of the node represented by the variable N, and the variable L The node information whose value is within the range between the maximum value and the minimum value of the length to the end node is substituted.

ステップＳ７１１の処理として、変数Ｎの値がＮＵＬＬであるかどうかを判定する。変数Ｎの値がＮＵＬＬであれば、Ｍ番目のトークンに対応するノードが存在しないこととなる。したがって、ステップＳ７１２に処理を移行させ、文字列は登録されていないと判断する。 In step S711, it is determined whether the value of the variable N is NULL. If the value of the variable N is NULL, there is no node corresponding to the Mth token. Therefore, the process proceeds to step S712, and it is determined that no character string is registered.

ステップＳ７１１において、変数のＮの値がＮＵＬＬでないと判定されれば、ステップＳ７０５へ戻る。 If it is determined in step S711 that the value of the variable N is not NULL, the process returns to step S705.

なお、本実施形態に係る検索装置は、計算機にプログラムを実行させることにより実現することができる。すなわち、インデックスメモリ１０１を計算機のメモリや二次記憶として実現し、トークナイザ１０２と検索部１０３とをそれぞれ実現するモジュールとして有するプログラムを計算機のＣＰＵに実行させることにより実現できる。 Note that the search device according to the present embodiment can be realized by causing a computer to execute a program. That is, it can be realized by realizing the index memory 101 as a computer memory or secondary storage, and causing the computer CPU to execute a program having the tokenizer 102 and the search unit 103 as modules.

トークナイザ１０２を実現するモジュールは、例えば、検索する文字列が格納されたメモリのアドレスから文字列を読出して、メモリ中の領域にトークンの列を表すデータ構造を構成し、その領域のアドレスを、検索部１０３を実現するモジュールに伝達する。 The module that implements the tokenizer 102, for example, reads a character string from an address of a memory in which a character string to be searched is stored, configures a data structure that represents a token string in an area in the memory, The information is transmitted to a module that implements the search unit 103.

検索部１０３を実現するモジュールは、トークン列保持部６０２、トークン個数保持部６０３、トークン番号保持部６０４、ノード情報保持部６０５に対応するレジスタなどを操作し、トークンの列に変換された文字列が、インデックスメモリ１０１に記憶された木構造に対応する文字列の中に存在するかどうかなどを検出する。 A module that implements the search unit 103 operates a register corresponding to the token string holding unit 602, the token number holding unit 603, the token number holding unit 604, and the node information holding unit 605, and the character string converted into a token string. Is present in the character string corresponding to the tree structure stored in the index memory 101.

なお、検索装置は、ＬＳＩなどを組み合わせて、ハードウェアのみによって実現することもできる。 Note that the search device can be realized only by hardware by combining LSIs and the like.

（実施例）
図８は、図５と図７とのフローチャートの処理の例を示す。図８の（ａ）に示すように、登録される文字列が「ＭＲＭＡＩＮＥＴＣＯＬＴＤ」と「ＭＲＭＡＩＮＥＴＳＹＳＴＥＭＣＯＬＴＤ」とであれば、図５のフローチャートの処理により、図８の（ｂ）に示す木構造が生成される。したがって、インデックスメモリ１０１には、図８の（ｂ）に示す木構造を表す木構造データが記憶される。 (Example)
FIG. 8 shows an example of processing of the flowcharts of FIGS. As shown in FIG. 8A, if the registered character strings are “MR MAINET COLTD” and “MR MAINET SYSTEM COLT”, the processing shown in FIG. A tree structure is generated. Therefore, the index memory 101 stores tree structure data representing the tree structure shown in FIG.

検索する文字列が、「ＭＲＭＡＩＮＥＴ」であれば、トークナイザ１０２により、図８の（ｃ）に示すデータ構造によって表されるトークンの列が生成される。このトークンの列が、検索部１０３に伝達されると、ステップＳ７０２において、Ｎの値がＮＵＬＬとなる。なぜならば、ノード１は、１番目のトークンの文字列である「ＭＲ」を有するが、トークンの列のトークンの個数が２であり、２はノード１から末端ノードまでの長さの最大値４と最小値３との範囲に含まれないので、ステップＳ７０３においてＳ７０４へ分岐し、文字列は登録されていないと判断される。実際、図８の（ａ）には、「ＭＲＭＡＩＮＥＴ」と一致する文字列は含まれていない。したがって、「ＭＡＩＮＥＴ」に対応するトークンを走査するまでもなく、「ＭＲＭＡＩＮＥＴ」が登録されていないと判断できる。 If the character string to be searched is “MR MAINET”, the tokenizer 102 generates a token string represented by the data structure shown in FIG. When this token sequence is transmitted to the search unit 103, the value of N becomes NULL in step S702. This is because node 1 has “MR”, which is the character string of the first token, but the number of tokens in the token sequence is 2, and 2 is the maximum length 4 from node 1 to the end node. And the minimum value 3 are not included, the process branches to step S704 in step S703, and it is determined that no character string is registered. Actually, (a) in FIG. 8 does not include a character string that matches “MR MAINET”. Therefore, it is possible to determine that “MR MAINET” is not registered without scanning the token corresponding to “MAINET”.

検索文字列が、「ＭＲＭＡＩＮＥＴＣＯＬＴＤ」であれば、トークナイザ１０２により、図８の（ｄ）に示すデータ構造によって表されるトークンの列が生成される。このトークンの列が、検索部１０３に伝達されると、変数Ｌ、Ｍ、Ｎの値は図８の（ｅ）に示されるように遷移する。すなわち、ステップＳ７０１における変数Ｌ、Ｍ、Ｎの値は行２１に示されるものとなる。ただし、Ｎの値は初期化前であり、ＮＵＬＬとした。ステップＳ７０２が実行されると、変数Ｌ、Ｍ、Ｎの値は、行２２に示すように、Ｎの値がノード１を表すようになる。ステップＳ７０５が終了すると、変数Ｌ、Ｍ、Ｎの値は、行２３に示すようになる。ステップＳ７０６が実行され、ステップＳ７１０が実行されると、変数Ｌ、Ｍ、Ｎの値は、行２４に示すようになる。ステップＳ７１１において、Ｎの値はＮＵＬＬではないので、ステップＳ７０５に処理が戻り、変数Ｌ、Ｍ、Ｎの値は行２５に示すようになる。 If the search character string is “MR MAINET COLT”, the tokenizer 102 generates a token string represented by the data structure shown in FIG. When this token sequence is transmitted to the search unit 103, the values of the variables L, M, and N transition as shown in FIG. That is, the values of variables L, M, and N in step S701 are those shown in row 21. However, the value of N is before initialization and is set to NULL. When step S <b> 702 is executed, the values of variables L, M, and N are such that the value of N represents node 1, as shown in line 22. When step S705 ends, the values of variables L, M, and N are as shown in line 23. When step S706 is executed and step S710 is executed, the values of the variables L, M, and N are as shown in line 24. In step S711, since the value of N is not NULL, the process returns to step S705, and the values of variables L, M, and N are as shown in line 25.

ステップＳ７０６、Ｓ７１０の処理により、変数Ｌ、Ｍ、Ｎの値は、行２６に示すようになり、ステップＳ７１１においてＮの値はＮＵＬＬではないので、ステップＳ７０５へ処理が戻り、変数Ｌ、Ｍ、Ｎの値は行２７に示すようになる。ステップＳ７０６において、変数Ｌの値が０になっているので、ステップＳ７０７へ分岐し、ノード３は末端ノードなので、ステップＳ７０９に分岐する。したがって、ステップＳ７０９において、文字列は登録されていると判断される。実際、図８（ａ）には、第２の文字列に「ＭＲＭＡＩＮＥＴＣＯＬＴＤ」が存在する。このように、登録されている文字列を検索する場合であっても、トークンの列と一致しないノードを走査することがない。 The values of variables L, M, and N are as shown in line 26 by the processing of steps S706 and S710. Since the value of N is not NULL in step S711, the processing returns to step S705, and the variables L, M, and N are returned. The value of N is as shown in line 27. In step S706, since the value of the variable L is 0, the process branches to step S707. Since node 3 is a terminal node, the process branches to step S709. Therefore, in step S709, it is determined that the character string is registered. In fact, in FIG. 8A, “MR MAINET coltd” exists in the second character string. In this way, even when searching for a registered character string, nodes that do not match the token string are not scanned.

（応用例）
本実施形態に係る検索装置は、例えば振込の宛名人が取引先として登録されているかどうかを検出し、また、振込の宛名人が取引先であれば、その口座番号などを取得するために用いることができる。すなわち、あらかじめ、取引先の口座の名義人の名前を表す文字列それぞれをトークンの列に変換し、共通するトークンの列を開始端より併合して構成される木構造データをインデックスメモリ１０１に記憶する。このとき、木構造データには末端ノードまでの長さの最大値および最小値を関連づけておく。 (Application examples)
The search device according to the present embodiment detects, for example, whether or not the transfer addressee is registered as a business partner, and if the transfer addressee is a business partner, it is used to acquire the account number and the like. be able to. That is, in advance, each character string representing the name of the account holder of the customer's account is converted into a token string, and the tree structure data formed by merging the common token string from the start end is stored in the index memory 101. To do. At this time, the maximum value and the minimum value of the length to the end node are associated with the tree structure data.

そして、振込の宛名人の名前が入力された場合、その名前を表す文字列をトークナイザ１０２に伝達し、トークンの列に変換し、検索部１０３により名前を表す文字列が、取引先の口座の名義人の名前の文字列の中に存在するかを検出する。 When the name of the addressee of the transfer is input, the character string representing the name is transmitted to the tokenizer 102, converted into a token string, and the character string representing the name by the search unit 103 is converted to the account of the customer. Detect if it exists in the string of the name of the holder.

例えば図７のフローチャートのＳ７０８、Ｓ７１２の処理が実行される場合には、振込の宛名人が取引先として登録されていないとする。また、末端ノードに、取引先の口座番号など関連付けておくことにより、ステップＳ７０９の処理が実行される場合に、変数Ｎの表すノードに関連付けられた口座番号などを取得することにより、取引先の口座番号などを取得することができ、振込の処理などを行なうことができる。 For example, when the processes of S708 and S712 in the flowchart of FIG. 7 are executed, it is assumed that the addressee of the transfer is not registered as a business partner. Further, by associating the account number of the business partner with the terminal node, when the process of step S709 is executed, the account number of the business partner is obtained by acquiring the account number associated with the node represented by the variable N. An account number or the like can be acquired, and a transfer process or the like can be performed.

（実施形態２）
本発明の実施形態２として、図３に示したデータ構造の変形例を図９に示す。図９の（ａ）において、データ構造は、６つのスロットを有し、スロット９０１、９０４、９０５、９０６は、図３におけるスロット３０１、３０２、３０３、３０４にそれぞれ対応する。図９においては、スロット９０２、９０３が示されている。スロット９０２には、スロット９０１に記憶される文字列の先頭の文字列が格納される。また、スロット９０３には、その文字列の長さが格納される。例えば、スロット９０１に「ＭＴＵ」が格納される場合には、スロット９０２、スロット９０３にはそれぞれ、Ｍ、３が格納される。 (Embodiment 2)
FIG. 9 shows a modification of the data structure shown in FIG. 3 as Embodiment 2 of the present invention. 9A, the data structure has six slots, and the slots 901, 904, 905, and 906 correspond to the slots 301, 302, 303, and 304 in FIG. 3, respectively. In FIG. 9, slots 902 and 903 are shown. The slot 902 stores the first character string stored in the slot 901. The slot 903 stores the length of the character string. For example, when “MTU” is stored in the slot 901, M and 3 are stored in the slot 902 and the slot 903, respectively.

図９の（ｂ）は、（ａ）に示すデータ構造を用いて、図８の（ａ）に示す文字列に対応する木構造データを示す。 FIG. 9B shows tree structure data corresponding to the character string shown in FIG. 8A using the data structure shown in FIG.

このようなデータ構造を用いる場合には、図７のフローチャートのステップＳ７０２とステップＳ７１０において、まず、トークンの文字列の先頭の文字が、スロット９０２に格納されている文字と同じであり、かつ、トークンの文字列の長さが、スロット９０３に格納されている値と同じノードのアドレスを得るようにする。もし、トークンの文字列の先頭の文字が、スロット９０２に格納されている文字と同じであり、かつ、トークンの文字列の長さが、スロット９０３に格納されている値と同じノードが根ノードあるいは子ノードとして存在しなければ、トークンの文字列とスロット９０１の文字列とを比較するまでもなく、検索する文字列が、複数の文字列の中に存在しないことを検出することができる。 When such a data structure is used, in step S702 and step S710 of the flowchart of FIG. 7, first, the first character of the token character string is the same as the character stored in the slot 902, and The length of the character string of the token is made to obtain the address of the same node as the value stored in the slot 903. If the first character of the token character string is the same as the character stored in the slot 902 and the length of the token character string is the same as the value stored in the slot 903, the node is the root node. Alternatively, if it does not exist as a child node, it is possible to detect that the character string to be searched does not exist in the plurality of character strings without comparing the character string of the token with the character string of the slot 901.

また、ノードに、末端ノードまでの長さの最大値および最小値を関連付けるのではなく、末端ノードまでの長さのリストを関連付けておいてもよい。この場合には、未走査のトークンの数が、子ノードに関連付けられているリストになければ、検索する文字列が、複数の文字列の中に存在しないことを検出することができる。 Further, instead of associating the maximum value and the minimum value of the length to the end node with each node, a list of the length to the end node may be associated. In this case, if the number of unscanned tokens is not in the list associated with the child node, it can be detected that the character string to be searched does not exist in the plurality of character strings.

１００検索装置、１０１インデックスメモリ、１０２トークナイザ、１０３検索部 100 search device, 101 index memory, 102 tokenizer, 103 search unit

Claims

Each of a plurality of character strings is converted into one or a plurality of token strings and a common token string is merged from the start end, and the tree structure data nodes are connected to the end nodes. An index memory for storing tree-structured data associated with a maximum value and a minimum value;
A tokenizer that converts the string to be searched into a sequence of one or more tokens;
The character to be searched for is searched by selecting a child node of the node of the tree-structured data stored in the index memory while scanning the token row converted by the tokenizer and searching for a route starting from the root node. A search unit for detecting whether a column exists in the plurality of character strings,
The search unit searches for the character string to be searched if there is no child node including the number of unscanned tokens within the range of the maximum value and the minimum value of the length from the child node of the node to the end node. Is a search device for detecting that is not present in the plurality of character strings.

The length of the character string of the token represented by the node is associated with the node constituting the tree structure data stored in the index memory,
If there is no child node associated with the length of the character string of the token to be scanned next from among the child nodes of the node, the search unit searches for the character string to be searched among the plurality of character strings. The search device according to claim 1, wherein the search device detects that it does not exist in the device.

A node constituting the tree structure data stored in the index memory is associated with a list of lengths up to the end node,
If there is no child node including the number of unscanned tokens in the list associated with the child node among the child nodes of the node, the search unit searches for the plurality of character strings. The search device according to claim 1, wherein the search device detects that it is not present in the.

The search device according to claim 1, wherein the tokenizer converts the character string to be searched into a string of delimiter tokens according to a delimiter character in the character string to be searched.

5. The search device according to claim 1, wherein the tokenizer converts the character string to be searched into a string of tokens by dividing the character string into words according to the presence or absence of a predetermined word.

The search device according to claim 1, wherein the tokenizer sets one character in the character string to be searched as one token.

Each of a plurality of character strings is converted into one or a plurality of token strings and a common token string is merged from the start end, and the tree structure data nodes are connected to the end nodes. In the computer storing the tree structure data in which the maximum value and the minimum value of the length are associated in the storage unit,
The search string is converted into a sequence of one or more tokens,
The character string to be searched is obtained by searching a path starting from a root node by selecting a child node of the node of the tree structure data stored in the index memory while scanning the converted token sequence. It is a program for detecting whether it exists in multiple character strings,
If there is no child node including the number of unscanned tokens within the range of the maximum value and the minimum value of the length from the child node of the node to the end node, the character string to be searched is the plurality of characters. A program that causes the computer to detect that it does not exist in a column.

Each of a plurality of character strings is converted into one or a plurality of token strings and a common token string is merged from the start end, and the tree structure data nodes are connected to the end nodes. A computer operating method for storing tree structure data in which a maximum value and a minimum value of a length are associated with each other in a storage unit,
The calculator is
Convert the search string to a sequence of one or more tokens,
The character string to be searched is obtained by searching a path starting from a root node by selecting a child node of the node of the tree structure data stored in the index memory while scanning the converted token sequence. Detect if it exists in multiple strings,
If there is no child node including the number of unscanned tokens within the range of the maximum value and the minimum value of the length from the child node of the node to the end node, the character string to be searched is the plurality of characters. The way the computer works, including not being in a column.