JPH11203183A

JPH11203183A - Data management device and record medium

Info

Publication number: JPH11203183A
Application number: JP10002715A
Authority: JP
Inventors: Toshiaki Ando; 俊明安藤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-01-09
Filing date: 1998-01-09
Publication date: 1999-07-30
Anticipated expiration: 2018-01-09
Also published as: JP3855423B2

Abstract

PROBLEM TO BE SOLVED: To execute the identification processing of identifiers at high speed. SOLUTION: An identifier input means 1 inputs a first identifier set and an identifier division means 2 divides the respective identifiers belonging to the first identifier set into the segments of a prescribed data length. A tag generation means 3 generates a tag by computing the exclusive OR of the segments obtained from the respective identifiers and an index generation means 4 generates an index from the tag belonging to the first identifier set. Then, in the case that a second identifier set is inputted so as to perform the identification processing with the first identifier, the tags corresponding to the respective identifiers of the second identifier set are generated by a similar processing. Then, whether or not the same tag is present is judged by referring to the index, and when it is present, the identifiers are compared to each other.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はデータ管理装置およ
び記録媒体に関し、特に、識別子によりデータを管理す
るデータ管理装置および記録媒体に関する。The present invention relates to a data management device and a recording medium, and more particularly to a data management device and a recording medium for managing data by using an identifier.

【０００２】[0002]

【従来の技術】記録装置に記録されている複数のデータ
を管理する場合、データ集合に対する演算処理を行う必
要がしばしば生ずる。2. Description of the Related Art When managing a plurality of data recorded in a recording apparatus, it is often necessary to perform arithmetic processing on a data set.

【０００３】一般に、データ集合を対象とする演算処理
においては、同一のデータであることを判定するデータ
同定処理が頻繁に使用される。従来においては、データ
とそのデータを一意に識別するための識別子とを対応付
けて管理し、この識別子を比較することによって識別子
の同一性を判定し、データを同定していた。[0003] Generally, in an arithmetic processing for a data set, a data identification processing for determining that the data is the same is frequently used. Conventionally, data and an identifier for uniquely identifying the data are managed in association with each other, and the identifiers are compared to determine the identity of the identifier and identify the data.

【０００４】通常、データ自身のデータ長に比較して識
別子のデータ長は非常に小さいため、識別子によるデー
タの同定処理は効率的なものであった。しかし、データ
管理システムが分散環境で運用されるなどして、管理す
るデータ数が多数になると、データを識別するための識
別子を表すデータ長も次第に大きくなってきた。Normally, the data length of an identifier is very small as compared with the data length of the data itself, so that the data identification process using the identifier has been efficient. However, when the number of pieces of data to be managed increases, for example, when the data management system is operated in a distributed environment, the data length representing an identifier for identifying the data has been gradually increased.

【０００５】そこで、そのデータ（ファイル）が記録さ
れている位置をパス名によって指定する方法が提案され
ている。即ち、このような方法では、ファイルというデ
ータをパス名によって識別する。Accordingly, a method has been proposed in which the position where the data (file) is recorded is specified by a path name. That is, in such a method, data called a file is identified by a path name.

【０００６】ところで、パス名にはユーザが認識しやす
い名前をつけるため、管理するファイルが多くなるほ
ど、パス名の長さ（パス名のデータ長）も長くなる傾向
がある。従って、ファイルの階層が増加すると、必然的
にパス名も長くなる結果となる。このとき、ファイル
（データ）の同定処理はパス名（文字列）の比較処理と
なるため、パス名が長くなるほど処理コストが増加する
ことになる。By the way, since a path name is given to a user-friendly name, the length of the path name (the data length of the path name) tends to increase as the number of files to be managed increases. Therefore, as the file hierarchy increases, the path name naturally becomes longer. At this time, since the file (data) identification process is a process of comparing path names (character strings), the processing cost increases as the path name becomes longer.

【０００７】そのような問題点を解決する方法として、
特開平４−３３８８４４「ファイルのパス名管理制御方
式」がある。この発明では、パス名に対応するＩＤを作
成し、パス名の代わりにこのＩＤ利用する。ＩＤの作成
にあたっては、ＩＤはパス名よりもバイト数が少なく、
かつ、ＩＤとパス名との対応関係が一意となるように留
意する。パス名の代わりにこのＩＤを利用することによ
って、メモリを節約したり、処理を高速化することが可
能となり、その結果、ＩＤ同士を比較することによっ
て、パス名を使用した場合に比較して迅速にファイルを
同定することが可能となる。As a method of solving such a problem,
Japanese Patent Application Laid-Open No. 4-338844 discloses a "file path name management control method". In the present invention, an ID corresponding to a path name is created, and this ID is used instead of the path name. When creating an ID, the ID has fewer bytes than the path name,
In addition, care must be taken so that the correspondence between the ID and the path name is unique. By using this ID instead of the path name, it is possible to save memory and to speed up the processing. As a result, the IDs are compared with each other so that the ID can be compared with the case where the path name is used. It becomes possible to identify a file quickly.

【０００８】一方、データ長の長いキーに対する効率的
な探索方法として、「トライ」が知られている。トライ
では長い文字列に含まれる文字を索引である木構造のノ
ードに対応させる。そして、探索する文字列をこの木構
造のノードごとに比較することによって、目的となる文
字列を探索することが可能となる。On the other hand, "try" is known as an efficient search method for a key having a long data length. In the try, characters included in a long character string are made to correspond to nodes of a tree structure as an index. Then, the target character string can be searched for by comparing the character string to be searched for each node of the tree structure.

【０００９】たとえば、特開平５−２６０７「木構造デ
ータ構造による高速探索方式」では、長くなった識別子
をコンパクトなブロックに分割することによって、文字
列を識別子に、また、文字をブロックに置き換えること
でトライを適用している。つまり、ブロックを索引であ
る木構造データのノードとして表現し、識別子の比較で
はなく、ブロック列を比較することによって比較処理の
コストを少なくしている。For example, in Japanese Patent Laid-Open No. 5-2607 "High-speed search method using a tree-structured data structure", a character string is replaced with an identifier and a character is replaced with a block by dividing a long identifier into compact blocks. Is applying a try. In other words, the block is expressed as a node of tree-structured data as an index, and the cost of the comparison process is reduced by comparing the block sequence instead of comparing the identifier.

【００１０】ところで、Ｂ木を利用した探索において
は、木のノードごとに何度か識別子全体を比較する必要
がある。特開平５−２６０７に開示されている方法で
は、データ長の小さいブロックごとに比較して、比較処
理のコストを低減している。By the way, in the search using the B-tree, it is necessary to compare the entire identifier several times for each node of the tree. In the method disclosed in JP-A-5-2607, the cost of the comparison process is reduced by comparing each block having a small data length.

【００１１】[0011]

【発明が解決しようとする課題】しかし、特開平４−３
３８８４４に開示されている方法では、ファイルにアク
セスするためには、ＩＤからパス名を取り出す必要があ
るため、パス名とＩＤとを管理する（対応付ける）テー
ブルや管理手段を設ける必要がある。このテーブルは、
当然のことながら全てのパス名を含んでいる必要がある
ことから、パス名とＩＤの対応テーブルの大きさは、フ
ァイルの数に比例して大きなものになる。従って、ファ
イルの数の増加に応じて、占有されるメモリ容量が増大
するという問題点があった。However, Japanese Patent Laid-Open Publication No.
In the method disclosed in US Pat. No. 38,844, it is necessary to extract a path name from an ID in order to access a file. Therefore, it is necessary to provide a table and management means for managing (associating) the path name and the ID. This table is
Naturally, since it is necessary to include all path names, the size of the correspondence table between path names and IDs becomes large in proportion to the number of files. Therefore, there is a problem that the occupied memory capacity increases as the number of files increases.

【００１２】一方、特開平５−２６０７に開示されてい
る方法では、木構造を利用した探索処理において、識別
子の比較回数を削減するものであって、識別子の比較処
理そのものを効率化するものではない。この方法の本来
の目的は、データ長の長いキーを利用した場合の探索処
理の高効率化にある。従って、集合演算処理のように識
別子の比較処理を何度も繰り返し実行しなければならな
い場面においては、この方法をそのまま利用することは
困難である。On the other hand, the method disclosed in Japanese Patent Application Laid-Open No. 5-2607 is intended to reduce the number of times of comparing identifiers in a search process using a tree structure, and not to improve the efficiency of the identifier comparison process itself. Absent. The original purpose of this method is to improve the efficiency of search processing when a key having a long data length is used. Therefore, it is difficult to use this method as it is in a situation where the comparison processing of the identifiers has to be repeatedly performed many times as in the set operation processing.

【００１３】しかし、あえて集合演算処理に使用するな
らば、演算対象の識別子集合から索引となる木構造を作
成し、もう１つの識別子集合の要素を１つずつ探索して
いく方式を採ることになる。そのため、あらかじめ識別
子全体のために作成してある索引を利用することができ
ないことから、その場で索引を作成する処理が必要とな
り、その結果、処理速度が低下するという問題点があっ
た。However, if the method is to be used for set operation processing, a method of creating a tree structure serving as an index from the identifier set to be operated and searching for elements of another identifier set one by one will be adopted. Become. For this reason, since an index created for the entire identifier in advance cannot be used, a process of creating an index on the spot is required, and as a result, there is a problem that the processing speed is reduced.

【００１４】本発明はこのような点に鑑みてなされたも
のであり、データ長の長い識別子に対しても、データの
集合演算処理の場面で、複数の識別子から識別子の同一
性を迅速に判定してデータを識別することが可能なデー
タ管理装置を提供する。The present invention has been made in view of such a point, and even for an identifier having a long data length, the identity of the identifier is quickly determined from a plurality of identifiers in a data set operation process. And a data management device capable of identifying data.

【００１５】[0015]

【課題を解決するための手段】本発明では上記課題を解
決するために、識別子によりデータを管理するデータ管
理装置において、識別子が入力される識別子入力手段
と、入力された識別子を複数のセグメントに分割する識
別子分割手段と、得られたセグメントに対して所定の論
理演算を施すことにより、前記識別子よりもデータ長の
短いタグを生成するタグ生成手段と、前記タグ生成手段
によって生成されたタグを元にして、索引を生成する索
引生成手段と、を有することを特徴とするデータ管理装
置が提供される。According to the present invention, in order to solve the above-mentioned problems, in a data management apparatus which manages data by using an identifier, an identifier input means for inputting an identifier, and the input identifier is divided into a plurality of segments. An identifier dividing means for dividing, a tag generating means for performing a predetermined logical operation on the obtained segment to generate a tag having a data length shorter than the identifier, and a tag generated by the tag generating means. A data management device, comprising: index generation means for generating an index based on the data.

【００１６】ここで、識別子入力手段からは識別子が入
力される。識別子分割手段は、入力された識別子を複数
のセグメントに分割する。タグ生成手段は、得られたセ
グメントに対して所定の論理演算を施すことにより、識
別子よりもデータ長の短いタグを生成する。索引生成手
段は、タグ生成手段によって生成されたタグを元にして
索引を生成する。Here, the identifier is input from the identifier input means. The identifier dividing means divides the input identifier into a plurality of segments. The tag generating means performs a predetermined logical operation on the obtained segment to generate a tag having a data length shorter than the identifier. The index generation unit generates an index based on the tag generated by the tag generation unit.

【００１７】[0017]

【発明の実施の形態】以下、本発明の実施の形態を図面
を参照して説明する。図１は、本発明のデータ管理装置
の原理を説明する原理図である。この図において、識別
子入力手段１からは、データを識別するための識別子が
入力される。識別子分割手段２は、入力された識別子を
複数のセグメントに分割して出力する。タグ生成手段３
は、得られたセグメントに対して所定の論理演算を施す
ことにより、識別子よりもデータ長の短いタグを生成す
る。索引生成手段４は、生成されたタグを元にして、例
えば、木構造を有する索引を生成する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a principle diagram for explaining the principle of the data management device of the present invention. In this figure, an identifier for identifying data is input from the identifier input means 1. The identifier dividing means 2 divides the input identifier into a plurality of segments and outputs the segment. Tag generation means 3
Performs a predetermined logical operation on the obtained segment to generate a tag having a data length shorter than the identifier. The index generation unit 4 generates, for example, an index having a tree structure based on the generated tags.

【００１８】次に、図２を参照して本発明の実施の形態
の一例について説明する。この図において、識別子入力
手段１１からは、データを識別するための識別子が入力
される。識別子分割手段１２は、入力された識別子をタ
グと同一長の複数のセグメントに分割して出力する。タ
グ生成手段１３は、得られたセグメントの間で排他的論
理和を演算することにより、識別子よりもデータ長の短
いタグを生成する。索引生成手段１４は、生成されたタ
グを元にして、木構造を有する索引を生成する。同一タ
グ存在判定手段１５は、新たな識別子が入力された場合
に、その識別子に対応するタグが、索引中に既に存在し
ているか否かを判定する。識別子同一性判定手段１６
は、同一タグ存在判定手段１５によって同一のタグが既
に存在していると判定された場合に、そのタグに対応す
る識別子と新たに入力された識別子とを比較してこれら
が同一であるか否かを判定し、その結果を出力する。Next, an example of an embodiment of the present invention will be described with reference to FIG. In this figure, an identifier for identifying data is input from an identifier input unit 11. The identifier dividing means 12 divides the input identifier into a plurality of segments having the same length as the tag and outputs the segment. The tag generation means 13 generates a tag having a shorter data length than the identifier by calculating an exclusive OR between the obtained segments. The index generation unit 14 generates an index having a tree structure based on the generated tags. When a new identifier is input, the same tag presence determination unit 15 determines whether a tag corresponding to the identifier already exists in the index. Identifier identity determination means 16
When the same tag existence determining means 15 determines that the same tag already exists, the identifier corresponding to the tag is compared with the newly input identifier to determine whether they are the same. And outputs the result.

【００１９】次に、以上の実施の形態の動作について説
明する。以下では、第１および第２の識別子集合が入力
された場合において、これらの間で識別子の同定を行う
ための同定処理を例に挙げて動作の説明をする。Next, the operation of the above embodiment will be described. In the following, the operation will be described with an example of an identification process for identifying an identifier between the first and second sets of identifiers when they are input.

【００２０】なお、このような同定処理は、識別子集合
の間で論理和や論理積などを算出するときに利用され
る。図３は、図２に示す実施の形態において同定処理を
行う場合に実行される処理の一例を説明するフローチャ
ートである。このフローチャートが開始されると、以下
の処理が実行されることになる。［Ｓ１］識別子入力手段１１は、第１の識別子集合を入
力する。［Ｓ２］識別子分割手段１２とタグ生成手段１３は、タ
グを生成する。Such an identification process is used when calculating a logical sum or a logical product between the identifier sets. FIG. 3 is a flowchart illustrating an example of a process performed when the identification process is performed in the embodiment illustrated in FIG. When this flowchart is started, the following processing is executed. [S1] The identifier input means 11 inputs a first identifier set. [S2] The identifier dividing means 12 and the tag generating means 13 generate a tag.

【００２１】即ち、識別子分割手段１２は、入力された
識別子集合から識別子を１つだけ取得し、取得した識別
子をタグと同一データ長のセグメントに分割する。タグ
生成手段１３は、１つの識別子から得られた複数のセグ
メントの間で排他的論理和を演算し、得られた結果をそ
の識別子のタグとする。That is, the identifier dividing means 12 acquires only one identifier from the input identifier set, and divides the acquired identifier into segments having the same data length as the tag. The tag generation means 13 calculates an exclusive OR between a plurality of segments obtained from one identifier, and sets the obtained result as a tag of the identifier.

【００２２】なお、排他的論理和の代わりに他のハッシ
ュ関数を使用してもよい。［Ｓ３］索引生成手段１４は、生成されたタグを元にし
て索引を生成する。即ち、索引生成手段１４は、生成さ
れたタグを利用して、第１の識別子集合に対応する索引
（２進木）を生成する。図４は、以上の処理によって生
成される索引の一例を示している。この例において、左
端の索引が以上の処理によって作成される部分である。
この索引はｎ１〜ｎ７のノードによって構成されてお
り、それぞれのノードには、分岐する際に必要な情報が
付与されている。なお、このような２進木については、
岩波講座情報科学１１「データ管理算法」, Ｐ３３に詳
しい記述がある。Incidentally, another hash function may be used instead of the exclusive OR. [S3] The index generation unit 14 generates an index based on the generated tags. That is, the index generation unit 14 generates an index (binary tree) corresponding to the first set of identifiers by using the generated tags. FIG. 4 shows an example of an index generated by the above processing. In this example, the leftmost index is a part created by the above processing.
This index is composed of nodes n1 to n7, and information necessary for branching is given to each node. For such a binary tree,
There is a detailed description in Iwanami Course Information Science 11, "Data Management Algorithm", P33.

【００２３】索引の右端のノード（ｎ４〜ｎ７）には、
タグが少なくとも１つずつ関係付けられている。この例
では、ノードｎ４にタグｔ１，ｔ２が関連付けられてお
り、また、ノードｎ５〜ｎ７には、タグｔ３〜ｔ５がそ
れぞれ関連付けられている。更に、各タグには識別子が
それぞれ関連付けられている。この例では、タグｔ１〜
ｔ５に識別子ｉ１〜ｉ５がそれぞれ関連付けられてい
る。なお、本実施の形態においては、図４に示すような
索引、タグ、および、識別子が索引生成手段１４内部の
メモリ等に記憶される。In the rightmost nodes (n4 to n7) of the index,
Tags are associated at least one by one. In this example, the tags t1 and t2 are associated with the node n4, and the tags t3 and t5 are associated with the nodes n5 and n7, respectively. Further, each tag is associated with an identifier. In this example, the tags t1 to
Identifiers i1 to i5 are respectively associated with t5. In this embodiment, the index, the tag, and the identifier as shown in FIG.

【００２４】このような索引を用いることにより、第１
の識別子集合に属する識別子のうち、同一のタグを持つ
識別子（タグが衝突した識別子）を効率よくまとめるこ
とができる。なお、実際に作成される索引は、図４の場
合に比較して大きいものとなる。［Ｓ４］識別子入力手段１１は、第２の識別子集合を入
力する。［Ｓ５］識別子分割手段１２とタグ生成手段１３は、入
力された第２の識別子集合に属している各識別子に対応
するタグを前述の場合と同様の処理により生成する。［Ｓ６］同一タグ存在判定手段１５と識別子同一性判定
手段１６は協働して、識別子の同定処理を実行する。即
ち、同一タグ存在判定手段１５は、索引を参照して同一
のタグが存在しているか否かを判定し、その結果、同一
のタグが存在していると判定した場合には、識別子同一
性判定手段１６がそれらの識別子が同一であるか否かを
更に判定する。なお、この処理の詳細については、図５
を参照して後述する。［Ｓ７］識別子同一性判定手段１６は、全ての識別子の
同定が終了したか否かを判定する。その結果、全ての識
別子の同定が終了した場合には処理を完了し、また、終
了していない場合にはステップＳ６に戻る。次に、図５
を参照して図３に示す同定処理の詳細について説明す
る。By using such an index, the first
Among the identifiers belonging to the set of identifiers, identifiers having the same tag (identifiers with collision of tags) can be efficiently collected. Note that the index actually created is larger than that in FIG. [S4] The identifier input means 11 inputs a second identifier set. [S5] The identifier dividing means 12 and the tag generating means 13 generate tags corresponding to the respective identifiers belonging to the input second identifier set by the same processing as described above. [S6] The identical tag existence determining means 15 and the identifier identity determining means 16 cooperate to execute identifier identification processing. That is, the same tag existence determination means 15 determines whether or not the same tag exists by referring to the index. As a result, when it is determined that the same tag exists, the identifier identity The determining means 16 further determines whether those identifiers are the same. The details of this process are described in FIG.
It will be described later with reference to FIG. [S7] The identifier identity determination means 16 determines whether or not all identifiers have been identified. As a result, if the identification of all identifiers has been completed, the process is completed, and if not, the process returns to step S6. Next, FIG.
The details of the identification processing shown in FIG. 3 will be described with reference to FIG.

【００２５】このフローチャートは、図３に示すステッ
プＳ６の「同定処理」が開始された場合に、呼び出され
て実行される。このフローチャートが開始されると、以
下のような処理が実行されることになる。［Ｓ２１］同一タグ存在判定手段１５は、タグ生成手段
１３によって生成された第２の識別子集合に属する識別
子に対応するタグを１つ選択し、索引生成手段１４によ
って生成された第１の識別子集合に対応する索引を参照
することにより、同一のタグが存在しているか否かを検
索する。［Ｓ２２］同一タグ存在判定手段１５は、ステップＳ２
１の処理の結果、同一のタグが存在していると判定した
場合にはステップＳ２３に進み、また、同一のタグが存
在していないと判定した場合にはステップＳ２５に進
む。［Ｓ２３］識別子同一性判定手段１６は、同一のタグが
存在している場合には、そのタグに対応する識別子（第
１の識別子集合に属している識別子）と、処理の対象と
なっている識別子（第２の識別子集合に属している識別
子）とをバイナリデータとして比較し、同一であるか否
かを判定する。その結果、これらの識別子が同一である
と判定した場合にはステップＳ２４に進み、同一ではな
いと判定した場合にはステップＳ２５に進む。This flowchart is called and executed when the "identification process" of step S6 shown in FIG. 3 is started. When this flowchart is started, the following processing is executed. [S21] The same tag existence determination unit 15 selects one tag corresponding to the identifier belonging to the second identifier set generated by the tag generation unit 13, and the first identifier set generated by the index generation unit 14. Is searched for the same tag by referring to the index corresponding to. [S22] The same tag presence determination unit 15 determines in step S2
As a result of the process 1, if it is determined that the same tag exists, the process proceeds to step S23. If it is determined that the same tag does not exist, the process proceeds to step S25. [S23] If the same tag exists, the identifier identity determination unit 16 determines the identifier (identifier belonging to the first identifier set) corresponding to the tag and processes the same. Identifiers (identifiers belonging to the second identifier set) are compared as binary data to determine whether they are the same. As a result, when it is determined that these identifiers are the same, the process proceeds to step S24, and when it is determined that they are not the same, the process proceeds to step S25.

【００２６】なお、処理対象となっているタグに対し
て、複数のタグが同一であるとステップＳ２２において
判定された場合には、最初の識別子と比較し、同一でな
かったら、次の同一のタグを有する識別子を比較する。
また、同一であったら、これを同一な識別子と判定し、
残りの同一のタグを持つ識別子を無視する。［Ｓ２４］識別子同一性判定手段１６は、同定した識別
子を第３の識別子集合として退避させる。［Ｓ２５］同一タグ存在判定手段１５は、第２の識別子
集合に属しているタグ（同定処理の対象となるタグ）が
まだあるか否かを判定する。その結果、タグがまだある
と判定した場合にはステップＳ２１に戻り、次のタグに
対する同定処理を行う。また、タグがないと判定した場
合には、図３の処理に復帰（リターン）する。If it is determined in step S22 that a plurality of tags are the same for the tag to be processed, the tag is compared with the first identifier. Compare identifiers with tags.
If they are the same, it is determined to be the same identifier,
Ignore the remaining identifiers with the same tag. [S24] The identifier identity determination unit 16 saves the identified identifier as a third identifier set. [S25] The same tag presence determination unit 15 determines whether there are any more tags belonging to the second identifier set (tags to be identified). As a result, when it is determined that there is still a tag, the process returns to step S21, and the identification processing for the next tag is performed. If it is determined that there is no tag, the process returns to the process of FIG.

【００２７】以上の処理によれば、索引を利用して同一
のタグが存在しているか否かを高速に判定した後、同一
のタグが存在している場合には、対応する識別子同士を
比較するようにしたので、同一なタグを持つ識別子の数
は、識別子集合内の要素数に比較して十分少なくなって
いることから、同定処理を迅速に実行することが可能と
なる。According to the above-described processing, after it is determined at a high speed whether or not the same tag exists using the index, if the same tag exists, the corresponding identifiers are compared. Since the number of identifiers having the same tag is sufficiently smaller than the number of elements in the identifier set, the identification process can be executed quickly.

【００２８】なお、以上のようにして生成された第３の
識別子集合は入力された第１の識別子集合と第２の識別
子集合の論理積となっている。また、第１の識別子集合
と重複する識別子を取り除いた第２の識別子集合は識別
子の重複がないため、第１の識別子集合と第２の識別子
集合とを単純に結合しただけで、第１および第２の識別
子集合の論理和を作成することができる。The third set of identifiers generated as described above is the logical product of the input first set of identifiers and the second set of identifiers. Further, since the second identifier set from which the duplicate identifiers of the first identifier set are removed does not have duplicate identifiers, the first and second identifier sets are simply combined to form the first and second identifier sets. An OR of the second set of identifiers can be created.

【００２９】次に、図６を参照して本発明の第２の実施
の形態の構成例について説明する。なお、この図におい
て、図２の場合と対応する部分には同一の符号が付して
あるのでその説明は省略する。Next, a configuration example of the second embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 2 are denoted by the same reference numerals, and the description thereof is omitted.

【００３０】この実施の形態においては、図２の場合と
比較してデータ構造情報記録手段２０が新たに追加され
ているとともに、識別子分割手段１２、タグ生成手段１
３、および、識別子同一性判定手段１６における処理が
異なっている。その他の構成は、図２の場合と同様であ
る。In this embodiment, as compared with the case of FIG. 2, a data structure information recording means 20 is newly added, an identifier dividing means 12, a tag generating means 1
3 and the processing in the identifier identity determination means 16 is different. Other configurations are the same as those in FIG.

【００３１】データ構造情報記録手段２０は、識別子の
データ構造に関する情報を記録しており、要求がなされ
た場合には、必要な情報を、識別子分割手段１２、タグ
生成手段１３、または、識別子同一性判定手段１６に供
給する。The data structure information recording means 20 records information relating to the data structure of the identifier. When a request is made, necessary information is stored in the identifier dividing means 12, the tag generating means 13, or the identifier identical information. It is supplied to the sex determination means 16.

【００３２】識別子分割手段１２は、データ構造に応じ
て識別子を分割する。タグ生成手段１３は、データ構造
情報記録手段２０に記録されている情報を参照して、識
別子の値のばらつきが大きい部分に対して、タグのデー
タ領域をより多く割り当てることにより、衝突の少ない
タグを生成する。The identifier dividing means 12 divides an identifier according to a data structure. The tag generation unit 13 refers to the information recorded in the data structure information recording unit 20 and allocates more tag data areas to the parts where the value of the identifier has a large variation. Generate

【００３３】識別子同一性判定手段１６は、同一のタグ
が存在している場合には、データ構造情報記録手段２０
に記録されている情報を参照して、比較コストが小さい
部分から識別子を順次比較する。If the same tag exists, the identifier identity determination means 16 determines whether the data structure information recording means 20
, The identifiers are sequentially compared from the part having the lower comparison cost.

【００３４】次に、以上の実施の形態の動作について説
明する。なお、以下の処理では、前述の場合と同様に、
第１および第２の識別子集合の同定処理を例に挙げて説
明を行う。また、前述の場合と同様の処理については説
明を省略する。Next, the operation of the above embodiment will be described. In the following processing, as in the case described above,
The description will be made by taking the identification processing of the first and second identifier sets as an example. The description of the same processing as that described above is omitted.

【００３５】識別子が入力されると、識別子分割手段１
２は、データ構造情報記録手段２０に記録されている構
造情報を参照し、同一の属性を有する領域毎に識別子を
分割する。When an identifier is input, the identifier dividing means 1
Reference numeral 2 refers to the structure information recorded in the data structure information recording means 20 and divides the identifier for each area having the same attribute.

【００３６】タグ生成手段１３は、入力されたすべての
識別子から対応するタグを生成する。即ち、タグ生成手
段１３は、データ構造情報記録手段２０に予め記録され
ている識別子の構造情報を参照し、識別子の構造に応じ
てタグを生成する。例えば、識別子の値のばらつき（統
計的なばらつき）が多い部分に対して、タグのデータ領
域をより多く割り当てることによって、異なる識別子か
ら生成されるタグが重なりにくいように最適化する。The tag generation means 13 generates a corresponding tag from all the input identifiers. That is, the tag generation unit 13 refers to the structure information of the identifier recorded in advance in the data structure information recording unit 20, and generates a tag according to the structure of the identifier. For example, by allocating more tag data areas to portions where there is a large variation (statistical variation) in identifier values, optimization is performed so that tags generated from different identifiers are unlikely to overlap.

【００３７】いま、図７に示すような構造情報を参照し
てタグを生成する場合を考える。この例では、識別子は
以下のようなデータによって構成されている。（Ａ）length ・・・４バイトの整数データであり、全ての識別子が同一の値（固定値）を有する。（Ｂ）name ・・・固定値を持つ文字列。（Ｃ）unknown ・・・２０４８バイトのバイナリデータであり、特定の値をとる。（Ｄ）number ・・・４バイトのデータが１６個並んだ配列データであり、１つ１つのデータは、４バイトで表現可能な数値すべてを表す。Now, consider the case where a tag is generated with reference to the structure information as shown in FIG. In this example, the identifier is constituted by the following data. (A) length ... 4-byte integer data, and all identifiers have the same value (fixed value). (B) name ... a character string having a fixed value. (C) unknown: 2048-byte binary data, which takes a specific value. (D) number: Array data in which 16 4-byte data are arranged, and each data represents all numerical values that can be represented by 4 bytes.

【００３８】なお、以上のような構造情報の代わりに、
図８に示すような構造情報を用いることも可能である。
即ち、このような構造情報の場合では、「役割」の“長
さ”や“区切り”により、他のフィールドの長さを決定
することによって、可変長データである識別子を表現す
ることができる。（Ａ）length ・・・４バイトの整数であり、nameの長さを表す。（Ｂ）name ・・・文字列であり、データ長は前述のlengthによって表される。（Ｃ）unknown1 ・・・バイナリデータであり、データの末端はフィールドbo undaryによって決定される。（Ｄ）boundary ・・・４バイトの整数であり、その値は０である。前後のデータの区切りを表す。（Ｅ）unknown2 ・・・バイナリデータであり、データの先頭はフィールドbo undaryによって決定される。[0038] Instead of the above structure information,
It is also possible to use structure information as shown in FIG.
That is, in the case of such structure information, an identifier that is variable-length data can be expressed by determining the length of another field based on the “length” or “delimiter” of the “role”. (A) length ... 4-byte integer, indicating the length of name. (B) name ... a character string, and the data length is represented by the length described above. (C) unknown1... Binary data, and the end of the data is determined by the field boundary. (D) boundary ... This is a 4-byte integer, and its value is 0. Represents the separation of data before and after. (E) unknown2... Binary data, the head of which is determined by the field bo undary.

【００３９】ここでは、図７に示す構造情報について話
をすすめる。この例では、ばらつき度の項目に示してあ
るように、「length」および「name」では、ばらつきは
固定値となっているため、その他のフィールドに対して
のみタグの領域を割り当てることが望ましい。Here, the structure information shown in FIG. 7 will be described. In this example, as shown in the item of the degree of variation, the variation of “length” and “name” is a fixed value, and therefore it is desirable to allocate the tag area only to the other fields.

【００４０】従って、先ず、識別子分割手段１２が、構
造情報に応じて識別子を、「length」、「name」「unkn
own 」「number」の４つの領域に分割する。そして、タ
グ生成手段１３は、構造情報の「ばらつき度」を参照し
て、numberのフィールドに対応するタグ部分に多くの領
域を割り当て、固定値を有するlengthやnameに対応する
フィールドはタグ生成に利用しない。ここでは、タグの
データ長を４バイトとし、ばらつき度の重みを図９のよ
うにすると、全体の重みが５（＝４＋１）となる。Therefore, first, the identifier dividing means 12 converts the identifiers into “length”, “name”, “unkn” according to the structure information.
Divide into four areas of own and number. Then, the tag generation unit 13 refers to the “variation degree” of the structure information, allocates a large area to the tag portion corresponding to the number field, and assigns the fields corresponding to length and name having fixed values to the tag generation. Do not use. Here, assuming that the data length of the tag is 4 bytes and the weight of the degree of variation is as shown in FIG. 9, the overall weight is 5 (= 4 + 1).

【００４１】その結果、unknown は４×１÷５＝０．８
となり、一方、numberは、４×４÷５＝３．２となり、
四捨五入して１バイトと３バイトをそれぞれ割り当て
る。そして、例えば、ハッシュ関数によって、unknown
を１バイト、また、numberを３バイトのハッシュ値に変
換する。As a result, unknown is 4 × 1 ÷ 5 = 0.8
While number is 4 × 4 は 5 = 3.2,
1 byte and 3 bytes are allocated after rounding. And, for example, by a hash function,
Is converted into a 1-byte hash value, and number is converted into a 3-byte hash value.

【００４２】このように構造情報を利用してタグの生成
方法を決定することによって、タグの衝突を減少させる
ことができる。なお、それぞれのフィールドを等価に１
バイトずつ割り当てた場合では、タグの領域のうち、le
ngthとnameとに対応する領域の２バイト分はすべてのタ
グで同じ値となり、実質的に表現できるタグの値は２
（＝４−２）バイトになり、タグの衝突が起きやすくな
る。As described above, by determining the tag generation method using the structure information, the collision of tags can be reduced. Note that each field is equivalently 1
When bytes are allocated, le in the tag area
The 2 bytes of the area corresponding to ngth and name have the same value for all tags, and the tag value that can be expressed in effect is 2
(= 4-2) bytes, so that tag collisions are likely to occur.

【００４３】以上の実施の形態では、重みに応じてフィ
ールドを割り当てるようにしたが、例えば、データ型な
ど別の情報に応じてハッシュ関数を変更するようにして
もよい。In the above embodiment, the fields are assigned according to the weights. However, for example, the hash function may be changed according to other information such as the data type.

【００４４】以上のようにして作成されたタグを元にし
て、索引生成手段１４が索引（第１の識別子集合に対す
る索引）を生成する。次に、第２の識別子集合が入力さ
れた場合の同定処理について説明する。The index generation means 14 generates an index (index for the first set of identifiers) based on the tags created as described above. Next, an identification process when a second identifier set is input will be described.

【００４５】第２の識別子集合が入力されると、識別子
分割手段１２は、第１の識別子集合の場合と同様の分割
方法により、第２の識別子集合に属している各識別子を
分割する。タグ生成手段１３も前述の場合と同様の処理
によりタグを生成する。When the second identifier set is input, the identifier dividing means 12 divides each identifier belonging to the second identifier set by the same division method as in the case of the first identifier set. The tag generation means 13 also generates a tag by the same processing as described above.

【００４６】同一タグ存在判定手段１５は、第１の識別
子集合に対する索引を参照することにより、同一のタグ
が存在しているか否かを判定する。その結果、同一のタ
グが存在していると判定された場合には、識別子同一性
判定手段１６は、これらのタグに対応する識別子が同一
であるか否かを判定する。即ち、識別子同一性判定手段
１６は、偶然にタグが同一になった識別子であるか、同
一の識別子であるかを判定する。The same tag presence determination means 15 determines whether the same tag exists by referring to the index for the first identifier set. As a result, when it is determined that the same tag exists, the identifier identity determination unit 16 determines whether the identifiers corresponding to these tags are the same. That is, the identifier identity determination means 16 determines whether the tags have accidentally become the same identifier or the same identifier.

【００４７】識別子同一性判定手段１６は、先ず、デー
タ構造情報記録手段２０から識別子の構造情報を取得
し、フィールドの比較順序を決める。即ち、値のばらつ
きの多いフィールドや、バイト数が少ないなど比較処理
コストの小さいフィールドから比較する。また、固定値
を持つ（すべての識別子において同じ値を有する）フィ
ールドは比較対象にしない。いまの例では、「number」
に対応するフィールドのばらつき度が大きいので、この
フィールドを優先して比較処理を行う。そして、このフ
ィールドにおいて同一性が検出されなかった場合には、
次に、「unknown」に対応するフィールドを比較する。
なお、「length」および「name」は固定値であるため、
これらのフィールドは比較対象としない。フィールドが
１つでも異なっていれば、その時点で２つの識別子が異
なるものであると判定する。一方、比較対象となるすべ
てのフィールドが同一であれば、識別子も同一であると
判定する。The identifier identity judging means 16 first obtains the structure information of the identifier from the data structure information recording means 20, and determines the field comparison order. That is, comparison is performed from a field having a large value variation or a field having a small comparison processing cost such as a small number of bytes. A field having a fixed value (having the same value in all identifiers) is not set as a comparison target. In our example, "number"
Since the degree of variation of the field corresponding to is large, the comparison process is performed with priority given to this field. And if no identity is detected in this field,
Next, the fields corresponding to "unknown" are compared.
Since “length” and “name” are fixed values,
These fields are not compared. If at least one field is different, it is determined that the two identifiers are different at that time. On the other hand, if all the fields to be compared are the same, it is determined that the identifiers are also the same.

【００４８】なお、識別子同一性の判定処理では、ばら
つき度によるフィールド判定は複数の識別子を判定する
場合であっても一度でよい。以上の実施の形態によれ
ば、識別子のデータ構造に応じて、ばらつきの大きい部
分に対して、タグのデータ領域をより多く割り当てるよ
うにしたためタグの衝突を低減した。さらに、同定処理
において、同一のタグがあると判定された場合には、デ
ータ構造を参照して、同一性が低い部分、比較コストが
低い部分から優先的に比較処理を行うようにしたので、
同定処理を高速に実行することが可能となる。In the process of determining the identity of an identifier, the field determination based on the degree of variation may be performed once even when determining a plurality of identifiers. According to the above-described embodiment, more tag data areas are allocated to portions having large variations according to the data structure of identifiers, thereby reducing tag collisions. Furthermore, in the identification processing, when it is determined that there is the same tag, the comparison processing is performed preferentially from a part having a low identity and a part having a low comparison cost by referring to the data structure.
The identification process can be executed at high speed.

【００４９】次に、図１０を参照して本発明の第３の実
施の形態について説明する。なお、この図において、図
６の場合と対応する部分には同一の符号を付してあるの
でその説明は省略する。Next, a third embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 6 are denoted by the same reference numerals, and their description is omitted.

【００５０】この実施の形態においては、構造記述デー
タ入力手段３０、構造記述データ解析手段３１、およ
び、書き込み手段３２が新たに追加されている。なお、
その他の構成は、図６の場合と同様である。In this embodiment, a structure description data input means 30, a structure description data analysis means 31, and a writing means 32 are newly added. In addition,
Other configurations are the same as those in FIG.

【００５１】構造記述データ入力手段３０は、識別子の
構造を記述したデータである構造記述データ（図１１参
照）を入力する。構造記述データ解析手段３１は、構造
記述データ入力手段３０から入力された構造記述データ
を解析し、識別子の構造情報（前述の図７および図８参
照）を生成する。The structure description data input means 30 inputs structure description data (see FIG. 11) which is data describing the structure of the identifier. The structure description data analysis means 31 analyzes the structure description data input from the structure description data input means 30 and generates structure information of the identifier (see FIGS. 7 and 8 described above).

【００５２】書き込み手段３２は、構造記述データ解析
手段３１によって得られた識別子の構造情報をデータ構
造情報記録手段２０に書き込んで記録させる。次に、以
上の実施の形態の動作について説明する。なお、以下の
説明では、前述の場合と同様に、第１および第２の識別
子集合の同定処理を例に挙げて説明を行う。また、前述
の場合と同様の処理については説明を省略する。The writing means 32 writes the structure information of the identifier obtained by the structure description data analyzing means 31 in the data structure information recording means 20 and records it. Next, the operation of the above embodiment will be described. In the following description, as in the case described above, the identification processing of the first and second identifier sets will be described as an example. The description of the same processing as that described above is omitted.

【００５３】この実施例では、ユーザが識別子のデータ
構造を知っているとき、そのデータ構造が記述された構
造記述データを入力し、この構造記述データを解析して
構造情報を生成し、構造情報記録手段２０に書き込む。
そして、この構造情報を参照して、第２の実施の形態の
場合と同様の処理により、タグの生成処理や同定処理を
行う構成とされている。なお、識別子の同定方法は第２
の実施の形態の場合と同様であるため、以下では、構造
情報の登録手順についてのみ説明する。In this embodiment, when the user knows the data structure of the identifier, the user inputs the structure description data describing the data structure, analyzes the structure description data, generates the structure information, and generates the structure information. Write to the recording means 20.
Then, by referring to the structure information, the tag generation processing and the identification processing are performed by the same processing as in the second embodiment. The identifier identification method is the second method.
Since this embodiment is the same as the embodiment, only the registration procedure of the structure information will be described below.

【００５４】例えば、図７に示す構造情報を有する識別
子に対する構造記述データは、図１１のようになる。こ
の例の第１行目に示されている「int length 4*1 cons
t;」は、このフィールドが、整数型（int ）であり、４
バイト（4*1 ）の長さを持ち、また、固定値（const ）
を有することを示している。このように、構造記述デー
タは、データ型、フィールド名、データ長、ばらつき度
を示している。For example, the structure description data for the identifier having the structure information shown in FIG. 7 is as shown in FIG. "Int length 4 * 1 cons" shown in the first line of this example
"t;" indicates that this field is of integer type (int) and 4
It has a length of bytes (4 * 1) and a fixed value (const)
Has been shown. As described above, the structure description data indicates the data type, the field name, the data length, and the degree of variation.

【００５５】同様にして、「string」は文字列型を、
「bin 」はバイナリ型を、「change」はばらつきが小さ
いことを、「volatile」はばらつきが大きいことを表
す。また、図１２は、図８に示す構造情報を有する識別
子に対する構造記述データの一例である。この例では、
第４行目に、データの「区切り」を示す「bound 」が示
されている。この「bound 」は「区切り」であることを
指している。なお、構造記述データのシンタックスは問
わない。構造記述データによって構造情報を表現できれ
ばよい。Similarly, “string” represents a character string type,
“Bin” indicates a binary type, “change” indicates a small variation, and “volatile” indicates a large variation. FIG. 12 shows an example of the structure description data for the identifier having the structure information shown in FIG. In this example,
On the fourth line, “bound” indicating “separation” of data is shown. This “bound” indicates that it is a “delimiter”. The syntax of the structure description data does not matter. It is only necessary that the structure information can be represented by the structure description data.

【００５６】このような構造記述データは、例えば、テ
キストエディタなどにより作成し、構造記述データ入力
手段３０から入力する。入力された構造記述データは、
構造記述データ解析手段３１によって解析される。即
ち、構造記述データ解析手段３１は、入力された構造記
述データをパーズ（構造解析）し、構造情報を有する解
析木を作成する。そして、パーズによって得られた解析
木から構造情報を作成する。Such structure description data is created by, for example, a text editor or the like, and is input from the structure description data input unit 30. The input structure description data is
It is analyzed by the structure description data analysis means 31. That is, the structure description data analysis means 31 parses (structures) the input structure description data and creates an analysis tree having structure information. Then, structure information is created from the parse tree obtained by the parse.

【００５７】以上のようにして作成された構造情報は、
書き込み手段３２によってデータ構造情報記録手段２０
の所定の領域に書き込まれる。このようにして作成され
た構造情報は、第２の実施の形態において説明したよう
に、タグの作成処理と同定処理において参照されること
になる。The structure information created as described above is
The data structure information recording means 20 is written by the writing means 32.
Is written in a predetermined area. The structure information created in this way is referred to in the tag creation processing and the identification processing as described in the second embodiment.

【００５８】次に、図１３を参照して、本発明の第４の
実施の形態の構成例について説明する。なお、この図に
おいて、図１０の場合と対応する部分には同一の符号を
付してあるのでその説明は省略する。Next, a configuration example of the fourth embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 10 are denoted by the same reference numerals, and the description thereof is omitted.

【００５９】この実施の形態においては、データ構造解
析手段４０、指示情報付与手段４１、および、候補絞り
込み手段４２が新たに追加されており、また、識別子同
一性判定手段１６の処理が異なっている。その他の構成
は、図１０の場合と同様である。In this embodiment, a data structure analyzing means 40, an instruction information providing means 41, and a candidate narrowing means 42 are newly added, and the processing of the identifier identity determining means 16 is different. . Other configurations are the same as those in FIG.

【００６０】データ構造解析手段４０は、識別子入力手
段１１から入力された複数の識別子のデータ構造を統計
的な手法により解析する。指示情報付与手段４１は、同
一タグ存在判定手段１５によって同一のタグが存在する
と判定された場合には、そのタグに対応する識別子と新
たに入力された識別子とを比較し、これらの間で異なる
部分を特定し、特定した部分を指示する指示情報を索引
に対して付与する。The data structure analysis means 40 analyzes the data structure of a plurality of identifiers input from the identifier input means 11 by a statistical method. When the same tag presence determination unit 15 determines that the same tag exists, the instruction information provision unit 41 compares the identifier corresponding to the tag with the newly input identifier, and differs between the two. The part is specified, and instruction information indicating the specified part is added to the index.

【００６１】候補絞り込み手段４２は、同一タグ存在判
定手段１５によって同一のタグが複数存在すると判定さ
れた場合には、指示情報付与手段４１によって付与され
た指示情報を参照して、異なる部分のみを比較すること
により、候補を絞り込む。When the same tag presence determination means 15 determines that a plurality of identical tags exist, the candidate narrowing means 42 refers to the instruction information provided by the instruction information providing means 41 to determine only different portions. The candidates are narrowed down by comparison.

【００６２】識別子同一性判定手段１６は、絞り込まれ
た候補と新たに入力された識別子との同一性を識別子の
データ構造を参照して判定する。次に、以上の実施の形
態の動作について説明する。なお、以下の説明では、前
述の場合と同様に、第１および第２の識別子集合の同定
処理を例に挙げて説明を行う。また、前述の場合と同様
の処理については説明を省略する。The identifier identity determination means 16 determines the identity between the narrowed-down candidate and the newly input identifier by referring to the data structure of the identifier. Next, the operation of the above embodiment will be described. In the following description, as in the case described above, the identification processing of the first and second identifier sets will be described as an example. The description of the same processing as that described above is omitted.

【００６３】データ構造解析手段４０は、識別子入力手
段１１から入力された複数の識別子を統計的手法により
解析し、識別子の構造情報を生成する。図１４は、デー
タ構造解析手段４０において実行される処理の一例を説
明するフローチャートである。このフローチャートが開
始されると、以下の処理が実行されることになる。［Ｓ４１］識別子入力手段１１から解析しようとする識
別子を入力する。The data structure analysis means 40 analyzes a plurality of identifiers input from the identifier input means 11 by a statistical method, and generates identifier structure information. FIG. 14 is a flowchart illustrating an example of a process performed by the data structure analysis unit 40. When this flowchart is started, the following processing is executed. [S41] An identifier to be analyzed is input from the identifier input means 11.

【００６４】ここでは、１００個の識別子を入力したも
のとする。［Ｓ４２］識別子入力手段１１から入力されたすべての
識別子を処理に適する大きさのセグメントに分割する。Here, it is assumed that 100 identifiers have been input. [S42] All identifiers input from the identifier input unit 11 are divided into segments of a size suitable for processing.

【００６５】この分割の様子を図１５に示す。各識別子
は、それぞれが４バイトからなるｎ個のセグメントに分
割される。なお、インデックスは、各セグメントを特定
するために割り振った番号である。［Ｓ４３］すべてのセグメントからセグメントごとの統
計情報を得る。FIG. 15 shows this division. Each identifier is divided into n segments of 4 bytes each. The index is a number assigned to identify each segment. [S43] The statistical information for each segment is obtained from all the segments.

【００６６】即ち、第ｉ（１≦ｉ≦ｎ）番目のセグメン
トの値の分布を調べ、同一の値がいくつ存在しているか
を検出する。そして、これをまとめて解析結果とする。
解析結果の一例を図１６に示す。この例では、インデッ
クスの１から４までは、値の種類（値の分布）が１種類
だけである（１００個全てのセグメントが同一の値を有
している）。また、インデックスの２６１から２７２ま
では、値の種類は７種類であり、また、同一の値の最大
個数は７９個であることが分かる。［Ｓ４４］ステップ
Ｓ４３での解析結果に基づいて構造情報を生成する。That is, the distribution of the value of the i-th (1 ≦ i ≦ n) segment is examined to detect how many identical values exist. Then, this is collectively used as an analysis result.
FIG. 16 shows an example of the analysis result. In this example, there is only one type of value (value distribution) for indexes 1 to 4 (all 100 segments have the same value). Also, it can be seen that there are seven types of values for the indexes 261 to 272, and the maximum number of the same values is 79. [S44] Structural information is generated based on the analysis result in step S43.

【００６７】即ち、各セグメントの値の偏り具合から、
データ型と値のばらつき度を推定する。このとき、隣接
するセグメントのデータ型とばらつき度が同一であった
ら、ひとまとめにする。なお、ここでは、たとえば、ば
らつき度は「大きい」と「中程度」の境界を値の種類数
２０とし、また、「小さい」と「中程度」の境界を種類
数５とする。That is, from the bias of the value of each segment,
Estimate the degree of variation between data types and values. At this time, if the data type and the degree of variation of the adjacent segments are the same, they are put together. Note that, here, for example, the boundary between “large” and “medium” is set to 20 types of values, and the boundary between “small” and “medium” is set to 5 types.

【００６８】例えば、図１６に示す解析結果を対象とし
て、前後のインデックスのばらつき度が同一である場合
にはそのインデックスをまとめる処理を実行すると、図
１７に示すような構造情報を得ることができる。［Ｓ４５］データ構造解析手段４０は、生成された構造
情報を書き込み手段３２に対して出力する。For example, if the degree of variation of the index before and after the analysis result shown in FIG. 16 is the same, by executing a process of combining the indexes, structural information as shown in FIG. 17 can be obtained. . [S45] The data structure analysis unit 40 outputs the generated structure information to the writing unit 32.

【００６９】以上の処理により、識別子の構造が未知の
場合においても、複数の識別子から構造情報を推定する
ことが可能となる。これらの処理は、識別子集合の同定
処理と同時ではなく事前に実行し、構造情報を得ること
ができる。By the above processing, even when the structure of the identifier is unknown, it is possible to estimate the structure information from a plurality of identifiers. These processes can be executed in advance, not at the same time as the identifier set identification process, to obtain structural information.

【００７０】次に、図１８を参照して、タグおよび索引
の生成処理について説明する。このフローチャートが開
始されると以下の処理が実行されることになる。［Ｓ６１］タグ生成手段１３は、データ構造情報記録手
段２０に記録されている構造情報に基づいて、入力され
た識別子からタグを生成する。［Ｓ６２］同一タグ存在判定手段１５は、索引を参照
し、同一のタグが存在しているか否かを判定する。その
結果、同一のタグが存在している場合にはステップＳ６
３に進み、存在していない場合にはステップＳ６５に進
む。［Ｓ６３］指示情報付与手段４１は、衝突しているタグ
に対応する識別子を比較し、異なっている部分を特定す
る。［Ｓ６４］指示情報付与手段４１は、ステップＳ６３に
おいて特定された識別子の異なる部分を指示する指示情
報を索引に付与する。［Ｓ６５］タグ生成手段１３は、全ての識別子に対する
タグの生成処理が終了したか否かを判定する。その結
果、タグの生成が終了した場合には処理を完了し、ま
た、終了していないと判定した場合にはステップＳ６１
に戻る。Next, a process for generating a tag and an index will be described with reference to FIG. When this flowchart is started, the following processing is executed. [S61] The tag generator 13 generates a tag from the input identifier based on the structure information recorded in the data structure information recorder 20. [S62] The same tag presence determination unit 15 refers to the index and determines whether the same tag exists. As a result, if the same tag exists, step S6
The process proceeds to step S3, and if not, the process proceeds to step S65. [S63] The instruction information providing unit 41 compares the identifiers corresponding to the colliding tags, and identifies the different parts. [S64] The instruction information providing unit 41 adds, to the index, instruction information indicating a different part of the identifier specified in step S63. [S65] The tag generation unit 13 determines whether or not the tag generation processing for all identifiers has been completed. As a result, when the generation of the tag is completed, the process is completed, and when it is determined that the generation is not completed, step S61 is performed.
Return to

【００７１】以上の処理により、タグの衝突が発生した
場合には、それらのタグに対応する識別子が比較され、
異なっている部分が特定される。そして、特定された部
分を指示する指示情報が索引に付加されることになる。According to the above processing, when tag collision occurs, identifiers corresponding to those tags are compared,
Differences are identified. Then, instruction information indicating the specified part is added to the index.

【００７２】次に、以上のようにして作成された索引を
参照して、識別子を同定する場合の処理について説明す
る。図１９は、図１８の処理によって作成された、指示
情報が付与された索引を参照して、識別子を同定する場
合の処理の一例を説明するフローチャートである。この
フローチャートが開始されると、以下の処理が実行され
ることになる。［Ｓ８１］同一タグ存在判定手段１５は、索引を参照
し、比較の対象となる識別子のタグと同一のタグを取得
する。［Ｓ８２］同一タグ存在判定手段１５は、同一のタグが
１つだけ存在しているか否かを判定する。その結果、同
一のタグが１つだけ存在している場合にはステップＳ８
５に進み、また、複数存在している場合にはステップＳ
８３に進む。［Ｓ８３］候補絞り込み手段４２は、索引から指示情報
を取得する。［Ｓ８４］候補絞り込み手段４２は、取得した指示情報
を参照し、識別子の異なる部分だけを比較して、候補を
１つに絞る。［Ｓ８５］識別子同一性判定手段１６は、データ構造情
報記録手段２０に記録されている構造情報を参照して、
識別子を比較する部分を特定する。［Ｓ８６］識別子同一性判定手段１６は、候補絞り込み
手段４２によって絞り込まれた候補の識別子と、比較の
対象となる識別子との相違を、ステップＳ８５において
特定された部分を比較することによって判定する。Next, processing for identifying an identifier with reference to the index created as described above will be described. FIG. 19 is a flowchart illustrating an example of a process for identifying an identifier with reference to the index added with the instruction information created by the process of FIG. 18. When this flowchart is started, the following processing is executed. [S81] The identical tag existence determining unit 15 refers to the index and acquires the same tag as the tag of the identifier to be compared. [S82] The identical tag existence determining means 15 determines whether only one identical tag exists. As a result, if only one identical tag exists, step S8
5 and if there are a plurality, step S
Go to 83. [S83] The candidate narrowing means 42 acquires the instruction information from the index. [S84] The candidate narrowing-down unit 42 refers to the obtained instruction information, compares only portions having different identifiers, and narrows the candidates to one. [S85] The identifier identity determination unit 16 refers to the structure information recorded in the data structure information recording unit 20,
Identify the part to compare identifiers. [S86] The identifier identity determination unit 16 determines the difference between the candidate identifier narrowed down by the candidate narrowing down unit 42 and the identifier to be compared by comparing the part specified in step S85.

【００７３】以上の処理によれば、索引作成時、同一タ
グを持つ識別子間において、異なる値を有するフィール
ドを示す情報を索引に持たせ、識別子の同定時には、構
造情報ではなく、この情報をもとに識別子を比較するよ
うにしたので、タグの衝突によって複数の識別子が同一
の識別子の候補として現われても、少ない比較処理によ
って１つの識別子に絞り込むことができる。According to the above-described processing, at the time of creating an index, information indicating fields having different values is provided in the index between identifiers having the same tag, and at the time of identifying an identifier, this information is used instead of the structural information. Since the identifiers are compared with each other, even if a plurality of identifiers appear as the same identifier candidate due to the collision of tags, the identifiers can be narrowed down to one identifier by a small number of comparison processes.

【００７４】なお、以上の実施の形態は、１つの識別子
集合に対して索引を作成し、いくつもの識別子集合と演
算する場合や、タグが衝突しやすい場合（たとえば、識
別子に未知の部分が多く、処理があまり最適化されてい
ない場合）に有効である。In the above embodiment, an index is created for one set of identifiers, and operations are performed with a number of sets of identifiers, or when tags are likely to collide (for example, when there are many unknown parts in identifiers). , When the processing is not so optimized).

【００７５】次に、以上の実施の形態をファイルのコピ
ー管理に適用した場合について説明する。以下では、２
つのデータ管理装置において、一方のデータ管理装置Ａ
のデータをもう一方のデータ管理装置Ｂへコピー（転
写）して、そのデータ間の一貫性を管理するために、そ
の対応関係を管理している場合について考える。Next, a case where the above embodiment is applied to file copy management will be described. In the following, 2
In one data management device, one data management device A
Let's consider a case where the data is copied (transferred) to the other data management apparatus B, and the correspondence is managed in order to manage the consistency between the data.

【００７６】データ管理装置Ａからの検索結果となる識
別子集合Ａと、データ管理装置Ｂからの検索結果となる
識別子集合Ｂとがあるとき、識別子集合Ａの要素である
識別子ａと識別子集合Ｂの要素である識別子ｂとが同じ
データ( オリジナルデータとコピーされたデータ) を示
している可能性がある。このため、検索結果をそのまま
マージしたのでは、コピーされているデータが重複する
ことになる。同時に利用する管理システムが多くなるほ
ど、また、データのコピーが多くなるほど、このような
問題は深刻になる。When there is an identifier set A as a search result from the data management device A and an identifier set B as a search result from the data management device B, the identifier a and the identifier set There is a possibility that the identifier b as the element indicates the same data (original data and copied data). Therefore, if the search results are merged as they are, the copied data will be duplicated. Such problems become more serious as the number of management systems used simultaneously and the number of copies of data increase.

【００７７】そのような場合に対処するための処理の一
例を図２０に示す。このフローチャートが開始される
と、以下の処理が実行されることになる。［Ｓ１０１］データ管理装置Ａからの検索結果として識
別子集合Ａを、また、データ管理装置Ｂからの検索結果
として識別子集合Ｂを得る。［Ｓ１０２］オリジナルデータとコピーデータの識別子
の対応表を利用して、識別子集合Ｂの要素を対応するオ
リジナルのデータの識別子に変換する。［Ｓ１０３］ステップＳ１０２において、オリジナルデ
ータに変換できなかった識別子集合を識別子集合Ｂ’と
する。一方、変換できた識別子集合を識別子集合Ｂ”と
する。［Ｓ１０４］図２、図６、図１０、または、図１３の実
施の形態により、識別子集合Ａと識別子集合Ｂ”の重複
する識別子をチェックする。［Ｓ１０５］図２、図６、図１０、または、図１３の実
施の形態により、識別子集合Ａと識別子集合Ｂ”の論理
和を算出する。［Ｓ１０６］ステップＳ１０５において得られた結果
と、識別子集合Ｂ’とをマージする。FIG. 20 shows an example of processing for coping with such a case. When this flowchart is started, the following processing is executed. [S101] An identifier set A is obtained as a search result from the data management device A, and an identifier set B is obtained as a search result from the data management device B. [S102] The elements of the identifier set B are converted into the corresponding original data identifiers using the correspondence table between the identifiers of the original data and the copy data. [S103] In step S102, an identifier set that cannot be converted to original data is set as an identifier set B '. On the other hand, the converted identifier set is referred to as an identifier set B ″. [S104] According to the embodiment of FIG. 2, FIG. 6, FIG. 10, or FIG. To check. [S105] The logical sum of the identifier set A and the identifier set B ″ is calculated according to the embodiment of FIG. 2, FIG. 6, FIG. 10, or FIG. 13. [S106] The result obtained in step S105 and the identifier Merge with set B ′.

【００７８】このような処理によれば、重複のない検索
結果を得ることができる。以上に示したように、本発明
を利用した結果である識別子集合の論理和を利用する
と、簡単な処理で複数のデータ管理装置にわたってデー
タの重複のない検索結果を得ることができる。According to such processing, search results without duplication can be obtained. As described above, if the logical sum of the identifier set, which is a result of using the present invention, is used, it is possible to obtain a search result without duplication of data across a plurality of data management devices by a simple process.

【００７９】なお、上記の処理機能は、コンピュータに
よって実現することができる。その場合、データ管理装
置が有すべき機能の処理内容は、コンピュータで読み取
り可能な記録媒体に記録されたプログラムに記述されて
おり、このプログラムをコンピュータで実行することに
より、上記処理がコンピュータで実現される。The above processing functions can be realized by a computer. In this case, the processing contents of the functions that the data management device should have are described in a program recorded on a computer-readable recording medium, and the above processing is realized by the computer by executing this program on the computer. Is done.

【００８０】コンピュータで読み取り可能な記録媒体と
しては、磁気記録装置や半導体メモリ等がある。市場を
流通させる場合には、ＣＤ−ＲＯＭ(Compact Disk Read
Only Memory) やフロッピーディスク等の可搬型記録媒
体にプログラムを格納して流通させたり、ネットワーク
を介して接続されたコンピュータの記憶装置に格納して
おき、ネットワークを通じて他のコンピュータに転送す
ることもできる。コンピュータで実行する際には、コン
ピュータ内のハードディスク装置等にプログラムを格納
しておき、メインメモリにロードして実行する。As a computer-readable recording medium, there are a magnetic recording device, a semiconductor memory, and the like. When distributing in the market, CD-ROM (Compact Disk Read
The program can be stored and distributed on a portable recording medium such as Only Memory) or a floppy disk, or stored in a storage device of a computer connected via a network, and transferred to another computer via the network. . When the program is executed by the computer, the program is stored in a hard disk device or the like in the computer, and is loaded into the main memory and executed.

【００８１】[0081]

【発明の効果】以上説明したように本発明では、識別子
入力手段から識別子を入力し、識別子分割手段は、入力
された識別子を複数のセグメントに分割し、タグ生成手
段は、得られたセグメントに対して所定の論理演算を施
すことにより、識別子よりもデータ長の短いタグを生成
し、索引生成手段は、タグ生成手段によって生成された
タグを元にして索引を生成するようにしたので、識別子
の同定処理を高速に実行することが可能となる。As described above, according to the present invention, an identifier is inputted from the identifier input means, the identifier dividing means divides the inputted identifier into a plurality of segments, and the tag generating means divides the inputted segment into a plurality of segments. By performing a predetermined logical operation on the tag, a tag having a data length shorter than the identifier is generated, and the index generating unit generates the index based on the tag generated by the tag generating unit. Can be executed at high speed.

[Brief description of the drawings]

【図１】本発明の原理を示す原理図である。FIG. 1 is a principle diagram showing the principle of the present invention.

【図２】本発明の第１の実施の形態の構成例を示す図で
ある。FIG. 2 is a diagram illustrating a configuration example of a first embodiment of the present invention.

【図３】図２に示す実施の形態において実行される処理
の一例を説明するフローチャートである。FIG. 3 is a flowchart illustrating an example of a process performed in the embodiment shown in FIG. 2;

【図４】図２に示す実施の形態において生成される索引
の一例を示す図である。FIG. 4 is a diagram showing an example of an index generated in the embodiment shown in FIG.

【図５】図２に示す実施の形態において実行される同定
処理の一例を説明するフローチャートである。FIG. 5 is a flowchart illustrating an example of an identification process performed in the embodiment shown in FIG. 2;

【図６】本発明の第２の実施の形態の構成例を示す図で
ある。FIG. 6 is a diagram illustrating a configuration example according to a second embodiment of the present invention.

【図７】図６に示す実施の形態において使用される構造
情報の一例を示す図である。FIG. 7 is a diagram showing an example of structure information used in the embodiment shown in FIG. 6;

【図８】構造情報の他の一例を示す図である。FIG. 8 is a diagram showing another example of the structure information.

【図９】ばらつきと重みとの関係を示す図である。FIG. 9 is a diagram showing a relationship between a variation and a weight.

【図１０】本発明の第３の実施の形態の構成例を示す図
である。FIG. 10 is a diagram illustrating a configuration example of a third embodiment of the present invention.

【図１１】図１０に示す実施の形態において使用される
構造記述データの一例を示す図である。11 is a diagram showing an example of structure description data used in the embodiment shown in FIG.

【図１２】図１０に示す実施の形態において使用される
構造記述データの他の一例を示す図である。FIG. 12 is a diagram showing another example of the structure description data used in the embodiment shown in FIG. 10;

【図１３】本発明の第４の実施の形態の構成例を示す図
である。FIG. 13 is a diagram illustrating a configuration example according to a fourth embodiment of the present invention.

【図１４】図１３に示す実施の形態において実行される
処理の一例を示す図である。FIG. 14 is a diagram showing an example of processing executed in the embodiment shown in FIG.

【図１５】図１３に示す実施の形態によってセグメント
に分割された識別子の様子を示す図である。FIG. 15 is a diagram showing a state of an identifier divided into segments according to the embodiment shown in FIG. 13;

【図１６】図１３に示す実施の形態によって解析された
セグメントの情報の一例を示す図である。FIG. 16 is a diagram showing an example of segment information analyzed by the embodiment shown in FIG. 13;

【図１７】図１６に示す解析結果から生成された構造情
報の一例を示す図である。FIG. 17 is a diagram illustrating an example of structural information generated from the analysis result illustrated in FIG. 16;

【図１８】図１３に示す実施の形態において実行される
処理の一例を説明するフローチャートである。FIG. 18 is a flowchart illustrating an example of a process performed in the embodiment shown in FIG.

【図１９】図１３に示す実施の形態において実行される
処理の他の一例を説明するフローチャートである。FIG. 19 is a flowchart illustrating another example of the processing executed in the embodiment shown in FIG.

【図２０】データのコピー管理に本発明を適用した場合
の処理の一例を説明するフローチャートである。FIG. 20 is a flowchart illustrating an example of processing when the present invention is applied to data copy management.

[Explanation of symbols]

１識別子入力手段２識別子分割手段３タグ生成手段４索引生成手段１１識別子入力手段１２識別子分割手段１３タグ生成手段１４索引生成手段１５同一タグ存在判定手段１６識別子同一性判定手段２０データ構造情報記録手段３０構造記述データ入力手段３１構造記述データ解析手段３２書き込み手段４０データ構造解析手段４１指示情報付与手段４２候補絞り込み手段 DESCRIPTION OF SYMBOLS 1 Identifier input means 2 Identifier division means 3 Tag generation means 4 Index generation means 11 Identifier input means 12 Identifier division means 13 Tag generation means 14 Index generation means 15 Identical tag existence determination means 16 Identifier identity determination means 20 Data structure information recording means Reference Signs List 30 Structure description data input means 31 Structure description data analysis means 32 Writing means 40 Data structure analysis means 41 Instruction information adding means 42 Candidate narrowing means

Claims

[Claims]

1. A data management device for managing data by an identifier, an identifier input means for inputting an identifier, an identifier dividing means for dividing the input identifier into a plurality of segments, By performing a logical operation of, tag generation means for generating a tag having a shorter data length than the identifier, based on the tag generated by the tag generation means,
A data management device, comprising: index generation means for generating an index.

2. The identifier division unit divides the identifier into segments having the same data length as the tag, and the tag generation unit generates the tag by calculating exclusive OR between the segments. 2. The data management device according to claim 1, wherein:

3. The identifier dividing means divides an identifier into segments according to the data structure of the identifier, and the tag generating means responds to a variation state of the value of each segment obtained by the identifier dividing means. 2. The data management device according to claim 1, wherein a data area of the tag is assigned to each segment.

4. A data structure information recording unit for recording information on a data structure of the identifier, wherein the identifier division unit and the tag generation unit refer to information recorded in the data structure information recording unit. 4. The data management device according to claim 3, wherein the processing is performed.

5. A structure description data input unit to which structure description data describing the data structure is input, and a structure description data analysis unit configured to analyze the input structure description data and generate information on a data structure of the identifier. 5. The data management apparatus according to claim 4, further comprising: writing means for writing the information on the data structure of the identifier to the data structure information recording means.

6. A data structure analyzing means for analyzing, by a statistical method, a data structure of a plurality of identifiers inputted from said identifier input means, and said data structure information recording means for storing information on the obtained data structure of said identifier. 5. The data management device according to claim 4, further comprising: a writing unit that writes data into the data management device.

7. When a tag of a newly input identifier is generated by the tag generation unit, it is determined whether or not the same tag exists by referring to the index generated by the index generation unit. Identical tag presence determining means, and when it is determined that the same tag exists, identifier identity determining means for determining whether an identifier having the tag and the newly input identifier are the same, 2. The data management device according to claim 1, further comprising:

8. The method according to claim 7, wherein the identifier identity determination unit sequentially compares the identifiers with reference to a data structure of the identifiers, starting from a portion having a low comparison cost.
Data management device as described.

9. When the same tag presence determination means determines that the same tag exists, an identifier corresponding to the tag is compared with a newly input identifier, and a different part is identified between them. 8. The data management apparatus according to claim 7, further comprising an instruction information assigning unit that assigns, to the index, instruction information that specifies the specified part.

10. When the same tag presence determination means determines that there are a plurality of identical tags, comparing only portions having different identifiers with reference to the instruction information provided by the instruction information providing means. Thus, the apparatus further comprises candidate narrowing-down means for narrowing down candidates, wherein the identifier identity determining means determines the identity between the narrowed-down candidate and the newly input identifier by referring to a data structure of the identifier. The data management device according to claim 9, wherein

11. A computer-readable recording medium storing a data management program for causing a computer to manage data by an identifier, an identifier input means for inputting the identifier, and an identifier for dividing the input identifier into a plurality of segments. Dividing means, by performing a predetermined logical operation on the obtained segment, a tag generating means for generating a tag having a data length shorter than the identifier, based on the tag generated by the tag generating means,
A computer-readable recording medium that records a data management program for causing a computer to function as index generation means for generating an index.