JP2011186852A

JP2011186852A - File dividing device, method and program

Info

Publication number: JP2011186852A
Application number: JP2010052266A
Authority: JP
Inventors: Akiyoshi Kawada; 明良川田; Harushio Hidaka; 東潮日高; Takashi Hoshino; 隆星野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-03-09
Filing date: 2010-03-09
Publication date: 2011-09-22
Anticipated expiration: 2030-03-09
Also published as: JP5367622B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the maximum size of a divided file and reduce file dividing time, to reduce the memory consumption and accesses to an external storage device. <P>SOLUTION: The number of divisions and the number of duplicates are acquired. The start position of an input file is set for a pointer storage area. A line corresponding to a pointer in the pointer storage area is read from the input file. A distribution file for the number of duplicates (wherein, the number of duplicates<the number of divisions) of the number of divisions is determined so that the lines of the same main key are stored in the same distribution file. The processing of distributing the lines by duplication according to the number of duplicates and adding them into the division file is repeated until the end position of the input file is pointed. The sizes (the number of lines) of division files are compared with each other after division, to eliminate the top N(N=the number of duplicates-1) division files having a large number of lines. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、ファイル分割装置及び方法及びプログラムに係り、特に、入力された処理対象ファイル（入力ファイル）を複数のファイルに分割するためのファイル分割装置及び方法及びプログラムに関する。 The present invention relates to a file dividing apparatus, method, and program, and more particularly, to a file dividing apparatus, method, and program for dividing an input file to be processed (input file) into a plurality of files.

ファイル分割を行うために、周回する境界条件で順番に振り分け先を決定する方式（ラウンドロビン）がある（例えば、非特許文献１参照）。 In order to perform file division, there is a method (round robin) in which a sorting destination is determined in order based on a circulating boundary condition (see, for example, Non-Patent Document 1).

また、キー重複数を入力ファイルに付加し、キー重複数の大きな順にサイズの小さな分割ファイルから振り分ける方法がある（例えば、特許文献１参照）。 In addition, there is a method in which a plurality of key duplications are added to an input file and sorted from a small divided file in the descending order of the number of key duplications (see, for example, Patent Document 1).

また、入力ファイルの統計情報を元に最適な均等振り分け計画を構築する方法（整数計画法）がある（例えば、非特許文献２参照）。 In addition, there is a method (integer programming) that constructs an optimal uniform distribution plan based on statistical information of an input file (see, for example, Non-Patent Document 2).

特開２００７−８６９５１号公報JP 2007-86951 A

相磯秀夫、田中英彦編集「第２版コンピュータの辞典」、p502, 朝倉書店Edited by Hideo Aiso and Hidehiko Tanaka “Second Edition Computer Dictionary”, p502, Asakura Shoten 相磯秀夫、田中英彦編集「第２版コンピュータの辞典」、p731, 朝倉書店Edited by Hideo Aiso and Hidehiko Tanaka “Second Edition Computer Dictionary”, p731, Asakura Shoten

しかしながら、上記のラウンドロビンによる方法では、入力ファイルのキー重複数の分布が非一様の場合に分割ファイルのサイズが均等化されないという問題がある。 However, the above round robin method has a problem that the size of the divided files is not equalized when the distribution of the key overlaps of the input file is not uniform.

また、キー重複数の大きな順にサイズの小さな分割ファイルから振り分ける技術や整数計画法を用いる技術は、入力ファイルの全体読み込みが完了する（キー重複数情報作成）まで振り分けを開始することができない。そのため、入力ファイルの全体読み込みが完了してから振り分けを開始するためメモリ使用量、外部記憶装置アクセスが余計に発生してしまうという問題がある。 In addition, a technique that distributes from a small-sized divided file in descending order of the key duplication number or a technique that uses integer programming cannot start sorting until the entire reading of the input file is completed (key duplication information creation). Therefore, since the distribution is started after the entire reading of the input file is completed, there is a problem that extra memory usage and external storage device access occur.

本発明は、上記の点に鑑みなされたもので、分割ファイルの最大サイズを小さくし、かつファイル分割の時間を短縮することが可能なファイル分割装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to provide a file dividing apparatus, method, and program capable of reducing the maximum size of a divided file and shortening the time of file division. .

さらに、メモリ使用量及び外部記憶装置アクセスの削減が可能なファイル分割装置及び方法及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a file dividing device, method, and program capable of reducing memory usage and external storage device access.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、主キーの重複が許されている処理対象の入力ファイル１を分割し、同一主キー値の行は同一振分先となるように分割ファイル３に格納するファイル分割装置であって、
分割数が格納された分割数情報記憶手段６と、
分割重複数が格納された分割重複数情報記憶手段７と、
分割数情報記憶手段６から分割数を取得し、分割重複数情報記憶手段７から分割重複数を取得し、前記入力ファイルの先頭位置をポインタ格納領域に設定し、該入力ファイル１から該ポインタ格納領域のポインタに対応する行を読み出し、同一の主キーの行は同一振分先となるように分割数のうち分割重複数（但し、分割重複数＜分割数）の振分先を決定し、該分割重複数に応じた複製を行うことで行の振分を行い、分割ファイル３に追加格納する処理を、該ポインタが該入力ファイルの終端位置になるまで繰り返すファイル出力手段２３と、
分割完了後に、分割ファイル３のサイズ（行数）を比較し、行数の大きい上位Ｎ（Ｎ＝分割重複数―１）件の分割ファイルを排除する分割ファイル排除手段２４と、を有する。 The present invention (Claim 1) divides the input file 1 to be processed in which duplication of the primary key is permitted, and stores the file in the divided file 3 so that the rows having the same primary key value are the same distribution destination. A splitting device,
A division number information storage means 6 in which the division number is stored;
A division duplication number information storage means 7 in which the division duplication number is stored;
The division number is obtained from the division number information storage means 6, the division duplication number is obtained from the division duplication number information storage means 7, the head position of the input file is set in the pointer storage area, and the pointer is stored from the input file 1. Read the row corresponding to the pointer of the area, determine the division destination of the division overlap number (however, the division overlap number <the division number) out of the division number so that the same primary key row is the same assignment destination, A file output unit 23 that repeats the process of performing row distribution by performing copying according to the division overlap number and additionally storing the divided file 3 until the pointer reaches the end position of the input file;
After the division is completed, the divided file exclusion unit 24 compares the size (number of lines) of the divided file 3 and excludes the upper N (N = division overlap number-1) divided files having the largest number of lines.

また、本発明（請求項２）は、主キー毎に振り分け先を格納する振分先リストを有する振分先情報記憶手段と、
主キー毎に振り分け先候補を格納する振分先候補リスト振分先候補情報記憶手段と、を有し、
ファイル出力手段２３は、
格納領域から行を読み出し、該行から主キーを抽出する手段と、
前記主キーに対応する前記振分先候補リストに初期値として１から分割数まで（｛１，２，…，分割数｝）の要素を設定し、振分先候補から公平に選抜した分割重複数の振分先を決定し、前記主キーに対応する振分先リストに格納する手段を含む。 Further, according to the present invention (Claim 2), a distribution destination information storage unit having a distribution destination list for storing a distribution destination for each primary key;
A distribution destination candidate list distribution destination candidate information storage means for storing a distribution destination candidate for each primary key;
The file output means 23
Means for reading a row from the storage area and extracting a primary key from the row;
An element from 1 to the number of divisions ({1, 2,..., The number of divisions}) is set as an initial value in the distribution destination candidate list corresponding to the primary key, and the division weight selected fairly from the distribution destination candidates Means for determining a plurality of distribution destinations and storing them in a distribution destination list corresponding to the primary key;

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項３）は、主キーの重複が許されている処理対象の入力ファイルを分割し、同一主キー値の行は同一振分先となるように分割ファイルに格納するファイル分割方法であって、
分割数が格納された分割数情報記憶手段と、分割重複数が格納された分割重複数情報記憶手段と、を有する装置において、
分割数情報記憶手段から分割数を取得する分割数取得ステップ（ステップ１）と、
分割重複数情報記憶手段から分割重複数を取得する分割重複数取得ステップ（ステップ２）と、
分割数情報記憶手段から分割数を取得し（ステップ１）、分割重複数情報記憶手段から分割重複数を取得し（ステップ２）、入力ファイルの先頭位置をポインタ格納領域に設定する（ステップ３）設定ステップと、
入力ファイルから該ポインタ格納領域のポインタに対応する行を読み出し（ステップ４）、同一の主キーの行は同一振分先となるように前記分割数のうち前記分割重複数（但し、分割重複数＜分割数）の振分先を決定し（ステップ５）、該分割重複数に応じた複製を行うことで行の振分を行い（ステップ６）、分割ファイルに追加格納する（ステップ７）処理を、該ポインタが該入力ファイルの終端位置になるまで繰り返す（ステップ８）ファイル出力ステップと、
分割完了後に、分割ファイルのサイズ（行数）を比較し、行数の大きい上位Ｎ（Ｎ＝分割重複数―１）件の分割ファイルを排除する分割ファイル排除ステップ（ステップ９）と、
を行う。 The present invention (Claim 3) divides a processing target input file in which duplication of a primary key is permitted, and stores the file in the divided file so that rows having the same primary key value are assigned to the same distribution destination. Because
In an apparatus having division number information storage means for storing a division number and division overlap number information storage means for storing a division overlap number,
A division number obtaining step (step 1) for obtaining a division number from the division number information storage means;
A division duplication number acquisition step (step 2) for obtaining the division duplication number from the division duplication number information storage means;
The division number is acquired from the division number information storage means (step 1), the division overlap number is acquired from the division overlap number information storage means (step 2), and the head position of the input file is set in the pointer storage area (step 3). Configuration steps;
The row corresponding to the pointer in the pointer storage area is read from the input file (step 4), and the division number of the division number (provided that the division number is the same) so that the same primary key row is the same distribution destination. The distribution destination of <number of divisions> is determined (step 5), the rows are distributed by copying according to the division overlap number (step 6), and additionally stored in the divided file (step 7) Is repeated until the pointer reaches the end position of the input file (step 8), and a file output step;
After the division is completed, the size (number of lines) of the divided files is compared, and a divided file exclusion step (step 9) for eliminating the top N (N = divided multiples-1) divided files having the largest number of lines;
I do.

また、本発明（請求項４）は、ファイル出力ステップにおいて、
格納領域から行を読み出し、該行から主キーを抽出する手段と、
主キーに対応する振分先候補リストに初期値として１から分割数まで（｛１，２，…，分割数｝）の要素を設定し、振分先候補から公平に選抜した分割重複数の振分先を決定し、主キーに対応する振分先リストに格納する。 The present invention (Claim 4) provides a file output step including:
Means for reading a row from the storage area and extracting a primary key from the row;
An element from 1 to the number of divisions ({1, 2,..., The number of divisions}) is set as an initial value in the distribution destination candidate list corresponding to the primary key, and a plurality of division duplicates selected fairly from the distribution destination candidates The distribution destination is determined and stored in the distribution destination list corresponding to the primary key.

本発明（請求項５）は、請求項１または２に記載のファイル分割装置を構成する各手段としてコンピュータを機能させるためのファイル分割プログラムである。 The present invention (Claim 5) is a file division program for causing a computer to function as each means constituting the file division apparatus according to Claim 1 or 2.

入力ファイルの主キー値は入力ファイル全体で重複が許されており、ファイル分割時に同一主キー値の行は同一振分先となる制約条件の下で、分割重複数に基づいて、行単位での複製を用いた振分により分割候補生成を行い、分割完了後に分割ファイルのサイズが大きな上位Ｎ（分割重複数−１）件の分割ファイルを排除することにより、分割ファイルの最大サイズを小さくし、かつ、ファイル分割の時間短縮が可能となる。 Duplicate primary key values in the input file are allowed in the entire input file, and lines with the same primary key value are divided in units of lines based on the division duplication number under the constraint that the same distribution destination is used when dividing the file. The division candidate is generated by allocating the copy of the file, and after the division is completed, the division file of the top N (division multiple -1) having the largest division file size is excluded, thereby reducing the maximum size of the division file. In addition, the file division time can be shortened.

また、入力ファイルの先頭位置をポインタ格納領域に設定し、当該ポインタ格納領域の先頭位置から行を読み出し、行の振分を入力ファイルの終端位置まで行うことで、入力ファイルのスキャンは１回で済むため、メモリ使用量、外部記憶装置のアクセス回数の削減が可能である。 Also, the input file can be scanned once by setting the start position of the input file in the pointer storage area, reading the line from the start position of the pointer storage area, and allocating the line to the end position of the input file. As a result, the memory usage and the number of accesses to the external storage device can be reduced.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の一実施の形態におけるファイル分割装置の構成図である。It is a block diagram of the file division | segmentation apparatus in one embodiment of this invention. 本発明の一実施の形態における入力ファイルのフォーマットである。It is the format of the input file in one embodiment of this invention. 本発明の一実施の形態における振分先情報記憶部の例である。It is an example of the distribution destination information storage part in one embodiment of this invention. 本発明の一実施の形態における振分先候補情報記憶部の例である。It is an example of the allocation candidate information storage part in one embodiment of this invention. 本発明の一実施の形態におけるファイル分割処理部の構成図である。It is a block diagram of the file division | segmentation process part in one embodiment of this invention. 本発明の一実施の形態におけるファイル分割部のフローチャートである。It is a flowchart of the file division part in one embodiment of this invention. 本発明の一実施の形態におけるファイル出力処理（Ｓ１２０）のフローチャートである。It is a flowchart of the file output process (S120) in one embodiment of this invention. 本発明の一実施の形態における行の振分処理（Ｓ１２３）の詳細なフローチャートである。It is a detailed flowchart of row distribution processing (S123) in an embodiment of the present invention. 本発明の一実施の形態における振分先決定処理（Ｓ１２３３）の詳細なフローチャートである。It is a detailed flowchart of the allocation destination determination process (S1233) in one embodiment of this invention. 本発明の一実施の形態における分割ファイル排除処理（Ｓ１３０）の詳細なフローチャートである。It is a detailed flowchart of the division | segmentation file exclusion process (S130) in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態におけるファイル分割装置の構成を示す。 FIG. 3 shows a configuration of the file dividing device according to the embodiment of the present invention.

同図に示すファイル分割装置は、入力ファイル１、ファイル分割部２、分割ファイル３、振分先情報記憶部４、振分先候補情報記憶部５、分割数情報記憶部６、分割重複数情報記憶部７、メモリ８から構成される。このうち、振分先情報記憶部４、振分先候補情報記憶部５、分割数情報記憶部６、分割重複数情報記憶部７はファイル分割処理部２とデータバスで接続された記憶装置に格納されている。また、メモリ８は、読み込まれた入力ファイル１のハンドルに基づいて格納するポインタ格納領域を有する。 The file dividing apparatus shown in FIG. 1 includes an input file 1, a file dividing unit 2, a divided file 3, a distribution destination information storage unit 4, a distribution destination candidate information storage unit 5, a division number information storage unit 6, and a division overlap information. It comprises a storage unit 7 and a memory 8. Among these, the allocation destination information storage unit 4, the allocation destination candidate information storage unit 5, the division number information storage unit 6, and the division overlap multiple information storage unit 7 are stored in a storage device connected to the file division processing unit 2 via a data bus. Stored. The memory 8 also has a pointer storage area for storing based on the handle of the input file 1 that has been read.

入力ファイル１は、図４に示すように、主キーａ、副キーｂ、データ部ｃから構成された行の集まりであり、主キーａと副キーｂにより行が特定可能であるが、主キー値は入力ファイル全体で重複が許されている。ファイル分割は、主キー値による集約計算を並列処理する目的で行われるため、同一主キー値の行は同一振分先となる制約条件を課す。 As shown in FIG. 4, the input file 1 is a collection of lines composed of a primary key a, a secondary key b, and a data part c, and the primary key a and the secondary key b can identify a line. Key values are allowed to be duplicated throughout the input file. Since the file division is performed for the purpose of performing the parallel calculation of the aggregation calculation based on the primary key value, the same primary key value row imposes a constraint condition that becomes the same distribution destination.

振分先情報記憶部４は、図５に示すような振分先リストを有し、当該振分先リストには、主キー毎の振分先が設定される。 The distribution destination information storage unit 4 has a distribution destination list as shown in FIG. 5, and a distribution destination for each primary key is set in the distribution destination list.

振分先候補記憶部５は、図６に示すような振分先候補リストを有し、主キー毎に振分先候補が設定される。 The distribution destination candidate storage unit 5 has a distribution destination candidate list as shown in FIG. 6, and a distribution destination candidate is set for each main key.

分割数情報記憶部６は、分割数を格納している。 The division number information storage unit 6 stores the division number.

分割重複数情報記憶部７は、分割重複数を格納している。 The division overlap number information storage unit 7 stores the division overlap number.

ファイル分割部２は、入力ファイル１から分割ファイル３を生成する。 The file dividing unit 2 generates a divided file 3 from the input file 1.

図７は、本発明の一実施の形態におけるファイル分割処理部の構成を示す。 FIG. 7 shows the configuration of the file division processing unit in the embodiment of the present invention.

ファイル分割処理部２は、分割数取得部２１、分割重複数取得部２２、ファイル出力部２３、分割ファイル排除部２４を有する。ファイル出力部２３は、ポインタ格納領域設定部２３１、振分先決定部２３２、分割ファイル格納部２３３を有する。 The file division processing unit 2 includes a division number acquisition unit 21, a division duplication number acquisition unit 22, a file output unit 23, and a divided file exclusion unit 24. The file output unit 23 includes a pointer storage area setting unit 231, a distribution destination determination unit 232, and a divided file storage unit 233.

分割数取得部２１は、分割数情報記憶部６から分割数を取得し、振分先決定２３２に出力する。 The division number acquisition unit 21 acquires the division number from the division number information storage unit 6 and outputs it to the distribution destination determination 232.

分割重複数取得部２２は、分割重複数情報記憶部７から分割重複数を取得して分割ファイル排除部２４及び振分先決定部２３２に出力する。 The division duplication number acquisition unit 22 acquires the division duplication number from the division duplication number information storage unit 7 and outputs it to the division file exclusion unit 24 and the distribution destination determination unit 232.

ファイル出力部２３のポインタ格納領域設定部２３１は、入力ファイル１を読み込み、メモリ８内のポインタ格納領域Ｐに入力ファイルの行の位置を格納する。 The pointer storage area setting unit 231 of the file output unit 23 reads the input file 1 and stores the line position of the input file in the pointer storage area P in the memory 8.

振分先決定部２３２は、メモリ８のポインタ格納領域Ｐから入力ファイルの行の位置を取得し、行を読み出し、分割ファイル３に振り分けて格納する。この際に、振分先決定部２３２は、読み出した行から主キーを抽出し、当該主キーが振分先情報記憶部４に登録されていれば、同じ主キーの行は同一振分先となるように、分割数のうち、分割重複数（但し、分割重複数＜分割数）分の行を分割ファイル３に追加格納する。登録されていない場合は、図１１で後述する処理により振分先を決定し、振分先情報記憶部４に追加する。 The distribution destination determination unit 232 acquires the position of the line of the input file from the pointer storage area P of the memory 8, reads the line, distributes it to the divided file 3, and stores it. At this time, the assignment destination determination unit 232 extracts the primary key from the read row, and if the primary key is registered in the assignment destination information storage unit 4, the same primary key row is assigned to the same assignment destination. Thus, among the number of divisions, lines corresponding to the division overlap number (however, the division overlap number <the division number) are additionally stored in the division file 3. If it is not registered, the distribution destination is determined by the processing described later in FIG. 11 and added to the distribution destination information storage unit 4.

以下に、上記の構成における入力ファイル１から分割ファイル３を生成する処理を説明する。 Below, the process which produces | generates the division | segmentation file 3 from the input file 1 in said structure is demonstrated.

図８は、本発明の一実施の形態におけるファイル分割部のフローチャートである。 FIG. 8 is a flowchart of the file division unit according to the embodiment of the present invention.

ステップ１００）ファイル分割部２は、分割数情報ファイル６を読み込んで、分割数を抽出する。 Step 100) The file dividing unit 2 reads the division number information file 6 and extracts the division number.

ステップ１１０）分割重複数情報記憶部７を読み込んで、分割重複数を抽出する。 Step 110) The division overlap number information storage unit 7 is read to extract the division overlap number.

ステップ１２０）ファイル出力処理として、図９に示す処理を行う。 Step 120) As the file output process, the process shown in FIG. 9 is performed.

ステップ１３０）分割ファイル排除処理として、図１２の処理を行う。 Step 130) As the divided file exclusion process, the process of FIG. 12 is performed.

次に、上記のステップ１２０の処理を説明する。 Next, the process of step 120 will be described.

図９は、本発明の一実施の形態におけるファイル出力処理のフローチャートである。 FIG. 9 is a flowchart of file output processing according to an embodiment of the present invention.

ステップ１２１）入力ファイル１のハンドルを元に、入力ファイル先頭位置を取得し、メモリ８のポインタ格納領域Ｐに格納する。 Step 121) Based on the handle of the input file 1, the input file head position is acquired and stored in the pointer storage area P of the memory 8.

ステップ１２２）入力ファイル１からポインタ格納領域Ｐのポインタに対応する行の読み出しを行う。 Step 122) The line corresponding to the pointer in the pointer storage area P is read from the input file 1.

ステップ１２３）読み出された行の振分を行う。以下に詳細な動作を、図１０に基づいて説明する。 Step 123) The read lines are allocated. Detailed operation will be described below with reference to FIG.

ステップ１２３１）読み出された行の内容から主キーを抽出する。 Step 1231) A primary key is extracted from the contents of the read line.

ステップ１２３２）主キーを元に振分先情報記憶部４を検索し、登録済みか判定し、未登録であればステップ１２３３に移行し、登録済みであればステップ１２３５に移行する。 Step 1232) The distribution destination information storage unit 4 is searched based on the primary key to determine whether it has been registered. If it has not been registered, the process proceeds to Step 1233, and if it has been registered, the process proceeds to Step 1235.

ステップ１２３３）振分先を決定する。詳細は図１１において後述する。 Step 1233) A distribution destination is determined. Details will be described later with reference to FIG.

ステップ１２３４）決定された振分先を振分先記憶部４の振分先リストに追加格納し、ステップ１２３６に移行する。 Step 1234) The determined allocation destination is additionally stored in the allocation destination list of the allocation destination storage unit 4, and the process proceeds to Step 1236.

ステップ１２３５）振分先情報記憶部４に主キーが登録されている場合は、主キーに基づいて振分先情報記憶部４を検索し、振分先を取得する。 Step 1235) When the primary key is registered in the allocation destination information storage unit 4, the allocation destination information storage unit 4 is searched based on the primary key to acquire the allocation destination.

ステップ１２３６）行の内容を振分先の分割ファイル３に追記出力する。 Step 1236) The contents of the line are additionally output to the divided file 3 of the distribution destination.

ステップ１２４）ポインタ格納領域Ｐのポインタが入力ファイル１の終端位置に到達したかを判定し、到達していない場合はステップ１２５に移行し、到達した場合はステップ１３０に移行する。 Step 124) It is determined whether or not the pointer in the pointer storage area P has reached the end position of the input file 1. If not, the process proceeds to Step 125, and if it has reached, the process proceeds to Step 130.

ステップ１２５）ポインタ格納領域Ｐに次の行の位置を格納し、ステップ１２２に移行する。 Step 125) The position of the next row is stored in the pointer storage area P, and the routine proceeds to Step 122.

次に、上記のステップ１２３３の振分先決定処理について、図１１に沿って説明する。 Next, the distribution destination determination process in step 1233 will be described with reference to FIG.

図１１は、本発明の一実施の形態における振分先決定処理のフローチャートである。 FIG. 11 is a flowchart of the assignment destination determination process according to the embodiment of the present invention.

同図の処理は、モンテカルロ法を用いた一例であるが、分割数の候補から分割重複数の振分先を公平に選抜できる方法であれば、この例に限定されない。 The process in FIG. 5 is an example using the Monte Carlo method, but is not limited to this example as long as it is a method that can fairly select a plurality of division destinations from the number of division candidates.

ステップ１２３３１）最初に、振分情報記憶部４の主キーに対応する振分先リストを空に初期化する。 Step 12331) First, the distribution destination list corresponding to the primary key of the distribution information storage unit 4 is initialized to be empty.

ステップ１２３３２）主キーに基づいて振分先候補情報記憶部５を検索し、対応する振分先候補リストに初期値として｛１，２，…，分割数｝を設定する。 Step 12332) Searches the distribution destination candidate information storage unit 5 based on the primary key, and sets {1, 2,..., Number of divisions} as an initial value in the corresponding distribution destination candidate list.

ステップ１２３３３）振分先候補情報記憶部５の主キーに対応する振分先リストの要素数を取得し、「区間０以上１以下」を要素数で等分し、等分した部分区間に対し、順番に振分先候補リストの要素に１対１で割り当てる。例えば、分割数＝５である場合は、５つの部分区間に等分され、各要素数の番号（振分先）を割り当てる。 Step 12333) Obtain the number of elements of the distribution destination list corresponding to the primary key of the distribution destination candidate information storage unit 5, equally divide "Section 0 to 1" by the number of elements, and In this order, the assignment destination candidate list is assigned one-to-one. For example, when the number of divisions = 5, it is equally divided into five partial sections, and the number of each element (assignment destination) is assigned.

ステップ１２３３４）「区間０以上１以下」での一様な疑似乱数を取得する。 Step 12334) A uniform pseudo-random number in “section 0 to 1” is acquired.

ステップ１２３３５）ステップ１２３３３で得られた部分区間に対し、疑似乱数の値がどの部分区間に入るか（何番目の部分区間に乱数ｘが含まれるか）判定し、割り当てた対応付けを表現する構造体を元に振分先を決定する。 Step 12335) A structure that expresses the assigned association by determining which partial section the pseudo-random number value is included in the partial section obtained in Step 12333 (which partial section contains the random number x) Decide where to distribute based on your body.

ステップ１２３３６）決定された振分先（ｍ番目）の振分先候補を主キーに対応する振分先情報記憶部４の振分先リストに追加する。 Step 12336) The determined allocation destination (m-th) allocation destination candidate is added to the allocation destination list of the allocation destination information storage unit 4 corresponding to the primary key.

ステップ１２３３７）決定された振分先（ｍ番目）の振分先候補を主キーに対応する振分先候補情報記憶部５の振分先候補リストから除去する。 Step 12337) The determined allocation destination (m-th) allocation destination candidate is removed from the allocation destination candidate list in the allocation destination candidate information storage unit 5 corresponding to the primary key.

ステップ１２３３８）主キーの振分先情報記憶部４の要素数が分割重複数以上であれば処理を終了し、そうでなければステップ１２３３３に移行する。 Step 12338) If the number of elements in the primary key distribution destination information storage unit 4 is greater than or equal to the division overlap number, the process is terminated; otherwise, the process proceeds to Step 12333.

次に、上記のステップ１３０の分割ファイル排除処理について説明する。 Next, the divided file exclusion process in step 130 will be described.

図１２は、本発明の一実施の形態における分割ファイル排除処理の詳細なフローチャートである。 FIG. 12 is a detailed flowchart of the divided file exclusion process according to the embodiment of the present invention.

ステップ１３１）分割ファイル３のサイズ（行数）を取得し、サイズの大きな上位Ｎ件（分割重複数―１）の分割ファイルのハンドルを取得する。 Step 131) The size (number of lines) of the divided file 3 is acquired, and the handle of the divided file of the top N items (divided multiples-1) having the largest size is acquired.

ステップ１３２）ステップ１３１で取得した分割ファイルのハンドルを元にファイル排除（例えば、ファイル削除）を実行する。 Step 132) File exclusion (for example, file deletion) is executed based on the handle of the divided file acquired in Step 131.

上記のように、入力ファイル１のキー重複数がキー集合の一部の要素（上位Ｌ位とする）に偏って行数が大きな場合、分割候補の生成を行い、分割完了後にサイズの大きな分割ファイルを排除することにより、「上位Ｌ位のキーで識別される全ての行が同一振分先に収容されてしまう確率」を組み合わせ数が急速に増大する性質を用いることで小さくすることが可能であるため、分割ファイルサイズの最大サイズを小さくし、かつ、ファイル分割の時間短縮が可能となる。 As described above, if the number of lines in the input file 1 is biased toward a part of the key set (assumed to be the upper L) and the number of rows is large, a division candidate is generated, and a large size division is performed after the division is completed. By eliminating files, it is possible to reduce the "probability that all the lines identified by the upper L key will be accommodated in the same distribution destination" by using the property of rapidly increasing the number of combinations. Therefore, the maximum size of the divided file size can be reduced and the file division time can be shortened.

なお、上記のファイル分割処理部２の動作をプログラムとして構築し、ファイル分割装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 It is possible to construct the operation of the file division processing unit 2 as a program and install it on a computer used as a file division apparatus for execution, or distribute it via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスクまたはＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

１入力ファイル
２ファイル分割処理（ＣＰＵ）
３分割ファイル
４振分先情報記憶部
５振分先候補情報記憶部
６分割数情報記憶部
７分割重複数情報記憶部
８メモリ
２１分割数取得部
２２分割重複数取得部
２４ファイル出力部
２４分割ファイル排除部
２３１ポインタ格納領域設定部
２３２振分先決定部
２３３分割ファイル格納部 1 Input file 2 File division processing (CPU)
3 Divided File 4 Distribution Destination Information Storage Unit 5 Distribution Destination Candidate Information Storage Unit 6 Division Number Information Storage Unit 7 Division Duplication / Multiple Information Storage Unit 8 Memory 21 Division Number Acquisition Unit 22 Division Duplication / Multiple Acquisition Unit 24 File Output Unit 24 Division File exclusion unit 231 Pointer storage area setting unit 232 Distribution destination determination unit 233 Division file storage unit

Claims

A file dividing device that divides an input file to be processed for which primary key duplication is allowed, and stores the same primary key value row in the divided file so that the same primary key value row is the same distribution destination,
A division number information storage means in which the division number is stored;
A division duplication number information storage means in which the division duplication number is stored;
Setting means for obtaining a division number from the division number information storage means, obtaining a division duplication number from the division duplication number information storage means, and setting a head position of the input file in a pointer storage area;
The row corresponding to the pointer in the pointer storage area is read from the input file, and the row of the same primary key is the same distribution destination. ) Is assigned, and the row is distributed by performing duplication according to the division overlap number, and additional storage in the division file is performed until the pointer reaches the end position of the input file. Repeated file output means,
A divided file exclusion means for comparing the size (number of lines) of the divided files after completion of the division and for removing the top N (N = divided duplication number−1) divided files having the largest number of lines;
A file dividing device characterized by comprising:

A distribution destination information storage unit having a distribution destination list for storing a distribution destination for each primary key;
A distribution destination candidate list distribution destination candidate information storage means for storing a distribution destination candidate for each primary key;
The file output means includes:
Means for reading a row from the storage area and extracting a primary key from the row;
An element from 1 to the number of divisions ({1, 2,..., The number of divisions}) is set as an initial value in the distribution destination candidate list corresponding to the primary key, and the division weight selected fairly from the distribution destination candidates 2. The file dividing apparatus according to claim 1, further comprising means for determining a plurality of distribution destinations and storing them in the distribution destination list corresponding to the primary key.

A file splitting method that splits an input file to be processed that allows duplicate primary keys, and stores them in the split file so that rows with the same primary key value are assigned to the same destination.
In an apparatus having division number information storage means for storing a division number and division overlap number information storage means for storing a division overlap number,
A division number obtaining step of obtaining a division number from the division number information storage means;
A division duplication number obtaining step of obtaining the division duplication number from the division duplication number information storage means;
Obtaining a division number from the division number information storage unit, obtaining a division duplication number from the division duplication number information storage unit, and setting a head position of the input file in a pointer storage area;
The row corresponding to the pointer in the pointer storage area is read from the input file, and the row of the same primary key is the same distribution destination. ) Is assigned, and the row is distributed by performing duplication according to the division overlap number, and additional storage in the division file is performed until the pointer reaches the end position of the input file. Repeated file output steps;
A divided file exclusion step for comparing the size (number of lines) of the divided files after the completion of the division, and excluding the top N (N = divided multiple-1) divided files having the largest number of lines;
A file dividing method characterized by:

In the file output step,
Means for reading a row from the storage area and extracting a primary key from the row;
An element of 1 to the number of divisions ({1, 2,..., The number of divisions}) is set as an initial value in the distribution destination candidate list corresponding to the primary key, and the number of division overlaps selected fairly from the distribution destination candidates 4. The file dividing method according to claim 3, wherein a distribution destination is determined and stored in a distribution destination list corresponding to the primary key.

A file division program for causing a computer to function as each means constituting the file division apparatus according to claim 1.