JPS6134626A

JPS6134626A - Method of executing external distribution and sorting

Info

Publication number: JPS6134626A
Application number: JP10768084A
Authority: JP
Inventors: ユージン・エミル・リンドストロム; ジエフリイ・スコツト・ビツター
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1984-05-29
Filing date: 1984-05-29
Publication date: 1986-02-18
Also published as: JPH048814B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、極めて長大なファイルを外部で分類するため
の、コンピュータで実施可能な分布法に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION Field of the Invention The present invention relates to a computer-implementable distribution method for externally classifying extremely large files.

［従来技術］分類（ソート）とは、項目をある「順序」に配列する方
法である。対象となるものは、各々が関連したキー値を
有している記録である。分類の目的は、キー値を非減少
類に入力したレコードの順列を決定することである。分
類作業の特徴は、再配列が行なわれる態様、ならびにこ
れが全体として、分類を行なうＣＰＵにとって局所であ
る内部メモリ内で、すなわちデータがＣＰＵのランダム
・アクセス内部（メイン）メモリに整然と適合する個所
で行なわれるのかどうかによって決まる。[Prior Art] Classification (sorting) is a method of arranging items in a certain "order." Of interest are records, each having an associated key value. The purpose of classification is to determine the permutation of records whose key values have been entered into non-reducing classes. A feature of the classification task is the manner in which the reordering is performed and that this is done entirely within internal memory that is local to the CPU performing the classification, i.e. where the data fits neatly into the CPU's random access internal (main) memory. It depends on whether it is done or not.

「分布（デストリビュージョン）分類」とは、レコード
を隣接した範囲に分け、ひとつの範囲の全てのレコード
が、次の範囲のレコードのキーよりも小さい値のキーを
持っているようにすることである。一方、「組合せ（マ
ージ）分類Ｊとは、２つ以上の大きさ順配列のリス１〜
を組み合わせ、組み合わされたリストの配列も大きさ順
になるようにすることである。２ウエイ・マージ・ソー
トは典型例であって、この場合、対になった項目が比較
され、各部が順序付けられ、ついでこれらの対が組み合
わされ、結果として得られる４重数が順序付けられ、４
重数がソート済みの８重数に組み合わされ、これが組合
せができなくなるまで続けられる。"Distribution classification" refers to dividing records into adjacent ranges such that all records in one range have a key with a smaller value than the key of the record in the next range. It is. On the other hand, "combination (merge) classification J is a list of two or more size-ordered lists 1 to 1.
The idea is to combine them so that the combined list is also arranged in order of size. A typical example is a two-way merge sort, where paired items are compared, each part is ordered, these pairs are then combined, and the resulting quadruple numbers are ordered, 4
The multiple numbers are combined into a sorted octuplet number, and this continues until no more combinations can be made.

「外部分類」とは、１次すなわち内部メモリの容量を越
えたデータのファイルに適用される分類技法であって、
分類処理中にＤＡＳＤ　（直接アクセス記憶装置）、テ
ープ及びドラムのような２次記憶装置に依存する。外部
分類の一型式である組合せ分類において、ファイルの各
部は、内部メモリに読み込まれ、内部で順序付けられ、
ついで外部装置すなわち２次記憶装置に再度書き込まれ
る。"External classification" is a classification technique applied to files of data that exceed the capacity of primary or internal memory,
Reliance on secondary storage devices such as DASD (Direct Access Storage Devices), tapes and drums during the sorting process. In combinatorial classification, a form of external classification, parts of a file are read into internal memory, ordered internally, and
Then, it is written again to an external device, that is, a secondary storage device.

「交換−選択」という技法では、順序付けられていない
「入力ファイル」から、１つまたはそれ以上の順序付け
られたリスト（文字列）を含んでいる中間ファイルが作
成される。交換−選択では、さまざまな長さの順序付け
られた文字列が作成されるが、その平均長さは内部メモ
リの容量の２倍となる。米国特許第２９８３９０４号を
参照されたい。「最小組合せハフマン・ツリー」を形成
することにより、文字列の順序付けられた文字列への最
適な組合せを行なうことができる。最小組合せツリーは
、文字列の長さを表すターミナル節点（ノード）によっ
て構成されており、かつ組合せツリーの値ができるだけ
小さくなるように、配列される。In the "exchange-select" technique, an intermediate file containing one or more ordered lists (strings) is created from an unordered "input file." Swap-select creates ordered strings of varying lengths, the average length of which is twice the capacity of internal memory. See US Pat. No. 2,983,904. Optimal combination of strings into ordered strings can be performed by forming a "minimal combinatorial Huffman tree." The minimum combinatorial tree is made up of terminal nodes that represent the lengths of strings, and is arranged so that the values of the combinatorial tree are as small as possible.

ディスク装置に格納されているデータに使用されるほと
んどの外部分類法は１組合せをベースとするものである
。」二記のごとく、これらには交換−選択を用いて初期
格納文字列を幾つか生成し、ついで文字列が１つだけに
なるまで、組合せを繰り返す必要がある。米国特許第４
２１０９６１号“Ｓｏｒｔｉｎｇ　Ｓｙｓｔｅｍ”には
、典型的な組合せベースの外部分類法が記載されている
。Most external classification methods used for data stored on disk drives are combination-based. 2, these require using swap-selection to generate several initial stored strings, and then repeating the combinations until only one string is left. US Patent No. 4
No. 210,961 "Sorting System" describes a typical combination-based external classification method.

一３＝一方、連想記憶装置が、分類及び探索の両適用業務に使
用されている。Ｃｈａｎｇ　ｅｔ　ａｌ、′Ａｓ５ｏｃ
ｉａｔｉｖｅ　５ｅａｒｃｈ　Ｂｕｂｂｌｅ　Ｄｅｖｉ
ｃｅｓ　ｆｏｒＣｏｎｔｅｎｔ　Ａｄｄｒｅｓｓａｂｌ
ｅ　Ｍｅｍｏｒｉｅｓ”、１８、ＩＢＭＴｅｃｈｎｉｃ
ａｌ　Ｄｉ、ｓｃｌ、ｏｓｕｒｅ　Ｂｕｌｌｅｔｉｎ、
　ｐｐ、　５９８−６０２、Ｊｕｌ、ｙ　　１９７５に
は、言葉及びビットごとに内容アドレス可能なメモリと
して磁気バブル・メモリ装置を挙げている。また、米国
特許第４１６８５３５号”Ｎｏｎ−Ｖｏｌａｔｉｌｅ　
Ｂｕｂｂｌｅ　ＤｏｍａｉｎＭｅｍｏｒｙ　Ｓｙｓｔｅ
ｍ”には、複数の小／大ループ・ビット・ストレージ・
アレイが記載されている。同様に、Ｄｏｒｔｙ　ａｔ　
ａｌ、”Ｍａｇｎｅｔｉｃ　Ｂｕｂｂｌｅ　Ｍｅｍｏｒ
ｙＡｒｃｈｊｌ：ｅｃｔｕｒｅｓ　ｆｏｒ　Ｓｕｐｐｏ
ｒｔｉｎｇ　ＡｓｓｏｃｊａｔｊｖｅＳｅａｒｃｈｊｎ
ｇ　ｏｆ　Ｒｅ１ａｔｊｏｎａｌ　Ｄａｔａｂａｓｅｓ
”２９、ＩＥＥ　Ｅ　Ｔｒａｎｓａｃｔｊｏｎｓ　ｏｎ
　Ｃｏｍｐｕｔｅｒｓ、　ｐｐ、　９５７−９７０、Ｎ
ｏｖｅｍｂｅｒ　ｌ　９８０は、磁気バブル・メモリの
パラレル・アーキテクチャを実質的に拡張し、リレーシ
ョナル・データベースにおける連想探索をサポートする
ものである。Ｃ，Ｓ、　Ｌｊ、ｎ、“ＳＯｒｔｉｎｇ　
ｕｊｔｈ　Ａｓ５ｏｃｉａｔｉｖｅ　５ｅｃｏｎｄａｒ
ｙ　Ｓｔｏｒａｇｅ　Ｄｅｖｉｃｅｓ”、Ｐｒｏｃｅｅ
ｄｉｎｇｓ　ｏｆ　ＡＦＩＰＳ＋　Ｎａｔｉｏｎａｌ　
　　　Ｃｏｍρｕｔｅｒ　Ｃｏｎｆｅｒｅｎｃｅ、　　
１９７７、ｐｐ、６９１−６９５には、キーのヒストグ
ラムを用いた分布分類と、ヘッド・パー・トラック・デ
ィスクにおける連想探索を実行するための、コンピュー
タで実施可能な方法が記載されている。この方法を操作
できるのは、キー値が平均して分布している場合だけで
ある。13= On the other hand, associative memory devices are used for both classification and search applications. Chang et al,'As5oc
iative 5earch Bubble Devi
ces for Content Addressable
e Memories”, 18, IBM Technic
al Di, scl, osure Bulletin,
pp. 598-602, Jul, y 1975, describes a magnetic bubble memory device as a word- and bit-wise content-addressable memory. Also, U.S. Patent No. 4168535 “Non-Volatile”
Bubble Domain Memory System
m” has multiple small/large loops, bit storage,
array is described. Similarly, Dorty at
al,”Magnetic Bubble Memory
yArchjl: ectures for Suppo
rting AssocjatjveSearchjn
g of Re1atjonal Databases
”29, IEEE Transactjons on
Computers, pp, 957-970, N
Ovember l 980 substantially extends the magnetic bubble memory parallel architecture to support associative search in relational databases. C, S, Lj, n, “SOrting
ujth As5oactive 5econdar
y Storage Devices”, Processee
dings of AFIPS+ National
Computer Conference,
1977, pp. 691-695 describes a computer-implementable method for performing histogram-based distributional classification of keys and associative searches on head-per-track disks. This method only works if the key values are distributed on average.

［発明が解決しようとする問題点］上記のような従来の外部分類法は組合せ（マージ）を基
本とするものであって、交換−選択を用い初期格納文字
列を幾つか生成した後、文字列が唯１つになるまで組合
せ（マージ）を繰返す必要があった。このため分類に長
時間を要し、又長大なファイルには適用困難であった。[Problems to be solved by the invention] The conventional external classification method described above is based on combination (merging), and after generating several initial storage strings using exchange and selection, It was necessary to repeat the combination (merge) until there was only one column. For this reason, it takes a long time to classify, and it is difficult to apply this method to large files.

［問題点を解決するための手段］本発明の目的は、２次記憶装置によって連想アクセスさ
れる極めて長大なファイルに適用でき、分類機能の稼働
時間を大幅に削減する、外部分類法を考案するところに
ある。[Means for Solving the Problems] An object of the present invention is to devise an external classification method that can be applied to extremely long files that are accessed associatively through a secondary storage device and that can significantly reduce the operating time of the classification function. It's there.

上記の目的は、分布分類を実行することによって実現さ
れる。再配列されるデータは、連想２次記憶装置でアク
セスできる、キー登付けて格納されたレコードを包含し
ている。本発明の方法は、所定数のキーのランダム・サ
ンプリングとサンプリングされたキーの内部分類を利用
するものであり、キー値の範囲内のキーのヒストグラム
を作成し、密度の低い隣接した範囲を組み合わせること
によって、レコードの同一サイズで各々がＣＰＵのメイ
ン・メモリに適合している区画を単一のパスに形成し、
キーがある＠凹円にある全てのレコードを連想検索し、
検索されたレコードを内部で分類するものである。The above objective is achieved by performing a distribution classification. The data to be rearranged includes records stored with keys that can be accessed in the content addressable secondary storage device. The method of the present invention utilizes random sampling of a predetermined number of keys and internal classification of the sampled keys to create a histogram of keys within a range of key values and combine less dense adjacent ranges. forming in a single pass partitions of the same size of records, each of which fits into the main memory of the CPU;
Perform an associative search for all records in the @concave circle with the key,
This is to internally classify searched records.

本方法は、キーの順序や、キー値の分布に関係なく、極
めて高速に適用できるものである。サンプリングされた
キーの内部分類の後、格納された順序でキーを指示（ポ
イント）する順次ポインタ・リストが構成される。これ
に関連して、キーのヒストグラムとキーの範囲を作成し
てレコードの区画を形成することには、各キーに対して
ポインタ・リストで内部メモリの２進探索を行ない、キ
ーに関連した範囲を確認することが含まれている。本分
布法を有利に最適化できるのは、連想記憶装置に関して
であって、従来の単なるＤＡＳＤに関してではない。This method can be applied extremely quickly, regardless of the order of keys or the distribution of key values. After internal sorting of the sampled keys, a sequential pointer list is constructed that points to the keys in the order in which they were stored. In this context, creating histograms of keys and ranges of keys to form partitions of records involves performing a binary search of internal memory in a list of pointers for each key to determine the range associated with the key. Includes checking. The present distribution method can be advantageously optimized with respect to content addressable storage devices, and not with respect to mere conventional DASDs.

［実施例］本発明の方法は、少なくとも１つのＣＰＵを包含してお
りその各々が内部メモリ、入出力チャネル、制御装置、
直接アクセス記憶装置、及びこれらに接続されたその他
の入出力装置を持つコンピュータ・システムで実行可能
である。かかるシステムは、“Ｄａｔａ　Ｐｒｏｃｅｓ
ｓｉｎｇ　Ｓｙｓｔｅｍ”なる名称の米国特許第３４０
０３７１号に記載されている。[Embodiment] The method of the present invention includes at least one CPU, each of which has an internal memory, an input/output channel, a control device,
It can be implemented on computer systems that have direct access storage devices and other input/output devices connected thereto. Such a system is called “Data Process
U.S. Patent No. 340 titled “sing System”
It is described in No. 0371.

該明細書記載のシステムは、資源として、コンピュータ
・システムまたはこのシステムで稼働するオペレーティ
ング・システムのいず九かの、処理の実行に必要とされ
る機構の全てを包含している。The system described herein includes, as resources, all of the mechanisms required to carry out the process, either a computer system or an operating system running on this system.

典型的な資源は、内部メモリ、入出力装置、ＣＰＵ、デ
ータ・セット、及び制御または処理プログラムを含んで
いる。Typical resources include internal memory, input/output devices, CPU, data sets, and control or processing programs.

機能のセットとしての「分類」を、任意の実行適用業務
処理、またはデータベース管理システムによって呼び出
すことができる。実行機能が、ファイル名で分類を呼び
出すのが、典型的なものである。オペレーティング・シ
ステムがファイルの位置を確認し、これが内部メモリに
適合しない場合には、「外部分類」機能が呼び出される
。A "classification" as a set of functions can be invoked by any executing application process or database management system. It is typical for an executive function to invoke a classification by file name. If the operating system locates the file and it does not fit into internal memory, the "external sort" function is called.

本発明の方法には、連想２次記憶装置が必要である。こ
の装置は、ロジック・パー・トラック機能を有する大形
ＤＡＳＤ、または磁気バブル・メモリ（ＭＢＭ）のいず
れかによって達成される。The method of the invention requires an associative secondary memory. This device is accomplished either by large DASD with logic-per-track functionality or by magnetic bubble memory (MBM).

本方法を、３つのフェーズ、すなわち（１）サンプル・
フェーズ、（２）範ＦＨＩ（バケツ１−）形成フェーズ
、及び（３）内部分類フェーズを参照して説明する。The method is divided into three phases: (1) sample
(2) Range FHI (bucket 1-) formation phase, and (3) internal classification phase.

サンプル・フェーズにおいては、所定数のキー値がラン
ダムにサンプリングされ、サンプリングされたキーが内
部で分類される。範囲形成フェーズにおいては、ファイ
ルの単一パスで、分類されたサンプルによる定義に従っ
て、各範囲に幾つのレコードが属しているかのカウント
が取られる。このヒストグラムを使用して、隣接する範
囲が組み合わされ、各範囲がほぼ内部メモリに適合する
レコード数を含んでいるような、大きい範囲が形成され
る。最後に、増加キー値順の各範囲に対する内部分類フ
ェーズにおいては、その範囲内にキー値がある全てのレ
コードを内部メモリで検索するために、ファイルの連想
探索が行なわれる。次いで、レコードが内部分類され、
出力ファイルに付加される。In the sample phase, a predetermined number of key values are randomly sampled and the sampled keys are internally classified. During the range formation phase, in a single pass through the file, a count is taken of how many records belong to each range as defined by the classified sample. Using this histogram, adjacent ranges are combined to form large ranges such that each range contains approximately the number of records that will fit in internal memory. Finally, in the internal classification phase for each range in increasing key value order, an associative search of the file is performed to search internal memory for all records with key values within that range. The records are then classified internally and
appended to the output file.

時間／空間効率を改善する修正を加えた方法の分析、連
想２次記憶装置としての磁気バブル・メモリの使用、及
び内部分類フェーズの有効性の改善についても、詳細に
説明する。An analysis of the modified method to improve time/space efficiency, use of magnetic bubble memory as an associative secondary storage, and improvement of the effectiveness of the internal classification phase is also detailed.

本明細書においては、まず各レコードがＲバイトの固定
長を有しており、一方、キー・フィールドがＫ（Ｒバイ
トからなるものと、想定する。ＣＰＵはＭバイトの内部
メモリ・サイズを有しており、分類されるファイルはＮ
レコードからなるものとする。さらに、内部で分類でき
るレコード数Ｆは、はぼＭ／Ｒであるものとする。最後
に、すンプリングされたキーの数Ｓは、はぼＭ／にであ
るとする。In this specification, we first assume that each record has a fixed length of R bytes, while the key field consists of K (R bytes). The CPU has an internal memory size of M bytes. and the number of files classified is N
shall consist of records. Furthermore, it is assumed that the number F of records that can be classified internally is approximately M/R. Finally, assume that the number S of keys sampled is approximately M/.

サンプル・フェーズ８個（ただし、ＳはほぼＭ／にである）のキーのサンプ
ルは、ランダム・サンプリング法を用いて取られる。必
要な内部メモリの量Ｍの決定も行なうＳの値の選択は、
本方法の中核をなすものであり、これについて以下で説
明する。内部メモリに格納されるこのサンプルは、ＡＶ
ＬまたはＲＢツリーのような平衡ツリーを使用して、格
納される。Ｄ、　Ｅ、　Ｋｎｕｔｈ、”Ｔｈｅ　Ａｒｔ
　ｏｆ　ＣｏｍｐｕｔｅｒＰｒｏｇｒａｍｍｉ　ｎｇ”
、Ｖｏｌ、ｕｍｅ３　：　“Ｓｏｒｔｉｎｇ　ａｎｄＳ
ｅａｒｃｈｉｎｇ”、　Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ
、１９７３を参照されたい。平衡ツリーによる分類によ
って、分類時間がサンプリングと部分的に重複してもよ
いことになる。完全を期すため、「ツリー」はサイクル
を持たない連結グラフであることが望ましい。Samples of the keys in eight sample phases (where S is approximately M/) are taken using a random sampling method. The choice of the value of S, which also determines the amount of internal memory M required, is
This is the core of the method and will be explained below. This sample stored in internal memory is AV
It is stored using a balanced tree, such as an L or RB tree. D. E. Knuth, “The Art.
of Computer Programming”
, Vol, ume3: “Sorting and S
Addison-Wesley
, 1973. Classification by balanced trees allows the classification time to partially overlap with sampling. For completeness, a "tree" is preferably a connected graph with no cycles.

さらに、「有向ツリー］はサイクル及び代替経路を何ら
含まない有向グラフである。有向ツリーは、子孫のセッ
トが他の全ての節点（ノード）からなっている「−意の
節点（ルート）」を有している。Furthermore, a directed tree is a directed graph that does not contain any cycles or alternative paths. A directed tree is a directed tree whose descendant set consists of all other nodes. have.

本発明において、平衡ツリーの各節点には、キーのため
の記憶スペース、ならびに分類のための２つのポインタ
が必要である。ちなみに、祖先子孫という述語が、節点
の関係を述べるのに使用されている。それゆえ、左側の
子孫のキー値が、親のキー値以下である、ということが
できる。同様にして、右側の子孫のキー値は、親のキー
値以上となる。キー及びポインタの平衡ツリー構成の例
を、第１図に示す。サンプル・フェーズの最後の部分で
は、昇順のツリーのキーに向けられるポインタの順次リ
ストを作成するために、平衡ツリーの順を追った横断が
必要である。第１図の平衡ツリーに対応したポインタの
リストを第２図に示す。In the present invention, each node of the balanced tree requires storage space for the key as well as two pointers for classification. Incidentally, the predicate ancestor-descendant is used to describe the relationship between nodes. Therefore, it can be said that the key value of the descendant on the left is less than or equal to the key value of the parent. Similarly, the key value of the right descendant is greater than or equal to the parent's key value. An example of a balanced tree configuration of keys and pointers is shown in FIG. The final part of the sample phase requires a sequential traversal of the balanced tree to create a sequential list of pointers pointing to the keys of the tree in ascending order. A list of pointers corresponding to the balanced tree of FIG. 1 is shown in FIG.

上述の通り、このステップには、８個のキーのランダム
・サンプリングとサンプリングされたキーの内部分類が
含まれている。磁気バブル・メモリの形の連想記憶装置
においては、各レコードはＭＢＭのランダム・アレイに
格納される。第３図にアレイを示すが、これについて以
下で説明する。As mentioned above, this step includes random sampling of eight keys and internal classification of the sampled keys. In content addressable storage in the form of magnetic bubble memory, each record is stored in a random array of MBMs. The array is shown in FIG. 3 and will be described below.

サンプリングは、レコードがＭＢＭに書き込まれている
間に行なわれる。入力ファイルのレコードがＭＢＭに既
に常駐している場合には、これらはランダム・アレイに
格納されているものとみなされる。常駐していない場合
には、これらをアレイ内でランダムに暗号化しなければ
ならない。Sampling is done while records are being written to the MBM. If the records of the input file already reside in the MBM, they are assumed to be stored in a random array. If not resident, they must be randomly encrypted within the array.

かかる装置におけるサンプリングは、次のようにして行
なわれる。Sampling in such a device is performed as follows.

まず、各アレイからサンプルとして選択されるレコード
数が、多項分布に基づいてランダムに選ばれる。独立多
項ランダム変数の生成方法は、Ｄ、　Ｅ、　Ｋｎｕｔｈ
、”Ｔｈｅ　Ａｒｔ　ｏｆ　ＣｏｍｐｕｔｅｒＰｒｏｇ
ｒａｍｍｉｎｇ”、Ｖｏｌｕｍｅ　２：“Ｓｅｍｉｎｕ
ｍｅｒｉｃａｌＡｌｇｏｒｉｓｍｓ”、Ａｄｄｉｔｉｏ
ｓｏｎ−Ｗａｓｌｅｙ、２ｎｄ　ｅｄｉｔｉｏｎ。First, the number of records selected as samples from each array is randomly selected based on a multinomial distribution. The generation method of independent polynomial random variables is D, E, Knuth
,”The Art of ComputerProg
ramming”, Volume 2: “Seminu
mericalAlgorisms”, Additio
son-Wasley, 2nd edition.

１９８１に記載されている。次いで、順序ランダム・サ
ンプリング法を使用して、各アレイから所望数のキーを
選択する。これに関しては、　Ｊ、　Ｓ。1981. An ordered random sampling method is then used to select the desired number of keys from each array. In this regard, J.S.

Ｖｉｔｔｅｒ、　”Ｆａｓｔｅｒ　Ｍｅｔｈｏｄ　ｆｏ
ｒ　Ｒａｎｄｏｍ　Ｓａｍｐｌｉｎｇ”、Ｔｅｃｈｎｊ
ｃａｌ　Ｒｅｐｏｒｔ　Ｃ５−８２−２１、Ｂｒｏｗｎ
　Ｕｎｉｖｅｒｓｉｔｙ。Vitter, “Faster Method for
r Random Sampling”, Technj
cal Report C5-82-21, Brown
University.

Ａｕｇｕｓｔ　１９８２を参照されたい。サンプリング
され＝１２− たキー値は、ついで内部メモリに読み込まれ、上述のよ
うに、平衡ツリーに挿入される。See August 1982. The sampled =12- key values are then read into internal memory and inserted into the balanced tree as described above.

（パケット　多　フェーズこのフェーズは、計算サブフェーズと結合サブフェーズ
とに分けられる。サンプル・フェーズで分類されたサン
プリングされたキー値は、ファイルをＳ＋１の範囲の区
画に分けるのに役立つ。この場合、分類されたキー値は
、Ｘｌ、Ｘ２・・・・、ＸＳと表される。また、ｘｏは
−のにセットされ、Ｘ８４１は＋のにセットされる。１
からＳ＋１までの閉区間の各値ｉに対し、ｉ番目の範囲
は、キー値が半閉区間ｘト、ないしＸ−にあるレコード
のセットになるように定義される。(Packet Multi Phase This phase is divided into a computation sub-phase and a join sub-phase. The sampled key values sorted in the sample phase serve to partition the file into S+1 range partitions. In this case, The classified key values are expressed as Xl, X2..., XS. Also, xo is set to -, and X841 is set to +.1
For each value i in the closed interval from to S+1, the i-th range is defined to be the set of records whose key values are in the semi-closed interval xt through

各範囲は標準偏差が約Ｎ／Ｓの平均Ｎ／Ｓのレコードを
含んでいる。Ｓの値は、内部メモリに入れることのでき
るレコード数ＦよりもＮ／Ｓがはるかに小さくなるよう
に選択される。これについては、以下で検討する。Each range contains an average of N/S records with a standard deviation of approximately N/S. The value of S is chosen such that N/S is much smaller than the number of records F that can be placed in internal memory. This will be discussed below.

キーの格納順序を、サンプル・フェーズの最期の段階で
作成された順次ポインタ・リストによって表わすことが
できる。The storage order of keys can be represented by a sequential pointer list created at the end of the sample phase.

重要なのは、計数サブフェーズが各キーを連想記憶装置
で処理すること、及びフイボナツチ探索または等２進探
索のいずれかを、ポインタ・リストに行なって、小さい
範囲のどれがキーを含んでいるかを確認するとからなっ
ていることである。Importantly, the counting subphase processes each key in associative memory and performs either a Fibonacci search or an equibinary search on the pointer list to see which of the smaller ranges contains the key. This is what happens.

この範囲のカウントは、１．ずつ増加させられる。The count in this range is 1. Increased by increments.

平衡ツリーのポインタ・フィールドがもはや必要ないの
であるから、範囲の上限を画定するキーのポインタ・フ
ィールドのツリーに範囲のカウントを格納できる。別の
記憶位置を必要とする最高順位の範囲のカウントの場合
を除き、これがあてはまる。Since the pointer field of the balanced tree is no longer needed, the range count can be stored in the tree of pointer fields of the keys that define the upper bounds of the range. This is true except for the highest order range counts, which require a separate storage location.

計数サブシステムのステップの順序を、第１表のパスカ
ル疑似コードを参照して説明する。The order of steps in the counting subsystem is explained with reference to the Pascal pseudocode in Table 1.

１表　範　ン　フェーズの　　サブフェーズ（Ｉｎｉｔ
ｊａｌｉｚｅ　ｔｈｅ　Ｃｏｕｎｔｓ）ｆｏｒ　ｉ：＝
１　ｔｏ　Ｓ＋１　ｄｏ　Ｐ［ｉｌ、ｃｏｕｎｔ：＝Ｏ
；（Ｐｒｏｃｅｓｓ　ｅａｃｈ　ｋｅｙ　ａｎｄ　１ｎ
ｃｒｅ＋ｎｅｎｔ　ｉｔｓ　ｒａｎｇｅ　ｃｃ＋ｕｎｔ
）ｆｏｒ　ａａｃｈ　ｋｅｙ　ｉｎ　ｔｈｅ　ｆｉｌｅ
　ｄ。Table 1 Subphases of Range Phase (Init
(jalize the Counts) for i:=
1 to S+1 do P[il, count:=O
;(Process each key and 1n
cre+nent its range cc+unt
) for aach key in the file
d.

ｅｇｉｎＰｅｒｆｏｒｍ　ａ　ｂｉｎａｒｙ　５ｅａｒｃｈ　ｏ
ｎ　ｔｈｅ　ｐｏｉｎｔｅｒ　ｕｓｔ　Ｐ　ｉｎ　ｏｒ
ｄｅｒ　ｔｏ　ｆｉｎｄｔｈｅ　ｖａｌｕｅ　ｉ　５ｕ
ｃｈ　ｔｈａｔ　Ｐ［ｉ−１］、ｋｅｙ＜“ｋｅｙ　ｖ
ａｌ、ｕｅ”＜Ｐ［ｉｌ、ｋｅｙ；ｉｆ　Ｐ［ｉｌ、ｃ
ｏｕｎｔ＜Ｖ　ｔｈｅｎ　Ｐ口］、ｃｏｕｎｔ：＝Ｐ［
ｉｌ、ｃｏｕｎｔ＋１ｅｎｄ；この場合、サンプル・フェーズの最期の段階で作成され
たポインタ・リストはＰで表される。Ｐの各要素は、キ
ー・フィールドとカウント・フィールドを有するレコー
ドのアドレスである。１とＳ＋１との間の各ｉに対して
、Ｐ　［ｉｌキーは格納された順序のｉ番目のサンプリ
ングされたキーである。値がＰ　［Ｓ＋１］のキーは、
＋ωであるとみなされ、キーが取り得る最大値よりも大
きな任意の数である。各１に対する、Ｐ　［ｉｌに格納
できる最大値Ｖは、Ｆよりも大きく、これは内部分類フ
ェーズ中に内部メモリに適合できるレコード数である。egin Perform a binary 5earch o
n the pointer ust P in or
der to find the value i 5u
ch that P[i-1], key<“key v
al,ue”<P[il,key;if P[il,c
count<V then P mouth], count:=P[
il, count+1 end; In this case, the pointer list created at the end of the sample phase is denoted by P. Each element of P is the address of a record with a key field and a count field. For each i between 1 and S+1, the P[il key is the i-th sampled key in the stored order. The key with value P [S+1] is
+ω, which is any number greater than the maximum possible value of the key. For each 1, the maximum value V that can be stored in P[il is greater than F, which is the number of records that can fit into internal memory during the internal classification phase.

係数サブフェーズの目的は、１とＳ＋１との間の各ｉに
対するＰ　［ｉ］カカラトの値を、レコード数が多くで
もＶである場合には、キー値が区間（Ｐ［１−１１キー
〈キー値≦Ｐ［ｉコキー）にあるレコード数に、それ以
外の場合には、値Ｖにセットすることである。The purpose of the coefficient subphase is to calculate the value of P[i]Kakarato for each i between 1 and S+1, if the number of records is at most V, the key value is in the interval (P[1-11 key The key value is set to the number of records where key value≦P[i cokey), and otherwise to the value V.

結合サブフェーズにおいて、隣接する範囲がまとめられ
、より大きな範囲が形成される。各グルーピングは、で
きるだけ多くの小さい範囲を結合し、結果として得られ
る結合された範囲が多くてもＦ個のレコードを含むよう
にする。結果として得られる各範囲は、はとんどの場合
、はぼＦに等しくなる。ちなみに、Ｆ個を越えるレコー
ドを含んでいるレコードを、「オーバーフロー範囲」と
呼ぶ。その後、オーバーフロー範の平均値と標準偏差が
１未満であることが、確立される。また、＝１６− 範囲に対する区画を画定するキーが、内部メモリにスペ
ースがなければ、公知のディスクまたはテープに順次出
力される。オーバーフロー範囲の場合、カウントまたは
特別なマーカが、範囲の上限を画定するキーと共に出力
される。In the merging subphase, adjacent ranges are combined to form larger ranges. Each grouping combines as many small ranges as possible, such that the resulting combined range contains at most F records. Each resulting range will most likely be equal to F. Incidentally, a record containing more than F records is called an "overflow range." It is then established that the mean and standard deviation of the overflow range are less than one. Also, the keys defining the partitions for the =16- range are output sequentially to known disk or tape if there is no space in the internal memory. In the case of an overflow range, a count or special marker is output along with the key that defines the upper limit of the range.

結合サブフェーズを実行する方法を、第■表のパスカル
言語疑似コード順序を参照して説明する。How to perform the join subphase is explained with reference to the Pascal language pseudocode sequence in Table 1.

■表　　　形　フェーズの　ムサブフエーズｅｎｄ：このパスカル類似の順序において、１とＳとの間の各ｊ
に対する値Ｐ　［ｉｌは、格納された順序のサンプリン
グされたキー値である。Ｐ［０］キー：＝−ω及びＰ　
［Ｓ＋１］キー＝十（１）とみなされるが、これらはキ
ーが取り得るあらゆる値よりも小さく、また大きい数で
ある。各ｉに対するＰ［ｉ］カカラトに格納できる最大
数を、■で表す。■Tabular form phase Musabfeez end: In this Pascal-like order, each j between 1 and S
The value P[il for is the sampled key value in the stored order. P[0] key: =-ω and P
It is assumed that [S+1] key=10 (1), which is a number that is both smaller and larger than any possible value of the key. The maximum number that can be stored in P[i] for each i is represented by ■.

１とＳ＋１との間のｊに対する値Ｐ　［ｉｌのカウント
は、■の最大数、及びキー値が（Ｐ［ｉ−１］キー〈キ
ー値≦Ｐ［ｉ］キー）にあるレコード数である。■の値
はＦよりも大きく、これは内部分類フェーズ中に内部メ
モリに適合できるレコード数である。プログラミングを
簡単にするために、特別なカウント・フィールドＰ　［
Ｓ＋２］カウントがあるものとする。The count of P [il for a value of j between 1 and S+1 is the maximum number of . The value of ■ is greater than F, which is the number of records that can fit into internal memory during the internal classification phase. To simplify programming, a special count field P[
S+2] count.

結合フェーズの出力は、結合された範囲の終点を画定す
るキー値のリストである。The output of the join phase is a list of key values that define the endpoints of the joined ranges.

レコードが固定長ではなく、可変長である場合には、本
方法を次のようにして、変更することができる。係数サ
ブフェーズにおいて、各範囲に対するカウントは、この
範囲内のレコード数ではなく、この範囲内のレコードが
占める全スペースをカウントする。結合サブフェーズに
おいては、できるだけ多くの隣接範囲が結合され、結果
として得られる範囲の各々のレコードが内部メモリに適
合できるようにする。If the records are of variable length rather than fixed length, the method can be modified as follows. In the coefficient subphase, the count for each range counts the total space occupied by records within this range, rather than the number of records within this range. In the merge subphase, as many adjacent ranges as possible are merged so that each record of the resulting range can fit into internal memory.

血皿分兼λ玉ニスこのフェーズにおいては、範囲を画定するキーが順次処
理される。各範囲に対し、その内部のレコードが想連探
索によって検索され、分類され、次いで出力ファイルに
付加される。範囲がオーバーフロー範囲になければ、そ
のレコードの全ては内部メモリに適合するので、内部で
分類することができる。この場合、範囲内のレコードの
分類に、交換選択を使用するが、これは各範囲を再初期
化する必要がないからである。範囲内のレコードが２次
記憶装置から検索されれば、次の範囲のレコードの検索
が始まる。これは、最初の範囲の全てのキー値が２番目
の範囲の全てのキー値以下であること、ならびに範囲内
のレコードが交換選択ツリーに適合できることに依存す
るものである。範囲がオーバーフロー範囲であるという
まれな場合には、本発明方法を再帰的に適用するか、あ
るいは公知の分類組合せを用いるかのいずれかによって
、範囲の分類を行なうことができる。上記のように、か
かるオーバーフロー範囲の数は、はとんど常に１未満で
ある。In this phase, the keys defining the range are processed in sequence. For each range, the records within it are searched by an associative search, sorted, and then appended to the output file. If the range is not in the overflow range, all of its records fit into internal memory and can be sorted internally. In this case, exchange selection is used to classify the records within the range, since each range does not need to be reinitialized. Once the records within the range are retrieved from the secondary storage, the search for the next range of records begins. This depends on all key values in the first range being less than or equal to all key values in the second range, and on the records in the range being able to fit into the exchange selection tree. In the rare case that a range is an overflow range, the range classification can be performed either by applying the method of the invention recursively or by using known classification combinations. As mentioned above, the number of such overflow ranges is always less than one.

サンプル・す４ヌ唱Ｊ１ぴ閏１し仙艷ｕ」」１仁スｙ□
の選択サンプリングされたキーの数Ｓは、平均数及びオーバー
フロー範囲の数の標準偏差が１未満でなければならない
という目的に従って、必要な内部メモリの量を最小限の
ものにするように、選択される。Ｍ及びＳの値は、次の
式によって関連付けられる。Sample Su 4 Nu Sing J1 Pyin 1 Shi Sen 艷 u” 1 Ren Su y □
The number of sampled keys S is chosen to minimize the amount of internal memory required, with the objective that the standard deviation of the average number and the number of overflow ranges should be less than one. Ru. The values of M and S are related by the following equation.

（］、）　　Ｓ＝　（Ｍ−２Ｂ）／　（Ｋ＋３Ｐ）Ｍ及
びＳの値を、２つの変数Ｍ及びｒ（ただし、ｒ＝Ｅ　（
Ｓ＋１）（Ｎ＋１）である）の２つの式を解くことによ
って、計算できる。上記の目的を達成できることを保証
する最初の式は、次のとおりである。(],) S= (M-2B)/(K+3P) The values of M and S are expressed as two variables M and r (where r=E (
It can be calculated by solving two equations: S+1)(N+1). The first equation that ensures that the above objective can be achieved is:

ただし、Ｂ及びＢ′は人出力バツファのサイズであり、
Ｐはポインタ・フィールド当りのバイト数である。２番
目の式は、次のとおりである。However, B and B' are the sizes of human output buffers,
P is the number of bytes per pointer field. The second equation is:

（３）　　（Ｍ−２Ｂ＋に＋３Ｐ）（Ｍ−２（Ｂ＋Ｂ’
））＝ｒ（Ｎ＋１）（Ｒ＋Ｐ）（Ｋ＋３Ｐ）この式はサ
ンプリングされたキーＳを内部に格納できることを保証
するものである。これら２つの式を解いて得られるＭの
概算値は、次のとおりである。(3) (+3P to M-2B+) (M-2(B+B'
))=r(N+1)(R+P)(K+3P) This formula ensures that the sampled key S can be stored internally. The approximate value of M obtained by solving these two equations is as follows.

上記で計算したＭの値よりも多い内部メモリが利用でき
る、たとえばｋＭバイトの内部メモリを利用できるので
あれば、分類の総時間は、大体係数にだけ減少する。If more internal memory is available than the value of M calculated above, for example kM bytes of internal memory, the total classification time is reduced by approximately a factor.

磁気バブル・メモリに実施した連想２次記ｍ連想２次記
憶装置は、次の形式（プログラム言語）の連想探索また
は範囲照会と呼ばれる照会を処理する。An associative secondary memory implemented in a magnetic bubble memory processes queries of the following form (programming language) called associative searches or range queries.

Ｇｊｖｅｎ　ｖａｌｕｅｓ　ａ　ａｎｄ　ｂ＋ｒｅｔｒ
ｊｅｖｅ　ａ］、１　ｒｅｃｏｒｄｓｓｕｃｈ　ｔｈａ
ｔ　ａ“ｋｅｙ　ｖａｌｕｅ”≦ｂこの照会は、本方法
の内部分類フェーズ中に、各範囲に対し１回行なわれる
。これは実行時間の大幅な削減をもたらす。データベー
ス適用業務における連想２次記憶装置への磁気バブル・
メモリの使用法の詳細は、前出のＣｈａｎ、　Ｄｏｔｙ
及びＬｉｎの参照文献に記載されている。Gjven values a and b+retr
jeve a], 1 recordssuch tha
t a “key value”≦b This query is performed once for each range during the internal classification phase of the method. This results in a significant reduction in execution time. Magnetic bubbles and associative secondary storage devices in database applications
For more information on memory usage, see Chan, Doty, supra.
and Lin references.

第３図は、数レベルの階層を形成する複数個のカートを
包含する磁気バブル・メモリ（ＭＢＭ）２次記憶装置を
示すものである。最高レベルにおいて、メモリは数百メ
ガバイトの容量を有する幾つかのボックスからなってい
る。ボックスをボードに区分できるが、このボードは数
個のチップを含む数個のモジュールを含んでいる。キー
となる記憶装置をアレイと呼ぶ。第３図は幾っがのアレ
ス１５．１７．１９を示す。各アレイは複数の磁気バブ
ル・ループ２１．２３．２５．２７を有する。各アレイ
は大きな記憶容量を持つことが望まれる。アレイにはＲ
ＡＭバッファ３が組合わされている。バッファ３は幾つ
かの磁気バブル・アレイに対して読み又は書きバッファ
として働らく。FIG. 3 illustrates a magnetic bubble memory (MBM) secondary storage device that includes multiple carts forming a hierarchy of several levels. At the highest level, memory consists of several boxes with a capacity of several hundred megabytes. A box can be partitioned into boards, which contain several modules containing several chips. The key storage device is called an array. Figure 3 shows some Ares 15.17.19. Each array has a plurality of magnetic bubble loops 21.23.25.27. It is desirable that each array have a large storage capacity. R for array
AM buffer 3 is combined. Buffer 3 serves as a read or write buffer for several magnetic bubble arrays.

約２０００ビツトのＲＡＭが各１メガビツトのアレイに
使用される。アレイ１５または１７または１９から、Ｒ
ＡＭバッファ３へ各レコードを転送することによって、
連想探索が行なわれる。ＲＡＭ３に結合しているマイク
ロプロセッサ３３は、パス２からパス１へ、キー値が範
囲紹介を満たす全てのレコードを選択し、転送する。ア
レイ１５を、たとえは、ＲＡＭ３に結合しているパス３
０及び３２は、１メガバイト／秒の速度で駆動される。Approximately 2000 bits of RAM are used for each 1 megabit array. From array 15 or 17 or 19, R
By transferring each record to AM buffer 3,
An associative search is performed. Microprocessor 33 coupled to RAM 3 selects and transfers from pass 2 to pass 1 all records whose key values satisfy the range introduction. Path 3 coupling array 15 to RAM 3
0 and 32 are driven at a speed of 1 megabyte/second.

以下で説明するように、アレイ１５がデータを、この速
度でループからＲ２Ｈへ放出できると仮定すると、アレ
イの全ての内容を、１秒で探索できることになる。全て
のアレイを平行して探索できるので、連想探索当りの全
時間が１秒となり、有利である。Assuming that array 15 can release data from the loop to R2H at this rate, as explained below, the entire contents of the array can be searched in one second. Advantageously, all arrays can be searched in parallel, resulting in a total time per associative search of 1 second.

典型的な１メガビツトのアレイ１５は、最大１０００個
の同期した小ループ２１．２３．２５．２７からなって
いる。各ループは、変動磁場に応じてループを回転する
１０００ビツトを表すようにコーディングされた磁気バ
ブルを含んでいる。A typical 1 megabit array 15 consists of up to 1000 synchronous small loops 21.23.25.27. Each loop contains a magnetic bubble coded to represent 1000 bits which rotates the loop in response to a varying magnetic field.

この場合、レコードは小ループに格納される。すなわち
、レコードのビットは、ループの同一相対位置に、それ
ぞれ１ビツト／ループで格納さる。In this case, the records are stored in a small loop. That is, the bits of the record are stored at the same relative position in the loop, one bit per loop.

１０００ビツトを越えるビットを含んでいるレコードは
、これらを幾つかのアレイに拡張するか、あるいは隣接
する１０００ビツト・セクションに区分するかして、格
納される。いずれの場合においても、セクション中の各
ビットは、同一相対位置の別個の小ループに格納される
。Records containing more than 1000 bits are stored by expanding them into several arrays or by partitioning them into contiguous 1000 bit sections. In either case, each bit in the section is stored in a separate small loop in the same relative position.

必要に応じ、レコードが幾つかのアレイに拡張されると
仮定すると、レコードの全てのビットは、同期して、ル
ープ内の同一点に同時に到達することになる。小ループ
の各読取りアクセスは、非破壊性のものとみなされる。Assuming that the record is expanded into several arrays if necessary, all bits of the record will be synchronous and arrive at the same point in the loop at the same time. Each read access of a small loop is considered non-destructive.

読取りは、各ビットを複写し、複写ビットを１０００ル
ープ読取リバツフアにロードすることによって、行なわ
れる。読取りシフト・レジスタＲ８Ｒ２９が空で、利用
できる場合、ビットをＲ２Ｈにロードできる。次いて、
ビットは順次、読取りヘッド９によってシフトされる。Reading is done by copying each bit and loading the copied bits into a 1000 loop read buffer. If read shift register R8R29 is empty and available, bits can be loaded into R2H. Next,
The bits are shifted sequentially by the read head 9.

書込み操作は、書込みシフト・レジスタＷＳＲ３１を使
用して、同様な方法で行なわれる。Write operations are performed in a similar manner using write shift register WSR31.

小ループの循環に必要な時間は、］、　ＯＯＯビットを
読取りヘッドによってＲ２Ｈへ移動するのにかかる時間
に等しい。これは約０．００１秒である。シフトがＲ８
Ｒで完了すると、１０００ループ読取リバツフアには、
次のレコードをロードする時間ができる。それゆえ、フ
ァイル全体を、１０００ＸＯ，０Ｏ１＝１秒でＲ８Ｒに
ロードし、読み取ることができる。これは１メガビツト
をアレイ１５からＲＡＭ３に転送するのにかかる時間に
対応している。連想探索当りの時間は、従って、１−秒
となる。The time required to cycle through the small loop is equal to the time it takes to move the OOO bit to R2H by the read head. This is approximately 0.001 seconds. Shift is R8
When completed in R, the 1000 loop read buffer has:
This gives you time to load the next record. Therefore, the entire file can be loaded and read into R8R in 1000XO,0O1=1 seconds. This corresponds to the time it takes to transfer 1 megabit from array 15 to RAM 3. The time per associative search is therefore 1-second.

１５及び１７のような２つのアレイが、単一の読取りパ
ス３０を共有している場合には、連想探索時間は、１秒
ではなく、２秒となる。１〜２秒程度の連想探索が、分
類性能には極めて適切なものであることがわかった。以
下で説明するマーキング技術を使用して、有効連想探索
時間を数分の一秒まで短縮することができる。If two arrays, such as 15 and 17, share a single read path 30, the associative search time would be 2 seconds instead of 1 second. It was found that an associative search of about 1 to 2 seconds is extremely appropriate for classification performance. Using the marking techniques described below, the effective associative search time can be reduced to a fraction of a second.

作菊４ヒする修正実行時間、及び本発明の外部分布分類法を支援するのに
必要な内部メモリの大きさの両面での経済性を達成でき
る。本方法は大まかに３つのステップを行なうものであ
る。これらのステップとは、（１）８個のキーのランダ
ム・サンプリング及び８個のサンプリングされたキーの
内部分類、（２）各々が内部ＣＰＵメモリに適合でき、
キー値の範囲を構成できるレコードの等サイズの区画を
単一パスに形成すること、（３）キーが範囲内にあるレ
コードの全ての連想検索を行ない、これらのレコードを
内部で分類することである。Economics can be achieved both in terms of modified execution time and in the amount of internal memory required to support the external distribution classification method of the present invention. The method generally involves three steps. These steps are: (1) random sampling of the 8 keys and internal classification of the 8 sampled keys; (2) each can fit into internal CPU memory;
(3) performing an associative search for all records whose keys fall within the range and internally classifying these records; be.

キーの一様分布を想定するランダム・サンプリング及びサンプリングされたキーの
内部分類ステップにおいて、キーが一様に分布している
場合に生じる高速分割によって、速度を上げることがで
きる。これは、内部メモリの所定部分に、次の処理を行
なうことによって達成される。まず、キーが取ることの
できる２５６にの範囲を、はぼｃ　Ｎ　Ｒ／　Ｍの等し
いサイズの区間に分割する。ただし、Ｃはｃ　＞　１の
定数である。各キーが処理されると、該当する区間のカ
ウンタが増加させられる。各キーの最初の数バイトが一
様に近い区画をもたらすのであれば、範囲形成フェーズ
のファイルへのパスを省略し、一様な区間のカウントで
置き換えることができる。In the random sampling and internal classification steps of the sampled keys, which assume a uniform distribution of keys, speed can be increased by the fast splitting that occurs when the keys are uniformly distributed. This is accomplished by performing the following operations on a predetermined portion of internal memory. First, the range of 256 that a key can take is divided into approximately c N R/M equally sized intervals. However, C is a constant of c>1. As each key is processed, the counter for the corresponding interval is incremented. If the first few bytes of each key result in near-uniform partitions, the path to the file in the range formation phase can be omitted and replaced with a uniform interval count.

五皿分１Ｙ援助するマーキング技キー値が範囲内にある全てのレコードを連想検索し、こ
れらのレコードを内部で分類するステップにおいて、上
記した探索には、１〜２秒が必要であった。しかしなが
ら、磁気バブル・メモリ連想記憶をＣＰＵに結合してい
る経路は、Ｍバイト（内部メモリへの１回のロード）の
転送にかかる時間を上記のほんの数分の１に減らすこと
ができる。ファイル内のＮＲ／Ｍのレコードごとにほぼ
１つのレコードが、所定の範囲に属することがわかって
いる。従って、所定の範囲内のレコードの全体的な検索
のほとんどには、範囲照会を満たさないレコードの処理
が含まれる。レコードのほとんどを検索することを回避
することにより、検索の速度を上げることができる。Ｍ
ＢＭに格納されており、かつ「マーク・ビット」として
使用されるレコード当り１個の特別ビットを使用するこ
とが、有利である。レコードには、マークを付けても、
付けなくでもよい。連想記憶アーキテクチャを変更し、
マークの付けられたレコードだけが大ループから読取り
シフト・レジスタ２９に転送されるようにしなければな
らない。キー値の所定範囲内のキーを得るための照会の
プログラム言語は、次の形を取る。The above search required 1 to 2 seconds to perform an associative search for all records whose key values were within the range and to internally classify these records. However, the path coupling the magnetic bubble memory content addressable memory to the CPU can reduce the time it takes to transfer Mbytes (one load into internal memory) to a fraction of the above. Approximately one record for every NR/M record in the file is known to belong to the predetermined range. Therefore, most of the overall search for records within a given range involves processing records that do not satisfy the range query. Search speed can be increased by avoiding searching most of the records. M
It is advantageous to use one special bit per record stored in the BM and used as a "mark bit". Even if you mark a record,
You don't have to attach it. change the associative memory architecture,
It must be ensured that only marked records are transferred from the large loop to the read shift register 29. The programming language for queries to obtain keys within a predetermined range of key values takes the form:

Ｇｉｖｅｎ　ｖａｌｕｅｓ　ａ　ａｎｄ　ｂ　ｒｅｔｒ
ｉｅｖｅ　ａｌｌ　ｍａｒｋｅｄｒｅｃｏｒｄｓ　ｉｎ
　ｔｈｅ　ｓｅｍｉ−ｃｌｏｓｅｄ　ｒａｎｇｅ　ａ“
ｋｅｙｖａｌｕｅ”≦ｂ。Given values a and b retr
ieve all marked records in
the semi-closed range a“
keyvalue”≦b.

説明のため、ファイルをに個のほぼ等しいサイズの領域
に分割するに一１個の順序付けられたキー値があるもの
と想定する。Ｋ−１個の順序付けられたキー値は、範囲
形成フェーズの結合サブフェーズ中に得られる。内部分
類フェーズを、Ｋ個の連続したサブフェーズに分けるこ
とができる。For purposes of illustration, assume that there are 11 ordered key values that divide the file into approximately equally sized regions. The K-1 ordered key values are obtained during the join subphase of the range formation phase. The internal classification phase can be divided into K consecutive subphases.

この場合、ｉ番目のサブフェースは、Ｋ−１個の隣接キ
ー値によって形成されたｉ番目の領域内にある書く各範
囲の検索及び分類を含んでいる。ｉ番目のサブフェース
の始まりで、ｉ番目の領域の全てのレコードにマークが
付けられ、その他の全てのレコードにマークが付れられ
ていなければ、１回でほぼＮ／に個のレコードにマーク
が付けられ、従って各範囲検索で処理されるレコードの
数は、係数にだけ減少する事になる。Ｋ回の処理ステッ
プの各々では、１回の完全な連想探索が必要となる。総
処理時間は、無視できるものである。In this case, the i-th sub-face includes searching and classifying each range within the i-th region formed by K-1 adjacent key values. At the beginning of the i-th subface, if all records in the i-th region are marked and all other records are unmarked, approximately N/ records are marked at once. , and thus the number of records processed in each range search will be reduced by a factor. Each of the K processing steps requires one complete associative search. The total processing time is negligible.

各連想探索中に、範囲照会を満たす全てのレコードのマ
ークの付いたビットがオフにされるのであれば、さらに
改良を行なうことができる。それゆえ、各サブフェーズ
の始めで、ぼぼＮ／にのレコードにマークが付けられる
が、マークの付けられたレコードの数は、サブフェーズ
の終りでゼロになるまで、直線的に減少することになる
。これを第４図に示す。しかしながら、各連想探索で処
理されるレコードの平均数は、約Ｎ／（２Ｋ）となる。A further improvement can be made if during each associative search, the marked bits of all records satisfying the range query are turned off. Therefore, at the beginning of each subphase, N/ records are marked, but the number of marked records decreases linearly until it reaches zero at the end of the subphase. Become. This is shown in FIG. However, the average number of records processed in each associative search will be approximately N/(2K).

なお、範囲照会を、２つの比較ではなく、１つの比較し
か要求しない下記の照会に書き換えることができる。Note that the range query can be rewritten as the following query, which only requires one comparison instead of two.

Ｇｉｖｅｎ　　ｖａｌｕｅｓ　　ｂ、ｒｅｔｒｉｅｖｅ
　　ａｎｄ　　ｕｎｍａｒｋ　　ａｌｌｍａｒｋｅｄ　
ｒｅｃｏｒｄｓ　　５ｕｃｈ　ｔｈａｔ　“ｋ　ｅ　ｙ
　ｖ　ａ　］、　ｕ　ｅ　”≦ｂ＋この形式の照会は、
範囲（ａ　“ｋｅｙ　ｖａｌ、ｕｅ”≦ｂ）が処理され
ている間に、ａ以下のキー値を有するキー値からマーク
が取り除かれるのであるから、以前の照会と等価である
。Given values b, retrieve
and unmark allmarked
records 5uch that “ke y
v a ], ue ”≦b+ This form of query is
This is equivalent to the previous query because while the range (a "key val, ue"<= b) is being processed, the mark is removed from key values with key values less than or equal to a.

［発明の効果コ本発明の好ましい実施例を説明したが、各種の改変を本
発明の原理に従って行ない得ることを理解されたい。例
えば、経験上、ＲＡＭ３が余分のマークの付いたビット
のための、あるいはアドレスの待ち行列を形成するため
の余分の記憶域を有していることが好ましいことがわか
っている。さらに、入力ファイルの各論理レコードが、
幾つかの物理レコードにおよぶように、ＭＢＭに極めて
小さい論理レコードを置くか、あるいは各物理レコード
が複数の論理レコードを含むように、極めて大きい物理
レコードを置くかのいずれかによつて、ＭＢＭ連想記憶
の記憶域の利用度を最大限のものにできる。上記のマー
キング技術は、物理レコードのサイズが、論理レコード
のサイズに比較して小さい場合に、より効率がよくなる
。[Effects of the Invention] Although preferred embodiments of the invention have been described, it is to be understood that various modifications may be made in accordance with the principles of the invention. For example, experience has shown that it is preferable for RAM 3 to have extra storage for extra marked bits or for queuing addresses. Additionally, each logical record in the input file is
MBM associations can be created by either placing very small logical records in the MBM, such that they span several physical records, or by placing very large physical records, such that each physical record contains multiple logical records. Maximize the utilization of storage space. The above marking techniques are more efficient when the size of the physical record is small compared to the size of the logical record.

連想探索当りの時間、及び格納されている各範囲に対す
る最終出力時間が十分短いものであれば、内部分類フェ
ーズの稼働時間は、ＣＰＵ時間によって支配され、入出
力時間によっては支配されない。この場合、内部メモリ
にほぼ２倍のスペースを割り振り、これを２つに分割す
ることによって、内部分類フェーズをはるかに高速にで
きる。内部メモリの容量は、Ｍ’＝２　（Ｍ−２（Ｂ＋
Ｂ’　）　）に増加する。ただし、Ｍは以前の内部メモ
リの容量である。If the time per associative search and the final output time for each stored range are short enough, the running time of the internal classification phase is dominated by CPU time and not by I/O time. In this case, the internal classification phase can be made much faster by allocating approximately twice as much space in internal memory and splitting it in two. The capacity of the internal memory is M'=2 (M-2(B+
B' )) increases. However, M is the previous internal memory capacity.

各範囲の内部範囲が、動的なものではなく、静的なもの
なのであるから、交換選択の代りに、交換選択のほぼ２
倍の速度であるクイックソートやラデイツクスのような
静的内部分類法を使用できることに留意されたい。内部
分類フェーズに対するＣＰＵ時間は約５０パーセント削
減される。Since the internal range of each range is static rather than dynamic, instead of an exchange selection, almost two
Note that you can use static internal classification methods like quicksort or Radix, which are twice as fast. CPU time for the internal classification phase is reduced by approximately 50 percent.

現在、大規模なデータベースがＤＡＳＤにラン＝３２− ダム・アクセス方式で格納されており、これは探索時間
を速くするために、ハツシュ及び索引手法を使用してい
る。上述のＭＢＭ技術により、データベースをＭＢＭに
格納でき、より高速なランダム・アクセスの利点を利用
できる。同じハツシュ及び索引手法を使用できる。この
方式のデータベース・システムは、ＤＡＳＤに格納され
ているデータベース・システムよりもはるかに高速で１
−ランザクジョンを処理することができる。Currently, large databases are stored on DASD in a run-32-dumb access fashion, which uses hashing and indexing techniques to speed up search times. The MBM technology described above allows databases to be stored in the MBM and take advantage of faster random access. The same hash and indexing techniques can be used. This type of database system is much faster and faster than database systems stored on DASD.
- Able to process ranzagions.

さらに、ＭＢＭに格納されているデータベースは一般の
リレーショナル・データベース照会を、連想探索を行な
うことにより１秒車位で、迅速に実行できる。これはデ
ータベースがＤＡＳＤに格納されている場合には、達成
できなかったことである。換言すれば、ＭＢＭをランダ
ム・アクセス装置、あるいは連想装置のいずれかとして
、現在のトランザクションまたは照会にいずれか効率の
よい方を使用できる。分類法は、リレーショナル・デー
タベースの一般的な操作のひとつとして検討できるもの
である。Furthermore, the database stored in the MBM can quickly execute general relational database queries in about one second by performing associative searches. This could not be achieved if the database was stored on DASD. In other words, the MBM can be used as either a random access device or an associative device for the current transaction or query, whichever is more efficient. Taxonomy can be considered as one of the common operations of relational databases.

[Brief explanation of the drawing]

第１図は、サンプル・フェーズ中に作成された必要なポ
インタを含む、分類されサンプリングされたキーの平衡
ツリーの図である。第２図は、サンプル・フェーズ中に作成されたポインタ
・リストの図である。第３図は、本発明方法の連想探索を実行するための典型
的な連想記憶カードの図である。第４図は、実施されるマーキング技術の図である。３・・・・ＲＡＭバッファ、５・・・・エラー・コード
・テスト、９・・・・読取り、１１・・・・書込み、１
５．１７．１９・・・・アレイ。出願人　　　インターナショナル・ビジネス・マシーン
ズ・コーポレーションFIG. 1 is a diagram of a balanced tree of sorted and sampled keys, including the necessary pointers created during the sample phase. FIG. 2 is a diagram of the pointer list created during the sample phase. FIG. 3 is a diagram of a typical associative memory card for carrying out the associative search of the method of the present invention. FIG. 4 is a diagram of the marking technique implemented. 3...RAM buffer, 5...Error code test, 9...Read, 11...Write, 1
5.17.19...Array. Applicant International Business Machines Corporation

Claims

Claims: A method for performing external distribution classification for reordering data including keyed records accessed from an associative secondary storage device by a CPU with available internal memory, comprising: S By performing a random sampling of keys, internally classifying the sampled keys within the CPU, and obtaining a histogram of keys and a range of key values, we form partitions of equal size in a single pass, and each Section is C
combining adjacent low-density ranges to form a larger range to fit into the internal memory of the PU; performing an associative search for all records whose keys fall within the range of the partition and internally classifying the records; A method of performing external distribution classification consisting of and .