JPH10105522A

JPH10105522A - Multicomputer system

Info

Publication number: JPH10105522A
Application number: JP9151811A
Authority: JP
Inventors: R Klaus Michael; マイケル・アール・クラウス
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 1996-06-27
Filing date: 1997-06-10
Publication date: 1998-04-24

Abstract

PROBLEM TO BE SOLVED: To improve high usability, calculating processing and I/O performance in a cluster environment by preparing distributed streams and correcting a STREAMS framework so as to lead these distributed streams into various nodes inside the cluster. SOLUTION: The basic architecture of a STREAMS stack is a stream head 30 and a driver 44 having a module 32 of zero or more than it pushed at the upper part of the driver, and a control sled 34 and a physical cluster mutual connecting driver (P-ICS) 36 are provided at a minimum. The PICS 36 provides a high-speed mutual connecting link to use a light protocol of a little waiting time containing both the components of software and hardware. These mutual connections depend on a virtual line and the virtual line has ability for transmissively moving an instance from a certain application to the other one.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ネットワーク化さ
れた複数コンピュータ間の相互動作に関するもので、特
に複数コンピュータ・ノードのクラスタに対して使用さ
れる分散オペレーティング・システムに関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the interoperation between networked computers, and more particularly to a distributed operating system used for a cluster of computer nodes.

【０００２】[0002]

【従来の技術】ノードとは、１つまたは複数のプロセッ
サ、ローカル・メモリ、入出力サブシステムおよびオプ
ションである周辺装置から構成されるコンピュータを意
味する(上記プロセッサは潜在的にはＳＭＰすなわち(Sy
mmetrical MultiProcessorの略称で)対称型マルチプロ
セッサである)。クラスタとは、問題を解決するためな
んらかのレベルで協力し合う２つ以上のノードのセット
を意味する。１つのクラスタ内の複数ノードは相互に接
続し、価格性能要件に応じて種々のソフトウェアおよび
ハードウェア解決手段を使用する。共用メモリまたはメ
ッセージ受け渡し等の実施様態に応じて、クラスタは、
単一地点が故障しても分散アプリケーションは停止しな
い高い可用性、アプリケーション負荷の平準化によるク
ラスタ全体資源の有効活用、ハードウェア共用、計算お
よびＩ／Ｏ帯域総和の増加、および大容量記憶域やネッ
トワーキング等の資源へのアクセス増加という諸機能を
提供する。BACKGROUND OF THE INVENTION A node refers to a computer comprising one or more processors, local memory, input / output subsystems, and optional peripherals, which are potentially SMP or (Sy
mmetrical MultiProcessor) is a symmetric multiprocessor). A cluster is a set of two or more nodes that cooperate at some level to solve a problem. The nodes in a cluster interconnect and use different software and hardware solutions depending on price / performance requirements. Depending on the implementation, such as shared memory or message passing, the cluster
High availability, distributed applications do not stop even if a single point fails, level utilization of application load, effective use of cluster-wide resources, hardware sharing, increase of total computation and I / O bandwidth, mass storage and networking And other functions to increase access to resources.

【０００３】ＳＴＲＥＡＭＳは、ネットワーキング・プ
ロトコールおよびクライアント／サーバ・アプリケーシ
ョンを実施するための比較的軽量なメッセージ受け渡し
フレームワークであり、事実上の業界標準となってい
る。従来技術のＳＴＲＥＡＭＳメカニズムは、UNIX Sys
tem V Network Programming, S.A. Rago(1993), Ch. 3&
9に記述されている。例えば、NFS, UDP/IP, TCP/IP, S
PX, NetBIOS, SNAおよびDLPIはすべて、ＳＴＲＥＡＭＳ
フレームワークを利用して、多数の業者によって実施さ
れているネットワーク・プロトコールを定義している。
これらのプロトコールおよびサービスに基づくアプリケ
ーションは、クラスタ化が成功するとすれば、クラスタ
によってもたらされる上記のような利益を享受すること
ができる。[0003] STREAMS is a relatively lightweight message passing framework for implementing networking protocols and client / server applications, and has become the de facto industry standard. Prior art STREAMS mechanisms are based on UNIX Sys
tem V Network Programming, SA Rago (1993), Ch. 3 &
It is described in Section 9. For example, NFS, UDP / IP, TCP / IP, S
PX, NetBIOS, SNA and DLPI are all STREAMS
The framework is used to define network protocols implemented by many vendors.
Applications based on these protocols and services, if successful in clustering, can benefit from the above benefits provided by the cluster.

【０００４】[0004]

【発明が解決しようとする課題】従って、クラスタ下部
構造の必要要件を理解し、ＳＴＲＥＡＭＳフレームワー
ク環境においてアプリケーションおよびクラスタ機能の
実行が可能となるように、ＳＴＲＥＡＭＳフレームワー
クを修正する必要がある。Accordingly, there is a need to understand the requirements of the cluster infrastructure and to modify the STREAMS framework so that applications and cluster functions can be performed in the STREAMS framework environment.

【０００５】[0005]

【課題を解決するための手段】本発明の目的は、マルチ
コンピュータ環境(すなわち一般的にいえばクラスタ環
境)における高い可用性、計算処理およびＩ／Ｏ性能の
向上および単一点(シングル・ポイント)管理等のような
大規模ユーザの問題を解決するため、そのような環境に
おいてＳＴＲＥＡＭＳの動作を可能にすることである。
本発明に従って、分散ストリームを作成し、クラスタ内
の諸ノードにこれら分散ストリームを導入することがで
きるように、ＳＴＲＥＡＭＳフレームワークが修正され
る。SUMMARY OF THE INVENTION It is an object of the present invention to provide high availability, improved computing and I / O performance, and single point management in a multi-computer environment (ie, a cluster environment in general). In order to solve the problem of large-scale users such as, for example, it is an object of the present invention to enable the operation of STREAMS in such an environment.
In accordance with the present invention, the STREAMS framework is modified so that distributed streams can be created and these distributed streams can be introduced to nodes in the cluster.

【０００６】本発明の別の１つの目的は、ユーザおよび
カーネルの両空間において、本発明の作業のすべてが、
ユーザ・アプリケーションにとって透過的であることで
ある。この目的を達成するため、クラスタ化された環境
内におけるＳＴＲＥＡＭＳの存在をソフトウェアが認識
しないで済むようにＳＴＲＥＡＭＳフレームワークが修
正される。Another object of the present invention is that all of the work of the present invention, in both user and kernel space,
It is transparent to the user application. To this end, the STREAMS framework is modified so that software does not need to be aware of the existence of STREAMS in a clustered environment.

【０００７】上記課題を解決する手段として、本発明
は、分散ＳＴＲＥＡＭＳ機能を有するマルチコンピュー
タ・システムを提供する。該マルチコンピュータ・シス
テムは、１つまたは複数のシステム・プロセッサ装置を
含むコンピュータ、ローカル・メモリおよび入出力サブ
システムをそれぞれが有しデータ通信相互接続サブシス
テムを介して相互接続された少なくとも２つのノードか
ら構成されるクラスタ、上記クラスタにおいて上記シス
テム・プロセッサ装置の各々の上で稼働しネットワーキ
ング・プロトコール、クライアント／サーバ・アプリケ
ーションおよびサービスの１つまたは複数の実施の際に
使用されるＳＴＲＥＡＭＳメッセージ伝達メカニズムを
含むオペレーティング・システム、タスクの実行または
問題の解決のため上記オペレーティング・システムの制
御の下で少くとも１つの上記ノードのシステム・プロセ
ッサ装置上で稼働するソフトウェア・アプリケーショ
ン、上記複数ノードのうちの第１の始動元ノード上に上
記アプリケーションおよび上記ＳＴＲＥＡＭＳメッセー
ジ伝達メカニズムとは独立した分散ＳＴＲＥＡＭＳイン
スタンスを作成する手段、ならびに、上記第１のノード
上におけるネットワーキング・プロトコール、クライア
ント／サーバ・アプリケーションおよびサービスに対し
て透過的に上記複数ノードのうちの第２の目標ノード上
でソフトウェア・アプリケーションの選択されたタスク
を実行するため上記第２の目標ノードへ上記分散ＳＴＲ
ＥＡＭＳインスタンスの少なくとも一部を移行させる手
段を備える。As a means for solving the above problems, the present invention provides a multi-computer system having a distributed STREAMS function. The multi-computer system includes a computer including one or more system processor units, at least two nodes each having a local memory and an input / output subsystem, interconnected via a data communication interconnect subsystem. A STREAMS messaging mechanism running on each of the system processor devices in the cluster and used in implementing one or more of the networking protocols, client / server applications and services. An operating system, including a software application running on at least one of the node's system processor units under the control of the operating system for performing tasks or solving problems Means for creating a distributed STREAMS instance independent of the application and the STREAMS messaging mechanism on a first initiating node of the plurality of nodes, a networking protocol on the first node, a client / Distributed STR to the second target node to perform selected tasks of a software application on a second target node of the plurality of nodes transparently to server applications and services
Means for migrating at least a part of the EAMS instance is provided.

【０００８】[0008]

【発明の実施の形態】I. 実施形態の概要クラスタは、高速通信リンクを介して相互に接続された
任意の規模のコンピュータ・セットである。クラスタ自
体は、ゲートウェイ・サービスを提供するクラスタ・ノ
ードのサブセット上に導入されたＦＤＤＩ、ＡＴＭまた
は１００ＶＧハードウエア・リンクおよびドライバのよ
うな標準的ネットワーク・リンクを介して非クラスタ・
ノードに接続することも可能である。これらの標準的ネ
ットワーキング・リンクは、通信のためＴＣＰ／ＩＰの
ような標準的ネットワーキング・プロトコールを使用す
ることができるが、一方、クラスタ相互接続は好ましく
は特別仕様の軽量プロトコールを使用する。DETAILED DESCRIPTION OF THE INVENTION I. Overview of Embodiments A cluster is a set of computers of any size interconnected via high speed communication links. The cluster itself is connected to a non-cluster node via standard network links such as FDDI, ATM or 100VG hardware links and drivers installed on a subset of the cluster nodes that provide the gateway service.
It is also possible to connect to nodes. These standard networking links can use standard networking protocols such as TCP / IP for communication, while the cluster interconnect preferably uses a custom lightweight protocol.

【０００９】ノードは、１つまたは複数のプロセッサ装
置、ローカル・メモリ、入出力サブシステムおよびオプ
ションとして大容量装置のような周辺装置から構成され
るコンピュータであり、問題を解決するために使用され
る。分散されたストリーム(以下単に分散ストリームと
呼ぶ)は、あらかじめ指定されたポイントにおいて２つ
のノードの間に分割されたストリームである。これは、
一般的にはモジュールとドライバの間で行われる。例え
ば、ＴＣＰ／ＩＰプロトコールは、ＴＣＰまたはＵＤＰ
が一方のノード上に存在しＩＰが他方のノード上に存在
するように分割することができる。分割を決定する規則
は、一般的に、維持すべき状態情報の内容、場所、およ
び処理性能考慮点に基づく。以下の記述では、理解の容
易さおよび実施の一般性を考慮して、ＴＣＰ／ＩＰを１
例として使用する。A node is a computer composed of one or more processor units, local memory, input / output subsystems and, optionally, peripheral devices such as bulk devices, and is used to solve problems. . A distributed stream (hereinafter simply referred to as a distributed stream) is a stream divided between two nodes at a point designated in advance. this is,
Generally, it takes place between the module and the driver. For example, the TCP / IP protocol is TCP or UDP
Are on one node and the IP is on the other node. The rules for determining the split are generally based on the content, location, and performance considerations of the state information to be maintained. In the following description, TCP / IP is referred to as 1 for ease of understanding and generality of implementation.
Use as an example.

【００１０】分散ＳＴＲＥＡＭＳは、以下の特徴／属性
のいずれかを持つＳＴＲＥＡＭＳである。＊アプリケーションおよびＳＴＲＥＡＭスタック全体が
非クラスタ環境における場合と同様に同一ノードに存在
することができるが、ＳＴＲＥＡＭＳ関連フレームワー
ク下部構造を活用して、負荷の平準化、高い可用性等の
ようなクラスタ機能の利点を享受することができる。Ｓ
ＴＲＥＡＭスタックは、それにアクセスしているアプリ
ケーションを実行しているノードと異なるノード上で実
行することが可能である。＊ＳＴＲＥＡＭは、モジュール／ドライバ／ストリーム
ヘッドのレベルで、各々が当該クラスタの範囲内で異な
る個々のノード上で実行される複数コンポーネントに分
割されることができる。＊ＳＴＲＥＡＭＳ型パイプのパイプ端点(エンドポイン
ト)の各々はクラスタの範囲内の異なるノード上で実行
することができる。＊ＳＴＲＥＡＭスタックは、１つのノードから別のノー
ドへ、全体としてまたは部分的に、移行することができ
る。A distributed STREAMS is a STREAMS having any of the following features / attributes. * The entire application and STREAM stack can reside on the same node as in a non-cluster environment, but leverage the STREAMS-related framework infrastructure to implement cluster functions such as load leveling, high availability, etc. You can enjoy the benefits. S
The TREAM stack can run on a different node than the node running the application accessing it. * STREAM can be divided at the module / driver / streamhead level into multiple components, each running on a different individual node within the cluster. * Each of the end points of a STREAMS-type pipe can run on different nodes within the cluster. * The STREAM stack can migrate, in whole or in part, from one node to another.

【００１１】本発明は、制御スレッド、プレビュー機能
セット、Ｐ−ＩＣＳドライバおよび好ましくはＳ−ＩＣ
Ｓドライバというシステム・コンポーネントを使用する
クラスタにおいて分散ＳＴＲＥＡＭＳを実施する。スレ
ッドは、コンピュータ上でのアプリケーション命令の実
行を通して機能性セットを提供するソフトウェア処理エ
レメントである。制御スレッドは、分散ＳＴＲＥＡＭの
ために第三者通信および管理ポイントの働きをする特殊
スレッドである(機能の詳細は後述される)。Ｓ−ＩＣＳ
ドライバは、アプリケーションのＳＴＲＥＡＭＳスタッ
ク、制御スレッド、Ｐ−ＩＣＳおよびプレビュー機能と
いうコンポーネントのいずれかの間でメッセージを移動
させる働きをするＳＴＲＥＡＭＳソフトウェア相互接続
ドライバである。Ｐ−ＩＣＳは、基本的通信網を提供す
る物理的クラスタ相互接続である。プレビュー機能は、
ＳＴＲＥＡＭＳ動的機能置換(Dynamic Function Replac
ement)機能を通して使用可能にされるＳＴＲＥＡＭＳに
基づく機能であり、当該システム内の種々のコンポーネ
ント間でメッセージを多重送信するためにメッセージを
検査する機能を持つ。The present invention relates to a control thread, a preview feature set, a P-ICS driver and preferably an S-IC.
Implement distributed STREAMS in a cluster that uses a system component called the S-driver. A thread is a software processing element that provides a set of functionality through the execution of application instructions on a computer. The control thread is a special thread that acts as a third party communication and management point for distributed STREAM (functional details are described below). S-ICS
The driver is a STREAMS software interconnect driver that serves to move messages between any of the components of the application's STREAMS stack, control thread, P-ICS and preview function. P-ICS is a physical cluster interconnect that provides the basic communication network. The preview function
STREAMS Dynamic Function Replac
ement) function, which is enabled through the STREAMS function, and has the function of examining messages to multiplex messages between various components in the system.

【００１２】図６は、分散ストリーム、相互接続ドライ
バおよび独立したミドルウェア・ソフトウエア・スレッ
ドを示す。本図に含まれるミドルウェアは、分散ＳＴＲ
ＥＡＭＳを実施可能にする上で重要ではあるが、記述さ
れているプロセスはミドルウェアの実施とは全く独立し
ている点に注意する必要がある。通信インタフェースが
標準化されている限り複数のミドルウェア供給業者と協
調することを可能にするので、この点は大きな利点であ
る。FIG. 6 illustrates a distributed stream, an interconnect driver, and a separate middleware software thread. The middleware included in this figure is a distributed STR
It is important to note that while making EAMS feasible, the process described is completely independent of middleware implementation. This is a great advantage, as it allows to work with multiple middleware suppliers as long as the communication interface is standardized.

【００１３】図６に示されるように、ストリームは、少
なくとも２つのノード上に分割され、２つのノードのい
ずれか１つまたは第３のノード上で実行されるミドルウ
ェアを利用する。相互接続ドライバは、インテリジェン
トなＳＴＲＥＡＭＳ型ドライバ(Ｓ−ＩＣＳ)と(非ＳＴ
ＲＥＡＭＳノード間通信のために再使用できる)インテ
リジェントのない物理的ドライバ(Ｓ−ＩＣＳ)に更に分
割される。インテリジェントなＳＴＲＥＡＭＳ型ドライ
バ(Ｓ−ＩＣＳ)は、ミドルウェアおよびＳＴＲＥＡＭＳ
要求／応答を取り扱うカーネル・デーモンと通信するた
めのプロトコールを含む。As shown in FIG. 6, the stream is split on at least two nodes and utilizes middleware running on any one of the two nodes or on a third node. The interconnect driver consists of an intelligent STREAMS type driver (S-ICS) and (non-ST
REAMS is further divided into non-intelligent physical drivers (S-ICS) that can be reused for inter-node communication. Intelligent STREAMS type driver (S-ICS) is a middleware and STREAMS
Includes a protocol for communicating with kernel daemons that handle requests / responses.

【００１４】図６に示された構成を作成する基本的スト
リーム作成アルゴリズムは以下の通りである。このアル
ゴリズムは、新しいカーネル内ＳＴＲＥＡＭＳインター
フェースを利用する。通常のopen()を介してデバイス・
ファイルをオープンする。ファイル・システムはアプリ
ケーションから見て単一システムと見なされるので、ア
プリケーションの変更は必要とされないが、オープンを
実行する正しいノードを決定することは必要である。The basic stream creation algorithm for creating the configuration shown in FIG. 6 is as follows. This algorithm utilizes the new in-kernel STREAMS interface. Device via normal open ()
Open a file. Since the file system is viewed as a single system from the application's point of view, no application changes are required, but it is necessary to determine the correct node to perform the open.

【００１５】ファイル・システムがオープンを実行する
時、ＶＦＳ(virtual file systemnの略称で仮想ファイ
ル・システムの意)はＳＴＲＥＡＭＳフレームワークの
オープン・ルーチンを最終的に呼び出す。フレームワー
クのautopush(自動プッシュ)および構成機能を使用し
て、目標ドライバ(本ケースではＩＰ)をオープンする代
わりに、相互接続ドライバを透過的にオープンする。こ
れは、相互接続ドライバの装置番号を後の使用のために
記憶される目標装置番号に対応付けし直すこと(remappi
nng)によって達成される。カーネルの再対応付けは、ド
ライバがシステム・ブートの間にシステム内に導入され
る時実際に発生する。ＳＴＲＥＡＭＳフレームワーク
は、クラスタ化のためドライバが構成されたことを認識
する必要がある。次に、ＳＴＲＥＡＭＳは、どのような
ドライバに対しても通常行うように相互接続ドライバの
オープンを実行する。この時点においては、アプリケー
ションが現在駐在するノード上に１つのストリームヘッ
ドと１つのドライバ・インスタンスだけが存在する。When the file system performs an open, VFS (virtual file system) is the final call to the STREAMS framework open routine. Use the framework's autopush and configuration functions to open the interconnect driver transparently instead of opening the target driver (in this case, the IP). This involves remapping the interconnect driver device number to a target device number stored for later use (remappi
nng). Kernel remapping actually occurs when a driver is introduced into the system during a system boot. The STREAMS framework needs to be aware that the driver has been configured for clustering. Next, STREAMS performs the opening of the interconnect driver as it normally does for any driver. At this point, there is only one stream head and one driver instance on the node where the application currently resides.

【００１６】次のステップは、実際のドライバが存在す
るノード上でオープンを実行することである。これを行
うため、保存された装置番号がノード・アドレスに再対
応付けされる。ＳＴＲＥＡＭＳ型相互接続ドライバは、
適切なプロトコールを介してミドルウェアと通信し、目
標遠隔ノードおよびすべての適切な通信情報を決定す
る。この通信は、相互接続ドライバのオープン・ルーチ
ンの一部として構成されるか、あるいは、調査して必要
な措置を講ずるようにドライバに伝える透過的ioctlを
通して始動されるか、いずれかである。いずれを取るか
は実施上の問題ではあるが、留意しておくべき点であ
る。The next step is to perform an open on the node where the actual driver resides. To do this, the stored device number is re-associated with the node address. STREAMS type interconnect driver
Communicate with the middleware via the appropriate protocol to determine the target remote node and all appropriate communication information. This communication is either configured as part of the interconnect driver's open routine or is initiated through a transparent ioctl that tells the driver to investigate and take the necessary action. The choice is a matter of implementation, but it should be noted.

【００１７】相互接続ドライバは、次に、既知のアドレ
スの目標ノードに送出されるメッセージを作成する。こ
のノード内には、そのような要求に聞き耳をたてメッセ
ージの到着に反応するカーネル内デーモンまたは制御ス
レッドが存在する。このデーモンは要求を解釈し、(1)
要求自体を処理するか、(2)要求を取り扱うためスレッ
ド作成を実行するか、あるいは(3)複数の目標ドライバ
・インスタンスを既にマルチプレックスしている既存ス
レッドに要求を手渡すか、いずれかを選択して実行す
る。The interconnect driver then composes a message to be sent to the target node at the known address. Within this node is an in-kernel daemon or control thread that listens for such requests and reacts to the arrival of messages. This daemon interprets the request and (1)
Choose to process the request itself, (2) perform thread creation to handle the request, or (3) hand off the request to an existing thread that already multiplexes multiple target driver instances. And run.

【００１８】この時点で、要求を取り扱うスレッドは、
アプリケーションのオープン要求から取り出されるオリ
ジナルの装置番号を使用して目標ノードに関するカーネ
ル内streams_open()を実行する。streams_open()は、あ
たかもＩＰドライバがアプリケーションと同じノードに
存在しているかのようにＩＰドライバをオープンする。
ひとたびＩＰがオープンされると、ストリームヘッド・
アドレスが始動元ノードの相互接続ドライバに戻され、
遠隔ノードのアドレスと組み合わせられたこのアドレス
を使用することによって、別の独立した命名方式を作成
しなくても、クラスタの範囲内にユニークなアドレス指
定タプルが作成される。このタプルは、後刻ストリーム
移行および検証のために使用される。この時点で、アプ
リケーションはopen()から戻り、無修正で処理を続行す
ることができなければならない。上記プロセスがユーザ
・アプリケーション、ＴＣＰモジュールあるいはＩＰド
ライバに対するいかなる修正をも必要としない点は認め
られるであろう。この点はクラスタ構成が成功する重要
な機能である。At this point, the thread that handles the request
Perform an in-kernel streams_open () for the target node using the original device number retrieved from the application open request. streams_open () opens the IP driver as if the IP driver were on the same node as the application.
Once the IP is opened, the stream head
The address is returned to the initiating node's interconnect driver,
Using this address in combination with the address of the remote node creates a unique addressing tuple within the cluster without having to create another independent naming scheme. This tuple is used later for stream migration and verification. At this point, the application must be able to return from open () and continue processing without modification. It will be appreciated that the above process does not require any modifications to the user application, TCP module or IP driver. This is an important feature for successful cluster configuration.

【００１９】ストリーム移行は、クラスタ内の２分割さ
れた分散ストリームのいずれか一方または両方を新しい
ノードへ移行させることを取り扱う。移行の実施はアプ
リケーションにとって透過的でなければならず、ストリ
ーム・スタックのコンポーネントに対する変更を必要と
してはならない。上述の分散ＳＴＲＥＡＭＳ構成作成の
場合と同じ例を使用して、故障につながる２つのポイン
トをあげると、第１のポイントは、ＩＰドライバが存在
するノードである。このノードが何らかの理由で障害を
起こしても、このノードを経由してその際実行中の接続
は故障すべきではなくまたその障害を認識すべきではな
い。相互接続ドライバがそのノードに関する接続を失っ
たことを認識し、前述のストリーム作成プロセスを使用
して回復プロトコールを始動させる。Stream migration deals with migrating one or both of the two split distributed streams in a cluster to a new node. The implementation of the migration must be transparent to the application and must not require changes to the components of the stream stack. Using the same example as in the case of creating a distributed STREAMS configuration described above, and giving two points that lead to a failure, the first point is the node where the IP driver exists. If this node fails for any reason, the connection currently running via this node should not fail and should not be aware of the failure. Recognizing that the interconnect driver has lost connection for that node, it initiates a recovery protocol using the stream creation process described above.

【００２０】ストリームの上半分に関しては、負荷平準
化またはシステム・メインテナンスのような種々の理由
からアプリケーションを移行させることがしばしば必要
となる。上半分が情報を処理することができずそのため
タイミングとデータ破壊問題を発生させる可能性がある
ため、移行が完了し下半分のストリームが上半分に到来
データを送付しなくなるまで、上半分のストリームおよ
びアプリケーションが均衡状態に置かれなければならな
いので、このタイプの移行は比較的困難である。With respect to the upper half of the stream, it is often necessary to migrate applications for various reasons, such as load leveling or system maintenance. Because the upper half cannot process the information, which can cause timing and data corruption issues, the upper half stream will remain until the transition is complete and the lower half no longer sends incoming data to the upper half. This type of migration is relatively difficult because and applications must be balanced.

【００２１】異なるノード上へのＴＣＰ／ＩＰ分割を伴
う従来技術の実施形態は実際の関連モジュールの修正に
依存しているため、性能、支援性、維持費、開発時間の
増加、コストの増加等に直接影響を与える。クラスタの
範囲内で容易に実行されることができるモジュールおよ
びドライバを制約するので、これは不必要な負荷であ
る。この負荷を除去するため、以下のプロセスが提供さ
れる。Prior art embodiments involving TCP / IP partitioning on different nodes rely on modification of the actual relevant modules, thus increasing performance, supportability, maintenance costs, increased development time, increased costs, etc. Affects directly. This is an unnecessary load, as it limits the modules and drivers that can be easily executed within the cluster. To remove this load, the following process is provided.

【００２２】分散ＳＴＲＥＡＭＳ内の(どのようなモジ
ュールにも適用可能であるので１つの例として使用す
る)ＴＣＰを移行するプロセスは、新しいioctlと組み合
わせられたＳＴＲＥＡＭＳフレームワークの主要特性を
利用して、移行が発生していることを認識することなく
またモジュール自体に対する変更を必要とすることなく
ＴＣＰモジュールを透過的に移行する。このプロセスは
モジュール／ドライバ開発者によって開発される１つの
関数を必要とする。この関数は、データ構造およびメ
ッセージをq->q_ptrによってポイントされる構造に関連
づけるために使用される。この関数の実際の詳細は現時
点では必要ではないが、これはストリームを透過的に移
行する際になくてはならないステップである。The process of migrating TCP (used as an example because it is applicable to any module) in a distributed STREAMS takes advantage of the key properties of the STREAMS framework combined with new ioctls, The TCP module is migrated transparently without realizing that a migration has occurred and without requiring changes to the module itself. This process requires one function developed by the module / driver developer. This function is used to associate data structures and messages with the structure pointed to by q-> q_ptr. The actual details of this function are not required at this time, but this is a necessary step in moving the stream transparently.

【００２３】基本的ストリーム移行は次の通りである。
ノードＡまたはアプリケーション自体の中の１つの独立
スレッドが、あるノードから別のノードへそれ自体また
は接続を移行するように設計されていなければならない
が、移行が発生すべきＳＴＲＥＡＭＳフレームワークに
よって解読される新しいioctlを発令する。このioctlは
ストリームを新しいノードへ移動させるために必要な情
報のすべてを含む。The basic stream transition is as follows.
One independent thread within node A or the application itself must be designed to transition itself or a connection from one node to another, but is deciphered by the STREAMS framework where the transition should occur Issues a new ioctl. This ioctl contains all the information needed to move the stream to the new node.

【００２４】フレームワークは、相互接続ドライバにio
ctlを発して、移行を行うことおよび移行をフロー制御
されているものとみなすことをその遠隔ドライバ処理ス
レッドに通知する。次に、フレームワークはローカル・
スタックを凍結し、モジュール・プライベート・データ
構造である各モジュールのq_ptr構造に対して整理関数
を実行する。この整理関数は、目標ノード上にこの構造
および関連メモリを複写するために必要なすべての情報
を返す。The framework provides the interconnect driver with io
Issue a ctl to notify the remote driver processing thread to perform the migration and consider the migration to be flow controlled. Next, the framework uses local
Freeze the stack and execute the simplification function on the q_ptr structure of each module, which is a module private data structure. This simplification function returns all the information needed to duplicate this structure and associated memory on the target node.

【００２５】その情報が収集されると、フレームワーク
はその情報とともに、ストリームヘッド状態およびメッ
セージのようなすべてのフレームワーク特有情報を目標
ノード上のデーモンへ送出する。次にそのデーモンは、
ストリーム・スタックを即刻作成するか、プロセスが再
確立されるのを待ち、次にオープンされているノードに
応じてストリームを再構築する。例えば、これがＮＦＳ
のようなカーネル内アプリケーションであるとすれば、
ストリームは直ちに完全に再構築することが可能であ
り、ＮＦＳプロセスはその作成および再初期化を通知さ
れるであろう。もしもこれがユーザ空間ソケット・アプ
リケーションであるとすれば、再構築実行の前に移行が
完了するまでプロセスは待機する。ストリーム・スタッ
クが再構築されると、新しい相互接続ドライバ・インス
タンスが今やメッセージを受け取ることが可能であるこ
とを遠隔ノードに通知し、通信は新たに開始する。これ
らすべては、関係するアプリケーション、モジュールお
よびドライバにとって透過的である。Once that information has been collected, the framework sends it along with all the framework specific information such as stream head status and messages to the daemon on the target node. Then the daemon
Instantly create a stream stack or wait for the process to be re-established, then rebuild the stream according to the open nodes. For example, this is NFS
If you have an in-kernel application like
The stream can be completely reconstructed immediately and the NFS process will be notified of its creation and re-initialization. If this is a user space socket application, the process waits for the migration to complete before performing the rebuild. Once the stream stack is rebuilt, the new interconnect driver instance notifies the remote node that it can now receive the message, and communication begins anew. All of these are transparent to the applications, modules and drivers involved.

【００２６】同様に、ＩＰドライバが実行するノードが
障害を起こせば、ユーザ・アプリケーションは無修正で
動作を続行しなければならない。そのためには、相互接
続ドライバは、障害が発生したことを認識し、上述のオ
ープン・プロセスを介して新しいノード上でその接続を
再確立しなければならない。Similarly, if the node on which the IP driver runs fails, the user application must continue operation without modification. To do so, the interconnect driver must recognize that a failure has occurred and re-establish its connection on the new node via the open process described above.

【００２７】上述のプロセスは、クラスタ環境内で以下
の利点を提供する。ストリームは、関係するドライバま
たはモジュールに対する変更を必要とせずに複数ノード
に分割分散される。従来、これらのコンポーネントに対
し相当の修正を加えない限りは、複雑性を加えコスト性
能を高める効果をあげるストリーム分散は不可能であっ
た。無修正は、より短い開発時間、サポートと問題解決
時間の短縮、およびテスト時間の短縮を意味する。更
に、すぐれた設計のＳＴＲＥＡＭＳドライバおよびモジ
ュールは単一のノード上の場合と同様の形態で共同動作
するので、開発者はクラスタ環境がどのように動作する
か学ぶ必要はない。これは、本発明の使用者が第三者の
独立ソフトウェア供給者に対して従来技術システムより
安いコストでより迅速にクラスタ環境を公開することが
できることを意味する。The above process provides the following advantages in a cluster environment: The stream is split and distributed among multiple nodes without requiring changes to the drivers or modules involved. In the past, without significant modifications to these components, stream distribution that would add complexity and cost effectiveness was not possible. Unmodified means shorter development time, reduced support and problem resolution time, and reduced test time. Furthermore, because well-designed STREAMS drivers and modules work together in a manner similar to that on a single node, developers do not need to learn how a cluster environment works. This means that users of the present invention can publish the cluster environment to third party independent software suppliers more quickly and at lower cost than prior art systems.

【００２８】ストリームは、ドライバまたはモジュール
に対する修正の必要なしにノード間を移行でき、従っ
て、使用している現行メカニズムから複雑性および性能
損失を取り除くことができる。移行に要求される事項
は、移行がドライバまたはモジュールの通常のＳＴＲＥ
ＡＭＳフレームワーク交信の実行において独立した単一
の機能であることに限定される。システムは、制御スレ
ッドまたはＳ−ＩＣＳ内の基本的技術実施に対する修正
を必要とすることなくＰ−ＩＣＳドライバを利用する。
本技術は、Ｐ−ＩＣＳ独立性を実現する共通データ構造
とインターフェースのセットを提供する。これによっ
て、本技術は将来のＰ−ＩＣＳ改良の利点を透過的に享
受することができる。プロトコールまたはアプリケーシ
ョン・スタックのＳＴＲＥＡＭＳ型実施は、クラスタ内
で透過的にすなわち修正なしに組み入れられ、実施され
そして実行されることができる。Streams can be migrated between nodes without the need for modifications to drivers or modules, thus removing complexity and performance loss from the current mechanism in use. The requirements for a migration are that the migration is a normal STRE for the driver or module.
It is limited to being a single independent function in the execution of AMS framework communication. The system utilizes the P-ICS driver without requiring modifications to the control thread or the underlying technical implementation in the S-ICS.
The present technology provides a set of common data structures and interfaces that achieve P-ICS independence. This allows the present technology to transparently enjoy the benefits of future P-ICS improvements. A STREAMS-type implementation of a protocol or application stack can be incorporated, implemented, and executed transparently, ie, without modification, within a cluster.

【００２９】基本ＳＴＲＥＡＭＳフレームワークは、Ｄ
ＤＩ／ＤＫＩ(すなわちディバイス・ドライバ・インタ
ーフェース/ドライバ・カーネル・インターフェースDev
iceDriver Interface/Driver Kerneru Interface)ルー
チンに対する修正なしに実施される。更に、すべてのＳ
ＴＲＥＡＭＳフレームワーク・メッセージ、コマンド、
記録／管理ドライバ、システム呼び出し、動的関数置換
(Dynamic Function Replacement)、動的関数登録(Dynam
ic Function Registration)およびストリームヘッド動
作は、すべて、非分散環境における場合と同様に動作す
る。これは、スタックが通常の単一システム・パラダイ
ムを使用して設計されることを可能にし、(サポートさ
れるとすれば)整理機能を除いて、設計者がクラスタ構
成を理解したりそれを考慮して開発することを必要とし
ない。分散ＳＴＲＥＡＭＳは、ＶＦＳ(仮想ファイル・
システム)の下でＳＴＲＥＡＭＳを実施する他のオペレ
ーティング・システムへの可搬性を保証するＶＦＳを修
正することなく実施されることができる。The basic STREAMS framework is D
DI / DKI (ie device driver interface / driver kernel interface Dev)
Performed without any modifications to the iceDriver Interface / Driver Kerneru Interface) routine. Furthermore, all S
TREAMS framework messages, commands,
Record / Manage Driver, System Call, Dynamic Function Replacement
(Dynamic Function Replacement), Dynamic Function Registration (Dynam
ic Function Registration) and stream head operation all operate as in a non-distributed environment. This allows the stack to be designed using the usual single-system paradigm, and the designer understands and takes into account the cluster configuration, except for the cleanup function (if supported). No need to develop. Distributed STREAMS uses VFS (virtual file
It can be implemented without modification of the VFS to ensure portability to other operating systems that implement STREAMS under (System).

【００３０】本発明の設計はプロトコールに依存しな
い。これは、設計を種々のネットワークまたは他のアプ
リケーションに使用することを可能とし、実施上修正を
必要としないことを意味する。プレビュー機能の一部は
わずかな修正を必要とするかもしれないが、そのような
修正は主として課題の解決方法に関わるものである。例
えば、ＩＰ連結プレビュー機能は、大域ポート管理ソル
ーション特有のものであるが、本発明の技術を使用して
基本ＴＣＰ／ＩＰを実施するためには必要とされない。The design of the present invention does not depend on a protocol. This allows the design to be used for various networks or other applications, meaning that no implementation modifications are required. Some of the preview features may require minor modifications, but such modifications are primarily related to solving the problem. For example, the IP connectivity preview feature, which is specific to the global port management solution, is not required to implement basic TCP / IP using the techniques of the present invention.

【００３１】本発明の設計は、Ｓ−ＩＣＳドライバまた
は制御スレッドの設計変更を必要とせずに種々の異なる
Ｐ−ＩＣＳドライバ実施形態を活用することができる。
従って、本発明の設計は、受信者型通信から多数の簡素
化および性能改善を提供する送信者型通信へのシフトの
新しい技術または通信経路パラダイムを利用することを
可能にする。The design of the present invention can take advantage of a variety of different P-ICS driver embodiments without requiring a design change in the S-ICS driver or control thread.
Thus, the design of the present invention makes it possible to utilize a new technology or communication path paradigm of shifting from receiver-based communication to sender-based communication which offers a number of simplifications and performance improvements.

【００３２】本設計は、ＳＴＲＥＡＭＳフレームワーク
のすべての局面が分散環境の範囲内で達成される方法を
示す。これは、コマンド、ロギング(strlog)、管理(Ｓ
ＡＤ)、パイプ、システム呼び出し、ＤＤＩルーチン等
を含む。ほとんどの場合、ＳＴＲＥＡＭＳフレームワー
クは、制御スレッド、Ｓ−ＩＣＳドライバおよびいくつ
かのプレビュー機能が大半の仕事を実行するので、多く
の変更を必要としない。このため、ヒューレット・パッ
カード社以外のＳＴＲＥＡＭＳ実施形態へ本技術を迅速
に統合することを可能にする。更に、Ｓ−ＩＣＳドライ
バおよび制御スレッドは、動的関数置換機能を必要とす
ること以外は、ＳＴＲＥＡＭＳの実施から独立している
ので、それらの設計を他の業者のプラットホームへ移入
することが可能であるにちがいない。This design shows how all aspects of the STREAMS framework can be achieved within a distributed environment. This includes command, logging (strlog), management (S
AD), pipes, system calls, DDI routines, etc. In most cases, the STREAMS framework does not require much modification because the controlling thread, S-ICS driver and some preview functions perform most of the work. This allows for rapid integration of the present technology into STREAMS embodiments other than Hewlett-Packard. In addition, the S-ICS driver and control thread are independent of the implementation of STREAMS, except that they require dynamic function replacement, so their designs can be ported to other vendors' platforms. There must be.

【００３３】ＳＴＲＥＡＭＳ型モジュールおよびドライ
バが分散環境の範囲内で動作できるようにするため設計
を修正する必要はない。この点は、ＳｕｎおよびＬｏｃ
ｕｓのシステムの場合、ＴＣＰ、ＵＤＰおよびＩＰのよ
うなプロトコール・モジュールが動作できるようにする
ため広範囲な修正を必要としているように見えるので、
重要な点である。この点は、本発明の設計が従来提案さ
れたいかなるものとも相違する点である。本設計は、ア
プリケーションによって供給されるモジュールおよびド
ライバの外側に存在する種々のコンポーネント内の変更
に主眼をおき、そのためモジュールまたはドライバの移
入およびサポートが迅速化しコスト効率が向上する。There is no need to modify the design to allow STREAMS-type modules and drivers to operate within a distributed environment. This is because Sun and Loc
us system seems to require extensive modifications to allow protocol modules like TCP, UDP and IP to work,
This is an important point. This is where the design of the present invention differs from any previously proposed. The design focuses on changes in the various components that exist outside of the modules and drivers supplied by the application, thereby speeding up the module or driver import and support and increasing cost efficiency.

【００３４】本設計は、また、分散ドライバのオープン
を実行するためにＶＦＳ(仮想ファイル・システム)を修
正する必要性を排除する。本設計は、ユニークでプロト
コールから独立したオープンおよび移行のアルゴリズム
を提案する。The present design also eliminates the need to modify the VFS (Virtual File System) to perform a distributed driver open. This design proposes a unique and protocol independent open and migrate algorithm.

【００３５】II. 実施形態の詳細説明 1.0 概説 1.1 クラスタ設定の例図１に示されているクラスタは、従来技術の分散アプリ
ケーション技術の使用によっては容易に複製することが
できないユニークなクラスタ・ソルーションを形成する
多数の新技術と従来技術の組み合わせに基づくクラスタ
の例である。遠隔クライアント・アプリケーションはこ
のクラスタを使用して、可用性の非常に高いデータウェ
アハウスを実施するために使用される分散データベース
にアクセスすることができる。クライアント・アプリケ
ーションは、ネットスケープ(Netscape)のような世界的
ウェブ・ブラウザを活用して照会要求を作成して結果を
検証することができる。ブラウザは、遠隔ユーザがデー
タを検証するため学習し実行することを一層容易にさせ
る整合性のとれたインターフェースと操作基準を持つと
いう点において、従来のクライアント・ソフトウェアに
ない特長を持つ。 II. Detailed Description of Embodiments 1.0 Overview 1.1 Cluster Setup Example The cluster shown in FIG. 1 provides a unique cluster solution that cannot be easily duplicated using the prior art distributed application techniques. 2 is an example of a cluster based on a combination of a number of new technologies to be formed and a conventional technology. Remote client applications can use this cluster to access distributed databases used to implement highly available data warehouses. The client application can utilize a global web browser such as Netscape to create a query request and verify the results. Browsers are unique to traditional client software in that they have a consistent interface and operating standards that make it easier for remote users to learn and execute to validate data.

【００３６】クラスタ内でこのソルーションの種々の部
分を実施するために使用されるノードの数は、データウ
ェアハウスのサイズとデータを同時にアクセスするクラ
イアントの数に依存する。例えば、数百ギガバイトのデ
ータを持つ小さいデータウェアハウスは、データウェア
ハウスのための４ないし８のノードとを使用して実施す
ることができる。一方、テラバイトの単位のデータから
なる大規模データウェアハウスは、遠隔クライアントの
数が多く相当数のノードをデータベース・サーバとして
はたらかせねばならないとすれば、インターネット・ゲ
ートウェイとしての役目を果たす４ないし８のノードを
含めて１６ないし３２のノードを用いて実施されるであ
ろう。各インターネット・ゲートウェイ・ノードは、４
つの６２２メガビットＡＴＭリンクを使用して遠隔クラ
イアントと通信することができる。すべてのノードは、
ハードウェア、ソフトウェアおよびファームウェア技術
の組合せから構成されるクラスタ相互接続ソルーション
を使用して相互に接続される。The number of nodes used to implement the various parts of this solution in a cluster depends on the size of the data warehouse and the number of clients accessing the data simultaneously. For example, a small data warehouse with hundreds of gigabytes of data can be implemented using four to eight nodes for the data warehouse. On the other hand, a large data warehouse consisting of terabytes of data can serve as an Internet gateway if there are a large number of remote clients and a considerable number of nodes must serve as database servers. It will be implemented with 16 to 32 nodes, including nodes. Each Internet gateway node has 4
Two 622 Megabit ATM links can be used to communicate with remote clients. All nodes are
Interconnected using a cluster interconnect solution consisting of a combination of hardware, software and firmware technologies.

【００３７】図１に示される例は、ノードＡ、Ｂ、Ｃお
よびＤを含む４つの小規模ノードのクラスタ２０を利用
する。ノードＡおよびＢは、データウェアハウスを提供
する分散型データベースを実施し、ノードＣおよびＤ
は、遠隔クライアントに対するアクセスを提供するイン
ターネット・ゲートウェイおよび照会エンジンとしては
たらく。相互接続を実施するために使用される実際の技
術は分散ＳＴＲＥＡＭＳから独立しているが、ＳＴＲＥ
ＡＭＳの設計と実施を一層簡単にする好ましい技術が当
然存在する。新しい技術を透過的に組み入れることが可
能なためこの独立性は設計上の利点であり、そのため、
アプリケーション・プロトコールとして使用年数が比較
的長いより低コストのソルーションが実現し、各種クラ
スタ・サービスを分散ＳＴＲＥＡＭＳを基盤として構築
することができる。The example shown in FIG. 1 utilizes a cluster 20 of four small nodes, including nodes A, B, C and D. Nodes A and B implement a distributed database providing a data warehouse, and nodes C and D
Serves as an Internet gateway and query engine that provides access to remote clients. The actual technology used to implement the interconnect is independent of the distributed STREAMS,
There are, of course, preferred techniques that further simplify the design and implementation of AMS. This independence is a design advantage as new technologies can be incorporated transparently,
A lower cost solution with a relatively long service life as an application protocol is realized, and various cluster services can be built on the basis of distributed STREAMS.

【００３８】クラスタに関して以下のように仮定する。＊クラスタは、ＳＴＲＥＡＭＳおよびストリーム型ドラ
イバがその上で実行されるソフトウェア／ハードウェア
を介して相互接続される。＊各ノードのオペレーティング・システム・インスタン
スは、マルチスレッド化され、マルチプロセッサ対応で
なければならない。＊クラスタは、ＳＴＲＥＡＭＳが情報を探りまたは供給
する対象となるミドルウェア・フレームワークを備える
場合もある。そのような通信は適切に設計されたインタ
ーフェースを介して行われる。制御スレッドのようなミ
ドルウェア・フレームワークは、ＳＴＲＥＡＭＳが実行
する種々の動作に関する意志決定を行う場合に必要とさ
れる。注：このＳＴＲＥＡＭＳ設計自体はミドルウェア
から独立しているが、設計実施形態がこのミドルウェア
を利用する実際の程度は実施形態に非常に依存してい
る。制御スレッドが何であるか、実施形態に応じてミド
ルウェア・タスクのいくつかを実行するためにどのよう
に制御スレッドが使用されるかを理解する際に、この点
は想起されるべき点である。Assume the following for the cluster: * Clusters are interconnected via software / hardware on which STREAMS and stream-type drivers run. * The operating system instance on each node must be multi-threaded and multi-processor capable. * Clusters may also have a middleware framework against which STREAMS seeks or supplies information. Such communication takes place via a suitably designed interface. A middleware framework, such as a controlling thread, is needed when making decisions about the various operations that STREAMS performs. Note: Although the STREAMS design itself is independent of the middleware, the actual extent to which the design embodiment utilizes this middleware is highly dependent on the embodiment. This is to be recalled in understanding what the control thread is and how the control thread is used to perform some of the middleware tasks depending on the embodiment.

【００３９】分散ストリームは、例えば以下のような形
態でクラスタ化された環境において動作することができ
るＳＴＲＥＡＭである。＊アプリケーションおよびストリーム・スタック全体は
非クラスタ環境の場合と同様のノード上に存在すること
はできるかもしれないが、ＳＴＲＥＡＭＳ関連クラスタ
・フレームワーク下部構造を利用することによって、負
荷平準化、高可用性等のようなクラスタ機構が持つ利点
を享受することができる。＊ストリーム・スタックは、それにアクセスするアプリ
ケーションが稼働しているノードと異なるノード上で実
行することも可能である。＊ストリームは、モジュール／ドライバ／ストリームヘ
ッドのレベルで、クラスタの範囲内の個別ノード上で各
々が潜在的に実行される複数のコンポーネントに分割さ
れることもある。＊ストリーム型パイプに関してはパイプの両端がクラス
タ内の別々のノードに位置することもある。A distributed stream is a STREAM that can operate in a clustered environment, for example, in the following manner. * The entire application and stream stack may be able to reside on the same nodes as in a non-cluster environment, but by utilizing the STREAMS-related cluster framework infrastructure, load leveling, high availability, etc. It is possible to enjoy the advantages of the cluster mechanism as described above. * The stream stack can run on a different node than the node running the application accessing it. * At the module / driver / streamhead level, a stream may be divided into multiple components, each potentially running on an individual node within the cluster. * For stream-type pipes, both ends of the pipe may be located on separate nodes in the cluster.

【００４０】重要な点は、ストリームが単一のノード上
で稼働する場合には通常存在しないような形態でストリ
ームはそれにアクセスするアプリケーションからとにか
く切り離されるいう点である。この分離が行われるのは
以下に説明される多くの理由による。ストリームは以下
のいずれかの理由によって(ただし以下の理由に限定さ
れないが)分散される。クラスタ型機構の特長を活かす
ため、アプリケーションは１つのクラスタの範囲内でそ
れらの実行および管理を実施するような単一システムの
様相をなすクラスタ２０を持ち、高い可用性、負荷平均
下およびハードウェア共用性を備えなければならない。
分散の副次的効果は、ＣＰＵオーバーヘッドの減少と一
層高いクラスタ全体性能である。これらの概念を示すた
め、以下の節においてクラスタの例を使用して、詳細に
説明する。The important point is that the stream is simply decoupled from the application accessing it in a way that would not normally exist if the stream were running on a single node. This separation occurs for a number of reasons described below. The stream is distributed for (but not limited to) any of the following reasons: To take advantage of the features of the clustered mechanism, applications have a cluster 20, which appears as a single system, performing and managing them within a single cluster, with high availability, under load average and hardware sharing. Have to have sex.
Side effects of distribution are reduced CPU overhead and higher overall cluster performance. These concepts will be described in detail in the following sections using examples of clusters.

【００４１】1.2 ハードウェアの共用本発明に従って実施されるクラスタ２０は、大幅なハー
ドウェア共用を可能にする。ハードウェア共用は、従来
技術のシステム構成に比較して多くの利点を与える。各
ノードは処理能力、メモリ、Ｉ／Ｏバックプレーン等の
観点から測定できるので、実行しているタスクに対して
ハードウェアの選択および柔軟性が増加する。言い換え
ると、クラスタは同質ノードのセットである必要はな
い。図１のクラスタの例において、ゲートウェイ・ノー
ドＣ，Ｄの構成は、４ウエイＳＭＰ、３２ビット・プロ
セッサ、２台のＳＣＳＩ４ＧＢハードディスク、２５
６ＭＢメモリ、２枚の６２２ＭｂＡＴＭネットワーク
用カード、ＨＡが問題であれば２枚でもよいが１枚の相
互接続カードである。データベース・サーバ構成は、例
えば、４ウエイＳＭＰ、アドレス空間の大きい６４ビッ
ト・プロセッサ、テラバイト容量のディスクを備えたフ
ァイバ・チャネル接続ディスク・アレイおよび２枚の相
互接続カードである。1.2 Hardware Sharing The cluster 20 implemented according to the present invention allows for significant hardware sharing. Hardware sharing offers many advantages over prior art system configurations. Each node can be measured in terms of processing power, memory, I / O backplane, etc., thus increasing hardware choice and flexibility for the task being performed. In other words, a cluster need not be a set of homogeneous nodes. In the example of the cluster of FIG. 1, the configuration of the gateway nodes C and D is a 4-way SMP, a 32-bit processor, two SCSI 4 GB hard disks, and 25
A 6 MB memory, two 622 Mb ATM network cards, and two interconnect cards if HA is a problem, but one interconnect card. The database server configuration is, for example, a 4-way SMP, a 64-bit processor with a large address space, a Fiber Channel connected disk array with terabyte capacity disks, and two interconnect cards.

【００４２】クラスタの耐用年数に比較して購入／保守
コストは安い。クラスタ例２０において、ゲートウェイ
・ノードＣ、Ｄだけは、６２２ＭｂＡＴＭカードを設
置する必要がある。これは、カード数を減少させ、固定
的カード・コストをクラスタ耐用期間にわたって減価償
却できるので短期的にも長期的にもコストが安いだけで
はなく、設置しなければならない通信回線の数をも減少
させる。６２２ＭｂＡＴＭカードをサポートすることが
できる通信回線は非常に高価であり、各回線は、クラス
タ耐用期間にわたって償却するのが難しいかもしれない
繰り返し使用コストおよび保守コストを持つ。The purchase / maintenance cost is lower than the service life of the cluster. In example cluster 20, only gateway nodes C and D need to have a 622 Mb ATM card installed. This reduces the number of cards and reduces fixed card costs over the life of the cluster, which reduces costs both in the short and long term, and also reduces the number of communication lines that must be installed. Let it. Communication lines that can support 622 Mb ATM cards are very expensive, and each line has repeated use and maintenance costs that may be difficult to amortize over the life of the cluster.

【００４３】クラスタの分割によってセキュリティが向
上する。この例における３つの主要部分すなわち遠隔ク
ライアント、ゲートウェイおよびデータベース・サーバ
はすべて異なるノード上で実行される。このような実施
形態によって、物理的およびソフトウェア両面のセキュ
リティが容易に実現する。The security is improved by dividing the cluster. The three main parts in this example, the remote client, the gateway and the database server all run on different nodes. Such an embodiment facilitates both physical and software security.

【００４４】クラスタの細分化は、また、故障ポイント
の数および種類を減少させる。ノードＡおよびＢに関し
ては、故障ポイントは、大容量記憶装置、データベース
・ソフトウェア、分散アプリケーションおよびクラスタ
相互接続に限られる。ノードＣおよびＤに関しては、故
障ポイントは、大容量記憶(ただしこれらのノードが大
容量を必要としなければ非常に小さいスケールではある
が)、ゲートウェイ・ハードウェア、ネットワーキング
・プロトコール・スタック、照会エンジン、インターネ
ット・ソフトウェアおよびクラスタ相互接続に限られ
る。いずれの場合でも、故障ポイントの減少によって、
本発明の解決策は、一カ所でのデータベースまたはゲー
トウェイ管理が実現できるため、マーフィの法則が適用
される分野は小さく、不良箇所の分離は向上し、デバッ
ギングは容易となり、クラスタの複雑さは減少し、管理
およびサポートのコストが低下するという効果をあげ
る。Cluster subdivision also reduces the number and type of failure points. For nodes A and B, the points of failure are limited to mass storage, database software, distributed applications and cluster interconnects. For nodes C and D, the points of failure are mass storage (although on a very small scale if these nodes do not require large capacity), gateway hardware, networking protocol stacks, query engines, Limited to Internet software and cluster interconnect. In each case, the reduction in failure points
The solution of the present invention provides a single point of database or gateway management, so the field of application of Murphy's law is small, fault isolation is improved, debugging is easier, and cluster complexity is reduced. This has the effect of reducing management and support costs.

【００４５】また、本実施形態のクラスタ環境内におい
て、ハードウェア共用は、ＴＣＰ／ＩＰスタックの実行
をゲートウェイ・ノード上にのみとどめることによって
クラスタ全体の性能および効率を改善する。これは、標
準的ＲＰＣ(遠隔手順呼び出しRemote Procedure Callの
略称)関数送達パラダイムを使用して、遠隔実行の結果
のみを各ノードに送付することによって達成される。従
来型の分散クライアント／サーバ・アプリケーションに
おいては、各ノードは通常ＴＣＰ／ＩＰスタックを実行
し、次にゲートウェイ・ノードに送るパケットを使用し
て遠隔クライアントと通信する。従来型のアプローチの
問題は、ゲートウェイを経由してパケットを作成し、処
理し、回付することに関連する仕事のすべてを実行する
ために必要なサイクル数が、パケットあたり数千サイク
ルとなる点である。このようなパケットあたりの多数の
サイクルの消費は、可能な遠隔クライアント数および回
付されるデータ量を考慮すれば、最適なものではなく、
クラスタ処理性能の低下に結びつく。ネットワーキング
に関して費やされるサイクルがデータベース・サーバ上
で消費されないので、ユーザは支払う代金に見合う最大
システム効率を実現することとならない。加えて、命令
キャッシュがデータベース・アプリケーションとネット
ワーキング・コード間の頻繁な切り換えを必要とするの
で、プロセッサの処理性能を低下させる原因となる頻繁
なキャッシュ・ミスが発生する。クラスタ相互接続上で
のネットワーク・プロトコールの実施は、(メモリ、プ
ロセッサ、Ｉ／Ｏバス、タイマ等の)システム資源と相
互接続Ｉ／Ｏカードの衝突の増加によってパケット待ち
時間を増加させる。これは、分散データベースに関する
クラスタ全体の性能の減少および遠隔アプリケーション
に関する応答時間の増加につながる。注：待ち時間は高
性能なクラスタを構築する上でのキーと見なされ、従っ
て、少ない待ち時間は一般により速い応答時間とより高
いスループット／性能を意味するものと解釈される。Also, within the cluster environment of this embodiment, hardware sharing improves overall cluster performance and efficiency by keeping the TCP / IP stack running only on the gateway node. This is achieved by using the standard RPC (Remote Procedure Call) function delivery paradigm to send only the results of the remote execution to each node. In a conventional distributed client / server application, each node typically runs a TCP / IP stack and then communicates with remote clients using packets sent to the gateway node. The problem with the traditional approach is that it takes thousands of cycles per packet to perform all of the work associated with creating, processing, and routing packets through the gateway. It is. The consumption of such a large number of cycles per packet is not optimal given the number of possible remote clients and the amount of data circulated,
This leads to a decrease in cluster processing performance. Because the cycles spent on networking are not consumed on the database server, the user does not achieve maximum system efficiency for the price paid. In addition, the instruction cache requires frequent switching between database applications and networking code, resulting in frequent cache misses that can degrade processor performance. Implementing a network protocol over a cluster interconnect increases packet latency by increasing collisions between system resources (such as memory, processors, I / O buses, timers, etc.) and the interconnect I / O card. This leads to reduced overall cluster performance for distributed databases and increased response time for remote applications. Note: Latency is considered a key in building high performance clusters, so low latency is generally interpreted as meaning faster response time and higher throughput / performance.

【００４６】相互接続ソルーションがＴＣＰまたはＵＤ
Ｐが必要とするような伝送プロトコール能力のすべてを
サポートしなければならないとすれば、それはあまりに
複雑なものとなり、貧弱な処理性能および高い設計コス
トとういう結果を生む可能性がある。本発明は、ノード
Ａ上のデータベース・アプリケーションと遠隔インター
ネット・アプリケーションの間にクライアント／サーバ
・アプリケーションを構築する代わりに、例えばＳＴＲ
ＥＡＭＳ型ＴＣＰ／ＩＰを使用することによって、ＴＣ
Ｐ／ＩＰ接続がノードＣにのみ存在していればよいよう
なアプリケーションを分散させ、標準ＲＰＣ関数送達パ
ラダイムを用いてノードＡ上のアプリケーションから透
過的にアクセスできるようにする。関数送達は、基本的
に１つの位置から別の位置へ関数の実行を送出し、この
ため、例えば本実施形態におけるクラスタは、ノードＣ
およびＤにとって高価なＴＣＰ／ＩＰ処理サイクルのす
べてをノードＡおよびＢからオフロードすることが可能
となる。これは、ノードＡおよびＢに関するシステム性
能が向上するものと解釈されるべきでる。このような点
がＳＴＲＥＡＭＳフレームワーク内で達成される態様の
詳細は後述される。If the interconnect solution is TCP or UD
If P had to support all of the transmission protocol capabilities required by it, it would be too complex, resulting in poor processing performance and high design costs. Instead of building a client / server application between the database application on node A and the remote Internet application, the present invention
By using EAMS type TCP / IP, TC
Applications that require a P / IP connection to exist only at node C are distributed and transparently accessible from applications on node A using the standard RPC function delivery paradigm. Function delivery basically sends the execution of a function from one location to another, so that, for example, the cluster in this embodiment is a node C
And D can be offloaded from nodes A and B for all of the expensive TCP / IP processing cycles. This should be interpreted as an increase in system performance for nodes A and B. Details of how such points are achieved within the STREAMS framework are described below.

【００４７】1.3 負荷の平均化負荷の平均化は、クラスタ内のノード間のワークロード
をならすために使用される。クラスタ例２０において、
負荷の平均化は、データベース・サーバおよびインター
ネット・ゲートウェイという２つの分野で利用される。
データベース・サーバ・ノードに関しては、データベー
スが正しく分散されていれば、各サーバ・ノードを要求
目標とする確率は同じでなければならない。あいにく、
確率は現実を必ずしも反映しない。要求がデータベース
内に「熱い地点」を作れば、クラスタ・ノードは等しく
使用されず、クラスタ全体性能は低下するかもしれな
い。同様に、不均衡な数の遠隔クライアントがノードＤ
ではなくノードＣを経由してクラスタに到来すれば、ク
ライアント照会に対するクラスタ応答時間は増加する可
能性がある。そのような状況において、本発明のソルー
ションは、負荷平均化の方針(すなわちポリシー)を導入
し、アプリケーション、アプリケーション・インスタン
ス、データ、接続、またはそれらのいずれかの組み合わ
せを負荷のより軽いノードへ移行させる。以下の例は、
このような移行がＳＴＲＥＡＭＳ型パイプを使用して実
施される様態を示す。1.3 Load Averaging Load averaging is used to smooth the workload between nodes in a cluster. In cluster example 20,
Load averaging is used in two areas: database servers and Internet gateways.
For database server nodes, if the database is correctly distributed, the probability of requesting each server node must be the same. Unfortunately,
Probability does not necessarily reflect reality. If requests create "hot spots" in the database, cluster nodes are not used equally and overall cluster performance may be degraded. Similarly, an unbalanced number of remote clients are
However, if the cluster arrives via node C instead, the cluster response time to client queries may increase. In such a situation, the solution of the present invention introduces a load averaging policy (i.e., policy) and migrates the application, application instance, data, connection, or any combination thereof to the lightly loaded node Let it. The following example:
It shows how such a transition can be performed using a STREAMS-type pipe.

【００４８】ＳＴＲＥＡＭＳ型パイプは、少なくとも２
つの実行協調スレッド間でデータを送受するために使用
することができる双方向待ち行列である。その実際の例
は、コマンド行処理のような単純なものから複数クライ
アントがパイプ・エンドポイントを用いてサーバ・アプ
リケーションと通信する場合のような複雑なものに及
ぶ。アプリケーションは、データの付加的フィルタリン
グまたは前処理が必要とされる場合、メモリ型パイプの
代わりにＳＴＲＥＡＭＳ型パイプを使用する。メモリ型
パイプは、この作業を行うためフィルタリング・モジュ
ールをパイプの中央にプッシュすることができない。STREAMS type pipes have at least 2
A bidirectional queue that can be used to send and receive data between two execution coordination threads. Practical examples range from as simple as command line processing to as complex as multiple clients communicating with a server application using a pipe endpoint. Applications use STREAMS-type pipes instead of memory-type pipes when additional filtering or pre-processing of data is required. Memory-based pipes cannot do this by pushing the filtering module to the center of the pipe.

【００４９】クラスタ例２０の範囲内において、データ
ベース・サーバ・アプリケーションは、パイプを経由し
て通信する１組の協調スレッドから構成することができ
る。クラスタが不均衡になると、それを解決するため、
データベースのコンポーネントの一部またはデータベー
ス・サーバ・インスタンスの一部をクラスタの範囲内の
他のノードへ移行させる。従来技術の分散環境において
は、このアプローチは、ソケット(sochet)のような異な
る通信パラダイムを利用するため、あるいは移行を行う
際新しいパラダイムに切り換える方法を理解するため、
アプリケーションを再設計しなければならないことを意
味する。そのような設計変更は、高価であり、クラスタ
化の利点を生かすことができるアプリケーションの数を
減少させるので、多くのユーザに受け入れられることが
できない。この問題を解決するため、本発明の実施形態
においてアプリケーションはパイプの利用を継続するこ
とができる。(ＳＴＲＥＡＭ型パイプは、付加的機能性
を利用できないとしても、メモリ型パイプと透過的に代
替することができる)。不均衡が検出されると、パイプ
・エンドポイントまたはパイプ全体、および関連実行ス
レッドは、クラスタの範囲内の新しいノードへ移行され
る。この移行のすべてはアプリケーション内の１命令行
すら修正する必要なしに行われる。Within the scope of the example cluster 20, a database server application can consist of a set of cooperating threads that communicate via pipes. When a cluster becomes unbalanced,
Migrate some database components or some database server instances to other nodes within the cluster. In a distributed environment of the prior art, this approach is to use different communication paradigms such as sockets, or to understand how to switch to a new paradigm when making the transition.
It means you have to redesign your application. Such a design change is unacceptable to many users because it is expensive and reduces the number of applications that can take advantage of clustering. To solve this problem, in embodiments of the present invention, the application can continue to use the pipe. (STREAM-type pipes can transparently replace memory-type pipes, even if additional functionality is not available). When an imbalance is detected, the pipe endpoint or the entire pipe, and the associated execution thread, are migrated to a new node within the cluster. All of this transition takes place without having to modify even a single instruction line in the application.

【００５０】1.4 高い可用性高い可用性は、フォルトトレラント・システムの複雑性
またはコストのすべてを必ずしも提供することなくフォ
ルトトレラント動作を実行するソフトウェアまたはハー
ドウェア・ソルーションを表すため一般に使用される表
現である。上述のように分散または移行が可能なように
ＳＴＲＥＡＭＳを拡張することによって、クラスタの範
囲内のソフトウェアおよびハードウェアが大部分の単一
点障害から回復することができる能力が備わる。例え
ば、図１において、遠隔インターネット・アプリケーシ
ョンがノードＤゲートウェイを経由してノードＢ上のア
プリケーションと通信している時ノードＤゲートウェイ
が故障したとすれば、すべてのアプリケーション活動を
透過的に処理するノードＣ上の回復メカニズムを始動さ
せることができる。同様に、アプリケーションがノード
ＡおよびノードＢ上に同期ポイントを含んでいるとすれ
ば、いずれのノードが故障したとしても、遠隔クライア
ント・アプリケーションは回復措置が行われていること
に気づかぬままに実行を継続する。すなわち、当該アプ
リケーションは問題が発生したことを検出せずに正常に
動作を継続する。注：分散ＳＴＲＥＡＭＳを使用して、
主としてコンポーネントがＤＬＰＩのような状態のない
エンティティであるようないくつかの単一点障害からの
回復を行うことができる。ＴＣＰのような状態維持コン
ポーネントに関してはノード障害事象の際に単一点障害
からの回復を行うため分散ＳＴＲＥＡＭＳを使用すこと
はできない。そのような場合、遠隔クライアント・アプ
リケーションは接続が失われたことを検出し、再開を実
行しななければならない。この点は現在はユーザに大目
に見られるているかもしれないが、将来は許されないで
あろう。おそらく、分散ＳＴＲＥＡＭＳは、ヒューレッ
トパッカード社研究所のプロジェクトであるBrevix研究
およびそのプロトタイプ・ソルーションを利用すること
となろう。Brevixは、単なるシステム破壊の代わりに、
エラー処理ルーチンを起動するために特定のトラップを
設けるメカニズムを定義する。このようなエラー処理ル
ーチンは、クライアント・アプリケーションおよびＳＴ
ＲＥＡＭＳインスタンスをクラスタ内の他のノードに移
行させるために使用することができるであろう。この研
究が一層多くのトラップを含むように拡張され、「パニ
ック」関数の使用が、システム・ダウンの前にアプリケ
ーションを移行することが安全であるか否かを判断する
評価ルーチンと置き換えられるとすれば、分散ＳＴＲＥ
ＡＭＳは、クラスタ化市場において比類なき高可用性ソ
ルーションを提供することができるであろう。1.4 High Availability High availability is a commonly used expression to describe software or hardware solutions that perform fault-tolerant operations without necessarily providing all of the complexity or cost of a fault-tolerant system. Extending STREAMS to allow distribution or migration as described above provides the ability for software and hardware within the cluster to recover from most single points of failure. For example, in FIG. 1, if a remote Internet application is communicating with an application on Node B via a Node D gateway, and the Node D gateway fails, a node that transparently handles all application activity The recovery mechanism on C can be triggered. Similarly, if the application includes synchronization points on Node A and Node B, then if either node fails, the remote client application runs unnoticed that recovery is taking place. To continue. That is, the application continues to operate normally without detecting that a problem has occurred. Note: Using distributed STREAMS,
Recovery can be performed from some single point of failure, where the component is primarily a stateless entity such as DLPI. For state maintenance components such as TCP, distributed STREAMS cannot be used to recover from a single point of failure in the event of a node failure. In such a case, the remote client application must detect that the connection has been lost and perform a restart. This point may be tolerated by users now, but will not be tolerated in the future. Presumably, distributed STREAMS will utilize the Brevix study, a project of the Hewlett-Packard Company laboratory, and its prototype solution. Brevix is not just a system disruption,
Defines a mechanism to set up specific traps to invoke error handling routines. Such an error handling routine is executed by the client application and the ST.
It could be used to migrate a REAMS instance to another node in the cluster. As this work is extended to include more traps, the use of the "panic" function will be replaced by an evaluation routine that determines whether it is safe to migrate the application before the system goes down. If the distributed STRE
AMS will be able to provide an unparalleled high availability solution in the clustered market.

【００５１】Brevixは、それが扱うことができるシステ
ム・トラップの数の点で制限される。そのトラップは、
データ・セグメント障害のトラップ１５、およびメモリ
保護および非整列メモリ参照を処理するトラップ１８、
２６、２７および２８に通常制限される。Brevixの利点
は、それがサブシステム単位で実行されることができる
点であり、従って、当該サブシステムに依存するコンポ
ーネントを移行させることが安全であるか否かを回復メ
カニズムが相当の信頼度をもって判断することができる
であろう。多くのカーネル・サブシステムは、それらが
前進することができない場合あるいはなんらか不正な点
が検出された時、関数panic()を呼び出す。panic()が単
にメッセージ・ストリングを持つことから方針パラメー
タを持つことへ変更されるとすれば、回復および移行は
もっとインテリジェンスをはたらかせて実行されること
であろう。この方針パラメータは、サブシステム全体に
疑いを持ち他のサブシステムに対する影響および回復ス
テップを考慮することを含むであろう。[0051] Brevix is limited in the number of system traps it can handle. The trap is
Trap 15 for data segment failure and trap 18 for handling memory protection and unaligned memory references
Usually limited to 26, 27 and 28. The advantage of Brevix is that it can be performed on a subsystem-by-subsystem basis, so that the recovery mechanism can rely on the recovery mechanism to determine whether it is safe to migrate components that depend on that subsystem. Will be able to judge. Many kernel subsystems call the function panic () when they cannot advance or when something wrong is detected. If panic () were changed from just having a message string to having a policy parameter, recovery and migration would be performed with more intelligence. This policy parameter will include suspicion of the entire subsystem and consideration of the impact on other subsystems and recovery steps.

【００５２】1.5. 単一システム視点クラスタ２０が複数ノードから構成されているので、ク
ラスタは、ノード毎に管理するには本来的に複雑であり
潜在的に困難である。しかし、クラスタを少なくともア
プリケーションの観点から単一システムとして見るとす
れば、多くの問題は解決が非常に容易になる。例えば、
アプリケーションがＴＣＰ／ＩＰを使用しそれ自身を特
定ポートに連結し、次にそのアプリケーションが何かの
理由で別のノードに移行することを望むとすれば、目標
移行ノード上のそのポートに別のどのようなアプリケー
ションも既に連結されていないことを確認しなければな
らない。さもなければ、移行を実行することはできな
い。この問題を解決する方法は、クラスタ・ミドルウェ
アを使用してクラスタ全体のポート空間を管理すること
によって問題が発生する可能性を防止することである。
本発明の分散ＳＴＲＥＡＭＳ設計は、既存のモジュール
またはドライバを修正する必要性を排除する技術、アル
ゴリズムおよびＳＴＲＥＡＭＳフレームワーク修正を提
供し、アプリケーションが無修正でどのような組み合わ
せのノード上でも透過的に実行することを可能にする。1.5. Single System Viewpoint Because the cluster 20 is composed of multiple nodes, the cluster is inherently complex and potentially difficult to manage on a node-by-node basis. However, many problems are much easier to solve if the cluster is viewed as a single system, at least from an application perspective. For example,
If an application uses TCP / IP to tie itself to a particular port, and then wants the application to migrate to another node for some reason, another port for that port on the target migration node You must ensure that no applications are already connected. Otherwise, no migration can be performed. A solution to this problem is to use cluster middleware to manage the port space of the entire cluster, thereby preventing potential problems.
The distributed STREAMS design of the present invention provides techniques, algorithms and STREAMS framework modifications that eliminate the need to modify existing modules or drivers, allowing applications to run transparently on any combination of nodes without modification To be able to

【００５３】1.6 ＤＬＫＭおよび単一システム視点問題オペレーティング・システムが、動的にロード可能なカ
ーネル・モジュール(以下dynamically loadable kernel
modulesの頭文字をとって単にＤＬＫＭと呼ぶ)をサポ
ートする場合、クラスタの範囲内の各ノードは、現在実
行されているアプリケーションに関して上記モジュール
のロードおよび実行のみを行うように自己構成を行うこ
とができる。分散ＳＴＲＥＡＭＳ環境においては、これ
は、ＳＴＲＥＡＭＳサブシステムをどのようにロードす
るかを記述する(各ノード毎にアクセス可能な)フラット
・ファイルを備えることによって達成される。それによ
って、ＳＴＲＥＡＭＳが必要なモジュールおよびドライ
バを持ち込み、実行されているアプリケーションに関し
てＴＣＰ／ＩＰのようなＳＴＲＥＡＭＳ型サブシステム
をロードし構成することができる。単一システムの視点
からすれば、各ノードがクラスタ全体の構成データのみ
ならずノード特有の構成データをも入手することができ
るメカニズムがなくてはならない。1.6 DLKM and Single System Perspective Problem The operating system uses a dynamically loadable kernel module (hereinafter referred to as a dynamically loadable kernel module).
(indicated by the acronym modules), each node within the cluster can self-configure to only load and execute the modules for the currently running application. it can. In a distributed STREAMS environment, this is achieved by having a flat file (accessible for each node) that describes how to load the STREAMS subsystem. Thereby, STREAMS can bring in the necessary modules and drivers and load and configure STREAMS-type subsystems such as TCP / IP for the application being executed. From a single system perspective, there must be a mechanism by which each node can obtain not only configuration data for the entire cluster but also node-specific configuration data.

【００５４】ＤＬＫＭが機能するためには、各ノード
は、ＳＴＲＥＡＭＳサブシステムをロードすることがき
るロード元ノードのリストの情報を含むフラット・ファ
イルを含む必要がある。ＳＴＲＥＡＭＳ初期化関数が既
存の接続管理プロトコールを介してこのリストを処理し
通信することが可能となり、このプロトコールによっ
て、始動ノードが、サブシステムをダウンロードするこ
とができるノードを決定し実際のダウン・ロードを実行
することが可能となる。For the DLKM to work, each node must include a flat file that contains information on the list of nodes from which the STREAMS subsystem can be loaded. The STREAMS initialization function allows this list to be processed and communicated via an existing connection management protocol, which allows the initiating node to determine which nodes can download the subsystem and to determine the actual download Can be executed.

【００５５】上記フラット・ファイルは、また、ＳＴＲ
ＥＡＭＳがロードされる時必ずロードされなければなら
ないデフォルトＳＴＲＥＡＭＳドライバおよびドライバ
のロードのためのパラメータ・セットを含む。最低限、
clone、strlogおよびsadドライバは常にロードされなけ
ればならない。アプリケーションがドライバをオープン
することを試みるまでこれらは必要とされないとも云え
るが、ほとんどあらゆるＳＴＲＥＡＭＳ型ドライバは直
接的にまたは間接的にこれらのドライバを使用するの
で、これらを初期設定の際にロードする方が簡単で速
い。ファイルは、また、ロードされる必要があるかもし
れないＳＴＲＥＡＭＳ型サブシステムおよびそれらの対
応するサーバ・ノードのリストを含む。基本的考え方
は、これらのサブシステムがそれ自体のフラット・ファ
イルを必要としないで済むように十分な情報を事前に取
りそろえることである。サーバ・ノードとの通信を介し
てこの情報を読み取るようにstr_install()を変更し、
現在のデータ構造をダウンロードさせることによって、
これは機能する。相違する点は、str_install()がサブ
システムを実際にロードしないことであろう。str_inst
all()は、サブシステム・ロードが一層迅速にそして一
層少ない情報で(すなわち新しいフラット・ファイルで)
行われるように、必要とされるＳＴＲＥＡＭＳ下部構造
を作成するだけであろう。このため、あらゆる事象がＳ
ＴＲＥＡＭＳを介して発生するので、ＳＴＲＥＡＭＳサ
ブシステムが新しいロード機構を作成しなければならな
い必要性はなくなる。加えて、この環境はクラスタであ
るので、サブシステムのロードはそれに対するＨＡ品質
を維持することができる。The above flat file also has a STR
Includes a default STREAMS driver that must be loaded whenever EAMS is loaded and a parameter set for loading the driver. minimum,
The clone, strlog and sad drivers must always be loaded. These may not be needed until the application attempts to open the drivers, but almost any STREAMS-type driver will use them directly or indirectly, so load them during initialization. It is easier and faster. The file also contains a list of STREAMS-type subsystems that may need to be loaded and their corresponding server nodes. The basic idea is to pre-package enough information so that these subsystems do not need their own flat file. Modify str_install () to read this information via communication with the server node,
By having the current data structure downloaded,
This works. The difference will be that str_install () does not actually load the subsystem. str_inst
all () allows subsystem loading to be faster and with less information (i.e. in a new flat file)
It will only create the required STREAMS substructure as done. Therefore, every event is S
Since it occurs via TREAMS, there is no need for the STREAMS subsystem to have to create a new load mechanism. In addition, since the environment is a cluster, loading subsystems can maintain HA quality for it.

【００５６】1.7 設計問題本発明は、分散ＳＴＲＥＡＭＳおよびＳＴＲＥＡＭＳ型
アプリケーションに関連した多数の設計問題に対するソ
ルーションを提供する。本発明の設計がどのように有効
に働くか、また本発明の設計の理由の理解を図るため、
以下に設計上の諸問題をリストする。1.7 Design Issues The present invention provides a solution to a number of design issues associated with distributed STREAMS and STREAMS-type applications. To understand how the design of the present invention works and the reasons for the design of the present invention,
The following is a list of design issues.

【００５７】クラスタの単一システムという視点から、
アプリケーションがオープンする可能性のあるただ１セ
ットのデバイス・ファイルがある。解決されるべき問題
は、潜在的な対応する目標ハードウェアまたはソフトウ
ェアがすべてのノードに存在しないこともあるので、ど
のノードがデバイスをオープンすべき正しいノードであ
るかを決定する方法である。加えて、このような情報を
伝えないデバイス・ファイル名の場合それに基づいてど
のようにして正しいノードを決定するのか？そのような
状況は、上述のハードウェア共有の例において示されて
いる。From the viewpoint of a single system of a cluster,
There is only one set of device files that the application can open. The problem to be solved is how to determine which node is the correct node to open the device since the potential corresponding target hardware or software may not be present on all nodes. In addition, how to determine the correct node based on a device file name that does not convey such information? Such a situation is illustrated in the hardware sharing example above.

【００５８】ストリーム型スタックの異なる部分が異な
るノード上で動作しなければならないとすれば、コンポ
ーネント設計変更を必要とすることなく、どのようにし
てこれらのコンポーネントは作成され、相互接続される
のであろうか？これは、開発者がそれらのモジュール
およびドライバをクラスタに移植するのを奨励する開か
れたクラスタ環境を作成することに対する重要な目標で
ある。If different parts of the stream-type stack had to operate on different nodes, how would these components be created and interconnected without requiring component design changes? Do you? This is an important goal for creating an open cluster environment that encourages developers to port their modules and drivers to a cluster.

【００５９】負荷平均化、高い可用性あるいはどのよう
な理由にしても、ストリームがあるノードから別のノー
ドへ移行すべき場合、コンポーネント設計を修正するこ
となくどのようにして状態および正確性を維持し、移行
を行うのか？ (注：クラスタ機構の一部に関しては、
各コンポーネントはこれらの機構の利点を活用するため
付加的機能性を必要とする場合があり、一方、デフォル
ト動作を持ち修正もコードの追加も必要としない機構も
ある)。移行努力と非同期的な実の可能なＳＴＲＥＡＭ
Ｓputルーチンおよびserviceルーチンに関心を払う必要
がある。ＤＤＩ／ＤＫＩ規格ユーティリティであるstrl
og()を利用するモジュールまたはドライバに関しては、
管理担当者がstraceを実行する対象である単一クラスタ
・ログ・ドライバにメッセージをどのように送付するの
か？また、あるノードのロギングと他のロギングとを
どのように見分けるか？また、情報を失うことなくま
たはモジュールあるいはドライバを修正せずにどのよう
にしてその見分けを実行するのか？ (ＨＰ、Ｓｕｎ、
ＯＳＦ、Ｍｅｎｔａｔ、Ｕｎｉｘｗａｒｅ等の)多数の
ＳＴＲＥＡＭＳ実施形態は、多くの異なる待ち行列同期
レベルを利用する。これらの同期レベルは、正しい待ち
行列アクセスおよび動作を確保する上でのキーである。
クラスタの範囲内で、どのように、そしてどの程度それ
らの能力を提供するのか？If the stream should transition from one node to another, for load averaging, high availability, or whatever the reason, how to maintain state and correctness without modifying the component design. Do you make the transition? (Note: For some cluster mechanisms,
Each component may require additional functionality to take advantage of these features, while others have default behavior and do not require modification or additional code). Feasible STREAM Asynchronous with Transition Effort
You need to pay attention to the Sput and service routines. Strl, a DDI / DKI standard utility
For modules or drivers that use og (),
How do administrators send messages to the single cluster log driver on which they run strace? Also, how do you distinguish between logging for one node and logging for another? Also, how do you do that without losing information or modifying a module or driver? (HP, Sun,
Many STREAMS embodiments (such as OSF, Mentat, Unixware) utilize many different queuing synchronization levels. These synchronization levels are key in ensuring proper queue access and operation.
How and how to provide those capabilities within a cluster?

【００６０】ストリーム型マルチプレクサを作成する
時、不必要なオーバーヘッドおよび性能の低下が発生す
る場合どのようにしてコンポーネントの相互結合を行う
のか？マルチプレクサの片側半分が新しいノードに移行
しなが別の半分が移行できない場合どうするのか？上述
以外にも多数の問題が存在するが、これは、分散ＳＴＲ
ＥＡＭＳがクラスタの範囲内で効果的なソルーションで
あることができる前に、何を解決しなければならないか
を理解するための基礎を提供する。When creating a stream-type multiplexer, how to interconnect components when unnecessary overhead and performance degradation occur? What if one half of the multiplexer migrates to the new node but the other half cannot? There are a number of other problems besides the above,
Before EAMS can be an effective solution within a cluster, it provides a basis for understanding what must be solved.

【００６１】1.8 設計規則分散ＳＴＲＥＡＭＳ設計を評価する際に考慮すべき３つ
の設計規則がある。第１の規則は、ＳＴＲＥＡＭＳフレ
ームワークに対するどのような修正も、非分散ＳＴＲＥ
ＡＭＳアプリケーションに関する性能低下の原因となっ
てはならない。開発者は一部の顧客の満足を得るためあ
ちらこちらに機構を付加しがちである。通常単一の機構
は性能劣化の観点から大きな犠牲を払わないが、時間の
経過と共に、機構は拡張し続け、性能劣化の蓄積が顧客
全体に影響を与えるほどの深刻な全体的性能劣化につな
がることがある。従って、分散ＳＴＲＥＡＭＳ機能性を
追加する際には、顧客層全体が経験するような性能劣化
を加える設計であってはならない。便宜上あちこちでサ
イクル・タイムを費消する誘惑は可能な限り撃退されな
ければならない。1.8 Design Rules There are three design rules to consider when evaluating a distributed STREAMS design. The first rule states that any modifications to the STREAMS framework will require a non-distributed STREMS
It should not cause performance degradation for AMS applications. Developers tend to add features here and there to get the satisfaction of some customers. Typically, a single mechanism does not make a significant sacrifice in terms of performance degradation, but over time, the mechanism continues to expand, leading to severe overall performance degradation such that the accumulation of performance degradation affects the entire customer. Sometimes. Thus, adding distributed STREAMS functionality should not be designed to add performance degradation as experienced by the entire customer base. The temptation to spend cycle time here and there for convenience must be repelled as much as possible.

【００６２】第２の規則は、いかなるＳＴＲＥＡＭＳモ
ジュールまたはドライバも、分散ＳＴＲＥＡＭＳ環境の
範囲内で動作できるようにするため修正を必要としては
ならないということである。これは、第３者の開発者が
目標プラットホーム上で彼らのソフトウェアを開発し、
クラスタ環境のためのアプリケーションを開発、配置お
よびサポートする全体の時間およびコストを減少させる
上でのキーである。過去において、クラスタ・ソフトウ
ェア開発者は、クラスタ特有の動作が直ちに実行されな
ければならないか否かを調べる検査を追加するようにコ
ードのメイン経路を修正するというような極端な行為に
走った。この種の修正は、性能とコストの観点から容認
することができないだけではなく、クラスタ使用様態を
制限し、顧客に対する訴求点を減少させる。この規則に
対する唯一の潜在的例外は、モジュールまたはドライバ
が、インスタンスを１つのノードから別のノードへ移行
させるため機能性の追加を必要とする可能性があること
である。この機能性は既存のコードの修正を必要としな
いが、ＳＴＲＥＡＭＳが通常は気づかないままでいるモ
ジュール／ドライバのプライベート・データを移行する
ためＳＴＲＥＡＭＳフレームワークが利用する機能性が
追加される。また、モジュール／ドライバ機能性は、Ｓ
ＴＲＥＡＭＳ Dynamic Function ReplacementおよびReg
istrationを使用して増加することもできる。この点
は、アメリカ合衆国特許出願第08/545,561号および同第
08/593,313号に記載されている。The second rule is that no STREAMS module or driver should require modification to be able to operate within a distributed STREAMS environment. This means that third-party developers can develop their software on the target platform,
It is key to reducing the overall time and cost of developing, deploying and supporting applications for a cluster environment. In the past, cluster software developers have performed extreme acts, such as modifying the main path of code to add a check to see if cluster-specific operations must be performed immediately. This type of modification is not only unacceptable in terms of performance and cost, but also limits cluster usage and reduces the appeal to customers. The only potential exception to this rule is that a module or driver may need additional functionality to migrate an instance from one node to another. This functionality does not require modification of existing code, but adds the functionality that the STREAMS framework uses to migrate the private data of modules / drivers that STREAMS normally remains unaware of. The module / driver functionality is S
TREAMS Dynamic Function Replacement and Reg
It can also be increased using istration. This is the subject of U.S. patent application Ser.No. 08 / 545,561 and U.S. Pat.
08 / 593,313.

【００６３】第３の規則は、分散ＳＴＲＥＡＭＳソルー
ションを可能な限りミドルウェアに依存させないことで
ある。設計は、ミドルウェアとの通信が発生する場合の
ポイントおよび状況を可能な限り制限しなければならな
い。このようなミドルウェアからの独立性によって、一
層柔軟な設計と共に、市場登場時間、移入時間および製
品サポート全体の観点からのコストをより低くした新し
いミドルウェア技術への一層迅速な移行が実現される。The third rule is to make the distributed STREAMS solution as independent of middleware as possible. The design must limit as much as possible the points and situations where communication with the middleware occurs. This independence from middleware allows for a more flexible design, as well as a faster transition to new middleware technologies with lower costs in terms of time to market, time to entry and overall product support.

【００６４】2.0 設計概要図２および図３に示されているＳＴＲＥＡＭＳスタック
の基本アーキテクチャは、ドライバの上部にプッシュさ
れたゼロまたはそれ以上のモジュール３２を持つストリ
ームヘッド３０およびドライバ４４である。この図では
単一スタック・インスタンスとしてアクセスされ取り扱
われているが、複数ストリーム・スタックをツリー状構
造に結合する機能を持つソフトウェア・マルチプレクサ
を使用して一層複雑なスタックを作成することができ
る。この点を考慮しながら、分散ＳＴＲＥＡＭＳ設計の
詳細を以下に記述する。2.0 Design Overview The basic architecture of the STREAMS stack shown in FIGS. 2 and 3 is a streamhead 30 and driver 44 with zero or more modules 32 pushed on top of the driver. Although shown and accessed as a single stack instance in this figure, more complex stacks can be created using software multiplexers that have the ability to combine multiple stream stacks into a tree-like structure. With this in mind, the details of the distributed STREAMS design are described below.

【００６５】本発明の設計は、制御スレッド３４および
物理的クラスタ相互接続ドライバ(Ｐ−ＩＣＳ)３６を最
低限含む。加えて、ソフトウェア・クラスタ相互接続ド
ライバ(Ｓ−ＩＣＳ)３８と呼ばれるＳＴＲＥＡＭＳ型ド
ライバの１つ以上のインスタンスが存在することもあ
る。これらのコンポーネントの詳細を記述する前に、２
つの潜在的設計構成およびそれらを使用して種々のクラ
スタ問題をどのように解決することができるかを検証す
る。２つの構成は図２および図３に示されている。The design of the present invention includes at a minimum a control thread 34 and a physical cluster interconnect driver (P-ICS) 36. In addition, there may be one or more instances of a STREAMS-type driver called Software Cluster Interconnect Driver (S-ICS) 38. Before describing the details of these components,
We examine two potential design configurations and how they can be used to solve various cluster problems. Two configurations are shown in FIGS.

【００６６】Ｐ−ＩＣＳは、ソフトウェアおよびハード
ウェア両コンポーネントを含む軽量で待ち時間の少ない
プロトコールを使用する高速相互接続リンクを提供す
る。これらの相互接続は、すべての意図および目的のた
め、仮想回線である。仮想回線は、多数の概念的利点の
みならず実施上の利点を提供するが、少なくとも、アプ
リケーションに対して透過的にインスタンスをあるもの
から他のものへ移動させる能力を持つ。加えて、仮想回
線は、ストリームまたは物理的相互接続ドライバの範囲
内で、あるいはＳＴＲＥＡＭＳフレームワークの範囲内
で、プロトコール特定情報を維持する必要性を排除し、
そのため、設計および実施全体を単純化し、性能を改善
し、コード再利用および柔軟性を向上させることができ
る。注意すべき点ではあるが、図２および図３に示され
ている単純なアーキテクチャは、管理されるＳＴＲＥＡ
ＭＳの数や各スタックにとって使用可能なクラスタ機構
を図示されているものに制限することを意味していな
い。P-ICS provides a high-speed interconnect link that uses a lightweight, low-latency protocol that includes both software and hardware components. These interconnects are virtual circuits for all intents and purposes. Virtual circuits provide a number of conceptual benefits as well as implementation benefits, but at least have the ability to move instances from one to another transparently to the application. In addition, the virtual circuit eliminates the need to maintain protocol specific information within the stream or physical interconnect driver, or within the STREAMS framework.
This can simplify overall design and implementation, improve performance, and increase code reuse and flexibility. It should be noted that the simple architecture shown in FIGS.
It is not meant to limit the number of MSs or cluster mechanisms available for each stack to those shown.

【００６７】図２に示される第１の構成は、ＩＰＭＵ
Ｘ(すなわちＩＰマルチプレクサ)４４の下で結合され
る複数のＤＬＰＩＡＴＭドライバ４０、４２を持つＴ
ＣＰ／ＩＰスタックを含む。すなわち、これはＳＴＲＥ
ＡＭＳ型ＴＣＰ／ＩＰ実施形態において見られる標準的
構成である。更に、他の３つのコンポーネントがある。
最初の２つは、制御スレッド３４およびＰ−ＩＣＳ３６
である。３番目は、このスタックがどのクラスタ機構を
利用するかに基づいてＴＣＰとＩＰ動作を増やすことが
できるプレビュー関数セット３１、３３である。ＴＣＰ
に関してはこれはクラスタ全体のポート管理方式とする
ことができ、一方ＩＰに関してはエラーおよびカード障
害状態を扱う関数の高可用性セットとすることができ
る。いずれにせよ、これは制御スレッドおよびＰ−ＩＣ
Ｓに対するアクセスを必要とするオプションの機能性で
ある。注：すべてのストリーム型スタックは、それらが
作成される時制御スレッドに認識されているので、すべ
てのスタックは、最低限、非クラスタ実施形態を修正す
ることなくノード間で移行することができなければなら
ないが、スタックは関連プライベート・データを持つコ
ンポーネントの各々に関して一対の関数を提供すること
が求められる。The first configuration shown in FIG.
X having a plurality of DLPI ATM drivers 40, 42 coupled under an X (ie, IP multiplexer) 44
Includes CP / IP stack. That is, this is the STRE
This is a standard configuration found in AMS type TCP / IP embodiments. In addition, there are three other components.
The first two are the control thread 34 and the P-ICS 36
It is. Third is a set of preview functions 31, 33 that can increase TCP and IP operations based on which cluster mechanism this stack uses. TCP
For, this can be a cluster-wide port management scheme, while for IP, it can be a high availability set of functions that handle error and card fault conditions. In any case, this is the control thread and the P-IC
Optional functionality that requires access to S. Note: Since all stream-type stacks are known to the controlling thread when they are created, all stacks must at least be able to migrate between nodes without modifying the non-clustered embodiment. It must be noted that the stack is required to provide a pair of functions for each component that has associated private data.

【００６８】図３の第２の構成において、ＩＰマルチプ
レクサの下にリンクされた複数のＤＬＰＩＡＴＭドラ
イバを持つＴＣＰスタックが示されているが、この場
合、１つのＤＬＰＩインスタンス４２が実際には異なる
ノード上で実行されている。この動作を可能にするた
め、ＩＰマルチプレクサの下にリンクされたＳ−ＩＣＳ
ドライバ３８がデータをＩＰから遠隔ＤＬＰＩインスタ
ンス４２へ送付する責任を持つ。このスタックはこのよ
うに作成される場合もあり、あるいは、例えばＤＬＰＩ
がカード障害を検知し、M_ERRORメッセージを送出する
場合、高い可用性の回復方式の結果として、このスタッ
クが作成されることもある。そのメッセージがＩＰ拡張
関数によって捕捉され、制御スレッドに通知され、そし
て回復方針ルーチンが起動される。この状況において、
アプリケーションはカード障害に気づくことはなく、そ
れまでと同様に動作を続行する。これがどのように達成
されるかは後述される。In the second configuration of FIG. 3, a TCP stack is shown having a plurality of DLPI ATM drivers linked below an IP multiplexer, in which case one DLPI instance 42 is actually a different node Running on To enable this operation, the S-ICS linked below the IP multiplexer
Driver 38 is responsible for sending data from IP to remote DLPI instance 42. This stack may be created this way, or for example, DLPI
If a detects a card failure and sends an M_ERROR message, this stack may be created as a result of the high availability recovery scheme. The message is caught by the IP extension function, notified to the controlling thread, and a recovery policy routine is invoked. In this situation,
The application will not notice the card failure and will continue to operate as before. How this is achieved will be described later.

【００６９】両方の構成においてミドルウェア・エンテ
ィティはオプションのコンポーネントであるのでここで
は特に言及しない(ミドルウェアについては後述述され
る)。これらの構成において、制御スレッドは、このＴ
ＣＰ／ＩＰを実行させているすべてのノードにわたる異
なる機構および回復メカニズムを調整するための十分な
情報を持つ。この調整動作は、後節において諸コンポー
ネント間の相互交信が説明される際に詳述される。[0069] In both configurations, the middleware entity is an optional component and will not be specifically described here (middleware will be described later). In these configurations, the controlling thread uses this T
It has enough information to coordinate different mechanisms and recovery mechanisms across all nodes running CP / IP. This coordination operation will be described in detail when a mutual exchange between components is described in a later section.

【００７０】2.1 制御カーネル・スレッドの概要クラスタ内のストリームを透過的に確立し移行させるた
めには、ＳＴＲＥＡＭＳフレームワーク、他の制御スレ
ッド・インスタンス、およびクラスタ全体の管理ミドル
ウェアと通信することができる第３者エンティティの存
在を許容しなければならない。これは、１つ以上のカー
ネル管理スレッドを作成することによって達成される。
これらのスレッドは最低限以下の責任／能力を持つ。ス
トリームが第１の構成において示されているように通常
作成されるとすれば、制御スレッドは、その作成を検知
し、ＳＴＲＥＡＭＳフレームワークが関数増加を実行で
きるようにＳＴＲＥＡＭＳフレームワークにそれを通知
する責任を持つ。スタックはstr_install()を使用して
ＳＴＲＥＡＭＳフレームワークに通知する。コマンドst
r_install()は、ＳＴＲＥＡＭＳドライバまたはモジュ
ールをカーネルに導入するするために使用される。この
関数に渡されるパラメータは、同期レベル、ストリーム
タブ・エントリ等のような項目を定義する。クラスタ化
のため、起動される方針セットを含む新しいパラメータ
が加えられる。ドライバまたはモジュールの非クラスタ
・ポートでさえ再コンパイルを必要とするので、ドライ
バまたはモジュールは、実施形態を変更することなく自
動的に方針のデフォルト・セットを取り出す。デフォル
ト方針は負荷平均化および高い可用性のためスタック移
行を可能にする。2.1 Overview of the Control Kernel Thread In order to transparently establish and migrate streams in the cluster, the first to communicate with the STREAMS framework, other control thread instances, and the cluster-wide management middleware. The existence of a three-party entity must be allowed. This is achieved by creating one or more kernel management threads.
These threads have at least the following responsibilities / abilities: Assuming that the stream is normally created as shown in the first configuration, the controlling thread will detect its creation and notify the STREAMS framework of it so that the STREAMS framework can perform the function increment. Take responsibility. The stack notifies the STREAMS framework using str_install (). Command st
r_install () is used to install a STREAMS driver or module into the kernel. The parameters passed to this function define items such as sync level, stream tab entry, etc. For clustering, a new parameter containing the policy set to be activated is added. Since even non-cluster ports of a driver or module require recompilation, the driver or module automatically retrieves the default set of policies without changing the embodiment. The default policy allows stack migration for load averaging and high availability.

【００７１】クラスタの範囲内の異なるノードに存在し
ているコンポーネントを用いてストリームが作成されつ
つある場合、制御スレッドは遠隔ノード上のコンポーネ
ントを作成する責任を持つ。制御スレッドは遠隔ＤＬＰ
Ｉインスタンスおよびローカル・ストリームヘッドを作
成する責任を持つ。モジュールがこのドライバ・インス
タンスにプッシュされる必要があれば、制御スレッドは
またこれらのモジュールをプッシュする責任を持つ。When a stream is being created with components residing on different nodes within the cluster, the controlling thread is responsible for creating components on remote nodes. Control thread is remote DLP
Responsible for creating I instances and local stream heads. If modules need to be pushed to this driver instance, the controlling thread is also responsible for pushing those modules.

【００７２】制御スレッドは、ローカル・ノード上のス
トリーム・インスタンスと通信するためカーネル内ＳＴ
ＲＥＡＭＳインターフェースを使用する。ＳＴＲＥＡＭ
Ｓインターフェースの好ましい形態は、ヒューレット・
パッカード社のＨＰ−ＵＸリリース１０.１０および１
０.２０において実施されている。カーネル内に制御ス
レッドを保持する理由は性能およびセキュリティであ
る。カーネル内インターフェースは、そのパートナーと
通信する上でシステム呼び出しインターフェースを過度
に通過する必要がない。セキュリティに関しては、これ
は他のスレッドおよび詮索から保護された経路指定テー
ブルおよび管理構造を保持する。このことは、制御スレ
ッドがユーザ空間では実現できないことを意味するので
あろうか？答えはそうではなく、また、問題はメッセー
ジを送る能力でありメッセージが通過する場所ではない
ので、ユーザ空間での実施はこのソルーションに関する
設計変更を必要としない。The controlling thread communicates with the stream instance on the local node by using the ST in the kernel.
Use the REAMS interface. STREAM
The preferred form of the S interface is
Packard HP-UX Releases 10.10 and 1
0.20. The reason for keeping the controlling thread in the kernel is performance and security. The in-kernel interface does not need to go through the system call interface excessively to communicate with its partner. As for security, it maintains a routing table and administrative structure protected from other threads and snooping. Does this mean that control threads cannot be realized in user space? Since the answer is not, and the problem is the ability to send the message, not where the message passes, implementation in user space does not require design changes for this solution.

【００７３】制御スレッドはノードに関するＳ−ＩＣＳ
インスタンスを確立することに対して責任を持つ。すべ
ての制御スレッドに関して１つのインスタンスしか存在
しないことがあるが、この場合、それらスレッドはスト
リームヘッド・ポインタ・アドレスを共有するか、ある
いは、それぞれのスレッドが、Ｓ−ＩＣＳ内に保持され
る情報を各スレッドがサポートする機能性に対応させる
かもしれない。制御スレッドはストリームヘッド固有メ
ッセージを渡す責任を持つ。The control thread is the S-ICS for the node.
Responsible for establishing instances. There may be only one instance for all control threads, in which case they either share the stream head pointer address, or each thread can share information held in the S-ICS. It may correspond to the functionality supported by each thread. The controlling thread is responsible for passing streamhead specific messages.

【００７４】制御スレッドは、コンポーネント間でメッ
セージを渡し、必要な処理フロー制御を実行する責任を
持つ。制御スレッドはストリーム移行に関与する。ノー
ドが故障し、ストリーム・インスタンス・コンポーネン
トの一部がそのノード上で動作していたとすれば、制御
スレッドは、高い可用性ソルーションに関係のあるエラ
ー回復に参画する。制御スレッドは、ミドルウェアおよ
びそれと通信する方法について詳細な知識を持つクラス
タ内の２つの位置の１つとして、クラスタ全体にわたる
ミドルウェアと通信して、情報を探り、それが管理して
いる事柄の変更に関してミドルウェアを更新することが
できる。この情報を制限することによって、ＳＴＲＥＡ
ＭＳフレームワークはミドルウェアからの独立性を維持
し、それによって、新しいミドルウェア技術に関するソ
ルーション設計の柔軟性が増加する。これらの事柄のす
べてがどのように達成されるか、その詳細は次の節で記
述される。The control thread is responsible for passing messages between components and performing the necessary processing flow control. The controlling thread is involved in the stream transition. If a node fails and some of the stream instance components were running on that node, the controlling thread participates in error recovery related to the high availability solution. The controlling thread communicates with the middleware across the cluster as one of two locations in the cluster that has in-depth knowledge of the middleware and how to communicate with it, seeking information and changing what it manages. Middleware can be updated. By restricting this information, STREA
The MS framework maintains middleware independence, thereby increasing the flexibility of solution design for new middleware technologies. Details of how all of these things are achieved are described in the next section.

【００７５】2.2 Ｓ−ＩＣＳの概要Ｓ−ＩＣＳ３８は、ＳＴＲＥＡＭＳフレームワーク設計
ソルーションをミドルウェアから独立させる際のキー・
コンポーネントである。Ｓ−ＩＣＳは、物理的相互接続
ドライバＰ−ＩＣＳ３６の上に存在するストリーム型ソ
フトウェア・ドライバである。ＳＴＲＥＡＭＳフレーム
ワークに容易に組み込めるように、またこのドライバが
ストリーム移行および回復を単純化するいくつかのユニ
ークで標準的メカニズムを提供するので、本発明の実施
形態では、Ｐ−ＩＣＳ実施特有の依存性をＳＴＲＥＡＭ
Ｓフレームワーク視点から単一点へ隔離することが可能
なように、Ｓ−ＩＣＳドライバはストリーム型(すなわ
ちストリームに基づくもの)として選択された。2.2 Outline of S-ICS The S-ICS 38 is a key to making the STREAMS framework design solution independent of middleware.
Component. The S-ICS is a stream type software driver that exists on the physical interconnect driver P-ICS36. Embodiments of the present invention rely on P-ICS implementation specific dependencies for easy integration into the STREAMS framework and because this driver provides some unique and standard mechanisms to simplify stream migration and recovery. STREAM
The S-ICS driver was selected as stream type (ie, stream based) so that it could be isolated to a single point from the S framework point of view.

【００７６】最低限、Ｓ−ＩＣＳ３８は、以下の能力／
責任を持つ。＊Ｓ−ＩＣＳは、分散ストリーム・コンポーネントの作
成に関与する。＊Ｓ−ＩＣＳは、ミドルウェアを調べ、低レベルＳＴＲ
ＥＡＭＳフレームワークおよびアプリケーション特有情
報について、その情報が制御スレッド内に記憶されてい
なければ、ミドルウェアを更新する責任を持つ。＊Ｓ−ＩＣＳは、必要に応じて、Ｐ−ＩＣＳ、ＳＴＲＥ
ＡＭＳフレームワーク、制御管理スレッドおよびミドル
ウェアの間のメッセージを処理する。＊Ｓ−ＩＣＳは、処理性能を改善し、分散ＳＴＲＥＡＭ
Ｓ環境の範囲内のモジュールおよびドライバ組み込みを
単純化するため、アプリケーション特有のキャッシュ情
報を潜在的に維持する。これらのキャッシュは、実行中
の目標アプリケーションを持っていない可能性のあるノ
ードに到着するパケットに対する経路指定テーブルのよ
うな情報を含む場合もある。At a minimum, the S-ICS 38 has the following capabilities /
Take responsibility. * S-ICS is responsible for creating distributed stream components. * S-ICS examines middleware and checks low level STR
For EAMS framework and application specific information, if the information is not stored in the controlling thread, it is responsible for updating the middleware. * S-ICS is P-ICS, STRE if necessary
Handles messages between the AMS framework, control management thread and middleware. * S-ICS improves processing performance and provides distributed STREAM
To simplify module and driver integration within the S environment, application-specific cache information is potentially maintained. These caches may also contain information such as routing tables for packets arriving at nodes that may not have a running target application.

【００７７】2.3 コンポーネント動作の例上述の諸コンポーネントのすべてがどのように相互に動
作するかを示すため、分散ＳＴＲＥＡＭＳが適用される
最も一般的分野すなわちＴＣＰ／ＩＰを１つの例として
とり上げて以下に検討する。この例は、図４に２つの代
替的構成として示されている。ストリームが初期的に作
成される方法は後節で記述することとして、本節では、
ストリームが機能する様態およびどのような設計ソルー
ションを使用して上述の３つの設計規則に対処するかと
いう点に焦点をあてて記述を進める。2.3 Example of Component Operation To illustrate how all of the above components work together, take the most common area where distributed STREAMS applies, namely TCP / IP, as an example. consider. This example is shown in FIG. 4 as two alternative configurations. The way in which the stream is initially created is described in a later section.
The discussion will focus on how the stream works and what design solutions are used to address the above three design rules.

【００７８】2.3.1 非分割スタックＴＣＰ／ＩＰ図４の第１の構成において、ストリーム・スタック３
０、３２、４４はスタックの観点から通常作成される
が、付加的処理が加えられる。オープンの際、制御スレ
ッド３４はスタックが作成されつつあることを通知され
る。ＳＴＲＥＡＭＳフレームワークのopen()コードが実
行される時、制御スレッドは、ストリームヘッド・アド
レスおよびドライバがオープンされ、モジュール類がプ
ッシュされていることを認識する。制御スレッドは、次
に、各ドライバ／モジュールを検査し、関係するクラス
タ方針セットが存在するか否かを判断する。次に、制御
スレッド３はそれらの方針に応じて、スタックがクラス
タに備わる機構を活用することができるようにするた
め、スタックに適切な機能性を付与する。2.3.1 Non-split Stack TCP / IP In the first configuration of FIG.
0, 32, and 44 are typically created from a stack perspective, but with additional processing. Upon opening, the control thread 34 is notified that a stack is being created. When the STREAMS framework's open () code is executed, the controlling thread recognizes that the stream head address and driver have been opened and that the modules have been pushed. The controlling thread then examines each driver / module to determine if there is an associated cluster policy set. Next, the control thread 3 gives the stack appropriate functionality according to those policies, so that the stack can utilize the mechanism provided in the cluster.

【００７９】一旦機能が付与されると、制御スレッド
は、それに送付されるものおよび設定された方針に従っ
てのみ介入または反応する。従って、多くの場合、ＴＣ
Ｐ／ＩＰスタックは、非クラスタ環境における場合と同
様に動作し、アプリケーションの性能はその場合と同様
に見える。この構成が設計規則のすべてを満たすことは
明白である。Once enabled, the controlling thread only intervenes or reacts according to what is sent to it and the policies set. Therefore, in many cases TC
The P / IP stack operates as in a non-cluster environment, and the performance of the application looks as if it were. Clearly, this configuration meets all of the design rules.

【００８０】2.3.2 分割スタックＴＣＰ／ＩＰスタックが分散される場合、それぞれトレードオフを持
つ２つの異なる形態がある。第１の選択肢は、スタック
をＴＣＰとＩＰモジュールの間で分割する。この分割の
利点は、ＴＣＰが状態を維持し、ＩＰが維持しないの
で、スタックを新しいノードに移行させる必要がある
時、ＩＰの更新は不要で、ＴＣＰだけを移行させればよ
く、このため移行の作業量が減少する。更に、ＩＰを実
行しているノードが故障する場合、ＴＣＰモジュールま
たはアプリケーションに影響を及ぼすことなく、クラス
タ内の別のノード上に新しいＩＰインスタンスを容易に
確立することができる。ストリーム型ＴＣＰに関して
は、接続標識のようなＴＣＰデフォルト待ち行列が存在
する。デフォルト待ち行列は、正しい目標ノードに接続
標識などを送付するようにするため付加的コードが追加
されることを必要とするかもしれない。アプリケーショ
ンが新しいノードに移行される場合そのような状況が発
生するであろう。スタックがこのレベルで分割されると
すれば、このデフォルト待ち行列処理を実行するために
必要とされるコードは、少なくともメッセージをクラス
タ内のどこに送付すべきかを決定する観点から、ＴＣＰ
／ＩＰ処理の場合にも再使用することができるであろ
う。2.3.2 Split Stack When the TCP / IP stack is distributed, there are two different configurations, each with a trade-off. The first option splits the stack between TCP and IP modules. The advantage of this split is that when the stack needs to be migrated to a new node because the TCP maintains state and the IP does not, no IP update is required and only the TCP needs to be migrated, thus the migration Work load is reduced. Furthermore, if the node running the IP fails, a new IP instance can easily be established on another node in the cluster without affecting the TCP module or application. For stream type TCP, there is a TCP default queue, such as a connection indicator. The default queue may require that additional code be added to ensure that a connection indicator is sent to the correct target node. Such a situation would occur if the application was migrated to a new node. If the stack were to be split at this level, the code needed to perform this default queuing would be TCP at least in terms of deciding where to send the message in the cluster.
It could also be reused in the case of / IP processing.

【００８１】欠点は以下の通りである。この選択肢は、
プロトコール実施形態から独立していない。ほとんどの
場合、表面的には完全なモジュラ性が存在するように見
えるが、実際には存在しない。この分割が利用される
と、設計上、これらの２つのモジュール間の特別形式の
通信とプライベート・データ並びにそれに対するアクセ
スに関する先入観を考慮に入れなければならない。言い
換えると、この選択肢は、(必須のものとはほど遠い)Ｔ
ＣＰ／ＩＰスタック実施の深い知識を必要とする。この
設計は、Ｓ−ＩＣＳが完全に９５％以上プロトコール非
依存でなければならなかったことの代わりに、ほとんど
完全にプロトコール従属的であることを要求する。The disadvantages are as follows. This option is
Not independent of protocol embodiment. In most cases, superficial modularity appears to exist, but it does not. When this partition is used, the design must take into account the special form of communication between these two modules and the prejudice regarding private data and access to it. In other words, this option is T (far from essential)
Requires in-depth knowledge of CP / IP stack implementation. This design requires that the S-ICS be almost completely protocol dependent, instead of having to be completely 95% or more protocol independent.

【００８２】Ｓ−ＩＣＳの多くは他の伝送プロバイダの
ために再使用されることができない。このことは、クラ
スタ型ＴＣＰ／ＩＰスタックおよび他のプロトコール・
スタックを開発し維持するコストが許容しがたいほど増
大することを意味する。メッセージを宛先に回付するた
めには、経路更新のためクラスタ・ミドルウェアと連絡
をとることができるメカニズムが存在しなければならな
い。ＴＣＰおよびＩＰはストリーム・スタックの中央に
位置しているので、それらは休止することは許されず、
潜在的に異なるノードへの要求は、特にパケット処理が
発生しなおかつ割込み制御スタック上に位置している場
合、スタックの実行に関する重大な問題を引き起こす可
能性がある。このレベルでの分割は、クラスタ内での多
数の分散スタック・インスタンス、すなわち接続毎に１
つのインスタンスを派生する。Many of the S-ICS cannot be reused for other transmission providers. This is because the clustered TCP / IP stack and other protocols
This means that the cost of developing and maintaining the stack increases unacceptably. In order to route a message to its destination, there must be a mechanism that can contact the cluster middleware for route updates. Since TCP and IP are located in the middle of the stream stack, they are not allowed to pause,
Requests to potentially different nodes can cause significant problems with the execution of the stack, especially if packet processing still occurs and is located on the interrupt control stack. Partitioning at this level means that many distributed stack instances in the cluster, one for each connection
Derive one instance.

【００８３】第２の選択肢は、スタックをＩＰ／ＤＬＰ
Ｉレベルで分割し、各ノード上でＴＣＰおよびＩＰモジ
ュールを共に常時維持するものである。この形態は図４
の(Ｂ)に示されている。この設計の利点は次の通りであ
る。設計は９５％以上プロトコール非依存である。スト
リーム型ＴＣＰが、ＴＰＩ(すなわち伝送プロバイダ・
インターフェースTransport Provider Interfaceの略
称)をサポートすると理解するだけで十分である。この
ＴＰＩは、伝送プロバイダのモジュール／ドライバの下
に接続されるＤＬＰＩを利用するSNA, OSI, Netware, A
ppletalk等の種々の伝送プロバイダによってサポートさ
れている。多くのＳ−ＩＣＳは他のプロトコール・スタ
ックの場合にも再利用されることができ、完全にプロト
コール独立的であるように潜在的に十分に一般化されて
いる。これによって、クラスタ環境に移入される伝送ス
タックにおけるソルーション・コストが低減し、柔軟性
が増加し、新しい技術の市場への導入時間が短縮され
る。The second option is to set the stack to IP / DLP
It is divided at the I level, and both the TCP and IP modules are constantly maintained on each node. This form is shown in FIG.
(B). The advantages of this design are as follows. The design is more than 95% protocol independent. Stream-type TCP is used by TPI (ie,
It is enough to understand that it supports the interface Transport Provider Interface). This TPI uses SNA, OSI, Netware, A using DLPI connected under the module / driver of the transmission provider.
It is supported by various transmission providers such as ppletalk. Many S-ICS can be reused in the case of other protocol stacks and are potentially sufficiently generalized to be completely protocol independent. This reduces solution costs, increases flexibility, and reduces time to market for new technologies in a transmission stack that is populated in a cluster environment.

【００８４】実施形態に基づいて、分散スタックの数は
ＩＰ／ＤＬＰＩレベルにおける分割とともに少なくな
る。ＩＰがマルチプレクサとして実施される場合、それ
はＩ／Ｏカードにつきただ１つのＤＬＰＩインスタンス
を持つにすぎない。分割がこのレベルで行われれば、各
遠隔ＤＬＰＩはＩＰの下にリンクされ、この単一リンク
を通る接続の数に制限はない(実際はＳ−ＩＣＳはリン
クされたドライバであるが、正しい、予測できるメッセ
ージ経路指定が実行されるため遠隔ノード上のどのＤＬ
ＰＩインスタンスと通信しているのかを理解するために
十分な情報を記録する)。分割がＴＣＰ／ＩＰまたはＵ
ＤＰ／ＩＰレベルで行われるとすれば、ＴＣＰ接続あた
り１つの相互接続が存在し、この設定はＩＰ／ＤＬＰＩ
レベルでの分割より頻繁にクラスタ範囲内で発生する
が、クラスタ性能を低下させ移行コストを増加させる。
ＤＬＰＩは状態のないものであるので、スタックの移行
を望む場合ＤＬＰＩ部分は移行する必要がない。According to an embodiment, the number of distributed stacks decreases with splitting at the IP / DLPI level. If IP is implemented as a multiplexer, it has only one DLPI instance per I / O card. If splitting is done at this level, each remote DLPI is linked under the IP and there is no limit on the number of connections through this single link (actually the S-ICS is a linked driver but correct, predictable) Which DL on the remote node to be able to perform message routing
Record enough information to understand if you are communicating with the PI instance). Division is TCP / IP or U
Assuming done at the DP / IP level, there is one interconnect per TCP connection and this setting is based on IP / DLPI
Occurs more frequently within a cluster than partitioning at the level, but degrades cluster performance and increases migration costs.
Because the DLPI is stateless, the DLPI portion does not need to be migrated if a stack transition is desired.

【００８５】ＤＬＰＩインスタンスが故障したとすれ
ば、ＴＣＰ／ＩＰ部分は新しいＤＬＰＩインスタンスに
移行されなくてもよい。カードがスタックにとって局所
的であれば、新しいＤＬＰＩインスタンスは、それに割
り当てられた以下の付加ＩＰアドレスを持つことを新し
いＤＬＰＩまたは既存のＤＬＰＩに単に通知する。ＤＬ
ＰＩが遠隔インスタンスであれば、故障処理は同じプロ
セスを使用するが、この場合にはそのＤＬＰＩインスタ
ンスとして使用されるＳ−ＩＣＳを持っている。これは
すべてのＤＬＰＩインスタンスを各ノード毎にクラスタ
範囲内でリンクさせることによって達成することがで
き、あるいは、構成コードがどのＤＬＰＩを故障対策コ
ンポーネントとして使用するか選択することができる。
これは実施形態に依存する。If the DLPI instance fails, the TCP / IP part does not have to be migrated to a new DLPI instance. If the card is local to the stack, the new DLPI instance simply informs the new or existing DLPI that it has the following additional IP address assigned to it. DL
If the PI is a remote instance, the fault handling uses the same process, but in this case has the S-ICS used as its DLPI instance. This can be achieved by linking all DLPI instances within a cluster for each node, or the configuration code can select which DLPI to use as a fault-tolerant component.
This depends on the embodiment.

【００８６】欠点は次の通りである。Ｓ−ＩＣＳはプロ
トコール独立性のためより多くの開発労力および時間を
必要とするが、この労力は、同じ設計を異なるスタック
に再利用することができることで埋め合わせられる。デ
フォルトＴＣＰ待ち行列が経路指定処理モジュールを必
要とするならば、このモジュールはコスト増ではあるが
作成する必要がある。もちろんこのコードはＳ−ＩＣＳ
コードのサブセットであるので、設計はＳ−ＩＣＳコー
ドを十分取り入れ、コストを許容範囲内にとどめること
ができよう。排除することができない５％の依存性の大
部分がこの点である。The disadvantages are as follows. S-ICS requires more development effort and time for protocol independence, but this effort is offset by the ability to reuse the same design for different stacks. If the default TCP queue requires a routing processing module, this module needs to be created, albeit at a higher cost. Of course, this code is S-ICS
Being a subset of the code, the design could fully incorporate the S-ICS code and keep the cost within an acceptable range. Most of the 5% dependence that cannot be ruled out is in this regard.

【００８７】上述の諸利点および欠点の検証に従えば、
図４の(Ｂ)の第２の選択肢の使用が好ましい。次の節で
は、クラスタの単一システム視点に関連する１つの問題
である大域ポート対応付けが第２の選択肢を使用してど
のように解決されるか、またそのソルーションが他のプ
ロトコール・スタックにどのように適用されるかが記述
される。According to the examination of the advantages and disadvantages described above,
The use of the second option in FIG. 4B is preferred. In the next section, we will see how one problem related to the single system perspective of the cluster, global port mapping, is solved using the second alternative, and the solution is transferred to other protocol stacks. It describes how it is applied.

【００８８】注：図４には３つの数字１、２、３が符号
として示されている。これらの数字は、予防措置が追加
される場合用いられる可能性のある潜在的な性能最適化
に関してそれぞれ以下のような注釈に関連づけられてい
る。１. 図４の(Ｂ)の第２の選択肢の複数バージョンを持つ
クラスタを想定し、ＴＣＰ／ＩＰスタック部分にアクセ
スするアプリケーションがノードからノードへ移行する
ことができるという可能性を付加することができれば、
すべてのクラスタ内経路指定がスタックの新しい位置を
反映するように更新される前に、パケットをノードへ発
送することができる可能性が存在する。そのような場
合、ＩＰがそのファンアウト・テーブルを使用して正し
いＴＣＰまたはＵＤＰインスタンスがこのノード上にな
いと判断した時、上記の点が認められる。そのような場
合、パケットはＳ−ＩＣＳを経由して経路の再指定が行
われ、新しいノードに送られる。ＩＰはその時までにパ
ケットの処理を終えているであろうから、パケットをた
だ正しいノードに送り戻しＩＰに戻すとすれば、この処
理を繰り返す理由はなく、またスタックに存在する処理
の影響／回復に関して心配する理由もない。従って、制
御クラスタが備える特性に従って、パケットは新しいノ
ード上の特定のＴＣＰまたはＵＤＰインスタンスに直接
成功裡に送り出されることができる。上述のように、こ
れは、また、ＩＰ再処理問題を解消し、かつスタック全
体性能を向上させる。この再経路指定の詳細は後述する
が、手短に言えば、ファンアウト・テーブルはＴＣＰ／
ＵＤＰ目標待ち行列セットを含む。これは、ＴＣＰまた
はＵＤＰ目標待ち行列(読取り待ち行列)をＳ−ＩＣＳ書
込み待ち行列であるように変えることによって達成さ
れ、putnext()が実行される時、それはＴＣＰまたはＵ
ＤＰputルーチンの代わりにＳ−ＩＣＳputルーチンを起
動する。このため、メッセージの走査またはＩＰコード
修正の必要性を排除するが、この基本的ＩＰ実施設計を
理解することは必要とされ、このファンアウト・テーブ
ルを正確に修正することができるコードの生成が必要と
される(この点は、ストリーム型ＩＰが実施される過程
および本設計が９５％プロトコール独立性を達成できる
理由を実際に理解しなければならない主な箇所であ
る）。２. 上述のように、第１の選択肢は、性能向上のためＳ
−ＩＣＳとＩＰの間での直接のメッセージ伝達に力を貸
す。これは再び実施形態に依存する知識を必要とする。
加えて、第１の選択肢は、遠隔ノードの故障のため付加
的予防コードを追加することをＳ−ＩＣＳが決定する可
能性を処理する実施形態を必要とする。３.上記２項と同様に、メッセージをＳ−ＩＣＳとＤＬ
ＰＩ間で直接プッシュすることによって性能を向上させ
ることが可能である。前と同様の考察が適用される。い
ずれにしても、メッセージがストリームヘッドを通過す
る必要も、制御カーネル・スレッドへの潜在的文脈切り
替えの必要もないという事実は、顕著な性能向上を生
む。性能向上を重視するか、あるいは複雑性および予防
的コードの追加を重視するかはプロトコール毎の実施形
態に応じて選択される。Note: In FIG. 4, three numbers 1, 2, and 3 are shown as symbols. These figures are each associated with the following notes regarding potential performance optimizations that may be used if precautionary measures are added. 1. Assuming a cluster with multiple versions of the second option in FIG. 4B, adding the possibility that an application accessing the TCP / IP stack part can migrate from node to node. if you can,
Before all intra-cluster routing is updated to reflect the new location of the stack, the possibility exists that the packet can be routed to the node. In such a case, the above point is recognized when the IP determines that the correct TCP or UDP instance is not on this node using its fanout table. In such a case, the packet is re-routed via the S-ICS and sent to the new node. Since the IP will have finished processing the packet by that time, if we just send the packet back to the correct node and back to the IP, there is no reason to repeat this process and the impact / recovery of the process present in the stack There is no reason to worry about Thus, packets can be successfully sent directly to a specific TCP or UDP instance on the new node, depending on the characteristics provided by the control cluster. As mentioned above, this also eliminates IP reprocessing problems and improves overall stack performance. The details of this rerouting will be described later, but in short, the fan-out table has a
Contains the UDP target queue set. This is accomplished by changing the TCP or UDP target queue (read queue) to be an S-ICS write queue, which when putnext () is executed,
Activate the S-ICSput routine instead of the DPput routine. This eliminates the need for message scanning or IP code modification, but requires an understanding of this basic IP implementation design and the generation of code that can modify this fan-out table accurately. It is needed (this is the main point where you really need to understand how the stream-based IP is implemented and why this design can achieve 95% protocol independence). 2. As mentioned above, the first option is to use S to improve performance.
-Empower direct messaging between ICS and IP. This again requires implementation dependent knowledge.
In addition, the first option requires an embodiment that handles the possibility that the S-ICS decides to add additional preventive code for a remote node failure. 3. The message is sent to S-ICS and DL
Performance can be improved by pushing directly between PIs. The same considerations apply as before. Either way, the fact that the message does not need to pass through the stream head and there is no need for a potential context switch to the controlling kernel thread produces significant performance gains. Either emphasis on performance enhancement or emphasis on complexity and the addition of proactive code is chosen depending on the implementation for each protocol.

【００８９】図５は、図３の第２の選択肢が実施される
場合にパケットが回送される方法を示す。制御スレッド
３４は、そのローカル・データ構造４６に記憶するスト
リームヘッド・アドレスを決定するハッシュ関数を使用
することによって、ローカルＳ−ＩＣＳ３０およびロー
カル・スタック・インスタンス３０、３２、４４にアク
セスする。ロカル・スタック・インスタンスは、また、
管理Ｓ−ＩＣＳインスタンス３８へのポインタ４８を持
ち、それを用いて、制御スレッドを経由することなく直
接Ｓ−ＩＣＳインスタンスへメッセージを向かわせるこ
とができる。同様に、Ｓ−ＩＣＳは、ＤＬＰＩドライバ
４２へデータを送るため制御スレッドにアクセスする必
要はない。FIG. 5 shows how packets are forwarded when the second option of FIG. 3 is implemented. The controlling thread 34 accesses the local S-ICS 30 and the local stack instances 30, 32, 44 by using a hash function that determines a stream head address to store in its local data structure 46. The local stack instance also
It has a pointer 48 to the management S-ICS instance 38, which can be used to direct messages to the S-ICS instance without going through the controlling thread. Similarly, the S-ICS does not need to access the control thread to send data to the DLPI driver 42.

【００９０】2.3.3 大域ポート対応付け上記の構成を作成するため、場合によっては大域ポート
対応付けが実施される。この機能は、単にノード単位で
実施するのではなくクラスタ全体を通してＴＣＰポート
の割当てを制御する。これを実施する１つの方法は、ポ
ート空間をノードあたりのポート空間サイズ／クラスタ
・ポート内のノード数で単純に分割することである。実
施は簡単であるが、それは十分な柔軟性を提供しない
し、クラスタ内の各ノードが実行している可能性のある
機能に適応しない。2.3.3 Global Port Correspondence In order to create the above configuration, global port correlation is performed in some cases. This function controls the assignment of TCP ports throughout the cluster, rather than just on a node-by-node basis. One way to do this is to simply divide the port space by port space size per node / number of nodes in a cluster port. Although simple to implement, it does not provide enough flexibility or adapt to the functions that each node in the cluster may be performing.

【００９１】もっと適切なソルーションは、図６に示さ
れているように、どのポートが活動的であり、どのポー
トが使用可能であり、ポートがクラスタ内のどこで実際
に使用されているかを示すリストを保持するクラスタ・
ミドルウェア制御スレッド５０を作成することである。
このアプローチは、必要に応じてノードあたりのポート
数が変動するのを許容しながら、それでもなお、ポート
が重複使用されているか懸念することなく、アプリケー
ションがクラスタの範囲内のいかなるノードへも移行す
ることを可能にする。注：図６は、非分割スタックおよ
び分割スタック双方の動作を示している。ＴＣＰまたは
ＵＤＰインスタンスはローカルＤＬＰＩリンク４０ある
いは遠隔ＤＬＰＩリンク４２上で実行されることができ
るがアルゴリズムは本質的に同じものである。ＴＣＰ／
ＩＰクラスタ実施に関しては、すべてのＤＬＰＩインス
タンスをクラスタ初期化の際に各ＩＰの下にリンクする
ことが最善であるかもしれない。これによってカード障
害に対してより迅速な回復が可能であり、各ノード上の
Ｓ−ＩＣＳは移行が発生する時ノード間で大量のパケッ
トを伝送する方法を自動的に認識し、それによって、Ｓ
−ＩＣＳ更新の必要性を排除し、経路障害回復の必要性
を排除することができる。A more suitable solution is a list showing which ports are active, which ports are available, and where the ports are actually used in the cluster, as shown in FIG. Cluster that holds
This is to create a middleware control thread 50.
This approach allows an application to move to any node within the cluster, while still allowing the number of ports per node to fluctuate as needed, yet without worrying about port duplication Make it possible. Note: FIG. 6 illustrates the operation of both non-split and split stacks. A TCP or UDP instance can run on the local DLPI link 40 or the remote DLPI link 42, but the algorithm is essentially the same. TCP /
For an IP cluster implementation, it may be best to link all DLPI instances under each IP during cluster initialization. This allows for faster recovery from a card failure, and the S-ICS on each node automatically knows how to transmit large amounts of packets between nodes when a transition occurs, thereby reducing the S-ICS.
-Eliminates the need for ICS updates and eliminates the need for path failure recovery.

【００９２】クラスタ内の各ノードに関して以下の設定
が行われる。ミドルウェアは別々のノードに存在するこ
ともできるが、図６に示されているように、そのデータ
・キャッシュが十分な情報を持っていなければ、Ｓ−Ｉ
ＣＳはパケット経路指定のためミドルウェアを調べるか
もしれない。もちろんＳ−ＩＣＳはパケットを単純に破
棄する自由を常に維持する。Ｓ−ＩＣＳが別のノードに
経路指定すれば、パケットはＴＣＰまたはＵＤＰ読取り
待ち行列上に直接置かれ、ＩＰを再び通過することはな
い。The following settings are made for each node in the cluster. The middleware may reside on separate nodes, but if its data cache does not have enough information, as shown in FIG.
CS may consult middleware for packet routing. Of course, S-ICS always maintains the freedom to simply discard packets. If the S-ICS routes to another node, the packet will be placed directly on the TCP or UDP read queue and will not pass through IP again.

【００９３】ＩＰに到来するパケットは次のように流れ
る。パケットはＩＰ下方ＭＵＸ(マルチプレクサ)に到着
する。ＩＰはパケットを検査し、ＩＰファンアウト・テ
ーブル５２を利用して、どの待ち行列にそれを送り出す
べきか判断する。ファンアウト・テーブルは、<port, q
ueue>対応付け関数以外の何物でもない。パケットがロ
ーカルＴＣＰまたはＵＤＰインスタンスに結合される
と、ＩＰは直接putnext(tcp_queue, mp)または putnext
(udp_queue, mp)を実行する。パケットがローカル・エ
ンドポイントに関するものではなくファンアウト・テー
ブルのエントリを持っていれば、パケットはputnext(S_
ICS_0 queue, mp)を実行する。ファンアウト・テーブル
のエントリがなければ、ＩＰはパケットを破棄するか、
さもなければ、ＩＰ送信機能が使用可能となっていれば
パケットをＳ−ＩＣＳ下方ＭＵＸに転送するかもしれな
い。Ｓ−ＩＣＳは組み込まれた経路指定情報についてパ
ケットを検査し、そのデータ・キャッシュを参照して経
路を決定する。Ｓ−ＩＣＳがＭＵＸであれば、どのＳ−
ＩＣＳインスタンスがデータを送ったかという知識を使
用して経路を見い出す。The packet arriving at the IP flows as follows. The packet arrives at the IP lower MUX (multiplexer). The IP examines the packet and uses the IP fanout table 52 to determine which queue to send it to. Fanout table is <port, q
ueue> There is nothing other than the mapping function. When a packet is bound to a local TCP or UDP instance, IP is directly putnext (tcp_queue, mp) or putnext
Execute (udp_queue, mp). If the packet has an entry in the fanout table rather than for the local endpoint, the packet is putnext (S_
Execute ICS_0 queue, mp). If there is no fanout table entry, the IP will either drop the packet or
Otherwise, the packet may be forwarded to the S-ICS lower MUX if the IP transmission function is enabled. The S-ICS examines the packet for embedded routing information and determines a route by referring to its data cache. If S-ICS is MUX, which S-ICS
Use the knowledge of whether the ICS instance sent the data to find the path.

【００９４】本質的にこのソルーションにとって、拡張
走査関数セット、この努力を管理する制御スレッドおよ
びミドルウェア・スレッドという３つのコンポーネント
が存在する。ただし、制御スレッドがほとんどすべての
管理局面に関係しているので、このミドルウェア・スレ
ッドは実際に制御スレッドであってもよい。そのように
する唯一の短所は、制御スレッドが所望のようにプロト
コールから独立していないかもしれず、制御スレッドを
完全にプロトコール非依存に保ちながら、関わるプロト
コール・スタックに特有のミドルウェア・スレッドを設
計することはより簡単でより短時間ですむ。これが実際
にどのように実施されるかに関係なく、関係する機能性
および実施の大筋を以下に記述する。Essentially for this solution, there are three components: an extended scan function set, a control thread that manages this effort, and a middleware thread. However, this middleware thread may actually be the controlling thread, as the controlling thread is involved in almost every administrative aspect. The only disadvantage is that the controlling thread may not be protocol independent as desired, and design a middleware thread specific to the protocol stack involved, while keeping the controlling thread completely protocol independent. Things are easier and faster. Regardless of how this is actually implemented, the pertinent functionality and implementation outlines are described below.

【００９５】1. 伝送の端点(エンドポイント)すなわち
この例ではＴＣＰまたはＵＤＰ端点が作成される時、米
国特許出願第08/545,561号記載のSTREAMS Dynamic Func
tionReplacement(動的関数置換)機能を使用して、読取
りおよび書込み待ち行列に関連したput, serviceおよび
closeルーチンを拡張する(この点の詳細は後述する)。
これらの関数は、スタックとの間で伝送されるメッセー
ジを事前に確認するように拡張される。putおよびservi
ceルーチンに関しては、書込み待ち行列関数に関するＴ
ＰＩのT_BIND_REQおよびT_UNBIND_REQメッセージならび
に書き込み待ち行列関数に関するT_BIND_ACK、T_UNBIND
_ACKおよびT_ERROR_ACKメッセージを含むこともあるM_P
TROTOメッセージが対象となる。closeルーチンに関して
は、それが呼び出されてなおエンドポイントが解放され
ていない場合、なにがしかの作業を行う必要がある。こ
れらの関数は特定のメッセージ・タイプまたは活動を調
べ、条件が合致すればアクションをとる。スタックにプ
ッシュされたモジュールの代わりに、STREAMS Dynamic
Function Replacement(動的関数置換)を使用すること
(これはこの技術の使用に対する代替的方法である)によ
って、あらゆるメッセージ毎に起動されるputルーチン
の余分なセット(読取りおよび書込み)が排除され、その
結果処理性能が向上する。注：他のメッセージはすべて
オリジナルのputまたはserviceルーチンを使用して直ち
に処理されるので、関数呼び出しのオーバーヘッドだけ
が性能低下の要因であるにすぎない。1. When a transmission endpoint, or TCP or UDP endpoint in this example, is created, the STREAMS Dynamic Func described in US patent application Ser. No. 08 / 545,561.
Use the tionReplacement (dynamic function replacement) feature to put, service, and
Extend the close routine (details of this will be described later).
These functions are extended to proactively check the messages transmitted to and from the stack. put and servi
For the ce routine, T for the write queue function
T_BIND_REQ and T_UNBIND_REQ messages for PI and T_BIND_ACK, T_UNBIND for write queue function
M_P which may include _ACK and T_ERROR_ACK messages
Targets TROTO messages. As for the close routine, if it has been called and the endpoint has not been released, some work needs to be done. These functions look for specific message types or activities and take action if conditions are met. STREAMS Dynamic instead of modules pushed onto the stack
Using Function Replacement
(This is an alternative to using this technique) eliminates an extra set of put routines (read and write) that are invoked for every message, resulting in improved processing performance. Note: All other messages are processed immediately using the original put or service routine, so only the function call overhead is the only performance penalty.

【００９６】2.上述したものと同じスレッドである制御
スレッドは、ミドルウェアおよび拡張された関数と選択
的に協調動作する。上述と同じ例を引き続き使用すれ
ば、エンドポイントはオープンされ、関数は拡張されて
いる。アプリケーションがT_BIND_REQ ＴＰＩメッセー
ジをＴＣＰ／ＵＤＰモジュールに送る時、拡張されたwr
ite putルーチンは、この発生を検出し、暫定的にメッ
セージを制御スレッドに向け直す。メッセージの向け直
しの後、処理は続行する。アプリケーションに戻される
T_BIND_ACKまたは T_ERROR_ACKのいずれかのＴＰＩを受
け取るまでアプリケーション要求は本当に完了しないた
め、処理の続行は可能である。これは非同期事象であ
り、すべてのＴＰＩ実施形態および伝送プロバイダはこ
の動作をサポートする。2. A control thread, the same thread as described above, selectively cooperates with middleware and extended functions. Continuing with the same example as above, the endpoint is open and the function is extended. Extended wr when application sends T_BIND_REQ TPI message to TCP / UDP module
The itput routine detects this occurrence and tentatively redirects the message to the controlling thread. After the message has been redirected, processing continues. Returned to the application
Since the application request is not really completed until either T_BIND_ACK or T_ERROR_ACK is received, processing can be continued. This is an asynchronous event and all TPI embodiments and transport providers support this operation.

【００９７】3. 制御スレッドは結合情報すなわちこの
場合は関連ポート情報を取り出し、クラスタの範囲内の
異なるノード上で実行している可能性のあるミドルウェ
ア・スレッドを調べる。 4. ミドルウェア・スレッドは結合情報を調べ、指定さ
れたポートが使用可能か否か判断する。指定されたポー
トが使用可能でない場合、制御スレッドにエラーを戻
し、その制御スレッドは適切なアクションをとる。ポー
トが使用可能であれば、このポートに「使用中」のマー
クをつけ、このポートが使用中であることおよび使用し
ているノードをクラスタの残りのコンポーネントすなわ
ち制御スレッドの残りへ通知する。3. The controlling thread retrieves the binding information, in this case the relevant port information, and looks up middleware threads that may be running on different nodes within the cluster. 4. The middleware thread checks the binding information and determines whether the specified port is available. If the specified port is not available, return an error to the controlling thread and the controlling thread takes the appropriate action. If the port is available, mark the port as "in use" and notify the remaining components of the cluster, the rest of the controlling thread, that the port is in use and the node it is using.

【００９８】5. 各ノードの制御スレッドは、結合要求
をＩＰに対して発する。この結合のため、ファンアウト
・テーブル<port, read queue address>を拡張して、re
ad queue address(読取り待ち行列アドレス)を実際には
Ｓ−ＩＣＳ write queue address(書込み待ち行列アド
レス)とする。Ｓ−ＩＣＳもまた結合について通知さ
れ、そのデータ・キャッシュをこの新しい経路指定情報
を反映するために更新する。かくして、各ノードは結合
されたポート・アドレスを含むＩＰインスタンスを持
ち、アプリケーションが現在実行されてないノードに到
着するパケットの経路指定を行うためこのアドレスが使
用される(潜在的にはアプリケーション移行または回復
によって、ＤＬＰＩコンポーネントの障害はどこかのポ
イントで発生し新しいインスタンスが別のノードで開始
される)。5. The control thread of each node issues a binding request to the IP. For this connection, the fan-out table <port, read queue address> is extended to
The ad queue address (read queue address) is actually an S-ICS write queue address. The S-ICS is also notified of the binding and updates its data cache to reflect this new routing information. Thus, each node has an IP instance containing the bound port address, which is used to route packets arriving at the node where the application is not currently running (potentially application migration or With recovery, a DLPI component failure occurs at some point and a new instance is started on another node).

【００９９】6. クラスタが各ノード上で結合動作を完
了すると、ミドルウェア・スレッドは、成功か失敗かを
制御スレッドに通知する。成功であれば、制御スレッド
はカーネル内ＳＴＲＥＡＭＳインターフェースを使用し
て、オリジナルの結合メッセージを伝送モジュール(す
なわちＴＣＰまたはＵＤＰ)の書き込み待ち行列に書き
込む。このモジュールおよびそのＩＰインスタンスは必
要な結合動作を実行して、T_BIND_ACKまたはT_ERROR_AC
Kを生成する。T_ERROR_ACKが発生すると、結合動作を処
理していたことは認識されているので、制御スレッドは
結合が失敗したことを通知され、制御スレッドはそれを
ミドルウェア・スレッドに通知して、ミドルウェア・ス
レッドがこのポートに関する他のすべてのノードの結合
を解除する。結合が継続すれば、そのコンポーネントの
分散、非分散にかかわらず、ストリーム・スタックは正
常に動作する。すなわち、上記の技術は、正常に作成さ
れ、そのコンポーネントが同一のノード上で実行されて
いるストリームに対して適用されることができる。6. When the cluster completes the join operation on each node, the middleware thread notifies the control thread of success or failure. If successful, the controlling thread uses the in-kernel STREAMS interface to write the original binding message to the write queue of the transmission module (ie, TCP or UDP). This module and its IP instance perform the necessary binding operations and either T_BIND_ACK or T_ERROR_AC
Generate K. When T_ERROR_ACK occurs, the controlling thread is notified that the binding operation was being processed, so the controlling thread is notified that the binding has failed, and the controlling thread notifies the middleware thread of the fact that the Unbind all other nodes for the port. If the connection continues, the stream stack will operate normally regardless of whether the components are distributed or not. That is, the techniques described above can be applied to streams that have been successfully created and whose components are running on the same node.

【０１００】7. その後、アプリケーションはストリー
ム・インスタンスをクローズするかあるいは結合解除命
令を出す。そのいずれの場合でも、拡張されたルーチン
はそれを検出しアクションをとる。アプリケーションが
ストリーム・スタックをクローズすれば、closeルーチ
ンはメッセージを制御スレッドに送り出し、制御スレッ
ドはローカルのＳ−ＩＣＳインスタンスにそれを通知し
てそのキャッシュのクリーンアップを行わせる。更に、
制御スレッドはミドルウェア・スレッドにも通知し、ミ
ドルウェア・スレッドはクラスタ内のすべての他のノー
ドに通知し、該当するＩＰインスタンスに結合解除命令
を発する。この命令によって、各ノードのＳ−ＩＣＳイ
ンスタンスがそのデータ・キャッシュのクリーンアップ
を行う。アプリケーションがクローズ要求の代わりに結
合解除要求を発したとすれば、拡張関数が動作続行の前
にすべてのノード上の制御スレッドが結合解除命令を実
行するまで結合命令が待機するのと同様にメッセージ処
理を保留する点を除いて、上記のアルゴリズムは同じで
ある。次の表１は結合プレビュー(bind preview)コード
のサンプルである。7. The application then closes the stream instance or issues an unbind command. In each case, the extended routine detects it and takes action. If the application closes the stream stack, the close routine sends a message to the controlling thread, which notifies the local S-ICS instance to clean up its cache. Furthermore,
The controlling thread also notifies the middleware thread, which notifies all other nodes in the cluster and issues a de-coupling instruction to the relevant IP instance. This instruction causes the S-ICS instance at each node to clean up its data cache. If the application issues a disassociation request instead of a close request, the message is as if the binding instruction waited for the controlling thread on all nodes to execute the disassociation instruction before the extension function continued. The above algorithm is the same except that processing is suspended. Table 1 below is a sample of a bind preview code.

【０１０１】[0101]

【表１】 tpi_bind_w_preview(q,mp) { union T_primitives*tpi; if (mp>b_datap->db_type !=M_PROTO) (*q->q_qinfo->qi_putp)(q,mp); else { tpi = (union cast it)mp->b_rptr; if (tpi->PRIM_type == T_BIND_REQ || tpi->PRIM_type == T_UNBIND_REQ) { /* メッセージを制御スレッドへ宛て送付する */ route_msg(tpi_ct_endpoint,mp,write_queue); } else (*q->q_qinfo->qi_putp)(q,mp); } }[Table 1] tpi_bind_w_preview (q, mp) {union T_primitives * tpi; if (mp> b_datap-> db_type! = M_PROTO) (* q-> q_qinfo-> qi_putp) (q, mp); else {tpi = (union cast it) mp-> b_rptr; if (tpi-> PRIM_type == T_BIND_REQ || tpi-> PRIM_type == T_UNBIND_REQ) {/ * Send message to control thread * / route_msg (tpi_ct_endpoint, mp, write_queue);} else (* q-> q_qinfo-> qi_putp) (q, mp);}}

【０１０２】制御スレッドの範囲内において、複数の通
信ポイントが監視され、その１つがこの活動と関連づけ
られる。制御スレッドは、上述された必要なタスクを実
行し、streams_putmsg()を起動して正しい待ち行列にメ
ッセージを書き込む。動作が失敗すれば、制御スレッド
はT_ERROR_ACKを作成して、putnext(tp_read_queue,mp)
を実行する。クラスタ上で動作が成功すれば、制御スレ
ッドはオリジナル・メッセージのputnext(tp_write_que
ue,mp)を実行する。読み取り側でのプレビュー関数のサ
ンプルは次の表２の通りである。Within the control thread, a plurality of communication points are monitored, one of which is associated with this activity. The controlling thread performs the necessary tasks described above and invokes streams_putmsg () to write the message to the correct queue. If the operation fails, the controlling thread creates T_ERROR_ACK and putnext (tp_read_queue, mp)
Execute If the operation succeeds on the cluster, the controlling thread returns the original message putnext (tp_write_queue).
ue, mp). Table 2 below shows a sample of the preview function on the reading side.

【０１０３】[0103]

【表２】 tpi_bind_r_preview(q,mp) { union T_primitives*tpi; if (mp>b_datap->db_type != M_PCPROTO) (*q->q_info->qi_putp)(q,mp); else { tpi = (union cast it)mp->b_rptr; if(tpi->PRIM_type == T_ERROR_ACK) { /* 結合動作が失敗したので、制御スレッドへ通知して、*/ /* クリーンアップを行う */ route_msg(tpi_ct_endpoint,mp,read_queue); } else (*q->q_qinfo->qi_putp)(q,mp); } }[Table 2] tpi_bind_r_preview (q, mp) {union T_primitives * tpi; if (mp> b_datap-> db_type! = M_PCPROTO) (* q-> q_info-> qi_putp) (q, mp); else {tpi = (union cast it) mp-> b_rptr; if (tpi-> PRIM_type == T_ERROR_ACK) {/ * Notify join thread and notify control thread and perform * / / * cleanup * / route_msg (tpi_ct_endpoint, mp , read_queue);} else (* q-> q_qinfo-> qi_putp) (q, mp);}}

【０１０４】クローズ処理に関しては次の表３のサンプ
ルのようなクローズ(close)関数が実行される。For the close processing, a close function as shown in the following sample of Table 3 is executed.

【０１０５】[0105]

【表３】 tpi_close(q,mp) { mblkP mp; /* 制御スレッドにmp = create_close_msg()であることを通知するclose */ /* メッセージを作成する */ route_msg(tpi_ct_endpoint,mp,read_queue); /* クラスタがクリーンアップされたことを示すメッセージが */ /* 制御スレッドから送られてくるのを待つ */ recv_msg(q,mp); /* オリジナルのcloseルーチンを起動する */ (*q->q_qinfo->qi_putp)(q,mp) }[Table 3] tpi_close (q, mp) {mblkP mp; / * Create a close * / / * message to notify the control thread that mp = create_close_msg () * / route_msg (tpi_ct_endpoint, mp, read_queue); / * Wait for a message indicating that the cluster has been cleaned up * / / * Wait for the control thread * / recv_msg (q, mp); / * Invoke the original close routine * / (* q -> q_qinfo-> qi_putp) (q, mp)}

【０１０６】この関数は、制御スレッドとの協調動作を
実行する。ポートが解放されたこと、また、すべてのＳ
−ＩＣＳについてクローズが実行されもはや与えられた
待ち行列は有効でないことを反映するように経路指定テ
ーブルを更新しなければならないことを制御スレッドは
その他のスレッド制御に通知する。競争状態が存在すれ
ば、Ｓ−ＩＣＳは、ポートが使用可能であるかを調べな
ければならない制御スレッドと連絡をとり、このポート
に関するすべてのメッセージを当面放棄することをＳ−
ＩＣＳに伝える。This function executes a cooperative operation with the control thread. The port has been released and all S
-The controlling thread informs the other thread control that a close has been performed for the ICS and the routing table must be updated to reflect that the given queue is no longer valid. If a race condition exists, the S-ICS contacts the control thread which must check for the availability of the port, and informs the S-ICS to abandon all messages for this port for the time being.
Tell ICS.

【０１０７】2.4.ポート対応付け最適化以上の記述において、ミドルウェア・スレッドに結合要
求を渡しその結果を戻すため、制御スレッドが使用され
た。このメッセージ伝達によって、制御スレッドがプロ
トコールに依存しない状態を維持することが可能とされ
るが、これはポート対応付け関数の動作をなにがしか遅
くさせ、ウェブ・ブラウザのように多数の接続が生成さ
れる場合では、これらの接続を確立する速度が重要とな
る。この問題を解決するため、それが結合命令を実行し
ていて、プロトコール特有の処理関数を起動しなければ
ならないことを認識させることによって、制御スレッド
を幾分プロトコール従属的とさせることができる。2.4. Port Mapping Optimization In the above description, the control thread was used to pass the binding request to the middleware thread and return the result. This communication allows the controlling thread to maintain a protocol-independent state, which slows down the port mapping function and creates a large number of connections, such as a web browser. In some cases, the speed at which these connections are established is important. To solve this problem, the controlling thread can be made somewhat protocol dependent by recognizing that it is executing a bound instruction and must invoke a protocol specific processing function.

【０１０８】本実施形態においてＴＣＰ／ＩＰ処理関数
は次のように定義される。アプリケーションが使用可能
な６４Ｋポートが存在する。このポートのうち、約５Ｋ
はアプリケーションのために予約され、プロトコールが
ポート決定を実行することを可能にする。処理関数はこ
の点を勘案しながらローカルの最適化を図る。アプリケ
ーションがいずれかのポートを要求すると、処理関数は
クラスタ・ゲートウェイ・ノードの間に分割されている
いずれかのポートのローカル・プールからポートを割り
当てる。この場合、ノード間のポート割当てを調整する
必要はなく、ただ現在どのポートが使用中で後の結合を
確立するのかに注意すればよい。すなわち、クラスタ全
体の弱い結合を実行し、必要に応じて経路指定要求を取
り扱う。必要に応じてというのは、初期化の際設定され
る可能性のあるオプションを用いてアプリケーションが
「寿命」期間を設定する場合があるからである。例え
ば、ウェブ・ブラウザはマルチパート・ドキュメントに
つき潜在的に多くの接続を作りそして各接続は短命であ
る。すぐれた設計のブラウザは、ポート管理オーバーヘ
ッドを減少させるためクラスタ化が利用することができ
る接続は短命であることを知らせる。ノードがローカル
の「任意のアドレスの」ポートのプールを使い果たす場
合、ノードは他のノードからポートを借りる要求を出す
か、あるいは、クラスタの範囲内の使用可能なポートの
再均衡を図るプロトコールを起動する。ポート空間の残
りに関しては、制御スレッドは、ポート割り当ておよび
経路指定情報を調整し、接続結合解除またはクローズに
際してクリーンアップを行わなければならない。次の表
４はこれを実施する疑似アルゴリズムを示す。In this embodiment, the TCP / IP processing function is defined as follows. There are 64K ports available for applications. About 5K of this port
Is reserved for the application and allows the protocol to perform port decisions. The processing function performs local optimization while taking this point into account. When an application requests any port, the processing function allocates a port from the local pool of any port that is split between the cluster gateway nodes. In this case, there is no need to adjust the port assignments between the nodes, but only to note which ports are currently in use and establish later connections. That is, it performs weak coupling of the entire cluster and handles routing requests as needed. As needed, because the application may set a "lifetime" period with options that may be set during initialization. For example, web browsers make potentially many connections per multipart document and each connection is short-lived. Well-designed browsers signal that the connections that clustering can utilize to reduce port management overhead are short-lived. If a node runs out of the local "any address" pool of ports, the node issues a request to borrow a port from another node or invokes a protocol that rebalances the available ports within the cluster. I do. For the rest of the port space, the controlling thread must adjust the port allocation and routing information and perform cleanup on connection decoupling or closing. Table 4 below shows a pseudo-algorithm that implements this.

【０１０９】[0109]

【表４】 if (結合アドレスが「任意のアドレス」であれば) { if (現在時ノードが使用可能ポートを有していなけれ
ば) {クラスタの範囲内に使用可能なポートがあれば少
くとも１つのポートを始動ポートが取得することが保証
されるように、ポート空間の再均衡を図るgroup notify
命令を始動する。使用可能なポートが何もなければ、ア
プリケーションに結合失敗を通知するため、putnext(OT
HERQ(transport provider write queue), mp)を実行す
ることによってポート使用中エラー条件を含むT_ERROR_
ACKを発し、ローカルの伝送プロバイダが結合要求を受
け取ることを防ぐ。 } 接続が短命であるとわかっていない場合、あるいは、す
べての結合に関連した活動についてポートの状態を把握
するすべてのノードに常に通知するという方針が存在す
る場合、適切なアクションをとることができるように、
ポートの遠隔ノードおよびノード・アドレスを通知す
る。これは、信頼性の高い同報通信プロトコールを使用
して達成されなければならない。 } else { <特定のポートについて> ポートの状態を把握するすべてのノードの間でポート割
当てを調整する必要がある。これを行うため、group no
tify命令を使用する。group notify命令は少ない待ち時
間のパケットがノード・セットに信頼性を持って送り出
されることを可能にし、それぞれのノードは始動ノード
に応答を送り戻す。この例では、要求は、ポート・アド
レスおよび関連ノードの要求であり、応答はこの要求の
受諾か拒絶である。ノードが受託を応答する時、そのノ
ードはローカルに(1)ポートを使用中にさせ、(2)接続が
長命であればファンアウト・テーブルを更新するため結
合インスタンスを作成し、(3)経路指定情報を更新す
る。ノードが拒絶を応答すれば、要求元ノードはポート
が使用中であると仮定する。拒絶ノードの実行に関して
それ以上の仮定はされない。 }[Table 4] if (if the binding address is "any address") {if (unless the current node has an available port) {at least if there is an available port within the range of the cluster Group notify to rebalance the port space so that the starting port is guaranteed to get one port
Start the order. If no ports are available, putnext (OT
T_ERROR_ including port busy error condition by performing HERQ (transport provider write queue), mp)
Issue an ACK to prevent the local transmission provider from receiving the join request. } If the connection is not known to be ephemeral, or if there is a policy to always notify all nodes that know the state of the port about all binding-related activities, appropriate action can be taken. like,
Advertise the remote node and node address of the port. This must be achieved using a reliable broadcast protocol. } else {<About a specific port> It is necessary to adjust the port assignment among all nodes that grasp the port status. To do this, group no
Use the tify instruction. The group notify instruction allows low-latency packets to be reliably sent to the node set, and each node sends a response back to the initiating node. In this example, the request is a request for a port address and associated node, and the response is an acceptance or rejection of the request. When a node responds with a trust, it will locally (1) bring the port in use, (2) create a join instance to update the fanout table if the connection is long lived, and (3) route Update the specified information. If the node responds with a rejection, the requesting node assumes that the port is in use. No further assumptions are made regarding the execution of the reject node. }

【０１１０】上記のシナリオにおいて、２つのノードが
同じポート・アドレスを結合しようとする競争状態が発
生する潜在的可能性がある。これは、各始動元ノードに
対して部分的成功だけを示すgroup notifyによって標示
される。これが起こると、結合命令失敗を発生させる
か、あるいは、ランダムなバックアップ・アルゴリズム
を実行させて結合命令を再試行させるか、いずれかの実
施が選択される。ランダム・バックアップは、何らかの
予測を可能にするノード特有キーを含む同一の乱数発生
ルーチンを使用することによって、達成することができ
る。これによって、あるアプリケーションは結合に成功
し別のアプリケーションが失敗するようにさせることが
できる。In the above scenario, there is a potential for a race condition where two nodes try to bind the same port address. This is marked by a group notify indicating only partial success for each initiating node. When this occurs, either an implementation failure is selected, or a random backup algorithm is run to retry the integration instruction and either is chosen. Random backup can be achieved by using the same random number generation routine that includes a node-specific key that allows some prediction. This allows one application to succeed and another to fail.

【０１１１】結合が失敗すると、始動元ノードはポート
許諾を応答したノードに失敗を通知し、それらポートは
必要なクリーンアップを実行する。相互接続がどのノー
ドが失敗したかを通知することができない場合、始動元
ノードは、ポートの結合を行わなかったことをすべての
ノードに通知しなければならない。そこで、遠隔ノード
は、始動元アドレスおよびポートを検査して、クリーン
アップが必要か否かを判断する。If the binding fails, the initiating node notifies the nodes that responded to the port grant of the failure, and those ports perform the necessary cleanup. If the interconnection cannot tell which node failed, the initiating node must inform all nodes that it did not perform the port binding. The remote node then checks the source address and port to determine if cleanup is needed.

【０１１２】部分的な許諾がある場合、あるアプリケー
ションと他のアプリケーションの間で同一のポートに対
する結合に関する競争条件が存在する。これが起こる場
合、次の２つの可能な解決方法がある。各ノードは許諾
対拒絶の比率を調べ、ノードの多数が結合を許諾してい
れば、そのポートが勝つ。負けたノードは、アプリケー
ションへのポート使用中条件およびポート所有者として
許諾したノードに対してそれらのテーブルおよび割り当
てられた資源をクリーンアップするためのエラー・メッ
セージを含むT_ERROR_ACKを発する。勝ちを得たノード
は、他のノードに通知するために十分な時間すなわち
(エラーを処理する時間＊Ｎ)＋(エラー・クリーンアッ
プ完了メッセージを送る時間＊Ｎ)の時間間隔待機した
後、元々失敗しているノードに結合要求を始動する。With partial permission, there is a race condition between one application and another application for binding to the same port. If this happens, there are two possible solutions: Each node looks at the grant-to-reject ratio, and if many of the nodes have granted the join, the port wins. The losing node issues a T_ERROR_ACK containing the port busy condition to the application and an error message to the node licensed as the port owner to clean up their tables and allocated resources. The winning node has enough time to notify other nodes, ie
After waiting a time interval of (time to process error * N) + (time to send error cleanup completion message * N), initiate a join request to the node that originally failed.

【０１１３】別の選択肢は、結合が失敗し、クリーンア
ップが必要なことをすべての許諾ノードに各ノードが通
知することである。その後各ノードはノード独自のラン
ダム時間量待機する(ノード特有のキーを持つ乱数発生
ルーチンを使用し、従って、各ノード毎に、デバッグま
たはテストのために必要であれば再生することができる
予測可能パターンが存在する)。タイマーが立ち上がる
と、結合命令が再び始動され、結果がチェックされる。
別の衝突が存在すれば、最大衝突カウンタを利用しても
う１つの試行を行うべきかが判断される。この方法は両
方のアプリケーションが結合に失敗する結果となること
もあるが、その可能性は非常に低い。Another option is for each node to notify all authorized nodes that the merge has failed and needs to be cleaned up. Each node then waits for a node-specific random amount of time (uses a random number generator with node-specific keys, so that each node can be replayed if needed for debugging or testing purposes) Pattern exists). When the timer starts, the join instruction is started again and the result is checked.
If there is another collision, the maximum collision counter is used to determine if another attempt should be made. This method can result in both applications failing to join, but this is very unlikely.

【０１１４】それが明白でない場合、クラスタが同時に
何千もの接続を取り扱う場合、ＴＣＰ／ＩＰは一層大き
いポート空間を持つ必要がある。これは、ＩＥＴＦによ
って承認されるプロトコール変更を必要とする。望まし
くは、この問題を救済するなんらかの方策が将来立てら
れるでろうが、それまでは、１つのクラスタが１時点で
保有することができる接続数に人工的制限を設ける必要
があろう。If that is not clear, TCP / IP needs to have a larger port space if the cluster handles thousands of connections simultaneously. This requires a protocol change approved by the IETF. Desirably, some way to remedy this problem will be developed in the future, but until then it would be necessary to place an artificial limit on the number of connections a cluster can have at any one time.

【０１１５】2.4.1 ポート対応付の一層の最適化アプリケーションが移行した時にのみ結合を実行するこ
とによってポート対応付けソルーションを更に最適化す
ることが可能かもしれない。本発明のクラスタ実施の例
において、ＤＬＰＩドライバにつき少くとも１つのＩＰ
アドレスがある。これは、アプリケーションが現在使用
しているもの以外に他のノードまたはインターフェース
にパケットが到着する可能性がないことを意味する。ア
プリケーションが新しいノードに移行しない限りは、結
合寿命に関してこの点はあてはまる。そのような場合、
このノードに到着するすべてのパケットが新しいノード
に転送されるようにするため、Ｓ−ＩＣＳは、新しい送
付経路を作成しなければならないし、ファンアウト・テ
ーブルは、<port address, S-ICS write queue>エント
リを持つように更新されなければならない。このすべて
が真であれば、他のノードにどのポートが使用中である
かを通知することは別として、結合された各アプリケー
ションに関してファンアウト・テーブル・エントリを作
成する必要もないし、ポートが現在使用可能であること
をクラスタに通知すること以外に結合解除またはクロー
ズ命令に対して付加的クリーンアップを実行する必要も
ない。2.4.1 Further Optimization of Port Mapping It may be possible to further optimize the port mapping solution by performing the binding only when the application has migrated. In the cluster embodiment of the present invention, at least one IP per DLPI driver
There is an address. This means that there is no possibility that the packet will arrive at another node or interface other than the one that the application is currently using. This is true for the combined lifetime unless the application migrates to a new node. In such a case,
To ensure that all packets arriving at this node are forwarded to the new node, the S-ICS must create a new routing path and the fan-out table contains the <port address, S-ICS write It must be updated to have a queue> entry. If all of this is true, apart from notifying other nodes which ports are in use, there is no need to create a fanout table entry for each combined application, and There is no need to perform any additional cleanup on de-association or close instructions other than notifying the cluster that it is available.

【０１１６】2.5 ストリーム動的関数置換の使用クラスタの範囲内において、ストリーム・スタック・イ
ンスタンスは以下のようなモードで動作する。＊ストリーム・スタックは、それがクラスタの一部であ
ることを認識していない可能性がある。ストリーム・ス
タックはスタンドアロン型ノードであるかのように動作
するかもしれないし、移行等は可能であるが、クラスタ
機能そのものを認識していないか、あるいは、そのよう
な機能を活用する能力を必ずしも有していないかもしれ
ない。＊ストリーム・スタック全体は単一ノード上で実行され
ていても、それがクラスタの一部であることを認識し、
大域ポート対応付け機能のようなクラスタ機能を活用す
る能力を有している。この点はストリーム・スタックの
実施が直接修正されたことを意味しない。潜在的に拡張
されたにすぎない。＊ストリーム・スタックのコンポーネントは、ＴＣＰ／
ＩＰ／ＤＬＰＩ分割の実施例において記述されたよう
に、クラスタの範囲内で２つまたはそれ以上のノード上
で実行される場合がある。2.5 Use of Stream Dynamic Function Replacement Within a cluster, a stream stack instance operates in the following modes. * The stream stack may not be aware that it is part of a cluster. The stream stack may operate as if it were a stand-alone node, may be migrated, etc., but may not be aware of the cluster functionality itself, or may not necessarily have the ability to utilize such functionality. Maybe not. * Recognizing that even though the entire stream stack is running on a single node, it is part of a cluster,
It has the ability to utilize cluster functions such as global port mapping functions. This does not mean that the implementation of the stream stack was directly modified. It is only potentially expanded. * The components of the stream stack are TCP /
As described in the IP / DLPI splitting embodiment, it may be performed on two or more nodes within a cluster.

【０１１７】上記の後者の２つのモードでは、各モジュ
ールまたはドライバに関連する異なるストリーム関数を
拡張することが必要である。ＳＴＲＥＡＭＳ動的関数置
換機能は、モジュールまたはドライバのすべてのインス
タンスが拡張されることを可能にし、あるいは拡張はス
トリーム毎の待ち行列に基づいて実行される。図６に示
される大域ポート対応付け機構の実施例に関する限り、
関数置き換えステップを実行するためＳＡＤドライバを
経由してアプリケーション(潜在的には制御スレッド)を
使用することによって、ＴＣＰあるいはＵＤＰモジュー
ル・レベルで機能性を増加させることがおそらく最も簡
単であろう。この点に関する詳細は、米国特許出願第08
/545,561号に記載されている。この変更は、システム／
クラスタ初期状態設定の一部として達成することができ
る。In the latter two modes, it is necessary to extend the different stream functions associated with each module or driver. The STREAMS dynamic function replacement function allows all instances of a module or driver to be extended, or the extension is performed on a per-stream queue. As far as the embodiment of the global port mapping mechanism shown in FIG.
It would probably be easiest to increase functionality at the TCP or UDP module level by using the application (potentially the controlling thread) via the SAD driver to perform the function replacement step. See US Patent Application No. 08
/ 545,561. This change has been made to the system /
This can be achieved as part of cluster initialization.

【０１１８】一方、クラスタ機能にとって、ストリーム
・スタックのすべてのインスタンスが常に修正されなけ
ればならないのは望ましくない。そのような場合、クラ
スタ特有情報に基づいて後刻拡張を適用する必要があ
る。これを達成するためには、モジュールまたはドライ
バのすべてのインスタンスに関して、オープン・ルーチ
ンを拡張するだけでよい。新しいオープンは、クラスタ
に関する特定の情報を調査し(ＳＴＲＥＡＭＳオープン
・ルーチンは休止可能なのでそのような調査は許容され
る)、作成されつつあるストリーム・スタック内のスタ
ック全体または特定モジュール／ドライバが更なる拡張
を必要とするか否かを判断する。On the other hand, it is undesirable for the cluster function that every instance of the stream stack must be constantly modified. In such a case, it is necessary to apply the extension later based on the cluster-specific information. To accomplish this, it is only necessary to extend the open routine for every instance of the module or driver. The new open examines specific information about the cluster (such investigations are allowed because the STREAMS open routine can pause) and the entire stack or specific modules / drivers in the stream stack being created are further Determine if you need an extension.

【０１１９】この実施例は、図４の(Ｂ)に示されている
ようなＩＰ／ＤＬＰＩレベルで分割されているＴＣＰ／
ＩＰ／ＤＬＰＩスタックである。この例は潜在的な性能
向上を示している。すなわち、メッセージは、ローカル
・ストリームヘッドを通過する必要はなくＤＬＰＩから
Ｓ−ＩＣＳへ直接送られ、次に制御スレッドによって遠
隔ノードへ向け直される。これは次の２つの形態で達成
される。(1)ＤＬＰＩ読取り待ち行列q_nexフィールドは
Ｓ−ＩＣＳ書込み待ち行列をポイントするように経路変
更される。そのような経路変更はＳ−ＩＣＳの実施にと
って複雑ではあるが可能である。(2)ＤＬＰＩストリー
ムヘッド読み取り待ち行列putルーチンは、データのプ
レビューを行った後経路再指定を実行するように拡張さ
れる。以下の理由から第２の方法が好ましい。In this embodiment, the TCP / TCP divided at the IP / DLPI level as shown in FIG.
IP / DLPI stack. This example illustrates a potential performance improvement. That is, the message is sent directly from DLPI to S-ICS without having to pass through the local stream head, and is then redirected by the controlling thread to the remote node. This is achieved in two forms: (1) The DLPI read queue q_nex field is re-routed to point to the S-ICS write queue. Such a rerouting is possible, albeit complicated for an S-ICS implementation. (2) The DLPI stream head read queue put routine is extended to preview the data and then re-route. The second method is preferred for the following reasons.

【０１２０】待ち行列はいっさい変更されず、基本的機
能性およびメッセージの流れが同じままに維持される。
すなわち、ＤＬＰＩドライバからストリームヘッドへの
メッセージの流れは以前の通りであり、これは潜在的サ
ポートおよびデバッギング問題を減少させる。第１の方
法が実行されるとすれば、Ｓ−ＩＣＳは、処理のためメ
ッセージが制御スレッドへ反映または経路指定されるべ
きか否かを判断する必要がある。M_SETOPTSのような一
部のメッセージは、遠隔ストリームヘッドのパラメータ
と共にローカル・ストリームヘッド・パラメータを修正
することを必要とする。これは、Ｓ−ＩＣＳ設計および
実施を複雑にするだけではなく、Ｓ−ＩＣＳが知らなけ
ればならないプロトコール特有情報の量を増加させる。The queue is not changed in any way, and the basic functionality and message flow remain the same.
That is, the message flow from the DLPI driver to the stream head is as before, which reduces potential support and debugging issues. If the first method were to be performed, the S-ICS would need to determine whether the message should be reflected or routed to the controlling thread for processing. Some messages, such as M_SETOPTS, require modifying the local streamhead parameters along with the remote streamhead parameters. This not only complicates S-ICS design and implementation, but also increases the amount of protocol-specific information that S-ICS must know.

【０１２１】第２の方法を使用すれば、制御スレッドが
モジュール／ドライバ毎に実行されるべきプレビューの
タイプを選択することが可能である。これは次の２つの
形態で実行することができる。すなわち、(1)クラスタ
設計者が一般的ＤＬＰＩクラスのようなスタックが分割
される場所に基づいてプレビュー関数クラスのコードを
書くか、あるいは(2)各モジュール／ドライバ開発者が
プロトコールに特有のストリームヘッド・プレビュー関
数を定義する。いずれにせよ、プレビュー・コードの焦
点はスタックの即時必要性に置かれ、それによって、制
御スレッドのプロトコールからの独立性を保ちながら、
設計および開発時間を短縮し柔軟性を向上させることが
可能とされる。一つの例はＤＬＰＩドライバがハードウ
ェア問題を検出する場合であり、その場合ＤＬＰＩドラ
イバはM_ERRORメッセージを生成する。ＤＬＰＩストリ
ームヘッド・プレビュー関数が、この条件を処理するた
めに起動される特別なものを必要としなければ、そのメ
ッセージは通常のメッセージのようにＳ−ＩＣＳに送ら
れる。しかし、ドライバのＤＬＰＩクラスに関して、あ
るいは、ＡＴＭ上のＤＬＰＩのような特定のドライバ・
タイプに関して、制御スレッドが高い可用性回復方式の
ため例えばＤＬＰＩインスタンスの新しいカードまたは
新しいノードへの移行のように何かを実行しなければな
らないとクラスタ設計者が決定すれば、プレビユー関数
は、ストリームヘッドputルーチンを起動して、制御ス
レッドにメッセージを与えるであろう。次に、制御スレ
ッドは、このメッセージ・タイプおよびこのクラスのス
トリーム・モジュール／ドライバに関して実行するよう
になされた構成に基づいて、そのメッセージをアプリケ
ーションに転送する必要があるか否かを決定する。Using the second method, it is possible for the controlling thread to select the type of preview to be performed for each module / driver. This can be done in two forms: Either (1) the cluster designer writes the code for the preview function class based on where the stack is split, such as a generic DLPI class, or (2) each module / driver developer has a stream specific to the protocol. Define a head preview function. In any case, the focus of the preview code is on the immediate need of the stack, thereby keeping the control thread independent of the protocol,
It is possible to shorten design and development time and improve flexibility. One example is when the DLPI driver detects a hardware problem, in which case the DLPI driver generates an M_ERROR message. If the DLPI streamhead preview function does not require anything to be invoked to handle this condition, the message is sent to the S-ICS like a normal message. However, with respect to the driver's DLPI class, or specific driver like DLPI over ATM,
With respect to type, if the cluster designer determines that the controlling thread must perform something for the high availability recovery scheme, such as migrating a DLPI instance to a new card or a new node, the preview function will determine the stream head. It will invoke the put routine and give the message to the controlling thread. The controlling thread then determines whether the message needs to be forwarded to the application based on the message type and the configuration configured to execute for this class of stream module / driver.

【０１２２】M_DATAまたはM_PROTOのようなＳＴＲＥＡ
ＭＳフレームワーク・アクションを必要としないメッセ
ージに関しては、プレビュー関数はＳ−ＩＣＳ書込み待
ち行列上でputnext()を自動的に実行する。これはスト
リームヘッドにメッセージをためる必要を回避し、制御
スレッドを文脈切り替え的とし、制御スレッドが関与す
る必要がないことを確認するためにのみメッセージを検
査する。次の表５はプレビュー関数のプログラム・コー
ドのサンプルである。STREA such as M_DATA or M_PROTO
For messages that do not require MS framework actions, the preview function automatically executes putnext () on the S-ICS write queue. This avoids having to store the message at the stream head, makes the controlling thread context-switchable, and examines the message only to make sure that the controlling thread does not need to be involved. Table 5 below is a sample program code of the preview function.

【０１２３】[0123]

【表５】 dlpi_sth_read_Preview(q,mp) { switch(mp->b_datap->db_type): case M_HANGUP: case M_ERROR: /* 制御スレッドが要求だけを送信するとしても */ /* 肯定応答ackが制御スレッドを経由して回付されるように、*/ /* 制御スレッドはローカルioctl命令を始動する。 */ case M_IOCACK: case M_IOCNAK: case M_ERROR: /* 標準ストリームヘッド読み取りputルーチンを起動し、*/ /* 制御スレッドが、何を送るべきかまたはM_HANGUP/M_ERRORケースに */ /* 関して回復方式を起動すべきかを決定することを可能にする。*/ (*q->q_qinfo->qi_putp)(q,mp); break; case M_DATA: case M_PROTO: case M_PCPROTO: default: /* リストされていないどのようなメッセージ・タイプも */ /* 自動的に送られる。 */ if (canput(q->q_ptr->s_ics->write_queue) putnext(q->q_ptr->s_ics->write_queue,mp); else { /* 上方Ｓ−ＩＣＳｍｕｘはフロー制御されていて、それを処理する */ initiate_flow_control_recovery(q,mp); } } }[Table 5] dlpi_sth_read_Preview (q, mp) {switch (mp-> b_datap-> db_type): case M_HANGUP: case M_ERROR: / * Even if the control thread sends only a request * / / * The acknowledgment ack is the control thread The control thread fires a local ioctl instruction, as routed via. * / case M_IOCACK: case M_IOCNAK: case M_ERROR: / * Invokes the standard stream head read put routine, * / / * The control thread recovers what to send or * / / * in the M_HANGUP / M_ERROR case. To determine if it should be started. * / (* q-> q_qinfo-> qi_putp) (q, mp); break; case M_DATA: case M_PROTO: case M_PCPROTO: default: / * Any message type not listed * / / * automatically Sent to * / if (canput (q-> q_ptr-> s_ics-> write_queue) putnext (q-> q_ptr-> s_ics-> write_queue, mp); else {/ * The upper S-ICS mux is flow controlled, Process * / initiate_flow_control_recovery (q, mp);}}}

【０１２４】s_icsポインタは、このＤＬＰＩインスタ
ンスに関連するストリームヘッドを定義するプライベー
ト・データ構造(q->q_ptr)の範囲内に記憶される。s_ic
sポインタは、このインスタンスに関連するＳ−ＩＣＳ
ストリームヘッド・アドレスであり、メッセージは遠隔
ノードにメッセージを送り出すＳ−ＩＣＳ書込み待ち行
列に記憶される。経路指定情報がメッセージにどのよう
に埋め込まれるか上記の関数サンプルに示されている点
に注意する必要がある。The s_ics pointer is stored within a private data structure (q-> q_ptr) that defines the stream head associated with this DLPI instance. s_ic
The s pointer is the S-ICS associated with this instance
Streamhead address, the message is stored in the S-ICS write queue that sends the message to the remote node. Note that how the routing information is embedded in the message is shown in the function sample above.

【０１２５】3.0 制御スレッド設計前節では制御スレッドの概要を記述したが、本説では制
御スレッドの詳細を記述する。しかしながら、最終設計
は目標オペレーティング・システム、ＳＴＲＥＡＭＳ実
施形態および分散ＳＴＲＥＡＭＳが実施されるプラット
ホームに基づいた実施形態に依存するという点は理解さ
れるべきであろう。本節は、更に、高水準設計、制御ス
レッドが追跡しなければならないデータ、および制御ス
レッドが他のコンポーネントと対話する態様を明らかに
する。分散ストリーム作成、ストリーム移行および方針
管理のための制御スレッド設計および相互作用の詳細は
後節で記述する。3.0 Control Thread Design Although the outline of the control thread has been described in the previous section, the details of the control thread are described in this section. However, it should be understood that the final design will depend on the target operating system, the STREAMS implementation and the implementation based on the platform on which the distributed STREAMS is implemented. This section also highlights the high-level design, the data that the control thread must track, and the way the control thread interacts with other components. The details of control thread design and interaction for distributed stream creation, stream migration and policy management are described in a later section.

【０１２６】注：前節において、制御スレッド３４は、
Ｐ−ＩＣＳおよびＳ−ＩＣＳに対して直接動作するもの
として示された。これらの２つのドライバは、完全に異
なるインターフェースを持ち潜在的に異なる動作特性を
持つ。例えば、一方は送信者によって管理される通信機
構であり、他方は受信者によって管理される通信機構で
ある。そのような相違は、設計の複雑さと潜在的な性能
トレードオフを派生する可能性がある。そのような相違
が大きければ、制御スレッドは、単一実行スレッドとし
て見えるように共有データ・セットを持った２つのスレ
ッドとして実施することもできる。一方のスレッドがＳ
ＴＲＥＡＭＳに関連した活動を管理し、他方がポート空
間やデバイス・ドライバ管理のようなクラスタ管理活動
を管理する。この実施形態は、ＳＴＲＥＡＭＳに関連し
たドライバおよびコードのプロトコール独立性を向上さ
せる。Note: In the previous section, the control thread 34
It has been shown to operate directly on P-ICS and S-ICS. These two drivers have completely different interfaces and potentially different operating characteristics. For example, one is a communication mechanism managed by the sender and the other is a communication mechanism managed by the recipient. Such differences can derive design complexity and potential performance tradeoffs. If such a difference is large, the controlling thread can also be implemented as two threads with a shared data set to appear as a single execution thread. One thread is S
It manages activities related to TREAMS, and the other manages cluster management activities such as port space and device driver management. This embodiment improves protocol independence of drivers and code associated with STREAMS.

【０１２７】3.1 制御スレッド＝＝ミドルウェア制御スレッド３４は、クラスタの範囲内でミドルウェア
・コンポーネント５０を置き換えることができるであろ
うか？答えは「できる」である。制御スレッドは、基
本的には、記述された機能性を加えた分散ミドルウェア
として実施される。この点は、(図６の)大域ポート対応
付け最適化の説明の際に既に記述したが、後続の節にお
いてもこの機能性を処理するための強化が必要な場合可
能な限り説明が加えられる。3.1 Control Thread == Middleware Can the control thread 34 replace the middleware component 50 within the cluster? The answer is "can". The control thread is basically implemented as distributed middleware with the described functionality. This point has already been described in the description of the global port mapping optimization (of FIG. 6), but will be explained as much as possible in subsequent sections if enhancements to handle this functionality are needed. .

【０１２８】3.2 作成制御スレッド３４は次のように作成される。システムが
その初期状態設定を完了する前に制御スレッドが実行し
ていなければならないとすれば、標準システム初期化プ
ロセスの一部としてスレッドは作成されなければならな
い。このケースとなるのは、システムの必要要件に基づ
いてノードが自力で構成作業を行う場合である。例えば
システムが特定のネットワーキング・サブシステムをロ
ードしなければならず、このサブシステムがクラスタ機
構の一部であるかまたはそれに依存していれば、制御ス
レッドはこのサブシステムが初期化される前に動作可能
でなければならない。3.2 Creation The control thread 34 is created as follows. If the controlling thread must be running before the system completes its initialization, the thread must be created as part of the standard system initialization process. This is the case when the node performs configuration work on its own based on the requirements of the system. For example, if the system has to load a particular networking subsystem, and this subsystem is part of or relies on a cluster facility, the controlling thread will call this subsystem before it is initialized. Must be operational.

【０１２９】制御スレッドがシステムを立ち上げなくて
よい場合、制御スレッドはクラスタ始動コマンドを介し
て任意の時間に始動させることができるが、次の点は考
慮されなければならない。クラスタ機構がシステム初期
化で立ち上げられるサブシステムを含むとすれば、これ
らのサブシステムはクラスタ化によって影響を受けない
か、あるいは影響されるとすればそれらは制御スレッド
がそれらと通信することができるようななんらかの形態
に変換されなければならない。これの１つの例は、ポー
ト・セットを結合した実行中のＮＦＳサービスを持つノ
ードである。ＮＦＳがストリーム型伝送プロバイダに対
して動作していると仮定すると、ポートに関する情報は
ミドルウェア・スレッドに通知されなければならず、パ
ケットがこのノードに正しく回送されることまたはアプ
リケーションが移行すればパケットが正確に送られるこ
とを保証するため、クラスタ内の他のノード上で適切な
ポートが結合されなければならない。加えて、put, ser
viceおよびcloseルーチンに関して上述した関数置換が
行われる必要がある。If the controlling thread does not need to start up the system, it can be started at any time via the start cluster command, but the following must be considered: If the clustering mechanism includes subsystems that are launched at system initialization, these subsystems are not affected by clustering, or if they are, they will not allow control threads to communicate with them. It must be converted into some form that can be done. One example of this is a node with a running NFS service that combines a set of ports. Assuming that NFS is running for a stream-based transport provider, information about the port must be notified to the middleware thread, and if the packet is routed correctly to this node or Appropriate ports must be tied on other nodes in the cluster to ensure that they are sent correctly. In addition, put, ser
The function substitution described above needs to be performed for the vice and close routines.

【０１３０】一般的には、制御スレッドは、このスレッ
ドを利用するクラスタ機構が要求に対する適時の応答を
受け取ることを保証するため、リアルタイム実行スレッ
ドとして動作しなければならない。典型的には、これ
は、スレッドの作成の一部、すなわちスレッドが開始さ
れる様態または設定優先コマンドを介して実行される。In general, the controlling thread must operate as a real-time execution thread to ensure that the cluster mechanism utilizing this thread receives a timely response to the request. Typically, this is performed as part of thread creation, ie, in the manner in which the thread is started or via a set priority command.

【０１３１】このスレッドの停止は、スレッドの再開を
可能にするようなステップが作成時に取られていない限
り、そのノード上でのＳＴＲＥＡＭ関連クラスタ化活動
の停止につながる。そのようなステップは次の通りであ
る。(1)既知のデータ構造をカーネルの範囲内に割り当
て、そのアドレスをカーネル大域変数に記憶する。(2)
このデータ構造内に、制御スレッドは、次の情報を記憶
しなければならない。すなわちオープンされたかまたは
この制御スレッドに割り当てられたすべてのストリーム
ヘッドのアドレス、ミドルウェア・インターフェースデ
バイスのアドレス(ミドルウェアとの通信はファイル・
ポインタまたは相互接続記述子等を経由して行われ
る)、およびスレッドに通知された方針情報。(3)送信元
管理の通信がミドルウェア・スレッドとの通信に使用さ
れれば、ユーザ空間を維持するためなんらかのメカニズ
ムが必要である。これは、このデータを保持すべき方法
および新しい制御スレッドのためそれを復元する方法を
クラスタ結合管理サブシステムに通知することを必然的
に伴うかもしれない。This suspension of the thread will lead to the cessation of STREAM-related clustering activity on that node, unless steps have been taken at creation time to allow the thread to resume. Such steps are as follows. (1) Allocate a known data structure within the kernel and store its address in a kernel global variable. (2)
Within this data structure, the controlling thread must store the following information: That is, addresses of all stream heads opened or assigned to this control thread, addresses of middleware interface devices (communication with middleware is
(Via pointers or interconnect descriptors) and policy information communicated to the thread. (3) If the communication of the source management is used for the communication with the middleware thread, some mechanism is required to maintain the user space. This may involve notifying the cluster coupling management subsystem how to keep this data and how to restore it for a new thread of control.

【０１３２】ノード構造は、ノードがアドレスされる方
法およびクラスタ内で経路指定が達成される方法に依存
する。上記構造は、フィールドとして不可欠なものと考
えられるものを含んでいるが、実施の過程でその他のも
のを追加することは可能である。所与のフラグは必要と
される最低限のものである。次の表６はノード構造定義
のサンプルである。The node structure depends on how the nodes are addressed and how routing is achieved within the cluster. The above structure includes what is considered to be essential as a field, but it is possible to add others in the course of implementation. The given flags are the minimum required. Table 6 below is a sample node structure definition.

【０１３３】[0133]

【表６】 #define NODE_INACTIVE Ox0000 #define NODE_ACTIVE Ox0001 #define NODE_IN_ERROR 0x0002 struct node_t { int32 node_flags; /* 遠隔ノードの状態フラグ */ node_address_t node_address; /* 遠隔ノードのアドレス */ node_route_t node_route; /* ローカル・ノードから遠隔ノードへの経路 */ struct node_t*next,*prev; /* リスト内の次または前のノード */ } #define NODE_HASH_SIZE 1024[Table 6] #define NODE_INACTIVE Ox0000 #define NODE_ACTIVE Ox0001 #define NODE_IN_ERROR 0x0002 struct node_t {int32 node_flags; / * Remote node status flag * / node_address_t node_address; / * Remote node address * / node_route_t node_route; / * Local node * / Struct node_t * next, * prev; / * next or previous node in list * /} #define NODE_HASH_SIZE 1024

【０１３４】以下に示す構造は、Ｐ−ＩＣＳ特有の構造
タイプを含む。すなわち、Ｐ−ＩＣＳの機能およびＰ−
ＩＣＳが通信を望む内容を知らなければ、これらの構造
の内容を示すことはできない。最低限で、この構造はメ
ッセージ処理に十分な端点定義を含まなければならず、
状態フラグおよび動作を起動させるためＳ−ＩＣＳによ
って暗黙的に起動されるかもしれない汎用関数ベクトル
を含む。すなわち、Ｐ−ＩＣＳ実施形態、Ｓ−ＩＣＳが
ある１つのＰＩＣＳインスタンスから他のものへ移行す
ることを必要とする場合の移行方針、あるいはＰ−ＩＣ
Ｓエラーまたは相互接続障害から回復する方法に関する
回復方針等について、Ｓ−ＩＣＳは可能な限り無知のま
までいる。The structure shown below includes a structure type specific to P-ICS. That is, the functions of P-ICS and P-ICS
Unless the ICS knows what it wants to communicate with, it cannot show the contents of these structures. At a minimum, this structure must include enough endpoint definitions for message processing,
Includes state flags and generic function vectors that may be implicitly invoked by the S-ICS to activate the operation. That is, a P-ICS embodiment, a migration policy when the S-ICS needs to migrate from one PICS instance to another, or a P-ICS.
The S-ICS remains as ignorant as possible about recovery policies and the like on how to recover from S errors or interconnect failures.

【０１３５】[0135]

【表７】 struct p_ics_endpoint_t { uint32 p_ics_state; uint32 p_ics_flags; int32 (*p_ics_memory_alloc)(); int32 (*p_ics_memory_free)(); int32 (*p_ics_open)(); int32 (*p_ics_close)(); int32 (*p_ics_attach)(); int32 (*p_ics_detach)(); int32 (*p_ics_ctl)(); int32 (*p_ics_send)(); int32 (*p_ics_recv)(); int32 (*p_ics_misc)(); }; typedef p_ics_endpoint_t P_ICS_ENDPOINT; struct p_ics_data { P_ICS_ENDPOINT p_ics_endpoint; P_ICS_MIGRATION_POLICY p_ics_migration_policy; P_ICS_RECOVERY_POLICY p_ics_recovery_policy; }[Table 7] struct p_ics_endpoint_t {uint32 p_ics_state; uint32 p_ics_flags; int32 (* p_ics_memory_alloc) (); int32 (* p_ics_memory_free) (); int32 (* p_ics_open) (); int32 (* p_ics_close) (); int32 (); int32 (* p_ics_detach) (); int32 (* p_ics_ctl) (); int32 (* p_ics_send) (); int32 (* p_ics_recv) (); int32 (* p_ics_misc) ();}; typedef p_ics_endpoint_t P_ICS_ENDPOINT p_ics_data {P_ICS_ENDPOINT p_ics_endpoint; P_ICS_MIGRATION_POLICY p_ics_migration_policy; P_ICS_RECOVERY_POLICY p_ics_recovery_policy;}

【０１３６】上記サンプルにおいて、ｐ_ics_endpoin
は、Ｐ−ＩＣＳエンドポイントおよび可能な動作セット
を表す関数ベクトルを定義する。p_ics_migration_poli
cyは、方針の内容、方針を起動するタイミングおよび方
針を実行するための関数ベクトルのセットを定義する方
針構造である。p_ics_recovery_policyは、回復アルゴ
リズムの内容、遠隔ノードが失敗するタイミング、ロー
カルＰ−ＩＣＳインスタンスが失敗するタイミング、お
よび方針を実行すべき関数ベクトルを定義する方針構造
である。In the above sample, p_ics_endpoin
Defines a function vector that represents the P-ICS endpoint and a set of possible actions. p_ics_migration_poli
cy is a policy structure that defines the content of the policy, the timing of activating the policy, and the set of function vectors for executing the policy. p_ics_recovery_policy is a policy structure that defines the content of the recovery algorithm, the timing when the remote node fails, the timing when the local P-ICS instance fails, and the function vector to execute the policy.

【０１３７】[0137]

【表８】 #define CTL_S_ICS_TBL_SIZE 64 #define CTL_STH_HASH_SIZE 512 #define CTL_HASH_FUNC(dev) do { \ &ctl->ctl_thread_streams[((unsigned(major(dev) ^ (unsigned)minor(dev)) & (CTL_STH_HASH_SIZE); }while (0) #define CTL_S_ICS_HASH_FUNC(dev) do { \ &ctl->ctl_s_ics_streams[((unsigned(major(dev) ^ (unsigned)minor(dev)) & (CTL_S_ICS_TBL_SIZE); } while (0);[Table 8] #define CTL_S_ICS_TBL_SIZE 64 #define CTL_STH_HASH_SIZE 512 #define CTL_HASH_FUNC (dev) do {\ & ctl-> ctl_thread_streams [(((unsigned (major (dev) ^ (unsigned) minor (dev)) &(CTL_STH_HASH_SIZE); (0) #define CTL_S_ICS_HASH_FUNC (dev) do {\ & ctl-> ctl_s_ics_streams [((unsigned (major (dev) ^ (unsigned) minor (dev)) &(CTL_S_ICS_TBL_SIZE);} while (0);

【０１３８】この実施形態の基本にある仮定は、ストリ
ームヘッドがsth->sth_nextフィールドを経由して相互
に結合されるかもしれないということである。そうでな
い場合は、実施のためには、異なるハッシュ・バケット
管理戦略を必要とする。[0138] The assumption underlying the embodiment is that the stream heads may be interconnected via the sth-> sth_next field. Otherwise, implementation requires a different hash bucket management strategy.

【０１３９】[0139]

【表９】 struct thread_management { struct sth_s*ctl_thread_streams[CTL_STH_HASH_SIZE]; struct sth_s*ctl_s_ics_streams[CTL_S_ICS_TBL_SIZE]; struct inter_con_descriptor*ctl_middleware_descriptor; struct policy*ctl_known_policies; struct preserve_policy*ctl_preserver_policy; struct node_t ctl_cluster_nodes[NODE_HASH_SIZE]; struct p_ics_data*ctl_p_ics_instance; struct driver_management_policy_t*ctl_driver_policy; } struct thread_management ctl_thread_global_data;TABLE 9 struct thread_management {struct sth_s * ctl_thread_streams [CTL_STH_HASH_SIZE]; struct sth_s * ctl_s_ics_streams [CTL_S_ICS_TBL_SIZE]; struct inter_con_descriptor * ctl_middleware_descriptor; struct policy * ctl_known_policies; struct preserve_policy * ctl_preserver_policy; struct node_t ctl_cluster_nodes [NODE_HASH_SIZE]; struct p_ics_data * ctl_p_ics_instance; struct driver_management_policy_t * ctl_driver_policy;} struct thread_management ctl_thread_global_data;

【０１４０】上記サンプルにおいて、ctl_thread_strea
msは、このノード上で動作する分散コンポーネント(例
えば分散ＤＬＰＩインスタンス)を表すストリームヘッ
ド・ポインタである。ctl_s_ics_streams は、ローカル
のＳ−ＩＣＳインスタンスを表すストリームヘッド・ポ
インタである。ctl_middleware_descriptorは自明であ
る。ctl_known_policiesは、この制御スレッドがどのよ
うに動作すべきかを記述する方針セットである。これら
の構造は、制御スレッドが基本的なプロトコール固有要
件を制御スレッドが外部から隠すことを可能にする機構
である。ctl_preserver_policyは、上述の自己保存方針
である。ctl_cluster_nodesは、制御スレッド初期設定
時に作成されるノード・テーブルである。このデータ
は、その初期設定プロセスの一部としてこのデータがダ
ウンロードされた相互接続ドライバから引き出すことが
できる。ctl_p_icsインスタンスは、クラスタの範囲内
で制御スレッドが他の制御スレッドと通信するために必
要なＰ−ＩＣＳデータである。ctl_driver_policyは、
ドライバ・クラスとそれらを取り扱う方法を記述する。
このデータは、制御スレッド初期設定の間にダウンロー
ドされる。例として挙げた方針は、ドライバがオープン
されるべき場所やストリーム・スタックがノード間の分
割を必要とするか否かを決定するための方針である。In the above sample, ctl_thread_strea
ms is a stream head pointer representing a distributed component (eg, a distributed DLPI instance) operating on this node. ctl_s_ics_streams is a stream head pointer representing a local S-ICS instance. ctl_middleware_descriptor is self explanatory. ctl_known_policies is a set of policies that describes how this controlling thread should operate. These structures are mechanisms that allow the controlling thread to hide basic protocol-specific requirements from the outside world. ctl_preserver_policy is the self-preservation policy described above. ctl_cluster_nodes is a node table created at the time of control thread initialization. This data can be derived from the interconnect driver from which the data was downloaded as part of the initialization process. The ctl_p_ics instance is P-ICS data necessary for the control thread to communicate with another control thread within the range of the cluster. ctl_driver_policy is
Describe driver classes and how to handle them.
This data is downloaded during control thread initialization. The example policies are for determining where the driver should be opened and whether the stream stack requires splitting between nodes.

【０１４１】一旦スレッドが作成されれば、そのスレッ
ドは、既存のクラスタ機構方針にアクセスして適切な下
部構造を設定する。クラスタ領域情報をどのようにまた
どこへダウンロードすべきか、どのノードが現在のクラ
スタ管理ノードであるかをどのように判断するか等、特
定タスクがどのように実行されるべきかを指定するため
方針が使用される。この情報は、各ノード上の既知の場
所に記憶されるか、あるいは、同報通信を通してクラス
タの範囲内の実行中ミドルウェア・スレッドにダウンロ
ードされる。Once a thread is created, it accesses the existing cluster mechanism policies and sets up the appropriate substructure. Policy to specify how specific tasks should be performed, such as how and where to download cluster area information, how to determine which node is the current cluster management node, etc. Is used. This information is stored at a known location on each node or downloaded via broadcast to a running middleware thread within the cluster.

【０１４２】もしもクラスタ・ノードがＳＭＰのような
マルチプロセッサ・システムであり、また、そのノード
が多くの分散ストリームを管理していれば、複数の制御
スレッド・インスタンスを作成する必要があるかもしれ
ない。各インスタンスは、個々のプロトコール・スタッ
クに対して、または単一のスレッドが取り扱うには空間
が大きすぎればそれぞれのプロトコール・スタックのイ
ンスタンスのサブセットに対して責任を持つようにして
もよく、あるいは、ある制御スレッドがstrlog()を取り
扱い、ある制御スレッドがSADを取り扱い、ある制御ス
レッドがプロトコールを取り扱うというように異なるク
ラスタ機構局面に特に向くように設計することもでき
る。分割の形態および必要な制御スレッドの数は、実施
形態に非常に依存する。図７は、データ構造レイアウト
と相互接続を示す。Ｓ−ＩＣＳ３８がそのプライベート
・データ構造領域から上記のようなデータ構造をポイン
トする点に注意する必要がある。If the cluster node is a multi-processor system such as SMP, and that node manages many distributed streams, it may be necessary to create multiple control thread instances. . Each instance may be responsible for an individual protocol stack, or for a subset of instances of each protocol stack if space is too large for a single thread to handle, or It can be designed specifically for different clustering aspects, such as one control thread handles strlog (), one control thread handles SAD, and one control thread handles protocols. The form of partitioning and the number of control threads required is highly dependent on the embodiment. FIG. 7 shows the data structure layout and interconnections. It should be noted that S-ICS 38 points to such a data structure from its private data structure area.

【０１４３】3.3 制御スレッド構成制御レッド３４は、以下の技術のいずれかを使用して構
成することができる。制御スレッドは、少なくとも(図
４の)Ｐ−ＩＣＳ３６と通信することができるデフォル
ト・メカニズムを備えていなければならない。この通信
メカニズムは、Ｐ−ＩＣＳ実施形態に依存し、初期化設
定プロトコールをマイクロプログラムとして含めること
が必要となる。3.3 Control Thread Configuration The control thread 34 can be configured using any of the following techniques. The controlling thread must have at least a default mechanism that can communicate with the P-ICS 36 (of FIG. 4). This communication mechanism depends on the P-ICS embodiment and requires that the initialization setup protocol be included as a microprogram.

【０１４４】同報通信の代わりに、制御スレッド３４
は、ファイル・システムのブートストラップ構成データ
を読み取るという複雑さは少ないがエラーの起きないメ
カニズムに依存することもできる。このデータは、スレ
ッドがどのように動作すべきか、存在するとすればミド
ルウェアはどこにあるか、またその他必要な初期ステッ
プを指定する。当然のことながら、構成ファイルはスレ
ッドが知る必要があるすべての事柄も含むことができ
る。そのようなファイルは、単一の管理ノードで管理さ
れ、そこからクラスタ内に配布される。Instead of broadcasting, the control thread 34
May rely on a less complex but error-free mechanism of reading the file system bootstrap configuration data. This data specifies how the thread should behave, where the middleware, if any, is, and any other necessary initial steps. Of course, the configuration file can also contain everything the thread needs to know. Such files are managed by a single management node from which they are distributed within the cluster.

【０１４５】もう一つのメカニズムは、スレッドが他の
同様の制御スレッドまたはミドルウェアと通信すること
ができるポイントまでスレッド３４をブートストラップ
するものである。制御スレッドは、「ＡＲＰ」を実行し
それ自身およびその現在状態を識別することによって他
のスレッドまたはミドルウェア５０を検知する。その
後、制御スレッドはそれに送出される内容に応答する。
例えば、制御スレッドがＳ−ＩＣＳまたはローカルのス
トリーム・コンポーネントをオープンする時、それらコ
ンポーネントに関してそれらの要件を満たすために必要
なデータおよび方針を調べあげることができる。この能
力によって、制御スレッドは、かなり実施形態非依存の
状態を保つ一方でローカルの必要要件に対応することが
できる。Another mechanism is to bootstrap the thread 34 to a point where the thread can communicate with other similar control threads or middleware. The controlling thread detects other threads or middleware 50 by executing "ARP" and identifying itself and its current state. Thereafter, the controlling thread responds to the content sent to it.
For example, when the controlling thread opens S-ICS or local stream components, it can look up the data and policies needed to meet those requirements for those components. This capability allows the controlling thread to meet local requirements while remaining fairly implementation independent.

【０１４６】3.4 コンポーネント間交信以下の節は、制御スレッドと対話する分散ＳＴＲＥＡＭ
Ｓの主要コンポーネント間の交信を詳述する。各節は、
そのような交信を通して問題が解決される様態を示すサ
ンプルを含む。3.4 Inter-Component Communication The following section describes a distributed STREAM that interacts with the controlling thread.
The communication between the main components of S will be described in detail. Each clause is
Includes samples that show how problems are solved through such communications.

【０１４７】3.4.1 ミドルウェア交信 (図６の)ミドルウェア５０との交信の量およびタイミン
グは、ミドルウェアによって提供されるサービスおよび
分散ＳＴＲＥＡＭＳスタックに関する方針要件に依存す
る。これは、分散ＳＴＲＥＡＭＳソルーションをミドル
ウェア非依存に保つための設計上の判定要因である。最
低限、ストリーム作成、ストリーム分散およびエラー回
復という分野が交信を必要とする。交信がどのようには
たらくかを示すため、上述したハードウェア共用、負荷
平均化、高い可用性および単一システム視点などのクラ
スタの使用様態を以下に検討する。3.4.1 Middleware Communication The amount and timing of communication with the middleware 50 (of FIG. 6) depends on the services provided by the middleware and the policy requirements for the distributed STREAMS stack. This is a design decision factor for keeping the distributed STREAMS solution middleware independent. At a minimum, the areas of stream creation, stream distribution and error recovery require communication. In order to show how the communication works, the manner of using the cluster, such as the hardware sharing, load averaging, high availability and single system perspective described above will be discussed below.

【０１４８】ハードウェア共用：ハードウェア共用クラ
スタが定義される時、ある所与の１時点でクラスタ内の
すべてのノードが活動的であることはない。すなわち使
用可能なクラスタ・ハードウェアは流動的であり、依存
できるノードは存在しない。このことは、クラスタの範
囲内の１つまたは複数のノードに存在するソフトウェア
・エンティティ内にクラスタ・ハードウェア構成が存在
しなければならないこと、そして、情報の場所および時
間を決定する動的(静的でも可能)プロトコールがなけれ
ばならないことを意味する。これは制御スレッドの範囲
内で達成されることができるが、その形態は設計を複雑
にし、実施およびデータをストリーム型でない他のクラ
スタ機構について利用できないものとする可能性があ
る。このような状況を改善するのがミドルウェアであ
る。 Hardware Sharing : When a hardware sharing cluster is defined, at a given point in time, not all nodes in the cluster are active. That is, the available cluster hardware is fluid and no nodes can depend on it. This means that the cluster hardware configuration must exist in software entities that reside on one or more nodes within the cluster, and that dynamic (static) decisions determine the location and time of information. It means that there must be a protocol. This can be achieved within the control thread, but its form can complicate the design and make the implementation and data unavailable for other non-streamed cluster mechanisms. Middleware improves this situation.

【０１４９】例示されているハードウェア共用の例では
ストリーム・スタック全体がノードＣおよびＤに存在
し、ＴＣＰ／ＩＰスタックとの間でのデータ交換が関数
送達に依存し、アプリケーションがノードＡおよびノー
ドＢ上で実行される。アプリケーションがＴＣＰ／ＩＰ
インスタンスをオープンするため(オープンの詳細なア
ルゴリズムは後述する)、アプリケーションはミドルウ
ェア・スレッドと連絡をとる制御スレッドと交信しなけ
ればならない。制御スレッドは、アクセスされているハ
ードウェアのタイプ、実行されているアプリケーション
のタイプ、および遠隔目標ネットワーク・アドレスやア
プリケーション優先度のような付加的修飾因子を示すメ
ッセージを送付することによって、ミドルウェア５０を
調べる。この情報を使用してミドルウェア方針要求が作
成される。ミドルウェアはこの要求を調べて、要求がど
のノードで充足されているかを判断する。このアルゴリ
ズムは図８に示されている。In the illustrated hardware sharing example, the entire stream stack resides at nodes C and D, the data exchange with the TCP / IP stack relies on function delivery, and the application Executed on B. Application is TCP / IP
To open an instance (the algorithm for opening is described below), the application must interact with a control thread that contacts the middleware thread. The controlling thread sends the middleware 50 by sending messages indicating the type of hardware being accessed, the type of application being executed, and additional modifiers such as the remote target network address and application priority. Find out. This information is used to create a middleware policy request. The middleware examines this request and determines at which node the request is being satisfied. This algorithm is shown in FIG.

【０１５０】これらのメッセージの実際の内容は、ミド
ルウェアに依存してはいるが、制御スレッドがミドルウ
ェアから独立的であることを排除しない。独立性を維持
するため、一般的メッセージおよびアクションを実行す
る関数ベクトルを作成してミドルウェア固有メッセージ
との間の変換を行うことによって、制御スレッドは上記
情報のすべてを隠すこともできる。この技術は、ＳＴＲ
ＥＡＭＳおよび多くのオペレーティング・システムで使
用されてきたもので、オブジェクト指向アプリケーショ
ンにおいて非常に一般的である。分散ストリームに関し
ては最低限以下の動作がサポートされなければならな
い。＊ミドルウェアの位置およびミドルウェア故障回復方針
を取得すること。＊クラスタ・ノード構成および関連ハードウェアを取得
すること。ノード構成は、ＳＴＲＥＡＭＳドライバおよ
びモジュールが配置されている場所、そのノード上で実
行されている制御スレッド、どのＳＴＲＥＡＭＳサービ
スがサポートされているか等の項目を含む。＊クラスタ・ノード経路指定テーブルを取得すること。＊活動クラスタ・ノードのリストを取得すること。この
リストは動的であるので、制御スレッドが周期的にこの
情報を確かめるか、あるいは、ミドルウェアが変更のす
べての制御スレッドを更新するかもしれない。＊ノード障害回復および高可用性に関連するその他の活
動を実行する方法を決定する方針を取得すること。例え
ばスタックが分割されている場所、スタックが移行する
場所、性能向上のため動的関数置換を使用すべきか否か
など実行中のＳＴＲＥＡＭＳスタックに特有の方針を入
手すること。Although the actual content of these messages depends on the middleware, it does not exclude that the controlling thread is independent of the middleware. To maintain independence, the controlling thread may also hide all of the above information by creating a function vector that performs the general message and action to convert between middleware-specific messages. This technology uses STR
It has been used in EAMS and many operating systems and is very common in object-oriented applications. The following operations must be supported for distributed streams: * Acquire middleware location and middleware failure recovery policy. * Acquire the cluster node configuration and related hardware. The node configuration includes items such as where STREAMS drivers and modules are located, the control threads running on that node, and which STREAMS services are supported. * Obtain the cluster node routing table. * Get a list of active cluster nodes. Since this list is dynamic, the controlling thread may periodically verify this information, or the middleware may update all controlling threads of the change. * Obtaining policies that determine how to perform node disaster recovery and other activities related to high availability. Obtain policies specific to the running STREAMS stack, such as where the stack is split, where the stack transitions, and whether dynamic function replacement should be used to improve performance.

【０１５１】上記がミドルウェア・スレッドを使用せず
に達成されるようにするためには、制御スレッドは以下
の構成データを必要とする。＊どのノードがストリーム・ドライバおよびモジュール
を含むか。＊どのノードが現在活動的で完全に動作可能であり、初
期設定の要件を満たすために必要な資源および能力を持
っているか。＊どのようにデバイス・ファイルにアクセスして、それ
をdev_t構造に対応づけるか。＊ドライバのアクセスを要求するアプリケーションに関
して保護／セキュリティ問題をどのように取り扱うか。＊目標ノード・セットへの現在時システム・ロードは何
で、特定目標を選択する基準は何であるか。ラウンドロ
ビン方式を使用することができるが、ロード・データな
しでも個々のノードは負荷過剰となり得る。＊システム内のすべてのＳ−ＩＣＳのアドレスは何で、
それらの能力は何であるか。＊この要求を満たすため制御スレッドは何と対話すべき
か。In order for the above to be achieved without using a middleware thread, the control thread needs the following configuration data: * Which nodes contain stream drivers and modules. * Which nodes are currently active and fully operational and have the necessary resources and capabilities to meet the default requirements. * How to access the device file and associate it with the dev_t structure. * How to handle protection / security issues for applications that require driver access. * What is the current system load on the target node set and what are the criteria for selecting a specific target? A round-robin scheme can be used, but individual nodes can be overloaded even without load data. * What are the addresses of all S-ICS in the system,
What are those abilities? * What should the control thread interact with to meet this request?

【０１５２】負荷平均化：負荷平均化には、スレッドに
従ってストリーム型スタックが全体としてまたは部分的
に異なるノードに移行することができなければならな
い。制御スレッドは、ミドルウェアを利用して、移行方
針、スタック分割場所、目標ノード等を決定する。この
情報の大多数は、スタックが作成される時点に決定され
るが、移行すべきタイミングおよび移行目標は通常唯一
の動的エレメントである。図９は、制御スレッド３４と
ミドルウェア５０の間の潜在的メッセージ交換を示す。 Load averaging : Load averaging requires that the stream-type stack be able to migrate to different nodes in whole or in part according to the thread. The control thread uses the middleware to determine a migration policy, a stack division location, a target node, and the like. The majority of this information is determined at the time the stack is created, but the timing to transition and the transition goal are usually the only dynamic elements. FIG. 9 illustrates a potential message exchange between the control thread 34 and the middleware 50.

【０１５３】高い可用性：高可用性に関しては、(ミド
ルウェアではなく)制御スレッドが回復始動プログラム
である。これが必要な理由は、回復がミドルウェア障害
の結果起きるものである可能性があり、制御スレッドが
新しいミドルウェアと連絡を再確立する必要があるから
である。ミドルウェア障害が発生すれば、障害検出方法
に従って、以下のステップがとられなければならない。 High Availability : For high availability, the controlling thread (not the middleware) is the recovery starter. This is necessary because recovery can be the result of a middleware failure and the controlling thread needs to re-establish communication with the new middleware. If a middleware failure occurs, the following steps must be taken according to the failure detection method.

【０１５４】1. 制御スレッドが要求または仕掛かり中
要求のタイムアウトを始動していたとすれば、制御スレ
ッドはそのミドルウェア回復方針を検査して、次にとる
べきアクションを決定する。これはミドルウェアの実施
形態に応じて変ることもある。以下は方針の例である。
(A)他のノード上の制御スレッドへメッセージを同報通
信し、障害を検出しそれがそのノード特有のものである
か判断する。そうであれば、制御スレッドは、その他の
制御スレッドの１つを経由してドルウェアに通知し回復
を始動する必要がある。(B)ノード・リストを調べてミ
ドルウェアがそこで動作中であるか否かを判断する。そ
の場合は、通信を再確立し、次のノードへ進む。もしい
ずれのノードもミドルウェアを実行していなければ、代
替的ミドルウェア・ノードのリストを調べ、そのノード
にミドルウェア回復を始動し実行するように通知する。1. If the controlling thread had initiated a request or pending request timeout, the controlling thread checks its middleware recovery policy to determine the next action to take. This may vary depending on the middleware embodiment. The following is an example of a policy.
(A) Broadcast a message to a control thread on another node to detect a failure and determine if it is unique to that node. If so, the controlling thread needs to notify the dollarware via one of the other controlling threads to initiate recovery. (B) Check the node list to determine whether the middleware is running there. In that case, re-establish communication and proceed to the next node. If neither node is running middleware, it consults the list of alternate middleware nodes and notifies that node to initiate and perform middleware recovery.

【０１５５】2. 制御スレッドは、前のステップの結果
として生成された新しいミドルウェア・インスタンスに
よってミドルウェア障害を知ることもある。その場合に
は、新しいインスタンスは、変更についてすべての既知
の制御スレッドに通知し、回復プロセスが始動されるこ
とを防ぐ。2. The controlling thread may also know of a middleware failure with a new middleware instance created as a result of a previous step. In that case, the new instance will notify all known control threads of the change and prevent the recovery process from being triggered.

【０１５６】3. ミドルウェアをもはや発見することが
できないことを制御スレッドはＳ−ＩＣＳを経由して通
知されることもある。その場合は制御スレッドがその回
復メカニズムを始動する。回復が完了すれば、制御スレ
ッドは新しいインスタンスをＳ−ＩＣＳに通知し、Ｓ−
ＩＣＳはその処理を続ける。注：回復が完了するまでＳ
−ＩＣＳがすべてのミドルウェア要求を制御スレッドに
転送するようにＳ−ＩＣＳおよび制御スレッドを設計す
ることは簡単であろう。いくつかの活動はストリーム・
スタックの範囲内よりむしろストリームヘッド上に設計
する方が本来的に簡単であるので、この方法はＳ−ＩＣ
Ｓを単純化する。3. The controlling thread may be notified via S-ICS that the middleware can no longer be found. In that case, the controlling thread starts its recovery mechanism. When the recovery is completed, the control thread notifies the S-ICS of the new instance,
The ICS continues its processing. Note: S until recovery is completed
-It would be straightforward to design the S-ICS and the control thread so that the ICS forwards all middleware requests to the control thread. Some activities are streamed
This method is an S-IC because it is inherently simpler to design on the stream head rather than within the stack.
S is simplified.

【０１５７】ストリーム・スタック・コンポーネントが
故障する場合、回復メカニズムおよびミドルウェアとの
対話は、障害を通知するものおよび障害が発生した場所
に応じて行われる。このためには以下のステップがとら
れなければならない。 1. 制御スレッド３４は、ノードが故障したことをミド
ルウェア５０から通知されるかもしれない。これは、ミ
ドルウェアが障害回復に対して能動的アプローチをとる
ケースである。図４に示されたＴＣＰ／ＩＰ／ＤＬＰＩ
構成の場合、ストリーム・スタックの分割位置に応じて
ＴＣＰであろうとＴＣＰ／ＩＰであろうと、ストリーム
の上方部分を実行しているノードが障害を起こせば、回
復方針は、単にＤＬＰＩインスタンスをクローズし、す
べてのＳ−ＩＣＳおよび制御スレッド・データ構造を更
新することであるかもしれない。別のストリーム・スタ
ックについて、回復方針は、あたかもストリーム型パイ
プの一方のエンドポイントが新しいアプリケーションに
再接続されたかのようこのストリーム・コンポーネント
を別のノードへ再接続することであるかもしれない。こ
の場合ミドルウェアは送出されたメッセージ内に新しい
接続点を含む可能性がある。これは、ストリーム型パイ
プの場合アプリケーション移行を実施する１つの方法で
あろう。If a stream stack component fails, the interaction with the recovery mechanism and middleware is dependent on what is notifying and where the fault has occurred. For this, the following steps must be taken. 1. The control thread 34 may be notified by the middleware 50 that the node has failed. This is the case when middleware takes an active approach to disaster recovery. TCP / IP / DLPI shown in FIG.
In the case of a configuration, whether TCP or TCP / IP depending on the split position of the stream stack, if the node executing the upper part of the stream fails, the recovery policy simply closes the DLPI instance. , Update all S-ICS and control thread data structures. For another stream stack, the recovery policy may be to reconnect this stream component to another node as if one endpoint of the streamed pipe was reconnected to a new application. In this case, the middleware may include a new point of attachment in the message sent. This would be one way to implement application migration in the case of a stream-type pipe.

【０１５８】2. 制御スレッドは、Ｓ−ＩＣＳがもはや
遠隔ノードに要求を回付することができないことを知ら
される場合がある。制御スレッドは、ストリーム・スタ
ックが作成された時にダウンロードされている回復方針
を発動する。この方針がミドルウェアに知らされ新しい
遠隔ノードが場合によっては決定される。当然Ｓ−ＩＣ
Ｓもこの動作を実行することができるが、それは回復の
実施されるべき所望の場所に依存する。Ｓ−ＩＣＳ内で
回復を実行する方が速いかもこともある。なぜなら、制
御スレッドへの文脈スイッチは存在しないが、データが
分割され各コンポーネント内で反映される形態に基づい
てＳ−ＩＣＳと制御スレッドの間でタイミング問題が発
生する可能性があるからである 3. 問題があることをストリーム・スタック・コンポー
ネント自体によって制御スレッドが通知される場合があ
る。例えばコンポーネントがM_HANGUPまたはM_ERRORメ
ッセージを送り出す場合、問題が解決されるまでスタッ
クは基本的に役に立たない。M_ERRORメッセージが出さ
れる場合、それはドライバがハードウェアと通信するこ
とができないことを示す。制御スレッドは、ハードウェ
アがローカル・ノード上で複製されたことを知っていれ
ば、ストリーム・コンポーネントの新しいインスタンス
を単に作成し、関連データを更新する。制御スレッド
は、また、そのハードウェア知識方針を検査し、分散ス
トリームをオープンするアルゴリズムを実行することに
よって、新しいインスタンスを始動するかもしれない。
あるいは、制御スレッドは、代替ハードウェアのための
ミドルウェアを探りあて、同じアルゴリズムを始動する
こともある。いずれの場合でもデータが分割され反映さ
れる形態に基づいて、制御スレッドは、新しいスタック
・コンポーネントの位置についてミドルウェア、他の制
御スレッドおよびＳ−ＩＣＳに通知する必要がある。2. The controlling thread may be informed that the S-ICS can no longer route requests to remote nodes. The controlling thread invokes the recovery policy that was being downloaded when the stream stack was created. This policy is communicated to the middleware and new remote nodes are possibly determined. Naturally S-IC
S can also perform this operation, depending on the desired location where recovery should be performed. Performing the recovery in the S-ICS may be faster. This is because there is no context switch to the control thread, but a timing problem may occur between the S-ICS and the control thread based on the form in which the data is divided and reflected in each component. The controlling thread may be notified by the stream stack component itself that there is a problem. For example, if a component sends an M_HANGUP or M_ERROR message, the stack is essentially useless until the problem is resolved. If an M_ERROR message is issued, it indicates that the driver cannot communicate with the hardware. If the controlling thread knows that the hardware has been replicated on the local node, it simply creates a new instance of the stream component and updates the relevant data. The controlling thread may also start a new instance by checking its hardware knowledge policy and executing an algorithm that opens a distributed stream.
Alternatively, the controlling thread may seek out middleware for alternative hardware and start the same algorithm. In each case, based on the manner in which the data is split and reflected, the control thread needs to notify the middleware, other control threads, and the S-ICS about the location of the new stack component.

【０１５９】単一システム視点：クラスタが、クラスタ
全体資源について単一システム視点を維持すべきもので
ある場合、一般的資源基準リストを特定資源識別リスト
に変換するミドルウェア・メカニズムが存在しなければ
ならない。上述のハードウェア共用交信において、基準
メッセージが制御スレッドによって生成されミドルウェ
アに送り出された。ミドルウェアはこのメッセージを解
読して、この要件を満たす特定の資源インスタンスで応
答する。制御スレッドはそれによって動作を続行するこ
とができる。これらのメッセージの生成方法の詳細は、
分散ストリームのオープンと共に、後述する。 Single System View : If the cluster is to maintain a single system view for the entire cluster resource, there must be a middleware mechanism to translate the generic resource criteria list into a specific resource identification list. In the hardware sharing described above, a reference message was generated by the controlling thread and sent to the middleware. The middleware decrypts this message and responds with a specific resource instance that meets this requirement. The controlling thread can then continue its operation. For more information on how to generate these messages,
This will be described later together with the opening of the distributed stream.

【０１６０】3.4.2 Ｓ−ＩＣＳ作成および交信制御スレッドは、カーネル内ＳＴＲＥＡＭＳインターフ
ェースを経由してＳ−ＩＣＳインスタンスを作成、管理
および破棄することに対して責任を持つ。以下の表１０
は、制御スレッドが、複製可能なストリーム型ドライバ
であるＳ−ＩＣＳインスタンスを作成するサンプル・コ
ードである。3.4.2 S-ICS Creation and Communication The control thread is responsible for creating, managing and destroying S-ICS instances via the in-kernel STREAMS interface. Table 10 below
Is sample code for the control thread to create an S-ICS instance that is a stream type driver that can be duplicated.

【０１６１】[0161]

【表１０】 /* Ｓ−ＩＣＳドライに関する装置識別名を決定する */ if ((open_dev = str_dev_lookup("/dev/s_ics",&err)) && err){ /* デバイス・ファイルが存在しなければエラー記録し、処理終了。*/ /* オペレーティング・システムが動的ロード可能な */ /* カーネル・モジュールをサポートすれば、このドライバの */ /* サブシステム・ロードを明示的に始動する */ } /* 装置識別名を使用してドライバをオープンし */ /* ストリームヘッドを見出す */ if ((sth = streams_clone_open(open_dev,O_RDWR,&error)) == NULL) /* エラーを記録し、Ｓ−ＩＣＳドライバのメジャー／マイナー番号の */ /* 検出のようなエラー回復を試行する。 */ /* ストリームヘッド・アドレスを使用して、バケット・エントリ */ /* をハッシュし、アドレスを記憶する */ struct sth_s*bucket; bucket = CTL_S_ICS_HASH_FUNC(sth); bucket->sth_next = sth;Table 10 / * Determine device identifier for S-ICS dry * / if ((open_dev = str_dev_lookup ("/ dev / s_ics", & err)) && err) {/ * If device file does not exist Record an error and end the process. * / / * If the operating system supports dynamically loadable * / / * kernel modules, * / / * explicitly start the subsystem load of this driver * /} / * device identifier Open the driver using * / / * find the stream head * / if ((sth = streams_clone_open (open_dev, O_RDWR, & error)) == NULL) / * Record the error and use the S / ICS driver major / Attempt error recovery, such as * / / * detection of minor numbers. * / / * Hash bucket entry * / / * using streamhead address and remember address * / struct sth_s * bucket; bucket = CTL_S_ICS_HASH_FUNC (sth); bucket-> sth_next = sth;

【０１６２】Ｓ−ＩＣＳが非複製ドライバとして実施さ
れる場合、streams_clone_open()呼び出しは、streams_
open()呼び出しと置き換えられる。コードの残り部分は
変更されない。各制御スレッドは、１つまたは複数のＳ
−ＩＣＳインスタンスを作成するかもしれない。作成さ
れるインスタンスの数は以下に基づく。If the S-ICS is implemented as a non-cloning driver, the streams_clone_open () call
Replaced by open () call. The rest of the code remains unchanged. Each control thread has one or more S
-May create an ICS instance. The number of instances created is based on:

【０１６３】制御スレッドが、１つのノードに関するモ
ジュールおよびドライバstrlog()活動処理のような特定
のタスクを割り当てられたとすれば、ただ１つのＳ−Ｉ
ＣＳタスクを作成するだけでよい。If the controlling thread is assigned a particular task, such as module and driver strlog () activity processing for one node, only one SI
Just create a CS task.

【０１６４】分散ストリーム・コンポーネント間のメッ
セージの大多数が主にM_DATAおよびM_PROＭであり、制
御スレッドが関与していなければ、制御スレッドは、帯
域スループット／効率を向上させ、分散ＳＴＲＥＡＭＳ
ソルーションのＭＰ尺度を改善するため、プロセッサあ
たりあるいはクラスタ相互接続カードあたり少くとも１
つのＳ−ＩＣＳを作成することを望むであろう。通信す
る目標ノードのような単純なものに基づくか、あるは管
理しているＤＬＰＩインスタンスのようなハードウェア
・カードに基づいて、管理されている分散ストリーム・
コンポーネントにＳ−ＩＣＳインスタンスを割り当てる
ことができる。これによって、システム・キャッシュの
使用効率および性能が向上する。The majority of the messages between the distributed stream components are mainly M_DATA and M_PROM, and if the control thread is not involved, the control thread will increase the bandwidth throughput / efficiency and the distributed STREAMS
At least one per processor or cluster interconnect card to improve the solution MP scale
One would like to create one S-ICS. A distributed stream that is managed based on something simple, such as a target node with which to communicate, or based on a hardware card, such as a managing DLPI instance.
An S-ICS instance can be assigned to a component. This improves the efficiency and performance of the system cache.

【０１６５】Ｓ−ＩＣＳの実施がプロトコールから独立
しているように実施することができない場合、例えばほ
とんどすべてのストリーム型伝送プロバイダがアプリケ
ーションの観点からＴＰＩを使用しハードウェア・イン
ターフェースの観点からＤＬＰＩを使用しながら、なお
一般的方法で解決することができない一部の状況が存在
すれば、特別なプロトコール・スタックに関してＳ−Ｉ
ＣＳ実施を行うことが必要になる。これは、Ｓ−ＩＣＳ
設計の多くが再利用されることができないことを意味す
るだけでなく、経路指定キャッシュのようなＳ−ＩＣＳ
の部分はプロトコール毎に基づいて実施されるのが最良
であることを意味する。この効果はＳ−ＩＣＳの一部が
簡略化されその他は簡略化されない程度かもしれない。
例えば、すべてのプロトコール・スタックが高い可用性
のソルーションを必要としているとは限らないし、ある
いは、簡素化された設計につながる異なる経路指定方式
を持つかもしれない。注：ＳＴＲＥＡＭＳ動的関数置換
(Dynamic Function Replacement)を使用してＳ−ＩＣＳ
の機能の拡張を行う一般的形態でこれらの問題を解決す
ることができる。ＳＴＲＥＡＭＳ動的関数登録(Dynamic
Function Registration)[3]を使用して、メッセージを
解読し、その処理においてＳ−ＩＣＳを援助する情報を
組み込みあるいは除去するためにＳ−ＩＣＳを経由して
どのような事前／事後処理を行うことができるかを決定
する支援を行う。制御スレッドは、付加的動的割り当て
要求に備えて多数のＳ−ＩＣＳを迅速参照キャッシュに
前もって記憶する。事前記憶によって制御スレッドが遠
隔要求およびスタック移行に迅速に応答することを可能
にする。If the implementation of S-ICS cannot be implemented in a protocol-independent manner, for example, almost all stream-based transport providers use TPI from an application perspective and DLPI from a hardware interface perspective. If, while using, there are some situations that cannot be solved in a general way, the SI
It is necessary to perform CS implementation. This is S-ICS
S-ICS, such as a routing cache, not only means that many of the designs cannot be reused
Means best performed on a protocol-by-protocol basis. This effect may be such that some of the S-ICS are simplified and others are not.
For example, not all protocol stacks may require a highly available solution, or may have different routing schemes that lead to a simplified design. Note: STREAMS dynamic function replacement
S-ICS using (Dynamic Function Replacement)
These problems can be solved in a general form of extending the functions of the above. STREAMS Dynamic Function Registration (Dynamic
Using Function Registration) [3] to decrypt the message and perform any pre / post-processing via the S-ICS to include or remove information that assists the S-ICS in its processing Help them decide if they can do it. The controlling thread pre-stores a number of S-ICS in a quick lookup cache for additional dynamic allocation requests. Pre-storage allows the controlling thread to respond quickly to remote requests and stack transitions.

【０１６６】作成されるＳ−ＩＣＳの各々に関して以下
のステップがとられなければならない。 1. 制御スレッドは、ノード・テーブル、ノード経路指
定テーブル、ミドルウェア通信エンドポイントなどのク
ラスタ下部構造データを伝える。Ｓ−ＩＣＳはこのデー
タを利用してデータ・キャッシュを作成する。加えて、
制御スレッドは、Ｓ−ＩＣＳが上述した特定の動作特性
を持っていれば、付加的Ｓ−ＩＣＳ固有情報を伝達する
かもしれない。 2. 制御スレッドは、Ｓ−ＩＣＳ読取り書き込みを待ち
行列アドレスを決定する。次に、制御スレッドは、この
Ｓ−ＩＣＳがデータを渡す先の分散ストリーム・コンポ
ーネントのストリームヘッド・プレビュー関数の各々に
通知する。これによって、ストリームヘッド関数が、M_
DATAおよびM_PROTOメッセージのようなデータをストリ
ームヘッドを経由してバイパスすることが可能となる。
前出のプレビュー関数では、これらのメッセージは直接
Ｓ−ＩＣＳ書込み待ち行列に送られる。The following steps must be taken for each S-ICS created. 1. The controlling thread conveys cluster substructure data such as node tables, node routing tables, and middleware communication endpoints. The S-ICS uses this data to create a data cache. in addition,
The controlling thread may convey additional S-ICS specific information if the S-ICS has the specific operating characteristics described above. 2. The controlling thread determines the S-ICS read / write queue address. Next, the control thread notifies each of the stream head preview functions of the distributed stream components to which this S-ICS passes data. This allows the stream head function to
Data such as DATA and M_PROTO messages can be bypassed via the stream head.
In the preview function above, these messages are sent directly to the S-ICS write queue.

【０１６７】3. 同様に、制御スレッドは、分散ストリ
ームのＳ−ＩＣＳにストリームヘッド・アドレスおよび
そのsth->wq->q_nextおとびsth->rq->q_nexｔアドレス
を伝える。Ｓ−ＩＣＳは、このデータを使用してファン
アウト・テーブルを構築する。ファンアウト・キーは、
メッセージに埋め込まれるノード識別子およびストリー
ムヘッド・アドレスであり、ユニークでクラスタ全体の
識別タプルを提供する。ステップ２およびステップ３を
使用して、必要でなければ制御スレッドを含める必要性
を排除することによって性能を向上させる。3. Similarly, the control thread notifies the S-ICS of the distributed stream of the stream head address and its sth->wq-> q_next and sth->rq-> q_next addresses. The S-ICS uses this data to build a fanout table. The fanout key is
The node identifier and stream head address embedded in the message, providing a unique, cluster-wide identification tuple. Steps 2 and 3 are used to improve performance by eliminating the need to include a control thread if not needed.

【０１６８】4.上述の性能向上を実現できるようにする
ため、制御スレッドは、このバイパスに関係するすべ
てのコンポーネントが未完了活動を除去しなければなら
ないことを認識させ、スタックを使用禁止にし未完了の
活動をキャンセルすることを確認するため、クローズ処
理の間に使用されるフラグをストリームヘッド内に設定
しなければならない。このコードがＳＴＲＥＡＭＳ実施
形態において提供されないとすれば、システムはシステ
ム・パニックに結びつく可能性のある競争状態となる。4. To be able to achieve the performance enhancements described above, the controlling thread recognizes that all components involved in this bypass must remove incomplete activity, disables the stack and disables the stack. A flag used during the close process must be set in the stream head to make sure that the completion activity is canceled. If this code were not provided in the STREAMS embodiment, the system would be in a race condition that could lead to a system panic.

【０１６９】5. 制御スレッドがミドルウェアが故障し
たことを検出すれば、回復を始動し、新しいインスタン
スのＳ−ＩＣＳに通知する。同様にＳーＩＣＳがミドル
ウェア障害を検出すれば、制御スレッドに通知し、制御
スレッドは、回復を始動し、新しいミドルウェア・イン
スタンスのＳ−ＩＣＳを更新する。Ｓ−ＩＣＳが作成さ
れ初期化される場合、制御スレッドとＳ−ＩＣＳの間の
交信は、クラスタ機構が提供する能力および分散ＳＴＲ
ＥＡＭＳの各コンポーネントが実施されたレベルに依存
する。以下は、制御スレッドとＳ−ＩＣＳの間の潜在的
交信リストおよびそれらがどのようにまたどのような環
境下で実施されるかに関する概略の説明である。5. If the controlling thread detects that the middleware has failed, it initiates recovery and notifies the S-ICS of the new instance. Similarly, if the S-ICS detects a middleware failure, it notifies the control thread, which initiates recovery and updates the S-ICS of the new middleware instance. When the S-ICS is created and initialized, the communication between the controlling thread and the S-ICS depends on the capabilities provided by the cluster mechanism and the distributed STR.
Each component of EAMS depends on the level implemented. The following is a brief description of potential communication lists between the controlling thread and the S-ICS and how and under what circumstances they are implemented.

【０１７０】ストリーム・スタックがオープンされその
コンポーネントが少なくとも２つのノード間で分散され
ると、制御スレッドは遠隔目標ノードの制御スレッドと
連絡をとって、遠隔スタック・コンポーネントを作成し
Ｓ−ＩＣＳデータ・キャッシュを初期化する。これを達
成するため、２つのスレッド・インスタンスの間および
スレッドとＳ−ＩＣＳインスタンスの間で１つの通信プ
ロトコールが適用される必要がある。加えて、制御スレ
ッドは、ミドルウェア知識および更新を調整する必要が
ある。この点は若干複雑であるので別の節で触れること
にする。When the stream stack is opened and its components are distributed between at least two nodes, the controlling thread contacts the controlling thread of the remote target node to create a remote stack component and create an S-ICS data Initialize the cache. To achieve this, one communication protocol needs to be applied between the two thread instances and between the thread and the S-ICS instance. In addition, the control thread needs to coordinate middleware knowledge and updates. This is a bit complicated and will be covered in another section.

【０１７１】本設計で示される性能チューニングの考え
の一部は、ＳＴＲＥＡＭＳモジュールおよびドライバが
如何に実施されるべきか、メッセージがスタック内で如
何に送られるべきかという理想を技術的に破っている。
多くのモジュールおよびドライバの設計が通常貧弱であ
り従って最小の努力で目標を達成するためそのような違
反を行かなければならないので、これは異常ではない。
例えば、ネットワーキング・プロトコールの例でいえ
ば、非ＳＴＲＥＡＭＳスタックで元々実施されたプロト
コールをＳＴＲＥＡＭＳスタックに適用させる場合であ
る。この設計の範囲内で、あるスタックとＳ−ＩＣＳス
タックとの間でのメッセージの経路指定を可能にするこ
とによってメッセージを通常経路指定する方法すなわち
ストリームヘッド経由でのみ行うという方法に違反して
いる。このような違反を許容しない主な理由はクローズ
動作競争状態にある。Some of the performance tuning ideas presented in this design technically violate the ideals of how STREAMS modules and drivers should be implemented, and how messages should be sent in the stack. .
This is not unusual, as many module and driver designs are usually poor and such violations must be made to achieve the goal with minimal effort.
For example, in the case of a networking protocol, a protocol originally implemented in a non-STREAMS stack is applied to a STREAMS stack. Within this design, it violates the normal routing of messages by enabling the routing of messages between certain stacks and the S-ICS stack, i.e., only via the stream head. . The main reason for not allowing such a violation is the close competition.

【０１７２】スタックがクローズされつつある時、個々
のコンポーネントは、ストリームヘッドの下から最も低
位のドライバへとトップダウン形態でスタックから取り
出される。各モジュール／ドライバが取り出される前
に、それらの関連クローズ・ルーチンが呼び出されるの
で、各コンポーネントが(q_ptrによってポイントされて
いる)プライベート・データのクリーンアップを実行
し、現在時メッセージを消去することが可能となる。ま
た、モジュール／ドライバ構造が解放される前にいかな
る待ち行列またはメッセージに対しても未完了の参照が
存在しないことを保証するため、すべての仕掛かり中の
putおよびserviceルーチンをＳＴＲＥＡＭＳフレームワ
ークはキャンセルする。もしもそのようなルーチンがキ
ャンセルされなかったとすれば、それらは、システム破
壊を生むことになる無効なポインタ・アドレスを参照す
る可能性がある。When the stack is being closed, individual components are removed from the stack in a top-down fashion from below the stream head to the lowest driver. Before each module / driver is fetched, their associated close routines are called so that each component can perform private data cleanup (pointed to by q_ptr) and erase current messages. It becomes possible. Also, to ensure that there are no pending references to any queues or messages before the module / driver structure is released, all pending
The STREAMS framework cancels put and service routines. If such routines were not canceled, they could reference invalid pointer addresses that would cause system corruption.

【０１７３】ある１つのストリーム・スタックのputま
たはservic関数が、オリジナルであろうと拡張されたも
のであろうと、別のストリーム・スタックにメッセージ
を置き、その待ち行列インスタンスのputあるいはservi
ceルーチンを起動することによって、クローズが進行中
であれば、目標待ち行列のアドレスのような参照ポイン
タが無効となりシステムが壊れる可能性がある。これを
防止するため、２つのスタックに関して使用されるクロ
ーズ・プロトコールが開発される必要がある。この場
合、Ｓ−ＩＣＳ、制御スレッドおよび経路指定を実行す
るストリームヘッド・プレビュー関数が調整されなけれ
ばならない。The put or servic function of one stream stack, whether original or extended, puts a message on another stream stack and puts or puts the servicing of its queue instance.
By invoking the ce routine, if a close is in progress, the reference pointer, such as the address of the target queue, may become invalid and the system may be broken. To prevent this, a close protocol used for the two stacks needs to be developed. In this case, the S-ICS, the control thread and the streamhead preview function that performs the routing must be adjusted.

【０１７４】以下に記述されるクローズ・プロトコール
は、図４の(Ｂ)に示されるように分割がＩＰとＤＬＰＩ
の間で行われているＴＣＰ／ＩＰ／ＤＬＰＩ構成の例を
使用している。ＤＬＰＩスタックがクローズされつつあ
るとすれば、メッセージがＳ−ＩＣＳ読取り待ち行列か
ら、そのputおよびserviceルーチンを経由して、ＤＬＰ
Ｉ書込み待ち行列に送り出されている場合にのみクロー
ズ競争状態が発生する。これを防ぐため、クローズ動作
の始動元である制御スレッドは、クローズが保留されて
いること、すべての保留中のメッセージをキャンセル
し、クローズを開始することができる前に目標ＤＬＰＩ
インスタンスへの現在の動作が完了したことを確認すべ
きことをＳ−ＩＣＳに通知しなければならない。Ｓ−Ｉ
ＣＳは、Ｓ−ＩＣＳがすべてのクリーンアップを完了す
るまでioctlは終了しないのでioctl経由で通知される。
Ｓ−ＩＣＳは、目標ＤＬＰＩに関連するそのプライベー
ト・データ構造内の「活動」カウンタおよび「クロー
ズ」フラグを使用することによってこのクリーンアップ
を実行する。このカウンタはputまたはserviceサービス
・ルーチンが開始する毎に増分され、完了した時に減分
される。カウンタがゼロになると、Ｓ‐ＩＣＳは、すべ
ての仕掛かり中の活動が完了したことを確認することが
でき、従ってこのインスタンスを目標とする待ち行列か
らすべてのメッセージを安全に消去することができる。
以下の表１１はこれらの変更を示すＳ−ＩＣＳ読み取り
書き込みルーチンのサンプルである。カウンタはＳＴＲ
ＥＡＭＳフレームワーク同期メカニズムによって制御さ
れるので、追加されるべき余分なスピンロックがない点
に注意する必要がある。クローズ・フラグもいつでもセ
ットできまた決してクリアされないので、スピンクロッ
クを必要としない。[0174] The close protocol described below is divided into IP and DLPI as shown in FIG.
An example of the TCP / IP / DLPI configuration being performed between the two is used. If the DLPI stack is being closed, a message is sent from the S-ICS read queue via its put and service routines to the DLP
A close race condition occurs only if it has been sent to the I write queue. To prevent this, the control thread from which the close operation was initiated must confirm that the close is pending, cancel any pending messages, and set the target DLPI before the close can begin.
The S-ICS must be notified that the current operation on the instance should be confirmed as completed. SI
The CS is notified via the ioctl because the ioctl does not end until the S-ICS has completed all cleanup.
The S-ICS performs this cleanup by using an "active" counter and a "closed" flag in its private data structure associated with the target DLPI. This counter is incremented each time a put or service service routine is started and decremented upon completion. When the counter reaches zero, the S-ICS can confirm that all in-flight activities have been completed, and thus can safely remove all messages from the queue targeted for this instance. .
Table 11 below is a sample S-ICS read / write routine showing these changes. Counter is STR
It should be noted that there is no extra spinlock to add, as controlled by the EAMS framework synchronization mechanism. Since the close flag can also be set at any time and never cleared, no spin clock is required.

【０１７５】この構造は、各Ｓ−ＩＣＳインスタンスq-
>q_ptrに記憶されるＳ−ＩＣＳプライベート・データ構
造の一部分である。This structure corresponds to each S-ICS instance q-
> part of the S-ICS private data structure stored in q_ptr.

【０１７６】[0176]

【表１１】 struct target_queue_t { queue_t*target_read_queue; queue_t*target_write_queue; int32 target_pending_actions; int32 target_flags; /* その他の付加的情報 */ } /* putルーチン修正のサンプル */ int32 s_ics_read_put(q,mp) { struct target_queue_t*target_queue; struct s_ics_q_ptr*s_ics_data; /* ローカルＳ−ＩＣＳデータ */ s_ics_data = (struct s_ics_q_ptr*) q->q_ptr; /* M_DATAまたはM_PROTOであればインスタンスを */ /* 修正するため直ちに送付 */ if(mp->b_datap->db_type==M_DATA || mp->b_datap->db_type== M_PROTO) { if((target_queue == find_target(mp,s_ics_data)) == NULL) { /* エラー回復を始動 */ } else { if(target_queue->target_flags & TARGET_CLOSING) { freemsg(mp); return; } target_queue->target_pending_actions++; putnext(target_queue->target_write_queue,mp); target_queue->target_pending_actions--; } } else /* すべての他のメッセージ・タイプに関して、 */ /* メッセージを制御スレッドに送出すべきか否か判断する。*/ switch (mp->db_type) { case M_PCPROTO: /* チェックを実行しメッセージ中断を処理する */ case M_IOCTL: /* チェックを実行しメッセージ中断を処理する */ case M_FLUSH: /* チェックを実行しメッセージ中断を処理する */ .... /* Ｓ−ＩＣＳの処理が必要な数だけケースを追加する */ freemsg(mp); return; } }[Table 11] struct target_queue_t {queue_t * target_read_queue; queue_t * target_write_queue; int32 target_pending_actions; int32 target_flags; / * Other additional information * /} / * Sample of put routine modification * / int32 s_ics_read_put (q, mp) {struct target_queue_ * target_queue; struct s_ics_q_ptr * s_ics_data; / * Local S-ICS data * / s_ics_data = (struct s_ics_q_ptr *) q-> q_ptr; / * If M_DATA or M_PROTO, send the instance * / / * immediately to correct * / if (mp-> b_datap-> db_type == M_DATA || mp-> b_datap-> db_type == M_PROTO) {if ((target_queue == find_target (mp, s_ics_data)) == NULL) {/ * Start error recovery * /} else {if (target_queue-> target_flags & TARGET_CLOSING) {freemsg (mp); return;} target_queue-> target_pending_actions ++; putnext (target_queue-> target_write_queue, mp); target_queue-> target_pending_actions--;}} else / * Send * / / * messages to the controlling thread for all other message types It should be whether or not to judge. * / switch (mp-> db_type) {case M_PCPROTO: / * Execute check and handle message interruption * / case M_IOCTL: / * Execute check and handle message interruption * / case M_FLUSH: / * Execute check * / .... / * Add as many cases as necessary for S-ICS processing * / freemsg (mp); return;}}

【０１７７】Ｓ−ＩＣＳインスタンスがクローズされつ
つあれば、現在時および仕掛かり中のすべてのアクショ
ンが完了するか無効な待ち行列アドレスの参照を防止す
ることを保証するため、Ｓ−ＩＣＳインスタンスが管理
していたＤＬＰＩインスタンスのすべてに通知されなけ
ればならない。この待ち行列アドレスへの経路指定は、
ストリームヘッド読み取り書き込み(read put)ルーチン
がデータをプレビューし送付するように拡張された場合
にのみ行われるので、解決策は簡単である。すなわちス
トリームヘッドから拡張関数を除去するか、あるいは、
ＤＬＰＩインスタンスをあるＳ−ＩＣＳから別のＳ−Ｉ
ＣＳへ移行する場合は拡張された関数目標待ち行列情報
を更新するかいずれかである。関数を除去するため、Ｓ
ＴＲＥＡＭＳ Dynamic Function Replacementに使われ
ているままの技術を利用する。ストリームヘッドが読取
りサービス(read service)ルーチンを持っていないので
カウンタまたはクローズ・フラグを保有する必要はな
い。従って、ストリームヘッドがこの活動を実行するた
め取得される時、活動的呼出しのすべてが完了し、すべ
ての仕掛かり中の活動が更新されたストリームヘッドpu
tルーチンを使用して実行されるということが保証され
る。If the S-ICS instance is being closed, the S-ICS instance manages to ensure that all current and in-progress actions are completed or prevent reference to an invalid queue address. All of the DLPI instances that have been running must be notified. Routing to this queue address
The solution is simple because the stream head read put routine is only performed if it has been extended to preview and send data. That is, remove the extension function from the stream head, or
DLPI instance from one S-ICS to another SI
When shifting to CS, the extended function target queue information is either updated. S to remove the function
Uses the same technology used for TREAMS Dynamic Function Replacement. There is no need to maintain a counter or close flag since the stream head has no read service routine. Thus, when the stream head is obtained to perform this activity, all of the active calls have been completed and all in-progress activities have been updated with the stream head pu.
It is guaranteed to be performed using the t routine.

【０１７８】分散ストリームの実施のため取り扱われな
ければならない交信の分野が少くとも３つある。それら
は、既存ストリームの分割、クラスタの範囲内の動作の
流れの制御問題の処理、および分散ストリームヘッドと
遠隔ストリームヘッド取り扱いを実行する方法である。
これらの詳細は後述される。There are at least three areas of communication that must be addressed for the implementation of a distributed stream. They are ways to perform splitting of existing streams, handling of operational flow control issues within a cluster, and handling distributed and remote stream heads.
These details will be described later.

【０１７９】3.4.3 ローカル・スタック・コンポーネン
トローカル・スタック・コンポーネントは、コンポーネン
ト間のメッセージ受け渡しおよび必要な流れ制御の実行
に対して責任を持つ。3.4.3 Local Stack Component The local stack component is responsible for passing messages between components and performing the necessary flow control.

【０１８０】3.4.4 カーネル内ストリーム・インターフ
ェース制御スレッドはが、カーネル内ＳＴＲＥＡＭＳインター
フェースを利用して、それが制御しているＳ−ＩＣＳお
よび種々のモジュール／ドライバ・スタックと通信す
る。制御スレッドは、以下のカーネル内インターフェー
ス命令を使用する。＊streams open()およびstreams close()：Ｓ−ＩＣＳ
インスタンスおよび分散ストリーム・コンポーネントを
作成／破棄するために使用される。＊streams_read()およびstreams_write()：制御スレッ
ドによって生成されるメッセージを送り出すために使用
される。これらのルーチンは、allocb()を経由してM_DA
TA mblksを生成し、制御スレッドがこの問題を取り扱う
必要がないことを意味する組み込み回復関数bufcall()
を持つ。しかしながら、始動元を決定するためすべての
M_DATAメッセージを走査しなければならないので、分散
ストリーム管理のための通信機構としてのM_DATAメッセ
ージには使用上の限度がある。＊streams_putmsg()およびstreams_getmsg()：制御スレ
ッドが作成するメッセージmblksを送り出すために使用
される。この場合当然のことながら制御スレッドはallo
cb()の障害を取り扱わなければならず、そのため制御ス
レッドは、メモリが割り当てられ次第復帰する新しいル
ーチンallocb_wait()を使用するか、あるいは、それ自
身を管理するmblkのプールを維持するか、いずれかであ
ろう。このプールは、非常に小さく、制御スレッドが非
常の場合にのみ供給されればよいので、２−３のメッセ
ージ程度の小規模のものである。それに送られるメッセ
ージの大多数はmblkであるので、実際の主要な割り当て
方式はmblkの再利用である。3.4.4 In-Kernel Stream Interface The controlling thread utilizes the in-kernel STREAMS interface to communicate with the S-ICS it controls and the various module / driver stacks. The controlling thread uses the following in-kernel interface instructions: * Streams open () and streams close (): S-ICS
Used to create / destroy instances and distributed stream components. * Streams_read () and streams_write (): used to send out messages generated by the controlling thread. These routines use M_DA via allocb ()
The built-in recovery function bufcall () that generates TA mblks, which means that the controlling thread does not need to handle this problem
have. However, to determine where to start
Since the M_DATA message must be scanned, the M_DATA message as a communication mechanism for distributed stream management has a limited use. * Streams_putmsg () and streams_getmsg (): used to send out messages mblks created by the controlling thread. In this case the control thread is of course allo
The cb () failure must be handled, so the controlling thread either uses the new routine allocb_wait (), which returns as soon as memory is allocated, or maintains a pool of mblks that manage itself. Or maybe. This pool is very small and only small in size, about 2-3 messages, as control threads need only be supplied in case of emergency. The actual primary allocation scheme is mblk reuse, because the majority of the messages sent to it are mblk.

【０１８１】制御スレッドは、mblkを使用することによ
って、性能問題を起こさずに、M_CTL、M_DELAY、M_RSE,
M_PCRSE, M_STARTI, M_STOPIまたはM_HPDATAメッセー
ジを経由してＳ−ＩＣＳまたは遠隔Ｓ−ＩＣＳと直接通
信する。これが可能な理由は、ほとんどのモジュールお
よびドライバがこれらのメッセージ・タイプを限定的に
使用していて、そのため、Ｓ−ＩＣＳおよびストリーム
ヘッド・プレビュー関数が、性能低下を招くことなくこ
の特定サブセットのメッセージを検出し、その他のメッ
セージすべてを自動的に送出することが可能となるから
である。更に、これらユーティリティによって、制御ス
レッドが、M_ERRORのようなメッセージを直接認識し、
回復メカニズムを実行し、アプリケーションへメッセー
ジを送出しないことのような適切なアクションをとるこ
とが可能となる。＊streams_ioctl()：すべてのioctl処理のために使用さ
れる。カーネル内インターフェースは、カーネルioctl
の新しいセットおよびそれらの使用方法を定義する。st
reamio(7)に定義される標準ＳＴＲＥＡＭＳのほとんど
がサポートされる。By using mblk, the control thread can execute M_CTL, M_DELAY, M_RSE,
Communicate directly with the S-ICS or remote S-ICS via M_PCRSE, M_STARTI, M_STOPI or M_HPDATA messages. This is possible because most modules and drivers use these message types in a limited way, so that the S-ICS and streamhead preview functions can use this particular subset of messages without performance degradation. Is detected, and all other messages can be automatically transmitted. In addition, these utilities allow the controlling thread to directly recognize messages such as M_ERROR,
A recovery mechanism can be implemented and appropriate action taken, such as not sending a message to the application. * Streams_ioctl (): Used for all ioctl processing. The kernel interface is the kernel ioctl
Defines a new set of and how to use them. st
Most of the standard STREAMS defined in reamio (7) are supported.

【０１８２】4.0 Ｓ−ＩＣＳ設計Ｓ−ＩＣＳは、ストリーム型ドライバであり、好ましく
は複製可能なドライバである。Ｓ−ＩＣＳは物理的ハー
ドウェア装置を制御することはないので、Ｓ−ＩＣＳは
ソフトウェア・ドライバである。制御スレッドがカーネ
ル内インターフェースを経由して通信するように例示さ
れているが、通信は標準システム呼び出しインターフェ
ースを使用して実行することができる。どちらを選択す
るかは、処理性能および潜在的にはセキュリティーを勘
案して決定される。更に、複数の制御スレッドが同時に
動作する場合、それらの間でＳーＩＣＳの負荷平均化を
図りデータ共用を行うことは、更にコードおよび共用メ
モリ・セグメントの追加を必要とするので、一層困難で
ある。4.0 S-ICS Design S-ICS is a stream-type driver, preferably a duplicable driver. S-ICS is a software driver because S-ICS does not control physical hardware devices. Although the controlling thread is illustrated as communicating via an in-kernel interface, the communication can be performed using a standard system call interface. The choice is made in consideration of processing performance and potentially security. Further, when a plurality of control threads operate at the same time, it is more difficult to balance the load of the S-ICS and perform data sharing among them because it requires additional code and a shared memory segment. is there.

【０１８３】ノード・テーブルおよびＰ−ＩＣＳインス
タンス・データが制御スレッドでなくＳ−ＩＣＳの範囲
内で維持されなければならない点を除いて、上記の点は
Ｓ−ＩＣＳ設計に影響を及ぼさない。Ｓ−ＩＣＳは、Ｐ
−ＩＣＳ特有の操作関数を記述する関数ベクトルを含む
P_ICS_ENDPOINT構造を使用するＰ−ＩＣＳと通信する。
これらの関数との通信を制限することによって、Ｓ−Ｉ
ＣＳは、基本的にＰ−ＩＣＳ実施形態からの独立を維持
する。Ｓ−ＩＣＳは、それが管理している目標待ち行列
とそれ自身の間でのメッセージの送信に使用する経路指
定テーブルおよびいくつかのデータ・テーブルを維持す
る。残りの用途は、制御スレッドを経由してメッセージ
の大多数を移動させる必要性をなくし、例外状況だけを
取り扱うために残すことによって性能を向上させるもの
である。これらの経路指定テーブルの更新方法に基づい
て、常に同期が取られるようにテーブル類が更新される
ので、Ｓ−ＩＣＳは多くのエラー処理を必要としない。
このように、Ｓ−ＩＣＳは、コンポーネントがノード間
で分割される場合にすべての分散ＳＴＲＥＡＭの作成お
よび移行に関係しなければならない。Ｓ−ＩＣＳに関す
る上記の諸点を以下に更に詳述する。The above points do not affect the S-ICS design, except that the node table and P-ICS instance data must be maintained within the S-ICS and not the controlling thread. S-ICS is P
-Contains a function vector describing the ICS-specific operation functions
Communicate with P-ICS using the P_ICS_ENDPOINT structure.
By limiting communication with these functions, SI
The CS basically remains independent of the P-ICS embodiment. The S-ICS maintains a routing table and some data tables used to send messages between the target queue it is managing and itself. The remaining uses improve performance by eliminating the need to move the majority of messages via the controlling thread and leaving them to handle only exception situations. Since the tables are updated so as to be always synchronized based on the method of updating the routing table, the S-ICS does not require much error processing.
Thus, the S-ICS must be involved in the creation and migration of all distributed STREAMs when components are split between nodes. The above points regarding S-ICS will be described in more detail below.

【０１８４】4.1 作成および破棄これについては、制御スレッドの作成および相互通信の
節で記述する。4.1 Creation and Destruction This is described in the section on creating and interacting with control threads.

【０１８５】4.1.1 ドライバ・オープン・ルーチン設計すべてのドライバと同様に、Ｓ−ＩＣＳ３８はオープン
・ルーチンを持つ。ストリーム型ドライバはオープンの
間休止していることが許容されるので、Ｓ−ＩＣＳは休
止を含む以下のようなステップを実行する。 1. Ｓ−ＩＣＳが複製可能なドライバとして実施される
とすれば、それは、dev_t構造の一部として呼び出し元
に返される新しいマイナー番号を取得する必要がある。
このマイナー番号は、潜在的Ｓ−ＩＣＳ装置構造のアレ
イへのインデックスを表す整数である。そのような場
合、アレイ・エレメントは、使用中のマークをつけら
れ、オープン・ルーチンがそれを初期化する。Ｓ−ＩＣ
Ｓが複製不可能なドライバとして実施されるとすれば、
それは既に活動的であり、ある１時点に複数のオープン
を実行することは許容されない。そのような時はエラー
が戻されなければならない。4.1.1 Driver Open Routine Design Like all drivers, the S-ICS 38 has an open routine. Since the stream type driver is allowed to sleep during the open, the S-ICS performs the following steps including the sleep. 1. If S-ICS is implemented as a replicable driver, it needs to get a new minor number that is returned to the caller as part of the dev_t structure.
This minor number is an integer that represents an index into the array of potential S-ICS device structures. In such a case, the array element is marked in use and the open routine initializes it. S-IC
Assuming that S is implemented as a non-duplicatable driver,
It is already active and it is not permissible to perform multiple opens at any one time. In such cases an error must be returned.

【０１８６】2. Ｐ−ＩＣＳ dev_tが、デフォルト値ま
たはdev_t構造を戻す既知の関数呼出しのいずれかを経
由して取得されるとすれば、Ｓ−ＩＣＳは、実施形態に
特有のＰ−ＩＣＳプロトコールを使用してそれ自身をこ
のハードウェア・ドライバに接続することもできる(デ
ータ秘匿を使用してＰ−ＩＣＳにアクセスするすべての
サブシステムのためこれを汎用関数呼び出しとさせ
る)。Ｐ−ＩＣＳがこの動作の間休止していれば、その
接続はオープンの間に完了することが望ましい。休止し
ていない場合は、制御スレッドがこのノードに関して最
適なＰ−ＩＣＳインスタンスを求めてミドルウェアを調
べるようにさせることがより簡単でより柔軟な方法であ
る(これによって、ミドルウェアは、その上で各ノード
について実行している種々の他のサブシステムの間で複
数のＰ−ＩＣＳを均衡化させることが可能とされる）。 3. この時点において、Ｓ−ＩＣＳは、すべてのプライ
ベート・データ構造、ロックその他を割り当て、制御ア
ドレスをその待ち行列のq_ptrフィールドに記憶しなけ
ればならない。この情報の潜在的レイアウトは次節で示
される。 4. エラーがなければ、マイナー番号を戻す。2. Assuming that the P-ICS dev_t is obtained via either a default value or a known function call that returns a dev_t structure, the S-ICS uses the P-ICS protocol specific to the embodiment. Can be used to connect itself to this hardware driver (which makes this a generic function call for all subsystems accessing the P-ICS using data concealment). If the P-ICS is dormant during this operation, the connection is preferably completed during the open. If not dormant, it is a simpler and more flexible way to have the controlling thread check the middleware for the best P-ICS instance for this node (this allows the middleware to It is possible to balance multiple P-ICS among various other subsystems running for a node). 3. At this point, the S-ICS must allocate all private data structures, locks, etc., and store the control address in the q_ptr field of its queue. The potential layout for this information is shown in the next section. 4. If there are no errors, return the minor number.

【０１８７】以下の表１２は、上記のステップを含むＳ
−ＩＣＳオープン・ルーチンのサンプルである。実際の
実施の詳細は、このドライバの個々の開発者に委ねられ
る。Table 12 below shows the S including the above steps.
-A sample of the ICS open routine. The actual implementation details are left to the individual developers of this driver.

【０１８８】[0188]

【表１２】 int32 s_ics_open(q,devp,flag,sflag,credp) { int32 minor; struct p_ics_attach attach; /* 使用可能なマイナ番号をみつける */ if((minor = find_minor_num(q,devp,flag,sflag)) == 0) return(0); /* Ｐ−ＩＣＳを決定し接続を実行する */ if(streams_P_ICS) { if((error = p_ics_attachment(DERIVE_DEV_T(streams_P_ICS), &attach)) { release_minor_number(minor); return(error); } } /* プライベート・データ構造およびＳ−ＩＣＳ */ /* ファンアウト・テーブルを割り当て初期状態にする */ if((q->q_ptr = allocate_s_ics_structures(q,minor,flag)) NULL) { if(streams_P_ICS) detach_p_ics(attach); release_minor_number(minor); return(ENOMEM); } return(minor); }[Table 12] int32 s_ics_open (q, devp, flag, sflag, credp) {int32 minor; struct p_ics_attach attach; / * Find available minor number * / if ((minor = find_minor_num (q, devp, flag, sflag )) == 0) return (0); / * Determine P-ICS and execute connection * / if (streams_P_ICS) {if ((error = p_ics_attachment (DERIVE_DEV_T (streams_P_ICS), & attach)) {release_minor_number (minor) ; return (error);}} / * Private data structure and S-ICS * / / * Allocate fanout table to initial state * / if ((q-> q_ptr = allocate_s_ics_structures (q, minor, flag) ) NULL) {if (streams_P_ICS) detach_p_ics (attach); release_minor_number (minor); return (ENOMEM);} return (minor);}

【０１８９】4.1.2 ドライバ・クローズ・ルーチン設計Ｓ−ＩＣＳドライバ・クローズ・ルーチンは、オープン
・ルーチンのほぼ逆である。Ｓ−ＩＣＳがクローズされ
る時、ストリームヘッドをバイパスし直接通信されてい
たいずれのモジュールまたはドライバ待ち行列も、仕掛
かり中のクローズ命令について通知され、拡張されたス
トリームヘッドが適切に調整され、すべての仕掛かり中
の要求命令がキャンセルまたは他のＳ−ＩＣＳインスタ
ンスへ回付されていると仮定される。従って、クローズ
・ルーチンは、オープン・ルーチンが実行したことを単
に元に戻すだけである。以下の表１３のサンプル・コー
ドは、そのようなステップを実行する例である。4.1.2 Driver Close Routine Design The S-ICS driver close routine is almost the reverse of the open routine. When the S-ICS is closed, any modules or driver queues that were directly communicating bypassing the stream head are notified of the pending close command, the extended stream head is properly adjusted, and all It is assumed that the in-flight request instruction has been canceled or routed to another S-ICS instance. Thus, the close routine simply undoes what the open routine did. The sample code in Table 13 below is an example of performing such a step.

【０１９０】[0190]

【表１３】 int32 s_ics_close(q) { int32 minor; struct s_ics_data*sp; /* プライベート・データを既知のＳ−ＩＣＳ構造へ書き込む */ sp = (struct s_ics_data *)q->q_ptr; sp = (struct s_ics_data *)q->q_ptr; minor = sp->minor; /* Ｐ−ＩＣＳがなお接続されていれば、それをはずす */ if(sp->p_ics_attached) detach_p_ics(sp->attach); /* プライベト・データ構造を解放する */ free_s_ics_structure(sp); /* マイナ番号を解放する */ release_minor_number(minor);}Table 13 int32 s_ics_close (q) {int32 minor; struct s_ics_data * sp; / * Write private data to a known S-ICS structure * / sp = (struct s_ics_data *) q-> q_ptr; sp = (struct s_ics_data *) q-> q_ptr; minor = sp-> minor; / * If P-ICS is still connected, disconnect it * / if (sp-> p_ics_attached) detach_p_ics (sp-> attach); / * Release private data structure * / free_s_ics_structure (sp); / * Release minor number * / release_minor_number (minor);}

【０１９１】4.1.3 書き込みプット(write_put)ルーチ
ン設計以下の表１４は、書き込みプット(write_put)ルーチン
のサンプル・コードである。4.1.3 Write_put Routine Design Table 14 below shows sample code for the write_put routine.

【０１９２】[0192]

【表１４】 int32 s_ics_write_put(q,mp) { struct target_queue_t*target_queue; struct s_ics_q_ptr*s_ics; struct rem _ s_ics_descrp *target_sics; struct p_ics_data_t*p_ics_data; /* ローカルＳ−ＩＣＳデータおよびソースが待ち行列を取り出す */ s_ics = (struct s_ics_q_ptr*) q->q_ptr; /* M_DATAまたはM_*PROでなく、あるいは制御スレッドからのもので */ /* なければ、目標Ｓ−ＩＣＳを引き出して直ちに送り出す */ if (mp->b_datap->db_type==M_DATA || mp->b_datap->db_type == M_PROTO || mp->b_datap->db_type == M_PCPROTO || !from_controlling_thread(mp)){ /* mblkから引き出されたＳ−ＩＣＳインスタンスを見出す */ if((target_s_ics == find_s_ics(mp,s_ics_data)) == NULL)｛ /* エラー回復を始動し、M_ERROR経由でソースに通知する */ mp->b_datap->db_type = M_ERROR; set_m_error(mp,ENXIO); target_queue = target_queue_index[mp->b_datap-> source_queue_index]; /* 仕掛かり中アクションを更新して、クローズ競争条件を防止する */ target_queue->target_pending_actions++; putnext(target_queue->target_write_queue,mp); target_queue->target_pending_actions--; } else { if(p_ics_s_ics_status & S_ICS_FLOW_CONTROLLED) { p_ics_s_ics_flw_ctl(target_s_ics,mp); return; } mp = s_ics->p_ics_instance->p_ics_memory_alloc (mblk_header_size); CONVERT_mblk(mblk,mp); db = s_ics->p_ics_instance->p_ics_memory_alloc(dblk_size); CONVERT_dblk(dblk,db); rem = P_ICS_CREATE_HEADER(remote_node,remote_s_ics); (*s_ics->p_ics_instance->p_ics_send)(rem,mp,db); /* 送信エラーが検出されなければメッセージを解放する */ freemsg(mp); } else /* すべての他のメッセージ・タイプに関して、適切なアクションをとる。*/ /* 例えば、M_IOCTLについては、制御スレッドは下部構造データを */ /* 更新する可能性が高く、これはローカルＳ−ＩＣＳデータに反映 */ /* されなければならない。他のメッセージは、制御スレッドと */ /* Ｓ−ＩＣＳの間で適用されたプロトコールに依存する。*/ /* 注：ルーチンを混乱させる恐れがあるので、M_DATAまたは */ /* M_*PROTOメッセージを利用してＳ−ＩＣＳと通信する。 */ switch(mp->db_type) { case M_IOCTL: /* チェックを実行しメッセージを処理する */ break; case M_FLUSH:/* チェックを実行しメッセージを処理する */ break;... /* 処理するために必要な限りのケースを追加する */ default: freemsg(mp); return; } }Table 14 int32 s_ics_write_put (q, mp) {struct target_queue_t * target_queue; struct s_ics_q_ptr * s_ics; struct rem _ s_ics_descrp * target_sics; struct p_ics_data_t * p_ics_data; s_ics = (struct s_ics_q_ptr *) q-> q_ptr; / * If not from M_DATA or M_ * PRO or from the controlling thread * / / *, pull out the target S-ICS and send it immediately * / if (mp -> b_datap-> db_type == M_DATA || mp-> b_datap-> db_type == M_PROTO || mp-> b_datap-> db_type == M_PCPROTO ||! from_controlling_thread (mp)) {/ * S derived from mblk − Find ICS instance * / if ((target_s_ics == find_s_ics (mp, s_ics_data)) == NULL) ｛/ * Initiate error recovery and notify source via M_ERROR * / mp-> b_datap-> db_type = M_ERROR; set_m_error (mp, ENXIO); target_queue = target_queue_index [mp-> b_datap-> source_queue_index]; / * Update in-process action * / Target_queue-> target_pending_actions ++; putnext (target_queue-> target_write_queue, mp); target_queue-> target_pending_actions--;} else {if (p_ics_s_ics_status & S_ICS_FLOW_CONTROLLED) {p_ics_s_mp_fls_ics_fls return;} mp = s_ics-> p_ics_instance-> p_ics_memory_alloc (mblk_header_size); CONVERT_mblk (mblk, mp); db = s_ics-> p_ics_instance-> p_ics_memory_alloc (dblk_size); CONVERT_dblk (dblk, db); _ remote__REME ); (* s_ics-> p_ics_instance-> p_ics_send) (rem, mp, db); / * Release the message if no transmission error is detected * / freemsg (mp);} else / * All other messages Take the appropriate action for the type. * / / * For example, for M_IOCTL, the controlling thread is likely to update the underlying structure data * / / *, which must be reflected in the local S-ICS data * / / *. Other messages depend on the protocol applied between the controlling thread and * // * S-ICS. * / / * Note: Use the M_DATA or * / / * M_ * PROTO messages to communicate with the S-ICS as this may confuse the routine. * / switch (mp-> db_type) {case M_IOCTL: / * Execute check and process message * / break; case M_FLUSH: / * Execute check and process message * / break; ... / * Process * / Default: freemsg (mp); return;}}

【０１９３】4.2 Ｓ−ＩＣＳプライベート・データ (図６の)Ｓ−ＩＣＳインスタンス３８、３８Ａ．．．３
８Ｎはその待ち行列のq_ptrフィールドにプライベート
・データ構造を維持する。以下の表１５は、そのプライ
ベート・データ構造および関連構造のサンプルである。4.2 S-ICS Private Data S-ICS Instances 38, 38A. . . 3
8N maintains a private data structure in the q_ptr field of its queue. Table 15 below is a sample of that private data structure and associated structure.

【０１９４】[0194]

【表１５】 #define MAX_FREE_FUNC 10 #define MAX_FREE_ARG 10 struct target_queue_t { struct sth_s*target_sth; struct dev_t dev; uint32 cluster_route_id; uint32 target_queue_index; queue_t*target_read_queue; queue_t*target_write_queue; queue_t*local_read_queue; queue_t*local_write_queue; int32 target_pending_actions; uint32 target_flags; /* 現在時目標状態／能力 */ short free_func_index[MAX_FREE_FUNC]; short free_func_arg_index[MAX_FREE_ARG]; target_recover_t rem_target_recovery_policy; struct target_queue_t*next,*prev; /* 同じバケットにハッシュ */ /* した次の項目 */ } #define TARGET_HASH_SIZE 1024 /* 最大最小ハッシュ・アレイ衝突 */【Table 15】 #define MAX_FREE_FUNC 10 #define MAX_FREE_ARG 10 struct target_queue_t {struct sth_s * target_sth; struct dev_t dev; uint32 cluster_route_id; uint32 target_queue_index; queue_t * target_read_queue; target_flags; / * Current target state / ability * / short free_func_index [MAX_FREE_FUNC]; short free_func_arg_index [MAX_FREE_ARG]; target_recover_t rem_target_recovery_policy; struct target_queue_t * next, * prev; / * Hash in the same bucket * / / * /} #define TARGET_HASH_SIZE 1024 / * Maximum / minimum hash array collision * /

【０１９５】target_sth：この待ち行列インスタンスと
関連するストリームヘッドである。このアドレスは、Ｓ
−ＩＣＳがsth_cluster_route_idの更新のようなストリ
ームヘッドの更新を実行する必要がある場合に維持され
る。 dev：目標ストリームヘッドと関連する装置識別名であ
る。装置識別名はシステムまたはカーネル内ストリーム
・インターフェース呼び出しの間mblk内で初期的に送ら
れる。次にそれは取り出され、ストリームヘッドはクラ
スタ経路識別子を用いて更新される。 cluster_route_id：Ｓ−ＩＣＳ経路指定テーブルへのイ
ンデックスである。 target_queue_index：この目標待ち行列をユニークに識
別するtarget_queue_hashへのインデックスである。 target_write_queueおよびtarget_read_queue：メッセ
ージを記憶しようとするローカル待ち行列アドレスであ
る。Target_sth: The stream head associated with this queue instance. This address is
-Maintained when ICS needs to perform stream head updates, such as updating sth_cluster_route_id. dev: device identifier associated with the target stream head. The device identifier is sent initially in mblk during a system or in-kernel stream interface call. Then it is retrieved and the stream head is updated with the cluster path identifier. cluster_route_id: an index into the S-ICS routing table. target_queue_index: an index into target_queue_hash that uniquely identifies this target queue. target_write_queue and target_read_queue: local queue addresses where messages are to be stored.

【０１９６】local_read_queueおよびlocal_write_queu
e：メッセージの送出元を示すローカル待ち行列アドレ
スである。これらのアドレスを使用して、cluster_rout
e_idがmblk内に定義されていなかった場合に遠隔目標イ
ンデックスをハッシュする。ローカル通信に関してモジ
ュールまたはドライバによってmblkが作成されたなら
ば、これが起きる。ＩＰの下でのＤＬＰＩドライバの構
成が行われる間がこの例の１つである。多くの制御メッ
セージが発生する可能性があり、Ｓ−ＩＣＳは、どのＤ
ＬＰＩにデータを送付すべきかを知るためどのＩＰがデ
ータを送出したか理解する必要がある。 target_pending_actions：クローズ競争状態を防ぐため
に使用されるカウンタである。 free_func_indexおよびfree_func_arg_index：mblk_hea
derの説明の項で説明されている。 rem_target_recovery_policy：この目標を経路テーブル
の同期外れのため見つけ出すことができなくなった場合
どのようなステップをとるべきか遠隔ノードが理解する
ようにするため、分散ストリーム初期設定の一部として
戻される方針構造である。Local_read_queue and local_write_queu
e: Local queue address indicating the source of the message. Using these addresses, cluster_rout
Hash the remote target index if e_id was not defined in the mblk. This occurs if the mblk was created by a module or driver for local communication. One example of this is during the configuration of the DLPI driver under IP. Many control messages can occur and the S-ICS determines which D
It is necessary which IP to understand how to send data to know whether to send the data to the LPI. target_pending_actions: a counter used to prevent a close race condition. free_func_index and free_func_arg_index: mblk_hea
This is explained in the description of der. rem_target_recovery_policy: A policy structure returned as part of the distributed stream initialization to allow the remote node to understand what steps to take if this target cannot be found due to out-of-sync of the routing table. is there.

【０１９７】方針が制御スレッドと連絡をとり遠隔コン
ポーネントの新しい位置を探る程度に単純なものである
ように、経路テーブルを移行の間同期外れとすることが
できる。要求が遠隔ノードから到来する時、目標待ち行
列が、Ｓ−ＩＣＳとローカル・ノード・アドレスである
関連実目標待ち行列アドレスの間で移行したとすれば、
それは正しいＳ−ＩＣＳインスタンスを見つけるために
使用される目標待ち行列インデックスを含む。uint32値
のようなテーブル・インデックスを使用しているのでポ
インタ強制またはノードが３２ビット系または６４ビッ
ト系かという懸念が避けられる。The routing table can be out of sync during the migration so that the policy is as simple as contacting the controlling thread and finding the new location of the remote component. When a request comes from a remote node, if the target queue has transitioned between the S-ICS and the associated real target queue address, which is the local node address,
It contains the target queue index used to find the correct S-ICS instance. The use of a table index such as a uint32 value avoids pointer coercion or concerns about whether the node is 32-bit or 64-bit.

【０１９８】[0198]

【表１６】 struct target_queue_index_t { struct s_ics_q_ptr s_ics_instance; /* 異なるＳ−ＩＣＳ処理であれば */ /* 後方参照 */ struct target_queue_t*target_queue; /* 実目標へのポインタ */ } #define TARGET_MAX_INDEX 65536[Table 16] struct target_queue_index_t {struct s_ics_q_ptr s_ics_instance; / * Different S-ICS processing * / / * Back reference * / struct target_queue_t * target_queue; / * Pointer to real target * /} #define TARGET_MAX_INDEX 65536

【０１９９】経路テーブルは遠隔経路エントリを定義す
る構造アレイである。遠隔目標待ち行列につき１つのエ
ントリがある。このエントリへのインデックスは、targ
et_queueエレメントの範囲内およびストリームヘッドの
範囲内の始動元ストリームヘッド内に記憶される。各エ
ントリは最低限以下の表１７のフィールドを含む。A route table is a structural array that defines remote route entries. There is one entry per remote target queue. The index to this entry is targ
Stored in the initiating stream head within the et_queue element and within the stream head. Each entry includes at a minimum the fields of Table 17 below.

【０２００】[0200]

【表１７】 #define REMOTE_INDEX_ACTIVE Ox0000 #define REMOTE_INDEX_INACTIVE OxOOO1 #define REMOTE_INDEX_MIGRATING OxOOO2 #define REMOTE_INDEX_IN_ERROR OxOOO3 #define REMOTE_INDEX_IN_FLUX OxOOO4 struct s_ics_rem_route_t { int32 remote_index_state; int32 remote_target_index; struct node_t*remote_node; struct rem_s_ics_descrp*remote_s_ics; struct sth_s*source_sth; }[Table 17] #define REMOTE_INDEX_ACTIVE Ox0000 #define REMOTE_INDEX_INACTIVE OxOOO1 #define REMOTE_INDEX_MIGRATING OxOOO2 #define REMOTE_INDEX_IN_ERROR OxOOO3 #define REMOTE_INDEX_IN_FLUX OxOOO4 struct s_ics_rem_route_t {int32 remote_index_state; int32 remote_target_index; struct node_t * remote_node; struct rem_s_ics_descrp * remote_s_ics; struct sth_s * source_sth;}

【０２０１】remote_index_state：このエレメントの現
在の既知の状態を反映する。正常状態が活動的な場合メ
ッセージの経路指定を行っている時追加のチェックを行
う必要がない。エレメントが流動または移行状態であれ
ば、動作は短時間中断される必要があるかもしれない。
モジュール／ドライバを使用不可とすることにより、ま
た、動作の流れ制御をいかなるメッセージも送信できな
い状態に強制することによって、この中断を実行するこ
とができる。仕掛かり中のメッセージは短時間キャッシ
ュに記憶される必要がある。これはＳ−ＩＣＳおよびス
トリームヘッド・プレビュー関数とは独立していて、そ
の状態が除去されると、すべての処理は通常通り進む。
remote_target_index：実際の遠隔目標待ち行列読取り
書き込みアドレスを定義する遠隔target_queue_indexア
レイへのインデックスである。このインデックスは、分
散ストリームが作成、移行またはクローズされる時交換
される。その場合、経路テーブルは、現在の状態を反映
するように更新される。Remote_index_state: reflects the current known state of this element. When the normal state is active, no additional checks need to be performed when routing messages. If the element is in a flowing or transitioning state, the operation may need to be interrupted for a short time.
This interruption can be performed by disabling the module / driver and forcing the operational flow control to be unable to send any messages. In-flight messages need to be cached for a short time. This is independent of the S-ICS and streamhead preview functions, and once the state is removed, all processing proceeds normally.
remote_target_index: An index into the remote target_queue_index array that defines the actual remote target queue read / write address. This index is swapped when a distributed stream is created, migrated or closed. In that case, the routing table is updated to reflect the current state.

【０２０２】remote_node：制御スレッドによって管理
されているノード・テーブルをポイントする。この情報
を使用して、遠隔ノードにメッセージを送るためにＰ−
ＩＣＳヘッダが作成される。 remote_sics：遠隔Ｓ−ＩＣＳアドレスおよびＰ−ＩＣ
Ｓを経由してそこへ到達する方法を定義する構造をポイ
ントする。この構造が多数の遠隔インデックス・エレメ
ントによってポイントされる場合もある。 source_sth：ローカル・ストリーム・コンポーネントの
ストリームヘッド・アドレスである。エラー処理が適切
なストリームヘッド・インスタンスに向け直されるよう
にこのアドレスは維持される。Remote_node: points to the node table managed by the controlling thread. Using this information, P-
An ICS header is created. remote_sics: remote S-ICS address and P-IC
Points to the structure that defines how to get there via S. This structure may be pointed to by a number of remote index elements. source_sth: The stream head address of the local stream component. This address is maintained so that error handling is redirected to the appropriate streamhead instance.

【０２０３】[0203]

【表１８】 #define REM_S_ICS_ACTIVE OｘOOOO #define REM_S_ICS_INACTIVE OxOOO1 #define REM_S_ICS_IN_FLUX OxOOO2 struct rem_s_ics_descrp { P_ICS_S_ICS_ADDR p_ics_s_ics_address; int32 p_ics_s_ics_state; P_ICS_DATA_BUF p_ics_s_ics_data_buf; int32 p_ics_s_ics_offset; }[Table 18] #define REM_S_ICS_ACTIVE OxOOOO #define REM_S_ICS_INACTIVE OxOOO1 #define REM_S_ICS_IN_FLUX OxOOO2 struct rem_s_ics_descrp {P_ICS_S_ICS_ADDR p_ics_s_fs_ics_ics_ics_data_ints

【０２０４】p_ics_s_ics_address：Ｐ−ＩＣＳ実施特
有のアドレス情報を記述する構造をポイントする。この
構造は、遠隔Ｓ−ＩＣＳを目標とするようにＰ−ＩＣＳ
ヘッダを作成・管理するために使用される関数セットを
含む。 p_ics_s_ics_state：遠隔Ｓ−ＩＣＳの現在状態であ
る。 p_ics_s_ics_data_buf：データ・バッファ記述構造であ
る。Ｐ−ＩＣＳが受信者管理パラダイムを使用していれ
ば、これはＰ−ＩＣＳが管理する事前確保のメモリ・プ
ールであってもよい。送信者管理パラダイムが使用され
ると、これは遠隔目標アドレス範囲であり、p_ics_s_ic
s_offsetは、実際にデータを置くためのこのメモリ・プ
ールへのオフセットである。このバッファ・プールは、
P_ICS_ENDPOINT構造の範囲内に定義される諸関数によっ
て実施される。P_ics_s_ics_address: points to a structure that describes address information specific to P-ICS implementation. This structure allows the P-ICS to target a remote S-ICS.
Contains a set of functions used to create and manage headers. p_ics_s_ics_state: The current state of the remote S-ICS. p_ics_s_ics_data_buf: Data buffer description structure. If the P-ICS uses a recipient management paradigm, this may be a pre-allocated memory pool managed by the P-ICS. If the sender management paradigm is used, this is the remote target address range and p_ics_s_ic
s_offset is the offset into this memory pool to actually put the data. This buffer pool
Implemented by functions defined within the P_ICS_ENDPOINT structure.

【０２０５】Ｓ−ＩＣＳプライベート・データ構造は、
ドライバが実行されるまで知ることができない構造タイ
プを含む。最低限以下の表１９のdefine(定義)およびst
ructure(構造)が存在しなければならない。The S-ICS private data structure is:
Includes structure types that cannot be known until the driver is executed. At least define and st in Table 19 below
The ructure (structure) must exist.

【０２０６】[0206]

【表１９】 #define S_ICS_INACTIVE Ox0000 #define S_ICS_ACTIVE Ox0001 #define S_ICS_CLOSING OxOOO2 #define S_ICS_MIGRATION OxOOO3 #define S_ICS_IN_ERROR OxOOO4 #define S_ICS_MIDDLEWARE_ERROR OxOOO5 #define S_ICS_ORPHANED OxOOO6 #define S_ICS_OUTBOUND OxOOO7 #define S_ICS_INBOUND OxOOO8 #define S_ICS_FLOW_CONTROL OxOOO9 struct s_ics_q_ptr { int32 s_ics_flags; struct p_ics_data*p_ics_instance; struct middleware_t*s_ics_middleware; struct target_queue_index_t target_queue_index_hash[TARGET_MAX_INDEX]; struct target_queue_t target_queue_hash[TARGET_HASH_SIZE]; struct node_t*node_tbl_ptr; struct sth_s*s_ics sth[S_ICS_HASH_SIZE]; struct s_ics_rem_route_t remote_route_tbl[TARGET_MAX_INDEX]; struct s_ics_rem_route_t s_ics_route_tbl[REM_MAX_INDEX]; S_ICS_MIGRATION_POLICY s_ics_ migration_policy; S_ICS_RECOVERY_POLICY s_ics_recovery_policy; S_ICS_CONTROL_POLICY s_ics_control_policy; }[Table 19] #define S_ICS_INACTIVE Ox0000 #define S_ICS_ACTIVE Ox0001 #define S_ICS_CLOSING OxOOO2 #define S_ICS_MIGRATION OxOOO3 #define S_ICS_IN_ERROR OxOOO4 #define S_ICS_MIDDLEWARE_ERROR OxOOO5 #define S_ICS_ORPHANED OxOOO6 #define S_ICS_OUTBOUND OxOOO7 #define S_ICS_INBOUND OxOOO8 #define S_ICS_FLOW_CONTROL OxOOO9 struct s_ics_q_ptr {int32 s_ics_flags; struct p_ics_data * p_ics_instance; struct middleware_t * s_ics_middleware; struct target_queue_index_t target_queue_index_hash [TARGET_MAX_INDEX]; struct target_queue_t target_queue_hash [TARGET_HASH_SIZE]; struct node_t * node_tbl_ptr; struct sth_s * s_ics sth [S_ICS_HASH_SIZE]; struct s_ics_rem_route_t remote_route_tbl [TARGET_MAX_INDEX]; struct s_ics_rem_route_t s_ics_route_tbl [REM_MAX_INDEX ]; S_ICS_MIGRATION_POLICY s_ics_ migration_policy; S_ICS_RECOVERY_POLICY s_ics_recovery_policy; S_ICS_CONTROL_POLICY s_ics_control_policy;}

【０２０７】s_ics_flagsは、各Ｓ−ＩＣＳインスタン
スの状態および実行の内容を示す。 p_ics_instance：前の構造によって定義されたＰ−ＩＣ
Ｓインスタンス・データである。 s_ics_middleware：ミドルウェアと連絡をとる方法およ
び方針ならびに標準関数を介してミドルウェア移行／障
害を取り扱う方法を記述する。 target_queue_index_hash：実目標待ち行列構造を特定
するために使用されるインデックス・アレイである。こ
れは、ハッシュ・アレイではなく、実際にインデックス
あたりただ１つのエレメントが存在する。 target_queue_hash：目標待ち行列エレメントのすべて
を含むハッシュ・テーブルである。エレメントを見つけ
このハッシュを更新する関数が２倍に連結されたリスト
に従うことができなければならないようにするため、バ
ケットあたり複数の目標待ち行列が存在することもあ
る。 node_hash_tbl：ノード定義(バケットにつき潜在的な複
数のノード)を含むハッシュ・テーブルである。S_ics_flags indicates the state of each S-ICS instance and the contents of execution. p_ics_instance: P-IC defined by the previous structure
This is S instance data. s_ics_middleware: Describes how and how to contact middleware and how to handle middleware migration / failure via standard functions. target_queue_index_hash: An index array used to specify the actual target queue structure. This is not a hash array, but there is actually only one element per index. target_queue_hash: A hash table that contains all of the target queue elements. There may be multiple target queues per bucket so that the function of finding the element and updating this hash must be able to follow a doubly linked list. node_hash_tbl: A hash table containing node definitions (potential nodes per bucket).

【０２０８】s_ics_sth：各上方マルチプレクサＳ−Ｉ
ＣＳインスタンスと関連するローカル・ストリームヘッ
ド・アドレスのすべてを含むハッシュ・テーブルであ
る。これは、実行中の測定命令に関してストリームヘッ
ド内に記憶される待ち行列アドレスを見出すために、追
跡される。 remote_route_tbl：上述の通り。 s_ics_route_tbl：Ｓ−ＩＣＳ経路指定テーブルであ
る。このテーブルは、すべてのＳ−ＩＣＳインスタンス
がクラスタ内のどこに存在するかおよびそこに到達する
方法を記述する。これは、アプリケーションまたはスト
リーム・コンポーネントが移行すべき場合異なるノード
のためメッセージが経路指定されなければならないな
ら、必要である。ノードが故障すると、Ｓ−ＩＣＳは、
その経路指定テーブルを更新するため制御スレッドまた
はミドルウェアを調べる。経路エラー問題に反応するの
ではなく変更のためこれらすべての他のＳ−ＩＣＳを更
新するためシステム内のＳ−ＩＣＳによってこれらの更
新は始動される。注：この構造は、インデックス値を使用しない点を除い
てremote_route_tblと同じ定義を使用する。S_ics_sth: Each upper multiplexer SI
6 is a hash table that contains all of the local stream head addresses associated with a CS instance. This is tracked to find the queue address stored in the stream head for the running measurement instruction. remote_route_tbl: As described above. s_ics_route_tbl: S-ICS routing table. This table describes where all S-ICS instances are in the cluster and how to get there. This is necessary if the message has to be routed for a different node if the application or stream component is to be migrated. If a node fails, the S-ICS will:
Examine the controlling thread or middleware to update its routing table. These updates are triggered by the S-ICS in the system to update all these other S-ICS for changes rather than reacting to path error issues. Note: This structure uses the same definition as remote_route_tbl, except that it does not use index values.

【０２０９】残りの３つのエレメントは方針構造であ
る。これらの構造の内容は、Ｓ−ＩＣＳ実施形態に依存
するもので、ここでは詳述しない。各方針に関して、最
低限、特定の状態を扱うためＳ−ＩＣＳが呼び出すこと
ができる関数ベクトルが存在しなければならない。The remaining three elements are policy structures. The contents of these structures depend on the S-ICS embodiment and will not be described in detail here. For each policy, at a minimum, there must be a function vector that the S-ICS can call to handle a particular state.

【０２１０】２つのノードの間でデータを交換するた
め、Ｓ−ＩＣＳは、送られるメッセージの一部として伝
送されるバイト・ストリングとの間でmblkヘッダを変換
することができなければならない。これは、種々の非ポ
インタ関連フィールドが新しい構造にアセンブルされな
ければならないことを意味する。以下の表２０は、構造
の１つの実施例である。In order to exchange data between two nodes, the S-ICS must be able to convert the mblk header to and from a byte string transmitted as part of the message sent. This means that various non-pointer related fields must be assembled into a new structure. Table 20 below is one example of a structure.

【０２１１】[0211]

【表２０】 #definePRROT_NORMAL 0x00 #define PROT_ROUTE_ERROR OxO1 struct mblk_header { unsigned short protocol_msg_type; unsigned short b_flag; unsigned char b_band; unsigned char db_ref; unsigned char db_type; unsigned short db_flag; unsigned int db_size; unsigned int free_func_index; unsigned int free_func_arg_index; unsigned int source_queue_index; unsigned int target_queue_index }[Table 20] #definePRROT_NORMAL 0x00 #define PROT_ROUTE_ERROR OxO1 struct mblk_header {unsigned short protocol_msg_type; unsigned short b_flag; unsigned char b_band; unsigned char db_ref; unsigned char db_type; unsigned short db_flag; unsigned int source_queue_index; unsigned int target_queue_index}

【０２１２】protocol_msg_type：このメッセージがＳ
−ＩＣＳの間での正常な交換か、エラー回復かあるいは
その他のものかを示す。送出されているmblk内に組み込
まれた目標待ち行列インデックスを見つけることができ
なければ、エラー回復が起こるかもしれない。エラー戻
しメッセージはソースおよび目標待ち行列インデックス
を含み、Ｓ−ＩＣＳは、メッセージを検査し、新しい経
路を調べる必要があるか、エラーを出すべきか、メッセ
ージを落とすべきか判断する必要がある。 b_flag, b_band, db_ref, db_type, db_flagおよびdb_s
ize：まさにmblkから取り出された値であり、遠隔ノー
ド上でmblkヘッダを再構築するために使用される。 free_func_index：free(解放)関数アドレス・アレイへ
のインデックスである。ソース・ノードは、遠隔ノード
対応付けインデックスを検証してヘッダにこれを埋め込
む。ドライバまたはモジュールがそれらのデータについ
て代替的free関数を定義する場合にesballoc()およびそ
の変形を取り扱うために必要である。 free_func_arg_index：同様の設定であるが、esballo
c()に渡される引き数がカプセル化されたデータへのオ
フセット・ポインタであるか、あるいはfree関数のため
の実引き数を提供する構造へのポインタであるかもしれ
ないので、このインデックスはプロトコール特有の解釈
を必要とするかもしれない。従って、すぐれた設計を行
うため調査および検討が更に必要とされる。Protocol_msg_type: This message is S
Indicates whether a successful exchange between ICSs, error recovery or something else. Error recovery may occur if the target queue index embedded in the mblk being sent cannot be found. The error return message contains the source and target queue indexes, and the S-ICS needs to examine the message and determine if it needs to look up a new path, raise an error, or drop the message. b_flag, b_band, db_ref, db_type, db_flag and db_s
ize: Just the value retrieved from mblk, used to reconstruct the mblk header on the remote node. free_func_index: An index into the free (free) function address array. The source node verifies the remote node mapping index and embeds it in the header. Required to handle esballoc () and its variants when a driver or module defines an alternative free function for their data. free_func_arg_index: Same setting, but esballo
Since the argument passed to c () may be an offset pointer to the encapsulated data, or a pointer to a structure that provides the actual argument for the free function, this index May require a specific interpretation. Therefore, further research and consideration is needed to make a good design.

【０２１３】source_queue_index：２つの形態で解釈で
きる。mblkがシステム呼び出しの場合のようにストリー
ムヘッド関連ルーチンの一部とし作成される場合、ルー
チンはこのフィールド内にsth_cluster_route_idを埋め
込む。このフィールドは、経路探索が迅速となるように
通常はＳ−ＩＣＳ経路テーブル・インデックスである。
例外はＳ−ＩＣＳがこのフィールドを更新することがで
きない時であり、この場合、フィールドはこのストリー
ムヘッドに関連した装置識別子(dev_t)であるように初
期化されている。経路指定機能は、このフィールドを使
用して経路テーブル・エントリを決定することができ
る。エラー回復の間、mblk内のsource_queue_indexは、
正しい経路テーブル・インデックスであって装置識別子
ではない。 target_queue_index：遠隔目標待ち行列ハッシュ・テー
ブルへのインデックスである。方針等を取り扱うデータ
構造は、Ｓ−ＩＣＳを経由して実行される関数ベクトル
を含むかもしれない。この関数ベクトルによって、コン
ポーネント設計者が新しい機能性を開発するために必要
とするプロトコール独立性をＳ−ＩＣＳが維持すること
が可能とされる。関数が定義されていなければ、Ｓ−Ｉ
ＣＳは、これが問題であればメッセージを破棄するかあ
るいは分散ストリームを遮断して、アプリケーションに
問題の存在を知らせるためM_HANGUP または M_ERRORメ
ッセージを発する。図１０はＳ−ＩＣＳデータ構造を示
す。図１０はまた制御スレッドとのデータ構造相互通信
を示す。Source_queue_index: can be interpreted in two forms. If the mblk is created as part of a streamhead related routine, such as in a system call, the routine will embed the sth_cluster_route_id in this field. This field is usually the S-ICS route table index so that the route search is quick.
The exception is when the S-ICS cannot update this field, in which case the field has been initialized to be the device identifier (dev_t) associated with this stream head. The routing function can use this field to determine the routing table entry. During error recovery, source_queue_index in mblk is
It is a correct routing table index, not a device identifier. target_queue_index: index into the remote target queue hash table. A data structure that handles policies and the like may include a function vector that is executed via the S-ICS. This function vector allows the S-ICS to maintain the protocol independence that component designers need to develop new functionality. If no function is defined, SI
CS discards the message or blocks the distributed stream if this is a problem and issues an M_HANGUP or M_ERROR message to inform the application of the problem. FIG. 10 shows the S-ICS data structure. FIG. 10 also illustrates data structure interaction with the controlling thread.

【０２１４】4.3 Ｓ−ＩＣＳ経路指定Ｓ−ＩＣＳ３８は、クラスタ全体におよぶユニークなア
ドレス指定方式を使用してメッセージの経路を定める。
この方式の詳細は以下の通りである。先ず、送出メッセ
ージに関して記述する。 1. ストリームヘッドを通過するwrite, putmsg, putpms
g, ioctl, streams_putmsg, streams_writeおよびstrea
ms_ioctlなどの各送出関数呼び出しが行われる。allocb
()を介してmblkが割り当てられたかあるいはstreams_pu
tmsgによって呼び出し元関数からmblkが受け取られた
後、ＳＴＲＥＡＭＳは、mp->b_datap->source_queue_in
dexにmblkヘッダ内のsth->sth_cluster_route_idを記憶
する。ストリームが作成される時経路識別子がsth->sth
_devに初期設定される。サポートするＳ−ＩＣＳ経路指
定アルゴリズムはどの形態をも理解する。フレームワー
クは、mblkの処理を終了しそれを下流に送り出す。4.3 S-ICS Routing The S-ICS 38 routes messages using a unique addressing scheme that spans the entire cluster.
The details of this method are as follows. First, the outgoing message will be described. 1. write, putmsg, putpms passing through the stream head
g, ioctl, streams_putmsg, streams_write and strea
Each sending function call such as ms_ioctl is performed. allocb
Mblk was allocated via () or streams_pu
After the mblk is received from the calling function by tmsg, STREAMS returns mp->b_datap-> source_queue_in
Store sth-> sth_cluster_route_id in mblk header in dex. When the stream is created, the path identifier is sth-> sth
Initialized to _dev. The supported S-ICS routing algorithm understands any form. The framework ends the processing of mblk and sends it downstream.

【０２１５】2. Ｓ−ＩＣＳは、そのwrite putルーチン
を介してmblkを受領すると、mp->b_datap->source_queu
e_indexを調べてhash(ハッシュ)関数を適用する。この
ハッシュ関数は範囲検査を行うもので、経路識別子が経
路テーブルへのアレイ・インデックスに対応していか否
かを検証する。それが対応していない場合、別のハッシ
ュ関数を適用し、dev_tとしてインデックスを扱うこと
によってこれを経路インデックスにする。これにも失敗
すれば、エラー回復メカニズムが適用される。2. When the S-ICS receives the mblk via the write put routine, the S-ICS receives mp->b_datap-> source_queu
Check e_index and apply hash function. This hash function performs a range check, and verifies whether or not the path identifier corresponds to the array index in the path table. If that doesn't work, apply another hash function and make this a path index by treating the index as dev_t. If that fails, an error recovery mechanism is applied.

【０２１６】3. 経路インデックスが決定されると、mp-
>b_datap->target_queue_index はremote_target_index
を用いて更新される。遠隔ノードから始動されたメッセ
ージに関するエラー処理が実行されることができるよう
に、source_target_indexは更新されないまま残る。次
に、mblkがＰ−ＩＣＳに送り出され、そこで、実施形態
依存形式に変換され、遠隔Ｓ−ＩＣＳに送り出される。
また遠隔ノード／Ｓ−ＩＣＳ識別子が経路テーブルから
取り出され、Ｐ−ＩＣＳに送り出される。3. When the path index is determined, mp-
>b_datap-> target_queue_index is remote_target_index
Is updated using. The source_target_index remains updated so that error handling for messages initiated from remote nodes can be performed. Next, the mblk is sent to the P-ICS, where it is converted to an embodiment-dependent format and sent to the remote S-ICS.
Also, the remote node / S-ICS identifier is retrieved from the routing table and sent to the P-ICS.

【０２１７】4. 遠隔Ｐ−ＩＣＳは到着したデータを検
査し、それがＳＴＲＥＡＭＳサブシステムのためのもの
であるか否か判断する。そうでない場合は、制御スレッ
ドのような正しい目標サブシステムまたはスレッドへメ
ッセージを送る。注：送信元管理通信を使用すていれ
ば、目標アドレスは既知であり、Ｐ−ＩＣＳが行う必要
があることのすべては、アプリケーションがそのように
することを通知している場合メッセージ到着を正しいサ
ブシステムに通知することである。割り込みルーチンが
利用されると、割り込みルーチンはmblkへメッセージを
変換し、どのＳ−ＩＣＳを起動すべきか決定する。4. The remote P-ICS examines the incoming data and determines if it is for the STREAMS subsystem. Otherwise, send a message to the correct target subsystem or thread, such as the controlling thread. Note: If using source management communication, the target address is known and all that the P-ICS needs to do is correct message arrival if the application has signaled to do so. Notify the subsystem. When an interrupt routine is used, it translates the message to mblk and determines which S-ICS to activate.

【０２１８】5. (読取り待ち行列putルーチンを介して)
遠隔Ｓ−ＩＣＳは、target_queue_indexを調べ,インデ
ックスが有効であるか、関連待ち行列がクローズされた
かまたは新しいノードに移行されたかを判断する。もし
有効であれば、メッセージを目標待ち行列へ転送する。
Ｓ−ＩＣＳは、あらゆるノード上でそれと関連づけされ
た登録待ち行列をハッシュするだけでよい。これらの待
ち行列をノード・テーブルと結合して、スタックの一方
の端から他方の端への対応関係のセットが作成される。
例えば、分割されたＴＣＰ／ＩＰ／ＤＬＰＩスタックに
おいて、メッセージはＳ−ＩＣＳを経由してＩＰとＤＬ
ＰＩの間で転送される。Ｓ−ＩＣＳは、この構成に設定
に関与しているから、どの識別子および待ち行列が相互
に対応しているかを知っている。注：ストリーム型スタ
ックは、通常、dupb()を使用してmblkヘッダを複写する
ことによってパケットを断片化し、断片位置を反映する
ようにb_rptrおよびb_wptrを単に調節する。識別子がmb
lkヘッダの一部として記憶されているので、断片化は問
題でない。同様のことが、再転送のための複製作成にも
適用される。5. (via read queue put routine)
The remote S-ICS examines the target_queue_index to determine if the index is valid, whether the associated queue has been closed, or has been migrated to a new node. If valid, forward the message to the target queue.
The S-ICS need only hash the registration queue associated with it on every node. These queues are combined with the node table to create a set of correspondences from one end of the stack to the other.
For example, in a split TCP / IP / DLPI stack, messages are sent to IP and DL via S-ICS.
Transferred between PIs. Since the S-ICS is involved in setting up this configuration, it knows which identifiers and queues correspond to each other. Note: Stream-type stacks typically fragment packets by duplicating the mblk header using dupb (), and simply adjust b_rptr and b_wptr to reflect the fragment location. Identifier is mb
Fragmentation is not an issue as it is stored as part of the lk header. The same applies to making duplicates for retransmission.

【０２１９】到来メッセージすなわち遠隔ＤＬＰＩイン
スタンスのような遠隔コンポーネントによって生成され
るメッセージに関しては、経路指定は以下のように行わ
れる。 1. 各到来メッセージに関して、ストリームヘッド読み
取り待ち行列プレビュー関数は、メッセージが制御スレ
ッドあるいは直接関連Ｓ−ＩＣＳに転送されるべきか否
かを決定する。メッセージが制御スレッドへ送られる場
合、プレビュー関数はsth_rput(q, mp)関数を起動し、
非クラスタ環境の場合と同様なフレームワーク処理を実
行する。メッセージがＳ−ＩＣＳに向かう場合、プレビ
ュー関数はそのsth_cluster_dataフィールド内のＳ−Ｉ
ＣＳに対応するストリームヘッドを見出し、それを使用
してメッセ時の経路を選択する。For an incoming message, ie, a message generated by a remote component, such as a remote DLPI instance, the routing is performed as follows. 1. For each incoming message, the stream head read queue preview function determines whether the message should be forwarded to the controlling thread or directly to the relevant S-ICS. When a message is sent to the controlling thread, the preview function invokes the sth_rput (q, mp) function,
Execute the same framework processing as in the non-cluster environment. If the message goes to the S-ICS, the preview function returns the SI in its sth_cluster_data field.
It finds the stream head corresponding to the CS and uses it to select the route at the time of the message.

【０２２０】2. プレビュー関数が実際にputnext(s_ics
_sth->wq, mp)を実行する前に、mblkの範囲内にインデ
ックス値を埋め込む。プレビュー関数は、分散ストリー
ムが作成された時点に通知された実際の経路テーブル・
インデックスであるsth->sth_cluster_route_idをsourc
e_queue_index内に記憶する。実施形態によるが、メカ
ニズムがこのコンポーネントのストリームヘッドを更新
する必要をなくすため、この経路インデックスは変更す
べきではない。いかなる経路変更もＳ−ＩＣＳが自動的
に認識する経路テーブルの更新だけにとどまるべきであ
る。2. The preview function is actually putnext (s_ics
Before executing _sth-> wq, mp), embed the index value in the range of mblk. The preview function returns the actual route table and information that were notified when the distributed stream was created.
Index sth-> sth_cluster_route_id sourc
Store in e_queue_index. Depending on the embodiment, this path index should not be changed, as the mechanism does not need to update the stream head of this component. Any route changes should be limited to updating the routing table automatically recognized by the S-ICS.

【０２２１】3. Ｓ−ＩＣＳがメッセージの発生元には
関心を持たず、ただソースおよび目標インデックス値に
のみ関心を払うので、処理の残りは上述の経路指定の場
合と同様である。3. Since the S-ICS is not concerned with the origin of the message, but only with the source and target index values, the rest of the process is similar to the routing case described above.

【０２２２】エラー回復に関して、Ｓ−ＩＣＳ(読み取
り待ち行列putルーチン)がtarget_queue_indexが有効で
ないと判断すれば、以下の処理が行われる。target_que
ue_indexは、スタックがクローズされたかあるいはスタ
ックが移行した場合でなければ常に有効である。そのど
ちらの場合でも、送出元Ｓ‐ＩＣＳとローカルＳ−ＩＣ
Ｓの間に競争状態が存在する。クローズまたは移行が起
こると、Ｓ−ＩＣＳは、すべての遠隔Ｓ−ＩＣＳに経路
変更を通知する。これは、次の２つの実施形態に依存す
る方法のいずれかで実施することができる。第１の方法
は、Ｓ−ＩＣＳが単にメッセージを同時通報するもの
で、すべてのノードにそれが通知され、すべてのＳ−Ｉ
ＣＳが更新されると仮定する。問題は、転送中のメッセ
ージが新しい情報を使用して正しく経路指定される保証
がない点である。For error recovery, if the S-ICS (read queue put routine) determines that target_queue_index is not valid, the following processing is performed. target_que
ue_index is always valid unless the stack has been closed or the stack has transitioned. In both cases, the source S-ICS and the local S-IC
A race condition exists between S. When a close or transition occurs, the S-ICS notifies all remote S-ICS of the route change. This can be implemented in one of two ways depending on the following embodiments. The first method is that the S-ICS simply broadcasts the message simultaneously, and all nodes are notified of it,
Assume that CS is updated. The problem is that there is no guarantee that the message in transit will be correctly routed using the new information.

【０２２３】第２の方法は、Ｓ−ＩＣＳが、経路更新が
まさに行われようとしていることをすべてのＳ−ＩＣＳ
に通知し、そのすべてのＳ−ＩＣＳが実行中のものを完
了させ、経路更新を待つというものである。すべてのＳ
−ＩＣＳからの応答があった時、Ｓ‐ＩＣＳは経路を更
新し確認を待つ。本質的には、Ｓ−ＩＣＳは、第２／第
３フェーズのコミット・プロトコールを使用してすべて
のＳ−ＩＣＳが常に同期していることを保証する。明ら
かに第２の方法はエラー回復の必要性を排除するが、経
路更新はより多くの動作と時間を要する。しかし経路が
さほど頻繁に更新されなければこれは許容されることが
できるであろう(この点は、必要なＳ−ＩＣＳが常に同
期しているように通知されるのでストリーム作成にはあ
てはまらない）。従って、上記第２の方法の選択基準
は、どこへ複雑性が付加され、どれほど多くの時間が各
分野で費やされなければならないかという点である。The second method is that the S-ICS determines that a route update is about to occur
, And all the S-ICS being executed are completed, and the process waits for a route update. All S
-When there is a response from the ICS, the S-ICS updates the route and waits for confirmation. In essence, S-ICS uses a second / third phase commit protocol to ensure that all S-ICS are always in sync. Obviously, the second method eliminates the need for error recovery, but path updating requires more action and time. However, this could be tolerated if the route is not updated very often (this is not the case for stream creation as the required S-ICS will always be signaled to be in sync). . Thus, the selection criterion of the second method is where complexity is added and how much time has to be spent in each field.

【０２２４】もしも同報通信メカニズム(該当経路に関
係のあるＳ−ＩＣＳだけが変更を理解すればよいのでむ
しろ複数通報メカニズム)が使用されるとすれば、Ｓ−
ＩＣＳは始動元Ｓ−ＩＣＳに対するエラー応答を生成す
る必要がある。更にＳ−ＩＣＳは一定時間情報を保持す
る必要がある。その時間は、Ｓ−ＩＣＳが障害は何かを
遠隔Ｓ−ＩＣＳに通知することができるようにするため
すべての経路テーブルの更新を確実に行うための時間の
量である。これはオプションである。なぜなら、メカニ
ズムは常に経路障害メッセージを戻すことができ、S_Ｉ
ＣＳにその経路テーブルを検査させ当該分散ストリーム
上での制御スレッドの状態を調査させることができるか
らである。制御スレッドは実際にノード障害の場合のよ
うな障害エラー回復を実行していることもあり、その場
合Ｓ−ＩＣＳにメッセージを出してもらうかもしれな
い。次に、制御スレッドは、回復が完了するまでデータ
を保持し始動元ストリームの動作の流れを制御する(こ
の点は更に後述する)。If a broadcast mechanism (multiple reporting mechanisms are used, since only the S-ICS relevant to the path need understand the change), the S-
The ICS needs to generate an error response to the originating S-ICS. Further, the S-ICS needs to hold information for a certain period of time. That time is the amount of time to ensure that all routing tables are updated so that the S-ICS can inform the remote S-ICS what the fault is. This is optional. Because the mechanism can always return a path failure message, S_I
This is because the CS can check the route table and check the state of the control thread on the distributed stream. The controlling thread may actually be performing fault error recovery as in the case of a node fault, in which case the S-ICS may issue a message. Next, the control thread holds the data until the recovery is completed and controls the operation flow of the starting stream (this point will be further described later).

【０２２５】このような経路指定解決策に関する唯一の
潜在的な問題は、モジュール間通信に使用されるM_CTL
メッセージまたはハードウェア・ドライバ結合要求を出
すために使用されるM_PROTO/M_DATAメッセージのような
送出メッセージをモジュールまたはドライバが生成する
場合である。そのような場合、メッセージはソースおよ
び目標インデックス値を含まない。この問題を解くため
次のような２つの方法がある。第１の方法は、この形態
で通信するストリーム・コンポーネント間でスタックを
分割しないことによって単純に問題発生を防止すること
である。しかし、これは可能ではあるが実際的ではな
い。第２の方法は、付加情報を記録するためＳＴＲＥＡ
ＭＳフレームワーク測定ルーチンを修正することである
(それらルーチンは、autopush(1M)機能に関連し、I_LIN
K, I_PLINK, I_PUSHおよびI_POP ioctlsに関連する)。
更に、ストリーム・スタックが測定される方法に基づい
て、ストリーム型ドライバとしてではなく、Ｐ−ＩＣＳ
より上方に所在する下方マルチプレクサへ供給する複数
Ｓ−ＩＣＳインスタンスに関するＮ×１多重化ドライバ
としてＳ‐ＩＣＳを実施することが必要となることがあ
る。この点をＴＣＰ／ＩＰ／ＤＬＰＩの例を使用して示
せば以下の通りである。The only potential problem with such a routing solution is the M_CTL used for inter-module communication.
This is the case when a module or driver generates an outgoing message, such as a message or an M_PROTO / M_DATA message used to issue a hardware driver binding request. In such a case, the message does not include the source and target index values. There are two ways to solve this problem: The first is to simply prevent problems by not splitting the stack between stream components communicating in this fashion. However, this is possible but not practical. The second method uses STREA to record additional information.
Modifying the MS framework measurement routine
(These routines are related to the autopush (1M) function and I_LIN
K, related to I_PLINK, I_PUSH and I_POP ioctls).
In addition, based on the way the stream stack is measured, the P-ICS
It may be necessary to implement the S-ICS as an Nx1 multiplex driver for multiple S-ICS instances feeding a lower multiplexer located above. This point will be described below using an example of TCP / IP / DLPI.

【０２２６】ＩＰコンポーネントは、そのすぐ下にリン
クされている１つまたは複数のＤＬＰＩインスタンスに
関するマルチプレクサとして実施することもできる。ス
タックがＩＰ／ＤＬＰＩレベルで分割されていれば、こ
のＤＬＰＩインスタンスに関連するＩＰ読取り書込みが
記録される。これは、I_LINK動作の間に分散ストリーム
としてストリームが初期的に作成されれば上記は起きる
かもしれないし、あるいはストリームが作成の後分割さ
れるならば、ＩＰおよびＤＬＰＩ読み取り書き込み待ち
行列は挿入されつつあるＳ−ＩＣＳを反映するように更
新される。作成後の分割の詳細は後述する。いずれの場
合でも、待ち行列アドレスは、Ｓ−ＩＣＳプライベート
・データ構造に記録され、正しい経路を決定するためハ
ッシュ待ち行列として利用される。今やＩＰとＳ−ＩＣ
Ｓインスタンスの間に１対１の対応関係が存在するの
で、これは可能である。An IP component can also be implemented as a multiplexer for one or more DLPI instances linked directly below it. If the stack is split at the IP / DLPI level, the IP read / write associated with this DLPI instance is recorded. This may happen if the stream is initially created as a distributed stream during an I_LINK operation, or if the stream is split after creation, the IP and DLPI read / write queues will be inserted. Updated to reflect certain S-ICS. Details of the division after the creation will be described later. In either case, the queue address is recorded in the S-ICS private data structure and used as a hash queue to determine the correct path. Now IP and S-IC
This is possible because there is a one-to-one correspondence between S instances.

【０２２７】次の表２１のサンプル・コードは、プライ
ベート・データがすべてのマルチプレクサ・インスタン
スにアクセス可能であるので、そのストリームヘッドに
基づいてどのＳ−ＩＣＳかを決定するために使用される
マクロを含む。The sample code in Table 21 below shows the macro used to determine which S-ICS is based on the stream head since private data is accessible to all multiplexer instances. Including.

【０２２８】[0228]

【表２１】 s_ics_private_data = (struct s_ics_data*) sth->wq->q_ptr; RECORD_QUEUE(upper_sth,s_ics_sth,s_ics_private_data);[Table 21] s_ics_private_data = (struct s_ics_data *) sth-> wq-> q_ptr; RECORD_QUEUE (upper_sth, s_ics_sth, s_ics_private_data);

【０２２９】Ｓ−ＩＣＳがマルチプレクサとして実施さ
れなければならない条件は、Ｓ−ＩＣＳがどのように使
用されるか、どれ程多数の異なるプロトコール・スタッ
クが１つのＳ−ＩＣＳによってサポートされているか、
Ｓ−ＩＣＳがプロトコール独立性をどの程度維持するこ
とができるか、どの程度の負荷が観察されるか等々に依
存する。これは判断の問題であるが、実施の如何にかか
わらず、全般的概念および設計は同じままである。The conditions under which the S-ICS must be implemented as a multiplexer include how the S-ICS is used, how many different protocol stacks are supported by one S-ICS,
It depends on how much S-ICS can maintain protocol independence, how much loading is observed, and so on. This is a matter of judgment, but regardless of implementation, the general concept and design remains the same.

【０２３０】以下の表２２は、Ｓ−ＩＣＳがメッセージ
経路指定に関して送出および到来の両ケースを取り扱う
コードのサンプルである。この関数はメッセージ・ポイ
ンタを使用してソース待ち行列アドレス・インデックス
を取り出す。このインデックスが取り出されない場合、
Ｓ−ＩＣＳはエラー回復アルゴリズムを始動し、経路を
別の方法で見つけなければならない。ソースが見出され
れば、Ｓ−ＩＣＳは、ハッシュ・キーとしてそれを使用
して目標待ち行列インデックスを見出す。この目標待ち
行列インデックスは、メッセージ内に埋め込まれる。関
数は次に目標待ち行列インデックスを使用して、遠隔ノ
ードを見出す(インデックスを管理しているデータ構造
が交差リンクされていれば、この動作は単純な検索にな
る）。ノード・アドレスを使用して、対応するＳ−ＩＣ
Ｓインスタンス・アドレスが決定され、それを使用して
メッセージが遠隔ノードへ送り出される。Table 22 below is a sample code that the S-ICS handles both outgoing and incoming cases for message routing. This function uses the message pointer to retrieve the source queue address index. If this index is not retrieved,
The S-ICS starts the error recovery algorithm and must find the path in another way. If the source is found, the S-ICS finds the target queue index using it as a hash key. This target queue index is embedded in the message. The function then uses the target queue index to find the remote node (if the data structure managing the index is cross-linked, this operation is a simple search). Using the node address, the corresponding S-IC
The S instance address is determined and used to send a message to the remote node.

【０２３１】[0231]

【表２２】 struct s_ics_instance* find_s_ics(mp,s_ics_data) { struct target_queue_t*target_queue; uint32 source_queue_index; struct node_t*node_address; struct s_ics_instance*s_ics; struct sth_s*sth; mblkP update_mp; uint32 route_index; /* ソース待ち行列アドレスを取り出す */ source_queue_index = mp->b_datap->source_queue_index; if(!VALIDATE_INDEX(source_queue_index)) { if(route_index = VALIDATE_DEV_ID(source_queue_index)) { /* 正しい経路インデックスを用いてストリームヘッドを */ /* 更新する必要がある。このサンプル・コードでは、この更新は、*/ /* 更新された経路を送るM_SETOPTSメッセージを使用して行われる。*/ /* 通常のハッシュ関数を使用して、このスタックの */ /* ローカル・ストリームヘッド・アドレスが見出され、*/ /* メッセージが直接ストリームヘッド読み取り待ち行列に記憶され、*/ /* そこで、sth rput()がsth->sth_cluster_route_idを処理更新する。*/ update_mp = allocb(sizeof(struct stroptions)); mp->b_datap->db_type = M_SETOPTS; sop = (struct stroptions*)update_mp->b_rptr; sop->so_flags = SO_CLUSTER_ROUTE_UPDATE; sop->cluster_route = route_index; sth = STH_HASH(source_queue_index); putq(sth->sth_rq,update_mp); goto route_msg; } else { /* モジュール／ドライバによって作成されたメッセージで */ /* あるかもしれない。ローカルＳ−ＩＣＳデータを使用して、*/ /* 何がソース待ち行列であるかを判断し、local_read_queue */ /* またはlocal_write_queueフィールドを検査することによって */ /* target_queueエレメントを見出す。*/ if (!(remote_index = FIND_TARGET_QUEUE(s_ics_data))) { /* メッセージの経路を指定することができない、制御スレッドを調べるかメッセージを破棄する */ } } } route msg: /* エラーの際NULLを戻す。*/ /* エラーでなければremote_sics構造アドレスを戻す */ return(&s_ics_data->remote_route_tbl[remote_indexl.remote_s_ics); }Table 22 struct s_ics_instance * find_s_ics (mp, s_ics_data) {struct target_queue_t * target_queue; uint32 source_queue_index; struct node_t * node_address; struct s_ics_instance * s_ics; struct sth_s * sth; mblkP update_mp; Retrieve * / source_queue_index = mp-> b_datap-> source_queue_index; if (! VALIDATE_INDEX (source_queue_index)) {if (route_index = VALIDATE_DEV_ID (source_queue_index)) {/ * Need to update streamhead with correct route index * / / * There is. In this sample code, this update is done using the M_SETOPTS message, which sends the updated route. * / / * Using a regular hash function, the * / / * local streamhead address of this stack is found, * / / * the message is stored directly in the streamhead read queue, and * / / * Therefore, sth rput () updates sth-> sth_cluster_route_id. * / update_mp = allocb (sizeof (struct stroptions)); mp-> b_datap-> db_type = M_SETOPTS; sop = (struct stroptions *) update_mp-> b_rptr; sop-> so_flags = SO_CLUSTER_ROUTE_UPDATE; sop-> cluster_route = route_index; sth = STH_HASH (source_queue_index); putq (sth-> sth_rq, update_mp); goto route_msg;} else {/ * There may be * / / * messages created by the module / driver. Using the local S-ICS data * / / * determine what is the source queue and find the * / / * target_queue element by examining the local_read_queue * / / * or local_write_queue fields. * / if (! (remote_index = FIND_TARGET_QUEUE (s_ics_data))) {/ * Unable to route message, check control thread or discard message * /}}} route msg: / * On error Returns NULL. * / / * Return remote_sics structure address if no error * / return (& s_ics_data-> remote_route_tbl [remote_indexl.remote_s_ics);}

【０２３２】この関数は、メッセージ・ポインタを使用
してローカルの目標待ち行列アドレスを取り出す。この
アドレスはtarget_queue構造をハッシュするために使用
され、target_queue構造は次にクローズ競争条件を処理
するため参照される。This function uses the message pointer to retrieve the local target queue address. This address is used to hash the target_queue structure, which is then referenced to handle the close race condition.

【０２３３】[0233]

【表２３】 target_queue_t find_target(mp,s_ics_data) { uint32 target_queue; /* 目標待ち行列インデックスを取り出し、*/ /* 有効性検査済み目標待ち行列またはNULLを戻す。*/ target_queue_index = mp->b_datap->target_queue_index; return(VALIDATE_TARGET_QUEUE(target_queue_index)); }Table 23 target_queue_t find_target (mp, s_ics_data) {uint32 target_queue; / * Retrieve the target queue index and * / / * return a validated target queue or NULL. * / target_queue_index = mp-> b_datap-> target_queue_index; return (VALIDATE_TARGET_QUEUE (target_queue_index));}

【０２３４】移行またはクローズ動作が発生していて、
始動元Ｓ−ＩＣＳが更新される前にメッセージが転送中
であれば(潜在的なタイミング問題)、FIND_TARGET_QUEU
Eは失敗する。対応付け機能が適用される時にすべての
動作に関して各ローカルＳ−ＩＣＳインスタンスはカレ
ントとみなされる。クローズまたは移行が発生する時は
必ず各Ｓ−ＩＣＳインスタンスは更新される。パケット
が転送中であれば、タイミングをとるための間隔が取ら
れ、更新された情報は適用されない。従って、成功また
は失敗に応じてアドレスを解釈し適切なアクションをと
るのは受け取り側のＳ‐ＩＣＳインスタンスに依存す
る。When the transition or closing operation has occurred,
If the message is in transit before the initiating S-ICS is updated (potential timing issue), FIND_TARGET_QUEU
E fails. Each local S-ICS instance is considered current for all operations when the mapping function is applied. Each S-ICS instance is updated whenever a close or transition occurs. If a packet is in transit, a time interval is taken and the updated information is not applied. Thus, it is up to the receiving S-ICS instance to interpret the address and take the appropriate action upon success or failure.

【０２３５】このように、Ｓ−ＩＣＳがクラスタ範囲内
でメッセージの経路指定を行うことができるようにする
ため、Ｓ−ＩＣＳは、モジュールまたはドライバ・イン
スタンスが存在する場所およびそれらの位置が修正され
る時点を通知される必要がある。グロ―バル・ポート対
応付けの例においては、Ｓ−ＩＣＳは、結合、結合解除
またはクローズ命令が始動される時点で、あるいは、分
散ストリームが作成される時点でのストリームのオープ
ンまたは後刻のストリーム分割について、通知されなけ
ればならない。それによって、Ｓ−ＩＣＳはすべての経
路につて十分な情報を持ち、ミドルウェアを調査してそ
の経路キャッシュを更新する必要性が一般的になくな
る。図１１ないし図１３、図１４および図１５は、クラ
スタ範囲内のメッセージ経路指定を示す。Thus, to enable the S-ICS to route messages within the cluster, the S-ICS is modified where modules or driver instances reside and their locations. Need to be informed of when In the example of global port mapping, the S-ICS may open or later stream the stream when a join, unjoin or close command is initiated, or when a distributed stream is created. About, must be notified. Thereby, the S-ICS has sufficient information about all routes and generally eliminates the need to consult middleware and update its route cache. FIGS. 11-13, 14 and 15 illustrate message routing within a cluster range.

【０２３６】4.4 Ｓ−ＩＣＳ関連ストリーム・データ構
造の修正前述までの節において、mblkヘッダから経路指定情報を
取り出すいくつかの関数が定義された。この情報はヘッ
ダ内に組み込まれ、アクセスされクラスタ内で分散され
る情報量を減少させる。これを実現するため以下のデー
タ構造を修正する必要がある。datab構造は付加的経路
指定アドレスを追加するように修正される。現時点のSV
R4.2 DDI/DKIバージョンとして、datab構造は、データ
のキャッシュ行整列を維持するためいくつかのパディン
グ(埋め込み)ワードを持つ。これらの埋め込みワードは
以下の値に置き換えられる。 uint32 source_queue_index; uint32 target_queue_index; これらのフィールドは、送出メッセージ上にストリーム
ヘッド経路識別子を渡すために再使用される。経路識別
子は、ストリームヘッドが割り当てられる時sth->sth_d
evにとってデフォルト値ではあるが、ストリームヘッド
に生成されたmblk内に記憶される。これは次の表２４の
通りに行われる。4.4 Modification of S-ICS Related Stream Data Structure In the preceding sections, several functions have been defined that retrieve routing information from the mblk header. This information is embedded in the header and reduces the amount of information that is accessed and distributed within the cluster. To achieve this, the following data structure needs to be modified. The datab structure is modified to add additional routing addresses. SV at present
As an R4.2 DDI / DKI version, the datab structure has a number of padding words to maintain cache line alignment of the data. These embedded words are replaced with the following values: uint32 target_queue_index; uint32 target_queue_index; These fields are reused to pass the stream head path identifier on outgoing messages. The path identifier is sth-> sth_d when the stream head is assigned
This is the default value for ev, but is stored in the mblk generated at the stream head. This is done as shown in Table 24 below.

【０２３７】[0237]

【表２４】 if((mp = allocb(size,pri)) == NULL) { /* 通常はbufcall()であるエラー回復を実行する */ } mp->b_datap->source_queue_address_index = sth->sth_cluster_route_id;[Table 24] if ((mp = allocb (size, pri)) == NULL) {/ * Execute error recovery which is usually bufcall () * /} mp-> b_datap-> source_queue_address_index = sth-> sth_cluster_route_id ;

【０２３８】性能の犠牲は最小である、すなわち１回の
ロードおよび記憶(load and store)である。検査するコ
ストが割り当てコストを越えているので、上記が実行さ
れたか否かを見るためストリームヘッド・フラグを維持
するのは価値がない。しかし、フレームワークが測定動
作の間所与の構造内に特定の待ち行列アドレスを記録す
べきか否かを示すためストリームヘッド・フラグを維持
する。The performance penalty is minimal, ie, one load and store. It is not worth keeping the streamhead flag to see if the above has been performed, since the cost of checking exceeds the allocation cost. However, the framework maintains a stream head flag to indicate whether a particular queue address should be recorded in a given structure during a measurement operation.

【０２３９】[0239]

【表２５】 #define F_STH_CLUSTER_INTERCONNECT Ox2OOO #define F_STH_CLUSTER_CTL_CHECK Ox4OOO struct sth_cluster_data_t { queue_t*read_queue; queue_t*write_queue; }[Table 25] #define F_STH_CLUSTER_INTERCONNECT Ox2OOO #define F_STH_CLUSTER_CTL_CHECK Ox4OOO struct sth_cluster_data_t {queue_t * read_queue; queue_t * write_queue;}

【０２４０】読取り待ち行列および書込み待ち行列は、
このＳＩＣＳより上の実待ち行列を反映するためフレー
ムワーク測定動作の間更新される。例えば、ＩＰはマル
チプレクサであり直接ドライバに対しメッセージを生成
することもあるので、遠隔ＤＬＰＩ目標を決定すること
ができるようにするため、待ち行列のどのＩＰセットが
メッセージを生成するため使用されているか知る必要が
ある。ある場合にはこれは明白でありまたそうでない場
合もある。例えばＩＰがメッセージをその到来ファンア
ウト・テーブルを経由して遠隔ＤＬＰＩへＳ−ＩＣＳ経
由で送る時、Ｓ−ＩＣＳはどのＩＰがそのメッセージを
送ったかを判断するためq_nextを見ることができない。
なぜならば、ＩＰ実施形態に応じて正しいＤＬＰＩを見
出すのにこれは正確または十分でないこともあるからで
ある。ストリームヘッド内に次の表２６の構造を追加す
る。The read queue and the write queue are:
Updated during the framework measurement operation to reflect the actual queue above this SICS. For example, since the IP is a multiplexer and may generate messages directly to the driver, which IP set in the queue is being used to generate the message to be able to determine the remote DLPI target You need to know. In some cases this is obvious and in others it is not. For example, when an IP sends a message via its incoming fanout table to the remote DLPI via S-ICS, the S-ICS cannot see q_next to determine which IP sent the message.
This is because this may not be accurate or sufficient to find the correct DLPI depending on the IP embodiment. The structure shown in the following Table 26 is added in the stream head.

【０２４１】[0241]

【表２６】 struct sth_s { ... struct sth_cluster_data_t*sth_cluster_data; }[Table 26] struct sth_s {... struct sth_cluster_data_t * sth_cluster_data;}

【０２４２】ここで、クラスタの範囲内の異なるノード
に存在する待ち行列アドレスをどのように調整するのか
という疑問が出る。ある場合は調整しある場合は調整し
ない。アドレス自体は交換されないが、実際のアドレス
を見つけるために使用することができる構造に対するイ
ンデックスが交換される。例えば、メッセージが定義さ
れたsource_queue_address_indexを持つＳＩＣＳに送り
出される時、Ｓ−ＩＣＳは、このインデックスを使用し
て遠隔目標待ち行列アドレス・インデックスを見つけ出
す。この目標待ち行列アドレス・インデックスは、待ち
行列位置を定義している遠隔テーブルへのインデックス
である。The question now arises as to how to adjust the queue addresses that exist at different nodes within the cluster. Adjust in some cases and not in others. The addresses themselves are not exchanged, but the indexes for structures that can be used to find the actual address are exchanged. For example, when a message is sent to a SICS with a defined source_queue_address_index, the S-ICS uses this index to find the remote target queue address index. This target queue address index is an index into the remote table that defines the queue location.

【０２４３】4.5 コンポーネント交信Ｓ−ＩＣＳとコンポーネントの間で行うことができる交
信については上述した。以下の節では、Ｓ−ＩＣＳの観
点からのそのような交信の詳細およびそれらがどのよう
に設計されるかを検討する。更に、Ｓ−ＩＣＳ３８がど
のようにＰＩＣＳ３６と交信するかを検討する。4.5 Component Communication Communication that can be performed between the S-ICS and the component has been described above. The following sections discuss the details of such contacts from an S-ICS point of view and how they are designed. Further consider how the S-ICS 38 communicates with the PICS 36.

【０２４４】4.5.1 制御スレッド交信２つのコンポーネント間の対話については既に記述した
が、以下にいくつかの重要なポイントを述べる。制御ス
レッド３４に関する限り、Ｓ−ＩＣＳ３８はまさにスト
リーム型ドライバであり、従って、制御スレッドは、ス
トリーム型アプリケーション以外の何物でもない。交信
は標準的メッセージ・パラダイムを使用して行われる。
Ｓ−ＩＣＳは、メッセージを制御スレッドに送り出す必
要がある時、標準的メッセージ・タイプを使用する。し
かし、制御スレッドがメッセージをＳ−ＩＣＳに送り出
す必要がある時、制御スレッドは、ストリームヘッドの
バイパスのため送信元を見分けることができないので標
準的メッセージ・タイプを使用することができない。4.5.1 Control Thread Interaction Although the interaction between the two components has been described above, some important points are described below. As far as the control thread 34 is concerned, the S-ICS 38 is just a stream-type driver, so the control thread is nothing more than a stream-type application. Communication takes place using a standard message paradigm.
S-ICS uses a standard message type when it needs to send a message to the controlling thread. However, when the control thread needs to send a message to the S-ICS, the control thread cannot use the standard message type because it cannot identify the source due to stream head bypass.

【０２４５】制御スレッドは、Ｓ−ＩＣＳと通信するた
めカーネル内ＳＴＲＥＡＭＳインターフェースを使用す
る。このため、使用中のものはユーザ空間からアクセス
することが通常できないので、制御スレッドはメッセー
ジ・タイプを指定することができる。制御スレッドがユ
ーザ空間アプリケーションとして実施されたとすれば、
すべてのＳ−ＩＣＳ通信は透過的ioctlを利用してデー
タを通過させなければならないであろう。なぜなら、そ
れらＳ−ＩＣＳだけがモジュール間通信と明確に区別す
ることができるからである。The control thread uses the in-kernel STREAMS interface to communicate with the S-ICS. Because of this, the controlling thread can specify the message type, since those in use cannot usually be accessed from user space. If the controlling thread was implemented as a user space application,
All S-ICS communications will have to pass data using transparent ioctls. This is because only those S-ICS can be clearly distinguished from inter-module communication.

【０２４６】Ｓ−ＩＣＳはそのノード・テーブルおよび
Ｐ−ＩＣＳアクセス情報を制御スレッド・データ構造か
ら直接取り出す。Ｓ−ＩＣＳは、その作成の間これらの
構造のアドレスを入手する。Ｓ−ＩＣＳを読み取り専用
の同じ情報にアクセスさせることによって、制御スレッ
ドは、ロックせずに、また、新しい情報をすべてのＳ−
ＩＣＳに直接通知する必要もなしに、このデータを更新
することができる。このように、経路更新またはノード
状態は、すべてのＳ−ＩＣＳに自動的に反映されること
ができる。制御スレッドがユーザ空間で実施されるとす
れば、これは可能でない。そのような実施形態では、上
記のような更新を取り扱うメッセージ交換プロトコール
が存在しなければならない。注：Ｓ−ＩＣＳは、通常最
初に検知すべきものであるからノードが障害を起こした
かそれへの経路が障害を起こしたかを標示するため、ノ
ードの状態を直接更新することもできる。S-ICS retrieves its node table and P-ICS access information directly from the controlling thread data structure. The S-ICS obtains the addresses of these structures during its creation. By having the S-ICS access the same information that is read-only, the controlling thread does not lock and also transfers new information to all S-ICS.
This data can be updated without having to notify the ICS directly. In this way, the route update or node status can be automatically reflected on all S-ICS. This is not possible if the control thread is implemented in user space. In such an embodiment, there must be a message exchange protocol that handles such updates. Note: The S-ICS can also update the state of the node directly to indicate if the node has failed or the path to it has failed since it is usually the first to be detected.

【０２４７】ノード状態が４バイト数量であり、PA_RIS
Cのようなアーキテクチャにおける４バイト書き込みが
アトミック動作であるので、この方法はいかなるロック
も必要としないであろう。経路指定アルゴリズムの範囲
内で、Ｓ−ＩＣＳは、メッセージ経路指定命令が失敗し
ている場合制御スレッドを調査する能力を持つ。この障
害はメッセージが生成されたローカル・ノードで起こ
る。この障害の発生によって、source_queue_index、ロ
ーカルS_ICS のlocal_read_queueまたはlocal_write_qu
eue情報を(未定義のため)使用することができないの
で、remote_target_indexを取り出す。この問題に対処
するためＳ−ＩＣＳができることは、ルート障害が発生
したことを制御スレッドに通知し、受け取っていたいく
つかのデータおよびメッセージを提供することぐらいで
ある。ほとんどの場合、上方マルチプレクサ・インスタ
ンスが分散ポイントにとってユニークであるような形態
でＳ−ＩＣＳがマルチプレクサとして実施されていれ
ば、このような問題は絶対に起きない。When the node state is a 4-byte quantity, PA_RIS
This method will not require any locking, since 4-byte writes in an architecture like C is an atomic operation. Within the scope of the routing algorithm, the S-ICS has the ability to probe the controlling thread if a message routing instruction has failed. This fault occurs at the local node where the message was generated. Depending on the occurrence of this fault, source_queue_index, local_read_queue or local_write_qu of local S_ICS
Since the eue information cannot be used (because it is undefined), fetch remote_target_index. The only thing the S-ICS can do to address this problem is to notify the controlling thread that a route failure has occurred and provide some data and messages it has received. In most cases, such problems will never occur if the S-ICS is implemented as a multiplexer in such a way that the upper multiplexer instance is unique to the distribution point.

【０２４８】4.5.2 ミドルウェア交信Ｓ_ＩＣＳは、必要な経路指定情報が制御スレッド３４
内に存在しない時、ミドルウェア５０とだけ交信する。
例えば、ノードが故障したことをＳ−ＩＣＳが検出する
場合、別のノード上に新しい遠隔インスタンスを自動的
に作成することを可能にする回復方針をＳ−ＩＣＳは起
動するかもしれない。この回復措置は、Ｓ−ＩＣＳが代
替経路を求めてミドルウェアを調査するので、ローカル
の制御スレッドがこれに関与することを必要としない。
この調査は、制御スレッドがアドレスを求めてミドルウ
ェアを調べる場合と同様であるが、相違している点は、
ミドルウェアが更新されその同じ経路を代替経路として
戻すことのないように、Ｓ−ＩＣＳが無効な経路情報を
提供する点である。このすべてを制御スレッド内で実施
することができるであろうか。答えは可能であり、ＳＴ
ＲＥＡＭＳを経由して提供されているサービスの内容お
よびＳ_ＩＣＳが実施される形態に応じて、これを一層
簡単に実施することができるであろう。4.5.2 Middleware Communication The S_ICS indicates that the necessary routing information
When it does not exist inside, it communicates only with the middleware 50.
For example, if the S-ICS detects that a node has failed, the S-ICS may invoke a recovery policy that allows a new remote instance to be automatically created on another node. This remedy does not require the local control thread to participate in the S-ICS as it searches the middleware for an alternate path.
This check is similar to the case where the controlling thread checks the middleware for an address, with the difference that
The point is that the S-ICS provides invalid route information so that the middleware is not updated and the same route is not returned as an alternative route. Could all this be done in the controlling thread? The answer is possible, ST
Depending on the content of the service provided via REAMS and the mode in which S_ICS is implemented, this could be implemented more easily.

【０２４９】4.5.3 Ｐ−ＩＣＳ交信Ｐ−ＩＣＳ３６はクラスタ依存の技術であり、Ｓ−ＩＣ
Ｓは「ブラック・ボックス」として取り扱われるように
設計される。通信は、小さい「にかわ」層に限定されて
いる命令を含むよく定義されたＡＰＩを使用して実行さ
れなければならない。このＡＰＩは一般的Ｓ−ＩＣＳ命
令を特定のＰ−ＩＣＳ命令に変換する。最低限以下の命
令が提供されなければならない。特定のノードおよび経
路データを所与とするクラスタ内部のＳ−ＩＣＳエンド
ポイントとの間でデータが送受信されることができなけ
ればならない。これは、相互接続ヘッダがＰ−ＩＣＳに
よって自動的に生成されなければならず、Ｓ−ＩＣＳの
責任ではないことを意味する。Ｐ−ＩＣＳ３６は、散乱
／収集をサポートしなければならない。散乱／収集動作
は、性能を向上しmblkヘッダおよび関連データの送信を
単純化するため、問題とされる再分類を排除する。すな
わち、データが到着すると、ヘッダおよびＳ−ＩＣＳは
そのデータの内容を判断することを試みなければなら。4.5.3 P-ICS Communication The P-ICS 36 is a cluster-dependent technology.
S is designed to be treated as a "black box". Communication must be performed using a well-defined API that includes instructions that are limited to the small "glue" layer. This API converts a generic S-ICS instruction into a specific P-ICS instruction. At a minimum, the following instructions must be provided: Data must be able to be sent to and received from S-ICS endpoints within the cluster given a particular node and path data. This means that the interconnect header must be generated automatically by the P-ICS and is not the responsibility of the S-ICS. The P-ICS 36 must support scattering / collection. The scatter / gather operation eliminates the reclassification in question to improve performance and simplify transmission of the mblk header and associated data. That is, when data arrives, the header and the S-ICS must attempt to determine the content of the data.

【０２５０】従って、データ・バッファは一組の隣接す
るバイトとして扱われなければならない。それにより、
Ｓ−ＩＣは、データをmblkへまたはmblkからコピーする
必要がなくなる。送出メッセージに関しては、Ｓ−ＩＣ
Ｓは、目標ノード上で再作成されることができるよう
に、mblkヘッダを定義する小さいバッファ領域を作成す
る。加えて、Ｐ−ＩＣＳにmp->b_rptrアドレスおよび長
さ(mp->b_wptr_mp->b_rptr)が与えられる。これらは、
送信用の生データとしてまた散乱／収集メッセージの第
２の部分として取り扱われる。到来メッセージに関して
は、Ｓ−ＩＣＳは、esballoc()を経由してmblkを割り当
て、メッセージの第２部分として送られるすべてのデー
タをポイントするmblkヘッダ・ポインタを割り当てる。
Ｓ−ＩＣＳは、メッセージの第１部分内に含まれるヘッ
ダ情報をmblkヘッダに記入する。到来および送出につい
て、送信されているmblkは、b_contポインタを経由して
リンクされるメッセージ・セットである場合もある。送
出の場合、散乱／収集データ部分を定義し、各mblkヘッ
ダ毎に小さいデータ・ブロックを割り当て、ヘッダ・デ
ータおよびユーザ・データのすべてを送り出すことだけ
を実行するように、mblkはすべて同じメッセージ・タイ
プでなければならない。到来の場合、メッセージ・セッ
トが解読されると、mblkヘッダは各部分毎に割り当てら
れ、各データ部分はesballoc()されるであろう。Therefore, the data buffer must be treated as a set of adjacent bytes. Thereby,
The S-IC does not need to copy data to or from the mblk. For outgoing messages, S-IC
S creates a small buffer area that defines the mblk header so that it can be recreated on the target node. In addition, the P-ICS is given an mp-> b_rptr address and length (mp->b_wptr_mp-> b_rptr). They are,
Treated as raw data for transmission and as the second part of the scatter / gather message. For an incoming message, the S-ICS allocates an mblk via esballoc () and allocates an mblk header pointer that points to all data sent as the second part of the message.
The S-ICS writes the header information contained in the first part of the message into the mblk header. For incoming and outgoing, the mblk being sent may be a message set linked via the b_cont pointer. For transmission, the mblks are all the same message data so as to define the scatter / collect data portion, allocate a small block of data for each mblk header, and only perform sending out all of the header data and user data. Must be type. On arrival, when the message set is decrypted, an mblk header will be allocated for each part and each data part will be esballoc ().

【０２５１】散乱／収集がサポートされないとすれば、
処理性能が問題であるが、プロトコール・ヘッダの修正
および障害受け取り問題への対処が必要な点を除いて、
上述のアルゴリズムは同じままとされよう。mblkヘッダ
は例えば次の表２７のような形式である。If scattering / collection is not supported,
Except for performance issues, except for the need to modify the protocol headers and address the problem
The above algorithm will remain the same. The mblk header has, for example, a format as shown in Table 27 below.

【０２５２】[0252]

【表２７】 struct mblk_header { unsigned short b_flag; /* オリジナルのb_flag */ unsigned short b_band; /* オリジナルのb_band */ unsigned short db_type;/* オリジナルのメッセージ・タイプ */ unsigned short m_flags;/* データを解釈するためのmblkヘッダ・フラグ */ int32 data_size; /* 実際のデータ差分 */ int32 target_queue_index; /* 目標待ち行列インデックス・テーブル */ /* へのインデックス */ int32 mblk_seq_number; /* データを突き合わせる一連番号 */[Table 27] struct mblk_header {unsigned short b_flag; / * original b_flag * / unsigned short b_band; / * original b_band * / unsigned short db_type; / * original message type * / unsigned short m_flags; / * Mblk header flags to interpret * / int32 data_size; / * Actual data difference * / int32 target_queue_index; / * Target queue index table * / / * Index to ** int32 mblk_seq_number; / * Match data Serial number * /

【０２５３】今やデータはそれに追加されるべきプロト
コール・ヘッダを必要とする。これは次のいずれかを必
要とする。すなわち、一般的ＳＴＲＥＡＭアクションで
あるがヘッダを追加できるようにヘッダ空間を事前割り
当てするため、ストリームヘッド書込みオフセットの使
用を必要とするか、あるいは、性能を低下させるが新し
いバッファにプロトコール・ヘッダと共にデータを実際
にコピーすることを必要とするかいずれかである。以下
の表２８はプロトコール・ヘッダ形式の１例である。The data now needs a protocol header to be added to it. This requires one of the following: Either require the use of a stream head write offset to pre-allocate header space so that headers can be added, as is a common STREAM action, or reduce the performance but add data to the new buffer along with the protocol header. Need to be actually copied. Table 28 below is an example of a protocol header format.

【０２５４】[0254]

【表２８】 struct data_header { int32 mblk_seq_number; /* *mblkを突き合わせる一連番号 */ unsigned short data_flags; }[Table 28] struct data_header {int32 mblk_seq_number; / * * Serial number to match mblk * / unsigned short data_flags;}

【０２５５】Ｐ−ＩＣＳが輻輳制御の処理のようなパケ
ット配送を保証することができなければ、制御スレッド
およびおそらくＳ−ＩＣＳが回復プロトコールを確立す
る必要がある。この回復プロトコールは、失われている
メッセージからの回復を行う上位プロトコールに依存す
るかもしれない。例えばＩＰ断片が失われるとすれば、
ＩＰはタイムアウトを起こし、伝送プロバイダ要件に基
づいてその通常のエラー回復を実行する。可能であれば
上記のようになるべきであるが、制御スレッドの間のポ
ート更新のようなメッセージに関しては、制御スレッド
は、タイムアウト機構およびエラー回復メカニズムを必
要とする。もちろん、Ｐ−ＩＣＳが送信者によって管理
されている場合、この問題は起こらない。If the P-ICS cannot guarantee packet delivery, such as handling congestion control, the control thread and possibly the S-ICS will need to establish a recovery protocol. This recovery protocol may depend on a higher-level protocol that performs recovery from lost messages. For example, if IP fragments are lost,
IP times out and performs its normal error recovery based on the transmission provider requirements. Should be as above, if possible, but for messages such as port updates between control threads, the control thread requires a timeout mechanism and an error recovery mechanism. Of course, this problem does not occur if the P-ICS is managed by the sender.

【０２５６】4.5.4 フロー制御どれだけのデータがソフトウェア・コンポーネントまた
はカードとの間で送受信されるべきかをを統制するた
め、ＳＴＲＥＡＭＳモジュールおよびドライバはどこか
でフロー制御を起動する。ＳＴＲＥＡＭＳは、内蔵フロ
ー制御機構およびフロー制御状態を判断するためのいく
つかのＤＤＩルーチンを含む。分散ＳＴＲＥＡＭＳ環境
内においては、これらのフロー制御および状態関数は、
モジュールまたはドライバを変更することなく開発者が
意図するように動作しなければならない。これを実現す
るため、Ｓ−ＩＣＳおよびストリームヘッド・プレビュ
ー関数は、スタックの次のモジュールの各ローカル・ス
トリーム・コンポーネントの見え方を反映しなければな
らない(スタックが複数ノードにわたって分割されてな
い場合、フロー制御は非クラスタ化環境の場合と同様
で、変更の必要はない)。ここでの説明も、ＩＰおよび
ＤＬＰＩ間で分割が行われているＴＣＰ／ＩＰ／ＤＬＰ
Ｉスタックを使用する。先ずＤＬＰＩに関するフロー制
御から考察を始める。4.5.4 Flow Control To control how much data should be sent to and received from software components or cards, the STREAMS module and driver invoke flow control somewhere. STREAMS includes a built-in flow control mechanism and several DDI routines for determining flow control status. In a distributed STREAMS environment, these flow control and state functions are:
It must work as intended by the developer without changing the module or driver. To accomplish this, the S-ICS and streamhead preview functions must reflect the appearance of each local stream component of the next module in the stack (if the stack is not split across multiple nodes, The flow control is the same as in a non-clustered environment and does not need to be changed.) The description here is also for TCP / IP / DLP where division is performed between IP and DLPI.
Use the I stack. First, consider the flow control for DLPI.

【０２５７】ＤＬＰＩは両方向でのフロー制御を持つ。
送出ケースに関しては、ＤＬＰＩはハードウェアをオー
バーランすることはできない。従って、ハードウェアが
後戻り可能な数を最小に維持することができるように、
ＤＬＰＩはその高位および低位しきい値を設定する。す
なわち、高位および低位しきい値の比率は、データにあ
たるパケットの(１つや２つではなく)多数に相当する。
低い比率は過大なフロー制御状況の原因となり、潜在的
には、経路およびスケジューリングの観点からオーバー
ヘッドが付加されるサービス・ルーチンを経由して大多
数のデータが処理されることとなる。ＤＬＰＩがフロー
制御を始動させるため実行することは、ただ１つ、その
待ち行列からメッセージを除去することである。そのこ
とによって、上位しきい値を越えるため、上方モジュー
ルはcanput()を停止させ、メッセージをローカルに待ち
行列に記憶させる。クラスタにおいては、Ｓ−ＩＣＳが
どこかにメッセージを書き込み、ＤＬＰＩインスタンス
がメッセージ伝送を停止させるようにフロー制御されて
いることを遠隔ノードに通知しなければならないであろ
う。このＳ−ＩＣＳインスタンスがＤＬＰＩと１対１で
対応付けされていれば、Ｓ−ＩＣＳはメッセージをロー
カルに待ち行列記憶し、その高位しきい値を越えるまで
メッセージの受け取りを続行することができる。Ｓ−Ｉ
ＣＳ高位しきい値の設定は、Ｓ−ＩＣＳ間での制御メッ
セージ交換を避ける上で重要である。以下の点が実行さ
れなければならない。 1. Ｓ−ＩＣＳ高位しきい値に到達すれば、Ｓ−ＩＣＳ
に関する転送中メッセージの受け取りは続けられ、それ
らは待ち行列に記憶される。 2. Ｓ−ＩＣＳはフロー制御を標示する少待ち時間メッ
セージを遠隔Ｓ−ＩＣＳに送出し、フロー制御されるべ
き経路テーブルp_ics_s_ics_statusを更新する。 3. 遠隔Ｓ−ＩＣＳ上方マルチプレクサ・インスタンス
もまた、その高位しきい値を確実に越えるようにputq
(q, mp)を介してその書込み待ち行列上にmblkを記憶す
ることによって、フロー制御状態に置かれる。mblkは、
mblkが多量のデータを含むが実際にはいずれのデータも
存在しないことをその(b_wptr_b_rptr)が示す長さゼロ
のmblkであってよい。このmblkは、Ｓ−ＩＣＳ作成時に
事前割り当てされ、bufcall()処理を避けるためローカ
ルに記憶される。DLPI has flow control in both directions.
For the sending case, DLPI cannot overrun the hardware. Therefore, so that the hardware can keep the number of backtracks to a minimum,
DLPI sets its high and low thresholds. That is, the ratio between the high and low thresholds corresponds to a large number (not one or two) of packets corresponding to data.
Low ratios cause excessive flow control situations, potentially leading to the majority of data being processed via service routines that add overhead in terms of routing and scheduling. The only thing that DLPI does to trigger flow control is to remove messages from its queue. As a result, the upper module stops the computer () and stores the message locally in a queue, since the upper threshold is exceeded. In a cluster, the S-ICS would have to write a message somewhere and notify the remote node that the DLPI instance is being flow controlled to stop message transmission. If the S-ICS instance is one-to-one associated with the DLPI, the S-ICS can locally queue the message and continue receiving messages until its high threshold is exceeded. SI
The setting of the CS high threshold is important for avoiding control message exchange between S-ICS. The following points must be performed: 1. If the S-ICS high threshold is reached, S-ICS
Receiving of in-transit messages for is continued and they are stored in a queue. 2. The S-ICS sends a low latency message indicating flow control to the remote S-ICS and updates the routing table p_ics_s_ics_status to be flow controlled. 3. The remote S-ICS upper multiplexer instance also putqs to ensure that its high threshold is crossed.
Place the flow control state by storing mblk on its write queue via (q, mp). mblk is
The mblk may be a zero-length mblk whose (b_wptr_b_rptr) indicates that it contains a large amount of data but no data is actually present. This mblk is pre-allocated when the S-ICS is created and stored locally to avoid bufcall () processing.

【０２５８】4. 遠隔フロー制御については、ローカル
ＤＬＰＩはその待ち行列およびローカルＳ−ＩＣＳ待ち
行列を放出する。ローカルＳ−ＩＣＳ低位しきい値に達
する時、あるいはＳ−ＩＣＳが再び使用可能にされたこ
とを検知する時、Ｓ−ＩＣＳはもう1つの少待ち時間メ
ッセージを送り出して、フロー制御がもはや存在せず、
遠隔Ｓ−ＩＣＳがメッセージを送出し始めることができ
るということを知らせる。 5. 遠隔Ｓ−ＩＣＳはその制御mblkが見つけられるとメ
ッセージの送出を開始する。その場合、遠隔Ｓ−ＩＣＳ
はそれをローカルに記憶し、それより上のモジュール／
ドライバ・インスタンスを再使用可能にさせる。4. For remote flow control, the local DLPI releases its queue and the local S-ICS queue. When the local S-ICS low threshold is reached, or when it detects that the S-ICS has been re-enabled, the S-ICS sends another low latency message and the flow control no longer exists. Without
Signals that the remote S-ICS can start sending messages. 5. The remote S-ICS starts sending messages when its control mblk is found. In that case, remote S-ICS
Stores it locally, and the module /
Make the driver instance reusable.

【０２５９】しきい値が慎重に選択されていれば、Ｓ−
ＩＣＳ間で交換されている制御メッセージの数は最小と
なる。ローカルＤＬＰＩから遠隔ＩＰへの到来メッセー
ジの場合、ＤＬＰＩは上流ストリーム・コンポーネント
へメッセージをputnext()する。この場合、それはロー
カル・ストリームヘッドであり遠隔ＩＰではない。ＤＬ
ＰＩがcanput()を実行する場合、それは、(遠隔ＩＰで
はなく)ローカル・ストリームヘッドがメッセージを受
け取ることを決定することができる。より正確には、制
御スレッドによって認識されなければならないメッセー
ジだけがストリームヘッドに実際に記憶され、その他の
すべてはＳ−ＩＣＳへ直接伝送されなければならない。
この点は以下の通り解決される。If the threshold is carefully selected, S-
The number of control messages exchanged between ICSs is minimal. For incoming messages from the local DLPI to the remote IP, DLPI putnext () the message to the upstream stream component. In this case, it is the local stream head and not the remote IP. DL
If the PI performs a put (), it can determine that the local stream head (rather than the remote IP) receives the message. More precisely, only those messages that must be recognized by the controlling thread are actually stored at the stream head, and all others must be transmitted directly to the S-ICS.
This point is solved as follows.

【０２６０】1. ドライバがcanput()またはその変形を
実行する時、ローカルＳ−ＩＣＳ書込み待ち行列または
遠隔ＩＰの状態を返される。これは次のように達成され
る。 (a)ストリームヘッド・プレビュー関数がＳ−ＩＣＳ書
込み待ち行列上にcanput()を実行する。(b)canput()の
成否に関係なく、プレビュー関数は常にＳ−ＩＣＳ書込
み待ち行列上にメッセージを置く。(c)canput()が失敗
する時、プレビュー関数はローカル・ストリームヘッド
QFULLフラグを修正する。これによって、すべての将来
のＤＬＰＩcanput()が失敗することとなる。(d)Ｓ−Ｉ
ＣＳがもはやフロー制御されていない時、それはtarget
_queue->target_sthを経由してローカル・ストリームヘ
ッドに通知される。Ｓ−ＩＣＳが１対１の対応関係で設
計されていようが複数対１の対応関係で設計されていよ
うが、これは可能である。後者の場合、Ｓ−ＩＣＳは、
それに対応づけるtarget qurur_index->s_ics_instance
をtarget_queue_indexから取り出し、target_queue->ta
rget_sthを直接更新する。1. When the driver executes canput () or a variant thereof, the status of the local S-ICS write queue or remote IP is returned. This is achieved as follows. (a) The stream head preview function executes canput () on the S-ICS write queue. (b) Regardless of the success or failure of canput (), the preview function always puts a message on the S-ICS write queue. (c) When canput () fails, the preview function returns the local stream head
Modify the QFULL flag. This will cause all future DLPIcanput () to fail. (d) SI
When CS is no longer flow controlled, it
Notified to local stream head via _queue-> target_sth. This is possible whether the S-ICS is designed with a one-to-one correspondence or a multiple-to-one correspondence. In the latter case, S-ICS is
Target qurur_index-> s_ics_instance to associate with it
From target_queue_index, and target_queue-> ta
Update rget_sth directly.

【０２６１】2. 遠隔Ｓ−ＩＣＳ３８Ａがフロー制御さ
れた場合、ローカルＳ−ＩＣＳ３８だけがフロー制御さ
れる。遠隔Ｓ−ＩＣＳは、上方ｍｕｘインスタンスのca
nput()が失敗する時フロー制御される。これが発生する
と、遠隔Ｓ−ＩＣＳはメッセージをその読取り待ち行列
上に記憶する。遠隔Ｓ−ＩＣＳは、それがフロー制御さ
れていることあるいは上位しきい値に達するまでメッセ
ージの受領を続行することをローカルＳ−ＩＣＳに通知
することを選択することができる。Ｓ−ＩＣＳ間で生成
される制御メッセージを最低限に保ち、アプリケーショ
ンが可能な限り迅速にデータを受け取れるように維持す
る上で、この上位しきい値の選択は重要である。2. When the remote S-ICS 38A is flow-controlled, only the local S-ICS 38 is flow-controlled. The remote S-ICS uses the upper mux instance ca
Flow control is performed when nput () fails. When this occurs, the remote S-ICS stores the message on its read queue. The remote S-ICS may choose to notify the local S-ICS that it is flow controlled or that it will continue to receive messages until it reaches the upper threshold. The choice of this upper threshold is important in keeping the control messages generated between the S-ICS to a minimum and keeping the application receiving data as quickly as possible.

【０２６２】3. ローカルＳ−ＩＣＳがフロー制御メッ
セージを受け取る時、送出経路上に使用されたものと同
じmblk技術を使用するようにローカル・フロー制御を強
制することもできる。本実施形態ではＤＬＰＩがこの状
況を取り扱うことができるように設計され、フロー制御
の解放とともに既に受け取ってあるメッセージを送り出
すことを保証することができるメカニズムが適用されて
いるので、この方法の使用は望ましい。3. When the local S-ICS receives the flow control message, it can enforce local flow control to use the same mblk technology used on the outgoing path. In this embodiment, the DLPI is designed to be able to handle this situation, and a mechanism is applied that can guarantee that the already received message will be sent out with the release of flow control. desirable.

【０２６３】4. 遠隔ＩＰに対するフロー制御が解放さ
れると、ＩＰは遠隔Ｓ−ＩＣＳを再使用可能にし、次に
遠隔Ｓ−ＩＣＳはフロー制御を解放するため遠隔Ｓ−Ｉ
ＣＳに対する制御メッセージを直ちに生成するか、ある
いは下位いきい値に達するまで待機する。直ちにメッセ
ージを生成する利点は、ローカル・メッセージを解放す
るために必要な時間量が、メッセージを生成してそれを
送り出し再びデータ転送を開始する時間に比較して、非
常に小さい点である。従って、スタックおよびアプリケ
ーションによる遅延が秘匿される。4. When the flow control for the remote IP is released, the IP makes the remote S-ICS reusable, and the remote S-ICS then releases the remote SI to release the flow control.
Either immediately generate a control message for CS or wait until the lower threshold is reached. The advantage of generating the message immediately is that the amount of time required to release the local message is very small compared to the time to generate the message, send it out and start the data transfer again. Therefore, the delay caused by the stack and the application is kept secret.

【０２６４】以下はいくつかの留意すべきポイントであ
る。Ｓ−ＩＣＳインスタンス間の制御メッセージは少待
ち時間でなくてはならず、ノード間に確実に送り出され
なければならない。これらのメッセージが消失すると、
スタックはフロー制御されたままであり、アプリケーシ
ョンは待機状態を続ける。Ｐ−ＩＣＳが輻輳制御が原因
でメッセージを消失すれば、Ｓ−ＩＣＳは、この状況を
処理するため要求応答プロトコールを必要とする(これ
は当然処理性能を低下させる)。このプロトコールは、
また、応答が迅速に戻されない場合要求を再転送させる
ために使用されるタイムアウトを必要とする。The following are some points to keep in mind. Control messages between S-ICS instances must have low latency and must be reliably sent between nodes. When these messages disappear,
The stack remains flow controlled and the application remains in a wait state. If the P-ICS loses a message due to congestion control, the S-ICS needs a request response protocol to handle this situation (which naturally reduces processing performance). This protocol is
It also requires a timeout that is used to cause the request to be retransmitted if the response is not returned quickly.

【０２６５】Ｓ−ＩＣＳ制御メッセージは少待ち時間で
なければならない。ＰーＩＣＳがそのような概念をサポ
ートしない場合、制御メッセージを大量データ転送の後
につなげることもできる。これが可能であれば、Ｓ−Ｉ
ＣＳがメモリを使いすぎないようにＳ−ＩＣＳ上位およ
び下位しきい値を設定しなければならない。Ｓ−ＩＣＳ
設計の複雑性も増すかもしれない。S-ICS control messages must have low latency. If the P-ICS does not support such a concept, the control message may be chained after the bulk data transfer. If this is possible, SI
S-ICS upper and lower thresholds must be set so that CS does not use too much memory. S-ICS
Design complexity may also increase.

【０２６６】Ｐ−ＩＣＳがデータの送信元による管理形
態を提供するとすれば、送信元が使用可能なバッファ空
間を持っていれば制御メッセージが常に送り出されるこ
とができるので、輻輳制御問題は絶対存在しない。更
に、送信元による管理形態は、受取りバッファを細分化
するかこれらのタスクのために専用バッファを用意する
ことができるので、一般的にいえば少待ち時間メッセー
ジを許容する。このように、送信元管理型の相互接続を
使用すれば、分散ＳＴＲＥＥＡＭＳの実施は一層容易で
ある。If the P-ICS provides a management mode based on the data source, the control message can always be sent out if the source has a usable buffer space. do not do. Furthermore, the source management scheme generally allows for low-latency messages because the receive buffer can be subdivided or dedicated buffers can be provided for these tasks. Thus, the use of source-managed interconnections makes distributed STREAMS easier to implement.

【０２６７】フロー制御の一部ではないが、ＤＤＩルー
チンnoenable()およびenableok()を使用して、モジュー
ル／ドライバがメッセージを処理するか否かが制御され
る。これらのルーチンはQNOENBフラグを検査する。通
常、このフラグはアクセスされているローカル待ち行列
上でのみ使用され、スタックの次の待ち行列には使用さ
れない。(望ましくはないが)遠隔アクセスに関して使用
されるならば、Ｓ−ＩＣＳは、QFULLフラグに関して同
様の制御メッセージ論理を使用する必要がある。Although not part of the flow control, the DDI routines noenable () and enableok () are used to control whether the module / driver processes the message. These routines check the QNOENB flag. Normally, this flag is only used on the local queue being accessed, not the next queue on the stack. If used for remote access (although not desired), the S-ICS needs to use similar control message logic for the QFULL flag.

【０２６８】5.0 分散ストリームの作成分散ＳＴＲＥＡＭは、(1)１つのノード上に存在するが
クラスタ全体の機構を利用するＳＴＲＥＡＭ、(2)アプ
リケーションとは異なるノード上で実行されるストリー
ム、(3)複数のノードにわたって分割されたコンポーネ
ントを持つストリーム、および(5)各々のエンドポイン
トが異なるノード上で実行されるストリーム型パイプを
含むストリーム、という４つのカテゴリの１つに属す
る。これらタイプの各々がどのように作成されるかを以
下に記述する。5.0 Creation of Distributed Stream Distributed STREAM consists of (1) a STREAM that exists on one node but uses the mechanism of the entire cluster, (2) a stream executed on a node different from the application, (3) They belong to one of four categories: streams with components split across multiple nodes, and (5) streams that include stream-type pipes where each endpoint runs on a different node. The following describes how each of these types is created.

【０２６９】5.1 クラスタ全体の機構ストリームこのタイプのストリームは、上述のグロ―バル・ポート
対応付けの例において示されている。このストリームの
スタックの種々のコンポーネントは、どのようなサービ
スが必要とされるかに応じて、制御スレッドまたはおそ
らくＳ−ＩＣＳと通信するため拡張される。このような
拡張を実施するためには、制御スレッドは、適切な拡張
を実行することができるように、ストリームが作成され
ていることを通知されなければならない。これは次のよ
うに達成される。5.1 Cluster-wide mechanism stream This type of stream is shown in the global port mapping example above. The various components of this stream stack are extended to communicate with the controlling thread or possibly the S-ICS, depending on what services are required. To implement such an extension, the controlling thread must be notified that the stream has been created so that the appropriate extension can be performed. This is achieved as follows.

【０２７０】1. アプリケーションは、ファイルのオー
プンによってストリーム型ドライバをオープンする。こ
のファイルの形式は、例えば、/dev/driver_nameであ
る。クラスタがファイル・システムの単一システム視点
を提供していれば、オープン・プログラムは、ノード特
有のdev_t構造を取り出す。dev_tは、オープンされつつ
ある装置のメジャーおよびマイナー番号を定義する。こ
の構造を検査することによって、ＳＴＲＥＡＭＳフレー
ムワークは、ドライバのストリームタブ・エントリのqi
nit構造内に記憶されている対応するドライバ・オープ
ン・ルーチンを実行する。1. The application opens the stream type driver by opening the file. The format of this file is, for example, / dev / driver_name. If the cluster has provided a single system view of the file system, the open program retrieves the node-specific dev_t structure. dev_t defines the major and minor numbers of the device being opened. By examining this structure, the STREAMS framework makes sure that the driver's stream tab entry qi
Execute the corresponding driver open routine stored in the nit structure.

【０２７１】2. クラスタ化が制御スレッドの作成を介
して使用可能とされると、ＳＴＲＥＡＭＳフレームワー
クの範囲内のグロ―バル構造はTRUE(真)に設定され、マ
スタ制御スレッドのアドレス・データが有効となる。オ
ープン処理の間フレームワーク・プログラムはこのグロ
ーバル構造を検査する。もしFALSE(偽)であれば、オー
プン処理は非クラスタ化環境の場合と同様に進められ
る。TRUE(真)であれば、オープン・プログラムは、この
制御スレッドにオープンされつつあるdev_tを渡し、応
答するように要求する。2. When clustering is enabled through the creation of a control thread, the global structure within the STREAMS framework is set to TRUE and the master control thread's address data is set to TRUE. Becomes effective. During open processing, the framework program checks this global structure. If FALSE, the open process proceeds as in a non-clustered environment. If TRUE, the open program passes this controlling thread the dev_t being opened and requests that it respond.

【０２７２】3. 制御スレッドはdev_tを検査して、処理
すべきと想定するドライバとdev_tを比較する。ドライ
バが特別の処理を必要としなければ、制御スレッドは呼
び出し元へ制御を単に戻し、オープン・プログラムは従
来技術の通り通り進行する。特別のものであれば、制御
スレッドは、関連するドライバ・オープン方針を起動す
る。3. The controlling thread checks dev_t and compares dev_t with the driver assumed to be processed. If the driver does not require special handling, the controlling thread simply returns control to the caller and the open program proceeds as in the prior art. If special, the controlling thread invokes the associated driver open policy.

【０２７３】4. ドライバ・オープン方針は、オープン
をどのように実行すべきか、ストリームが分割ストリー
ムとして作成されるべきか否か、ストリームが作成され
すべてのモジュールがautopush(待ち行列への自動書き
込み)が完了した時どのような関数が起動されるべきか
を標示する。このストリームに関して、この方針は、オ
ープンがローカルに実行されるべきこと、そして、スト
リーム・スタックが完了されたならば以下の関数が起動
されなければならないことを示す。4. The driver open policy is how the open should be performed, whether the stream should be created as a split stream, the stream is created and all modules are autopushed (automatically queued). Indicates what function should be invoked when is completed. For this stream, this policy indicates that the open should be performed locally and that the following functions must be invoked once the stream stack is completed.

【０２７４】5. グロ―バル・ポート対応付けの例の場
合、ＴＣＰおよびＩＰコードを拡張する関数が呼び出さ
れる。フレームワークがそのオープン・タスクのすべて
を完了する直前に、この関数が起動される(これは制御
スレッドが関数のいかなる仕様をも理解する必要なしに
実行される、すなわち、この関数は実施上制御スレッド
から独立している）。5. In the case of the global port mapping example, a function for extending TCP and IP code is called. Immediately before the framework completes all of its open tasks, this function is invoked (this is performed without the controlling thread having to understand any specification of the function, i.e., the function is Independent of the thread).

【０２７５】このタイプのストリームはドライバあるい
はモジュールの実施形態の変更を必要としないし、制御
スレッドは、このスタックによって実施されている下部
構造のプロトコールを知らないままでよい。This type of stream does not require changes to the driver or module embodiments, and the controlling thread may remain unaware of the underlying protocol implemented by this stack.

【０２７６】5.2 異なるノード・ストリームアプリケーションが実行している場所と異なるノードに
ストリーム・スタックが存在しているとすれば、制御ス
レッド３４は、このノードがどこにあるかを判断し、相
互接続を確立することができなければならない。これは
以下のように達成できる(図１６参照)。5.2 Different Node Streams If the stream stack exists on a different node than where the application is running, the controlling thread 34 determines where this node is and establishes an interconnect. Must be able to. This can be achieved as follows (see FIG. 16).

【０２７７】1. オープン・プログラムは従来技術の通
り進むが、ＶＦＳによって戻されるdev_tはローカルＳ
−ＩＣＳのそれである。加えて、このdev_tは、複製メ
ジャー番号に類似した特別な装置メジャー番号を使用す
る。複製装置の場合と同様に、マイナー番号はオープン
されている実装置ドライバの符号化されたメジャー番号
である。制御スレッドはＳ−ＩＣＳがオープンされてい
ることを知り、minor(dev_t)呼び出しを使用することに
よって実メジャー番号を取り出す。1. The open program proceeds as in the prior art, but dev_t returned by VFS is
-That of ICS. In addition, this dev_t uses a special device major number similar to the duplicate major number. As with the duplicate device, the minor number is the encoded major number of the open real device driver. The controlling thread knows that the S-ICS is open and retrieves the actual major number by using the minor (dev_t) call.

【０２７８】2. 次に、制御スレッド３４は、メジャー
番号オープン方針を検査して、オープンを実行すべきノ
ードをどのように決定すべきか判断する。方針は例えば
以下のいずれかである。＊ローカル・データ構造を調べ、制御スレッドおよびク
ラスタ初期状態設定ステップの間に構築された対応する
ノード識別子を見つける、＊どのスレッドが要求されたドライバを持っているか、
負荷といういう観点からどれが最善の選択であるかを決
定するように要求を他の制御スレッドに同報通信する、
あるいは＊同様のデータを持っていなければならないミドルウェ
ア・エンティティと連絡をとる。どの場合でも、制御ス
レッドは、実施形態に応じて、目標ノード・アドレス、
遠隔制御スレッド・アドレスおよび遠隔Ｓ−ＩＣＳアド
レスを決定することができる。2. Next, the control thread 34 examines the major number open policy and determines how to determine the node to execute the open. The policy is, for example, one of the following. * Examine the local data structures and find the controlling thread and the corresponding node identifier constructed during the cluster initialization step. * Which thread has the required driver,
Broadcast requests to other control threads to determine which is the best choice in terms of load,
Or * Contact a middleware entity that must have similar data. In each case, the controlling thread will, depending on the embodiment, target node address,
The remote control thread address and the remote S-ICS address can be determined.

【０２７９】3. 制御スレッド３４は遠隔制御スレッド
３４Ａと連絡をとりそれにドライバ名すなわち今やメジ
ャー番号であるマイナー番号を含むdev_tを渡す。3. The control thread 34 contacts the remote control thread 34A and passes it the driver name, ie dev_t, which contains the minor number, now the major number.

【０２８０】4. 遠隔制御スレッドは、所与のドライバ
のカーネル内streams_open()を実行する。このオープン
は、非クラスタ化環境における場合と同様に、構成され
たautopush活動を実行する。遠隔スレッドは、オープン
を完了した時、使用されるべき遠隔Ｓ_ＩＣＳアドレス
および関連remote_target_indexを含むメッセージを返
す。4. The remote control thread executes streams_open () in the kernel of the given driver. This open performs the configured autopush activity as in a non-clustered environment. When the remote thread completes the open, it returns a message containing the remote S_ICS address to be used and the associated remote_target_index.

【０２８１】5. 今や制御スレッドは、オープンを実行
することが可能であることを認識し、標準的フレームワ
ーク・オープン・コードを使用してローカルＳ−ＩＣＳ
ドライバに対するオープンを実行する。5. The controlling thread now recognizes that it is possible to perform the open and uses standard framework open code to execute the local S-ICS.
Perform an open on the driver.

【０２８２】6. 上述の関数拡張と同様に、制御スレッ
ドは、最終的にオープン・ルーチンから戻る前に実行さ
れるべき関数ベクトルを記憶する。起動される諸関数
は、remote_target_indexを使用してメッセージをロー
カルＳＩＣＳに送り出す。これを行う目的は次の２つで
ある。第１に、ローカルＳＩＣＳが正しい経路テーブル
・インデックスを用いてsth->sth_cluster_route_idに
おける関連dev_tinを変更することを可能にする。第２
に、遠隔ＳＩＣＳがその経路指定テーブルを更新し、必
要に応じて、ストリーム内の２つのノード間でメッセー
ジが送信されているという信号を発することを可能にす
る。open動作は以上のように完了した。6. Similar to the function extension described above, the controlling thread stores the function vector to be executed before finally returning from the open routine. The invoked functions send messages to the local SICS using the remote_target_index. This is done for two purposes. First, it allows the local SICS to change the associated dev_tin in sth-> sth_cluster_route_id with the correct route table index. Second
In addition, it allows the remote SICS to update its routing table and signal, if necessary, that a message is being sent between two nodes in the stream. The open operation has been completed as described above.

【０２８３】利点は以下の通りである。ＳＴＲＥＡＭＳ
システム呼び出しは動作のための修正を必要としない。
複雑性が増加しない。ノード間の調整作業は増加しな
い。潜在的な新しいタイミング問題はない。select()お
よびpoll()でさえ従来と同様に動く。他のオペレーティ
ング・システム・コンポーネントを実施する場合には、
Ｓ_ＩＣＳと同様のものを同じタイプの関数送出を提供
するために構築することができる。もしこれが行われな
い場合、file_opsは、非クラスタ化環境に関する場合と
同様にオープン時にＳＴＲＥＡＭＳのfile_opsであるよ
うに再対応付けすることができるであろう。その他のコ
ンポーネントも代替的環境の場合と同様に動作すること
ができるであろう。システム呼び出しが非クラスタ化環
境において実行される時、構成動作による性能の低下は
発生しない。The advantages are as follows. STREAMS
System calls do not require modification for operation.
No increase in complexity. The coordination work between nodes does not increase. There are no potential new timing issues. Even select () and poll () work as before. When implementing other operating system components,
Similar to S_ICS can be constructed to provide the same type of function dispatch. If this is not done, the file_ops could be re-assigned to be STREAMS file_ops at open time as in a non-clustered environment. Other components could operate as in the alternative environment. When system calls are executed in a non-clustered environment, no performance degradation occurs due to configuration operations.

【０２８４】構成は既存の機能性とアルゴリズムを利用
するので、それをサポートするための付加的作業はな
い。同期的システム呼び出しは、実際の待機が遠隔では
なくローカル・システム上で行なわれるので、いかなる
コンポーネントに対しても不当な複雑性または修正を引
き起こさない。例えば、アプリケーションが２０００バ
イト読取りを要求し、１０００バイトだけが受け取られ
たとすれば、システム呼び出しは残りの１０００バイト
が受け取られるか割込みが発生するまで休止する。後述
の代替実施形態の場合、移行ソフトウェアは、システム
呼び出しが同じ状態で再開できるように、ローカル・ス
タック・フレーム変数を取り出し、その情報を記憶する
必要がある。システム呼び出しをＳ−ＩＣＳを経由して
ローカル・ノード上で発生させることによって、これら
のシステム呼び出しは、再開の必要はなく、従来技術と
同様に動作する。Since the configuration makes use of existing functionality and algorithms, there is no additional work to support it. Synchronous system calls do not cause undue complexity or modification to any component, since the actual waiting is done on the local system rather than remotely. For example, if an application requests a read of 2000 bytes and only 1000 bytes are received, the system call will pause until the remaining 1000 bytes are received or an interrupt occurs. In an alternative embodiment described below, the migration software needs to retrieve the local stack frame variables and store that information so that the system call can resume in the same state. By generating system calls on the local node via the S-ICS, these system calls do not need to be restarted and operate as in the prior art.

【０２８５】ノード障害検出および回復は、システム呼
び出しがローカル・ノードで実行されるので、アプリケ
ーションにとって透過的に実施するのがより簡単である
こともある。これは、スタックが一旦再作成されればシ
ステム呼び出しを再スタートさせるためローカル・スタ
ック・フレーム変数を再利用するこことができることを
意味する。後述の構成はこの能力を持つことはできない
であろう。getmsg(),getpmsg(), putmsg(), putpmsg(),
poll()およびselect()は、この構成内で動作する上で
修正を必要としない。これら関数は、スタックがローカ
ルであっても動作する。select()だけがfile_ops関数ベ
クトルに関係し残りは通常のシステム呼び出しとしては
たらくので、これら関数が選び出される。Node failure detection and recovery may be easier to implement transparently to the application because system calls are performed at the local node. This means that local stack frame variables can be reused to restart system calls once the stack is recreated. The configurations described below may not have this capability. getmsg (), getpmsg (), putmsg (), putpmsg (),
poll () and select () do not require modification to operate within this configuration. These functions work even if the stack is local. These functions are selected because only select () is concerned with the file_ops function vector and the rest works as normal system calls.

【０２８６】短所は以下の通りである。データを送受す
べきコンポーネントが増加する潜在的可能性はあり、そ
のため性能問題を引き起こす可能性がある。構成はＳＴ
ＲＥＡＭＳと非ＳＴＲＥＡＭＳ装置に対して同一でな
く、そのため処理される２系統のシステム呼び出しが存
在する可能性がある。これは、必ずしも好ましくないと
はいえないが、トラブル解決手段はこの状況を認識する
必要があることを意味する。Disadvantages are as follows. There is a potential for more components to send and receive data, which can cause performance problems. The configuration is ST
There may be two system calls that are not the same for REAMS and non-STREAMS devices, and thus are processed. This is not necessarily undesirable, but means that the troubleshooting solution needs to be aware of this situation.

【０２８７】代替実施形態：上記のシナリオおよびシス
テム構成は、本発明を実施する唯一の方法ではない。図
１７を参照しながら、代替的システム構成およびアルゴ
リズムを以下検討する。この代替構成は次のように作成
される。 1. 目標装置が遠隔ノードに存在することを認識するよ
うにＶＦＳを修正する必要があり、そのノードにオープ
ンを向け直すようにメカニズムを作成する必要がある。 2. 遠隔ノード上に、制御スレッドと同様に、実際のオ
ープンを実行するカーネル・スレッドが存在しなければ
ならない。 3. ほとんどの部分に関して、オープンは通常通り進
み、ローカル・コンポーネントがクラスタ全体の機構を
利用することを必要とする場合にのみ制御スレッドが関
与する。Alternative Embodiment: The above scenarios and system configurations are not the only ways to implement the present invention. Alternative system configurations and algorithms are discussed below with reference to FIG. This alternative configuration is created as follows. 1. The VFS needs to be modified to recognize that the target device is at a remote node, and a mechanism needs to be created to redirect the open to that node. 2. On the remote node, there must be a kernel thread that performs the actual open, as well as the controlling thread. 3. For the most part, the open proceeds normally, and the controlling thread is involved only when the local component needs to take advantage of the cluster-wide mechanism.

【０２８８】この利点は次の通りである。少なくとも表
面的には、メッセージの送出に関係するコンポーネント
は少ない。この構成および実施は、非ＳＴＲＥＡＭＳド
ライバに対して好ましいかもしれない。This advantage is as follows. At least on the surface, few components are involved in sending the message. This configuration and implementation may be preferred for non-STREAMS drivers.

【０２８９】短所は次の通りである。すべてのＳＴＲＥ
ＡＭＳシステム呼び出しおよび処理関数は、遠隔ノード
への経路指定のため中間的構造へ進むデータの動きを取
り扱うため修正を必要とする。例えば、read()およびwr
ite()は、uio構造を経由してユーザ空間からカーネル空
間へデータを移動するためcopyin()およびcopyout()を
使用する。この構造が含むユーザ空間アドレスは今やカ
ーネル空間アドレスに変更されなければならず、copyin
()およびcopyout()はカーネルsidを持つlbcopy()と置き
換えられなければならない。すべての同期要求を処理す
るため新しいカーネル・スレッドが存在しなければなら
ず、そのため性能低下につながる過度の文脈切り替えが
起こる可能性がある。これは、要求を満たすために必要
なデータのすべてがローカル・ストリーム・スタックに
よって処理されることができるようになるまでＰ−ＩＣ
Ｓを保持できないので、必要である。The disadvantages are as follows. All STRE
AMS system call and processing functions require modification to handle the movement of data to intermediate structures for routing to remote nodes. For example, read () and wr
ite () uses copyin () and copyout () to move data from user space to kernel space via the uio structure. The user space address that this structure contains must now be changed to a kernel space address and the copyin
() And copyout () must be replaced with lbcopy () with kernel sid. There must be a new kernel thread to handle all synchronization requests, which can lead to excessive context switching leading to performance degradation. This is because the P-IC will have all of the data needed to satisfy the request until it can be processed by the local stream stack.
This is necessary because S cannot be held.

【０２９０】この構成は、明確なミドルウェア・エンテ
ィティが移行および管理問題の処理に関与することを必
要とする。前述の構成は、これらのタスクの実行を制御
スレッドに依存することだけを必要とした。specfs層
(すなわち特殊ファイル・インターフェース・コード層)
に決定機能をおろす代わりに、ＶＦＳ関数ベクトルがク
ラスタ特有関数ベクトルと置き換えられない限り、非ク
ラスタ環境はそれがクラスタか否かを判断する上での性
能低下をこうむる。どの構成を使用すべきかは、クラス
タ化のための残りのオペレーティング・システムの修正
および実際の処理性能に依存する。作業が少なく他のオ
ペレーティング・システムとの調整を必要としないの
で、第１のステップとして、最初のシステム構成を実施
することが望ましい。オペレーティング・システムの残
りはこの状況を扱うためＳ−ＩＣＳ同等機構を作成する
ことを決定し、遭遇する可能性のある諸問題を回避する
ことができるであろう。これが行われない場合、非クラ
スタ化環境は性能劣化に苦しみ、その劣化は容認するこ
とができない。This configuration requires that distinct middleware entities be involved in handling migration and management issues. The above arrangement only required that the execution of these tasks depend on the controlling thread. specfs layer
(I.e. special file interface code layer)
Unless the VFS function vector is replaced with a cluster-specific function vector instead of downgrading the decision function, the non-cluster environment suffers a performance degradation in determining whether it is a cluster. Which configuration to use depends on the modification of the remaining operating system for clustering and the actual processing performance. As a first step, it is desirable to implement an initial system configuration because the work is small and does not require coordination with other operating systems. The rest of the operating system will decide to create an S-ICS equivalent to handle this situation and will be able to avoid problems that may be encountered. If this is not done, the non-clustered environment suffers from performance degradation, which is unacceptable.

【０２９１】5.3 コンポーネント分割ストリーム分割コンポーネント・ストリームはまさに異なるノード
・ストリーム構成の特別なインスタンスであり、従っ
て、そのオープン処理はまさにその特別なインスタンス
である。ステップは次の通りである。 1. ステップ１は同一である。 2. 制御スレッドがメジャー番号オープン方針を調べる
時、分割ストリーム構成を作成すべきことを知る。制御
スレッドは、遠隔目標ノードと連絡をとりそれにドライ
バ名すなわち今やメジャー番号であるマイナー番号を持
つdev_tを渡す。前述のステップ４と同様に、遠隔制御
スレッドは、ドライバをオープンして、構成されたauto
push活動を実行する。これが完了すると、ストリーム・
スタックの低位部分が構築されたこととなる。 3. 前述の場合と同様にローカル制御スレッドがＳ−Ｉ
ＣＳをオープンするが、唯一の相違点は、スタックの上
方部分すなわちどのローカル・モジュールも今やＳ−Ｉ
ＣＳ上にプッシュされ、スタックは完了し、そして関連
メッセージが生成され処理される点である。5.3 Component Split Stream A split component stream is just a special instance of a different node stream configuration, and thus its open processing is just that special instance. The steps are as follows. 1. Step 1 is the same. 2. When the controlling thread checks the major number open policy, it knows that a split stream configuration should be created. The controlling thread contacts the remote target node and passes it the dev_t with the driver name, the minor number, which is now the major number. As in step 4 above, the remote control thread opens the driver and configures the auto
Perform push activity. When this is complete, the stream
The lower part of the stack has been built. 3. As in the previous case, the local control thread
Open CS, the only difference is that the upper part of the stack, ie any local module, is now
It is pushed onto the CS, the stack is completed, and the relevant message is generated and processed.

【０２９２】このように大きな相違はない。これを行う
例は、ＴＣＰ／ＩＰ／ＤＬＰＩスタックをＩＰ／ＤＬＰ
ＩレベルでなくＴＣＰ／ＩＰレベルで分割することであ
る。これは、ローカルＤＬＰＩがＩＰの下方に直接接続
されている場合ＩＰをオープンすることを可能にする。
上述のように、このような分割は好ましくはないが、マ
ルチプレクサの場合には考慮されるべきであろう。好ま
しいＩＰ／ＤＬＰＩレベルでの分割を使用して、そのコ
ンポーネントが異なるノードに存在しているマルチプレ
クサに適用することもできる。There is no significant difference. An example of doing this is to use the TCP / IP / DLPI stack with IP / DLP
That is, division is performed not at the I level but at the TCP / IP level. This allows the IP to be opened if the local DLPI is directly connected below the IP.
As mentioned above, such a division is not preferred, but should be considered in the case of a multiplexer. The partitioning at the preferred IP / DLPI level can also be used to apply to multiplexers whose components reside on different nodes.

【０２９３】注意すべき点ではあるが、マルチプレクサ
は２つの装置をオープンし、I_LINKまたはP_LINK ioctl
呼び出しを使用してそれらを結合することによって作成
されている。この場合、一方のコンポーネントを例えば
ＩＰのようにローカルとし、他方を遠隔とする。このよ
うにして、上述のオープン処理が適用される。ＩＰに関
するオープンはローカル・ノード上で行われる。ＤＬＰ
Ｉに関しては、実際にはＳ−ＩＣＳに関するものである
ローカル・ストリームヘッドを持つ遠隔ノードでオープ
ンは行われる。オープン処理が完了すると、アプリケー
ションは通常のリンク命令を実行し、コンポーネント間
での通信が行われる。It should be noted that the multiplexer opens two devices and uses the I_LINK or P_LINK ioctl
It has been created by combining them using calls. In this case, one component is local, eg, IP, and the other is remote. Thus, the above-described open processing is applied. Opening for IP is done on the local node. DLP
For I, the opening occurs at a remote node with a local stream head, which is actually for S-ICS. When the open processing is completed, the application executes a normal link instruction, and communication between components is performed.

【０２９４】5.3.1 単一ストリーム・スタックからの分
割ストリームの作成負荷平均化および潜在的ハードウェア共用のため、どこ
かのポイントでスタックを異なるノード上で実行される
複数コンポーネントに分割することが必要となることが
ある。ほとんどの場合、上述のオープン・アルゴリズム
部分を利用し、また移行機能を使用することによって、
これは達成される。基本的アプローチは次の通りであ
る。5.3.1 Creating a Split Stream from a Single Stream Stack For load averaging and potential hardware sharing, it is possible to split the stack at some point into multiple components running on different nodes. May be required. In most cases, by using the open algorithm part described above and by using the transition function,
This is achieved. The basic approach is as follows.

【０２９５】1. 制御スレッドは、ストリーム・スタッ
クがそのコンポーネントへ分割される必要があることを
通知される。分割を標示するメッセージは、また、分割
後のコンポーネントが存在するノード(ローカルおよび
遠隔ノード、あるいはすべて遠隔ノード)を標示する。1. The controlling thread is notified that the stream stack needs to be split into its components. The message indicating the split also indicates the node where the split component resides (local and remote nodes, or all remote nodes).

【０２９６】2. 制御スレッドは、カーネル内ＳＴＲＥ
ＡＭＳインターフェースを使用して、ストリームヘッド
に対して、streams_ioctl()を経由して、ＳＴＲＥＡＭ
フレームワーク特有のioctlを出す。ioctlはストリーム
を凍結させ、必要なコンポーネントすべての呼出しおよ
び作成を分割が行われるべき場所に依存させる。ＩＰと
ＤＬＰＩの間での分割の場合、収集されなければならな
いもののすべては、ＴＣＰあるいはＵＤＰモジュールの
プライベート・データおよび状態である。ioctlは、こ
の情報を含む移行構造(後述)を戻す。注：ＩＰ／ＤＬＰ
Ｉレベルでの分割に関しては、ＩＰは最終的に到来経路
上でフロー制御され、この場合、ＩＰはパケットを放棄
することを開始するか、このアドレスの送信を停止する
ようにＤＬＰＩに通知する。2. The controlling thread uses the STRE in the kernel.
STREAM via streams_ioctl () to stream head using AMS interface
Issue framework specific ioctls. The ioctl freezes the stream, making the invocation and creation of all necessary components dependent on where the split is to take place. In the case of a split between IP and DLPI, all that must be collected is the private data and state of the TCP or UDP module. The ioctl returns a migration structure (described below) containing this information. Note: IP / DLP
For splitting at the I level, the IP is finally flow controlled on the incoming path, in which case the IP informs the DLPI to start dropping packets or stop sending this address.

【０２９７】3. ioctlが戻ると、制御スレッドはコンポ
ーネントをそれぞれのノードへ移行させる。 4. ハードウェア共用の場合、一部のコンポーネントは
ローカルのまま残るが上方コンポーネントは移行する、
すなわち、ＤＬＰＩ部分が残るがＴＣＰ／ＩＰは新しい
ノードへ移行する。あるいは、おそらくローカル・ハー
ドウェア・カードが故障しＤＬＰＩが新しいノード上で
再作成される。これらの状況においては、さらに多くの
ステップが追加されなければならない。3. When the ioctl returns, the controlling thread migrates the component to each node. 4. In the case of hardware sharing, some components remain local but upper components migrate,
That is, the DLPI part remains, but TCP / IP moves to a new node. Alternatively, the local hardware card probably fails and the DLPI is recreated on the new node. In these situations, more steps must be added.

【０２９８】5. ＤＬＰＩが移行されつつある場合、Ｉ
ＰはＤＬＰＩをその下に接続させる必要がる。これは、
制御スレッドが、古いＤＬＰＩを、新しい遠隔ＤＬＰＩ
と通信することができるＳ−ＩＣＳインスタンスに置き
換えることを要求する。これは、次の２つの方法のいず
れかで達成することができる。(1)結合解除／結合命令
が制御スレッドによって実行され、この場合ＩＰからＤ
ＬＰＩへの初期化メッセージを認識し適切なアクション
を取らなければならない。あるいは(2)制御スレッドが
Ｓ−ＩＣＳへioctlを発して、この経路に関してＳ−Ｉ
ＣＳをポイントするようにＩＰ上のフレームワーク構造
を修正させる。最初のアプローチがおそらく最も安全で
実施も最も簡単であろう。注：一旦ＤＬＰＩが移行すれ
ば、物理アドレスから論理アドレスへの新しい対応関係
を遠隔ドライバが理解するようにＡＲＰが発されなけれ
ばならない。5. If DLPI is being migrated, I
P needs to have DLPI connected below it. this is,
The controlling thread replaces the old DLPI with the new remote DLPI
Request to replace with an S-ICS instance that can communicate with This can be achieved in one of two ways: (1) The unbind / bind instruction is executed by the control thread, in which case IP to D
It must recognize the initialization message to the LPI and take appropriate action. Or (2) the control thread issues an ioctl to the S-ICS, and the S-I
Modify the framework structure on IP to point to CS. The first approach is probably the safest and easiest to implement. Note: Once the DLPI has migrated, an ARP must be issued so that the remote driver understands the new mapping of physical addresses to logical addresses.

【０２９９】6. ＴＣＰ／ＩＰが移行されれば、制御ス
レッドは大域対応付の例の場合と同様のステップを実行
する。制御スレッドは、ＳＩＣＳ上方ｍｕｘインスタン
スを作成し、ユニークな経路の決定を確実にし、次にロ
ーカル結合を実行し、このＳ−ＩＣＳをポイントするよ
うにＩＰファンアウト・テーブルを更新する。これによ
って、すべての到来パケットがＳ−ＩＣＳを経由して遠
隔ＴＣＰ／ＩＰに正しく経路指定される。6. When TCP / IP is migrated, the control thread executes the same steps as in the case of the global mapping. The controlling thread creates a SICS upper mux instance, ensuring a unique path determination, then performing a local join and updating the IP fanout table to point to this S-ICS. This ensures that all incoming packets are correctly routed to the remote TCP / IP via the S-ICS.

【０３００】7. 次に、制御スレッドは遠隔ノードに対
するメッセージ交換を実行し、経路を活動化し、remote
_target_indexを確認し、ローカルＴＣＰ／ＩＰスタッ
クをクローズしてそれらの現在の資源を解放する。ＴＣ
Ｐ／ＩＰが移行される時、他のＶＦＳは、ＶＦＳ移行ソ
フトウェアによって更新されるものと仮定される。接続
されたものがそれだけであれば、ソケットを含めＴＣＰ
またはＩＰにアクセスするものはすべて、そのインスタ
ンス特有データを移行させ、その状態を目標ノード上に
再作成する。これは、少くともストリームヘッド・マッ
ピング構造の再作成時点まで、制御スレッドとＶＦＳま
たはソケット移行ソフトウェアの間の協力を必要とす
る。7. The controlling thread then performs a message exchange with the remote node, activates the path,
Check the _target_index, close the local TCP / IP stack and release their current resources. TC
When the P / IP is migrated, the other VFSs are assumed to be updated by the VFS migration software. If there is only one connected, TCP including socket
Or whatever accesses the IP migrates its instance specific data and recreates its state on the target node. This requires cooperation between the controlling thread and the VFS or socket migration software, at least until the point at which the streamhead mapping structure is recreated.

【０３０１】8. エラーが発生する場合、スタックは最
終的な確認メッセージまでなお存在するので、制御スレ
ッドは割り当てられた資源を消去し、ＴＣＰ／ＩＰコン
ポーネントを単に解放する。これによって、接続は従来
技術と同様に実行を続行する。8. If an error occurs, since the stack is still present until the final confirmation message, the controlling thread clears the allocated resources and simply releases the TCP / IP component. This causes the connection to continue executing as in the prior art.

【０３０２】5.4 ストリーム型パイプストリーム型パイプは、通常のストリーム・ドライバが
経験しないような問題を発生させる。すなわち、ストリ
ーム型パイプは、遠隔ノード・アドレスを取得するため
に利用することができるドライバ情報を含まない。この
ため、外部からの干渉なしに、異なるノード上にエンド
ポイントがあるパイプを透過的にオープンすることがで
きない。２つのアプローチが存在する。第１のアプロー
チは、パイプを通常の通り作成して、１つのエンドポイ
ントを別のノードへ移行させるものである。第２のアプ
ローチは、各ノード上でパイプ・エンドポイントをオー
プンするクラスタ特有関数へパイプ・ルーチンを再対応
付けするものである。いずれのアプローチでも、ノード
・データを求め、適切な仕事を実行する外部エンティテ
ィが存在しなければならない。価値があるアプローチな
のかどうかという疑問が第２のアプローチに関して発せ
られるかもしれない。第１のアプローチでは、所与の移
行機能が活用され、パイプを使用するアプリケーション
が両エンドポイントを受け取り、他のアプリケーション
へ一方のエンドポイントを送るので、第１のアプローチ
の実施が好ましい。従って、本実施形態の設計は、pipe
()コードが２つのファイル記述子、２つのファイル・ポ
インタおよび、対応する読取り書込み待ち行列が相互に
交差結合されている２つのストリームヘッドを生成する
ものと仮定する。パイプ移行の詳細は後述する。5.4 Stream-type Pipes Stream-type pipes create problems that ordinary stream drivers do not experience. That is, the stream-type pipe does not include driver information that can be used to obtain the remote node address. For this reason, it is impossible to transparently open a pipe having an endpoint on a different node without external interference. There are two approaches. The first approach is to create a pipe as usual and migrate one endpoint to another node. The second approach is to remap the pipe routine to a cluster-specific function that opens a pipe endpoint on each node. With either approach, there must be an external entity that seeks node data and performs the appropriate work. The question of whether this is a worthwhile approach may be raised regarding the second approach. The implementation of the first approach is preferred because in the first approach, a given migration function is exploited and the application using the pipe receives both endpoints and sends one endpoint to the other application. Therefore, the design of this embodiment is
Assume that the () code produces two stream descriptors with two file descriptors, two file pointers, and corresponding read / write queues cross-coupled to each other. Details of the pipe transfer will be described later.

【０３０３】6.0 ストリームの移行ＳＴＲＥＡＭ移行は、あるノードから別のノードへの全
体的あるいはコンポーネント毎のストリーム・スタック
の移行である。スタック全体がアプリケーションから独
立したノードに存在する場合とスタックが２つのノード
間に分割されている場合の２つの構成を使用して、移行
の詳細を以下記述する。6.0 Stream Migration A STREAM migration is a global or component-by-component stream stack migration from one node to another. The migration details are described below using two configurations, where the entire stack resides on a node independent of the application and when the stack is split between two nodes.

【０３０４】6.1 整理関数移行されるべきモジュールまたはドライバの各々毎に、
開発者は、完全にモジュール／ドライバ特有の２つの関
数を開発する必要がある。ＳＴＲＥＡＭＳの範囲内にお
いて、各待ち行列は、開発者がその待ち行列に関して実
施形態特有のデータを記憶するために使用することがで
きるプライベート・データ構造に対するポインタ(q_pt
r)を含む。このデータ構造が、状態特有の情報すなわち
タイマー、再送パケット、状態等を含む可能性があるの
で、このデータは新しいノードへ移行され、モジュール
／ドライバをそのオリジナルの状態で再確立するため使
用されなければならない。ＳＴＲＥＡＭＳはどこがポイ
ントされるているか知ることができないので、開発者は
これらの整理関数を書く必要がある。6.1 Simplification Functions For each module or driver to be migrated,
Developers need to develop two functions that are completely module / driver specific. Within STREAMS, each queue is a pointer (q_pt) to a private data structure that developers can use to store embodiment-specific data for that queue.
r). Since this data structure may contain state-specific information, ie timers, retransmitted packets, states, etc., this data is migrated to the new node and must be used to re-establish the module / driver in its original state. Must. Developers need to write these cleanup functions because STREAMS cannot know where it is pointed.

【０３０５】このような関数は次のような機能を持たな
ければならない。関数は、プライベート・データの収集
およびプライベート・データの復元という２つの能力を
持たなければならない。mblk内に記憶されたデータはＳ
ＴＲＥＡＭＳフレームワークによって移行される。関数
はＳＴＲＥＡＭＳフレームワークによって制約なしで起
動されることができなければならない。関数は、待ち行
列および整理データ構造に対するポインタという２つの
パラメータを使用しなければならない。関数は、待ち行
列アドレスを使用して、待ち行列内に保持されているプ
ライベート・データを取り出すために使用する q_ptrお
よびOTHERQ(q)->q_ptrの参照を解除する。各関数は、mb
lk内に含まれていないそのプライベート・データのすべ
てを保有するためMALLOC()によって十分なメモリを確保
する。割り付けは、割り付けが成功することを確認する
ためM_WATOKフラグを使用しなければならない。このメ
モリは、FREE()を使用して解放される。Such a function must have the following functions. The function must have two capabilities: collecting private data and restoring private data. The data stored in mblk is S
Migrated by the TREAMS framework. The function must be able to be invoked without restrictions by the STREAMS framework. The function must use two parameters: a queue and a pointer to the cleanup data structure. The function uses the queue address to dereference q_ptr and OTHERQ (q)-> q_ptr, which are used to retrieve private data held in the queue. Each function is mb
Allocate enough memory with MALLOC () to hold all of its private data not included in lk. The allocation must use the M_WATOK flag to confirm that the allocation is successful. This memory is freed using FREE ().

【０３０６】この割り当てられたメモリの範囲内におい
て、関数は、ＡＢＩ(Application Binary Interfaceの
略称)形式を使用してすべてのデータを連続的にレイア
ウトする。例えば４つの整数およびプライベート構造へ
の１つのポインタが存在すれば、整数は４バイトの整数
倍のメモリにコピーされ、ポインタの内容はバイト単位
でコピーされなければならない。このため、データがバ
イト・ストリームとして転送され、遠隔ノード上で再編
成されることを可能にする。関数はすべての仕掛かり中
のタイマを取り消し、残りの時間を記録しなければなら
ない。コンポーネントが移行される時、関数はタイマを
再開しなければならない。関数は、仕掛かり中のbufcal
l()動作を取り消し、新しいノード上に要求を再発行し
なければならない。 bufcall()が呼び戻し関数を使用す
るので、この関数は新しいノードに存在しなければなら
ない。関数アドレスは、次の表２９に示されるように、
queue_t構造内に記憶される。Within this allocated memory area, the function lays out all data continuously using ABI (Application Binary Interface) format. For example, if there are four integers and one pointer to a private structure, then the integers must be copied to an integer multiple of 4 bytes of memory and the contents of the pointer copied byte by byte. This allows the data to be transferred as a stream of bytes and reorganized on the remote node. The function must cancel all in-progress timers and record the time remaining. When the component is migrated, the function must restart the timer. The function is bufcal in process
The l () operation must be canceled and the request reissued on the new node. Since bufcall () uses a recall function, this function must be present on the new node. The function address, as shown in Table 29 below,
Stored in the queue_t structure.

【０３０７】[0307]

【表２９】 struct queue { struct qinit*q_qinfo;/* 待ち行列のためのプロシージャおよび限度 */ struct msgb*qfirst;/* メッセージ待ち行列のヘッド */ struct msgb*q_last;/* メッセージ待ち行列の末尾 */ struct queue*q_next;/* ストリーム中の次の待ち行列) */ struct queue*q_link;/* スケジューリング待ち行列へのリンク */ void*q_ptr; /* プライベート・データ構造へ */ ulong q_count; /* q上の加重文字カウンタ */ ulong q_flag; /* 待ち行列状態 */ long q_minpsz; /* 受けられる最小パケット・サイズ */ long q_maxpsz; /* 受けられる最大パケット・サイズ */ ulong q_hiwat; /* フロー制御のための高位しきい値 */ ulong q_lowat; /* フロー制御のための低位しきい値 */ struct qband*q_bandp; /* 帯域情報 */ unsigned char q_nband; /* 帯域数 */ unsigned char q_pad1[3]; /* 予備 */ struct queue*q_other; /* 待ち行列ペアの他方のＱへのポインタ */ int32(*q_marshall_func)(queue_t*q,struct marshall*marsh_ptr); int32(*q_demarshall_func)(queue_t*q,struct marshall*marsh_ｐtr); QUEUE_KERNEL_FIELDS };Table 29 struct queue {struct qinit * q_qinfo; / * Procedures and limits for queues * / struct msgb * qfirst; / * Head of message queue * / struct msgb * q_last; / * End of message queue * / struct queue * q_next; / * next queue in stream) * / struct queue * q_link; / * link to scheduling queue * / void * q_ptr; / * to private data structure * / ulong q_count; / * Weighted character counter on q * / ulong q_flag; / * Queue status * / long q_minpsz; / * Minimum packet size accepted * / long q_maxpsz; / * Maximum packet size accepted * / ulong q_hiwat; / * High threshold for flow control * / ulong q_lowat; / * Low threshold for flow control * / struct qband * q_bandp; / * Band information * / unsigned char q_nband; / * Number of bands * / unsigned char q_pad1 [3]; / * reserved * / struct queue * q_other; / * pointer to the other Q in the queue pair * / int32 (* q_marshall_func) (queue_t * q, struct marshall * marsh_ptr); int32 (* q_demarshall_func) (queue_t * q, struct marshall * marsh_ptr); QUEUE_KERNEL_FIELDS};

【０３０８】整理(marshall)データ構造は以下の表３０
の通りである。注：すべてのフィールドが整理関数によ
って記入されるわけではなく、一部はＳＲＥＡＭＳフレ
ームワークによって記入される。The marshall data structure is shown in Table 30 below.
It is as follows. Note: Not all fields are filled by the simplification function, some are filled by the SREAMS framework.

【０３０９】[0309]

【表３０】 struct marshall { struct streamtab*str_tab; caddr_t str_frame_wk; caddr_t q_private_data; mblkP read_mp; mblkp write_mp; mblkP read_mp_private; mblkP write_mp_private; }[Table 30] struct marshall {struct streamtab * str_tab; caddr_t str_frame_wk; caddr_t q_private_data; mblkP read_mp; mblkp write_mp; mblkP read_mp_private; mblkP write_mp_private;

【０３１０】str_tab：このモジュール／ドライバのた
めの現在時ストリームタブ・エントリである。現在時ス
トリームタブは、動的関数置換の変更を反映する。これ
はフレームワークによって記入される。 str_frame_wk：このコンポーネントのためのすべてのＳ
ＴＲＥＡＭＳフレームワーク・データを含むメモリをポ
イントする。この情報は、処理されているシステム呼び
出し、同期待ち行列等に関するローカル・スタック・フ
レーム構造を含む。q_private_data：整理関数によって割り当てられた構
造へのポインタである。これは、読取りおよび書き込み
両方のq_ptrデータを含まなければならない。 read_mpおよびwrite_mp：処理を待つ待ち行列に記憶さ
れているメッセージに関するmblk連鎖である。これらは
フレームワークによって作成される。 read mp privateおよびwrite_mp private：モジュール
／コンポーネントによって現在保持されているmblk連鎖
である。これらは、再転送を待っているメッセージの可
能性がある。注：移行中のすべてのmblkは、b_next / b
_prevフィールドを使用して結合される。b_contポイン
タを経由して結合されているすべてのmblkは１つのメッ
セージの一部であると仮定される。Str_tab: current stream tab entry for this module / driver. The Current Streams tab reflects changes in dynamic function replacement. This is filled out by the framework. str_frame_wk: all S for this component
Points to the memory containing the TREAMS framework data. This information includes the local stack frame structure for system calls, synchronization queues, etc. being processed. q_private_data: Pointer to the structure allocated by the simplification function. It must include both read and write q_ptr data. read_mp and write_mp: mblk chains for messages stored in the queue waiting to be processed. These are created by the framework. read mp private and write_mp private: mblk chains currently held by the module / component. These may be messages waiting for retransmission. Note: All mblks in transition are b_next / b
Combined using the _prev field. All mblks bound via the b_cont pointer are assumed to be part of one message.

【０３１１】この構造への記入が終わると、制御スレッ
ドはこのデータを遠隔ノードに伝送し、そこでデータに
逆関数が適応され、構造がそのノード上に再作成され
る。開発者は読取りおよび書込み待ち行列q_ptr構造を
再作成すること、およびFREE()を通してq_private_data
を解放することに関して責任がある。When the structure is completed, the controlling thread transmits the data to the remote node, where the inverse function is applied to the data and the structure is recreated on that node. The developer can recreate the read and write queue q_ptr structures and q_private_data through FREE ()
Is responsible for releasing.

【０３１２】6.2 その他スタック全体前節において、２つの実施形態が示された。代替実施形
態において、ローカルＳＴＲＥＡＭフレームワークのコ
ンポーネントは存在せず、従って、移行は、スタックを
通常通り移行することおよび２つのノード上のVFS層を
単に更新することから構成される。従って、以下の記述
では第１の構成に焦点をあて、どのようにそれが移行さ
れるかを説明する。これは、一般的に移行の基礎として
使用されるであろう。6.2 Other Entire Stack In the previous section, two embodiments were shown. In an alternative embodiment, there are no components of the local STREAM framework, so the migration consists of migrating the stack normally and simply updating the VFS layer on the two nodes. Therefore, the following description focuses on the first configuration and describes how it is migrated. This will generally be used as a basis for the transition.

【０３１３】1. 制御スレッドがＳ−ＩＣＳ書込み待ち
行列を取得し「書込み」経路上のフロー制御をローカル
Ｓ−ＩＣＳインスタンスに置かせる。これは、更なるメ
ッセージがこのノードから遠隔ノードへ送り出されるの
を防止する。 2. 制御スレッドは、移行が行われることを遠隔制御ス
レッドに通知し、遠隔目標インデックスを使用してコン
ポーネントを識別する。 3. 遠隔制御スレッドは、ドライバのローカル・ストリ
ームヘッドを使用禁止にし、ストリームヘッド・プレビ
ュー関数を使用禁止にする。また、ストリームヘッド内
のQFULLフラグを設定する。これらの２つのステップの
間で、すべての下流活動は、ローカルＳ−ＩＣＳあるい
はローカル・ストリームヘッドにデータを送信すること
を止められる。1. The controlling thread gets the S-ICS write queue and causes the flow control on the "write" path to be placed on the local S-ICS instance. This prevents further messages from being sent from this node to the remote node. 2. The controlling thread notifies the remote controlling thread that a migration will take place and uses the remote target index to identify the component. 3. The remote control thread disables the driver's local streamhead and disables the streamhead preview function. Also, a QFULL flag in the stream head is set. Between these two steps, all downstream activity is stopped sending data to the local S-ICS or local streamhead.

【０３１４】4. 遠隔制御スレッドは、これらのコンポ
ーネントに対し通常の移行アルゴリズムを起動する。こ
れが完了すると、遠隔制御スレッドは、移行が現在発生
しているというメッセージを始動元制御スレッドへ送
る。 5. 移行が完了すると、遠隔制御スレッドは始動元スレ
ッドへメッセージを送り、移行状態および新たな経路テ
ーブル・データを確認する。 6. 制御スレッドは、次に、新しい情報を用いてＳ−Ｉ
ＣＳ経路テーブルを更新する。 7. この時点で、制御スレッドは、フロー制御を解放
し、Ｓ−ＩＣＳがメッセージを送り出すのを可能にす
る。フロー制御の解放は遠隔Ｓ−ＩＣＳへのメッセージ
送付を可能にし、そのメッセージにより遠隔Ｓ−ＩＣＳ
がストリーム・スタックからのメッセージを送付するこ
とが可能となる。4. The remote control thread invokes the normal migration algorithm for these components. When this is complete, the remote control thread sends a message to the initiating control thread that a transition is currently occurring. 5. When the migration is complete, the remote control thread sends a message to the initiating thread to confirm the migration status and new routing table data. 6. The controlling thread then uses the new information to
Update the CS route table. 7. At this point, the controlling thread releases the flow control and allows the S-ICS to send out the message. Release of flow control allows sending a message to the remote S-ICS,
Can send messages from the stream stack.

【０３１５】6.3 移行アルゴリズムストリーム・スタック全体を移行させる場合と１つの分
散コンポーネントを移行させる場合との間に実際に相違
はない。移行のステップは以下の通りである。 1. 制御スレッドがＳ−ＩＣＳ書込み待ち行列を取得し
「書込み」経路上のフロー制御をローカルＳ−ＩＣＳイ
ンスタンスに置かせる。これは、更なるメッセージがこ
のノードから遠隔ノードへ送り出されるのを防止する。
これを行うため、制御スレッドは、各Ｓ−ＩＣＳインス
タンスの範囲内に記憶されているフロー制御mblkを利用
する。制御スレッドは、Ｓ‐ＩＣＳ書込み待ち行列上に
putq(wq, mp)を実行し、Ｓ−ＩＣＳ内にこの状態を書き
込む。6.3 Migration Algorithm There is no real difference between migrating the entire stream stack and migrating a single distributed component. The transition steps are as follows: 1. The controlling thread gets the S-ICS write queue and causes the flow control on the "write" path to be placed on the local S-ICS instance. This prevents further messages from being sent from this node to the remote node.
To do this, the control thread utilizes a flow control mblk stored within each S-ICS instance. The control thread is on the S-ICS write queue
Execute putq (wq, mp) and write this state in the S-ICS.

【０３１６】2. 制御スレッドは、移行が行われること
を遠隔制御スレッドに通知し、遠隔目標インデックス(r
emote_target_index)を使用してコンポーネントを識別
する。遠隔目標インデッックスは経路テーブル・エント
リを取り出すことによって決定される。経路テーブル・
エントリを取り出すためには、制御スレッドはこのスト
リーム・スタックを管理するＳ−ＩＣＳを識別する必要
がある。Ｓ−ＩＣＳアドレスは、cluster_route_idと共
に、ローカル・コンポーネントのストリームヘッド内で
見出される。これらのコンポーネントを使用して、正し
い経路テーブル・エントリがインデックスされデータが
取り出される。2. The controlling thread informs the remote controlling thread that a migration will take place and sends the remote target index (r
emote_target_index) to identify the component. The remote target index is determined by retrieving a route table entry. Routing table
To retrieve an entry, the controlling thread needs to identify the S-ICS that manages this stream stack. The S-ICS address is found in the local component stream head along with the cluster_route_id. Using these components, the correct routing table entry is indexed and the data is retrieved.

【０３１７】3. 遠隔制御スレッドは、ドライバのロー
カル・ストリームヘッドを使用禁止にし、ストリームヘ
ッド・プレビュー関数を使用禁止にする。遠隔制御スレ
ッドは、remote_target_indexに基づいて対応するtarge
t_queueエントリを見出すことによって、関連ストリー
ムヘッドを識別する(このアルゴリズムは上述した)。ta
rget_queueの範囲内で target_sthが取り出される。カ
ーネル内streams_loctl()を使用して、遠隔制御スレッ
ドは、プレビュー関数を設定するためプロセスを逆転さ
せる。これによって、低位モジュール／ドライバがスト
リームヘッドをバイパスしローカルＳ−ＩＣＳにメッセ
ージを置くのが防止される。3. The remote control thread disables the driver's local streamhead and disables the streamhead preview function. The remote control thread uses the corresponding targe based on the remote_target_index
Identify the relevant stream head by finding the t_queue entry (this algorithm is described above). ta
The target_sth is extracted within the range of rget_queue. Using streams_loctl () in the kernel, the remote control thread reverses the process to set up the preview function. This prevents low level modules / drivers from bypassing the stream head and putting messages on the local S-ICS.

【０３１８】4. 遠隔制御スレッドは、また、ストリー
ムヘッド読取り待ち行列の範囲内にQFULLフラグを設定
する。これによって、低位モジュールまたはドライバが
ストリームヘッドにメッセージを記憶させようとする試
みが防止される。この条件が検出されると、モジュール
／ドライバはフロー制御されるか、あるいはそれらの標
準的プロトコール命令の一部としてメッセージを放棄す
る。いずれにせよ、メッセージはもはや上流へ流れな
い。4. The remote control thread also sets the QFULL flag within the stream head read queue. This prevents lower-level modules or drivers from trying to store messages in the stream head. When this condition is detected, the modules / drivers are flow controlled or discard the message as part of their standard protocol instructions. In any case, the message no longer flows upstream.

【０３１９】5. 次に、遠隔制御スレッドは、目標遠隔
ノードと連絡をとる。遠隔制御スレッドは、目標遠隔制
御スレッドに通常のオープン命令を実行させ、このノー
ドに存在するものと同じスタック構成を生成させる。こ
れが完了すると、目標ノードは、データ交換を開始する
ため適切なメッセージおよびデータを用いて応答する。
注：目標ノード制御スレッドがスタックを作成する時、
ＳＴＲＥＡＭＳフレームワークにスタック・インスタン
スに関するすべての同期待ち行列を取得させる。これに
よって、移行が完了するまでいかなる活動もこれらの待
ち行列上で発生することが防止される。Next, the remote control thread contacts the target remote node. The remote control thread causes the target remote control thread to execute a normal open instruction and create the same stack configuration that exists at this node. When this is completed, the target node responds with appropriate messages and data to initiate the data exchange.
Note: When the target node control thread creates the stack,
Causes the STREAMS framework to get all synchronization queues for the stack instance. This prevents any activity from occurring on these queues until the transition is completed.

【０３２０】6. この時点で、遠隔制御スレッドは、保
証されてはいないが、移行が可能であることを確信す
る。遠隔制御スレッドは、最も低位のコンポーネントか
ら開始してすべてのスタック同期待ち行列を取得する。
これは、すべての仕掛かり中のサービス・ルーチンの動
作を中止させて、サービス・ルーチンがその実行を完了
した状態およびサービス・ルーチンが必要とされる同期
待ち行列を取得することができないため「反転」実行経
路に置かれた状態という２つの状態の１つへ置く。これ
が起きると、ＳＴＲＥＡＭＳフレームワークは、すべて
の仕掛かり中サービス・ルーチンおよび関連メッセージ
を目標スタックに移行する。注：ＳＴＲＥＡＭＳ実施
が、プロセス領域の外部のサービス・ルーチンを実行す
ると、それらルーチンおよび関連データは取り出され移
行される。また、ソフトウェア割込みがサービス・ルー
チンを実行するため使用されるとすれば、移行が完了す
る時、すべてのサービス・ルーチンは新しいソフトウェ
ア割り込みとして再開される必要がある。これは、スタ
ックが再開されるまでサービス・ルーチン要求が載せら
れているリストが存在しなければならないことを意味す
る。6. At this point, the remote control thread is convinced that migration is possible, although not guaranteed. The remote control thread gets all stack synchronization queues starting from the lowest component.
This causes "inverted" because all in-flight service routines stop running, and the service routine cannot get the status that the service routine has completed its execution and the required synchronization queue. "Place in one of two states: placed in execution path. When this happens, the STREAMS framework moves all in-flight service routines and associated messages to the target stack. Note: When a STREAMS implementation executes service routines outside the process area, those routines and associated data are retrieved and migrated. Also, if a software interrupt is used to execute a service routine, all service routines need to be restarted as new software interrupts when the transition is completed. This means that there must be a list with service routine requests until the stack is restarted.

【０３２１】7. 制御スレッドは今やすべての非使用禁
止システム呼び出しの完了を可能にする。すべての使用
禁止システム呼び出しは凍結され、それらの状態は記録
され、再開始のため新しいノードへ送り出される。スタ
ックが移動中でアプリケーションがとどまっている場
合、フレームワークはローカル・システムの範囲内で内
容切り替えを実行しなければならない。これを実行する
ため、ファイル・システムの実施形態に基づいて、例え
ば、vnode v_rdevをＳ−ＩＣＳと関連するdev_tをポイ
ントするように更新し、同様に、vnode v_streamをＳ−
ＩＣＳインスタンスのストリームヘッド・ポインタと置
き換える。これらステップは、vnode構造をポイントす
る関連ファイルポインタf_dataフィールドをＳ−ＩＣＳ
オープン・シーケンスの間に生成されたＳ−ＩＣＳvnod
eインスタンスと置き換えることによって達成される。7. The controlling thread now allows the completion of all disabled system calls. All disabled system calls are frozen, their state is recorded and dispatched to a new node for restart. If the application is stuck while the stack is moving, the framework must perform content switching within the local system. To do this, based on the embodiment of the file system, for example, update vnode v_rdev to point to the dev_t associated with the S-ICS, and likewise update vnode v_stream to S-
Replace with the stream head pointer of the ICS instance. These steps add the associated file pointer f_data field pointing to the vnode structure to the S-ICS
S-ICSvnod generated during open sequence
Achieved by replacing e instances.

【０３２２】8. 制御スレッドは、このスタックに関す
るＳＴＲＥＡＭＳフレームワーク・データを整理する。
これは、すべての中間システム呼び出しの結果、ストリ
ームヘッド・フラグ、内部システム呼び出し処理構造、
同期待ち行列データと仕掛かり中要求、すべての待ち行
列フラグ、すべての動的関数置換データ、すべての動的
関数登録データ、すべてのメッセージとそれらの帯域、
M_SETOPTSに関連するすべての修正等々を含む。このデ
ータは、スタック・フレームワークを再構築する目標制
御スレッドに送り出される。 9. 次に、制御スレッドは、各ストリーム・コンポーネ
ントに関して整理関数を起動する。8. The controlling thread organizes STREAMS framework data for this stack.
This includes the results of all intermediate system calls, stream head flags, internal system call processing structures,
Synchronous queue data and pending requests, all queue flags, all dynamic function replacement data, all dynamic function registration data, all messages and their bandwidths,
Includes all modifications related to M_SETOPTS and so on. This data is sent to the target control thread which rebuilds the stack framework. 9. Next, the controlling thread invokes a simplification function for each stream component.

【０３２３】10.この構造が目標制御スレッドに渡さ
れ、目標制御スレッドは整理関数を起動してこのスタッ
ク上でプロセスを反転させる。その際、目標ノードが潜
在的メモリ制約のためデータおよびメッセージのすべて
を再生成させることができない可能性があるかもしれな
い。この点は次のように処理される。移行が失敗した場
合、遠隔制御スレッドはその失敗を検知し、状態を復元
し、割り当てられた資源をクリーンアップする。次に、
始動元制御スレッドが新しい目標ノードを選択して新た
にプロセスを開始するのを待つ。別の方法は、ＳＴＲＥ
ＡＭＳフレームワークにメモリ回復アルゴリズムを作成
させることである。これは、作業は増加するが、制御ス
レッドに関する移行の複雑性を減少させる。10. The structure is passed to the target control thread, which invokes a simplification function to flip the process on this stack. In so doing, it may not be possible for the target node to regenerate all of the data and messages due to potential memory constraints. This is handled as follows. If the migration fails, the remote control thread detects the failure, restores state, and cleans up allocated resources. next,
Wait for the starting control thread to select a new target node and start a new process. Another method is STRE
Let the AMS framework create a memory recovery algorithm. This increases the work, but reduces the transition complexity for the controlling thread.

【０３２４】11. 目標ノードが動作を完了した時、目標
ノードはそれを遠隔制御スレッドを経由して制御スッド
に通知する。そこで遠隔制御スレッドは、既存のストリ
ーム・スタックを取り壊しすべての資源をクリーンアッ
プする。次に、目標制御スレッドは、すべての同期待ち
行列を解放し、スタックおよび保留中のシステム呼び出
しの再開を可能にする。11. When the target node has completed its operation, the target node notifies it to the control suite via the remote control thread. The remote control thread then tears down the existing stream stack and cleans up all resources. The target control thread then releases all synchronization queues, allowing the stack and pending system calls to resume.

【０３２５】12. 制御スレッドは、次に、新しいデータ
を使用してＳ−ＩＣＳ経路テーブルを更新して、mblkを
除去することによってフロー制御を解放する。これによ
って通信が新しいノードから始まることが可能にされ
る。制御スレッドがいずれかの時点でエラーを検出する
場合は、実行されたことを元に戻し、スタックを再開す
る。これは、移行プロセスが移行完了までスタックのい
かなる局面をも変えていないので、可能である。注：多
くの状況において、移行はノードまたはハードウェア障
害によって発生する可能性がある。これは、制御スレッ
ドが始動元エンティティであってクラスタ管理スレッド
ではないことを意味する。更にスタック構成に応じて、
一部の移行は、障害を検出する制御スレッドとアプリケ
ーションが共存するＳ−ＩＣＳの間で実行されることが
ある。回復動作が、同一ノードまたは新しいノード上の
新しいインターフェース・カードに進むことを含む場合
もある。どのような状況でも、上述のステップのすべて
は、あらゆる移行状況と関係するかもしれないし関係し
ないかもしれないので、ケース・バイ・ケースでこれら
ステップを取捨選択して適用しなければならない。12. The controlling thread then updates the S-ICS routing table with the new data and releases flow control by removing the mblk. This allows communication to start from a new node. If the controlling thread detects an error at any point, it will undo what was done and restart the stack. This is possible because the transition process does not change any aspect of the stack until the transition is complete. Note: In many situations, migration can be caused by node or hardware failure. This means that the controlling thread is the initiating entity and not the cluster management thread. Furthermore, depending on the stack configuration,
Some migrations may be performed between the control thread that detects the failure and the S-ICS where the application coexists. The recovery operation may include going to a new interface card on the same node or a new node. In any situation, all of the above steps may or may not be relevant to any transition situation, so these steps must be selectively applied on a case-by-case basis.

【０３２６】7.0 ストリーム同期ヒューレット・パッカード社および多数の他の業者のＳ
ＴＲＥＡＭＳ実施形態は、現在、種々の異なるレベルの
自動フレームワーク同期をサポートしている。データに
対するそれら自身の同期機構を作成し、またどのような
putまたはserviceルーチンを並列的に実行させるかを制
御する必要性からモジュールまたはドライバ開発者を解
放するために、これらのレベルは作成された。これらの
レベルには、待ち行列、待ち行列ペア、モジュール、そ
の他およびグローバルという５つのレベルがある。これ
らは次のように定義されている。7.0 Stream Synchronization Hewlett-Packard and many other vendors' S
The TREAMS embodiment currently supports a variety of different levels of automatic framework synchronization. Create their own synchronization mechanism for data, and what
These levels have been created to relieve module or driver developers from having to control whether put or service routines execute in parallel. These levels have five levels: Queue, Queue Pair, Module, Other and Global. These are defined as follows:

【０３２７】待ち行列同期レベルは、ほとんどの並列性
を提供する。この待ち行列のputまたはserviceルーチン
のインスタンスのただ１つだけが１時点で実行されるよ
うに、それは待ち行列に対するアクセスを逐次化する。
異なる待ち行列インスタンスがそれらのputまたはservi
ceルーチンを異なるプロセッサ上で並列的に実行させる
場合もある。待ち行列ペア同期レベルは、read put, re
ad service, write putまたはwriteserviceという４つ
の関数のうちの１つだけが一度に実行されるように、読
取り書込み待ち行列ペアに対するアクセスを逐次化させ
る。異なる待ち行列ペア・インスタンスが異なるプロセ
ッサ上で並列的に実行することも可能である。The queuing synchronization level provides most of the parallelism. It serializes access to the queue so that only one instance of the queue's put or service routine is executed at a time.
Different queue instances have their put or servi
The ce routine may be executed in parallel on different processors. The queue pair synchronization level is read put, re
Serialize access to read / write queue pairs so that only one of the four functions ad service, write put or write service is performed at a time. It is also possible that different queue pair instances execute in parallel on different processors.

【０３２８】モジュール同期レベルは、モジュールの待
ち行列ペアまたはインスタンスのすべてに対するアクセ
スを逐次化する。すなわち、ただ１つのインスタンスの
readput, read service, write putまたはwrite servic
eルーチンだけが１度に実行される。異なるモジュール
が異なるプロセッサ上で並列的に実行することもでき
る。「その他」同期は、１つの関数だけを一度に実行す
るため異なるモジュールからなる１つのグループを提供
する。これは、複数ドライバ／モジュール間の協調タス
クを意味する。実際に使用されると、それは複雑なアプ
リケーションのための単純な同期機構を提供することが
できる。グロ―バル同期は、並列処理を提供しない。グ
ローバル同期として構成されたモジュールのうちただ１
つのモジュールが一度に実行することが許容される。The module synchronization level serializes access to all of a module's queue pairs or instances. That is, only one instance
readput, read service, write put or write servic
Only the e-routine is executed at one time. Different modules can execute in parallel on different processors. "Other" synchronization provides a group of different modules to execute only one function at a time. This means a cooperative task between multiple drivers / modules. When used in practice, it can provide a simple synchronization mechanism for complex applications. Global synchronization does not provide parallel processing. Only one of the modules configured as global synchronization
Two modules are allowed to run at once.

【０３２９】すべての同期レベルに共通な最も重要な特
長は、付加的ロッキング・メカニズムを必要とせずに、
各々のレベルが情報の共有できる異なる度合いを提供す
る点である。例えば、「その他」同期は、異なるモジュ
ールがスピンロックのような機構を調整する必要性を持
たずにデータを簡単に共有することを可能にする。フレ
ームワークは、アクセス可能なものを標示する単純な識
別子を提供し、それによって、すべてのアクセスは、付
加的スピンロックまたは信号を使用することなく、自動
的に逐次化される。The most important feature common to all synchronization levels is that without the need for additional locking mechanisms,
Each level provides a different degree of information sharing. For example, "other" synchronization allows different modules to easily share data without having to coordinate mechanisms such as spin locks. The framework provides a simple identifier that indicates what is accessible, so that all accesses are automatically serialized without using additional spinlocks or signals.

【０３３０】スタックまたはカーネルの内の異なる位置
に記憶される親の同期待ち行列を処理することによっ
て、これらの同期レベルの実際の実施は達成される。例
えば、待ち行列および待ち行列ペア同期はその待ち行列
にとってローカルである親の同期待ち行列を記憶し、一
方、モジュール同期はfmodswテーブル内に親同期待ち行
列アドレスを記憶する。「その他」およびグローバル同
期は、ローカルのカーネル・インスタンスに対してのみ
ユニークである。このように、クラスタ構成は、待ち行
列または待ち行列ペア同期をサポートするドライバおよ
びモジュールを分散するだけでよい。この場合、これは
オペレーティング・システムから他のレベルを削除しな
いが、それらはローカル・ノード内でのみ動作できる。
例えば、ＴＲＥＡＭＳ型ＴＣＰＡＲＰモジュールは、
モジュール・レベル同期を使用するので、分散すること
はできない。多くの場合、これは問題ではない。なぜな
ら、ＡＲＰと通信するＩＰは同じノードに存在し、目標
ドライバも分散されないからである。The actual implementation of these synchronization levels is achieved by processing the parent synchronization queue, which is stored at a different location in the stack or kernel. For example, queue and queue pair synchronization store the parent's synchronization queue that is local to the queue, while module synchronization stores the parent synchronization queue address in the fmodsw table. "Other" and global synchronization are unique only to the local kernel instance. Thus, the cluster configuration need only distribute drivers and modules that support queue or queue pair synchronization. In this case, this does not remove other levels from the operating system, but they can only operate within the local node.
For example, the TREAMS type TCP ARP module
Because it uses module-level synchronization, it cannot be distributed. In most cases, this is not a problem. This is because the IPs that communicate with the ARP are on the same node and the target drivers are not distributed.

【０３３１】8.0 ドライバ／モジュール切り替えテーブ
ルモジュールまたはドライバがシステムにロードされる
時、そのストリームタブ構造および構成データは、ＳＴ
ＲＥＡＭＳフレームワークによる迅速で簡単なアクセス
のためグロ―バル切り替えテーブル内に記憶される。ほ
とんどの実施形態において、ドライバはdmodswテーブル
に記憶され、モジュールはfmodswテーブルに記憶され
る。ほとんどの場合、これらのテーブルは、完全独立で
あるので、クラスタ内での同期をとる必要はない。例外
は、モジュールまたはドライバがモジュール、グローバ
ルまたは「その他」の同期レベルを使用する場合であ
る。モジュール同期の場合、これは通常問題ではない。
なぜならほとんどのモジュールはデータがすべてのノー
ドの間で同期させられることを要求せず、単に特定のノ
ードにとってローカルなインスタンスの間での同期を必
要とするにすぎないからである。グローバルおよび「そ
の他」の同期の場合は問題である。これらのレベルは、
活動を協調するため独立したモジュールおよびドライバ
を提供する。これらのレベルをサポートするため、ＳＴ
ＲＥＡＭＳフレームワークは分散ロッキング・メカニズ
ムを提供する必要がある。そのようなメカニズムの実施
は難しくないが、それらは待ち時間と複雑性をすべての
動作に加える。8.0 Driver / Module Switching Table When a module or driver is loaded into the system, its stream tab structure and configuration data are
Stored in a global switch table for quick and easy access by the REAMS framework. In most embodiments, the drivers are stored in the dmodsw table, and the modules are stored in the fmodsw table. In most cases, these tables are completely independent and do not need to be synchronized within a cluster. The exception is when a module or driver uses a module, global or "other" synchronization level. For module synchronization, this is usually not a problem.
Because most modules do not require data to be synchronized between all nodes, they only require synchronization between instances local to a particular node. This is a problem for global and "other" synchronizations. These levels are
Provide independent modules and drivers to coordinate activities. To support these levels, ST
The REAMS framework needs to provide a distributed locking mechanism. Implementation of such mechanisms is not difficult, but they add latency and complexity to all operations.

【０３３２】そのような機構を作成する場合以下を考慮
しなければならない。同期ロックが現在遠隔にあり動作
が割り込みスタックから分離して実行されているとすれ
ば、要求は、取得要求を記録しその実行を後刻行うＳＴ
ＲＥＡＭＳデーモンの待ち行列に記憶する必要がある。
下部構造のＰ−ＩＣＳがメッセージ配信を保証しない場
合、分散ロッキング・メカニズムは、現在時ロック保有
者に対しロックを正式に解放する前に目標ノードから送
られてくる受信承認メッセージを待つように要求する。
これは、現在時保有者に対し、再転送タイマを記録しメ
ッセージの潜在的消失またはノード障害を処理するよう
に要求する。周期的に、そのようなモデルに関して単一
ポイントの障害が存在するので、現在時保有者は、ロッ
クが現在ありすべて順調に行われている他のロック管理
スレッドに同時通報する必要がある。これは心臓の鼓動
のように行われ、心拍が検知されたならば他のスレッド
が回復処理を実行することを可能にする。When creating such a mechanism, the following must be considered. Assuming that the synchronization lock is now remote and the operation is being performed separately from the interrupt stack, the request is a ST that records the acquisition request and later performs that execution.
It must be stored in the REAMS daemon's queue.
If the underlying P-ICS does not guarantee message delivery, the distributed locking mechanism requires the current lock holder to wait for an acknowledgment message from the target node before formally releasing the lock. I do.
This requires the current holder to record a retransmission timer and handle potential message loss or node failure. Periodically, as there is a single point of failure for such a model, the current holder needs to simultaneously notify other lock management threads where locks are currently and are all well. This is done like a heartbeat, allowing another thread to perform recovery if a heartbeat is detected.

【０３３３】また、役に立つ可能性のある付加的情報を
記憶し取り出すめ、切り替えテーブルを使用することも
できる。例えば、ストリームタブ構造を検査することに
よって、ドライバがマルチプレクサであるか否かを判断
することができる。この点は、オープンを実行すべき適
切なノードを決定するオープン方針に含めることができ
る。更に、クラスタ構成データを検査することによっ
て、下方ｍｕｘが同一ノード上にあるか否か、また、こ
れが重要であるか否かを判断することができる。制御ス
レッドは、また、SADの範囲内に記憶されているストリ
ームタブおよび自動プッシュ(autopush)データを調べ
て、スタックが最終的にはどのようになるかを判断し、
スタックを分割することが適切かあるいはスタック全体
をノード上に生成するのが適切かを判断することができ
る。切り替えテーブルのエントリは、エラー、H/Aおよ
び移行回復方針のためのグロ―バル・レポジトリとなる
こともできる。これらの方針は、動的関す置換の実行方
法と同様に、クラスタ管理スレッドによって更新するこ
ともできる。A switching table can also be used to store and retrieve additional information that may be useful. For example, by examining the stream tab structure, it can be determined whether the driver is a multiplexer. This can be included in the open policy that determines the appropriate node to perform the open. Further, by examining the cluster configuration data, it can be determined whether the lower mux is on the same node and whether this is important. The controlling thread also examines the stream tab and autopush data stored within the SAD to determine what the stack will ultimately look like,
It can be determined whether it is appropriate to split the stack or to generate the entire stack on a node. Switching table entries can also be a global repository for errors, H / A and migration recovery policies. These policies can also be updated by the cluster management thread, as well as how dynamic permutations are performed.

【０３３４】9.0 分散STRLOG ＳＴＲＥＡＭＳは、モジュールまたはドライバがロギン
グ・メッセージを送り出すことを可能にするlogドライ
バを提供する。これらのロギング・メッセージは、エラ
ー・ロギングまたはデバッギングのために使用される。
ＤＤＩは、これを実行する標準的ユーティリティstrlog
()を定義する。カーネルの範囲内で、strlog()はログ・
ドライバの読取り待ち行列へメッセージを書き込み、そ
れを処理するログ・ドライバをスケジュールする。stre
rrまたはstraceのいずれかがメッセージを引き出すため
ユーザ・スレッドによって実行されるまで、ログ・ドラ
イバはメッセージをそのストリームヘッド読取り待ち行
列上に記憶する。カーネル・メモリのすべてを消費しな
いようにするため、ログ・ドライバは、それが記憶する
メッセージの数を制限し、超過メッセージのすべてを解
放する。9.0 Distributed STRLOG STREAMS provides a log driver that allows a module or driver to send out logging messages. These logging messages are used for error logging or debugging.
DDI is a standard utility that performs this, strlog
() Is defined. Within the kernel, strlog ()
Write a message to the driver's read queue and schedule a log driver to process it. stre
The log driver stores the message on its stream head read queue until either rr or strace is executed by the user thread to retrieve the message. To avoid consuming all of the kernel memory, the log driver limits the number of messages it stores and releases all excess messages.

【０３３５】単一システム視点を示すクラスタの範囲内
では、strerrまたはstraceは１つのノード上で実行さ
れ、クラスタはすべてのノードに関するすべてのログ・
メッセージを返し、迅速な障害分離のためログ・メッセ
ージの生成元のノードを明らかにする。問題は、strlog
()定義また使用法を修正せずにクラスタ・ノード・メッ
セージのすべてを特定ノードに届ける方法である。その
解決策を、図１８を参照して以下に示す。Within the scope of a cluster showing a single system perspective, strerr or strace is run on one node, and the cluster manages all log data for all nodes.
Returns the message and identifies the node that generated the log message for quick fault isolation. The problem is strlog
(2) A method of delivering all cluster node messages to a specific node without modifying the definition or usage. The solution is shown below with reference to FIG.

【０３３６】straceが発せられる時、その命令はログ・
ドライバをオープンし、getmsg()を使用してそれが停止
させられるまでドライバからメッセージを同期的に取り
出す。これは以下のステップのように行われる。 1. ログ・ドライバ実施が次のように変更される。これ
は、クラスタ設計規則の違反として解釈されるかもしれ
ないが、logドライバはＳＴＲＥＡＭＳ実施の一部であ
るので、例外である。(この点はSADドライバにもあては
まる)。logドライバがマルチプレクサ・ドライバに変換
される。その下方部分は、基本的には現行のlogドライ
バ実施形態であって、この環境で動作する上での変更は
必要とされない。 2.ノードがクラスタの範囲内にあるかどうかに関係な
く、上方ｍｕｘの上方部分は、カーネル内ＳＴＲＥＡＭ
Ｓインターフェースを使用して下方ｍｕｘインスタンス
str_dev_lookup(), streams_open(), およびstreams_io
ctl(I_LINK)を作成する。 3.上方ｍｕｘがオープンされていることを認識する制御
スレッドは、logドライバを管理しているクラスタの範
囲内のすべての制御スレッドと連絡をとり、ノード毎に
別々の接続およびＳ−ＩＣＳインスタンスを確立し、上
方muxの下に各Ｓ−ＩＣＳを接続する。これによって、
上方muxが各ログ・メッセージに対するノード識別子を
コマンドに届ける前に事前に保留状態にさせることを可
能にする。When the strace is issued, the instruction
Open the driver and synchronously retrieve messages from the driver until it is stopped using getmsg (). This is performed as follows. 1. The log driver implementation is changed as follows. This may be interpreted as a violation of the cluster design rules, but is an exception because the log driver is part of a STREAMS implementation. (This also applies to SAD drivers). The log driver is converted to a multiplexer driver. The lower part is basically the current log driver embodiment, and no changes are required to operate in this environment. 2. The upper part of the upper mux, whether or not the node is within range of the cluster, is the STREAM in the kernel
Downward mux instance using S interface
str_dev_lookup (), streams_open (), and streams_io
Create ctl (I_LINK). 3. The control thread that recognizes that the upper mux is open contacts all control threads within the cluster managing the log driver and establishes a separate connection and S-ICS instance for each node. Establish and connect each S-ICS under the upper mux. by this,
Allows the upper mux to pre-pend the node identifier for each log message before delivering it to the command.

【０３３７】4. 接続が最初に実行される時、各ノード
は、すべての保留中メッセージを現在時ノードに自動的
に転送する。新しいメッセージは、標準的メッセージ経
路指定アルゴリズムを介して自動的に発送される。注：
コンポーネントは、過大なカーネル・メモリ消費の原因
となる過大なstrlog()呼び出しを実行する可能性があ
る。非クラスタ化環境においては、logドライバはＮ個
が待ち行列記憶されるとメッセージを放棄するように仕
組まれる。上方マルチプレクサ内でも、同様に、Ｎ＊Ｍ
のノード・メッセージまで許容されそれ以上は破棄され
るように公式化する必要がある。 5. ロギングが頻繁に使用される場合、制御スレッド
は、すべてのノードと多分１つか２つのクラスタ管理ノ
ードの間でのＳ−ＩＣＳ接続を事前実行してもよい。こ
れによって、クラスタはＳＴＲＥＡＭコンポーネント・
メッセージを自動的に収集することが可能となり、その
能力をヒューレット・パッカード社のOpenviewのような
管理機構に組み入れることが可能となる。4. When a connection is first made, each node automatically forwards all pending messages to the current node. New messages are automatically routed via standard message routing algorithms. note:
A component can make excessive strlog () calls, causing excessive kernel memory consumption. In a non-clustered environment, the log driver is arranged to discard the message when N are queued. Also in the upper multiplexer, N * M
It is necessary to formulate that no more than the node message is allowed and more are discarded. 5. If logging is used frequently, the controlling thread may pre-execute an S-ICS connection between all nodes and possibly one or two cluster management nodes. This allows the cluster to have a STREAM component
Messages can be automatically collected, and their capabilities can be incorporated into management mechanisms such as Hewlett-Packard's Openview.

【０３３８】10.0 分散ＳＡＤＳＡＤは、STREAMS Administration Driverの略称でス
トリーム管理ドライバを意味する。ＳＡＤは、ノード上
でＳＴＲＥＡＭＳドライバ管理を実行するために作成さ
れる。ＳＡＤの主要な用途は、autopush要件を取り扱
い、モジュール・セットがシステムの範囲内に存在する
か否かを判断するためシステムを照会することである。
クラスタの範囲内で、ＳＡＤを使用するアプリケーショ
ンはローカル・ノード上またはクラスタ全体にわたる場
所で動作する。もしローカル・ノード上だけで動作する
とすれば、アプリケーションおよびＳＡＤは変更を必要
としないであろう。クラスタに広がった動作を行う場
合、アプリケーションおよびＳＡＤの両方は変更を必要
とする。10.0 Distributed SAD SAD is an abbreviation of STREAMS Administration Driver and means a stream management driver. The SAD is created to perform STREAMS driver management on the node. The primary use of SAD is to handle autopush requirements and query the system to determine if a set of modules is within range of the system.
Within the scope of a cluster, applications using SAD run on local nodes or across the cluster. If running only on the local node, the application and the SAD would not need to change. When performing cluster-wide operations, both the application and the SAD require modification.

【０３３９】autopush(1m)のような管理アプリケーショ
ンは、ＳＡＤドライバと通信するためioctsを使用す
る。単一システム視点が維持されるとすれば、ioctlsの
実施はクラスタ全体に伝えられなければならない。log
ドライバの場合と同様に、ＳＡＤはマルチプレクサに構
成されるか、ＳＡＤが進行中メッセージを持たないの
で、メッセージ分配を行うだけの１回限りのアプローチ
を使用してもよい。このアプローチを使用すれば、ＳＡ
Ｄドライバは、動的関数置換を使用して、クラスタ初期
状態設定の間に修正される。A management application such as autopush (1m) uses iocts to communicate with the SAD driver. Assuming a single system perspective is maintained, the implementation of ioctls must be propagated throughout the cluster. log
As with the driver, the SAD may be configured in a multiplexer or a one-off approach to do message distribution may be used since the SAD has no ongoing messages. Using this approach, SA
The D driver is modified during cluster initialization using dynamic function replacement.

【０３４０】修正されたput/serviceルーチンは、アク
ションをとり結果を戻す制御スレッドにその要求を通知
する。ＳＡＤがioctlsを使用するので、M_IOCACKが生成
されるまでアプリケーションは安全に待つことができ
る。初期状態設定時に認識されている既知のストリーム
位置に対するカーネル内ＳＴＲＥＡＭＳインターフェー
スを使用して交信が行われる。ほとんどの場合、このタ
スクは非常に簡単であり、クラスタ管理ソルーションに
含めることもできる。The modified put / service routine notifies the controlling thread that takes action and returns a result. Because SAD uses ioctls, applications can safely wait until M_IOCACK is generated. Communication is performed using an in-kernel STREAMS interface to a known stream position known at initialization. In most cases, this task is very simple and can be included in a cluster management solution.

【０３４１】11.0 分散ストリームヘッド要件ストリーム・スタックが複数のノード間に分割される
時、ストリームヘッド内容および関連機能性がこれらの
ノード間に反映されなければならない。そのため、制御
スレッドは、ストリームヘッド特有のメッセージすべて
を検査して、M_HANGUP状況からの回復のようなローカル
のアクションをとるか、あるいは新しいストリームヘッ
ド書込みオフセットを記録してそのメッセージを遠隔ノ
ードに送信するかいずれかを行う。以下は、メッセージ
・タイプとオプション、および制御スレッド内で取られ
る対応するアクションのリストである。11.0 Distributed Streamhead Requirements When a stream stack is split between multiple nodes, the streamhead content and related functionality must be reflected between these nodes. Thus, the controlling thread examines all stream head specific messages and takes local actions, such as recovery from the M_HANGUP situation, or records a new stream head write offset and sends the message to the remote node Do one or the other. The following is a list of message types and options, and the corresponding actions taken in the controlling thread.

【０３４２】M_HANGUP：このメッセージが到着すると、
sth_rput()がストリームヘッドをエラー状態(ENXIO)に
置く。それは、また、保留中のシステム呼び出しを再起
動し事象をポーリングする。ＳＴＲＥＡＭＳの範囲内
で、ストリームヘッド・フラグF_STH_CLUSTER_CTL_CHEC
Kがセットされているか否かを検証するようにコードが
拡張されなければならない。これがセットされていれ
ば、いかなる通常のステップも実行せず、メッセージを
待ち行列に記憶し、制御スレッドを再起動して、エラー
回復処理を実行させるか、単にメッセージを遠隔ノード
へ送付し応答を待つようにさせる。M_HANGUP: When this message arrives,
sth_rput () places the stream head in an error state (ENXIO). It also restarts pending system calls and polls for events. Within the range of STREAMS, stream head flag F_STH_CLUSTER_CTL_CHEC
The code must be extended to verify whether K is set. If set, do not perform any normal steps, store the message in a queue, restart the control thread, and perform error recovery, or simply send the message to the remote node and reply. Have them wait.

【０３４３】M_IOCACKおよびM_IOCNAK：制御スレッドに
よって使用されるか、あるいはアプリケーションioctl
に対する応答である。制御スレッドが使用を望むか否か
を検査するようにsth_rput()は修正されなければならな
い。それがセットされていれば、制御スレッドは再起動
し、経路データおよびioc_idフィールドを検査してioct
lがそれに属しているかどうかを判断する。セットされ
ていなければ、それは、目標経路を検査し、streams_pu
tmsg()を使用して遠隔ノードにメッセージを転送するＳ
−ＩＣＳにメッセージを送出する。M_IOCACK and M_IOCNAK: used by control thread or application ioctl
Is the response to Sth_rput () must be modified to check whether the controlling thread wants to use it. If it is set, the controlling thread restarts, checks the route data and ioc_id fields and
Determine if l belongs to it. If not set, it examines the target path, streams_pu
S to forward message to remote node using tmsg ()
-Send message to ICS.

【０３４４】M_COPYINおよびM_COPYOUT：これらは巧妙
なメッセージである。それらは、ストリームヘッドへア
ドレスを渡しデータをある方向へコピーするように要求
するために使用される。目標アドレス空間は異なるノー
ド上に存在するため整列されてないので、制御スレッド
およびＳ−ＩＣＳは正しく動作するため協調しなければ
ならない。これらのメッセージはioctl処理の間にのみ
生成されることができるので、この情報を使用して問題
を解決することができる。いずれのメッセージに関して
も、ドライバは、データをcq_addrとの間でコピーする
ため目標アドレスを指定する。M_COPYIN and M_COPYOUT: These are clever messages. They are used to pass addresses to the stream head and request that data be copied in one direction. Since the target address space is not aligned because it is on a different node, the controlling thread and the S-ICS must cooperate to operate properly. Because these messages can only be generated during ioctl processing, this information can be used to solve the problem. For any message, the driver specifies a target address to copy data to and from cq_addr.

【０３４５】M_COPYINの場合、形式は目標アドレスをも
つ１つのメッセージ・ブロックである。そのようにする
ため、制御スレッドは、ドライバがM_COPYIN命令を実行
しているという事実を目標アドレスに記録する。次に、
制御スレッドは、Ｓ−ＩＣＳが特別な要求であると理解
するようにそのメッセージをM_CTLメッセージとしてＳ
−ＩＣＳに転送する。遠隔Ｓ−ＩＣＳはこのメッセージ
を受け取り、cq_sizeのローカル・カーネル・メモリを
割り当て、cq_addrを更新し、copyin()を実行するロー
カル・ストリームヘッドにメッセージをM_COPYIN要求と
して送信する。これが完了すと、Ｓ−ＩＣＳは、制御ス
レッドにデータを送る始動元Ｓ−ＩＣＳにこのデータを
送り戻す。次に、制御スレッドはデータをドライバのオ
リジナル・アドレスへ移動する。In the case of M_COPYIN, the format is one message block with the target address. To do so, the controlling thread records at the target address the fact that the driver is executing an M_COPYIN instruction. next,
The controlling thread uses the message as an M_CTL message so that the S-ICS knows it is a special request.
-Transfer to ICS. The remote S-ICS receives this message, allocates local kernel memory of cq_size, updates cq_addr, and sends the message as an M_COPYIN request to the local streamhead performing copyin (). When this is complete, the S-ICS sends this data back to the initiating S-ICS that sends the data to the controlling thread. Next, the control thread moves the data to the driver's original address.

【０３４６】M_COPYOUTに関しては、ドライバはM_DATA
の範囲内でデータを供給している。このメッセージ連鎖
は、一旦制御スレッドがそれをM_CTLメッセージに書き
込み次第、遠隔Ｓ−ＩＣＳに転送され、従って、Ｓ−Ｉ
ＣＳは特別な処理要求の存在を認識する。Ｓ−ＩＣＳ
は、新しいローカル・アドレスを反映するように目標cq
_addrを更新し、copyout()の実行のためストリームヘッ
ドにメッセージを送付する。With respect to M_COPYOUT, the driver uses M_DATA
The data is supplied within the range. This message chain is forwarded to the remote S-ICS once the controlling thread has written it in the M_CTL message,
The CS recognizes the existence of a special processing request. S-ICS
Target cq to reflect the new local address
Update _addr and send a message to the stream head to execute copyout ().

【０３４７】M_IOCTL：パイプを実施する場合にストリ
ームヘッドで使用される。その場合、M_IOCNAKが生成さ
れ、遠隔送信アプリケーションにもどされなければなら
ない。Ｓ−ＩＣＳは、それを決定するためこれらのメッ
セージを直接制御スレッドに転送する点に留意する必要
がある。 M_ERROR：M_HANGUPの場合と同様に、回復方針が存在す
る場合がある。フラグがセットされていれば、制御スレ
ッドに判断させ、さもなければ通常通り処理を進める。 M_SIGおよびM_PCSIG：アプリケーションに対する信号を
生成するためドライバおよびモジュールによって使用さ
れる。制御スレッドが信号を使用しない場合、メッセー
ジは自動的に送信され解釈されない。フラグが検査さ
れ、sth_rput()内でプロセスすべきか否か判断される。M_IOCTL: Used by the stream head when implementing a pipe. In that case, M_IOCNAK must be generated and returned to the remote sending application. It should be noted that the S-ICS forwards these messages directly to the controlling thread to determine it. M_ERROR: Similar to M_HANGUP, there may be a recovery policy. If the flag is set, the control thread makes a decision; otherwise, the process proceeds as usual. M_SIG and M_PCSIG: Used by drivers and modules to generate signals for applications. If the controlling thread does not use the signal, the message will be sent automatically and will not be interpreted. The flags are checked to determine whether to process in sth_rput ().

【０３４８】M_PASSFP：I_RECVFDが行われるようにする
ためパイプのファイル・ポインタを一方のエンドポイン
トから別のエンドポイントへ移動させるものと仮定され
るので、このメッセージは意味を持たないかもしれな
い。パイプが異なるノード上に存在すれば、この情報を
転送する必要はなく、ファイル・ポインタはノード特有
のものにすぎず、ノード間のアドレス指定の機能は持た
ない。 M_FLUSH：これらのメッセージはローカルおよび遠隔で
処理されなければならない。ストリームヘッド・プレビ
ュー関数または制御スレッドは、メッセージを送付しな
ければならない。 M_SETOPTS：ドライバがストリームヘッドにセットする
可能性のある２−３のオプションがある。この遠隔ドラ
イバがこれらのオプションを設定すれば、データの反映
が必要とされたり、ローカルの値を反映するように修正
が必要とされたりする。ＳＴＲＥＡＭＳの実施形態に依
存して、ストリームヘッド・プレビュー関数または制御
スレッドは、copyb()を実行し正しい値でこのメッセー
ジを修正し、一方M_SETOPTSを遠隔ノードに転送するこ
とによって、更新されつつあるローカル修正値を用いて
解釈を実行する。以下は、フィールドおよび該当するア
クションのリストである。M_PASSFP: This message may not be meaningful, as it is assumed that the file pointer of the pipe is moved from one endpoint to another in order for I_RECVFD to take place. If the pipes are on different nodes, there is no need to transfer this information, the file pointers are only node-specific, and have no function of addressing between nodes. M_FLUSH: These messages must be processed locally and remotely. The streamhead preview function or control thread must send the message. M_SETOPTS: There are a few options that the driver may set on the stream head. If the remote driver sets these options, the data needs to be reflected or modified to reflect local values. Depending on the embodiment of the STREAMS, the stream head preview function or the controlling thread modifies this message with the correct value while executing copyb (), while transferring the Perform interpretation using the modified values. The following is a list of fields and applicable actions.

【０３４９】SO_READOPT, SO_WROFF, SO_MINPSZ, SO_MA
XPSZ,SO_MREADON, SO_MREADOFF, SO_ISTTY, SO_ISNTTY,
SO_NDELON, SO_NDELOFF, SO_BAND, SO_TOSTOPおよびSO
_TONSTOP：そのまま遠隔ストリームヘッドに送付され、
ローカルの修正は必要ない。 SO_HIWATおよびSO_LOWAT：ローカル・ストリームヘッド
値の更新が必要であり、これらは無修正で遠隔ストリー
ムヘッドに送られる。それらの値は記録され、フローが
制御されている間、ローカル・ノードは、ローカル・ス
トリームヘッド上に存在することが許容されるデータ量
に関して、非クラスタ化環境における場合と同様に反応
する。SO_READOPT, SO_WROFF, SO_MINPSZ, SO_MA
XPSZ, SO_MREADON, SO_MREADOFF, SO_ISTTY, SO_ISNTTY,
SO_NDELON, SO_NDELOFF, SO_BAND, SO_TOSTOP and SO
_TONSTOP: sent to remote stream head as it is,
No local modification is required. SO_HIWAT and SO_LOWAT: Need to update local streamhead values, which are sent unmodified to the remote streamhead. Those values are recorded and while the flow is controlled, the local node reacts as in a non-clustered environment with respect to the amount of data that is allowed to reside on the local stream head.

【０３５０】SO_COWENABLEおよびSO_COWDISABLE：ドラ
イバが遠隔であり処理されつつあるメッセージがカーネ
ル内に到着するので、クラスタ環境の範囲内ではあまり
意味をなさない。これらを送付すべきか否かは、ドライ
バ／モジュールが特定のアクションを取る場合その知識
を必要とするかどうかに依存する。 SO_FUNC_DISABLEおよびSO_FUNC_ENABLE：動的関数登録
を制御するために使用される。制御スレッドは、M_SETO
PTSの範囲内のインデックスを検査し、それを遠隔の関
数登録アレイに対応付けしなければならない。各モジュ
ール/ドライバについて同一のレイアウトを持つように
アレイがセットアップされていない場合、メッセージは
更新される必要がある。加えて、参照された関数は遠隔
カーネル像の範囲内に存在しなければならないし、これ
を認識するため、ドライバ／モジュールは、str_instal
l()を通して導入されていなければならない。データ変
換またはcopyin/checksumが呼出しアプリケーションに
基づいて遠隔ノードで実行されるので、これらのビット
はローカルで解釈することはできない。SO_COWENABLE and SO_COWDISABLE: Not very meaningful within a cluster environment because the driver is remote and the message being processed arrives in the kernel. Whether these should be sent depends on whether the driver / module needs knowledge of the specific actions to take. SO_FUNC_DISABLE and SO_FUNC_ENABLE: Used to control dynamic function registration. The controlling thread is M_SETO
The index within the PTS must be checked and mapped to the remote function registration array. If the array is not set up to have the same layout for each module / driver, the message needs to be updated. In addition, the referenced function must be within the scope of the remote kernel image, and to recognize this, the driver / module calls str_instal
Must be introduced through l (). These bits cannot be interpreted locally because the data conversion or copyin / checksum is performed at the remote node based on the calling application.

【０３５１】12.0 コピー回避に対する効果コピー回避およびcopy_on_write(COW)は、従来技術の場
合と同様に、分散ＳＴＲＥＡＭＳ環境内で動作する。コ
ピー回避およびcopy_on_writeは分散ＳＴＲＥＡＭＳ環
境内で使用されることができなければならず、現在のＶ
Ｍ実施形態に何ら変更を必要としない。その可能性を示
すため、以下コピー回避すなわちページ再マップを例と
して説明する。ページ再マップは、カーネル・ページ
を、ある空間から他の空間へ、例えばカーネル空間から
ユーザ空間へあるいはユーザ空間からカーネル空間へ再
マップする機能である。分散ＳＴＲＥＡＭＳ環境の範囲
内では、ドライバに通信する２つのＳＴＲＥＡＭＳスタ
ックが存在する。(注：ストリーム・スタックに関し
て、諸関数がローカル・ストリームを通過せずに遠隔ノ
ードにただシフトされるなら、コピー回避とCOWは不可
能であり、性能問題が派生する。この点はスタック実施
方法を選択する際に考慮されなければならない)。アプ
リケーションが存在するノードは、物理的ドライバであ
り、ユーザ空間に容易に再マップすることができるペー
ジ単位に配列されたデータを作成するローカル相互接続
ドライバと通信している。実際のアプリケーション・ド
ライバが存在する遠隔ノードは、可能な場合ページ単位
に配列されたデータを作成する。カーネル内スレッドは
このデータをそのローカル相互接続ドライバ・インスタ
ンスに送り出さなければならないので、データを再マッ
プする必要がない(必要な場合、既存のカーネル対カー
ネル・ページ再マップ機能を利用することができる)。12.0 Effects on Copy Avoidance Copy avoidance and copy_on_write (COW) operate in a distributed STREAMS environment, as in the prior art. Copy avoidance and copy_on_write must be able to be used in a distributed STREAMS environment and the current V
No changes are required to the M embodiment. In order to show the possibility, copy avoidance, that is, page remapping will be described as an example. Page remap is a function that remaps a kernel page from one space to another, for example, from kernel space to user space or from user space to kernel space. Within a distributed STREAMS environment, there are two STREAMS stacks that communicate to the driver. (Note: For stream stacks, if functions are just shifted to remote nodes without passing through the local stream, copy avoidance and COW are not possible, leading to performance issues. Must be taken into account when choosing a). The node where the application resides is the physical driver and is in communication with the local interconnect driver that creates paged data that can be easily remapped to user space. The remote node where the actual application driver resides creates paged data where possible. The in-kernel thread must send this data to its local interconnect driver instance, so there is no need to remap the data (if necessary, existing kernel-to-kernel page remapping functionality can be used. ).

【０３５２】COWに関しては、相互接続ドライバが常に
この機能を持つように開発されなければならないので、
アプリケーションのノード上にこの機構を備えることが
可能である。遠隔ノードに関しては、ロックすべきユー
ザ・メモリ空間ブロックが存在しないので、copy_on_wr
iteは不要である。copy_on_writeの実行を可能にさせる
チェックサム・オフロード(checksum_off_load)をドラ
イバがサポートしているか否かを調べて判断するSocket
のようなアプリケーションの場合、修正なしでこれは実
施できるが、ＳＴＲＥＡＭＳは結果を無視しなければな
らない。これを実施する最も単純な方法は、相互接続ド
ライバが新しいビットを含むM_SETOPTSメッセージを送
出することである。そのメッセージによって、ストリー
ムヘッドは、ＳＴＲＥＡＭＳが分散されていることを認
識し、取るべきアクションの１つとしてストリームヘッ
ドの範囲内にF_STH_COWフラグを設定する。With respect to COW, since the interconnect driver must be developed to always have this function,
It is possible to have this mechanism on the application node. For remote nodes, there is no user memory space block to lock, so copy_on_wr
ite is not required. Socket that checks whether the driver supports checksum offload (checksum_off_load) that enables execution of copy_on_write and determines
For applications like this, this can be done without modification, but STREAMS must ignore the result. The simplest way to do this is for the interconnect driver to send an M_SETOPTS message containing the new bits. With the message, the stream head recognizes that the STREAMS is distributed, and sets the F_STH_COW flag within the range of the stream head as one of the actions to take.

【０３５３】13.0 ＭＰ対ＵＰエミュレーションヒューレット・パッカード社および多数の業者のＳＴＲ
ＥＡＭ実施形態は、マルチプロセッサ(ＭＰ)および単一
プロセッサ(ＵＰ)エミュレーションのドライバおよびモ
ジュールをサポートする。ヒューレット・パッカード社
のＳＴＲＥＡＭＳは、ストリーム・スタック・インスタ
ンスに関してＭＰおよびＵＰの混合モジュール／ドライ
バをサポートしない。従って、本発明の実施形態では、
ＵＰモジュールがＭＰドライバ上へプッシュされると、
ドライバおよびすべての他のモジュールはＵＰエミュレ
ーションを使用するように変換される。逆に、潜在的に
すべてＭＰモジュール／ドライバを含むスタックからＵ
Ｐモジュールを取り出すプロセスは行わない。13.0 MP vs. UP Emulation Hewlett-Packard and many vendor STRs
EAM embodiments support multiprocessor (MP) and uniprocessor (UP) emulation drivers and modules. Hewlett-Packard's STREAMS does not support mixed MP and UP modules / drivers for stream stack instances. Therefore, in the embodiment of the present invention,
When the UP module is pushed onto the MP driver,
Drivers and all other modules are translated to use UP emulation. Conversely, from the stack, which potentially contains all MP modules / drivers,
The process of removing the P module is not performed.

【０３５４】ＵＰストリーム・スタックを分割するため
には、open()またはI_PUSH ioctlに対して実行される変
換情報を遠隔スタック・セグメントに送付することが必
要である。これを実施するためには、新しいストリーム
ヘッドioctlであるI_UP_CONVERTが作成される必要があ
る。このioctlは、M_IOCTLを使用することによって作成
される。このM_IOCTLは、制御スレッドによって、すべ
ての既存のモジュールおよびドライバをＵＰ向けに変換
するアルゴリズムを起動するstreams_ioctl(I_UP_CONVE
RT)へ変換される。当然のことながら、もっと簡単な解
決策は、そのような環境においてはＵＰエミュレーショ
ンをサポートしないことである。In order to split the UP stream stack, it is necessary to send the conversion information executed for open () or I_PUSH ioctl to the remote stack segment. To do this, a new streamhead ioctl, I_UP_CONVERT, needs to be created. This ioctl is created by using M_IOCTL. This M_IOCTL is a stream_ioctl (I_UP_CONVE) that invokes, by the controlling thread, an algorithm that converts all existing modules and drivers for the UP.
RT). Of course, a simpler solution is to not support UP emulation in such an environment.

【０３５５】14.0 動的関数置換(Dynamic Function Rep
lacement) 動的関数置換は種々のタスクに対して使用される。クラ
スタの範囲内でこの機能性は動作を続ける上で修正を必
要としない。唯一の例外は、ストリームがノード間で分
割される場合である。この場合、代替ストリームタブ・
エントリへのインデックスは、ノード間で同じものでは
ない。正しいノード上での動作をそのノードに関する対
応するインデックス値を用いて制御スレッドに実行させ
ることによって、この問題は解決される。これは、ioct
l実施形態を次の表３１のように修正することによって
実施される。14.0 Dynamic Function Rep
lacement) Dynamic function replacement is used for various tasks. This functionality does not require modification to continue working within the cluster. The only exception is when the stream is split between nodes. In this case, the alternate stream tab
The index to the entry is not the same between nodes. This problem is solved by having the controlling thread perform the action on the correct node using the corresponding index value for that node. This is ioct
l This is implemented by modifying the embodiment as shown in Table 31 below.

【０３５６】[0356]

【表３１】 case I_ALTSTRTAB_ACTIVE & Oxff: /* このioctlを下方ｍｕｘ上に呼び出すことを可能にする */ close_out_check((RWEL_ERROR_FLAGS & (-F_STH_LINKED)), sth); error = copyin(*(caddr_t*)data,buf,sizeof(struct straltactive)); if(sth->sth_flags & F_STH_CLUSTER_CTL_CHECK, I_ALTSTRTAB _ACTIVE) { error = func_replace_probe_ctl(sth->ctl_address,buf); ioctl_dequeue_osr_wakeup(sth,osr); return(error); } error = osr_altstractive(osr); break; case I_ALTSTRTAB_FUTURE & Oxff: close_out_check(RWHL_ERROR_FLAGS,sth); error = copyin(*(caddr_t*)data,buf,sizeof(struct straltfuture)); if(sth->sth_flags & F_STH_CLUSTER_CTL_CHECK) { func_replace_probe_ctl(sth->ctl_address,buf, I_ALTSTRTAB_FUTURE); error = func_reg_probe_ctl(sth->ctl_address,buf); ioctl_dequeue_osr_wakeup(sth,osr); return(error); } error = osr_altstrfuture(osr); break;Table 31 case I_ALTSTRTAB_ACTIVE & Oxff: / * Enable to call this ioctl on the lower mux * / close_out_check ((RWEL_ERROR_FLAGS & (-F_STH_LINKED)), sth); error = copyin (* (caddr_t *) data , buf, sizeof (struct straltactive)); if (sth-> sth_flags & F_STH_CLUSTER_CTL_CHECK, I_ALTSTRTAB _ACTIVE) {error = func_replace_probe_ctl (sth-> ctl_address, buf); ioctl_dequeue_osr_wakeup (sth, osr); osr_altstractive (osr); break; case I_ALTSTRTAB_FUTURE & Oxff: close_out_check (RWHL_ERROR_FLAGS, sth); error = copyin (* (caddr_t *) data, buf, sizeof (struct straltfuture)); if (sth-> sth_flags & F_STH_CLUSTER) (sth-> ctl_address, buf, I_ALTSTRTAB_FUTURE); error = func_reg_probe_ctl (sth-> ctl_address, buf); ioctl_dequeue_osr_wakeup (sth, osr); return (error);} error = osr_altstrfuture (osr); break;

【０３５７】func_replace_probe_ctl()は、適切なイン
デックス値を反映するように更新されているバッファを
使用してメッセージを送受信する。制御スレッドは遠隔
制御スレッドを実際に調査して、そのノード上で命令を
実行し、結果を戻す。注：このコードはメインのデータ
経路上にないので、アプリケーションまたは他のスタッ
クから性能低下は観察されない。また、すべてはストリ
ームヘッド・レベルで起きるので、スレッドは制御スレ
ッドの応答を待ちながら休止することが可能である。[0357] func_replace_probe_ctl () sends and receives messages using a buffer that has been updated to reflect the appropriate index value. The controlling thread actually examines the remote controlling thread, executes instructions on that node, and returns a result. Note: Since this code is not on the main data path, no performance degradation is observed from the application or other stacks. Also, since everything happens at the stream head level, threads can sleep while waiting for the response of the controlling thread.

【０３５８】15.0 送信元型相互接続の要件送信元型相互接続の環境で分散ストリームを動作させる
ために多くの変更を行う必要はない。ＳＴＲＥＡＭＳ
は、メモリ管理およびallocb()/freeb()の使用に関する
限り通常通り動作する。メッセージを送り出そうとする
時、相互接続がIOVA(すなわちＩ／Ｏ仮想アドレス)を持
つようにするため、データがiomapされる必要がある。
このマッピング動作は約１２の命令を必要とするが、相
互接続カーネル命令に割り当てられる２−３のtlbエン
トリを生成することによって避けることができる。15.0 Source-Type Interconnection Requirements No significant changes need to be made to operate a distributed stream in a source-type interconnect environment. STREAMS
Will work normally as far as memory management and use of allocb () / freeb () are concerned. When attempting to send a message, the data needs to be iomapped so that the interconnect has an IOVA (ie, I / O virtual address).
This mapping operation requires about twelve instructions, but can be avoided by creating a few tlb entries assigned to the interconnect kernel instructions.

【０３５９】データは、常にカーネルに存在するので、
ユーザ空間に存在する場合のような確保動作を必要とし
ない。それがCOW(copy_on_write)バッファである場合、
ユーザ・データは、COW呼び出しを介して、相違がない
ように既に確保されている。受信側では、Ｓ−ＩＣＳド
ライバは、分散ＳＴＲＥＡＭＳ要件を満たすように十分
なメモリを割り当てて確保する。このメモリにすべての
ユーザ・データおよびmblkヘッダが到着し、それらにes
balloc() が実行される。free関数は、メモリを解放し
遠隔ノードに対して受領を承認する相互接続ＡＰＩ呼び
出しコールである。Since the data always exists in the kernel,
There is no need for a secure operation as in the case of being in the user space. If it is a COW (copy_on_write) buffer,
The user data is already reserved via the COW call so that there is no difference. On the receiving side, the S-ICS driver allocates and reserves enough memory to satisfy the distributed STREAMS requirements. All user data and mblk headers arrive at this memory and es
balloc () is executed. The free function is an interconnect API call that frees memory and acknowledges receipt for the remote node.

【０３６０】受取りバッファ領域が相互接続によって確
保されるので、ページ再マップ機能は、ローカル・ノー
ドとアプリケーションの間では動作しない。この問題を
解決するため、アプリケーションを修正して、Ｓ−ＩＣ
Ｓがこのストリーム・インスタンスを認識できるように
受け取りバッファ領域を提供する。このためには、アプ
リケーションはアドレスおよびバッファ長を標示する付
加的ioctlを送信することを求められる。ＳＴＲＥＡＭ
Ｓは、esballoc()を使用してこの領域からmblkを割り当
てることを続けることができるが、それらが既にユーザ
領域に存在しているというマークをつけなければならな
い。ストリームヘッド・ルーチンはそのようなケースが
発生したことを検証し、ルーチン内のデータ移動段階を
スキップする。性能劣化を防止するため、ストリームヘ
ッド・ルーチンはより単純なものであるべきであるか
ら、ＳＴＲＥＡＭＳ f_opsをこの動作を理解するクラス
タ特有ベクトルであるように再マップすることだけを実
施する。同様の変化は書き込み側にも実行できる。注：
ソケットに関しては、ＳＴＲＥＡＭＳがデータを移動で
きるか否かに応じて、同様の形態でこれを実行できる。The page remap function does not work between the local node and the application because the receive buffer area is reserved by the interconnect. To solve this problem, modify the application so that the S-IC
Provide a receiving buffer area so that S can recognize this stream instance. This requires the application to send an additional ioctl indicating the address and buffer length. STREAM
S can continue to allocate mblks from this area using esballoc (), but must mark them as already existing in the user area. The streamhead routine verifies that such a case has occurred and skips the data movement step within the routine. To prevent performance degradation, the streamhead routine should be simpler, so it only remaps STREAMS f_ops to be a cluster-specific vector that understands this behavior. Similar changes can be made on the write side. note:
For sockets, this can be done in a similar fashion, depending on whether STREAMS can move the data.

【０３６１】唯一の問題はinterconnect_ctl_blockであ
るが、Ｓ−ＩＣＳはカーネル内メモリをセットアップし
て他の相互接続ＡＰＩがそれを認識するように修正す
る。ほとんどの場合、送信元型の相互接続実施形態は、
ＦＣ４実施形態に比較して多くの利点を持つ。それは輻
輳制御で苦しまない。送信元管理であるからそれはバッ
ファをオーバーランしない。それはパケットを破棄しな
い。それは内蔵少待ち時間メカニズムおよびグループ通
知機構を有しているので、グロ―バル・ポート対応付け
および他のデータベース同期命令をより速くより容易に
実施することができる。総体的に、送信元型アーキテク
チャは、この環境を実施する最高のアーキテクチャであ
り、その他のアーキテクチャは、上述の諸問題を処理す
るため付加的設計と複雑性を必要とする。The only problem is interconnect_ctl_block, but the S-ICS sets up memory in the kernel and modifies it so that other interconnect APIs can recognize it. In most cases, source-based interconnect embodiments
It has many advantages over the FC4 embodiment. It does not suffer from congestion control. It is source-managed and does not overrun the buffer. It does not drop packets. It has a built-in low-latency mechanism and a group notification mechanism so that global port mapping and other database synchronization instructions can be implemented faster and more easily. Overall, the source architecture is the best architecture that implements this environment, while other architectures require additional design and complexity to address the issues described above.

【０３６２】以上の通り、特定の実施形態を参照して本
発明の原理を記述したが、そのような原理を逸脱するこ
となく本発明の構成および細部を修正することができる
点は明らかであろう。As described above, the principles of the present invention have been described with reference to the specific embodiments. However, it is apparent that the structure and details of the present invention can be modified without departing from such principles. Would.

【０３６３】本発明には、例として次のような実施様態
が含まれる。（１）分散ＳＴＲＥＡＭＳ機能を有するマルチコンピュ
ータ・システムであって、１つまたは複数のシステム・
プロセッサ装置を含むコンピュータ、ローカル・メモリ
および入出力サブシステムをそれぞれが有し、データ通
信相互接続サブシステムを介して相互接続された少なく
とも２つのノードから構成されるクラスタと、上記クラ
スタにおいて上記システム・プロセッサ装置の各々の上
で稼働し、ネットワーキング・プロトコール、クライア
ント／サーバ・アプリケーションおよびサービスの１つ
または複数の実施の際に使用されるＳＴＲＥＡＭＳメッ
セージ伝達メカニズムを含むオペレーティング・システ
ムと、タスクの実行または問題の解決のため上記オペレ
ーティング・システムの制御の下で少くとも１つの上記
ノードのシステム・プロセッサ装置上で稼働するソフト
ウェア・アプリケーションと、上記複数ノードのうちの
第１の始動元ノード上に、上記アプリケーションおよび
上記ＳＴＲＥＡＭＳメッセージ伝達メカニズムとは独立
した分散ＳＴＲＥＡＭＳインスタンスを作成する手段
と、上記第１のノード上におけるネットワーキング・プ
ロトコール、クライアント／サーバ・アプリケーション
およびサービスに対して透過的に上記複数ノードのうち
の第２の目標ノード上でソフトウェア・アプリケーショ
ンの選択されたタスクを実行するため、上記第２の目標
ノードへ上記分散ＳＴＲＥＡＭＳインスタンスの少なく
とも一部を移行させる手段と、を備えるマルチコンピュ
ータ・システム。The present invention includes the following embodiments as examples. (1) A multi-computer system having a distributed STREAMS function, wherein one or more systems
A cluster comprising at least two nodes each having a computer including a processor unit, a local memory, and an input / output subsystem, interconnected via a data communication interconnect subsystem; An operating system running on each of the processor units and including a STREAMS messaging mechanism used in the implementation of one or more of networking protocols, client / server applications and services; A software application running on at least one of the node's system processor units under the control of the operating system, and a first initiating node of the plurality of nodes. A means for creating a distributed STREAMS instance independent of the application and the STREAMS messaging mechanism; and a plurality of networking protocols, client / server applications and services on the first node transparently. Means for migrating at least a portion of the distributed STREAMS instance to the second target node to perform a selected task of a software application on a second target node of the nodes. system.

【０３６４】（２）上記ＳＴＲＥＡＭＳメッセージ伝達
メカニズムが、動的に付加または削除が可能な中間エレ
メントであるモジュールを介してユーザ・プロセスとデ
バイス・ドライバの間の双方向データ経路を提供するた
め上記コンピュータの各々に配置されるＳＴＲＥＡＭＳ
データ構造を含み、上記デバイス・ドライバが上記オペ
レーティング・システムのシステム・カーネル部に駐在
し、当該デバイス・ドライバによって受け取られるデー
タがユーザ・プロセスに向かって移動することができる
ようにするため周辺装置を制御して上記カーネルと上記
周辺装置間でデータを転送するように上記デバイス・ド
ライバが動作し、上記ＳＴＲＥＡＭデータ構造が、ロー
カル・メモリに記憶されたスタックを含み、モジュール
およびドライバ関数の実行を制御するため一組の関数ポ
インタを保持する、上記（１）に記載のマルチコンピュ
ータ・システム。（３）クラスタ・パラメータに従って上記ＳＴＲＥＡＭ
Ｓメッセージ伝達メカニズムとの通信およびその制御を
行うため上記オペレーティング・システムに含まれる制
御スレッド手段と、システムの複数コンポーネント間で
のメッセージを多重化するための動的関数置換関数を含
む一組のプレビュー関数を定義する手段と、を更に備
え、上記制御スレッドの制御の下でかつ上記プレビュー
関数に従ってシステム内のノード間での基本的通信機能
を提供する物理的クラスタ相互接続ドライバすなわちＰ
−ＩＣＳを上記デバイス・ドライバが含む、上記（２）
に記載のマルチコンピュータ・システム。(2) The computer in which the STREAMS message transfer mechanism provides a bidirectional data path between a user process and a device driver via a module which is an intermediate element that can be dynamically added or deleted. STREAMS located in each of
A data structure, wherein the device driver resides in a system kernel portion of the operating system and includes peripheral devices for enabling data received by the device driver to move toward a user process. Operating the device driver to control and transfer data between the kernel and the peripheral device, wherein the STREAM data structure includes a stack stored in local memory and controls execution of module and driver functions. The multi-computer system according to (1), wherein the multi-computer system holds a set of function pointers for performing the operation. (3) The above STREAM according to the cluster parameter
A set of previews including control thread means included in the operating system for communicating with and controlling the S message transfer mechanism and a dynamic function replacement function for multiplexing messages between multiple components of the system. Means for defining a physical cluster interconnect driver or P which provides basic communication functions between nodes in the system under the control of the control thread and in accordance with the preview function.
(2) wherein the device driver includes ICS.
2. The multi-computer system according to claim 1.

【０３６５】（４）アプリケーションのＳＴＲＥＡＭＳ
スタック、制御スレッド、Ｐ−ＩＣＳおよびプレビュー
関数というシステム・コンポーネントのうちの少なくと
も２つのコーンポーネントの間でメッセージを移動させ
るＳＴＲＥＡＭＳソフトウェア相互接続ドライバすなわ
ちＳ−ＩＣＳを定義する手段を、更に備える上記（３）
に記載のマルチコンピュータ・システム。（５）上記オペレーティング・システムと上記Ｓ−ＩＣ
Ｓの間でメッセージを処理し伝えるように動作するミド
ルウェア・スレッドを定義する手段を、更に備える上記
（４）に記載のマルチコンピュータ・システム。（６）分散ＳＴＲＥＡＭＳインスタンスを作成する上記
手段が、各ノードのオペレーティング・システム内に所
在し、ＳＴＲＥＡＭＳメッセージ伝達メカニズムのオー
プン命令を始動する上記アプリケーションの要求に応答
してオープン関数を実行するファイル・システム、上記
第１の始動ノードにおける相互接続ドライバの装置番号
を目標装置番号に対応付けて保存する手段、上記第１の
ノードの上で選択されたドライバをオープンする機能を
持ち上記ＳＴＲＥＡＭＳメッセージ伝達メカニズムに含
まれる手段、上記保存された装置番号を上記第２の目標
ノードに関するノード・アドレスに再度対応づける手
段、上記アプリケーションのオープン要求および上記保
存された装置番号を含むメッセージを目標ノードに伝送
する機能を持ち上記相互接続ドライバ中に所在する手
段、および、上記メッセージに応答して目標ドライバを
オープンする機能を持ち上記目標ノード内に所在するフ
ァイル・システムを含む、上記（４）に記載のマルチコ
ンピュータ・システム。(4) Application STREAMS
The above (3) further comprising means for defining a STREAMS software interconnect driver or S-ICS for moving messages between at least two components of the system components: stack, control thread, P-ICS and preview function. )
2. The multi-computer system according to claim 1. (5) The operating system and the S-IC
The multi-computer system of claim 4, further comprising means for defining a middleware thread that operates to process and pass messages between S. (6) The means for creating a distributed STREAMS instance is a file system located in the operating system of each node and executing an open function in response to a request from the application to initiate an open command for a STREAMS messaging mechanism. Means for storing the device number of the interconnection driver in the first initiating node in association with the target device number, having a function of opening the selected driver on the first node, and providing the STREAMS message transmission mechanism with Means included, means for re-associating the stored device number with a node address for the second target node, and a function of transmitting a message including the application open request and the stored device number to the target node. Holding Means located in the interconnections driver, and a multi-computer system according to including a file system, located within the target node has the ability to open the target driver in response, the (4) in the message.

【０３６６】（７）上記目標ノード上のオペレーティン
グ・システムが、上記目標ドライバに関するストリーム
ヘッド・アドレスを含むメッセージを始動元ノードの相
互接続ドライバに戻す手段を含む、上記（６）に記載の
マルチコンピュータ・システム。（８）上記分散ＳＴＲＥＡＭＳインスタンスの少なくと
も一部を上記第２の目標ノードへ移行させる手段が、上
記第１のノード上のスタックの少なくとも一部を上記第
２のノードに送信する手段を含む、上記（２）に記載の
マルチコンピュータ・システム。（９）移行の実行を表すioctlを上記ＳＴＲＥＡＭＳメ
ッセージ伝達メカニズムに発信する手段を備え、上記Ｓ
ＴＲＥＡＭＳメッセージ伝達メカニズムが、上記ioctl
に応答して、上記始動ノード上のスタックを凍結し、ど
のノードからのスタックを使用して目標ノード上でスタ
ックを再生成することができるかという情報を含むメッ
セージを上記目標ノードへ送信する、上記（８）に記載
のマルチコンピュータ・システム。（１０）上記ＳＴＲＥＡＭＳメッセージ伝達メカニズム
が、上記データ構造の一部に関して整理関数を実行し目
標ノード上のスタックにおいて表されるモジュールに関
するプライベート・データ構造を含む、上記（９）に記
載のマルチコンピュータ・システム。(7) The multicomputer of (6), wherein the operating system on the target node includes means for returning a message containing the stream head address for the target driver to the interconnect driver of the initiating node. ·system. (8) The means for migrating at least a part of the distributed STREAMS instance to the second target node includes a means for transmitting at least a part of a stack on the first node to the second node. The multi-computer system according to (2). (9) A means for transmitting an ioctl indicating execution of migration to the STREAMS message transmission mechanism,
The TREAMS message transfer mechanism uses the ioctl
Sending a message to the target node including information on which stack from which node the stack on the starting node can be used to regenerate the stack on the target node, The multi-computer system according to the above (8). (10) The multicomputer of (9) above, wherein said STREAMS message transfer mechanism includes a private data structure for a module represented by a stack on a target node, performing a simplification function on a portion of said data structure. system.

【０３６７】（１１）１つまたは複数のシステム・プロ
セッサ装置を含むコンピュータ、ローカル・メモリおよ
び入出力サブシステムをそれぞれが有し、データ通信相
互接続サブシステムを介して相互接続された少なくとも
２つのノードから構成されるクラスタ、上記クラスタに
おいて上記システム・プロセッサ装置の各々の上で稼働
しネットワーキング・プロトコール、クライアント／サ
ーバ・アプリケーションおよびサービスの１つまたは複
数の実施の際に使用されるＳＴＲＥＡＭＳメッセージ伝
達メカニズムを含むオペレーティング・システム、およ
び、タスクの実行または問題の解決のため上記オペレー
ティング・システムの制御の下で少くとも１つの上記ノ
ードのシステム・プロセッサ装置上で稼働するソフトウ
ェア・アプリケーションを備え、分散ＳＴＲＥＡＭＳ機
能を有するマルチコンピュータ・システムにおいて、Ｓ
ＴＲＥＡＭＳドライバが各ノード内のどこに位置するか
を判断し、またどの機構が各ドライバと関連するかを判
断することを含め、上記クラスタにおける各ノード上に
制御スレッドを始動するステップと、上記クラスタにお
ける各ノード上の上記ＳＴＲＥＡＭＳメッセージ伝達メ
カニズムに、クラスタ化が実施済みであるあることを標
示するフラグを設定するステップと、上記クラスタ内の
１つのノード上の特定のドライバをユニークに表すた
め、各ドライバに関するメジャー番号テーブル識別情報
および遠隔ノード上の少なくとも１つのドライバおよび
ローカル・ノード上のローカル機構を選択的に識別する
マイナー番号パラメータを符号化したファイル名を割り
当てるステップと、各ノード上のドライバの上記ファイ
ル名を上記クラスタにおける他のノードへ上記ファイル
・システムを通して伝えるステップと、上記ファイル名
が標示するドライバをオープンするためＳＴＲＥＡＭＳ
open関数を実行させ、それがクラスタ化機構であるこ
とを確認するためそのファイル名を検査し、そうであれ
ばその初期メジャー番号およびマイナー番号を上記制御
スレッドに渡すステップと、を含む分散ＳＴＲＥＡＭＳ
データ構造作成方法であって、上記制御スレッドが、上
記メジャー番号およびマイナー番号を使用してそれによ
って表された装置と機構を調べ、新しいメジャーおよび
マイナー番号を取り出し、それらの番号を一組の機構識
別子と共に上記ＳＴＲＥＡＭＳメッセージ伝達メカニズ
ムに送り返し、上記初期メジャー番号およびマイナー番
号が遠隔ノード上の機構に関連するものであれば、ロー
カル・ノードのＳＴＲＥＡＭＳメッセージ伝達メカニズ
ムがローカル・ノード上のＳＴＲＥＡＭＳソフトウェア
相互接続ドライバすなわちＳ−ＩＣＳをオープンし、上
記Ｓ−ＩＣＳドライバが遠隔ノード上の制御スレッドへ
オープン要求を通知し、上記制御スレッドは初期状態設
定とともにローカル・ノード上にＳＴＲＥＡＭＳソフト
ウェア相互接続ドライバすなわちＳ−ＩＣＳを確立する
ように稼働し、遠隔ノード上の制御スレッドが、ローカ
ル・ノード上のＳＴＲＥＡＭＳデータ構造インスタンス
の分散ＳＴＲＥＡＭＳインスタンスを遠隔ノード上で作
成するため、内部ＳＴＲＥＡＭＳオープンを実行する、
分散ＳＴＲＥＡＭＳデータ構造作成方法。(11) A computer including one or more system processor units, at least two nodes each having a local memory and an input / output subsystem, interconnected via a data communication interconnect subsystem. A STREAMS messaging mechanism running on each of the system processor devices in the cluster and used in implementing one or more of the networking protocols, client / server applications and services. An operating system, including a software application running on at least one of the node's system processor units under the control of the operating system to perform tasks or resolve problems. Comprising a ® down, in a multi-computer system having a distributed STREAMS function, S
Starting a control thread on each node in the cluster, including determining where the TREAMS driver is located within each node, and determining which mechanism is associated with each driver; Setting a flag in the STREAMS message transfer mechanism on each node to indicate that clustering has been performed, and each driver to uniquely represent a particular driver on one node in the cluster. Assigning a file name encoding a major number table identification and a minor number parameter that selectively identifies at least one driver on the remote node and a local mechanism on the local node; and File name above cluster A step of transmitting definitive to other nodes through the file system, STREAMS to open the driver the file name is marked
executing an open function and checking its filename to confirm that it is a clustering facility, and if so, passing its initial major and minor numbers to the controlling thread.
A method for creating a data structure, wherein the controlling thread uses the major and minor numbers to look up the device and mechanism represented by it, retrieves new major and minor numbers, and stores those numbers in a set of mechanisms. Sending back to the STREAMS messaging mechanism along with the identifier, if the initial major and minor numbers are associated with a mechanism on the remote node, the STREAMS messaging mechanism on the local node is a STREAMS software interconnect driver on the local node That is, the S-ICS is opened, the S-ICS driver notifies the control thread on the remote node of the open request, and the control thread sets the initial state and sets the STREAMS software interconnection on the local node. Operating to establish a driver i.e. S-ICS, control thread on a remote node, for creating distributed STREAMS instance of STREAMS data structure instance on the local node on a remote node, performing an internal STREAMS open,
Distributed STREAMS data structure creation method.

【０３６８】（１２）上記初期メジャー番号およびマイ
ナー番号がローカル・ノード上の機構に関連する場合、
ローカル・ノードにおける上記ＳＴＲＥＡＭＳメッセー
ジ伝達メカニズムがそのドライバに関連したローカル機
構を使用可能にするため、指定されたローカル・ドライ
バをオープンするステップを含む、上記（１１）に記載
された方法。（１３）上記ＳＴＲＥＡＭＳメッセージ伝達メカニズム
が、動的に付加または削除が可能な中間エレメントであ
るモジュールを介してユーザ・プロセスとデバイス・ド
ライバの間の双方向データ経路を提供するため上記コン
ピュータの各々に配置されるＳＴＲＥＡＭＳデータ構造
を含み、上記デバイス・ドライバが上記オペレーティン
グ・システムのシステム・カーネル部に駐在し、当該デ
バイス・ドライバによって受け取られるデータがユーザ
・プロセスに向かって移動することができるようにする
ため周辺装置を制御して上記カーネルと上記周辺装置間
でデータを転送するように上記デバイス・ドライバが動
作し、上記ＳＴＲＥＡＭデータ構造が、ローカル・メモ
リに記憶されたスタックを含み、モジュールおよびドラ
イバ関数の実行を制御するため一組の関数ポインタを保
持し、上記第１のノード上におけるスタックの少なくと
も一部を上記第２のノードへ送信することによって上記
第２の目標ノードへ上記分散ＳＴＲＥＡＭＳインスタン
スの少なくとも一部を移行させるステップを更に含む、
上記（１１）に記載の方法。（１４）ＳＴＲＥＡＭＳスタックが、それをアクセスし
ているアプリケーションが稼働しているノードと異なる
ノード上で実行される、上記（１３）に記載の方法。（１５）ＳＴＲＥＡＭが、モジュール／ドライバ／スト
リームヘッド・レベルで、上記クラスタ内の異なる個々
のノード上で実行される構成コンポーネントに分割され
る、上記（１３）に記載の方法。（１６）上記分散ＳＴＥＡＭＳデータ構造が、各々が上
記クラスタ内の異なるノード上で実行される２つのパイ
プ端点を持つＳＴＲＥＡＭＳ型パイプを含む、上記（１
３）に記載の方法。（１７）ＳＴＲＥＡＭＳスタックが、全体としてまたは
部分的に、あるノードから別のノードへ移行される、上
記（１３）に記載の方法。（１８）上記第２のノードの上での障害状態を検出しし
てエラー回復を始動するステップを含む、上記（１３）
に記載の方法。（１９）上記目標ノード上のスタックにおいて表される
モジュールに関するプライベート・データ構造を複製す
るために必要とされる情報を収集するため上記データ構
造の一部に関して整理関数を実行するステップを含む上
記（１３）に記載の方法。(12) If the initial major number and minor number relate to a mechanism on the local node,
The method of claim 11, further comprising the step of opening the designated local driver so that the STREAMS messaging mechanism at the local node enables the local mechanism associated with the driver. (13) The STREAMS messaging mechanism provides to each of the computers to provide a bidirectional data path between a user process and a device driver via a module which is an intermediate element that can be dynamically added or deleted. An STREAMS data structure to be located, wherein the device driver resides in a system kernel portion of the operating system, so that data received by the device driver can move toward a user process. Operating the device driver to control a peripheral device to transfer data between the kernel and the peripheral device, wherein the STREAM data structure includes a stack stored in a local memory, a module and a driver function. Run At least a portion of the distributed STREAMS instance to the second target node by holding a set of function pointers to control and transmitting at least a portion of a stack on the first node to the second node Further comprising the step of:
The method according to the above (11). (14) The method according to (13), wherein the STREAMS stack is executed on a node different from a node on which an application accessing the STREAMS stack is running. (15) The method according to (13), wherein the STREAM is divided at a module / driver / streamhead level into configuration components running on different individual nodes in the cluster. (16) The (1) above, wherein the distributed STEAMS data structure includes a STREAMS-type pipe having two pipe endpoints each running on a different node in the cluster.
The method according to 3). (17) The method according to (13), wherein the STREAMS stack is migrated, in whole or in part, from one node to another. And (18) detecting a fault condition on said second node to initiate error recovery.
The method described in. And (19) performing a simplification function on a portion of the data structure to gather information needed to replicate a private data structure for a module represented in a stack on the target node. The method according to 13).

【０３６９】（２０）分散ＳＴＲＥＡＭＳ機能を有する
マルチコンピュータ・システムであって、１つまたは複
数のシステム・プロセッサ装置を含むコンピュータ、ロ
ーカル・メモリおよび入出力サブシステムをそれぞれが
有し、データ通信相互接続サブシステムを介して相互接
続された少なくとも２つのノードから構成されるクラス
タと、上記クラスタにおいて上記システム・プロセッサ
装置の各々の上で稼働し、ネットワーキング・プロトコ
ール、クライアント／サーバ・アプリケーションおよび
サービスの１つまたは複数の実施の際に使用されるＳＴ
ＲＥＡＭＳメッセージ伝達メカニズムを含むオペレーテ
ィング・システムと、タスクの実行または問題の解決の
ため上記オペレーティング・システムの制御の下で少く
とも１つの上記ノードのシステム・プロセッサ装置上で
稼働するソフトウェア・アプリケーションと、上記複数
ノードのうちの第１の始動元ノード上に、上記アプリケ
ーションおよび上記ＳＴＲＥＡＭＳメッセージ伝達メカ
ニズムとは独立した分散ＳＴＲＥＡＭＳインスタンスを
作成する手段と、上記クラスタにおいて上記システム・
プロセッサ装置の各々の上で稼働し、ネットワーキング
・プロトコール、クライアント／サーバ・アプリケーシ
ョンおよびサービスの１つまたは複数の実施の際に使用
されるＳＴＲＥＡＭＳメッセージ伝達メカニズムを含む
オペレーティング・システムと、タスクの実行または問
題の解決のため上記オペレーティング・システムの制御
の下で少くとも１つの上記ノードのシステム・プロセッ
サ装置上で稼働するソフトウェア・アプリケーション
と、上記複数ノードのうちの第１の始動元ノード上で、
上記アプリケーションおよび上記ＳＴＲＥＡＭＳメッセ
ージ伝達メカニズムとは独立した分散ＳＴＲＥＡＭＳイ
ンスタンスを作成する手段と、上記第１のノード上の分
散ＳＴＲＥＡＭＳインスタンスと上記第２のノード上の
ＳＴＲＥＡＭＳメッセージ伝達メカニズムの間の通信を
制御するプレビュー関数を含むクラスタ機構を分散ＳＴ
ＲＥＡＭＳインスタンスの制御の下で上記第２のノード
上で使用する手段と、を備えるマルチコンピュータ・シ
ステム。(20) A multi-computer system having a distributed STREAMS function, the computer including one or more system processor units, each having a local memory and an input / output subsystem, and having a data communication interconnect. A cluster consisting of at least two nodes interconnected via a subsystem, and one of a networking protocol, a client / server application and a service running on each of the system processor units in the cluster; Or ST used for multiple implementations
An operating system including a REAMS messaging mechanism; a software application running on at least one of the node's system processor units under the control of the operating system for performing tasks or solving problems; Means for creating, on a first initiating node of the plurality of nodes, a distributed STREAMS instance independent of the application and the STREAMS messaging mechanism; and
An operating system running on each of the processor units and including a STREAMS messaging mechanism used in implementing one or more of the networking protocols, client / server applications and services; A software application running on at least one of the node's system processor units under the control of the operating system and a first initiating node of the plurality of nodes
Means for creating a distributed STREAMS instance independent of the application and the STREAMS messaging mechanism; and controlling communication between the distributed STREAMS instance on the first node and the STREAMS messaging mechanism on the second node. Distributed ST with cluster mechanism including preview function
Means for use on said second node under control of a REAMS instance.

【０３７０】（２１）上記ＳＴＲＥＡＭＳメッセージ伝
達メカニズムが、動的に付加または削除が可能な中間エ
レメントであるモジュールを介してユーザ・プロセスと
デバイス・ドライバの間の双方向データ経路を提供する
ため上記コンピュータの各々に配置されるＳＴＲＥＡＭ
Ｓデータ構造を含み、上記デバイス・ドライバが上記オ
ペレーティング・システムのシステム・カーネル部に駐
在し、当該デバイス・ドライバによって受け取られるデ
ータがユーザ・プロセスに向かって移動することができ
るようにするため周辺装置を制御して上記カーネルと上
記周辺装置間でデータを転送するように上記デバイス・
ドライバが動作し、上記ＳＴＲＥＡＭデータ構造が、ロ
ーカル・メモリに記憶されたスタックを含み、モジュー
ルおよびドライバ関数の実行を制御するため一組の関数
ポインタを保持し、上記アプリケーションおよび上記Ｓ
ＴＲＥＡＭＳ全体は同一ノード上に存在するが、ＳＴＲ
ＥＡＭＳメッセージ伝達メカニズムはハードウェア共
用、付加平均化および高い可用性というクラスタ機能の
利点を引き出すため利用される、上記（２０）に記載の
マルチコンピュータ・システム。（２２）上記オペレーティング・システムが、上記ノー
ドの１つのローカル・メモリに記憶され、コンピュータ
の上でのアプリケーション命令の実行を通して一組の機
能性を提供するソフトウェア処理エレメントによって定
義されるスレッドを含み、上記スレッドが、分散ＳＴＲ
ＥＡＭに関する第三者通信および制御ポイントとしての
役割をはたす専用スレッドによって定義される制御スレ
ッドを含む、上記（２０）に記載のマルチコンピュータ
・システム。(21) The computer in which the STREAMS message transfer mechanism provides a bidirectional data path between a user process and a device driver via a module which is an intermediate element that can be dynamically added or deleted. STREAM placed in each of
A peripheral device including an S data structure, wherein the device driver resides in a system kernel portion of the operating system, and enables data received by the device driver to move toward a user process. To transfer data between the kernel and the peripheral device.
The driver operates, the STREAM data structure includes a stack stored in local memory, holds a set of function pointers to control execution of module and driver functions, the application and the S
Although the entire TREAMS is on the same node,
The multi-computer system of (20) above, wherein the EAMS messaging mechanism is utilized to exploit the advantages of the cluster function of hardware sharing, additive averaging and high availability. (22) the operating system includes threads defined in software processing elements that are stored in a local memory of one of the nodes and provide a set of functionality through execution of application instructions on a computer; The above thread is distributed STR
The multi-computer system of claim 20, including a control thread defined by a dedicated thread that serves as a third-party communication and control point for the EAM.

【０３７１】[0371]

【発明の効果】マルチコンピュータ・システムのクラス
タ環境において、本発明の分散ＳＴＲＥＡＭＳを実施す
ることによって、単一地点が故障しても分散アプリケー
ションが停止しない高い可用性、アプリケーション負荷
の平準化によるクラスタ全体資源の有効活用、ハードウ
ェア共用、計算およびＩ／Ｏ帯域総和の増加、および大
容量記憶域やネットワーキング資源へのアクセス増加な
どの効果を奏することができる。By implementing the distributed STREAMS of the present invention in a cluster environment of a multi-computer system, high availability in which a distributed application does not stop even if a single point fails, and the entire cluster resource by leveling the application load , Effective use of hardware, increase in the sum of computation and I / O bandwidth, and increase in access to large-capacity storage areas and networking resources.

[Brief description of the drawings]

【図１】本発明に従って高速ディジタル・データが相互
接続機構によって相互接続されたコンピュータ・ノード
・クラスタの一例を示すブロック図である。FIG. 1 is a block diagram illustrating one example of a computer node cluster in which high speed digital data is interconnected by an interconnect mechanism in accordance with the present invention.

【図２】図１のクラスタの一部の詳細を示すもので、ノ
ードＣおよびノードＤのモード交替および相互動作を示
すブロック図である。FIG. 2 is a block diagram showing details of a part of the cluster of FIG. 1 and showing mode switching and mutual operation of nodes C and D;

【図３】図２の代替実施形態を示すブロック図である。FIG. 3 is a block diagram illustrating an alternative embodiment of FIG.

【図４】図１のクラスタの一部の詳細を示すもので、ノ
ードＡおよびノードＢ上のＴＣＰ／ＩＰの動作のためス
トリーム・スタックを分割する代替的構成を示すブロッ
ク図である。FIG. 4 is a block diagram showing details of a portion of the cluster of FIG. 1, showing an alternative configuration for splitting the stream stack for TCP / IP operation on Node A and Node B.

【図５】図４の（Ｂ）の構成におけるデータ構造パケッ
ト経路指定プロセスを示すブロック図である。FIG. 5 is a block diagram showing a data structure packet routing process in the configuration of FIG. 4 (B).

【図６】ミドルウェア・スレッドの制御の下複数のポー
トのグローバル対応付けを行うため複数インスタンスに
おけるパケット経路指定プロセスを示す図４の（Ｂ）に
対応するブロック図である。FIG. 6 is a block diagram corresponding to FIG. 4B showing a packet routing process in a plurality of instances for performing global association of a plurality of ports under the control of a middleware thread.

【図７】図５および図６のパケット経路指定プロセスを
実施するために使用されるデータ構造のブロック図であ
る。FIG. 7 is a block diagram of a data structure used to implement the packet routing process of FIGS. 5 and 6.

【図８】制御スレッドとミドルウェアの間のメッセージ
のフローを示すブロック図である。FIG. 8 is a block diagram showing a message flow between a control thread and middleware.

【図９】図６のプロセスにおける制御スレッドとミドル
ウェアの間のメッセージのフローを示すブロック図であ
る。FIG. 9 is a block diagram showing a message flow between a control thread and middleware in the process of FIG. 6;

【図１０】図５および図６のＳ−ＩＣＳを実施するため
に使用されるデータ構造の例を示すブロック図である。FIG. 10 is a block diagram illustrating an example of a data structure used to implement the S-ICS of FIGS. 5 and 6;

【図１１】図１０のデータ構造を使用して遠隔インデッ
クスを特定するプロセスの状態／フローを示すブロック
図である。11 is a block diagram illustrating the status / flow of a process for identifying a remote index using the data structure of FIG.

【図１２】図１０のデータ構造における目標待ち行列を
特定するため図５および図６のＳ−ＩＣＳによって使用
されるプロセスの状態／フローを示すブロック図であ
る。FIG. 12 is a block diagram illustrating the state / flow of a process used by the S-ICS of FIGS. 5 and 6 to identify a target queue in the data structure of FIG. 10;

【図１３】ノードＤ上で図３のプレビュー機能を実施す
るため図１０のデータ構造を使用して目標待ち行列イン
デックスを特定するプロセスの状態／フローを示すブロ
ック図である。FIG. 13 is a block diagram illustrating the status / flow of a process for identifying a target queue index using the data structure of FIG. 10 to implement the preview function of FIG. 3 on node D.

【図１４】図１５と共に、図２および図３においてＰ−
ＩＣＳおよびＳ−ＩＣＳを経由してＤＬＰＩまたはＩＰ
へのメッセージ経路指定を行う動作の流れ図である。14 and FIG. 15 and FIG. 2 and FIG.
DLPI or IP via ICS and S-ICS
4 is a flowchart of an operation for specifying a message route to a server.

【図１５】図１４と共に、図２および図３においてＰ−
ＩＣＳおよびＳ−ＩＣＳを経由してＤＬＰＩまたはＩＰ
へのメッセージ経路指定を行う動作の流れ図である。FIG. 15 is a cross-sectional view of FIG.
DLPI or IP via ICS and S-ICS
4 is a flowchart of an operation for specifying a message route to a server.

【図１６】回復メカニズムを備えたプロセスが図１のク
ラスタの２つのノードに分散されたＳＴＲＥＡＭＳフレ
ームワークにおいてドライバをオープンする代替実施形
態を示すブロック図である。FIG. 16 is a block diagram illustrating an alternative embodiment in which a process with a recovery mechanism opens a driver in a STREAMS framework distributed across two nodes of the cluster of FIG. 1;

【図１７】回復メカニズムを備えていない図１のクラス
タにおいて１つの遠隔ノード上でのみＳＴＲＥＡＭＳフ
レームワークのドライバをオープンする第２の代替プロ
セスを示すブロック図である。FIG. 17 is a block diagram illustrating a second alternative process of opening the driver of the STREAMS framework on only one remote node in the cluster of FIG. 1 without a recovery mechanism.

【図１８】図１のシステムにおいてログ・マルチプレク
サによってＳＴＲＥＡＭＳログ記録を行う単一システム
様態を示すブロック図である。FIG. 18 is a block diagram illustrating a single system implementation of STREAMS logging with a log multiplexer in the system of FIG.

[Explanation of symbols]

２０クラスタ３０ストリームヘッド３１、３３、３５プレビュー関数セット３２ＴＣＰ／ＵＤＰ３４制御スレッド３６物理的クラスタ相互接続ドライバ(Ｐ−ＩＣＳ) ３８ソフトウェア・クラスタ相互接続ドライバ(Ｓ
−ＩＣＳ) ４０、４２ＤＬＰＩ４４ＩＰＭＵＸ(マルチプレクサ) ４６ローカル・データ構造４８ポインタ５０ミドルウェア・スレッド５２ファンアウト・テーブルReference Signs List 20 cluster 30 stream head 31, 33, 35 preview function set 32 TCP / UDP 34 control thread 36 physical cluster interconnect driver (P-ICS) 38 software cluster interconnect driver (S
-ICS) 40, 42 DLPI 44 IP MUX (Mux) 46 Local Data Structure 48 Pointer 50 Middleware Thread 52 Fanout Table

Claims

[Claims]

1. A multi-computer system having distributed STREAMS capabilities, comprising: a computer including one or more system processor units; each having a local memory and an input / output subsystem; A cluster consisting of at least two nodes interconnected through a system; and a networking protocol running on each of said system processor units in said cluster;
STREA used in the implementation of one or more client / server applications and services
An operating system including an MS messaging mechanism; a software application running on at least one of the node's system processor units under the control of the operating system for performing tasks or solving problems; Means for creating, on a first initiating node of the plurality of nodes, a distributed STREAMS instance independent of the application and the STREAMS messaging mechanism; a networking protocol on the first node; To the second target node for performing selected tasks of a software application on a second target node of the plurality of nodes transparently to applications and services Multicomputer system comprising: means for shifting at least a portion of the serial distributed STREAMS instance, the.