JP2000207233A

JP2000207233A - Multi-thread processor

Info

Publication number: JP2000207233A
Application number: JP11005863A
Authority: JP
Inventors: Naoki Nishi; 直樹西; Atsushi Torii; 淳鳥居; Junji Sakai; 淳嗣酒井; Hiroshi Osawa; 拓大澤; Satoshi Matsushita; 智松下
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-01-12
Filing date: 1999-01-12
Publication date: 2000-07-28
Anticipated expiration: 2019-01-12
Also published as: JP3604029B2

Abstract

PROBLEM TO BE SOLVED: To make available a hardware scheduling function and a software scheduling function both. SOLUTION: Thread execution units 10-0 to 10-3 have machine-word instruction, which enable a thread in process to generate a new thread and a thread in process to end. Furthermore, the processor is equipped with a function of directly assigning the new thread by hardware, a function of instructing whether or not the assignment by the hardware is performed by the hardware, a function of holding the register context of a thread generated, when the assignment by the hardware is not performed, and a function of restoring the held register context on the thread execution units 10-0 to 10-3, when the thread in process ends; and the thread execution units 10-0 to 10-3 process the new thread generated by the thread in process by themselves as need after the thread ends in processing.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明が属する技術分野】本発明は、情報処理装置にお
いて複数の命令を同時に実行するマルチスレッドマイク
ロプロセッサに関し、特に複数のスレッドの実行スケジ
ューリング技術に特徴を有するマルチスレッドプロセッ
サに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multi-thread microprocessor for executing a plurality of instructions in an information processing apparatus at the same time, and more particularly to a multi-thread processor characterized by an execution scheduling technique for a plurality of threads.

【０００２】[0002]

【従来の技術】シングルプログラムを高速実行する技術
として、プログラムを複数のスレッド（命令流）に分割
し、そのスレッドレベルで並列処理を行う技術がある。
また、当該技術分野において、スレッドの生成、プロセ
ッサ実行資源へのスレッド割り当て及び実行、スレッド
の終了及び消滅といった基本的な処理を機械語命令レベ
ルで実現することにより、より細粒度のスレッドでも高
速な実行を目指す方式が提案されている。2. Description of the Related Art As a technique for executing a single program at high speed, there is a technique for dividing a program into a plurality of threads (instruction streams) and performing parallel processing at the thread level.
Also, in the technical field, by realizing basic processes such as thread generation, thread allocation and execution to processor execution resources, and termination and deletion of a thread at a machine language instruction level, even a thread with a finer granularity can achieve high speed. A scheme aimed at execution has been proposed.

【０００３】この種の従来技術として、例えば、文献
「ＭｕｌｔｉｓｃａｌａｒＰｒｏｃｅｓｓｏｒ」
（Ｇ．Ｓ．Ｓｏｈｉ，Ｓ．Ｅ．Ｂｒｅａｃｈａｎｄ
Ｔ．Ｎ．Ｖｉｊａｙｋｕｍａｒ，Ｔｈｅ２２ｎｄＩ
ｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎ
ＣｏｍｐｕｔｅｒＡｒｃｈｉｔｅｃｔｕｒｅ，ＩＥ
ＥＥＣｏｍｐｕｔｅｒＳｏｃｉｅｔｙＰｒｅｓｓ，
１９９５，ｐｐ．４１４−４２５）に開示された技術
や、文献「Ｏｎ−ＣｈｉｐＭｕｌｔｉｐｒｏｃｅｓｓ
ｏｒ指向制御並列アーキテクチャＭＵＳＣＡＴの提案」
（鳥居、近藤、本村、小長谷、西、並列処理シンポジウ
ムＪＳＰＰ９７論文集、情報処理学会、ｐｐ．２２９−
２３６，Ｍａｙ１９９７）に開示された技術がある。[0003] As this kind of prior art, for example, a document “Multiscalar Processor”
(GS Sohi, SE Breach and
T. N. Vijaykumar, The 22nd I
international Symposium on
Computer Architecture, IE
EEComputer Society Press,
1995, p. 414-425) and the document "On-Chip Multiprocess."
or Oriented Control Parallel Architecture MUSCAT "
(Torii, Kondo, Motomura, Konagaya, Nishi, Parallel Processing Symposium JSPP97 Transactions, Information Processing Society of Japan, pp.229-
236, May 1997).

【０００４】これらの文献において開示されている技術
によれば、Ｃ言語等のプログラミング言語で書かれたプ
ログラムの中のベーシックブロックや、ベーシックブロ
ックをいくつかまとめて見たマクロベーシックブロック
のレベルで並列化して処理を高速化することが可能であ
り、非常に小さなプログラム粒度（μｓｅｃオーダ）に
おいても高速化可能である。According to the techniques disclosed in these documents, a basic block in a program written in a programming language such as the C language or a macro basic block in which some basic blocks are collectively viewed are parallelized. It is possible to increase the processing speed, and the processing speed can be increased even with a very small program granularity (on the order of μsec).

【０００５】ＭｕｌｔｉｓｃａｌａｒやＭＵＳＣＡＴ等
の細粒度スレッド実行方式における大きな特徴として
は、高速なスレッド生成及び実行資源に割り付けるため
のハードウェアスケジューリングと、スレッド間での高
速なデータ授受を可能とするためのレジスタ内容継承が
ある。スレッドのハードウェアスケジューリングに係わ
る従来技術としては、例えば特開平１０−２７１０８号
公報に開示されたスレッド実行方式がある。また、レジ
スタ内容継承に係わる技術として、本出願人による特許
出願（特願平１０−１１７６６６号）がある。これらの
技術は、シーケンシャルな順序実行モデルに基づくプロ
グラムを、細粒度で分割して並列実行し、高速実行を可
能にする。また、プログラムのスレッド分割をコンパイ
ラにて自動で行うことをも視野にいれた技術である。The major features of fine-grained thread execution methods such as Multiscalar and MUSCAT are hardware scheduling for high-speed thread generation and allocation to execution resources, and registers for enabling high-speed data transfer between threads. There is content inheritance. As a conventional technique relating to hardware scheduling of a thread, there is a thread execution method disclosed in, for example, Japanese Patent Application Laid-Open No. 10-27108. As a technique relating to the inheritance of register contents, there is a patent application (Japanese Patent Application No. 10-117666) filed by the present applicant. These techniques enable a high-speed execution by dividing a program based on a sequential order execution model into fine-grained programs and executing them in parallel. In addition, the technique is also directed to automatically dividing a thread of a program by a compiler.

【０００６】また、スレッド並列実行に係わる技術にお
いては、前記のようなプロセッサハードウェアにおいて
マルチスレッドをサポートしようとする技術の他に、プ
ログラミングモデルとして陽に並列性を記述する並列プ
ログラミングモデルも存在する。そして、ＳＭＰ型マル
チプロセッサを使用して実際に並列実行を可能とする試
みもさかんである。例えば、文献「ＢＳＤＵＮＩＸ上
での移植性にすぐれた計量プロセス機構の実現」（阿
部、松浦、谷口、情報処理学会論文誌、Ｖｏｌ．３６，
Ｎｏ．２，ｐｐ．２９６−３０３，Ｆｅｂ．１９９
５．）に開示されるように、ソフトウェア技術において
可能な範囲で、より軽量かつ高速なソフトウェアスレッ
ドを実現すべく、ユーザレベルのライブラリスケジュー
ラを用いた研究がさかんに行われている。In addition, in the technology related to thread parallel execution, there is a parallel programming model which explicitly describes parallelism as a programming model, in addition to the technology for supporting multi-thread in the processor hardware as described above. . There have been many attempts to use an SMP type multiprocessor to actually perform parallel execution. For example, a document “Implementation of a weighing process mechanism with excellent portability on BSD UNIX” (Abe, Matsuura, Taniguchi, Transactions of Information Processing Society of Japan, Vol. 36,
No. 2, pp. 296-303, Feb. 199
5. As described in (1), research using a user-level library scheduler has been actively conducted to realize a lighter and faster software thread as far as possible in software technology.

【０００７】これら従来の技術における、複数のスレッ
ドの並列実行に対するハードウェアによるスケジューリ
ングとソフトウェアによるスケジューリングとを比較し
た場合、一方の長所と短所とは、他方の短所と長所とに
なる。[0007] When comparing hardware-based scheduling and software-based scheduling for parallel execution of a plurality of threads in these conventional technologies, one advantage and disadvantage are the other advantage and disadvantage.

【０００８】すなわち、ハードウェアによるアプローチ
においては、長所として、並列処理粒度が非常に小さく
ても並列処理による加速が可能であることが挙げられ
る。また、短所として、ハードウェアが直接管理、記憶
保持するため、スケジューリング可能なスレッド数に制
限が強いこと、ハードウェアによるスケジューリング
は、必ずしもソフトウェアとして望む性能特質を実現す
るものとはならない場合も多く、柔軟性に欠けることが
挙げられる。[0008] That is, the advantage of the hardware approach is that it can be accelerated by parallel processing even if the parallel processing granularity is very small. On the other hand, disadvantages are that the number of threads that can be scheduled is strongly limited because the hardware directly manages and stores, and scheduling by hardware often does not always achieve the performance characteristics desired by software. Lack of flexibility.

【０００９】一方、ソフトウェアによるアプローチにお
いては、長所として、管理可能なスレッド数をハードウ
ェアアプローチと比べて非常に大きくすることが可能で
あり、ＧＵＩ（ＧｒａｐｈｉｃｓＵｓｅｒＩｎｔｅ
ｒｆａｃｅ）プログラムや通信トランザクション処理に
おいてよく用いられる、イベント駆動型のプログラミン
グスタイルへの適用性が高いこと、スケジューリング方
法が、対象アプリケーションごとに個別に要求される特
性に対して個々にチューニング可能であり柔軟性が高い
ことが挙げられる。また、短所として、並列処理粒度が
小さいとオーバヘッドが大きくなるため、高速化が難し
いことが挙げられる。On the other hand, the software approach has an advantage that the number of manageable threads can be made much larger than that of the hardware approach, and a GUI (Graphics User Interface) can be used.
rface) High applicability to an event-driven programming style often used in program and communication transaction processing, and a scheduling method can be individually tuned and flexibly adjusted to characteristics required individually for each target application. Is high. Another disadvantage is that if the parallel processing granularity is small, the overhead becomes large, and it is difficult to increase the speed.

【００１０】すなわち、ハードウェアアプローチとソフ
トウェアアプローチとは、互いに相手を補完する性質を
備えるため、両方の機能を混在動作可能とし、かつ自在
にいずれのアプローチをも組み合わせて使えることが望
ましい。That is, since the hardware approach and the software approach have the property of complementing each other, it is desirable that both functions can be operated in a mixed manner and that any of the approaches can be freely used in combination.

【００１１】次に、ハードウェアスケジューリングとソ
フトウェアスケジューリングとの混在プログラムの実行
に関する問題を一段掘り下げて説明する。図２０乃至図
２２は、ｍｓｅｃオーダの６個の並列処理可能な仕事
（Ｔ０〜Ｔ５スレッド）を例題として並列化粒度とスケ
ジューリングとの問題を説明する概念図である。Next, a problem concerning the execution of a mixed program of hardware scheduling and software scheduling will be described in more detail. FIG. 20 to FIG. 22 are conceptual diagrams illustrating the problem of parallelization granularity and scheduling using six parallel-processable jobs (T0 to T5 threads) on the order of msec as an example.

【００１２】図２０は、元々のｍｓｅｃオーダの仕事
を、スレッドＴ０→スレッドＴ１→…→スレッドＴ５の
順で生成及び実行を行う形として、Ｍｕｌｔｉｓｃａｌ
ａｒやＭＵＳＣＡＴにおいてハードウェアスケジューリ
ングを行った場合のスケジューリングを示している。こ
こでは、高速なスレッド生成と実行資源への割り当てと
を可能とするために、スレッド実行ユニット（ＰＥ＃０
〜ＰＥ＃３）間は、方向リングで指し示される方向に対
してのみ、新規スレッドの生成と実行開始が可能となっ
ている。したがって、ＰＥ＃３実行中のスレッドＴ３が
新たなスレッドＴ４を生成しようとすると、ＰＥ＃０が
Ｔ０の実行を完了した後にＴ４が開始されることとな
る。このため、Ｔ１やＴ２のようにＴ０よりも早期に終
了してアイドル状態となっているＰＥ＃１やＰＥ＃２が
有効活用されず、無駄時間が生じてしまう。この問題が
解決可能かどうかは、現実にはスレッド粒度と密接な関
係がある。すなわち、ＰＥ＃１やＰＥ＃２のアイドル時
間がμｓｅｃオーダであるならばこのアイドル時間を活
用していくのは難しい面もあるが、アイドル時間がｍｓ
ｅｃオーダ以上であれば、ソフトウェアスケジューリン
グの方がうまくスケジューリングを行うことが可能であ
る。[0012] FIG. 20 shows that the original msec-order work is generated and executed in the order of thread T0 → thread T1 →.
This shows the scheduling when hardware scheduling is performed in ar or MUSCAT. Here, in order to enable high-speed thread generation and allocation to execution resources, a thread execution unit (PE # 0
In the period from to PE # 3), a new thread can be generated and executed only in the direction indicated by the direction ring. Therefore, when the thread T3 executing the PE # 3 attempts to generate a new thread T4, the T4 starts after the PE # 0 completes the execution of the T0. For this reason, PE # 1 and PE # 2, which end earlier than T0 and are in an idle state, such as T1 and T2, are not effectively utilized, resulting in wasted time. Whether this problem can be solved is actually closely related to thread granularity. In other words, if the idle time of PE # 1 or PE # 2 is on the order of μsec, it is difficult to utilize this idle time, but the idle time is ms.
If the order is equal to or more than the ec order, the software scheduling can perform scheduling better.

【００１３】図２１は、ソフトウェアスケジューリング
による、ユーザレベルのライブラリスケジューラを用い
てスケジューリングを行った場合を示している。ソフト
ウェアスケジューラを用いれば、アイドル状態となった
ＰＥは、ライブラリスケジューラに制御を移して次に実
行すべきスレッドを探すことが可能である。これによ
り、ＰＥ＃１やＰＥ＃２は各々Ｔ１とＴ２終了後に直ち
にＴ４とＴ５を実行開始することが可能となる。このた
め、図２０に対して全終了時刻を短縮することが可能と
なる。しかしながら、この図２１の状態においては、最
終的にＴ５の実行を受け持ったＰＥ＃２が処理終了する
までは、ＰＥ＃０、＃１、＃３はアイドル状態で待機す
ることとなるため、まだ効率が悪い面がある。FIG. 21 shows a case where scheduling is performed using a user-level library scheduler by software scheduling. If the software scheduler is used, the PE in the idle state can transfer control to the library scheduler and search for the next thread to be executed. As a result, the PE # 1 and PE # 2 can start executing T4 and T5 immediately after the end of T1 and T2, respectively. Therefore, it is possible to shorten the entire end time as compared with FIG. However, in the state of FIG. 21, the PEs # 0, # 1, and # 3 wait in an idle state until the processing of the PE # 2, which has ultimately performed the execution of T5, ends. There is an inefficient aspect.

【００１４】そこで、図２０や図２１を参照して提示し
た問題を踏まえ、ハードウェアスケジューリングとソフ
トウェアスケジューリングとを混在させた実行例を図２
２に示す。図２２においては、元々、Ｔ０〜Ｔ５の各仕
事はその単位においてユーザレベルスケジューラを用い
て並列化されたものを第一階層の並列化とする。そし
て、第二階層の並列化として、Ｔ０〜Ｔ５の内部をハー
ドウェアスケジューラで処理される形に並列化したプロ
グラムを想定している。すなわち、図２２に示すよう
に、最後に処置中のＴ５は、他のＰＥ＃０、＃１、＃３
がアイドル状態になった時点で、ハードウェアスケジュ
ーリングレベルで粒度の小さい形での並列処理を行うこ
とにより、さらに高速化を行うことを可能としている。Therefore, based on the problem presented with reference to FIGS. 20 and 21, an execution example in which hardware scheduling and software scheduling are mixed is shown in FIG.
It is shown in FIG. In FIG. 22, each of the tasks T0 to T5 is originally parallelized using the user-level scheduler in the unit thereof, and is defined as the first-level parallelism. And, as the parallelization of the second hierarchy, a program is assumed in which the inside of T0 to T5 is parallelized so as to be processed by the hardware scheduler. That is, as shown in FIG. 22, the last T5 under treatment is the other PE # 0, # 1, # 3.
When the system becomes idle, parallel processing is performed at a hardware scheduling level in a small-granular form, thereby further increasing the speed.

【００１５】しかし、図２２に示すような処理を行わせ
るためには、Ｔ０〜Ｔ５の内部をハードウェアレベルで
並列化可能としておき、プログラムに対して、他のＰＥ
が空き状態であるならば当該他のＰＥを使い、他のＰＥ
がソフトウェアスケジューリングにより使用中であるな
らば、ハードウェアレベルで起動しようとしている新し
いスレッドを、自ＰＥ上で時間軸でシリアライズして処
理できるように制御する必要があった。However, in order to perform the processing shown in FIG. 22, the inside of T0 to T5 is made parallelizable at the hardware level, and another PE is provided to the program.
Is empty, the other PE is used, and the other PE is used.
If is currently being used by software scheduling, it is necessary to control a new thread that is to be started at the hardware level so that it can be serialized and processed on its own PE on the time axis.

【００１６】[0016]

【発明が解決しようとする課題】上述したように、従来
のマルチスレッドプロセッサにおけるスレッド実行方法
には、ハードウェアによるスケジューリングとソフトウ
ェアによるスケジューリングとがある。そして、両アプ
ローチには互いに相反する長所と短所とが存在すること
から、両方の機能を混在させ、自在に組み合わせて動作
させることが望ましい。As described above, conventional thread execution methods in a multi-thread processor include hardware scheduling and software scheduling. Since both approaches have contradictory advantages and disadvantages, it is desirable to mix both functions and operate them freely.

【００１７】しかし、そのような動作を行うためには、
複数のスレッドの内部をハードウェアレベルで並列化可
能とし、各スレッド実行ユニットにおいて、他のスレッ
ド実行ユニットの使用状況に応じてプログラムの処理方
法を選択する制御を実現する必要があった。However, in order to perform such an operation,
It has been necessary to make the inside of a plurality of threads parallel at the hardware level, and to implement control in each thread execution unit to select a processing method of a program according to the use status of another thread execution unit.

【００１８】本発明の目的は、上記従来の課題を解決
し、複数のスレッドの並列実行において、ハードウェア
によるスケジューリング機能とソフトウェアによるスケ
ジューリング機能とを混在させ、かつ自在に組み合わせ
て動作させることを実現したマルチスレッドプロセッサ
を提供することにある。An object of the present invention is to solve the above-mentioned conventional problems, and realize that, in parallel execution of a plurality of threads, a hardware scheduling function and a software scheduling function are mixed and operated freely. To provide a multi-thread processor.

【００１９】[0019]

【課題を解決するための手段】上記の目的を達成する本
発明は、複数のプログラムカウンタを備え、前記複数の
プログラムカウンタにしたがった複数のスレッドの命令
を同時にフェッチ、解釈、実行する複数のスレッド実行
ユニットを備えたマルチスレッドプロセッサにおいて、
前記スレッド実行ユニットが、処理中のスレッドが高々
１回の新たなスレッド生成を行うこと可能とする機械語
命令及び任意のスレッド実行ユニットにおいて処理中の
スレッドが終了することを可能とする機械語命令を備
え、さらに、前記機械語命令に基づいて生成される新し
いスレッドを、前記機械語命令を実行する前記スレッド
実行ユニット以外のスレッド実行ユニットに割り当てる
処理をハードウェアにて直接行うスレッド割り当て手段
と、任意の前記スレッド実行ユニットにおいて処理中の
スレッドが新たなスレッド生成を行う機械語命令を実行
する場合に、前記機械語命令に基づいて生成される新し
いスレッドを前記機械語命令を実行する前記スレッド実
行ユニットのスレッド実行ユニットに割り当てる処理を
前記スレッド割り当て手段が行うかどうかをソフトウェ
アにて指示するスレッド割り当て指示手段と、前記スレ
ッド割り当て指示手段が前記スレッド割り当て手段に対
して前記スレッドの割り当てを実行させない旨の指示を
行った場合に、前記機械語命令に基づいて生成される新
しいスレッドの実行開始に必要とされるレジスタコンテ
キストを記憶保持するレジスタコンテキスト保持手段
と、前記機械語命令を実行する前記スレッド実行ユニッ
トにおいて処理中のスレッドが、スレッド実行終了命令
を実行したことを検出するスレッド実行終了命令実行検
出手段と、任意の前記スレッド実行ユニットにおいて処
理中のスレッドを終了する機械語命令を実行した場合
に、前記レジスタコンテキスト保持手段により保持され
た新たに生成すべきスレッドの実行開始に必要とされる
レジスタコンテキストを、該スレッド実行ユニット上に
復旧することを可能とするレジスタコンテキスト復旧手
段とを備え、必要に応じて、任意の前記スレッド実行ユ
ニットにおいて処理中のスレッドが新たなスレッド生成
を行う機械語命令を実行した場合に生成される新しいス
レッドを、該スレッド実行ユニット上で処理中のスレッ
ドが終了した後に、該スレッド実行ユニットが処理する
ことを特徴とする。SUMMARY OF THE INVENTION In order to achieve the above object, the present invention comprises a plurality of program counters, and a plurality of threads for simultaneously fetching, interpreting, and executing instructions of a plurality of threads according to the plurality of program counters. In a multi-thread processor having an execution unit,
A machine instruction for enabling the thread executing unit to execute a new thread at most once and a machine instruction for enabling the thread being executed to terminate in any thread execution unit Further, a thread allocation means for directly performing, in hardware, a process of allocating a new thread generated based on the machine language instruction to a thread execution unit other than the thread execution unit executing the machine language instruction, In a case where a thread being processed in any of the thread execution units executes a machine instruction for generating a new thread, the thread execution for executing the machine instruction on a new thread generated based on the machine instruction. The process assigned to the thread execution unit of the unit A thread assignment instructing means for instructing whether or not the means is to be performed by software; and the machine language instruction when the thread assignment instructing means instructs the thread assigning means not to execute the thread assignment. Register context holding means for storing and holding a register context required for the start of execution of a new thread generated based on a thread execution unit for executing the machine language instruction. And a thread execution end instruction execution detecting means for detecting that the thread has been executed, and a machine language instruction for terminating a thread being processed in any of the thread execution units. To start execution of the thread to be created A register context restoring means for restoring a required register context on the thread execution unit, wherein a thread being processed in any of the thread execution units generates a new thread if necessary. The thread execution unit processes a new thread generated when executing a machine language instruction for performing the following after the thread being processed on the thread execution unit ends.

【００２０】また、上記の目的を達成する他の本発明
は、複数のプログラムカウンタを備え、前記複数のプロ
グラムカウンタにしたがった複数のスレッドの命令を同
時にフェッチ、解釈、実行する複数のスレッド実行ユニ
ットを備えたマルチスレッドプロセッサにおいて、前記
スレッド実行ユニットが、処理中のスレッドが高々１回
の新たなスレッド生成を行うこと可能とする機械語命令
及び任意のスレッド実行ユニットにおいて処理中のスレ
ッドが終了することを可能とする機械語命令を備え、前
記複数のスレッド実行ユニットに共有され複数の物理レ
ジスタからなる共有物理レジスタファイルと、前記複数
のスレッド実行ユニットに設けられ前記スレッド実行ユ
ニット内の１つの論理レジスタと前記共有物理レジスタ
ファイル中の特定の複数の物理レジスタの１つとの間に
写像関係を定義する変換テーブルと、前記複数のスレッ
ド実行ユニットの変換テーブルの情報を隣接するスレッ
ド実行ユニットにコピーする変換テーブル情報コピー手
段とを有し、１つの前記論理レジスタとの間で写像関係
が定義されている複数の前記物理レジスタ毎にグループ
化し、該グループ内の位置を示す情報を前記変換テーブ
ルの情報に付加して前記写像関係を定義することによ
り、レジスタ内容の継承を行うレジスタ内容継承装置
と、任意の前記スレッド実行ユニットにおいて処理中の
スレッドが新たなスレッド生成を行う機械語命令を実行
する場合に、前記機械語命令に基づいて生成される新し
いスレッドを前記機械語命令を実行する前記スレッド実
行ユニットのスレッド実行ユニットに割り当てる処理を
前記スレッド割り当て手段が行うかどうかをソフトウェ
アにて指示するスレッド割り当て指示手段と、前記スレ
ッド割り当て指示手段が前記スレッド割り当て手段に対
して前記スレッドの割り当てを実行させない旨の指示を
行った場合に、前記機械語命令に基づいて生成される新
しいスレッドの実行開始に必要とされるレジスタコンテ
キストを記憶保持するレジスタコンテキスト保持手段
と、前記機械語命令を実行する前記スレッド実行ユニッ
トにおいて処理中のスレッドが、スレッド実行終了命令
を実行したことを検出するスレッド実行終了命令実行検
出手段と、任意の前記スレッド実行ユニットにおいて処
理中のスレッドを終了する機械語命令を実行した場合
に、前記レジスタコンテキスト保持手段により保持され
た新たに生成すべきスレッドの実行開始に必要とされる
レジスタコンテキストを、該スレッド実行ユニット上に
復旧することを可能とするレジスタコンテキスト復旧手
段とを備え、必要に応じて、任意の前記スレッド実行ユ
ニットにおいて処理中のスレッドが新たなスレッド生成
を行う機械語命令を実行した場合に生成される新しいス
レッドを、該スレッド実行ユニット上で処理中のスレッ
ドが終了した後に、該スレッド実行ユニットが処理する
ことを特徴とする。According to another aspect of the present invention, there is provided a thread execution unit comprising a plurality of program counters, wherein the plurality of thread execution units simultaneously fetch, interpret, and execute instructions of a plurality of threads according to the plurality of program counters. In the multi-threaded processor provided with the above, the thread execution unit terminates the thread being processed in a machine instruction that enables the thread being processed to generate a new thread at most once and any thread execution unit. A shared physical register file comprising a plurality of physical registers shared by the plurality of thread execution units, and one of the logics in the thread execution unit provided in the plurality of thread execution units. Registers and specific in the shared physical register file A conversion table defining a mapping relationship with one of the plurality of physical registers; and a conversion table information copying means for copying information of the conversion tables of the plurality of thread execution units to adjacent thread execution units. Grouping a plurality of physical registers for which a mapping relationship is defined with one of the logical registers, and adding information indicating a position in the group to information of the conversion table to define the mapping relationship Thus, when a thread being processed in any of the thread execution units executes a machine language instruction for generating a new thread, a register content inheritance device that performs inheritance of the register content is generated based on the machine language instruction. Assigns a new thread to a thread execution unit of the thread execution unit that executes the machine instruction. Thread assignment means for instructing, by software, whether or not the thread assignment means performs the thread assignment process, and the thread assignment instruction means instructing the thread assignment means not to execute the thread assignment. A register context holding unit for storing and holding a register context required to start execution of a new thread generated based on the machine language instruction; and a thread being processed in the thread execution unit executing the machine language instruction A thread execution end instruction execution detecting means for detecting execution of a thread execution end instruction, and a register context holding means for executing a machine language instruction for terminating a thread being processed in any of the thread execution units. Freshly held by Register context restoration means for restoring, on the thread execution unit, a register context required for the start of execution of a thread to be formed. Wherein the thread execution unit processes a new thread generated when the thread of the thread executes a machine language instruction for generating a new thread, after the thread being processed on the thread execution unit ends. I do.

【００２１】また、他の態様では、上記いずれかの本発
明において、前記レジスタコンテキスト保持手段が、前
記スレッド実行ユニットの主記憶上のメモリ領域であ
り、命令トラップを用いたソフトウェア処理により、生
成される新しいスレッドの前記レジスタコンテキストを
格納することを特徴とする。In another aspect, in any one of the above-mentioned inventions, the register context holding means is a memory area on a main memory of the thread execution unit, and is generated by software processing using an instruction trap. Storing the register context of a new thread.

【００２２】また、他の態様では、上記いずれかの本発
明において、前記レジスタコンテキスト保持手段が、前
記スレッド割り当て指示手段が前記スレッド割り当て手
段に対して前記スレッドの割り当てを実行させない旨の
指示を行った場合に、新たなスレッド生成を行う機械語
命令を処理しようとした際に、予めハードウェアにより
定められた命令アドレスＸにトラップすると共に、前記
新たなスレッド生成を行う機械語命令を実行した命令番
地を前記スレッド実行ユニットが固有に備えるトラップ
元指示アドレスレジスタに記憶保持し、前記命令アドレ
スＸをエントリポイントとするプログラムにより、新た
に生成されるべき新しいスレッドの実行開始に必要とさ
れるレジスタコンテキストを前記主記憶上に待避し、ト
ラップ元指示アドレスレジスタに記憶保持された命令ア
ドレスに基づいて前記の新たなスレッド生成を行う機械
語命令が指示する、前記新たなスレッドの開始命令アド
レス番地を算出して算出結果を主記憶上に待避し、前記
スレッド実行ユニットが固有に備えるトラップ元指示ア
ドレスレジスタに保持された命令番地に基づいて、前記
新たなスレッド生成を行う機械語命令を実行した命令の
後続命令番地にプログラム制御を復旧し、前記スレッド
実行終了命令実行検出手段が、前記スレッド割り当て指
示手段が前記スレッド割り当て手段に対して前記スレッ
ドの割り当てを実行させない旨の指示を行った場合に、
前記スレッド実行ユニットにおいて処理中のスレッドが
スレッド実行終了命令を実行した際に、予めハードウェ
アにより定められた命令アドレスＹにトラップし、前記
レジスタコンテキスト復旧手段が、命令アドレスＹをエ
ントリポイントとするプログラムにより、主記憶上に待
避していた新たに生成されるべき新しいスレッドの前記
レジスタコンテキストをレジスタ上に復帰し、主記憶上
に待避していた新たなスレッドの開始命令アドレスを取
り出して該アドレスにプログラム制御を復旧することを
特徴とする。In another aspect of the present invention, in any one of the above-mentioned inventions, the register context holding means instructs the thread assignment instructing means to not execute the thread assignment to the thread assigning means. In this case, when an attempt is made to process a machine instruction for generating a new thread, an instruction that traps an instruction address X determined in advance by hardware and executes the machine instruction for generating a new thread is executed. A register context required for starting execution of a new thread to be newly generated by a program in which an address is stored and held in a trap source instruction address register uniquely provided in the thread execution unit, and the instruction address X is an entry point. Is stored in the main memory, and the trap source instruction address is saved. The instruction of the machine language instruction for generating the new thread based on the instruction address stored and held in the register is calculated, the start instruction address of the new thread is calculated, and the calculation result is saved on the main memory. On the basis of the instruction address held in the trap source instruction address register uniquely provided in the thread execution unit, program control is restored to an instruction address subsequent to the instruction that executed the machine instruction for generating the new thread, and the thread execution is executed. When the end instruction execution detecting means has instructed the thread assignment instructing means to not execute the thread assignment to the thread assigning means,
When a thread being processed in the thread execution unit executes a thread execution end instruction, the thread traps at an instruction address Y determined in advance by hardware, and the register context recovery means uses the instruction address Y as an entry point. Thus, the register context of the new thread to be newly created and saved on the main memory is returned to the register, and the start instruction address of the new thread saved on the main memory is taken out and stored in the address. It is characterized by restoring program control.

【００２３】さらにまた、他の態様では、上記いずれか
の本発明において、前記レジスタコンテキスト保持手段
が、レジスタコンテキストセーブ用として前記各スレッ
ド実行ユニット固有に備えられた１組みの記憶装置であ
り、直接、ハードウェアシーケンスにより、生成される
新しいスレッドの前記レジスタコンテキストを格納し、
前記レジスタコンテキスト復旧手段が、直接、ハードウ
ェアシーケンスにより、前記レジスタコンテキストセー
ブ用記憶装置に保持された前記レジスタコンテキストを
前記スレッド実行ユニット上に復旧させることを特徴と
する。In still another aspect, in any one of the above-mentioned inventions, the register context holding means is a set of storage devices provided for each thread execution unit for register context saving, and Storing the register context of the new thread created by the hardware sequence,
The register context restoring means directly restores the register context held in the register context saving storage device to the thread execution unit by a hardware sequence.

【００２４】[0024]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。図１は、本発明の一
実施の形態によるマルチスレッドプロセッサにおける、
ハードウェアスケジューリングとソフトウェアスケジュ
ーリングの切り替えを実現する構成を示すブロック図で
ある。本実施形態のマルチスレッドプロセッサは、図１
に示すように、４スレッド並列実行型プロセッサであ
り、スレッド管理ユニット３０と、４組のスレッド実行
ユニット（ＰＥ＃０〜ＰＥ＃３）１０−０〜１０−３
と、物理共有レジスタファイル２０と、実行ユニットス
テータス５０とを備える。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 shows a multi-thread processor according to an embodiment of the present invention.
FIG. 3 is a block diagram illustrating a configuration for realizing switching between hardware scheduling and software scheduling. The multi-thread processor according to the present embodiment has the configuration shown in FIG.
As shown in FIG. 2, the processor is a four-thread parallel execution type processor, and has a thread management unit 30 and four sets of thread execution units (PE # 0 to PE # 3) 10-0 to 10-3.
, A physical shared register file 20, and an execution unit status 50.

【００２５】スレッド実行ユニット１０−０〜１０−３
は、それぞれ、命令キャッシュ（＃０〜＃３）１１−０
〜１１−３と、命令デコーダ（＃０〜＃３）１２−０〜
１２−３と、レジスタ写像テーブル（＃０〜＃３）１３
−０〜１３−３と、演算ユニット（＃０〜＃３）１４−
０〜１４−３と、セレクタ１５−０〜１５−３とを備え
る。また、レジスタ写像テーブル１３−０〜１３−３
は、それぞれ写像情報転送バス４０によってリング状を
なすように、隣接するレジスタ写像テーブルに接続され
ている。Thread execution units 10-0 to 10-3
Are the instruction caches (# 0 to # 3) 11-0, respectively.
11-3, and instruction decoders (# 0 to # 3) 12-0 to
12-3 and a register mapping table (# 0 to # 3) 13
−0 to 13-3 and arithmetic unit (# 0 to # 3) 14−
0 to 14-3 and selectors 15-0 to 15-3. Also, the register mapping tables 13-0 to 13-3
Are connected to adjacent register mapping tables so as to form a ring by a mapping information transfer bus 40, respectively.

【００２６】実行ユニットステータス５０は、各スレッ
ド実行ユニット（ＰＥ＃０〜ＰＥ＃３）１０−０〜１０
−３内の命令デコーダ部１２−０〜１２−３と、物理共
有レジスタファイル２０から演算ユニット１４−０〜１
４−３へデータ及びコマンドを接続するバスまたは演算
ユニット１４−０〜１４−３から実行結果を物理共有レ
ジスタ２０に戻すバスを選択するセレクタ１５−０〜１
５−３とに接続されている。The execution unit status 50 indicates each thread execution unit (PE # 0 to PE # 3) 10-0 to 10
-3 and the instruction units 12-0 to 12-3 and the physical shared register file 20 to calculate the operation units 14-0 to 14-1.
Selectors 15-0 to 15-3 for selecting a bus for connecting data and commands to 4-3 or a bus for returning execution results to the physical shared register 20 from the operation units 14-0 to 14-3
5-3.

【００２７】なお、図１には、本実施形態における特徴
的な構成のみを記載し、他の一般的な構成については記
載を省略してある。例えば、プロセッサとしては、上記
の構成の他にロードストアユニットやデータキャッシュ
メモリ、外部インタフェース等が必要であるのは言うま
でもない。また、以下の説明では、スレッド実行ユニッ
ト１０−０〜１０−３及びスレッド実行ユニット１０−
０〜１０−３の構成要素である命令キャッシュ１１−０
〜１１−３、命令デコーダ１２−０〜１２−３、レジス
タ写像テーブル１３−０〜１３−３、演算ユニット１４
−０〜１４−３、セレクタ１５−０〜１５−３におい
て、特に個々の構成要素を区別する必要がない場合は、
適宜符号の添え字を省略し、例えばスレッド実行ユニッ
ト１０、命令キャッシュ１１、命令デコーダ１２という
ように表記する。FIG. 1 shows only the characteristic configuration of the present embodiment, and the description of other general configurations is omitted. For example, it goes without saying that a processor requires a load store unit, a data cache memory, an external interface, and the like in addition to the above configuration. In the following description, the thread execution units 10-0 to 10-3 and the thread execution units 10-
Instruction cache 11-0, which is a component of 0-10-3
To 11-3, instruction decoders 12-0 to 12-3, register mapping tables 13-0 to 13-3, arithmetic unit 14
In the case of −0 to 14-3 and the selectors 15-0 to 15-3, when it is not particularly necessary to distinguish individual components,
The suffixes of the reference numerals are omitted as appropriate, and, for example, the thread execution unit 10, the instruction cache 11, and the instruction decoder 12 are described.

【００２８】本実施形態のマルチスレッドプロセッサ
は、任意のスレッド実行ユニットにおいて処理中のスレ
ッドが新たなスレッド生成を行うことを可能とする機械
語命令及び任意のスレッド実行ユニットにおいて処理中
のスレッドが終了することを可能とする機械語命令を備
え、任意のスレッド実行ユニットにおいて処理中のスレ
ッドは高々１回の新たなスレッド生成を行い、新たなス
レッド生成を行う機械語命令の処理において、当該命令
に基づいて生成される新しいスレッドを当該スレッド実
行ユニット以外のスレッド実行ユニットに割り当てる制
御をハードウェアが直接行うことにより実現されるレジ
スタ内容の継承手段を備えることが前提となっている。
上述したスレッド管理ユニット３０、スレッド実行ユニ
ット１０−０〜１０−３及び物理共有レジスタファイル
２０が、レジスタ内容の継承手段を実現する。そこで、
まず、当該レジスタ内容の継承手段について説明する。The multi-thread processor according to the present embodiment has a machine language instruction that enables a thread being processed in an arbitrary thread execution unit to generate a new thread, and a thread being processed in an arbitrary thread execution unit is terminated. The thread being processed in any thread execution unit generates a new thread at most once, and in the processing of the machine language instruction for generating a new thread, the It is assumed that there is provided a register content inheritance means realized by hardware that directly controls a new thread generated based on the thread execution unit other than the thread execution unit.
The thread management unit 30, the thread execution units 10-0 to 10-3, and the physical shared register file 20 implement a register content inheritance unit. Therefore,
First, means for inheriting the contents of the register will be described.

【００２９】図１２は、図１のスレッド実行ユニット１
０−０〜１０−３のパイプラインステージを示す図であ
る。図１２を参照すると、スレッド実行ユニット１０−
０〜１０−３のパイプラインステージにおいては、命令
フェッチステージ１２０１、命令デコードステージ１２
０２、レジスタ変換ステージ１２０３、実行ステージ１
２０４、レジスタ書き戻しステージ１２０５の５段のス
テージを経て実行が完了する。FIG. 12 shows the thread execution unit 1 of FIG.
It is a figure which shows the pipeline stages of 0-0 to 10-3. Referring to FIG. 12, the thread execution unit 10-
In the pipeline stages 0 to 10-3, the instruction fetch stage 1201, the instruction decode stage 12
02, register conversion stage 1203, execution stage 1
The execution is completed through five stages of 204 and a register write back stage 1205.

【００３０】図１３は、図１の物理共有レジスタファイ
ル２０の詳細な構成を示す図である。図１３を参照する
と、物理レジスタファイル２０は、論理レジスタ番号２
０２毎にスレッド実行ユニット１０−０〜１０−３の数
の２倍の物理レジスタ２０１を備えて構成される。した
がって、本実施例では、１つの論理レジスタに対して８
個の物理レジスタ２０１が対応付けられる。各物理レジ
スタ２０１は、グループ選択ビット２０３のＡ、Ｂの２
つのグループ２０４、２０５に分けられ、それぞれ、ス
レッド実行ユニット１０−０〜１０−３の数分の物理拡
張ビット２０６を持つ。FIG. 13 is a diagram showing a detailed configuration of the physical shared register file 20 of FIG. Referring to FIG. 13, the physical register file 20 stores the logical register number 2
The number of physical registers 201 is twice as large as the number of thread execution units 10-0 to 10-3 for every 02. Therefore, in this embodiment, 8 bits are assigned to one logical register.
Physical registers 201 are associated with each other. Each physical register 201 has two bits A and B of the group selection bit 203.
Into two groups 204 and 205, each having physical extension bits 206 corresponding to the number of thread execution units 10-0 to 10-3.

【００３１】図１４は、図１３の物理レジスタ２０１の
フォーマットを示す図である。図１４を参照すると、物
理レジスタ２０１は、３２本の論理レジスタセットを持
つ命令セットの場合、物理拡張ビット２０６と、グルー
プ選択ビット２０３と、論理レジスタ番号２０２とから
なる。この場合、論理レジスタセットの数が変化すると
論理レジスタ番号２０２を示すビット数が変化し、スレ
ッド実行ユニット１０−０〜１０−３の数が変化すると
物理拡張ビット２０６の値が変化する。FIG. 14 is a diagram showing a format of the physical register 201 of FIG. Referring to FIG. 14, in the case of an instruction set having 32 logical register sets, the physical register 201 includes a physical extension bit 206, a group selection bit 203, and a logical register number 202. In this case, when the number of logical register sets changes, the number of bits indicating the logical register number 202 changes, and when the number of thread execution units 10-0 to 10-3 changes, the value of the physical extension bit 206 changes.

【００３２】図１５は、図１のレジスタ写像テーブル１
３の詳細な構成を示す図である。図１５を参照すると、
レジスタ写像テーブル１３は、論理レジスタ番号２０２
ごとにグループ選択ビット１５０１によって選択される
Ａ、Ｂのグループに分けられ、それぞれ、物理拡張ビッ
ト１５０３と、変更ビット１５０４と、書き戻しビット
１５０５と、継承時グループ選択ビット１５０２と、グ
ループ選択変更命令未完了ビット１５０６とから構成さ
れる。グループ選択ビット１５０１は、当該スレッド実
行ユニット１０−０〜１０−３が参照する物理共有レジ
スタファイル２０のグループを示しており、物理拡張ビ
ット１５０３によって、その中のどの物理レジスタ２０
１を参照するかを示している。変更ビット１５０４は、
当該スレッド実行ユニット１０−０〜１０−３によっ
て、グループ選択ビット１５０１により選択されている
物理レジスタ２０１を更新する命令を１回以上デコード
したか否かを示す。書き戻しビット１５０５は、物理レ
ジスタ２０１を更新した命令が１個以上実際に完了した
か否かを示す。継承時グループ選択ビット１５０２は、
レジスタを当該スレッド実行ユニット１０−０〜１０−
３から別のスレッド実行ユニット１０−０〜１０−３に
継承した時点でのグループ選択ビット１５０１の内容を
コピーしたものである。FIG. 15 shows the register mapping table 1 of FIG.
FIG. 3 is a diagram showing a detailed configuration of No. 3; Referring to FIG.
The register mapping table 13 has a logical register number 202
Each is divided into groups A and B selected by the group selection bit 1501, and each includes a physical extension bit 1503, a change bit 1504, a write-back bit 1505, an inheritance group selection bit 1502, and a group selection change instruction And an incomplete bit 1506. The group selection bit 1501 indicates a group of the physical shared register file 20 referred to by the thread execution units 10-0 to 10-3.
1 is referred to. The change bit 1504 is
It indicates whether or not the instruction for updating the physical register 201 selected by the group selection bit 1501 has been decoded one or more times by the thread execution units 10-0 to 10-3. The write-back bit 1505 indicates whether or not one or more instructions that have updated the physical register 201 have actually been completed. The inheritance group selection bit 1502 is
Registers are stored in the corresponding thread execution units 10-0 to 10-
3 is a copy of the contents of the group selection bit 1501 at the time when the thread selection units are inherited from the thread execution units 10-0 to 10-3.

【００３３】図１６は、図１のレジスタ写像テーブル１
３の１エントリの詳細な構成を示す図である。図１６を
参照すると、レジスタ写像テーブル１３は、図１５に示
すビット以外に、加算機１６０１ａ、１６０１ｂと、マ
ルチプレクサ１６０２ａ〜１６０２ｄと、書き込み動作
論理１６０３とを備える。FIG. 16 shows the register mapping table 1 of FIG.
FIG. 3 is a diagram showing a detailed configuration of one entry of No. 3; Referring to FIG. 16, the register mapping table 13 includes adders 1601a and 1601b, multiplexers 1602a to 1602d, and write operation logic 1603 in addition to the bits shown in FIG.

【００３４】グループ選択ビット１５０１は、フォーク
１回モデルにおけるフォーク、すなわちスレッド生成が
行なわれた後に、当該スレッド実行ユニット１０−０〜
１０−３の命令によって１回目にレジスタ値を変更した
場合にセットされる。フォーク後の１回目の書き換えか
否かの判断は、グループ選択ビット１５０１と継承時グ
ループ選択ビット１５０２との値の排他的論理輪を取っ
て行う。継承時グループ選択ビット１５０２は、スレッ
ド生成時のグループ選択ビット１５０１のコピーを保持
しているので、当該判断を行なうことができる。The group selection bit 1501 indicates the fork in the one-fork one-time model, that is, after the thread has been generated, the relevant thread execution unit 10-0 to 10-0
This is set when the register value is changed for the first time by the instruction of 10-3. The determination as to whether or not the first rewrite after the fork is performed by taking an exclusive logical loop of the values of the group selection bit 1501 and the inheritance group selection bit 1502. Since the inheritance group selection bit 1502 holds a copy of the group selection bit 1501 at the time of thread creation, this determination can be made.

【００３５】また、変更ビット１５０４ａ、１５０４ｂ
は、自スレッドの起動時にはリセットされる。その後、
当該レジスタ値を変更する命令を命令デコーダ１２−０
〜１２−３から受け取った際に、グループ選択ビット１
５０１で選択されている側の変更ビット１５０４ａ、１
５０４ｂがセットされる。Also, change bits 1504a, 1504b
Is reset when the thread starts. afterwards,
The instruction for changing the register value is sent to the instruction decoder 12-0.
Group selection bit 1 when received from
Change bits 1504a, 1 on the side selected at 501
504b is set.

【００３６】書き戻しビット１５０５ａ、１５０５ｂ
は、自スレッドの起動時にリセットされており、演算ユ
ニット１４によって演算された実際の計算結果が物理共
有レジスタファイル２０に書き戻される際にセットされ
る。これによって、以下のような参照方針で、論理レジ
スタに対して物理レジスタ番号の拡張を行なう。Write-back bits 1505a, 1505b
Is reset when the thread is started, and is set when the actual calculation result calculated by the arithmetic unit 14 is written back to the physical shared register file 20. As a result, the physical register number is extended to the logical register according to the following reference policy.

【００３７】まず、呼び出し参照時、マルチプレクサ１
６０２ａ、１６０２ｂは、変更ビット１５０４ａ、１５
０４ｂがリセットされているときに、物理拡張ビット１
５０３ａ、１５０３ｂの値に、加算機１６０１ａ、１６
０１ｂにより加算した値を出力する。物理拡張ビット１
５０３ａ、１５０３ｂの値に１を加算することにより、
非選択側が用いる物理共有レジスタファイル２０におい
て、レジスタの競合の発生を防止している。レジスタの
競合の発生の防止は、非選択側における自ユニットでの
変更時に用いられるため、前段のユニットと自ユニット
とが、あるいは自ユニットと後段のユニットとが同じレ
ジスタを使用しないようにすることで実現できる。First, at the time of calling reference, the multiplexer 1
602a and 1602b are change bits 1504a and 15
04b is reset, the physical extension bit 1
Adders 1601a and 1601a are added to the values of 503a and 1503b.
The value added by 01b is output. Physical extension bit 1
By adding 1 to the values of 503a and 1503b,
In the physical shared register file 20 used by the non-selection side, the occurrence of register conflict is prevented. Prevention of register conflicts is used when changes are made in the own unit on the non-selected side.Therefore, make sure that the preceding unit and the own unit or the own unit and the subsequent unit do not use the same register. Can be realized.

【００３８】マルチプレクサ１６０２ｃは、グループ選
択ビット１５０１によって、ＡグループまたはＢグルー
プのどちらかの値を読み出して参照用の物理拡張ビット
２０６として出力するかを選択する。一方、書き込み参
照用の物理拡張ビット２０６は、ＡグループまたはＢグ
ループのどちらを選択するにしても、常に物理拡張ビッ
ト１５０３ａ、１５０３ｂに１を加算した値を出力する
必要がある。The multiplexer 1602c selects, based on the group selection bit 1501, either the value of the group A or the value of the group B and outputs it as the physical extension bit 206 for reference. On the other hand, as for the physical extension bit 206 for write reference, it is necessary to always output a value obtained by adding 1 to the physical extension bits 1503a and 1503b regardless of whether the group A or the group B is selected.

【００３９】したがって、マルチプレクサ１６０２ｄに
対する入力は、ＡグループまたはＢグループのどちらも
物理拡張ビット１５０３ａ、１５０３ｂから加算機１６
０１ａ、１６０１ｂを通った値を用いる。Ａグループま
たはＢグループの選択は、基本的にグループ選択ビット
１５０１の値に従うが、上述したグループ選択ビット４
１を切り替える際には、先行して切り替え先の方のグル
ープを選択する。この制御は、書き込み動作論理１６０
３によって行われる。また、物理拡張ビット１５０３
ａ、１５０３ｂは、加算によって用意された桁が溢れる
と、０に戻される。さらに、スレッド生成時には、グル
ープ選択ビット１５０１及びマルチプレクサ１６０２
ａ、１６０２ｂから出力される物理拡張ビット１５０３
ａ、１５０３ｂをスレッド生成先のレジスタ写像テーブ
ル１３を介してコピーする。Therefore, the input to the multiplexer 1602d is obtained from the physical extension bits 1503a and 1503b of either the A group or the B group.
01a and 1601b are used. The selection of the A group or the B group basically depends on the value of the group selection bit 1501.
When switching 1, the group of the switching destination is selected first. This control is performed by the write operation logic 160.
3 is performed. Also, the physical extension bit 1503
a and 1503b are returned to 0 when the digits prepared by the addition overflow. Further, at the time of thread generation, the group selection bit 1501 and the multiplexer 1602
a, physical extension bit 1503 output from 1602b
a and 1503b are copied via the register mapping table 13 at the thread creation destination.

【００４０】以下、スレッド起動後の通常のレジスタ参
照動作、スレッド生成時の動作及びスレッド生成後のレ
ジスタ参照動作について、それぞれ時間順序を追って説
明する。以下の説明は、主に図１２におけるレジスタ変
換ステージ１２０３によって行なわれる動作である。Hereinafter, a normal register reference operation after thread activation, an operation at thread creation, and a register reference operation after thread creation will be described in chronological order. The following description is mainly of the operation performed by register conversion stage 1203 in FIG.

【００４１】図１７は、通常の動作時のグループ選択ビ
ット１５０１、物理拡張ビット１５０３ａ、１５０３
ｂ、変更ビット１５０４ａ、１５０４ｂの値の遷移と、
それによってレジスタ継承が実現できる仕組みを説明す
る図である。なお、書き戻しビット１５０５ａ、１５０
５ｂの動作については後述する。FIG. 17 shows a group selection bit 1501 and physical extension bits 1503a and 1503 during normal operation.
b, transition of the values of the change bits 1504a and 1504b,
FIG. 4 is a diagram illustrating a mechanism that can realize register inheritance. Note that the write-back bits 1505a, 1505
The operation of 5b will be described later.

【００４２】スレッド実行ユニット（＃０）１０−０に
おいて、新規スレッドが起動した（イ）の時点では、グ
ループ選択ビット１５０１は「Ａ」となっており、物理
拡張ビット１５０３ａ、１５０３ｂは０、変更ビット１
５０４ａ、１５０４ｂは０となっている。この場合、論
理レジスタは「Ａ」の０に位置する物理レジスタ２０１
を読み出して参照する。In the thread execution unit (# 0) 10-0, when the new thread is started (A), the group selection bit 1501 is "A", the physical extension bits 1503a and 1503b are 0, Bit 1
504a and 1504b are 0. In this case, the logical register is the physical register 201 located at 0 of “A”.
Is read and referenced.

【００４３】書き込み参照が生じた際、すなわちレジス
タ変更（ロ）の時点では、「Ａ」の変更ビット１５０４
ａが１にセットされる。変更は「Ａ」の１に位置する物
理レジスタ２０１に対して行われ、その後の読み出し参
照も同一レジスタに対して行われる。その後、同一レジ
スタに対する書き込み参照が生じても、グループ選択ビ
ット１５０１や変更ビット１５０４ａ、１５０４ｂは変
更しない。When a write reference occurs, that is, at the time of register change (b), the change bit 1504 of "A"
a is set to 1. The change is performed on the physical register 201 located at 1 of “A”, and the subsequent read reference is also performed on the same register. Thereafter, even if a write reference to the same register occurs, the group selection bit 1501 and the change bits 1504a and 1504b are not changed.

【００４４】次に、新たにスレッドを生成する（ハ）の
時点では、グループ選択ビット１５０１は「Ａ」である
ので「Ａ」をセットされ、変更ビット１５０４ａは選択
側でセットされており、変更ビット１５０４ｂは非選択
側なので、物理拡張ビット１５０３ａ、１５０３ｂの値
に１を加算した値を加算した値をスレッド実行ユニット
（＃１）１０−１のレジスタ写像テーブル１３−１対し
て送信する。Next, at the point of time when a new thread is created (C), the group selection bit 1501 is "A", so "A" is set, and the change bit 1504a is set on the selection side. Since the bit 1504b is the non-selection side, a value obtained by adding 1 to the value of the physical extension bits 1503a and 1503b is transmitted to the register mapping table 13-1 of the thread execution unit (# 1) 10-1.

【００４５】スレッド実行ユニット（＃０）１０−０が
スレッド生成後に初めてレジスタ書き込み参照を行う
際、すなわち（ニ）の時点では、グループ選択ビット１
５０１を「Ａ」から「Ｂ」に変更し、変更ビット１５０
４ｂをセットする。変更は「Ｂ」の１に位置する物理レ
ジスタ２０１に対して行われ、その後の読み出し参照も
同一レジスタに対して行われる。その後、同一レジスタ
に対する書き込み参照が生じても、グループ選択ビット
１５０１や変更ビット１５０４ａ、１５０４ｂは変更し
ない。これによって、スレッド実行ユニット（＃１）１
０−１で参照する可能性のあるレジスタは、「Ａ」の０
の位置で保持されたままとなる。When the thread execution unit (# 0) 10-0 performs register write reference for the first time after thread creation, that is, (d), the group selection bit 1
501 is changed from “A” to “B” and the change bit 150 is changed.
4b is set. The change is performed on the physical register 201 located at 1 of “B”, and the subsequent read reference is also performed on the same register. Thereafter, even if a write reference to the same register occurs, the group selection bit 1501 and the change bits 1504a and 1504b are not changed. Thereby, the thread execution unit (# 1) 1
Registers that may be referenced by 0-1 are “A” 0
Is held at the position.

【００４６】スレッド実行ユニット（＃１）１０−１で
は、レジスタの書き込み参照を起こすことなく、（ホ）
の時点で新スレッドを生成している。したがって、グル
ープ選択側の「Ａ」の物理拡張ビット１５０３ａは、そ
のままの値を送信している。したがって、スレッド実行
ユニット（＃０）１０−０で実行しているスレッドのレ
ジスタの内容は、そのままスレッド実行ユニット（＃
２）１０−２で実行されるスレッドに継承される。ま
た、（ヘ）の時点で、レジスタ変更が行われた際には、
フォーク後であるので、グループ選択ビット１５０１を
「Ａ」から「Ｂ」にする。In the thread execution unit (# 1) 10-1, (e)
A new thread is created at the time. Therefore, the value of the physical extension bit 1503a of “A” on the group selection side is transmitted as it is. Therefore, the contents of the register of the thread executing in the thread execution unit (# 0) 10-0 are directly stored in the thread execution unit (# 0).
2) Inherited by the thread executed in 10-2. At the time of (f), when the register is changed,
Since it is after forking, the group selection bit 1501 is changed from "A" to "B".

【００４７】図１８は、投機的なスレッド生成を伴う場
合のグループ選択ビット１５０１、物理拡張ビット１５
０３ａ、１５０３ｂ、変更ビット１５０４ａ、１５０４
ｂの値の遷移と、それによってレジスタ継承が実現でき
る仕組みを説明するための図である。図１８に示した動
作のうち、（イ）〜（ニ）は、図１７の（イ）〜（ニ）
の動作と同一である。FIG. 18 shows a group selection bit 1501 and a physical extension bit 15 when speculative thread generation is involved.
03a, 1503b, change bits 1504a, 1504
FIG. 11 is a diagram for explaining a transition of a value of b and a mechanism that can realize register inheritance by the transition of the value of b. Among the operations shown in FIG. 18, (A) to (D) correspond to (A) to (D) in FIG.
Operation is the same as

【００４８】図１７の（ホ）の時点で、スレッド実行ユ
ニット（＃０）１０−０は、（ハ）の時点において生成
したスレッドの生成を取り消す。さらに、（ヘ）の時点
で再びスレッドを生成する。グループ選択ビット１５０
１は「Ｂ」であるので「Ｂ」をセットされ、変更ビット
１５０４ａ「Ａ」は選択側でセットされており、「Ｂ」
は非選択側なので、物理拡張ビット１５０３ａ、１５０
３ｂの値に１を加算した値をスレッド実行ユニット（＃
１）１０−１のレジスタテーブル１３−１に対して送信
する。これによって、（ニ）の時点で変更された値が、
スレッド実行ユニット（＃１）１０−１で実行されるス
レッドに継承される。（ト）の時点で、再びレジスタが
変更された場合は、グループ選択ビット１５０１を再び
「Ａ」に戻す。At time (e) in FIG. 17, the thread execution unit (# 0) 10-0 cancels the generation of the thread generated at time (c). Further, a thread is generated again at the time of (f). Group selection bit 150
Since 1 is "B", "B" is set, and the change bit 1504a "A" is set on the selection side.
Is the non-selection side, so the physical extension bits 1503a, 1503a
The value obtained by adding 1 to the value of 3b is the thread execution unit (#
1) Transmit to the register table 13-1 of 10-1. As a result, the value changed at the time (d)
It is inherited by the thread executed by the thread execution unit (# 1) 10-1. If the register is changed again at (g), the group selection bit 1501 is returned to "A" again.

【００４９】図１９は、図１２に示したパイプラインの
動作における写像情報のコピーのタイミングを示す図で
ある。図１９を参照すると、レジスタ写像情報のコピー
は、スレッド生成命令がレジスタ変換ステージ（図１９
ではサイクル５）においてスレッド実行ユニット（＃
０）１０−０からレジスタ継承情報が送信され、サイク
ル６においてスレッド実行ユニット（＃１）１０−１の
レジスタ写像テーブル１３−１に書き込まれる。通常命
令Ｅは、サイクル７において、当該レジスタ写像テーブ
ル１３−１を参照して継承したレジスタをアクセスす
る。なお、スレッド生成先のスレッド実行ユニット１０
−０〜１０−３が他のスレッドを実行中であり、スレッ
ドの新規生成要求を受けられない場合は、その後に受付
け可能になった時点で、継承時グループ選択ビット１５
０２の値を、グループ選択ビットの１５０１の代わりに
送信すればよい。FIG. 19 is a diagram showing the timing of copying the mapping information in the operation of the pipeline shown in FIG. Referring to FIG. 19, the copy of the register mapping information is performed by executing the thread generation instruction in the register conversion stage (FIG. 19).
In cycle 5), the thread execution unit (#
0) Register inheritance information is transmitted from 10-0, and is written to the register mapping table 13-1 of the thread execution unit (# 1) 10-1 in cycle 6. In the cycle 7, the normal instruction E accesses the inherited register with reference to the register mapping table 13-1. The thread execution unit 10 at the thread creation destination
If −0 to 10−3 are executing another thread and cannot receive a new thread creation request, the inheritance group selection bit 15
A value of 02 may be transmitted instead of the group selection bit 1501.

【００５０】次に、書き戻しビット１５０５ａ、１５０
５ｂの動作について説明する。書き戻しビット１５０５
ａ、１５０５ｂは、レジスタを書き込み参照する命令が
何らかの原因（例えば、条件分岐命令の予測の失敗等）
により取り消された場合に、変更ビット１５０４ａ、１
５０４ｂを正しい値に復帰するために用いられる。Next, write-back bits 1505a, 1505
The operation of 5b will be described. Writeback bit 1505
“a” and “1505b” are caused by an instruction that writes and refers to a register for some reason (for example, prediction failure of a conditional branch instruction).
Change bits 1504a, 1
Used to restore 504b to the correct value.

【００５１】書き戻しビット１５０５ａ、１５０５ｂ
は、自スレッドの起動時にグループ選択ビット１５０１
によってリセットされている。リセットされている書き
戻しビット１５０５ａ、１５０５ｂは、演算ユニット１
４によって演算された実際の計算結果が物理共有レジス
タファイル２０に書き戻される際にセットされる。すな
わち、変更ビット１５０４ａ、１５０４ｂがセットされ
ており、かつ書き戻しビット１５０５ａ、１５０５ｂが
セットされていないということは、まだ変更ビット１５
０４ａ、１５０４ｂをセットした命令が完了していない
ことを意味する。したがって、この段階で命令取り消し
の事象が発生した場合は、書き戻しビット１５０５ａ、
１５０５ｂの内容を変更ビット１５０４ａ、１５０４ｂ
にコピーして初期値に戻すことによって、命令取り消し
の際にレジスタ写像テーブル１３を正しい値に復帰させ
る。Write-back bits 1505a, 1505b
Is the group selection bit 1501 when the thread is started.
Has been reset by The reset write-back bits 1505a and 1505b are
4 is set when the actual calculation result calculated by 4 is written back to the physical shared register file 20. That is, the fact that the change bits 1504a and 1504b are set and the write-back bits 1505a and 1505b are not set indicates that the change bit 15
It means that the instruction which set 04a, 1504b is not completed. Therefore, if an instruction cancellation event occurs at this stage, the write-back bit 1505a,
Change the contents of 1505b bits 1504a, 1504b
To restore the register mapping table 13 to the correct value when the instruction is canceled.

【００５２】さらに、グループ選択変更命令未完了ビッ
ト１５０６は、スレッド起動時にはリセットされてお
り、グループ選択ビット１５０１を変更する命令がレジ
スタ変換ステージ１２０３に達した時にセットされ、レ
ジスタ書き戻しステージ１２０５に達した時にリセット
される。すなわち、グループ選択変更命令未完了ビット
１５０６がセットされている間は、グループ選択ビット
１５０１を変更する命令が完了していないことになる。
この状態で命令の取り消しが生じた場合は、セットされ
ているグループ選択変更命令未完了ビット１５０６に対
応してグループ選択ビット１５０１を反転させる。その
後、グループ選択変更命令未完了ビット１５０６をリセ
ットする。Further, the group selection change instruction incomplete bit 1506 is reset when the thread is started, is set when the instruction to change the group selection bit 1501 reaches the register conversion stage 1203, and reaches the register write back stage 1205. Reset when done. That is, while the group selection change instruction incomplete bit 1506 is set, the instruction to change the group selection bit 1501 is not completed.
If an instruction is canceled in this state, the group selection bit 1501 is inverted corresponding to the set group selection change instruction incomplete bit 1506. Thereafter, the group selection change command incomplete bit 1506 is reset.

【００５３】以上説明した方法により、レジスタの実内
容のコピーを行うことなく、また共有資源を物理共有レ
ジスタファイル２０のみとして、レジスタの継承を実現
することが可能となる。各物理拡張ビット２０６は、継
承の際に高々１が加算されるだけであり、スレッド実行
ユニット１０−０〜１０−３の数分のレジスタグループ
を２組持つことによって、これらの機構は実現可能であ
る。According to the method described above, the inheritance of registers can be realized without copying the actual contents of the registers and using only the physical shared register file 20 as a shared resource. At the time of inheritance, only 1 is added to each physical extension bit 206, and these mechanisms can be realized by having two sets of register groups corresponding to the number of thread execution units 10-0 to 10-3. It is.

【００５４】次に、上述したレジスタ内容の継承装置を
用いたプロセッサ装置において、ハードウェアスケジュ
ーリングとソフトウェアスケジューリングの切り替えを
実現する本実施形態について説明する。Next, a description will be given of an embodiment of the present invention for realizing switching between hardware scheduling and software scheduling in a processor device using the above-described register content inheriting device.

【００５５】本実施形態は、図１に示すように、レジス
タ内容の継承手段を実現するスレッド管理ユニット３
０、スレッド実行ユニット１０−０〜１０−３及び物理
共有レジスタファイル２０からなる構成に、命令デコー
ダ１２−０〜１２−３に追加論理を加え、実行ユニット
ステータス５０をさらに備える。当該実行ユニットステ
ータス５０が、任意のスレッド実行ユニットにおいて処
理中のスレッドが新たなスレッド生成を行う機械語命令
を処理しようとした際に、当該命令の要求に基づいて生
成されるべき新しいスレッドを、当該スレッド実行ユニ
ット以外のスレッド実行ユニットに割り当てる処理をハ
ードウェアが直接行うべきかどうかをソフトウェアにお
いて指示できるようにする。In the present embodiment, as shown in FIG. 1, a thread management unit 3 for realizing means for inheriting register contents is provided.
0, a configuration comprising the thread execution units 10-0 to 10-3 and the physical shared register file 20, additional logic is added to the instruction decoders 12-0 to 12-3, and an execution unit status 50 is further provided. The execution unit status 50 indicates that when a thread being processed in an arbitrary thread execution unit attempts to process a machine instruction for generating a new thread, a new thread to be generated based on the request of the instruction is The present invention enables software to instruct whether hardware should directly perform processing to be assigned to a thread execution unit other than the thread execution unit.

【００５６】図２は、図１における実行ユニットステー
タス５０の内部構造を示した回路図である。図２を参照
すると、実行ユニットステータス５０は、スレッド実行
ユニットステータス＃０〜＃３（２１−０〜２１−３）
と、ＮＯＲ論理ゲート（２２−０〜２２−３）と、各ス
レッド実行ユニットからのステータス書き込み先を選択
するためのセレクタ（２３−０〜２３−３）と、各スレ
ッド実行ユニットからのステータス読み出しエントリを
選択するためのセレクタ（２４−０〜２４−３）とを備
える。FIG. 2 is a circuit diagram showing the internal structure of execution unit status 50 in FIG. Referring to FIG. 2, the execution unit status 50 includes thread execution unit statuses # 0 to # 3 (21-0 to 21-3).
And NOR logic gates (22-0 to 22-3), selectors (23-0 to 23-3) for selecting a status write destination from each thread execution unit, and status reading from each thread execution unit Selectors (24-0 to 24-3) for selecting entries.

【００５７】図３は、図１における命令デコーダ１２−
０〜１２−３の内部構造を示した回路図である。図３を
参照すると、命令デコーダ１２−０〜１２−３は、上述
したレジスタ内容の継承手段を用いたプロセッサ装置に
おける命令デコーダ機能に相当する機能を備える命令デ
コーダ回路２５と、ＡＮＤゲート２６、インバータ２
７、セット／リセットフリップフロップ２８とを備え
る。FIG. 3 shows the instruction decoder 12- in FIG.
FIG. 3 is a circuit diagram showing an internal structure of 0-12-3. Referring to FIG. 3, instruction decoders 12-0 to 12-3 include an instruction decoder circuit 25 having a function corresponding to an instruction decoder function in a processor device using the above-described register content inheritance means, an AND gate 26, and an inverter. 2
7, a set / reset flip-flop 28.

【００５８】命令デコーダ回路２５に付加された回路が
果たす機能の全体との関連づけについては後述すること
とし、ここでは、その回路的な動作を説明する。The association of the functions added to the instruction decoder circuit 25 with the entire functions will be described later, and the circuit operation will be described here.

【００５９】命令デコード回路２５は、実行中の命令シ
ーケンス中にハードウェアスケジューリング対象の機械
語による新しいスレッド生成命令を見出した場合、ＡＮ
Ｄゲート２６の指示基づいて命令アドレスＸへトラップ
すべきかどうかを判定して動作する。ＡＮＤゲート出力
が値「０」をとる場合は、命令アドレスＸへのトラップ
動作を起こし、ＡＮＤゲート２６の出力値が「１」の場
合は、命令アドレスＸへのトラップは起こさず直接ハー
ドウェアスケジューリングによる新しいスレッド生成動
作を行う。命令デコード回路２５が命令アドレスＸへト
ラップすべきと判定した場合、命令デコード回路２５
は、セット／リセットフリップフロップ２８にセット動
作を行い、フリップフロップ２８の値を「１」とする。When the instruction decode circuit 25 finds a new thread generation instruction in a machine language to be hardware-scheduled in the instruction sequence being executed,
Based on the instruction from the D gate 26, it is determined whether or not to trap to the instruction address X, and the operation is performed. When the output of the AND gate takes the value "0", a trap operation to the instruction address X occurs, and when the output value of the AND gate 26 is "1", the trap to the instruction address X does not occur and hardware scheduling is performed directly. Perform a new thread creation operation. If the instruction decode circuit 25 determines that the instruction address X should be trapped, the instruction decode circuit 25
Performs a set operation on the set / reset flip-flop 28 and sets the value of the flip-flop 28 to “1”.

【００６０】また、命令デコード回路２５は、実行中の
命令シーケンス中にハードウェアスケジューリング対象
の機械語によるスレッド終了命令を見出した場合に、Ａ
ＮＤゲート２６の指示に基づいて命令アドレスＹへトラ
ップすべきかどうかを判定して動作する。ＡＮＤゲート
出力が値「０」をとる場合は、命令アドレスＹへのトラ
ップ動作を起こし、ＡＮＤゲート２６の出力値が「１」
の場合は、命令アドレスＹへのトラップは起こさず直接
ハードウェアスケジューリングによるスレッド生成終了
動作を行う。命令デコード回路２５が命令アドレスＸへ
トラップすべきと判定した場合、命令デコード回路２５
は、セット／リセットフリップフロップ２８にリセット
動作を行い、フリップフロップ２８の値を「０」とす
る。When the instruction decoding circuit 25 finds a thread end instruction in a machine language to be hardware-scheduled in an instruction sequence being executed,
Based on the instruction of the ND gate 26, the operation is performed by determining whether or not to trap to the instruction address Y. When the output of the AND gate takes the value "0", a trap operation to the instruction address Y occurs, and the output value of the AND gate 26 becomes "1".
In the case of, the thread generation end operation is directly performed by hardware scheduling without causing a trap to the instruction address Y. If the instruction decode circuit 25 determines that the instruction address X should be trapped, the instruction decode circuit 25
Performs a reset operation on the set / reset flip-flop 28 and sets the value of the flip-flop 28 to “0”.

【００６１】セット／リセットフリップフロップ２８の
出力は、インバータ２７を介してＡＮＤゲート２６に接
続される。これらの回路は、ある機械語命令による新し
いスレッド生成命令をアドレスＸにトラップする動作を
一度行った場合は、当該機械語命令と対になるスレッド
終了命令を再度命令デコーダ回路２５が見出した際に、
必ずアドレスＹへトラップするように指示する。なお、
セット／リセットフリップフロップ２８は、プロセッサ
初期化に際して、値「０」にセットされる。The output of the set / reset flip-flop 28 is connected to the AND gate 26 via the inverter 27. When these circuits once perform the operation of trapping a new thread generation instruction based on a certain machine language instruction at the address X, the instruction decoder circuit 25 again finds a thread end instruction that is paired with the machine language instruction. ,
Instruct to always trap to address Y. In addition,
The set / reset flip-flop 28 is set to a value “0” when the processor is initialized.

【００６２】図３に示す、命令デコーダ回路２５に追加
された付加回路は、図１の実行ユニットステータス５０
との組み合わせにおいて、所定のスレッド実行ユニット
１０−０〜１０−３において処理中のスレッドがスレッ
ド実行終了命令を実行したことを検出する。The additional circuit added to the instruction decoder circuit 25 shown in FIG.
In the combination with the above, it is detected that the thread being processed in the predetermined thread execution units 10-0 to 10-3 has executed the thread execution end instruction.

【００６３】次に、図２２に示したソフトウェアスケジ
ューリングとハードウェアスケジューリングの混在実行
を実現するプログラムの構造を、図４乃至図７のフロー
チャートを参照して説明する。Next, the structure of a program for realizing mixed execution of software scheduling and hardware scheduling shown in FIG. 22 will be described with reference to the flowcharts of FIGS.

【００６４】図４は、最初にＯＳ等から起動されて動作
し始めるユーザプログラムの初期化において、ソフトウ
ェアスケジューリングとハードウェアスケジューリング
の混在実行に係わる動作を示すフローチャートである。FIG. 4 is a flowchart showing an operation relating to the mixed execution of software scheduling and hardware scheduling in the initialization of a user program which is started up from the OS or the like and starts operating.

【００６５】当該実行に係わる部分として、図４に示す
プログラムを参照すると、最初にソフトウェアスケジュ
ーラにおいて管理されるスレッドＴ１〜Ｔ５を生成する
ものとし、スレッドＴ０はプログラム初期化を行ったＰ
Ｅが初期化完了後にそのまま自分で実行を開始するもの
としている（ステップ４０１、４０２）。スレッドＴ１
〜Ｔ５の生成は、ユーザプログラムの実行ロードモジュ
ールを作成する際に、動的あるいは静的にリンケージさ
れたユーザレベル・スレッドライブラリの中のスレッド
生成ルーチンを呼び出すことにより行われる。また、ス
テップ４０１において記載している「スレッド実行ステ
ータス」の変更は、ハードウェアスケジューリングの可
否をソフトウェア側から制御することに係わる処理であ
り、図２中のスレッド実行ユニット＃０〜＃３ステータ
ス（２１−０〜２１−３）の中から、ソフトウェアスケ
ジューリングの下に起動を行おうとしているいずれか一
台のスレッド実行ユニットのステータスを更新する処理
である。Referring to the program shown in FIG. 4 as a part related to the execution, first, it is assumed that threads T1 to T5 managed by the software scheduler are generated, and the thread T0 is the P which initializes the program.
It is assumed that E starts executing itself as it is after the initialization is completed (steps 401 and 402). Thread T1
Generation of T5 is performed by calling a thread generation routine in a dynamically or statically linked user-level thread library when creating an execution load module of the user program. The change of the “thread execution status” described in step 401 is a process related to controlling the availability of hardware scheduling from the software side, and the status of the thread execution units # 0 to # 3 in FIG. This is a process of updating the status of one of the thread execution units that is to be activated under software scheduling from among 21-0 to 21-3).

【００６６】図２において、スレッド実行ユニット＃ｉ
ステータスの値が「０」である場合が「スレッド実行ユ
ニット＃ｉはハードウェアスケジューリングで使用不可
能な状態」であり、スレッド実行ユニット＃ｉステータ
スが値「１」である場合が「スレッド実行ユニット＃ｉ
はソフトウェアスケジューリングで使用可能な状態」で
あることを示している。したがって、図４のステップ４
０１で行う「ＰＥの「スレッド実行ステータス」をユー
ザレベル実行に設定」とは、起動対象となったＰＥに係
わる図２のスレッド実行ユニットステータスを値「１」
に設定することを表している。In FIG. 2, thread execution unit #i
When the status value is “0”, it means “the thread execution unit #i is unusable by hardware scheduling”, and when the thread execution unit #i status is the value “1”, it means “thread execution unit #i”. #I
Indicates a state that can be used in software scheduling. Therefore, step 4 in FIG.
The “set the“ thread execution status ”of the PE to the user-level execution” performed in step 01 indicates that the thread execution unit status of FIG.
Is set.

【００６７】この更新処理は、図１において、スレッド
生成処理を行っているＰＥ＃ｊからデータ及びコマンド
を実行ユニットステータス５０に送り込み、図２におい
て目的とする＃ｉのスレッド実行ユニットステータスの
値を更新することである。これは、セレクタ２３−０〜
２３−３のいずれかを制御することによって可能であ
り、そのインプリメンテーションは、ごく一般的なハー
ドウェア設計知識を有する技術者においては明らかであ
るので説明を省略する。In this updating process, in FIG. 1, data and commands are sent to the execution unit status 50 from PE # j which is performing the thread generation process, and the target value of the thread execution unit status of #i in FIG. It is to update. This is because the selectors 23-0 to 23-0
This is possible by controlling any one of 23-3, and its implementation is apparent to a technician having very general hardware design knowledge, and thus the description is omitted.

【００６８】図５乃至図７は、ユーザレベル・スレッド
ライブラリのソフトウェア構造を表すフローチャートで
ある。FIGS. 5 to 7 are flowcharts showing the software structure of the user-level thread library.

【００６９】このユーザレベル・スレッドライブラリ
は、ハードウェアスケジューリングの下での並列実行環
境から呼び出してはならない。ハードウェアスケジュー
リングの下での並列処理実行環境下からは、必ず１つの
スレッド実行ユニットのみの実行状態になってから呼び
出される。Ｔ０〜Ｔ５のスレッドを構成するプログラム
において、ユーザレベル・スレッドライブラリの呼び出
し時に、必ず１つのスレッド実行ユニットのみの実行状
態になってから呼び出しを行うようなプログラム構成と
することは可能であるし、また、ソフトウェアの指示の
下にハードウェア的に１つのスレッド実行ユニットのみ
の実行状態になることを保証する手段は、例えば、特公
平１０−２７１０８号公報スレッド実行方法」に開示さ
れているので、ここでは、その詳細な説明は省略する。This user-level thread library must not be called from a parallel execution environment under hardware scheduling. In a parallel processing execution environment under hardware scheduling, it is always called after only one thread execution unit is in the execution state. In a program constituting the threads T0 to T5, it is possible to adopt a program configuration in which, when a user-level thread library is called, only one thread execution unit is executed before calling. Also, means for ensuring that only one thread execution unit becomes an execution state in hardware under the instruction of software is disclosed in, for example, Japanese Patent Publication No. 10-27108, "Thread Execution Method". Here, the detailed description is omitted.

【００７０】図５は、スレッド生成ルーチンを表す。図
５に示すルーチンは、呼び出されると生成すべきスレッ
ド情報を登録した後に（ステップ５０２）、待機中のス
レッド実行ユニットが存在するかどうかを調べ、待機中
のスレッド実行ユニットが存在するならば、その中から
１台を選んで起動してディスパッチャーを実行させる
（ステップ５０３）。スレッド実行ユニットが待機中で
あるかどうかについては、ライブラリスケジューラ内に
おいてスレッド実行ユニットを起動あるいは終了に導く
際にライブラリが使用するメモリ領域にフラグエリアを
設けることにより、ソフトウェアで管理することが可能
である。FIG. 5 shows a thread generation routine. After registering thread information to be generated when called (step 502), the routine shown in FIG. 5 checks whether there is a waiting thread execution unit, and if there is a waiting thread execution unit, One of them is selected and activated to execute the dispatcher (step 503). Whether or not the thread execution unit is on standby can be managed by software by providing a flag area in a memory area used by the library when starting or terminating the thread execution unit in the library scheduler. is there.

【００７１】「スレッド生成処理」は、複数のスレッド
実行ユニットが同時に処理すると、排他的にしかアクセ
スができないデータの破壊がおこる。そのため、図５に
示すスレッド生成ルーチンは、ステップ５０１とステッ
プ５０４とにおいて、スケジューラのロックの確保と開
放を行うことにより、各スレッド実行ユニットからの呼
び出しに対して排他的に動作することを保証しておく必
要がある。In the "thread generation processing", if a plurality of thread execution units perform processing at the same time, destruction of data that can be accessed only exclusively occurs. Therefore, the thread generation routine shown in FIG. 5 secures and releases the lock of the scheduler in steps 501 and 504, thereby guaranteeing that it operates exclusively for calls from each thread execution unit. Need to be kept.

【００７２】図６は、スレッド終了ルーチンを表す。図
６に示すルーチンにおいても、スレッド実行ユニット間
で排他的に実行する必要がある。そのため、呼び出され
るとスケジューラのロックを確保した後に（ステップ６
０１）、スレッド終了処理としてライブラリが管理する
スレッド情報から終了スレッドの情報を抹消する（ステ
ップ６０２）。さらに、図６に示すスレッド終了ルーチ
ンは、処理を終了したソフトウェアの管理対象であるユ
ーザスレッドに実行制御を戻す必要はないので、そのま
ま新たに実行すべきスレッドを得るべくディスパッチャ
ーに制御を渡す構造となっている。FIG. 6 shows a thread termination routine. The routine shown in FIG. 6 also needs to be executed exclusively between thread execution units. Therefore, when called, after securing the scheduler lock (step 6
01), as the thread end processing, the information of the end thread is deleted from the thread information managed by the library (step 602). Further, the thread termination routine shown in FIG. 6 does not need to return execution control to the user thread managed by the software that has completed the processing, and therefore has a structure in which control is passed to the dispatcher in order to obtain a new thread to be executed as it is. Has become.

【００７３】図７は、ディスパッチャーの処理を表す。
図７に示す処理においても、スレッド実行ユニット間で
排他的に実行する必要がある。そのため、呼び出される
と、まずスケジューラのロックを確保する（ステップ７
０１）。そして、スレッド管理情報を参照して、まだ処
理されていない実行可能なユーザレベルスレッドの有無
のチェックを行い（ステップ７０２）、実行すべきスレ
ッドが存在する場合は、当該スレッド情報と取り出して
（ステップ７０３）、自スレッド実行ユニットで実行す
べく制御を取り出したスレッドに移行する。FIG. 7 shows the processing of the dispatcher.
Also in the process shown in FIG. 7, it is necessary to execute exclusively between thread execution units. Therefore, when called, the scheduler lock is first secured (step 7).
01). Then, by referring to the thread management information, it is checked whether there is any executable user-level thread that has not been processed yet (step 702). If there is a thread to be executed, the thread information is extracted (step 702). 703) The process shifts to the thread from which control has been extracted to be executed by the own thread execution unit.

【００７４】他方、実行すべきスレッドが存在しない場
合は、待機中の状態に至る必要がある。具体的には、待
機状態に至ろうとしているスレッド実行ユニットは、他
の動作中のスレッド実行ユニットが１台より多いかどう
かを判定する（ステップ７０５）。そして、１台よりも
多い場合は、スケジューラのロックを解除して（ステッ
プ７０６）、自スレッド実行ユニットの「スレッド実行
ステータス」をハードウェアスケジューリングが可能と
なるように設定し、待機中の状態に至る（ステップ７０
７）。On the other hand, when there is no thread to be executed, it is necessary to reach a waiting state. Specifically, the thread execution unit that is about to enter the standby state determines whether there is more than one other thread execution unit in operation (step 705). If the number is more than one, the scheduler is unlocked (step 706), and the "thread execution status" of the own thread execution unit is set so that hardware scheduling can be performed. (Step 70
7).

【００７５】また、自スレッド実行ユニットが待機状態
になることにより残りの動作中のスレッド実行ユニット
が１台となる場合は、当該最後に残るスレッド実行ユニ
ットの「スレッド実行ステータス」をハードウェアスケ
ジューリングが可能となるように設定する処理を付加的
に行う（ステップ７０８）。If the remaining thread execution unit becomes one due to its own thread execution unit being in the standby state, the “thread execution status” of the last remaining thread execution unit is set by hardware scheduling. A process for setting to be possible is additionally performed (step 708).

【００７６】ステップ７０５〜７０８の動作の意味する
ところは、動作中状態のスレッド実行ユニットが１台の
みとなる場合は、当該スレッド実行ユニットがハードウ
ェアスケジューリングの下に、既に待機状態にある２台
のスレッド実行ユニットと今まさに待機状態に至ろうと
している１台のスレッド実行ユニットの、合わせて３台
のスレッド実行ユニットを使用してよいことを伝えるこ
とにある。The operations in steps 705 to 708 mean that if only one thread execution unit is in the operating state, the two thread execution units that are already in the standby state are under hardware scheduling. The purpose of the present invention is to convey that a total of three thread execution units may be used, including one thread execution unit and one thread execution unit that is about to enter a waiting state.

【００７７】ステップ７０５〜７０８の処理の結果は、
ステップ７０８及びステップ７０７を実行することによ
って、図２におけるスレッド実行ユニットステータスの
フラグがすべて値「０」になることによりハードウェア
に伝達される。ハードウェアが、以上の処理の結果をど
のように解釈して動作するのかについては、後に全体の
動作の観点から説明する。The result of the processing in steps 705 to 708 is
By executing steps 708 and 707, all the flags of the thread execution unit status in FIG. 2 become “0” and are transmitted to the hardware. How the hardware interprets and operates the result of the above processing will be described later in terms of the overall operation.

【００７８】次に、図１、図２及び図３と、図８及び図
９のフローチャートとを参照して、ソフトウェアが新し
いスレッドの割り当てをハードウェアが直接行うべきで
ないという指示を行った場合に、スレッド生成要求に基
づいて生成されるべき新しいスレッドの実行開始に必要
とされるレジスタコンテキストを記憶保持する手段と、
保持された新たに生成すべきスレッドの実行開始に必要
とされるレジスタコンテキストを、当該スレッド実行ユ
ニット上に復旧することを可能とする手段について説明
する。Next, referring to FIGS. 1, 2 and 3, and the flowcharts of FIGS. 8 and 9, when the software gives an instruction that the hardware should not directly allocate a new thread, Means for storing and holding a register context required to start execution of a new thread to be created based on the thread creation request;
A description will be given of a means for enabling the held register context necessary for starting execution of a thread to be newly generated to be restored on the thread execution unit.

【００７９】まず、図８及び図９に記載した処理は、各
スレッド実行ユニットがソフトウェアスケジューリング
の下で動作中に、各スレッド実行ユニット内でハードウ
ェアスケジューリングに基づくスレッド生成命令の実行
が起動された場合に対応して動く動作を説明するフロー
チャートである。図８及び図９に記載された処理が起動
されるかどうかの判断は、以下のように行われる。First, in the processing described in FIGS. 8 and 9, execution of a thread generation instruction based on hardware scheduling is started in each thread execution unit while each thread execution unit is operating under software scheduling. It is a flowchart explaining the operation | movement which respond | corresponds to a case. The determination as to whether or not the processing described in FIGS. 8 and 9 is activated is performed as follows.

【００８０】まず、図２におけるスレッド実行ユニット
＃ｉステータスの値が「０」であるか、または「１」で
あるかにより、ＮＯＲゲート２２−０〜２２−３の出力
が変化する。ＮＯＲゲート２２−０〜２２−３の出力
は、所定のスレッド実行ユニット＃ｉから見て、自分以
外のスレッド実行ユニットのスレッド実行ユニット＃ｊ
（ｊはｉ以外）ステータスが値「０」となっているなら
ば、すなわち自分以外のスレッド実行ユニットがハード
ウェアスケジュール可能な状態になっているならば、値
「１」をとり、そうでなければ値「０」をとる。First, the outputs of the NOR gates 22-0 to 22-3 change depending on whether the value of the thread execution unit #i status in FIG. 2 is "0" or "1". The outputs of the NOR gates 22-0 to 22-3 are, when viewed from the predetermined thread execution unit #i, the thread execution units #j of the other thread execution units.
(J is other than i) If the status is a value "0", that is, if a thread execution unit other than itself is in a state in which hardware scheduling is possible, a value "1" is taken, otherwise. In this case, the value is "0".

【００８１】ＮＯＲゲート２２−０〜２２−３の出力
は、図３に示すように、ＡＮＤゲート２６を介して各ス
レッド実行ユニットの命令デコード回路２５に接続され
ている。セット／リセットフリップフロップ２８は、初
期値「０」であるため、この値が反転入力されたＡＮＤ
ゲート２６は、ＮＯＲゲート２２−０〜２２−３の出力
を道通状態で伝達する。The outputs of the NOR gates 22-0 to 22-3 are connected to an instruction decode circuit 25 of each thread execution unit via an AND gate 26 as shown in FIG. Since the set / reset flip-flop 28 has the initial value “0”, this value is AND-inverted.
Gate 26 transmits the outputs of NOR gates 22-0 to 22-3 in a running state.

【００８２】したがって、各スレッド実行ユニットの命
令デコード回路２５は、接続された図２のＮＯＲゲート
２２−０〜２２−３の出力が「０」であるならば、自分
が実行しようとする命令列中において、ハードウェアス
ケジューリングを直接行う機械語命令のスレッド生成命
令を見つけた場合に、シリアライズ実行しなければなら
ないと解釈して、予めハードウェアにより定められた命
令アドレスＸにトラップする。Therefore, if the output of the connected NOR gates 22-0 to 22-3 of FIG. 2 is "0", the instruction decode circuit 25 of each thread execution unit When a thread generation instruction of a machine language instruction that directly performs hardware scheduling is found, it is interpreted that serialization must be performed, and trapped at an instruction address X determined in advance by hardware.

【００８３】他方、各スレッド実行ユニットの命令デコ
ード回路２５に入力されるＮＯＲゲート２２−０〜２２
−３の出力が「１」であるならば、自分が実行しようと
している命令列中において、ハードウェアスケジューリ
ングを直接行う機械語命令のスレッド生成命令を見つけ
た場合に、従来技術で示されているように隣接降番のス
レッド実行ユニットに新しいスレッドを生成して実行さ
せる。On the other hand, NOR gates 22-0 to 22 input to instruction decode circuit 25 of each thread execution unit
If the output of -3 is "1", this is shown in the prior art when a thread generation instruction of a machine language instruction that directly performs hardware scheduling is found in the instruction sequence to be executed by itself. A new thread is generated and executed by the adjacent descending thread execution unit as described above.

【００８４】また、各スレッド実行ユニットが、自分が
実行しようとする命令列中において、ハードウェアスケ
ジューリングを直接行う機械語命令のスレッド終了命令
を見つけた場合も、図２のＮＯＲゲート２２−０〜２２
−３は同様に振る舞うが、スレッド終了命令に関して
は、図３の回路が付加的な役割も果たす。Also, when each thread execution unit finds a thread end instruction of a machine language instruction that directly performs hardware scheduling in an instruction sequence to be executed by itself, the NOR gates 22-0 to 22-0 in FIG. 22
-3 behaves similarly, but for the thread termination instruction, the circuit of FIG. 3 also plays an additional role.

【００８５】すなわち、各スレッド実行ユニットの命令
デコード回路２５は、接続された図２のＮＯＲゲートゲ
ート２２−０〜２２−３の出力が「０」であるならば、
自分が実行しようとする命令列中において、ハードウェ
アスケジューリングを直接行う機械語命令のスレッド終
了命令を見つけた場合に、予めハードウェアにより定め
られた命令アドレスＹにトラップする。他方、各スレッ
ド実行ユニットの命令デコーダ１２に入力されるＮＯＲ
ゲート２２−０〜２２−３の出力が「１」であるなら
ば、自分が実行しようとしている命令列中において、ハ
ードウェアスケジューリングを直接行う機械語命令のス
レッド終了命令を見つけた場合に、従来技術で示されれ
ているように直接スレッド実行を終了して待機状態に至
ろうとする動作が基本となる。That is, if the outputs of the connected NOR gates 22-0 to 22-3 of FIG. 2 are "0", the instruction decode circuit 25 of each thread execution unit
When a thread end instruction of a machine language instruction for directly performing hardware scheduling is found in an instruction sequence to be executed, the instruction is trapped at an instruction address Y determined in advance by hardware. On the other hand, NOR input to the instruction decoder 12 of each thread execution unit
If the output of the gates 22-0 to 22-3 is "1", when a thread end instruction of a machine instruction for directly performing hardware scheduling is found in the instruction sequence to be executed, As described in the art, the basic operation is to end the thread execution directly and to attain the waiting state.

【００８６】ただし、スレッド終了命令に対しては、最
終的には図３の付加回路の働きにより、スレッド終了命
令を見つける前に処理したスレッド生成命令をシリアラ
イズ実行していた場合は、図２のＮＯＲゲート２２−０
〜２２−３の出力は、図３のＡＮＤゲート２６によりマ
スクされ、この場合も命令アドレスＹへトラップする動
作となる。However, when the thread generation instruction processed before finding the thread end instruction is serialized by the operation of the additional circuit in FIG. NOR gate 22-0
The outputs of .about.22-3 are masked by the AND gate 26 in FIG.

【００８７】プロセッサが機械語命令のスレッド生成や
終了命令を実行しようとした際にアドレスＸやアドレス
Ｙにトラップする仕組みに関しては、従来から、ワイア
ドロジックで実現するのが複雑な命令を、ソフトウェア
あるいはファームウェアとしてトラップしてエミュレー
ションする技術として公知であるので、ここでは、その
詳細な説明は省略する。With respect to the mechanism of trapping at the address X or the address Y when the processor attempts to execute the thread generation of the machine language instruction or the execution of the end instruction, conventionally, an instruction which is complicated to be realized by the wired logic is replaced by a software. Alternatively, since it is known as a technique of emulating by trapping as firmware, a detailed description thereof is omitted here.

【００８８】次に、図８及び図９を参照して、アドレス
ＸやアドレスＹにトラップした後の処理を説明する。図
８は、アドレスＸより開始されるスレッド生成をシリア
ライズして実行する処理を示すフローチャートである。Next, with reference to FIGS. 8 and 9, the processing after trapping at the address X or the address Y will be described. FIG. 8 is a flowchart showing a process of serializing and executing thread generation started from address X.

【００８９】本実施形態のプロセッサにおいては、所定
のスレッド実行ユニットがハードウェアスケジューリン
グの下に新しいスレッドを生成するという処理は、新し
いスレッドを生成する命令であるｆｏｒｋ命令が表れた
時点のレジスタコンテキストを継承して子どもスレッド
を生成することである。また、新しく生成されたスレッ
ドは、その終了までに高々１回のみ新たなスレッド生成
を行うことが可能である。誤ったスレッド生成を行った
場合は、必ず誤って生成したスレッドを強制終了させた
後に、スレッドの再生成を行う計算モデルとなってい
る。したがって、この計算モデルにおいては、スレッド
生成命令を実行しようとした際に、その時点のレジスタ
コンテキストと生成されるスレッドの命令開始番地アド
レスをセーブし、実行中だったスレッドの処理を継続実
行してスレッド終了命令に実行が到達した後に、セーブ
しておいた生成すべき新しいスレッドのコンテキストを
レジスタ上に復旧し、かつ、生成すべき新しいスレッド
の命令開始番地に実行制御を渡すことにより、複数個の
スレッド実行ユニットを用いずに、１つのスレッド実行
ユニットのみを用いて複数スレッドをシリアライズして
実行処理することが可能である。また、シリアライズし
た実行に必要なコンテキストのセーブエリアは、スレッ
ド実行ユニット毎に高々１個のスレッド情報を保持でき
れば十分である。したがって、図８のステップ８０２に
おいて、各スレッド実行ユニットの個別のメモリ領域
に、生成すべきハードウェア起動のスレッド情報（レジ
スタコンテキストとスレッド開始命令アドレス）を保存
し、その後、トラップ元のアドレスに制御を復帰して元
のスレッドの継続実行を行う。In the processor of the present embodiment, the process of a predetermined thread execution unit generating a new thread under hardware scheduling is performed by changing the register context at the time when the fork instruction, which is an instruction for generating a new thread, appears. Inherit and create a child thread. Further, a newly created thread can generate a new thread at most once by the end thereof. When an incorrect thread is generated, the calculation model is such that the thread that has been generated incorrectly is forcibly terminated and then the thread is regenerated. Therefore, in this calculation model, when an attempt is made to execute a thread generation instruction, the register context at that time and the instruction start address of the generated thread are saved, and the processing of the thread being executed is continuously executed. After execution reaches the thread end instruction, the context of the new thread to be created that has been saved is restored to the register, and execution control is passed to the instruction start address of the new thread to be created. It is possible to serialize and execute a plurality of threads using only one thread execution unit without using the thread execution unit. It is sufficient that the save area of the context necessary for serialized execution can hold at most one thread information for each thread execution unit. Therefore, in step 802 of FIG. 8, the thread information (register context and thread start instruction address) of the hardware activation to be generated is stored in the individual memory area of each thread execution unit, and thereafter, the control is performed to the trap source address. To continue execution of the original thread.

【００９０】以上のようにして、ソフトウェアが新しい
スレッドの割り当てをハードウェアが直接行うべきでな
いという指示を行った場合に、スレッド生成要求に基づ
いて生成されるべき新しいスレッドの実行開始に必要と
されるレジスタコンテキストを記憶保持する手段が実現
される。As described above, when the software instructs that the hardware should not directly allocate a new thread, the software is required to start execution of a new thread to be generated based on the thread generation request. A means for storing and holding a register context is realized.

【００９１】また、図８の、アドレスＸより開始される
スレッド生成をシリアライズして実行する処理において
は、シリアライズ処理対象としてセーブした生成すべき
スレッドを、後にスレッド終了命令を実行しようとした
際に必ず復活可能とするために、図３のセット／リセッ
トフリップフロップ２８に対するセット動作もステップ
８０１において行う。In the process of serializing and executing the thread generation started from the address X shown in FIG. 8, when the thread to be generated saved as a serialization process target is to be executed later, a thread end instruction is executed. A set operation for the set / reset flip-flop 28 in FIG.

【００９２】次に、各スレッド実行ユニットが、スレッ
ド終了命令を検出し、かつシリアライズ実行中としてア
ドレスＹに命令トラップした場合の処理を、図９を参照
して説明する。図９に示すように、終了処理は、２通り
の場合があり得る。Next, the processing in the case where each thread execution unit detects a thread end instruction and traps an instruction at address Y while serialization is being executed will be described with reference to FIG. As shown in FIG. 9, there are two cases of the end processing.

【００９３】第１のケースは、終了しようとするスレッ
ドが新しいスレッド生成処理を行っていた場合であり、
このケースでは、生成すべき新しいスレッドのコンテキ
ストをレジスタに復旧し、また、新しいスレッドの開始
番地に制御を移行する（ステップ９０２、９０３）。The first case is when the thread to be terminated is performing a new thread generation process.
In this case, the context of the new thread to be created is restored to the register, and control is transferred to the start address of the new thread (steps 902 and 903).

【００９４】第２のケースは、終了しようとするスレッ
ドが新しいスレッド生成処理を行っていなかった場合で
あり、このケースでは、スレッド実行ユニットは待機状
態に至る（ステップ９０２、９０４）。The second case is where the thread to be terminated has not performed a new thread generation process. In this case, the thread execution unit enters a standby state (steps 902 and 904).

【００９５】これにより、保持された新たに生成すべき
スレッドの実行開始に必要とされるレジスタコンテキス
トを、当該スレッド実行ユニット上に復旧することを可
能とする手段が実現される。As a result, a means for restoring the held register context required for starting execution of the newly created thread to be created on the thread execution unit is realized.

【００９６】また、スレッド終了命令を検出し、かつシ
リアライズ実行中としてアドレスＹに命令トラップした
場合は、付加的な処理として、図３のセット／リセット
フリップフロップ２８のリセットも行う。When a thread end instruction is detected and an instruction is trapped at the address Y while serialization is being executed, the set / reset flip-flop 28 shown in FIG. 3 is reset as additional processing.

【００９７】次に、図１０を参照して本実施形態の全体
の動きを説明する。図１０は、図２２に示したソフトウ
ェアスケジューリングとハードウェアスケジューリング
の混在実行を、時間軸上の分解能を細かくして示した図
である。Next, the overall operation of this embodiment will be described with reference to FIG. FIG. 10 is a diagram showing the mixed execution of software scheduling and hardware scheduling shown in FIG. 22 with finer resolution on the time axis.

【００９８】各スレッド実行ユニットの実行状態の記述
において、右側に下方向への矢印付きの線に「値１」ま
たは「値０」と付した記述は、各スレッド実行ユニット
に対する図２のスレッド実行ユニットステータス２３−
０〜２３−３の値を示す。In the description of the execution state of each thread execution unit, a description in which “value 1” or “value 0” is added to a line with an arrow pointing downward on the right side corresponds to the thread execution of FIG. Unit status 23-
0 to 23-3.

【００９９】プログラム処理の流れは、まず、図４で示
したプログラム初期化処理を、スレッド実行ユニットＰ
Ｅ＃０で開始することにより開始される。スレッド実行
ユニットＰＥ＃０が、プログラム初期化処理を開始して
ステップ４０１の処理を行うことにより、スレッド実行
ユニットＰＥ＃０のスレッド実行ステータスは値１（ソ
フトウェアスケジューリングで使用可能な状態）とな
る。これにより、スレッド実行ユニットＰＥ＃１〜＃３
において、ハードウェアスケジューリングにより生成さ
れたスレッドはシリアライズして実行するモードとな
る。The flow of the program processing is as follows. First, the program initialization processing shown in FIG.
It is started by starting at E # 0. When the thread execution unit PE # 0 starts the program initialization process and performs the process of step 401, the thread execution status of the thread execution unit PE # 0 becomes the value 1 (a state usable in software scheduling). Thereby, the thread execution units PE # 1 to # 3
, The thread generated by the hardware scheduling is in a mode of serializing and executing.

【０１００】引き続きプログラム初期化処理を行ってい
るスレッド実行ユニットＰＥ＃０は、ステップ４０２の
処理を行うことにより、図５のスレッド生成処理を５回
実行してソフトウェアスケジューリング対象となるスレ
ッドＴ１〜Ｔ５を生成すると共に、Ｔ１〜Ｔ３の生成時
に、スレッド実行ユニットＰＥ＃１〜＃３の各々のスレ
ッド実行ユニットステータスをソフトウェアスケジュー
リングで使用可能な状態としてディスパッチャーを起動
する。The thread execution unit PE # 0, which is continuously performing the program initialization process, executes the process of step 402, executes the thread generation process of FIG. 5 five times, and executes the threads T1 to T5 to be subjected to software scheduling. Is generated, and at the time of generation of T1 to T3, the dispatcher is activated by setting the thread execution unit status of each of the thread execution units PE # 1 to # 3 to a state usable by software scheduling.

【０１０１】スレッド実行ユニットＰＥ＃１〜ＰＥ＃３
は、ディスパッチャーを介して処理すべきスレッドＴ１
〜Ｔ３を受け取り、各々実行を開始する。また、スレッ
ド実行ユニットＰＥ＃０は、ステップ４０２の処理を行
った後、スレッドＴ０の実行を開始する。Thread execution units PE # 1 to PE # 3
Is the thread T1 to be processed via the dispatcher
~ T3, and each starts execution. After performing the processing of step 402, the thread execution unit PE # 0 starts executing the thread T0.

【０１０２】ここで、スレッドＴ０〜Ｔ３が並列実行さ
れている状態においては、各ソフトウェアスケジューリ
ング対象スレッドが、ハードウェアスケジューリング対
象である機械語命令によるスレッド生成及び消滅を行な
いながら処理を進めることになる。しかし、全てのスレ
ッド実行ユニットのスレッド実行ユニットステータス
は、ソフトウェアスケジューリング可能な状態で処理を
進めているため、機械語命令によるスレッド生成命令及
びスレッド終了命令を実行すると、これらの命令はトラ
ップされ、図８及び図９を参照して説明した手段により
シリアライズ処理されていくことになる。スレッドＴ０
〜Ｔ３の処理終了時点においては、ソフトウェアスケジ
ューリング対象スレッドの実行終了として、各スレッド
実行ユニットは、図６のスレッド終了処理を呼び出すこ
とになる。Here, in a state where the threads T0 to T3 are being executed in parallel, each thread to be software-scheduled proceeds while generating and deleting a thread by using a machine language instruction to be hardware-scheduled. . However, since the thread execution unit status of all thread execution units is proceeding in a state in which software scheduling is possible, when a thread generation instruction and a thread end instruction are executed by machine language instructions, these instructions are trapped, and FIG. 8 and serialized by the means described with reference to FIG. Thread T0
At the end of the processing from T3 to T3, each thread execution unit calls the thread end processing in FIG. 6 as the execution end of the thread to be software scheduled.

【０１０３】時間軸上、最初にソフトウェアスケジュー
リング対象スレッドＴ１が終了し、次いでスレッドＴ２
が終了するが、これらを実行していたスレッド実行ユニ
ットＰＥ＃１とスレッド実行ユニットＰＥ＃２とは、各
々図６のスレッド終了処理、続く図７のディスパッチャ
ーへと処理が進んで、再度、まだ処理がされていないス
レッドＴ４、Ｔ５を実行開始することとなる。On the time axis, the thread T1 to be software-scheduled ends first, and then the thread T2
End, the thread execution unit PE # 1 and the thread execution unit PE # 2, which have executed them, respectively proceed to the thread end processing of FIG. 6 and the subsequent dispatcher of FIG. Execution of the threads T4 and T5 that have not been processed is started.

【０１０４】次に、時間軸上ソフトウェアスケジューリ
ング対象スレッドＴ０を実行していたスレッド実行ユニ
ットＰＥ＃０が、実行終了状態に到達し、図６のスレッ
ド終了処理を行って、続く図７のディスパッチャーに制
御が移る。Next, the thread execution unit PE # 0 executing the thread T0 to be software-scheduled on the time axis reaches the execution end state, performs the thread end processing of FIG. 6, and returns to the dispatcher of FIG. Control transfers.

【０１０５】この時点において、ソフトウェアスケジュ
ーリング対象スレッドにおいてスレッド実行ユニットに
割り当てられていないものは存在しない状態となってい
るため、スレッド実行ユニットＰＥ＃０は、ステップ７
０７の処理に至って、自スレッド実行ユニットのスレッ
ド実行ユニットステータスを、ハードウェアスケジュー
リングで使用可能な状態に変更して待機状態に陥る。At this point, since there is no thread that is not assigned to a thread execution unit among the threads to be software-scheduled, the thread execution unit PE # 0 executes step 7
In step 07, the thread execution unit status of the own thread execution unit is changed to a state usable by hardware scheduling, and the thread execution unit falls into a standby state.

【０１０６】次いで、同様にスレッドＴ４を実行してい
たスレッド実行ユニットＰＥ＃１も、処理を終了し、同
様に自スレッド実行ユニットのスレッド実行ユニットス
テータスを、ハードウェアスケジューリングで使用可能
な状態に変更して待機状態に陥る。Next, the thread execution unit PE # 1, which was also executing the thread T4, also terminates the processing and similarly changes the thread execution unit status of its own thread execution unit to a state usable by hardware scheduling. And fall into a standby state.

【０１０７】さらに、スレッドＴ３を実行していたスレ
ッド実行ユニットＰＥ＃３が、処理を終了し、図７のデ
ィスパッチャーに制御が移る。ここでは、ステップ７０
５の処理において、自スレッド実行ユニットＰＥ＃３が
待機状態に至ると、残る処理実行中のＰＥが一台のみに
なることを判定して、ステップ７０８の処理を行うこと
になる。ステップ７０８の処理が完了した時点において
は、図２のスレッド実行ユニットステータスの内、スレ
ッド実行ユニットＰＥ＃０とスレッド実行ユニットＰＥ
＃１とスレッド実行ユニットＰＥ＃２とがハードウェア
スケジューリング実行で使用可能な状態を指し示すこと
となる。この時点で、ＮＯＲ回路２２−０〜２２−３の
出力は、順番に、「０」、「０」、「０」、「１」とな
る。当該出力が接続される、図１の各スレッド実行ユニ
ットの命令デコーダ１２−０〜１２−３においては、ス
レッド実行ユニットＰＥ＃３のみがハードウェアスケジ
ューリング対象のスレッドを直接実行可能と判断できる
状態となる。Further, the thread execution unit PE # 3 executing the thread T3 ends the processing, and shifts the control to the dispatcher of FIG. Here, step 70
In the process of 5, when the own thread execution unit PE # 3 reaches the standby state, it is determined that only one PE is executing the remaining process, and the process of step 708 is performed. When the processing of step 708 is completed, the thread execution unit PE # 0 and the thread execution unit PE in the thread execution unit status of FIG.
# 1 and the thread execution unit PE # 2 indicate a state usable in executing hardware scheduling. At this point, the outputs of the NOR circuits 22-0 to 22-3 become "0", "0", "0", and "1" in this order. In the instruction decoders 12-0 to 12-3 of each thread execution unit in FIG. 1 to which the output is connected, only the thread execution unit PE # 3 can determine that the thread for which hardware scheduling is to be performed can be directly executed. Become.

【０１０８】他方、スレッド実行ユニットＰＥ＃３は、
ステップ７０８の処理を実行した後にステップ７０７の
処理を実行するため、この時点ではハードウェアスケジ
ューリング処理対象のスレッドを生成することはない。
しかし、ステップ７０７の処理を実行すると、図２のス
レッド実行ユニットステータス２１−０〜２１−３は、
全てがハードウェアスケジューリング実行で使用可能な
状態を指し示すこととなる。この状態に到達すると、Ｎ
ＯＲ回路２２−０〜２２−３の出力は、順番に、
「１」、「１」、「１」、「１」となり、全てのＰＥが
ハードウェアスケジューリング対象のスレッドを実行可
能な状態に到達する。On the other hand, the thread execution unit PE # 3
Since the process of step 707 is performed after the process of step 708 is performed, no thread for which hardware scheduling is to be performed is generated at this time.
However, when the process of step 707 is executed, the thread execution unit statuses 21-0 to 21-3 in FIG.
Everything indicates a state usable in hardware scheduling execution. When this state is reached, N
The outputs of the OR circuits 22-0 to 22-3 are, in order,
"1", "1", "1", "1", and all the PEs reach a state in which a thread for hardware scheduling can be executed.

【０１０９】ソフトウェアスケジューリング対象のスレ
ッドＴ５を実行していたスレッド実行ユニットＰＥ＃２
は、この時点以降において、ハードウェアスケジューリ
ング対象のスレッド生成機械語命令を実行しようとする
と、図５で示したトラップしてシリアライズする実行に
は陥らず、順次、全てのスレッド実行ユニット資源を利
用してハードウェアスケジューリングの下に高速実行を
行うこととなる。The thread execution unit PE # 2 executing the thread T5 to be software-scheduled
At this point, if an attempt is made to execute a thread-scheduled machine instruction to be subjected to hardware scheduling after this point, the execution does not fall into the trap and serialize execution shown in FIG. Thus, high-speed execution is performed under hardware scheduling.

【０１１０】図６において、スレッドＴ５の符号に「Ｔ
５ａ」と「Ｔ５ｂ」のように最後の文字にａ、ｂを付与
しているのは、ハードウェアスケジューリングによるス
レッド実行がシリアライズされて処理されている期間を
Ｔ５ａ、ハードウェアスケジューリングにより直接複数
のスレッド実行ユニット資源を用いて並列処理実行が行
われている期間をＴ５ｂと区別して記載したものであ
る。In FIG. 6, the symbol of the thread T5 is “T
The reason why a and b are added to the last character, such as “5a” and “T5b”, is that T5a is a period during which thread execution by hardware scheduling is serialized and processed, and a plurality of threads are directly executed by hardware scheduling. The period during which parallel processing is performed using the execution unit resources is described separately from T5b.

【０１１１】図１１は、本発明の他の実施の形態による
マルチスレッドプロセッサにおける、ハードウェアスケ
ジューリングとソフトウェアスケジューリングの切り替
えを実現する構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration for realizing switching between hardware scheduling and software scheduling in a multi-thread processor according to another embodiment of the present invention.

【０１１２】本実施形態本実施形態のマルチスレッドプ
ロセッサは、図１に示した第１の実施形態と同様に、４
スレッド並列実行型プロセッサであり、スレッド管理ユ
ニット３０と、４組のスレッド実行ユニット（ＰＥ＃０
〜ＰＥ＃３）１０−０〜１０−３と、物理共有レジスタ
ファイル２０と、実行ユニットステータス５０とを備え
る。This embodiment The multi-thread processor of this embodiment has four processors as in the first embodiment shown in FIG.
It is a thread parallel execution type processor, and has a thread management unit 30 and four sets of thread execution units (PE # 0
-PE # 3) 10-0 to 10-3, a physical shared register file 20, and an execution unit status 50.

【０１１３】スレッド実行ユニット１１０−０〜１１０
−３は、それぞれ、第１の実施形態におけるスレッド実
行ユニット１０−０〜１０−３と同様に、命令キャッシ
ュ（＃０〜＃３）１１−０〜１１−３と、命令デコーダ
（＃０〜＃３）１２−０〜１２−３と、レジスタ写像テ
ーブル（＃０〜＃３）１３−０〜１３−３と、演算ユニ
ット（＃０〜＃３）１４−０〜１４−３と、セレクタ１
５−０〜１５−３とを備えると共に、待避記憶（＃０〜
＃３）１１１−０〜１１１−３を備える。また、第１の
実施形態と同様に、レジスタ写像テーブル１３−０〜１
３−３は、それぞれ写像情報転送バス４０によってリン
グ状をなすように、隣接するレジスタ写像テーブルに接
続されている。Thread execution units 110-0 to 110
-3 are instruction caches (# 0 to # 3) 11-0 to 11-3 and instruction decoders (# 0 to # 3), similarly to the thread execution units 10-0 to 10-3 in the first embodiment. # 3) 12-0 to 12-3, register mapping tables (# 0 to # 3) 13-0 to 13-3, operation units (# 0 to # 3) 14-0 to 14-3, selector 1
5-0 to 15-3, and the evacuation memory (# 0 to # 0)
# 3) 111-0 to 111-3 are provided. In addition, similarly to the first embodiment, the register mapping tables 13-0 to 13-0
3-3 are connected to adjacent register mapping tables so as to form a ring by a mapping information transfer bus 40, respectively.

【０１１４】待避記憶１１１−０〜１１１−３は、ソフ
トウェアにおいて、新しいスレッドの割り当てをハード
ウェアが直接行うべきでないという指示がなされた場合
に、当該スレッド生成要求に基づいて生成されるべき新
しいスレッドの実行開始に必要とされるレジスタコンテ
キストを記憶保持する手段であり、ハードウェアスケジ
ューリングの対象であるスレッドの情報を待避するため
に用いられる。The backup storages 111-0 to 111-3 store new threads to be created based on the thread creation request when an instruction is given in software that hardware should not directly allocate new threads. This is a means for storing and holding a register context required for the start of execution of a program, and is used to save information of a thread to be subjected to hardware scheduling.

【０１１５】待避記憶１１１−０〜１１１−３を設けた
ことにより、本実施形態は、ハードウェアスケジューリ
ングの対象であるスレッドをシリアライズして実行する
状態において、ハードウェアスケジューリングの対象と
なる機械語命令によるスレッド生成命令及びスレッド終
了命令を実行しようとした場合に、図２で示したような
命令トラップを用いたソフトウェア処理を行わず、直
接、ハードウェアシーケンスにより各スレッド実行ユニ
ットが備える待機記憶１１１−０〜１１１−３を用いて
スレッドの情報の待避と復旧とを行うことが可能であ
る。With the provision of the backup storages 111-0 to 111-3, the present embodiment enables the machine language instruction to be subjected to hardware scheduling in a state where the thread to be hardware scheduled is serialized and executed. When executing the thread generation instruction and the thread end instruction according to the above, the software processing using the instruction trap as shown in FIG. 2 is not performed, and the standby storage 111-included in each thread execution unit is directly performed by a hardware sequence. It is possible to save and restore thread information using 0 to 111-3.

【０１１６】その他の構成要素及び動作については、図
１乃至図１０７を参照して説明した第１の実施形態の構
成要素及び動作と同一であるため、説明を省略する。The other components and operations are the same as those of the first embodiment described with reference to FIGS. 1 to 107, and a description thereof will not be repeated.

【０１１７】以上好ましい実施形態をあげて本発明を説
明したが、本発明は必ずしも上記実施形態に限定される
ものではない。Although the present invention has been described with reference to the preferred embodiments, the present invention is not necessarily limited to the above embodiments.

【０１１８】[0118]

【発明の効果】以上説明したように、本発明のマルチス
レッドプロセッサによれば、ハードウェアスケジューリ
ングに基づく細粒度での並列プログラム実行と、ソフト
ウェア・ライブラリ・スケジューラを用いた柔軟な並列
実行スケジューリングとを混在して実行可能であるた
め、効率的にプログラムの実行を行うことにより、プロ
グラム全体の処理時間を短縮できるという効果を有す
る。As described above, according to the multi-thread processor of the present invention, parallel program execution with fine granularity based on hardware scheduling and flexible parallel execution scheduling using a software library scheduler are achieved. Since the programs can be mixedly executed, there is an effect that the processing time of the entire program can be reduced by efficiently executing the program.

[Brief description of the drawings]

【図１】本発明の一実施形態によるマルチスレッドプ
ロセッサにおける、ハードウェアスケジューリングとソ
フトウェアスケジューリングの切り替えを実現する構成
を示すブロック図である。FIG. 1 is a block diagram showing a configuration for realizing switching between hardware scheduling and software scheduling in a multi-thread processor according to an embodiment of the present invention.

【図２】本実施形態における実行ユニットステータス
の内部構造を示す回路図である。FIG. 2 is a circuit diagram showing an internal structure of an execution unit status in the embodiment.

【図３】本実施形態におけるスレッド実行ユニットの
命令デコーダの内部構造を示す回路図である。FIG. 3 is a circuit diagram illustrating an internal structure of an instruction decoder of a thread execution unit according to the embodiment.

【図４】ＯＳ等から起動されて動作し始めるユーザプ
ログラムの初期化において、ソフトウェアスケジューリ
ングとハードウェアスケジューリングの混在実行に係わ
る動作を示すフローチャートである。FIG. 4 is a flowchart showing an operation related to a mixed execution of software scheduling and hardware scheduling in the initialization of a user program started up and started to operate from an OS or the like.

【図５】ユーザレベル・スレッドライブラリのソフト
ウェア構造を表すフローチャートであり、スレッド生成
ルーチンを示す図である。FIG. 5 is a flowchart illustrating a software structure of a user-level thread library, and illustrates a thread generation routine.

【図６】ユーザレベル・スレッドライブラリのソフト
ウェア構造を表すフローチャートであり、スレッド終了
ルーチンを示す図である。FIG. 6 is a flowchart illustrating a software structure of a user-level thread library, illustrating a thread termination routine.

【図７】ユーザレベル・スレッドライブラリのソフト
ウェア構造を表すフローチャートであり、ディスパッチ
ャーの処理を示す図である。FIG. 7 is a flowchart illustrating a software structure of a user-level thread library, and is a diagram illustrating a process of a dispatcher.

【図８】スレッド実行ユニットがソフトウェアスケジ
ューリングの下で動作中に、スレッド実行ユニット内で
ハードウェアスケジューリングに基づくスレッド生成命
令の実行が起動された場合のスレッド生成動作を示すフ
ローチャートである。FIG. 8 is a flowchart showing a thread generation operation when execution of a thread generation instruction based on hardware scheduling is started in the thread execution unit while the thread execution unit is operating under software scheduling.

【図９】スレッド実行ユニットがソフトウェアスケジ
ューリングの下で動作中に、スレッド実行ユニット内で
ハードウェアスケジューリングに基づくスレッド生成命
令の実行が起動された場合のスレッド終了動作を示すフ
ローチャートである。FIG. 9 is a flowchart illustrating a thread termination operation when execution of a thread generation instruction based on hardware scheduling is started in the thread execution unit while the thread execution unit is operating under software scheduling.

【図１０】ソフトウェアスケジューリングとハードウ
ェアスケジューリングの混在実行を、時間軸上の分解能
を細かくして示した図である。FIG. 10 is a diagram showing mixed execution of software scheduling and hardware scheduling, with finer resolution on the time axis.

【図１１】本発明の他の実施形態によるマルチスレッ
ドプロセッサにおける、ハードウェアスケジューリング
とソフトウェアスケジューリングの切り替えを実現する
構成を示すブロック図である。FIG. 11 is a block diagram showing a configuration for realizing switching between hardware scheduling and software scheduling in a multi-thread processor according to another embodiment of the present invention.

【図１２】本発明の実施形態におけるスレッド実行ユ
ニットのパイプラインステージを示す図である。FIG. 12 is a diagram illustrating a pipeline stage of a thread execution unit in the embodiment of the present invention.

【図１３】本発明の実施形態における物理共有レジス
タファイルの詳細な構成を示す図である。FIG. 13 is a diagram showing a detailed configuration of a physical shared register file in the embodiment of the present invention.

【図１４】本発明の実施形態における物理レジスタの
フォーマットを示す図である。FIG. 14 is a diagram showing a format of a physical register in the embodiment of the present invention.

【図１５】本発明の実施形態におけるレジスタ写像テ
ーブルの詳細な構成を示す図である。FIG. 15 is a diagram showing a detailed configuration of a register mapping table in the embodiment of the present invention.

【図１６】本発明の実施形態におけるレジスタ写像テ
ーブルの１エントリの詳細な構成を示す図である。FIG. 16 is a diagram showing a detailed configuration of one entry of a register mapping table according to the embodiment of the present invention.

【図１７】本発明の実施形態において、通常の動作時
のグループ選択ビット、物理拡張ビット変更ビットの値
の遷移と、それによってレジスタ継承が実現できる仕組
みを説明する図である。FIG. 17 is a diagram illustrating a transition of values of a group selection bit and a physical extension bit change bit during a normal operation and a mechanism that can realize register inheritance by the embodiment according to the present invention.

【図１８】本発明の実施形態において、投機的なスレ
ッド生成を伴う場合のグループ選択ビット、物理拡張ビ
ット、変更ビットの値の遷移と、それによってレジスタ
継承が実現できる仕組みを説明するための図である。FIG. 18 is a diagram for explaining a transition of values of a group selection bit, a physical extension bit, and a change bit when speculative thread generation is performed in the embodiment of the present invention, and a mechanism for realizing register inheritance thereby. It is.

【図１９】図１２に示すパイプラインの動作における
写像情報のコピーのタイミングを示す図である。FIG. 19 is a diagram showing a timing of copying mapping information in the operation of the pipeline shown in FIG. 12;

【図２０】ハードウェアスケジューリングによるスレ
ッド実行方法を説明する図である。FIG. 20 is a diagram illustrating a thread execution method by hardware scheduling.

【図２１】ソフトウェアスケジューリングによるスレ
ッド実行方法を説明する図である。FIG. 21 is a diagram illustrating a thread execution method by software scheduling.

【図２２】ハードウェアスケジューリングとソフトウ
ェアスケジューリングとを混在させたスレッド実行方法
を説明する図である。FIG. 22 is a diagram illustrating a thread execution method in which hardware scheduling and software scheduling are mixed.

[Explanation of symbols]

１０−０〜１０−３スレッド実行ユニット１１−０〜１１−３命令キャッシュ１２−０〜１２−３命令デコーダ１３−０〜１３−３レジスタ写像テーブル１４−０〜１４−３演算ユニット１５−０〜１５−３セレクタ２０物理共有レジスタファイル３０スレッド管理ユニット４０写像情報転送バス５０実行ユニットステータス 10-0 to 10-3 Thread execution unit 11-0 to 11-3 Instruction cache 12-0 to 12-3 Instruction decoder 13-0 to 13-3 Register mapping table 14-0 to 14-3 Operation unit 15-0 ~ 15-3 Selector 20 Physical shared register file 30 Thread management unit 40 Mapping information transfer bus 50 Execution unit status

───────────────────────────────────────────────────── フロントページの続き (72)発明者酒井淳嗣東京都港区芝五丁目７番１号日本電気株式会社内 (72)発明者大澤拓東京都港区芝五丁目７番１号日本電気株式会社内 (72)発明者松下智東京都港区芝五丁目７番１号日本電気株式会社内Ｆターム(参考） 5B013 DD01 DD04 5B033 AA13 DD01 5B098 AA02 AA10 GA05 GC01 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Junji Sakai 5-7-1 Shiba, Minato-ku, Tokyo Within NEC Corporation (72) Taku Osawa 5-7-1 Shiba, Minato-ku, Tokyo Japan Inside Electric Company (72) Inventor Satoshi Matsushita 5-7-1 Shiba, Minato-ku, Tokyo NEC Corporation F-term (reference) 5B013 DD01 DD04 5B033 AA13 DD01 5B098 AA02 AA10 GA05 GC01

Claims

[Claims]

1. A multi-thread processor comprising: a plurality of program counters; and a plurality of thread execution units for simultaneously fetching, interpreting, and executing instructions of a plurality of threads according to the plurality of program counters. Comprises a machine instruction that allows the thread being processed to create a new thread at most once and a machine instruction that allows the thread being processed to terminate in any thread execution unit, Thread allocation means for directly performing, by hardware, a process of allocating a new thread generated based on the machine language instruction to a thread execution unit other than the thread execution unit executing the machine language instruction, and any of the threads The thread being processed in the execution unit is new. When executing a machine instruction for generating a new thread, assigning a new thread generated based on the machine instruction to a thread execution unit of the thread execution unit executing the machine instruction. A thread assignment instructing means for instructing whether or not the means is to be performed by software; and the machine language instruction when the thread assignment instructing means instructs the thread assigning means not to execute the thread assignment. Register context holding means for storing and holding a register context required for the start of execution of a new thread generated based on: a thread being processed in the thread execution unit for executing the machine instruction terminates the thread That the machine language instruction was executed A thread execution end instruction execution detecting unit that issues a machine language instruction for terminating a thread being processed in any of the thread execution units, and a thread to be newly generated that is held by the register context holding unit. Register context restoring means for restoring, on the thread execution unit, a register context required for the start of execution, and if necessary, a thread being processed in any of the thread execution units is newly provided. A multi-thread processor, wherein the thread execution unit processes a new thread generated when executing a machine instruction for generating a new thread after the thread being processed on the thread execution unit ends. .

2. The register context holding unit according to claim 1,
The multi-thread processor according to claim 1, wherein the multi-thread processor is a memory area on a main memory of the thread execution unit and stores the register context of a new thread generated by software processing using an instruction trap.

3. A machine language instruction for generating a new thread when the register context holding unit instructs the thread allocation unit not to execute the thread allocation, when the thread allocation instruction unit instructs the thread allocation unit not to execute the thread allocation. When the thread execution unit attempts to process the instruction, the thread execution unit traps the instruction at the instruction address X determined in advance by the hardware, and the thread execution unit uniquely includes an instruction address at which the machine language instruction for generating the new thread is executed. A register context required to start execution of a new thread to be newly created is saved in the main memory by a program having the instruction address X as an entry point and stored in an address register. Based on the instruction address stored in the register Calculating a start instruction address of the new thread, saving the calculation result in a main memory, and indicating a trap source instruction uniquely provided in the thread execution unit. Based on the instruction address held in the address register,
The program control is restored to an instruction address subsequent to the instruction that has executed the machine instruction for generating the new thread, the thread execution end instruction execution detecting means includes: When the thread executing in the thread execution unit executes a thread execution end instruction, the instruction is trapped at an instruction address Y predetermined by hardware, and the register is executed. The context restoring means restores the register context of a new thread to be newly created, which has been saved on the main memory, to a register by a program having the instruction address Y as an entry point, and saves the register context on the main memory. The start instruction address of the new thread Multithreaded processor according to claim 2, characterized in that to restore the program control to the address.

4. The register context holding means,
A set of storage devices provided specifically for each thread execution unit for register context save,
The register context restoring means directly stores the register context of the new thread generated by a hardware sequence, and stores the register context held in the storage device for register context save directly by a hardware sequence. The multi-thread processor according to claim 1, wherein the multi-thread processor is restored on a thread execution unit.

5. A multi-thread processor comprising: a plurality of program counters; and a plurality of thread execution units for simultaneously fetching, interpreting, and executing instructions of a plurality of threads according to the plurality of program counters. Comprises a machine instruction for enabling the thread being processed to create a new thread at most once and a machine instruction for enabling the thread being processed to terminate in any thread execution unit, A shared physical register file that is shared by a plurality of thread execution units and includes a plurality of physical registers; one logical register provided in the plurality of thread execution units and one of the plurality of physical registers in the thread execution unit; Mapping between one of the physical registers And a conversion table information copying means for copying information of the conversion table of the plurality of thread execution units to an adjacent thread execution unit, wherein a mapping relationship is defined between one of the logical registers. Register inheritance device for inheriting register contents by grouping each of the plurality of physical registers described above and adding information indicating a position in the group to information of the conversion table to define the mapping relationship When a thread being processed in any of the thread execution units executes a machine instruction for generating a new thread, the new thread generated based on the machine instruction executes the machine instruction. The thread assignment unit assigns a process to assign a thread execution unit to a thread execution unit. Thread assignment instructing means for instructing whether or not to execute the thread assignment by software, based on the machine language instruction when the thread assignment instructing means instructs the thread assigning means not to execute the thread assignment. Register context holding means for storing and holding a register context required to start execution of a new thread to be generated; and a thread being processed in the thread execution unit executing the machine language instruction has executed a thread execution end instruction. A thread execution end instruction execution detecting means for detecting that a thread is being executed, and a machine language instruction for terminating a thread being processed in any of the thread execution units should be newly generated and held by the register context holding means. Necessary to start thread execution And a register context restoring means for restoring a register context to be executed on the thread execution unit, wherein a thread being processed in any of the thread execution units generates a new thread if necessary. A multi-thread processor wherein a new thread generated when a word instruction is executed is processed by the thread execution unit after the thread being processed on the thread execution unit ends.

6. The register context holding means,
The multi-thread processor according to claim 5, wherein the multi-thread processor is a memory area on a main memory of the thread execution unit, and stores the register context of a new thread generated by software processing using an instruction trap.

7. A machine language instruction for generating a new thread when the register context holding means has instructed the thread allocation means not to execute the thread allocation when the thread allocation instructing means has performed the instruction. When the thread execution unit attempts to process the instruction, the thread execution unit traps the instruction at the instruction address X determined in advance by the hardware, and the thread execution unit uniquely includes an instruction address at which the machine language instruction for generating the new thread is executed. A register context required to start execution of a new thread to be newly created is saved in the main memory by a program having the instruction address X as an entry point and stored in an address register. Based on the instruction address stored in the register Calculating a start instruction address of the new thread, saving the calculation result in a main memory, and indicating a trap source instruction uniquely provided in the thread execution unit. Based on the instruction address held in the address register,
The program control is restored to an instruction address subsequent to the instruction that has executed the machine instruction for generating the new thread, the thread execution end instruction execution detecting means includes: When the thread executing in the thread execution unit executes a thread execution end instruction, the instruction is trapped at an instruction address Y predetermined by hardware, and the register is executed. The context restoring means restores the register context of a new thread to be newly created, which has been saved on the main memory, to a register by a program having the instruction address Y as an entry point, and saves the register context on the main memory. The start instruction address of the new thread Multithreaded processor according to claim 6, characterized in that to restore the program control to the address.

8. The register context holding means,
A set of storage devices provided specifically for each thread execution unit for register context save,
The register context restoring means directly stores the register context of the new thread generated by a hardware sequence, and stores the register context held in the storage device for register context save directly by a hardware sequence. The multi-thread processor according to claim 5, wherein the multi-thread processor is restored on a thread execution unit.