JPH1166033A

JPH1166033A - Pe array device and associated memory block

Info

Publication number: JPH1166033A
Application number: JP24195397A
Authority: JP
Inventors: Takeshi Ikenaga; 剛池永; Takeshi Ogura; 武小倉
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-08-22
Filing date: 1997-08-22
Publication date: 1999-03-09
Anticipated expiration: 2017-08-22
Also published as: JP3627953B2

Abstract

PROBLEM TO BE SOLVED: To add the data of all processing elements(PE) consisting of a PE array device at high speed even without providing any special additional hardware outside the PE array device. SOLUTION: This PE device is provided with (w) [(w) is an arbitrary natural number] pieces of PE 24, shift operation enabled hit flag register 25, function shared register 26 operable as pipeline register or counter, and (n) [(n) is a natural number >=2] pieces of associated memory blocks 20 having control circuits 27 and an inter-block dedicated bus 14 is provided for connecting the associated memory blocks 20. The control circuit 27 selects either a means for operating the function shared register 26 or a means for operating the function shared register 26 as the pipeline register for transferring data between the associated memory blocks 20 or as the counter for counting the number of hit flags.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、画像処理システム
等の様々な超並列型計算装置を構成するための超並列型
プロセッシングエレメントアレイ装置（ＰＥアレイ装
置）に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a massively parallel processing element array device (PE array device) for configuring various massively parallel computing devices such as an image processing system.

【０００２】[0002]

【従来の技術】ネットワークサービスのビジュアル化、
高付加価値化によって、高度な画像処理、音響処理、知
識処理の必要性が高まっている。ところで、上記のよう
な処理は、一般に膨大な処理性能が要求されるので、ノ
イマンアーキテクチャに基づいた既存のマイクロプロセ
ッサ、信号処理プロセッサを使用したのでは、実行困難
な場合が多い。2. Description of the Related Art Visualization of network services,
The need for advanced image processing, sound processing, and knowledge processing is increasing due to the added value. By the way, the above-described processing generally requires enormous processing performance, and therefore, it is often difficult to execute the processing using an existing microprocessor or signal processing processor based on the Neumann architecture.

【０００３】上記のような処理の有効な装置として、超
並列型ＰＥアレイ装置が知られている。この超並列型Ｐ
Ｅアレイ装置は、種々の論理、算術演算処理を行うＰＥ
（プロセッシングエレメント）を多数搭載し、単一命令
ストリーム・複数データストリーム方式（ＳＩＭＤ）に
よって、１つの制御回路から各ＰＥに対して単一の命令
列を与え、これによって、各ＰＥが上記演算処理を同時
に実行することができる機構を有する装置である。[0003] A super-parallel PE array device is known as an effective device for the above-described processing. This massively parallel P
The E-array device is a PE that performs various logic and arithmetic operations.
(Processing elements), a single instruction stream / multiple data stream system (SIMD) provides a single instruction sequence to each PE from one control circuit, thereby allowing each PE to perform the above-described arithmetic processing. Is a device having a mechanism capable of executing the same at the same time.

【０００４】また、連想メモリが知られている（参考文
献：Ogura,T.et al."A 20-kbit Associative Memory LS
I for Artificial Intelligence Machines", IEEE J.So
lid-State Circuits, Vol.24, No.4,pp.1014-1020 Aug.
1989）。この連想メモリは、上記のような超並列型ＰＥ
アレイ装置を極めて少ないハード量で実現できる集積回
路である。[0004] An associative memory is also known (reference: Ogura, T. et al. "A 20-kbit Associative Memory LS").
I for Artificial Intelligence Machines ", IEEE J.So
lid-State Circuits, Vol. 24, No. 4, pp. 1014-1020 Aug.
1989). This associative memory is a massively parallel PE as described above.
This is an integrated circuit that can realize an array device with an extremely small amount of hardware.

【０００５】また、２次元ＰＥアレイ装置が知られてい
る。この２次元ＰＥアレイ装置は、上記連想メモリを構
成要素として、数十万個のＰＥを２次元的に搭載した装
置である（参考文献：Ikenaga,T.et al."CAM² : A High
ly-parallel 2-D Cellular Automata Architecture for
Real-time and Palm-top Pixel-level Image Processi
ng", Euro-Par '96, Aug. 1996）。[0005] Also, a two-dimensional PE array device is known. This two-dimensional PE array device is a device in which hundreds of thousands of PEs are two-dimensionally mounted with the above-mentioned associative memory as a constituent element (Reference: Ikenaga, T. et al. "CAM ² : A High
ly-parallel 2-D Cellular Automata Architecture for
Real-time and Palm-top Pixel-level Image Processi
ng ", Euro-Par '96, Aug. 1996).

【０００６】図４は、従来のＰＥアレイ装置４０を示す
図である。FIG. 4 is a diagram showing a conventional PE array device 40.

【０００７】従来のＰＥアレイ装置４０は、図４に示す
ように、マスクレジスタ４６、アドレスデコーダ４５、
ヒットフラグレジスタ４８、Ｗ個のＰＥ４７、制御回路
４４によって構成されている。上記ＰＥアレイ装置４０
は、通常のメモリのように、アドレス入出力ポート４１
に所定のアドレス（値）を与えることによって、Ｗ個の
ＰＥ４７のうちの任意のＰＥに対して、データ入出力ポ
ート４２を介してデータを読み書きできる機能を有する
ものである。As shown in FIG. 4, a conventional PE array device 40 has a mask register 46, an address decoder 45,
It comprises a hit flag register 48, W PEs 47, and a control circuit 44. The PE array device 40
Is an address input / output port 41 like a normal memory.
Is provided with a predetermined address (value) so that data can be read from or written to any of the W PEs 47 via the data input / output port 42.

【０００８】また、上記従来のＰＥアレイ装置４０は、
データ入出力ポート４２から与えられる検索データとＰ
Ｅの内容とを並列に照合し、一致したＰＥに対してヒッ
トフラグを立てるマスク検索機能と、上記ヒットフラグ
が立っているＰＥに対して、データ入出力ポート４２か
ら与えられるデータを並列に書き込む並列部分書き込み
機能とを有する。Further, the above-mentioned conventional PE array device 40
Search data provided from the data input / output port 42 and P
The content of E is checked in parallel, a mask search function for setting a hit flag for the matched PE, and the data supplied from the data input / output port 42 are written in parallel to the PE on which the hit flag is set. It has a parallel partial write function.

【０００９】これら両機能を用いることによって、種々
のデータ転送、論理、算術演算処理を、ワード並列（wo
rd parallel ）、ビット直列（bit serial）に実行する
ことができる。また、マスク検索（検索データとＰＥの
内容とを並列に照合した結果、一致したＰＥに対してヒ
ットフラグを立てる処理）を行った後、上記ヒットフラ
グレジスタをシフトさせ、並列部分書き込みを行うこと
によって、近傍ＰＥ間（ワード間）において、データ転
送を実行することができる。By using both of these functions, various data transfer, logic, and arithmetic operations can be performed in word parallel (wo
rd parallel), bit serial. Further, after performing a mask search (a process of setting a hit flag for a matched PE as a result of collating the search data and the contents of the PEs in parallel), shifting the hit flag register to perform parallel partial writing. Thus, data transfer can be performed between neighboring PEs (between words).

【００１０】[0010]

【発明が解決しようとする課題】しかし、上記従来例に
おいて、上記各ＰＥにおける並列処理機能、または近傍
ＰＥ間のデータ転送機能のみを使用した場合、ＰＥアレ
イ装置４０を構成する全てのＰＥのデータを足し合わせ
ることができず、すなわち、グローバルな処理を実現す
ることができない。たとえば、ＰＥアレイ装置４０に白
黒画像データが格納され、所定のＰＥの各データが１で
あるもの（黒画素）の合計数を、ＰＥアレイ装置４０に
格納されている全ての白黒画像データについて求めるこ
とができない。However, in the above conventional example, when only the parallel processing function in each of the PEs or the data transfer function between neighboring PEs is used, the data of all the PEs constituting the PE array device 40 are used. Cannot be added, that is, global processing cannot be realized. For example, black-and-white image data is stored in the PE array device 40, and the total number of pixels (black pixels) in which each data of a predetermined PE is 1 is obtained for all the black-and-white image data stored in the PE array device 40. Can not do.

【００１１】上記従来例において、ＰＥアレイ装置４０
を構成する全てのＰＥのデータを足し合わせるために
は、プロセッサまたは加算器等の追加回路を、ＰＥアレ
イ装置４０の外部に予め設け、１系統のデータ入出力ポ
ート４２を介してＰＥ毎にデータを外部に読み出し、上
記プロセッサまたは加算器等を使用して加算し、これら
の読出し、加算の操作を繰り返す必要がある。In the above conventional example, the PE array device 40
In order to add up the data of all the PEs constituting the data processing system, an additional circuit such as a processor or an adder is provided in advance outside the PE array device 40, and the data for each PE is provided via one data input / output port 42. Must be read out to the outside, added using the above processor or adder, and the reading and adding operations must be repeated.

【００１２】したがって、上記従来例においては、上記
追加回路を設けることによって、ハード量が増大すると
いう問題があり、しかも、システムが複雑化するという
問題があり、また、処理時間が長くなるという問題があ
る。Therefore, in the above conventional example, there is a problem that the amount of hardware is increased by providing the additional circuit, a problem that the system is complicated, and a problem that the processing time is lengthened. There is.

【００１３】画像処理等の様々な超並列アルゴリズムの
中には、全ＰＥのデータを足し合わせる処理を必ず実行
するものが多く存在する。たとえば、モルフォロジーを
用いたパターンスペクトラム算出（参考文献：小畑、モ
ルフォロジー、７章、コロナ社）における面積計算等に
おいて、全ＰＥのデータを足し合わせる処理を必ず実行
する。Among various massively parallel algorithms such as image processing, there are many which always execute a process of adding data of all PEs. For example, in the area calculation in the pattern spectrum calculation using morphology (reference: Obata, Morphology, Chapter 7, Corona), a process of adding the data of all PEs is always executed.

【００１４】また、近年、実時間処理を要求する画像処
理アプリケーションが増え、これらのアプリケーション
に適用するには、アルゴリズム全体の処理時間がビデオ
レート（３３ミリ秒）内に収まる必要がある。このため
に、上記全ＰＥのデータを足し合わせる処理を、極めて
短い時間で実現できることが望まれている。In recent years, image processing applications requiring real-time processing have increased, and to apply to these applications, the processing time of the entire algorithm needs to be within the video rate (33 milliseconds). For this reason, it is desired that the processing of adding the data of all the PEs can be realized in an extremely short time.

【００１５】さらに、近年、コンパクトかつ低コストな
画像処理システムへの要求が強く、これを実現するため
には、できるだけシステムの構成要素のハード量が少な
く、簡易な構成な装置が望まれている。Furthermore, in recent years, there has been a strong demand for a compact and low-cost image processing system, and to realize this, a device with a simple configuration in which the amount of hardware of the system is as small as possible is desired. .

【００１６】本発明は、ＰＥアレイ装置の外部に特別な
追加ハードを設けなくても、ＰＥアレイ装置を構成する
全てのＰＥのデータを高速に加算処理することができる
ＰＥアレイ装置を提供することを目的とするものであ
る。An object of the present invention is to provide a PE array device which can add data of all PEs constituting the PE array device at high speed without providing any special additional hardware outside the PE array device. It is intended for.

【００１７】[0017]

【課題を解決するための手段】本発明は、ｗ個（ｗは任
意の自然数）のＰＥと、シフト動作可能なヒットフラグ
レジスタと、パイプラインレジスタまたはカウンタとし
て動作可能な機能共有型レジスタと、制御回路とを有す
るｎ個（ｎは２以上の自然数）の連想メモリブロックを
設け、上記連想メモリブロック間を結合するブロック間
専用バスを設けたＰＥアレイ装置であり、上記制御回路
は、上記連想メモリブロック間のデータ転送用のパイプ
ラインレジスタとして、上記機能共有型レジスタを動作
させる手段と、ヒットフラグの数を数えるカウンタとし
て、上記機能共有型レジスタを動作させる手段とのうち
のいずれか一方の手段を選択する回路であるＰＥアレイ
装置である。According to the present invention, there are provided w (where w is an arbitrary natural number) PEs, a hit flag register capable of performing a shift operation, a function sharing type register capable of operating as a pipeline register or a counter, A PE array device provided with n (n is a natural number of 2 or more) associative memory blocks having a control circuit, and a dedicated bus between blocks for connecting the associative memory blocks; One of a means for operating the shared function register as a pipeline register for data transfer between memory blocks, and a means for operating the shared function register as a counter for counting the number of hit flags This is a PE array device which is a circuit for selecting a means.

【００１８】[0018]

【発明の実施の形態および実施例】図１は、本発明の一
実施例であるＰＥアレイ装置１０の基本構成を示す図で
ある。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a diagram showing a basic configuration of a PE array device 10 according to one embodiment of the present invention.

【００１９】ＰＥアレイ装置１０は、図１（１）に示す
ように、ｎ個（ｎは２以上の自然数）の連想メモリ（Ｃ
ＡＭ）ブロック２０₍₁₎ 、２０₍₂₎ 、……、２０_(n) を
有し、各連想メモリブロック２０₍₁₎ 、２０₍₂₎ 、…
…、２０_(n) 間は、ブロック間専用バス１４によって結
合されている。As shown in FIG. 1A, the PE array device 10 includes n (n is a natural number of 2 or more) associative memories (C
AM) blocks 20 ₍₁₎ , 20 ₍₂₎ ,..., 20 _(n) , and each associative memory block 20 ₍₁₎ , 20 ₍₂₎ ,.
.., 20 _(n) are connected by a dedicated bus 14 between blocks.

【００２０】連想メモリブロック２０₍₁₎ は、図１
（２）に示すように、ｗ個（ｗは任意の自然数）のＰＥ
（ワード）２４₍₁₎ 、２４₍₂₎ 、……、２４_(n) で構成
され、このＰＥは、単一命令ストリーム・複数データス
トリーム方式（ＳＩＭＤ）のＰＥとして利用できるもの
である。なお、連想メモリブロック２０₍₂₎ 、……、２
０_(n) のそれぞれの構成は、連想メモリブロック２０
₍₁₎ の上記構成と同様である。したがって、ＰＥアレイ
装置１０全体では、ｎ×ｗ個のＰＥ（プロセッシングエ
レメント）を有する。The associative memory block 20 ₍₁₎ is shown in FIG.
As shown in (2), w (w is an arbitrary natural number) PEs
(Word) 24 ₍₁₎ , 24 ₍₂₎ ,..., 24 _(n) , and this PE can be used as a single instruction stream / multiple data stream (SIMD) PE. The associative memory block 20 ₍₂₎ ,.
0 _(n) are associated with the associative memory block 20.
This is the same as the above configuration of ₍₁₎ . Therefore, the entire PE array device 10 has n × w PEs (processing elements).

【００２１】連想メモリブロック２０₍₁₎ は、図１
（２）に示すように、アドレスデコーダ２２と、マスク
レジスタ２３と、ＰＥ２４₍₁₎ 、２４₍₂₎ 、……、２４
_(n) と、ヒットフラグレジスタ２５と、機能共有型レジ
スタ２６と、制御部２７と、制御線２８と、パス切り替
え回路２９とによって構成されている。The associative memory block 20 ₍₁₎ is shown in FIG.
As shown in (2), the address decoder 22, the mask register 23, and the PEs 24 ₍ ₁₎ , 24 ₍₂₎ ,.
_(n) , a hit flag register 25, a function sharing type register 26, a control unit 27, a control line 28, and a path switching circuit 29.

【００２２】上記のように、連想メモリブロック２０
₍₂₎ 、……、２０_(n) の構成は、連想メモリブロック２
０₍₁₎ の構成と同様であるので、以下では、これらを代
表して連想メモリブロック２０として説明する。また、
ＰＥ２４₍₁₎ 、２４₍₂₎ 、……、２４_(n) を代表してＰ
Ｅ２４として説明する。As described above, associative memory block 20
_{The configuration of (2)} ,..., 20 _(n) is associative memory block 2
Since the configuration is the same as that of 0 ₍₁₎ , the following description will be made as an associative memory block 20 as a representative. Also,
PE24 ₍₁₎ , 24 ₍₂₎ ,..., P on behalf of 24 _(n)
This will be described as E24.

【００２３】連想メモリブロック２０は、データ入出力
ポート１１から与える検索データと、ＰＥ２４₍₁₎ 、２
４₍₂₎ 、……、２４_(n) の内容とを並列に照合し、この
照合が一致したＰＥに対してヒットフラグを立てるマス
ク検索機能を有している。また、連想メモリブロック２
０は、ヒットフラグの立っているＰＥに対して、データ
入出力ポート１１から与えるデータを並列に書き込む並
列部分書き込み機能をも有している。いずれの機能も、
処理するビットを、マスクレジスタ２３によって限定す
ることができる。これらの機能を用いることによって、
加算を含む種々のデータ転送、論理、算術演算処理をワ
ード並列（word parallel ）、ビット直列（bit seria
l）に実行することができる。The associative memory block 20 stores the search data supplied from the data input / output port 11, the PE 24 ₍₁₎ ,
4 ₍₂₎ ,..., And 24 _(n) are matched in parallel, and a mask search function is set to set a hit flag for the PEs that match. Also, the associative memory block 2
0 also has a parallel partial write function of writing data given from the data input / output port 11 in parallel to the PE on which the hit flag is set. Both features,
The bits to be processed can be limited by the mask register 23. By using these functions,
Various data transfer, including addition, logic, and arithmetic operations are performed in word parallel and bit serial
l) Can be performed.

【００２４】また、連想メモリブロック２０は、機能共
有型レジスタ２６をカウンタとして動作させ、ヒットフ
ラグ制御線２１を介して、ヒットフラグレジスタ２５の
内容を順次シフトすることによって、ヒットフラグの数
をカウントする機能を有するものである。上記ヒットフ
ラグレジスタ２５のカウント機能は、各連想メモリブロ
ック２０で独立して動作させることができる。The associative memory block 20 counts the number of hit flags by operating the shared function register 26 as a counter and sequentially shifting the contents of the hit flag register 25 via the hit flag control line 21. It has a function to perform. The count function of the hit flag register 25 can be operated independently in each associative memory block 20.

【００２５】また、連想メモリブロック２０は、パス切
り替え回路２９によって、ＰＥ２４からのパスを機能共
有型レジスタ２６に切り替え、機能共有型レジスタ２６
をパイプラインレジスタとして動作させる機能を有す
る。この機能を用いることによって、ブロック間専用バ
ス１４を介して、隣接連想メモリブロック２０間で、Ｐ
Ｅのデータを転送することができる。The associative memory block 20 switches the path from the PE 24 to the shared function register 26 by the path switching circuit 29, and the shared function register 26
As a pipeline register. By using this function, P between adjacent associative memory blocks 20 via the inter-block dedicated bus 14
E data can be transferred.

【００２６】機能共有型レジスタ２６は、カウンタであ
りながら、レジスタ部を共有しているので、少ないハー
ド量によって、パイプラインレジスタとカウンタとを実
現することができる。機能共有型レジスタ２６は、制御
線によってパイプラインレジスタとカウンタとが切り替
わり、Ｆ／Ｆを共有するものであり、ハード量を少なく
する場合、次のような機能記述によって実現している。Since the shared function type register 26 is a counter and shares a register portion, a pipeline register and a counter can be realized with a small amount of hardware. The function sharing type register 26 switches between a pipeline register and a counter by a control line and shares an F / F. When the amount of hardware is reduced, the following function description is realized.

【００２７】つまり、出力＝レジスタ；ｉｆ制御線＝０レジスタ＜＝入力；ｅｌｓｅｉｆ制御線＝１レジスタ＜＝レジスタ＋ヒットフラグのデータ；（つまり制御線が１で、かつ、ヒットフラグのデータ
（０か１）が１の時だけ、カウントアップする）という
機能記述によって、実現している。That is, output = register; if control line = 0 register <= input; else if control line = 1 register <= register + hit flag data; (that is, the control line is 1 and the hit flag data ( Only when 0 or 1) is 1, the count is incremented).

【００２８】制御回路２７から出力され、制御線２８を
経由する制御信号によって、上記ヒットフラグレジスタ
２５のカウント機能とブロック間データ転送機能との２
つの機能が制御される。The control signal output from the control circuit 27 and passed through the control line 28 allows the hit flag register 25 to perform a counting function and an inter-block data transferring function.
One function is controlled.

【００２９】ＰＥアレイ装置１０は、プロセッサではな
いので、単独では動作しない。なお、連想メモリブロッ
ク２０内の制御回路２７は単なる命令デコーダである。
図２に示す命令シーケンスを、ＰＥアレイ装置１０の命
令入力ポート１３に与えるシーケンサ（図示せず）を、
外部に設ける必要がある。The PE array device 10 does not operate alone because it is not a processor. The control circuit 27 in the associative memory block 20 is a simple instruction decoder.
A sequencer (not shown) for giving the instruction sequence shown in FIG. 2 to the instruction input port 13 of the PE array device 10
It must be provided outside.

【００３０】次に、ＰＥアレイ装置１０における全ＰＥ
（ワード）の加算処理手順について説明する。Next, all PEs in the PE array device 10
The (word) addition processing procedure will be described.

【００３１】図２は、上記実施例において、１つの連想
メモリ２０を構成する全てのＰＥの内容を加算する処理
手順を示すフローチャートである。FIG. 2 is a flowchart showing a processing procedure for adding the contents of all the PEs constituting one associative memory 20 in the above embodiment.

【００３２】１つの連想メモリ２０を構成する全てのＰ
Ｅ（ワード）の加算処理は、ブロック内加算（Ｓ１０）
とブロック間加算（Ｓ２０）とに分けて行われる。All Ps constituting one associative memory 20
The addition process of E (word) is performed by adding in a block (S10).
And inter-block addition (S20).

【００３３】まず、ブロック内加算処理（Ｓ１０）にお
いて、検索マスクを設定する（Ｓ１１）ことによって、
ＰＥのビットの中で加算すべきデータが格納されている
ビット位置（１ビット）以外のビットをマスクする。そ
して、マスク検索を行い（Ｓ１２）、各ＰＥに格納され
ている加算すべきデータをヒットフラグレジスタ２５に
転送する。First, in the intra-block addition processing (S10), a search mask is set (S11), whereby
The bits other than the bit position (1 bit) where the data to be added are stored in the PE bits are masked. Then, a mask search is performed (S12), and the data to be added stored in each PE is transferred to the hit flag register 25.

【００３４】次に、機能共有型レジスタ２６をカウンタ
として動作させ、ヒットフラグレジスタ２５をシフトす
ることによって、ヒットフラグの数をカウントする（Ｓ
１３）。連想メモリブロック２０に設けられているＰＥ
の数Ｗと同じ数Ｗ回だけ、上記カウント処理を繰り返す
（Ｓ１４）。また、各連想メモリブロック２０におい
て、上記カウント処理を同時に実行する。最後に、機能
共有型レジスタ２６に蓄えられた加算結果を所定のＰＥ
２４に書き込む（Ｓ１５）。上記一連の処理によって、
各連想メモリブロック２０毎にＰＥの加算結果を得るこ
とができる。Next, the number of hit flags is counted by operating the function sharing type register 26 as a counter and shifting the hit flag register 25 (S
13). PE provided in the associative memory block 20
The above counting process is repeated the same number W times as the number W (S14). In each associative memory block 20, the above-described count processing is simultaneously executed. Finally, the addition result stored in the function sharing type register 26 is stored in a predetermined PE.
24 (S15). By the above series of processing,
An addition result of PE can be obtained for each associative memory block 20.

【００３５】ブロック間加算処理（Ｓ２０）において、
ブロック間転送を行い（Ｓ２１）、ビットシリアル加算
を行い（Ｓ２２）、ブロック間転送（Ｓ２１）とビット
シリアル加算（Ｓ２２）とを繰り返し、つまり、加算結
果が１つのＰＥに集約されるまで繰り返し（Ｓ２３）、
上記連想メモリブロック毎の加算結果をツリー状に集約
しながら加算する。最後に、１つのＰＥに集約された加
算結果を、ＰＥの読み出しによって、ＰＥアレイ装置１
０の外部に取り出す（Ｓ２４）。In the inter-block addition process (S20),
Inter-block transfer is performed (S21), bit-serial addition is performed (S22), and inter-block transfer (S21) and bit-serial addition (S22) are repeated, that is, repeated until the addition result is aggregated in one PE ( S23),
The result of addition for each of the associative memory blocks is added while being aggregated in a tree shape. Finally, the addition results aggregated in one PE are read out of the PE to read the PE array device 1
0 (S24).

【００３６】次に、ブロック間加算処理（Ｓ２０）を詳
細に説明する。Next, the inter-block addition processing (S20) will be described in detail.

【００３７】図３は、上記実施例において、ｎ＝４であ
る場合におけるブロック間加算処理（Ｓ２０）の例を示
す図である。FIG. 3 is a diagram showing an example of the inter-block addition processing (S20) when n = 4 in the above embodiment.

【００３８】ブロック間加算処理（Ｓ２０）によって、
各連想メモリブロック２０の所定のＰＥ（ワード）に蓄
えられたブロック内加算結果Ａ、Ｂ、Ｃ、Ｄのうち、加
算結果Ａ、Ｃを、ブロック間転送によって、右側に隣接
する連想メモリブロック２０が有するＰＥのうちで、加
算結果Ｂ、Ｄを格納しているＰＥ（ワード）のそれぞれ
に転送する。By the inter-block addition processing (S20),
Of the intra-block addition results A, B, C, and D stored in a predetermined PE (word) of each associative memory block 20, the addition results A and C are transferred to the right side of the associative memory block 20 by inter-block transfer. Are transferred to each of the PEs (words) storing the addition results B and D among the PEs included in.

【００３９】この場合、機能共有型レジスタ２６をパイ
プラインレジスタとして用い、ブロック間専用バス１４
を介して、加算結果Ａ、Ｃを転送する。次に、マスク検
索、並列部分書き込みの繰り返しによって、Ａ＋Ｂ（＝
Ｅ）、Ｃ＋Ｄ（＝Ｆ）の加算を、ビットシリアルに実行
する。なお、上記転送処理、加算処理を、同時に実行す
るようにしてもよい。In this case, the function sharing type register 26 is used as a pipeline register, and the dedicated bus 14 between blocks is used.
, The addition results A and C are transferred. Next, by repeating mask search and parallel partial writing, A + B (=
E), and addition of C + D (= F) is executed bit-serial. Note that the transfer processing and the addition processing may be performed simultaneously.

【００４０】上記と同様に、加算結果Ｅのブロック間転
送を２度実行することによって、加算結果Ｆを格納して
いるＰＥに転送し、このようにマスク検索と並列部分書
き込みとを繰り返すことによって、加算結果Ｅ＋Ｆ（＝
Ｇ）をビットシリアルに計算する。この加算結果Ｇが、
１つの連想メモリ２０を構成する全てのＰＥの加算結果
になる。In the same manner as described above, the inter-block transfer of the addition result E is performed twice to transfer the addition result F to the PE in which the addition result F is stored, and the mask search and the parallel partial writing are repeated as described above. , The addition result E + F (=
G) is calculated bit-serial. This addition result G is
The result is the sum of all PEs that make up one associative memory 20.

【００４１】最後に、ＰＥ読出し（ワードの読み出し）
によってＰＥアレイ装置１０の外部に、加算結果Ｇを取
り出す。連想メモリ２０の数ｎが増えた場合も、上記と
同様な手順によって、ブロック間加算処理（Ｓ２０）を
実行することができる。Finally, PE read (word read)
Then, the addition result G is taken out of the PE array device 10. Even when the number n of the associative memories 20 increases, the inter-block addition processing (S20) can be executed by the same procedure as described above.

【００４２】ところで、従来のＰＥアレイ装置４０にお
いては、上記のような全ＰＥ（ワード）の加算処理を実
行する場合、ＰＥアレイ装置４０の外部にプロセッサ等
を設け、１系統しかないデータ入出力ポート４２を介し
て、ＰＥ毎に読み出し、上記外部のプロセッサ等を用い
て加算を繰り返し実行する必要がある。したがって、全
ＰＥ数と同数の処理サイクルが必要であり、処理時間が
長くなり、また、加算を実行するための特別な回路を、
ＰＥアレイ装置４０の外部に設ける必要がある。Incidentally, in the conventional PE array device 40, when performing the above-described addition processing of all PEs (words), a processor or the like is provided outside the PE array device 40, and there is only one system of data input / output. It is necessary to read out for each PE via the port 42 and repeatedly execute addition using the external processor or the like. Therefore, the same number of processing cycles as the total number of PEs are required, the processing time is long, and a special circuit for performing the addition is required.
It must be provided outside the PE array device 40.

【００４３】ところが、上記実施例においては、全ＰＥ
（ワード）数／ｎのサイクルで、ブロック内加算処理
（Ｓ１０）を実行することができ、ブロック転送と加算
とを、log₂ｎ回、繰り返すサイクルによって、ブロック
間加算処理（Ｓ２０）を実行することができる。したが
って、連想メモリ２０の数ｎを増やした場合、従来例で
必要とする処理時間のほぼ１／ｎの処理時間で、加算処
理を終了することができる。However, in the above embodiment, all PE
The intra-block addition processing (S10) can be executed in (word) number / n cycles, and the inter-block addition processing (S20) is executed in a cycle in which block transfer and addition are repeated log ₂ n times. be able to. Therefore, when the number n of the associative memory 20 is increased, the addition processing can be completed in a processing time of about 1 / n of the processing time required in the conventional example.

【００４４】つまり、図３に示す連想メモリブロック２
０の個数ｎ＝４である例において、ブロック間転送とビ
ットシリアル加算とを２回繰り返せば、ブロック毎の加
算結果の個数が４→２→１になる。これと同様に、ｎ＝
８である場合、ブロック間転送とビットシリアル加算と
を３回繰り返せば、ブロック毎の加算結果の個数が８→
４→２→１になり、ｎ＝１６である場合、ブロック間転
送とビットシリアル加算とを４回繰り返すことによっ
て、ブロック毎の加算結果の個数が１６→８→４→２→
１になる。つまり、ブロック間転送とビットシリアル加
算との繰り返し回数は、log₂ｎで足りる。すなわち、ブ
ロック間加算処理は、ｎに対して、対数オーダの処理時
間で処理することができ、ｎを増やせば、ブロック間加
算処理時間は無視できる値になる。That is, the associative memory block 2 shown in FIG.
In the example where the number of 0s is n = 4, if the inter-block transfer and the bit serial addition are repeated twice, the number of addition results for each block becomes 4 → 2 → 1. Similarly, n =
In the case of 8, if the inter-block transfer and the bit serial addition are repeated three times, the number of addition results per block becomes 8 →
4 → 2 → 1, and when n = 16, the inter-block transfer and the bit serial addition are repeated four times, so that the number of addition results per block is 16 → 8 → 4 → 2 →
Becomes 1. That is, the number of repetitions of the inter-block transfer and the bit serial addition is log ₂ n. That is, the inter-block addition processing can be performed in a logarithmic-order processing time for n, and when n is increased, the inter-block addition processing time becomes a negligible value.

【００４５】また、上記実施例においては、ＰＥアレイ
装置１０に外部装置を設けずに、ＰＥアレイ装置１０の
機能のみを用いて、連想メモリ２０を構成する全ＰＥの
データに対する加算処理を実行することができる。In the above embodiment, the addition processing is performed on the data of all the PEs constituting the content addressable memory 20 using only the functions of the PE array device 10 without providing any external device in the PE array device 10. be able to.

【００４６】上記実施例において、連想メモリブロック
２０間をデータ転送するためのパイプラインレジスタ
（機能共有型レジスタ２６）に、カウンタの機能を持た
せ、ヒットフラグレジスタ２５のシフト出力を入力する
ことによって、ヒットフラグの数をカウントすることが
できる。これによって、各連想メモリブロック２０毎
に、それを構成する全てのＰＥの加算結果を得ることが
できる。また、上記連想メモリブロック２０毎の加算結
果は、ブロック間転送を用いてツリー状にデータを集約
しながら、加算処理を繰り返すことによって、最終的に
は、特定の連想メモリブロック２０が有する１つのワー
ドに、全ＰＥの加算データを集めることができる。この
加算データを読み出すことによって、外部に特別な追加
ハードを設けることなく、ＰＥアレイ装置１０を構成す
る全てのＰＥのデータの加算処理を、ＰＥアレイ装置１
０内で実行することができる。In the above embodiment, the pipeline register (function-shared register 26) for transferring data between the associative memory blocks 20 is provided with a counter function, and the shift output of the hit flag register 25 is input. , The number of hit flags can be counted. Thus, for each associative memory block 20, an addition result of all the PEs constituting the associative memory block 20 can be obtained. The addition result for each of the associative memory blocks 20 is obtained by repeating the addition process while aggregating the data in a tree shape using inter-block transfer, so that one of the specific associative memory blocks 20 is finally obtained. In a word, the addition data of all PEs can be collected. By reading this addition data, the addition processing of the data of all the PEs constituting the PE array apparatus 10 can be performed without providing any special additional hardware externally.
0 can be executed.

【００４７】上記処理のうち、ヒットフラグのカウント
処理を、全ての連想メモリブロック２０で並列に処理で
き、また連想メモリブロック２０間の加算を、ツリー状
に集約しながら行うことによって、短時間で処理でき
る。したがって、従来のように、各ＰＥ（ワード）を１
つづつ読み出しながら加算処理する場合と比較すると、
全てのＰＥのデータの加算処理を短時間で実行すること
ができる。Of the above processes, the hit flag counting process can be performed in parallel in all the associative memory blocks 20, and the addition between the associative memory blocks 20 is performed in a tree-like manner, thereby shortening the time. Can be processed. Therefore, as in the conventional case, each PE (word) is set to 1
Compared to the case of adding while reading one by one,
The addition processing of the data of all PEs can be executed in a short time.

【００４８】[0048]

【発明の効果】本発明によれば、ＰＥアレイ装置の外部
に特別な追加ハードを設けなくても、ＰＥアレイ装置を
構成する全てのＰＥのデータを加算することができ、し
かも、その加算処理が高速であるという効果を奏する。According to the present invention, it is possible to add the data of all the PEs constituting the PE array device without providing any special additional hardware outside the PE array device, and to perform the addition process. Is fast.

[Brief description of the drawings]

【図１】本発明の一実施例であるＰＥアレイ装置１０の
基本構成を示す図である。FIG. 1 is a diagram showing a basic configuration of a PE array device 10 according to one embodiment of the present invention.

【図２】上記実施例において、１つの連想メモリ２０を
構成する全てのＰＥの内容を加算する処理手順を示すフ
ローチャートである。FIG. 2 is a flowchart showing a processing procedure for adding contents of all PEs constituting one associative memory 20 in the embodiment.

【図３】上記実施例において、ｎ＝４である場合におけ
るブロック間加算処理（Ｓ２０）の例を示す図である。FIG. 3 is a diagram illustrating an example of an inter-block addition process (S20) when n = 4 in the embodiment.

【図４】従来のＰＥアレイ装置４０を示す図である。FIG. 4 is a diagram showing a conventional PE array device 40.

[Explanation of symbols]

１０…ＰＥアレイ装置、１４…ブロック間専用バス、２０…連想メモリブロック、２２…アドレスデコーダ、２３…マスクレジスタ、２４…ＰＥ（ワード）、２５…ヒットフラグレジスタ、２６…機能共有型レジスタ、２７…制御部、２９…パス切り替え回路。 DESCRIPTION OF SYMBOLS 10 ... PE array apparatus, 14 ... Bus between blocks, 20 ... Associative memory block, 22 ... Address decoder, 23 ... Mask register, 24 ... PE (word), 25 ... Hit flag register, 26 ... Function sharing type register, 27 ... Control unit, 29 ... Path switching circuit.

Claims

[Claims]

1. A semiconductor device comprising: w (where w is an arbitrary natural number) PEs; a shift flag-operable hit flag register; a function-shared register operable as a pipeline register or a counter; n is a natural number of 2 or more); an inter-block dedicated bus for connecting the associative memory blocks; and the control circuit serves as a pipeline register for data transfer between the associative memory blocks. A circuit for selecting one of the means for operating the shared function register and the means for operating the shared function register as a counter for counting the number of hit flags. PE array device.

2. w (where w is an arbitrary natural number) PEs; a hit flag register capable of a shift operation; a shared function register operable as a pipeline register or a counter; and a data transfer between associative memory blocks. A control circuit for selecting one of the means for operating the shared function register as a pipeline register and the means for operating the shared function register as a counter for counting the number of hit flags; and An associative memory block comprising:

3. A search mask setting step of masking a bit other than a bit position where data to be added is stored among the bits of the PE; and storing the data to be added stored in each PE in a hit flag register. A mask search step for transferring; a hit flag number counting step for counting the number of hit flags by operating the function sharing type register as a counter and shifting the hit flag register; and a P provided in the associative memory block.
A count repetition step of repeating the count step by the same number as the number of E, and simultaneously executing the count processing in each associative memory block; and adding the addition result stored in the shared function register to a predetermined PE. An addition result writing step for writing data to a block; a block transfer step for transferring data between blocks via a shared function register functioning as a pipeline register; and a bit serial executed by repeating mask search and parallel partial writing. By repeating the addition step,
An aggregation addition step of adding the addition results to the one PE while collecting them in a tree form; and an addition result extracting step of extracting the addition results aggregated in the one PE to the outside of the PE array device. An operation method using a PE array device.

4. A semiconductor memory device comprising w (where w is an arbitrary natural number) PEs, a hit flag register capable of performing a shift operation, a function-shared register operable as a pipeline register or a counter, and a control circuit (n is 2). A natural number) associative memory block, and an inter-block dedicated bus for connecting the associative memory blocks.
As a pipeline register for data transfer between the associative memory blocks, means for operating the function-shared register, and a counter for counting the number of hit flags,
A PE array device which is a circuit for selecting one of the means for operating the function-sharing type register; and controlling addition processing in the associative memory block and addition processing between the associative memory blocks. And a sequencer, wherein the PE array device and the sequencer are connected by an instruction input port.