JP5575947B2

JP5575947B2 - Multiprocessor device

Info

Publication number: JP5575947B2
Application number: JP2013078334A
Authority: JP
Inventors: 幸一石見
Original assignee: Renesas Electronics Corp
Current assignee: Renesas Electronics Corp
Priority date: 2013-04-04
Filing date: 2013-04-04
Publication date: 2014-08-20
Anticipated expiration: 2027-01-22
Also published as: JP2013140630A

Description

本発明は、複数のマルチプロセッサ群を同一ＬＳＩにインプリメントしたマルチプロセッサ装置の最適なバス構成およびレイアウト構成に関するものである。 The present invention relates to an optimal bus configuration and layout configuration of a multiprocessor device in which a plurality of multiprocessor groups are implemented in the same LSI.

同一のアーキテクチャ、およびCPUやDSPなどの異なるアーキテクチャの複数のマルチプロセッサを同一の半導体チップ上にインプリメントするマルチプロセッサ装置のバス構成は、下記非特許文献1に記載されているように1つのバスに全てのマルチプロセッサが接続されている構成と、下記非特許文献2に記載されているように同じプロトコルを持つマルチプロセッサとバスを接続するためにそれぞれのCPUごとにローカルバスを持ち、複数のローカルバス同士をブリッジ結合している構成であった。 The bus configuration of a multiprocessor device that implements a plurality of multiprocessors of the same architecture and different architectures such as CPU and DSP on the same semiconductor chip is as one bus as described in Non-Patent Document 1 below. A configuration in which all multiprocessors are connected, and a multi-processor having the same protocol as described in Non-Patent Document 2 below has a local bus for each CPU to connect the bus, and multiple local processors The buses were bridged together.

1つのバスに全てのマルチプロセッサが接続されている場合は、外部バスI/FがLSIに一つであっても、複数ある場合であっても同一バス上に接続されている。 When all the multiprocessors are connected to one bus, the external bus I / F is connected to the same bus regardless of whether there is one LSI or a plurality of external bus I / Fs.

複数のバスに分けてバス同士をブリッジ結合している場合は、ローカルバスに接続されるプロセッサは一つであり、それぞれのローカルバスは一つのバスマスタに接続しており、外部バスI/Fに接続されているのは一つのバスである。 When the buses are divided into multiple buses and bridged together, only one processor is connected to the local bus, and each local bus is connected to one bus master and connected to the external bus I / F. One bus is connected.

東芝、EmotionEngine，SCE/IBM/東芝、Cell、2005年2月9日、［2007年1月9日検索］、インターネット＜http://ascii24.com/news/i/tech/article/2005/02/09/654178-000.html＞Toshiba, EmotionEngine, SCE / IBM / Toshiba, Cell, February 9, 2005, [Search January 9, 2007], Internet <http://ascii24.com/news/i/tech/article/2005/02 /09/654178-000.html> ルネサス、G1、2006年2月、ISSCC2006 Fig.29.5.1「A Power Manegement Scheme Controlling 20 Power Domains for a Single-Chip Mobile Processor」Renesas, G1, February 2006, ISSCC2006 Fig.29.5.1 `` A Power Manegement Scheme Controlling 20 Power Domains for a Single-Chip Mobile Processor ''

しかしながら、一つのバスに複数のマルチプロセッサが接続されており、異なるアーキテクチャを含む場合、通常はプロセッサによって処理性能の速度差があるため、低速プロセッサによって高速プロセッサの動作が阻害され、高速プロセッサの性能が上がらないという問題があった。また、DSP,SIMD型超並列プロセッサなどのデータ処理を主に行うプロセッサを含む場合は、DSP,SIMD型超並列プロセッサは扱うデータ量が多いため、マルチプロセッサ側のバスアクセスが長く待たされ、マルチプロセッサによる性能向上の恩恵が受けられないという問題があった。 However, when multiple multiprocessors are connected to a single bus and include different architectures, there is usually a difference in processing performance due to the processor, so the operation of the high-speed processor is hindered by the low-speed processor, and the performance of the high-speed processor There was a problem that did not go up. If a processor that mainly processes data such as DSP and SIMD type massively parallel processors is included, the DSP and SIMD type massively parallel processors handle a large amount of data. There was a problem that the benefits of improved performance by the processor could not be received.

また、キャッシュのコヒーレンシーの問題に対し、同一アーキテクチャのマルチプロセッサの場合は保証されているが、異アーキテクチャの場合は保証されていないことがほとんどであり、整合性が取れていないという問題があった。 In addition, the problem of cache coherency is guaranteed in the case of multiprocessors of the same architecture, but is not guaranteed in the case of different architectures, and there is a problem that consistency is not achieved. .

また、マルチプロセッサ対応OSを実行する場合、異アーキテクチャのプロセッサは開発元が違うため、複数のプロセッサに対応するようなOSを作ることはほとんどなく、同一アーキテクチャのプロセッサにしかマルチプロセッサ対応OSは対応しないことがほとんどである。よって、異アーキテクチャのプロセッサには別のOSを備えることになるが、異なるOSが同一バス上にある場合、同一バス上にOSがよく知らないバスマスタIP接続されているのと同じことになり、マルチプロセッサ対応OSによるスケジューリングなどの性能向上が妨げられるという問題があった。 Also, when running a multiprocessor-compatible OS, developers of different architectures have different developers, so there is almost no need to create an OS that supports multiple processors, and multiprocessor-compatible OSs only support processors with the same architecture. Mostly not. Therefore, a processor of a different architecture will have a different OS, but if different OSs are on the same bus, it will be the same as a bus master IP connection that the OS is not familiar with on the same bus, There was a problem that performance improvement such as scheduling by multi-processor compatible OS was hindered.

また、複数のバスに分けてバス同士をブリッジ結合している場合であっても、それぞれのローカルバスはバスマスタが一つであるため、CPUとローカルバスを合わせて一つのCPUと考えられ、異アーキテクチャが同一バス上にあった場合の上記問題点と同一の問題があった。 Even if the buses are divided into multiple buses and bridged together, each local bus has a single bus master, so the CPU and local buses are considered to be a single CPU. There was the same problem as the above problem when the architecture was on the same bus.

また、外部バスI/FがLSIに一つであっても、複数ある場合であっても同一バス上に接続されていることから、外部バスI/Fが接続されている側のバスは、別のバスからの外部バスアクセスリクエストにより、頻繁に止められ、所望の性能が得られない。また、外部バスI/Fが接続されていない側のバスは、別のバスからの外部バスI/Fにアクセスする際の性能が落ちるという問題があった。 In addition, even if there is one or more external bus I / F in the LSI, the bus on the side to which the external bus I / F is connected is connected on the same bus, The external bus access request from another bus is frequently stopped and the desired performance cannot be obtained. In addition, there is a problem that the performance of the bus on the side to which the external bus I / F is not connected decreases when accessing the external bus I / F from another bus.

そこで本発明はかかる問題を解決するためになされたものであり、異なるアーキテクチャごとに独立したバスと外部バスI/Fを持つことで、高性能のマルチプロセッサ装置を得ることを目的としている。 Accordingly, the present invention has been made to solve such a problem, and an object thereof is to obtain a high-performance multiprocessor device by having an independent bus and an external bus I / F for each different architecture.

本発明の一実施形態における、マルチプロセッサ装置は、複数の第1のプロセッサと、複数の第2のプロセッサと、前記複数の第1のプロセッサが接続されている第1のバスと、前記複数の第2のプロセッサが接続されている第2のバスと、前記第1のバスが接続されている第1の外部バスI/Fと、前記第2のバスが接続されている第2の外部バスI/Fと、を一の半導体チップ上に備える。前記第1のプロセッサと前記第2のプロセッサは、別系統のクロックで制御され、周波数または位相が異なる。 In one embodiment of the present invention, a multiprocessor device includes a plurality of first processors, a plurality of second processors, a first bus to which the plurality of first processors are connected, and the plurality of the plurality of first processors. A second bus to which a second processor is connected, a first external bus I / F to which the first bus is connected, and a second external bus to which the second bus is connected I / F is provided on one semiconductor chip. The first processor and the second processor are controlled by different clocks and have different frequencies or phases.

本発明の一実施形態によれば、複数のマルチプロセッサ群を同一の半導体チップにインプリメントする場合、異なるアーキテクチャごとに独立したバスと外部バスI/Fを備える。この構成により、それぞれのマルチプロセッサ群がほとんど独立に動けるため、異なるアーキテクチャのプロセッサ間の調停やバスの取り合いが減り、高性能なマルチプロセッサシステムを、低コスト、低電力で実現できる。 According to an embodiment of the present invention, when a plurality of multiprocessor groups are implemented on the same semiconductor chip, an independent bus and an external bus I / F are provided for each different architecture. With this configuration, since each multiprocessor group can operate almost independently, arbitration between processors of different architectures and bus contention are reduced, and a high-performance multiprocessor system can be realized at low cost and low power.

本発明の実施の形態1におけるマルチプロセッサ装置を示す構成図である。1 is a configuration diagram illustrating a multiprocessor device according to a first embodiment of the present invention. 本発明の実施の形態2におけるレイアウト図である。FIG. 6 is a layout diagram in the second embodiment of the present invention. 本発明の実施の形態2におけるレイアウト図である。FIG. 6 is a layout diagram in the second embodiment of the present invention. 本発明の実施の形態2におけるレイアウト図である。FIG. 6 is a layout diagram in the second embodiment of the present invention. 本発明の実施の形態3におけるレイアウト図である。FIG. 10 is a layout diagram in the third embodiment of the present invention. 本発明の実施の形態4におけるマルチプロセッサ装置を示す構成図である。FIG. 6 is a configuration diagram showing a multiprocessor device in a fourth embodiment of the present invention. 本発明の実施の形態5におけるマルチプロセッサ装置を示す構成図である。FIG. 10 is a configuration diagram showing a multiprocessor device in a fifth embodiment of the present invention. 本発明の実施の形態6におけるタイミングチャートである。10 is a timing chart according to the sixth embodiment of the present invention. 従来技術におけるクロック供給回路を示す図である。It is a figure which shows the clock supply circuit in a prior art. 本発明の実施の形態6におけるクロック供給回路を示す図である。FIG. 10 is a diagram illustrating a clock supply circuit according to a sixth embodiment of the present invention. 本発明の実施の形態7におけるソフトウェアのブロック図である。FIG. 17 is a software block diagram according to Embodiment 7 of the present invention. 本発明の実施の形態7におけるソフトウェアのブロック図である。FIG. 17 is a software block diagram according to Embodiment 7 of the present invention.

［実施の形態1］
図1は本発明の実施の形態1におけるマルチプロセッサ装置を示す構成図であり、このマルチプロセッサ装置は一の半導体チップ上に形成されている。CPU1〜8の複数のプロセッサが並列に構成されており（第１のプロセッサ群）、SMP（Symmetric Multiple Processor）構成となっている。それぞれのCPUは内部に1次キャッシュ（I-cache，D-cache）、内部メモリ（U-LM）、MMU（メモリ管理）、SDI（デバッガ）を持っている。8個のCPUはCPUバス10（第１のバス）に接続され、CPUバス10はCPUバス制御部11を介して2次キャッシュ12に接続されている。2次キャッシュ12はDDR2 I/F13（第１の外部バスI/F）を介して外部バス1に接続されている。 [Embodiment 1]
FIG. 1 is a configuration diagram showing a multiprocessor device according to Embodiment 1 of the present invention, and this multiprocessor device is formed on one semiconductor chip. A plurality of processors 1 to 8 are configured in parallel (first processor group) and have a SMP (Symmetric Multiple Processor) configuration. Each CPU has a primary cache (I-cache, D-cache), internal memory (U-LM), MMU (memory management), and SDI (debugger). The eight CPUs are connected to the CPU bus 10 (first bus), and the CPU bus 10 is connected to the secondary cache 12 via the CPU bus control unit 11. The secondary cache 12 is connected to the external bus 1 via a DDR2 I / F 13 (first external bus I / F).

CPU内部は最大533MHzで動作する。CPUはCPU内部のバスI/Fで周波数変換され、最大266MHzでCPUバス10と接続される。2次キャッシュ12およびDDR2 I/F13は最大266MHzで動作する。 The CPU operates at a maximum of 533MHz. The CPU frequency is converted by a bus I / F inside the CPU and is connected to the CPU bus 10 at a maximum of 266 MHz. The secondary cache 12 and the DDR2 I / F 13 operate at a maximum of 266 MHz.

また、本発明のLSIは同一半導体チップ上にCPUバス10の他に内部周辺バス14（第２のバス）を持っている。内部周辺バス14には、ICU（割り込みコントローラ），ITIM（定期的タイマ），UART（Universal Asynchronous Receiver Transmitter：クロック非同期型シリアルI/O），CSIO（クロック同期型シリアルI/O），CLKC（クロックコントローラ）などの周辺回路15、DMAC16（DMAコントローラ）、内蔵SRAM17、SMP構成のマトリクス型超並列プロセッサ（SIMD型超並列プロセッサ31，32、第２のプロセッサ群）、外部バス制御部18（第２の外部バスI/F）、別のアーキテクチャのCPU19が接続されている。内部周辺バス14は外部バス制御部18を介して外部バス2に接続され、SDRAM，ROM，RAM，IOなどの外部デバイスに接続するための外部バスアクセス経路を構成している。 The LSI of the present invention has an internal peripheral bus 14 (second bus) in addition to the CPU bus 10 on the same semiconductor chip. Internal peripheral bus 14 includes ICU (interrupt controller), ITIM (periodic timer), UART (Universal Asynchronous Receiver Transmitter: clock asynchronous serial I / O), CSIO (clock synchronous serial I / O), CLKC (clock Peripheral circuit 15 such as a controller), DMAC 16 (DMA controller), built-in SRAM 17, SMP configuration matrix type massively parallel processor (SIMD type massively parallel processors 31, 32, second processor group), external bus control unit 18 (second External bus I / F), CPU19 of different architecture is connected. The internal peripheral bus 14 is connected to the external bus 2 via the external bus control unit 18, and constitutes an external bus access path for connecting to external devices such as SDRAM, ROM, RAM, and IO.

内部周辺バス14は最大133MHzで動作し、DMAC16、内蔵SRAM17、周辺回路15も最大133MHzで動作する。SIMD型超並列プロセッサ内部は最大266MHzで動作し、SIMD型超並列プロセッサ内部のバスI/Fで周波数変換され、内部周辺バス14と接続される。CPU19内部も最大266MHzで動作し、CPU19内部のバスI/Fで周波数変換され内部周波数バス7と接続する。このように処理性能に速度差があるため、それぞれのプロセッサ群は別系統のクロックで制御され、周波数、位相などが異なる。 The internal peripheral bus 14 operates at a maximum of 133 MHz, and the DMAC 16, the built-in SRAM 17, and the peripheral circuit 15 also operate at a maximum of 133 MHz. The inside of the SIMD type massively parallel processor operates at a maximum of 266 MHz, is frequency-converted by the bus I / F inside the SIMD type massively parallel processor, and is connected to the internal peripheral bus 14. The CPU 19 also operates at a maximum of 266 MHz, is frequency-converted by the bus I / F inside the CPU 19 and is connected to the internal frequency bus 7. Since there is a speed difference in processing performance in this way, each processor group is controlled by a clock of a different system, and the frequency, phase, etc. are different.

CPUバス10と内部周辺バス14は2次キャッシュ12を通して接続されている。よってCPU1〜8は2次キャッシュ12を通してDDR2 I/F13から外部バス1にアクセスできるだけでなく、2次キャッシュ12を通して内部周辺バス14の資源へもアクセス可能である。従って、経路が遠く周波数も遅いのでデータ転送性能は上がらないが、CPU1〜8は外部バス制御部18を通して別の外部バス2へもアクセスできる。内部周辺バス14に接続される各モジュールは、外部バス制御部18を通して外部バス2にアクセスできるが、外部バス1へはアクセスできない。 The CPU bus 10 and the internal peripheral bus 14 are connected through the secondary cache 12. Therefore, the CPUs 1 to 8 can access not only the external bus 1 from the DDR2 I / F 13 through the secondary cache 12 but also the resources of the internal peripheral bus 14 through the secondary cache 12. Therefore, since the path is long and the frequency is slow, the data transfer performance is not improved, but the CPUs 1 to 8 can access another external bus 2 through the external bus control unit 18. Each module connected to the internal peripheral bus 14 can access the external bus 2 through the external bus control unit 18, but cannot access the external bus 1.

CPU1〜8は同一アーキテクチャのCPUである。1次／2次キャッシュのコヒーレンシーに関しては、1次／2次キャッシュメモリの内容がコヒーレンシ制御されて整合性が取れており、CPUが誤動作する心配がない。またマルチプロセッサ対応OSを使用した場合でも、CPUバス10上には同一アーキテクチャのCPU8個と2次キャッシュ12しかなく、また、外部バス1へのアクセスもCPU1〜8からのアクセスに限られるため、高い性能を出せる。特にSIMD型超並列プロセッサはCPUに比べ動作速度が遅く、データ処理時には大量のデータを扱うため、バスを長時間占領してしまいがちだが、SIMD型超並列プロセッサは内部周辺バス14を通して外部バス2へアクセスするので、CPUバス10側への影響はない。 CPUs 1 to 8 are CPUs of the same architecture. As for the coherency of the primary / secondary cache, the contents of the primary / secondary cache memory are coherently controlled and consistent, so there is no risk of the CPU malfunctioning. Even when using a multiprocessor-compatible OS, there are only 8 CPUs of the same architecture and the secondary cache 12 on the CPU bus 10, and access to the external bus 1 is limited to access from CPUs 1 to 8, High performance can be achieved. In particular, SIMD massively parallel processors are slower than CPUs and handle large amounts of data during data processing, so they tend to occupy the bus for a long time. Is not affected on the CPU bus 10 side.

また、SIMD型超並列プロセッサから見れば、CPUは主にはCPUバス10から外部バス1の経路を使用するので、データ転送中にCPUのために内部周辺バス14を開放する必要がなくなり、効率的なデータ転送ができる。特にCPUが複数構成されるマルチプロセッサなので、その効果は顕著であり、本発明例では8個のCPUであるが16個，32個それ以上のプロセッサがSIMD型超並列プロセッサのようなデータ処理向けプロセッサと同一バスにいた場合、データ処理が滞ってしまうため、本発明による効果は更に顕著になる。 Also, from the viewpoint of the SIMD massively parallel processor, the CPU mainly uses the path from the CPU bus 10 to the external bus 1, so there is no need to open the internal peripheral bus 14 for the CPU during data transfer, and efficiency Data transfer is possible. The effect is remarkable because it is a multiprocessor composed of multiple CPUs. In the example of the present invention, there are 8 CPUs, but 16 or 32 or more processors are for data processing such as SIMD type massively parallel processors. If the processor is on the same bus, the data processing will be delayed, so the effect of the present invention will become more prominent.

CPU19はCPU1〜8に比べて動作速度や処理性能は落ちるが、消費電力や面積が小さいマイクロプロセッサである。周辺回路15を起動したり、タイマーをチェックしたり、CLKCを使ったパワーマネジメントなど演算処理性能が不要な処理を行うことができる。よってSIMD型超並列プロセッサと同一バス上に構成されていても、SIMD型超並列プロセッサの性能が落ちるという問題はない。 The CPU 19 is a microprocessor that consumes less power and has a smaller area, although the operation speed and processing performance are lower than those of the CPUs 1-8. Processing that does not require arithmetic processing performance, such as starting the peripheral circuit 15, checking the timer, and power management using the CLKC, can be performed. Therefore, even if it is configured on the same bus as the SIMD type massively parallel processor, there is no problem that the performance of the SIMD type massively parallel processor decreases.

［実施の形態2］
図2から図4は本発明の実施の形態2におけるマルチプロセッサ装置のレイアウトを示した図である。図2は実施の形態1の各モジュールを実際のシリコンウエハ上に構成したレイアウト例である。図3は図2のレイアウト例をCPUバス関連モジュール（CPU1〜8、CPUバス制御）と内部周辺バス関連モジュール（SIMD型超並列プロセッサ31，32，CPU19，内蔵SRAM17，周辺回路15，外部バス制御部18，DMAC16）をそれぞれCPUバス領域20と内部周辺バス領域21にまとめた図である。図4は実施の形態2における電源／GND配線22のイメージ図である。 [Embodiment 2]
2 to 4 are diagrams showing layouts of the multiprocessor device according to the second embodiment of the present invention. FIG. 2 is a layout example in which each module of the first embodiment is configured on an actual silicon wafer. Fig. 3 shows the layout example of Fig. 2. CPU bus related modules (CPU1-8, CPU bus control) and internal peripheral bus related modules (SIMD type massively parallel processors 31, 32, CPU19, built-in SRAM 17, peripheral circuit 15, external bus control FIG. 7 is a diagram in which the CPU 18 and DMAC 16) are grouped into a CPU bus area 20 and an internal peripheral bus area 21, respectively. FIG. 4 is an image diagram of power supply / GND wiring 22 in the second embodiment.

図2のようなレイアウト構成にすることにより、内部周辺バス14、CPUバス10は図のように最短で結ぶことができるため、高速動作でかつ無理な配線交差による混雑も起こりにくいため、面積が小さくなり低コストになる。また、バス以外の信号線も交差する配線数が減り、配線混雑や長距離配線による速度低下が起こりにくくなるため、低消費電力かつ低コストのLSIを実現することができる。また、バス領域ごとにエリアを分割しているため、電源遮断などの制御をしやすい。 With the layout configuration shown in Fig. 2, the internal peripheral bus 14 and the CPU bus 10 can be connected in the shortest as shown in the figure. Smaller and lower cost. In addition, the number of wiring lines that intersect with signal lines other than the bus is reduced, and it is difficult for speed reduction due to wiring congestion and long-distance wiring to occur, so that a low power consumption and low cost LSI can be realized. In addition, since the area is divided for each bus area, it is easy to perform control such as power-off.

また、内部周辺バス領域21とCPUバス領域20では動作周波数／演算処理能力に差があるため、消費電力に差がある。クロック周波数が速く消費電力の大きいCPUバス領域20は低インピーダンスの配線が必要であり、クロック周波数が遅く消費電力の小さい内部周辺バス領域21は比較的インピーダンスが高めでもよい。消費電力の大きい領域における低インピーダンスの配線は、配線幅を太くしたり、あるいは配線間隔を狭くしたりすることで実現できるが、代償として配線層のうち電源／GND配線22が占める面積が大きくなるために、他の信号線等が配線しにくくなり、結果としてLSI面積の増大、コストの増大や、信号配線迂回による配線容量増で消費電力が増える。それぞれの領域が混在している場合は、安定した動作を保証するためには全体を低インピーダンス配線にする必要があるが、面積が大きくなり高コストとなる。 Further, the internal peripheral bus area 21 and the CPU bus area 20 have a difference in power consumption because there is a difference in operating frequency / arithmetic processing capacity. The CPU bus area 20 having a high clock frequency and high power consumption requires low impedance wiring, and the internal peripheral bus area 21 having a low clock frequency and low power consumption may have a relatively high impedance. Low-impedance wiring in areas with high power consumption can be realized by increasing the wiring width or by reducing the wiring interval, but the area occupied by the power supply / GND wiring 22 in the wiring layer increases as a compensation. For this reason, it becomes difficult to wire other signal lines and the like, and as a result, the power consumption increases due to an increase in LSI area, an increase in cost, and an increase in wiring capacity due to bypass of signal wiring. When the respective regions are mixed, it is necessary to make the whole wiring with low impedance in order to guarantee stable operation, but the area becomes large and the cost becomes high.

図3のように消費電力の大きいCPUバス領域20と消費電力の小さい内部周辺バス領域21とに分けた場合、低インピーダンスの電源／GND配線22はCPUバス領域20にだけ適用すればよい。例えば図4のようにCPUバス領域20は太い配線を密に、内部周辺バス領域21は細い配線を疎に配線すればよい。こうすることで、不必要な電源配線をなくして低コストにしながら安定した動作が保証できる。また、電源端子も同様であり、図4のようにCPUバス領域20の電源／GND端子23は密に、内部周辺バス領域21の電源／GND端子23は疎にすればよい。 As shown in FIG. 3, when the CPU bus area 20 with high power consumption and the internal peripheral bus area 21 with low power consumption are divided, the low impedance power / GND wiring 22 need only be applied to the CPU bus area 20. For example, as shown in FIG. 4, the CPU bus area 20 may have thick wiring densely and the internal peripheral bus area 21 may have thin wiring sparsely wired. By doing so, it is possible to guarantee a stable operation while eliminating the need for unnecessary power supply wiring and reducing the cost. The same applies to the power supply terminals. As shown in FIG. 4, the power supply / GND terminal 23 of the CPU bus area 20 may be dense and the power supply / GND terminal 23 of the internal peripheral bus area 21 may be sparse.

図4の領域上に引かれている線は電源もしくはGND線で、チップ外縁にある丸は電源又はGND端子である。擬似的に幅の太い配線を数本引いているが、実際はもっと細い配線が数多く引かれる。例えば信号配線の最小幅が0．2μmの製造プロセスでは、CPUバス領域20には1μm幅の配線を4μmピッチで配線し、内部周辺バス領域21では0．4μm幅の配線を100μmピッチで配線する。こうすることで、不必要な電源／GND端子23をなくしながら、且つ安定した動作を保証することができる。図1からCPUバス領域20には外部バスが接続されていないので端子数は少なく、本実施の形態の配置を適用すれば、さほど大きな影響なく実現可能である。 The line drawn on the area in FIG. 4 is a power supply or GND line, and the circle on the outer edge of the chip is a power supply or GND terminal. Although several pseudo-wide wirings are drawn, many thinner wirings are actually drawn. For example, in a manufacturing process in which the minimum width of signal wiring is 0.2 μm, wiring of 1 μm width is routed at 4 μm pitch in the CPU bus area 20, and wiring of 0.4 μm width is wired at 100 μm pitch in the internal peripheral bus area 21. . By doing so, it is possible to ensure a stable operation while eliminating the unnecessary power supply / GND terminal 23. Since the external bus is not connected to the CPU bus area 20 from FIG. 1, the number of terminals is small, and if the arrangement of the present embodiment is applied, this can be realized without much influence.

また、本実施の形態では図2のように外部バス1と外部バス2がチップの上下に離れて配置されることになる。外部バス制御部18またはDDR2 I/F13は駆動能力が高いため消費電力が大きく電源ノイズ等を引き起こしやすい。しかし本実施の形態の配置では、大電流源になる外部バス制御部18、DDR2 I/F13、CPUが離れて配置されており局所的な電力集中が起こらないため、発熱も均一化される。また外部バス制御部18、DDR2I/F、CPUはノイズや温度変化に敏感であるが、放して配置することで互いのノイズや発熱の影響が減る。 In the present embodiment, as shown in FIG. 2, the external bus 1 and the external bus 2 are arranged apart from each other above and below the chip. Since the external bus control unit 18 or the DDR2 I / F 13 has a high driving capability, it consumes a large amount of power and is likely to cause power supply noise. However, in the arrangement according to the present embodiment, the external bus control unit 18, the DDR2 I / F 13, and the CPU, which are large current sources, are arranged apart from each other and local power concentration does not occur, so that heat generation is made uniform. The external bus control unit 18, the DDR2 I / F, and the CPU are sensitive to noise and temperature changes. However, the arrangement of the external bus control unit 18, DDR2 I / F, and CPU reduces the influence of mutual noise and heat generation.

このように消費電力が大きくノイズに敏感なモジュールを離して配置することで、互いのノイズの影響が減るため、ノイズに対するマージンを少なく見積もって設計することができる。また、全体の消費電力が均一化され局所的な電力集中が起こらないため、電源配線が簡略化でき、さらに局所的な発熱がなく温度変化に対するマージンを少なく見積もって設計することができる。以上から安定した動作を保証しつつ、小面積、低コストで低電力のLSIが実現可能である。 By disposing modules that are large in power consumption and sensitive to noise as described above, the influence of each other's noise is reduced, so that it is possible to design with a small margin for noise. Further, since the overall power consumption is made uniform and local power concentration does not occur, the power supply wiring can be simplified, and there can be no local heat generation, and the design can be performed with a small margin for temperature change. From the above, it is possible to realize a small area, low cost, low power LSI while guaranteeing stable operation.

［実施の形態3］
図5は実施の形態1の実際のシリコンウエハ上に構成された各モジュールのレイアウト例である。実施の形態2と比較すると、CPUバス制御モジュールと周辺モジュールの位置関係、および内蔵SRAM17の位置と大きさ、CPU19および2次キャッシュ12の形状が変わっている。 [Embodiment 3]
FIG. 5 is a layout example of each module configured on the actual silicon wafer of the first embodiment. Compared to the second embodiment, the positional relationship between the CPU bus control module and peripheral modules, the position and size of the built-in SRAM 17, and the shapes of the CPU 19 and the secondary cache 12 are changed.

CPUバス制御モジュールと周辺モジュールの位置関係に関して、自動配線ツールを使用したレイアウトでは、図2のように厳密にエリアを分割したまっすぐな配線ではなく、図5のようにCPUバス制御モジュールに対してバスが配線されることが多い。その場合、多少内部周辺バス14との重複が発生するが、実施の形態2とほぼ同様の効果を得ることができる。また、図５のようにそれぞれのCPUと2次キャッシュ12との重心に近い場所にCPUバス制御モジュールを配置した方が良い場合もあり、例えば内蔵SRAM17が実施例2よりも小さくてよく、且つアクセス頻度も少なく動作速度にも余裕がある場合、図5のように柔軟に配置位置を変更した方が全体の面積を抑えることができ、低コスト化することができる。 With regard to the positional relationship between the CPU bus control module and peripheral modules, the layout using the automatic wiring tool is not a straight wiring with strictly divided areas as shown in Fig. 2, but instead of a straight line with divided areas as shown in Fig. 5. Buses are often wired. In that case, there is some overlap with the internal peripheral bus 14, but almost the same effect as in the second embodiment can be obtained. Further, as shown in FIG. 5, it may be better to arrange the CPU bus control module near the center of gravity of each CPU and the secondary cache 12, for example, the built-in SRAM 17 may be smaller than the second embodiment, and When the access frequency is low and the operation speed is sufficient, the total area can be reduced and the cost can be reduced by changing the arrangement position flexibly as shown in FIG.

内蔵SRAM17へのバス配線は、内部周辺バス14から分岐するところに図5のようにバッファ回路24を置く。こうすることで内部周辺バス14の配線長が長くなることによる、内部周辺バス14の速度劣化や電力増を防ぐことができる。内蔵SRAM17へのアクセスは速度に余裕があるため、バッファ回路24の挿入は問題にならない。 For bus wiring to the built-in SRAM 17, a buffer circuit 24 is placed where the internal peripheral bus 14 branches off as shown in FIG. By doing so, it is possible to prevent the deterioration of the speed of the internal peripheral bus 14 and the increase in power due to the increase in the wiring length of the internal peripheral bus 14. Since access to the built-in SRAM 17 has a sufficient speed, the insertion of the buffer circuit 24 is not a problem.

［実施の形態4］
図6は本発明の実施の形態4におけるマルチプロセッサ装置を示す構成図である。以下に実施の形態1と異なる点について説明する。CPUバス10と内部周辺バス14がバスブリッジ回路（Bus bridge25）を通して接続されている。よってCPU1〜8は2次キャッシュ12を通してDDR2 I/F13から外部バス1にアクセスできるだけでなく、Bus bridge25を通して内部周辺バス14の資源へもアクセス可能である。従って、経路が遠く周波数も遅いので、データ転送性能は上がらないが、外部バス制御部18を通して別の外部バス2にもアクセスできる。ただし、Bus bridge25を外部バス2および内部周辺バス14へのアクセスは2次キャッシュ12のキャッシングの対象とならない。また、内部周辺バス14に接続される各モジュールも、外部バス制御部18を通して外部バス2にアクセスできるほか、外部バス1へもBus bridgeを通してアクセスできる。 [Embodiment 4]
FIG. 6 is a block diagram showing a multiprocessor device according to the fourth embodiment of the present invention. The differences from the first embodiment will be described below. The CPU bus 10 and the internal peripheral bus 14 are connected through a bus bridge circuit (Bus bridge 25). Therefore, the CPUs 1 to 8 can access not only the external bus 1 from the DDR2 I / F 13 through the secondary cache 12 but also the resources of the internal peripheral bus 14 through the bus bridge 25. Accordingly, since the path is long and the frequency is slow, the data transfer performance is not improved, but another external bus 2 can be accessed through the external bus control unit 18. However, the access to the external bus 2 and the internal peripheral bus 14 through the bus bridge 25 is not subject to caching of the secondary cache 12. Each module connected to the internal peripheral bus 14 can also access the external bus 2 through the external bus control unit 18 and can also access the external bus 1 through the bus bridge.

CPU1〜8は同一アーキテクチャのCPUである。1次／2次キャッシュのコヒーレンシーに関しては、1次／2次キャッシュメモリの内容がコヒーレンシ制御されて整合性が取れており、CPUが誤動作する心配がない。またマルチプロセッサ対応OSを使用した場合でも、CPUバス10上には同一アーキテクチャのCPU8個と2次キャッシュ12とBus bridge25しかなく、また、外部バス1へのアクセスは、内部周辺バス14に接続される各モジュールのアクセスは少なくほぼCPU1〜8からのアクセスであるため、高い性能を出せる。 CPUs 1 to 8 are CPUs of the same architecture. As for the coherency of the primary / secondary cache, the contents of the primary / secondary cache memory are coherently controlled and consistent, so there is no risk of the CPU malfunctioning. Even when a multiprocessor OS is used, there are only 8 CPUs with the same architecture, secondary cache 12 and bus bridge 25 on the CPU bus 10, and access to the external bus 1 is connected to the internal peripheral bus 14. Since each module has few accesses and is almost from CPU 1-8, high performance can be achieved.

その他の構成、効果は実施の形態1と同様のため説明を省略する。 Since other configurations and effects are the same as those of the first embodiment, description thereof is omitted.

［実施の形態5］
図7は本発明の実施の形態5におけるマルチプロセッサ装置を示す構成図である。実施の形態1と異なる点は、SIMD型超並列プロセッサ31，32の代わりにDSP41，42が接続されている点である。また、本実施の形態ではCPUバス10と内部周辺バス14のブリッジに2次キャッシュ12を使用したが、実施の形態4のように専用のBus bridge25を使用してもよい。その他の構成、効果は実施の形態1と同様のため説明を省略する。 [Embodiment 5]
FIG. 7 is a configuration diagram showing a multiprocessor device according to the fifth embodiment of the present invention. A difference from the first embodiment is that DSPs 41 and 42 are connected in place of the SIMD type massively parallel processors 31 and 32. Further, in the present embodiment, the secondary cache 12 is used for the bridge between the CPU bus 10 and the internal peripheral bus 14, but a dedicated bus bridge 25 may be used as in the fourth embodiment. Since other configurations and effects are the same as those of the first embodiment, description thereof is omitted.

［実施の形態6］
図8は実施の形態1〜5のCPUのクロック（CPUクロック）とCPUバスクロック（バスクロック）の関係を示したタイミングチャートである。CPUクロックとCPUバスクロックの周波数は、CPUクロックの方がバスクロックより速い場合を考える。図8では、CPUクロックとバスクロックの周波数比が、1:1，2:1，4:1，8:1の場合を例にしている。n分周クロック（n＝1,2,4,8）はCPUクロックを周波数比に従って分周したクロックである。 [Embodiment 6]
FIG. 8 is a timing chart showing the relationship between the CPU clock (CPU clock) and the CPU bus clock (bus clock) in the first to fifth embodiments. The frequency of the CPU clock and the CPU bus clock is considered when the CPU clock is faster than the bus clock. FIG. 8 shows an example in which the frequency ratio between the CPU clock and the bus clock is 1: 1, 2: 1, 4: 1, and 8: 1. The n-divided clock (n = 1, 2, 4, 8) is a clock obtained by dividing the CPU clock according to the frequency ratio.

本発明ではn分周クロックの代わりに、図8のバスクロックをCPUバス10（図1参照）のクロックとしている。n分周クロックを用いる場合のクロック供給回路を図9に、Sync.＋バスクロックを用いる場合のクロック供給回路を図10に示す。図9の分周器、もしくは図10のSync.生成は、どちらもCLKCで生成される。CLKCは通常LSIに1つなので、CPUによってはn分周クロック、もしくはSync.は長い距離を接続することになり、実際はバッファなどが挿入される。 In the present invention, the bus clock of FIG. 8 is used as the clock of the CPU bus 10 (see FIG. 1) instead of the n-divided clock. FIG. 9 shows a clock supply circuit when the n-divided clock is used, and FIG. 10 shows a clock supply circuit when the Sync. + bus clock is used. Both the frequency divider of FIG. 9 and the Sync. Generation of FIG. 10 are generated by CLKC. Since there is usually one CLKC in the LSI, depending on the CPU, the n-divided clock or Sync. Will be connected over a long distance, and in fact a buffer will be inserted.

n分周クロックとSync.を比較すると、スイッチング回数（頻度）はどちらも同じだが、n分周クロックはCPUクロックと位相を厳密に合わせる必要があるのに対して、Sync.はその必要がないため、不必要に大きなバッファや無駄な遅延生成用バッファが不要になり、小面積、低コストで低消費電力なLSIを実現できる。 When comparing the n-divided clock and Sync., the switching frequency (frequency) is the same, but the n-divided clock needs to be exactly in phase with the CPU clock, but Sync. is not necessary Therefore, an unnecessarily large buffer and useless delay generation buffer are not required, and an LSI with a small area, low cost and low power consumption can be realized.

また、CPUバス10のクロックの品質については、図9ではCPUクロックとの分岐点が遠く、かつ分周器が挿入されているn分周クロックに対して、図10のバスクロックはCPUクロックとの分岐点が近く、かつAND回路のみ挿入されている。従ってCPUクロックとの位相差（スキュー）は図10の方が小さくすることができ、より高い周波数で動作することができる。またCPUとCPUバス10間の転送に対して、ホールド保証用のバッファが不要または少なくてすむ。従って、小面積、低コストで低消費電力なLSIを実現できる。 As for the quality of the clock of the CPU bus 10, in FIG. 9, the bus clock in FIG. 10 is the CPU clock compared to the n-divided clock that is far from the CPU clock and the frequency divider is inserted. The branch point is close and only the AND circuit is inserted. Therefore, the phase difference (skew) from the CPU clock can be made smaller in FIG. 10, and the operation can be performed at a higher frequency. Also, a buffer for guaranteeing the hold is not necessary or required for the transfer between the CPU and the CPU bus 10. Therefore, a small area, low cost and low power consumption LSI can be realized.

本実施の形態ではCPUとCPUバスクロックとの関係について説明したが、SIMD型超並列プロセッサと内部周辺バス14、CPU19と内部周辺バス14に関しても同様である。 Although the relationship between the CPU and the CPU bus clock has been described in the present embodiment, the same applies to the SIMD type massively parallel processor and the internal peripheral bus 14, and the CPU 19 and the internal peripheral bus 14.

［実施の形態7］
図11は実施の形態1〜6を使用したシステムのソフトウェアのブロック図である。各プロセッサごとにデバイスドライバ（driver）があり、その上位階層にOSがある。CPU1〜8はOS1が制御し、SIMD型超並列プロセッサ31,32とCPU19はOS2が制御する。各OSは例えばOS1がLinux（登録商標）などの非リアルタイムOSで、OS2がITRONなどのリアルタイムOSが考えられる。OS1はCPUのアーキテクチャ向けに最適化されており、CPUバス10上には同一アーキテクチャのCPU8個と2次キャッシュ12とBus bridge25しかない。また、外部バス1へのアクセスも内部周辺バス14に接続される各モジュールのアクセスは少なく、ほぼCPU1〜8からのアクセスであるため、高い性能を出せる。また、OS1により1次／2次キャッシュメモリの内容がコヒーレンシ制御されて整合性が取れており、コヒーレンシーの問題も最適に対応できる。一方OS2側もOS1とは独立に外部バス2を持っているので、OS1とのリソースの調整がほとんど無くなり、高い性能が出せる。 [Embodiment 7]
FIG. 11 is a block diagram of the software of the system using the first to sixth embodiments. Each processor has a device driver (driver) and an OS in the upper hierarchy. The CPUs 1 to 8 are controlled by the OS1, and the SIMD type massively parallel processors 31 and 32 and the CPU 19 are controlled by the OS2. As each OS, for example, OS1 is a non-real-time OS such as Linux (registered trademark), and OS2 is a real-time OS such as ITRON. OS1 is optimized for the CPU architecture, and on the CPU bus 10 there are only 8 CPUs of the same architecture, the secondary cache 12 and the Bus bridge 25. In addition, access to the external bus 1 is also rare because each module connected to the internal peripheral bus 14 is accessed from the CPUs 1 to 8, and high performance can be obtained. In addition, the contents of the primary / secondary cache memory are coherently controlled by OS1 to ensure consistency, and coherency issues can be optimally addressed. On the other hand, the OS2 side also has an external bus 2 independent of OS1, so there is almost no resource adjustment with OS1 and high performance can be achieved.

図12は図11からさらにCPU19用にOS3を別途持たせたものである。図11の効果があるほか、各OSは単一アーキテクチャのプロセッサしか扱わないため、さらに効率がよい。 FIG. 12 is obtained by additionally providing OS3 for the CPU 19 from FIG. In addition to the effects shown in FIG. 11, each OS handles only a single architecture processor, so it is more efficient.

1〜8，19 CPU、10 CPUバス、11 CPUバス制御部、12 2次キャッシュ、13 DDR2 I/F、14 内部周辺バス、15 周辺回路、16 DMAC、17 内蔵SRAM、18 外部バス制御部、20 CPUバス領域、21 内部周辺バス領域、22 電源／GND配線、23 電源／GND端子、24 バッファ、25 Bus bridge、31，32 SIMD型超並列プロセッサ、41，42 DSP。 1 to 8, 19 CPU, 10 CPU bus, 11 CPU bus controller, 12 secondary cache, 13 DDR2 I / F, 14 internal peripheral bus, 15 peripheral circuit, 16 DMAC, 17 internal SRAM, 18 external bus controller, 20 CPU bus area, 21 internal peripheral bus area, 22 power supply / GND wiring, 23 power supply / GND terminal, 24 buffer, 25 Bus bridge, 31, 32 SIMD type massively parallel processor, 41, 42 DSP.

Claims

A first multiple processors,
A second multiple processors,
A first bus to which the plurality of first processors are connected;
A second bus to which the plurality of second processors are connected;
The first and the external bus I / F first external bus connectable,
A second external bus I / F second external bus can be connected, a provided on one semiconductor chip,
Said first processor and said second processor is controlled by a separate system clock, Ri frequency or phase Do different,
The plurality of first processors includes the first external bus provided outside the semiconductor chip via the first bus and the first external bus I / F without going through the second bus. Is accessible to
The plurality of first processors can access the second external bus via the first bus, the second bus, and a second external bus I / F,
The plurality of second processors include the second external bus provided outside the semiconductor chip via the second bus and the second external bus I / F without going through the first bus. Is accessible to the
A multiprocessor device characterized by that.

2. The wiring density of power supply wiring in a processor area having a high clock frequency is increased and the wiring density of power supply wiring in a processor area having a low clock frequency is reduced among the first processor and the second processor. The multiprocessor device described.

3. The power supply terminal in the processor group area having a fast clock frequency among the first processor and the second processor is increased, and the power supply terminal in the processor group area having a slow clock frequency is reduced. multiprocessor system according to one paragraph or.

One of the first processor or the second processor, CPU area and the external bus I / F fast processor of the clock frequency according to any one of claims 1 to be located remotely on a semiconductor chip 3 Multiprocessor device.

5. The multiprocessor according to claim 4, wherein a fast clock is supplied to a CPU of the processor having a fast clock frequency, and a part of the clock is generated by gating the bus clock for data processing with the external bus I / F. apparatus.

A secondary cache coupled between the first bus and the second bus;
2. The multiprocessor device according to claim 1, wherein the plurality of first processors are accessible to the second external bus via the secondary cache.

A bus bridge coupled between the first bus and the second bus;
2. The multiprocessor device according to claim 1, wherein the plurality of first processors can access the second external bus via the bus bridge.

A secondary cache coupled to the first bus and the first external bus I / F;
8. The multiprocessor device according to claim 7, wherein the plurality of first processors can access the first external bus via the secondary cache.

9. The multiprocessor device according to claim 1, wherein the plurality of first processors have an SMP configuration, and the plurality of second processors also have an SMP configuration.