DE19910451C2

DE19910451C2 - multiprocessor

Info

Publication number: DE19910451C2
Application number: DE1999110451
Authority: DE
Inventors: Trong Son Dao; Petra Leber; Jens Leenstra
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1998-05-02
Filing date: 1999-03-10
Publication date: 2003-08-14
Anticipated expiration: 2019-03-11
Also published as: DE19910451A1

Description

Die Erfindung betrifft einen Multiprozessor nach dem Oberbegriff des Patentanspruchs 1.The invention relates to a multiprocessor according to the Preamble of claim 1.

Fig. 1 zeigt einen einfachen, superskalaren Prozessor 10 mit einer einzelnen Befehlswarteschlange und einem einzelnen, vereinheitlichten Cachespeicher. Diese Erfindung kann auf weitaus kompliziertere Prozessorstrukturen angewendet werden. Die Befehle werden aus dem Cachespeicher- Bereich 12, der einen einzelnen Port hat, abgerufen und über den Ergebnisbus 16 in die Befehlswarteschlange 14 geladen. Die Breite der Daten, die aus dem Cachespeicher-Bereich gelesen werden und die Breite des Ergebnisbusses ist viel größer als ein einzelner 32- Bit-Befehl, z. B. 256 Bits. Somit könnten bis zu 4 Befehle pro Lesezyklus aus dem Cachespeicher erledigt werden, der in dieser Abbildung ein einzelner Maschinenzyklus ist. Drei Ausführungseinheiten empfangen Bereitbefehle von der Dispatch-Logik mit einer Geschwindigkeit von bis zu einem Befehl pro Ausführungseinheit und Maschinenzyklus. Diese Einheiten sind eine Verzweigungseinheit 18, eine Festkommaausführungseinheit mit Festkommaregisterdatei 20 und einer Gleitkommaeinheit mit Gleitkommaregisterdatei 22. Die Speicherbefehle, welche die Berechnung einer effektiven Adresse erfordern, werden an die Festkommaeinheit geschickt, und die daraus resultierende Adresse wird dann an die Speicherverwaltungseinheit (MMU) 24 gesendet. Figure 1 shows a simple, superscalar processor 10 with a single instruction queue and a single, unified cache. This invention can be applied to much more complex processor structures. The commands are retrieved from the cache area 12 , which has a single port, and loaded into the command queue 14 via the result bus 16 . The width of the data read from the cache area and the width of the result bus is much larger than a single 32-bit instruction, e.g. B. 256 bits. Thus, up to 4 commands per read cycle could be done from the cache, which in this figure is a single machine cycle. Three execution units receive ready instructions from the dispatch logic at a rate of up to one instruction per execution unit and machine cycle. These units are a branching unit 18 , a fixed point execution unit with fixed point register file 20 and a floating point unit with floating point register file 22 . The memory instructions that require the calculation of an effective address are sent to the fixed point unit and the resulting address is then sent to the memory management unit (MMU) 24 .

Ebenso werden die Befehlsadressen in der Befehlsabrufeinheit 26 erzeugt. Sowohl die MMU als auch die Befehlsabrufeinheit senden die Adressen an die Cachespeicher-Etiketten 28 und somit an die Bereichsadressierungslogik. Ein Treffer bei den Etikettenergebnissen in den Daten aus dem Bereich wird in dem Ergebnisbus 16 plaziert. Ein Verfehlen bei den Cachespeicher-Etiketten resultiert in einer Speicheranforderung, die in die Speicherwarteschlange 30 gesetzt und über die Busschnittstelleneinheit an den externen Adreß- und Datenbus 32 übertragen wird. Jede die ser Einheiten ist in der Lage, gleichzeitig zu operieren, was in einem superskalaren Pipeline-Prozessor mit einer Spitzenausführungsrate von 3 Befehlen pro Zyklus resultiert. Wenn die Gleitkomma-Ausführungseinheit 22 eine Latenz von 2 Zyklen hat, dann wäre die Pipelinereihenfolge: abrufen, verschicken, decodieren, ausführen 1, ausführen 2 und writeback. The command addresses are also generated in the command fetch unit 26 . Both the MMU and the instruction fetcher send the addresses to the cache tags 28 and thus to the area addressing logic. A hit on the label results in the data from the area is placed in the result bus 16 . Missing the cache tags results in a memory request that is placed in the memory queue 30 and transmitted to the external address and data bus 32 via the bus interface unit. Each of these units is capable of operating simultaneously, resulting in a superscalar pipeline processor with a peak execution rate of 3 instructions per cycle. If the floating point execution unit 22 has a latency of 2 cycles, then the pipeline order would be: fetch, send, decode, execute 1, execute 2, and writeback.

Fig. 2 zeigt diesen einfachen Prozessor 10a, der durch eine sättigende, arithmetische Multimedia- Befehlsausführungseinheit 34, SIMD, mit ihrer Registerdatei erweitert worden ist. Die Operation dieser Einheit ist ähnlich wie die der Gleitkommaeinheit 22. Die Befehle für diese Einheit treten normalerweise nicht in unmittelbarer Nähe von Gleitkommaoperationen in der Ausführungssequenz von gewöhnlichen Programmen auf. In der Tat sind in der Intel^TM MMX Architektur die MMX-Register in dem Gleitkomma- Registerraum abgebildet, und es ist eine explizite Kontextschaltung notwendig, um MMX-Befehle zu ermöglichen, nachdem Gleitkommabefehle ausgeführt wurden und umgekehrt. Wird ein Multiprozessor in eine einzelne Form gesetzt, so würde dies einen Chip ähnlich dem ergeben, der in den Fig. 3 oder 4 abgebildet ist. Offensichtliche gemeinsame Elemente, wie zum Beispiel die Schnittstelle 36 im externen Bus, werden gemeinsam benutzt. Das konventionelle Design würde auch nur ein Objekt auf dem Chip der Taktgenerierung haben, der Testprozessor oder die Grenzenabtastungssteuerung usw. Es ist zu beachten, daß an genommen wird, daß die Etikettenlogik und die MMU die richtige gemeinsame Datenbenutzungsprotokoll-Logik (z. B. MESI) für einen Multiprozessor haben. Umfangreiche Änderungen sind möglich, wenn mehr als ein Prozessor in eine Form gelegt wird, wie zum Beispiel das Hinzufügen zusätzlicher Stufen zur Cachespeicherung oder mehr externe Busse. Ein solcher Chip wäre groß und hätte nicht die doppelte Leistung eines einzelnen Prozessors aufgrund der bekannten Multiprozessoreffekte. Wenn jedoch der Bereich deutlich reduziert werden könnte, dann wären die Befehle, die pro Sekunde pro Siliziumbereich ausgeführt würden, d. h. Kosten/Leistung, gleich oder besser als bei einem Einzel prozessor. Dies könnte geschehen, indem bereichsintensive Einheiten, z. B. die Gleitkommaeinheit, zwischen den beiden Prozessoren gemeinsam benutzt werden. Fig. 2 shows this simple processor 10 a, which has been expanded by a saturating, arithmetic multimedia instruction execution unit 34 , SIMD, with its register file. The operation of this unit is similar to that of the floating point unit 22 . The instructions for this unit do not normally appear in the immediate vicinity of floating point operations in the execution sequence of ordinary programs. Indeed, in the Intel ^TM MMX architecture, the MMX registers are mapped in the floating point register space, and explicit context switching is necessary to enable MMX instructions after floating point instructions have been executed and vice versa. Placing a multiprocessor in a single form would result in a chip similar to that shown in FIG. 3 or 4. Obvious common elements, such as interface 36 in the external bus, are shared. The conventional design would also have only one object on the clock generation chip, the test processor or the boundary scan controller, etc. It should be noted that the label logic and MMU are assumed to have the correct data sharing protocol logic (e.g., MESI ) for a multiprocessor. Extensive changes are possible if more than one processor is placed in a mold, such as adding additional levels for caching or more external buses. Such a chip would be large and would not have twice the performance of a single processor due to the known multiprocessor effects. However, if the area could be reduced significantly, then the instructions that would be executed per second per silicon area, ie cost / performance, would be the same or better than for a single processor. This could be done by using area-intensive units, e.g. B. the floating point unit can be shared between the two processors.

Unglücklicherweise ist bereits die Datenzuordnung zwischen zwei CPUs ein komplexes Problem. Besonders im Fall einer Gleitkommaeinheit, wo Befehle sehr lange brauchen, kann, sobald die erste CPU von der Gleitkommaeinheit bedient worden ist, die zweite CPU eine lange Zeit aussetzen, selbst wenn die zweite CPU einige kurze Befehle hat, die ausgeführt werden müssen. Wenn jedoch die zweite CPU Priorität hat, die kurze Befehle ausgibt, dann kann andererseits die erste CPU niemals bedient werden und muß lange Zeit aussetzen.Unfortunately, the data mapping is already between two CPUs a complex problem. Especially in the case of one Floating point unit, where commands can take a long time, as soon as the first CPU is operated by the floating point unit has been exposing the second CPU for a long time even if the second CPU has some short commands that must be carried out. However, if the second CPU Priority that issues short commands, then can on the other hand, the first CPU never has to be operated suspend for a long time.

Die Schnittstelle, die entwickelt werden muß, damit die eine oder die andere CPU aussetzt, ist ebenfalls eine komplexe Aufgabe und sehr anfällig für Fehler.The interface that must be developed for the one or the other CPU misses is also one complex task and very prone to errors.

Die Datenzuordnung unter den Prozessoren mit komplexen Schemata führt manchmal zu Blockierungssituationen, die dazu neigen, zu spät erkannt zu werden, z. B. zum Zeitpunkt des Versands oder des Austestens der Hardware. Die Behebung solcher Fehler ist zeitraubend und teuer.Data mapping among processors with complex Schemes sometimes lead to blocking situations that tend to be recognized too late, e.g. B. at the time shipping or debugging the hardware. The fix such a mistake is time consuming and expensive.

Die Leistung eines Multiprozessorsystems, das zwei oder mehr CPUs enthält, kann sich erheblich verschlechtern, wenn eine gleitkommaintensive Anwendung (bei der viele kurze und schnelle Befehle benutzt werden) gleichzeitig mit einer anderen Anwendung läuft, die einige lang laufende Gleitkommabefehle benutzt. Dieser Warteschlangenbildungseffekt wird auch in konventionellen Multiprozessorsystemen beobachtet, die einen Speicherbus oder ein Speichersystem gemeinsam benutzen, z. B. den gleichen L2-Cachespeicher gemeinsam benutzen.The performance of a multiprocessor system that has two or Containing more CPUs can deteriorate significantly if a floating point intensive application (where many short and fast commands are used) simultaneously with a other application that is running some long running Floating point commands used. This Queuing effect is also conventional Multiprocessor systems observed a memory bus or share a storage system, e.g. B. the share the same L2 cache.

Obgleich die gemeinsame Benutzung eines Speichersystems in bekannten Multiprozessorsystemen bereits mit Schwierigkeiten verbunden sein kann, ist schließlich die gemeinsame Benutzung einer Verarbeitungseinheit, z. B. eine Gleitkomma-, Multimedia- oder eine Datenkompressionseinheit, eine weitaus komplexere Aufgabe. Diese Einheiten haben eigene Registerdateien, Status register, Programmzähler usw. und haben alle Sequentialisie rungseffekte. Demgemäß erfordert die gemeinsame Benutzung wenigstens einen separaten Satz Register mit einer speziellen Zuordnung für jede CPU. Vom Gesichtspunkt der Implementierung aus gibt es viele Probleme, z. B. Bereich und Zeitmessung mit einem extra Bedarf an Datenmultiplexern und einer Synchronisierungssteuerlogik.Although sharing a storage system in already known multiprocessor systems After all, difficulties can be connected sharing a processing unit, e.g. Legs Floating point, multimedia or one Data compression unit, a much more complex task. These units have their own register files, status register, program counter, etc. and all have sequentialization rate effects. Accordingly, sharing requires at least one separate set of registers with one special assignment for each CPU. From the point of view of Implementation out there are many problems, e.g. B. area and Time measurement with an extra need for data multiplexers and synchronization control logic.

Ein aus der US 3,980,992 A bekannter Multiprozessor weist vier gleiche Prozessoren auf, die gemeinsam eine Ausführungseinheit benutzen. Bei der von allen vier Prozessoren gemeinsam benutzten Ausführungseinheit handelt es sich insbesondere um die sogenannte ALU (Arithmetic Logic Unit), die arithmetische Operationen auf der Basis von Daten durchführt, die von den vier Prozessoren an diese übermittelt werden. In einem ersten Taktzyklus werden Daten des ersten Prozessors von der ALU verarbeitet, in einem zweiten Taktzyklus Daten des zweiten Prozessors, in einem dritten Taktzyklus Daten des dritten Prozessors, in einem vierten Taktzyklus Daten des vierten Prozessors und in einem fünften Taktzyklus wiederum Daten des ersten Prozessors usw.A multiprocessor known from US 3,980,992 A has four of the same processors together, one Use execution unit. In the case of all four Processors shared execution unit in particular the so-called ALU (Arithmetic Logic Unit), the arithmetic operations based on data performs by the four processors to this be transmitted. In a first clock cycle, data of the first processor processed by the ALU, in one second clock cycle data of the second processor, in one third clock cycle data of the third processor, in one fourth clock cycle data of the fourth processor and in one fifth clock cycle again data from the first processor etc.

An diesem bekannten Multiprozessor ist nachteilig, daß die Daten eines jeden Prozessors nur in jedem vierten Taktzyklus verarbeitet werden. Durch die gemeinsame Benutzung einer einzigen ALU von vier Prozessoren sinkt die Verarbeitungsgeschwindigkeit eines jeden Prozessors gegenüber einem Prozessor, der eine eigene ALU aufweist, auf 25% oder weniger.A disadvantage of this known multiprocessor is that Data from each processor only every fourth clock cycle are processed. By sharing one the only ALU of four processors drops Processing speed of each processor compared to a processor that has its own ALU 25% or less.

Aufgabe der Erfindung ist die Bereitstellung eines Multiprozessors mit einer platzsparenden und damit kostengünstigen Ausführungseinheit, die dennoch von den Prozessoren des Multiprozessors in günstiger Weise gemeinsam genutzt werden kann.The object of the invention is to provide a Multiprocessor with a space-saving and therefore inexpensive execution unit, which is nevertheless from the Processors of the multiprocessor together in a favorable manner can be used.

Diese Einheit kann eine Gleitkomma- und/oder eine Datenkompressionseinheit oder eine andere Einheit sein, deren Flächenbedarf auf dem Chip im Verhältnis zur Häufigkeit der Nutzung durch die Prozessoren ungünstig ist.This unit can be a floating point and / or a Data compression unit or another unit, their space requirements on the chip in relation to Frequency of use by the processors is unfavorable.

Die Aufgabe wird durch einen Multiprozessor mit den Merkmalen des Anspruchs 1 gelöst. Vorteilhafte Ausgestaltungen der Erfindung sind in den Unteransprüchen angegeben.The task is performed by a multiprocessor with the Features of claim 1 solved. advantageous Embodiments of the invention are in the subclaims specified.

Die Effizienz bekannter Multiprozessoren wird erfindungsgemäß verbessert, indem Ausführungseinheiten, die weniger oft von dem Multiprozessor benötigt werden, gestrippt und von mehreren symmetrischen Mikroprozessoren des Multiprozessors gemeinsam benutzt werden. Hierdurch kann jede CPU einen kleineren Bereich belegen; die symmetrische Struktur der Prozessoren im Multiprozessor und die damit einhergehenden Vorteile einer einfacheren Softwarestruktur für Multiprozessoren bleiben dennoch erhalten.The efficiency of known multiprocessors will improved according to the invention by execution units that are used less often by the multiprocessor, stripped and from multiple symmetrical microprocessors shared by the multiprocessor. This can each CPU occupy a smaller area; the symmetrical Structure of the processors in the multiprocessor and thus associated advantages of a simpler software structure for multiprocessors are still preserved.

BRIEF DESCRIPTION OF THE DRAWINGS

Zum besseren Verständnis der vorliegenden Erfindung und für weitere Details und Vorteile derselben wird jetzt Bezug auf die folgende ausführliche Beschreibung in Verbindung mit den beiliegenden Zeichnungen genommen, in denenFor a better understanding of the present invention and for Further details and advantages of the same will now be referred to the following detailed description in connection with the attached drawings, in which

Fig. 1 ein Schema eines typischen Mikroprozessors zeigt; Fig. 1 is a diagram of a typical microprocessor displays;

Fig. 2 einen typischen Mikroprozessor mit einer Multimedia-Befehlseinheit zeigt; Figure 2 shows a typical microprocessor with a multimedia instruction unit;

Fig. 3 zwei Mikroprozessoren auf einem Einzelchip zeigt, jeden mit seiner eigenen FPU und FXU; Figure 3 shows two microprocessors on a single chip, each with its own FPU and FXU;

Fig. 4 zwei Mikroprozessoren auf einem Einzelchip zeigt, jeden mit seiner eigenen FPU, FXU und MMXU; Figure 4 shows two microprocessors on a single chip, each with its own FPU, FXU and MMXU;

Fig. 5 zwei Mikroprozessoren zeigt, die gemeinsam eine einzelne FPU gemäß der vorliegenden Erfindung zeigt; und FIG. 5 shows two microprocessors which together according to the present invention shows a single FPU; and

Fig. 6 die FPU zeigt, wie diese in Fig. 5 abgebildet ist, wobei ihr Pipeline-Feedbackschema ausführlicher dargestellt ist. Figure 6 shows the FPU as shown in Figure 5, with its pipeline feedback scheme shown in more detail.

DETAILED DESCRIPTION OF THE INVENTION

Die Fig. 5 und 6 zeigen ein Ausführungsbeispiel der vorliegenden Erfindung. Es ist bekannt, daß in den meisten Programmen Fließkommabefehle in Bündeln auftreten, die durch Festkomma- und Steuerbefehle getrennt sind. Je öfter der Betriebssystemcode ausgeführt wird, desto weniger häufig ist die Frequenz bei der Ausführung von Fließkommabefehlen. Unter Verwendung dieser Betrachtung kann der Bereich des Chips 10, der in Fig. 3 abgebildet ist, erheblich reduziert werden, indem dieser neu struk turiert wird, wie dies in Fig. 5 schematisch dargestellt ist. FIGS. 5 and 6 show an embodiment of the present invention. It is known that in most programs floating point instructions appear in bundles separated by fixed point and control instructions. The more often the operating system code is executed, the less frequent the frequency in the execution of floating point instructions. Using this consideration, the area of the chip 10 depicted in FIG. 3 can be significantly reduced by restructuring it, as shown schematically in FIG. 5.

Der schematisch dargestellte Multiprozessor 100, der in Fig. 5 abgebildet ist, enthält einen ersten Prozessor 101, einen zweiten Prozessor 102, einen Multiplexer 105, eine Fließkommaeinheit 106 und einen Demultiplexer 107. Der erste Prozessor 101 wird über einen ersten Datenbus 103 mit dem Eingang des Multiplexers 105 verbunden, und der zweite Prozessor 102 wird über einen zweiten Datenbus 104 mit dem Eingang des Multiplexers 105 verbunden. Der Ausgang des Multiplexers 105 wird mit dem Eingang der Fließkommaeinheit 106 verbunden, und der Ausgang der Fließkommaeinheit 106 wird mit dem Eingang des Demultiplexers 107 gekoppelt. Der Demultiplexer 107 enthält zwei Ausgänge, von denen einer über einen Datenbus 107a mit dem ersten Prozessor 101 verbunden ist und über einen Datenbus 107b mit dem zweiten Prozessor 102 gekoppelt ist. Die Fließkommaeinheit 106 enthält mehrere Filterstufen, wie dies durch eine erste Stufe 106_1, eine zweite Stufe 106_2, eine dritte Stufe 106_3 und weitere Stufen, falls notwendig, angegeben ist. Der Ausgang des Multiplexers 105 ist über einen Datenbus 106a mit dem Eingang der ersten Filterstufe 106_1 verbunden, der Ausgang der ersten Filterstufe 106_1 ist über einen Datenbus 106b mit dem Eingang der zweiten Filterstufe 106_2 verbunden, und der Ausgang der zweiten Stufe ist über einen Datenbus 106c mit dem Eingang der dritten Stufe 106_3 verbunden usw. Der Ausgang der Endstufe 106_n der Fließkommaeinheit 106 ist über einen Datenbus 109 mit dem Eingang des Demultiplexers 107 verbunden.The schematically illustrated multiprocessor 100 , which is depicted in FIG. 5, contains a first processor 101 , a second processor 102 , a multiplexer 105 , a floating point unit 106 and a demultiplexer 107 . The first processor 101 is connected to the input of the multiplexer 105 via a first data bus 103 , and the second processor 102 is connected to the input of the multiplexer 105 via a second data bus 104 . The output of multiplexer 105 is coupled to the input of floating point unit 106 and the output of floating point unit 106 is coupled to the input of demultiplexer 107 . The demultiplexer 107 contains two outputs, one of which is connected to the first processor 101 via a data bus 107 a and is coupled to the second processor 102 via a data bus 107 b. The floating point unit 106 contains several filter stages, as indicated by a first stage 106 _1, a second stage 106 _2, a third stage 106 _3 and further stages, if necessary. The output of the multiplexer 105 is connected via a data bus 106 a to the input of the first filter stage 106 _1, the output of the first filter stage 106 _1 is connected to the input of the second filter stage 106 _2 via a data bus 106 b, and the output of the second stage c through a data bus 106 to the input of the third stage 106 connected _3, etc. the output of the power amplifier 106 _n the floating point unit 106 is connected via a data bus 109 to the input of the demultiplexer 107th

Wie aus Fig. 5 hervorgeht, wird die Fließkommaeinheit aus jedem Prozessor 101 und 102 entfernt, und in jedem Prozessor verbleibt nur eine lokale Registerdatei (ohne Abbildung). Die einzelne Fließkommaeinheit wird an eine optimale physische Stelle zwischen den beiden Prozessoren gesetzt, um die Bewegungszeit der elektrischen Signale von jedem Prozessor zu der Fließkommaeinheit 106 auszugleichen. Bei jedem ungeraden Taktzyklus überträgt der erste Prozessor 101 Daten, die von der Fließkommaeinheit 106 zu verarbeiten sind, über den Multiplexer 105. Auf der anderen Seite überträgt der zweite Prozessor 102 bei jedem geraden Taktzyklus seine Daten, die von der Fließkommaeinheit 106 zu verarbeiten sind, über den Multiplexer 105. Nachdem die zu verarbeitenden Daten vollständig von der Fließ kommaeinheit 106 verarbeitet worden sind, werden die verarbeiteten Daten an den Demultiplexer 107 übertragen. Bei jedem ungeraden Taktzyklus werden die Daten von dem ersten Prozessor, die bereits vollständig von der Fließkommaeinheit 106 verarbeitet worden sind, über den Demultiplexer 107 und den Datenbus 107a zurück an den ersten Prozessor 101 übertragen. Bei jedem geraden Taktzyklus werden die Daten von dem zweiten Prozessor 102, die bereits vollständig von der Fließkommaeinheit 106 ver arbeitet worden sind, vom Ausgang der Fließkommaeinheit 106 über den Demultiplexer 107 und den Datenbus 107b zurück an den zweiten Prozessor 102 übertragen. Gemäß der Erfindung ist nach jeweils zwei Taktzyklen jeder Prozessor (CPU) in der Lage, auf die Fließkommaeinheit 106 zuzugreifen, und es muß kein Prozessor aussetzen. Da in den meisten Programmen Fließkommabefehle im Vergleich mit Festkommabefehlen weitaus weniger häufig vorkommen, sind die beiden CPUs die meiste Zeit in der Lage, mit ihrer normalen Leistung zu arbeiten, und nur wenn die Fließkommabefehle verarbeitet werden müssen, besteht der Engpaß einer einzelnen Fließkommaeinheit bei zwei CPUs. Der Vorschlag aus der Erfindung ermöglicht daher eine deutliche Reduzierung von Kosten/Leistung bei einem solchen Einchip-Multiprozessor.As shown in Figure 5, the floating point unit is removed from each processor 101 and 102 , and only one local register file (not shown) remains in each processor. The single floating point unit is placed in an optimal physical location between the two processors to compensate for the travel time of the electrical signals from each processor to the floating point unit 106 . In each odd clock cycle, the first processor 101 transmits data to be processed by the floating point unit 106 via the multiplexer 105 . On the other hand, the second processor 102 transmits its data to be processed by the floating point unit 106 via the multiplexer 105 every even clock cycle. After the data to be processed have been completely processed by the floating point unit 106 , the processed data are transmitted to the demultiplexer 107 . At every odd clock cycle, the data from the first processor, which have already been completely processed by the floating point unit 106 , are transmitted back to the first processor 101 via the demultiplexer 107 and the data bus 107a . With every even clock cycle, the data from the second processor 102 , which have already been completely processed by the floating point unit 106 , are transmitted back from the output of the floating point unit 106 via the demultiplexer 107 and the data bus 107b to the second processor 102 . According to the invention, after every two clock cycles, each processor (CPU) is able to access the floating point unit 106 and no processor has to be suspended. Since floating point instructions are far less common in most programs compared to fixed point instructions, the two CPUs are able to work with their normal performance most of the time, and only when the floating point instructions need to be processed is there a bottleneck of a single floating point unit at two CPUs. The proposal from the invention therefore enables a significant reduction in cost / performance in such a single-chip multiprocessor.

Fig. 6 zeigt ausführlicher die Fließkommaeinheit 106 aus Fig. 5. Gemäß der Erfindung werden die resultierenden Daten von jeder zweiten Filterstufe zurück zum Eingang der Filterstufe 106_1 der Fließkommaeinheit 106 gekoppelt, und das gesamte interne Feedback der Daten von jedem Prozessor wird nach jeweils zwei Taktzyklen über die Datenbusse 108a, 108b und 108c ausgeführt. FIG. 6 shows the floating point unit 106 of FIG. 5 in greater detail . In accordance with the invention, the resulting data from each second filter stage is coupled back to the input of the filter stage 106 _1 of the floating point unit 106 , and the total internal feedback of the data from each processor is increased every two Clock cycles over the data buses 108 a, 108 b and 108 c executed.

Während eines ersten Taktzyklus treten die Fließkommadaten aus dem Prozessor 101 in die erste Filterstufe 106_1 ein. Während eines zweiten Taktzyklus werden die Daten, die von der ersten Stufe verarbeitet werden, an die nächste Filterstufe 106_2 geschickt, während die Fließkommadaten aus dem zweiten Prozessor 102 in die erste Filterstufe 106_1 eintreten. Während eines dritten Taktzyklus werden die Fließkommadaten aus dem ersten Prozessor 101, die von der zweiten Filterstufe 106_2 verarbeitet werden, über den Datenbus 108a an den Eingang der ersten Filterstufe 106_1 zurückgeschickt und ebenfalls an die dritte Filterstufe 106_3 zur weiteren Verarbeitung gesendet. Während des dritten Taktzyklus ermöglicht es der Multiplexer 105 dem ersten Prozessor 101 wiederum, auf die Fließkommaeinheit 106 zuzugreifen und so weiter. Während eines vierten Taktzyklus werden die Fließkommadaten aus dem ersten Prozessor 101, die von der dritten Stufe 106_3 verarbeitet werden, an die vierte Filterstufe 106_4 geschickt, während die Fließkommadaten aus dem zweiten Prozessor 102, die bereits von der ersten und zweiten Filterstufe 106_1 und 106_2 verarbeitet werden, an den Eingang der ersten Filterstufe 106_1 gesendet und an den Eingang der dritten Filterstufe 106_3 zur weiteren Verarbeitung. Während des vierten Taktzyklus wird es dem zweiten Prozessor wiederum ermöglicht, über den Multiplexer 105 auf die Fließkom maeinheit 106 zuzugreifen.During a first clock cycle, the floating point data from the processor 101 in the first filter stage 106 _1 occur. During a second clock cycle, the data processed by the first stage, sent to the next filter stage 106 _2, while the floating point data from the second processor 102 in the first filter stage 106 enter _1. During a third clock cycle, the floating point data from the first processor 101 , which are processed by the second filter stage 106 _2, are sent back via the data bus 108 a to the input of the first filter stage 106 _1 and are also sent to the third filter stage 106 _3 for further processing. During the third clock cycle, multiplexer 105 in turn enables first processor 101 to access floating point unit 106 and so on. During a fourth clock cycle, the floating point data from the first processor 101 , which are processed by the third stage 106 _3, are sent to the fourth filter stage 106 _4, while the floating point data from the second processor 102 , which are already being processed by the first and second filter stages 106 _1 and 106_2 are processed, sent to the input of the first filter stage 106 _1 and to the input of the third filter stage 106 _3 for further processing. During the fourth clock cycle, the second processor is again enabled to access the flow comm unit 106 via the multiplexer 105 .

Während eines fünften Taktzyklus werden die Fließkommadaten aus dem ersten Prozessor 101 an die fünfte Filterstufe 106_5 gesendet und von dem Ausgang der vierten Filterstufe 106_4 mit dem Eingang der ersten Filterstufe 106_1 über den Datenbus 108b zurückgekoppelt.During a fifth clock cycle, the floating point data are sent from the first processor 101 to the fifth filter stage 106 _5 and are fed back from the output of the fourth filter stage 106 _4 to the input of the first filter stage 106 _1 via the data bus 108 b.

Während eines sechsten Taktzyklus werden die Fließkommadaten aus dem zweiten Prozessor 102, die von der vierten Filterstufe 106_4 verarbeitet werden, mit dem Eingang der ersten Filterstufe 106_1 über den Datenbus 108b zurückgekoppelt und an die nächste Filterstufe gesendet. Das gleiche findet in bezug auf die weiteren Filterstufen und die Feedback-Schleifen der Fließkommaeinheit in weiteren Taktzyklen Anwendung. Die Leitungen, die mit einem "X" gekennzeichnet sind, geben an, daß das Feedback der Daten durch eine ungerade Anzahl von Stufen untersagt ist, um sicherzustellen, daß die Daten des ersten Prozessors 101 und des zweiten Prozessors 102 an verschiedene Filterstufen während jedes Taktzyklus gesendet werden, und daher sind die Daten aus verschiedenen Prozessoren streng voneinander getrennt. During a sixth clock cycle are fed back to the floating point data from the second processor 102 are processed by the fourth filter stage 106 _4 to the input of first filter stage 106 _1 through the data bus 108 b and sent to the next filter stage. The same applies to the further filter stages and the feedback loops of the floating point unit in further clock cycles. The lines marked with an "X" indicate that feedback of the data by an odd number of stages is prohibited to ensure that the data of the first processor 101 and the second processor 102 pass to different filter stages during each clock cycle are sent, and therefore the data from different processors are strictly separated.

Da es eine strenge Trennung des Fließkomma- Ausführungsprozesses von Daten aus dem ersten Prozessor 101 und von Daten aus dem zweiten Prozessor 102 bei geraden und ungeraden Taktzyklen und des entsprechenden Designs des Filter-Feedbacks gibt, besteht kein Bedarf an einer besonderen Synchronisierung oder einer Extrazuordnung von Steuerbits für die Identifizierung der Daten von jeder CPU. Somit kann die Fließkommaeinheit 106 aus der Erfindung als unintelligentes Element betrachtet werden, zum Beispiel ein Speicher oder Cache-Speicher, der einfach von zwei CPUs gemeinsam benutzt werden kann.Since there is a strict separation of the floating point execution process of data from the first processor 101 and data from the second processor 102 for even and odd clock cycles and the corresponding design of the filter feedback, there is no need for any particular synchronization or extra mapping of Control bits for identifying the data from each CPU. Thus, the floating point unit 106 of the invention can be considered an unintelligent element, for example, a memory or cache memory that can easily be shared between two CPUs.

Es ist klar, daß das Konzept aus der Erfindung auch für einen Multiprozessor benutzt werden kann, der mehr als zwei Prozessoren enthält, wenn die Fließkommaeinheit und die beschriebenen Feedback-Schleifen der Fließkommaeinheit gemäß der Erfindung geändert werden, um die Trennung der Daten von jeder CPU aufrechtzuerhalten. Gleiches gilt, wenn eine unterschiedliche Anzahl von Taktzyklen jeder CPU oder jedem Prozessor zugeordnet wird. Das Konzept aus der Erfindung ermöglicht es, eine einfache Verbindung herzustellen, um beide CPUs auf einem Chip ohne komplexe Datenzuordnung zu bedienen. Der für einen Multiprozessor erforderliche Chipbereich kann ohne einen separaten, dedizierten Satz von Registern deutlich reduziert werden. Das Design aus der Erfindung ermöglicht niedrige Zykluszeiten, ohne daß das Multiplexen von Daten notwendig ist. Das heißt, daß der Filter nicht leer sein muß, bevor die Befehle der anderen CPU gestartet werden können. Out-of-Order-Verarbeitung kann einfach implementiert werden, und es kommt zu keiner Verriegelung, die Blockierungsprobleme verursachen kann.It is clear that the concept from the invention also for a multiprocessor can be used that has more than two Contains processors when the floating point unit and the described feedback loops of the floating point unit according to The invention can be changed to separate the data from maintain each CPU. The same applies if one different number of clock cycles of each CPU or each Processor is assigned. The concept from the invention allows you to easily connect to both CPUs on one chip without complex data mapping too serve. The one required for a multiprocessor Chip area can be made without a separate, dedicated set of chips Registers can be significantly reduced. The design from the Invention enables low cycle times without that Multiplexing of data is necessary. That means that the Filter does not have to be empty before the commands of the others CPU can be started. Out-of-order processing can are simply implemented and there is none Interlock that can cause blocking problems.

Claims

1. A multiprocessor ( 100 ) with at least:

a) a first processor ( 101 );
b) a second processor ( 102 );

a shared execution unit ( 106 ) connected to both the first and second processors; and
means ( 105 ) for controlling the first data from the first processor ( 101 ) for execution to this shared execution unit ( 106 ) in a first clock cycle and for controlling the second data from the second processor ( 102 ) for execution thereon control the shared execution unit ( 106 ) in a second clock cycle,
characterized by
that the shared execution unit ( 106 ) has at least two successive execution stages (106_1,..., 106_n), each of which has an input and an output data bus ( 106 a,..., 106 n), and the output data bus ( 106 b) a previous execution stage (106_1) is coupled to the input data bus ( 106 b) of the subsequent execution stage (106_2), one execution stage (106_2) the data of the first processor ( 101 ) and the other execution stage (106_1) the data of the second Processor ( 102 ) processed in the same clock cycle.

The multiprocessor of claim 1, wherein the data from the first processor ( 101 ) is transmitted to the shared execution unit ( 106 ) on all odd clock cycles and data from the second processor ( 102 ) is transmitted to the shared execution unit on all even clock cycles ( 106 ) can be transferred.

3. The multiprocessor according to claim 1, wherein the data from the first processor ( 101 ) is processed at a first number of clock cycles and the data from the second processor ( 102 ) is processed by the shared execution unit ( 106 ) at a second number of clock cycles.

4. Multiprocessor according to claim 3, wherein the first Number of clock cycles from the second number of Clock cycles differs.

The multiprocessor of claim 1, wherein the register file is kept in each processor ( 101 , 102 ) and the control means ( 105 ) sends only the first and second data to be processed by the shared execution unit ( 106 ).

6. The multiprocessor according to claim 1, wherein the data that have been processed by the shared execution unit ( 106 ) are transmitted via a demultiplexer ( 107 ) to the assigned first or second processor ( 101 , 102 ).

7. The multiprocessor according to claim 1, wherein the Output data bus of every second execution stage (106_2, 106_4, 106_6) with the input data bus of the first Execution stage (106_1) is coupled.

8. The multiprocessor according to claim 7, wherein the output data of every second execution stage (106_2, 106_4,... 106_n) after every two clock cycles or after each Multiples of two clock cycles to the input of the first execution stage (106_1, 106_3,..., 106_n - 1) be transmitted.

9. Multiprocessor according to claim 1, wherein the execution stages are filter levels (106_1,..., 106_n).

10. The multiprocessor of claim 1, wherein the shared execution unit ( 106 ) is located between the first and second processors ( 101 , 102 ).

The multiprocessor of claim 1, wherein the shared execution unit is a floating point unit ( 106 ).

12. Multiprocessor according to claim 1, wherein the common execution unit used a Data compression unit is.

13. Multiprocessor according to claim 1, wherein the common execution unit uses a multimedia Execution unit is.

The multiprocessor of claim 1, wherein the first and second processors ( 101 , 102 ) contain a local register file ( 20 , 22 ).

The multiprocessor of claim 1, wherein the control means is a multiplexer ( 105 ).

16. The multiprocessor of claim 1, further including means ( 107 ) for controlling a result from a shared execution unit ( 106 ) to either the first or the second processor ( 101 , 102 ).

17. A multiprocessor according to claim 16, wherein the control means is a demultiplexer ( 107 ).

The multiprocessor of claim 14, wherein both the first and second processors ( 101 , 102 ) keep a copy of its floating point, multimedia, and / or data compression registers ( 20 , 22 ).