CN1229953A

CN1229953A - Cache coherency protocol with global and local tagged states

Info

Publication number: CN1229953A
Application number: CN 98125993
Authority: CN
Inventors: 拉维·库玛·阿里米利; 约翰·史蒂文·多德森; 杰里·唐·刘易斯
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1998-02-17
Filing date: 1998-12-31
Publication date: 1999-09-29

Abstract

A cache coherency protocol uses a 'Tagged' coherency state to track responsibility for writing a modified value back to system memory, allowing intervention of the value without immediately writing it back to system memory, thus increasing memory bandwidth. The Tagged state can migrate across the caches (horizontally) when assigned to a cache line that has most recently loaded the modified value. Historical states relating to the Tagged state may further be used. The invention may also be applied to a multiprocessor computer system having clustered processing units, such that the Tagged state can be applied to one of the cache lines in each group of caches that support separate processing unit clusters. Priorities are assigned to different cache states, including the Tagged state, for responding to a request to access a corresponding memory block.

Description

Cache coherency protocol with overall drawn game tagged states

The present invention relates generally to computer system, relate more specifically to a kind of cache coherency protocol, it provides a kind of new coherence's state for the data of revising, and the data that needn't require to insert thereby the insertion of improvement high-speed cache is operated are written to system storage.

The basic structure of conventional multiprocessor computer system 10 is shown among Figure 10.Computer system 10 has a plurality of processing units, wherein there are two 12a and 12b to be described, they are connected to different peripheral, comprise I/O (I/O) equipment 14 (display for example, keyboard, figure is given directions device (mouse) and permanent storage appliance (hard disk), for the processing unit use with the memory device 16 (for example random access memory or RAM) of execution of program instructions with and main application be when connecting computer power supply for the first time, to find out and load the firmware 18 of an operating system one of in peripheral hardware in individual (normally permanent storage appliance).

Processing unit

12a and 12b through various channels with peripheral communication, comprise general interconnection or bus 20 or DMA channel (not shown).Computer system 10 can have many unshowned optional features, for example is used for the serial and the parallel port that are connected with for example modulator-demodular unit or printer.Other parts that can use with parts shown in Fig. 1 block diagram in addition; For example, can be used for the display adapter of control of video display, can be used for the Memory Controller of reference-to storage 16, or the like.Computing machine also can have two above processing units.

In the symmetric multi processor (smp) computing machine, all processing units generally all are identical, and also, they all are in operation and use cover common instruction and a protocol suite or a subclass, and generally have same architecture.A kind of typical architecture is shown among Fig. 1.A processing unit comprises that one has numerous execution of program instructions that are used for the register of operational computations machine and the processor core 22 of performance element.An exemplary process unit comprises the Power PC that is pushed to market by International Business Machine Corporation (IBM) ^TMProcessor.Processing unit also can have one or more high-speed caches, instruction cache 24 and data cache 26 that for example available high speed storing equipment is realized.High-speed cache is generally used for depositing temporarily and can saves the step of the long period of loading value in storer 16 so that quicken to handle by the value of processor access repeatedly.When these high-speed caches and processor core integrally were encapsulated on the single integrated chip 28, they were called " on the sheet ".Each high-speed cache and a director cache (not shown) associated, the data between latter's management processor nuclear and the high-speed cache and the transmission of instruction.

Processing unit can comprise additional caches, and for example high-speed cache 30, because (rank 1) high-speed cache 24 and 26 on its supporting pieces, it is called as rank 2 (L2) high-speed cache.In other words, high-speed cache 30 is used as the intergrade between the high-speed cache on storer 16 and the sheet, and can store the much bigger information (instruction and data) that can store than high-speed cache on the sheet, but accessing cost more greatly.For example, high-speed cache 30 may be that its memory capacity of a slice is the chip of 256 or 512,000 bytes, and processor may be an IBM PowerPC with high-speed cache on the sheet that its total storage capacity is 64,000 bytes ^TM604 series processors.High-speed cache 30 is connected to bus 20, and must pass through high-speed cache 30 from the full detail that storer 16 is loaded into processor chips 22.Though Fig. 1 has only set forth a twin-stage high-speed cache level, also can provide the multilevel cache level, many ranks (L3, L4, or the like) of polyphone high-speed cache are arranged wherein.If a piece is arranged in the L1 of given processing unit high-speed cache, then it also can occur in the L2 of this processing unit and L3 high-speed cache.This performance is referred to as compatible.Suppose that from now on compatibility principle is applicable to high-speed cache related to the present invention.

In the SMP computing machine, a relevant storage system importantly is provided, also, all require the write operation of each individual memory cell is contacted in proper order with certain for all processors.For example, suppose that a unit changes content, the value of taking 1,2,3,4 by a series of write operations in the storer.In the cache coherence system, all processors can observe to the write operation of given unit according to shown in order carrying out.Yet processing unit might be slipped to a write operation of memory cell.The given processing unit that reads memory cell may be seen order 1,3,4 and miss renewal to being worth 2.The system that realizes these performances is called as " being concerned with ".In fact all coherency protocols are only operated the size granularity of speed buffering piece.Also be, coherency protocol on the cacheline basis rather than individually to the individual memory cell control data move and write permission (from now on, " data " speech be used in reference to for by the digital value of program use or be equivalent to the memory value of the value of programmed instruction).

There are some agreements and technology to be used to finish to be familiar with the cache coherency known to the technician.All these mechanism that are used to keep the coherence require the only accurate processor of these agreements to have " permission " of at any given time given memory cell (cacheline) being carried out write operation.The result of this requirement is: whenever when a processing unit attempts to write a memory cell, it must at first notify all other processing units it want to write this unit and receive permission so that carry out write operation from all other processing units.

For realize cache coherency in system, these processors communicate on a public universal interconnect (for example bus 20).These processors transmit the hope that message will read or write to it to express them from memory cell in interconnection.When in when operation, arranged in this interconnection, all other processors all " try to find out " (supervision) if this operation and determine the state of their high-speed cache whether allow to carry out desired operation and can under what conditions.There are several bus transaction to require to try to find out and follow the tracks of action to promise to undertake bus transaction and to keep memory coherence.Snoop operations excites by receiving qualified snoop request, and is generated by the affirmation of certain bus signals.Just think to try to find out when hitting and just instruction process is interrupted, snoop machine need to determine an additional caches to try to find out to solve the coherence of the sector of being interrupted simultaneously.

This communication is necessary, because in having the system of high-speed cache, up-to-date effective copy of certain piece of storer may be moves in one or more high-speed caches in the system (as described above) from system storage 16.If a processor (for example 12a) attempts to visit a current not memory cell in its high-speed cache level, then comprise this memory cell reality (current) value piece right version may or in system storage 16 or at another processing unit for example in one or more high-speed caches of processing unit 12b.As right version is in one or more other high-speed caches of system, then must obtain right value in the high-speed cache of system rather than in system storage.

For example, consider a processor for example 12a attempt to read unit in the storer.It is its L1 high-speed cache (24 or 26) of controlling oneself of poll at first.Not in the L1 high-speed cache, this request promptly is sent to L2 high-speed cache (30) as this piece.Still not in the L2 high-speed cache, this request then is sent to low level cache, for example L3 high-speed cache as this piece.Not in than the low level high-speed cache, then this request is delivered to universal interconnect (20) upward for processing as this piece.In case an operation is placed on the universal interconnect, then all other processing units are tried to find out this operation and are determined that this piece is whether in their high-speed cache.If given processing unit has processing unit is asked in its L1 high-speed cache piece and the value in this piece is modified, then according to compatibility principle L2 high-speed cache and any copy (yet their copy is out-of-date, because the copy in the processor high speed buffer memory was revised) that also has piece than the low level high-speed cache.Therefore when the lowermost level high-speed cache (for example L3) of processing unit when trying to find out read operation, it should determine requested in upper-level cache and be modified.When this incident is given birth to, L3 is cached at and places a message on the universal interconnect, the notifier processes unit is necessary " retry " its operation of time after a while, because the actual value of memory cell in memory hierarchy top layer L1 high-speed cache, must be retrieved so that it can be used to serve the request of reading of former requesting processing.

In case the request of former requesting processing by retry after, the L3 high-speed cache starts a process, so as in the L1 high-speed cache retrieval modification value, and it is offered L3 high-speed cache, primary memory or both according to accurate realization details.For more retrieving this piece in the higher level cache certainly, the L3 high-speed cache is delivered to more higher level cache by the connection between high-speed cache with message, with this piece of request retrieval.These message up transmit along the processing unit level, arrive the L1 high-speed cache until them, and make this piece along level toward be displaced downwardly to lowermost level not (L3 or primary memory) so that can serve request from former request unit.

In fact former request unit will read request and deliver on the universal interconnect again.Yet modification this moment value retrieves and places system storage in the L1 high-speed cache of handling the unit, therefore the read request of former request processor is met.Just now the scheme of Miao Shuing was commonly referred to " trying to find out promotion ".On universal interconnect, try to find out the request of reading, processing unit " is pushed away " to the level bottom piece to satisfy the request of reading of former requesting processing.

Key is to wish to read or during write-in block, it must be with other processing unit in this hope reporting system so that keep cache coherency when processor.For accomplishing this point, each the piece associated in each rank of cache coherency protocol and high-speed cache level has a state marking device to be used to indicate current " state " of piece.Status information is used to make coherency protocol necessarily to be optimized, thereby reduces the message communicating amount that is connected between universal interconnect and high-speed cache.For the example of this kind mechanism, when a processing unit was carried out read operation, it received a piece of news, whether also wanted this read operation of retry after indicating.Read as retry not that (original text is borrowed, change) operation, then this message also comprises information usually, be used to allow processing unit determine whether any other processing unit still have this piece still the copy of work (other lowermost level high-speed cache be any they not the read operation of retry send one " sharing " or " not sharing " sign, thereby finish this point).Therefore processing unit can be determined in the system copy whether any other processor has this piece.All do not have the work copy of this piece as other processing unit, the processing unit that is then just reading is designated as " unique " with this bulk state.As a demarcation is unique, then allow this processing unit to write this piece subsequently and needn't be earlier and in the system other processing unit link, because of other processing unit copy of this piece not all.When just thinking that coherency protocol guarantees that other processor is all had no stomach for to this piece, processor just may read or write a unit and needn't earlier this hope be served interconnection.

Aforementioned cache coherence's technology realizes in the special protocol that is called " MESI ", and is set forth among Fig. 2.In this agreement, cacheline can have one of following four kinds of states: " M " (modification), " E " (unique)." S " (sharing) or " I " (invalid).Under the MESI agreement, each cache information unit (for example one 32 byte sector) has two additional bits, be used for indicating four of this message unit may states a state.This state can require the access type change according to message unit original state and former request processor, and for this message unit particular state is set in former request processor high-speed cache.For example, when a sector is in the modification state, just think that this sector is arranged in the high-speed cache of the sector with modification and when the modification value was not written back to system storage as yet, the sector of being visited was just effective.When sector when being unique, its only just existence in mentioned sector, and consistent with system storage.As the sector for sharing, then it this and all effective in another high-speed cache at least, all sectors of sharing are all consistent with system storage.At last, when sector when being invalid, this indicate the sector of visiting not in this high-speed cache.As seen in fig. 2, as the sector be not in modification, share or disarmed state in any one state in the time, it can be according to the conversion between different conditions of concrete bus transaction.A sector that is in unique state can be changed into any other state, and sector have only at first become invalid just can become unique.

Can utilize the further visit that improves cacheline of cache coherency protocol.This high-speed cache that improves permission control store piece that is called " insertion " directly offers data in this piece the high-speed cache (for read operation) of another request numerical value, in other words, will not reach to return from memory read by former request processor again by data writing system storer.Insertion can only by have its state be revise or unique piece in the high-speed cache of value finish.In this two states, have only a cacheline to have effective copy of value, needn't be a simple thing at first so on bus 20, send (writing) value with its writing system storer.Insertion process therefore quicken processing procedure and avoid writing with the read-out system storer than growth process (in fact it relate to three bus operations and two storage operations).This process can not only obtain the better stand-by period, the bus bandwidth that can also obtain increasing.

As the part of insertion process, the cache line that the Memory Controller of system storage also can have the modification state certainly receives the insertion response, so that Memory Controller is known the data that read modification when data offer other processor with parallel mode.When the process end, because the data of revising will be copied into system storage, in the modification state, keep the high-speed cache of data will become shared state, the cacheline of other processor also will become shared state from disarmed state simultaneously.In conventional cache coherency protocol, the data of revising when the insertion operation must be written to system storage, though this available parallelism mode is carried out to quicken handling possibility and unnecessary in many occasions.For example, if the cacheline of the modification in the first processor is used for data are offered the high-speed cache of second processor, and second processor may similarly further change these data, and this will finally require the write operation of another time to system storage.Do not have other processor to need these data (memory block) if provide the data and second processor to revise in time between these data at first processor, then the write operation first time (part of insertion process) to system storage is unnecessary.

There is a scheme can avoid unnecessary write operation to system storage; Even processor only needs reading of data also to force second processor that the data that provide are provided in the modification state.In this way, second processor is responsible in the future data being write back system storage.Yet the subject matter of this kind implementation is: insert (first) processor and its cache line must be made as disarmed state to forbid shared data; Whenever all have only a processor can read it, and these data must move around always, produce unnecessary bus traffic between processor.Therefore be desirable to provide a kind of method of keeping cache coherency, allow data to insert effectively, but can avoid unnecessary write operation system storage.The data that allow to revise as this method move between high-speed cache and do not influence storer but still can make data sharing, and are then more favourable.

Therefore an object of the present invention is to provide a kind of improved method of in multiprocessor computer system, keeping cache coherency.

Another object of the present invention provides and a kind ofly like this allows high-speed cache to insert but can avoid method to the unnecessary write operation of system storage.

Another purpose of the present invention provides a kind of like this method that allows to insert the data of modification and allow several different cache these data of maintenance under the situation of sharing.

Above-mentioned purpose reaches by a kind of method of keeping cache coherency in multiprocessor computer system, this method uses " mark " coherence state to comprise a modification value and (also be to indicate a particular cache line, with the inconsistent value of respective memory piece in the system memory devices) and this cache line be responsible for (or be written to other place in the memory hierarchy, promptly by inserting) at present to the modification value being written back to system memory devices at least.All other cache lines (line of other processing unit in the back-up system) that comprise the copy of modification value are endowed second coherence's state (sharing), this also indicates these lines and comprises the modification value, but the not responsible assurance of these lines is in fact with modification value update system storer.The state of this mark can (flatly) move the apprentice between high-speed cache, it is endowed the nearest cache line that has loaded up-to-date modification value.Historical coherence's state can be further used for indicating specific cache line and comprise the modification value and it is sent recently, when for example being evicted from as the result of least recently used algorithm with the cache line of the existing mark of box lunch with " second " thus the historic state cache line is transformed to the write-back responsibility that flag state is freed the cache line of " first " mark.

The present invention also can be used to have numerous multiprocessor computer systems that are combined into the processing unit of trooping and have the given level cache of trooping of a plurality of supports, so that coherence's state of mark is used for the line that each supports the cache set high speed cache lines of trooping in the individual processing unit.Flag state also can be realized in this cluster system more rudimentary.

In a preferred embodiment, used each the coherence's state of the present invention has the priority of an associated, so that when request will be visited a piece, former requesting processing was just delivered in the response that only has limit priority.The high-speed cache that response influences is inserted in the meeting of using crossbar the insertion response of any mark only can be given selection.

The agreement of mark can be combined with other cache protocol, for example comprises the agreement of " recently " state, should " recently " thus state indicates the copy that a high-speed cache comprises the value of a nearest firm accessed mistake allows shared insertion.Should can be further used for giving collision priority to require a memory block by " T " state, this collision priority can change the request (DClaim operation) from the mutual conflict of other high-speed cache.Needn't in the state of single marking, (ⅰ) be inserted data, (ⅱ) follow the tracks of the responsibility that the data that will revise are written to memory hierarchy, (ⅲ) provide these three function combinations of DClaim collision priority together, and can use more complicated enforcement of the present invention to carry out these functions individually.

Detailed below above-mentioned and attached purpose of the present invention, the feature and advantage of writing in the description will be more obvious.

Novel feature of the present invention and believable feature in appended claims, have been set forth.Yet consult the following detailed description of illustrative embodiment in conjunction with the accompanying drawings, can understand very much the present invention itself, the preference pattern of use, further purpose and their advantage, in the accompanying drawing:

Fig. 1 is the block diagram of prior art multiprocessor computer system;

Fig. 2 is the constitutional diagram of setting forth prior art cache coherency protocol (MESI);

Fig. 3 is the constitutional diagram of setting forth cache coherency protocol of the present invention, and the flag state of the data of the modification that is provided allows to insert and do not require data are written to system storage; And

Fig. 4 is the block diagram with multiprocessor computer system of multilevel cache architecture, and it is applicable to the coherence's state that all utilizes mark of the present invention on (CPU troops) scale of on the overall scale and part; And

Fig. 5 is a constitutional diagram, is used to set forth cache coherency protocol of the present invention and allows the identification combination of the coherency protocol of the cache line of read value just recently.

The present invention be directed to the method for in the multicomputer system of for example Fig. 1 system, keeping cache coherency, but the present invention also can be used for not necessarily conventional computer system, for example, they can comprise unshowned new hardware component among Fig. 1 or have the new interconnection architecture of existing parts.Therefore being familiar with the technician knows and the invention is not restricted to general-purpose system shown in the figure.

Referring now to Fig. 3, set forth the constitutional diagram of the embodiment of cache coherency protocol of the present invention among the figure.This protocol class is similar to the prior art MESI agreement of Fig. 2, be that it comprises four identical prior art states (modification, unique, shared and invalid), but it also comprises new " T " state (mark), is used to indicate a cacheline and is revised by certain processor but be not written to system storage as yet.For example, when cacheline in the processor is in the modification state and neutralizes another processor request read operation, first processor promptly sends the insertion response of a modification, the processor that reads thereby these data are remained in (first processor is from revising state transformation to sharing) in the T state.This operates available Attached Processor and repeats, and be exactly that value is remained in high-speed cache in the T state so that just read the high-speed cache of the copy of revising data recently, and the processor of the copy of all other values of having remains in it in the shared state.In this way, " mark " high-speed cache mean it current be responsible in the future certain the time in case of necessity the data of revising are written in the memory hierarchy, its method is or inserts response duration in modification and deliver to another high-speed cache, or it is written back to system storage.The method reduces the total quantity to the write operation of system storage.

In prior art MESI agreement, a high-speed cache that reads the copy of modification value fades to shared state (rather than fading to the T state) from disarmed state, and Memory Controller also will detect, and response is inserted in modification so that data can write store.In the basic agreement that is called " T-MESI " agreement herein of the present invention, Memory Controller is ignored this issued transaction, discharges the storer frequency band.When need just thinking, for example discharge the result of algorithm as least recently used (LRU) high-speed cache, just the value that will revise is written to system storage.

As protocol in prior art, can change this four M-E-S-I states according to the access type that the original state and the former request processor of message unit are looked for.Shifting gears of this one of four states is general identical with prior art MESI agreement, and following additional operations is arranged.As shown in Figure 3, cache line also can fade to flag state from disarmed state, fades to disarmed state from flag state, and fades to shared state from flag state.Can understand this embodiment of T-MESI agreement with further reference to table 1, table 1 is set forth three different processor P ₀, P ₁And P ₂The cache coherency state of middle particular cache block:

Table 1

	????P ₀	????P ₁	????P ₂
	????P ₀	????P ₁	????P ₂	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	P ₁Read	????S	????T	????I
P ₂Read	????S	????S	????T	P ₁Read	????S	????T	????I
P ₂Read	????S	????S	????T	Try to find out promotion (P ₁?DCIaim)	????S	????S	????I
P ₁DClaim (behind the retry)	????I	????M	????I	Try to find out promotion (P ₁?DCIaim)	????S	????S	????I

In first row, the cacheline when all three processors begin all is in disarmed state.Among second row, processor P ₀Carry out one " read and attempt revise " operation (RWITM), so its cache line fades to modification from invalid, after this, processor P ₁The read operation of request cache line; Processor P ₀Insert and fade to shared state, and processor P ₁Fade to flag state (table 1 the 3rd row) from disarmed state.Later processor P ₂The read operation of request cache line; Processor P ₁Insert and fade to shared state, and processor P 2 fades to flag state (the 4th row of table 1) from disarmed state.

Table 1 is also set forth: also may not be forced to data are written to system storage even a cache line that is in the T state is released it.When processor need be written to the permission of a piece, some processor architecture comprised PowerPC ^TMProcessor allows to carry out a special instruction outside the RMITM instruction." DClaim " instruction is an example.In table 1 fifth line, processor P ₁Sent a DClaim request to particular cache line; Processor P ₂Try to find out this DClaim, send a retry message, and attempt data are pushed into system storage.In case finish promote this cache line become from mark invalid, but processor P ₀And P ₁In cache line still for sharing.Behind retry, processor P ₂To send DClaim again, retry, so P again ₀In cache line will become invalid, and P ₁In become modification.

The value of some modification may be migrated between high-speed cache but in fact never is written to system storage.For example, consider RWITM who remains on the value in the T state already of a processor request, the high-speed cache (being in the high-speed cache in the T state) of " having " this value by insert send this value afterwards in other processor all remain on the corresponding cache line in the shared state and be in cache line in the T state all will become invalid.The new processor of carrying out RWITM is made as the modification state, and therefore the value that had before remained in the T state never is written to system storage.

The T state has shared state (because these data remain in the shared state in other one or more processors) and both character of modification state (not being written back to system storage as yet because these data have been revised).Really, it seems from CPU that the T state is equivalent to the S state, but it seems from system bus, the cache line that has the T state is treated mainly as the piece of revising.

In the embodiments of figure 3, " T " state is migrated between cache line, and in another interchangeable embodiment, and " T " state remains in the cache line of the former processor that value has been revised.In other words, will be worth the cache line that keeps in the modification state and will become flag state (rather than becoming shared state) in another processor in that data are sent to.The constitutional diagram of this interchangeable embodiment is similar to Fig. 3, and not existing together is that cache line in the modification state is transformed to flag state rather than shared state.This interchangeable embodiment needs in a fixed structure, so that the value of making " wears out " in high-speed cache.For example, be multistage (at least to L3) as high-speed cache, then be pushed into the L3 high-speed cache by being worth from the L2 high-speed cache, this value can be sent to other L3 high-speed cache afterwards more piece and needn't allow system wait it in the L2 level, retrieve.The promotion of this value can take place under background mode, for example as the result of LRU release rather than the response that specific bus is operated, therefore consequently more effective total operation.

Written or printed documents is bright can in multiprocessor computer system, constitute agreement with the response of specific phase dryness from all with the snooper of the high-speed cache of other processor associated in be sent in the processor of asking read operation.Form formation in the table 2 is pressed in the response of one embodiment of the present of invention:

Table 2

Access response	Priority	Definition
Access response	Priority	Definition	????000	????-	＜keep
????001	????3(1)	Share and insert	????000	????-	＜keep
????001	????3(1)	Share and insert	????010	????6	The long distance state
????011	????4	Heavily operation	????010	????6	The long distance state
????011	????4	Heavily operation	????100	????1(2)	Retry
????101	????2(3)	Mark inserts	????100	????1(2)	Retry
????101	????2(3)	Mark inserts	????110	????5	Share
????111	????7	Zero or remove	????110	????5	Share

Signal has taked the position to try to find out response format, and its value (access response) and definition are illustrated in the table 2.These signal encodings are tried to find out the result so that indicate the address after the operating period.Table 2 shows response and retry response of a common lines and clean (invalid) line; These three responses are all known in the prior art.Table 2 also shows four new responses: " mark insertion ", " share and insert ", " long distance state " and " heavily operation ".When cacheline or at the modification state or in flag state during retention value, usage flag is inserted response can insert this value to indicate it, but the new cacheline of asking this value must be responsible for it is duplicated back system storage (enforcement of " T " State Selection is such as discussed below, can provide different responses with the piece of mark to what revise in case of necessity) temporarily.

Other three responses are directly not relevant with the present invention.The shared piece that inserts effective copy of response permission retention value sends its (R-MESI agreement of face discussion as follows).It is successful that the long distance condition responsive that only is used for read operation is used to indicate this read operation, and after will use another signal to share or unique coherence responds and sends back to data.In the time can not determining that the coherence responds immediately, use heavily operation response, and this request must be sent downwards in level.Heavily the operation response is that with the retry difference preceding a piece of news must resend and it must have same identifier so that it can match with the message of former transmission.

May further be each response and be equipped with a priority value so that allow system logic decision which response energy in forming a single response of delivering to former request processor have priority, as shown in table 2.For example, if one or more high-speed cache responds (priority 2) with mark and one or more high-speed cache responds (priority 1) with retry, then the retry response has priority, and system logic will send the retry response to requesting processor.This system logic may be arranged in different parts, for example unit, system reference mark or be far more than and be positioned at Memory Controller.

The main numerical value of priority shown in the table 2 provides limit priority for the retry response.Yet, can provide the another kind of precedence scheme of replacing to strengthen use to the T-MESI agreement.In the scheme of this replacement, indicated by the priority number in table 2 bracket, share the insertion response and have limit priority, next be the retry response, then be mark inserts response again; All other priority are identical with first scheme.In the scheme of this replacement, because many reasons can be approved: share the insertion response and be positioned at all the time on other response.At first, keep a value (data or instruction) as cache line described just below " R " state (be used for sharing and insert), then other high-speed cache all can not keep the value corresponding with same address in modification or flag state, so very clear do not have other high-speed cache to respond with the mark insertion.In addition, send a retry response as any other high-speed cache, then any response from same high-speed cache can be shared at most according to retry later on, and this shows that again it is acceptable at first sending shared insertion response.

The present invention can optionally realize in the computer system of the high-speed caches with some support T-MESI agreements and other high-speed cache of not supporting this agreement.For example, may have four processing units that are contained on the system circuit board during the initial produce and market of multiprocessor computer system, increase other processing unit but there are other four sockets can be provided with the back.Cheap and therefore initial processing unit (or their director cache) may not support the T-MESI agreement, and system logic (unit, system reference mark) is then supported this agreement.Yet these initial treatment unit can be equipped with device to indicate their whether supported protocols not expensively, for example use a bit flag whether to support T-MESI with the reporting system logic.Then, if add new processing unit in the socket with high-speed cache of supporting the T-MESI agreement really, then system logic can be distinguished these high-speed caches by service marking, and utilizes agreement with suitable processing unit.

For further explaining foregoing, considers a system, it has several processing units of supporting the T-MESI agreement and several processing units of supported protocol not.When each unit sent read request, this request comprised the sign that identification T-MESI supports.If value remains in the cache line in the modification state (being kept by any class processing unit) and this value and asked by the processing unit of not supporting the T-MESI agreement, then system logic responds the insertion of a modification and delivers to former request processor and Memory Controller; Cache line in the former request processor will become shared state from disarmed state, and Memory Controller will detect this value and it is stored in the system storage during inserting.Yet, support T-MESI as former requesting processing, system logic will send the insertion response (it is with an insertion response that is converted to a mark from the non-insertion response that meets the modification of high-speed cache) of a mark; Cache line in the former requesting processing will be from the invalid mark that becomes, and Memory Controller will ignore issued transaction.Under any one situation, the cache line in the processing unit of transmission is shared from becoming of revising.This structure allow computer system utilize any support T-MESI agreement processing unit and no matter the reason that T-MESI and normal MESI high-speed cache are mixed how, the selectivity of agreement realizes also can be used for diagnostic purpose.

Except that according to former request processor service marking optionally realizing the flag state (isomery support), can provide a system sign on global basis, to use or the prohibition flag state, also promptly use one one bit field in the system logic.For example, primary processor (former request processor) may be supported the T-MESI agreement, but this system may need the modification value is down brought in the memory hierarchy, for example brings in the vertical L3 high-speed cache.

As noted above, for state and operation shown in table 1 and 2, finish transfer and coherence's response according to prior art MESI agreement.Below all points relate to the T-MESI agreement of implementing among Fig. 3: message unit just thinks that it currently just can be transformed to the T state when being in the disarmed state and (be in the shared state already as it, then it just only stays in the shared state, and as one be cached in the T state and keep a value, then any high-speed cache that other is in M or the E state all can not keep it); And the message unit of a mark can only be transformed to shared (when the insertion response of revising), invalid (when release or DClaim promotion) or (data that further modification was revised already as same processor) revised.

When using this new T-MESI agreement, the ownership of piece is migrated the high-speed cache to last reading of data, and the attendant advantages of doing like this is that up-to-date use is resident, therefore can reduce the machine for releasing meeting when using least recently used (LRU) high-speed cache replacement mechanism." T " cached state also can be advantageously used in other application scenario.For example an intelligent I/O (I/O) controller can will read one by the processor of the I/O state cell of high speed storing/high-speed cache interruption fresh, because this processor/high-speed cache most possibly once high speed storing cross I/O device driver sign indicating number, therefore can carry out this sign indicating number quickly than other processor, this yard need be read in its high-speed cache because of other processor.Certainly, advantageous particularly part of the present invention is that its allows to share those data of revising and inserting.

" T " state be cache line in being in this state when being released on the other hand, all processors can be seen this release by universal interconnect.It is that it supports historical cache state information that the observability of this release also has an advantage.Consider the example of a set forth in the table 1, wherein three corresponding cache lines that processor had are in the disarmed state when initial.When first processor is carried out a RWITM operation, its cache line is transformed to modification from invalid, and when second processor asks to read cache line later on, first processor inserts this data, its cache line is become shared state, and the cache line of second processor then becomes flag state (being similar to first three row of table 1) from disarmed state.Yet first processor demarcates its cache line for having the special shape of shared state now, is called " S _T" (sharing-mark).Then, discharge cache line (for example by LRU mechanism) as the 3rd processor in " T " state, first processor is perceived this and is discharged also and it can be in " S _T" cache line in the state becomes different conditions with the table response; This different conditions is decided by concrete enforcement.For example, the cache line of mark can be written to system storage, and is in " S _T" cache line in the state can be changed into special state, is called " R " state (visit recently), it can be used for the insertion of shared data." R " state will further be discussed below, and that includes here also has this point in the U.S. Patent Application Serial 08/839,557 of application on April 14th, 1997.

In the enforcement of replacing, be in " S _T" cache line in the state can the value of skipping the promotion operation and be reversed to " T " state simply, rather than when discharging, the data value of revising is written to system storage from " T " state cache line.Owing to share-tag block is identical, needn't be in the cacheline of mark copy data; Just cached state is updated.These steps are elaborated in preceding four rows of table 3:

Table 3

	????P ₀	????P ₁	????P ₂
	????P ₀	????P ₁	????P ₂	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	P ₁Read	????S _T	????T	????I
P ₁LRU discharges	????T	????I	????I	P ₁Read	????S _T	????T	????I
P ₁LRU discharges	????T	????I	????I	P ₁Read	????S _T	????T	????I

P ₂Read	????S	????S _T	????T
P ₂Read	????S	????S _T	????T	P ₂LRU discharges	????S	????T	????I
P ₂Read	????S	????S _T	????T	P ₂LRU discharges	????S	????T	????I

Use shared-flag state that computer system is kept after the cache line of release mark and insert operation, thereby improve system performance.

Preceding four rows set forth " S _T" use of state, this state originates from the conversion that is in the cacheline in " M " state, and how the cachelines that back four rows of table 3 set forth marks also can become shared-mark.The the 5th and the 6th row is presented at " T " state cacheline of migrating between processor, processor P ₁In cacheline become " S from " T " state _T" state; Original processor P for shared-mark ₀In cacheline become shared.In the 7th row, processor P ₂In cacheline be released, make processor P ₁In cacheline from " S _T" state becomes the state into " T " again.Even have this moment one to be in " S " state another piece that neutralizes and to be in " T " state, but the neither one cacheline is in " S _T" state (if shared-mark cache line earlier discharges, also this situation can occur).Yet, " S _T" state can occur again, processor P among the row of end for example ₂Ask another read operation.

The method can followingly be implemented: by the suitable message of cache line broadcasting that is in " T " state, expressing wishes to want to rely on is in " S _T" in the state cacheline and avoid the write operation of system storage.As be in " S _T" cache line in the state receives this message, this cache line sends a suitable response, and the cache line that then is in " T " state promptly discharges simply.If response (does not promptly have a cache line to be in " S _T" in the state), the processor that then has the cache line of mark must be written to the value of revising in the system storage when discharging.

In the enforcement of above-mentioned sharing-flag state, when the cache line of release mark, have only a cache line to can be changed into different conditions.In more accurate enforcement, can provide multi-level historical cache information.For example, can polylith be arranged and singly be not that a cacheline is in shared-flag state at any given time.Helping implementing thereby can give a preface alias to it according to the historical rank of every shared-mark cacheline, also is " S _T1" state is used for just data being delivered to the cache line of " T " cache line; " S _T2" state is used for before data being delivered to " S _T1" cache line of cache line; " S _T3" state is used for before data being delivered to " S _T2" cache line of cache line, by that analogy, when the cache line of a mark discharged, all shared-mark cache lines promote one-levels, such as in the table 4 elaboration:

Table 4

	????P ₀	????P ₁	????P ₂	????P ₃
	????P ₀	????P ₁	????P ₂	????P ₃	Original state	????I	????I	????I	????I
P ₀?RWITM	????M	????I	????I	????I	Original state	????I	????I	????I	????I
P ₀?RWITM	????M	????I	????I	????I	P ₁Read	????S _T1	????T	????I	????I
P ₂Read	????S _T2	????S _T1	????T	????I	P ₁Read	????S _T1	????T	????I	????I
P ₂Read	????S _T2	????S _T1	????T	????I	P ₃Read	????S _T3	????S _T2	????S _T1	????T
P ₃LRU discharges	????S _T2	????S _T1	????T	????I	P ₃Read	????S _T3	????S _T2	????S _T1	????T

In first three row of table 4, be similar to table 1 and table 3, processor P ₀Cache line become modification, then value is delivered to processor P ₁The cache line that becomes mark.Processor P ₀Cache line become shared-the 1st grade of mark.In two rows of back, the cache line of mark is migrated to processor P ₂And P ₃, that the cache line of wherein original mark becomes is shared-the 1st grade of mark.Once to be that any line of shared-the 1st grade of mark becomes shared-the 2nd grade of mark, and in the 5th row processor P ₀In line from share-mark becomes shared-mark 3rd level for the 2nd grade.In the 6th row, processor P ₃In cache line discharge by LRU mechanism; Processor P ₂In " S _T1" cache line becomes " T " state, processor P ₁In " S _T2" cache line becomes " S _T1" state, and processor P ₀In " S _T3" cache line becomes " S _T2" state.

Use shared-other identifier of other preface of flag state level to improve performance again with the LRU releasing mechanism, because this has reduced the d/d possibility of cache line that is in specific share-mark n level state, thereby increased possibility in the cache structure that the value of revising rests on level.It to unique restriction of the number of levels of historical cache information the figure place in the cache coherency status field of cache line.

The above-mentioned description of this invention generally is applicable to any grade cache architecture, for example L2, L3 or the like, but in the example of multilevel cache, it will be more favourable that the present invention uses a slightly different model.With reference to Fig. 4, shown multiprocessor computer system 40 comprises two CPU troop 42a and 42b.CPU group 42a has four CPU44a, 44b, and 44c and 44d, each has a processor core that has (L1) instruction and data high-speed cache and a L2 high-speed cache on the sheet in them.These four CPU44a, 44b, the L2 high-speed cache of 44c and 44d all is connected to a shared L3 high-speed cache 46a, the latter is connected to system storage (RAM) 48 by universal interconnect or bus 50, CPU group 42b has four CPU44e similarly, 44f, 44g and 44h, each also has a processor core that has (L1) instruction and data high-speed cache and a L2 high-speed cache on the sheet in them.These four CPU44e, 44f, the L2 high-speed cache of 44g and 44h are connected to another L3 high-speed cache 46b that shares, and the latter is connected to storer 48 by bus 50 again.In the hierarchical structure of T-MESI agreement, but three corresponding cache lines of as many as are in the flag state: CPU44a, 44b, the cache line between the L2 high-speed cache of 44c and 44d; CPU44e, 44f, the cache line between the L2 high-speed cache of 44g and 44h; Reach the cache line between two L3 high-speed cache 46a and the 46b.

Consider following example, all be in the disarmed state when wherein the corresponding cache line of all among the CPU44a-44h begins.Processor 44a carries out a RWITM operation, so its cache line (L2) becomes the modification state from disarmed state; Corresponding cache line among the L3 high-speed cache 46a also becomes modification from invalid.Therefore, the read operation of processor 44b request cache line; Processor 44a inserts and cache line (L2) becomes shared state, and the cache line of processor 44b (L2) becomes flag state from disarmed state.The cache line of L3 high-speed cache 46a still is the modification state.Read operation with preprocessor 44e request cache line; Processor 44b inserts, but its cache line (L2) still is a flag state, because it is among the different CPU group with processor 44e.Yet the cache line among the processor 44e (L2) becomes flag state from disarmed state.In addition, because the data of inserting are by two L3 high-speed caches, so the slow line of the high speed among the L3 high-speed cache 46a is shared from becoming of revising, and the cache line among the L3 high-speed cache 46b is from the invalid mark that becomes.After this as the read operation of processor 44f request cache line, then it can be sent by the cache line (L2) of processor 44e.In the case, the cache line of processor 44e (L2) becomes shared state from flag state, and the cache line (L2) of processor 44f becomes flag state from disarmed state.These steps all are set forth in the table 5:

Table 5

	?L2 _44a	?L2 _44b	?L2 _44c	?L2 _44f	?L3 _46a	?L3 _46b
	?L2 _44a	?L2 _44b	?L2 _44c	?L2 _44f	?L3 _46a	?L3 _46b	Original state	????I	????I	????I	????I	????I	????I
?P _44a?RWITM	????M	????I	????I	????I	????M	????I	Original state	????I	????I	????I	????I	????I	????I

P _44bRead	????S	????T	????I	????I	????M	????I
P _44bRead	????S	????T	????I	????I	????M	????I	P _44eRead	????S	????T	????T	????I	????S	????T
P _44fRead	????S	????T	????S	????T	????S	????T	P _44eRead	????S	????T	????T	????I	????S	????T

In last row of table 5, all there is a cache line to be in " T " state among each CPU group, one that reaches in the L3 high-speed cache also has a line to be in " T " state, this situation allows in local processor (also promptly with former request processor with one in a group) to send the data of modification on the L2 rank, with further reinforcing property.If therefore after this processor 44c asks the read operation of cache line, then this request will be satisfied by the cache line (L2) of processor 44b, but after this ask the read operation of cache line as processor 44g, then this request will be satisfied by the cache line (L2) of processor 44f; These two operations all take place on the L2 rank, do not need any action of L3 high-speed cache 46a and 46b.As two above CPU groups are provided, then " T " cache line can be migrated between additional L3 high-speed cache similarly.This notion may extend to be had even more than three level caches (L1, L2, cache architecture L3).Needn't on all ranks, all realize " T " state.

The present invention also can different MESI combination of protocols with other use, and for example the United States Patent (USP) of mentioning and also mentioning is in front in the past used the R-MESI agreement that discusses in the series number 08/839,557.According to this agreement, a nearest state is used for reading at last the high-speed cache of shared data, inserts response to allow to provide to share.Therefore can design a combination RT-MESI agreement, its embodiment is shown among Fig. 5.In this hybrid protocol, in case the value of revising is delivered to some other unit (promptly delivering to other high-speed cache or system storage) in the memory hierarchy, the cache line that is in the flag state promptly becomes nearest state, reaches the cache line that is in the modification state when delivering to other unit and becomes nearest state similarly.Table 6 shows an example:

Table 6

	????P ₀	????P ₁	????P ₂
	????P ₀	????P ₁	????P ₂	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	Original state	????I	????I	????I
P ₀?RWITM	????M	????I	????I	P ₁Read	????R	????T	????I
P ₂Read	????S	????R	????T	P ₁Read	????R	????T	????I
P ₂Read	????S	????R	????T	P ₂Discharge	????S	????R	????I

Processor P in the table 6 ₀, P ₁And P ₂In all three corresponding cache lines all be in disarmed state when initial and (being similar to table 1) works as processor P ₀When carrying out RWITM on the corresponding memory piece, its cache line becomes modification.Work as processor P ₁When carrying out read operation, its corresponding cache line still becomes mark, but processor P ₀Cache line now become nearest rather than (table 6 the 3rd row) that share.After this, work as processor P ₂When carrying out read operation, its cache line becomes mark, and processor P ₁Cache line become nearest, and processor P ₀Cache line become shared (table 6 the 4th row).Work as processor P then ₂When discharging this piece (because lru algorithm), processor P 1 will this value remain in " R " state.In this way, processor P 1 can send this value by sharing to insert to respond in the future.In the other scheme of this agreement, unique state can be ignored and replace it effectively with nearest state.

The person skilled in the art knows the more complex form that the RT-MESI agreement can be arranged, and for example uses aforementioned " S _T" mixed version of state, wherein when the cache line of retention value in existing " T " state discharges and supposes that this value is written back to system storage, be in " S _T" cache line of state becomes " R " state (rather than " T " state).Can use by to share-historical cache information design similar embodiment that flag state provides more.In addition, can utilize the overall situation/local cache structure of Fig. 4 to realize the RT-MESI agreement.For example, consider the local cache of the processing unit 44d of retention value in " M " state, it delivers to processing unit 44h with this value then.As before, the cache line among the processing unit 44h is from the invalid mark that becomes, and the cache line among the processing unit 44d can be nearest from becoming of revising now.

" T " and " R " state the both provide a kind of mechanism, is used for discerning a cacheline uniquely from one group of cacheline of sharing a value.As noted, this one side this piece easy to use of these states is in inserting operation.This piece is demarcated other advantage in addition uniquely.This relates to above-mentioned DClaim operation.Many processors all can be carried out this operation simultaneously in the reality, thereby cause collision." T " state can be used for composing gives collision priority, thereby changes the DClaim request from the conflict of other high-speed cache.By this collision priority is provided, DClaim from " T " status block operates can come forth (promptly being put in the cache operations formation so that in fact be broadcast to the remainder of memory hierarchy), but can finish the DClaim storage instruction immediately, total faster operation of feasible system like this, as what discussed in the U.S. Patent Application Serial 08/024,587.

Though " T " state can be advantageously used in (ⅰ) and insert data, (ⅱ) to follow the trail of and will revise data and be written to the responsibility of memory hierarchy and (ⅲ) provide DClaim to collide priority, these three functions needn't be combined in single coherence's state.Table 7 is set forth more complicated coherency protocol below, and wherein these functions are performed independently:

Table 7

Cacheline	The possible state of other cacheline (level)
Cacheline	The possible state of other cacheline (level)	I	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?M?I
H	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?M?I	I	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?M?I
H	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?M?I	M	I
Q	R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?I	M	I
Q	R?R _D?R _T?R _DT?S?S _D?S _T?S _DT?H?I	Q _D	R?R _TS?S _T?H?I
Q _T	R?R _D?S?S _D?H?I	Q _D	R?R _TS?S _T?H?I
Q _T	R?R _D?S?S _D?H?I	Q _DT	R?S?H?I
R	Q?Q _D?Q _T?Q _DT?S?S _D?S _T?S _DT?H?I	Q _DT	R?S?H?I
R	Q?Q _D?Q _T?Q _DT?S?S _D?S _T?S _DT?H?I	R _D	Q?Q _T?S?S _T?H?I
R _T	Q?Q _D?S?S _D?H?I	R _D	Q?Q _T?S?S _T?H?I
R _T	Q?Q _D?S?S _D?H?I	R _DT	Q?S?H?I
S	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S? _SD?S _T?S _DT?H?I	R _DT	Q?S?H?I
S	Q?Q _D?Q _T?Q _DT?R?R _D?R _T?R _DT?S? _SD?S _T?S _DT?H?I	S _D	Q?Q _T?R?R _T?S?S _T?H?I
S _T	Q?Q _D?R?R _D?S?S _D?H?I	S _D	Q?Q _T?R?R _T?S?S _T?H?I
S _T	Q?Q _D?R?R _D?S?S _D?H?I	S _DT	Q?R?S?H?I

In the table 7, left column indicates the state of particular cache block, and possible coherence's state of corresponding piece in the high-speed cache of right other level of row sign.This protocol scheme provides 15 coherence's states, so the coherence territory needs four.Assignment is as follows individually for above-mentioned three functions, and at first, any having " D " be target coherence state (Q down _D, Q _DT, R _D, R _DT, S _DOr S _DT) allow to announce that a DClaim operates (piece when also promptly, occurring as the DClaim request that conflicts will have collision priority).Secondly, target coherence state (Q under any having " T " _T, Q _DT, R _T, R _DT, S _TOr S _DT) be responsible in memory hierarchy, the modification value being write downwards.The 3rd, any " R _[x]" coherence's state (R, R _D, R _TOr R _DT) the insertion value of having the right." Q _[x]" coherence's state (Q, Q _D, Q _TOr Q _DT) be used for there be not " R _[x]" provide during state and insert operation, also, " R _[x]" state allows to insert first operation, and " Q _[x]" state (historical) permission inserts for the second time." H " state is the state that hovers that below will mention.This embodiment does not use " E " state.

Three functions of mentioning are implemented individually, yet can combine in specific phase dryness state.Insertion and write-back are blamed any two functions at state R _TAnd Q _TIn combine.Insertion and two functions of DClaim priority are at state R _DAnd Q _DIn combine.Two functions of write-back responsibility and DClaim priority are at state S _DTIn combine.All three functions are at state Q _DTAnd R _DTIn combine.Can use the data stream engine to control the independence of these three functions by system bits is set.This notion also can be used for supporting trooping high-speed cache of CPU.

At last, the also special use compatibility with " fir cross bar " of the present invention.The prior art cache design uses the address crossbar to communicate by letter with reinforcement with the data crossbar.High-speed cache does not generally have point-to-point communication, but must send request and response with broadcast mode other unit in memory hierarchy.Crossbar just is used for request and response are delivered to the switch in different paths on the bus or relay so that more effectively use bus a bit.In other words, all high-speed caches all interconnect to crossbar, and it keeps a formation so that cache operations can be distributed in the bus between the different paths equably, set up much wide bus frequency band.These crossbar of system controller may command.A given high-speed cache (as L2) must notify this high-speed cache of crossbar controller must watch the operation of relevant given mark.

The T-MESI agreement is useful with crossbar, requires these apparatus operating because of certain address and data manipulation only offer.Consider the example of four processing units, one of them has the cacheline that is in " T " state, and another has and is in " S _T" relevant block of state, other two then have the relevant block that is in " I " state.When request value of reading in latter two processor, system logic can determine to make limit priority response (insertion of mark) should only deliver in four processors three.Therefore, this address operation does not offer the 4th processor (its piece is in the non-request processor of " I " state).Similarly, the data crossbar can be used for only value itself being offered former request processor.If usage flag is inserted the precedence scheme that response changes the retry response, then this response may need to offer the processor of retry.

Though the present invention describes with reference to specific embodiment, it is restrictive that this description and not meaning that is interpreted as.The person skilled in the art can understand the different modifications of the disclosed embodiments and interchangeable embodiment of the present invention with reference to description of the invention.For example, this agreement can be used in combination with the cache coherency protocol beyond the R-MESI agreement; U.S. Patent Application Serial 08/024,610 have described " H-MESI " agreement, wherein a cache line runs in " hovering " state and transmits valid data to wait for, and the H-MESI agreement can combine with existing T-MESI agreement, for example when the cache line that hovers loads valid data, there is a cache line that is in " H " state to become " S _T" state.Therefore imagination can be made this modification under the situation that does not deviate from defined essence of the present invention in the appended claims or scope.

Claims

1. method of in multiprocessor computer system, keeping cache coherency, this computer system has numerous at least two processing units of trooping that are combined into, each processing unit is trooped and is had at least two level caches, wherein in first level cache in numerous high-speed caches given one only for single processing unit use, and in the numerous high-speed caches in second level cache given one use for two or more processing units in same the trooping, this method may further comprise the steps:

With coherence's state of mark give troop with first processing unit in first cache line in high-speed cache in first level cache of the first processing unit associated, comprise a modification value to indicate first cache line, and indicate the storage block that this modification value is not written to system memory devices as yet corresponding to a storage block of the system memory devices of computer system; And

With coherence's state of mark give troop with second processing unit in second cache line in high-speed cache in first level cache of the second processing unit associated, comprise a modification value and indicate the storage block that this modification value is not written to system memory devices as yet to indicate second cache line.

2. the method for claim 1 is characterized in that further being included in described coherence's state with mark and gives after the step of first cache line, the modification value is delivered to the step of second cache line from first cache line.

3. the method for claim 1 is characterized in that further comprising the step that the modification value is written to the storage block of system memory devices.

4. the method for claim 1 is characterized in that further comprising coherence's state of mark is given and the troop step of the 3rd cache line of a high-speed cache in second level cache of associated of first processing unit.

5. the method for claim 1 is characterized in that further may further comprise the steps:

Before the step of described coherence's state of giving mark, the 3rd cache line of another high-speed cache in first level cache of modification value the 3rd processing unit associated in trooping with first processing unit is delivered to first cache line; And

Give the 3rd cache line with coherence's state of sharing and comprise a shared copy of modification value to indicate the 3rd cache line.

6. the method for claim 1 is characterized in that may further comprise the steps:

With the modification value from second cache line deliver to troop with second processing unit in the 3rd cache line of another high-speed cache in first level cache of the 3rd processing unit associated;

Give the 3rd cache line with coherence's state of mark, comprise the modification value and indicate the storage block that this modification value is not written to system memory devices as yet to indicate the 3rd cache line; And

Described send step after, give second cache line with coherence's state of sharing and comprise a shared copy of modification value to indicate second cache line.

7. the method for claim 3, what it is characterized in that further comprising is that coherence's state of mark is given and the troop step of the 3rd cache line of a high-speed cache in second level cache of associated of first processing unit, and the described step of writing is by finishing in response to the next procedure that discharges the modification value in the 3rd cache line.

8. the method for claim 3 is characterized in that the described step of writing is by finishing in response to the next procedure of the operation of trying to find out needs promotion modification value.

9. the method for claim 4 is characterized in that further being included in described coherence's state with mark and gives the step of the 3rd cache line is delivered to the modification value second cache line afterwards from the 3rd cache line step.

10. the method for claim 4 is characterized in that further may further comprise the steps:

After described coherence's state with mark is given the step of the 3rd cache line, the modification value is delivered to and troop the 4th cache line of another high-speed cache in second level cache of associated of second processing unit from the 3rd cache line;

Give the 4th cache line with coherence's state of mark; And

Give three cache line with coherence's state of sharing described after sending step, comprise the shared copy of modification value to indicate the 3rd cache line.

11. the method for claim 6 is characterized in that further may further comprise the steps:

With the modification value from first cache line deliver to troop with first processing unit in the 4th cache line of managing another high-speed cache in first level cache of unit associated everywhere;

Give the 4th cache line with coherence's state of mark, comprise the modification value and indicate the storage block that this modification value is not written to system memory devices as yet to indicate the 4th cache line; And

After the described step of the modification value being delivered to the 4th cache line, give first cache line with coherence's state of sharing and comprise a shared copy of modification value to indicate first cache line.

12. a computer system comprises:

A system memory devices;

Article one, be connected to the bus of described system memory devices;

First numerous processing units, each has a high-speed cache that is used to deposit from the value of described system memory devices in described first numerous processing units, and described first numerous processing units are combined into first processing unit and troop;

Second numerous processing units, each has a high-speed cache that is used to deposit from the value of described system memory devices in described second numerous processing units, and described second numerous processing units are combined into second processing unit and troop;

Be connected to each described high-speed cache of described first numerous processing units and first cluster caching of described bus;

Be connected to each described high-speed cache of described second numerous processing units and second cluster caching of described bus; And

The cache coherency device, be used for coherence's state of mark is given and troop first cache line of a high-speed cache of the first processing unit associated of first processing unit, comprise one corresponding to the modification value of the storage block of described system memory devices with indicate the described storage block that this modification value is not written to described system memory devices as yet to indicate first cache line, and be used for coherence's state of mark given with second processing unit and troop second cache line of a high-speed cache of the second processing unit associated to indicate the described storage block that second cache line comprises the modification value and should the modification value be written to described system memory devices as yet.

13. the computer system of claim 12, it is characterized in that described cache coherency device comprises is used for the modification value being delivered to the device of second cache line after described coherence's state with mark is given first cache line and in response to the read request from described second processing unit.

14. the computer system of claim 12 is characterized in that described cache coherency device comprises to be used in response to the release of modification value and the device that the modification value is written to the storage block of system memory devices in response to requiring to promote the trying to find out of operation of modification value.

15. the computer system of claim 12 is characterized in that described cache coherency device comprises the device that is used for coherence's state of mark is given the 3rd cache line of described first cluster caching.

16. the computer system of claim 12, it is characterized in that described cache coherency device comprises was used for before described coherence's state of giving mark the modification value from delivering to first cache line with troop the 3rd cache line of another high-speed cache of the 3rd processing unit associated of first processing unit, and gave the 3rd cache line with coherence's state of sharing and comprise the device of a shared copy of modification value to indicate the 3rd cache line.

17. the computer system of claim 12, it is characterized in that described cache coherency device comprises that being used for (ⅰ) delivers to the modification value and troop the 3rd cache line of another high-speed cache of the 3rd processing unit associated of second processing unit from second cache line, (ⅱ) coherence's state of mark being given the 3rd cache line comprises the modification value and indicates the described storage block that this modification value is not written to described system memory devices as yet to indicate the 3rd cache line, and (ⅲ) described send the modification value after, give second cache line with coherence's state of sharing and comprise the device of a shared copy of modification value to indicate second cache line.

18. the computer system of claim 15 is characterized in that described cache coherency device comprises the device that is used for after described coherence's state with mark is given the 3rd cache line the modification value being delivered to from the 3rd cache line second cache line.

19. the computer system of claim 15, it is characterized in that comprising and be used for (ⅰ) after giving the 3rd cache line coherence's state of mark, the modification value is delivered to the 4th cache line of described second cluster caching from the 3rd cache line, (ⅱ) give the 4th cache line, and (ⅲ) give the 3rd cache line with shared coherence's state after sending the modification value and comprise the device of a shared copy of modification value to indicate the 3rd cache line described with coherence's state of mark.

20. the computer system of claim 17, it is characterized in that described cache coherency device comprises that being used for (ⅰ) delivers to the modification value and first processing unit, the 4th cache line of managing another high-speed cache of unit associated everywhere of trooping from first cache line, (ⅱ) coherence's state of mark being given the 4th cache line comprises the modification value and indicates the described storage block that this modification value is not written to described system memory devices as yet to indicate the 4th cache line, and (ⅲ) described the modification value is delivered to the 4th cache line after, give first cache line with coherence's state of sharing and comprise the device of a shared copy of modification value to indicate first cache line.