CN102681890A

CN102681890A - Restrictive value delivery method and device applied to thread-level speculative parallelism

Info

Publication number: CN102681890A
Application number: CN2012101330669A
Authority: CN
Inventors: 安虹; 邓博斌; 李颀; 李功明; 毛梦捷
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2012-04-28
Filing date: 2012-04-28
Publication date: 2012-09-19
Anticipated expiration: 2032-04-28
Also published as: CN102681890B

Abstract

The invention provides a restrictive value delivery method and device applied to thread-level speculative parallelism, and when a conflict happens, the total execution time of a system can be reduced in a method of value delivery, i.e. a conflict thread can receive needed data only when a specific condition is met, and otherwise execution is conducted only according to the mode of the original system. The method disclosed by the invention is a lightweight class value delivery method, and compared with a complete value delivery and value prediction method, the method has the advantage of low hardware and protocol complexity, but under an ordinary condition, the performance is inferior to that of complete value delivery and value prediction. Through experimental data analysis, compared with a value prediction model, a restrictive value delivery model does not have large performance loss. The device is realized and verified on a LogSPoTM model, and is also suitable for other thread-level systems.

Description

A kind of thread-level that is applied to is inferred parallel restricted value transmission method and device

Technical field

The invention belongs to computer microprocessor structural design field, the particularly a kind of lightweight value transmission method and device that can improve the multi-threaded system performance effectively.

Background technology

Speculative multithreading technology and affairs memory technology

Along with multi core chip (Chip Multi-Processor; CMP) arrival in epoch; The serial program threading that how will be difficult to parallelization traditionally is to quicken single program implementation; Also for but increasing computation core on the sheet provides the calculation task of more executed in parallel to improve the utilization factor of resource on the sheet, become the common hot research problem of paying close attention to of academia and industry member simultaneously.

In order to develop available Thread-Level Parallelism property on the more multi core chip; Solve that the concurrent program correctness is safeguarded the complicacy brought to multiple programming and to the restriction problem of performance; Academia has proposed thread-level supposition (Thread-Level Speculation respectively from different angles; TLS) and affairs storages (Transactional Memory, TM) two kinds of technology.TLS is intended to break cross-thread and relies on the restriction that thread parallel is carried out, and increases the chance that program parallelization is carried out.When compiler or programmer can't confirm dependence complete between the thread candidate; It is machine-processed need not adopt conservative strategy to abandon synchronous protection parallel or that add bulk redundancy; Can ignore the direct parallelization of the dependence that possibly exist; The maintenance of serial semantics supports to infer that by when operation the hardware mechanisms of carrying out guarantees, thereby has the technological potentiality of the concurrency in the excavation program farthest.TM is intended to seek replacement scheme for explicit lock synchronization mechanism, through the implicit expression synchronization mechanism that runtime system provides, realizes not having the shared storage programming of lock.Owing to need not ask lock and release, so this also is a kind of method of synchronization of unblock, both solved correctness problems such as the deadlock that exists in the lock mechanism, pirority inversion, also solved the influence that lock granularity possibly cause performance.TM is automatically safeguarded from the concurrent operations of a plurality of threads or the performance element semantic consistency when sharing the storage organization status modifier by system.The common ground of TLS and TM is to reduce the difficulty of multiple programming, increases the chance that thread parallel is carried out; The requirement of similar support presumed access, metadata cache, conflict cancellation has also been proposed at hardware aspect.

In order to enlarge this two kinds of The Application of Technology scopes, researchers have proposed the mixture model of some TM and TLS, and the TCC of Stanford University is wherein more typically arranged, the Bulk in champagne branch school, Illinois, the LogSPoTM that also has China Science & Technology University to propose.Wherein, LogSPoTM is based on and realizes on the basis of LogTM, has increased the semantic support to TLS for original TM system.The execution pattern of LogSPoTM is divided into parallel (gray line) and two parts of serial (black line) shown in accompanying drawing 1.For the serial part, in order to guarantee program correctness, thread must be submitted to according to priority from high to low according to the order of sequence.LogSPoTM is based on the structure (TCC and Bulk are based on bus interconnection) of network interconnection, and it is better to expand putting property.Aspect the system hardware complexity, LogSPoTM is more moderate, is fit to further expanding of systemic-function.

The existing problems of speculative multithreading technology and affairs memory technology

As above-mentioned, when the dependence of the data collision between program ratio was lower, TLS and TM technology can both obtain reasonable performance on multi-core platform.But when existing more data collision to rely in the program, it is very bad that system performance can become.In view of this, the procedural application scope of TLS and TM system is more limited, and this also is why general processor manufacturer does not adopt these two kinds of main causes that technology is carried out actual production always.

In order to alleviate this problem, there are some scholars to propose some value transmission methods and quicken to conflict and rely on many parallel multithread programs, rely on affairs memory technologies (DATM) such as consciousness.This is a kind of radical value transmission technology; That is to say that as long as affairs have triggered the data dependence, it just can receive needed data and continue and carry out from last affairs.Though DATM obtains reasonable speed-up ratio in some test procedures, its practicality is also bad.At first, DATM is based on that the bus consistency protocol realizes, can not directly be suitable for for expanding the reasonable catalogue consistency protocol of putting property, need carry out that a large amount of agreements is improved and checking.Secondly, DATM has carried out large-scale modification to MSI bus consistency protocol, having only 3 status state machine to increase sharply to 13 states originally, and changes the mechanism with complex state.If want intactly to be transplanted to this thought in the general bibliographic structure, it is more complicated more than bus structure that state transition graph will become, and this system in reality can not realize.At last, DATM has added bigger hardware costs on general affair storage system, and this also is that processor manufacturer is unwilling to accept.

The present invention is to be target to improve system performance, relies on when serious in the routine data conflict, comes the contradiction of mitigation performance difference through the method for restricted value transmission.In improved process, also want emphasis to consider the problem of hardware costs, remove to revise hardware configuration as far as possible less.The false sharing problem of some TLS&TM mixture models (like LogSPoTM) can also be solved in addition by value transmission technology, system performance can be further improved.

Summary of the invention

Key problem of the present invention is to adopt few hardware costs of trying one's best to alleviate TM and TLS performance tradeoff, so we have proposed a kind of parallel restricted value transmission method and device of thread-level supposition that be applied to.It is a kind of compromise of primal system and radical value transmission system.From the empirical average result, restricted value transmission method also can be obtained more attracting performance, and hardware complexity is relatively low, and consistency protocol is changed little, has good transplantability.We are in the LogSPoTM model this restricted value transmission technology that has been Platform Implementation.We choose the LogSPoTM system mainly is because its hardware complexity is lower than existing other TM&TLS commingled systems, and is based on bibliographic structure, has to expand putting property preferably.But restricted value transmission technology is equally applicable to other TM and TLS system.

LogSPoTM supports TM and two kinds of semantemes of TLS, and in this manual, we only explain with TLS as an example.Rely on the performance loss that is brought in order to reduce data collision, we can adopt the added value pass through mechanism to alleviate.In the accompanying drawing 2 (a) is the conflict dependence processing mode of original LogSPoTM model, and when conflict took place, the thread that priority is low only could continue to carry out after waiting for the high thread submission completion of priority.The method that Fig. 2 (b) has adopted value to transmit; When the data collision dependence took place, the thread that priority is high can send needed data to the low thread of priority, let the low follow-up thread of priority continue to carry out; And submit to according to the order of sequence, thereby save total system's execution time.If but send thread the data of transmitting are revised, must in time be notified and accept thread.In order to guarantee the correctness of system, receiving thread must rollback and execution again.

Shortcoming about radical value transmission (DATM) above-mentioned is so the present invention proposes the method that a kind of restricted value is transmitted.Accompanying drawing 3 (b) has provided the execution pattern that restricted value is transmitted.The priority of thread is following among the figure: T1＞T2＞T3.Shared data X is occupied by T1.As T2 during to the T1 request msg X that also do not have to submit to, T1 can transmit this data to T2, lets unlike equally direct transmissions of original LogSPoTM NACK T2 is empty to wait for that at this moment T2 can continue execution.As T3 during also to T1 request msg X, restricted value transmission system can be the same with original LogSPoTM, lets T1 send NACK to T3, and unlike radical value transmission system that kind to T3 transmission data X.That is to say that system be as long as send corresponding data just can for the thread of request satisfying under certain agreed terms, otherwise only can carry out according to original LogSPoTM pattern.Restricted value transmission technology must satisfy following rule simultaneously and just allow Data transmission:

(1) data owner's thread must not send any data to other threads.

(2) data owner's thread must not receive the data that other any threads send over.

(3) data owner's thread sends data (if thread is orderly) only can for its low thread of priority ratio.

Transmit technological hardware costs for restricted value, we have only done micro-modification in original LogSPoTM system.We have increased by two registers group for each processor: data transmitter register group and data are accepted registers group.According to data limit property delivery rules above-mentioned, each two registers group above the processor can not be used simultaneously.Accompanying drawing 4 has provided concrete explanation of field.RID and SID have write down processor numbering that receives data and the processor numbering of sending data respectively.What ADD preserved is the address of this Data transmission.What DATA preserved is the value of these data.What be worth explanation is, receiving thread receives after the data, and the retouching operation of these data is only carried out in receiving register, when this thread is submitted to, just can write back among the cache at last.BITS is with solving false sharing problem, and the byte number that the figure place of BITS and cache are capable is identical and corresponding one by one.False sharing problem is because the institutional framework of cache and the acting in conjunction of collision detection mechanism cause.In the computer system of current main-stream, the cache structure is all organized with behavior unit, i.e. the corresponding a plurality of continuous address of each row.And when LogSPoTM carries out the collision detection of data, detect with cache behavior unit.That is to say that if two thread accesses is the capable different address locations of same cache, system also can handle it as data collision, though in fact do not send conflict, Here it is so-called false sharing problem.In some TM&TLS commingled systems (like TCC), false sharing problem has obtained reasonable solution.But these methods also are not suitable for the LogSPoTM model, and false sharing problem never is well solved in LogSPoTM, and becomes bottleneck of performance many times.The present invention is through having introduced the value pass through mechanism; Only need to increase a BITS registers group and just can alleviate the false performance loss that brings of sharing largely; Further improve system performance; Through the test of 8 benchmark programs, on average, in the configuration of 4 threads, obtained 25.8% performance boost.

In short, the present invention is the method and system that a kind of restricted value is transmitted, and infers that with things storage and thread-level mixture model is the basis.Between thread, clash when relying on, the thread that priority is high can send data to the low thread of priority under situation about satisfying condition, let it stop to wait for, continues executive routine.If do not satisfy the transmission condition, just the same with general hypothetical system, the thread that priority is low can be waited for.In order to guarantee the correctness of program structure, also set up the authentication mechanism of cover transmission data through sending the method for revising data.

The present invention proposes a kind of thread-level that is applied to and infers parallel restricted value transfer device; Comprise on-chip multi-processor, affairs memory function parts, value transferring elements; Also comprise and support to infer the processor core of carrying out; Increase the cache controller of timestamp, increased the L1 data cache that reads and writes the position, the L2cache that has increased the read-write position transmits normal data transmitter register group and the Data Receiving registers group of carrying out with assurance.

The present invention also proposes a kind of device according to claim 1 and carries out restricted value transmission method, may further comprise the steps:

Step 1, system detects data collision, and the thread that priority is high detects the condition that whether satisfies Data transmission, just satisfies and sends data to the low thread of priority, otherwise only send Nack message, and the low thread of priority this moment can only be waited for;

Step 2, the thread that priority is low are received after the data that pass over, and can leave it in the receive data register in, and low priority thread continues executive routine with these data then; As long as this low priority thread does not take place to submit to or rollback, all directly on receive data register, operate when having access to this colliding data piece later on;

Step 3, if the higher transmission thread of priority has rewritten the data block of sending, this high priority thread will send to receiving thread to the part of revising; Receiving thread can be verified, looks at whether the part of revising was used; If used, will carry out rollback operation.If do not use, just carry out the data fusion operation.

Advantage of the present invention and good effect mainly show:

Alleviate TM or TLS system through restricted value transmission method and rely on the contradiction of poor performance when serious, reduce the situation of system performance performance extreme difference effectively at data collision.

Solve the false sharing problem of LogSPoTM effectively, further improve system performance, and expand the range of application of multi-threaded systems such as LogSPoTM.

Hardware complexity is lower, and consistency protocol is revised little, has good portability.Increase description of drawings here

Description of drawings

Fig. 1 .LogSPoTM execution pattern synoptic diagram;

Fig. 2. it is the thread process method after clashing in the original LogSPoTM system that value is delivered in application model figure (a) among the LogSPoTM; Figure (b) is the thread process method of added value pass through mechanism;

Fig. 3. the restricted value in the multi-thread environment is transmitted LogSPoTM model execution pattern, and figure (a) is the disposal route of LogSPoTM under the same data block situation of multithreading request; Figure (b) is the disposal route behind the added value pass through mechanism;

Fig. 4. the hardware that restricted value TRANSFER MODEL is added on original LogSPoTM model basis;

Fig. 5. an implementation that has the false program of sharing at restricted value transmission LogSPoTM model, figure (a) is two and has the false thread code of sharing; Figure (b) is the situation of change that restricted value is transmitted LogSPoTM system registers group when execution graph (a) code.

Embodiment

With having the situation of change that the false usability of program fragments of sharing of data is showed the value of two groups of registers that add through one, the technological concrete course of work in LogSPoTM of restricted value transmissions is described below with this.Accompanying drawing 5 (a) has provided two affairs, and they carry out corresponding separately code.We suppose that the priority of affairs 1 is higher than affairs 2, and each cache is capable to have 4 bytes, that is to say that it is capable that 0 to No. 3 address of address space belongs to same cache.Like this, affairs 1 and affairs 2 just might trigger when collision detection conflicts, though in fact they do not have data and rely on.

Accompanying drawing 5 (b) has provided the execution change procedure of affairs.The data transmitter register group on the left side is from the processor core at affairs 1 place, and the Data Receiving register on the right then comes the processor core of self-operating affairs 2.Execution in step is following:

(1) transmitter register group and receiving register group all are initial values, and affairs 1 are all carried out the affairs sign on affairs 2.

(2) write operation has been carried out in 1 pair of address of affairs 0.Current write operation is directly to cache operation, not other transactions requests data.So the transmitting data register group of affairs 1 does not change.

(3) write operation is carried out in 2 pairs of addresses of affairs 1.Because collision detection mechanism and cache institutional framework, system triggers data collision.This situation is to meet the condition that restricted value is transmitted, so affairs 1 send data (whole cache is capable) for affairs 2.Concrete operations be that affairs 1 copy data the transmitting data register group to from cache, carry out relevant information record.And then send to affairs 2; Affairs 2 are carried out record at receive data register after receiving data; And write down the capable operative position of 2 couples of cache of affairs (read or write all can record, and write operation has been carried out in 2 pairs of addresses of affairs 1 in this example, so second of BITS becomes 1 from 0) in the BITS position.Continue executive routine then.

(4) write operation is carried out in 1 pair of address of affairs 2.Because the cache at 2 places, address is capable identical with the cache that is kept at the transmitter register group; So system is when the cache to affairs 1 carries out write operation; Also to carry out write operation to transmitting data register; And a data division (part of modification, it is capable to be not necessarily complete cache) of revising sends to affairs 2.

(5) affairs 2 receive and revise after the data message, detect the BITS position, and the data oneself that discovery affairs 1 send over were not used, so only need merge the modification data that receive, need not to carry out rollback operation, have avoided false sharing problem effectively.If but oneself had used the data that rigidly connect the modification of receiving, genuine data collision has just taken place in that, at this moment in order to guarantee the correctness of procedure result, just must carry out rollback operation.At this time affairs 2 have executed whole operations, but because the higher also not submission of affairs 1 of priority, so can only wait for.

(6) affairs 1 complete and submission empties the transmitting data register group.Affairs 2 just allow to submit to then, and write back the value of receiving register among the cache, empty the receive data register group at last.

Can find out that through above example the present invention not only gives the advantage that value transmission is provided between the conflict thread, and can also the most false sharing problem of resolution system, bigger performance boost space had.

Claims

1. one kind is applied to the parallel restricted value transfer device of thread-level supposition; Comprise on-chip multi-processor, affairs memory function parts, value transferring elements; It is characterized in that: also comprise and support to infer the processor core of carrying out; Increase the cache controller of timestamp, increased the L1 data cache that reads and writes the position, the L2cache that has increased the read-write position transmits normal data transmitter register group and the Data Receiving registers group of carrying out with assurance.

2. the device according to claim 1 carries out restricted value transmission method, it is characterized in that may further comprise the steps: