CN102681890B

CN102681890B - A kind of thread-level that is applied to infers parallel restricted value transmit method and apparatus

Info

Publication number: CN102681890B
Application number: CN201210133066.9A
Authority: CN
Inventors: 安虹; 邓博斌; 李颀; 李功明; 毛梦捷
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2012-04-28
Filing date: 2012-04-28
Publication date: 2015-09-09
Anticipated expiration: 2032-04-28
Also published as: CN102681890A

Abstract

The present invention proposes a kind of thread-level that is applied to and infer parallel restricted value transmit method and apparatus, total execution time of system can be reduced when conflicting and occurring by the method for value transmit.That is only meeting specific condition, the data that conflict thread just may be needed, otherwise just only can perform by the mode of primal system.This is a kind of value transmit method of lightweight, compares with value prediction method with value transmit thoroughly, have hardware and the low advantage of protocol complexities, but performance may be not so good as thorough value transmit and value prediction in the ordinary course of things.Data analysis by experiment, compared with value prediction model, finds the performance loss that restricted value transmit model is too not large.This device realizes and checking on LogSPoTM model, but it is also applicable to other thread-level systems.

Description

A kind of thread-level that is applied to infers parallel restricted value transmit method and apparatus

Technical field

The invention belongs to computer microprocessor field of structural design, particularly a kind of lightweight value transmit method and apparatus that effectively can improve multi-threaded system performance.

Background technology

Speculative multithreading technology and affairs memory technology

Along with multi core chip (Chip Multi-Processor, CMP) arrival in epoch, how by threading for the serial program being difficult to parallelization traditionally with the execution accelerating single program, simultaneously also on sheet increasing calculate core provide more can the calculation task of executed in parallel to improve the utilization factor of Resources on Chip, become the hot research problem that academia and industry member are paid close attention to jointly.

In order to develop available Thread level parallelism on more multi core chip, solve concurrent program correctness and safeguard the complicacy brought to multiple programming and the restriction problem to performance, academia proposes thread-level respectively from different angles and infers (Thread-Level Speculation, TLS) and affairs store (Transactional Memory, TM) two kinds of technology.TLS is intended to break cross-thread and relies on the restriction performed thread parallel, increases the chance that program parallelization performs.When compiler or programmer cannot determine dependence complete between thread candidate; do not need to adopt conservative strategy to abandon walking abreast or adding the synchronous protection mechanism of bulk redundancy; the direct parallelization of the dependence that may exist can be ignored; the maintenance of serial semantics is by supporting when running to infer that the hardware mechanisms performed ensures, thus the Technology Potential of concurrency during there is farthest excavation program.TM is intended to for explicit lock synchronization mechanism finds replacement scheme, the implicit expression synchronization mechanism provided by runtime system, realizes the shared storage programming without lock.Owing to not needing request lock and unlocking, therefore this is also a kind of method of synchronization of unblock, has both solved the correctness problem such as deadlock, pirority inversion existed in lock mechanism, and has also solved the impact that lock granularity may cause performance.By system, TM safeguards that concurrent operations from multiple thread or performance element is to semantic consistency during shared storage organization status modifier automatically.The common ground of TLS and TM is the difficulty that can both reduce multiple programming, increases the chance that thread parallel performs; The requirement of similar support presumed access, data buffer storage, conflict cancellation is it is also proposed at hardware aspect.

In order to expand the range of application of these two kinds of technology, researchers propose the mixture model of some TM and TLS, wherein more typically have the TCC of Stanford University, the Bulk in champagne branch school, Illinois, also have the LogSPoTM that China Science & Technology University proposes.Wherein, the basis based on LogTM realizes LogSPoTM, adds the support to TLS semanteme to original TM system.The execution pattern of LogSPoTM as shown in Figure 1, is divided into parallel (gray line) and serial (black line) two parts.For serial section, in order to ensure program correctness, thread must be submitted to from high to low according to the order of sequence according to priority.LogSPoTM is the structure (TCC and Bulk is based on bus interconnection) of interconnection Network Based, and scalability is better.In system hardware complexity, LogSPoTM is more moderate, is applicable to further expanding of systemic-function.

The existing problems of speculative multithreading technology and affairs memory technology

As mentioned above, time the data collision dependence ratio when between program is lower, TLS and TM technology can both obtain reasonable performance on multi-core platform.But when there is more data collision in program and relying on, system performance can become very bad.In view of this, the program scope of application of TLS and TM system is more limited, and why this Ye Shi general processor manufacturer does not adopt these two kinds of technology to carry out the main cause of actual production always.

In order to alleviate this problem, have some scholars to propose some value transmit methods and accelerate the many parallel multithread programs of conflict dependence, such as consciousness relies on affairs memory technology (DATM).This is a kind of radical value transmit technology; That is, as long as affair trigger data dependence, it just can receive required data and continue to perform from upper affairs.Although DATM obtains reasonable speed-up ratio in some test procedures, its practicality is also bad.First, DATM realizes based on bus consistency protocol, can not directly be suitable for for the reasonable catalogue consistency protocol of scalability, needs to carry out a large amount of agreements and improves and checking.Secondly, DATM has carried out large-scale amendment to MSI bus consistency protocol, originally only having 3 status state machine to surge to 13 states, and changes the mechanism with complex state.If want this thought to be intactly transplanted in general bibliographic structure, state transition graph will become more complicated than bus structure, and this system in reality can not realize substantially.Finally, DATM with the addition of larger hardware costs on general affair storage system, this Ye Shi processor manufacturer be unwilling accept.

The present invention improves system performance for target, when routine data conflict relies on serious, is carried out the contradiction of mitigation performance difference by the method for restricted value transmit.In the process improved, also want emphasis to consider the problem of hardware costs, go less to revise hardware configuration as far as possible.The false sharing problem of some TLS & TM mixture model (as LogSPoTM) can also be solved in addition by value transmit technology, system performance can be improved further.

Summary of the invention

Key problem of the present invention adopts as far as possible few hardware costs to alleviate TM and TLS performance tradeoff, therefore we have proposed a kind of thread-level that is applied to and infer parallel restricted value transmit method and apparatus.It is that the one of primal system and radical value transmit system is compromised.From empirical average result, restricted value transmit method also can obtain more attracting performance, and hardware complexity is relatively low, and consistency protocol is changed little, has good transplantability.We are that platform achieves this restricted value transmit technology at LogSPoTM model.We choose LogSPoTM system mainly because its hardware complexity is lower than other TM & TLS commingled systems existing, and are based on bibliographic structure, have and expand putting property preferably.But restricted value transmit technology is equally applicable to other TM and TLS systems.

LogSPoTM supports TM and TLS two kinds of semantemes, and in this manual, we only illustrate as an example with TLS.Rely on to reduce data collision the performance loss brought, we can adopt added value pass through mechanism to alleviate.In accompanying drawing 2 (a) is that the conflict of original LogSPoTM model relies on processing mode, and when conflict occurs, the low thread of priority only could continue execution after thread that priority is high has been submitted waiting for.Fig. 2 (b) have employed the method for value transmit, when there is data collision and relying on, the data required for the thread that priority is high can send to the thread that priority is low, the subsequent thread making priority low continues to perform, and submit to according to the order of sequence, thus save total system execution time.If but sending thread revises the data transmitted, must notify in time to accept thread.In order to ensure the correctness of system, receiving thread must rollback re-executing.

Shortcoming about radical value transmit (DATM) above-mentioned, so the present invention proposes a kind of method of restricted value transmit.Accompanying drawing 3 (b) gives the execution pattern of restricted value transmit.The priority of figure thread is as follows: T1 > T2 > T3.Share data X to be occupied by T1.When T2 is to the T1 request msg X also do not submitted to, T1 can transmit this data to T2, and unlike original LogSPoTM, direct transmission NACK allows T2 sky wait for, at this moment T2 can continue to perform.When T3 is also to T1 request msg X, restricted value transmit system can be the same with original LogSPoTM, allows T1 send NACK to T3, and send data X to T3 unlike radical value transmit system.That is, as long as system just can send corresponding data to the thread of request meeting under certain agreed terms, otherwise only can perform according to original LogSPoTM pattern.Restricted value transmit technology must meet following rule simultaneously and just allow to transmit data:

(1) data owner's thread must not send any data to other threads.

(2) data owner's thread must not receive the data that other any threads send over.

(3) data owner's thread sends data (if thread is orderly) only can to its low thread of priority ratio.

For the hardware costs of restricted value transmit technology, we have only done micro-amendment in original LogSPoTM system.We add two Parasites Fauna to each processor: data transmitter register group and data accept Parasites Fauna.According to the restricted delivery rules of data above-mentioned, two Parasites Fauna above each processor can not be used simultaneously.Fig. 4 shows concrete explanation of field.RID and SID have recorded the processor numbering receiving data and the processor numbering sending data respectively.What ADD preserved is the address of these transmission data.What DATA preserved is the value of these data.What deserves to be explained is, after receiving thread receives data, the retouching operation of these data is only carried out in receiving register, finally just can write back in cache when this thread is submitted to.BITS is used to solve false sharing problem, the identical and one_to_one corresponding of the byte number that the figure place of BITS is capable with cache.False sharing problem is because the institutional framework of cache and the acting in conjunction of collision detection mechanism cause.In the computer system of current main-stream, cache structure is all organized with behavior unit, i.e. the corresponding multiple continuous print address space of each row.And when LogSPoTM carries out the collision detection of data, detect with cache behavior unit.That is, if two thread accesses is the different address locations that same cache is capable, system also can process it as data collision, although in fact do not send conflict, and Here it is so-called false sharing problem.In some TM & TLS commingled systems (as TCC), false sharing problem obtains reasonable solution.But these methods are not also suitable for LogSPoTM model, and false sharing problem is never well solved in LogSPoTM, and many times to become the bottleneck of performance.The present invention is by introducing value transmit mechanism, only need increase BITS Parasites Fauna just can alleviate the false shared performance loss brought largely, further raising system performance, by the test of 8 benchmark programs, on average, in the configuration of 4 threads, obtain the performance boost of 25.8%.

In short, the present invention is a kind of method and system of restricted value transmit, infers based on mixture model by things storage and thread-level.When clashing dependence between thread, the thread that priority is high can send data to the thread that priority is low when satisfying condition, and allows it stop waiting for, continues executive routine.If do not meet transmission condition, just the same with general hypothetical system, the thread that priority is low can be waited for.In order to ensure the correctness of program structure, also by sending the method establishment of the Update Table authentication mechanism of a set of transmission data.

The present invention proposes a kind of thread-level that is applied to and infers parallel restricted value transfer device, comprise on-chip multi-processor, affairs memory function parts, value transmit parts, also comprise and support to infer the processor core performed, increase the cache controller of timestamp, add the L1 data cache of read-write position, add the L2cache reading and writing position and the data transmitter register group and the data receive register group that ensure to transmit normal execution.

The present invention also proposes a kind of device according to claim 1 and carries out restricted value transmit method, comprises the following steps:

Step 1, systems axiol-ogy is to data collision, and the thread that priority is high detects whether meet the condition transmitting data, and meet and just send data to the thread that priority is low, otherwise only send Nack message, the thread that now priority is low can only be waited for;

Step 2, after the thread that priority is low receives the data passed over, can leave in receive data register it, and then these data of low priority thread continue executive routine; As long as this low priority thread does not occur to submit to or rollback, all directly when having access to this colliding data block to operate on receive data register later;

Step 3, if the higher transmission thread of priority has rewritten the data block sent, this high priority thread will send to receiving thread the part of amendment; Receiving thread can be verified, looks at whether the part revised used; If used, rollback operation will be carried out.If do not used, just carried out data fusion operation.

Advantage of the present invention and good effect are mainly manifested in:

Alleviated the contradiction of TM or TLS system poor performance when data collision relies on serious by restricted value transmit method, effectively reduce the situation of system performance performance extreme difference.

Effectively solve the false sharing problem of LogSPoTM, improve system performance further, and expand the range of application of the multi-threaded systems such as LogSPoTM.

Hardware complexity is lower, and consistency protocol amendment is little, has good portability.Increase accompanying drawing explanation herein

Accompanying drawing explanation

Fig. 1 .LogSPoTM execution pattern schematic diagram;

Fig. 2. the application model figure (a) of value transmit in LogSPoTM is the thread processing method after clashing in original LogSPoTM system; Figure (b) is the thread processing method of added value pass through mechanism;

Fig. 3. the restricted value transmit LogSPoTM model execution pattern in multi-thread environment, figure (a) is the disposal route of LogSPoTM in the same data block situation of multithreading request; Figure (b) is the disposal route after added value pass through mechanism;

Fig. 4. the hardware that restricted value transmit model adds on original LogSPoTM model basis;

Fig. 5. one exists the implementation of the false program shared at restricted value transmit LogSPoTM model, and figure (a) is two and there is the false thread code shared; Figure (b) is the situation of change of restricted value transmit LogSPoTM system Parasites Fauna when execution figure (a) code.

Embodiment

Below by being shown the situation of change of the value of two groups of registers added with the usability of program fragments that data vacation is shared by one, with this, the specific works process of restricted value transmit technology in LogSPoTM is described.Accompanying drawing 5 (a) gives two affairs, and they perform corresponding code separately.We suppose that the priority of affairs 1 is higher than affairs 2, and each cache is capable 4 bytes, and that is, it is capable that address space 0 to No. 3 addresses belong to same cache.Like this, affairs 1 and affairs 2 just likely can trigger when collision detection conflicts, although in fact their not data dependences.

Accompanying drawing 5 (b) gives the execution change procedure of affairs.The data transmitter register group on the left side is from the processor core at affairs 1 place, and the data receive register on the right then carrys out the processor core of self-operating affairs 2.Execution step is as follows:

(1) transmitter register group and receiving register group are all initial values, and affairs 1 and affairs 2 all perform affairs sign on.

(2) write operation has been carried out in affairs 1 pair of address 0.Current write operation is directly to cache operation, not other transactions requests data.So the transmitting data register group of affairs 1 does not change.

(3) write operation is carried out in affairs 2 pairs of addresses 1.Due to collision detection mechanism and cache institutional framework, system triggers data collision.This situation is the condition meeting restricted value transmit, so affairs 1 send data (whole cache is capable) to affairs 2.Concrete operations be that affairs 1 copy data transmitting data register group to from cache, carry out relevant information record.And then send to affairs 2, affairs 2 make a record at receive data register after receiving data, and recorded the capable operative position of affairs 2 couples of cache (read or write all can record in BITS position, in this example, write operation has been carried out in affairs 2 pairs of addresses 1, so BITS second becomes 1 from 0).Then executive routine is continued.

(4) write operation is carried out in affairs 1 pair of address 2.Cache due to place, address 2 is capable, and the cache with being kept at transmitter register group is identical, so system is while carrying out write operation to the cache of affairs 1, also to carry out write operation to transmitting data register, and the data division (part of amendment, not necessarily complete cache is capable) of amendment is sent to affairs 2.

(5) after affairs 2 receive Update Table information, detect BITS position, find that the data oneself that affairs 1 send over did not use, so only need the Update Table received to merge, without the need to carrying out rollback operation, efficiently avoid false sharing problem.If but oneself had used the data of the amendment just received, that just there occurs genuine data collision, at this moment in order to ensure the correctness of procedure result, just rollback operation must be carried out.At this time affairs 2 execute all operations, but also do not submit to, so can only wait for due to the affairs 1 that priority is higher.

(6) affairs 1 are complete and submit to, empty transmitting data register group.Then affairs 2 just allow to submit to, and the value of receiving register is write back in cache, finally empty receive data register group.

Can be found out by above example, the present invention not only gives between conflict thread and provides the advantage of value transmit, and can also the most false sharing problem of resolution system, has larger performance boost space.

Claims

1. one kind is applied to the parallel restricted value transfer device of thread-level supposition, comprise on-chip multi-processor, affairs memory function parts, value transmit parts, it is characterized in that: also comprise and support to infer the processor core performed, increase the cache controller of timestamp, add the L1 data cache of read-write position, add the L2cache reading and writing position and the data transmitter register group and the data receive register group that ensure to transmit normal execution; Each processor adds two Parasites Fauna: data transmitter register group and data receive register group; Data transmitter register group is become by RID, ADD, DATA tri-data field groups; Data receive register group is become by SID, ADD, BITS and DATA tetra-data field groups; Two Parasites Fauna above each processor can not be used simultaneously, RID and SID have recorded the processor numbering receiving data and the processor numbering sending data respectively, what ADD preserved is the address of these transmission data, what DATA preserved is the value of these data, BITS is used to the bit register solving false sharing problem, the identical and one_to_one corresponding of the byte number that the figure place of BITS is capable with cache; Detecting BITS position, find that the data oneself that affairs 1 send over did not use, so only need the Update Table received to merge, without the need to carrying out rollback operation, efficiently avoid false sharing problem;

Concrete execution step is as follows:

(1) data transmitter register group and data receive register group are all initial values, and affairs 1 and affairs 2 all perform affairs sign on;

(2) write operation has been carried out in affairs 1 pair of address 0, and current write operation is directly to cache operation, not other transactions requests data, so the data transmitter register group of affairs 1 does not change;

(3) write operation is carried out in affairs 2 pairs of addresses 1, due to collision detection mechanism and cache institutional framework, and system triggers data collision; This situation is the condition meeting restricted value transmit, so affairs 1 send data to affairs 2; Concrete operations be that affairs 1 copy data data transmitter register group to from cache, carry out relevant information record; And then send to affairs 2, make a record in data receive register group after affairs 2 receive data, and recorded the capable operative position of affairs 2 couples of cache in BITS position; Read or write all can record, and write operation has been carried out in affairs 2 pairs of addresses 1, so BITS second becomes 1 from 0, then continues executive routine;

(4) write operation is carried out in affairs 1 pair of address 2, cache due to place, address 2 is capable, and the cache with being kept at data transmitter register group is identical, so system is while carrying out write operation to the cache of affairs 1, also to carry out write operation to data transmitter register group, and the data division of amendment is sent to affairs 2;

(5) after affairs 2 receive Update Table information, detect BITS position, find that the data oneself that affairs 1 send over did not use, so only need the Update Table received to merge, without the need to carrying out rollback operation, efficiently avoid false sharing problem; If but oneself had used the data of the amendment just received, that just there occurs genuine data collision, at this moment in order to ensure the correctness of procedure result, just rollback operation must be carried out; At this time affairs 2 execute all operations, but also do not submit to, so can only wait for due to the affairs 1 that priority is higher;

(6) affairs 1 are complete and submit to, empty data transmitter register group; Then affairs 2 just allow to submit to, and the value of data receive register group is write back in cache, finally empty data receive register group.

2. device according to claim 1 carries out a restricted value transmit method, it is characterized in that comprising the following steps:

Step 2, after the thread that priority is low receives the data passed over, can leave in data receive register group it, and then these data of low priority thread continue executive routine; As long as this low priority thread does not occur to submit to or rollback, all directly when having access to this colliding data block to operate in data receive register group later;

Step 3, if the higher transmission thread of priority has rewritten the data block sent, this high priority thread will send to receiving thread the part of amendment; Receiving thread can be verified, looks at whether the part revised used; If used, rollback operation will be carried out; If do not used, just carried out data fusion operation.