WO2012027959A1 - 一种多处理器系统及其同步引擎装置 - Google Patents

一种多处理器系统及其同步引擎装置 Download PDF

Info

Publication number
WO2012027959A1
WO2012027959A1 PCT/CN2011/001458 CN2011001458W WO2012027959A1 WO 2012027959 A1 WO2012027959 A1 WO 2012027959A1 CN 2011001458 W CN2011001458 W CN 2011001458W WO 2012027959 A1 WO2012027959 A1 WO 2012027959A1
Authority
WO
WIPO (PCT)
Prior art keywords
primitive
reduce
barrier
storage structure
synchronization
Prior art date
Application number
PCT/CN2011/001458
Other languages
English (en)
French (fr)
Inventor
孙凝晖
陈飞
曹政
王凯
安学军
Original Assignee
中国科学院计算技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算技术研究所 filed Critical 中国科学院计算技术研究所
Priority to US13/819,886 priority Critical patent/US9411778B2/en
Publication of WO2012027959A1 publication Critical patent/WO2012027959A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30087Synchronisation or serialisation instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present invention relates to parallel program data processing techniques, and more particularly to a multiprocessor system and its synchronization engine device. Background technique
  • FIG. 1 is a schematic diagram of an example of using Barrier synchronization primitives to achieve read and write order in a parallel program.
  • the synchronization primitive Barrier can guarantee that the process P2 is reading.
  • the value read must be the value written by process P1.
  • the Lock primitive is generally used in scientific computing to ensure mutually exclusive access to certain resources between multiple processes. Its implementation typically relies on special instructions provided by the processor, such as typical LL/SC instructions.
  • the Reduce primitive can be simply represented as Reduce (Ao, Op, Com), where Root represents the root process of the Reduce operation; Ai represents the source data of Process i participating in Reduce; Op represents the Reduced operation mode, common "Plus”, “Subtract”, “Maximum”, “Minimum”, etc. Com represents a collection of processes participating in this Reduce.
  • the meaning of Reduce (Root, Ai, Op, Com) is as follows: The data Ai of each process i in the set Com is calculated using Op mode, and the operation result is returned to Root.
  • the use of software in the prior art to implement the above synchronization primitives is more flexible but less efficient, mainly in the case of large startup overhead, slow execution speed, and high number of inter-process communication.
  • a software-implemented Reduce algorithm is similar to the above Barrier.
  • the data of each process is also calculated, and the result of the calculation is stored in the variable Value of the shared memory.
  • the data of process 0 is ValueO
  • the data of process 1 is Valuel
  • the data of process N is ValueN.
  • the root process initializes Value according to the operation type of Reduce. For example, if the operation type of Reduce is "maximum", the value is initialized to the minimum value that the computer can represent, and then each process n performs the following operations:
  • each process needs to guarantee the atomicity of the above operations, so that when all processes judge the completion of the calculation by the counter A described by the Barrier, the final value of Value is the largest value of all the processes, the process. You can read the value of Value, that is, complete the reduction type with the operation type "maximum".
  • invention disclosure It is an object of the present invention to provide a multiprocessor system and its synchronization engine apparatus. It can well support various common synchronous operations in a multi-processor environment, and has the advantages of fast execution speed, small communication bandwidth of the processor, and no matter whether or not there is Cache between processors, and the like;
  • the synchronization engine device is a hardware-implemented device, and the atomicity required for calculation is easily guaranteed.
  • a multiprocessor synchronization engine apparatus provided for the purpose of the present invention includes: a plurality of storage queues for receiving synchronization primitives sent by a plurality of processors, and a queue storing all synchronization primitives from one processor Language
  • a plurality of scheduling modules after selecting a synchronization primitive for execution in the plurality of storage queues, sending to a corresponding processing module for processing according to a type of the synchronization primitive, the scheduling module and the One-to-one correspondence of storage queues;
  • the virtual synchronous storage structure module uses a small amount of storage space, and maps direct storage spaces of all processors to synchronization through control logic. Storage structure to implement the functions of various synchronization primitives;
  • a main memory port for communicating with the virtual synchronous storage structure module, reading and writing the direct storage of each processor, and initiating an interrupt to the processor;
  • a configuration register is configured to store various configuration information required by the processing module.
  • the process number information is also saved to distinguish the synchronization primitives sent by different processes on the same processor.
  • the processing module includes: a Reduce processing module, a Barrier processing module, a Load/Store processing module, and a Put/Get processing module.
  • the processing module in the synchronization engine device can be extended according to the type of synchronization primitive supported by the synchronization engine device.
  • the synchronous storage structure is virtualized using a small amount of on-chip storage and does not directly occupy the space directly stored by the processor.
  • the synchronous storage structure is ⁇ Count, P, L, Value ⁇ , ⁇ Count, P, L, ⁇ is called synchronous storage.
  • the bit widths of Tag, Count and Value can be set differently according to system requirements; Value: storage unit for storing data, L: Lock flag for supporting Lock/Unlock primitives; P: Produce flag, for To implement the Put/Get primitive, the Count: counter is used to implement the Barrier primitive, the Reduce primitive, and the Put/Get primitives of multiple modes.
  • the bit width of the counter is related to the maximum parallel process supported by the sync engine device, and the n-bit Count can support up to 2 n processes.
  • the virtual method of the synchronous storage structure is: using an on-chip storage as a hash table, the structure of each item in the hash table is ⁇ key value, Tag, Value ⁇ , when the processing module writes a synchronous storage structure
  • the processing module executes the instruction, uses the address of the instruction as a key value, uses a hash algorithm to select a row in the hash table as a storage unit, and stores the synchronization structure; when the processing module reads a synchronous storage structure, the same Use the hash algorithm to find the item corresponding to this address.
  • the hash table outputs the content of the line found ⁇ Tag, Value ⁇ ; If the hash algorithm does not find the corresponding item during the reading process, the current instruction should be suspended.
  • the port sends an interrupt to the corresponding processor, and constructs a hash table in the direct memory of the processor to store the synchronous storage structure; wherein ⁇ Count, P, L, ⁇ is called a synchronous storage tag; and Value is a storage unit. .
  • a multiprocessor system using the multiprocessor synchronization engine device is also provided for the purpose of the present invention.
  • the system includes: a plurality of processors and a processing chip, wherein Processing chips, including:
  • the synchronization engine device wherein the storage queue is interconnected with the plurality of device ports; in the device discovery process, each processor searches for a device port interconnected with the device through a standard device search process, and allocates a device port application.
  • the synchronization engine device maps its own resources to the operating system of the corresponding processor through the device port, and the software on the plurality of processors operates the synchronization engine device through a mapping relationship, the synchronization engine
  • the device is shared by multiple processors.
  • a method for processing a Barrier primitive is provided by the synchronization engine device of the multiprocessor, and the method includes the following steps:
  • Synchronization engine device system initialization Specify a synchronous storage structure of multiple consecutive addresses as multiple Barrier variables, and maintain a register Barrier_HW_State indicating the completion status of multiple Barrier variables. Each processor applies for a space in its immediate memory. Barrier_State is represented as the completion status of the Barrier;
  • N processes call the Barrier primitive, and perform a Barrier operation using the Barrier variable At the same time, respectively save the state of the nth bit of the Barrier_State of the corresponding processor to the local variable Local_Barrier_State of the N processes;
  • the synchronization engine device After receiving the Store command for the Barrier variable, the synchronization engine device writes a value to the Barrier variable according to the physical address of the Store, and the value is equal to N-1;
  • the Barrier processing module reads the synchronous storage structure of the corresponding address. If the synchronous storage structure does not exist, or the Count read into the synchronous storage structure is equal to 0, a synchronous storage structure is established, where Count is equal to Store Value, if the read to the synchronous storage structure Count is not equal to 0, then perform the next step 150;
  • the Barrier processing module reads a synchronous storage structure of the corresponding address and decrements the count of the read synchronous storage structure by one;
  • Each of the processes periodically queries the processor to which the processor belongs after sending the Store instruction.
  • Step 110 of the synchronization engine device system initialization includes the following steps:
  • Each processor applies a space Barrier-State as its completion state representation in the memory of its immediate memory.
  • Each processor's Barrier_State is initialized to all 0s, and the Barrier-State is in the process of the corresponding processor. Is shared, each process can read;
  • Each processor sends the physical address of the application Barrier-State to the configuration register.
  • a multi-processor synchronization engine device is also provided for the purpose of the present invention.
  • a method for processing a Reduce primitive comprising the following steps:
  • Synchronization engine device system initialization Specify a synchronous storage structure of multiple consecutive addresses as multiple Reduce variables, and maintain a register Reduce_HW_State indicating the completion status of multiple Reduce variables ; each processor requests a space in its immediate memory Reduce—State as the completion state representation of Reduce;
  • the N process calls the Reduce primitive, and uses the Reduce variable Rn to perform a Reduce operation. Each process saves the state of the nth bit of the Reduce-State of the processor to the local variable Local-Reduce-State;
  • Each process participating in the reduction sends the data structure of the Reduce primitive to the synchronization engine device.
  • the synchronization engine device After receiving the Reduce data structure, the synchronization engine device writes a value to the Reduce variable Rn according to the physical address of the Reduce. , the value of the value is equal to N-1;
  • the Reduce processing module reads the synchronous storage structure of the corresponding address. If the synchronous storage structure does not exist, or the Count read into the synchronous storage structure is equal to 0, a synchronous storage structure is established, where Count is equal to Nl, And storing the source data in the Reduce data structure in the Value in the synchronous storage structure; if the read synchronous storage structure Count is not equal to 0, then perform the next step 250;
  • the Reduce processing module reads the synchronous storage structure of the corresponding address, and decrements the corresponding Count by 1, and performs the operation of the value in the read synchronous storage structure and the source data in the Reduce data structure, and stores the result in synchronization. Value in the storage structure;
  • each process After sending the Reduce data structure, each process periodically queries the value of the nth bit of the Reduce_State to which the processor belongs. If the status of the query is equal to Local_Reduce_Staten, it indicates that the process of the current process has not been completed yet. Query again later; if the queried state is not equal to Local_Reduce_Staten, it means that Reduce has completed and exits the query state; where Value is the storage unit; Count is the counter.
  • the steps of the synchronization engine device system initialization include the following steps:
  • each processor in its immediate application in the memory section Reduce- State space representation Reduce state as a finished, each processor Red UCe _St a t e are initialized to all 0;
  • Each processor sends the physical address of the Request Reduce-State to the configuration register of the synchronization engine device.
  • the data structure of the Reduce primitive is ⁇ Reduce variable address, operator, data type, number of processes -1, source data ⁇ .
  • a method for processing a Lock primitive by a synchronization engine device of the multiprocessor includes the following steps:
  • Each processor applies a variable Lock_Result in its immediate memory to clear the contents of the variable, and uses the physical address of the variable as the return physical address of the Lock primitive data structure;
  • the synchronization engine device After receiving the data structure of the Lock primitive sent by the process, the synchronization engine device reads the synchronous storage structure according to the target physical address therein, if the synchronous storage structure is not read, or the synchronous storage structure is read. If the L bit in the middle is equal to 0, it indicates that the physical address has not been locked, then the implementation of the Lock primitive is successful, and the process proceeds to the next step 330. If the L bit of the read synchronous storage structure is equal to 1, the physical address is indicated. If it has been locked, the implementation of this lock is abandoned, and the Lock primitive is placed in the corresponding storage queue, waiting to be executed in this schedule;
  • the L position in the synchronous storage structure is 1 and saved, and according to the return physical address, write 1 to the Lock_Result of the direct memory;
  • the process periodically queries the Lock_Result, if the read Lock_Result is equal to 0, it means that the lock has not been successfully succeeded, and then the query is delayed; if the read Lock_Result is equal to 1, it means adding The lock succeeds, exiting the Lock call; where L is the Lock flag.
  • the Lock primitive data structure is ⁇ return physical address, target physical address ⁇ , and the return physical address indicates that when the locking is successful, the synchronization engine device stores the success message in the return physical address in the main memory; the target physical address Indicates which physical address the software wants to lock.
  • a method for processing an Unlock primitive by a synchronization engine device of the multiprocessor includes the following steps:
  • the process sends the data structure to the synchronization engine device and exits the Unlock call; the synchronization engine device receives the Unlock data structure, according to the target address from the hash table
  • the synchronous data structure is read, and the L bit is cleared. If the synchronous data structure is equal to all 0s after the L bit is cleared, the synchronous data structure is released. Otherwise, only the data structure after the L bit is cleared is written. Back; where L is the Lock flag.
  • the data structure of the Unlock primitive is ⁇ target address ⁇ , and the only element in the data structure represents the variable address that needs to be unlocked.
  • a method for processing a Put primitive by a synchronization engine device of the multiprocessor includes the following steps:
  • the process sends the data structure of the Put primitive to the synchronization engine device, and exits the Put call;
  • the Put processing module of the synchronization engine device reads the corresponding synchronous storage structure according to the target address in the data structure of the Put primitive, and if not, creates a new synchronous storage structure, and if so, reads the existing synchronous storage structure. Synchronous storage structure;
  • P position 1 of the corresponding synchronous storage structure read according to the target address in the data structure of the Put primitive, and store the source data in the received Put primitive to the Value bit of the synchronous storage structure; Where P is the Produce flag bit; Value is the storage unit.
  • the Put primitive data structure is ⁇ target address, source data ⁇ , wherein the target address represents a physical address in which the source data is stored in the Put primitive; and the source data represents the data content moved in the Put primitive.
  • a method for processing a Get primitive by a synchronization engine device of the multiprocessor includes the following steps: 510.
  • the processor applies for a space in the memory space of the direct processor.
  • Result is used to store the data structure of the Get return value, and clear the requested space.
  • the Get processing module reads the corresponding synchronous storage structure according to the target address in the Get primitive data structure sent by the received process, and if there is no corresponding synchronous storage structure, the execution of Get is abandoned, and the Get original is Put back into the storage queue, waiting to be scheduled to execute again; if there is a corresponding synchronous storage structure, but the P bit is 0, then the execution of Get is abandoned, and the Get primitive is put back into the storage queue. Waiting to be scheduled to be executed again; if there is a corresponding synchronous storage structure in the hash table, and the P bit is 1, then the next step 530 is performed;
  • the Get primitive data structure is ⁇ return physical address, target physical address ⁇ .
  • the meaning of each element in the data structure is as follows:
  • the return physical address is the data of the Get data and the storage address of the completion identifier when Get is successfully executed.
  • the data structure of the Get return value is ⁇ return data, complete identification ⁇ , and the return value is continuous.
  • the target physical address represents the physical address of the data that Get is trying to obtain.
  • the beneficial effects of the present invention are:
  • the synchronization engine device of the present invention uses a unified shared storage structure to support and accelerate basic synchronization primitives such as Barrier, Lock/Unlock, Reduce > Put/Get, and greatly improve the execution of synchronization primitives.
  • Speed reduce inter-process communication, and simplify the interface of synchronization primitives, independent of multi-processor Cache-based, independent of the processor's special instruction set, making parallel programs more convenient when using synchronization primitives It has the characteristics of simple use, wide application range and fast execution speed.
  • FIG. 1 is a schematic diagram showing an example of using a Barrier synchronization primitive to implement a read/write order in a parallel program
  • FIG. 2 is a schematic diagram showing the structure of a multiprocessor synchronization engine apparatus of the present invention
  • FIG. 3 is a schematic structural diagram of a synchronization engine apparatus of a multiprocessor according to another embodiment of the present invention
  • FIG. 4 is a schematic diagram of a synchronous storage structure of the present invention
  • FIG. 5 is a schematic structural diagram of communication between a processing module and a synchronous storage structure in the present invention
  • FIG. 6 is a flow chart showing steps of supporting a barrier primitive in a synchronous storage structure in the synchronization engine of the present invention
  • FIG. 7 is a flow chart showing the steps of the initialization of the synchronization engine system in the support of the Barrier primitive in the present invention
  • 8 is a flow chart showing the steps of the synchronization storage structure supporting the Reduce primitive in the synchronization engine of the present invention
  • Figure 9 is a flow chart showing the steps of the synchronization engine system initialization in the support of the Reduce primitive in the present invention.
  • Figure 10 is a flow chart showing the steps of the synchronization storage structure supporting the Lock primitive in the synchronization engine of the present invention
  • FIG. 11 is a flow chart showing the steps of the synchronization storage structure supporting the Put primitive in the synchronization engine of the present invention.
  • Figure 12 is a flow chart showing the steps of the synchronization storage structure supporting the Get primitive in the synchronization engine of the present invention
  • FIG. 13 is a block diagram showing the structure of a multiprocessor system of the present invention. The best way to implement the invention
  • the multiprocessor system and the synchronization engine thereof of the invention change the way of using the software to implement the synchronization primitive, and use the unified shared storage structure to implement basic synchronization of Barrier, Lock/Unlock, Reduce, Put/Get, etc. through hardware devices.
  • the primitive supports and accelerates. It does not depend on the multi-processor Cache. It does not depend on the special instruction set of the processor. It has the characteristics of simple use, wide application range and fast execution speed.
  • FIG. 2 is a schematic structural diagram of a multi-processor synchronization engine according to the present invention. As shown in FIG. 2, the synchronization engine, Includes:
  • the synchronization primitives sent by processors 1 through n are stored in queues 1 through n when sent to the synchronization engine.
  • the scheduling module is used at the exit of the queue to schedule synchronization primitives from different processes in the same queue to ensure that synchronization primitives from different processes do not block each other.
  • a plurality of scheduling modules 2 configured to send, after selecting a synchronization primitive for execution in the plurality of storage queues, to a corresponding processing module for processing according to a type of the synchronization primitive, where the scheduling module is
  • the storage queues are in one-to-one correspondence;
  • a plurality of processing modules 3 configured to receive synchronization primitives sent by the scheduling module, and perform different functions;
  • the processing module is as shown in Figure 2, a Reduce processing module, a Barrier processing module, a Load/Store processing module, and a Put/Get Processing module. If the synchronization engine supports more types of synchronization primitives, you can extend more processing modules here.
  • the processing module may also have a form as shown in FIG. 3, where the processing module includes a Scatter processing module, a calculation processing module, a Load/Store processing module, and Gather processing module.
  • the Gather processing module is used to implement a many-to-one aggregation communication mode, and aggregates messages of multiple source ports and sends them to a specific destination port, which is used for collecting source data processes of Barrier and Reduce operations.
  • the Scatter processing module is used to implement a one-to-many aggregate communication mode, distributing a source port message to multiple destination ports, which is used for the distribution result process of Barrier and Reduce operations.
  • the Calculate processing module is used to implement the calculation function. This module is used to aggregate the operations submitted by the processing module to implement the functions required for Lock and Put/Get.
  • the Load/Store processing module is used to implement access to the synchronous storage structure in Load/Store mode.
  • the Barrier operation can be completed by the cooperation of the Gather processing module and the Scatter processing module.
  • the Reduce operation is completed by the Gather processing module, the Scatter processing module, and the Calculate processing module, and the Put/Get operation.
  • the calculation module (Calculate) and the Load/Store processing module are required to complete.
  • the virtual synchronous storage fabric module 4 uses RAM as storage and control logic implementation. The goal is to use a small amount of storage space to map the direct storage space of all processors to the synchronous storage structure ⁇ Count, P, L, Value ⁇ through the control logic. This achieves the function of implementing various synchronization primitives using this synchronous storage structure.
  • the mapping method implemented by the control logic is described in the following text.
  • the main memory port 5 is used for reading and writing the direct storage of each processor, and initiating an interrupt to the processor.
  • Configuration register 6 is used to store configuration information sent by various software.
  • the configuration information includes, for example, what is the interrupt number when the interrupt is sent, and what is the physical address written by the synchronization engine when writing the necessary information to the direct storage.
  • the configuration registers are stored by registers, and the configuration information required for each functional module is read from here.
  • the used data is the synchronous storage structure ⁇ Count, P, L, Value ⁇ provided by the virtual synchronous storage structure module, and the storage structure is used by the synchronization engine.
  • the on-chip storage is virtual, and does not directly occupy the space directly stored by the processor.
  • a storage queue 1 stores all synchronization originals from one processor when the synchronization primitive is stored in the corresponding storage queue 1, and simultaneously saves the process number information to distinguish synchronization primitives sent by different processes on the same processor;
  • the scheduling module 2 uses the scheduling module 2 to schedule synchronization primitives from different processes in the same storage queue according to the type of the synchronization primitive to ensure that synchronization primitives from different processes do not block each other;
  • the processing modules 3 respectively implement the support and acceleration of the synchronization primitives in the parallel program by using the hardware device according to the scheduling of the scheduling module 2 and using the data provided by the virtual synchronous storage structure module 4 to execute the corresponding synchronization primitives.
  • the storage structure is only the structure for storing data.
  • the meaning of the Store (A, 3556) only means that the immediate value 3556 is written to the address A, then 3556 This value is stored in the space of address A.
  • the Store instruction is executed, the contents of address A are read using the Load (A) instruction, and the returned value should be 3556.
  • the direct storage of a typical processor in the usual sense of memory. However, for this kind of storage structure, the order of execution of Store and Load is not guaranteed. Additional methods are needed to ensure that the Load command is executed after the Store instruction to ensure that the read data is 3556. A possible method of guaranteeing the order of reading and writing has been given in Fig. 1.
  • the synchronization engine maps the storage structure of the address space into a new storage structure (synchronous storage structure), and FIG. 4 is a schematic diagram of the synchronous storage structure in the present invention.
  • synchronous storage is used.
  • the structure not only supports and accelerates basic synchronization primitives such as Barrier, Lock/Unlock, and Reduce, but also supports a new synchronization primitive that automatically maintains the read and write order shown in Figure 1 without using Barrier.
  • This synchronization primitive, which automatically maintains the read and write order is called the Put/Get primitive.
  • the Put primitive represents a write operation and Get represents a read operation.
  • the synchronization engine described in this patent can maintain the execution order of Put and Get.
  • the content read by the Get operation must be the content written by the Put operation.
  • the synchronous storage structure is to add a header in front of the ordinary single storage space Value to make a synchronous storage structure such as ⁇ Count, P, L, Value ⁇ .
  • ⁇ Count, P, L, ⁇ is called a synchronously stored tag.
  • the bit widths of Count and Value in the synchronous storage structure can be set differently according to system requirements.
  • the bit width marked in Figure 4 is only an example. In Figure 4, there is 1 bit L, 1 bit P, and 8 bit Coimt in front of every 32 bits of storage space.
  • the meaning of each element in the synchronous storage structure is as follows -
  • a storage unit that stores data.
  • a general-purpose processor can write content to it using common Store instructions. It can also use the common Load command to read the contents stored inside.
  • L Lock flag. This flag is automatically set to 1 by the synchronization engine when the Lock instruction is successfully executed. A L content of 1 indicates that the storage unit is locked, and the other locking primitives for this storage unit cannot be successfully executed. When the Unlock instruction is successfully executed, this flag is automatically assigned a value of 0 by the synchronization engine. A L content of 0 indicates that the unit is not locked, and the first first lock primitive can be successfully executed. Use this flag to support the Lock/Unlock primitive.
  • P Produce flag.
  • this flag is set to 1 by the synchronization engine when the condition is met.
  • the content of P is 1 to indicate that the data of this storage unit is allowed to be obtained by the Get instruction.
  • the synchronization engine automatically assigns this flag to 0 when the condition is satisfied. Use this flag to implement Put/Get to ensure read and write order.
  • Counter which can be used to implement Barriers Reduce, as well as multiple modes of Put/Get primitives.
  • the bit width of the counter is related to the maximum parallel process supported by the synchronization engine.
  • the n-bit Count can support up to 2" processes.
  • the Value element in the synchronous storage structure is a standard storage unit of the current computer, occupying a real storage space.
  • the tags in the synchronous storage structure are generated by the synchronization engine mapping, exist only inside the synchronization engine, and do not occupy the storage space of the processor.
  • the tag of the synchronous storage structure cannot be accessed by the processor's Load/Store instruction and must be indirectly accessed through various synchronization primitives supported by the synchronization engine.
  • the address of the synchronous storage structure is equal to the address of the Value before the mapping tag.
  • FIG. 4 is a schematic structural diagram of communication between a processing module and a synchronous storage structure in the present invention.
  • the virtual engine uses a small amount of on-chip storage virtual virtual storage structure.
  • the virtual method is: using an on-chip storage
  • the structure of each item in the hash table is ⁇ key value, Tag, Value ⁇ .
  • the processing module writes a synchronous storage structure, such as the Put/Get module executing the Put instruction
  • the address of the Put instruction is used.
  • As a key value use the hash algorithm to select a row in the hash table as
  • the storage unit stores the synchronization structure.
  • the hash algorithm is also used to find the item corresponding to the address, and the hash table outputs the content of the found line.
  • the corresponding item is not found.
  • the Get command is sent to the synchronization engine before the corresponding Put instruction, indicating that the current instruction should be suspended, and the instruction is re-regressed. In the storage queue shown, it waits for the next scheduled execution.
  • the tag of the synchronous storage structure changes according to the instruction. If the execution result of the tag of the synchronous storage structure is equal to all 0s, it indicates that the synchronous storage structure has been completely executed. Release the corresponding storage space in the hash table.
  • the data stored in the hash table will be used in a very short period of time, so that the synchronization engine will release the space in its hash table, so the probability of hash table overflow is relatively small, constructing The on-chip storage space required by the hash table is not too large.
  • a similar example can refer to the processor's Cache. The size of the Cache is much smaller than the size of the memory.
  • the main memory port shown in FIG. 2 is used to send an interrupt to the corresponding processor, and the software interrupt program simulates the synchronization engine process, and a hash table is constructed in the processor direct memory to process the process.
  • a tag for implementing the synchronization function in front of each Value without directly occupying the processor's direct storage. This supports the Barrier, Lock/Unlock, Reduce, and Put/Get primitives.
  • FIG. 5 is a flow chart showing the steps of the synchronous storage structure supporting the Barrier primitive in the synchronization engine of the present invention
  • FIG. 6 is the original Barrier in the present invention.
  • the flow chart of the synchronization engine system initialization process includes the following steps - 110. Synchronization engine system initialization (this step only needs to be executed once the system is powered on)
  • the software specifies a synchronous storage structure of multiple consecutive addresses ⁇ Count, P, L, Value ⁇ as multiple Barrier variables, each variable occupies 4 Bytes of space, and writes the specified information to the configuration register of the synchronization engine;
  • Each processor applies a space Barrier-State as the completion state representation of the Barrier in its immediate memory.
  • the Barrier_State is equal to the number of Barrier variables divided by 8, the unit is Byte, and the completion state of each Barrier variable.
  • the bite-state corresponding to each bit of the Barrier_State is initialized to all 0s.
  • Barrier The processes in the corresponding processor are shared and can be read by each process.
  • each processor applies for a 4 Byte Barrier in its immediate memory.
  • the state of the Barrier variable B0 corresponds to the Barrier_ State bit.
  • the completion state of B1 corresponds to the bitl of Barrier_State.
  • Each processor sends the physical address of the application Barrier-State to the configuration register of the synchronization engine, the Barrier_State address requested by the processor 1 is Barrier-Maskl, and the Barrier_State address requested by the processor 2 is Barrier-Mask2
  • the address of the Barrier_State requested by processor n is Barrier-Maskno.
  • Each participating Barrier process writes a value to the Barrier variable using a normal Store instruction, the value of which is equal to the number of processes participating in the Barrier by the entire multiprocessor system minus one. This value will be used by the synchronization engine to calculate if all processes have reached the synchronization point.
  • the value of Store is equal to m-l. Regardless of the execution order of multiple processes, the value of Store is fixed equal to m-1, and the value of Store is the same for each process.
  • the synchronization engine After receiving the Store instruction for the Barrier variable, the synchronization engine has written the specified information into the configuration register in step 111. According to the physical address of the store, the Store instructions are interpreted by the synchronization engine as Barrier primitives. .
  • the Barrier processing module reads the synchronous storage structure of the corresponding address. If the synchronous storage structure does not exist in the hash table shown in FIG. 4, or the Count in the synchronous storage structure is equal to 0, this is the first time. When the Barrier primitive of the synchronization engine is reached, a synchronous storage structure is established in the hash table, where Count is equal to the value of Store, that is, ml. If the read sync storage structure Count is not equal to 0, then step 150;
  • the Barrier processing module reads a synchronous storage structure of a corresponding address, and reads the read Count of the synchronous storage structure minus 1;
  • the Count value is equal to 0, it means that all the processes have reached the synchronization point, and once the Barrier has been completed, the bit corresponding to the Barrier-HW_State is inverted, and then Broadcast the Barrier-HW_State to the position of the Barrier-State of the multiple processors, and the synchronous storage structure is not released in the hash table; otherwise, return to step 150;
  • each process After sending the Store instruction, each process periodically queries the value of the nth bit of the Barrier_State to which the processor belongs. If the value of the query is equal to the Local_Barrier_Staten of step 120, it indicates the Barrier of the process. Not yet completed, then check again later. If the queried state is not equal to Local_Barrier_Staten, it indicates that the Barrier has been completed, the process exits the query state, and the Barrier primitive call is exited.
  • step 110 only needs to be initialized once, and at other times, when each process calls the Barrier primitive, it starts from step 120 and ends at step 160.
  • the synchronization primitive of Barrier is equivalent to Store (Barrier variable address, m-l).
  • the support and implementation of the synchronous storage structure in the synchronization engine of the present invention is similar to that of the Barrier.
  • the data structure of the Reduce primitive supported by the synchronous storage structure is ⁇ Reduce variable address, operator, data type, number of processes -1, source data ⁇ .
  • the meaning of each element in the data structure is as follows:
  • the Reduce variable address indicates the physical address selected to be used for the Reduce operation; the operator indicates what type of Reduce operation this time is, such as "plus", “minus”, etc.;
  • the type indicates what type of source data is involved in the operation, such as "double precision floating point", "64 bit fixed point”, etc.;
  • the number of processes indicates how many processes participate in this Reduce operation;
  • the source data represents the data participating in Reduce.
  • 7 is a flow chart showing the steps of the synchronous storage structure supporting the Reduce primitive in the synchronization engine of the present invention
  • FIG. 8 is a flow chart showing the steps of the synchronization engine system initialization in the support of the Reduce primitive in the present invention, as shown in FIG. 7 and As shown in Figure 8, the specific method includes the following steps:
  • Synchronous engine system initialization (this step only needs to be executed once when the system is powered on)
  • the software specifies a synchronous storage structure of multiple consecutive addresses ⁇ Count, P, L, Value ⁇ as multiple Reduce variables, each variable occupies 8 Bytes space, and writes the specified information to the synchronization engine. Set in the register.
  • Reduce—State is represented as the completion state of Reduce.
  • the size of Reduce-State is equal to the number of Reduce variables divided by 8, the unit is
  • each Reduce variable corresponds to the 1 bit of the Reduce_State.
  • Each processor's Reduce_State is initialized to all 0s.
  • the Stata Reduce variable R0 has a completion state corresponding to the Reduce_ State bitQ Reduce variable.
  • the completion state of R1 corresponds to bitl of Reduce_State.
  • Each processor sends the physical address of the request Reduce_State to the configuration register of the synchronization engine.
  • the Reduce-State address applied by processor 1 is Reduce_Maskl
  • the Reduce-State address applied by processor 2 is Reduce-Mask2 processor.
  • n The Reduce-State address applied for is Reduce- Maskn 0
  • a process calls the Reduce primitive, and uses a Reduce variable Rn to perform a Reduce operation. Each process saves the state of the nth bit of the Reduce-State of the processor to the local variable Local-Reduce-State;
  • Each process participating in the reduction sends the data structure of the Reduce primitive to the synchronization engine, and after receiving the Reduce data structure, the synchronization engine writes a value to the Reduce variable Rn according to the physical address of the Reduce.
  • the size is equal to N-1;
  • the synchronization processing module After receiving the Reduce data structure, the synchronization processing module reads the synchronous storage structure of the corresponding address, if the synchronous storage structure does not exist in the hash table shown in FIG. 4, or reads the synchronous storage structure.
  • the Count in is equal to 0, indicating that this is the Reduce primitive that first arrives at the synchronization engine, then a synchronous storage structure is established in the hash table, where Count is equal to m-1, and the source data in the Reduce data structure is used. Stored in Value in the synchronous storage structure. If the read sync storage structure Count is not equal to 0, the next step 250 is performed;
  • each process After sending the Reduce data structure, each process periodically queries the value of the nth bit of the Reduce-State to which the processor belongs, and if the queried state is equal to the Local-Reduce_Staten of step 220, the process is represented.
  • the Reduce is not completed yet, and then queried again; if the queried state is not equal to Local-Reduce-Staten, it means that Reduce has been completed, and the software exits the query state, for example, the query to the Reduce-Staten to which the processor belongs
  • the value of the nth bit is equal to 0, and the Local_Reduce_Staten saved by the process is equal to 0, indicating that Reduce is not yet completed. If the query Local_Reduce_Staten is equal to 1, it indicates that Reduce has been completed. Since only lbit is used to indicate the completion status, they are only 0 and 1 in value.
  • the query has been completed, use the normal Load instruction to read the value of the corresponding Reduce variable.
  • step 210 only the system needs to be initialized once. At other times, when the software calls the Reduce primitive, it starts from step 220 and ends at step 260.
  • the Lock primitive data structure supported by the synchronous storage structure in the synchronization engine of the present invention is ⁇ return physical address, target physical address ⁇ , wherein the meaning of each element
  • Returning the physical address means that when the lock is successful, the synchronization engine will store the success message in the return physical address in the main memory; the target physical address indicates which physical address the software wishes to lock.
  • 9 is a flow chart showing the steps of the synchronization storage structure supporting the Lock primitive in the synchronization engine of the present invention. As shown in FIG. 9, the specific method includes the following steps:
  • Each processor applies a variable Lock_Result in its immediate memory to clear the contents of the variable, and uses the physical address of the variable as the return physical address of the Lock primitive data structure; the process sends the data structure of the Lock primitive.
  • Lock_Result in its immediate memory to clear the contents of the variable, and uses the physical address of the variable as the return physical address of the Lock primitive data structure; the process sends the data structure of the Lock primitive.
  • the synchronization engine After receiving the data structure of the Lock primitive, the synchronization engine reads the synchronous storage structure according to the target physical address therein, if the synchronous storage structure is not read from the hash table shown in FIG. 5, or is read The L bit in the synchronous storage structure is equal to 0, indicating that the physical address has not been locked, then the execution of the Lock primitive is successfully transferred to step 33Q. If the L bit of the read synchronous storage structure is equal to 1, the physical address is indicated. If it has been locked, the implementation of this lock is abandoned, and the Lock primitive is placed in the corresponding storage queue, waiting to be executed in this schedule;
  • the data structure of the Unlock primitive supported by the synchronous storage structure in the synchronization engine of the present invention is ⁇ target address ⁇ , and only one element in the data structure represents the variable address to be unlocked.
  • the process only needs to send the data structure to the synchronization engine to exit the Unlock call.
  • the synchronization engine hardware After receiving the Unlock data structure, the synchronization engine hardware reads the synchronous data structure from the hash table according to the target address, and clears the L bit therein. If the synchronous data structure is equal to all 0s after the L bit is cleared, then This synchronous data structure is released in the hash table, otherwise only the data structure after the L bit is cleared can be written back.
  • the Put primitive data structure supported by the synchronous storage structure in the synchronization engine of the present invention is ⁇ target address, source data ⁇ , wherein the meaning of each element in the data structure is as follows:
  • the target address indicates that the source data in the Put primitive is stored. Physical address; source data represents the data content moved in the Put primitive.
  • 10 is a flow chart of the step of supporting the Put primitive in the synchronization storage structure of the synchronization engine of the present invention. As shown in FIG. 10, the specific method includes the following steps:
  • the process sends the data structure of the Put primitive to the synchronization engine to exit the Put call; 420.
  • the Put processing module of the synchronization engine reads in the hash table shown in FIG. 4 according to the target address in the data structure of the Put primitive. Take the corresponding synchronous storage structure. If it does not exist, create a new synchronous storage structure. If it exists, read the existing synchronous storage structure.
  • the P location 1 of the synchronous storage structure obtained in step 420 and the source data in the received Put primitive are stored in a Value location of the synchronous storage structure.
  • the Get primitive data structure supported by the synchronous storage structure in the synchronization engine of the present invention is ⁇ return physical address, target physical address ⁇ .
  • the meaning of each element in the data structure is as follows:
  • the return physical address is the data of the Get data and the storage address of the completion identifier when Get is successfully executed.
  • the data structure of the Get return value is ⁇ return data, complete identification ⁇ , and the return value is continuous.
  • 11 is a flow chart showing the steps of the synchronization storage structure supporting the Get primitive in the synchronization engine of the present invention. As shown in FIG. 11, the specific method includes the following steps:
  • the processor applies a space in its direct memory space.
  • the Get-Result is used to store the data structure of the Get return value, and clears the applied space.
  • the process sends the Get primitive data structure to the synchronization engine;
  • the Get processing module reads the corresponding synchronous storage structure in the hash table shown in FIG. 6 according to the target address in the received Get primitive data structure. If the corresponding synchronous storage structure does not exist in the hash table, the execution of Get is abandoned, the Get primitive is put back into the storage queue, and is scheduled to be executed again. If there is a corresponding synchronous storage structure in the hash table, but the P bit is 0, then the execution of Get is abandoned, the Get primitive is put back into the storage queue, and is waiting to be scheduled again; if there is a corresponding in the hash table Synchronous storage structure, and wherein the P bit is 1, go to step 540;
  • the corresponding row synchronous storage structure may also be used for other purposes, such as where the L bit is set to 1, indicating that the row synchronous storage structure is also locked. Then you can't release this line of synchronous storage. Only if Tag is equal to all zeros, it means that this line of data is useless.
  • step 560 Regularly query the completion identifier of the data structure in Get-Result. If the query result is 0, indicating that the Get primitive has not been executed yet, the query is continued after a delay; if the query result is 1, it means Get After the primitive is executed, step 560 is performed.
  • a multiprocessor system is also provided.
  • FIG. 12 is a schematic structural diagram of a multiprocessor system of the present invention. As shown in FIG. 12, the system 7 includes: a plurality of processors 71 and a processing chip 72, wherein:
  • the processing chip 72 includes:
  • the synchronization engine is configured to interconnect with a plurality of device ports inside the chip.
  • Figure 12 shows the topology of the synchronization engine in n processor environments.
  • N device ports are implemented in one chip for high-speed interconnection with the processor, and each processor is connected to one device port.
  • each processor can search for the device ports connected to it through the standard device search process, and allocate various resources requested by the device port, mainly address space resources.
  • In-chip The multiple device ports of the department and the resources required by the synchronization engine intersynchronization engine are all requested by the device port to the operating system above the processor.
  • the operating system allocates resources to the corresponding device port, which actually allocates resources to the synchronization engine.
  • the synchronization engine also maps its own resources to the corresponding operating system through the device port.
  • the synchronization engine maps itself to the operating system on the n processors through n device ports, the software on the n processors can operate the synchronization engine through mapping relationships, such as operating exclusive resources, so the synchronization The engine is actually shared by n processors.
  • the beneficial effects of the present invention are:
  • the synchronization engine of the present invention uses a unified shared storage structure to support and accelerate basic synchronization primitives such as Barrier>Lock/Unlock and Reduce Put/Get, and greatly improves the execution speed of the synchronization primitive.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Multi Processors (AREA)

Abstract

本发明公开了一种多处理器系统及其同步引擎。所述同步引擎包括:多个存储队列,一个队列存储来自一个处理器的所有同步原语;多个调度模块,在多个存储队列中选定用于执行的同步原语之后,根据同步原语的类型,发送到相对应的处理模块进行处理,调度模块与存储队列一一对应;多个处理模块,接收调度模块发来的同步原语,执行不同功能;虚拟同步存储结构模块,使用少量的存储空间,通过控制逻辑把所有处理器的直属存储空间都映射为同步存储结构来实现各种同步原语的功能;主存端口,与虚拟同步存储结构模块进行通讯,对各处理器的直属存储进行读和写,以及向处理器发起中断;配置寄存器,存储处理模块需要用到的各种配置信息。

Description

一种多处理器系统及其同步引擎装置 技术领域
本发明涉及并行程序数据处理技术领域特别是涉及一种多处理器系统及 其同步引擎装置。 背景技术
在并行程序中, 多个进程 /线程 (以下统称进程) 并行协同工作, 完成处 理任务。 多个进程在其运行周期中, 需要使用同步原语来实现多个进程之间的 同步。 同步原语是保证并行程序正确性的关键原语。 并行程序中常用的具有同 步含义的原语有 Lock和 Barrio图 1是并行程序中使用 Barrier同步原语实现 读写次序的例子示意图, 如图 1所示, 同步原语 Barrier可以保证进程 P2在读 取变量 A的值时, 读到的值一定是进程 P1写入的值。 Lock原语在科学计算中 一般是用来保证多个进程之间对某种资源的互斥访问。它的实现一般依赖于处 理器提供的特殊指令来实现, 比如典型的 LL/SC指令。
除了 Barrier和 Lock这种只包含同步含义的原语以外, 在并行程序中还有 其他一些常用的, 隐含同步的原语, 如 Reduce或者 All-Reduce原语。 Reduce 原语可以简单地表示为 Reduce (Root,Ai,Op,Com), 其中 Root表示这次 Reduce 运算的根进程; Ai表示进程 i参与 Reduce的源数据; Op表示 Reduce的运算 模式, 常见的有"加"、 "减"、 "最大"、 "最小 "等; Com表示参与这个 Reduce 的进程的集合。 Reduce (Root,Ai,Op,Com)的含义如下: 在集合 Com内的每个 进程 i的数据 Ai, 均使用 Op模式进行运算, 并把运算结果返回至 Root。 在这 个 Reduce运算中, 隐含了 Com集合中所有进程和 Root进程的同步关系, 即 Root进程必须在所有 Com集合内的进程到达某个时间点之后, 才能获得最终 的运算结果, 并且在实现同步的同时, 也实现了数据在进程间的移动。
All-Reduce与 Reduce的区别仅仅在于最后的计算结果是广播到 Com内所有进 程, 而不仅仅是 Root。 在下文中除了特别说明, All-Reduce和 Reduce都统称 Reduce o
现有技术中使用软件实现上述同步原语具有较好的灵活 但是执行效率 较低, 主要体现在启动开销大, 执行速度慢, 进程间通信次数多。 比如软件实 现的 Barrier可以用类似于计数器的方法, 使用一个共享内存的计数器 A, A 被 Root进程初始化为 Q然后每个参与 Barrier的进程都执行 A=A+1 的操作 然后不断在循环中读取 A的值 当 A的值等于参与 Barrier的进程总数的时 il吳 即表明所有进程都达到了同步。 这种软件实现的方法有两个问题, 一个是在执 行八=八+1的时候, 由于 Α是共享的, 可能被多个进程同时操作, 因此每个进 程都要保证自己的操作是原子操作, 因此需要用或者是锁技术, 或者是锁内存 总线的方法来保证原子操作, 而这些操作都是很费时, 并且影响处理器性能的 操作; 另外一个问题在于各个进程在循环中读取 A的值, 由于 A是分配在某 个处理器的内存上的, 在多处理器的环境下, 如果多处理器间有 Cache—致性 保证, 会导致大量的 Cache—致性控制信息在处理器间交换; 如果没有 Cache 一致性保证, 循环读取 A值的时候会引起大量的远程 Load操作, 无论哪种情 况都会导致多处理器的通信带宽被大量占用, 从而影响系统性能。
一种软件实现的 Reduce算法与上述 Barrier类似, 除了计算是否所有进程 都达到同步点以外, 还要对各个进程的数据进行计算, 计算的结果存放在共享 内存的变量 Value中。 假设进程 0的数据为 ValueO, 进程 1的数据为 Valuel ... 进程 N的数据为 ValueN。 Root进程根据 Reduce的操作类型初始化 Value, 比 如 Reduce的操作类型是 "最大", 则把 Value初始化为计算机能够表示的最小 值, 然后每个进程 n进行如下操作:
if (Value n大于 Value)
Value = Value n;
同样的, 每个进程需要保证上述操作的原子性, 这样当所有进程都通过 Barrier所述的计数器 A判断完成计算后, Value的最终值就是所有进程的数据 中最大的的那个值, 每个进程就可以读取 Value的值了, 也就是完成了操作类 型为"最大"的 Reduce o
与 Barrier类似的, 软件实现多个处理器间的 Reduce, Lock操作也有类型 问题。 虽然软件使用一些改进算法可以减少上述的缺点, 但是仍然不能从根本 上解决问题, 仍然存在执行次数慢, 消耗处理器执行资源等问题。 发明公开 本发明的目的在于提供一种多处理器系统及其同步引擎装置。其能够很好 地在多处理器环境下支持各种常用同步操作, 具有执行速度快, 占用处理器通 信带宽小,无论处理器间是否有 Cache—致性都能适用等优点; 同时由于所述 的同步引擎装置是硬件实现的设备, 计算需要的原子性很容易得到保证。
为实现本发明的目的而提供的一种多处理器的同步引擎装置, 包括: 多个存储队列,用于接收多个处理器发送的同步原语,一个队列存储来自 一个处理器的所有同步原语;
多个调度模块, 用于在所述多个存储队列中选定用于执行的同步原语之 后, 根据同步原语的类型, 发送到相对应的处理模块进行处理, 所述调度模块 与所述存储队列一一对应;
多个处理模块, 用于接收所述调度模块发来的同步原语, 执行不同功能; 虚拟同步存储结构模块,使用少量的存储空间,通过控制逻辑把所有处理 器的直属存储空间都映射为同步存储结构来实现各种同步原语的功能;
主存端口,用于与所述虚拟同步存储结构模块进行通讯,对各处理器的直 属存储进行读和写, 以及向所述处理器发起中断;
配置寄存器, 用于存储所述处理模块需要用到的各种配置信息。
在同步原语存储到对应的存储队列里时,将同时保存进程号信息, 以区别 同一处理器上不同进程发送过来的同步原语。
所述处理模块, 包括: Reduce 处理模块、 Barrier 处理模块、 Load/Store 处理模块, 以及 Put/Get处理模块。
所述同步引擎装置中的处理模块,能够根据同步引擎装置支持的同步原语 的类型进行扩展。
所述同步存储结构是使用少量的片内存储虚拟出来的,并不直接占用处理 器直属存储的空间。
所述同步存储结构是 {Count, P, L, Value} , {Count, P, L,}称为同步存储的
Tag, Count和 Value的位宽可以根据系统需求进行不同的设定; Value: 存储 单元,用于存储数据, L: Lock标志位,用于支持 Lock/Unlock原语; P: Produce 标志位,用于实现 Put/Get原语, Count:计数器,用于实现 Barrier原语、 Reduce 原语, 以及多种模式的 Put/Get原语。 计数器的位宽和同步引擎装置支持的最 大并行进程有关, n位 Count可以支持最大 2n个进程。 所述同步存储结构的虚拟方法是: 使用一块片内存储作为哈希表, 哈希表 中每一项的结构为 {关键值, Tag, Value} , 当处理模块写入一项同步存储结构 时, 所述处理模块执行指令, 把指令的地址作为关键值, 使用哈希算法在哈希 表内选择一行作为存储单元, 把同步结构存储下来; 当处理模块读取一项同步 存储结构时, 同样使用哈希算法找到对应于这个地址的项, 哈希表输出找到的 那一行的内容 { Tag, Value}; 如果读取过程中使用哈希算法没有找到对应的 项, 则说明当前指令应该暂缓执行, 则指令被重新回归相应的存储队列中, 等 候下次调度执行; 在同步原语被执行后, 如果同步存储结构的 Tag的执行结果 等于全 0, 则说明这一项同步存储结构已经被完全执行完毕, 则在哈希表中释 放对应的存储空间; 当所述哈希表溢出的时候, 则使用所述主存端口向对应的 处理器发送中断, 在所述处理器直属内存中构造哈希表, 来存储所述同步存储 结构; 其中 {Count, P,L,}称为同步存储的 Tag; Value为存储单元。
为实现本发明的目的还提供一种采用所述的多处理器的同步引擎装置的 多处理器系统, 其特征在于, 所述系统, 包括: 多个处理器和一个处理芯片, 其中- 所述处理芯片, 包括:
多个设备端口, 用于和所述多个处理器高速互联, 每个处理器和一个设备 端口连接;
所述同步引擎装置, 其中, 所述存储队列与所述多个设备端口互联; 在设备发现过程中,每个处理器通过标准设备搜索流程搜索到与之互联的 设备端口, 并分配设备端口申请的各种资源; 所述同步引擎装置把自身的资源 通过设备端口映射到对应的处理器的操作系统中,在多个处理器上的软件通过 映射关系操作所述同步引擎装置, 所述同步引擎装置被多个处理器共享。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Barrier原语的处理方法, 所述方法, 包括下列步骤:
110.同步引擎装置系统初始化: 指定多个连续地址的同步存储结构作为多 个 Barrier 变量, 同时维护一个表示多个 Barrier 变量完成状态的寄存器 Barrier_HW_State, 每个处理器在其直属的内存中申请一段空间 Barrier_State 作为 Barrier的完成状态表示;
120.N个进程调用 Barrier原语, 使用所述 Barrier变量执行一次 Barrier操 作 同时分别保存相应处理器的 Barrier—State的第 n位的状态到所述 N个进程 的本地变量 Local_Barrier_State;
130.所述同步引擎装置在接收到对所述 Barrier变量的 Store指令后, 根据 Store的物理地址向所述 Barrier变量写入值, 值的大小等于 N-1;
140.Barrier处理模块读取对应地址的同步存储结构, 如果不存在这项同步 存储结构, 或者读取到同步存储结构中的 Count等于 0, 则建立一项同步存储 结构, 其中的 Count等于 Store的值, 如果读取到同步存储结构 Count不等于 0, 则执行下一步骤 150;
150.所述 Barrier处理模块读取对应地址的同步存储结构并把读取到的同 步存储结构的 Count减 1 ;
160.判断 Count值是否等于 0, 若是, 则所有进程都已经到达了同步点, 一次 Barrier 已经完成, 把 Barrier— HW— State 对应的位取反, 然后把 Barrier— HW_State广播到多个处理器的 Barrier— State的位置; 否则, 返回步骤 150;
170.每个所述进程在发送 Store 指令之后, 定时査询本处理器所属的
Barrier_State的第 n位的值, 如果査询到的值等于 Local_Barrier— Staten的值, 则表示本进程的 Barrier尚未完成, 则稍后再次査询; 如果查询到的状态不等 于 Local— Barrier— Staten, 则表示 Barrier已经完成, 所述进程退出査询状态, 同时退出 Barrier原语调用; 其中 Count为计数器。 同步引擎装置系统初始化 的步骤 110, 包括下列步骤:
111.指定多个连续地址的同步存储结构作为多个 Barrier变量并把指定信 息写入到所述配置寄存器中;
112.在所述同步引擎装置内部维护一个表示多个 Barrier变量完成状态的 寄存器 Barrier_HW— State, 每一位对应一个 Barrier变量的完成状态;
113.每个处理器在其直属的内存中申请一段空间 Barrier— State作为 Barrier 的完成状态表示, 每个处理器的 Barrier_State都被初始化为全 0, Barrier— State 在对应的处理器内部的各个进程是共享的, 各个进程都可以读取;
114.每个处理器把申请 Barrier— State 的物理地址发送到所述配置寄存器 中。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Reduce原语的处理方法, 所述方法, 包括下列步骤:
210.同步引擎装置系统初始化: 指定多个连续地址的同步存储结构作为多 个 Reduce 变量, 同时维护一个表示多个 Reduce 变量完成状态的寄存器 Reduce_HW_State; 每个处理器在其直属的内存中申请一段空间 Reduce— State 作为 Reduce的完成状态表示;
220.N个进程调用 Reduce原语,使用所述 Reduce变量 Rn执行一次 Reduce 操作, 每个进程保存本处理器的 Reduce一 State 的第 n位的状态到本地变量 Local— Reduce—State;
230.每个参与 Reduce的进程把上述 Reduce原语的数据结构发送到同步引 擎装置; 所述同步引擎装置在接收到 Reduce数据结构之后, 根据 Reduce的物 理地址向所述 Reduce变量 Rn量写入值, 值的大小等于 N-1;
240.Reduce处理模块读取对应地址的同步存储结构, 如果不存在这项同步 存储结构, 或者读取到同步存储结构中的 Count等于 0, 则建立一项同步存储 结构, 其中的 Count等于 N-l, 并把 Reduce数据结构中的源数据存储在同步 存储结构中的 Value中; 如果读取到同步存储结构 Count不等于 0, 则执行下 一步骤 250;
250.Reduce 处理模块读取对应地址的同步存储结构, 并把对应的 Count 减 1, 并把读取到的同步存储结构中的 Value和 Reduce数据结构中的源数据进 行运算, 把结果存放到同步存储结构中的 Value中;
260.判断 Count值是否等于 0, 若是, 则说明一次 Reduce 完成, 则把
Reduce_HW_State对应的位取反然后把 Reduce_HW— State广播到 n个处理器 的 Reduce— State的位置; 否则, 返回步骤 250;
270.每个进程在发送 Reduce 数据结构之后, 定时査询本处理器所属的 Reduce_State的第 n位的值, 如果査询到的状态等于 Local_Reduce—Staten, 则 表示本次进程的 Reduce尚未完成, 则稍后再次査询; 如果查询到的状态不等 于 Local_Reduce— Staten, 则表示 Reduce己经完成, 退出査询状态; 其中 Value 为存储单元; Count为计数器。
同步引擎装置系统初始化的步骤, 包括下列步骤:
211.指定多个连续地址的同步存储结构 {Count, P, L, Value}作为多个 Reduce变量, 并把指定信息写入到同步引擎装置的配置寄存器中; 212.在同步引擎装置内部维护一个表示多个 Reduce变量完成状态的寄存 器 Reduce— HW_State, 每一位对应一个 Reduce变量的完成状态;
213.每个处理器在其直属的内存中申请一段空间 Reduce— State作为 Reduce 的完成状态表示, 每个处理器的 RedUCe_State都被初始化为全 0;
214.每个处理器把申请 Reduce—State的物理地址发送到同步引擎装置的配 置寄存器中。
所述 Reduce原语的数据结构为 {Reduce变量地址, 操作符, 数据类型, 进程数目 -1, 源数据 }。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Lock 原语的处理方法, 所述方法, 包括下列步骤:
310.每个处理器在其直属的内存中申请变量 Lock—Result把这个变量的内 容清零, 并把这个变量的物理地址作为 Lock原语数据结构的返回物理地址;
320.所述同步引擎装置在接收到进程发送的 Lock原语的数据结构后, 根 据其中的目标物理地址读取同步存储结构, 如果读取不到同步存储结构, 或者 读取到的同步存储结构中的 L位等于 0, 则说明此物理地址尚未被加锁, 则 Lock原语执行成功, 转至下一步骤 330; 如果读取到的同步存储结构的 L位 等于 1, 则说明此物理地址已经被加锁, 则放弃本次 Lock的执行, 把 Lock原 语放入相应的存储队列中, 等候在此调度执行;
330.把同步存储结构中的 L位置 1并且保存, 根据所述返回物理地址, 向 直属的内存的 Lock— Result写入 1 ;
340.所述进程定时查询 Lock— Result, 如果读取到的 Lock— Result等于 0, 则表示尚未加锁成功, 则延时后再查询; 如果读取到的 Lock—Result等于 1, 则表示加锁成功, 退出 Lock调用; 其中 L为 Lock标志位。
所述 Lock原语数据结构为{返回物理地址, 目标物理地址 }, 返回物理地 址表示当加锁成功的时候,同步引擎装置会把成功消息存放在主存中的返回物 理地址中; 目标物理地址表示软件希望对哪个物理地址进行加锁。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Unlock原语的处理方法, 所述方法, 包括下列步骤:
进程把数据结构发送到所述同步引擎装置, 并退出 Unlock调用; 所述同步引擎装置在接收到 Unlock数据结构后, 根据目标地址从哈希表 中读取同步数据结构, 把其中的 L位清零, 如果同步数据结构在 L位清零后 等于全 0, 则释放这一项同步数据结构, 否则仅仅把 L位清零后的数据结构写 回; 其中 L为 Lock标志位。
所述 Unlock原语的数据结构为 {目标地址}, 数据结构中仅有的一个元素 表示需要解锁的变量地址。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Put原 语的处理方法, 所述方法, 包括下列步骤:
410.进程将 Put原语的数据结构发送到所述同步引擎装置, 并退出 Put调 用;
420.所述同步引擎装置的 Put处理模块根据 Put原语的数据结构中的目标 地址读取相应的同步存储结构, 如果不存在, 则新建一项同步存储结构, 如果 存在则读取已存在的同步存储结构;
430.把根据 Put原语的数据结构中的目标地址读取的相应的同步存储结构 的 P位置 1, 并把接收到的 Put原语中的源数据存储到所述同步存储结构的 Value位; 其中 P为 Produce标志位; Value为存储单元。
所述 Put原语数据结构为{目标地址, 源数据 }, 其中, 目标地址表示 Put 原语中源数据被存放的物理地址; 源数据表示 Put原语中移动的数据内容。
为实现本发明的目的还提供一种所述多处理器的同步引擎装置对于 Get 原语的处理方法, 所述方法, 包括下列步骤- 510.处理器在其直属的内存空间中申请一段空间 Get— Result用于存储 Get 返回值的数据结构, 并把申请到的空间清零;
520.所述 Get处理模块根据接收到的进程发送的 Get原语数据结构中的目 标地址, 读取对应的同步存储结构, 如果不存在对应的同步存储结构, 则放弃 Get的执行, 把 Get原语放回所述存储队列中, 等待再次被调度执行; 如果存 在对应的同步存储结构, 但是其中的 P位为 0, 则放弃 Get的执行, 把 Get原 语放回到所述存储队列中,等待再次被调度执行;如果哈希表中存在对应的同 步存储结构, 而且其中的 P位为 1, 则执行下一步骤 530;
530.把读取到的同步存储结构中的 P清零, 把读取到的同步存储结构中的 Value内容, 根据 Get原语数据结构中的返回物理地址, 把返回值 { 1, Value}写 入处理器的直属存储中; 540.如果清零后同步存储结构的 Tag等于全零, 则释放对应的同步存储结 构项, 否则把 P位清零的同步存储结构写回;
550.定时査询 Get_Result中数据结构的完成标识, 如果查询结果为 0, 表 示 Get原语尚未执行完毕, 则延时一段时间后继续査询; 如果査询结果为 1, 则表示 Get原语执行完毕, 执行下一步骤 560;
560.读取 Get_Result中数据结构的返回数据, 退出 Get调用; 其中 P为 Produce标志位; Value为存储单元; {Count, P, L,}称为同步存储的 Tag。
所述 Get原语数据结构为{返回物理地址, 目标物理地址 }。数据结构中各 个元素的含义如下: 返回物理地址是当 Get成功执行后, 返回 Get的数据以及 完成标识的存放地址, Get返回值的数据结构如 {返回数据, 完成标识}, 此返 回值被连续存放在返回物理地址中;目标物理地址表示 Get试图去获得的数据 的物理地址。
本发明的有益效果是: 本发明的同步引擎装置使用统一的共享存储结构, 对 Barrier、 Lock/Unlock, Reduce > Put/Get等基本同步原语进行支持和加速, 大幅度提高同步原语的执行速度,减少进程间的通信量,并简化同步原语的接 口, 不依赖于多处理器的 Cache—致性, 不依赖于处理器的特殊指令集, 使得 并行程序在使用同步原语时更加方便, 具有使用简单, 应用范围广, 执行速度 快的特点。 附图简要说明
图 1是并行程序中使用 Barrier同步原语实现读写次序的例子示意图; 图 2是本发明的多处理器的同步引擎装置的结构示意图;
图 3是本发明另一实施例中多处理器的同步引擎装置的结构示意图; 图 4是本发明中同步存储结构的示意图;
图 5是本发明中处理模块与同步存储结构进行通讯的结构示意图; 图 6是本发明的同步引擎中同步存储结构对于 Barrier原语的支持的步骤 流程图;
图 7是本发明中对于 Barrier原语的支持中同步引擎系统初始化的步骤流 程图; 图 8是本发明的同步引擎中同步存储结构对于 Reduce原语的支持的步骤 流程图;
图 9是本发明中对于 Reduce原语的支持中同步引擎系统初始化的步骤流 程图;
图 10是本发明的同步引擎中同步存储结构对于 Lock原语的支持的步骤流 程图;
图 11是本发明的同步引擎中同步存储结构对于 Put原语的支持的步骤流 程图;
图 12是本发明的同步引擎中同步存储结构对于 Get原语的支持的步骤流 程图;
图 13是本发明的多处理器系统的结构示意图。 实现本发明的最佳方式
为了使本发明的目的、技术方案及优点更加清楚明白, 以下结合附图及实 施例,对本发明的一种多处理器系统及其同步引擎进行进一步详细说明。应当 理解, 此处所描述的具体实施例仅仅用以解释本发明, 并不用于限定本发明。
本发明的一种多处理器系统及其同步引擎,改变使用软件实现同步原语的 方式, 使用统一的共享存储结构, 通过硬件装置实现对 Barrier、 Lock/Unlock、 Reduce, Put/Get等基本同步原语进行支持和加速,不依赖于多处理器的 Cache —致性, 不依赖于处理器的特殊指令集, 具有使用简单, 应用范围广, 执行速 度快的特点。
下面结合上述目标详细介绍本发明的一种多处理器的同步引擎,所述同步 引擎, 图 2是本发明的多处理器的同步引擎的结构示意图, 如图 2所示, 所述 同步引擎, 包括:
多个存储队列 1, 用于接收多个处理器发送的同步原语, 一个队列存储来 自一个处理器的所有同步原语;
处理器 1到 n发送的同步原语在发送到同步引擎时, 分别被存储在队列 1 到 n。
由于一个队列存储来在一个处理器的所有同步原语, 而一个处理器上可能 运行多个进程, 因此在同步原语存储到对应的队列里时, 应该同时保存进程号 信息, 以区别同一处理器上不同进程发送过来的同步原语。 在队列的出口处使 用调度模块对同一个队列里的来自不同进程的同步原语进行调度 以保证来自 不同进程的同步原语不会相互阻塞。
多个调度模块 2, 用于在在所述多个存储队列中选定用于执行的同步原语 之后, 根据同步原语的类型, 发送到相对应的处理模块进行处理, 所述调度模 块与所述存储队列一一对应;
多个处理模块 3,用于接收所述调度模块发来的同步原语,执行不同功能; 所述处理模块如图 2中 Reduce处理模块、 Barrier处理模块、 Load/Store 处理模块, 以及 Put/Get处理模块。如果有同步引擎支持更多类型的同步原语, 可以在这里扩展更多的处理模块。
在另一实施例中,所述处理模块还可存在如如图 3所示的构成形式,所述 处理模块包括分散(Scatter)处理模块、运算(Calculate)处理模块、 Load/Store 处理模块, 以及聚集 (Gather) 处理模块。 其中聚集 (Gather) 处理模块用于 实现多对一的集合通信模式,将多个源端口的消息聚集后,发送给特定的目的 端口, 该模块用于 Barrier和 Reduce操作的收集源数据过程。 分散 (Scatter) 处理模块用于实现一对多的集合通信模式,将一个源端口的消息分发至多个目 的端口, 该模块用于 Barrier和 Reduce操作的分发结果过程。运算(Calculate) 处理模块用于实现计算功能,该模块用于聚集处理模块提交的运算,实现 Lock 以及 Put/Get所需的运算功能。 Load/Store处理模块用于实现以 Load/Store方 式对同步存储结构的访问。 Barrier操作可以通过聚集(Gather)处理模块和分 散(Scatter)处理模块的配合完成, Reduce操作通过聚集(Gather)处理模块、 分散(Scatter) 处理模块和运算(Calculate)处理模块完成, Put/Get操作需要 运算 (Calculate) 处理模块和 Load/Store处理模块完成。
虚拟同步存储结构模块 4, 使用 RAM作为存储, 以及控制逻辑实现。 目 的在于使用少量的存储空间,通过控制逻辑把所有处理器的直属存储空间都映 射为同步存储结构 { Count, P, L, Value }。从而达到使用这种同步存储结构来实 现各种同步原语的功能。 控制逻辑实现的映射方法在下面的文字中有描述。
主存端口 5, 用于对各处理器的直属存储进行读和写操作, 以及向处理器 发起中断。 配置寄存器 6, 用于存储各种软件发送过来的配置信息。 配置信息包括如 发送中断时中断号是多少, 同步引擎在向直属存储写入必要信息时, 写入的物 理地址是多少等。配置寄存器用寄存器实现存储,各个功能模块需要用到的配 置信息都从这里读取。
在图 2所示的处理模块中,使用到的数据都是由所述虚拟同步存储结构模 块提供的同步存储结构 {Count, P, L, Value} , 这个存储结构是由所述同步引擎 使用少量的片内存储虚拟出来的, 并不直接占用处理器直属存储的空间。
一个存储队列 1存储来自一个处理器的所有同步原 i 在同步原语存储到 对应的存储队列 1里时, 同时保存进程号信息, 以区别同一处理器上不同进程 发送过来的同步原语; 在存储队列 1的出口处使用调度模块 2, 根据同步原语 的类型, 对同一个存储队列里的来自不同进程的同步原语进行调度, 以保证来 自不同进程的同步原语不会相互阻塞; 多个处理模块 3 分别根据调度模块 2 的调度 并使用由所述虚拟同步存储结构模块 4提供的数据执行相应的同步原 语, 实现采用硬件装置对并行程序中的同步原语的支持和加速。
对于普通的地址空间, 存储结构仅仅是存储数据的结构, 如对于 32位地 址 32,h0000_8600的地址 A来说, Store (A, 3556)的含义仅仅表示向地址 A写 入立即数 3556, 则 3556这个值被存储在地址 A的空间里。 当这个 Store指令 被执行完毕后, 使用 Load (A)指令读取地址 A 的内容, 则返回的值应该是 3556。 典型的处理器的直属存储, 也就通常意义上的内存。 但是对于这种存储 结构, 并不能保证 Store和 Load的执行次序, 需要额外的方法来保证 Load指 令在 Store指令后面执行, 才能保证读出来的数据是 3556。 在图 1中己经给出 保证读写次序的一种可行方法。
在本发明中, 所述同步引擎把地址空间的存储结构映射成为一种新的存储 结构(同步存储结构), 图 4是本发明中同步存储结构的示意图, 如图 4所示, 使用同步存储结构不仅可以对 Barrier、 Lock/Unlock、 Reduce等基本同步原语 进行支持和加速 还能够支持一种自动维护如图 1所示的读写次序而无需使用 Barrier的新同步原语。 这种能够自动维护读写次序的同步原语称之为 Put/Get 原语。 Put原语代表一种写操作, Get代表一种读操作。 使用本专利所述的同 步引擎能够维护 Put和 Get的执行次序保证 Get操作读取到的内容一定是 Put 操作写入的内容。 所述同步存储结构是在普通的单一存储空间 Value前面附加一个头部,使 之构成 {Count, P,L, Value}这样的同步存储结构。 在同步存储结构中, {Count, P, L,}称之为同步存储的 Tag。同步存储结构里面的 Count和 Value的位宽可以 根据系统需求有不同的设定, 图 4标注的位宽仅仅作为一个例子。 在图 4中, 每 32位的存储空间前面有 1位 L, 1位 P, 8位 Coimt。 同步存储结构里各个 元素的含义如下-
Value: 存储单元, 用于存储数据, 通用处理器可以使用常见的 Store指令 向里面写入内容, 也可以使用常见的 Load指令读取里面存储的内容。
L: Lock标志位, 当 Lock指令成功执行后, 由同步引擎自动把这一标志 位赋值为 1。 L内容为 1表示此存储单元被加锁, 则另外的对这个存储单元的 加锁原语均不能被成功执行。 当 Unlock指令被成功执行后, 由同步引擎自动 把这一标志位赋值为 0。 L内容为 0表示此单元未被加锁, 之后的第一条加锁 原语可以被成功执行。 使用这个标志位可以支持 Lock/Unlock原语。
P: Produce标志位, 当 Put指令成功执行后, 在条件满足时由同步引擎把 这一标志位赋值为 1。 P内容为 1表示此存储单元的数据允许被 Get指令获得, 当 Get指令成功执行后, 在条件满足时, 由同步引擎自动把这一标志位赋值为 0。 使用这个标志位可以实现 Put/Get, 保证读写次序。
Count: 计数器, 可以用于实现 Barriers Reduce, 以及多种模式的 Put/Get 原语。 计数器的位宽和同步引擎支持的最大并行进程有关, n位 Count可以支 持最大 2"个进程。
所述同步存储结构中的 Value元素是当前计算机的标准存储单元, 占据真 实的存储空间。而同步存储结构中的 Tag是由同步引擎映射而产生的, 仅仅在 同步引擎内部存在, 并不占用处理器的存储空间。 同步存储结构的 Tag不能由 处理器的 Load/Store指令访问,必须通过所述同步引擎支持的各种同步原语进 行间接的访问。 同步存储结构的地址等于映射 Tag之前的 Value的地址。
图 4是本发明中处理模块与同步存储结构进行通讯的结构示意图, 如图 4 所示, 所述同步引擎使用少量的片内存储虚拟所述同步存储结构的虚拟方法 是: 使用一块片内存储作为哈希表, 哈希表中每一项的结构如 {关键值, Tag, Value} 当处理模块写入一项同步存储结构时, 比如 Put/Get模块执行 Put指 令, 则把 Put指令的地址作为关键值, 使用哈希算法在哈希表内选择一行作为 存储单元, 把同步结构存储下来。 当处理模块读取一项同步存储结构时, 同样 使用哈希算法找到对应于这个地址的项, 哈希表输出找到的那一行的内容
{ Tag, Value} 如果读取过程中使用哈希算法没有找到对应的项, 如 Get指令 先于对应的 Put指令被发送到同步引擎, 则说明当前指令应该暂缓执行, 则指 令被重新回归图 2所示的存储队列中, 等候下次调度执行。在同步原语被执行 后, 同步存储结构的 Tag会根据指令的不同有所改变, 如果同步存储结构的 Tag的执行结果等于全 0, 则说明这一项同步存储结构已经被完全执行完毕, 则在哈希表中释放对应的存储空间。 由此, 能够使用较小的片内存储空间虚拟 出我们所需要的同步存储结构, 而不需要改变各处理器的直属存储结构。
一般说来, 存储在哈希表内的数据在很短暂的时间内就会被使用, 从而同 步引擎会释放其占用哈希表中的空间, 因此哈希表溢出的概率是比较小的, 构 造哈希表所需要的片内存储空间也不会很大一个比较相近的例子可以参考处 理器的 Cache, Cache的大小是远远小于内存的大小的。
当哈希表溢出的时候,则使用图 2所示的主存端口向对应的处理器发送中 断, 由软件中断程序模拟同步引擎的过程, 在处理器直属内存中构造哈希表, 来处理这个过程。通过这个构造同步存储的过程,就可以在不直接占用处理器 直属存储的前提下, 为每个 Value前面增加一个实现同步功能的 Tag。 从而支 持 Barrier、 Lock/Unlock、 Reduce以及 Put/Get原语。
以下详细说明本发明中的同步存储结构对于 Barrier、 Lock/Unlock、 Reduce. Put/Get等基本同步原语进行支持和加速的执行步骤。
一. 以 m个进程 (m大于等于 1 )执行 Barrier操作为例, 图 5是本发明 的同步引擎中同步存储结构对于 Barrier原语的支持的步骤流程图, 图 6是本 发明中对于 Barrier原语的支持中同步引擎系统初始化的步骤流程图, 如图 5 和图 6所示, 包括下列步骤- 110.同步引擎系统初始化 (这个步骤只需要在系统上电的时候执行一次)
111.软件指定多个连续地址的同步存储结构 {Count, P, L, Value}作为多个 Barrier变量, 每个变量占用 4Byte空间, 并把指定信息写入到同步引擎的配置 寄存器中;
112.在同步引擎内部维护一个表示多个 Barrier变量完成状态的寄存器 Barrier— HW_State, 每一位对应一个 Barrier变量的完成状态; 113.每个处理器在其直属的内存中申请一段空间 Barrier— State作为 Barrier 的完成状态表示, Barrier— State 的大小等于 Barrier变量个数除以 8, 单位为 Byte, 每个 Barrier变量的完成状态对应于 Barrier_State的 l bito每个处理器的 Barrier— State都被初始化为全 0。 Barrier— State在对应的处理器内部的各个进程 是共享的, 各个进程都可以读取。
假设有 32个 Barrier变量, 则每个处理器在其直属的内存中申请 4 Byte 的 Barrier— States Barrier变量 B0的完成状态对应于 Barrier— State的 bitQ Barrier 变量 B1的完成状态对应于 Barrier_State的 bitl。
114.每个处理器把申请 Barrier—State的物理地址发送到同步引擎的配置寄 存器中, 处理器 1 申请到的 Barrier_State地址为 Barrier—Maskl, 处理器 2申 请到的 Barrier— State地址为 Barrier— Mask2 处理器 n申请到的 Barrier_State 地址为 Barrier一 Maskno
120.多个进程调用 Barrier原语,使用某个 Barrier变量 Bn执行一次 Barrier 操作, 每个进程保存本处理器的 Barrier— State的第 n位的状态到这个进程的本 地变量 Local— Barrier— State;
130.每个参与 Barrier的进程使用普通的 Store指令向所述 Barrier变量写 入某个值, 此值的大小等于整个多处理器系统参与 Barrier的进程个数减 1。此 值将被同步引擎用于计算是否所有进程都达到了同步点。
比如有 m个进程 (m大于等于 1 )使用这个 Barrier变量执行一次 Barrier 操作。 则 Store的地址等于这个 Barrier变量所代表的同步存储结构的地址, 而
Store 的值等于 m-l。 不管多个进程的执行次序如何, Store 的值都固定等于 m-1 , 对于各个进程来说 Store的值都是相同的。
140.同步引擎在接收到对 Barrier变量的 Store指令后, 由于步骤 111中软 件己经把指定信息写入到配置寄存器中, 根据 Store的物理地址, 这些 Store 指令都被同步引擎解释为 Barrier原语。 Barrier处理模块读取对应地址的同步 存储结构, 如果在图 4所示的哈希表中不存在这项同步存储结构, 或者读取到 同步存储结构中的 Count等于 0, 说明这是第一次到达同步引擎的 Barrier原 语, 则在哈希表中建立一项同步存储结构, 其中的 Count等于 Store的值, 也 就是 m-l。 如果读取到同步存储结构 Count不等于 0, 则执行步骤 150;
150.所述 Barrier处理模块读取对应地址的同步存储结构, 并把读取到的 同步存储结构的 Count减 1 ;
160.如果在执行步骤 140中的 Barrier原语之后, Count值等于 0, 则说明 所有进程都己经到达了同步点, 一次 Barrier已经完成, 则把 Barrier— HW— State 对应的位取反, 然后把 Barrier— HW— State广播到多个处理器的 Barrier— State的 位置, 同步存储结构在哈希表中不释放; 否则, 返回步骤 150;
170.每个进程在发送 Store 指令之后, 定时査询本处理器所属的 Barrier— State 的第 n 位的值, 如果查询到的值等于步骤 120 的 Local— Barrier— Staten, 则表示本进程的 Barrier尚未完成, 则稍后再次查询。 如 果査询到的状态不等于 Local_Barrier_Staten, 则表示 Barrier已经完成, 所述 进程退出査询状态, 同时退出 Barrier原语调用。 比如査询到本处理器所属的 Barrier— State的第 n位的值等于 0, 而本进程保存的 Local— Barrier— Staten等于 0, 则表示 Barrier尚未完成; 而如果查询到 Local— Barrier_Staten等于 1, 则表 示 Barrier己经完成。 由于只使用 lbit来表示完成状态, 因此它们的取值只有 0和 1两种。 上述步骤中, 除了步骤 110只需要系统初始化一次, 其他时候在 各进程调用 Barrier原语的时候, 都是从步骤 120开始执行, 结束于步骤 160。 Barrier的同步原语等效于 Store(Barrier变量地址, m-l)。
二.以 m个进程(m大于等于 1 )执行 Reduce操作为例, 本发明的同步引 擎中同步存储结构对于 Reduce的支持和实现与 Barrier的过程类似。 同步存储 结构支持的 Reduce原语的数据结构为 {Reduce变量地址, 操作符, 数据类型, 进程数目 -1, 源数据 }。 数据结构中各个元素的含义如下: Reduce变量地址表 示选择的被用来做 Reduce运算的物理地址; 操作符表示这次的 Reduce的运算 类型是什么, 比如是"加"、 "减 "等; 数据类型表示参与运算的源数据的类型是 什么, 比如是 "双精度浮点"、 "64位定点 "等; 进程数目表示有多少个进程参与 这次的 Reduce运算; 源数据表示参与 Reduce的数据。 图 7是本发明的同步引 擎中同步存储结构对于 Reduce原语的支持的步骤流程图, 图 8是本发明中对 于 Reduce原语的支持中同步引擎系统初始化的步骤流程图, 如图 7和图 8所 示, 具体方法, 包括下列步骤:
210.同步引擎系统初始化 (这个步骤只需要在系统上电的时候执行一次)
211.软件指定多个连续地址的同步存储结构 {Count, P, L, Value}作为多个 Reduce变量, 每个变量占用 8Byte空间, 并把指定信息写入到同步引擎的配 置寄存器中。
212.在同步引擎内部维护一个表示多个 Reduce变量完成状态的寄存器 Reduce— HW— State, 每一位对应一个 Reduce变量的完成状态;
213.每个处理器在其直属的内存中申请一段空间 Reduce— State作为 Reduce 的完成状态表示, Reduce—State的大小等于 Reduce变量个数除以 8, 单位为
Byte, 每个 Reduce变量的完成状态对应于 Reduce— State的 1 bit。 每个处理器 的 Reduce_State都被初始化为全 0。
假设有 32个 Reduce变量, 则每个处理器在其直属的内存中申请 4 Byte 的 Reduce— Stata Reduce变量 R0的完成状态对应于 Reduce— State的 bitQ Reduce 变量 R1的完成状态对应于 Reduce_State的 bitl。
214.每个处理器把申请 Reduce_State的物理地址发送到同步引擎的配置寄 存器中, 处理器 1申请到的 Reduce— State地址为 Reduce_Maskl, 处理器 2申 请到的 Reduce— State地址为 Reduce— Mask2 处理器 n申请到的 Reduce— State 地址为 Reduce— Maskn0
220.某个进程调用 Reduce原语,使用某个 Reduce变量 Rn执行一次 Reduce 操作, 每个进程保存本处理器的 Reduce—State 的第 n位的状态到本地变量 Local— Reduce—State;
230.每个参与 Reduce的进程把上述 Reduce原语的数据结构发送到同步引 擎, 所述同步引擎在接收到 Reduce数据结构之后, 根据 Reduce的物理地址向 所述 Reduce变量 Rn量写入值, 值的大小等于 N-1 ;
240.同步引擎在接收到 Reduce数据结构之后, Reduce处理模块读取对应 地址的同步存储结构, 如果在图 4所示的哈希表中不存在这项同步存储结构, 或者读取到同步存储结构中的 Count等于 0, 说明这是第一次到达同步引擎的 Reduce原语, 则在哈希表中建立一项同步存储结构, 其中的 Count等于 m-1, 并把 Reduce数据结构中的源数据存储在同步存储结构中的 Value中。 如果读 取到同步存储结构 Count不等于 0, 执行下一步骤 250;
250.把对应的 Count减 L并把读取到的同步存储结构中的 Value和 Reduce 数据结构中的源数据进行运算, 把结果存放到同步存储结构中的 Value中; 260.如果在执行 Reduce原语之后, Count值等于 0。 则说明一次 Reduce 完成, 则把 Reduce— HW_State对应的位取反, 然后把 Reduce— HW— State广播 到 n个处理器的 RedUCe_State的位 ¾同步存储结构在哈希表中不释方夂否拠 返回上一步骤 250
270.每个进程在发送 Reduce数据结构之后, 定时查询本处理器所属的 Reduce— State 的第 n 位的值, 如果査询到的状态等于步骤 220 的 Local—Reduce— Staten, 则表示本次进程的 Reduce尚未完成, 则稍后再次查询; 如果査询到的状态不等于 Local— Reduce— Staten, 则表示 Reduce已经完成, 软 件退出査询状态, 比如査询到本处理器所属的 Reduce— Staten的第 n位的值等 于 0, 而本进程保存的 Local_Reduce— Staten等于 0, 则表示 Reduce尚未完成; 而如果査询到 Local— Reduce— Staten等于 1 , 则表示 Reduce已经完成。 由于只 使用 lbit来表示完成状态, 因此它们的取值只有 0和 1两种。当査询到 Reduce 己经完成后, 使用普通 Load指令来读取相应的 Reduce变量的值。
上述步骤中, 除了步骤 210只需要系统初始化一次, 其他时候在软件调用 Reduce原语的时候, 都是从步骤 220开始执行, 结束于步骤 260。
三.以 m个进程(m大于等于 1 )执行 Lock操作为例, 本发明的同步引擎 中同步存储结构支持的 Lock原语数据结构为{返回物理地址, 目标物理地 址}, 其中各个元素的含义如下: 返回物理地址表示当加锁成功的时候, 同步 引擎会把成功消息存放在主存中的返回物理地址中; 目标物理地址表示软件希 望对哪个物理地址进行加锁。 图 9是本发明的同步引擎中同步存储结构对于 Lock原语的支持的步骤流程图, 如图 9所示, 具体方法, 包括下列步骤:
310.每个处理器在其直属的内存中申请变量 Lock_Result把这个变量的内 容清零, 并把这个变量的物理地址作为 Lock原语数据结构的返回物理地址; 进程把 Lock原语的数据结构发送到同步引擎;
320. 同步引擎在接收到 Lock原语的数据结构后, 根据其中的目标物理地 址读取同步存储结构, 如果从图 5所示的哈希表中读取不到同步存储结构, 或 者读取到的同步存储结构中的 L位等于 0, 则说明此物理地址尚未被加锁, 则 Lock原语执行成功转至步骤 33Q如果读取到的同步存储结构的 L位等于 1, 则说明此物理地址已经被加锁, 则放弃本次 Lock的执行, 把 Lock原语放入相 应的存储队列中, 等候在此调度执行;
330.把同步存储结构中的 L位置 1, 并且保存至哈希表中。 根据 Lock数 据结构中的返回物理地址, 向直属的内存的 Lock—Result写入 1 ; 340.所述进程定时查询 Lock— Result, 如果读取到的 Lock— Result等于 0, 则表示尚未加锁成功, 则延时后再查询; 如果读取到的 Lock— Result等于 1, 则表示加锁成功, 退出 Lock调用。
进程每次调用 Lock时, 都会执行上面 4个步骤。
本发明的同步引擎中同步存储结构支持的 Unlock原语的数据结构为 {目 标地址 }, 数据结构中仅有的一个元素表示需要解锁的变量地址。 进程只需要 把所述数据结构发送到同步引擎即可退出 Unlock调用。 而同步引擎硬件在接 收到 Unlock数据结构后, 根据目标地址从哈希表中读取同步数据结构, 把其 中的 L位清零, 如果同步数据结构在 L位清零后等于全 0, 则在哈希表中释放 这一项同步数据结构, 否则仅仅把 L位清零后的数据结构写回即可。
四. 本发明的同步引擎中同步存储结构支持的 Put原语数据结构为{目标 地址, 源数据 }, 其中数据结构中的各个元素的含义如下: 目标地址表示 Put 原语中源数据被存放的物理地址; 源数据表示 Put原语中移动的数据内容。 图 10是本发明的同步引擎中同步存储结构对于 Put原语的支持的步骤流程¾如 图 10所示, 具体方法, 包括下列步骤:
410.进程将 Put原语的数据结构发送到同步引擎即可退出 Put调用; 420.同步引擎的 Put处理模块根据 Put原语的数据结构中的目标地址在图 4所示的哈希表中读取相应的同步存储结构, 如果不存在, 则新建一项同步存 储结构, 如果存在则读取已存在的同步存储结构。
430.把步骤 420所得的同步存储结构的 P位置 1, 以及把接收到的 Put原 语中的源数据存储到同步存储结构的 Value位置。
五.本发明的同步引擎中同步存储结构支持的 Get原语数据结构为{返回 物理地址, 目标物理地址 }。 数据结构中各个元素的含义如下: 返回物理地址 是当 Get成功执行后, 返回 Get的数据以及完成标识的存放地址, Get返回值 的数据结构如 {返回数据, 完成标识}, 此返回值被连续存放在返回物理地址 中; 目标物理地址表示 Get试图去获得的数据的物理地址。 图 11是本发明的 同步引擎中同步存储结构对于 Get原语的支持的步骤流程图, 如图 11所示, 具体方法, 包括下列步骤:
510.处理器在其直属的内存空间中申请一段空间 Get— Result用于存储 Get 返回值的数据结构, 并把申请到的空间清零; 进程发送 Get原语数据结构到同步引擎;
520.Get处理模块根据接收到的 Get原语数据结构中的目标地址, 读取图 6所示的哈希表中对应的同步存储结构。 如果哈希表中不存在对应的同步存储 结构, 则放弃 Get的执行, 把 Get原语放回存储队列中, 等待再次被调度执行。 如果哈希表中存在对应的同步存储结构, 但是其中的 P位为 0, 则放弃 Get的 执行, 把 Get原语放回到存储队列中, 等待再次被调度执行; 如果哈希表中存 在对应的同步存储结构, 而且其中的 P位为 1, 转步骤 540;
530.把读取到的同步存储结构中的 P清零, 把读取到的同步存储结构中的 Value内容, 根据 Get原语数据结构中的返回物理地址, 把返回值 { 1, Value}写 入处理器的直属存储中;
540.如果清零后同步存储结构的 Tag等于全零, 则释放哈希表中对应的同 步存储结构项, 否则把 P位清零的同步存储结构写回。
虽然 Get把 P位清零了。但是对应的这一行同步存储结构还可能被用来做 其他用途, 比如其中的 L位是置 1的, 表示这一行同步存储结构还被加锁了。 那么就不能释放这行同步存储。只有 Tag等于全零了,才表示这行数据无用了。
550.定时査询 Get— Result中数据结构的完成标识, 如果査询结果为 0, 表 示 Get原语尚未执行完毕, 则延时一段时间后继续査询; 如果査询结果为 1, 则表示 Get原语执行完毕, 执行步骤 560。
560.读取 Get_Result中数据结构的返回数据, 退出 Get调用。
相应于本发明的一种多处理器的同步引擎, 还提供一种多处理器系统, 图
12是本发明的多处理器系统的结构示意图, 如图 12所示, 所述系统 7, 包括: 多个处理器 71和一个处理芯片 72, 其中:
所述处理芯片 72, 包括:
多个设备端口 721, 用于和所述多个处理器的高速互联, 每个处理器和一 个设备端口连接;
所述同步引擎, 用于与芯片内部的多个设备端口互联。
图 12显示了同步引擎在 n个处理器环境下的拓扑结构。 在一个芯片内实 现 n个设备端口用于和处理器的高速互联, 每个处理器和一个设备端口连接。 在设备发现过程中,每个处理器通过标准设备搜索流程都可以搜索到与之互联 的设备端口, 并分配设备端口申请的各种资源, 主要是地址空间资源。 芯片内 部的多个设备端口都和同步引擎互 同步引擎所需要的资源都由设备端口代 为向处理器之上的操作系统申请。 操作系统分配资源给对应的设备端口, 实际 上是把资源分配给同步引擎。 同时, 同步引擎也把自身的资源通过设备端口映 射到对应的操作系统上面。 由于同步引擎通过 n个设备 端口把自身映射到 n 个处理器之上的操作系统中, 因此在 n个处理器上的软件都可以通过映射关 系, 像操作独占资源那样去操作同步引擎, 因此同步引擎实际上被 n个处理器 共享。
本发明的有益效果是: 本发明的同步引擎使用统一的共享存储结构, 对 Barrier> Lock/Unlock、 Reduce Put/Get等基本同步原语进行支持和加速, 大 幅度提高同步原语的执行速度,减少进程间的通信量,并简化同步原语的接口, 不依赖于多处理器的 Cache—致性, 不依赖于处理器的特殊指令集,使得并行 程序在使用同步原语时更加方便, 具有使用简单, 应用范围广, 执行速度快的 特点。
通过结合附图对本发明具体实施例的描述,本发明的其它方面及特征对本 领域的技术人员而言是显而易见的。
以上对本发明的具体实施例进行了描述和说明, 这些实施例应被认为其只 是示例性的, 并不用于对本发明进行限制, 本发明应根据所附的权利要求进行 解释。

Claims

权利要求书
1. 一种多处理器的同步引擎装置, 其特征在于, 所述同步引擎装置, 包 括:
多个存储队列,用于接收多个处理器发送的同步原语,一个队列存储来自 一个处理器的所有同步原语;
多个调度模块, 用于在所述多个存储队列中选定用于执行的同步原语之 后, 根据同步原语的类型, 发送到相对应的处理模块进行处理, 所述调度模块 与所述存储队列一一对应;
多个处理模块, 用于接收所述调度模块发来的同步原语, 执行不同功能; 虚拟同步存储结构模块, 使用少量的存储空间,通过控制逻辑把所有处理 器的直属存储空间都映射为同步存储结构来实现各种同步原语的功能;
主存端口,用于与所述虚拟同步存储结构模块进行通讯,对各处理器的直 属存储进行读和写, 以及向所述处理器发起中断;
配置寄存器, 用于存储所述处理模块需要用到的各种配置信息。
2.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 在同步 原语存储到对应的存储队列里时,将同时保存进程号信息, 以区别同一处理器 上不同进程发送过来的同步原语。
3.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 所述处 理模块, 包括: Reduce处理模块、 Barrier处理模块、 Load/Store处理模块, 以 及 Put/Get处理模块。
4.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 所述同 步引擎装置中的处理模块,能够根据同步引擎装置支持的同步原语的类型进行 扩展。
5.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 所述同 步存储结构是使用少量的片内存储虚拟出来的,并不直接占用处理器直属存储 的空间。
6.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 所述同 步存储结构是 {Count, P, L, Value}, {Count, P, L,}称为同步存储的 Tag, Count 和 Value的位宽可以根据系统需求进行不同的设定; Value: 存储单元, 用于 存储数据, L: Lock标志位, 用于支持 Lock/Unlock原语; P: Produce标志位, 用于实现 Put/Get原语, Count: 计数器, 用于实现 Barrier原语、 Reduce原语, 以及多种模式的 Put/Get原语; 计数器的位宽和同步引擎装置支持的最大并行 进程有关, n位 Count可以支持最大 2n个进程。
7.根据权利要求 1所述的多处理器的同步引擎装置, 其特征在于, 所述同 步存储结构的虚拟方法是: 使用一块片内存储作为哈希表, 哈希表中每一项的 结构为 {关键值, Tag, Value} , 当处理模块写入一项同步存储结构时, 所述处理 模块执行指令, 把指令的地址作为关键值, 使用哈希算法在哈希表内选择一行 作为存储单元, 把同步结构存储下来; 当处理模块读取一项同步存储结构时, 同样使用哈希算法找到对应于这个地址的项哈希表输出找到的那一行的内容 { Tag, Value}; 如果读取过程中使用哈希算法没有找到对应的项, 则说明当前 指令应该暂缓执行, 则指令被重新回归相应的存储队列中, 等候下次调度执 行; 在同步原语被执行后, 如果同步存储结构的 Tag的执行结果等于全 0, 则 说明这一项同步存储结构已经被完全执行完 则在哈希表中释放对应的存储 空间; 当所述哈希表溢出的时候, 则使用所述主存端口向对应的处理器发送中 断, 在所述处理器直属内存中构造哈希表, 来存储所述同步存储结构; 其中 (Count, P, L,}称为同步存储的 Tag; Value为存储单元。
8.—种采用权利要求 1-7中的一项所述的多处理器的同步引擎装置的多处 理器系统,其特征在于,所述系统,包括: 多个处理器和一个处理芯片,其中: 所述处理芯片, 包括:
多个设备端口,用于和所述多个处理器高速互联,每个处理器和一个设备 端口连接;
所述同步引擎装置, 其中, 所述存储队列与所述多个设备端口互联; 在设备发现过程中,每个处理器通过标准设备搜索流程搜索到与之互联的 设备端口, 并分配设备端口申请的各种资源; 所述同步引擎装置把自身的资源 通过设备端口映射到对应的处理器的操作系统中,在多个处理器上的软件通过 映射关系操作所述同步引擎装置, 所述同步引擎装置被多个处理器共享。
9.根据权利要求 1所述的多处理器的同步引擎装置对于 Barrier原语的处理 方法, 其特征在于, 所述方法, 包括下列步骤:
110.同步引擎装置系统初始化: 指定多个连续地址的同步存储结构作为多 个 Barrier变量, 同时维护一个表示多个 Barrier 变量完成状态的寄存器 Barrier_HW_State, 每个处理器在其直属的内存中申请一段空间 Barrier— State 作为 Barrier的完成状态表示;
120.N个进程调用 Barrier原语, 使用所述 Barrier变量执行一次 Barrier操 , 同时分别保存相应处理器的 Barrier— State的第 n位的状态到所述 N个进程 的本地变量 Local— Barrier— State;
130.所述同步引擎装置在接收到对所述 Barrier变量的 Store指令后, 根据 Store的物理地址向所述 Barrier变量写入值, 值的大小等于 N-1;
140.Barrier处理模块读取对应地址的同步存储结构, 如果不存在这项同步 存储结构, 或者读取到同步存储结构中的 Count等于 0, 则建立一项同步存储 结构, 其中的 Count等于 Store的值, 如果读取到同步存储结构 Count不等于 0, 则执行下一步骤 150;
150.所述 Barrier处理模块读取对应地址的同步存储结构并把读取到的同 步存储结构的 Count减 1 ;
160.判断 Count值是否等于 0, 若是, 则所有进程都已经到达了同步点, 一次 Barrier 已经完成, 把 Barrier_HW— State 对应的位取反, 然后把 Barrier_HW_State广播到多个处理器的 Barrier_State的位置; 否则, 返回步骤 150;
170.每个所述进程在发送 Store 指令之后, 定时査询本处理器所属的 Barrier_State的第 n位的值, 如果查询到的值等于 Local_Barrier— Staten的值, 则表示本进程的 Barrier尚未完成, 则稍后再次查询; 如果査询到的状态不等 于 Local— Barrier— Staten, 则表示 Barrier已经完成, 所述进程退出查询状态, 同时退出 Barrier原语调用; 其中 Count为计数器。
10.根据权利要求 9所述的多处理器的同步引擎装置对于 Barrier原语的处 理方法, 其特征在于, 同步引擎装置系统初始化的步骤 110, 包括下列步骤:
111.指定多个连续地址的同步存储结构作为多个 Barrier变鼂并把指定信 息写入到所述配置寄存器中;
112.在所述同步引擎装置内部维护一个表示多个 Barrier变量完成状态的 寄存器 Barrier— HW_State, 每一位对应一个 Barrier变量的完成状态;
113.每个处理器在其直属的内存中申请一段空间 Barrier_State作为 Barrier 的完成状态表示, 每个处理器的 Barrier_State都被初始化为全 0, Barrier— State 在对应的处理器内部的各个进程是共享的, 各个进程都可以读取;
114.每个处理器把申请 Barrier—State 的物理地址发送到所述配置寄存器 中。
11. 根据权利要求 1所述的多处理器的同步引擎装置对于 Reduce原语的 处理方法, 其特征在于, 所述方法, 包括下列步骤:
210.同步引擎装置系统初始化: 指定多个连续地址的同步存储结构作为多 个 Reduce 变量, 同时维护一个表示多个 Reduce 变量完成状态的寄存器 Reduce_HW_State; 每个处理器在其直属的内存中申请一段空间 Reduce_State 作为 Reduce的完成状态表示;
220.N个进程调用 Reduce原语,使用所述 Reduce变量 Rn执行一次 Reduce 操作, 每个进程保存本处理器的 Reduce— State 的第 n位的状态到本地变量 Local— Reduce—State;
230.每个参与 Reduce的进程把上述 Reduce原语的数据结构发送到同步引 擎装置; 所述同步引擎装置在接收到 Reduce数据结构之后, 根据 Reduce的物 理地址向所述 Reduce变量 Rn写入值, 值的大小等于 N-1 ;
240.Reduce处理模块读取对应地址的同步存储结构,如果不存在这项同步 存储结构, 或者读取到同步存储结构中的 Count等于 0, 则建立一项同步存储 结构, 其中的 Count等于 N-l, 并把 Reduce数据结构中的源数据存储在同步 存储结构中的 Value中; 如果读取到同步存储结构 Count不等于 0, 则执行下 一步骤 250;
250.Reduce 处理模块读取对应地址的同步存储结构, 并把对应的 Count 减 1,并把读取到的同步存储结构中的 Value和 Reduce数据结构中的源数据进 行运算, 把结果存放到同步存储结构中的 Value中;
260.判断 Count值是否等于 0, 若是, 则说明一次 Reduce完成, 则把 Reduce— HW— State对应的位取反然后把 Reduce_HW_State广播到 n个处理器 的 Reduce_State的位置; 否则, 返回步骤 250;
270.每个进程在发送 Reduce数据结构之后, 定时查询本处理器所属的 Reduce_State的第 n位的值, 如果查询到的状态等于 Local— Reduce_Staten, 则 表示本次进程的 Reduce尚未完成, 则稍后再次查询; 如果査询到的状态不等 于 Local_Reduce_Staten, 则表示 Reduce已经完成, 退出查询状态; 其中 Value 为存储单元; Count为计数器。
12.根据权利要求 11所述的多处理器的同步引擎装置对于 Reduce原语的 处理方法, 其特征在于, 同步引擎装置系统初始化的步骤, 包括下列步骤:
211.指定多个连续地址的同步存储结构 {Count, P, L, Value}作为多个 Reduce变量, 并把指定信息写入到同步引擎装置的配置寄存器中;
212.在同步引擎装置内部维护一个表示多个 Reduce变量完成状态的寄存 器 Reduce— HW_State, 每一位对应一个 Reduce变量的完成状态;
213.每个处理器在其直属的内存中申请一段空间 Reduce— State作为 Reduce 的完成状态表示, 每个处理器的 Reduce_State都被初始化为全 0;
214.每个处理器把申请 Reduce— State的物理地址发送到同步引擎装置的配 置寄存器中。
13.根据权利要求 11所述的多处理器的同步引擎装置对于 Reduce原语的 处理方法, 其特征在于, 所述 Reduce原语的数据结构为 {Reduce变量地址, 操作符, 数据类型, 进程数目 -1, 源数据 }。
14.根据权利要求 1所述的多处理器的同步引擎装置对于 Lock原语的处理 方法, 其特征在于, 所述方法, 包括下列步骤:
310.每个处理器在其直属的内存中申请变量 Lock— Result把这个变量的内 容清零, 并把这个变量的物理地址作为 Lock原语数据结构的返回物理地址; 320.所述同步引擎装置在接收到进程发送的 Lock原语的数据结构后, 根 据其中的目标物理地址读取同步存储结构, 如果读取不到同步存储结构, 或者 读取到的同步存储结构中的 L位等于 0, 则说明此物理地址尚未被加锁, 则 Lock原语执行成功, 转至下一步骤 330; 如果读取到的同步存储结构的 L位 等于 1, 则说明此物理地址己经被加锁, 则放弃本次 Lock的执行, 把 Lock原 语放入相应的存储队列中, 等候在此调度执行;
330.把同步存储结构中的 L位置 1并且保存, 根据所述返回物理地址, 向 直属的内存的 Lock— Result写入 1;
340.所述进程定时査询 Lock_Result, 如果读取到的 Lock_Result等于 0, 则表示尚未加锁成功, 则延时后再查询; 如果读取到的 Lock—Result等于 1, 则表示加锁成功, 退出 Lock调用; 其中 L为 Lock标志位。
15.根据权利要求 14所述的多处理器的同步引擎装置对于 Lock原语的处 理方法, 其特征在于, 所述 Lock原语数据结构为{返回物理地址, 目标物理地 址} , 返回物理地址表示当加锁成功的时候, 同步引擎装置会把成功消息存放 在主存中的返回物理地址中; 目标物理地址表示软件希望对哪个物理地址进行 加锁。
16.根据权利要求 1所述的多处理器的同步引擎装置对于 Unlock原语的处 理方法, 其特征在于, 所述方法, 包括下列步骤:
进程把数据结构发送到所述同步引擎装置, 并退出 Unlock调用; 所述同步引擎装置在接收到 Unlock数据结构后, 根据目标地址从哈希表 中读取同步数据结构, 把其中的 L位清零, 如果同步数据结构在 L位清零后 等于全 0, 则释放这一项同步数据结构, 否则仅仅把 L位清零后的数据结构写 回; 其中 L为 Lock标志位。
17.根据权利要求 16所述的多处理器的同步引擎装置对于 Unlock原语的 处理方法, 其特征在于, 所述 Unlock原语的数据结构为 {目标地址}, 数据结 构中仅有的一个元素表示需要解锁的变量地址。
18.根据权利要求 1所述的多处理器的同步引擎装置对于 Put原语的处理方 法, 其特征在于, 所述方法, 包括下列步骤:
410.进程将 Put原语的数据结构发送到所述同步引擎装置, 并退出 Put调 用;
420.所述同步引擎装置的 Put处理模块根据 Put原语的数据结构中的目标 地址读取相应的同步存储结构, 如果不存在, 则新建一项同步存储结构, 如果 存在则读取已存在的同步存储结构;
430.把根据 Put原语的数据结构中的目标地址读取的相应的同步存储结构 的 P位置 1, 并把接收到的 Put原语中的源数据存储到所述同步存储结构的
Value位; 其中 P为 Produce标志位; Value为存储单元。
19.根据权利要求 18所述的多处理器的同步引擎装置对于 Put原语的处理 方法, 其特征在于, 所述 Put原语数据结构为{目标地址, 源数据 }, 其中, 目 标地址表示 Put原语中源数据被存放的物理地址; 源数据表示 Put原语中移动 的数据内容。
20.根据权利要求 1所述的多处理器的同步引擎装置对于 Get原语的处理 方法, 其特征在于, 所述方法, 包括下列步骤:
510.处理器在其直属的内存空间中申请一段空间 Get_ReSUlt用于存储 Get 返回值的数据结构, 并把申请到的空间清零;
520.所述 Get处理模块根据接收到的进程发送的 Get原语数据结构中的目 标地址, 读取对应的同步存储结构, 如果不存在对应的同步存储结构, 则放弃 Get的执行, 把 Get原语放回所述存储队列中, 等待再次被调度执行; 如果存 在对应的同步存储结构, 但是其中的 P位为 0, 则放弃 Get的执行, 把 Get原 语放回到所述存储队列中,等待再次被调度执行;如果哈希表中存在对应的同 步存储结构, 而且其中的 P位为 1, 则执行下一步骤 530;
530.把读取到的同步存储结构中的 P清零, 把读取到的同步存储结构中的 Value内容, 根据 Get原语数据结构中的返回物理地址, 把返回值 { 1, Value}写 入处理器的直属存储中;
540.如果清零后同步存储结构的 Tag等于全零, 则释放对应的同步存储结 构项, 否则把 P位清零的同步存储结构写回;
550.定时査询 Get— Result中数据结构的完成标识, 如果査询结果为 0, 表 示 Get原语尚未执行完毕, 则延时一段时间后继续查询; 如果査询结果为 1, 则表示 Get原语执行完毕, 执行下一步骤 560;
560.读取 Get— Result中数据结构的返回数据, 退出 Get调用; 其中 P为 Produce标志位; Value为存储单元; {Count, P, L,}称为同步存储的 Tag。
21.根据权利要求 20所述的多处理器的同步引擎装置对于 Get原语的处理 方法,其特征在于,所述 Get原语数据结构为{返回物理地 ¾b 目标物理地址 }; 数据结构中各个元素的含义如下: 返回物理地址是当 Get成功执行后, 返回 Get的数据以及完成标识的存放地址, Get返回值的数据结构如 {返回数据, 完 成标识 }, 此返回值被连续存放在返回物理地址中; 目标物理地址表示 Get试 图去获得的数据的物理地址。
PCT/CN2011/001458 2010-08-30 2011-08-30 一种多处理器系统及其同步引擎装置 WO2012027959A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/819,886 US9411778B2 (en) 2010-08-30 2011-08-30 Multiprocessor system and synchronous engine device thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201010267931.X 2010-08-30
CN201010267931XA CN101950282B (zh) 2010-08-30 2010-08-30 一种多处理器系统及其同步引擎

Publications (1)

Publication Number Publication Date
WO2012027959A1 true WO2012027959A1 (zh) 2012-03-08

Family

ID=43453786

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/001458 WO2012027959A1 (zh) 2010-08-30 2011-08-30 一种多处理器系统及其同步引擎装置

Country Status (3)

Country Link
US (1) US9411778B2 (zh)
CN (1) CN101950282B (zh)
WO (1) WO2012027959A1 (zh)

Families Citing this family (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101950282B (zh) * 2010-08-30 2012-05-23 中国科学院计算技术研究所 一种多处理器系统及其同步引擎
US8463960B2 (en) 2011-08-08 2013-06-11 Arm Limited Synchronisation of data processing systems
US9092272B2 (en) * 2011-12-08 2015-07-28 International Business Machines Corporation Preparing parallel tasks to use a synchronization register
WO2013100783A1 (en) 2011-12-29 2013-07-04 Intel Corporation Method and system for control signalling in a data path module
US8869168B2 (en) * 2012-05-14 2014-10-21 International Business Machines Corporation Scheduling synchronization in association with collective operations in a parallel computer
US9164690B2 (en) * 2012-07-27 2015-10-20 Nvidia Corporation System, method, and computer program product for copying data between memory locations
WO2014078481A1 (en) * 2012-11-15 2014-05-22 Violin Memory Inc. Memorty array with atomic test and set
US9304945B2 (en) * 2013-01-24 2016-04-05 Raytheon Company Synchronizing parallel applications in an asymmetric multi-processing system
US10331583B2 (en) 2013-09-26 2019-06-25 Intel Corporation Executing distributed memory operations using processing elements connected by distributed channels
US10318297B2 (en) * 2015-01-30 2019-06-11 Huawei Technologies Co., Ltd. Method and apparatus for operating a self-timed parallelized multi-core processor
CN106603433B (zh) * 2015-10-14 2019-10-25 瑞昱半导体股份有限公司 数据输出调度装置与方法
US10331600B1 (en) 2016-03-31 2019-06-25 EMC IP Holding Company LLC Virtual I/O queuing
US10007443B1 (en) * 2016-03-31 2018-06-26 EMC IP Holding Company LLC Host to device I/O flow
CN108027727B (zh) * 2016-05-25 2020-09-08 华为技术有限公司 内存访问指令的调度方法、装置及计算机系统
US10558575B2 (en) 2016-12-30 2020-02-11 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10515049B1 (en) 2017-07-01 2019-12-24 Intel Corporation Memory circuits and methods for distributed memory hazard detection and error recovery
US10515046B2 (en) 2017-07-01 2019-12-24 Intel Corporation Processors, methods, and systems with a configurable spatial accelerator
US10496574B2 (en) 2017-09-28 2019-12-03 Intel Corporation Processors, methods, and systems for a memory fence in a configurable spatial accelerator
US11086816B2 (en) 2017-09-28 2021-08-10 Intel Corporation Processors, methods, and systems for debugging a configurable spatial accelerator
US11307873B2 (en) 2018-04-03 2022-04-19 Intel Corporation Apparatus, methods, and systems for unstructured data flow in a configurable spatial accelerator with predicate propagation and merging
US10564980B2 (en) 2018-04-03 2020-02-18 Intel Corporation Apparatus, methods, and systems for conditional queues in a configurable spatial accelerator
US11200186B2 (en) 2018-06-30 2021-12-14 Intel Corporation Apparatuses, methods, and systems for operations in a configurable spatial accelerator
US10459866B1 (en) * 2018-06-30 2019-10-29 Intel Corporation Apparatuses, methods, and systems for integrated control and data processing in a configurable spatial accelerator
US10891240B2 (en) 2018-06-30 2021-01-12 Intel Corporation Apparatus, methods, and systems for low latency communication in a configurable spatial accelerator
CN109032818B (zh) * 2018-07-27 2021-11-16 北京计算机技术及应用研究所 一种同构系统核间同步与通信的方法
JP7159696B2 (ja) * 2018-08-28 2022-10-25 富士通株式会社 情報処理装置,並列計算機システムおよび制御方法
CN111190706A (zh) * 2018-11-14 2020-05-22 中国电力科学研究院有限公司 一种基于电力交易的多任务优化引擎驱动方法及其系统
US10678724B1 (en) 2018-12-29 2020-06-09 Intel Corporation Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
US10915471B2 (en) 2019-03-30 2021-02-09 Intel Corporation Apparatuses, methods, and systems for memory interface circuit allocation in a configurable spatial accelerator
US10965536B2 (en) 2019-03-30 2021-03-30 Intel Corporation Methods and apparatus to insert buffers in a dataflow graph
US10817291B2 (en) 2019-03-30 2020-10-27 Intel Corporation Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US11029927B2 (en) 2019-03-30 2021-06-08 Intel Corporation Methods and apparatus to detect and annotate backedges in a dataflow graph
CN112347186B (zh) * 2019-08-09 2023-02-28 安徽寒武纪信息科技有限公司 数据同步方法及装置以及相关产品
CN112347027A (zh) * 2019-08-09 2021-02-09 安徽寒武纪信息科技有限公司 数据同步方法及装置以及相关产品
CN110413562B (zh) * 2019-06-26 2021-09-14 北京全路通信信号研究设计院集团有限公司 一种具有自适应功能的同步系统和方法
US11037050B2 (en) 2019-06-29 2021-06-15 Intel Corporation Apparatuses, methods, and systems for memory interface circuit arbitration in a configurable spatial accelerator
CN112654092B (zh) * 2019-10-09 2023-05-30 中盈优创资讯科技有限公司 资源调度方法、装置及系统
CN112925835A (zh) * 2019-12-05 2021-06-08 北京金山云网络技术有限公司 数据同步方法、装置和服务器
US11907713B2 (en) 2019-12-28 2024-02-20 Intel Corporation Apparatuses, methods, and systems for fused operations using sign modification in a processing element of a configurable spatial accelerator
CN111290856B (zh) * 2020-03-23 2023-08-25 优刻得科技股份有限公司 数据处理装置和方法
CN111782580B (zh) 2020-06-30 2024-03-01 北京百度网讯科技有限公司 复杂计算装置、方法、人工智能芯片和电子设备
CN112395095A (zh) * 2020-11-09 2021-02-23 王志平 一种基于cpoc的进程同步方法
CN112948467B (zh) * 2021-03-18 2023-10-10 北京中经惠众科技有限公司 数据处理方法及装置、计算机设备和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573715A (zh) * 2003-05-22 2005-02-02 国际商业机器公司 在非对称异构多处理器环境中提供原子更新原语的方法
CN1952900A (zh) * 2005-10-20 2007-04-25 中国科学院微电子研究所 可编程通用多核处理器芯片上处理器之间程序流同步方法
CN101950282A (zh) * 2010-08-30 2011-01-19 中国科学院计算技术研究所 一种多处理器系统及其同步引擎

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442797A (en) * 1991-12-04 1995-08-15 Casavant; Thomas L. Latency tolerant risc-based multiple processor with event driven locality managers resulting from variable tagging
US6760833B1 (en) * 1997-08-01 2004-07-06 Micron Technology, Inc. Split embedded DRAM processor
US6226738B1 (en) * 1997-08-01 2001-05-01 Micron Technology, Inc. Split embedded DRAM processor
US7137118B2 (en) * 2002-09-27 2006-11-14 Texas Instruments Incorporated Data synchronization hardware primitive in an embedded symmetrical multiprocessor computer
US7669086B2 (en) * 2006-08-02 2010-02-23 International Business Machines Corporation Systems and methods for providing collision detection in a memory system
JP5213485B2 (ja) * 2008-03-12 2013-06-19 株式会社トヨタIt開発センター マルチプロセッサシステムにおけるデータ同期方法及びマルチプロセッサシステム
US9081501B2 (en) * 2010-01-08 2015-07-14 International Business Machines Corporation Multi-petascale highly efficient parallel supercomputer
US8904068B2 (en) * 2012-05-09 2014-12-02 Nvidia Corporation Virtual memory structure for coprocessors having memory allocation limitations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1573715A (zh) * 2003-05-22 2005-02-02 国际商业机器公司 在非对称异构多处理器环境中提供原子更新原语的方法
CN1952900A (zh) * 2005-10-20 2007-04-25 中国科学院微电子研究所 可编程通用多核处理器芯片上处理器之间程序流同步方法
CN101950282A (zh) * 2010-08-30 2011-01-19 中国科学院计算技术研究所 一种多处理器系统及其同步引擎

Also Published As

Publication number Publication date
CN101950282B (zh) 2012-05-23
CN101950282A (zh) 2011-01-19
US9411778B2 (en) 2016-08-09
US20130166879A1 (en) 2013-06-27

Similar Documents

Publication Publication Date Title
WO2012027959A1 (zh) 一种多处理器系统及其同步引擎装置
US11809321B2 (en) Memory management in a multiple processor system
US10230608B2 (en) RPS support for NFV by system call bypass
Van der Wijngaart et al. Light-weight communications on Intel's single-chip cloud computer processor
US9400821B2 (en) Memory bus protocol to enable clustering between nodes of distinct physical domain address spaces
WO2018035856A1 (zh) 实现硬件加速处理的方法、设备和系统
TWI624791B (zh) 用於在多緒處理單元中改善性能之技術
US8368701B2 (en) Metaprocessor for GPU control and synchronization in a multiprocessor environment
US7802025B2 (en) DMA engine for repeating communication patterns
JP4171910B2 (ja) 並列処理システム及び並列処理プログラム
KR101150928B1 (ko) 네트워크 아키텍처 및 이를 이용한 패킷 처리 방법
WO2014056420A1 (zh) 核间通信装置及方法
WO2008082964A1 (en) Thread queuing method and apparatus
WO2011060366A2 (en) Distributed symmetric multiprocessing computing architecture
US20110265093A1 (en) Computer System and Program Product
Rotta On efficient message passing on the intel scc
CN115080277B (zh) 一种多核系统的核间通信系统
KR20230163559A (ko) 메시지 전달 회로부 및 방법
CA2473548A1 (en) Data transfer mechanism
US9442759B2 (en) Concurrent execution of independent streams in multi-channel time slice groups
US20220224605A1 (en) Simulating network flow control
Deri et al. Exploiting commodity multi-core systems for network traffic analysis
WO2016041150A1 (zh) 并行访问方法及系统
Lin et al. Zero-buffer inter-core process communication protocol for heterogeneous multi-core platforms
Kachris et al. Low-latency explicit communication and synchronization in scalable multi-core clusters

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11820989

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 13819886

Country of ref document: US

122 Ep: pct application non-entry in european phase

Ref document number: 11820989

Country of ref document: EP

Kind code of ref document: A1