EP4081898A1 - A system and method for optimizing time overhead in multi-core synchronization - Google Patents

A system and method for optimizing time overhead in multi-core synchronization

Info

Publication number
EP4081898A1
EP4081898A1 EP20701305.3A EP20701305A EP4081898A1 EP 4081898 A1 EP4081898 A1 EP 4081898A1 EP 20701305 A EP20701305 A EP 20701305A EP 4081898 A1 EP4081898 A1 EP 4081898A1
Authority
EP
European Patent Office
Prior art keywords
range
instruction
sync instruction
instructions
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20701305.3A
Other languages
German (de)
French (fr)
Inventor
Leonid Dubrovin
Alexander Rabinovitch
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of EP4081898A1 publication Critical patent/EP4081898A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3834Maintaining memory consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present invention in some embodiments thereof, relates to computer systems, more specifically, but not exclusively, to a system and method for optimizing time overhead in multi core synchronization.
  • a memory barrier is a type of barrier instruction that causes a Central Processing Unit (CPU) or a compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This means that operations issued prior to the barrier instruction are guaranteed to be performed before operations issued after the barrier instruction.
  • CPU Central Processing Unit
  • a compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This means that operations issued prior to the barrier instruction are guaranteed to be performed before operations issued after the barrier instruction.
  • Memory barriers are necessary because most modern CPUs employ performance optimizations that can result in out-of-order execution. This reordering of memory operations (loads and stores) normally goes unnoticed within a single thread of execution, but in cases of interaction between threads or interaction between software and hardware, it can cause unpredicted behavior.
  • Memory barriers are typically used when implementing low-level machine code that operates on memory shared by multiple devices, which includes synchronization primitives and lock-free data structures on multiprocessor systems, and device drivers that communicate with computer hardware.
  • a system for optimizing time overhead in multi-core synchronization comprising: a processor adapted to execute a plurality of instructions including write instructions to one or more memory devices and a sync instruction; and a counter configured for storing a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; wherein the processor executes the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
  • a method for optimizing time overhead in multi-core synchronization comprising: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
  • a computer program provided on a non-transitory computer readable storage medium storing instructions for performing a method for optimizing time overhead in multi-core synchronization, comprising: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
  • the counter is configured to store a three bits size range attribute value.
  • a counter is provided for every defined range.
  • the range attribute value stored in the counter is increased by one; and for every message from the memory device announcing the execution of the range write instruction of the same defined range is done, the range attribute value stored in the counter is decreased by one.
  • FIG. 1 schematically shows a diagram of a multi-core parallel processing pipeline in comparison to a single-core processing pipeline
  • FIG. 2 schematically shows an exemplary of an execution of a pseudo-code of a simple case which requires the use of a sync instruction
  • FIG. 3 schematically shows an example of the execution of a pseudo-code of a more complex case where two hardware devices have to be programmed
  • FIG. 4 schematically shows an exemplary of a schematic system executing a pseudo code according to some embodiments of the present invention.
  • FIG. 5 schematically shows a flowchart of a method for implementing a range write instruction and a range sync instruction, according to one or more embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to a system and method for multi-core synchronization and, more specifically, but not exclusively, to a system and method for optimizing time overhead in multi-core synchronization by implementing a range sync instruction.
  • AXI Advanced extensible Interface
  • RISC reduced instruction set computing
  • ARM Advanced Microcontroller Bus Architecture
  • AXI interface provides a communication interface for devices and/or components and/or processes in a system, to exchange information between each other.
  • the relationship between the components are defined as master -slave, when the master is the requesting component and the slave is the responding component.
  • a multi -core processor (a processor is also referred to herein as a processing unit or Central Processing Unit (CPU)) or a group of processors are configured to co-operate in executing a plurality of instructions simultaneously. Instructions are operations or tasks, which the processor is configured to execute.
  • CPU Central Processing Unit
  • FIG. 1 schematically shows a diagram of a multi-core parallel processing pipeline in comparison to a single-core processing pipeline.
  • Task 1 is executed by core 1
  • Task 2 is executed simultaneously by core 2
  • Task 3 is executed by core 3 simultaneously, without waiting for Task 1 and Task 2 to be completed.
  • Task 2 is completed first, so core 2 continuous to execute task 4.
  • Task 3 is completed second, so core 3 continuous to execute task 5 and only then task 1 is completed, so core 1 continuous to execute task 6 and so on.
  • the parallel processing architecture saves time, however when software interacts with a hardware, meaning when software “writes” or “reads” memory mapped hardware devices, memory flags, semaphores or the like, it is required to ensure the exact order of hardware accesses to the memory or in other words to ensure synchronization. For example in a case of a variable that is shared between two or more tasks. Without synchronization, the instructions of the two tasks may be interleaved in any order and result in out of order execution. Usually, synchronization is achieved by using barrier instructions such as sync instructions, which ensures that all the execution of all the instructions appearing prior to the sync instruction are completed before the execution of the instructions appearing after the sync instruction.
  • barrier instructions such as sync instructions
  • a standard hardware implantation of the sync instruction feature have a counter, which is a memory element, for write operations.
  • the value stored in the counter is increased by 1, when a write instruction is sent out from a master to a slave and decreased by 1 when a response from a slave is received to ensure the write instruction was successfully completed.
  • the sync operation is finished when the counter value equals to 0.
  • FIG. 2 presents a schematic execution of pseudo-code 210 using the sync instruction.
  • DMA Direct Memory Access
  • CPU Central Processing Unit
  • a processor 201 executes a set of write instructions to a DMA device 202, for programming DMA device 202, and initiating transfer of data between the DMA device 202 and memory 220.
  • a write instruction is executed for every register in the DMA device, e.g. for a DMA with five registers, five write instructions are executed.
  • processor 201 which is the master device, reads a TRANSFER FINSIHED register of DMA 202 to test the transfer status.
  • the write to DMA register instructions must be separated by a sync instruction from the read from TRANSFER FINSIHED register instruction, in order to verify that all the write instructions to the DMA registers are completed before the processor reads from the DMA registers.
  • FIG. 3 schematically shows an example of the execution of a pseudo-code 310, of a more complex case where two hardware devices have to be programmed:
  • DMA l 301 and DMA_2 302 there are two DMA devices DMA l 301 and DMA_2 302, which have to be programmed to transfer data to memory 320, each DMA device with N registers.
  • a standard sync instruction is used, and processor 301 waits until all the writes operations to the two DMA devices are completed and only then, the processor 301 can read from the DMA l TRANSFER FINSIHED register.
  • the time problem can be very significant when the hardware devices are located very far in the pipeline from the reading and/or writing master, when there are many DMA devices.
  • the present invention in some embodiments thereof, provides a set of range write instruction and range sync instruction, where the range write instruction contains a range attribute, which defines a range to be monitored by the range sync instruction.
  • the range write instruction to a specific device defines a range for which the range sync instruction is executed.
  • the processor waits until all range write instruction of the defined range are completed, instead of waiting for all the write instructions to be completed as for a standard sync instruction.
  • the implementation of the range write instruction and the range sync instruction requires an additional counter in the size of 3 bits, to store a range attribute value of the range write instruction.
  • the advantages achieved, in accordance to one or more embodiments of the present invention include an improvement of performance of various communication and multimedia application, which require hardware and software iteration. Specifically, minor additional hardware (e.g. die-size) is required to significantly improve applications and specifically hardware drivers performance.
  • minor additional hardware e.g. die-size
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • a network for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • FPGA field-programmable gate arrays
  • PLA programmable logic arrays
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
  • System 400 comprises a processor 401 (also referred as a CPU or processing unit) a memory 420, and m DMA devices - DMA 1, 401, DMA 2, 402... and DMA m, 403.
  • processor 401 also referred as a CPU or processing unit
  • memory 420 a memory 420
  • Each DMA device comprises N registers.
  • processor 401 executes a set of range write instructions to program the m DMA devices to transfer data to memory 420.
  • the range write instruction to DMA l device 402 has a range attribute 1
  • the range write instruction to DMA 2 device 403, has a range attribute 2, and so on.
  • Processor 401 is configured to execute a range sync instruction for range 1, before executing a read instruction from DMA l device 402. By executing the range sync instruction for range 1, processor 401 only waits until all the write instructions to DMA l device 402 are completed and then moves on to the execution of the read from DMA l 402 instruction (the read DMA I TRANSFER FINISHED instruction).
  • the use of the range sync instruction and range write instruction enables performance improvement by handling different register ranges separately from each other.
  • the range write instruction and range sync instruction are implemented on chip as part of the Instruction Set Architecture (ISA) of the processor, where a counter is provided for every range defined.
  • ISA Instruction Set Architecture
  • FIG. 5 schematically shows a flowchart of a method for executing a range write instruction and a range sync instruction by a processor, according to one or more embodiments of the invention.
  • a counter x is provided on chip for range x, to store a range attribute value which equals to zero, for a range write instruction for a DMA_X device.
  • the range attribute value stored in counter x is increased by 1.
  • the range attribute value stored in counter x is decreased by 1.
  • the range attribute value is checked, when the range attribute value equals to 0, it means that all the range write instructions of the defined range are completed and the execution of the range sync instruction operation ends.
  • the range sync instruction saves time and does not wait for all the write instruction to be completed, but only waits for the range write instruction of the same defined range to be completed. Else, when the range attribute value does not equal to 0, the processor goes back to 502.
  • a possible syntax for the range write instruction may be as follows: ST.A32.R Ae, Ai, imm3.
  • ST. A32.R is a 32-bit store to memory instruction with specifying a range.
  • the instruction stores the data in Ae register to a memory address specified by Ai register.
  • the instruction specifies in imm3, the range number to be used by the range sync instruction.
  • a range attribute value equals to 0 is stored in a counter as, for the range defined in imm3, as explained in 501. Every time a store instruction ST.A32.R for the same range specified by imm3 is executed by the processor, the range attribute value stored in the counter is increased by 1 as explained in 502. Every time a store instruction ST. A32.R for the same range specified by imm3 is completed, and a response is received from the slave, the range attribute value stored in the counter is decreased by 1, as explained in 503.
  • a possible syntax for the range sync instruction may be as follows: SYNC. REGION imm3.
  • the SYNC. REGION instruction is used to make sure that all store accesses to the region specified by imm3 are completed and have reached their destination.
  • the processor checks, as explained in 504, does the range attribute value of the range defined by imm3, which is stored in the counter, equals to 0? When the answer is yes, it means that all write responses of all store accesses to the specified region by imm3 are completed, and the SYNC.REGION instructions ends as explained in 504. When the answer is no, the processor goes back to step 502 and waits until all the ST. A32.R instructions for the range specified by imm3, will be completed.
  • DMA device with the syntax defined above may be as follows:
  • the meaning of the code line ST.A32.R A0, AI, 1 - is, write the data in A0 (this refers to the data that have to be written to register 1 of DMA1) to the address specified by A1 (which is the address of register 1 of DMA 1) and so on with the rest of the code lines.
  • the invention relates to a computer program for performing a method for optimizing time overhead in multi-core synchronization.
  • the computer program may be provided on a non-transitory computer-readable storage medium.
  • the computer program is adapted to perform a method, which comprises: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
  • range sync instruction and range write instruction It is expected that during the life of a patent maturing from this application many relevant range sync instruction and range write instruction will be developed and the scope of the term range sync instruction and range write instruction is intended to include all such new technologies a priori.
  • compositions comprising, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of' and “consisting essentially of'.
  • Consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

A system for optimizing time overhead in multi-core synchronization, comprising: a processor (401) adapted to execute a plurality of instructions including write instructions to one or more memory devices (402) and a sync instruction; and a counter configured for storing a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; wherein the processor (401) executes the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.

Description

A SYSTEM AND METHOD FOR OPTIMIZING TIME OVERHEAD IN MULTI-CORE
SYNCHRONIZATION
FIELD AND BACKGROUND OF THE INVENTION
The present invention, in some embodiments thereof, relates to computer systems, more specifically, but not exclusively, to a system and method for optimizing time overhead in multi core synchronization.
A memory barrier, is a type of barrier instruction that causes a Central Processing Unit (CPU) or a compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction. This means that operations issued prior to the barrier instruction are guaranteed to be performed before operations issued after the barrier instruction.
Memory barriers are necessary because most modern CPUs employ performance optimizations that can result in out-of-order execution. This reordering of memory operations (loads and stores) normally goes unnoticed within a single thread of execution, but in cases of interaction between threads or interaction between software and hardware, it can cause unpredicted behavior.
Memory barriers are typically used when implementing low-level machine code that operates on memory shared by multiple devices, which includes synchronization primitives and lock-free data structures on multiprocessor systems, and device drivers that communicate with computer hardware.
SUMMARY
According to first aspect, a system for optimizing time overhead in multi-core synchronization, comprising: a processor adapted to execute a plurality of instructions including write instructions to one or more memory devices and a sync instruction; and a counter configured for storing a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; wherein the processor executes the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
According to a second aspect, a method for optimizing time overhead in multi-core synchronization, comprising: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
According to a third aspect, A computer program provided on a non-transitory computer readable storage medium storing instructions for performing a method for optimizing time overhead in multi-core synchronization, comprising: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
In a further implementation of the first, second and third aspect, the counter is configured to store a three bits size range attribute value.
In a further implementation of the first, second and third aspect, a counter is provided for every defined range.
In a further implementation of the first, second and third aspect, for every execution of a range write instruction, once the execution is finished, a message from the memory device announcing the execution is done, is received by the processor.
In a further implementation of the first, second and third aspect, for every execution of a range write instruction of the same defined range, the range attribute value stored in the counter is increased by one; and for every message from the memory device announcing the execution of the range write instruction of the same defined range is done, the range attribute value stored in the counter is decreased by one.
In a further implementation of the first, second and third aspect, when the range attribute value stored in the counter equals to zero the execution of the range sync instruction is done.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 schematically shows a diagram of a multi-core parallel processing pipeline in comparison to a single-core processing pipeline;
FIG. 2 schematically shows an exemplary of an execution of a pseudo-code of a simple case which requires the use of a sync instruction;
FIG. 3 schematically shows an example of the execution of a pseudo-code of a more complex case where two hardware devices have to be programmed; FIG. 4 schematically shows an exemplary of a schematic system executing a pseudo code according to some embodiments of the present invention; and
FIG. 5 schematically shows a flowchart of a method for implementing a range write instruction and a range sync instruction, according to one or more embodiments of the present invention.
DETAILED DESCRIPTION
The present invention, in some embodiments thereof, relates to a system and method for multi-core synchronization and, more specifically, but not exclusively, to a system and method for optimizing time overhead in multi-core synchronization by implementing a range sync instruction.
Advanced extensible Interface (AXI) is a parallel high-performance, synchronous, high-frequency, multi-master, multi-slave communication interface, mainly designed for on- chip communication and a part of Advanced reduced instruction set computing (RISC) Machine (ARM) Advanced Microcontroller Bus Architecture (AMBA). AXI interface provides a communication interface for devices and/or components and/or processes in a system, to exchange information between each other. The relationship between the components are defined as master -slave, when the master is the requesting component and the slave is the responding component.
In parallel processing, a multi -core processor (a processor is also referred to herein as a processing unit or Central Processing Unit (CPU)) or a group of processors are configured to co-operate in executing a plurality of instructions simultaneously. Instructions are operations or tasks, which the processor is configured to execute.
Unlike a single-core system in which serial processing is performed and the processor executes a task only after the previous task is completed, in parallel processing tasks are executed simultaneously by the multi-core system without waiting for the previous task to be completed. Once one of the tasks is completed the executing core continues to the next task for execution.
FIG. 1 schematically shows a diagram of a multi-core parallel processing pipeline in comparison to a single-core processing pipeline. As can be seen in FIG. 1, Task 1 is executed by core 1, Task 2 is executed simultaneously by core 2, without waiting for task 1 to be completed and Task 3 is executed by core 3 simultaneously, without waiting for Task 1 and Task 2 to be completed. Task 2 is completed first, so core 2 continuous to execute task 4. Task 3 is completed second, so core 3 continuous to execute task 5 and only then task 1 is completed, so core 1 continuous to execute task 6 and so on.
The parallel processing architecture saves time, however when software interacts with a hardware, meaning when software “writes” or “reads” memory mapped hardware devices, memory flags, semaphores or the like, it is required to ensure the exact order of hardware accesses to the memory or in other words to ensure synchronization. For example in a case of a variable that is shared between two or more tasks. Without synchronization, the instructions of the two tasks may be interleaved in any order and result in out of order execution. Usually, synchronization is achieved by using barrier instructions such as sync instructions, which ensures that all the execution of all the instructions appearing prior to the sync instruction are completed before the execution of the instructions appearing after the sync instruction.
A standard hardware implantation of the sync instruction feature have a counter, which is a memory element, for write operations. The value stored in the counter is increased by 1, when a write instruction is sent out from a master to a slave and decreased by 1 when a response from a slave is received to ensure the write instruction was successfully completed. The sync operation is finished when the counter value equals to 0. An example for a simple case where the use of sync instruction is necessary is presented in FIG. 2, which presents a schematic execution of pseudo-code 210 using the sync instruction.
In pseudo-code 210, a series of write instructions is executed to program a hardware Direct Memory Access (DMA) device. DMA is a feature of computer systems that allows certain hardware subsystems to access main system memory (random-access memory), independent of the Central Processing Unit (CPU). Without DMA, when the CPU is using programmed input and/or output, the CPU is typically fully occupied for the entire duration of a read or write operation, and is thus unavailable to perform other work. With DMA, the CPU first initiates the transfer, then, it does other operations while the transfer is in progress by the DMA device. Finally, the CPU receives a message from the DMA device when the operation is done.
As can be seen in FIG. 2 a processor 201 executes a set of write instructions to a DMA device 202, for programming DMA device 202, and initiating transfer of data between the DMA device 202 and memory 220. As shown in the pseudo code 210, a write instruction is executed for every register in the DMA device, e.g. for a DMA with five registers, five write instructions are executed. After the programming is finished and DMA 202 transfers the data to the memory 220, processor 201, which is the master device, reads a TRANSFER FINSIHED register of DMA 202 to test the transfer status. In this case, the write to DMA register instructions, must be separated by a sync instruction from the read from TRANSFER FINSIHED register instruction, in order to verify that all the write instructions to the DMA registers are completed before the processor reads from the DMA registers.
However, in a more complex case when several hardware devices have to be programmed and then the status of a particular device has to be read, the use of the sync instruction may take a very long time. The solution of using a sync instruction after every set of write instruction to the same DMA is not useful as the execution of the sync instruction for every DMA may take a very long time too.
FIG. 3 schematically shows an example of the execution of a pseudo-code 310, of a more complex case where two hardware devices have to be programmed:
In this case, there are two DMA devices DMA l 301 and DMA_2 302, which have to be programmed to transfer data to memory 320, each DMA device with N registers. A standard sync instruction is used, and processor 301 waits until all the writes operations to the two DMA devices are completed and only then, the processor 301 can read from the DMA l TRANSFER FINSIHED register. The time problem can be very significant when the hardware devices are located very far in the pipeline from the reading and/or writing master, when there are many DMA devices.
The present invention, in some embodiments thereof, provides a set of range write instruction and range sync instruction, where the range write instruction contains a range attribute, which defines a range to be monitored by the range sync instruction. This means that, the range write instruction to a specific device defines a range for which the range sync instruction is executed. Once the range sync instruction is executed, the processor waits until all range write instruction of the defined range are completed, instead of waiting for all the write instructions to be completed as for a standard sync instruction. In one or more embodiments of the invention, the implementation of the range write instruction and the range sync instruction requires an additional counter in the size of 3 bits, to store a range attribute value of the range write instruction.
The advantages achieved, in accordance to one or more embodiments of the present invention, include an improvement of performance of various communication and multimedia application, which require hardware and software iteration. Specifically, minor additional hardware (e.g. die-size) is required to significantly improve applications and specifically hardware drivers performance.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions. The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to FIG. 4, which shows an exemplary of a schematic system 400, executing a pseudo-code 410 according to some embodiments of the present invention. System 400 comprises a processor 401 (also referred as a CPU or processing unit) a memory 420, and m DMA devices - DMA 1, 401, DMA 2, 402... and DMA m, 403. Each DMA device comprises N registers.
According to some embodiments of the invention, and as shown in FIG. 4, processor 401 executes a set of range write instructions to program the m DMA devices to transfer data to memory 420. The range write instruction to DMA l device 402, has a range attribute 1, the range write instruction to DMA 2 device 403, has a range attribute 2, and so on. Processor 401 is configured to execute a range sync instruction for range 1, before executing a read instruction from DMA l device 402. By executing the range sync instruction for range 1, processor 401 only waits until all the write instructions to DMA l device 402 are completed and then moves on to the execution of the read from DMA l 402 instruction (the read DMA I TRANSFER FINISHED instruction).
According to some embodiments of the present invention, the use of the range sync instruction and range write instruction enables performance improvement by handling different register ranges separately from each other.
In one or more embodiments of the invention, the range write instruction and range sync instruction are implemented on chip as part of the Instruction Set Architecture (ISA) of the processor, where a counter is provided for every range defined.
Reference is now made to FIG. 5, which schematically shows a flowchart of a method for executing a range write instruction and a range sync instruction by a processor, according to one or more embodiments of the invention.
At 501, a counter x is provided on chip for range x, to store a range attribute value which equals to zero, for a range write instruction for a DMA_X device. At 502, every time a range write instruction to a register of DMA X device, with a range attribute of range x, is executed by the processor, the range attribute value stored in counter x is increased by 1. At 503, every time a response from the slave is received to ensure the range write instruction to DMA X, with range x, was successfully executed, the range attribute value stored in counter x is decreased by 1. At 504, the range attribute value is checked, when the range attribute value equals to 0, it means that all the range write instructions of the defined range are completed and the execution of the range sync instruction operation ends. The range sync instruction saves time and does not wait for all the write instruction to be completed, but only waits for the range write instruction of the same defined range to be completed. Else, when the range attribute value does not equal to 0, the processor goes back to 502. In one or more embodiments of the invention, a possible syntax for the range write instruction may be as follows: ST.A32.R Ae, Ai, imm3.
ST. A32.R is a 32-bit store to memory instruction with specifying a range. The instruction stores the data in Ae register to a memory address specified by Ai register. In addition, the instruction specifies in imm3, the range number to be used by the range sync instruction. A range attribute value equals to 0 is stored in a counter as, for the range defined in imm3, as explained in 501. Every time a store instruction ST.A32.R for the same range specified by imm3 is executed by the processor, the range attribute value stored in the counter is increased by 1 as explained in 502. Every time a store instruction ST. A32.R for the same range specified by imm3 is completed, and a response is received from the slave, the range attribute value stored in the counter is decreased by 1, as explained in 503.
In one or more embodiments of the invention, a possible syntax for the range sync instruction may be as follows: SYNC. REGION imm3.
The SYNC. REGION instruction is used to make sure that all store accesses to the region specified by imm3 are completed and have reached their destination.
The next instruction after the SYNC.REGION instruction is not executed until receiving the write responses to all store accesses to the specified region.
The processor checks, as explained in 504, does the range attribute value of the range defined by imm3, which is stored in the counter, equals to 0? When the answer is yes, it means that all write responses of all store accesses to the specified region by imm3 are completed, and the SYNC.REGION instructions ends as explained in 504. When the answer is no, the processor goes back to step 502 and waits until all the ST. A32.R instructions for the range specified by imm3, will be completed.
An example for a code programming two DMA devices with three registers in each
DMA device, with the syntax defined above may be as follows:
STA32.RA0, AI, 1 // Write DMA 1 REG 1 , RANGE 1 STA32.RA2, A3, 1 // Write DMA 1 REG 2, RANGE 1 STA32.RA3, A4, 1 // Write DMA 1 REG 3, RANGE 1 STA32.RA4, AI, 2 // Write DMA 2 REG 1 , RANGE 1 STA32.RA5, A3, 2 // Write DMA 2 REG 2, RANGE 1 STA32.RA6, A4, 2 // Write DMA 2 REG 3, RANGE 1
SYNC.REGION 1 // SYNC RANGE 1
LD.A32.R A 7, AS // Read DMA 1 TRANSFER FINSIHED
In the above syntax example, the meaning of the code line ST.A32.R A0, AI, 1 - is, write the data in A0 (this refers to the data that have to be written to register 1 of DMA1) to the address specified by A1 (which is the address of register 1 of DMA 1) and so on with the rest of the code lines.
In one or more embodiments, the invention relates to a computer program for performing a method for optimizing time overhead in multi-core synchronization. The computer program may be provided on a non-transitory computer-readable storage medium. The computer program is adapted to perform a method, which comprises: executing by at least one processor a plurality of instructions including write instructions to one or more memory devices and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant range sync instruction and range write instruction will be developed and the scope of the term range sync instruction and range write instruction is intended to include all such new technologies a priori.
As used herein the term “about” refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'. The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:
1. A system for optimizing time overhead in multi-core synchronization, comprising: a processor (401) adapted to execute a plurality of instructions including write instructions to one or more memory devices (402) and a sync instruction; and a counter configured for storing a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; wherein the processor (401) is configured to execute the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
2. The system of claim 1, wherein the counter is configured to store a three bits size range attribute value.
3. The system of claim 1, wherein a counter is provided for every defined range.
4. The system of claim 1, wherein for every execution of a range write instruction, once the execution is finished, a message from the memory device (402) announcing the execution is done, is received by the processor (401).
5. The system of claim 4, wherein for every execution of a range write instruction of the same defined range, the range attribute value stored in the counter is increased by one; and for every message from the memory device announcing the execution of the range write instruction of the same defined range is done, the range attribute value stored in the counter is decreased by one.
6. The system of claim 5, wherein when the range attribute value stored in the counter equals to zero the execution of the range sync instruction is done.
7. A method for optimizing time overhead in multi-core synchronization, comprising: executing by at least one processor (401) a plurality of instructions including write instructions to one or more memory devices (402) and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
8. A computer program provided on a non-transitory computer readable storage medium storing instructions for performing a method for optimizing time overhead in multi-core synchronization, comprising: executing by at least one processor (401) a plurality of instructions including write instructions to one or more memory devices (402) and a sync instruction; storing by a counter a range attribute value for a plurality of range write instructions defining a range to be monitored by a range sync instruction; and executing the range sync instruction for a range defined by the range attribute value, to ensure the plurality of range write instructions appearing before the range sync instruction are completed before execution of instructions appearing after the range sync instruction.
EP20701305.3A 2020-01-17 2020-01-17 A system and method for optimizing time overhead in multi-core synchronization Pending EP4081898A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/051175 WO2021144034A1 (en) 2020-01-17 2020-01-17 A system and method for optimizing time overhead in multi-core synchronization

Publications (1)

Publication Number Publication Date
EP4081898A1 true EP4081898A1 (en) 2022-11-02

Family

ID=69182518

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20701305.3A Pending EP4081898A1 (en) 2020-01-17 2020-01-17 A system and method for optimizing time overhead in multi-core synchronization

Country Status (2)

Country Link
EP (1) EP4081898A1 (en)
WO (1) WO2021144034A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243200B2 (en) * 2004-07-15 2007-07-10 International Business Machines Corporation Establishing command order in an out of order DMA command queue
US8832403B2 (en) * 2009-11-13 2014-09-09 International Business Machines Corporation Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses
US9164690B2 (en) * 2012-07-27 2015-10-20 Nvidia Corporation System, method, and computer program product for copying data between memory locations
US10395623B2 (en) * 2017-04-01 2019-08-27 Intel Corporation Handling surface level coherency without reliance on fencing

Also Published As

Publication number Publication date
WO2021144034A1 (en) 2021-07-22

Similar Documents

Publication Publication Date Title
CN102648449B (en) A kind of method for the treatment of interference incident and Graphics Processing Unit
US7546393B2 (en) System for asynchronous DMA command completion notification wherein the DMA command comprising a tag belongs to a plurality of tag groups
US9292291B2 (en) Instruction merging optimization
US20020053016A1 (en) Solving parallel problems employing hardware multi-threading in a parallel processing environment
US20070294702A1 (en) Method and apparatus for implementing atomicity of memory operations in dynamic multi-streaming processors
US20080082707A1 (en) Non-blocking bus controller for a pipelined, variable latency, hierarchical bus with point-to-point first-in first-out ordering
JP6633119B2 (en) Autonomous memory method and system
KR20180010953A (en) SYSTEM AND METHOD OF ORCHESTRATING EXECUTION OF COMMANDS IN A NON-VOLATILE MEMORY EXPRESS (NVMe) DEVICE
US7581222B2 (en) Software barrier synchronization
EP1711899A2 (en) Establishing command order in an out of order dma command queue
EP1242867A4 (en) Memory reference instructions for micro engine used in multithreaded parallel processor architecture
US20150254113A1 (en) Lock Spin Wait Operation for Multi-Threaded Applications in a Multi-Core Computing Environment
TW201214289A (en) Context switching
JP2005332387A (en) Method and system for grouping and managing memory instruction
US7822948B2 (en) Apparatus, system, and method for discontiguous multiple issue of instructions
US9940168B2 (en) Resource sharing using process delay
WO2024001137A1 (en) Multi-core processing system and task scheduling method thereof, chip, and storage medium
EP3140730B1 (en) Detecting data dependencies of instructions associated with threads in a simultaneous multithreading scheme
EP4081898A1 (en) A system and method for optimizing time overhead in multi-core synchronization
US9384047B2 (en) Event-driven computation
CN112306703A (en) Critical region execution method and device in NUMA system
CN114035847B (en) Method and apparatus for parallel execution of kernel programs
WO2012156995A2 (en) Fetch less instruction processing (flip) computer architecture for central processing units (cpu)
CN114930292A (en) Cooperative work stealing scheduler
US9542342B2 (en) Smart holding registers to enable multiple register accesses

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220726

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20240206