EP2401676A1 - Zuordnungs- und überwachungseinheit - Google Patents

Zuordnungs- und überwachungseinheit

Info

Publication number
EP2401676A1
EP2401676A1 EP10705159A EP10705159A EP2401676A1 EP 2401676 A1 EP2401676 A1 EP 2401676A1 EP 10705159 A EP10705159 A EP 10705159A EP 10705159 A EP10705159 A EP 10705159A EP 2401676 A1 EP2401676 A1 EP 2401676A1
Authority
EP
European Patent Office
Prior art keywords
auxiliary
apu
logical
unit
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP10705159A
Other languages
English (en)
French (fr)
Inventor
Stéphane LOUISE
Vincent David
Raphaël DAVID
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Original Assignee
Commissariat a lEnergie Atomique CEA
Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Commissariat a lEnergie Atomique CEA, Commissariat a lEnergie Atomique et aux Energies Alternatives CEA filed Critical Commissariat a lEnergie Atomique CEA
Publication of EP2401676A1 publication Critical patent/EP2401676A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30101Special purpose registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores

Definitions

  • the present invention relates to an allocation and control unit for allocating threads of a task to a plurality of auxiliary processing units and for controlling the execution in parallel of said execution threads by said auxiliary units of processing, the task being executed sequentially by a main processing unit. It applies in particular in the field of embedded systems with high computing power.
  • multicores on a single chip have appeared, which may contain DSP ("Digital Signal Processor") for signal processing, GPP ("General Purpose Processor”). For ordinary processing, as well as analog input / output blocks.
  • DSP Digital Signal Processor
  • GPP General Purpose Processor
  • decoder cores dedicated to audio (“MPEG Audio Layer”, “Dolby D”, “DTS”) or video (“MPEG”, “H264”) have appeared in addition to the generalist processor.
  • US Patent 6249858 shows one of the most recent aspects of coupling capabilities between a standard processor and a coprocessor, allowing parallel execution of processing on both entities.
  • the coupling is rather close: the main processor sends the computation orders to the coprocessor by providing operands and a ROM program address.
  • this requires dedicated support software because an interrupt must be taken on the main processor to properly handle the call to the coprocessor features, and another interrupt is generated by the coprocessor at the end of the calculation. It shows how to weakly couple the main processor and its accelerator calculation.
  • the method can not be generalized to a plurality of acceleration elements.
  • it does not make it possible to do without a system support for the control and the obtaining of the results of the calculations of the coprocessor.
  • It also does not allow easy consistency in computation dependencies. These are a priori the responsibility of the programmer, which is usually difficult on a parallel system where the treatments can be highly heterogeneous. This also makes scaling extremely difficult and reserved for parallel programming specialists.
  • the graphics processing units (GPUs) of the modern graphics cards described in US Pat. No. 6,098,017 can be considered as sets of specialized auxiliary units for single-program and multi-data vector calculation (SPMD: "Single Program, Multiple Data ").
  • SPMD Single Program, Multiple Data
  • the problem treated is massively parallel. Indeed, it is to perform the same treatment on separate data sets, to calculate pixels in a memory buffer. But it is not important that there is an error at a given moment, because as long as the error rate remains low, the user is not embarrassed.
  • there is no simply accessible means of synchronization since the problem is intrinsically parallel. The only important synchronization is at the end processing an image, in order to add stages of postprocessing or simply to display on the screen the calculated pixels.
  • CMP Chip MultiProcessing
  • the present invention is intended to overcome the aforementioned drawbacks, by exploiting synchronization points set up at the compilation and in using as much as possible the resources made available by the hardware and the basic system software.
  • the subject of the invention is an allocation and control unit for allocating execution threads of a task to a plurality of auxiliary processing units and for controlling the execution in parallel of said threads of performed by said auxiliary processing units, the task being executed sequentially by a main processing unit.
  • the allocation and control unit comprises means for managing logical processing auxiliary units, means for managing physical processing auxiliary units, each physical processing auxiliary unit corresponding to an auxiliary processing unit and means for managing the auxiliary processing units.
  • the means for managing the auxiliary processing units comprises means for allocating a logical processing auxiliary unit to an execution thread to be executed and means for managing the correspondence between the logical processing auxiliary units and the physical processing auxiliary units.
  • the auxiliary processing units execute in parallel the execution threads of the task via the auxiliary logical processing units, which are allocated at the latest and released at the earliest.
  • the unit may comprise means for executing instructions inserted into the task, these instructions providing instructions for managing executable threads executable by the auxiliary logical processing units.
  • These inserted instructions may include an instruction to allocate a given logical auxiliary processing unit to the task.
  • These inserted instructions may also include an instruction to execute a thread for executing the task on the given logical processing auxiliary unit. This instruction then takes as input parameters an execution context on the given logical processing auxiliary unit.
  • the execution context identifies the thread to execute, the input data to execute it, and the output data.
  • the instruction for executing a thread for executing the task on the given logical processing auxiliary unit may be executed either with a release request or with a synchronization request.
  • request for release the unit given logical processing aid is released as soon as execution of the thread is complete.
  • synchronization request the given logical processing auxiliary unit is not released until a synchronization instruction is encountered in the instruction flow of the task, a synchronization instruction encountered in the flow of task instructions advantageously releasing all or part of the auxiliary logical processing units that have been the subject of a synchronization request by the task.
  • the means for executing the instructions inserted in the task can be implemented in the form of an execution pipeline or a microprogrammed sequencer.
  • the means for managing logical processing auxiliary units may comprise means for providing a free logical processing auxiliary unit identifier and / or means for releasing a logical processing auxiliary unit and / or means for associating a logical processing auxiliary unit with an auxiliary physical processing unit.
  • the means for providing a logical processing auxiliary unit identifier may provide the identifier of the first element of a list of free logical processing auxiliary units.
  • the means for managing physical processing auxiliary units may comprise means for providing a free physical processing auxiliary unit identifier, and / or means for associating an auxiliary physical processing unit with a unit.
  • the means for allocating a logical processing auxiliary unit to a thread for execution may comprise means for searching for a free logical processing auxiliary unit, and / or means for allocating the unit of the treatment free logic to a thread, and / or means for providing the identifier of the logical processing auxiliary unit allocated to a thread.
  • the unit may comprise means for managing execution contexts on the logical processing auxiliary units, and / or means for decoding interrupts coming from the auxiliary processing units.
  • the unit may comprise means for managing execution contexts on the main processing unit, an execution context on the main processing unit making it possible to identify a task that can be executed by the main unit. main processing unit, the input data to execute it and the output data, so that several tasks can be performed on the main processing unit.
  • the unit may comprise a local register bank including an exception masking register and / or interrupts coming from the auxiliary processing units, and / or a register indicating the physical processing auxiliary units in progress. and / or a register indicating the logical processing auxiliary units being executed, and / or a register indicating the logical processing auxiliary units which have not been the subject of a request for synchronization by the task.
  • the main advantages of the invention are that it does not require partial pre-emption of the auxiliary processing units. Moreover, the invention does not require a strong synchronization of all the processing units, only a weak synchronization at the level of the auxiliary processing units, possibly even synchronization by groups of auxiliary processing units. The invention also makes it possible to release the system software from the management of part of the interrupts.
  • FIG. 1 a diagram, an example of architecture according to the French patent number FR2893156 (B1) combining a general purpose processor and several processing units specialized in intensive calculations;
  • FIG. 2 by a diagram, an exemplary allocation and control unit architecture according to the invention
  • - Figure 3 by a timing diagram, an example of sequence of treatments.
  • FIG. 1 illustrates, by a diagram, an exemplary MPSoC architecture according to the French patent number FR2893156 (B1).
  • This architecture combines a general-purpose processor SPU (“Standard Processing Unit") located in an SPP subset of the architecture (“Standard Processing Part") and N specialized processing units in the APUO, APU2, ..., intensive computations.
  • APUN-1 "Auxiliary Processing Unit" located in an APP subset of the architecture (“Auxiliary Processing Part”). Subsequently, the APUO, APU2, ..., APUN-1 will be referred to as "the APU processor (s)” or even "the APU” (s).
  • the APU processors can communicate via a shared memory SMS ("Shared Memory Space"), this SMS memory can also have its own controller MSC ("Memory Space Controller”). Shared Register Files (SRF) registers can also be shared by APU processors.
  • the SPP receives data via a system bus (SB) and an SBA (System Bus Arbiter) bus controller.
  • SB system bus
  • SBA System Bus Arbiter
  • the SPP includes a control unit ESCU ("Extended Standard Control Unit"), which is responsible for reading and decoding instructions.
  • the SPP also comprises a storage unit comprising two cache memories L1 D-Cache and L1 l-Cache, a second level cache L2-Cache, and a loading unit LSU ("Load Store Unit") .
  • the execution of the tasks also includes the system software.
  • the SPP is able to use the auxiliary execution units that are the APU processors, to process certain application portions requiring very high computing power.
  • the present invention is in the way of using the auxiliary computing units that are the APU processors.
  • ACU allocation and control Unit
  • the APP may include a mass memory MM ("Main Memory”), in order to store all the data and programs handled by the APU processors.
  • This memory MM also has its own controller MMC ("Main Memory Controller"), on which the ACU allocates blocks of data transfer between the system and the intensive computing blocks.
  • MMC Mainn Memory Controller
  • a principle underlying the invention is to start from usual sequential codes, to which are added some clues left by the application programmer, in order to better separate the "calculation / processing" type code pieces (or “code”).
  • the control code is characterized by a rather low potential CPI, because strong control events depend as much on the data produced and consulted as on the execution history. Predictability connections and the predictability of memory access are low. GPs are well suited to control tasks.
  • the calculation code is characterized by a significant potential CPI because there are few unpredictable control or memory access hazards.
  • the parallelism of instructions is important and easy to optimize, especially on specialized processors of the DSP / SIMD type. Loop unwinding is useful for reducing perceived control hazards.
  • a first compilation phase is responsible for isolating the control code on one side, but keeping all the necessary information on the data streams that will have disappeared , and on the other hand to have code snippets and functions like calculations and data processing. It is then a question of compiling, if it is not already done in libraries, the codes of treatment which must execute on the auxiliary computation units that are the processors APU.
  • the data flow graph is processed for the control code to be executed on the SPP. Then the allocation, loading and execution data for the APU processors are inserted in the control program, as well as the synchronization data between the APU processors.
  • the first compilation phase can be illustrated using the following example algorithm 1:
  • the first phase corresponds to the extraction of the code to be positioned on the APU processors and the associated data stream.
  • the functions fft, det and muit are marked by the programmer for execution on the APU processors, the data dependency graph being analyzed.
  • the second compilation phase can be illustrated using the following algorithm example 2: [algo ⁇ tme 2] schematic control code obtained (pseudo assembler + code not changed)
  • Allocate Calculation Processor APP # 1 Allocate Calculation Processor APP # 2 load_prog fft on APP # 2 load data dat2 for APP # 2 launch APP # 2 load_prog fft on APP # 1 load data datl for APP # 1 launches & releases APP # 1
  • the second phase sets up the control code, based on the dependency graph, the potential parallelism and its execution on the APPs.
  • Pseudo-instructions that start with a capital letter are operations that require parallelism synchronization. They set markers at the processor level so that the base system software can know if the job is waiting for synchronization or not.
  • an executing task can allocate an APU processor via assembly instruction.
  • the execution of this instruction can be blocking, for example if there is no more processor available in the APP for the application; - A processor of the APP is subsequently allocated to the task until said processor has finished execution. It then comes back implicitly in the list of available processors for other processes; the other blocking instruction for a task is the synchronization instruction. It is associated with a list of processors of the APP which is expected to be completed.
  • Particular APU o Loading and invalidation of memory with data or programs. - the launch of processing on APU processors.
  • the ACU must also be able to manage autonomously, ie independently of the SPP or any other hardware element, the bulk of APU processor deallocation procedures, termination procedures, and that part of the memory management embedded on the chip.
  • FIG. 2 diagrammatically illustrates an exemplary architecture according to the invention for the ACU of FIG. 1, in charge of accelerating the management operations of meso-parallelism on the APUOs, APU2s, ... APUN-1 .
  • the internal connections are in solid lines while the external interfaces are in dashed lines.
  • the ACU is also responsible for managing part of the weak synchronization, without going through the framework of an explicit management by a dedicated kernel. This is possible when the operations to be performed are unambiguous, they are simple to implement in the hardware and they are likely to significantly improve the performance of the whole. In essence, the ACU therefore has a very important system dimension. One goal is to get away from the particularities of the operating system, when there is one, and to make the system as efficient as possible.
  • the ACU offers a virtualization of the use of APU processors, so that the implementation of a low synchronization type execution model is simple and efficient, the compilation tools and the system software intervening only when necessary.
  • the ACU is used to execute particular parallelism management instructions, which are inserted into the program that is running on the SPP.
  • the ACU is also used to manage the APUs and to manage the interface of the APUs with the SPP.
  • the ACU can comprise an execution pipeline EP ("Execution Pipeline”), which is responsible for executing instructions specific to the management of parallelism on the APUs or the MM from the instructions inserted in the flow. of instruction of the SPP.
  • EP Executecution Pipeline
  • This is a classic thread, which can be implemented in different ways such as an actual pipeline or a microprogrammed sequencer. But in the present preferred embodiment, it is a pipeline.
  • APU processors are homogeneous with unified management.
  • the generalization to heterogeneous APU processors is simple: it is enough to separate the dedicated descriptors for each type of APU.
  • the impact on the execution pipeline instructions is that you need to add an APU type identifier at the allocation statement level.
  • APU By type of APU, there is then a logical APU management by a logical APU manager (Logical APU Management) and a physical APU management by a PAM physical APU manager ("Physical APU"). Management ").
  • An APU APU Manager (“APU Manager") is then in charge of allocating logical APUs to the running context, as well as managing the correspondence between logical APUs and physical APUs.
  • the EP execution pipeline may advantageously make it possible to process the following basic instructions: allocation of an APU processor: this involves associating a logical APU with the task being executed on the SPP.
  • the instruction returns a logical APU number, obtained by the APU APU manager in a global register or memory indexed by the instruction.
  • the instruction can take additional parameters that can be compared with a Local Register File Register LRF, which will be detailed later, for example to limit the number of allocable APUs in a particular section of the local register.
  • LRF Local Register File Register
  • this instruction requires a specific context, which can be provided either by a context identifier in the assumption of operation with elaborate contexts, or by a simple triplet (program identifier, input data identifier , output data identifier).
  • This second case is a common case of treatments and allows a generic treatment: it is better to implement it in all cases.
  • the first case allows a greater flexibility in the definition of the treatments, but requires particular instructions of context generation of treatments APU on the
  • Physical APU is allocated for execution as soon as possible.
  • the logical APU is not freed for the context (but only for this execution context of the SPP) as long as a synchronization instruction is not encountered.
  • the logical APU can be reassigned to a different SPP context as soon as the physical APU has finished executing.
  • APU processor execution synchronization it is necessary to check that one or more logical APUs specified in the instruction have finished their execution.
  • the APU manager is directly in charge of the actual execution of this synchronization, which can cause an exception.
  • logical APUs in synchronization request are all released for the current context.
  • multiple APUs can be synchronized to the same instruction. A simple way to do this is to use a logical APU mask to synchronize.
  • the information can be passed through one or more registers in one or more synchronization instructions, although the case where only one instruction is needed is preferred. Information can also be passed through a memory structure whose address is passed. Then we will call
  • Synchronization mask the set of logical APUs whose synchronization has been requested, even if it is a particular case of implementation.
  • the EP execution pipeline can also be used to process the following instructions to facilitate the implementation of the system:
  • TLB Translation Lookaside Buffer
  • startup register value associated with a logical APU for the management of APU partial contexts
  • LRF LOCAL REGISTERS
  • the LRF local register bank may also include various other registers, for controlling the number of APUs allocatable in a given context for example, including: a maximum number of physical APUs that can be allocated in the context current;
  • the LRF local register bank provides part of the interface with the system software. As such, it may also include means for managing power on the chip, for example by providing low power consumption configurations for APUs, with probably lesser performance, or even means of putting some APU on standby.
  • the other part of the interface with the system software is via the APU APU Manager as detailed below.
  • the APU APU Manager can be considered the heart of the ACU. Its role is to allocate logical APUs to the context in execution, to manage the correspondence between logical APUs and physical APUs, as well as to speed up certain operations. usually devolves to system software, or even to replace the intervention of the system software in some cases.
  • the APU APU Manager can be quite simple because it is primarily a responsive system. The following table describes the processing it performs based on the signals it receives from the EP execution pipeline.
  • the ACU may also include an APUID interrupt decoder ("APU
  • the APU APU manager does not need more to operate the multiprocessor system of FIG. 1 in the weak synchronization execution model according to the invention.
  • the exceptions associated with specialized instructions executed on the APU have an advantage to be hidden by default. Indeed, the response of the ACU to such masking is to wait for the situation associated with the exception to reabsorb itself. This makes it possible to implement the execution model without having to adapt the system software.
  • such an execution model significantly reduces the potential performance, because the exceptions are designed and designed so that the SPP can optimize the exploitable parallelism by intelligently inciting the system software to do a task switch as soon as the task is running. execution is blocked in its execution, whether for resource problems or for synchronization issues.
  • the APU APU manager relies on the association services between logical APUs and physical APUs and vice versa.
  • the APU APU Manager is the preferred interface with the system software.
  • the system software When the system software is adapted and the automatic APU management option is implemented as detailed later, the system software intervention is limited to the strict minimum. It intervenes only when it is advantageous to switch tasks at the system level to optimize parallelism.
  • the system software At this level, the system software is the only one to have the complete information to put the SPP in energy saving mode ("idle"), which could advantageously be coupled with an energy management of the SPP itself.
  • the APU APU Manager has all the information for APU management. Thanks to its knowledge in advance of possible switching tasks, the APU APU manager can even manage the APU standby and wake up.
  • the processing queue waiting for APU is empty and there are APUs without treatment, they can be put on standby, even waking up at least one when the ACU sends an interruption to the SPP or when a new instruction from the SPP is executed on the ACU, in particular the partial SPP partial context change instruction thereafter.
  • the adaptation in frequency requires to be adapted to each treatment if it is implemented. It is therefore the responsibility of the user code and system software on the SPP and it requires special local registers to store APU configurations.
  • the APU Partial Context Manager which is optional, only contains the start context of APUO processing, APU2, ..., APUN-1. In the preferred implementation, it is not envisaged that it contains complete hardware contexts let alone extended contexts, as this would take up a lot of space.
  • a basic principle is to have tasks whose execution time is relatively low, of the order of a few tens of thousands of execution cycles, so that the need for partial preemption of APUs does not really exist. .
  • the counterpart of this assumption is that the start of processing on APUs must be very fast.
  • the simplifications made in this patent make this possible. These include storing access TLBs to the mass memory MM. The rest of the context, especially the registers, are not stored.
  • the treatments on APU are preferably not preemptible because in principle relatively short. If APUs are optimized for simple start-up and simple shutdown on well-defined code boundaries, then their "cold" start on a new treatment should be possible within a few cycles.
  • the processing code may be generic, but then the logical processing addresses for the APUs are constant in the code, the physical addresses are given by the few TLB inputs of the partial context provided at startup. In this way, single program multiple data (SPDM) processing can be done easily on the basis of APUs, even in the case of cascading or pipeline processing.
  • SPDM single program multiple data
  • PHYSICAL APU MANAGER The management of physical APUs essentially uses storage structures to associate a physical APU allocated to a processing on the SPP. In this example, management is organized as a double tail: one queue for free physical APUs and another for physical APUs being processed. In another embodiment, a single, non-mutable data structure can be used for both. Physical APU management capabilities can then be implemented using priority encoders or associative memories.
  • the logical APU manager seeks to associate a physical APU with the logical APU whose request to execute a process is made. The procedure is therefore simple: the management of physical APUs is made up of free physical APUs.
  • the processing queue is limited to a maximum depth of NAPUIogiques-NAPUphysiques where NAPUIogiques is the number of logical APUs offered and NAPUphysiques is the number of physical APUs available in total.
  • the delivery phase of the APU in the list of free APUs can be bypassed, removing the first pending process from the list and assigning it to the list. APU released.
  • setting 0 or 1 of a data bit associated with the physical APU is sufficient to mark it as busy or free. The queue management then no longer exists, but the rest of the implementation does not vary.
  • the management of physical APUs leads to two sets 0 : the set of free physical APUs and the set of physical APUs allocated. Optionally, there may also be a queue for processes waiting for a physical APU.
  • physical APUs are associated with a logical APU number. They can also be associated with a number in an optional contexts table, this table being detailed later. Finally, they can be associated with an optional validity bit, an APU having for example a validity indicator set to "true” (1) when it is allocated, to "false” (0) otherwise.
  • the services provided by the MAP physical APU manager may advantageously be as follows 0:
  • logical APU management essentially uses a storage structure. It should be noted that logical APU management tends to fade in part as the number of logical APUs approaches the number of physical APUs.
  • the logical LAM APU manager may return the first available free logical APU. If no logical APU is available, the LAM logical APU manager sends a signal to the APU APU manager, which then issues an exception or waits for a logical APU to free itself, as explained above.
  • the APU of the transmitted number is delivered in the set of free APUs.
  • a partial context management of the SPP can be implemented. The latter can be related to the management of SPP contexts for physical APUs as explained above.
  • the organization of the data structures makes it possible to distinguish a set of free logical APUs and a set of allocated logical APUs.
  • Each entry is associated with a physical APU number and a validity bit.
  • the validity bit indicates whether the physical APU associated with the number is actually associated with this logical APU or not.
  • an optional exception request bit may be used, as well as a field for an SPP context number. It is clear that there is information in common between the description structure of the allocations of the physical APUs and that of the logical APUs. This allows you to choose different implementation paths between two memory structures to be updated in parallel or a single associative memory structure.
  • the services provided by the logical APU manager LAM can advantageously be the following: allocation of a logical APU number, for example the first element of the list of free logical APUs, but optionally an entry for a context descriptor partial SPP can be associated;
  • the interrupts originating from the APUs can be reformatted in a form that is simpler to process by the APU APU manager.
  • Some interrupts, such as runtime errors on an APU, are reported directly to the SPP termination unit where they are treated as global exceptions to the SPP.
  • the other signals to be reformatted for the APU APU manager concern the purpose of executing ongoing processing on APUs, as well as any signals of potential interest, such as particular signals for program debugging operations. the APU ("debug").
  • a role of the APUID interrupt decoder is to serialize the various signals of interest to the APU APU manager. Once the signal is relayed to the APU APU manager, its role is also to provide the interrupt acknowledgment signals to the APUs that emitted them, as and when processing. It should be noted that the implementation to be preferred includes an event buffer, which makes it possible to release the signaling line between the APU and the ACU at the earliest, and thus an allocation at the earliest of the APUs released for new jobs. .
  • the SPP SPPPCM partial context manager is optional, it simply makes it possible to release the logical APUs allocated with synchronization request transparently and at the earliest.
  • This functionality is normally assigned to the system software, but in this exemplary embodiment the ACU may act as a dedicated accelerator to clear some of the work of the system software. This significantly decreases context switching times and processing times.
  • this mechanism is disengageable by the system software. To activate this mechanism, the system software must enter the address of the context synchronization mask as previously explained, otherwise the mechanism is basic deactivated. Whenever an update of this address is made by the system software, the provided address is compared with those already present in the SPP partial contexts table.
  • the SPPPCM preselects a free entry of this table. It is allocated completely only when allocating a logical APU, the input number then being filled in the table described previously.
  • the allocation field of the logical APUs is updated in this same table, the bit corresponding to the logical APU allocated being set to 1.
  • the field of the unsynchronized logical APU table is updated with the current value corresponding to the logical APU number associated with the allocated physical APU.
  • the field of the physical APU table previously described is filled with the entry number in the SPP context table.
  • the synchronization field of the table is updated, as well as the allocation field. The values are written in memory to the address provided. If the fields are returned to zero, the entry of the context table is released. In any case, the corresponding logical APU is released, its use by another task having been correctly plotted.
  • SPPPCM SPPPCM is that logical APUs can be released on the fly, even if the execution is multitasking with independent uses of APU inside different tasks. Logic APUs can therefore be made available to the current context without switching to the context that allocated the logical APU in question. Moreover, some interfaces can be useful to an optimal functioning of the APUs in relation to the ACU, for example:
  • an interruption and exception line APU IT can be connected to the interrupt decoder APUID.
  • This line can be coupled to a bus or a line for the transmission of ancillary information, such as an instruction pointer in the event of a runtime error on an APU. Activation of this line by an APU then blocks said APU until the ACU returns an acknowledgment signal of the exception.
  • This line may in particular be responsible for signaling the processing ends on an APU;
  • a partial context loading unit in particular for the TLBs, can advantageously be coupled to a reset mechanism of the other usage registers of the corresponding APU, if this makes sense for the APU in question;
  • - units can manage the production / consumption of data, in order to coordinate the executions of different stages of software pipelines implemented by processing on the
  • the producer then uses part of the mass shared memory MM to atomically write data production indicators. At least one consumer resets the indicator atomically as well.
  • This functionality can be easily performed in a software way, but it can advantageously be implemented by a dedicated hardware accelerator. Thus, many treatments can be linked without requiring synchronization by the ACU. By doing so, the number of operations to be performed by the ACU can be significantly reduced, significantly improving the efficiency of parallel execution of the system.
  • This mechanism aims to track down some of the data dependency management mechanisms, the other part being handled by the synchronization instructions. When the consumption management mechanisms are implemented, the line between the APU APU manager and the APUs can also be also be used to transmit the operating mode of the APU.
  • FIG. 3 illustrates by a timing diagram an example of a sequence of treatments on the architecture of FIGS. 1 and 2. This example is more particularly intended to show, on the one hand, the cooperation between the ACU and the basic system software and on the other hand, the optimization of the exploitation of the ancillary processing resources that are the APU.
  • This is an extract of an application composed of three active tasks:
  • a task T1 allocation of two APUs, execution on both APUs then synchronization of the two APUs for the context;
  • a task T2 allocation of three APUs, execution on the three APUs then synchronization of the three APUs for the context;
  • a task T3 allocation of an APU then execution with implicit release, followed by the allocation of an APU and execution then synchronization.
  • T1 allocates an APU and logical number 1 (first number in the list of free logical APUs) is returned to T1.
  • T1 allocates a new APU and logical number 2 (first number in the list of free logical APUs) is returned to T1.
  • T1 requests the execution of a processing on logical APU 1; the physical APU 1 (first number in the list of free physical APUs) is reserved by the ACU.
  • T1 requests the execution of a processing on logical APU 2; the physical APU 2 (first number in the list of free physical APUs) is reserved by the ACU.
  • T1 requests to synchronize on the end of the processing of logical APUs 1 and 2; the corresponding physical APUs have not completed the processes, the execution mask being at 0x3, so the ACU generates an E1 exception called "synchronization requested but processing in progress".
  • the system software captures this exception E1 and switch task T1, so it puts the T2 task in the foreground on the SPP instead of T1.
  • T2 allocates a logical APU; the ACU allocates logical APU 3 to it (first number in the list of free logical APUs).
  • T2 allocates a logical APU again; this time logical APU 4 is assigned to it.
  • T2 allocates a logical APU again; Logical APU 5 is assigned to task T2.
  • T2 requests the execution of a processing on logical APU 3 (the first allocated by T2); the ACU allocates the physical APU 3 (first free physical APU) and requests the start of execution.
  • T2 requests the execution of a processing on logical APU 4 (second allocated by T2); the ACU allocates the physical APU 4 (first free physical APU) and requests the start of execution.
  • T2 requests the execution of a processing on the logical APU 5 (third allocated by T2); there are no more free physical APUs (empty list), so the ACU places logical APU 5 in the queue of logical APUs waiting for a physical APU for execution.
  • T2 requests the synchronization on all three logical APUs allocated, the synchronization mask being at 0x1 c; none of the physical APUs has finished executing, one of the logical APUs not even being allocated on a physical APU; an E2 exception named
  • the physical APU 1 finishes its treatment; the ACU updates the execution status for task T1, the logical APU execution mask for T1 changing from 0x3 to 0x2; since the list of logical APUs waiting for physical APU is not empty, the ACU associates the physical APU 1 with the logical APU 5 for T2, its execution mask remaining unchanged at 0x1 c; the list of free logical APUs contains logical APU 1, since it has been released for T1, even if the synchronization instruction has not yet been executed.
  • T3 allocates a logical APU; it is the logical APU 1 assigned to it (the only one that was free); then T3 requests execution of the processing on logic APU 1 with implicit release after the processing; since there is no free physical APU, the ACU places logical APU 1 in the queue of logical APUs waiting for physical APU.
  • T3 attempts to allocate a new logical APU, but there is no logical APU available; an E3 exception named "no more logical APU available" is thrown, it is processed by the system software; T3 is in the process of being switched.
  • the physical APU 2 finishes its treatment; the logical APU 2 associated therewith is released at the execution level for T1 and the remaining synchronization mask for T1 changes from 0x2 to OxO; T1, which was switched due to synchronization, can be synchronized; the system software can therefore switch the execution that was suspended for T3 on T1.
  • the T1 synchronization instruction is executed at the task switching return; since the two logical APUs had been released for the T1 task, the synchronization instruction is executed without causing an exception, which is why the system software had switched to T1; formally, it is the moment in the execution of T1 where the user program on the SPP is certain to have released the two logical APUs.
  • T1 releases the SPP, at least for a certain time of use of the results of calculations; the system software regains control and puts T3 back in the foreground, which was switched by default from available logical APU.
  • T3 executes the allocation instruction of a logical APU; this time, logical APU 2 is free, so it is allocated to T3.
  • the associated logical APU here the logical APU associated with T2
  • the synchronization mask for T2 changes from 0x1 c to 0x18.
  • T3 requests execution on the logical APU 2 that it has allocated; the local execution mask is 0x3; the physical APU 3, which is free, is associated with logical APU 2 for T3.
  • the ACU places the physical APU 4 in the list of free physical APUs and the logical APU 4 in the list of free logical APUs; the execution mask for T2 is set to 0x10, as is the residual synchronization mask.
  • T3 requests a synchronization on the logical APU 2, which is allocated on the physical APU 3; since the physical APU has not finished running, an exception is thrown for synchronization, the synchronization mask being at 0x2; the basic system software takes over, but there is no allocable task left unblocked; the system software allows APU termination to be interrupted and goes into power save mode on the SPP.
  • T2 resumes its execution on the synchronization instruction, which does not cause an exception since the synchronization mask is at 0; it continues until it reaches the system at a time t2 / 9.
  • This example of a sequence of processes shows how much the present invention allows an excellent occupation of the ancillary processing resources that are the APU1, APU2, APU3 and APU.
  • the present invention also has the main advantages that it uses an ordinary processor to manage the medium grain parallelism, this parallelism being, at the level of the data dependence, in accordance with what has been calculated by the compiler of the application.
  • the optimal use of the medium-grained parallelism can be achieved by the automatic management of the auxiliary processing elements, thanks to the processing of a very limited number of simple specific instructions introduced at the compilation of the task. Out of these instructions, the rest of the parallelism management is automatic.
  • the present invention proposes a high level interface which makes the use of the processing units abstract, this interface allowing the allocation at the latest and the early release of the auxiliary processing units, the correspondence with the real units being integrally with the in charge of the present invention.
  • the system software may be granted the right to modify this, through access to the internal registers of the present invention in particular cases where compilation and static analysis of the execution for a task has not been performed.
  • the present invention provides the mechanisms necessary for the activation of the system software, in order to perform a switching of tasks as soon as the task being executed is blocked in its management of medium grain parallelism.
  • the invention also provides the mechanisms necessary for the automatic updating of the task blocking indicators as soon as these blockages are removed, without requiring the intervention of the system software, so as to improve the implementation of the two levels of parallelism. , coarse and medium grain.
  • the invention also provides the necessary mechanisms for the system software to easily choose which tasks are activatable.
  • the invention also offers, partly at the monotech level (user code) and partly at the multitasking level (system software), the mechanisms necessary to manage the number of processing units used at a given moment.
  • the overall system is better able to maintain processing delays, despite the sharing of processing units between several running tasks.
  • the invention provides the mechanisms necessary to manage advanced aspects of energy saving, by facilitating the economic mode of both the auxiliary processing units and the main processor that integrates the present invention.
  • a system according to the invention makes it possible to execute specialized instructions in the management of parallelism in heterogeneous multi-core systems. Once the parallelism management instructions are given, the management of the parallelism on the multi-core architecture becomes automatic and does not need assistance in a one-shot execution frame on the processor that integrates the invention. In the multitasking framework in particular, the invention becomes both an assistant and a specific accelerator for the management of medium grain parallelism for the system software present on the processor that integrates the invention.
  • the invention described above allows the greatest possible autonomy in the management of the auxiliary processing units, this without the system software having to intervene for the execution of the task.
  • System software only intervenes in cases where there are no other alternatives, such as error case or cases where it is necessary to wait for synchronization.
  • the invention described above allows the implementation of the system software for task switching, in order to optimize the use of different parallelisms.
  • the invention described above confers a global execution determinism which is close to that conferred by a conventional Von Neumann architecture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Advance Control (AREA)
EP10705159A 2009-02-24 2010-02-22 Zuordnungs- und überwachungseinheit Withdrawn EP2401676A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR0900833A FR2942556B1 (fr) 2009-02-24 2009-02-24 Unite d'allocation et de controle
PCT/EP2010/052215 WO2010105889A1 (fr) 2009-02-24 2010-02-22 Unité d'allocation et de contrôle

Publications (1)

Publication Number Publication Date
EP2401676A1 true EP2401676A1 (de) 2012-01-04

Family

ID=41396266

Family Applications (1)

Application Number Title Priority Date Filing Date
EP10705159A Withdrawn EP2401676A1 (de) 2009-02-24 2010-02-22 Zuordnungs- und überwachungseinheit

Country Status (4)

Country Link
US (1) US8973009B2 (de)
EP (1) EP2401676A1 (de)
FR (1) FR2942556B1 (de)
WO (1) WO2010105889A1 (de)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725915B2 (en) 2010-06-01 2014-05-13 Qualcomm Incorporated Virtual buffer interface methods and apparatuses for use in wireless devices
US8527993B2 (en) 2010-06-01 2013-09-03 Qualcomm Incorporated Tasking system interface methods and apparatuses for use in wireless devices
KR101710910B1 (ko) 2010-09-27 2017-03-13 삼성전자 주식회사 프로세싱 유닛의 동적 자원 할당을 위한 방법 및 장치
US9129060B2 (en) 2011-10-13 2015-09-08 Cavium, Inc. QoS based dynamic execution engine selection
US9128769B2 (en) 2011-10-13 2015-09-08 Cavium, Inc. Processor with dedicated virtual functions and dynamic assignment of functional resources
US8933942B2 (en) * 2011-12-08 2015-01-13 Advanced Micro Devices, Inc. Partitioning resources of a processor
GB2507484A (en) * 2012-10-30 2014-05-07 Ibm Limiting the number of concurrent requests in a database system
US10599453B1 (en) * 2017-09-25 2020-03-24 Amazon Technologies, Inc. Dynamic content generation with on-demand code execution
US10721172B2 (en) 2018-07-06 2020-07-21 Marvell Asia Pte, Ltd. Limiting backpressure with bad actors

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2987308B2 (ja) * 1995-04-28 1999-12-06 松下電器産業株式会社 情報処理装置
US6505290B1 (en) * 1997-09-05 2003-01-07 Motorola, Inc. Method and apparatus for interfacing a processor to a coprocessor
JP3829504B2 (ja) 1998-02-16 2006-10-04 株式会社デンソー 情報処理装置
JP2002041489A (ja) * 2000-07-25 2002-02-08 Mitsubishi Electric Corp 同期信号生成回路、それを用いたプロセッサシステムおよび同期信号生成方法
GB2378271B (en) * 2001-07-30 2004-12-29 Advanced Risc Mach Ltd Handling of coprocessor instructions in a data processing apparatus
US6944747B2 (en) * 2002-12-09 2005-09-13 Gemtech Systems, Llc Apparatus and method for matrix data processing
US7039914B2 (en) * 2003-03-07 2006-05-02 Cisco Technology, Inc. Message processing in network forwarding engine by tracking order of assigned thread in order group
US6987517B1 (en) 2004-01-06 2006-01-17 Nvidia Corporation Programmable graphics processor for generalized texturing
ATE406613T1 (de) * 2004-11-30 2008-09-15 Koninkl Philips Electronics Nv Effiziente umschaltung zwischen priorisierten tasks
US20060130062A1 (en) * 2004-12-14 2006-06-15 International Business Machines Corporation Scheduling threads in a multi-threaded computer
FR2893156B1 (fr) * 2005-11-04 2008-02-15 Commissariat Energie Atomique Procede et systeme de calcul intensif multitache et multiflot en temps reel.
US20080140989A1 (en) 2006-08-13 2008-06-12 Dragos Dudau Multiprocessor Architecture With Hierarchical Processor Organization
US8271989B2 (en) * 2008-02-07 2012-09-18 International Business Machines Corporation Method and apparatus for virtual processor dispatching to a partition based on shared memory pages

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2010105889A1 *

Also Published As

Publication number Publication date
FR2942556A1 (fr) 2010-08-27
US20110314478A1 (en) 2011-12-22
WO2010105889A1 (fr) 2010-09-23
FR2942556B1 (fr) 2011-03-25
US8973009B2 (en) 2015-03-03

Similar Documents

Publication Publication Date Title
WO2010105889A1 (fr) Unité d'allocation et de contrôle
AU2019392179B2 (en) Accelerating dataflow signal processing applications across heterogeneous CPU/GPU systems
JP3771957B2 (ja) プロセッサ・アーキテクチャにおける分散制御のための装置および方法
JP4936517B2 (ja) ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ
US7353517B2 (en) System and method for CPI load balancing in SMT processors
US20070150895A1 (en) Methods and apparatus for multi-core processing with dedicated thread management
EP3238056B1 (de) Verfahren zum organisieren von aufgaben an den knoten eines computer-clusters, zugehöriger aufgabenorganisator und cluster
US20150268956A1 (en) Sharing idled processor execution resources
WO2007051935A1 (fr) Procede et systeme de calcul intensif multitache et multiflot en temps reel
KR100985318B1 (ko) 운영 체계 서비스의 투명한 지원을 위한 방법 및 제품
FR2937439A1 (fr) Procede d'execution deterministe et de synchronisation d'un systeme de traitement de l'information comportant plusieurs coeurs de traitement executant des taches systemes.
EP2232368A1 (de) SYSTEM mit mehreren Verarbeitungseinheiten, die es unmöglich machen, Aufgaben gleichzeitig auszuführen durch mischung der Ausführungsart des Steuertyps und der Ausführungsart des Datenstromtyps
EP2350836B1 (de) Einrichtung zur verwaltung von datenpuffern in einem in mehrere speicherelemente aufgeteilten speicherraum
Vaishnav et al. Heterogeneous resource-elastic scheduling for CPU+ FPGA architectures
CA2348069A1 (fr) Systeme et methode de gestion d'une architecture multi-ressources
EP2282265A1 (de) Hardware-Ablaufsteuerung
Zheng et al. HiWayLib: A software framework for enabling high performance communications for heterogeneous pipeline computations
Strøm et al. Chip-multiprocessor hardware locks for safety-critical Java
EP2545449A1 (de) Verfahren zum konfigurieren eines it-systems, zugehöriges computerprogramm und it-system
Bechara et al. AHDAM: an asymmetric homogeneous with dynamic allocator manycore chip
CN117608532A (zh) 一种基于国产多核DSP的OpenMP实现方法
Olaya et al. Runtime Pipeline Scheduling System for Heterogeneous Architectures
Compton et al. Operating System Support for Reconfigurable Computing
Bechara Study and design of a manycore architecture with multithreaded processors for dynamic embedded applications

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20110727

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

RIN1 Information on inventor provided before grant (corrected)

Inventor name: DAVID, VINCENT

Inventor name: DAVID, RAPHAEL

Inventor name: LOUISE, STEPHANE

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20180514

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20180925