EP2438528A1 - Verfahren und einrichtung zum laden und ausführen von anweisungen mit deterministischen zyklen in einem einen bus aufweisenden mehrkern-avioniksystem, dessen zugriffszeit unvorhersehbar ist - Google Patents

Verfahren und einrichtung zum laden und ausführen von anweisungen mit deterministischen zyklen in einem einen bus aufweisenden mehrkern-avioniksystem, dessen zugriffszeit unvorhersehbar ist

Info

Publication number
EP2438528A1
EP2438528A1 EP10734208A EP10734208A EP2438528A1 EP 2438528 A1 EP2438528 A1 EP 2438528A1 EP 10734208 A EP10734208 A EP 10734208A EP 10734208 A EP10734208 A EP 10734208A EP 2438528 A1 EP2438528 A1 EP 2438528A1
Authority
EP
European Patent Office
Prior art keywords
memory
execution
core
cores
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP10734208A
Other languages
English (en)
French (fr)
Inventor
Victor Jegu
Benoît TRIQUET
Frédéric ASPRO
Claire PAGETTI
Frédéric BONIOL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Airbus Operations SAS
Original Assignee
Airbus Operations SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Airbus Operations SAS filed Critical Airbus Operations SAS
Publication of EP2438528A1 publication Critical patent/EP2438528A1/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching

Definitions

  • the present invention relates to the architecture of avionic type systems and more particularly to a method and a device for loading and executing deterministic cycle instructions in a multi-core avionic system having a bus whose access time is not predictable.
  • Modern aircraft include more and more electronic and computer systems to improve their performance and assist pilots and crew members on their missions.
  • the electric flight controls reduce the mechanical complexity of the transmission of commands to the actuators and therefore the mass associated with these commands.
  • the presentation of relevant information allows the pilot to optimize flight paths and respond quickly to any incident detected.
  • Such information includes speed, position, heading, meteorological and navigation data.
  • All of these electronic and computer systems are generally called avionics.
  • avionics has often been functionally distributed by specific modules, also called LRU (abbreviation of A Replaceable Unit in English terminology).
  • LRU abbreviation of A Replaceable Unit in English terminology.
  • the flight controls are managed in a particular device while the power supply is managed in another. A specific function is thus associated with each module.
  • each module supporting a critical function is preferably redundant so that the failure of a module does not lead to the loss of the associated function.
  • the operation of an aircraft using a redundant module when the main module is faulty requires a maintenance operation.
  • avionics is now more and more integrated according to an architecture called IMA (abbreviation of Integrated Modular Avionics).
  • IMA abbreviation of Integrated Modular Avionics
  • the functionalities are decorrelated from the systems, that is to say calculators or computing resources, in which they are implemented. Nevertheless, a segregation system makes it possible to isolate each of the functionalities so that the failure of one function has no influence on another.
  • Such systems implement different modules, in particular data processing modules, called CPM (Acronym for Core Processing Module in English terminology), data switching modules, called ASM (acronym for Avionic Switch Module in English). English terminology), and power supply modules, also called PSM (acronym for Power Supply Module in English terminology).
  • CPM data processing modules
  • ASM Acronym for Avionic Switch Module in English
  • PSM power supply modules
  • the data processing modules include so-called “high performance” modules for general avionics applications, so-called “critical time” modules for avionics applications with strong temporal determinism constraints and server-type modules for the applications. open world applications, not critical.
  • a data processing module is generally composed of one or more processors, also called CPUs (acronym for Central Processing Unit in English terminology), associated with one or more RAM type memory banks (acronym of Random Access Memory in English). Anglo-Saxon terminology) and FLASH.
  • the communications between several CPUs of a CPM are preferably provided by means of direct links to a shared memory or through an exchange memory of a communication interface, for example an AFDX interface (acronym Avionic FuII DupleX in Anglo-Saxon terminology).
  • CPM TC Critical time data processing module
  • FIG. 1 schematically illustrates a CPM implementing such an architecture.
  • the CPM 100 here comprises four "single-core" processors 105-1 to 105-4 and, associated with each processor, memory type DDRAM (acronym for Double Data Rate Random Access Memory in English terminology), generically referenced 110, and of flash type, generically referenced 115.
  • the CPM comprises a set 120 of logic circuits allowing in particular the processors 105-1 to 105-4 to exchange data with other components of the aircraft via an input / output module 125.
  • TC CPMs do not typically implement processors based on multi-core architectures using cache memories. Indeed, the CPM TC need a strong determinism of their execution time and their hidden memories create a variability which is difficult to determine due to a historical effect according to which, according to the past events, information may still be cached or not. It may then be necessary to recharge it without this being determined in advance. The same is true for the pipelined instruction sequences of processor cores and memory controllers for which the instructions can be spread over several cycles, thus creating historical dependencies.
  • TC CPMs must discard the mechanisms behind these variabilities and use margins to determine runtimes in advance, making the use of multi-core processors inefficient.
  • the invention solves at least one of the problems discussed above. More particularly, it is possible, according to the invention, to determine in advance the use of cache memories of multi-core systems so that the latency of the memories is no longer a factor limiting the performance.
  • the invention also makes it possible, in a multi-core, multi-processor architecture, or more generally a shared processor bus, to obtain the independence of computing cores and the determination of non-pessimistic WCETs.
  • cache memory latency independence allows the determination of WCET even if the memory and memory controller models are imprecise.
  • the subject of the invention is thus a method for loading and executing deterministic execution cycles of a plurality of instructions in an avionic system comprising at least one processor having at least two cores and at least one memory controller, each of said at least two cores having a private memory, said plurality of instructions being loaded and executed in execution slots, the method comprising the following steps,
  • access authorization to said at least one memory controller to a first of said at least two cores said first core transmitting to said at least one memory controller at least one data item stored in its private memory previously modified, and receiving at least one datum and at least one instruction of said plurality of instructions, said at least one datum and said at least one received instruction being stored in its private memory; o prohibiting access to said at least one memory controller to a second of said at least two cores, said second core executing at least one instruction previously stored in its private memory; during a second execution slice, prohibiting access to said at least one memory controller at said first core, said first core executing at least one instruction previously stored in its private memory; and, o access authorization to said at least one memory controller to said second core, said second core transmitting to said at least one memory controller at least one piece of data stored in its private memory, previously modified, and receiving at least one piece of data, and least one instruction of said plurality of instructions, said at least one data item and said at least one received instruction being stored in its private memory.
  • the method according to the invention thus makes it possible to implement technologies based on multi-core processors having buses whose access time is unpredictable for applications having high temporal deterministic constraints.
  • the method allows the use of burst-type memories of the DDRx type (mode called burst in English terminology), cores working at frequencies above 1 GHz, the implementation of massively parallel architecture and the electronic integration as unique components.
  • the time of the memory access phases is advantageously less than the total time spent by a heart waiting for the completion of each of these accesses for the execution model to be effective.
  • Another significant advantage is the simplification and strong reduction of pessimism of WCET calculations by static analysis due to the presence in private memory of the data used in the calculation phases.
  • Another advantage is the static analysis tools based on a processor model. Since the tool does not have to consider scenarios including access to shared memory in its analyzes, the processor model can be reduced to the core and its private memories.
  • said at least one processor further comprises at least one second memory controller, the method further comprising the following steps,
  • the method thus allows cores to access shared memories to execute instructions using common data.
  • at least one of said at least two cores is dedicated to data transmission and reception operations to and from a network communication interface to simplify the modeling of the processor.
  • the invention also relates to a method of processing a plurality of instructions for enabling the loading and the execution at deterministic execution cycles of said plurality of instructions according to the method described above, the method of treatment comprising a step of cutting said plurality of instructions into execution slots, each execution slot comprising a transfer sequence and an execution sequence, said transfer sequence allowing the transmission of at least one previously stored data and the reception and storing at least one datum and at least one instruction, said at least one received datum being necessary for the execution of said at least one received instruction and allowing the execution of said at least one received instruction, autonomously, during execution of said execution sequence.
  • the processing method thus makes it possible to cut the instructions into execution slices in order to optimize the described loading and execution method whose efficiency depends on the ability to precisely determine the information needed for a next execution phase to avoid to underestimate or overestimate the amount of information needed, which has the effect of requiring access to the shared memory for the execution of the instructions or to generate a longer loading phase at the time that the heart would spend loading each given.
  • said step of cutting is based on the resolution of a system of linear equations representing constraints of execution of the instructions of said plurality of instructions according to at least one characteristic of said at least one processor.
  • the method according to the invention thus makes it possible to optimize the organization of the execution slots and to simplify their determination.
  • the duration of said execution slots is preferably constant and predetermined. This duration is, for example, determined by the time of transmission of previously modified data and the time of receipt of data and instructions to be executed.
  • the invention also relates to a computer program comprising instructions adapted to the implementation of each of the steps of the method described above when said program is executed in a processor, a device comprising means adapted to the implementation of each of the steps of the method described above and an aircraft comprising the device according to the preceding claim.
  • a computer program comprising instructions adapted to the implementation of each of the steps of the method described above when said program is executed in a processor
  • a device comprising means adapted to the implementation of each of the steps of the method described above and an aircraft comprising the device according to the preceding claim.
  • FIG. 1 schematically shows a data processing module comprising several single-core processors
  • FIG. 2 comprising FIGS. 2a to 2d, schematically illustrates a time diagram illustrating the activities of a processor comprising eight cores, implemented in accordance with the invention
  • FIG. 3 comprising FIGS. 3a and 3b, illustrates an exemplary multi-core architecture adapted to implement the invention
  • FIG. 4 comprising FIGS. 4a to 4d, illustrates an exemplary access mechanism, by each heart in the transfer phase of a multi-core processor, to the memory controllers of this processor;
  • FIG. 5 schematically illustrates a module of an avionics system, whose architecture is based on a multi-core processor such as that shown in Figure 3b, adapted to implement the invention.
  • Multi-core processors of the latest generation also called SoC multicores (acronym for System on Chip in English terminology)
  • SoC multicores acronym for System on Chip in English terminology
  • this potential is difficult to exploit, especially for reasons of determinism and proof or test relating to temporal requirements.
  • the notion of real time implies a precise control of the temporal behavior of applications executed, in particular of their WCET.
  • critical requires, in the field of aeronautics, to provide strong evidence of this control. This issue of determinism comes partly from the execution of one or more competing applications on each of the cores which share certain resources in insufficient number to physically segregate all the paths of all the cores, in particular the data exchange buses. and memories used.
  • each heart has one or more private cache memories.
  • the cores envisaged in the CPMs have three private cache memories per cores: an L1_l (or L1 I) cache memory for the instructions, an L1_D (or L1 D) cache for the data and a unified L2 cache memory for the instructions and the data. While it is important here that each core has an individual cache memory and instructions for loading and unloading them, the number of levels of caches does not matter.
  • each core can access a local memory having an address on the core network.
  • the invention can be implemented with an internal SoC device external to the cores, DMA SoC type (DMA stands for Direct Memory Access in English terminology), controlled by the cores or activated on a fixed date on the This device is responsible for transferring the data in both directions between the memories associated with the cores, of the RAM type, and the central memories of the DDR type.
  • DMA SoC type DMA stands for Direct Memory Access in English terminology
  • the principle of the system according to the invention is to create phases during which applications run exclusively within their private cache memories, no external request (data access or monitoring) to affect them.
  • rendezvous points between which a core has access to each resource (for example a particular memory controller), exclusive and shared with a minimum of other cores. Outside these beaches, the heart can not access these resources. It is therefore necessary to distribute the meeting points so that each heart has equitable access to resources.
  • these rendezvous points are placed statically and regularly.
  • FIG. 2 schematically illustrates a timing diagram illustrating the activities of a processor comprising eight cores, implemented in accordance with the invention.
  • the type of activity of each of the cores is here represented along the time axis 200.
  • FIG. 2b shows part of FIG. 2a to illustrate more precisely the roles of a particular heart, here the second.
  • the marks 205-i, 205-j and 205-k define moments that represent static and regular rendezvous points where the hearts change their role.
  • the first heart executes a series of instructions previously stored in its cache memory with the corresponding data (reference 210).
  • the second heart exchanges data with a memory controller.
  • the second core prepares for an autonomous execution phase during which it will not need to access the memory controllers.
  • the period separating two consecutive instants at which each heart changes its role defines an execution slot denoted T. Then, at time 205-j, the first heart transmits data stored in its cache memory to the memory controller (reference 225 ) and then receives data and instructions from the memory controller which it stores in its cache memory (reference 230). From the same instant 205-j, the second heart executes the instructions previously stored in its cache memory with the corresponding data (reference 235).
  • the first heart executes previously received instructions while the second heart transmits and receives data and instructions.
  • the SoC comprising the processor whose operation is illustrated in FIG. 2 also preferably comprises two memory controllers.
  • the two pairs of cores 240 and 245 of the set 250 each access a different memory controller so that within this set, each memory controller is accessed, at a given instant, only by one heart.
  • the two pairs of cores 255 and 260 of the set 265 each access a different memory controller so that within this together, each controller is accessed, at a given moment, only by one heart.
  • each memory controller is accessed by two separate cores.
  • the SoC has several memory controllers, the access of the cores to each of the memory controllers is advantageously balanced. However, only one memory controller can be used, especially if it is sufficient to serve the performance needs of the CPM TC. In this case, the use of a single memory controller makes it possible to improve the development costs as well as the reliability, the mass and the heat dissipation of the SoC.
  • the scheduling of the transfer phases on all the cores is preferably strictly synchronous, balanced and planned.
  • the use of shared resources, including memory controllers, is also preferably strictly synchronous, balanced and planned.
  • the SoC contains two memory controllers, half of the cores in the transfer phase access, at any time, to one of the memory controllers and the other half to the other memory controller. If necessary, at preset times, all or part of the cores in the transfer phase can change memory controller to maintain the correct balance.
  • Two strategies can be implemented:
  • two cores form a pipeline of 10 requests in the memory controller, ie 80 data transfers of the burst type of 8 data per request. It is thus enough that the latency of a request is less than 40 cycles, by using a double transfer rate ⁇ double data rate) so as not to have a period of inactivity in the pipeline of the memory controller.
  • worst time to execute the cached code instructions with its associated data depends on the nature of the application executed, it is relatively constant for avionics applications; and, worst time for transferring modified data to the memory controllers from the cache memories and to load, from the memory controllers, the instructions, constants and variables of a execution slice in caches. This time depends on the number of hearts competing.
  • FIGS. 2a and 2b illustrate an example of optimal placement when the duration of the unloading / loading phase is identical to that of the execution phase of the instructions, many other distributions are possible.
  • FIGS. 2c and 2d show examples of optimal placement when the duration of the execution phase of the instructions is less than three times that of the unloading / loading phase and greater than or equal to three times that of the unloading / loading phase, respectively, ⁇ representing the duration of an execution slice.
  • FIG. 3 comprising FIGS. 3a and 3b, illustrates an exemplary multi-core architecture adapted to implement the invention.
  • the multi-core system 300 diagrammatically shown in FIG. 3a here comprises eight cores referenced 305-1 to 305-8, each connected to a local memory with a low, invariant and history-independent access time, ie -describe the previous performance of the computing unit to which it is connected.
  • These local memories here bear references 310-1 to 310-8. They can be local cache memories or blocks of static memory accessible by virtual or physical addressing from the calculation units.
  • Each local memory is itself connected to a bus unit, whose references are 315-1 to 315-8, connected in turn to a common bus 320 connected to a shared memory 325.
  • the cores form arithmetic calculation units. , logical, floating or other that perform complex processing. They only access the local memory to which they are connected.
  • the issue of calculating WCET of the cores forming the domain 330 is decorrelated from the multi-core characteristic and the modeling problem of the shared external memory and the interconnection network of the cores forming the domain 335.
  • the cache memories or static memory blocks are maintained coherently and powered by a multi-actor system simpler than the cores.
  • the variability due to inputs, the combinatorics due to branching decisions, all the speculative decisions that the execution units can take and all the variability due to uncertainties of synchronism between the cores are ignored in the field 335.
  • the WCET problematic of the domain 330 then only consists in calculating the WCET of arbitrarily complex programs, considered individually, for each of the calculation slices, and independently of the complexity of the domain 335.
  • This decomposition into domains 330 and 335 can be achieved on conventional mono or multicore processors provided with cache memories and appropriate instruction sets by synchronizing the bus units of the cores and by making them play the role of the system implemented. to maintain coherence of memories 310-1 to 310-8.
  • FIG. 3b illustrates an exemplary architecture of a multi-core SoC adapted to implement the invention.
  • the SoC 300 'here comprises the eight cores 3O5'-1 to 3O5'-8, generically referenced 305, with which are associated private cache memories generically referenced 340, 345 and 350.
  • the cache memory L1_l, referenced 340-1 , the cache L1_D, referenced 345-1, and the cache memory L2, referenced 350-1 are associated with the core 3O5'-1.
  • the cache memory L1_1, referenced 340-8, the cache memory L1_D, referenced 345-8, and the cache memory L2, referenced 350-8 are associated with the core 305'-8. It is the same for other hearts.
  • Each system consisting of a core and the associated private cache is connected to a fast data bus, referenced 320 ', which is itself connected to memory controllers 355-1 and 355-2, generically referenced 355.
  • the heart 3O5'-8 is here dedicated to the management of physical inputs / outputs.
  • the cores 3O5'-1 to 3O5'-8 may have an internal frequency of 1.6 GHz.
  • the data bus connecting the cores to the memory controllers can also use a frequency of 1, 6 GHz.
  • the loading / unloading time is about 25 ⁇ s in count.
  • the execution time of the instructions representing about two-thirds of the data exchanged, with a ratio of one instruction per three cycles of a heart, to 1, 6 GHz, is about 54 ⁇ s.
  • FIG. 4 comprising FIGS. 4a to 4d, illustrates an example of access mechanism, by each heart in the transfer phase of a multi-core processor, to the memory controllers of this processor.
  • a first half of the cores in the transfer phase accesses the first controller and the second half accesses the second controller.
  • the cores 3O5'-1 and 3O5'-2 access the memory controller 355-2 while the cores 3O5'-3 and 3O5'-4 access the memory controller 355-1 and the cores 3O5'-5 at 3O5'-8 are in the execution phase and can not access memory controllers 355-1 and 355-2.
  • the second half of the cores in the transfer phase accesses the first controller and the first half accesses the second controller.
  • the cores 3O5'-1 and 3O5'-2 access the memory controller 355-1 while the cores 3O5'-3 and 3O5'-4 access the 355-2 memory controller and the 3O5'-5 to 3O5'-8 cores are still in the execution phase and still can not access the 355-1 and 355-2 memory controllers.
  • the first and second times illustrated in FIGS. 4a and 4b are repeated so that, during a first period, the memory controllers 355-1 and 355-2 are used for data unloading and that, during a second period, Memory controllers 355-1 and 355-2 are used for data loading.
  • the first and second periods here have the same duration, the duration of the first period being, like that of the second period, identical for each memory controller.
  • the sequence of operations consists of unloading all the data by crossing the links between the memory controllers and the cores in the transfer phase at a given time and then loading the new data by again crossing the links between the memory controllers. and hearts in the transfer phase at a given moment.
  • the hearts change roles.
  • the cores that were in the transfer phase go into execution phase while the cores that were in the execution phase go into the transfer phase.
  • the cores 3O5'-5 and 3O5'-6 access the memory controller 355-2 while the cores 3O5'-7 and 3O5'-8 access the controller of memory 355-1 and that cores 3O5'-1 to 3O5'-4 are in the execution phase and can not access memory controllers 3355-1 and 355-2.
  • the cores 3O5'-5 and 3O5'-6 access the memory controller 355-1 while the cores 3O5'-7 and 3O5'-8 access the controller of 355-2 memory and that the 3O5'-1 to 3O5'-4 cores are still in the execution phase and still can not access memory controllers 355-1 and 355-2.
  • the third and fourth times shown in Figs. 4c and 4d are repeated so that during a first period memory controllers 355-1 and 355-2 are used for data unloading and that second period, 355-1 memory controllers and 355-2 are used for data loading.
  • the first and second periods here have the same duration, the duration of the first period being, like that of the second period, identical for each memory controller.
  • the sequence of operations consists of unloading all the data by crossing the links between the memory controllers and the cores in the transfer phase at a given moment and then loading the new data by crossing the links again between memory controllers and cores in the transfer phase at a given moment.
  • the control of the counting of the page changes within the memories used imposes that two cores should not have access, in the same phase of transfer, to the same banks. This imposes additional constraints on two cores working at the same time for the same application. In practice, this requires that two cores do not access the memory used for an application at the same time.
  • the I / O server shown below, is a special case because, by definition, it accesses all applications. The goal is to place application access to their I / O at different dates on the I / O server.
  • Each core has, permanently, that is to say, locked in cache memory, an instance of a supervision software that aims to sequence all the slices to be executed on the core. For example, it performs, for each slice of execution, the following operations:
  • the determination of the worst-case transfer can be carried out according to two approaches:
  • the cores do not have access to the memory controllers during their execution phase. In other words, cores have no access to addresses not already present in cache memories.
  • the restriction of the execution to the data and instructions loaded in the cache thus has the same effect as a programming of the memory management unit, called MMU (acronym of memory management unit in English terminology), to the granularity of the lines of the cached memories since any access outside the addresses determined by the result of placement would have the effect of triggering an access violation exception.
  • the SoC has a DMA capable of loading in the caches or local memory of each core the data it needs for the next installment.
  • the caches preferably contain either locked data indefinitely, i.e. locked data for the duration of the critical time phase, or locked data for the duration of a slice.
  • the cache closest to the cores, reserved for instructions, is locked with the most critical code elements, for example a library of routines called frequently.
  • the most distant cache memory advantageously contains the application code and the largest tables of constants that have the least usage-to-volume ratio.
  • the slice-dependent data is loaded into the cache memory from a descriptor table itself contained in the memory accessible via a memory controller and loaded into cache memory. It is possible to build tables whose surplus, called overhead in English terminology, does not exceed one percent by volume. At the end of the execution slice, the descriptor table is still used to transmit the modified expected data (flush operation). It must also be ensured that there can not be an edge effect due to the unmodified data kept in the cache memory, for example by globally disabling the cache memories (after backup if necessary in another cache of locked persistent data) .
  • non-LRU cache memory (acronym for Least Recently Used in English terminology) does not guarantee that the data of the old slice will necessarily disappear in favor of the data of the new slice.
  • each slice should preferably satisfy the following conditions: - the execution must not produce an error in the cache memories, that is to say that all the data required by an execution slice must be available cached;
  • the treatments must be reasonably scored and not highly sequential, in order to leave a few degrees of freedom for the placement solution, and the ratio between instructions and data, i.e. the computational density, should preferably be high so that the solution is effective.
  • the ratio between instructions and data i.e. the computational density
  • the cores when the caches are loaded with instructions and data, it must be possible for the cores to execute a large number of instructions before having to return to the bus to update their cache memory. Thus, for example, it is desirable not to use a function requiring large tables of data which would have the effect of blocking a large part of the cache memory for only a few instructions.
  • SCADE SCADE is a brand
  • the scheduling of the boards is free.
  • the placement of the processing in slices is done offline, that is to say before the execution of the slices, by a tool of the chain of software generation.
  • the principle is to use the various methods available for multi-objective optimization under constraints in order to statically solve a placement of instructions and data.
  • the investment out Online processing of slices of execution is essential to find as optimal a solution as possible. It makes it possible to produce an improvement of the WCET, or even the obtaining of the minimum, for the application concerned while benefiting from the improvement of the determinism due to the locality constraints of the data defined previously.
  • the constraint resolution application makes it possible to restrict the mathematical expressions to linear equations in order to solve the system of equations and to optimize a function (operational search).
  • the solution here is preferably restricted to whole solutions.
  • Such a solution called Integer Linear Programming (ILP) or Integer Linear Programming (ILP) in Anglo-Saxon terminology, aims to express a problem by a system of equations and / or linear inequalities with (partially) integer solutions.
  • a resolution of the PLNE type can be done by the simplex method that combinatorial optimization tools can offer, supplemented with heuristics to make the problem computable.
  • the constraint resolution application is requested to choose a slice for each board.
  • the index i varying from 1 to S, denotes here the slice numbers while the index j, varying from 1 to N, denotes the plate numbers also called nodes, that is to say the non-breaking fractions of the application.
  • N the plate numbers also called nodes, that is to say the non-breaking fractions of the application.
  • N the plate numbers also called nodes.
  • slice i. Nj, i is said "decision variable" indicating the decision of placement of the node Nj.
  • Each node Nj is characterized by a large volume of instructions and constants, called L2j, specific to the node j, to be placed in the cache L2 as well as by a volume of variables and constants of small size, called L1j, own at node j, to be placed in the cache memory of data L1 D.
  • L2j large volume of instructions and constants
  • L1j volume of variables and constants of small size
  • Each node Nj is also characterized by a list of variables shared with other nodes and a worst execution time WCETj.
  • the constants of significant size are to be placed in the cache memory L2 so as not to exhaust the capacity of the cache L1 D.
  • the choice of the transition threshold between the cache memories L2 and L1 D is determined by the placement tool.
  • the expression of the size constraints on the cache memories L2 and L1 D is given here as an example and corresponds to an investment on two resources having different characteristics, one, fast for the scarce data, to be reserved to critical data at run time while the other is to be used for less critical instructions and data. This principle can be adapted to other distributions of resources.
  • N is, Vi ⁇ LI j XN j 1 ⁇ MAXLl
  • N is V /, ⁇ Z1 7 x N j1 , + RESVLW ⁇ MAXLW
  • N VU ⁇ WCET j x N j1 , ⁇ MAXWCET
  • cache L1 D is not only used for small constants and variables but also for variables shared between several nodes. The value
  • RESVL1 D represents this space. In a simplified approach of the problem, separating the problem of placement of the nodes from the problem of placement of the variables, it is advised to choose a fixed value leading to a feasible and satisfactory solution. In a solution combining the optimization of the placement of the nodes and the variables, RESVL1 D is chosen as representing exactly the occupation of the variables in cache memory L1 D.
  • Nj When a scheduling constraint exists between two nodes, for example if Nj is to be executed before Nk, the following set of constraints is added (there is one Nk, i for each candidate slice for placement): for all j, k such that j must precede k, for all i> 2, N k ,, + N k ,, + i + ... + N k , s ⁇ N j ,,
  • Nj is placed in the slice i
  • Nk must also be placed in the slice i or in one of the following ones. If there are also constraints prohibiting the separate placement of two nodes (non-breaking nodes), they can then share the same decision variable.
  • nodes can share constants.
  • the sharing of small constants is generally not very dimensioning and does not justify complicating the problem.
  • the small constants can be duplicated, ie find different solutions in each slice, without significant cost, by using locations not used in the distribution of variables in memory. Large constants, usually few, for example trigonometric interpolation tables, nevertheless justify a search for optimization.
  • variable Cc, i is defined as equal to one if the constant Cc is referenced in the slice i. In the opposite case it is equal to zero.
  • a constraint on Cc, i is added as follows, for any slice i, for any node j referencing Cc, Cc, i> Nj, i
  • Cc, i is forced to 1. It should be noted that Cc, i is not really a decision variable, it is a consequence of the decision of placement of nodes Nj.
  • Vv i
  • Nj Nj
  • a block here is a cache line or a group of cache lines.
  • the second constraint evoked by replacing the value RESVL1 D by the allocation of blocks intended for the variables. It is then necessary to minimize the USAGE value (where USE ⁇ MAXL1 D) respecting the following constraints, for all i, LI 1 * N 1 ,, + LI 2 * N 2 ,, + ... + L1 N * N N ,, + BLK_SZ * (H 1 ,, + ... + H B> , - B) ⁇ USAGE where BLK_SZ represents the size of a block.
  • the function minimizing the value USE (USE ⁇ MAXL1 D) is sought within the following constraints: for all I USAGE_W_L1, + BLK_SZ * (W, + ... + W B, ) ⁇ MAX_FLUSH where the value USAGE_W_L1 i comes from the result of placement of the nodes and corresponds to the size of all the data in modifications in the slice i and known before the resolution of the constraints of placement of the variables.
  • Some simplifications can be made to the equations described above. For example, it is possible to calculate only one placement decision for all variables sharing exactly the same list of referenced slices. According to a particular embodiment, it is possible to simplify the problem by cutting the nodes or the variables into several subsets. This choice of preliminary division can be directed by the designer of the software to be placed, for example because he knows that his application is composed of three largely independent subsystems, or by the placement tool according to heuristics, for example in identifying nodes referencing the same variables. Each subproblem is then subject to an independent placement of its nodes and its own variables. A last placement of the shared variables completes the problem resolution. For example, the nodes can be divided into several subsets according to periodicities.
  • the slices are then scheduled at the periodicity of the nodes. It is also possible to split the specification used into relatively independent functional blocks. Other alternatives are possible, including expressing a prior system of constraints to distribute the nodes into a small number of subsystems rather than directly distributing the nodes in a large number of slots.
  • the desired optimum can be degraded by heuristics (choice of simplification) put in place, non-exhaustive methods can be used to solve the problem of combinatorial optimization that represents the problem of placement.
  • optimization methods such as the estimation algorithm of distribution, called estimation of distribution algorithm in English terminology, the methods based on on the principle of evolutionary algorithm (or genetic algorithm), the neural networks or a particle swarm algorithm, called particle swarm optimizer in English terminology, can be used.
  • Combinatorial optimization is a highly researched and evolving subject of research, and many approaches are available, each with their advantages and disadvantages.
  • the idea here is to seek an optimization of placements of nodes then of variables, or even of variables only, the objective functions allowing the iterative search for a better solution being notably the objectives of minimality of data exchange between the slices and the objectives of minimization of the execution time by a very fine localization of the data (minimizing the number of lines of caches that a calculation sequence must load or unload at the level of an L1 cache within an execution slice).
  • the presence of constraints of different natures can lead to consider an optimum search based on several optimization methods.
  • the calculation slots have no access to inputs / outputs, called I / O, physical. They can only access variables that have been cached by the monitoring software.
  • a core or several if necessary, is preferably dedicated to the management of physical I / O.
  • This core hosts an "I / O server” type function as opposed to other cores that can be considered as "computing servers”.
  • the I / O core produces the variables corresponding to the deformed inputs of the module and consumes the variables corresponding to the outputs unformatted module.
  • the heart I / O is a heart producer and consumer of unmarked data.
  • the activities of the I / O server cover access operations to physical registers and bus controllers, for example to Ethernet, PCIe or non-volatile memory controllers, and data verification and data conversion operations. known data structures and types of applications These operations are defined by configuration tables, loaded during the transfer slices, planned by the placement tool, along with the planning of the loadings of the calculation slices.
  • the I / O core has its software and some data residually, and uses its transfer phases to load and unload the values of the actual inputs and outputs as well as the configuration table elements necessary for their processing.
  • the I / O core is preferably the only core having access to bus type PCIe, Ethernet or other. Being the only one, and provided that its accesses do not interfere with the access of computing cores to the memory controllers, the I / O core has the use of these buses full time. On the other hand, being trivialized from the point of view of access to the memory controllers, it has strictly static slots and access ranges, planned at the same time as the planning of the accesses of the computation cores.
  • a memory component must be available so that these DMA transfers can be made without affecting the memory used by the computing cores.
  • This component can be the cache memory, preferably in that of the I / O core, which is used as a target. It can also be another cache or memory area accessible by addressing in the SoC, possibly even an external memory plan addressed by a dedicated memory controller.
  • the activities of the I / O server are divided into execution and transfer slices, strictly synchronous, balanced and planned, like the activities of computing cores (or application cores).
  • the I / O core uses its transfer slots to read the configuration tables, drop the inputs into memory, and retrieve the outputs.
  • the execution slots are dedicated to controlling the bus controllers.
  • the distribution of operations by tranche is carried out by the offline investment tool described above, while respecting the processing capabilities of the I / O core and the bus controllers, in time consistency with the applications.
  • the SoC architecture must provide sufficient segregation of paths for exchanges between the I / O core and the bus controllers during the execution slots to avoid interfering with exchanges between the memory and the cores. calculation in the transfer phase.
  • the physical inputs of the I / O server can be classified into two families:
  • These inputs generally consist of reading one or more registers to receive information;
  • the I / O server For asynchronous entries, the I / O server must have a resident configuration table element in its private cache memories. This element should allow it to correlate the unplanned arrival of the event with a request to access a specific memory area, and then later use a scheduled access date to this area to acquire, if necessary, the elements additional configuration tables and drop reformatted or non-event data.
  • the raw data must be cached between the time of arrival and the opening of the memory access.
  • the arrival of the event is unplanned in the sense that the moment it must arrive is unknown. However, the very existence of the event is scheduled addresses in memory and access opportunities scheduled to memory have been assigned.
  • the execution slices on the compute cores are grouped so that only one application is active at a time on all the cores, it is possible to reserve on the I / O server a slice prologue for the inputs and an epilogue slice for the outputs so that the I / O server can be considered during all this time for the exclusive use of the active application.
  • This alternative according to which all the cores are dedicated to an application for a determined duration, that is to say several slices, requires that the problems of determinism of the memory controllers due to the page changes be solved. It can be, for example, by the use of a sufficiently precise model of the memory controllers applied to the memory transfer lists required by each slice. This alternative also requires that the applications thus distributed have sufficient scheduling freedom to distribute efficiently across all cores in a parallel manner.
  • the mix of applications on different computing cores can be allowed.
  • the slices of the I / O server preceding or following the calculation slices are provided with CPU time resources and static bus access (equivalent to micropartitions). These resources are known to the application placement tool so that they do not exceed their assigned resources.
  • SoC has several Ethernet controllers, it is possible to perform AFDX or Erebus inputs / outputs in software. These implementations must, however, remain compatible with the statistic and deterministic constraints necessary for splitting into calculation slices.
  • Ethernet controllers should not access the core memory used by the compute cores and must work with independent memory and bus resources.
  • Bus-type resources can optionally be shared if there is an "instantaneous" priority management capable of serving requests from the application cores without preemption, or observable delay, in the event of a conflict, with the accesses of the Ethernet controllers or the server of I / O, and without faulting the WCET scans of the I / O server.
  • This approach implies that the accesses of the Ethernet controllers can be transparent vis-à-vis the computing cores.
  • the AFDX transmission and reception operations are preferably adapted to be performed in the IO core, while complying with the following constraints: the IO core must respect the concept of communication slices and treatment slices; - Ethernet controllers must not disturb memory controllers or other cores; and,
  • the cache memories of the IO core being too small to fully store the configuration and the variables related to the AFDX interface, they must be loaded in portions.
  • the packets received by the Ethernet controllers are stored in the memory of the heart 10. They are analyzed as and when they are received and then transferred to other queues.
  • a configuration table residing in the local memory of the I / O server is used to associate the identifiers of the virtual links (or VL, abbreviation of Virtual Link in English terminology), called VLID, received frames to one or more Scheduled memory access windows for the I / O server.
  • IP / UDP addresses Abbreviation of Internet Protocol / User Datagram Protocol in English terminology
  • the configuration table residing in the local memory of the I / O server, whose size is of the order of a few kilos bytes, is used for each Ethernet frame received.
  • the management of redundancy and integrity advantageously uses resources also stored in the local memory of the I / O server.
  • the elements of this table necessary for the processing of the VL identified by the configuration table residing in the local memory of the I / O server, are loaded into the I / O server's memory read slices allowed for this VL and only the pending packets corresponding to these VLs are processed. If the capacity of the local memory of the I / O server allows it, it is preferable for reasons of simplicity and reduced latency to leave these tables in the I / O server.
  • the I / O server's broadcast activities are scheduled by the placement tool used for placement of application processing in the slices and for slot placement on the cores.
  • FIG. 5 schematically illustrates a CPM, the architecture of which is based on a multi-core processor such as that presented in FIG. 3b, adapted to implement the invention in which the AFDX functions are managed in software in the multi-processor -hearts.
  • the CPM 500 includes the multi-core processor 505 having here, in particular, eight cores and two memory controllers. These memory controllers are used as interface between the cores and memories 510-1 and 510-2.
  • the CPM 500 further comprises a memory 515, for example a flash memory, for storing, for example, some of the applications to be executed by the cores of the processor 505.
  • the CPM 500 further comprises a network interface for receiving and transmitting data. , in particular an AFDX interface, as well as the logic necessary for the operation of the CPM.
  • the AFDX function is here performed by the multi-core processor, that is to say, software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
EP10734208A 2009-06-05 2010-06-02 Verfahren und einrichtung zum laden und ausführen von anweisungen mit deterministischen zyklen in einem einen bus aufweisenden mehrkern-avioniksystem, dessen zugriffszeit unvorhersehbar ist Ceased EP2438528A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR0902750A FR2946442B1 (fr) 2009-06-05 2009-06-05 Procede et dispositif de chargement et d'execution d'instructions a cycles deterministes dans un systeme avionique multi-coeurs ayant un bus dont le temps d'acces est non predictible
PCT/FR2010/051071 WO2010139896A1 (fr) 2009-06-05 2010-06-02 Procédé et dispositif de chargement et d'exécution d'instructions à cycles déterministes dans un système avionique multi-coeurs ayant un bus dont le temps d'accès est non prédictible

Publications (1)

Publication Number Publication Date
EP2438528A1 true EP2438528A1 (de) 2012-04-11

Family

ID=41667520

Family Applications (1)

Application Number Title Priority Date Filing Date
EP10734208A Ceased EP2438528A1 (de) 2009-06-05 2010-06-02 Verfahren und einrichtung zum laden und ausführen von anweisungen mit deterministischen zyklen in einem einen bus aufweisenden mehrkern-avioniksystem, dessen zugriffszeit unvorhersehbar ist

Country Status (4)

Country Link
US (1) US8694747B2 (de)
EP (1) EP2438528A1 (de)
FR (1) FR2946442B1 (de)
WO (1) WO2010139896A1 (de)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2945647A1 (fr) * 2009-05-18 2010-11-19 Airbus France Methode d'optimisation d'une plateforme avionique
US8516205B2 (en) * 2010-10-29 2013-08-20 Nokia Corporation Method and apparatus for providing efficient context classification
US8516194B2 (en) * 2010-11-22 2013-08-20 Micron Technology, Inc. Systems and methods for caching data with a nonvolatile memory cache
US9137038B1 (en) * 2012-08-30 2015-09-15 Rockwell Collins, Inc. Integrated modular avionics system with distributed processing
DE102013224702A1 (de) * 2013-12-03 2015-06-03 Robert Bosch Gmbh Steuergerät für ein Kraftfahrzeug
US10375087B2 (en) * 2014-07-21 2019-08-06 Honeywell International Inc. Security architecture for the connected aircraft
CN104202188B (zh) * 2014-09-01 2017-04-26 北京航空航天大学 一种采用遗传算法进行afdx网络路径优化的方法
US9812221B1 (en) 2015-09-09 2017-11-07 Rockwell Collins, Inc. Multi-core cache coherency built-in test
US10089233B2 (en) 2016-05-11 2018-10-02 Ge Aviation Systems, Llc Method of partitioning a set-associative cache in a computing platform
EA035760B1 (ru) * 2016-10-31 2020-08-06 ЛЕОНАРДО С.п.А. Структура по сертифицируемой системы управления с постоянными параметрами для приложений жесткого реального времени, критических с точки зрения безопасности, в системах бортового радиоэлектронного оборудования с использованием многоядерных процессоров
US10402327B2 (en) * 2016-11-22 2019-09-03 Advanced Micro Devices, Inc. Network-aware cache coherence protocol enhancement
US10162757B2 (en) 2016-12-06 2018-12-25 Advanced Micro Devices, Inc. Proactive cache coherence
FR3086780B1 (fr) * 2018-09-27 2020-11-06 Thales Sa Systeme et procede d'acces a une ressource partagee
DE102019128206B4 (de) 2019-10-18 2022-09-01 Iav Gmbh Ingenieurgesellschaft Auto Und Verkehr Verfahren und Vorrichtung zur statischen Speicherverwaltungsoptimierung bei integrierten Mehrkernprozessoren
US11409643B2 (en) 2019-11-06 2022-08-09 Honeywell International Inc Systems and methods for simulating worst-case contention to determine worst-case execution time of applications executed on a processor
CN114157615B (zh) * 2020-08-18 2024-11-01 上海航空电器有限公司 用于提高虚拟链路调度效率的afdx端系统及方法
CN113422714B (zh) * 2021-06-23 2022-07-05 中国航空无线电电子研究所 一种支持在afdx终端上支持高完整性冗余管理的模块

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5590368A (en) * 1993-03-31 1996-12-31 Intel Corporation Method and apparatus for dynamically expanding the pipeline of a microprocessor
JP2003150395A (ja) * 2001-11-15 2003-05-23 Nec Corp プロセッサとそのプログラム転送方法
EP2015174B1 (de) 2007-06-21 2018-03-14 Imsys AB Mikroprogrammierter prozessor mit vielfachen prozessorkernen und zeitgleichem zugriff auf einen mikroprogrammsteuerspeicher

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2010139896A1 *

Also Published As

Publication number Publication date
FR2946442A1 (fr) 2010-12-10
US8694747B2 (en) 2014-04-08
WO2010139896A1 (fr) 2010-12-09
FR2946442B1 (fr) 2011-08-05
US20120084525A1 (en) 2012-04-05

Similar Documents

Publication Publication Date Title
WO2010139896A1 (fr) Procédé et dispositif de chargement et d'exécution d'instructions à cycles déterministes dans un système avionique multi-coeurs ayant un bus dont le temps d'accès est non prédictible
Harlap et al. Addressing the straggler problem for iterative convergent parallel ML
CN110580197B (zh) 大型模型深度学习的分布式计算架构
KR20220078566A (ko) 메모리기반 프로세서
US20170367023A1 (en) Method of and system for processing a transaction request in distributed data processing systems
EP3844620A1 (de) Verfahren, vorrichtung und system für eine architektur zur beschleunigung von maschinenlernen
EP1949234A1 (de) Verfahren und system zum durchführen einer intensiven multitask- und multiflow-berechnung in echtzeit
US10970118B2 (en) Shareable FPGA compute engine
US20210357732A1 (en) Neural network accelerator hardware-specific division of inference into groups of layers
US20220113915A1 (en) Systems, methods, and devices for accelerators with virtualization and tiered memory
US20220414437A1 (en) Parameter caching for neural network accelerators
EP3494475B1 (de) Verfahren und vorrichtung zur verteilung von partitionen auf einem mehrkernprozessor
JP7492555B2 (ja) 複数の入力データセットのための処理
Zhao et al. Gpu-enabled function-as-a-service for machine learning inference
CN116382599B (zh) 一种面向分布式集群的任务执行方法、装置、介质及设备
CN103093446A (zh) 基于多处理器片上系统的多源图像融合装置和方法
EP2666092B1 (de) Mehrkernsystem und daten-kohärenz verfahren
CA2887077A1 (fr) Systeme de traitement de donnees pour interface graphique et interface graphique comportant un tel systeme de traitement de donnees
FR3010201A1 (fr) Calculateur comprenant un processeur multicoeur et procede de controle d'un tel calculateur
US20230266997A1 (en) Distributed scheduling in container orchestration engines
FR2991074A1 (fr) Procede, dispositif et programme d'ordinateur de controle dynamique de distances d'acces memoire dans un systeme de type numa
FR3057127A1 (fr) Processeur adapte pour un reseau ethernet commute deterministe
Bueno et al. RapidIO for radar processing in advanced space systems
Roozbeh Toward Next-generation Data Centers: Principles of Software-Defined “Hardware” Infrastructures and Resource Disaggregation
Bueno et al. Optimizing RapidIO architectures for onboard processing

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20111107

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20180515

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20200529