EP2438528A1

EP2438528A1 - Method and device for loading and executing instructions with deterministic cycles in a multicore avionics system having a bus, the access time of which is unpredictable

Info

Publication number: EP2438528A1
Application number: EP10734208A
Authority: EP
Inventors: Victor Jegu; Benoît TRIQUET; Frédéric ASPRO; Claire PAGETTI; Frédéric BONIOL
Original assignee: Airbus Operations SAS
Current assignee: Airbus Operations SAS
Priority date: 2009-06-05
Filing date: 2010-06-02
Publication date: 2012-04-11
Also published as: WO2010139896A1; FR2946442B1; US20120084525A1; US8694747B2; FR2946442A1

Abstract

The invention particularly relates to a method and device for loading and executing a plurality of instructions in an avionics system including a processor having at least two cores and a memory controller, each of the cores including a private memory. The plurality of instructions is loaded and executed by execution slots such that, during a first execution slot, a first core has access to the memory controller for transmitting (215) at least one piece of data stored in the private memory thereof and for receiving (220) and storing at least one datum and an instruction from the plurality of instructions in the private memory thereof, while the second core does not have access to the memory controller and executes (210) at least one instruction previously stored in the the private memory thereof and such that, during a second execution slot, the roles of the two cores are reversed.

Description

"Method and apparatus for loading and executing deterministic cycle instructions in a multi-core avionic system having a bus whose access time is unpredictable"

The present invention relates to the architecture of avionic type systems and more particularly to a method and a device for loading and executing deterministic cycle instructions in a multi-core avionic system having a bus whose access time is not predictable.

Modern aircraft include more and more electronic and computer systems to improve their performance and assist pilots and crew members on their missions. Thus, for example, the electric flight controls reduce the mechanical complexity of the transmission of commands to the actuators and therefore the mass associated with these commands. Similarly, the presentation of relevant information allows the pilot to optimize flight paths and respond quickly to any incident detected. Such information includes speed, position, heading, meteorological and navigation data. All of these electronic and computer systems are generally called avionics. For reasons including reliability, simplicity and certification, avionics has often been functionally distributed by specific modules, also called LRU (abbreviation of A Replaceable Unit in English terminology). Thus, for example, the flight controls are managed in a particular device while the power supply is managed in another. A specific function is thus associated with each module.

Furthermore, each module supporting a critical function is preferably redundant so that the failure of a module does not lead to the loss of the associated function. The operation of an aircraft using a redundant module when the main module is faulty requires a maintenance operation. To improve the functionality of aircraft, reduce the weight of electronic equipment and facilitate maintenance operations, avionics is now more and more integrated according to an architecture called IMA (abbreviation of Integrated Modular Avionics). . According to this architecture, the functionalities are decorrelated from the systems, that is to say calculators or computing resources, in which they are implemented. Nevertheless, a segregation system makes it possible to isolate each of the functionalities so that the failure of one function has no influence on another. Such systems implement different modules, in particular data processing modules, called CPM (Acronym for Core Processing Module in English terminology), data switching modules, called ASM (acronym for Avionic Switch Module in English). English terminology), and power supply modules, also called PSM (acronym for Power Supply Module in English terminology).

The data processing modules include so-called "high performance" modules for general avionics applications, so-called "critical time" modules for avionics applications with strong temporal determinism constraints and server-type modules for the applications. open world applications, not critical.

A data processing module is generally composed of one or more processors, also called CPUs (acronym for Central Processing Unit in English terminology), associated with one or more RAM type memory banks (acronym of Random Access Memory in English). Anglo-Saxon terminology) and FLASH.

The communications between several CPUs of a CPM are preferably provided by means of direct links to a shared memory or through an exchange memory of a communication interface, for example an AFDX interface (acronym Avionic FuII DupleX in Anglo-Saxon terminology).

In order to allow the calculation of the WCET (acronym for Worst Case Execution Time in English terminology) the data processing modules so-called critical time must use processors and memory architectures allowing their temporal determinism.

To realize a so-called critical time data processing module, called CPM TC (Acronym of Core Processing Module Time Critical in English terminology) in the following description, a large number of relatively simple processors can be used, with one execution. code in static RAM memory or flash type memory to ensure temporal determinism.

Figure 1 schematically illustrates a CPM implementing such an architecture. As illustrated, the CPM 100 here comprises four "single-core" processors 105-1 to 105-4 and, associated with each processor, memory type DDRAM (acronym for Double Data Rate Random Access Memory in English terminology), generically referenced 110, and of flash type, generically referenced 115. Furthermore, the CPM comprises a set 120 of logic circuits allowing in particular the processors 105-1 to 105-4 to exchange data with other components of the aircraft via an input / output module 125.

However, the use of a large number of processors increases the risk of failure, or MTBF (acronym for Mean Time Between Failures in English terminology), as well as the weight and development costs.

Moreover, despite the computing power required in TC CPMs, high-performance, super scalar processors, which execute code instructions from a dynamic RAM bank, are not or badly used because of the refresh time of the memory, changes of lines, columns and / or banks and especially the higher latency of the memory controller. In other words, TC CPMs do not typically implement processors based on multi-core architectures using cache memories. Indeed, the CPM TC need a strong determinism of their execution time and their hidden memories create a variability which is difficult to determine due to a historical effect according to which, according to the past events, information may still be cached or not. It may then be necessary to recharge it without this being determined in advance. The same is true for the pipelined instruction sequences of processor cores and memory controllers for which the instructions can be spread over several cycles, thus creating historical dependencies.

Therefore, to be deterministic, TC CPMs must discard the mechanisms behind these variabilities and use margins to determine runtimes in advance, making the use of multi-core processors inefficient.

The invention solves at least one of the problems discussed above. More particularly, it is possible, according to the invention, to determine in advance the use of cache memories of multi-core systems so that the latency of the memories is no longer a factor limiting the performance. The invention also makes it possible, in a multi-core, multi-processor architecture, or more generally a shared processor bus, to obtain the independence of computing cores and the determination of non-pessimistic WCETs. In addition, cache memory latency independence allows the determination of WCET even if the memory and memory controller models are imprecise.

The subject of the invention is thus a method for loading and executing deterministic execution cycles of a plurality of instructions in an avionic system comprising at least one processor having at least two cores and at least one memory controller, each of said at least two cores having a private memory, said plurality of instructions being loaded and executed in execution slots, the method comprising the following steps,

during a first execution slice, access authorization to said at least one memory controller to a first of said at least two cores, said first core transmitting to said at least one memory controller at least one data item stored in its private memory previously modified, and receiving at least one datum and at least one instruction of said plurality of instructions, said at least one datum and said at least one received instruction being stored in its private memory; o prohibiting access to said at least one memory controller to a second of said at least two cores, said second core executing at least one instruction previously stored in its private memory; during a second execution slice, prohibiting access to said at least one memory controller at said first core, said first core executing at least one instruction previously stored in its private memory; and, o access authorization to said at least one memory controller to said second core, said second core transmitting to said at least one memory controller at least one piece of data stored in its private memory, previously modified, and receiving at least one piece of data, and least one instruction of said plurality of instructions, said at least one data item and said at least one received instruction being stored in its private memory. The method according to the invention thus makes it possible to implement technologies based on multi-core processors having buses whose access time is unpredictable for applications having high temporal deterministic constraints. In particular, the method allows the use of burst-type memories of the DDRx type (mode called burst in English terminology), cores working at frequencies above 1 GHz, the implementation of massively parallel architecture and the electronic integration as unique components.

Although the division of the activity of the cores into long phases of execution, without access to a shared memory and in long phases of access to a shared memory, without calculation, seems inefficient at first glance, it is in fact envisaged avionics applications and the cutting of applications. The time of the memory access phases is advantageously less than the total time spent by a heart waiting for the completion of each of these accesses for the execution model to be effective.

Another significant advantage is the simplification and strong reduction of pessimism of WCET calculations by static analysis due to the presence in private memory of the data used in the calculation phases.

Another advantage is the static analysis tools based on a processor model. Since the tool does not have to consider scenarios including access to shared memory in its analyzes, the processor model can be reduced to the core and its private memories.

This approach is also compatible and adapted to the evolutions of memory technologies that evolve towards very high data rates, without reducing the latencies in proportion, the aim here being to feed increasingly large and numerous private memories. According to a particular embodiment, said at least one processor further comprises at least one second memory controller, the method further comprising the following steps,

during a first phase of said first execution slot, authorizing access to a first of said at least two memory controllers at said first core and prohibiting access to a second of said at least two memory controllers at said first core;

during a second phase of said first execution slot, allowing access to said second memory controller at said first core and prohibiting access to said first memory controller at said first core; during a first phase of said second execution slot, authorizing access to said first memory controller at said second core and prohibiting access to said second memory controller at said second core; and,

during a second phase of said second execution slot, authorizing access to said second memory controller at said second core and prohibiting access to said first memory controller at said second core.

The method thus allows cores to access shared memories to execute instructions using common data. According to a particular embodiment, at least one of said at least two cores is dedicated to data transmission and reception operations to and from a network communication interface to simplify the modeling of the processor. The invention also relates to a method of processing a plurality of instructions for enabling the loading and the execution at deterministic execution cycles of said plurality of instructions according to the method described above, the method of treatment comprising a step of cutting said plurality of instructions into execution slots, each execution slot comprising a transfer sequence and an execution sequence, said transfer sequence allowing the transmission of at least one previously stored data and the reception and storing at least one datum and at least one instruction, said at least one received datum being necessary for the execution of said at least one received instruction and allowing the execution of said at least one received instruction, autonomously, during execution of said execution sequence.

The processing method thus makes it possible to cut the instructions into execution slices in order to optimize the described loading and execution method whose efficiency depends on the ability to precisely determine the information needed for a next execution phase to avoid to underestimate or overestimate the amount of information needed, which has the effect of requiring access to the shared memory for the execution of the instructions or to generate a longer loading phase at the time that the heart would spend loading each given. According to a particular embodiment, said step of cutting is based on the resolution of a system of linear equations representing constraints of execution of the instructions of said plurality of instructions according to at least one characteristic of said at least one processor.

The method according to the invention thus makes it possible to optimize the organization of the execution slots and to simplify their determination.

The duration of said execution slots is preferably constant and predetermined. This duration is, for example, determined by the time of transmission of previously modified data and the time of receipt of data and instructions to be executed.

The invention also relates to a computer program comprising instructions adapted to the implementation of each of the steps of the method described above when said program is executed in a processor, a device comprising means adapted to the implementation of each of the steps of the method described above and an aircraft comprising the device according to the preceding claim. The advantages provided by such a computer program and such a device are similar to those mentioned above.

Other advantages, aims and features of the present invention will emerge from the detailed description which follows, given by way of non-limiting example, with reference to the accompanying drawings in which:

- Figure 1 schematically shows a data processing module comprising several single-core processors;

FIG. 2, comprising FIGS. 2a to 2d, schematically illustrates a time diagram illustrating the activities of a processor comprising eight cores, implemented in accordance with the invention;

FIG. 3, comprising FIGS. 3a and 3b, illustrates an exemplary multi-core architecture adapted to implement the invention;

FIG. 4, comprising FIGS. 4a to 4d, illustrates an exemplary access mechanism, by each heart in the transfer phase of a multi-core processor, to the memory controllers of this processor; and,

- Figure 5 schematically illustrates a module of an avionics system, whose architecture is based on a multi-core processor such as that shown in Figure 3b, adapted to implement the invention.

Multi-core processors of the latest generation, also called SoC multicores (acronym for System on Chip in English terminology), offer a high potential computing power. However, in the context of critical real-time applications, this potential is difficult to exploit, especially for reasons of determinism and proof or test relating to temporal requirements. It is recalled here that the notion of real time implies a precise control of the temporal behavior of applications executed, in particular of their WCET. The term "critical" requires, in the field of aeronautics, to provide strong evidence of this control. This issue of determinism comes partly from the execution of one or more competing applications on each of the cores which share certain resources in insufficient number to physically segregate all the paths of all the cores, in particular the data exchange buses. and memories used. If these shares are not controlled (ideally controlled access is temporally exclusive access), they introduce generally uncountable contentions. Alternatively, the increase by a worst-case analysis, worst-case in Anglo-Saxon terminology, is too pessimistic and leads to extreme under-utilization of the multi-core processor. Another source of indeterminacy comes from the intrinsic complexity of SoC whose set of components creates historical phenomena making it prohibitive, in terms of calculation cost, of a worst case scenario that is reasonably pessimistic. The lack of intra-SoC observability and the lack of documentation concerning their architecture also make it impossible to create temporal and reliable models adapted to WCET analyzes.

The system according to the invention makes it possible to circumvent these difficulties. It is first recalled that within the SoC, each heart has one or more private cache memories. Typically, the cores envisaged in the CPMs have three private cache memories per cores: an L1_l (or L1 I) cache memory for the instructions, an L1_D (or L1 D) cache for the data and a unified L2 cache memory for the instructions and the data. While it is important here that each core has an individual cache memory and instructions for loading and unloading them, the number of levels of caches does not matter.

Alternatively, each core can access a local memory having an address on the core network. In the same way, the invention can be implemented with an internal SoC device external to the cores, DMA SoC type (DMA stands for Direct Memory Access in English terminology), controlled by the cores or activated on a fixed date on the This device is responsible for transferring the data in both directions between the memories associated with the cores, of the RAM type, and the central memories of the DDR type.

As long as an application runs only in these caches, there is no problem of resource contention due to the multicore architecture. The complexity problems of the SoC are also, in this case, greatly reduced because the models necessary for the determination of the WCET are limited to the hearts and their caches. However, cache memories are generally not large enough to store applications in their entirety. In addition, the applications executed need, by their nature, to receive and transmit data through input / output interfaces, called I / O (acronym for Input / Output in English terminology).

The principle of the system according to the invention is to create phases during which applications run exclusively within their private cache memories, no external request (data access or monitoring) to affect them.

This principle brings the following benefits:

- the execution of the phases is entirely independent of the activity of the other cores and WCET analysis of these phases can follow a traditional single-core approach; and, - the determination of WCET does not require any other model than that of hearts and their private cache memories. In particular, a model of the inter-core data bus and the memory controller is not required.

It should be noted, however, that, as previously invoked, applications can not usually be fully contained in private cache memories of cores. In addition, a heart is usually not dedicated to a particular application. Moreover, its data are not local, an application that must necessarily consume and produce data used by other applications. Therefore, it is necessary to manage access to shared memory and / or access to one or more networks to load and unload code instructions and application data. However, these accesses must be planned so that they are exclusive (ideally) between the cores as well as countable and distributed so that the worst conditions are the least significant possible in terms of processing time.

One solution for scheduling these accesses is to define rendezvous points between which a core has access to each resource (for example a particular memory controller), exclusive and shared with a minimum of other cores. Outside these beaches, the heart can not access these resources. It is therefore necessary to distribute the meeting points so that each heart has equitable access to resources. Advantageously, these rendezvous points are placed statically and regularly.

Thus, for example, for a processor having eight cores and two memory controllers, for equivalent execution times and memory access, four cores are allowed, at any time, to access a memory via the two controllers. memory, this access being forbidden to the other four cores. Advantageously, of the four cores that can access the memory controllers, at any moment, two and only two access each memory controller. A shorter memory access time makes it possible to dedicate more time to the execution phase, without memory access, without changing the total duration of the memory cycle and the execution. A shorter memory access time makes it possible to limit the number of cores accessing the memory at any time.

Figure 2, including Figures 2a and 2b, schematically illustrates a timing diagram illustrating the activities of a processor comprising eight cores, implemented in accordance with the invention. The type of activity of each of the cores is here represented along the time axis 200. FIG. 2b shows part of FIG. 2a to illustrate more precisely the roles of a particular heart, here the second. The marks 205-i, 205-j and 205-k define moments that represent static and regular rendezvous points where the hearts change their role. Thus, for example, at time 205-i, the first heart executes a series of instructions previously stored in its cache memory with the corresponding data (reference 210). From the same moment, the second heart exchanges data with a memory controller. At first, it transmits data stored in its cache memory to the memory controller (reference 215). Then, in a second step, it receives data and instructions from the memory controller that it stores in its cache memory (reference 220). Thus, the second core prepares for an autonomous execution phase during which it will not need to access the memory controllers.

The period separating two consecutive instants at which each heart changes its role defines an execution slot denoted T. Then, at time 205-j, the first heart transmits data stored in its cache memory to the memory controller (reference 225 ) and then receives data and instructions from the memory controller which it stores in its cache memory (reference 230). From the same instant 205-j, the second heart executes the instructions previously stored in its cache memory with the corresponding data (reference 235).

Again, at time 205-k, the first heart executes previously received instructions while the second heart transmits and receives data and instructions.

A similar mechanism is implemented in all hearts. As indicated above, the SoC comprising the processor whose operation is illustrated in FIG. 2 also preferably comprises two memory controllers. Thus, the two pairs of cores 240 and 245 of the set 250 each access a different memory controller so that within this set, each memory controller is accessed, at a given instant, only by one heart. Similarly, the two pairs of cores 255 and 260 of the set 265 each access a different memory controller so that within this together, each controller is accessed, at a given moment, only by one heart. Thus, at a given moment, each memory controller is accessed by two separate cores.

It should be noted here that if the SoC has several memory controllers, the access of the cores to each of the memory controllers is advantageously balanced. However, only one memory controller can be used, especially if it is sufficient to serve the performance needs of the CPM TC. In this case, the use of a single memory controller makes it possible to improve the development costs as well as the reliability, the mass and the heat dissipation of the SoC.

The scheduling of the transfer phases on all the cores is preferably strictly synchronous, balanced and planned. The use of shared resources, including memory controllers, is also preferably strictly synchronous, balanced and planned. Thus, if the SoC contains two memory controllers, half of the cores in the transfer phase access, at any time, to one of the memory controllers and the other half to the other memory controller. If necessary, at preset times, all or part of the cores in the transfer phase can change memory controller to maintain the correct balance. Two strategies can be implemented:

- dedicate a single memory controller by execution slice, a slice of execution representing all the instructions executed by a heart between two consecutive rendezvous points. However, in this case, the execution slice can not participate in calculation processes implementing particular functions using the other memory controller. Such a strategy leads to the creation of calculation domains specific to each memory controller, with a problem of communication between the memory controllers which can be difficult to manage, especially for I / O using a particular core; and, - require each execution slice to communicate equitably with each memory controller. Such a balancing constraint is not difficult to achieve. Data is usually private for each slice of execution. In addition, they can be duplicated if necessary, just like instructions. Moreover, this data can be placed indifferently on one or the other memory controller to balance the sharing. Although sharing a memory controller between two cores is not an optimal solution to the heart, this solution is nevertheless preferable vis-à-vis the memory controller because a single heart can not usually maintain a query pipeline long enough to completely clear the latency of the used memory. Indeed, when the cores operate in tandem, since each access request does not depend on the completion of the N previous requests, where N is the depth of the access pipeline of a heart (i.e. say the capacity of the entity called Load Store Unit (LSU) in English terminology), the pipeline formed in the memory controller has a length PxN which makes it possible to reach the optimum efficiency of the memories used (often considered as one of the major bottlenecks in a multi-core system).

By way of illustration, for cores having a pipeline of 5 (LSU), two cores form a pipeline of 10 requests in the memory controller, ie 80 data transfers of the burst type of 8 data per request. It is thus enough that the latency of a request is less than 40 cycles, by using a double transfer rate {double data rate) so as not to have a period of inactivity in the pipeline of the memory controller.

Regarding the length of the execution slots, that is the spacing of consecutive rendezvous points, the following time references can be identified,

worst time to execute the cached code instructions with its associated data. Although this time depends on the nature of the application executed, it is relatively constant for avionics applications; and, worst time for transferring modified data to the memory controllers from the cache memories and to load, from the memory controllers, the instructions, constants and variables of a execution slice in caches. This time depends on the number of hearts competing.

It should be noted here that close rendezvous points are possible but increase the number of execution slots and the size of the instruction and data placement problem for processing into execution slices. This fragmentation of processing also increases the total volume of data to be loaded and unloaded caches.

While Figures 2a and 2b illustrate an example of optimal placement when the duration of the unloading / loading phase is identical to that of the execution phase of the instructions, many other distributions are possible. By way of illustration, FIGS. 2c and 2d show examples of optimal placement when the duration of the execution phase of the instructions is less than three times that of the unloading / loading phase and greater than or equal to three times that of the unloading / loading phase, respectively, Δ representing the duration of an execution slice.

FIG. 3, comprising FIGS. 3a and 3b, illustrates an exemplary multi-core architecture adapted to implement the invention. The multi-core system 300 diagrammatically shown in FIG. 3a here comprises eight cores referenced 305-1 to 305-8, each connected to a local memory with a low, invariant and history-independent access time, ie -describe the previous performance of the computing unit to which it is connected. These local memories here bear references 310-1 to 310-8. They can be local cache memories or blocks of static memory accessible by virtual or physical addressing from the calculation units. Each local memory is itself connected to a bus unit, whose references are 315-1 to 315-8, connected in turn to a common bus 320 connected to a shared memory 325. The cores form arithmetic calculation units. , logical, floating or other that perform complex processing. They only access the local memory to which they are connected. The issue of calculating WCET of the cores forming the domain 330 is decorrelated from the multi-core characteristic and the modeling problem of the shared external memory and the interconnection network of the cores forming the domain 335. Moreover, the cache memories or static memory blocks are maintained coherently and powered by a multi-actor system simpler than the cores. In particular, the variability due to inputs, the combinatorics due to branching decisions, all the speculative decisions that the execution units can take and all the variability due to uncertainties of synchronism between the cores are ignored in the field 335. In practice, Because of the lack of variability, it can be considered that a single measurement is sufficient to determine the unique time required to load each slice. This invariability is, however, obtained only if the memory refresh operations are deactivated and it is the periodicity of the accesses of the domain 335, on each page of memory, which ensures the maintenance of the shared memory.

The WCET problematic of the domain 330 then only consists in calculating the WCET of arbitrarily complex programs, considered individually, for each of the calculation slices, and independently of the complexity of the domain 335.

This decomposition into domains 330 and 335 can be achieved on conventional mono or multicore processors provided with cache memories and appropriate instruction sets by synchronizing the bus units of the cores and by making them play the role of the system implemented. to maintain coherence of memories 310-1 to 310-8.

FIG. 3b illustrates an exemplary architecture of a multi-core SoC adapted to implement the invention.

The SoC 300 'here comprises the eight cores 3O5'-1 to 3O5'-8, generically referenced 305, with which are associated private cache memories generically referenced 340, 345 and 350. For example, the cache memory L1_l, referenced 340-1 , the cache L1_D, referenced 345-1, and the cache memory L2, referenced 350-1, are associated with the core 3O5'-1. In a way Similarly, the cache memory L1_1, referenced 340-8, the cache memory L1_D, referenced 345-8, and the cache memory L2, referenced 350-8, are associated with the core 305'-8. It is the same for other hearts.

Each system consisting of a core and the associated private cache is connected to a fast data bus, referenced 320 ', which is itself connected to memory controllers 355-1 and 355-2, generically referenced 355.

It should be noted here that the heart 3O5'-8 is here dedicated to the management of physical inputs / outputs. By way of illustration, the cores 3O5'-1 to 3O5'-8 may have an internal frequency of 1.6 GHz. The data bus connecting the cores to the memory controllers can also use a frequency of 1, 6 GHz. Thus, if the volume of data exchanged between the memory controllers and the cache memories, including the instructions, the written data and the read data, is 192 KB, then the loading / unloading time is about 25 μs in count. the sharing of the channel between two cores and the memory controllers as well as overhead, called overhead in English terminology, linked to the configuration descriptors of the next slice.

Still according to this example, the execution time of the instructions, representing about two-thirds of the data exchanged, with a ratio of one instruction per three cycles of a heart, to 1, 6 GHz, is about 54 μs.

Moreover, applications generally requiring a memory space greater than the capacity of memory caches specific to each core, they must be divided into several phases. Each phase is processed in an execution slice. The volumes of instructions and data involved in each slice must be compatible with the capacity of the different caches. The cutting must in particular reach as few slices as possible, with slices carrying out as many treatments as possible. This division is preferably performed prior to its execution by a software generation workshop. FIG. 4, comprising FIGS. 4a to 4d, illustrates an example of access mechanism, by each heart in the transfer phase of a multi-core processor, to the memory controllers of this processor.

As indicated above, in order not to specialize the cores on a part of the applications, it is necessary to separate the loading and unloading phases in balanced batches on each memory controller. This splitting must also separate the loadings and the unloadings in order to reduce and simplify the access combinations obtained by combining two cores (combinations reduced to all the cores during the loading phase or to all the cores in phase of unloading). An important consideration in the separation of loadings and unloads is the ease of constructing a model of the operation of the core bus units, the core interconnect network and the memory controllers. For the hearts themselves, establishing a bus unit model intertwining any memory access would be very difficult, build two half models, one for loading and one for unloading, appears easier. Thus, if a processor is complex, it is nevertheless possible to "simplify" it by considering its behavior only on a simple program, here an uncorrelated loading sequence and an unloading sequence, that is to say, whose completion an instruction does not block the following instructions.

As illustrated in FIG. 4a, in a first step, a first half of the cores in the transfer phase accesses the first controller and the second half accesses the second controller. Thus, the cores 3O5'-1 and 3O5'-2 access the memory controller 355-2 while the cores 3O5'-3 and 3O5'-4 access the memory controller 355-1 and the cores 3O5'-5 at 3O5'-8 are in the execution phase and can not access memory controllers 355-1 and 355-2.

In a second step, as illustrated in FIG. 4b, the second half of the cores in the transfer phase accesses the first controller and the first half accesses the second controller. Thus, the cores 3O5'-1 and 3O5'-2 access the memory controller 355-1 while the cores 3O5'-3 and 3O5'-4 access the 355-2 memory controller and the 3O5'-5 to 3O5'-8 cores are still in the execution phase and still can not access the 355-1 and 355-2 memory controllers.

The first and second times illustrated in FIGS. 4a and 4b are repeated so that, during a first period, the memory controllers 355-1 and 355-2 are used for data unloading and that, during a second period, Memory controllers 355-1 and 355-2 are used for data loading. The first and second periods here have the same duration, the duration of the first period being, like that of the second period, identical for each memory controller.

Thus, the sequence of operations consists of unloading all the data by crossing the links between the memory controllers and the cores in the transfer phase at a given time and then loading the new data by again crossing the links between the memory controllers. and hearts in the transfer phase at a given moment.

Then, the hearts change roles. In other words, the cores that were in the transfer phase go into execution phase while the cores that were in the execution phase go into the transfer phase. Thus, in a third step, as illustrated in FIG. 4c, the cores 3O5'-5 and 3O5'-6 access the memory controller 355-2 while the cores 3O5'-7 and 3O5'-8 access the controller of memory 355-1 and that cores 3O5'-1 to 3O5'-4 are in the execution phase and can not access memory controllers 3355-1 and 355-2.

Then, in a fourth step, as illustrated in FIG. 4d, the cores 3O5'-5 and 3O5'-6 access the memory controller 355-1 while the cores 3O5'-7 and 3O5'-8 access the controller of 355-2 memory and that the 3O5'-1 to 3O5'-4 cores are still in the execution phase and still can not access memory controllers 355-1 and 355-2.

Again, the third and fourth times shown in Figs. 4c and 4d are repeated so that during a first period memory controllers 355-1 and 355-2 are used for data unloading and that second period, 355-1 memory controllers and 355-2 are used for data loading. The first and second periods here have the same duration, the duration of the first period being, like that of the second period, identical for each memory controller. Thus, in a similar way, the sequence of operations consists of unloading all the data by crossing the links between the memory controllers and the cores in the transfer phase at a given moment and then loading the new data by crossing the links again between memory controllers and cores in the transfer phase at a given moment. The control of the counting of the page changes within the memories used imposes that two cores should not have access, in the same phase of transfer, to the same banks. This imposes additional constraints on two cores working at the same time for the same application. In practice, this requires that two cores do not access the memory used for an application at the same time. The I / O server, shown below, is a special case because, by definition, it accesses all applications. The goal is to place application access to their I / O at different dates on the I / O server.

Each core has, permanently, that is to say, locked in cache memory, an instance of a supervision software that aims to sequence all the slices to be executed on the core. For example, it performs, for each slice of execution, the following operations:

reading in a configuration table stored in a memory accessed via a memory controller of the block information to be loaded into the cache memories and information to be transmitted;

- loading instructions, constants and data into cache memories;

- execution of the contents of the slice;

- waiting for the end of the execution slice; and, - transmission via the memory controllers of the modified data. The determination of the worst-case transfer can be carried out according to two approaches:

- by measurement if there are few temporal configurations, it is possible to measure them and to predict, for each access sequence, the time of each access; and,

- by construction of a model of the multi-core system restricted to the sequences of instructions in the supervision software. It is then possible to know at any moment the state of the hearts. This approach assumes, however, that SoC design information for modeling the transfer process is known.

It should be recalled here that according to the invention, the cores do not have access to the memory controllers during their execution phase. In other words, cores have no access to addresses not already present in cache memories. The restriction of the execution to the data and instructions loaded in the cache thus has the same effect as a programming of the memory management unit, called MMU (acronym of memory management unit in English terminology), to the granularity of the lines of the cached memories since any access outside the addresses determined by the result of placement would have the effect of triggering an access violation exception.

If an application is at the origin of an error in a cache, whether by bug, failure or alteration type SEU (acronym for Single Event Upset in English terminology, representing an alteration of the state of a bit in a memory or a register due to the passage of a cosmic ray), the heart is likely to initiate access to the memory controllers. However, this access is denied and causes an exception that is taken over by the monitoring software that disables the slice, core or application to which the slice belongs. Of course, it is accepted here that such a protection mechanism can be established on the multi-core system. A SoC explicitly designed for this purpose offers this opportunity very simply.

Alternatively, it is possible, at the level of the arbitration system of the bus, to deny the requests of the hearts in execution. Another solution consists in triggering an interruption on a bus access observed by a means normally dedicated to debugging. It is also possible to map, on the heart side, the memory controllers to different addresses for cores accessing the memory at different times and then to physically map the memory controllers to the addresses expected by the cores having at that time. instant access to memory. In general, the simplest is that the SoC has a DMA capable of loading in the caches or local memory of each core the data it needs for the next installment. The caches preferably contain either locked data indefinitely, i.e. locked data for the duration of the critical time phase, or locked data for the duration of a slice. The cache closest to the cores, reserved for instructions, is locked with the most critical code elements, for example a library of routines called frequently. The most distant cache memory advantageously contains the application code and the largest tables of constants that have the least usage-to-volume ratio.

The slice-dependent data is loaded into the cache memory from a descriptor table itself contained in the memory accessible via a memory controller and loaded into cache memory. It is possible to build tables whose surplus, called overhead in English terminology, does not exceed one percent by volume. At the end of the execution slice, the descriptor table is still used to transmit the modified expected data (flush operation). It must also be ensured that there can not be an edge effect due to the unmodified data kept in the cache memory, for example by globally disabling the cache memories (after backup if necessary in another cache of locked persistent data) . By way of illustration, non-LRU cache memory (acronym for Least Recently Used in English terminology) does not guarantee that the data of the old slice will necessarily disappear in favor of the data of the new slice. An important point to implement the invention lies in the proper division of instructions and data to allow the construction of calculation slices that best utilize the resources of the cores. Thus, each slice should preferably satisfy the following conditions: - the execution must not produce an error in the cache memories, that is to say that all the data required by an execution slice must be available cached;

- the instruction and data volumes must respect the sizes of the caches; - the worst execution time, or WCET, must be less than the duration of the execution slots; and,

- the execution must respect the scheduling constraints.

In addition, the treatments must be reasonably scored and not highly sequential, in order to leave a few degrees of freedom for the placement solution, and the ratio between instructions and data, i.e. the computational density, should preferably be high so that the solution is effective. In other words, when the caches are loaded with instructions and data, it must be possible for the cores to execute a large number of instructions before having to return to the bus to update their cache memory. Thus, for example, it is desirable not to use a function requiring large tables of data which would have the effect of blocking a large part of the cache memory for only a few instructions.

However, many avionics applications such as electrical flight control applications are written in the form of boards, for example SCADE boards (SCADE is a brand), which have such properties. Moreover, with the exception of certain temporal constraints, the scheduling of the boards is free.

The placement of the processing in slices is done offline, that is to say before the execution of the slices, by a tool of the chain of software generation. The principle is to use the various methods available for multi-objective optimization under constraints in order to statically solve a placement of instructions and data. The investment out Online processing of slices of execution is essential to find as optimal a solution as possible. It makes it possible to produce an improvement of the WCET, or even the obtaining of the minimum, for the application concerned while benefiting from the improvement of the determinism due to the locality constraints of the data defined previously.

Advantageously, the constraint resolution application makes it possible to restrict the mathematical expressions to linear equations in order to solve the system of equations and to optimize a function (operational search). The solution here is preferably restricted to whole solutions. Such a solution, called Integer Linear Programming (ILP) or Integer Linear Programming (ILP) in Anglo-Saxon terminology, aims to express a problem by a system of equations and / or linear inequalities with (partially) integer solutions.

A resolution of the PLNE type can be done by the simplex method that combinatorial optimization tools can offer, supplemented with heuristics to make the problem computable.

To make the constraint resolution application easier, it is best to simplify the problem or split it into more simple subproblems. According to a particular embodiment, the constraint resolution application is requested to choose a slice for each board. The index i, varying from 1 to S, denotes here the slice numbers while the index j, varying from 1 to N, denotes the plate numbers also called nodes, that is to say the non-breaking fractions of the application. There is defined a Boolean variable N designating the state of a node such that Nj, i = 1 if the node j is placed in the slice i and Nj, i = 0 if the node j is not placed in the node. slice i. Nj, i is said "decision variable" indicating the decision of placement of the node Nj.

Each node Nj is characterized by a large volume of instructions and constants, called L2j, specific to the node j, to be placed in the cache L2 as well as by a volume of variables and constants of small size, called L1j, own at node j, to be placed in the cache memory of data L1 D. Each node Nj is also characterized by a list of variables shared with other nodes and a worst execution time WCETj.

The constants of significant size, for example interpolation tables, are to be placed in the cache memory L2 so as not to exhaust the capacity of the cache L1 D. The choice of the transition threshold between the cache memories L2 and L1 D is determined by the placement tool. The expression of the size constraints on the cache memories L2 and L1 D is given here as an example and corresponds to an investment on two resources having different characteristics, one, fast for the scarce data, to be reserved to critical data at run time while the other is to be used for less critical instructions and data. This principle can be adapted to other distributions of resources.

It is then necessary to take into consideration the following constraints, expressed in the form of linear inequalities,

- each slot must not exceed the MAXL2 capacity of the L2 cache memory:

=> for all i _> L2r Ni ,, + L2 ₂ ^* N ₂ ,, + ... + L2 _N ^* N _N ,, ≤ MAXL2

N is, Vi ^ LI _j XN _{j 1} ≤ MAXLl

7 = 1 - each slice must not exceed the capacity MAXL1 D of the cache L1 D: => for all i, L1 i ^* Ni ,, + U ₂ ^* N ₂ ,, + ... + L1 _N ^* N _N ,, + RESVL1 D <MAXL1 D

N is V /, ΣZ1 ₇ x N _j1 , + RESVLW ≤ MAXLW

7 = 1

- each slice must not exceed a MAXWCET maximum execution time:

=> for all i, WCETi ^* N _υ + WCET ₂ ^* N ₂ ,, + ... + WCET _N ^* N _N ,, MAXWCET

N is VUΣWCET _j x N _j1 , ≤ MAXWCET

7 = 1

It is also necessary to force the placement solution to include once and once each node in each slice, => for all j, N _j , 1 + N _j , ₂ + ... + N _j , s = 1

either, YAJX ₁ = I

It should be noted here that the cache L1 D is not only used for small constants and variables but also for variables shared between several nodes. The value

RESVL1 D represents this space. In a simplified approach of the problem, separating the problem of placement of the nodes from the problem of placement of the variables, it is advised to choose a fixed value leading to a feasible and satisfactory solution. In a solution combining the optimization of the placement of the nodes and the variables, RESVL1 D is chosen as representing exactly the occupation of the variables in cache memory L1 D.

When a scheduling constraint exists between two nodes, for example if Nj is to be executed before Nk, the following set of constraints is added (there is one Nk, i for each candidate slice for placement): for all j, k such that j must precede k, for all i> 2, N _k ,, + N _k ,, + i + ... + N _k , s ≥ N _j ,,

let J> _w ≥ # "l = ι

Thus, if Nj is placed in the slice i, then Nk must also be placed in the slice i or in one of the following ones. If there are also constraints prohibiting the separate placement of two nodes (non-breaking nodes), they can then share the same decision variable.

In addition to sharing variables, nodes can share constants. In an exhaustive representation of the problem, it is possible to express accurately the investment decisions of these constants. However, the sharing of small constants is generally not very dimensioning and does not justify complicating the problem. The small constants can be duplicated, ie find different solutions in each slice, without significant cost, by using locations not used in the distribution of variables in memory. Large constants, usually few, for example trigonometric interpolation tables, nevertheless justify a search for optimization.

The variable Cc, i is defined as equal to one if the constant Cc is referenced in the slice i. In the opposite case it is equal to zero. A constraint on Cc, i is added as follows, for any slice i, for any node j referencing Cc, Cc, i> Nj, i

Thus, from the moment when the node j using Cc is placed in the slice i, Cc, i is forced to 1. It should be noted that Cc, i is not really a decision variable, it is a consequence of the decision of placement of nodes Nj.

The constants of large size being, for example, placed in cache L2, the constraint on the cache L2 is reformulated as follows, for all i, L2i ^* N _υ + L2 ₂ ^* N ₂ ,, + ... + L2 _N ^* N _N ,, + ... + sizeof (Cc) ^* Cc, i + ... <MAXL2

let Vι, ΣZ2, x N _{J> t} + Σsizeof (C _c ) x C _{c> ι} ≤ MAXL2

7 = 1 'c = 1 where sizeof (Cc) represents the size of the constant Cc, where C is the number of large constants.

The same formalism can be applied for any shared variable Vv. In other words, Vv, i = 1 if the variable Vv is referenced in the slice i else Vv, i = 0.

A constraint is also added on Vv, i as follows, for any slice i, for any node j referencing Vv, Vv, i> Nj, i

To limit the overall complexity of the placement, it is possible to subdivide the problem by first looking for a solution for placing the nodes with criteria for grouping the references to the variables (and constants) and looking for a solution that minimizes the sum of the Vv, i on all variables Vv and all slices i. It is thus necessary to minimize the following relationship, It should be noted that this function is not intended to minimize the case of the worst filling of the slices. In practice, minimizing the number of references to the variables consists on the contrary in maximizing the occupation of certain slices. It may be desirable, however, to maintain a certain margin in the cache memory in each slot in order to accept changes to the software to be placed without having to restart the investment tool and possibly obtain a completely different investment from the previous one. This is particularly useful in the context of qualification and incremental verification where it is not necessary to test the unmodified software parts again.

For the placement of the variables, decision variables are defined in the following way: Mv, b = 1 if the variable Vv is placed in the block b otherwise Mv, b = 0, b being a block index varying from 1 to B (A block here is a cache line or a group of cache lines). The larger the blocks, the more difficult it is to find placements that efficiently use block space. On the other hand, the complexity of the problem is reduced (less decision variable) and the efficiency of the cache operations improved.

This results in the following constraints, expressed as linear equations:

- do not allocate variables in a block beyond its capacity MAXBLOC, => for any block b, s / zeof (Vi) ^* Mi _ιb + ... + size (V _v ) ^* M _v , b +. .. ≤ MAXBLOC

NbVar is Vb, Σsizeof (V _v ) xM _vb ≤ MAXBLOC v = l - allocate each variable once and only once,

=> for any variable Vv, M _v , i + ... + M _v , b + ... = 1

let Vv, ΣM _vi = lb = \

The loading of a block b into a slice i is constrained as follows, for any variable Vv referenced by any node Nj, Hb ,,> M _v , b + N _j ,, OR,

Hb, i = 0 implies that the slice i is empty and that the block i is also empty (which is only possible if slices and blocks have been defined in excess); Hb, i = 1 implies that there is no node Nj placed in the slice i and accessing variables placed in the block b, and therefore that the block b is not required by the slice i; and

Hb, i = 2 implies that there exists at least one node Nj placed in the slice i and accessing at least one variable Vv placed in the block b, and therefore that the block b is required by the slice i. For a joint optimization of the placement of the nodes and the variables, it is then possible to complete the second constraint evoked by replacing the value RESVL1 D by the allocation of blocks intended for the variables. It is then necessary to minimize the USAGE value (where USE <MAXL1 D) respecting the following constraints, for all i, LI ₁ ^* N ₁ ,, + LI ₂ ^* N ₂ ,, + ... + L1 _N ^* N _N ,, + BLK_SZ ^* (H ₁ ,, + ... + H _B> , - B) <USAGE where BLK_SZ represents the size of a block.

Minimizing the value USAGE has the effect of seeking the placement of the variables minimizing the worst-case filling of the L1 D cache by slices. Naturally, a placement on a monolithic memory area of instructions and data would lead to different formulas and a placement on a memory hierarchy at more levels would have been different yet without invalidating the principles mentioned here.

To formulate the optimization of the placement of the variables after the placement of the nodes, that is to say to make the placements in two stages, first the nodes by optimizing the references to the variables, but without optimizing the placement of the variables in line of hidden memories, then the placement of the variables while benefiting from the result of the nodes, it is possible to reformulate more simply the constraints according to the following rules, - the variables of which all the references were placed in the same slice can be integrated with the space allocated in cache L1 D for variables and small constants specific to the nodes of the slice; and,

for the variables Vv shared by the slices i, for each block b, the following constraint is defined, Hb, i> Mv, b with Hb, i = 1 if there exists at least one variable Vv referenced for the slice i and placed in the block b.

It is then necessary to look for the function that minimizes USE (USE <MAXL1 D) by respecting the following constraints, for all i, USAGE_L1 i + BLK_SZ * (H1, i + ... + HB, i) <USAGE where USAGE_L1 i is derived from the placement result of the nodes, i.e.,

USAGE_L1 i = L1 i ^* N _υ + LI ₂ ^* N ₂ ,, + ... + L1 _N ^* N _N ,, + s / zeof (shared variables only in i)

The node-specific variables and small constants can be easily separated into modified blocks and unmodified blocks to minimize the number of flushes at the end of the slice. To optimize the placement of the shared variables and ensure that the solution meets the maximum limit of the number of unloading (flush), it is necessary to add additional constraints. Thus, for any variable Vv referenced in writing by i, for any block b, WbJ> Mv, b

In addition, for any portion i, the function minimizing the value USE (USE <MAXL1 D) is sought within the following constraints: for all I USAGE_W_L1, + BLK_SZ * _(W, + ... + W _B, ) <MAX_FLUSH where the value USAGE_W_L1 i comes from the result of placement of the nodes and corresponds to the size of all the data in modifications in the slice i and known before the resolution of the constraints of placement of the variables.

Some simplifications can be made to the equations described above. For example, it is possible to calculate only one placement decision for all variables sharing exactly the same list of referenced slices. According to a particular embodiment, it is possible to simplify the problem by cutting the nodes or the variables into several subsets. This choice of preliminary division can be directed by the designer of the software to be placed, for example because he knows that his application is composed of three largely independent subsystems, or by the placement tool according to heuristics, for example in identifying nodes referencing the same variables. Each subproblem is then subject to an independent placement of its nodes and its own variables. A last placement of the shared variables completes the problem resolution. For example, the nodes can be divided into several subsets according to periodicities. The slices are then scheduled at the periodicity of the nodes. It is also possible to split the specification used into relatively independent functional blocks. Other alternatives are possible, including expressing a prior system of constraints to distribute the nodes into a small number of subsystems rather than directly distributing the nodes in a large number of slots.

The desired optimum can be degraded by heuristics (choice of simplification) put in place, non-exhaustive methods can be used to solve the problem of combinatorial optimization that represents the problem of placement.

While maintaining the objective functions previously described and the constraints related to the architecture implemented, optimization methods such as the estimation algorithm of distribution, called estimation of distribution algorithm in English terminology, the methods based on on the principle of evolutionary algorithm (or genetic algorithm), the neural networks or a particle swarm algorithm, called particle swarm optimizer in English terminology, can be used.

Combinatorial optimization is a highly researched and evolving subject of research, and many approaches are available, each with their advantages and disadvantages. As it is a distribution estimation algorithm, the idea here is to seek an optimization of placements of nodes then of variables, or even of variables only, the objective functions allowing the iterative search for a better solution being notably the objectives of minimality of data exchange between the slices and the objectives of minimization of the execution time by a very fine localization of the data (minimizing the number of lines of caches that a calculation sequence must load or unload at the level of an L1 cache within an execution slice). The presence of constraints of different natures can lead to consider an optimum search based on several optimization methods.

For example, concerning the flight control application, it is possible to distinguish objectives and constraints aiming to improve the WCET by a fine localization of the data of the scheduling and sequential constraints of processing sets. Since the latter are more difficult for a distribution estimation algorithm but do not concern all the treatments, they can be treated differently. Here again, the state of the art concerning combinatorial optimization makes it possible to adopt a set of approaches giving more or less satisfactory results depending on the constraints of the application considered and the material architecture envisaged in order to obtain the slicing calculations sought.

According to the system of the invention, the calculation slots have no access to inputs / outputs, called I / O, physical. They can only access variables that have been cached by the monitoring software. Thus, as illustrated in FIG. 3b, a core, or several if necessary, is preferably dedicated to the management of physical I / O. This core, called the I / O core, hosts an "I / O server" type function as opposed to other cores that can be considered as "computing servers". The I / O core produces the variables corresponding to the deformed inputs of the module and consumes the variables corresponding to the outputs unformatted module. If the computational load due to the formatting functions of the I / O core is too large, it is conceivable to assign these formattings to the compute cores and to leave only the data transfers on the external buses of the SoC to the I server. / O. Given computing hearts, the heart I / O is a heart producer and consumer of unmarked data.

The activities of the I / O server cover access operations to physical registers and bus controllers, for example to Ethernet, PCIe or non-volatile memory controllers, and data verification and data conversion operations. known data structures and types of applications These operations are defined by configuration tables, loaded during the transfer slices, planned by the placement tool, along with the planning of the loadings of the calculation slices. The I / O core has its software and some data residually, and uses its transfer phases to load and unload the values of the actual inputs and outputs as well as the configuration table elements necessary for their processing.

The I / O core is preferably the only core having access to bus type PCIe, Ethernet or other. Being the only one, and provided that its accesses do not interfere with the access of computing cores to the memory controllers, the I / O core has the use of these buses full time. On the other hand, being trivialized from the point of view of access to the memory controllers, it has strictly static slots and access ranges, planned at the same time as the planning of the accesses of the computation cores.

Moreover, if bus controllers have to carry out data transfers of the DMA type, they must be able to reach memory targets without disturbing the calculation cores during the transfer phase. Thus, according to a particular embodiment, a memory component must be available so that these DMA transfers can be made without affecting the memory used by the computing cores. This component can be the cache memory, preferably in that of the I / O core, which is used as a target. It can also be another cache or memory area accessible by addressing in the SoC, possibly even an external memory plan addressed by a dedicated memory controller.

The activities of the I / O server are divided into execution and transfer slices, strictly synchronous, balanced and planned, like the activities of computing cores (or application cores). The I / O core uses its transfer slots to read the configuration tables, drop the inputs into memory, and retrieve the outputs. The execution slots are dedicated to controlling the bus controllers. The distribution of operations by tranche is carried out by the offline investment tool described above, while respecting the processing capabilities of the I / O core and the bus controllers, in time consistency with the applications.

For these purposes, the SoC architecture must provide sufficient segregation of paths for exchanges between the I / O core and the bus controllers during the execution slots to avoid interfering with exchanges between the memory and the cores. calculation in the transfer phase.

The physical inputs of the I / O server can be classified into two families:

- synchronous entries of applications that are acquired on the initiative of applications and can be placed in time in I / O server slices. These inputs generally consist of reading one or more registers to receive information; and,

- asynchronous entries of applications that are acquired according to external events, uncorrelated with the execution of the applications. Their acquisition can not be planned in a completely deterministic way, such as application processing or synchronous inputs. These inputs usually consist of frames or messages received on digital buses such as Ethernet.

Only the synchronous outputs, ie the outputs issued or generated on the initiative of the applications, are considered here. However, for the possible asynchronous outputs, for example an output of a device interrogated by the controller of an asynchronous bus of the sequencing of the slices, it is possible to consider that the device has a box with letters keeping the deposited data. The mailbox data submission is synchronous with the slices while the transmission on the bus is asynchronous.

Thus, apart from the asynchronous entries, it is possible to establish a static planning, via the offline tool, to determine the accesses to the configuration tables, the input / output variables and the control activities of the I controllers. / O.

For asynchronous entries, the I / O server must have a resident configuration table element in its private cache memories. This element should allow it to correlate the unplanned arrival of the event with a request to access a specific memory area, and then later use a scheduled access date to this area to acquire, if necessary, the elements additional configuration tables and drop reformatted or non-event data. The raw data must be cached between the time of arrival and the opening of the memory access. The arrival of the event is unplanned in the sense that the moment it must arrive is unknown. However, the very existence of the event is scheduled addresses in memory and access opportunities scheduled to memory have been assigned. If the execution slices on the compute cores are grouped so that only one application is active at a time on all the cores, it is possible to reserve on the I / O server a slice prologue for the inputs and an epilogue slice for the outputs so that the I / O server can be considered during all this time for the exclusive use of the active application. This alternative, according to which all the cores are dedicated to an application for a determined duration, that is to say several slices, requires that the problems of determinism of the memory controllers due to the page changes be solved. It can be, for example, by the use of a sufficiently precise model of the memory controllers applied to the memory transfer lists required by each slice. This alternative also requires that the applications thus distributed have sufficient scheduling freedom to distribute efficiently across all cores in a parallel manner.

Alternatively, the mix of applications on different computing cores can be allowed. In this case, the slices of the I / O server preceding or following the calculation slices are provided with CPU time resources and static bus access (equivalent to micropartitions). These resources are known to the application placement tool so that they do not exceed their assigned resources.

If the SoC has several Ethernet controllers, it is possible to perform AFDX or Erebus inputs / outputs in software. These implementations must, however, remain compatible with the statistic and deterministic constraints necessary for splitting into calculation slices.

For these purposes, Ethernet controllers should not access the core memory used by the compute cores and must work with independent memory and bus resources. Bus-type resources can optionally be shared if there is an "instantaneous" priority management capable of serving requests from the application cores without preemption, or observable delay, in the event of a conflict, with the accesses of the Ethernet controllers or the server of I / O, and without faulting the WCET scans of the I / O server. This approach implies that the accesses of the Ethernet controllers can be transparent vis-à-vis the computing cores. For performance reasons, it is also desirable that the data written by the external buses, for example Ethernet or PCIe, be transferred to the local memory of the I / O server. This transfer can be carried out either directly by the DMA of the Ethernet controller or by a mechanism equivalent to that used for the pre-loading of the cache memories.

The AFDX transmission and reception operations are preferably adapted to be performed in the IO core, while complying with the following constraints: the IO core must respect the concept of communication slices and treatment slices; - Ethernet controllers must not disturb memory controllers or other cores; and,

the cache memories of the IO core being too small to fully store the configuration and the variables related to the AFDX interface, they must be loaded in portions.

During the reception of data, the packets received by the Ethernet controllers are stored in the memory of the heart 10. They are analyzed as and when they are received and then transferred to other queues. A configuration table residing in the local memory of the I / O server is used to associate the identifiers of the virtual links (or VL, abbreviation of Virtual Link in English terminology), called VLID, received frames to one or more Scheduled memory access windows for the I / O server. There is a window for depositing the application part of the frame in memory and possibly one or more other windows for reading the elements of configuration tables necessary for identification and complete processing of the frame such as IP / UDP addresses (Abbreviation of Internet Protocol / User Datagram Protocol in English terminology) destination for the identification of the destination port, the type and storage address of the port in memory and network monitoring information. The configuration table residing in the local memory of the I / O server, whose size is of the order of a few kilos bytes, is used for each Ethernet frame received. The management of redundancy and integrity advantageously uses resources also stored in the local memory of the I / O server. If the search for the ports requires a table that is too large to be stored in local memory, the elements of this table, necessary for the processing of the VL identified by the configuration table residing in the local memory of the I / O server, are loaded into the I / O server's memory read slices allowed for this VL and only the pending packets corresponding to these VLs are processed. If the capacity of the local memory of the I / O server allows it, it is preferable for reasons of simplicity and reduced latency to leave these tables in the I / O server. The I / O server's broadcast activities are scheduled by the placement tool used for placement of application processing in the slices and for slot placement on the cores. During transmission, the configuration associated with a VL is loaded into the local memory at the scheduled cycle, as well as the state of the ports associated with it. If the transmission conditions are met, the transmission is triggered in the cycle at a time defined by the configuration. Similarly, if the local memory of the I / O server permits, it is preferable to leave the configuration tables necessary for the transmissions. FIG. 5 schematically illustrates a CPM, the architecture of which is based on a multi-core processor such as that presented in FIG. 3b, adapted to implement the invention in which the AFDX functions are managed in software in the multi-processor -hearts.

As illustrated, the CPM 500 includes the multi-core processor 505 having here, in particular, eight cores and two memory controllers. These memory controllers are used as interface between the cores and memories 510-1 and 510-2. The CPM 500 further comprises a memory 515, for example a flash memory, for storing, for example, some of the applications to be executed by the cores of the processor 505. The CPM 500 further comprises a network interface for receiving and transmitting data. , in particular an AFDX interface, as well as the logic necessary for the operation of the CPM. The AFDX function is here performed by the multi-core processor, that is to say, software.

Naturally, to meet specific needs, a person skilled in the field of the invention may apply modifications in the foregoing description.

Claims

A method of loading and executing deterministic executing cycles of a plurality of instructions in an avionic system comprising at least one processor having at least two cores (305, 305 ') and at least one memory controller ( 355), each of said at least two cores having a private memory (310, 340, 345, 350), said method being characterized in that said plurality of instructions are loaded and executed in execution slots and in that it comprises the following steps, - during a first execution slice, o access authorization to said at least one memory controller to a first of said at least two cores, said first core transmitting (215) to said at least one controller of memory at least one datum stored in its private memory, previously modified, and receiving (220) at least one datum and at least one instruction of said plurality of instructions, said at least one datum and said at least one datum received instructions being stored in his private memory; o prohibiting access to said at least one memory controller to a second of said at least two cores, said second core executing (210) at least one instruction previously stored in its private memory; during a second execution slice, prohibiting access to said at least one memory controller at said first core, said first core executing (235) at least one instruction previously stored in its private memory; and, o access authorization to said at least one memory controller to said second core, said second core transmitting (225) to said at least one memory controller at least one datum stored in its private memory, previously modified, and receiving (230) at least one data and at least one instruction of said plurality of instructions, said at least one data item and said at least one received instruction being stored in its private memory.

The method of claim 1 wherein said at least one processor further comprises at least one second memory controller, the method further comprising the steps of:

The method of claim 1 or claim 2 wherein at least one of said at least two cores is dedicated to data transmission and reception operations to and from a network communication interface.

A method of processing a plurality of instructions to allow the deterministic runtime loading and execution of said plurality of instructions according to any one of the preceding claims, the method of processing comprising a step of clipping. said plurality of instructions in execution slots, each execution slot comprising a transfer sequence and an execution sequence, said transfer sequence allowing the transmission of at least one previously stored data and the reception and storage at at least one datum and at least one instruction, said at least one received datum being necessary for the execution of said at least one received instruction and allowing the execution of said at least one received instruction, autonomously, during the execution of said execution sequence.

5. Method according to the preceding claim wherein said step of cutting is based on the resolution of a system of linear equations representing execution constraints of the instructions of said plurality of instructions according to at least one characteristic of a suitable processor. to execute said execution slots.

The method of claim 4 or claim 5 wherein the duration of said execution slots is constant and predetermined.

7. Method according to the preceding claim wherein said duration is determined by the previously modified data transmission time and the time of receipt of data and instructions to be executed.

8. Computer program comprising instructions adapted to the implementation of each of the steps of the method according to any one of the preceding claims when said program is executed a processor.

9. Device comprising means adapted to the implementation of each of the steps of the method according to any one of the claims.

1 to 7.

10. Aircraft comprising the device according to the preceding claim.