EP3519954A1

EP3519954A1 - Method for managing computation tasks on a functionally asymmetric multi-core processor

Info

Publication number: EP3519954A1
Application number: EP17761281.9A
Authority: EP
Inventors: Karim Ben Chehida; Paul-Antoine ARRAS
Original assignee: Commissariat a lEnergie Atomique CEA; Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Current assignee: Commissariat a lEnergie Atomique et aux Energies Alternatives CEA
Priority date: 2016-09-29
Filing date: 2017-09-06
Publication date: 2019-08-07
Also published as: FR3056786A1; WO2018059897A1; FR3056786B1; US20190250919A1; US10846086B2

Abstract

Method of managing a computation task on a functionally asymmetric multi-core processor comprising a plurality of cores (PC1 – PC4) at least one of which comprises at least one hardware extension (HW1, HW2) allowing the execution of specialized instructions, comprising the following steps: a) starting the execution of the computation task on a core of the processor; b) performing a tracking of a parameter (QoS) indicative of a quality of service of the computation task, as well as of a number of specialized instructions loaded by said core; c) identifying instants splitting an applicative period of the computation task into a predetermined number of portions (PTS); d) computing costs or gains in quality of service and in energy consumption corresponding to various management options in respect of the computation task; and e) performing a choice of management as a function of the costs or gains thus computed. Computer program product, processor and computing system for the implementation of such a method.

Description

METHOD FOR MANAGING CALCULATION TASKS ON A

FUNCTIONALLY ASYMMETRIC MULTI-CORE PROCESSOR

A computer-implemented method for managing computation tasks on a functionally asymmetric multi-core processor. It also relates to a computer program product, a multi-core processor and a computer system for implementing such a method.

A multi-core processor may include one or more hardware extensions for accelerating specific software code portions. For example, these hardware extensions may include circuits for floating point computation or vector computation.

A multi-core processor is said to be "functionally asymmetric" when not all cores have the same hardware extensions, and therefore some extensions are missing from some processor cores. Thus, a functionally asymmetric processor is characterized by unequal distribution (or association) of the extensions to the cores of processors. There is a set of instructions common to all cores and specific instruction sets associated with respective hardware extensions, present in some cores. By uniting all the instruction sets of the processor cores, all the instructions required by the execution of an application calculation task are represented.

The management of a functionally asymmetrical multi-core processor poses several technical problems, and notably that of efficiently managing the placement of calculation tasks on the various processor cores.

Software applications use these hardware extensions in a dynamic way, that is to say, that varies over time. For the same application, some calculation phases will use a given extension almost completely at full load (for example, floating-point data), while other calculation phases will use it little or nothing (for example calculations on integer type data). Using an extension is not always effective in terms of performance or energy ("quality" of use).

Published work concerning the placement of computational tasks (scheduling) on functionally asymmetric multi-core processors does not describe fully satisfactory solutions.

The article by H. Shen and F. Petot, "Novel Task Migration Framework on Configurable Heterogeneous MPSoC Platforms," Proceedings of the 2009 Asia and South Pacific Design Automation Conference, Piscataway, NJ, USA, 2009, pp. 733-738, describes an "affinity" offline scheduling technique, which consists of freezing the allocation of a task to a processor (or a type of processor) before execution following an offline analysis the application (hand-made, by code analysis tools or by the compiler) and the online scheduler follows exactly these guidelines. The main disadvantage of this approach is that no other optimization is allowed online when the applications are dynamic and their resource utilization ratio varies with time and data.

The article by G. Georgakoudis, DS Nikolopoulos, H. Vandierendonck, and S. Lalis, "Fast Dynamic Binary Rewriting for Flexible Thread Migration on Shared-ISA Heterogeneous MPSoCs," 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014, pp. 156-163, describes a technique - known as dynamic binary rewrite - that performs a fault rewrite (that is, when a specialized instruction is loaded by a heart not supporting it) by emulating unsupported instructions. It is a flexible and powerful technique that can be combined with a smart dynamic scheduler. However, the costs of rewriting and emulation can be very important.

The article by T. Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy and S. Hahn, "Operating System Support for Overlay-ISA Heterogeneous Multi-core Architectures", 2010 IEEE 1 6th International Symposium on High Performance Computer Architecture (HPCA), 2010, pp. 1-12 describes a so-called fault migration technique. It consists in migrating the execution of a task (to the granularity of a quantum of instructions), as soon as a specialized instruction not supported by the current execution unit is encountered, to a resource having the corresponding hardware extension. Its main weak point is that it can cause untimely migration and load imbalance between core and extended cores.

A. Aminot's thesis "Dynamic method to improve the placement of tasks on asymmetric processors in functionalities" University of Grenoble, France, 2015, describes a method of dynamic allocation of computing tasks in which the choice of the core on which a task is executed is made from an estimate at the compilation of the cost of emulation of said task. This estimate is obtained from the measurement of the number of use of different specialized instructions during previous executions of the same task. Although interesting in principle, this technique is not optimized and also requires to embed a version of the task binary by extension present in the architecture; moreover, it does not make it possible to ensure the respect of quality of service (QoS) constraints.

The invention aims to overcome, in all or at least in part, the aforementioned drawbacks of the prior art. More specifically, it aims to allow both optimal use of computing resources (especially in terms of energy) and compliance with QoS constraints. To do this, she perfected the approach proposed in A. Aminot's thesis.

In general, any parameter representing the performance of the system, considered from the point of view of the user, can be used as a QoS constraint according to various embodiments of the invention. By way of nonlimiting example, the QoS of a calculation task can be defined as the execution speed (that is to say the inverse of the execution time) of said task. To achieve this goal, the invention provides several improvements to the prior art, which can be implemented separately or in combination. In particular :

According to the invention, the management choices are made at key moments of the execution of each application, dynamically identified instead of being predefined, allowing the guarantee of QoS and minimizing the energy consumed.

In some embodiments of the invention, specific hardware means (counters ...) are provided to facilitate the determination of these key moments.

Management choices are made taking into account an "opportunity cost" in QoS and energy of each execution option. This makes it possible to exactly quantify the difference in terms of performance with respect to a minimum QoS setpoint that we want to guarantee. A new method of calculating this opportunity cost is also proposed.

The invention also proposes, in a first characterization step, a method for selecting different classes of instructions according to their emulation costs in terms of performance and energy, making it possible to minimize the error in estimating the cost of opportunity of each opportunity (or "options") of management.

An object of the invention is a computer-implemented method for managing a computation task on a functionally asymmetric multi-core processor, the execution of said task comprising a succession of application periods, said processor comprising a plurality of cores sharing so-called basic instructions, at least one said heart comprising at least one hardware extension, said or each hardware extension being adapted to allow the execution of so-called specialized instructions, different from said basic instructions, each specialized instruction being thus associated to a said hardware extension, the method comprising the following steps:

a) starting execution of the computation task on a core of said processor; b) performing, during said execution, a follow-up of a parameter indicative of a quality of service of the calculation task, as well as at least a number of specialized instructions loaded by said core;

c) on the basis of said tracking, identifying times that split an application period of the computing task into a predetermined number of portions such that, during each of said portions, a substantially equal number of specialized instructions associated with an extension predefined hardware are loaded by said core;

d) calculating, at said times and according to said monitoring, costs or gains in quality of service and in energy consumption corresponding to different management options of the calculation task; and

e) making a management decision including a decision to continue execution on the same processor core or on a different core depending on the costs or gains thus calculated.

Another object of the invention is a computer program product stored on a non-volatile computer readable medium comprising computer executable instructions for carrying out such a method.

Another object of the invention is a functionally asymmetric multi-core processor comprising a plurality of cores sharing so-called basic instructions, said at least one core comprising at least one hardware extension, said or each hardware extension being adapted to allow the execution instructions, called specialized instructions, different from said basic instructions, each specialized instruction being thus associated with a said hardware extension, characterized in that it also comprises:

filter circuits, configured to sort the basic instructions and specialized instructions associated with the different hardware extensions, and to assign each specialized instruction to a family; and

for each heart: a basic instruction counter loaded by the heart;

for each hardware extension not understood by said core, a specialized instruction counter associated with said hardware extension loaded by the heart, and a counter of the number of basic instructions used for emulating the associated specialized instructions; and

for each hardware extension understood by said core, and for each family of specialized instructions associated with said hardware extension, a specialized instruction counter associated with said hardware extension and belonging to said family loaded by the core.

Yet another object of the invention is a computer system comprising such a functionally asymmetrical multi-core processor and a non-volatile memory storing instructions executable by said processor for the implementation of a method according to the invention.

Other characteristics, details and advantages of the invention will emerge on reading the description given with reference to the accompanying drawings given by way of example and which represent, respectively:

FIG. 1, an example of a functionally asymmetric multi-core processor whose architecture is known from the prior art;

FIG. 2, a histogram of the cost of emulation of a floating-point calculation instruction;

FIG. 3, the functionally asymmetrical multi-core processor of FIG. 1, equipped with instruction hardware counters in accordance with one embodiment of the invention;

FIG. 4, a graph illustrating the division of the execution of an application into application periods, the monitoring of the quality of service for each period and the number of specialized instructions loaded for each hardware extension; FIG. 5, a graph illustrating the division of the application periods into segments, in accordance with one embodiment of the invention; and

Figure 6 is a graph illustrating the management choices made according to one embodiment of the invention.

In the following, we mean:

"Hardware Expansion", or simply "Hardware Extension", a circuit such as a Floating Point Unit (FPU), a vector calculation unit , a cryptographic processing unit, a signal processing unit, etc. A hardware extension introduces a dedicated hardware circuit accessible or connected to a processor core, which circuit provides high performance for specific computing tasks. Such a specific circuit improves the performance and energy efficiency of a core for particular calculations, but their intensive use can lead to reduced performance in terms of Watt per unit area. These hardware extensions associated with the core processor are provided with an instruction set that extends the standard or default (ISA) set. Hardware extensions are usually well integrated into the core's "pipeline", which provides efficient access to functions through instructions added to the "basic" set.

The function of a hardware extension is to speed up the processing of a specific set of instructions: a hardware extension may not speed up the processing of another type of instruction (eg floating point versus integer).

"Extended core", a processor core comprising a "basic" core, supporting a minimum set of instructions common to all processor cores, plus one or more hardware extensions.

- "Task"("thread" in English), the execution of a set of instructions of the machine language of a processor. A process usually involves several calculation tasks which, from the point of view of view of the user, seem to proceed in parallel. The different tasks in the same process share the same virtual memory, but each has its own call stack.

As discussed above, in a functionally asymmetric multi-core architecture a processor core may be associated with none (i.e., zero), one or more hardware extensions. These hardware extensions are then "exclusive", that is to say, a given hardware extension can not be accessed from a third heart. The processor core includes the hardware extension (s).

A processor core (or simply "heart") is a set of physical circuits capable of executing programs "autonomously". A hardware extension is able to execute a part of the program but is not "autonomous": it requires the association with at least one processor core.

FIG. 1 schematically illustrates the architecture of a functionally asymmetric multi-core processor PR integrating four cores PC1, PC2, PC3, PC4 which share a common base called "basic core" BASE and which may each have one or more extensions hardware HW1, HW2, giving them more or less extensive functionalities. More precisely, the PC1 core is "complete", that is to say that it comprises the basic core and the two available hardware extensions; the PC2 and PC3 cores comprise, in addition to the basic core, the only extension HW1 or HW2, respectively; and the PC4 core is "basic", not including hardware extensions.

The reference MM designates a read-only memory storing instructions executable by the processor PR and making it possible to implement the method of the invention.

CPU cores other than PC1 do not include all available hardware extensions. Thus, when a computation task is executed by such a heart, it is possible that the latter encounters specific instructions associated with hardware extensions that it does not have. It is therefore necessary to make a "management choice": continue execution on the current processor core by emulating unsupported specialized instructions (that is, converting them into basic instruction sets) or migrate the core task with the required hardware extensions; in both cases, it is also possible to act on the voltage-frequency torque. Each of these options has a cost in terms of energy consumed and Quality of Service (QoS), the latter parameter being able to be represented by the inverse of a running time of the calculation task.

The invention makes it possible to optimize the management choices by accurately predicting the quality of service (QoS) and the dissipated energy associated with each possible management opportunity, thus facilitating the taking into account of several criteria.

This prediction requires a preliminary step of calibration, implemented "offline", consisting in characterizing the time and energy costs of the execution of the different specific instructions on the basic cores (by emulating these instructions) and extended (in the performers normally on these hearts). According to a preferred embodiment of the invention, this step comprises the determination (generally empirical) of calibration parameters representative of the statistical distribution of the time and energy emulation costs of the specialized instructions. For example, these parameters can be the maximum cost, the minimum cost, the average cost and the standard deviation of the cost distribution, measured over several executions of these instructions, with different sequences and various data sets.

Advantageously, these parameters are not determined for each instruction considered in isolation, but for "families" comprising specific instructions associated with the same hardware extension and having comparable execution costs. This has a twofold advantage: firstly, a simplification of the task cost prediction operations (performed "online"), secondly - if the decomposition into families is carried out timely - a minimization of the estimation error made during these prediction operations.

For example, on the RISC-V core the cost of emulation in the number of equivalent basic instructions for floating-point specific instructions varies from 150 to 470 basic instructions. Figure 2 shows the cost of emulating the floating-point square root statement "fsqrt" in equivalent basic instructions (y-axis). It has a constant part (rectangle in thick line in figure 2) with more than 120 equivalent instructions (for the backup of context, its restitution and the call to the routine of emulation) and a part dependent on the instruction to emulate (fine line). The average cost of emulating this instruction is 360 basic instructions with a large standard deviation, represented by an error bar.

The solution of grouping the floating point specific instructions into a single family and considering an average cost of 282 basic instructions, corresponding to the offline characterization of these specific instructions, gives an estimation error of the order of 25-50. % of the test sets considered, which is not satisfactory.

According to one embodiment of the invention, it is proposed to carry out a decomposition by exploring the number of families "/ ^' " and the width of the cost interval of each family which minimizes the estimation error. the cost of emulation (in time and energy) of the specific instructions considered. This exploration is done on the basis of test games including unit tests per instruction and tests of real applications from sets of tests commonly used by the scientific community. It can be conducted experimentally / empirically, as by the use of a heuristic or any other operational search algorithm. For example, it is advantageous to group the instructions having a low standard deviation or an important occurrence frequency together to avoid degrading the final estimation error. The increase in the number of families reduces the width of the cost ranges and thus the error in estimating these costs. But if we increase enormously the number of families, a reverse phenomenon is observed which raises the error of estimation due to the standard deviation of the cost of emulation of certain instructions. A compromise is therefore to be found also because then, as will be explained in detail below, monitoring is necessary by implementing specific counters for each family of instructions.

A study was conducted based on the RISC-V processor core. By imposing a maximum tolerable estimation error of 5%, it was possible to delimit two families of floating-point instructions with an average error of 3% and a maximum error of 4%.

For each extension {HW _m ), and once the number of families "/ ^' " and their average costs in terms of number of basic instructions equivalent {Cost_moy_lnst ( _m!, ^■ )) and energy {Cost_moy_E _(m! , ^■ )), these offline features will be embedded and used online to estimate the costs / performance and energy gains of different management opportunities.

To obtain this estimate, it is necessary to implement - online - several steps:

1. Follow-up of QoS and the execution of specialized instructions;

2. A breakdown of each application period of the calculation task considered in "portions", performed according to this monitoring;

3. The calculation of costs or gains in quality of service and in energy consumption corresponding to different opportunities, or options, management. This calculation is based on the breakdown of the application periods and on the data obtained by the monitoring. 1. Follow-up of the execution of the specialized instructions and the

Application QoS

1.1 Follow-up of the execution of specialized instructions In order to better predict the future execution of specialized instructions, a follow-up (monitoring) of the past execution of these instructions and their classification according to their respective families is necessary. This monitoring consists mainly of counting the number of specialized instructions associated with each extension. More precisely, each heart must count:

the number of specialized instructions associated with each hardware extension that the core does not understand (these statements cause an exception that calls an emulation routine);

- for specialized instructions associated with hardware extensions that the core understands, the number of specialized instructions associated with each family (these instructions are executed directly); and the number of basic instructions.

According to a preferred embodiment of the invention, the monitoring is performed by means of digital filtering circuits of part of the binary encoding of each instruction ("opcode", from the English "operation code" it is "operating code"), used to sort the basic instructions and specialized instructions associated with the different hardware extensions and, where applicable, to assign each specialized instruction to a family, as well as to similar hardware counters the counters commonly present in embedded processors (cycle counters, floating instruction counters, cache failure counters ...), which would count the occurrence of the instructions of each family. These counters can be read, and resets can be controlled at specific times of the method. We favor the counting of instructions loaded by the heart because the loading of the instructions is always carried out whatever the type of heart (they can then be executed if the heart supports them or cause an exception and call an emulation routine in the opposite case).

In Figure 3:

The PC1 core includes: a set of counters Nb (HW1, i) counting the number of loaded instructions belonging to each family "i" of the hardware extension "HW1";

a set of counters Nb (HW2, i) counting the number of loaded instructions belonging to each family "i" of the hardware extension "HW2"; and

an Nb_basic counter counting the number of basic instructions loaded.

The PC2 core includes:

a set of counters Nb (HW1, i) counting the number of loaded instructions belonging to each family "i" of the hardware extension "HW1";

a counter Nb (HW2) counting the number of loaded instructions associated with the hardware extension "HW2", without distinguishing the different families of instructions;

an Nb_basic counter counting the number of basic instructions loaded; and

a counter Nb_emul (HW2). counting the number of basic instructions used to emulate the specialized instructions associated with the hardware extension "HW2".

The PC3 core includes:

a counter Nb (HW1) counting the number of loaded instructions associated with the hardware extension "HW1", without distinguishing the different families of instructions;

a set of counters Nb (HW2, i) counting the number of loaded instructions belonging to each family "i" of the hardware extension "HW2";

an Nb_basic counter counting the number of basic instructions loaded; and

a counter Nb_emul (HW1). counting the number of basic instructions used to emulate the specialized instructions associated with the hardware extension "HW1". The PC4 core includes:

an Nb_basic counter counting the number of basic instructions loaded

a counter Nb_emul (HW1). counting the number of basic instructions used to emulate the specialized instructions associated with the hardware extension "HW1"; and

The instruction filtering circuits are not shown so as not to overload the figure.

Advantageously, the digital filtering circuits perform the filtering of the instructions loaded by the heart at the time of the instruction decoding (opcode filtering) to identify whether the current instruction is of type "m" (that is to say is associated with the hardware extension "m") and, if so, if it belongs to the "i" family of this extension. One possible optimization of this embodiment is to rectify the family cut made in the first step of the method to reduce these intervals to similar classes of instructions (memory access, calculation, control ...) that share the same opcode thus facilitating the filtering of the instructions at the time of decoding. A calculation of the estimation error with the new cut must be made to check that it remains below the tolerable error.

For the counting of basic instructions {Nb_basic), it may be advantageous to filter the specialized instructions at the time of decoding by opcode and to disengage the counter Nb_basic when the execution of these instructions. The counter Nb_basic is also disengaged at the entrance of an exception and engaged at its exit.

In principle, these meters could be implemented in software, but this would be costly in time and energy. At least partially material realization is therefore preferred.

The tracking of the loaded instructions, performed by means of these counters, makes it possible to calculate the utilization ratio of the set of specialized instructions associated with each hardware extension "m" (regardless of whether these instructions are actually executed by the appropriate material extension or emulated by a basic heart):

_T j ^

Nb_basic ^LJ

In the case where the count is done in families, Nb (m) is obtained by summing Nb (m, i) for all the values of "i".

1.2 Application QoS Monitoring

This monitoring can be based on a technique of instrumentation little intrusive application code as the technique "Heartbeat", disclosed by the publication:

H. Hoffmann, J. Eastep, D. Santambrogio, J. E. Miller, A.

Agarwal, "Heartbeats Application: A Generic Interface for Specifying Performance Program and Goals in Autonomous Computing Environments," Proceedings of the 7th International Conference on Autonomy Computing, New York, NY, USA, 2010, pp. 79-88.

It allows the system to retrieve application events (called "application pulsations", or simply "pulsations" in the following) to calculate a quality of service and then verify that it is above a given minimum QoS reference. by the user. These application events also serve to trigger management actions that would be more or less aggressive depending on the margin on the

QoS we just calculated. This is illustrated in Figure 4, which shows a chronogram of the number of loaded specialized instructions that are associated with the two hardware extensions considered, HW1 and HW2; "T" designates the execution time. The PS pulses are indicated by vertical lines, and cut the execution time into "application periods" (or simply "periods") PRD. Quality of QoS service is measured at each heartbeat for the past period, and an MRG margin is calculated against a minimum quality level of Qos_Min. A maximum quality level QoS_Max is also indicated; indeed, we do not generally want to provide an unnecessarily high level of quality.

2. A breakdown of each application period of the calculation task considered in "portions"

The prediction of future executions of specialized instructions is generally based on an understanding of the past use of the extension. In the literature, the learning period is often constant, linked to events of the scheduler. To increase and refine the management opportunities within a period, the invention proposes to track and predict the use of specialized instructions on portions of the "application period" and thus adapt to profiles of use of specialized instructions changing from one period to another. A "portion" is the minimum time interval for each management decision of the proposed method.

According to the invention, each period is subdivided into N "portions" of non-constant durations and this as a function of the number of loaded specialized instructions associated with the hardware extension "HW _m " whose emulation cost over this period will be the most penalizing in terms of performance. Λ / is a number that can be arbitrarily fixed from the beginning of the execution or after a calibration phase at the beginning of the execution.

To define which hardware extension "m" is the most penalizing in terms of performance, it is performed at the start of the application, following a calibration phase (one or more pulses), calculating the relative cost of emulation Relative cost _{(m! n)} at the end of the current period "n" and for each hardware extension "m":

Nb (m, n) ^x Costjnoyjnstçni)

Relative cost.

= T (m, n) x Costjnoyjnstç _jf i ₎

[2] where T (m, n) represents the utilization ratio of the extension m over the period n (see equation [1]). We choose m that gives the cost: Relative cost ( _m! N) = Max _m {Relative cost (m, nj)

The determination of the most penalizing hardware extension can be done offline for each application to run on the platform. It can be calculated online and updated each time a new application is run (so with only one calibration phase at the beginning of the execution of each application) or within the same application (in the case of very dynamic applications) after each "P" application pulsations (the calibration phase is repeated after each P pulses, P being a constant and predefined number before execution).

The division of an application period in portions uses the knowledge, obtained by the follow-up described above, of the global number (all families combined) of specialized instructions associated with the hardware extension "HW _m " loaded at the end of the period. current "n": Nb ( _m , n) -

This knowledge makes it possible to predict the global number of specialized instructions associated with the hardware extension "HW _m " for the next application period {Nb_Pred ( _{m! N} + i)). This prediction can, for example, be obtained in calculating an exponential moving average (EMA), with a smoothing factor "cr" (cr <1) adaptable according to the dynamicity of the application. We then have:

Nb_Pred _(m , _{n +} i> = a-Nb _(m , _n) + (1-a) Nb_Pred _(m , _n) [3] For this material extension m, a portion of the current period ends when the number predicts specialized instructions from the hardware extension "HW _m " over the current period {Nb_Pred _{(mt n} j), divided by N, has been loaded.

This division of the application periods into portions as a function of the number of loaded specialized instructions makes it possible to directly estimate the intermediate QoS. It also makes it possible to predict the use of the specialized instructions on the portion in question regardless of the allocation made to the same portion of the previous period (portion executed on a basic or extended heart).

FIG. 5 shows the advantage of a decomposition of the PRD periods into N portions PTS (in the example, N = 3) as a function of the number of specialized instructions loaded for the prediction while remaining independent of the allocation. In the example considered, the second application period PRD2 was executed on a basic core and the third, PRD3, on an extended core (just like the first period PRD1). If fixed duration portions were used, no simple relationship could be found between the data collected at the PRD2 period to predict that of the same portion at the PRD3 period. Cutting into periods of fixed duration also requires, on the one hand, more controller interventions and, on the other hand, fairly expensive interpolations to reduce the measurements collected to usable data for the prediction. On the other hand, according to the invention, the portions comprise a substantially equal number of specialized instructions, which makes it possible to relate portions of the same rank of different periods.

To delimit the portions of the periods, it is necessary to count the total number of specialized instructions responsible for the extension "HW _m ". To speed up this counting and make it the least intrusive it may be advantageous to provide a dedicated hardware counter which increments each load of a specialized instruction of the extension "HW _m " and which triggers an interruption on the heart in question to call a monitoring and system management routine that must be triggered at each serving. In this embodiment, the periods preferably comprise exactly the same number of instructions dedicated to the "HW _m " extension (except the last period, which may be incomplete). In degraded variants, however, a margin of tolerance - typically less than 10% or even 5% - may be allowed.

In another embodiment, it is possible to use an alarm type counter with a fixed quantum of time which raises an interruption on the current heart. The routine called by this interrupt consults the hardware counter of the number of specialized instructions of the extension "HW _m " and calls the management routine when the number of specialized instructions required per serving is reached. In this embodiment, the number of specialized instructions responsible for the extension "HW _m " is approximately the same, with a margin of error all the smaller as the quantum of time is small. Preferably, this quantum will be chosen such that the margin of error is less than 5%, or even 10%.

In general, the transition from one application period to the next occurs when the number of specialized instructions reaches or exceeds a predefined threshold value.

3 Prediction of future executions of specialized instructions

The invention provides a follow-up, for each portion "k" of the period "n":

the number of specialized instructions executed from each '/ ^' family identified in the first step of the method:

Nb (m, i, k, n);

- the total number of instructions executed outside emulation for basic cores and without specialized instructions for extended cores: Nb_basiC (k, _n );

the actual execution time of the k portion of the period n: T _(n) ; and

- the number of basic instructions related to the emulation of each extension "HW _m " absent: Nb_emul (m _! k, n). Except for the execution time, the monitoring is carried out by means of the counters described above with reference to FIG.

The method provides the prediction, for the same portion k of the following application period n + 1 of:

number of specialized instructions from each "/" family of the "HW _m " extension:

Nb_Pred _(m , _{h K n +} 1) = a. Nb _(m , /, _{K n)} + (1 - a) Nb_Pred _(m , _{K n)}

[4] - number of instructions executed outside emulation and outside the specialized instructions of the "HW _m " extension:

Nb_Pred_basic _(k , _{n +} i) = a. Nb_basic _(k , _n) + (1 - a) Nb_Pred_basic _(n )

[5] number of basic instructions related to the emulation of each extension "HW _m " absent:

Nb_Pred_emul (m, k, n + 1) = a.Nb_emul (m, k, n) + (1-a) Nb_Pred_emul (m, k, n)

[6]

These recourse equations must be initialized. The initialization values can for example be obtained by calibration.

Estimate costs / gains in performance and energy of different management opportunities

Once the prediction of the future use of hardware extensions is made, it is possible to estimate the costs / gains in performance and energy of each of the management opportunities that are offered on the next portion of the current period. These opportunities are for example:

Migration of an extended core (with its variants Full, {base + HW-i), (base + HW ₂ )) (respectively basic) to a basic (or extended) heart. This includes a backup of the current execution context on the extended (or basic) core, the activation of a basic (or extended) heart if there are no active and free cores moment, the migration of the context to this new target core, the setting of the extended core (or basic) in low-power non-functional mode and the continuation of the execution on the basic core (respectively extended).

Changing the voltage-frequency torque (DVFS) of the current heart: decision to continue the execution on the core current but with a different frequency voltage torque.

Any combination of the last two opportunities (migration + DVFS)

The estimate of the costs / gains in performance and energy is based on an entire period. At this stage, the method estimates QoS and the energy consumed at the end of the period by estimating decisions at the portion level.

The application QoS is often reduced to the inverse of an execution time over the application period considered. This time can be estimated by summing the estimates of the execution times on the different portions of this period.

The energy consumed is the sum of the contributions of the cores, estimated according to the allocation made and according to the DVFS pair chosen on each portion, then reduced over the entire period.

Cost of emulation / acceleration:

trimmed the same portion of the previous period, the additional cost of emulation in time instructions of the hardware extension "m" on the portion "/", predicted for the period n + 1 and for a frequency of execution F1, is given by:

Σi (.Nb_Pred ( _m i Cost_moy_Inst _m i ^) x CPI

Fl m where Nb_Pred (m, /, k, n + 1) is the number of specialized instructions predicted over the period n + 1 of the family / ^' over the portion k (see equation [4]). CPI ("cycles per instruction") is the parameter used to quantify the average cost of an instruction in terms of processor cycles. By dividing this number of cycles by the frequency (here F1) of the processor we obtain the average time of execution of an instruction.

Compared with the same portion of the previous period, the extra cost (here negative) of an acceleration of the instructions of the hardware extension "m" by migrating to a core with this extension is given by:

Compared to the same portion of the previous period, the energy emulation extra cost of the instructions of the hardware extension "m" on the "/" portion, predicted for the period n + 1 and for a frequency of execution F1, is given by the following equation for the two variants: Kemui_energie overhead due to emulation and K i_energie _access overhead due to the acceleration of heart with extension "m":

Kemul / energy accel (m, k, n + 1, F1) = ^x

Where P1 and P2 respectively represent the average powers of the current and the destination core.

P1 represents the power of the heart having the extension "m" (respectively devoid of the extension "m") and P2 is that of the heart devoid of this extension (respectively having the extension "m") and towards which the migration cost in the case of an emulation (respectively in the case of an acceleration). Nb_total (k, _n ) is the total number of instructions on the previous period "n" and on the same portion "k". Cost of change of voltage / frequency (DVFS):

Compared to the same portion of the previous period, the additional cost in time of the change of torque voltage frequency on the portion "k" with constant allocation, predicted for the period n + 1 and for a change of a frequency of execution F1 towards F2 is given by the following equation:

F1-F2

K _D VFS_time (m, k, n + 1, F1 ^ F2) = Nb Otdl ^ X CPI X ( _ριχρ2 ) [0]

Compared to the same portion of the previous period, the energy overcharge of the change of torque voltage frequency on the portion "k" with constant allocation, predicted for the period n + 1 and for a change of a frequency of execution F1 towards F2, is given by the following equation:

"_" X _r Fl P2 - F2 X ^ Pl

I ^ DVFS_energie (m, k, n + 1, F1 ^ F2) - ^ DVFS_time {m, k, n + 1, Fl → F2) ^x C F1-F2

[1 1]

Where P1 (respectively P2) represents the average power of the current heart at the operating point DVFS given by the frequency F1 (respectively F2).

Cost of simultaneous change of voltage / frequency (DVFS) and allocation (emulation / acceleration) torque:

Compared to the same portion of the previous period, the additional cost in time of the change of torque voltage frequency on the portion "k" with constant allocation, predicted for the period n + 1 and for a change of a frequency of execution F1 towards F2 is given by the following equation with both variants:

- K _DV Fs_emui_time overhead due to emulation and DVFS; and

KDVFs_accei_time overhead due to acceleration on the core with "m" extension and DVFS:

K-DVFS_emul / accel_time (m, k, n + 1, F1 _ ^ F2) = Kemul / accel_time (m, k, n + 1, F2) + KDVFS_time (m, k, n + 1, F1 _> F2) [12] Compared with the same portion of the previous period, the energy overcharge predicted for the period n + 1 of the change of torque voltage frequency on the portion "k" and for a change of a frequency of execution F1 to F2 with migration, is given by the following equation with both variants:

KDVFs_emui_energie additional cost due to emulation and DVFS; and KDVFs_accei_energie additional cost due to the acceleration on the core with extension "m" and DVFS: KDVFS_emul / accel_energie (m, k, n + 1, F1 ^ F2) = {Kemul / accel_time (m, k, n + 1, F2) X P2 ) + KDVFS_energie (m, k, n + 1, F1 _> F2) [13]

Where P1 represents the average power of the current heart at the operating point DVFS given by the frequency F1 and P2 represents the average power of the destination heart at the operating point given by the frequency F2.

Decision making

Estimating the cost of different management options is used to enable decision-making to ensure a minimum QoS while minimizing the energy consumed.

Figure 6 gives an example of decision making at the granularity of a portion to maintain a quality of service above the minimum QoS and minimize energy. This figure is taken from Figure 4 by adding:

the decomposition of periods into portions;

the indication of the processor core used in each portion, and the corresponding energy consumption.

To guarantee this condition and to avoid unnecessary oscillations and managements, two decision-making intervals are defined [QoS_Min,

QoS_bas] and [QoS_Haut, QoS_Max] (represented on figure 6) on which one decides to trigger actions of management and one keeps the system such what if the calculated QoS falls within the intermediate interval [QoS_bas, QoS_Haut].

In one embodiment of the invention, it is sought to determine the margin with respect to the quality of service QoS_bas (as illustrated by FIG. 6) over the last period executed and to distribute this margin over the different portions of the following period. so that the decisions taken at the portion level benefit from this portion of the margin to choose a management opportunity (migration, DVFS change of couple, a combination of the two) which reduces the energy consumed under constraint of respect of this portion of margin.

The margin of QoS is reduced to a time margin "jUoos": positive when the QoS _n is greater than QoS_bas and negative otherwise (case of the margin on the third period of Figure 6) and its distribution on the N Portions of the period can be done in the same way as it can be more intelligently chosen for example compared to the utilization ratio of the extension "m" whose emulation is the most penalizing in time (T _m see equation [1 ]) calculated on each "k" portion: T ( _m , k, n An implementation would prefer, for example, the distribution of the margin to period portions with the lowest utilization ratios because they have greater emulation opportunities .

A second margin (positive or negative too) can be introduced in relation to the prediction of the evolution of the number of specialized instructions in time.

At the end of each portion k of the period n is calculated the prediction of the execution time, with constant allocation, of the portion k to the period n + 1:

T_Pred _(k , _{n +} i) =

Freq

Nb_Pred_basiC ( _{k n + 1} -) +

Σemuiées m (Σi Nb_Pred _{(m: i n + 1)} x Cost_moy_Inst _M ) + [14]

Σexecuted m '(Σ _t Nb_Predç _m ' _i ij _{i n +} ^) This estimate takes into account the emulation of the unsupported hardware extensions (emulated m) by the core used at the k portion of the period n by summing the cost of their emulation and the execution of the predicted instructions of the supported extensions (executed m '). ).

The second useable margin can be calculated by the following equation:

= T_Pred (k, n + 1) "T (kn) [5] where T (k, n) is the effective execution time on the k portion of the previous period n:

The margin collected by the / portion is:

Advantageously, this method operates incrementally starting from the implementation chosen at the same portion of the previous period and estimating the additional costs of management opportunities. The choice of the most appropriate management opportunity vis-à-vis the margin "μ ^" allocated to the k portion is advantageously considering a parameter that jointly integrates performance and energy as the EDP (product energy latency, Energy Delay Product).

According to one embodiment of the invention, the difference of the additional costs (reduced to an energetic quantity) is used as a parameter to be maximized in order to guide the choice of the management opportunity. For a management opportunity "opp", for a portion "k", and an average power P ₁ (power of the core used at the portion k of the preceding period), this difference is given by the following formula:

Dopp (k) = {Kopp_time (k) X 1) "Kopp_energie (k) [17] In equation [17], K _{oppJemps {k} ) is the time cost of a management opportunity, depending on the chosen opportunity, it can be worth for example K _emu ij _emps , K _acC ei_time or KovFsjemps- Similarly, K _opp _ _e energy (k) is the energy cost of a management opportunity; depending on whether retained, it may be worth eg K _em Ui_ energy Kaccel_energie OR KoVFS_energie- The condition to be satisfied is that:

Kopp emps (k) <f-1 (k)

and

Dopp (k) = MaX all opp {D ₀ pp (k)) [1 8]

For example, on the first portion of the third period of Figure 6 we see that the decision was made to change the previous allocation (first portion of the second period) which was on a basic heart with an extension HW1 to a heart basic because the margin allocated to this portion of period allowed to do so with a lower energy consumption. For the third and last part of period 3, we can note that the opportunity that maximized the difference of extra cost D on this portion was that which migrated to a basic heart and which at the same time changed the frequency of execution on this heart. .

Another possible embodiment of the decision-making phase is based on the principle of favoring basic core allocation over the first portions, as long as the QoS estimate at the end of the period remains above the quality of the core. minimum service, and then speed up execution by migrating to an extended heart in the last portions. This method tends to delay the migration to an extended core as much as possible while guaranteeing a minimum QoS at the end of the period, thus favoring the use of basic cores that are less efficient but not very energy intensive.

In this embodiment, just at the end of the k-1 portion of the current period, the estimate of QoS at the end of the period considers an allocation of the k portion on a basic core and an allocation of the following portions {k + 1 to N) on an extended heart. In this method, the allocation to the first portion is estimated on a basic core with all the other portions on an extended core and it is checked whether the QoS at the end of the period remains greater than QoS _m j _n . The prediction of the extended core execution time of the k (T_Etondu_Prod ( _k )) portion corresponds to the calculation of T_Pred ( _ki n + i) of the equation [14] considering that all the extensions are present. Similarly, the prediction of the basic core execution time of the k (T_Basic_Pred _(k j) portion corresponds to the calculation of T_Pred _{(kt n + 1)} of equation [14] assuming that all the extensions are emulated.

The QoS predicted at the end of the period by estimating that the portion k is allocated on a basic core and that the rest of the portions are allocated on an extended core is calculated by the following equation:

QoS_Pred _{n +} i = Σi≤j _< k T _n) + T_Base_Pred _(k) + Σj≤ _N -k T_Extended_Pred _{(k +} j) [19]

If the condition QoS_Pred _{n +} i> QoS _m j _n is satisfied, the predicted allocation is executed. Otherwise, the frequency of the allocation of the portion k is incrementally changed, then a change of allocation is considered at a low frequency and the frequency is gradually increased and so on.

The dual implementation, which favors the allocation of the portion / on an extended core by considering that the other following portions (k + 1 to N) will be allocated on a basic core is also a possible realization. As long as the condition QoS_Pred _{n +} i> QoS _m j _n is not verified, we decide to continue to allocate on the extended heart the next portion. As soon as this condition is verified, the QoS at the end of the period is predicted incrementally by estimating, in the same order as in the previous method, the possible management opportunities.

A method according to the invention is implemented by means of a computer program product comprising instructions executable by a functionally asymmetric multi-core processor (processor PR of FIG. 1), stored on a non-volatile support readable by said processor. (by for example a ROM memory MM, illustrated in Figure 1). Advantageously, as has been described in particular in relation to FIG. 3, the functionally asymmetric multi-core processor may have specific hardware means: instruction filtering circuits and counters. Such a processor and a non-volatile memory storing the computer program product can advantageously form, with other components (RAMs, peripherals ...), a computer system, typically embedded.

The invention has been described in relation to a particular embodiment, but variants are possible. For example, without limitation, other cost formulas may be used; predictions may not be obtained by sliding average but, for example, by Kalman filtering; the distribution between functionalities implemented in software or hardware may vary.

Claims

1. A computer-implemented method for managing a computational task on a functionally asymmetric multi-core processor (PR), the execution of said task comprising a succession of application periods, said processor comprising a plurality of cores (PC1- PC4) sharing so-called basic instructions, at least one said core comprising at least one hardware extension (HW1, HW2), said or each hardware extension being adapted to allow the execution of so-called specialized instructions, different from said basic instructions, each specialized instruction is thus associated with a said hardware extension, the method comprising the following steps:

a) starting execution of the computation task on a core of said processor;

b) performing, during said execution, a follow-up of a parameter (QoS) indicative of a quality of service of the calculation task, as well as at least a number of specialized instructions loaded by said core;

c) on the basis of said tracking, identifying times that split an application period of the computing task into a predetermined number of portions (PTS) such that, in each of said portions, a substantially equal number of associated specialized instructions to a predefined hardware extension are loaded by said core;

d) calculating, at said times and according to said monitoring, costs or gains in quality of service and energy consumption corresponding to different management options of the calculation task, a said management option of continuing execution on the same processor core and at least one other management option of continuing execution on a different core; and

e) making a management choice consisting in choosing among said management options based on the costs or gains thus calculated.

2. The method of claim 1 wherein said step c) comprises predicting the number of specialized instructions associated with said loaded hardware extension during each portion of a current application period from the number of said specialized instructions loaded during corresponding portions of at least one preceding application period.

3. Method according to one of the preceding claims wherein said predefined hardware extension is that adapted to allow the execution of specialized instructions whose emulation would have the highest quality of service cost.

4. Method according to one of the preceding claims wherein the management choice made in step e) includes the decision to continue the execution of the calculation task on the same heart or on a different heart so as to minimize the energy consumption of the processor while respecting a quality of service constraint.

The method of claim 4 wherein said step e) comprises:

e1) determining a quality of service margin over a preceding application period;

e2) distribute this margin between the portions of the current period; and

e3) for each portion of the current period, make a said management choice to reduce the energy consumption under constraint of respecting the quality of service margin distributed to said portion, when possible.

6. The method of claim 4 wherein:

during step a), the execution of the calculation task is started on a core that does not include hardware extensions; in said step e), a decision to continue performing the computation task on another processor, including at least one hardware execution, is taken when necessary to ensure compliance with a quality of service constraint.

7. Method according to one of the preceding claims wherein said management choice also comprises a decision to maintain or change a torque clock frequency - supply voltage of the heart.

8. Method according to one of the preceding claims wherein the specialized instructions associated with each hardware extension are grouped into a predefined number of families, this number being greater than 1 for at least one said hardware extension and at least one said family comprising several instructions ; and in which, in step d), the instructions of the same family are considered as one and the same instruction for the purpose of calculating said costs or gains in quality of service and in energy consumption.

9. The method of claim 8 wherein the number of families in which the specialized instructions associated with each hardware extension are grouped together is chosen so as to minimize errors affecting the calculation of costs or gains in quality of service and energy consumption made during from step d).

10. Method according to one of claims 8 and 9 wherein step b) comprises the following:

the number of basic instructions;

the number of specialized instructions associated with each hardware extension that said heart does not understand; the number of specialized instructions belonging to each family of specialized instructions associated with each hardware extension that said heart comprises; and

the number of basic instructions used to emulate the specialized instructions associated with each hardware extension that said heart does not understand;

charged by said heart.

1 1. Method according to one of the preceding claims, comprising a prior step of loading into memory a set of characterization parameters representative of a statistical distribution of the time and energy emulation costs of the specialized instructions, in which step d) is implemented using said characterization parameters.

The method of claim 11 further comprising a prior calibration step for determining said calibration parameters.

13. Method according to one of the preceding claims wherein said parameter (QoS) indicative of a quality of service of the calculation task is representative of the inverse of a running time of said calculation task.

14. Computer program product stored on a medium

Computer-readable non-volatile (MM) method comprising computer executable instructions for carrying out a method according to one of the preceding claims.

A functionally asymmetric multi-core processor (PR) comprising a plurality of cores (PC1-PC4) sharing so-called basic instructions, said at least one core comprising at least one hardware extension (HW1, HW2), said or each hardware extension being adapted to allow the execution of so-called specialized instructions, different from said basic instructions, each specialized instruction being thus associated with a said hardware extension, characterized in that it also includes:

filter circuits configured to sort the basic instructions and specialized instructions associated with the different hardware extensions, and to assign each specialized instruction to a family, where each family includes one or more specialized instructions associated with the same hardware extension and at least one hardware extension is associated with instructions included in a plurality of distinct families; and

for each heart:

a counter (Nb_basic) of basic instructions loaded by the heart;

for each hardware extension not included by said core, a counter (Nb (HW1), Nb (HW2)) specialized instructions associated with said hardware extension loaded by the heart, and a counter of the number of basic instructions used for emulation of associated specialized instructions; and

for each hardware extension understood by said core, and for each family of specialized instructions associated with said hardware extension, a counter (Nb (HW1, i), Nb (HW2, i)) of specialized instructions associated with said hardware extension and belonging to said family charged by the heart.

A computer system comprising a functionally asymmetric multi-core processor (PR) according to claim 15 and a non-volatile memory (MM) storing instructions executable by said processor for carrying out a method according to one of the Claims 1 to 13.