EP3814923A1

EP3814923A1 - Asynchronous processor architecture

Info

Publication number: EP3814923A1
Application number: EP19737810.2A
Authority: EP
Inventors: Khaled Maalej; Trung-Dung NGUYEN; Julien Schmitt; Pierre-Emmanuel BERNARD
Original assignee: Vsora
Current assignee: Vsora
Priority date: 2018-06-29
Filing date: 2019-05-21
Publication date: 2021-05-05
Also published as: US20210141644A1; FR3083351A1; KR20210021588A; FR3083351B1; WO2020002783A1; CN112639760A

Abstract

The invention concerns a data processing method comprising: - a control unit, at least one ALU (9), a set of registers (11), a memory (13) and a memory interface (15). The method comprises: a) obtaining (101, 102) the memory addresses of the operands; b) reading (103, 104) the operands in the memory (13); c) transmitting (105) an instruction to execute calculations to the ALU (9) without an addressing instruction; d) executing all the elementary operations (106) by the ALU (9) receiving as input each of the operands from the registers (11); e) storing (107) the data forming results of the processing operation on the registers (11); f) obtaining (108) a memory address for each of the data items forming a result of the processing operation; g) writing (109) the results in the memory (13), for storage and via the memory interface (15), by means of the obtained memory addresses.

Description

Asynchronous processor architecture

The invention relates to the field of processors and their functional architecture.

Conventionally, a computing device comprises a set of one or more processors. Each processor includes one or more processing units, or PU for "Processing Units". Each PU includes one or more calculation units called arithmetic and logical units, or ALU for "Arithmetic-Logic Unit". To have a high-performance IT device, that is to say fast to perform IT operations, it is conventional to provide a high number of ALUs. Thus, ALUs can process operations in parallel, i.e. at the same time. The unit of time is then a cycle of calculation. It is therefore common to quantify the computing power of the IT device in terms of the number of operations it is capable of performing per computing cycle.

However, a significant part of the computing power of a computing device is consumed to manage memory access. Such a device comprises a memory assembly, itself comprising one or more memory organs, each having a fixed number of memory locations on which computer data can be permanently stored. During computer processing, the ALUs receive input data from the memory organs and output data which, in turn, is stored on the memory organs. It is therefore understood that, in addition to the number of ALUs, the number of memory units is another determining criterion of the computing power of the device.

Data routing between the ALUs and the memory devices, in both directions, is ensured by a device bus. The term “bus” is used here in its general sense of system (or interface) of data transfer including hardware (interface circuit) and the protocols governing the exchanges. The bus transmits the data itself, addresses and control signals. Each bus also has hardware limits and software so that data routing is limited. In particular, the bus has a limited number of ports on the memory unit side and a limited number of ports on the ALUs side. Thus, during a calculation cycle, a memory location is accessible via the bus in a single direction (in “read” or in “write”). In addition, during a calculation cycle, a memory location is accessible for a single ALU.

Between the bus and the ALUs, a computing device generally comprises a set of registers and local memory organs, which can be seen as memories distinct from the aforementioned memory organs. To facilitate understanding, a distinction is made here between "registers", intended for storing data as such, and "local memory organs", intended for storing memory addresses. The ALUs of a PU are assigned to each register. A PU is assigned several registers. The storage capacity of the registers is very limited in comparison with the memory devices, but their content is directly accessible to the ALUs.

To perform the calculations, each ALU must generally first obtain the calculation input data, typically the two operands of an elementary calculation. An operation of "reading" the corresponding memory location via the bus to import each of the two operands into a register is therefore implemented. Then, the ALU performs the calculation operation by itself from data in a register and by exporting the result as data to a register. Finally, a "write" operation is implemented to save, in a memory location, the result of the calculation. During such a write operation, the result stored in the register is saved in a memory location via the bus. Each of the operations a priori consumes one or more calculation cycles.

In known computer devices, it is usual to try to execute several operations (or several instructions) during the same calculation cycle, in order to reduce the total number of calculation cycles and therefore to increase the efficiency . We then speak of parallel “processing chains” or “pipelines”. However, there are often many dependencies of operations on each other. For example, it is impossible to carry out an elementary calculation until the operands have not been read and they are not accessible on a register for the ALU. Implementing processing chains therefore involves checking the dependence of operations (instructions) on each other, which is complex and therefore costly.

Usually, several independent operations are implemented during the same calculation cycle. Generally, for a given ALU and during the same calculation cycle, it is possible to perform a calculation operation and a read or write operation. On the other hand, for a given ALU and during the same calculation cycle, it is impossible to carry out both a read operation and a write operation (in the case of single-port memory organs). On the other hand, the memory accesses (the bus) do not make it possible, during the same calculation cycle and for a given memory location, to read or write operations for two separate ALUs, one the other.

Consequently, it is known to perform an elementary calculation and to write the result obtained in memory during the same calculation cycle. The economy in terms of calculation cycles (or IT resources) remains weak.

The invention improves the situation.

A data processing method is proposed, which can be broken down into a set of elementary operations to be carried out, implemented by a computer device, said device comprising:

- a control unit;

- at least one arithmetic and logical unit;

a set of registers capable of supplying data forming the operand of the inputs of said first arithmetic and logic unit and capable of being supplied with data originating from the outputs of said arithmetic and logic unit;

- a memory ;

- a memory interface through which data is transmitted and routed between registers and memory.

The process includes:

a) obtaining the memory addresses of each of the data absent from the registers and forming an operand for at least one of said elementary operations to be carried out and;

b) read from memory, for loading into the registers via the memory interface, each of said data by means of the memory addresses obtained;

c) transmit an instruction for executing calculations from the control unit to said first arithmetic and logic unit, said instruction being devoid of addressing instructions;

d) upon receipt of said instruction for executing calculations, and as soon as the corresponding operands are available in the registers, execute all of said elementary operations by said first arithmetic and logic unit receiving each of the operands from the registers as inputs;

e) storing the data forming processing results on the registers at the outputs of said first arithmetic and logic unit;

f) obtaining a memory address for each of the data forming the result of the processing;

g) write in memory each of the data forming a result of the processing coming from the registers, for storage and via the memory interface, by means of the memory addresses obtained.

Such a method makes it possible, by dissociating in time the tasks relating to the processing of memory addresses and the calculation tasks, to exempt the ALU performing the calculations from carrying out further addressing operations which would necessitate stopping the calculation operations . In doing so, the processing as a whole becomes both asynchronous and self-adapting: elementary calculations are initiated (by an instruction transmitted to an ALU) only once the memory addresses have been updated in the local memory organs. By dissociating the two types of operations (updating memory addresses in local memory organs on the one hand and calculation on the other hand), the processing time can be reduced. In other words, for a fixed quantity of resources, the sum of the time necessary to update the memory addresses in the local memory organs during a first process, then to perform the calculations during a second process, is less than the time required for the same amount of resources to perform all of the processing during a single process (with access address updates memory in local memory organs on the fly). The saving of time is particularly important in the case of iterative processing which can typically be carried out by means of computer loops.

According to another aspect, a computerized data processing device is proposed, said processing being decomposable into a set of elementary operations to be carried out. The device includes:

- a control unit;

- at least a first arithmetic and logical unit among a plurality;

a set of registers capable of supplying data forming the operand of the inputs of said arithmetic and logic units and capable of being supplied with data from the outputs of said arithmetic and logic units;

- a memory ;

- a memory interface through which data is transmitted and routed between registers and memory. The IT device is configured to:

d) upon receipt of said instruction for carrying out calculations, and as soon as the operands are available in the registers, execute all of said elementary operations by said first arithmetic and logic unit receiving as inputs each of the operands from the registers;

f) obtaining a memory address for each of the data forming the result of the processing; g) write in memory each of the data forming a result of the processing coming from the registers, for storage and via the memory interface, by means of the memory addresses obtained.

According to another aspect, a set of machine instructions is proposed for the implementation of a method as defined herein when this program is executed by a processor. According to another aspect, a computer program, in particular a compilation program, is proposed, comprising instructions for the implementation of all or part of a process as defined herein when this program is executed by a processor. According to another aspect, a non-transient recording medium, readable by a computer, is proposed on which such a program is recorded.

The following features can optionally be implemented. They can be implemented independently of each other or in combination with each other:

- The first arithmetic and logic unit performs all the elementary calculations of the processing during consecutive calculation cycles, no memory access being carried out by said first arithmetic and logic unit during said calculation cycles. This allows the first arithmetic and logical unit to be dispensed from any memory access operation during elementary calculations and therefore to speed up the implementation of said calculations.

- At least one of the following steps includes an iterative loop:

d) upon receipt of said instruction for carrying out calculations, execute all of said elementary operations by said first arithmetic and logic unit receiving as inputs each of the operands from the registers;

f) obtain a memory address for each of the data forming the result of the processing.

This makes it possible to implement particularly fast calculation processes because repetitive. - The device further comprises at least one additional arithmetic and logic unit and distinct from the first arithmetic and logic unit executing all of said elementary operations. The additional arithmetic and logic unit implements:

a) obtaining the memory addresses of each of the data absent from the registers and forming an operand for at least one of said elementary operations to be carried out; and

b) reading from memory, for loading into the registers via the memory interface, of each of said data by means of the memory addresses obtained.

This allows the functions to be fixedly distributed for each ALU and therefore to improve their respective efficiency.

In parallel, the applicant also describes an approach in which, at each read operation, the number of data read is greater than the number of data strictly necessary for the implementation of the next calculation. Such an approach could, in contrast, be called "predictive memory access". It is then possible that a data item among the data read is used for a future calculation, other than the calculation implemented immediately after reading. In such cases, the necessary data was obtained during a single memory access operation (with increased memory bandwidth) while the usual approach would have required at least two separate memory accesses. Such an approach therefore has the effect, at least in certain cases, of reducing the consumption of calculation cycles for memory accesses and therefore makes it possible to improve the efficiency of the device. In the long term (several consecutive calculation cycles) the number of memory accesses (in read and / or write) is reduced.

This approach does not exclude losses: some of the data read and stored in a register can be lost (overwritten by other data then stored in the same register) even before being used in a calculation. Nevertheless, over a large number of calculations and calculation cycles, the applicant has observed an improvement in performance, including in the absence of selection of the data sets read. In other words, even in the absence of selection of the data read (or random selection), this approach makes it possible statistically to improve the efficiency of the computing device compared to the usual approach.

Other characteristics, details and advantages of the invention will appear on reading the detailed description below, and on analysis of the appended drawings, in which:

- Figure 1 shows an architecture of a computer device according to the invention;

- Figure 2 is a partial representation of an architecture of a computer device according to the invention;

- Figure 3 shows an example of memory access;

- Figure 4 is a variant of the example of Figure 3;

- Figure 5 schematically shows an operating architecture according to the invention; and

- Figure 6 shows a chronological breakdown of an example of operation according to the invention.

Figure 1 shows an example of a computing device 1. The device 1 comprises a set of one or more processors 3, sometimes called central processing units or CPU for "Central Processing Units". The processor assembly (s) 3 comprises at least one control unit 5 and at least one processing unit 7, or PU 7 for "Processing Unit". Each PU 7 comprises one or more calculation units called arithmetic and logical units 9, or ALU 9 for "Arithmetic-Logic Unit". In the example described here, each PU 7 further comprises a set of registers 11. The device 1 comprises at least one memory 13 capable of interacting with the processor assembly (s) 3. For this purpose, the device 1 comprises in addition a memory interface 15, or "Bus".

In the present context, it is considered that the memory organs are single-port, that is to say that the read and write operations are implemented during different cycles, as opposed to the memories called "double -port ”(more expensive in terms of surface area and requiring larger split control buses for writing and reading). Alternatively, the proposed technical solutions can be implemented with so-called “dual-port” memories. In such embodiments, reads and writes can be implemented during the same calculation cycle.

In FIG. 1, three PU 7 are represented: PU 1, PU X and PU N. Only the structure of PU X is shown in detail in order to simplify FIG. 1. Nevertheless, the structures of the PUs are similar to each other. In variants, the number of PUs is different. The device 1 can comprise a single PU, two PUs or more than three PUs.

In the example described here, the PU X comprises four ALUs: ALU X.0, ALU X.1, ALU X.2 and ALU X.3. In variants, the PUs can comprise a number of ALUs which are different from each other and / or different from four, including a single ALU. Each PU includes a set of registers 11, here at least one register 11 assigned to each ALU. In the example described here, the PU X comprises a single register 11 per ALU, that is to say four registers referenced REG X.0, REG X.1, REG X.2 and REG X.3 and allocated respectively to ALU X.0, ALU X.1, ALU X.2 and ALU X.3. In variants, each ALU is assigned a plurality of registers 11.

Each register 11 is capable of supplying operand type data with the inputs of said ALUs 9 and is capable of being supplied with data from the outputs of said ALUs. Each register 11 is furthermore capable of storing data from memory 13 obtained via bus 15 by a so-called "read" operation. Each register 11 is, furthermore, capable of transmitting stored data, destined for the memory 13 and via the bus 15, by a so-called "write" operation. The read and write operations are managed by controlling the memory accesses from the control unit 5.

The control unit 5 imposes on each ALU 9 the manner of carrying out elementary calculations, in particular their order, and attributes to each ALU 9 the operations to be executed. In the example described here, the control unit 5 is configured to drive the ALUs 9 according to a micro-architecture in the processing chain so that the ALUs 9 perform calculations in parallel with each other. others. For example, the device 1 has an architecture with a single instruction stream and multiple data streams, called SIMD for “Single Instructions Multiple Data”, and / or an architecture with multiple instruction streams and multiple data streams, called MIMD for "Multiple Instructions Multiple Data". On the other hand, the control unit 5 is further arranged to control the memory accesses via the memory interface 15 and in particular, here, the read and write operations. The two types of control (calculation and memory access) are represented, in FIG. 1, by arrows in broken lines.

Reference is now made to FIG. 2, in which a single ALU Y is represented. Data transmissions are represented by arrows with solid lines. The data transmission being carried out step by step, it is understood that FIG. 2 does not necessarily represent an instant t with simultaneous data transmissions. On the contrary, for a datum to be transmitted from a register 11 to an ALU 9, it is for example necessary that said datum is previously transmitted to said register 11 from the memory 13, here via the memory interface 15 (or Bus) .

In the example of FIG. 2, three registers 11, respectively referenced REG Y.0, REG Y.1 and REG Y.2, are assigned an ALU referenced ALU Y. Each ALU 9 has at least three ports, namely two inputs and an output. For each operation, at least two operands are received, respectively by the first and the second input. The result of the calculation is issued via the output. In the example shown in Figure 2, the operands received as input come from the REG Y.0 register and the REG Y.2 register respectively. The result of the calculation is entered in the REG register Y.1. Once registered in the REG Y.1 register, the result (in the form of data) is written in memory 13, via the memory interface 15. In variants, at least one ALU can have more than two entries and receive more of two operands for a calculation.

Each ALU 9 can perform:

- arithmetic operations on whole data (addition, subtraction, multiplication, division, etc.); - arithmetic operations on floating point data (addition, subtraction, multiplication, division, inversion, square root, logarithms, trigonometry, etc.);

- logical operations (two's complement, "AND", "OR", "exclusive OR", etc.).

ALUs 9 do not directly exchange data with each other. For example, if the result of a first calculation performed by a first ALU constitutes an operand for a second calculation to be performed by a second ALU, then the result of the first calculation must at least be entered in a register 11 before being usable by an ALU 9.

In embodiments, the data written to a register 11 is also systematically written to memory 13 (via the memory interface 15), even if said data is obtained only to serve as an operand and not as a result of a whole treatment process.

In embodiments, the data obtained to serve as an operand and having a short relevance (intermediate result without interest at the end of the treatment as a whole) are not systematically written in memory 13 and can be stored only temporarily on a register 11. For example, if the result of a first calculation carried out by a first ALU constitutes an operand for a second calculation to be carried out by a second ALU, then the result of the first calculation must be entered in a register 11. Then , said data is transmitted to the second ALU as an operand directly from the register 11. It is then understood that the allocation of a register 11 to an ALU 9 can change over time and in particular a calculation cycle to another. This allocation can in particular take the form of addressing data which makes it possible to locate, at all times, the location of a data item, whether on a register 11 or at a location in the memory 15.

In the following, the operation of the device 1 is described for a processing applied to computer data, the processing being composed of a set of operations, including calculations performed in parallel by a plurality of ALUs 9 over a period of time consisting of a sequence of calculation cycles. It is then said that the ALUs 9 operate according to a microarchitecture in the processing chain. However, the processing implemented by the device 1 and which we are talking about here can itself constitute part (or a subset) of a more global IT process. Such a more global process can include, in other parts or sub-assemblies, calculations carried out in a non-parallel manner by a plurality of ALUs, for example according to a series or cascade operation.

The operating architectures (parallel or in series) can be constant or dynamic, for example imposed (controlled) by the control unit 5. The architectural variations can for example be a function of the data to be processed and the current instructions received in input of the device 1. Such dynamic adaptation of the architectures can be implemented at the compilation stage, by adapting the machine instructions generated by the compiler according to the type of data to be processed and the instructions when the type of data to be processed and the instructions can be derived from the source code. Such an adaptation can also be implemented only at the level of the device 1, or of a processor, when it executes a conventional machine code and when it is programmed to implement a set of configuration instructions according to the data. to be processed and current instructions received.

The memory interface 15, or "bus", transmits and routes the data between the ALUs 9 and the memory 15, in both directions. The memory interface 15 is controlled by the control unit 5. Thus, the control unit 5 controls access to the memory 13 of the device 1 via the memory interface 15. The command 5 coordinates operations

(calculations) implemented by ALU 9 and memory accesses. The control of the control unit 5 includes the implementation of a succession of operations broken down into calculation cycles. Piloting includes the generation of a first cycle i and a second cycle ii. Chronologically, the first cycle i predates the second cycle ii. As will be described more in detail in the examples below, the second cycle ii can be immediately subsequent to the first cycle i, or else the first cycle i and the second cycle ii can be chronologically spaced from each other, for example with cycles intermediate.

The first cycle i includes:

- the implementation of a first calculation by at least one ALU 9; and

- downloading, from memory 13 onto at least one register 11, of a first set of data.

The second cycle ii includes the implementation of a second calculation by at least one ALU 9. The second calculation can be implemented by the same ALU 9 as the first calculation or by a separate ALU 9. At least part of the first data set downloaded during the first cycle i forms an operand for the second calculation.

Reference is now made to FIG. 3. Data, or blocks of data, are respectively referenced A0 to A15 and are stored in memory 13. In the example, it is considered that the data A0 to A15 are grouped by four as follows :

- a data set referenced AA0_3 consisting of data A0, A1, A2 and A3;

- a data set referenced AA4_7 consisting of data A4, A5, A6 and A 7;

- a data set referenced AA8_11 consisting of data A8, A9, A10 and A11; and

- a data set referenced AA12_15 consisting of data A12, A13, A14 and A15.

As a variant, the data can be grouped differently, in particular by group (or “block”, or “slot”) of two, three or more than four. A data set can be seen as a group of data accessible on the memory 13 via a single port of the memory interface 15 during a single read operation. Similarly, the data of a data set can be written into memory 13 via a single port of the memory interface 15 during a single write operation. Thus, during a first cycle i, at least one data set AA0_3, AA4_7, AA8_11 and / or AA12_15 is downloaded to at least one register 11. In the example of the figure, each of the data sets AA0_3, AA4_7, AA8_11 and / or AA12_15 is downloaded to a respective register 11, that is to say four registers 1 1 distinct from each other. Each of the registers 11 is assigned at least temporarily to a respective ALU 9, here referenced respectively ALU 0, ALU 1 ALU 2 and ALU 3. During this same cycle i, the ALUs 9 may have implemented a calculation.

During a second cycle ii, each ALU 9 implements a calculation for which at least one of the data stored in the corresponding register 1 1 forms an operand. For example, ALU 0 implements a calculation, one of the operands of which is A0. A1, A2 and A3 may be unused during the second cycle ii.

In general, downloading data from memory 13 to a register 11 consumes less computation time than implementing calculations by ALUs 9. Thus, it can generally be considered that a memory access operation (here a reading) consumes a single calculation cycle, while the implementation of a calculation by an ALU 9 consumes a calculation cycle or a succession of several calculation cycles, for example four. In the example of FIG. 3, there are a plurality of registers 11 allocated to each ALU 9, represented by groups of registers 11 referenced REG A, REG B and REG C. The data downloaded from the memory 13 on the registers 11 correspond to the REG A and REG B groups. The REG C group is here intended to store data obtained by calculations implemented by the ALUs 9 (during a write operation).

The registers 11 of the REG B and REG C groups can thus contain data sets referenced in a similar way to those of REG A:

- the REG B group includes four registers 11 on which are respectively stored a data set BB0_3 consisting of data B0 to B3, a data set BB4_7 consisting of data B4 to B7, a data set BB8_11 consisting of data B8 to B11 and a data set BB12_15 consisting of data B12 to B15;

the group REG C comprises four registers 11 on which are respectively stored a data set CC0_3 consisting of data C0 to C3, a data set CC4_7 consisting of data C4 to C7, a data set CC8_11 consisting of data C8 to C11 and a CC12_15 data set consisting of data C12 to C15.

In the example of FIG. 3, the data AN and BN constitute the operands of a calculation implemented by an ALU 9 while the data CN constitutes the result, with “N” an integer between 0 and 15. By example, in the case of an addition, CN = AN + BN. In such an example, the data processing implemented by the device 1 corresponds to 16 operations. The 16 operations are independent of each other in the sense that none of the results of the 16 operations is necessary to implement one of the other 15 operations.

The implementation of the treatment (the 16 operations) can therefore, for example, be broken down as follows, into 18 cycles.

Example 1:

- cycle # 0: reading of AA0_3;

- cycle # 1: reading of BB0_3;

- cycle # 2: calculation of C0 (of game CC0_3) and reading of AA4_7 (forming for example a cycle i);

- cycle # 3: calculation of C1 (of game CC0_3) and reading of BB4_7 (forming for example a cycle i);

- cycle # 4: calculation of C2 (from game CC0_3);

- cycle # 5: calculation of C3 (from game CC0_3) and writing of CC0_3;

- cycle # 6: calculation of C4 (from game CC4_7) and reading of AA8_11 (forming for example a cycle ii);

- cycle # 7: calculation of C5 (from game CC4_7) and reading of BB8_11 (for example forming a cycle ii); - cycle # 8: calculation of C6 (from game CC4_7) (for example forming a cycle ii);

- cycle # 9: calculation of C7 (from game CC4_7) and writing of CC4_7 (forming for example a cycle ii);

- cycle # 10: calculation of C8 (from game CC8_11) and reading of AA12_15;

- cycle # 11: calculation of C9 (from game CC8_11) and reading of BB12_15;

- cycle # 12: calculation of C10 (from game CC8_11);

- cycle # 13: calculation of C11 (from game CC8_11) and writing of CC8_11;

- cycle # 14: calculation of C12 (from game CC12_15);

- cycle # 15: calculation of C13 (from game CC12_15);

- cycle # 16: calculation of C14 (from game CC12_15);

- cycle # 17: calculation of C15 (from game CC12_15) and writing of CC12_15.

It is then understood that, with the exception of the initial cycles # 0 and # 1, the memory accesses (reads and writes) are implemented in parallel with the calculations, without consuming any additional calculation cycle. Reading sets of (several) data, or blocks of data, rather than reading a single piece of data, makes it possible to complete the import of the data from the memory 13 into the registers even before said data becomes necessary, as operand, for a calculation.

In the example of cycle # 2 above, if only the immediately necessary data (A0) had been read rather than reading the game AA0_3 = {A0; A1; A2; A3}, then it would have been necessary to implement three additional read operations later to obtain A1, A2 and A3.

To better understand, and by comparison, we reproduce below the implementation of a process in which a single data item is read each time rather than a set of (several) data items. We see that 48 cycles are necessary.

Example 0:

- cycle # 0: reading of A0;

- cycle # 1: reading of B0; - cycle # 2: CO calculation and CO writing;

- cycle # 3: reading of A1;

- cycle # 4: reading of B1;

- cycle # 5: calculation of C1 writing of C1;

- cycle # 45: reading of A15;

- cycle # 46: reading of B15;

- cycle # 47: calculation of C15 and writing of C15.

In example 1 (18 cycles), we notice that the first two cycles # 0 and # 1 constitute initialization cycles. The number I of initialization cycles corresponds to the number of operands per calculation. Then a pattern of four successive cycles is repeated four times. For example, cycles # 2 to # 5 together form a pattern. The number of cycles per pattern corresponds to the number D of data per data set while the number of patterns corresponds to the number E of data set to be processed. The total number of cycles can therefore be expressed as follows: I + D ^* E.

Achieving good performance is equivalent to minimizing the total number of cycles. Under the conditions considered, that is to say 16 elementary and independent operations which can each be implemented over a cycle, the number of optimum cycles therefore seems to be equal to that number of elementary operations (16) to which is added the initialization phase (2 cycles), for a total of 18 cycles.

In a variant, it is considered that the number of data accessible (in reading or writing) in a single cycle (the number D of data per data set) is equal to three (and no longer four), for example because of material limitations. Then the succession of cycles can, for example, be broken down as follows:

- an initialization phase of 2 cycles; then

- 5 patterns of 3 cycles for a total of 15 elementary calculations out of the 16 to be performed; then

- a final cycle to calculate and save the result of the last elementary calculation. Example 2:

- cycle # 0: reading of AA0_2 = {A0; A1; A2};

- cycle # 1: reading of BB0_2 = {B0; B1; B2};

- cycle # 2: calculation of CO (of the game CC0_2 = {C0; C1; C2}) and reading of AA3_5 (forming for example a cycle i);

- cycle # 3: calculation of C1 (of game CC0_2) and reading of BB3_5 (for example forming a cycle i);

- cycle # 4: calculation of C2 (of game CC0_2) and writing of CC0_2;

- cycle # 5: calculation of C3 (of game CC3_5) and reading of AA6_8 (forming for example a cycle ii);

- cycle # 6: calculation of C4 (of game CC3_5) and reading of BB6_8 (forming for example a cycle ii);

- cycle # 7: calculation of C5 (of game CC3_5) and writing of CC3_5 (forming for example a cycle ii);

- cycle # 8: calculation of C6 (from game CC6_8) and reading of AA9_11;

- cycle # 9: calculation of C7 (from game CC6_8) and reading of BB9_11;

- cycle # 10: calculation of C8 (from game CC6_8) and writing of CC6_8;

- cycle # 11: calculation of C9 (from game CC9_11) and reading of AA12_14;

- cycle # 12: calculation of C10 (from game CC9_11) and reading of BB12_14;

- cycle # 13: calculation of C11 (from game CC9_11) and writing of CC9_11;

- cycle # 14: calculation of C12 (from game CC12_14) and reading of A15 (for example forming a cycle i);

- cycle # 15: calculation of C13 (from game CC12_14) and reading of B15 (forming for example a cycle i);

- cycle # 16: calculation of C14 (from game CC12_14) and writing of CC12_14;

- cycle # 17: calculation of C15 (isolated data) and writing of C15 (forming for example a cycle ii). In example 2, it can be seen that each cycle includes a memory access operation (read or write). It is therefore understood that, if the number D of data accessible in a single cycle is strictly less than three, then additional cycles will be necessary to perform memory accesses. The optimum of 18 cycles for 16 elementary operations will therefore no longer be reached. However, even if the optimum is not reached, the number of cycles remains significantly less than the number of cycles required in Example 0. An embodiment in which the data sets comprise two data presents an improvement compared to to the existing.

In example 1, if cycles # 2 and / or # 3 correspond for example to a cycle i as defined above, then each of cycles # 6, # 7, # 8 and # 9 corresponds to a cycle ii . Of course, this can be transposed from motif to motif. In example 2, if cycles # 2 and / or # 3 correspond for example to a cycle i as defined above, then each of cycles # 5, # 6 and # 7 corresponds to a cycle ii. Of course, this can be transposed from motif to motif.

In the examples described so far, in particular examples 1 and 2, the small total number of cycles is achieved in particular because a maximum of memory access operations is implemented per set of (several) data rather than individually and in parallel with calculation operations. Thus, for certain parts of the process (for all the parts in the optimized examples), the reading of all the necessary operands can be completed even before the preceding elementary calculation operation is completed. Preferably, it is preserved from the computing power to perform a calculation and record (write operation) the result of said calculation in a common calculation cycle (cycle # 5 of example 1 for example).

In the examples, reading the operand data in advance is implemented throughout the process (repeated from one pattern to another). The operands necessary for the calculations carried out during a pattern are systematically obtained (read) during the chronologically previous pattern. It will be noted that, in degraded embodiments, the reading in advance is implemented only partially (for two successive reasons only). Such degraded mode compared to the examples above presents better results than the existing methods.

In the examples described so far, it has been assumed that the data is read before serving as operands. In embodiments, the data read in advance are read randomly, or at least independently of the calculations to be performed in the future. Thus, at least some of the data read in advance from the data sets effectively correspond to operands for subsequent calculations while other data read are not operands for subsequent calculations. For example, at least some of the data read can be subsequently erased from the registers 11 without having been used by the ALUs 9, typically overwritten by other data subsequently recorded in the registers 11. Certain data are therefore read unnecessarily (and recorded unnecessarily on registers 11). However, it is sufficient that at least some of the data among the data sets read are actually operands for a saving in the calculation cycle to occur, and therefore that the situation is improved compared to the existing one. Also, depending on the number of data to be processed and the number of cycles, it is likely (in the mathematical sense of the term), that at least some of the pre-read data can actually be used as an operand in a calculation performed by an ALU 9 in a following cycle.

In embodiments, the data read in advance are preselected, and depend on the calculations to be performed. This improves the relevance of the pre-read data. Indeed, in the examples with 16 elementary calculations above, each of the 16 elementary calculations requires as input a pair of operands, respectively A0 and B0; A1 and B1; ...; A15 and B15. If the data is read randomly, then the first two cycles could correspond to the reading of AA0_3 and BB4_7. In such a case, no complete pair of operands is available on the registers 11 at the end of the first two cycles. Consequently, ALUs 9 cannot carry out any elementary calculation in the following cycle. One or more additional cycles would therefore necessarily be consumed for memory access before the elementary computations could begin, which increases the total number of cycles and is therefore detrimental for efficiency. Relying on chance and probability so that the data obtained in reading is as relevant as possible is enough to improve what already exists, but is not fully satisfactory. The situation can still be improved.

The implementation of a prefetch algorithm makes it possible, as soon as possible, to obtain all the operands of the next calculation to be performed. In the example above, reading AA0_3 and BB0_3 during the first two cycles allows, for example, to make available, on the registers 11, all the operands necessary for the implementation of the first 4 elementary calculations.

Such an algorithm receives as input parameters information data relating to the calculations to be carried out subsequently by the ALUs 9, and in particular relating to the necessary operands. Such an algorithm makes it possible, at the output, to select the data read (per set) in anticipation of the future calculations to be performed. Such an algorithm is, for example, implemented by the control unit 5 when controlling the memory accesses. According to a first approach, the algorithm imposes an organization of the data as soon as they are recorded in the memory 13. For example, the data which one wishes to see forming together a dataset are juxtaposed and / or scheduled so that the whole of the dataset can be called by a single query. For example, if the addresses of the data A0, A1, A2 and A3 are referenced respectively @ A0 @ A1, @ A2, @ A3, then the memory interface 15 can be configured for, in response to a read request on @ A0 , read the data automatically at the following three addresses @ A1, @ A2 and @ A3. According to a second approach, the prefetch algorithm provided at the output of the memory access requests adapted as a function of the calculations to be performed subsequently by the ALUs 9, and in particular relating to the necessary operands. In the previous examples, the algorithm identifies for example that the data to be read in priority are those of AA0_3 and BB0_3 to make possible, from the next cycle, the elementary calculations resulting in CC0_3, that is to say the calculation of CO with the operands AO and BO, the calculation of C1 with the operands A1 and B1, the calculation of C2 with the operands A2 and B2 and the calculation of C3 with the operands A3 and B3. The algorithm therefore provides, at the output, memory access requests constructed to generate the reading of AA0_3 and BB0 3.

The two approaches can, optionally, be combined with each other: the algorithm identifies the data to be read and the control unit 5 deduces therefrom requests for memory access to the memory interface 15 to obtain said data , the requests being adapted as a function of the characteristics (structure and protocol) of the memory interface 15.

In the previous examples, in particular examples 1 and 2 above, the number of ALUs assigned to elementary calculations is not defined. A single ALU 9 can perform all of the elementary calculations, cycle by cycle. The elementary calculations to be performed can also be distributed over a plurality of ALUs 9 of a PU, for example four. In such cases, coordinating the distribution of the calculations on the ALUs with the manner of grouping the data to be read in each read operation can make it possible to further improve efficiency. Two approaches stand out.

In a first approach, the data read in one operation form operands in calculations implemented by a single ALU 9. For example, the groups AA0_3 and BB0_3 of data A0, A1, A2, A3, B0, B1, B2 and B3 are read first and a first ALU is responsible for calculating CC0 3 (C0, C1, C2 and C3). The groups AA4 _7 (A4, A5, A6, A7) and BB4_7 (B4, B5, B6 and B7) are then read and a second ALU is responsible for calculating CC4_7 (C4, C5, C6 and C7). It is then understood that the first ALU will be able to start implementing the calculations before the second ALU can do the same because the operands necessary for the calculations of the first ALU will be available on the registers 11 before the operands necessary for the calculations of the second ALU are. The ALUs 9 of a PU then operate in parallel and asynchronous fashion. In a second approach, the data read in one operation form operands in calculations each implemented by different ALUs 9, for example four. For example, two groups of data including A0, A4, A8 and A12 respectively; B0, B4, B8 and B12 are read first. A first ALU is responsible for calculating C0, a second ALU is responsible for calculating C4, a third ALU is responsible for calculating C8 and a fourth ALU is responsible for calculating C12. It will then be understood that the four ALUs will be able to start implementing their respective calculation in a substantially simultaneous manner, because the necessary operands will be available on the registers 11 at the same time as downloaded in a common operation. The ALUs 9 of a PU operate in parallel and synchronized. Depending on the types of calculations to be performed, the accessibility of the data in memory and the resources available, one or the other of the two approaches may be preferred. The two approaches can also be combined: the ALUs can be organized into subgroups, the ALUs of a subgroup operating in synchronized fashion and the subgroups operating asynchronously with respect to each other.

To impose a synchronized, asynchronous or mixed operation of the ALUs, the grouping of the data to be read by read operation must be selected in correspondence with the distribution of the assignments of the calculation operations to various ALUs.

In the previous examples, the elementary calculations are independent of each other. The order in which they are carried out therefore does not a priori matter. In applications for which at least some of the calculations are dependent on each other, the scheduling of the calculations can be specific. Such a situation typically arises in the context of recursive computations. In such cases, the algorithm can be configured to identify the data to be acquired (read) in priority. For example, if:

- the result C1 is obtained by a calculation of which one of the operands is C0, C0 being itself obtained from the operands A0 and B0,

- the result C5 is obtained by a calculation of which one of the operands is C4, C4 itself being obtained from the operands A4 and B4,

- the result C9 is obtained by a calculation of which one of the operands is C8, C8 being itself obtained from the operands A8 and B8, and

- the result C13 is obtained by a calculation of which one of the operands is C12, C12 itself being obtained from the operands A12 and B12,

then the algorithm can be configured to read, during the first two cycles # 0 and # 1 of initialization, the data sets defined as follows:

- {A0; A4; AT 8 ; A12}, and

- {B0; B4; B8; B12}.

The dataset thus defined is represented in FIG. 4. In a pictorial manner, it can be said that the data are grouped “online” in the embodiment represented in FIG. 3 and grouped “in column” in the embodiment represented in FIG. 4. Thus, the implementation of the algorithm makes it possible to read and make available on the registers 11, the operands useful for the priority elementary computations. In other words, the implementation of the algorithm makes it possible to increase the short-term relevance of the data read compared to a random reading.

The examples of processing units and methods described above, only by way of example should not be considered as limiting, other variants may be envisaged by those skilled in the art in the context of the protection sought. . Examples can also take the form:

- a set of machine instructions that can be implemented in a processor to obtain such a computing device,

- a processor or a set of processors,

- the implementation of such a set of machine instructions on a processor,

- the processor architecture management method implemented by the processor,

- the computer program including the corresponding machine instruction set, as well as

- the recording medium on which such a set of machine instructions is recorded by computer.

Reference is now made to FIG. 5. There is shown an example of the operating architecture of a device 1 in which the memory access and address processing operations are treated separately. elementary calculation operations. Such an architecture can take the form of a computer process. It can optionally be combined with the embodiments described above. The numerical references common with those of the preceding figures designate similar elements, in particular a control unit 5, an ALU 9, registers 11, a memory 13 and a memory interface 15, or "bus".

In order to facilitate understanding, the same naming conventions are used: we consider an elementary operation, for example an addition, in which AX and BX are data forming operands to obtain a data forming result CX, with X an integer between 0 and N, N + 1 being the number of elementary operations to be carried out during a treatment. The set of N + 1 operations forms the data processing as a whole. In addition, the memory addresses of each of the data are referenced by their name preceded by the character “@” (at sign). For example, the address of the data A0 is noted "@ A0".

For each addition (each value of X), a set of instructions can be implemented by the computer device 1. An example of such a set of instructions is given at the end of the description in the form of a computer pseudocode. . Usually, such instructions are applied one after the other during a common process implemented by an ALU 9. In the embodiments below, the instructions relating to memory accesses and the instructions relating to operations elementary computations are processed by separate processes from each other.

In an embodiment of a computer method in accordance with FIG. 5, the method can be broken down into steps referenced respectively 101 to 109.

During steps 101 and 102, the memory addresses @ A0 to @AN, respectively @ B0 to @BN, of each of the data forming operands for at least one of the elementary operations to be performed are obtained. By "obtained" is meant that at the end of operations 101 and 102, one or more local memory organs store the addresses of all the data forming operand. Such memory accesses are for example triggered by the reception of instructions from the control unit 5. In certain cases, at least some of said addresses are already stored in the local memory organs. No memory access is therefore necessary at this stage to obtain said addresses previously installed on the local memory organs.

In the example described here, a distinction is made between step 101 concerning the first operands "A" of the addition and from step 102 concerning the second operands "B" of the addition. Distinguishing the two operands then makes it possible to implement iterative loops (in the computer sense) specific to each of the two operands, and possibly different from each other.

Alternatively, in particular when the two operands preexist at the start of the process, steps 101 and 102 can be implemented at least partially in parallel with one another, independently of one another.

During steps 103 and 104, each of said data obtained, respectively A0 to AN and B0 to BN, is read from memory 13, for loading into registers 11, via the memory interface 15. Such readings are made possible by addresses obtained in steps 101 and 102. In the example described here, step 103 concerns the first operands "A" while step 104 concerns the second operands "B".

During step 105, a calculation execution instruction is transmitted from the control unit 5 to an ALU 9. The execution instruction is constructed so as to trigger the implementation of the calculations basic treatment for ALU 9. The instruction here is devoid of addressing instruction. By devoid of addressing instruction, it is meant here that, contrary to what is usually done, the instruction for carrying out calculations transmitted by the control unit 5 is not included in a general instruction set combining both addressing instructions and instructions for performing calculations. Thus, on receipt of the instructions, the ALU 9 is able to immediately apply the instructions by carrying out the elementary calculations without the need to apply instructions for configuring the memory interface 15 in advance and therefore without it being necessary, either, to verify any mutual dependence on the various instructions received. In an imaginary manner, the ALU 9 then behaves like a computer resource implementing the calculations (step 106 described below) independently of a possible complexity of interdependence between the different instructions. By conditioning the transmission of the calculation instruction (step 105) to the prior execution of the addressing operations (steps 103 and 104), the availability of the data forming operands on the registers 11 is ensured. In practice, the registers 11 behave like first-in-first-out (or FIFO for “First In, First Out”) buffers. The registers 11 filling and emptying respecting the order of arrival of the data, here the operands AN (step 103) and BN (step 104). Step 106 is executed if the registers 11 are not empty: the registers 11 are unstacked from the operands AN and BN. As a variant, the registers 11 do not operate in FIFO mode. In this case, data can be stored there more permanently without the risk of being erased and can be reused later if necessary.

During step 106, upon receipt of the instruction for executing calculations, the ALU 9 executes all of the corresponding elementary operations, as soon as the operands are available in the registers 11. Step 106 therefore includes the reception of the ALU 9 of each of the operands from the registers 11 as inputs. Provided that the addressing operations 103, 104 have been previously and correctly carried out, step 106 can be devoid of memory access (read).

During step 107, the data forming results of the processing are stored on the registers 11 at the outputs of the ALU 9. Here we only mention the results of the processing and not the results of each of the elementary calculations. Indeed, in the case where some of the results of the elementary calculations are used as operands for other elementary calculations during step 107, such (intermediate) results may become useless at the end of the processing. In these cases, the intermediate results can be deleted from the registers 11 at the end of step 107 (for example overwritten by other data, FIFO mode). As a variant, all of the data forming results of elementary operations are stored on registers 11 at the end of step 107 (mode different from FIFO).

During step 108, a memory address @CX for each of the CX data forming the result of the processing is obtained. Such an addressing operation makes it possible to determine the memory location on which each of the data forming the result will be stored. In the example described here, step 108 is implemented after step 107 of recording the results in the registers 11 and before writing the results to memory 13 (step 109 described below). As a variant, step 107 can be implemented earlier during the process, in particular before step 106. In fact, especially when the shape (for example the size) of the result data is known in advance, it is possible to address the result data even before their calculation. Obtaining the @CX memory addresses may include the transmission of addressing instructions from the control unit 5.

During step 109, each of the data forming the result of the processing is written into memory 13 from registers 11, for storage and via the memory interface 15, by means of the memory addresses obtained in step 108.

In FIG. 6, examples of the implementation of steps 101, 102, 106 and 108 are given in the form of a computer pseudocode. Such non-limiting examples represent operations in the form of computer loops. The use of loops is particularly advantageous for limiting the number of calculation cycles necessary, and therefore for improving efficiency, when the operations to be implemented are substantially analogous to each other (for example all additions) and that only the input data varies. In such cases, the calculation instructions transmitted in step 105 may take the form of a repetitive loop for each operation.

The chronological ordering of steps 101, 102, 106, 108 is represented by the arrow “t” in FIG. 6. Such a ordering constitutes a nonlimiting example. Figure 6 illustrates that, unlike the use in the technical field, the addressing of the input data (operands) is implemented distinct from the calculations themselves. In other words, addressing and calculation are treated as two separate processes from each other rather than being treated indiscriminately, on the fly, upon receipt of general instructions. In particular, the step 106 for executing the elementary operations can start even though all the operands have not yet been downloaded from the memory 13 to the registers 11. Typically, the first cycles of the step 106 can start as soon as that the first corresponding operands are available on the registers 11 for the calculations. This cascading implementation of operations gives the system its asynchronous nature.

In embodiments, the ALU 9 performs (step 106) all of the elementary calculations of the processing during consecutive calculation cycles. No memory access is made by the ALU 9 during these calculation cycles. Thus, the implementation of the calculations can be particularly rapid. The ALU 9, during such cycles, is exempt from memory access operation. In addition, from the point of view of the ALU 9 performing the calculations, obtaining the operands is similar to calling memory 13 but obtaining the operands is faster and independent of the memory interface 15 because the operands are in practice read directly from the registers 11. The memory accesses are implemented by another ALU (separate from the one performing the calculations). At least during a process, each ALU 9 has a fixed function: either the implementation of calculations, or the implementation of memory accesses. This fixed function assignment to each ALU 9 can be modified at the end of the implementation of the process to give flexibility to the IT device. However, this often involves adapting the addressing path accordingly. In the preferred embodiments, the function of each ALU 9 is therefore frozen from one process to another: the ALUs are specialized.

The methods and variants described above can take the form of a computer device, including a processor or a set of processors, arranged to implement such a method. The invention is not limited to the examples of methods and devices described above, only by way of example, but it encompasses all the variants that a person skilled in the art may envisage within the framework of the protection sought. The invention also relates to a set of machine instructions which can be implemented in a processor for obtaining such a computing device, such as a processor or a set of processors, the implementation of such a set of machine instructions on a processor, the processor architecture management method implemented by the processor, the computer program comprising the corresponding set of machine instructions, as well as the recording medium on which such a set of machine instructions is recorded by computer.

On the following pages of the description of the original document, examples of implementation are given in the form of a computer pseudocode.

Example in the form of a computer pseudocode of a treatment to be carried out in the form of ten additions:

Int A [10], B [10], C [10]

for (i = 0; i <10; i ++)

{

C [i] = A [i] + B [i];

}

Example in the form of a computer pseudocode of usual instructions for carrying out a treatment consisting of ten additions:

@A, @B, @C

AddrO = @A

Addrl = @B

Addr2 = @C

LOOP:

Load AddrO regO

Load Addrl governed

reg2 = regO + governed

Store addr2 reg2

AddrO = AddrO + 1

Addrl = Addrl + 1

Addr2 = Addr2 + 1

GOTO LOOP (10x) Example in the form of a computer pseudocode of instructions for carrying out a processing consisting of ten additions according to an embodiment in which the addressing instructions and the calculation instructions are distinguished from each other: // step 101 //

Addr = @A

LOOP

Load Addr

Addr = addr + 1

GOTO LOOP (10x) // step 102 //

Addr = @B

LOOP

Load Addr

Addr = addr + 1 GOTO LOOP (10x)

// step 106 // LOOP

c = a + b

GOTO LOOP (10x)

// step 108 // Addr = @C

LOOP

Load Addr

Addr = addr + 1 GOTO LOOP (10x)

Claims

claims

1. Data processing method, decomposable into a set of elementary operations to be carried out, implemented by a computer device (1), said device (1) comprising:

- a control unit (5);

- at least one arithmetic and logical unit (9);

- a set of registers (11) capable of supplying data forming operands to the inputs of said first arithmetic and logic unit (9) and capable of being supplied with data originating from the outputs of said arithmetic and logic unit (9);

- a memory (13);

- a memory interface (15) by means of which data (A0, A15) are transmitted and routed between the registers (11) and the memory (13); said method comprising:

a) obtaining (101, 102) the memory addresses (@ A0, @ A15) of each of the data absent from the registers (11) and forming the operand for at least one of said elementary operations to be carried out and;

b) read (103, 104) in memory (13), for loading into the registers (11) via the memory interface (15), each of said data (A0, A15) by means of the memory addresses (@ A0, @ A15 ) obtained;

c) transmitting (105) an instruction for executing calculations from the control unit (5) to said first arithmetic and logic unit (9), said instruction being devoid of addressing instructions;

d) on receipt of said instruction for executing calculations, and as soon as the corresponding operands are available in the registers (11), execute all of said elementary operations (106) by said first arithmetic and logic unit (9) receiving in inputs each of the operands from the registers (11);

e) storing (107) the data (C0, C15) forming results of the processing on the registers (11) at the outputs of said first arithmetic and logic unit (9); f) obtain (108) a memory address (@ C0, @ C15) for each of the data forming the result of the processing;

g) write (109) in memory (13) each of the data (C0, C15) forming the result of the processing from the registers (11), for storage and via the memory interface (15), by means of the memory addresses (@ C0 , @ C15) obtained.

2. Method according to claim 1, in which said first arithmetic and logic unit (9) performs all of the elementary calculations of the processing during consecutive calculation cycles, no memory access being carried out by said first arithmetic and logic unit. (9) during said calculation cycles.

3. Method according to one of the preceding claims, in which at least one of the following steps comprises an iterative loop:

d) upon receipt of said instruction for executing calculations, execute all of said elementary operations (106) by said first arithmetic and logic unit (9) receiving as inputs each of the operands from the registers (1 1);

f) obtaining (108) a memory address (@ C0, @ C15) for each of the data forming the result of the processing.

4. Method according to one of the preceding claims, wherein the device (1) further comprises at least one additional arithmetic and logic unit and distinct from the first arithmetic and logic unit (9) performing all of said elementary operations (106 ), the additional arithmetic and logical unit implementing:

a) obtaining (101, 102) memory addresses (@ A0, @ A15) of each of the data absent from the registers (11) and forming the operand for at least one of said elementary operations to be carried out; and

b) reading (103, 104) in memory (13), for loading into the registers (11) via the memory interface (15), of each of said data (AO, A15) by means of the memory addresses (@ A0, @ A15) obtained.

5. Data processing device (1), said processing being decomposable into a set of elementary operations to be carried out, said device (1) comprising:

- a control unit (5);

- at least a first arithmetic and logic unit (9) among a plurality;

- a set of registers (11) capable of supplying data forming operand to the inputs of said arithmetic and logic units (9, 10) and capable of being supplied with data from the outputs of said arithmetic and logic units (9);

- a memory (13);

- a memory interface (15) by means of which data (A0, A15) are transmitted and routed between the registers (11) and the memory (13); said computer device (1) being configured for:

a) obtaining (101, 102) the memory addresses of each of the data absent from the registers (11) and forming an operand for at least one of said elementary operations to be carried out and;

b) read (103, 104) in memory (13), for loading into the registers (11) via the memory interface (15), each of said data by means of the memory addresses obtained;

d) on receipt of said instruction for carrying out calculations, and as soon as the operands are available on the registers (11), execute all of said elementary operations (106) by said first arithmetic and logic unit (9) receiving as inputs each of the operands from the registers (11);

e) storing (107) the data forming results of the processing on the registers (11) as outputs of said first arithmetic and logic unit (9);

f) obtain (108) a memory address for each of the data forming treatment result;

g) write (109) in memory (13) each of the data forming a result of the processing coming from the registers (11), for storage and via the memory interface (15), by means of the memory addresses obtained.

6. Set of machine instructions for implementing the method according to one of claims 1 to 4 when this program is executed by a computer device (1) including at least one processor.

7. Computer program comprising instructions for implementing the method according to one of claims 1 to 4 when this program is executed by a computer device (1) including at least one processor.

8. Non-transient recording medium readable by a computer on which a program for implementing the method according to one of claims 1 to 4 is recorded when this program is executed by a processor.