WO2017207889A1

WO2017207889A1 - Device and method for parallel data processing

Info

Publication number: WO2017207889A1
Application number: PCT/FR2017/051210
Authority: WO
Inventors: Alessandro MARONGIU; Gaël PAUL; Daniele MODICA; Damien PRETET
Original assignee: Plda Group
Priority date: 2016-05-31
Filing date: 2017-05-18
Publication date: 2017-12-07
Also published as: FR3051933A1; FR3051933B1

Abstract

The invention relates to a method for processing data by means of a device (DV2) for parallel data processing comprising a unit (U1) configured to receive, on parallel inputs, a set (W) of binary words (wi). The method comprises a step consisting of reading, by means of the second unit (U1), all or some of the binary words of the set (W) of binary words (wi) applied on the parallel inputs of same, and sending a purge instruction (Nj) for purging certain binary words of the set of binary words, and a step consisting of supplying to the second unit (U1) the word or words (a5-a7) of the first set of binary words which have not been purged, and one or more binary words (b0-b4) of a new set of binary words (W', W2).

Description

DEVICE AND METHOD FOR PARALLELIZED DATA PROCESSING

The present invention relates to a parallel data processing method, a parallel data processing device, and a method for producing a parallel data processing device.

In microelectronics, "parallelism" or "parallelized data processing" is a technique consisting in using processor architectures capable of processing data simultaneously, under the control of programs specifically designed to provide such treatment. This technique aims to achieve the largest number of operations in the smallest possible time.

Programmable gate array integrated circuits, or FPGAs ("Field Programmable Gate Array") are particularly suitable for the realization of parallel processor architectures. A virgin FPGA circuit is transformed into an operational integrated circuit by configuring the network of logic gates that it comprises. This configuration is first defined in a hardware description language, called HDL ("Hardware Description Language"), or its variants (VHDL, Verilog ...), to describe the behavior and architecture of the hardware circuit. The circuit description in hardware description language is then compiled to obtain an FPGA circuit configuration. For this purpose, each FPGA circuit manufacturer proposes so-called logical synthesis and routing placement tools which make it possible, from the HDL circuit, to obtain an FPGA configuration also called "bitstream", ready to be loaded into the configuration memory of the device. FPGA to obtain an operational integrated circuit.

The description of a hardware descriptive language circuit being a complex operation reserved for specialists, it is known to describe the processing function executed by the circuit using a high level of abstraction language, for example the C or C ++ language, to obtain a so-called functional description. The functional description is then compiled to obtain the HDL circuit, the expression of the function in the hardware description language, which is itself compiled to obtain an FPGA circuit configuration. This method makes it possible to put the FPGA circuits die within the reach of programmers having no experience in the use of hardware descriptive languages. The transition from the functional description to the hardware description involves a High Level Synthesis (HLS) compiler that performs this transformation from a hardware architecture model, or target architecture, defining the general characteristics of the HDL circuit. This architecture model presents a certain number of adjustable characteristics, chosen during a step of definition of the hardware constraints intervening before the HLS compilation, such as bus clock speeds, the number of data buses arranged in parallel. and the size of these buses (number of bits they carry). The target architecture making it possible to implement the HDL circuit is generally not perfectly adapted to the particular needs of the function to be implemented, in particular as regards the number of parallel buses that it comprises relative to the number of data that the function can process simultaneously. Thus, the number of parallel buses is generally a multiple of 2, even if the implemented function does not simultaneously process a number of data that is a multiple of 2. However, the target architecture may have a number of data buses which, if not equal, is greater than the number of data that the function can handle simultaneously, in order to exploit its parallelism as much as possible. When several data processing functions are implemented in the same circuit, it is the function capable of simultaneously processing the largest number of data that generally determines the degree of parallelism retained in the target architecture.

These structural characteristics of the target architecture must be taken into account at the stage of the functional description of the circuit, using typed variables - that is to say of determined size -, such as CHAR (1 byte), INT ( 4 bytes with a 32-bit processor), and DOUBLE (8 bytes) in C or C ++, which are adapted to the number of parallelized data. For example, if the target architecture has 8 parallel data buses, the variable "DOUBLE" will be used in the functional description of the circuit. This results in a kind of "contamination" of the functional description by material constraints, related to an inadequacy of the target architecture relative to the real needs of the function and the use of typed variables. This "contamination" complicates the expression of function in high-level abstraction language. It also complicates the process of realization and development of an FPGA circuit. As an FPGA circuit is indeed reconfigurable at will, it is usual to rework the target architecture after a period of evaluation of the circuit, returning to the definition stage of the constraints hardware. The new definition of hardware constraints may result, in particular, in an increase in the number of parallel buses, in order to solve cases of data congestion identified during the evaluation period. In this case, the functional description of the circuit must be rewritten in order to take into account the modifications made to the hardware architecture that implements it.

There are also functions configured to simultaneously process a number of data that varies over time. This variability may depend on an external setpoint or an internal parameter which is a function of the result of previous processing steps (auto-adaptive functions called "data dependent behavior"). In this case, a change in the target architecture results in even more changes in the expression of the function in the high-level abstraction language.

It may therefore be desirable to provide a method and a parallel data processing device which simplifies the process of designing, producing and optimizing an FPGA circuit.

It may also be desired to provide a method and a parallel data processing device which simplifies the processing of a number of data different from the number of data conveyed by a set of parallel buses.

Embodiments of the invention provide a parallel data processing device, comprising: a first unit including parallel outputs for providing binary words in parallel, and a second unit including parallel inputs for receiving a set of binary words in parallel. parallel provided by the first unit, wherein the second unit is configured to read simultaneously, during a same reading step, all or part of the binary words of the set of binary words applied on its parallel inputs, and return a purge instruction of certain binary words of the set of binary words. The device comprises a data matching unit connecting the parallel outputs of the first unit to the parallel inputs of the second unit and configured to provide the second unit with a first set of binary words provided by the first unit, and, after reading by the second unit of a part of the binary words of the first set of binary words, supplying to the second unit the word or words of the first set of binary words that have not been purged, and one or more binary words of a second set of binary words provided by the first unit. According to one embodiment, the data matching unit is also configured to, when part of the second set of bit words has been purged, supply the second unit with the one or more binary words of the second set of binary words. not being purged and one or more binary words of a third set of binary words provided by the first unit.

According to one embodiment, the second unit is configured, when reading a part of the binary words applied to its parallel inputs, to read binary words present on inputs with a higher reading priority, and the unit of data mapping is configured to, after purging by the second unit of a bit portion of the first set of bit words, applying to the higher priority entries of the second unit words of the first set of bit words previously applied to parallel inputs and not purged, and apply on parallel inputs with lower read priority of the second unit one or more binary words of the second set of binary words.

According to one embodiment, the data matching unit comprises at least two input registers arranged in parallel and each having parallel inputs connected to the outputs of the first unit, an output register having outputs connected to the inputs parallel of the second unit, a data reorganization circuit connecting outputs of the input registers to parallel inputs of the output register, and a control circuit configured to: load in the input registers of the provided binary bit sets by the first unit, and loading in the output register, via the data reorganization circuit, binary words present in the input registers, controlling the data reorganization circuit as a function of the number of binary words having have been purged, so as to load in cells with higher priority reading of the output register the binary words pr present in the input registers and not having been purged, and load in lower priority cells of the output register of the binary words present in the input registers and not yet presented to the second unit .

According to one embodiment, the control circuit is configured to load in one of the input registers a new set of binary words provided by the first unit at least when all the binary words of the set of binary words present in the entry register have been purged. According to one embodiment, the second unit is configured to determine the number of binary words that it reads during the same reading step as a function of the previously read binary words or as a function of an external setpoint. According to one embodiment, the first unit is a processor, a serial / parallel communication interface circuit, a matching circuit of a number of buses arranged in parallel, a clock frequency modification circuit, or a combination of these elements. Embodiments also relate to a programmable gate array integrated circuit, comprising a device as defined above.

Embodiments also relating to a data processing method by means of a parallel data processing device comprising a first unit comprising parallel outputs for providing binary words in parallel, and a second unit comprising parallel inputs for receiving a bit. a set of parallel binary words provided by the first unit, the method comprising the steps of: by means of the second unit, reading simultaneously, during a reading step, all or part of the binary words of the set of binary words applied on its parallel inputs, and returning, after a reading step, a purge instruction of certain binary words of the set of binary words, and after reading by the second unit of all or part of the binary words of the first set of binary words, supplying to the second unit the word or words of the first set of binary words that have not been purged s, and one or more binary words of a second set of binary words provided by the first unit.

According to one embodiment, the method comprises the step of, when a part of the second set of binary words has been purged, supplying to the second unit the one or more binary words of the second set of binary words that have not been purged and one or more binary words of a third set of binary words provided by the first unit.

According to one embodiment, the method comprises the steps of: when reading all or part of the binary words applied to the parallel inputs of the second unit, reading binary words present on inputs having a higher reading priority, and after purging a portion of the binary words of the first set of binary words, applying to the higher priority read inputs of the second unit words of the first set of bit words previously applied to entries of lower read priority and not purged, and applying to parallel entries with lower read priority of the second unit one or more binary words of the second set of binary words .

According to one embodiment, the method comprises the step of providing, between the first and second units, a data matching unit connecting the parallel outputs of the first unit to the parallel inputs of the second unit and comprising minus two input registers arranged in parallel and each having parallel inputs connected to the outputs of the first unit, an output register having outputs connected to the parallel inputs of the second unit, and a data reordering circuit connecting outputs of the first units; input registers to parallel inputs of the output register. According to one embodiment, the method comprises a step of loading into one of the input registers a new set of binary words provided by the first unit, at least when all the binary words of the set of binary words present in the input register were purged. Embodiments also relate to a method for performing, in a gate array integrated circuit, a device executing a parallel data processing function requiring steps for simultaneous reading of binary words in a set of binary words carried by buses. of parallel data, the method comprising: a step of functionally describing the processing function using a high-level abstraction language and using an unspecified variable of unspecified size to designate the binary words to be read simultaneously, a step of defining material constraints to be taken into account in the realization of the device and including a choice of a degree of parallelism of the device, from the functional definition and the material constraints, a step of hardware description of the device having the degree of parallelism chosen, in use reading as a target architecture a device as described above, wherein the second unit executes the data processing function, and from the hardware description of the device, a step of configuring the programmable gate array integrated circuit, for obtain the device with the degree of parallelism chosen. According to one embodiment, the method comprises the steps of: testing the device, reworking the definition of the hardware constraints of the device so as to modify its degree of parallelism, without modifying the functional description of the processing function, reworking the hardware description of the device from the reworked definition of the hardware constraints, and from the redesigned hardware description of the device, reconfigure the programmable gate network integrated circuit, to obtain the device having the degree of modified parallelism.

Embodiments of the invention will be described in the following, without limitation, in relation to the appended drawings among which:

FIG. 1 represents an example of a conventional device for parallel processing of data,

FIG. 2 represents an embodiment of a parallel data processing device according to the invention,

FIG. 3 shows an embodiment of a mapping unit represented in block form in FIG. 2,

FIG. 4 is a flowchart illustrating the operation of the mapping unit,

FIG. 5 shows a data matching sequence executed by the mapping unit,

FIG. 6 shows steps of a method for producing an FPGA integrated circuit according to the invention,

FIGS. 7A, 7B, 7C illustrate the method of FIG. 6 and show various embodiments of a parallel data processing device according to the invention having different degrees of parallelism,

FIG. 8 shows another embodiment of a parallel data processing device according to the invention, and

- Figure 9 shows another embodiment of a parallel data processing device according to the invention.

An example of a conventional device for parallel processing of data DV1 is shown in FIG. 1. The device comprises a unit Ua ensuring the supply of parallel data, a unit Ub providing the parallel processing of the data provided by the unit Ua, and buses data B ₀ to Bi_i connecting the two units. The unit Ua comprises outputs So to Si_i each providing a binary word wo to WH of a set of binary words W and the unit Ub comprises inputs E ₀ to En each receiving a binary word wo to WH of the set of binary words, each output Si of rank i of the unit Ua being connected to an input Ei of corresponding rank of the unit Ub by a data bus Bi. The unit Ua sends the unit Ub a signal RDY ("Ready") when the set W of binary words w; is available on its outputs Si, and the unit Ub returns an acknowledgment signal ACK ("Acknowledge") when these binary words have been read via its inputs Ei. Upon receipt of the ACK signal, the unit Ua provides a new set W of binary words w; (w ₀ to wn) and re-transmits the RDY signal.

If the function executed by the unit Ub simultaneously processes, at each processing step, a number of binary words less than the number I of binary words w; conveyed by the data bus Bi, the unit Ub must nevertheless read all the binary words and put in memory those to be treated during a next processing step, so that the unit Ua can present on its outputs the new set of binary words W before the next processing step. This processing step may in fact require binary words of the new set of binary words W in addition to the previously stored binary words.

This gap between the number of processed data and the number of data provided must be taken into account when writing the function in a high level of abstraction language, using programming devices such as the use of variables of the array type (for example in the C language the variable "TABLE [i] [j]" for an array of i * j variables) which result, after completion of the operational circuit, by the reservation of memory zones dedicated to these variables and executing read and write access cycles of these memory areas. As indicated above, a reworking of the architecture of the DV1 device, after a period of evaluation, and in particular an increase in the number I of parallel buses, involves a reworking of the functional description of the device in a high-level language. abstraction and complexifïe the method of realization and development of this device.

An embodiment of a device for parallel processing of data DV2 according to the invention is shown in FIG. 2. The device comprises a unit U0 providing the supply of parallel data, an unit UI providing the parallel processing of data, a set parallel data bus Bi (B ₀ to Bi_i) connecting the units U0 and Ul, and a matching unit BDG arranged between the units U0 and Ul on the data buses Bi. It will be noted that in the present description, the term "binary word" designates a unit of data that can be processed by the unit U1 and whose size is several bits, for example 8, 16, 32, 64 bits, or any other format. Thus, the term "bus" designates a group of several conductive tracks in parallel for conveying a word.

Each data bus Bi comprises a first section Sa connecting the unit U0 to the unit BDG and a second section Sb connecting the unit BDG to the unit Ul. The unit U0 comprises outputs Si (S ₀ to Si_i) connected to inputs Ei (E ₀ to En) of the unit BDG by the first section Sa of the set of buses Bi, and the unit BDG comprises outputs Si (S ₀ to Si_i) connected to inputs Ei (E ₀ to En) of the unit Ul by the second section B of the set of buses Bi.

The unit U0 sends to the BDG unit a signal RDY1 when a set W of binary words wo to wi-i is available on its outputs Si, and the unit BDG returns an acknowledgment signal ACK1 when the set of binary words was read. On receiving the signal ACK1, the unit U0 provides a new set W of binary words w0 'to wn', and so on. Likewise, the unit BDG sends to the unit UI a signal RDY2 when binary words wi are available on its outputs Si, and the unit U1 returns an acknowledgment signal ACK2 when these binary words have been read. Upon receipt of the ACK2 signal, the BDG unit provides a new set of binary words, and so on.

According to the invention, the unit U1 is configured to read from among a set of binary words provided by the BDG unit only the binary words which it immediately needs, for example to execute a step of parallel processing of data. The unit U1 then supplies the BDG unit with information on the binary words that have been read. This information is provided with the signal ACK2, or forms the signal ACK2. On receipt of this information, the BDG unit presents again to the unit UI the binary words that have not been read during the preceding reading step, as well as binary words of the new set of binary words provided by the unit U0. The unit BDG thus supplies the unit U with "hybrid" binary word sets comprising binary words w ;, w; from different sets of bit words provided by the unit U0.

In one embodiment, the unit U1 is configured to read the binary words w; provided by the BDG unit according to a hierarchy of reading of its inputs, for example by giving its first entry Eo the highest priority of reading and its last entry In the most low reading priority. Thus, during a partial reading of the binary words presented to it by the BDG unit, the unit U1 firstly reads the binary words present on entries Ei with the highest priority of reading, ie here starting with the entry E0, and returns to the BDG unit the information on the binary words having been read in the form of a number "Nj" indicating the number of data read on its input, "j" being an index representing the rank of the reading step just executed. The number Nj is included in the signal ACK2 (Nj), or forms the signal ACK2 (Nj).

In this case, the unit BDG is configured to, after having received the number Nj, to supply to the unit Ul the I-Nj binary words wi that have not been read by the unit Ul, as well as Nj Wi-bit words of a set of words following W provided by the unit U0. For this purpose, the BDG unit shifts the position of the I-Nj binary words wi on its outputs so that they are presented on its outputs with the highest reading priority, and presents the Nj binary words wi 'of the set of next binary words on its lower priority outputs.

For the sake of clarity, an example bit matching sequence will now be described. It is assumed here that I is equal to 8 and that the unit U0 successively supplies the BDG unit with the following sets of binary words W1, W2, W3, W4, W5:

Wl = aO al a2 a3 a4 a5 a6 a7

W2 = b0 bl b2 b3 b4 b5 b6 b7

W3 = cO cl c2 c3 c4 c5 c6 c7

W4 = dO dl d2 d3 d4 d5 d6 d7

W5 = eO el e2 e3 e4 e5 e6 e7

These sets of binary words are read sequentially by the unit BDG and are considered by it as forming an uninterrupted string of binary words, ie: aO al a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7 d0 d1 d2 d3 d4 d5 d6 d7 e0 el e2 e3 e4 e5 e6 e7 wherein the BDG unit can form sets of binary words different from the sets of initial binary words that make up the word chain. At the beginning of a data reading cycle by the unit U1, the unit BDG first supplies the unit U1 with the set of binary words W1: a01 a2 a3 a4 a5 a6 a7

It is assumed here that during a first reading step, the unit U1 reads the binary words a0 to a4 and returns the information N1 = 5 to the unit BDG. This then provides the following set of binary words, which is found on the inputs E ₀ to En of the unit U1: a5 a6 a7 b0 b1 b2 b3 b4

It is also assumed that during a next read step, the unit U1 reads the binary words a5, a6, a7, b0 and returns the information N2 = 4 to the BDG unit. This then supplies the unit Ul with the following set of binary words: b1 b2 b3 b4 b5 b6 b7 c0

It is then assumed that during a next read step, the unit U1 reads the binary words b1 to b7 and c0 and returns the information N3 = 8 to the unit BDG. This then supplies the unit Ul with the following set of binary words: c2 c3 c4 c5 c6 c7 d0

It is then assumed that during a next read step, the unit U1 reads the binary words c1 to c7 and returns the information N4 = 7 to the BDG unit. This then supplies the unit U1 with the following set of binary words: dO d1 d2 d3 d4 d5 d6 d7

Finally, it is assumed that during a next reading step, the unit U1 reads the binary words d0 to d5, and returns the information N5 = 6 to the unit BDG. This then supplies the unit U1 with the following set of binary words: d6 d7 e0 el e2 e3 e4 e5 FIG. 3 represents an embodiment of the BDG unit implementing the mapping method that has just been described, with hierarchisation of the inputs and use of the number Nj to indicate the number of higher priority entries read. during a reading step of rank "j". The BDG unit comprises two input registers R1, R2, a data reorganization circuit BS, here a circular shift circuit, an output register R3 and a control circuit CCT1. Each register comprises I register cells each capable of receiving a binary word wi. Entries E0 to En of the register R1, each corresponding to an input of a cell of the register, are connected to the outputs So to Si_i of the unit U0. Similarly, inputs E0 'to En' of the register R2 are connected to the outputs So to Sn of the unit U0. Outputs So to Sn of the register R1, each corresponding to an output of a cell of the register, are connected to inputs E ₀ to En of the circular shift circuit BS. Similarly, the outputs SO 'to Sn' of the register R2 are connected to inputs E0 'to En' of the circular shift circuit BS. The circuit BS comprises outputs So to Sn connected to inputs E ₀ to En of the register R3, which includes outputs So to Sn connected to the inputs E ₀ to En of the unit U1.

The circuit CCT1 manages the data exchanges with the unit U0 and the unit Ul. It receives the signal RDY1 and returns the signal ACK1 after having loaded into one of the registers R1, R2 a set of binary words provided by the unit U0. It also receives the signal ACK2 (Nj) from the unit U1 and sends it the signal RDY2 when a new set of binary words has been loaded into the register R3. For this purpose, the circuit CCT1 applies an LDI data loading signal to the register RI or a data loading signal LD2 to the register R2, and alternately charges, in each register, the sets of binary words received from the unit U0. . The circuit CCT1 ensures that these registers do not include obsolete data having been read by the unit U by executing a method of updating these registers, an example of which will be described later.

The circuit CCT1 also applies a shift signal SHIFT to the circuit BS, with the value of this signal the last value Nj received from the unit U1, indicating the number of binary words read by it. After having applied the signal SHIFT to the circuit BS, the circuit CCT1 loads the binary words provided by the outputs So to Sn of the circuit BS into the output register R3, by applying a signal LD3 thereto, and sends the signal RDY2 to the unit Ul. FIG. 4 is a flowchart describing an exemplary configuration of the circuit CCT1, this being for example realized in the form of a finite state machine or "FSM"("FiniteState"). Machine "). B00 to B08 and steps COO to C04 are distinguished: The steps BO0 to B08 relate to the management of the data exchanges with the units U0, U1 and the control of the offsets executed by means of the circuit BS. The steps COO to C04 form an update loop of the registers R1, R2.

The steps BO0 to B05 are intended to initialize the BDG unit at the beginning of a data transfer cycle between the units U0 and U1. In step BOO, the circuit CCT1 waits for the signal RDY1 to be received from the unit BO, then loads the register R1 and sends the signal ACK1 to the step B01. In step B02, the circuit CCT1 waits for the signal RDY1 to be received from the unit BO, then loads the register R2 and sends the signal ACK1 to the step B03. In step B04, the circuit CCT1 applies the signal SHIFT = 0 to the circular shift circuit BS. In step B05, the circuit CCT1 loads the register R3 and sends the signal RDY2 to the unit U1.

In step B06, the circuit CCT1 waits to receive the read acknowledgment ACK2 from the unit U1, including (or formed by) the number Nj of binary words read by the unit U1. When this information is received, the circuit CCT1 goes to the step COI to execute the update loop of the registers R1, R2, before going to the step B07 during which it applies the signal SHIFT = Nj to circular shift circuit BS. In step B08, the circuit CCT1 loads the register R3 with the data supplied by the circuit BS and sends the signal RDY2 to the unit U1, then returns to the step B06 to wait for a new acknowledgment of receipt of the reading. part of the unit Ul and the corresponding number Nj.

The register update loop includes a COO initialization step that can be executed during the initialization of the BDG unit (steps BO0 to B05), during which the circuit CCT1 sets a variable "r" to zero. and a variable "ΣNj" equal to the sum of the numbers Nj received from the unit U1 since the last reset of this variable.

In step C01, the circuit CCT1 stores the last value Nj received in step B06 and refreshes the variableΣNj by adding to its previous value the new value Nj received. In step C02, the circuit CCT1 determines whether "r + ΣNj" is greater than or equal to I, I being the number of binary words flowing in parallel on the buses Bi or the number of buses Bi. If the answer is positive, the circuit CCT1 goes to the step C03, if not out of the loop and goes to the step B07. In step C03, the circuit CCT1 verifies that the signal RDY1 has been sent by the unit U0, otherwise waits for this signal to be received. When the signal RDY1 is received, the circuit CCT1 goes to the step C04 where it performs the following operations: the circuit CCT1 loads in the register RI or in the register R2 the new set of binary words provided by the unit UO. The first register updated in step C04 after the initialization steps B01 and B03 is the register R1. Then, during a new execution of the step C04, the circuit CCT1 loads the register R2, and so on so that the registers RI, R2 are updated alternately,

the circuit CCT1 sends the signal ACK1 to the unit UO,

the circuit CCT1 updates the variable r by means of the relation "r = (r + ΣNj) mod I". Thus, the new value of r is equal to the remainder of the modular division of the sum of its previous value and the variable Nj. After this update, the variable r represents the number of offsets beyond I or a multiple of I having been applied to the data in the registers RI, R2, - the circuit CCT1 sets the variable ΣNj to zero. . The successive offsets modulo I since the last reset of this variable are indeed now included in the new value of r.

FIG. 4 also shows steps P00 to P02 conducted by the unit Ul. During a step P00, the unit U1 expects to receive the signal RDY2. When this signal is received, the unit U1 performs a step P01 where it reads Nj binary words wi from the set of binary words provided by the register R3 of the BDG unit, and sends the signal ACK2 (Nj) to the circuit CCT1 . During a step P02, the unit U1 processes the data read, then returns to step P00 to verify that the signal RDY2 has been re-transmitted by the circuit CCT1 or wait for this signal to be re-transmitted before repeating the step P01.

In one embodiment, the number of binary words necessary for the execution of the processing step P02 may be greater than I. In this case, the unit U1 reads several times its inputs Eo to In before processing the data. read. For example, the unit U1 reads successively several sets of I data by returning after each reading the value Nj = I, then, during a last reading step, reads Nj binary words with Nj <I of the set of words binaries that is provided to him. Thus, the unit U1 performs steps P00 and P01 several times before executing the processing step P02.

In another embodiment, the unit U1 returns to the circuit CCT1 a number Nj less than the number of binary words read on its inputs, so that certain binary words read are presented again on its entries for the next reading step. This embodiment highlights the fact that the number Nj constitutes, in essence, a purge instruction of certain binary words of the set of binary words initially presented to the unit NI, rather than information on the number of words. binary words that the unit Ul read.

In other embodiments, the unit U1 may be a circuit in which data is written. The number Nj then represents the number of binary words written in the unit U1. The term "read binary words" as used herein and in the claims for designating an operation for receiving and processing binary words by the unit U1, can therefore also correspond to a receiving operation of binary words which are written in the Ul unit. The higher read priority inputs are then higher write priority inputs. The term "read" should therefore be understood in the broad sense to mean a data receiving operation for a given processing which may correspond to a writing of the data in the unit U1.

Another embodiment of the unit U1 can implement a method of matching the data without hierarchical entries. The unit U1 then supplies the BDG unit, instead of the number Nj, with information on the binary words wi which it no longer needs (data to be purged). These binary words are not necessarily of adjacent "i" ranks. The unit U1 can also indicate in which order the I-Nj binary words which have not been read (or which have been read but which it wishes to receive again) must be applied to its inputs. In such an embodiment, the circuit BS is a "crossbar" type data reorganization circuit capable of connecting any of its inputs to any of its outputs.

The mapping sequence described above is illustrated in FIG. 5 with indication of the corresponding steps performed among the above-described steps B00-B08, C00-C04. "I" is as previously assumed to be 8. FIG. 5 is described in the Appendix, which forms an integral part of the description.

In yet other embodiments, the BDG mapping unit may be configured to perform step C04 as a prefetch step before the number Nj is received, in order to reduce the number of clock cycles required to transfer the data from the unit U0 to the unit Ul. This preloading step can be ensured by adding a third input register in parallel with the registers RI R2 and whose outputs are connected to additional inputs of the shift circuit circular BS. New data can be preloaded into one of the three input registers without waiting to know which data is present in the other two registers that have been read by the unit U1. In another embodiment, the outputs of the register R3 are returned to additional inputs of the circuit BS. If it proves on receipt of the number Nj that data present in register R3 have not been read by the unit Ul but are no longer present in their original register RI, R2 because its content has been renewed, these data can be returned to the circuit BS from the register R3 to be reloaded in the register R3 after shift and combination with new data.

FIG. 6 represents steps of a method for producing an FPGA circuit according to the invention. We distinguish :

a step FOI describing the target data processing function (functional description) in a high level of abstraction language and using an untyped variable,

a step F02 for defining the hardware constraints of the circuit, in particular the number of parallel buses, the clock frequency or the clock frequencies if the device comprises processing blocks having different clock frequencies, etc.,

a step F03 of compiling the functional description in a target architecture using a BDG unit according to the invention, taking into account the hardware constraints, to obtain an HDL hardware description of the device, or HDL circuit.

a step F04 for configuring the FPGA circuit by logical synthesis and routing placement of the HDL circuit, in order to obtain a stream of configuration data (bitstream) which is loaded into the memory of the FPGA circuit to obtain an FPGA circuit operational.

During a step F05, the FPGA circuit is tested under actual operating conditions or in a simulation environment. The production method ends at a step F06 if the circuit is considered satisfactory. If not, return to step F02 to redefine the hardware constraints. This step may include defining a different number of parallel buses, for example to increase the degree of parallelism of the device. By using a mapping unit (or a plurality of matching units upstream of several processing units that may be affected by resizing the circuit) this step does not require changing the functional description of the circuit. In other words, it is not necessary to return to the FOI step. F03 stages of compilation and obtaining of a hardware description, and F04 of logical synthesis and routing placement, are executed on the basis of the new definition of the hardware constraints. At the end of the test step F05, a new optimization cycle comprising the steps F02, F03, F04 can optionally be conducted until an optimal configuration of the FPGA circuit is obtained.

The embodiment method that has just been described is illustrated in FIGS. 7A, 7B, 7C. These figures show three variants DV2a, DV2b, DV2c of the device DV2 performing the same processing function but having different degrees of parallelism retained in the step F02 of definition of the hardware constraints. The DV2a device has a dual parallel bus architecture, the DV2b device has a four parallel bus architecture and the DV2c device has an eight parallel bus architecture. The unit U1 is designated Ula, Ulb, Ulc and the BDG unit is designated BDGa, BDGb, BDGc in each of the variants, respectively.

In the embodiment chosen here by way of example, the unit Ul (Ula, Ulb, Ulc) is an eight-input circuit E0 to E7 that can calculate the average value of data provided on all or part of its inputs. According to an assumption taken here by way of example and forming part of the functional definition of the device DV2, the unit U1 is configured here to calculate the average value of 6 binary words provided on its inputs. The unit U1 comprises for this purpose a control circuit CCT2 (CCT2a, CCT2b, CCT2c), a bank of 8 input registers r0, r1, r2 ... r7, the input of each register forming an input E0, El, E2 ... E7 of the unit Ul. The control circuit CCT2 controls the loading of the registers r0 to r7 by means of charging signals LDO1 to LD07. The outputs of the registers are applied to an adder ADD. The output of the adder ADD is applied to a dividing circuit DIV / N configured here as a divider by 6 (DIV / 6) whose output provides the result R of the average value calculation.

In the variant DV2a, FIG. 7A, the unit Ula is connected to a matching unit BDGa having two outputs S0, S1 and two inputs connected to two outputs of a data supply unit UOa. The inputs E0, E1, E2 of the registers r0, r1, r2 are connected to the output S0 and the inputs E4, E5, E6 of the registers r4, r5, r6 are connected to the output S1. Registers r3, r7 and their loading signals LD03, LD07 are not used. These registers can be deleted, or be connected to the outputs S0, SI and their contents maintained at 0. When the device DV2a is in operation, the unit Ula receives from the unit BDGa sets W, W, W "of two binary words and the circuit CCT2a returns after each loading of two binary words the command ACK2 (Nj) with Nj = W0 and w1 of the set W are loaded into the registers r0 and r4, words w0 ', w1' of the set W are loaded into the registers r1, r5 and words w0 ", w1" of the set W "are loaded into the registers r2, r6. The circuit CCT2a then activates divider DIV / 6 which provides the result R of the average value.

In the variant DV2b, FIG. 7B, the unit Ulb is connected to a matching unit BDGb having four outputs S0, S1, S2, S3 and four inputs connected to four outputs of a data supply unit UOb. The inputs E0, E1 of the registers r0, r1 are connected to the output S0, the inputs E2, E3 of the registers r2, r3 are connected to the output S1, the input E4 of the register r4 is connected to the output S2 and the input E6 of the register r6 is connected to the output S3. Registers r5, r7 and their load signals LD05, LD07 are not used. These registers can be deleted or connected to the outputs S2, S3 and their content maintained at 0.

When the device DV2b is in operation, the unit Ulb receives from the BDGb unit sets W, W of four binary words. The circuit CCT2b loads the words wO, w1, w2, w3 of the set W into the registers r0, r2, r4, r6 and returns the command ACK2 (Nj) with Nj = 4. The circuit CCT2b then loads words w0 ', wl' of the set W into the registers r1, r3 and returns the command ACK2 (Nj) with Nj = 2. The circuit CCT2b then activates divider DIV / 6 which provides the result R of the average value. On reception of the command ACK2 (2), the unit BDGb shifts on its outputs S0, SI the position of two unread words w2 ', w3' of the second set of words W, initially presented on its outputs S2, S3, and presents on these last words of a new set of words received.

In the variant DV2c, FIG. 7C, the unit Ulc is connected to a BDGc matching unit having eight outputs S0, S1, S2, S3, S4, S5, S6, S7 and eight inputs connected to eight outputs of a UOc data supply unit. The inputs E0, E1, E2, E3, E4, E5 of the registers r0, r1, r2, r3, r4, r5 are connected to the outputs S0, S1, S2, S3, S4, S5, respectively. Registers r6, r7 and their load signals LD06, LD07 are not used. These registers can be deleted or connected to the outputs S6, S7 and their content maintained at 0. When the device DV2c is in operation, the BDGc unit supplies the unit Ulc with sets W of eight binary words. The circuit CCT2c loads 6 binary words w0 to w5 in the registers r0 to r5 and returns the command ACK2 (Nj) with Nj = 6. Upon reception of the command ACK2 (6), the BDGc unit shifts the position of two unread words w6, w7 of the word set W, initially presented on its outputs S6, S7, on its outputs S0, S1, and present on the outputs S2 to S7 words of a new set of words received.

Thanks to the use of the BDG unit taken in its variants BDGa, BDGb, BDGc, the three corresponding variants DV2a, DV2b, DV2c of the device DV2 are generated from the same functional description. For example, with reference to FIG. 6, variant DV2a was carried out during steps F03 and F04 after having chosen, during step F02, a two parallel bus architecture. The variant DV2b was carried out during the steps F03 and F04 after the test step F05 showed that the two-bus architecture of the variant DV2a was not optimal, and after returning to the step F02 to increase the number of parallel buses, as well as possibly other variables such as the clock frequency. The DV2c variant was carried out after the F05 test step showed that the parallel four-bus architecture of the DV2b variant was not optimal, or for other reasons, for example because the test of other parts of the circuit showed that an architecture with eight parallel buses was also desirable on this part of the FPGA circuit.

In the absence of the BDG unit, the method of producing the FPGA circuit shown in Fig. 6 should include a return to step FOI for rewriting the functional description of the unit U1, taking into account the reception by 2 , 4 or 8 bits of the six binary words needed to calculate an average value by the unit Ul. As indicated above, there would then be "contamination" of material constraints on the functional definition.

According to one aspect of some embodiments of the invention, an instruction to declare an untyped variable, i.e., of unspecified size, to the high level of abstraction programming language used is step FOI. The HLS compiler implements the untyped variable in step F03 using a target architecture including the BDG mapping unit and configuring the BDG unit as well as the unit U1 based on the number of buses. parallel, to obtain the configurations BDGa, BDGb, BDGc and Ula, Ulb, Ulc abovedescribed. The material constraints can thus be modified after the test phase F05 without requiring a rewrite of the description of the function executed by the unit U1 since this function is independent of the material structure of the unit U1 through the use of the untyped variable.

The program in pseudo-language C below is an example of a functional description of the calculation function of a mean value executed by the unit U1 of the device DV2 (DV2a, DV2b, DV2c). The instruction to declare an untyped variable is designated t untyped in this example and "bus" is the input variable of the program, which is untyped. "StaticAverage" is a function whose return is of type "int" (either integer "or an integer)." ReadBus "is a complementary function of reading the untyped variable." Sample "," sum "and" variable "variables. "count" are integers ("int") N is an integer of constant value ("const int"). int StaticAverage (t_untyped bus)

{

int sample;

int sum = 0;

int count = 0;

const int N = 6;

while (count <N) {

sample = ReadBus (bus);

sum = sum + sample;

count = count + 1;

}

return sum / count;

}

The program executes an averaging function that reads samples ("samples") to a "while" stop condition. At each iteration of the "while", the program reads a sample ("ReadBus (bus)"), then accumulates the samples ("sum = sum + sample") and increments a count of the number of samples ("count = count + 1 "). At the end of the "while" loop (stop condition encountered), the program calculates the average ("sum / count"). Since this is a static averaging function ("StaticAverage"), the stop condition is "the number of samples equals 6"("while (count <N)"). Similarly, a write Write function of the untyped variable can be provided. The program is written without knowing the degree of parallelism available to the hardware device running this program. It is therefore independent of the hardware architecture, which can be modified after the F05 test phase until finding the optimum number of parallel buses for the intended application. The HLS compiler of the C language implements the BUS variable and its read instruction READ (BUS) by means of the mapping unit BDG, and configuring the BDG unit and the unit Ul appropriately to the degree of parallelism retained (for example one of the configurations BDGa, BDGb, BDGc) (for example the unit Ul suitably.

FIG. 8 represents another embodiment DV2d of the device DV2 in which the unit U1 is, as before, a circuit for calculating the average value of data provided on its inputs E ₀ to En. The unit Ul comprises the adder ADD, a control circuit CCT2d, a bank of I input registers r ₀ to rn whose inputs are connected to corresponding outputs of a matching unit BDGd and whose outputs are applied to the adder ADD. The output of the adder ADD is applied to a dividing circuit DIV / N whose output provides the result R of the calculation of the average value of N binary words. The number N is here a variable supplied to divider DIV / N by the control circuit CCT2d. The circuit CCT2d also applies to the bank of registers r ₀ to ¾ loading signals LD01-LDi_i which are a function of N. The circuit CCT2d for example monitors the evolution of the value of the result R and determines, after each computation step of an average value, the number N of binary words needed to calculate the next average value. In a variant, the circuit CCT2 receives the variable N from the outside, as an external setpoint value. After each reading step, the number Nj = N is supplied to the BDG unit in the command ACK2 (Nj) or as command ACK2 (Nj).

The program in pseudo-language C below is an example of a functional description of the calculation function of a mean value executed by the unit U1 of the device DV2d of FIG. 8: int DynamicAverage (t_untyped bus)

{

int sample;

int sum = 0: int count = 0;

const int max = 1000;

while (sum <max) {

sample = ReadBus (bus);

sum = sum + sample;

count = count + 1;

}

return sum / count;

}

This program differs from the previous one in that it defines a dynamic averaging function "(DynamicAverage") with a stop condition which is "the sum of the samples is less than 1000" ("while (sum <max)"). The number of samples read to calculate an average value thus depends on their value. Assuming that the samples are non-zero, the worst case considered is 1000 samples equal to 1. This program is therefore of the "data-dependent behavior" type.

In addition to the FPGA (Field Programmable Gate Array) programmable gate integrated circuits, a parallel data processing device according to the invention can be implemented in the form of an integrated circuit by means of other known technological methods, in particular the ASICs ("Application Specifies Integrated Circuits") The device which has just been described has a particular advantage in the context of an implementation in the form of an FPGA circuit, which allows the adjustment of the bus size.

FIG. 9 shows an embodiment of the device DV2 in which the unit U0 is a circuit for matching the number of parallel buses and clock frequencies. In addition to its outputs connected to the BDG unit, the unit U0 comprises K inputs E ₀ to E _K -i connected to a unit U2 which comprises K outputs So to S _K - _I , K being different from the number I of outputs of the unit U0. The unit U2 is furthermore clocked by a clock signal CK2 different from a clock signal CK1 applied to the units BDG and U1. The unit U0 makes the connection between these two clock domains having different degrees of parallelism, and transforms sets of binary words wo to W _K - _I into sets of binary words wo to wi_i. For this purpose, the unit U0 uses a FIFO ("first input first out") stack for resizing the sets of binary words, receives the clock signals CK1, CK2 of each clock domain and controls the data exchange with the unit U2 by receiving a signal RDY3 therefrom and returning an acknowledgment signal ACK3 thereto. In general, the unit U0 can be any type of interface circuit providing parallel data, in particular an input port of the device. The unit U1 may also be a memory, or another data processing unit, or a clock domain adaptation circuit and / or degrees of parallelism as shown in FIG. 9. A device according to the The invention may comprise a plurality of processing units connected to the same set of data buses and each equipped with a BDG data matching unit enabling it to adjust its "consumption" of parallel data without worrying about the number of data units. data conveyed simultaneously by the set of buses. Although the mapping unit has been described in the above as a separate block of the processing unit, it can in practice be integrated into the processing unit, the inputs of the setting unit in correspondence then forming the inputs of the processing unit.

Annex

Description of Figure 5

Initializing the BDG unit

The sets of binary words W1 and W2 are loaded into the registers R1, R2, the circular shift circuit BS receives on its 2 * 1 inputs the concatenated contents of the two registers:

R1 // R2 = aO al a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7

The circuit CCT1 sets the signal SHIFT to 0 and loads in the register R3 the set W1:

R3 = aO al a2 a3 a4 a5 a6 a7

The variables r etΣNj are set to 0 (r = 0, ΣNj = 0). First reading step: the unit U1 reads the binary words a0 to a4 and returns the value N1 = 5 to the circuit CCT1 by means of the acknowledgment signal ACK2 (5),

the circuit CCT1 updates the variable ΣNj: ΣNj = ΣNj + Nj = 5,

the circuit CCT1 performs the test on the value of r + ΣNj: r + ΣNj = 5 ^ 8.

Since the variable r + ΣNj is not greater than or equal to 8, the circuit CCT1 exits the update loop of the registers R1, R2 and applies the command SHIFT = 5 to the circuit BS (step B07), then loads the register R3 with the binary words selected in the registers R1, R2 by the circuit BS (step B08), that is:

R3 = a5, a6, a7, b0, b1, b2, b3, b4

Second reading step: the unit U1 reads the binary words a5, a6, a7, b0 and returns the value N2 = 4 to the circuit CCT1 by means of the acknowledgment signal ACK2 (4),

the circuit CCT1 updates the variable ΣNj: ΣNj = ΣNj + Nj = 5 + 4 = 9,

the circuit CCT1 executes the test on the variable r + ΣNj: r + ΣNj = 0 + 9 = 9> 8. As the variable r + ΣNj is greater than or equal to 8: i) the circuit CCT1 loads the new set of words W3 binaries in the RI register. The circular shift circuit BS receives the concatenated contents of the two registers: R1 // R2 = c0 c2 c3 c4 c5 c6 c7 b0 b1 b2 b3 b4 b5 b6 b7 ii) the circuit CCT1 updates the variable r: r = ( r + ΣNj) mod 8 = 9 mod 8 = 1, iii) the circuit CCTl sets the variableΣNj to 0.

The circuit CCT1 then leaves the update loop of the registers R1, R2 and applies the command SHIFT = 4 to the circuit BS (step B07), then loads the register R3 with the selected binary words in the registers R1, R2 by the circuit BS (step B08), that is:

R3 = b1, b2, b3, b4, b5, b6, b7, cO It will be noted that the circuit BS executes the SHIFT command so that the content of the two registers is considered as an uninterrupted bit string which would be stored in an uninterrupted sequence of concatenated registers R1 // R2 // R3 // R4 / / R5 ... wherein R3 is the register R1 having received a first new set of binary words, the register R4 is the register R2 having received a new set of binary words, the register R5 is the register RI having received a second set of binary words set of binary words, etc. Thus, when the sum of the values of the SHIFT signals applied to the circuit BS exceeds the size of the two registers, the offset is executed by considering that the first bit of the register RI, here the bit c0, follows the last bit of the register R2.

Third reading step:

the unit U1 reads the binary words b1 to b7 and c0, and returns the value N3! to the average circuit CCT1 of the acknowledgment signal ACK2 (8),

the circuit CCT1 updates the variable ΣNj: ΣNj = ΣNj + Nj = 0 + 8 8,

the circuit CCT1 performs the test on the variable r + ΣNj: r + ΣNj = 1 + 8 9> 8,

Since the variable r + ΣNj is greater than or equal to 8: i) the circuit CCT1 loads the new set of binary words W4 into the register R2. The circular shift circuit BS receives the concatenated contents of the two registers:

R1 // R2 = cO cl c2 c3 c4 c5 c6 c7 d0 dl d2 d3 d4 d5 d6 d7 ii) the circuit CCT1 updates the variable r: r = (r + ΣNj) mod 8 = 9 mod 8 = 1,

iii) the circuit CCTl sets the variable ΣNj to 0.

The circuit CCT1 then leaves the update loop of the registers R1, R2 and applies the command SHIFT = 8 to the circuit BS (step B07), then loads the register R3 with the selected binary words in the registers R1, R2 by the circuit BS (step B08), that is:

R3 = cl, c2, c3, c4, c5, c6, c7, d0

Fourth reading step: the unit U1 reads the binary words cl at cl, and returns the value N4 = 7 to the circuit CCT1 by means of the acknowledgment signal ACK2 (7),

the circuit CCT1 updates the variable ΣNj: ΣNj = ΣNj + Nj = 0 + 7 = 7,

the circuit CCT1 executes the test on the variable r + ΣNj: r + ΣNj = l + 7 = 8> 8,

Since the variable r + ΣNj is greater than or equal to 8: i) the circuit CCT1 loads the new set of binary words W5 in the register RI. The circular shift circuit BS receives the concatenated contents of the two registers:

R1 // R2 = eO el e2 e3 e4 e5 e6 d7 dl d2 d3 d4 d5 d6 d7 ii) the circuit CCT1 updates the variable r: r = (r + ΣNj) mod 8 = 8 mod 8 = 0,

iii) the circuit sets variable Σ Nj to 0.

The circuit CCT1 then leaves the update loop of the registers R1, R2 and applies the command SHIFT = 7 to the circuit BS (step B07), then loads the register R3 with the selected binary words in the registers R1, R2 by the circuit BS (step B08), ie: R3 = d0, d1, d2, d3, d4, d5, d6, d7

Fifth reading step:

the unit U1 reads the binary words d0 to d5 and returns the value N5 = 6 to the circuit CCT1 by means of the acknowledgment signal ACK2 (6),

the circuit CCT1 updates the variable ΣNj: ΣNj = ΣNj + Nj = 0 + 6 = 6,

the circuit CCT1 executes the test on the variable r + ΣNj: r + ΣNj = 0 + 6 = 6 ^ 8,

Since the variable r + ΣNj is not greater than or equal to 8, the circuit CCT1 exits the update loop of the registers R1, R2 and applies the command SHIFT = 6 to the circuit BS (step B07), then loads the register R3 with the binary words selected in the registers R1, R2 by the circuit BS (step B08), that is:

R3 = d6, d7, e0, el, e2, e3, e4, e5

Claims

1. Device (DV2) for parallelized data processing, comprising:

- a first unit (UO) comprising parallel outputs (Si) for supplying binary words (w;) in parallel, and

- a second unit (Ul) comprising parallel inputs (Ei) for receiving a set (W) of binary words (w;) in parallel supplied by the first unit (UO),

characterized in that the second unit (Ul) is configured to read simultaneously, during the same reading step, all or part of the binary words of the set (W) of binary words (wi) applied to its parallel inputs , and return an instruction (Nj) for purging certain binary words from the set of binary words,

and in that it comprises a data matching unit (BDG) connecting the parallel outputs of the first unit (UO) to the parallel inputs of the second unit (Ul) and configured for:

- provide the second unit (Ul) with a first set of binary words (W, Wl) provided by the first unit (UO),

- after reading by the second unit (Ul) of a part (a0-a4) of the binary words of the first set (W, Wl) of binary words, provide the second unit (Ul) with the word(s) (a5-a7 ) of the first set of binary words that have not been purged, and one or more binary words (b0-b4) of a second set of binary words (W, W2) provided by the first unit.

2. Device according to claim 1, wherein the data matching unit (BDG) is also configured to, when part of the second set (W2) of binary words has been purged, provide the second unit ( Ul) the binary word(s) (bl-b7) of the second set of binary words which have not been purged and one or more binary words (cO) of a third set (W3) of binary words provided by the first unit.

3. Device according to one of claims 1 and 2, in which:

- the second unit (Ul) is configured to, when reading part of the binary words applied to its parallel inputs, read binary words (a0-a4) present on inputs with higher reading priority, and

- the data matching unit (BDG) is configured to, after purging by the second unit a part (a0-a4) of binary words from the first set of binary words, apply to the highest priority inputs of the second unit of words (a5-a7) of the first set (Wl) of binary words previously applied to parallel inputs and not having been purged, and apply to parallel inputs with lower reading priority of the second unit one or more binary words (b0-b4) of the second set (W2) of binary words.

4. Device according to claim 3, wherein the data matching unit comprises:

- at least two input registers (RI, R2) arranged in parallel and each having parallel inputs connected to the outputs of the first unit (U0),

- an output register (R3) having outputs connected to the parallel inputs of the second unit (Ul),

- a data reorganization circuit (BS) connecting outputs of the input registers to parallel inputs of the output register, and

- a control circuit (CCT1) configured for:

- load into the input registers sets (Wl, W2, W3, W4, W5) of binary words (a0-a7, b0-b7, c0-c7, d0-d7, e0-e7) supplied by the first unit , and load into the output register, via the data reorganization circuit, binary words present in the input registers,

- control the data reorganization circuit as a function of the number of binary words (Nj) having been purged, so as to load into cells with higher reading priority of the output register the binary words (a5-a7) present in the input registers and not having been purged, and loading into lower priority cells of the output register binary words (b0-b4) present in the input registers and not yet having been presented to the second unit.

5. Device according to claim 4, in which the control circuit (CCT1) is configured to load into one of the input registers (RI, R2) a new set (W3) of binary words supplied by the first unit ( U0) at least when all the binary words of the set (W2) of binary words present in the input register have been purged.

6. Device according to one of claims 1 to 5, in which the second unit is configured to determine the number of binary words that it reads during the same reading step as a function of the binary words previously read or as a function of an external instruction.

7. Device according to one of claims 1 to 6, in which the first unit is a processor, a series/parallel communication interface circuit, a circuit for adapting a number of buses arranged in parallel, a circuit clock frequency modification, or a combination of these elements.

8. Integrated circuit with programmable gate array, comprising a device according to one of claims 1 to 7.

9. Method for processing data by means of a parallelized data processing device (DV2) comprising a first unit (U0) comprising parallel outputs (Si) for supplying binary words (w;) in parallel, and a second unit (Ul) comprising parallel inputs (Ei) for receiving a set (W) of binary words (w;) in parallel supplied by the first unit (U0), method characterized in that it comprises the steps consisting of:

- by means of the second unit (Ul), read simultaneously, during a reading step, all or part of the binary words of the set (W) of binary words (wi) applied to its parallel inputs, and return , after a reading step, an instruction (Nj) for purging certain binary words from the set of binary words, and

- after reading by the second unit (Ul) of all or part (a0-a4) of the binary words of the first set (W, Wl) of binary words, provide the second unit (Ul) with the word(s)

(a5-a7) of the first set of binary words that have not been purged, and one or more binary words (b0-b4) of a second set of binary words (W, W2) provided by the first unit.

10. Method according to claim 9, comprising the step of, when a part of the second set (W2) of binary words has been purged, providing the second unit with the binary word(s) (bl-b7) of the second set of binary words that have not been purged and one or more binary words (cO) of a third set (W3) of binary words provided by the first unit.

11. Method according to one of claims 9 and 10, comprising the steps consisting of:

- when reading all or part (a0-a4) of the binary words applied to the parallel inputs of the second unit, read binary words (a0-a4) present on inputs with higher reading priority, and - after purging a part (a0-a4) of the binary words of the first set (Wl) of binary words, apply the words (a5-a7) of the first set of words (a5-a7) to the inputs with the highest reading priority of the second unit binary words previously applied to inputs of lower reading priority and which have not been purged, and apply to parallel inputs of lower reading priority of the second unit one or more binary words (b0-b4) of the second set (W2) of binary word.

12. Method according to one of claims 9 to 11, comprising the step consisting of providing, between the first and the second unit, a data matching unit (BDG) connecting the parallel outputs of the first unit (U0 ) to the parallel inputs of the second unit (Ul) and comprising at least two input registers (RI, R2) arranged in parallel and each having parallel inputs connected to the outputs of the first unit (U0), an output register ( R3) having outputs connected to the parallel inputs of the second unit (Ul), and a data reorganization circuit (BS) connecting outputs of the input registers to parallel inputs of the output register.

13. Method according to claim 12, comprising a step consisting of loading into one of the input registers (RI R2) a new set of binary words provided by the first unit, at least when all the binary words of the set of binary words present in the input register have been purged.

14. Method for producing, in a programmable gate array integrated circuit (FPGA), a device (DV2) executing a parallel data processing function requiring simultaneous reading steps of binary words (w;) in a set (W ) of binary words carried by parallel data buses, method comprising:

- a step (FOI) of functional description of the processing function using a language with a high level of abstraction and using an untyped variable of unspecified size (FOI) to designate the binary words (wi) to be read simultaneously ,

- a step (F02) of defining material constraints to be taken into account in the production of the device and including a choice of a degree of parallelism of the device,

- from the functional definition and the hardware constraints, a step (F03) of hardware description of the device presenting the chosen degree of parallelism, using as target architecture a device conforming to one of claims 1 to 7 in which the second unit (Ul) performs the data processing function, and - from the hardware description of the device, a step (F04) of configuring the integrated circuit with an array of programmable gates, to obtain the device presenting the chosen degree of parallelism.

15. Method according to claim 14, comprising the steps consisting of:

- test the device (F05),

- rework (F02) the definition of the hardware constraints of the device so as to modify its degree of parallelism, without modifying the functional description (FOI) of the processing function,

- rework (F03) the hardware description of the device based on the reworked definition of the hardware constraints, and

- from the revised hardware description of the device, reconfigure (F04) the integrated circuit with programmable gate array, to obtain the device presenting the modified degree of parallelism.