WO2021165075A1

WO2021165075A1 - System on a chip for the binary tree summation of floating-point numbers

Info

Publication number: WO2021165075A1
Application number: PCT/EP2021/052921
Authority: WO
Inventors: Thierry GOUBIER
Original assignee: Commissariat A L'energie Atomique Et Aux Energies Alternatives
Priority date: 2020-02-18
Filing date: 2021-02-08
Publication date: 2021-08-26
Also published as: FR3107374A1; FR3107374B1

Abstract

System on a chip for the binary tree summation of floating-point numbers, comprising: - a pushdown automaton (1) configured to perform offsetting, reduction and acceptance operations; - an input (2) receiving a queue of floating-point numbers to be added and an item of information (eof) representative of the end of the summation of the floating-point numbers, which are processed by the automaton (1); - a programming memory (3) for the automaton (1), comprising a transition table (3a) for the automaton (1); - a memory (4) comprising a stack of states (4a) for the automaton (1); - a memory (5) comprising a stack of results (5a) of the operations performed by the automaton (1); - a floating-point adder (6) configured to add two floating-point values and to deliver, as output, their sum as a floating-point number; and – a disambiguator (7) configured to implement a disambiguation criterion making it possible to perform the offsetting operation when it is false and the reduction operation when it is true.

Description

DESCRIPTION

Title of the invention: Binary tree summation system on a chip of floating numbers

The invention relates to an on-chip binary tree summation system of floating numbers.

[0002] The invention lies in the field of the production of hardware components for the acceleration of calculation.

Any floating point operation in a computer carries a risk of error or loss of precision, if standard arithmetic operators (IEE754) are used. This is the case with a naive or iterative sum, performed by accumulation and iteration over a set of numbers.

[0004] Writing a floating point number is writing in the form of a sign, a matisse and a base with an exponent. For example, -1.3254 = -13254 x10 ^-4 (the sign is negative, the matisse is 13254, the base is 10, but often the base is 2 in computer science, and the exponent is -4).

[0005] Algorithms reducing (or eliminating) Terror exist, but require more operations or at least slower operations. Sorting the data before the sum (adding from the smallest to the largest) allows you to limit Terror, at the cost of having to perform this sort. A compromise proposed by the literature can be a binary tree sum, which gives a reduced error regardless of the distribution of values in the set, as illustrated in Doc1: "Accuracy and Stability of Numerical Algorithms (2 ed ) ", SIAM, de Higham, Nicholas (2002).

[0006] But the state of the art considers that this sum in a binary tree assumes a priori knowledge of the number of values to be added, as in document Doc1, as well as in document Doc2: "Sums: A summation algorithm balancing accuracy with throughput. ", by Bamaby Dalton, Amy Wang, Bob Blainey; SIMDizing Pairwise WPMVP '14: Proceedings of the 2014 Workshop on Programming models for SIMD / Vector Processing February 2014 Pages 65-70 htps: //doi.org/10.1145/2568058.2568070; and in document Doc3: "Implementation of a low round-off summation method" by Caprani, Ole; BIT Numerical Mathematics, Sept 1971.

[0007] Documents Doc1 and Doc2 describe an iterative tree-sum algorithm with an automaton and a stack. This algorithm requires a priori knowledge of the number of elements in the set, which constitutes a problem when this size is not known at the start of the sum.

Techniques in the field of algebraic languages are known, used mainly for the problem of the belonging of a sentence to a language, and the construction of an abstract structure on a sentence, for example in document Doc4: "Parsing Techniques, A Practical Guide", by Dick Grune and Ceriel JH Jacobs, 2nd Edition, 2007, Springer, and in document Doc5: "Compilers: Principles, Techniques and Tools", by A. Aho, M. Lam, R Sethi and J. Ullman, 2nd Edition, 2007, Pearson. Such approaches are not usually used for calculation, however they constitute theoretical elements for the analysis of calculations.

An object of the invention is to alleviate the previously mentioned problems, and in particular to make it possible to sum a quantity of floating numbers not known a priori.

Also, it is proposed, according to one aspect of the invention, an on-chip binary tree summation system of floating numbers comprising: a battery-powered automaton (ie a state stack), configured to perform operations of shift, reduction, and acceptance; an input receiving in line floating numbers to be added and information representative of the end of the summation of the floating numbers, processed by the automaton; a PLC programming memory comprising a PLC transition table; a memory comprising a stack of states of the automaton; a memory comprising a stack of results of the operations performed by the automaton; a floating point adder configured to add two floating point values and output their floating point sum; and a disambiguator configured to implement a disambiguation criterion for performing the shift operation when it is false and for performing the reduction operation when it is true.

[0012] Such a system makes it possible to be able to add a quantity of floating-point numbers along a binary tree without knowing this quantity a priori, and by performing the minimum number of additions.

[0013] According to one embodiment, the disambiguation criterion is true if the depth of the stack of results is greater than the Hamming weight of the number of floating point values received at the input and already processed by the automaton. The use of such a disambiguation criterion makes it possible to make this sum deterministic and to obtain the evaluation of the sum in binary tree as sought.

In one embodiment, the system comprises at least one additional adder.

[0016] The use of several adders makes it possible to speed up the system by performing several additions at the same time.

[0017] According to one embodiment, the transition table is of the LR (1) type: a table of an automaton carrying out a recognition of the series of floating values, read from left to right and with a forecasting element. ("lookahead" in English).

[0018] Such a transition table has state-of-the-art tools and algorithms for its generation, and, in this embodiment, is very small.

In one embodiment, the automaton is configured to complete its sum when it has to process information representative of the end of the summation of the floating numbers.

[0020] Thus the system knows when to stop the summation without knowing a priori the knowledge of the number of floats to be added.

According to one embodiment, the system comprises a control module for constraining the disambiguator to constrain the value of the criterion to true during a summation.

[0022] Thus, the system then performs a naive sum (iterative), and thus has reproduction properties down to the bit of a code using naive sums. For processors offering two types of sums (naive and tree-based) over vectors in their instruction set, the original system implements the tree-sum, and this improvement implements the naïve sum, with the same hardware system.

In one embodiment, the system comprises a tree vector adder arranged between the input and the automaton making it possible to perform the tree sum of the floating numbers of a vector of successive floating numbers received at the input, and delivering their sum in tree to the automaton.

[0024] Thus, the system can be accelerated by a factor related to the size of this vector adder, and also decrease the size of the batteries used by the system.

The invention will be better understood by studying a few embodiments described by way of non-limiting examples and illustrated by the appended drawing in which:

[0026] [Fig.1] schematically illustrates a summation system on chip, according to one aspect of the invention; [0027] [Fig.2] schematically illustrates a transition table of the automaton of the system of Figure 1, according to one aspect of the invention;

[0028] [Fig.3] schematically illustrates the operation of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0029] [Fig.4] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0030] [Fig.5] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0031] [Fig.6] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0032] [Fig.7] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0033] [Fig.8] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0034] [Fig.9] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0035] [Fig.10] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0036] [Fig.l 1] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0037] [Fig.12] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0038] [Fig.13] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0039] [Fig.14] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0040] [Fig.15] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0041] [Fig.16] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0042] [Fig.17] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention;

[0043] [Fig.18] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention; and

[0044] [Fig.19] schematically illustrates an operating step of the automaton of the system of Figures 1 and 2, according to one aspect of the invention.

In all of the figures, the elements having identical references are similar. In Figure 1 is illustrated, according to one aspect of the invention an on-chip binary tree summation system of floating numbers comprising: a stacked automaton 1 configured to perform shifting, reduction, and shifting operations. acceptance; an input 2 receiving in line floating numbers to be added and a piece of information eof representative of the end of the summation of the floating numbers, processed by the automaton 1; a PLC 1 programming memory 3 comprising a transition table 3a of PLC 1; a memory 4 comprising a stack of states 4a of the automaton 1; a memory 5 comprising a stack of results 5a of the operations performed by the automaton 1; a floating point adder 6 configured to add two floating point values and output their floating point sum; and a disambiguator 7 configured to implement a disambiguation criterion making it possible to perform the shift operation when it is false and to perform the reduction operation when it is true.

The present invention is based on the following principle relating to a tree sum and a flow of digital values in the form of floating numbers.

A transition table for a non-deterministic stack automaton and an ambiguity resolution criterion makes it possible to construct a sum in a binary tree as the values are received, and this without knowing the size of the data set to be summed before receiving the last item.

The definition of a binary addition tree is as follows: a node of the tree represents either an addition and then has two children, or a numeric value and has no children. This is the definition of a strict or locally complete binary tree.

A tree sum corresponds to a binary tree of additions of minimum depth. If the number of values in the sum is a power of 2, then this tree is a perfect binary tree. If the number of values is not a power of 2, we consider the tree of minimum depth for which the following property is respected: each left child is a perfect binary tree. The tree has n-1 nodes with 2 children, and therefore corresponds to n-1 additions to obtain the final result.

Input 2 receives the values one by one or as a whole, and we can remove the oldest value received (and thus shift all the values received), according to a principle of first arrival, first output (FIFO for " First In First Out "in English). An eof data input may indicate that the transmission is complete and no more value will be added to the input. This can be achieved by a signal which, if activated, means that the transmission of values continues. If it is disabled, then the transmission is terminated and no more value will be added to the entry.

The transition table 3a of the controller 1 allows access in two dimensions, and is large enough for 3 rows and 2 columns, numbered from 1, with sufficient elements to contain 5 different values per row / crossing column.

The stack of states 4a, of predetermined maximum size, can contain 3 different values per element of the stack 4a, allowing the last element to be read, its removal, and the addition of a new element.

The stack of results 5a, each element of which can contain a floating point value. As with the state stack 4a, the operations of removing the last two elements, and adding an element are supported, as well as access to the size of the stack.

[0055] The floating point adder 6 conforms to the IEE754 standard, taking two floating point values and returning the result of the addition of these two values.

The battery-operated automaton 1 said shift / reduction ("shift / reduce" in English), capable of three operations: shift ("shift" in English), reduction ("reduce" in English) and acceptance ("accept" in English).

The controller 1 can perform three different operations which are as follows:

• Offset S: PLC 1 removes a value from input 2 and adds this value to results stack 5a. Controller 1 reads the state value from state stack 4a, adds 1 to it (with a maximum of 3), and adds this new state value to state stack 4a.

• Reduction R: PLC 1 removes two states from state stack 4a, reads the value at the top of state stack 4a after the removal, adds 1 (with a maximum of 3) and places this new value. on the state stack 4a. Controller 1 removes two values from result stack 5a, places those two values as input to adder 6, activates adder 6 to add, and places the result on results stack 5a.

• Acceptance A: PLC 1 removes a state from the state stack 4a and a value from the results stack 5a. It delivers this value as an output.

FIG. 2 schematically represents the transition table 3a of the automaton 1 which has the organization of a table of an automaton "shift / reduction" of the literature (Doc5,

Lig 4.37 page 231), with for columns a floating value action (Fi), an end of entry action (eof), and no Go_to column

The automaton 1 respects the given structure Lig 4.36, page 230 of Doc5 in an optimized form taking into account the small number of operations to be carried out. The stack of states 4a is initialized by pushing the value 1 on the empty stack of states 4a.

The stack of results 5a is initialized as an empty stack.

Either a the first element received (either F initially, or eof when the transmission is complete)

When the transfer is complete and all data has been processed (input FIFO 2 is empty), then a is equal to eof. Otherwise a is equal to F and has as associated value the oldest element removed from input FIFO 2.

This is illustrated in the algorithm of appendix 1.

The disambiguation test of the disambiguator 7 is calculated as follows: either i the number of floating point values processed by the automaton 1 (ie the last F absorbed is the i ^th element of the sequence); let p be the number of bits at 1 in the binary representation of i (ie the Hamming weight of i); the disambiguation test of disambiguator 7 is true if the depth of result stack 5a is greater than p, otherwise the disambiguation test of disambiguator 7 is false.

The operation of the automaton on a stream of 5 values, called F ₁ , F ₂ , F ₃ , F ₄ , F ₅ , is illustrated in FIG. 3.

Figures 4 to 18 detail the step by step operation of Figure 3.

FIG. 4 represents the initial state of the system.

FIG. 5 represents the reception of the floating number F ₁ on the input FIFO 1.

FIG. 6 represents the taking into account of F ₁ , with a shift action, and a reception of a second floating number F ₂ on the input FIFO 1.

FIG. 7 represents the result of the action of FIG. 6, and the taking into account of the second floating number F ₂ , with a shift action, and the reception of a third floating number F ₃ and of a fourth floating number F ₄ on the input FIFO 1.

FIG. 8 represents the result of the action of FIG. 7, and the taking into account of the third floating number F ₃ , with an ambiguous shift or reduction action, resolved in reduction because the disambiguation criterion of the disambiguator 7 is right. In addition, FIG. 8 represents the reception of a fifth floating number F ₅ . FIG. 9 represents the result of the action of FIG. 8 (with calculation of N ₁ = F ₁ + F ₂ ), and the taking into account of the third floating number F ₃ , with a shift action.

FIG. 10 represents the result of the action of FIG. 9, and the taking into account of a fourth floating number F ₄ , with an ambiguous action of shift or reduction, resolved in shift because the criterion of disambiguation of the disambiguator 7 is wrong.

FIG. 11 represents the result of the action of FIG. 10, and the taking into account of the fifth floating number F ₅ , with an ambiguous shift or reduction action, resolved in reduction because the disambiguation criterion of the disambiguator 7 is right. The transfer of the values F _i is finished.

FIG. 12 represents the result of the action of FIG. 11 (with the calculation of N ₂ = F ₃ + F ₄ ), and the taking into account of the fifth floating number F ₅ . An ambiguous shift or reduction action, resolved in reduction because the disambiguation criterion of disambiguator 7 is true.

FIG. 13 represents the result of the action of FIG. 12 (and the calculation of N ₃ = N ₂ + N ₁ ), and the taking into account of the fifth floating number F ₅ , with a shift action, and the presence of information eof representative of the end of the summation of the five floating numbers F ₁ , F ₂ , F ₃ , F ₄ and F ₅ .

FIG. 14 represents the result of the action of FIG. 13, and the end of transmission of the floating numbers F ₁ , F ₂ , F ₃ , F ₄ and F ₅ , taking into account the information eof end of data transmission to be summed. FIFO 1 is empty. The information eof is taken into account by a reduction action.

FIG. 15 represents the result of the action of FIG. 14 (and the calculation of N ₄ = F ₅ + N ₃ ), the action for taking eof into account being: accept.

FIG. 16 represents the action of accepting, in which N ₄ is delivered at the output of the automaton 1. The system has returned to its initial state and can restart a new sum.

This invention builds, as the floating-point values arrive, iteratively, without knowing their number in advance, the same result as the pseudo-code in Appendix 2.

The proof of this equivalence between the two algorithms is as follows.

The expression of all binary summation trees over a set of values is the following algebraic grammar:

N → NN | F where N is defined as a sum node of its two children or a leaf F (containing a floating point value).

The compilation of this grammar by a generator of GLR parsers (for example Bison or SmaCC) gives the transition table of a non-deterministic stack automaton which generates all the binary sum trees possible from an input of processed floating values from left to right with a look-ahead of 1 element, the calculation ending when the last element is received.

[0084] The disambiguation criterion ensures that we build only the tree of which each left child is a perfect balanced binary tree, and this in a deterministic manner.

The literature on parsers recalls that the transition table of the non-deterministic stack automaton is a compilation of the top-down strategy expressed by the algebraic grammar of the tree sum (see Doc4, page vii), and therefore, that the transition table of the automaton is equivalent to this grammar.

FIG. 17 represents an embodiment of the invention, making it possible to implement the FADDA and FADDV instructions of the ARMv8 SVE instruction set, as described in the document "ARM Architecture Reference Manual Supplement, The Scalable Vector Extension (SVE), for ARMv8-A ", Copyright © 2017-2018 ARM Limited or its affiliate.

FADDV is an instruction which returns the tree sum of the values of a vector, this vector containing between 2 and 64 values. The system of the invention, without modification, implements the FADDV instruction.

FADDA is an instruction which returns the naive (iterative) sum over the values of a vector, this vector containing between 2 and 64 values. The modification of the system consists in adding a control module 8 and a multiplexer 11 between the desambiguizer 7 and the PLC 1. The multiplexer 11 has an input from the disambiguator 7 and an input permanently set to "true". If the corresponding PLC input is still "true", the system performs an iterative sum.

The result works as follows: if the module 8 controls the multiplexer 11 so that it systematically chooses its "true" input and not the output of the disambiguator 7, then the system performs the FADDA instruction. Otherwise the system performs the FADDV instruction which is a tree sum.

This same extended system is able to implement the fredosum and fredsum instructions of the vector extension of the RISC-V instruction set, following the same process: if the module 8 controls the multiplexer 11 so that it systematically choose his input "true", the system performs the instruction fredosum; otherwise the system performs the fredsum instruction which is a tree sum.

The fredosum and fredsum instructions of the vector extension of the RISC-V instruction set are described in the document "RISC-V" V "Vector Extension", Version 0.8-draft- 20191117.

A modification also makes it possible to implement the RISC-V instructions which widen the result (fwredosum and fwredsum), considering that entry 2 is the single width (eg: float on 32bits), and the stack 5a and l adder 6 are double precision (eg: double on 64bits).

FIG. 18 represents an embodiment of the invention aimed at accelerating the processing operations by performing several sums in parallel, while keeping the same characteristics.

A vector adder of size 2 ^k in tree 9 is added to the system, placed between the input 2 and the automaton 1, and takes as input a vector V (of length 2 ^k ) and produces at output a value F _i , equal to the tree sum of the values contained in V.

A multiplexer 10 chooses either the value F _i at the output of the adder 9, or the value F _i at the output of the input 2, and transmits it to the automaton 1. This multiplexer 10 systematically chooses the output of the adder 9 if the latter has one, if not, it chooses the output F _i of the input 2.

The system has the following behavior added in entry 2: as long as the data transmission is not completed, entry 2 waits for 2 ^k values.

If the input 2 contains 2 ^k values or more, it extracts the 2 ^k oldest values in the form of the vector V, sends V to the vector adder in tree 9, and presents the result Fi to the multiplexer 10 .

If the input FIFO 2 does not have 2 ^k values and the data transmission is finished, then it behaves as in the version without this variant: it presents the oldest value Fi to the multiplexer 10.

If input 2 is empty and transmission is complete, then input 2 presents the eof information to multiplexer 10.

[0100] As illustrated in FIG. 19, the embodiment of FIG. 18 can be combined with the embodiment of FIG. 17 with the control module 8 making it possible to constrain the value of the criterion to true during a summation.

When one wishes to carry out a naive sum (FADDA or fredosum), it is therefore not possible to parallelize to use the vector adder in tree 9. Also, the control module 8 inhibits the vector adder. in shaft 9 (the control module 8 is configured to indicate to input 2 that vector adder 9 must be inhibited), and forces multiplexer 11 to systematically choose its "true" input.

On the other hand, when one wishes to carry out a sum in tree (FADDV or fredsum), the control module 8 activates the vector adder in tree 9 (the control module 8 is configured to indicate at the input 2 that the vector adder 9 must be activated), and forces the multiplexer 11 to systematically choose the input delivered by the disambiguator 7. [0103] Thus the system of figure 19 is capable of making the two different sums of figure 17 depending on the status of the control module 8.

Annex

[Annex 1]

Forever

let s be the state at the top of state stack 4a.

If a is zero then as long as FIFO 2 is empty and the transmission is not terminated, wait if FIFO 2 is empty and the transmission is finished then a = eof otherwise a = F and as value the oldest (withdrawn) value of FIFO 2 either action = transition (s, a) if (action is ambiguous) (both shift and reduction) if disambiguator 7 is true action = reduction otherwise action = shift if (action = shift) then stack on stack d 'states 4a the value (s + 1 min 3) stack on stack results 5a the value of a if (action = reduction) then unstack two states from the stack of states 4a. stack on the stack of states 4a the state at the top of the stack of states 4a + 1 (with 3 as the maximum) pop two values from the stack of results 5a, and stack on the stack of results 5a the sum (performed by adder 6) of these two values if (action = accept) then unstack the value at the top of the stack of states 4a. pop the top of the stack of results 5a and place it on the sum output (the result) of the system signal the provision of the result if (action = nothing) then loop as long as no eof at the head of FIFO 1

[Annex2] function pairviseSum (L list of floats) if size of L = 1 return first and only only element of L mid = 2 ^Ʌ (log ₂ (size of L) truncated) if (mid = size of L) return pairwiseBinarySum (L) else retum pairwiseBinarySum (L from 1 to mid) + pairwiseSum (L from mid + 1 to size of L) function pairwiseBinarySum (L list of floats) if size of L = 1 retum first and only element of L mid = size of L>> 1 return pairwiseBinarySum (L from 1 to mid) + pairwiseBinarySum (L from mid + 1 to size of L)

Claims

1. Binary tree-based floating-point summation system on chip comprising:

- a battery-powered automaton (1) configured to perform shift, reduction, and acceptance operations;

- an input (2) receiving in line floating numbers to be added and information (eof) representative of the end of the summation of the floating numbers, processed by the automaton (1);

- a memory (3) for programming the automaton (1) comprising a transition table (3a) of the automaton (1);

- a memory (4) comprising a stack of states (4a) of the automaton (1);

- a memory (5) comprising a stack of results (5a) of the operations performed by the automaton (1);

- a floating point adder (6) configured to add two floating point values and output their floating point sum; and

- a disambiguator (7) configured to implement a disambiguation criterion making it possible to perform the shift operation when it is false and to perform the reduction operation when it is true.

2. System according to claim 1, wherein the disambiguation criterion is true if the depth (pf) of the stack of results (5a) is greater than the Hamming weight (p) of the number (i) of floating values received as input. (2) and already processed by the PLC

(1).

3. System according to one of the preceding claims, comprising at least one additional adder (6).

4. System according to one of the preceding claims, wherein the transition table (3a) is of LR type (1).

5. System according to one of the preceding claims, in which the automaton (1) is configured to complete its sum when it has to process information (eof) representative of the end of the summation of the floating numbers.

6. System according to one of the preceding claims, comprising a control module (8) for controlling the disambiguator (7) to force the value of the criterion to be true during a summation.

7. System according to one of the preceding claims, comprising a tree vector adder (9) arranged between the input (2) and the controller (1) making it possible to perform the tree sum of the floating numbers of a vector. (V) successive floating point numbers received at input (2), and delivering their sum in tree to the automaton (1).