GB2247328A

GB2247328A - Data processing system

Info

Publication number: GB2247328A
Application number: GB9017416A
Authority: GB
Inventors: Robert Kurt Willia Haselwimmer
Original assignee: Philips Electronic and Associated Industries Ltd
Current assignee: Philips Electronics UK Ltd
Priority date: 1990-08-08
Filing date: 1990-08-08
Publication date: 1992-02-26
Also published as: GB9017416D0

Abstract

A data processing system comprises a systolic array of data processing nodes Nij effectively arranged in rows i and columns j. Data communication paths are provided from each node to the next in the corresponding row, from each node to the next in the corresponding column, and from each node to that which lies in both the preceding row and the preceding column. Because each path gives rise to at least one data period delay the input data has to be interspersed with zeros, which means that not more than one node in any group of at least three adjacent nodes is processing useful data at any given time. The nodes of each group are implemented by a single respective data processing unit. The nodes which make up each group are chosen to be ones which belong to the same diagonal of the array, rather than to the same row or column. This facilitates the construction of systems in which each group includes more than three nodes and in which the number of nodes in each group can be varied dynamically. <IMAGE>

Description

DATA PROCESSING SYSTEM This invention relates to a data processing system which takes the form of a set of data processing nodes which are provided with data communication paths therebetween in such manner that said nodes are effectively arranged in rows i and columns i with each node Ni,j provided with respective data communication paths to (a) the next node (if any) Ni,j+l in the corresponding row, (b) the next node (if any) Ni+i,j in the corresponding column, and (c) that node (if any) Niu l which lies in both the preceding row and the preceding column, each said data communication path including delay means for delaying data communicated via the relevant path by an integral number of data periods D, parallel output data paths being provided from the nodes lying in the first row and the first column, said system comprising a plurality of processing units for implementing the data processing required at said nodes, which units are each assigned to a respective subset of said nodes for implementing the data processing required at all the nodes of the corresponding subset.

A system of this general kind is known from US patent no.

4493048 which is incorporated herein by reference and may be used to implement, for example, the factoring of a band matrix into lower and upper triangular matrices L and U, or the formation of the product C of a pair of band matrices A and B. The disclosure in US 4493048 is mainly concerned with systems in which a processing unit is provided corresponding to each of the processing nodes, these units together forming a systolic array.However it is pointed out in that disclosure (column 5 lines 47-52 and column 6 lines 57-62) that, because blanks have to be provided in the input data in order to ensure that at any given time the data supplied to each node via the various input paths thereto is actually that which the unit requires for processing at that time, in the particular systems described only one out of every three processing units along a row or column is actually usefully employed at any given time, so that only one processing unit is actually required for every three nodes along the rows or columns.

The known systems are very special-purpose, in that efficient use is made of all the processing units only when the system processes matrices of a specific bandwidth. Moreover, at least wlw2/3 processing units are required to form the product of two matrices having bandwidths wl and w2 respectively or to decompose a band matrix having a bandwidth w1+w2-1 into upper and lower triangular matrices having bandwidths wl and w2 respectively. It is an object of the present invention to enable these disadvantages to be mitigated.

According to the invention a system as defined in the first paragraph is characterized in that for each said subset which contains more than one node each of the nodes constituting the relevant subset lies in a row and column which directly succeed the row and column respectively in which another of the nodes constituting the relevant subset lies and/or directly precede the row and column respectively in which another of the nodes constituting the relevant subset lies.

It has now been recognised that choosing the nodes which make up each subset in this manner, i.e. so that the values of i and j for each node of each subset which contains more than one node are both one greater than, and/or are both one less than, the values of i and j respectively for another node of the same subset, allows systems in which the number m of nodes (if available) implemented by each processing unit is greater than three to be constructed in a comparatively simple manner and, moreover, can facilitate the changing of this number in a given system as and when this is required. Each subset forms at least part of a given diagonal of the array of nodes effectively formed by the various (notional) rows and columns, which diagonal this is being determined by the value of i-j for the node(s) of the relevant subset.The complete diagonal is formed by all the nodes which have this value for i-j, i.e. by all the subsets for which the constituent nodes have this value for i-j, and, if the aforesaid integral number is appropriately chosen for each said data communication path, the active nodes step steadily along this (notional) diagonal m (notional) nodes apart towards the relevant node lying in the first column or the first row as processing proceeds. Therefore if each subset comprises m nodes (if m nodes are available for inclusion in the relevant subset) where m is the same for each subset, then for each diagonal, i.e. for each given value of i-j, a corresponding operation control line may be provided to all the processing units which implement the processing required at the relevant nodes.

Control means may then be provided for supplying operation codes successively and cyclically to each of these control lines to control the relevant processing units to implement the processing required at the corresponding m nodes successively and cyclically in step both with each other and with successive data periods. If m is greater than two the aforesaid integral number may conveniently be chosen to be equal to (m-2) for each said path from a node Ni,j to a node Ni,j+l if j# > i, and for each said path from a node Ni,j to a node Ni+l,j if j 4 i, and to be equal to unity for all the other said paths.If this is the case said control means may comprise further delay means between the lines of each pair of said operation control lines for which said given value differs by unity, for communicating operation codes from one line of the pair to the other after a delay of (m-2)D. The number m of nodes in each subset, i.e. the number of nodes implemented by each processing unit, can then be determined and/or changed in a simple manner by suitably determining and/or changing the cycle of operation codes applied to each operation control line and the delays produced by those delay devices whose delay is required to be (m-2)D. This can be done by supplying an m-determining code to said control means and to those data communication path delay means whose delay is required to be (m-2)D to thereby control m between specific ones of a plurality of values.

Although parallel input data paths can be provided to the nodes lying in the last row and the last column such input data paths are preferably provided to the nodes lying in the first row and the first column as is known per se from US patent 4493048.

This can facilitate the creation of a multi-purpose system, i.e. a system which can be switched to perform different tasks, for example the decomposition of an input band-matrix into "lower" and "upper" triangular matrices and the formation of the product of two matrices, respectively.

Embodiments of the invention will now be described, by way of example, with reference to the accompanying diagrammatic drawings, in which Figure 1 illustrates a mathematical operation which may be performed by means of a system in accordance with the invention, Figures 2a and 2b show a systolic array of processing nodes which may be used to perform the operation of Figure 1, Figures 3a and 3b illustrate a given stage of processing in the array of Figure 2, Figure 4 shows how processing units, each implementing a respective group of three processing nodes in the array of Figure 2, may be controlled, and how further such units may be added to increase the size of the array implemented.

Figure 5 shows another systolic array of processing nodes which may be used to perform the operation of Figure 1, Figure 6 illustrates a given stage of processing in the array of Figure 5, Figure 7 illustrates how the utilisation of the processing units in a given system can be increased compared with what would otherwise be the case when the amount of input data to be processed is smaller than that for which the system is designed, Figure 8 shows how a plurality of processing units effectively form a substantially triangular array in a particular embodiment of the invention, Figure 9 shows the communication paths between each of specific processing units in Figure 8 and their nearest neighbours, Figure 10 shows the communication paths between each of further specific processing units in Figure 8 and their nearest neighbours, Figure 11 shows the communication paths between each of yet further specific processing units in Figure 8 and their nearest neighbours, Figure 12 shows a possible construction for specific processing units in Figure 8, Figure 13 shows a possible construction for other specific processing units in Figure 8, Figure 14 shows how data communication paths between the array of Figure 8 and a host computer may be arranged, Figure 15 shows a possible substantially general-purpose construction for processing units in an embodiment of the invention, Figure 16 shows a logic circuit for controlling various components of the construction of Figure 15, Figure 17 shows how an array of processing units constructed in accordance with Figures 15 and 16 may be controlled by means of a control unit, and Figure 18 shows an alternative configuration for an array of processing units.

Figure 1 illustrates a mathematical operation which may be performed by means of a system in accordance with the invention.

This operation consists in the decomposition of a banded n x n matrix A of bandwidth 2w-l into the product of a "lower" banded triangular n x n matrix L and an "upper" banded triangular n x n matrix U, where all the points outside the band in each matrix are zero. As is well-known, such an operation can often facilitate the solution of a system of linear simultaneous equations such as arises, for example, during the solution of a partial differential system. (When solving a partial differential system, for example in order to simulate a semiconductor device, the solution space is discretised onto a finite mesh of some sort and then a set of inter-relationships is created between the variables on neighbouring mesh-points).

By definition

for i# # j (upper triangle) and

for i > j(lower triangle) which means that the values of li,j for i# j and ui,j for #%# j can be expressed as a form of recurrence relation in terms of the elements of L to the west, elements of U to the north, and ai,j.

If the last term in each dot-product summation is separated out one obtains in total ui,j = 0 for i > j and

for iS; j and similarly li,j = 0 for i < j, li,j = l for i=j, and

for i;oj, where li,i is explicitly defined as 1 so that L and U can be determined uniquely, given A. It will be noted from Figure 1 that the bands in the matrices L and U only fill out to the same extent as the band in the matrix A, so that the operations required by the formulae above need only be carried out for those elements of A in the main band. All the elements of L and U outside what would be the main band are known to be zero implicitly.

Figure 2 shows a systolic array which may be used to perform these operations, Figure 2a showing the actual array together with the manner in which the elements of the band of the matrix A have to be ordered for input, and Figure 2b indicating the arithmetic operations which are performed at the various processing nodes N (indicated by squares, circles or a diamond shape in accordance with their required function). Figure 2a corresponds broadly to Figure 11 of the US patent 4 493 048 cited previously and the array or set of nodes which has been chosen for illustration is a very small one of 4 x 4 nodes for simplicity (which means that input matrix A can have a bandwidth of 2.4-1 = 7 at the most, in this example).As will be seen from Figure 2a the various processing nodes are effectively (although not necessarily physically) arranged in rows i and columns j in that communication paths (each denoted by "D" which signifies in addition that the data delay in each path is one data period) are provided from each node Ni,j to the next (if any) Ni,j+l in the corresponding row and from each node Ni,j to the next (if any) Ni+l,j in the corresponding column.

In addition communication paths D are provided from each node Ni,j to that node (if any) Ni-lrj-l which lies in the preceding row and the preceding column.

As will be seen from Figure 2b the nodes denoted by circles multiply together the operands received from the north and west, subtract the result from the operand received from the south-east and output the result to the north-west. The nodes denoted by squares multiply together the operands received from the south-east and the west and output the result to both the north-west and the south if they lie in the top row of the array, and multiply together the operands received from the south-east and the north and output the result to both the north-west and the east if they lie in the left-hand column of the array. The node denoted by a diamond outputs the operands received from the south-east unchanged to the north-west, their reciprocals to the south, and unity operands to the east.

The band of the matrix A is inputted along the south and east sides of the array after formatting in the manner indicated in Figure 2a, and from a consideration of the arithmetic functions performed at the various nodes it will be evident that the bands of the lower and upper triangular matrices L and U are outputted from the west and north sides of the array as shown. As the elements of L and U leave the west and north sides they are formatted in the same manner as the inputted elements of the matrix A. The main diagonal of L, li,i is known to be 1 implicitly, so it is ui,i which is outputted from the top-left node in the array. The array may be extended at will by adding more nodes denoted by squares to the top row and the left-hand column and more nodes denoted by circles to the remaining rows and columns.

The various zeros shown in Figure 2a in the input data have to be provided because of the single data period delay D occurring in the communication path between each node and its neighbour. They ensure that each time a node receives non-zero input operands over its various input communication paths these operands are collectively those on which the relevant arithmetic operation is required to be performed at the relevant time. Because of the presence of these zeros, in fact along any row or column only one in every three of the various processing nodes is operating on non-zero data during any given data period, i.e. their effective useful utilisation is only 33%.Thus in theory subsets of three such nodes could each be implemented by a single processing unit, as has been recognised in the US patent 4 493 048 cited previously (lines 47-52 of column 5 thereof, and lines 57-62 of column 6). It has now been recognised that, with the arrangement of Figure 2, it is also the case that along any diagonal extending in a south-east to north-west direction only one in three of the various processing nodes is operating on non-zero data during any given data period.

Thus subsets of three consecutive nodes along these diagonals can themselves each be implemented by a single processing unit. Such implementation has been found to have certain advantages, as will now be elaborated upon.

Figure 3a illustrates a given stage of processing in the array of Figure 2a, the nodes which are processing useful data during the relevant data period being shown shaded, and the remainder being shown in outline only. Subsets of three nodes lying in consecutive positions along south-east to north-west diagonals re denoted by reference numerals 10, from which it will be seen that only one node of every subset is usefully operative during the relevant data period (and during every other data period). The node activity in effect steps on one position to the north-west for each successive data period. Figure 3b is merely a geometric transformation of the array of Figure 3a by rotating it anti-clockwise through 1350 and shearing the central columns upwards.It illustrates the fact that the subsets 10 of nodes in Figure 3a, which as has been pointed out can each be implemented by means of a single processing unit, can be considered as a succession of rows of such subsets or units, the number of subsets or units in the bottom row being equal to the number of diagonals in the original array, i.e. to 2n-1 where n is the length of a side of the original array, and the number of nodes in the subsets of the central column and implemented by the processing units corresponding to these subsets being at least equal to the number of nodes in the central diagonal of the array, i.e. to n. It also illustrates the fact that processing units implementing the various subsets and which lie adjacent one another in a given row operate in different phases, and such processing units which lie in the same column operate in the same phase.

Thus, if desired, processing units each implementing one of the subsets 10 of nodes of Figure 3b may be controlled to operate in their respective phases, i.e. to perform the operations required in the various phases, by operation codes inputted successively and cyclically on a common operations code number bus 11 as illustrated in the full-line portion of Figure 4, the bus 11 communicating with an operation control line 60 for the units 10 of the central column directly to cause these to perform the operations required in phases 1,2 and 3 in succession and cyclically during successive data periods, and communicating with respective similar operation control lines 60' for the units 10 lying in respective columns on either side via successive single data period delay devices 12 as shown.Figure 4 also illustrates in dashed lines how further nodes may be added to the array by, where necessary, providing further processing units 10' and further delay devices 12'. More particularly, a further processing unit 10' has been added at each end of each row, those added at the ends of the bottom row being controlled via respective further delay devices 12' and those added at the ends of the top row being controlled by the same control signal as the corresponding unit 10 in the bottom row. Moreover those units 10 which originally were not required to implement their full complement of three nodes in the original arrangement now implement one more node each as also shown in dashed lines.In effect the original "triangle" of nodes of height n = 4 nodes and base 2n - 1 = 7 nodes which implements an n x n = 4 x 4 array has been enlarged to a height of n = 5 nodes and a base of 2n - 1 = 9 nodes, to effectively implement a 5 x 5 node array. Obviously still larger arrays may be implemented in a similar way as required, i.e. by adding further processing units at the ends of each row, implementing further nodes by those units which are not fully utilised, and providing further rows of units if required, so that the resulting "triangle" of implemented nodes is always of height n and base 2n - 1, where n is the length of a side of the array to be implemented.Thus implementation of subsets of three consecutive processing nodes along the diagonals of an array of the form described with reference to Figure 2 by respective processing units can facilitate both the control of these units to operate in the required phases and the addition of further such units (or the removal of some units) if required in view of the bandwidths of the matrices which the array is required to handle.

It will be appreciated that, because the nodes implemented by each given column of units 10 in Figure 4 all belong to a respective southeast-northwest diagonal of the original square array of nodes Ni,j (c.f. Figure 2) each given column of units, controlled by a respective control line 60 or 60', implements all those nodes for which i-j has a corresponding given value.

As mentioned previously, the fact that only one node in each subset of three in the array of Figure 2 is active at any given time, enabling the three nodes of each subset to be implemented by a single respective processing unit, derives from the fact that a single data period delay D is present in each communication path between nodes, this necessitating the insertion of zeros in the input data in the manner illustrated. If the delay in certain of these communication paths is increased, (necessitating the insertion of yet more zeros in the input data) it can be arranged that subsets of more than three processing nodes can be implemented by single respective processing units.Figure 5 illustrates modifications which may be made to the data delays and input data format of Figure 2a in order that only one node in each subset of four in the array is active at any given time, enabling each such set of four to be implemented by a single respective processing unit. If Figure 5 is compared with Figure 2a it will be seen that the delays in the horizontal communication paths above the southeast-northwest diagonal, i.e. in each path from a node Ni,j to a node Ni,j+1 for j & 1, have been doubled to two data periods D, as have the delays in the vertical communication paths below this diagonal, i.e. in each path from a node Ni,j to a node Ni+11j for j 1, and further zeros have been inserted in the input data.

Figure 6 (which is analogous to Figure 3a) illustrates a given stage of processing in the array of Figure 5, subsets of four nodes lying in consecutive positions along south-east to north-west diagonals being denoted by reference numerals 13. It will be noted that one node at most of each subset is active during the relevant data period (and during every other data period) so that the nodes of each subset can be implemented by a single respective processing unit. Although when this is done the units of adjacent diagonals operate in different phases it will be noted that, in contrast to the arrangement illustrated in Figures 2 and 3 in which there is a steady progression of relative phase from each diagonal to the next on either side of the main diagonal of the array, in the arrangement of Figures 5 and 6 the processing units belonging to alternate diagonals operate in the same phase, which means that the arrangement of Figures 5 and 6 requires a different phase control circuit to that provided by the delay devices 12 of Figure 4 for the arrangement of Figures 2 and 3.Such a different phase control circuit may be obtained, for example, by increasing the delays produced by the delay devices 12 shown in Figure 4 to two data periods, control signals determining operation in each of four successive phases then being applied to the operations code number bus 11 successively and cyclically in synchronism with successive data periods.

As pointed out previously, the array of Figure 5 differs from the array of Figure 2 in that the delays in the west-east communication paths above the central southeast-northwest diagonal, and the delays in the north-south communication paths below this central diagonal have been increased from one data period D to 2D, this allowing subsets of four processing nodes to be implemented by single respective processing units instead of subsets of three (provided that the formatting of the input data is adjusted accordingly). In general, if the delays in these communication paths are each arranged to be (m-2)D (maintaining the delays in the other communication paths at D) this will allow subsets of m processing nodes to be implemented by single respective data processing units, where m = 3,4,5 , provided that the input data is appropriately formatted.Obviously the larger the value of m the slower will the array process a given amount of non-zero input data, (because this data will have to be supplemented with more zeros) but this disadvantage may be outweighed in given cases by the smaller number of processing units required. The units may, in all cases, be arranged to process non-zero data substantially all of the time, i.e. the array can be substantially 100% efficient, whatever the value of m, the zeros supplied to the array being not actually processed thereby in the main.In all cases the processing units may be provided with a phase control circuit similar to that which includes the delay devices 12 shown in Figure 4, the delay produced by each device 12 being arranged to be (m-2)D and control signals determining operation in each of m successive phases being applied to the operations code number bus 11 successively and cyclically in synchronism with successive data periods.

In general the real input data (neglecting the supplementing zeros) should be formatted in such a way that a data item is presented to the first processing unit implementing the nodes lying along a given diagonal of the (notional) array of nodes each time that unit is operating in its first phase (even if it should at that time be implementing a dummy node), and this in such a way that data item a1,l is presented first to the first processing unit implementing the nodes of the centre diagonal, then data items a1s2 and a2,1 are presented to the respective first processing units implementing the nodes of the diagonals immediately on either side of the centre diagonal, then data items a1,3 and a3,1 are presented to the respective first processing units implementing the nodes of the diagonals which are spaced by one diagonal from the centre diagonal, and so on, zero input data being presented to these first processing units during all other data periods.

In some circumstances it can be an advantage if the value of m for a given processing system is adjustable, as will now be elaborated upon.

It will be evident from the description so far that an n x n array of processing nodes can decompose a matrix of bandwidth 2w-1 into the product of "lower" and "upper" triangular matrices only if w is not greater than n. If w is less than n the required decomposition can still be achieved, but in such a case only a w x w portion of the array will be usefully employed, which means that when subsets of the nodes are implemented by means of single respective processing units in accordance with the invention, not all of these processing units will be usefully employed. Figure 7 shows in outline form and in full lines an n x n array of processing nodes distorted to form a triangle 16 (c.f. Figure 3b and Figure 4). As indicated, the triangle is of height n nodes and base (2n - 1) nodes, and the array can process an input matrix of bandwidth (2n - 1) at the most.If in accordance with the invention subsets of m nodes are implemented by means of single respective processing units (c.f. units 10 in Figures 3b and 4) then if these units were superimposed on the triangle 16 of Figure 7 in a manner similar to that shown in Figures 3b and 4 the bottom row of units will have to contain (2n - 1) units and, for a given value of m, the centre column will have to contain n/m units if n/m is an integer and a number of units equal to the next integer higher than n/m if n/m is not itself an integer.When such an array of processing units is present, each unit implementing the given value of m processing nodes so that the array is capable of decomposing a matrix of maximum bandwidth (2n-1), then if the input matrix has a smaller bandwidth of (2p - 1), where p is less than n, only a p x p portion of the (notional) array of n x n processing nodes implemented with the given value of m will be usefully employed, this portion mapping to the triangular outline indicated by broken lines 14 in Figure 7. Thus the processing units which implement at least a substantial number of the nodes lying between the triangles 14 and 16 will not be usefully employed either. The situation can be substantially improved in such circumstances by reducing the number m of nodes implemented by each processing unit and hence the number of nodes implemented overall.The effect of this on the array of implemented nodes signified by the full-line triangle 16 in Figure 7 is to reduce the height of the triangle, so that it is transformed, for example, into the triangle indicated by the broken lines 15 (for which it has been assumed that m has been reduced by a factor which is at least approximately equal to p/n).

Although some implemented nodes (those lying between the triangles 14 and 15) will still not be usefully employed, it will be evident that the proportion of implemented nodes which are usefully employed has been substantially increased, entailing a corresponding increase in processing speed for a given amount of non-zero input data. As pointed out above, the smaller the value of m the faster the system can in general process a given amount of non-zero input data, because the data needs to be supplemented with less zeros; each processing unit only has the burden of implementing m (notional) nodes.In general, if the bandwidth of the banded input matrix is 2w-1, and R is the number of processing units provided to implement the centre column of the triangular array of processing nodes, m is preferably chosen or adjusted to be equal to w/R if w/R is an integer, and to be equal to the next integer higher than w/R if w/R is not an integer.

Figure 8 shows once again, but this time in fragmentary form, the triangular array 16 of processing nodes of Figure 7.

Superimposed thereon are processing units 17 which implement these nodes (c.f. the units 10 of Figures 3b and 4), each unit 17 implementing m nodes. The units 17 are of three types, indicated by the suffixes "a", "b" and "c" respectively, lying in the left-hand half, the centre column, and the right-hand half of the triangle respectively. By analogy with Figures 3a and 3b it will be evident that the units 17a implement the nodes above the centre southeast-northwest diagonal of the original square array, the units 17b implement the nodes on this centre diagonal, and the units 17c implement the nodes below this centre diagonal.

Consideration of Figure 5, in which the delays 2D in certain of the communication paths between each node and its nearest neighbours are, in the general case, (m-2)D as has been pointed out, reveals that the units 17a, 17b and 17c have to communicate with their nearest neighbours A-H (if present) in the manner indicated in Figures 9, 10 and 11 respectively, the required delays in each path in terms of a number of data periods D being as indicated.

Communications along each of the diagonal communication paths in Figures 9-11 take the place of communications along the corresponding horizontal path, i.e. the corresponding horizontal path which has a single data period delay D, when the relevant processing unit is implementing a specific one of the corresponding m nodes. This specific node is in fact the last or mth node of the corresponding subset, i.e. that from which data is outputted to the first node of the next subset (implemented by the neighbouring unit C).

Figure 12 shows a possible construction for each of the units 17a of Figure 8, with the exception of those which lie in the bottom row. The unit 17a comprises a data multiplier 18, a data subtractor 19, multiplexers 20 and 21, and several single-data-period-delay latches shown as small rectangles, all these components being interconnected as shown. The requisite clock pulse inputs to the various latches and the components 18 and 19 are not shown, in the interests of clarity. (m-2) of the latches are connected in cascade to form, in effect, a delay device 22 giving a delay of (m-2) data periods.The multiplexers 20 and 21 are controlled by control signals supplied to inputs 23 and 24 respectively in such manner that, as is indicated in the drawing, multiplexer 20 connects its right-hand input to its output when the unit 17a is implementing the mth node of the corresponding subset of processing nodes and connects its left-hand input to its output otherwise, and multiplexer 21 connects its right-hand input to its output when the unit 17a is implementing the first node of the corresponding subset of processing nodes and connects its left-hand input to its output otherwise. The input 23 may be fed, for example, from the output of a comparator (not shown) which compares data derived from an operations code number bus (c.f. bus 11 in Figure 4) via an appropriate number of data period delays (c.f.

devices 12 in Figure 4) with m, and the input 24 may be fed from said output via a single data period delay. Thus, when the unit 17a is implementing the first node of the corresponding subset of processing nodes it subtracts the product of data derived from the neighbouring unit E (if present) and delayed (in device 22) data derived from the neighbouring unit A from data derived from the neighbouring unit G if present (c.f. Figure 9) and both presents the result to the neighbouring unit C and stores this result in the latch 25. When the unit is implementing each of the other nodes of the corresponding subset except the mth it subtracts the product of data derived from neighbouring unit E and delayed data derived from the neighbouring unit A from the contents of the latch 25 and both presents the result to the neighbouring unit C and stores this result in the latch 25.When the unit is implementing the mth node of the corresponding subset the data derived from the neighbouring unit E is replaced by data derived from the neighbouring unit D.

In all cases the unit also passes on the delayed data from device 22 to the neighbouring unit E (if present).

Those units 17a of Figure 8 which lie in the bottom row may basically have a similar construction but, as is evident from Figure 2 and the corresponding description, a modification to their operation is required when they are implementing the mth processing node of the corresponding subset. When this is the case they are each required, in addition to passing on the delayed data from device 22 to the neighbouring unit E (if present), merely to multiply this delayed data by the current contents of the latch 25 and present the result both to the exterior via data path C - where it can be read by exterior interface equipment - and to the neighbouring units A and H.The arrangement of Figure 12 may be readily reconfigured to perform this modified operation at the appropriate times by means of further multiplexers (not shown) controlled from the input 23, as will be evident to those skilled in the art. (See also Figure 15 of the drawings and the associated description below).

Each of the units 17c of Figure 8 may be constructed in a similar way; in effect they take the form of mirror images of the corresponding units 17a, as will be evident, inter alia, from a comparison of Figure 11 with Figure 9.

Figure 13 shows a possible construction for each of the units 17b of Figure 8, with the exception of that unit which lies in the bottom row of units. The unit 17b comprises a data multiplier 26, a data subtractor 27, multiplexers 28, 29 and 30, and several single-data-period-delay latches shown as small rectangles, all these components being interconnected as shown. The requisite clock pulse inputs to the various latches and the components 26 and 27 are not shown, in the interests of clarity. The multiplexers 28, 29 and 30 are controlled by control signals supplied to inputs 31,32 and 33 respectively in such manner that, as is indicated in the drawing, multiplexer 28 connects its right-hand input to its output when the unit 17b is implementing the first node of the corresponding subset of processing nodes and connects its left-hand input to its output otherwise, multiplexer 29 connects its right-hand input to its output when the unit 17b is implementing the mth node of the corresponding subset of processing nodes and connects its left-hand input to its output otherwise, and multiplexer 30 connects its left-hand input to its output when the unit 17b is implementing the mth node of the corresponding subset of processing nodes and connects its right-hand input to its output otherwise.The inputs 32 and 33 may be fed, for example, from the output of a comparator (not shown) which compares data derived from an operations code number bus via an appropriate number of data period delays with m, and the input 31 may be fed from said output via a single data period delay, similarly to the signals applied to the inputs 23 and 24 of Figure 12. Thus when the unit 17b is implementing the first node of the corresponding subset of processing nodes it subtracts the product of data derived from the neighbouring units A and E (if present) from data derived from the neighbouring processing unit G if present (c.f. Figure 10) and both presents the result to the neighbouring unit C and stores this result in the latch 34.When the unit is implementing each of the other nodes of the corresponding subset except the mth it subtracts the product of data derived from the neighbouring units A and E (if present) from the contents of the latch 25 and both presents the result to the neighbouring unit C and stores this result in the latch 34. When the unit is implementing the mth node of the corresponding subset the data derived from the neighbouring units E and A is replaced by data derived from the neighbouring units D and B respectively. In all cases the unit also passes on the data derived from neighbouring unit E or D to neighbouring unit A (if present), and the data derived from neighbouring unit A or B to neighbouring unit E (if present).

The unit 17b of Figure 8 which lies in the bottom row of units may basically have the same construction but, as is evident from Figure 2 and the corresponding description, a modification to its operation is required when it is implementing the mth processing node of the corresponding subset. When this is the case it is required to merely present unity data to the neighbcuring unit E, the reciprocal of the current contents of latch 34 to the neighbouring unit A, and the current contents of latch 34 to the output. Thus this particular unit will also have to be provided with a reciprocal-determining device, e.g. a suitably programmed look-up table, and means for reconfiguring it to perform these operations each time it is implementing the mth node of the corresponding set.Such means may take the form of further multiplexers controlled from the input 32 or 33, as will be evident to those skilled in the art.

Various references have been made above to the inputting of data from, and the presenting of data to, neighbouring processing units "if present". It will be evident that some such neighbouring processing units will not in fact be present for those units 17 of Figure 8 which lie along the sloping edges of the triangle 16. For those units which do not have a neighbouring unit G (c.f. Figures 9-13) the input data which would otherwise be derived from the neighbouring unit G will, of course, be constituted instead by the appropriately formatted data of the matrix to be decomposed. This is shown diagrammatically at 35 in Figure 14, where the input data is shown being supplied from a host computer 36 which also receives the output data from the base of the triangular array at an input port 37. For those units 17a which do not have a neighbouring unit E (c.f.Figure 9) zeros should be applied to the input which would otherwise be fed from the neighbouring unit E. Similarly, for those units 17c which do not have a neighbouring unit A (c.f.

Figure 11) zeros should be applied to the input which would otherwise be fed from the neighbouring unit A. Similarly, for any unit 17b which does not have neighbouring units A and E, zeros should be applied to those inputs which would otherwise be fed from the neighbouring units A and E. Similar comments apply to the bottom corner units 17a and 17c in respect of the inputs which would otherwise be fed from neighbouring units D and B respectively, and to the bottom unit 17b in respect of the inputs which would otherwise be fed from the neighbouring units D and B.

These zeros ensure that when a dummy node, i.e. a node which is not actually required, is implemented, the relevant operations always result in the partial difference being returned unchanged and in a zero being passed to any dummy node implemented in a neighbouring processing unit.

Although not shown in Figure 14 for clarity's sake, the computer 36 is also arranged to supply the array 16 with clock signals and also operation codes to an operation code number bus similar to the bus 11 described with reference to Figure 4.

As described so far the suitably formatted input matrix to be decomposed is applied to a different side or sides of the array of processing nodes/processing units to that or those from which the result matrices are outputted; see e.g. Figures 2a,5 and 14. The processing is achieved for each output data item ui,j or li,j by successively subtracting products of the form li,k.uk,j from and dividing the eventual result by unity or uj,j respectively (in fact multiplying by unity or 1/uj,j respectively). It will be appreciated that this is not the only way the desired output data can be obtained.An alternative is, for example, to successively accumulate the relevant products of the form li,k~uk,j, add to the eventual result, and then divide by -1 or -uj,j respectively (multiply by -1 or -l/uj,j respectively). This alternative can have advantages, in particular because when the array is modified to operate in this way it is found that a further simple modification, obtainable with some configurations by means of a signal on a single control line, results in an array which is capable of performing simple band-matrix by band-matrix multiplication. Indeed in some circumstances it may be that band-matrix by band-matrix multiplication is all that is ever required, in which case the facility for decomposing a band matrix into "upper" and "lower" triangular matrices may be dispensed with altogether.

Figure 15 shows a possible general-purpose construction for each processing unit 17, with the exception of that lying in the centre of the bottom row of processing units in the triangular array, when the modified processing just referred to is adopted.

(In fact units lying in the right-hand half of the triangular array will be mirror images of that shown). The unit is configured to perform as a centre column unit, a bottom row unit or one of the other units, by means of control signals applied to points x,y and z during operation, for example by means of a logic circuit as shown in Figure 16, as will be elaborated upon below. This configuring is achieved by means of multiplexers 38-42 the inputs of which are labelled "0" and "1" respectively to signify which is connected to the multiplexer output when the relevant control signal is 0 and 1 respectively. The unit also comprises further multiplexers 43-45 the functions of which are similar to multiplexers included in the units of Figures 12 and 13, a multiplexer 57, a data adder 46, a data multiplier 47, and several single-data-period-delay latches denoted by small rectangles.

Clock signal inputs to the various components are not shown, for the sake of clarity.

Although in practice some sharing of components between various processing units is possible, it will be assumed in the following description that each processing unit 17 as shown in Figure 15 is provided with a respective control logic circuit as shown in Figure 16, the points u-z being connected to the correspondingly annotated points in the relevant unit 17. As will be seen from Figure 16, each control logic circuit comprises two AND-gates 48 and 49, a data comparator 50, and a single-data-period-delay latch 51, connected as shown. The input 52 of comparator 50 is fed with the cycling phase-controlling operations code numbers for the relevant column of the triangular array; c.f. the relevant operations control line 60 or 60' of Figure 4. The input 53 of comparator 50 is fed with the desired value of m; c.f. the discussion above of the advantages of making m adjustable.The input 54 of gate 48 is fed with logic "1" if the relevant unit 17 lies in the centre column of the triangular array, and with logic "0" otherwise. The input 55 of gate 49 is fed with logic "1" if the relevant unit lies in the bottom row of the triangular array and with logic "0" otherwise.

The desired value of m fed to input 53 (and which may be also be fed to the operations code number delay devices, c.f. devices 12 in Figure 4, to control their delays to be equal to (m-2)D) controls the multiplexer 57 in the relevant unit 17 directly. The various inputs of this multiplexer are fed from respective taps on the delay device 56 formed by a plurality of latches connected in cascade, and multiplexer 57 is controlled by the desired value of m to connect that tap to its output which will result in the delay device 56 producing a delay of m-3 data periods in the signal path from multiplexer 44 to the output of multiplexer 57. (An additional single data period delay occurs at each input to multiplexer 44, due to the additional latches provided thereat).

Multiplexer 42 is controlled by the signal fed to input 54 directly so that, when the relevant unit 17 lies in the centre column of the triangular array the delay device 56 is short-circuited completely.

It will be evident from Figure 16 that the signal applied to point v is logic "1" when the relevant operations code number is m, and is logic "0" otherwise, and the signal applied to point w is logic "1" when the operations code number is one, and is logic "0" otherwise. Thus multiplexer 43 is controlled in the same way as multiplexer 20 in Figure 12 and multiplexer 29 in Figure 13, and multiplexer 45 is controlled in the same way as multiplexer 21 in Figure 12 and multiplexer 28 in Figure 13. When the relevant unit 17 belongs to the centre column of the triangular array (logic "1" at input 54) the signal applied to point y is logic "1" when the relevant operations code number is m, and is logic "0" otherwise.

Thus multiplexer 44 is, for centre column units, controlled in the same way as multiplexer 30 in Figure 13, as required. Otherwise it inputs continuously from the neighbouring unit A (c.f. Figure 12).

When the relevant unit 17 belongs to the bottom row of the triangular array (logic "1" at input 55) the signal applied to point x is logic "1" when the relevant operations code number is m, and is logic "0" otherwise. Thus, for these bottom row units the multiplexers 38-41 are switched when the relevant operations code number is m. At other times their states are those denoted by zeros, as they are continuously for all other units 17.

It will be appreciated from the above that the unit 17 of Figure 15, provided that it does not lie in the bottom row of the triangular array, operates in substantially the same way as the units of Figure 12 or Figure 13, as appropriate, with the exception that the multiply-subtract operations performed by the units of Figures 12 and 13 have been replaced by multiply-accumulate operations. As far as the units lying in the bottom row of the array are concerned these also operate in a similar way (performing multiply-accumulate operations) except when the relevant operations code number is m.When the operations code number is m the multiplexers 38-41 are switched over so that the data from D, instead of being multiplied by the data from A or B with the result being added to the contents of latch 58, is added to the contents of the latch 58 directly, the result being multiplied by the data from A (which is -1 for 17a units and -l/uj,j for 17c units) and the result of this outputted at C, as required.The data from D (D denoting the exterior in this case) is in fact the negative of the relevant item of input matrix data, i.e. -ai,j, which items now have to be inputted in negative form formatted in a similar way to that previously described for when the data is inputted along the sloping edges of the triangular array, but now so that an item of real data is presented to each processing unit lying along the bottom of the triangular array when it is operating in its mth phase rather than its first phase.

The processing unit which lies in the bottom row and centre column of the triangular array may be constructed basically as already described with reference to Figures 15 and 16. However, its operation has to be modified each time it is implementing the mth node of the corresponding set of processing nodes. More particularly, when this is the case it is required to add the data item -aj,j, inputted from the exterior at D, to the contents of latch 58, present the reciprocal of the result to the neighbouring unit A, present the negative of the result to the exterior at C, and present -1 to the neighbouring unit E.Thus this particular processing unit will have to be provided in addition with a reciprocal-determining device such as a suitably programmed look-up table and means for reconfiguring it to perform these operations each time it is implementing the mth node of the corresponding set. Such means may take the form of further multiplexers controlled from point x in Figure 16, as will be evident to those skilled in the art.

What was said above about applying zeros to the unused inputs of the processing units lying along the sloping edges of the triangular array 16 of Figure 14 applies also when input data is applied to the bottom of the triangular array as just described.

Now, however, these units will have further such unused inputs corresponding to the data paths 35 of Figure 14, and these too should be supplied with zeros.

When the various processing units are constructed as described above with reference to Figure 15 and are controlled by means of the logic circuit shown in Figure 16 the operation of those units lying in the bottom row of the triangular array is distinguished from the operation of the other units by virtue of the fact that only the units in the bottom row are periodically switched over by means of the signal from point x in Figure 16 (because it is only for these units that a logic "1" is applied to input 55 of Figure 16). If the signal applied to input 55 is reset to logic "0" then the distinction between operations will be removed; all units will then perform multiply-accumulate operations whatever nodes they are implementing at the relevant time.It will be appreciated that this is exactly what is required if the system is to perform band-matrix by band-matrix multiplication instead of band matrix decomposition, the input matrices then being inputted to the left-hand half and the right-hand half respectively of the bottom of the triangular array of processing units and the product matrix being outputted from the bottom of the array.

A substantial amount of what has been described above is illustrated in Figure 17, which shows two processing units 17 both situated in the bottom row of units which make up the effectively triangular array of units. The left-hand unit 17 in Figure 17 is that unit which lies at the left-hand end of the row and the right-hand unit is that which lies at the centre of the row. The data communication paths with neighbouring units (c.f. the description with reference to Figures 9-11 and Figure 15) are not shown, for the sake of clarity. Processing in the triangular array is controlled by means of a control unit 61, for example a suitably programmed computer.

Control unit 61 generates logic "1" at an output 62 when the system is required to perform band-matrix decomposition, and logic "0" thereat when the system is required to perform band-matrix by band-matrix multiplication. This signal is applied to the inputs 55 of all the units 17 in the bottom row; c.f. input 55 in Figure 16 and the relevant description, logic "0" being always applied to the corresponding units 17 in the other rows.

Control unit 61 generates the required value of m, i.e. the number of processing nodes which each unit 17 is required to implement, at an output 63. This signal is applied to the input 53 of all the units 17 in the array; c.f. input 53 in Figure 16 and the relevant description, and is also applied to the delay devices 12 which feed the operation control lines 60 and 60' and hence the inputs 52 of the units 17; c.f. Figure 4 and the relevant description and also input 52 in Figure 16, to control the delay produced by each device 12 to be (m-2) data periods. (Each device 12 may be formed by a succession of latches and a controlled multiplexer similar to the latches 56 and the multiplexer 57 in Figure 15).The units 17 and the devices 12 are supplied with clock signals generated by unit 61 at an output 64 to synchronise the various latches in these components and also the various arithmetic units included in the units 17. The intercoupled devices 12 are fed with m operation codes successively and cyclically from an output 65 of unit 61.

Inputs 66 of the units 17 in the bottom row of the array are fed with the relevant suitably formatted matrix input data from respective outputs 67 of unit 61; c.f. the input from "D" in Figure 15, and outputs 68 thereof supply the relevant matrix output data to respective inputs 69 of unit 61; c.f. the output to "C" in Figure 15.

The input 54 of the right-hand unit 17, and the corresponding inputs of all the other units in the corresponding, i.e. centre, column of the triangular array, are fed with logic "1", whereas the inputs 54 of all the other units are fed with logic "0", c.f. input 54 in Figure 16 and the corresponding description.

In the systems described so far, what is in effect a substantially triangular array of processing units, e.g. the units 17 in Figure 8, implements what is in effect a triangular array of processing nodes of height n and base (2n-1), the latter mapping in turn to an n x n square array of processing nodes. The systems can be used, inter alia, to decompose an input matrix of maximum bandwidth (2n-1). If the input matrix has a smaller bandwidth (2p-1), and each processing unit implements m processing nodes, the value of m may be reduced to increase the rate at which the smaller-bandwidth matrix is processed and the processing unit utilisation; c.f. Figure 7 and the associated description.

However, the amount by which m can be reduced is limited by the fact that InK must not be less than p, where R is the number of processing units which make up the centre column of the triangular array, and in any case m must be at least 3. Obviously providing more processing units in the triangular array will enable the value of m to be reduced further (although not below 3), but this will inevitably result in increased costs. These increased costs may, however, be lessened, at least in some circumstances, if the shape of the (notional) array of processing units is allowed to depart from the triangular, for example as shown in full lines 70 in Figure 18.

The array denoted by the full lines 70 in Figure 18 takes the (notional) form of two superimposed triangular arrays of base (2p-1) and (2n-1) units respectively and height r and q units respectively, the processing units lying in the area 71 being common to both arrays. If the processing units are controllable between each implementing m1 processing nodes and each implementing m2 processing nodes, where qm1 is not less than n and rm2 is not less than p, the array can satisfactorily process an input matrix of maximum bandwidth 2n-1 in the former case and maximum bandwidth 2p-1 in the latter case.Thus the array shown can process input matrices having a bandwidth of (2p-1) or very nearly this bandwidth in just an efficient manner as could a triangular array of controllable-m processing units of base 2n-1 units and height r units, the sloping sides of which would be as indicated by the dashed lines 72, while still being capable of processing matrices having bandwidths between (2p-1) and (2n-1). It will be appreciated that the last-mentioned triangular array would have a larger area than the array indicated by the full lines 70, and hence would require more processing units. Obviously the principle just described can be extended at will, so that the array of processing units takes the (notional) form of more than two superimposed triangular arrays.

From reading the present disclosure, other modifications will be apparent to persons skilled in the art. Such modifications may involve other features which are already known in the design, manufacture and use of data processing systems and component parts thereof and which may be used instead of or in addition to features already described herein. Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present application also includes any novel feature or any novel combination of features disclosed herein either explicitly or implicitly or any generalisation thereof, whether or not it relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.

Claims

CLAIM(S)

1. A data processing system which takes the form of a set of data processing nodes which are provided with data communication paths therebetween in such manner that said nodes are effectively arranged in rows i and columns i with each node Ni,j provided with respective data communication paths to (a) the next node (if any) Ni,j+1 in the corresponding row, (b) the next node (if any) Ni+1,j in the corresponding column, and (c) that node (if any) Ni-l,j-l which lies in both the preceding row and the preceding column, each said data communication path including delay means for delaying data communicated via the relevant path by an integral number of data periods D, parallel output data paths being provided from the nodes lying in the first row and the first column, said system comprising a plurality of processing units for implementing the data processing required at said nodes, which units are each assigned to a respective subset of said nodes for implementing the data processing required at all the nodes of the corresponding subset, characterized in that for each said subset which contains more than one node each of the nodes constituting the relevant subset lies in a row and column which directly succeed the row and column respectively in which another of the nodes constituting the relevant subset lies andlor directly precede the row and column respectively in which another of the nodes constituting the relevant subset lies.

2. A system as claimed in Claim 1, wherein each subset comprises m nodes if m nodes are available for inclusion in the relevant subset and, for each given value of i-j, a corresponding operation control line is provided to all the processing units which implement the processing required at the relevant nodes, control means being provided for supplying operation codes successively and cyclically to each said operation control line to control the relevant processing units to implement the processing required at the corresponding m nodes successively and cyclically in step both with each other and with successive data periods.

3. A system as claimed in Claim 2, wherein m is greater than two and said integral number is (m-2) for each said path from a node Ni,j to a node Ni,j+1 if j v i and for each said path from a node Ni,j to a node Ni+1,j if j 4 i, and is unity for all the other said paths.

4. A system as claimed in Claim 3, wherein said control means comprises further delay means between the lines of each pair of said operation control lines for which said given value differs by unity, for communicating operation codes from one line of the pair to the other after a delay of (m-2)D.

5. A system as claimed in Claim 4, including supply means for supplying an m-determining code to said control means and to those data communication path delay means whose delay is required to be (m-2)D to thereby control m between specific ones of a plurality of values.

6. A system as claimed in any preceding claim, wherein parallel input data paths are provided to the nodes lying in the first row and the first column.

7. A data processing system substantially as described herein with reference to Figures 8-14 of the drawings, to Figures 15-17 of the drawings, or to Figure 18 of the drawings.