NL2015114B1  Scalable computation architecture in a memristorbased array.  Google Patents
Scalable computation architecture in a memristorbased array. Download PDFInfo
 Publication number
 NL2015114B1 NL2015114B1 NL2015114A NL2015114A NL2015114B1 NL 2015114 B1 NL2015114 B1 NL 2015114B1 NL 2015114 A NL2015114 A NL 2015114A NL 2015114 A NL2015114 A NL 2015114A NL 2015114 B1 NL2015114 B1 NL 2015114B1
 Authority
 NL
 Netherlands
 Prior art keywords
 data
 circuit
 circuits
 instruction
 array
 Prior art date
Links
 238000009740 moulding (composite fabrication) Methods 0 abstract description 4
 241001442055 Vipera berus Species 0 description 57
 238000003860 storage Methods 0 description 31
 230000015654 memory Effects 0 description 28
 238000007792 addition Methods 0 description 27
 239000011159 matrix materials Substances 0 description 23
 206010003736 Attention deficit/hyperactivity diseases Diseases 0 description 15
 238000004891 communication Methods 0 description 11
 230000000875 corresponding Effects 0 description 6
 238000004422 calculation algorithm Methods 0 description 5
 102100017333 ADD1 Human genes 0 description 3
 102100010282 ADD2 Human genes 0 description 3
 101700077901 ADDB family Proteins 0 description 3
 108060003529 AddA family Proteins 0 description 3
 239000000460 chlorine Substances 0 description 3
 230000001276 controlling effects Effects 0 description 3
 239000011162 core materials Substances 0 description 3
 238000005516 engineering processes Methods 0 description 3
 230000004075 alteration Effects 0 description 2
 230000001934 delay Effects 0 description 2
 238000000034 methods Methods 0 description 2
 230000004048 modification Effects 0 description 2
 238000006011 modification Methods 0 description 2
 239000002070 nanowire Substances 0 description 2
 238000005365 production Methods 0 description 2
 239000000047 products Substances 0 description 2
 102100017320 ADD3 Human genes 0 description 1
 101700042292 ADDG family Proteins 0 description 1
 101700050308 TSCOT family Proteins 0 description 1
 HXHWSAZORRCQMXUHFFFAOYSAN albendazole Chemical compound   CCCSC1=CC=C2NC(NC(=O)OC)=NC2=C1 HXHWSAZORRCQMXUHFFFAOYSAN 0 description 1
 230000003935 attention Effects 0 description 1
 238000004364 calculation methods Methods 0 description 1
 230000001419 dependent Effects 0 description 1
 238000009826 distribution Methods 0 description 1
 238000005265 energy consumption Methods 0 description 1
 230000001965 increased Effects 0 description 1
 239000010410 layers Substances 0 description 1
 239000002184 metal Substances 0 description 1
 229910052751 metals Inorganic materials 0 description 1
 230000001603 reducing Effects 0 description 1
 230000014616 translation Effects 0 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F15/00—Digital computers in general; Data processing equipment in general
 G06F15/76—Architectures of general purpose stored program computers
 G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
 G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
Abstract
Description
Scalable computation architecture in a memristorbased array
Field of the invention
The present invention relates to a method for data processing in a memristorbased array in accordance with the preamble of claim 1.
Also, the invention relates to a memristor based computational device. Furthermore, the invention relates to a computer archiecture.
Background
In the last several decades, CMOS downscaling has been the primary driver behind computer performance improvement. However, CMOS technology is reaching its physical if not economical limits. Downscaling devices has led to many challenges such as leakage power, reliability, fabrication process and turnaround time, test complexity, cost for mask and design, and yield. Furthermore, the performance gain by increasing clock speed has saturated since early 2000; today, speedup is no longer the result of a faster clock, but rather a result of parallelization on multicore and manycore systems. However, the number of parallel cores that can be programed and the computation efficiency that can be extracted are tending to saturate as well. Obviously, all today’s computing systems are mainly built on John von Neumann storedprogram computer concept. A major drawback of this computer design is the gap between the processing units and the main memory, the socalled memory bottleneck. For dataintensive applications, the memory bottleneck is becoming even more severe and is putting major limitations both on performance and energy consumption. All of these motivate the need for a new architecture being able to (a) eliminate the communication bottleneck and support massive parallelism to increase the overall performance, (b) reduce the energy inefficiency to improve the computation efficiency.
Getting the memory closer the processing unit and reducing the memory bottleneck has attracted a lot of attention. In 1969, LogicInMemory (LIM) was originally introduced as a memory accelerator; i.e., add some processing units close to main memory. In 1992, LIM concept reappeared and named computational RAM, and typically uses the same accelerator concept where these are supposed to perform operations needed by the memory such as address translations. In the late 1990s and early 2000s, ProcessorInMemory (PIM) was proposed and manufactured. PIM is based on splitting the main memory in different parts, each with surrounded computing units to bring the computation near to the memory; the architecture has a master CPU that takes care of the overall control. In 2004, MemoryInLogic (MIL), which provides massive addressable memory on the processor, was proposed for supercomputer systems. All mentioned above efforts have tried to close the gap between processor and memory speed. However, as the computation and the storage are kept separately, they fundamentally use a von Neumann storedprogram computer concept and therefore suffer from a memory bottleneck, which negatively impacts the performance.
It is an object of the present invention to overcome or mitigate one or more of the disadvantages of the prior art.
Summary of the invention
The object is achieved by a method for data processing based on arithmetic operations in a memristorbased crossbar array, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction opposite to the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states; the method comprising:  defining data circuit templates for holding data;  defining computation circuit templates for holding a selected arithmetic instruction from the arithmetic operations;  arranging data as data circuits on predetermined data locations of the array in accordance with the data circuit template by programming the contacts of the array at the predetermined data locations;  arranging arithmetic instructions as instruction circuits on predetermined instructions locations of the array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the contacts of the array at the predetermined instruction locations, wherein data acting as input for a first arithmetic instruction is arranged as data circuit connected to an input location of the associated first arithmetic instruction circuit, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit; data circuits for data to be computed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and instruction circuits for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array.
Advantageously, computation based on the data and instructions is executed in the memory unit without moving the data and instructions from their respective locations. By this organization of data and instructions the memory unit also functions as processing unit. The main advantage of data processing usingcomputationinmemory over vonNeumann type data processing is the tight integration of both computing and storing operations using the same physical crossbar. Moreover, the performance of von Neumann architectures suffers from delays caused by the required transfer of data and instructions from memory to CPU and vice versa. Since this bottleneck is removed here, in the CIM architecture massive parallelism can be achieved with minimized communication.
According to an aspect the invention provides a method as described above, comprising:  arranging a plurality of data acting as input for the first arithmetic instruction as a plurality of input data circuits in an input data column of the array;  arranging a plurality of data acting as output for said first arithmetic instruction as a plurality output data circuits in an output data column of the array, and  arranging said first arithmetic instruction as a plurality of first arithmetic instruction circuits in a first instructions column of the array, intermediate the input column and the output column.
According to an aspect the invention provides a method as described above, wherein the plurality of data acting as output are arranged as a plurality of second input data circuits for a second arithmetic instruction, the method comprising:  arranging said second arithmetic instruction as a plurality of second arithmetic instruction circuits in a second instruction column or row of the array, with the input of each second arithmetic instruction circuit connected to the associated second input data circuit, arranging a plurality of second output data circuits on the array connected to the respective output of the associated second arithmetic instruction circuit.
According to an aspect the invention provides a method as described above, comprising for each connection between either an output of data circuit and an input of an instruction circuit or an input of data circuit and an output of an instruction circuit, a creation of a connection circuit in the array between said data circuit and said instruction circuit.
According to an aspect the invention provides a method as described above, wherein the arithmetic instructions are selected from operations comprising addition, multiplication, subtraction and division.
According to an aspect the invention provides a method as described above, wherein the data processing is a parallel addition of a plurality of N numbers using a binary tree addition algorithm, N being even.
According to an aspect the invention provides a method as described above, comprising:  creating in the array N+l columns of N/2 rows, the outer columns 1, N+l and the central column N/2+1 arranged as data columns with data circuits for receiving external data, in which each data circuit in a data column is connected through an connection circuit in an adjacent column to an adder circuit in a subsequent arithmetic column and each adder circuit through a further connection circuit is connected to an output data circuit in a next data column, such that in the central column N/2 + 1 the result of the addition of the N numbers is stored.
According to an aspect the invention provides a method as described above, wherein a data circuit in a data column is either an output data circuit or an input data circuit for an addition or a combined data circuit configured to receive a result from an addition and to transmit said result as input data for a next addition.
According to an aspect the invention provides a method as described above, wherein additions in each adder column are performed in parallel.
According to an aspect the invention provides a method as described above, wherein in each connection column data are transferred in parallel.
According to an aspect the invention provides a method as described above, wherein the data processing is a matrix multiplication algorithm for an NbyN matrix.
According to an aspect the invention provides a method as described above, wherein a first arithmetic instruction is a multiplication for a pair of matrix elements embodied as a multiplier circuit and a second arithmetic instruction is an addition for a pair of multiplication results embodied as an adder circuit, the method comprising:  arranging in the crossbar array in at least one row: — two multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the multiplication, wherein the output data circuits are located adjacent to each other in between the multiplier circuits, in a manner that within the at least one row the multiplier circuits can operate simultaneously, and  arranging in the crossbar in an adjacent row: — next to the output data circuits that are adjacent to each other, an adder circuit connected to said output data circuits and arranged for receiving data from the respective output data circuit and for adding the received data, and  arranging in the crossbar in a further row a sum output storage circuit arranged for receiving result data from the adder circuit, such that the adder circuit is between the row with multiplier circuits and the row with the sum output storage circuit.
According to an aspect the invention provides a method as described above, comprising:  arranging in the crossbar array in a second row: — two further multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the multiplication, wherein the output data circuits are located adjacent to each other in between the further multiplier circuits, in a manner that within the second row the further multiplier circuits can operate simultaneously, and  arranging in the crossbar in an adjacent one of the second row: — next to the output data circuits that are adjacent to each other, an further adder circuit connected to said output data circuits and arranged for receiving data from the respective output data circuit and for adding the received data, and  arranging in the crossbar in a further row next to the adjacent one a second sum output storage circuit arranged for receiving result data from the further adder circuit, such that the further adder circuit is between the second row with the further multiplier circuits and the further row with the second sum output storage circuit, and wherein the second sum output storage circuit is arranged adjacent to the sum output storage circuit.
According to an aspect the invention provides a method as described above, further comprising:  arranging in a row direction adjacent to the sum output storage circuit and the second sum output storage circuit, a third adder circuit that is configured for adding data from the sum output storage circuit and data from the second sum output storage circuit, and adjacent in the same row as the third adder a third sum output storage circuit.
Also, the invention relates to a computational device for data processing based on arithmetic operations, comprising a memristorbased crossbar array and controller, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction opposite to the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states, and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the crossbar array over the connection interface in accordance with the method as described above.
According to an aspect the invention provides a device as described above, wherein the controller is arranged to execute the first arithmetic instructions of the plurality of first arithmetic instruction circuits in the first instructions column of the array, in parallel.
According to an aspect the invention provides a device as described above, wherein the controller is arranged to execute the second arithmetic instructions of the plurality of second arithmetic instruction circuits in the second instruction column or row of the array, in parallel.
Further, the invention relates to a computer architecture for computationinmemory applications related to data processing based on arithmetic operations, comprising a memristorbased crossbar array and a controller, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction opposite to the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states, and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the crossbar array over the connection interface; in which architecture data circuits and instruction circuits are arranged based on data circuit templates for holding data and on computation circuit templates for holding a selected arithmetic instruction from the arithmetic operations, respectively; the data circuits being arranged on predetermined data locations of the array; the instruction circuits being arranged on predetermined instructions locations of the array, wherein data acting as input for a first arithmetic instruction is arranged as data circuit connected to an input location of the associated first arithmetic instruction circuit, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit, data circuits for data to be computed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and instruction circuits for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array.
According to an aspect the invention provides a computer architecture for computationinmemory applications comprising either a parallel adder circuit or a matrix multiplication circuit.
Advantageous embodiments are further defined by the dependent claims.
Brief description of drawings
The invention will be explained in more detail below with reference to drawings in which illustrative embodiments thereof are shown. The drawings are intended exclusively for illustrative purposes and not as a restriction of the inventive concept. The scope of the invention is only limited by the definitions presented in the appended claims.
Figure la shows a computationinmemory computer architecture for use in accordance with the method of the invention;
Figure lb shows a schematic of an adder circuit;
Figure lc shows a layout of a binary tree addition algorithm;
Figure Id shows a computationinmemory parallel adder structure in accordance with the invention;
Figure 2 shows a computationinmemory matrix multiplication and addition structure in accordance with the invention;
Figure 3 shows an arrangement of plurality computationinmemory matrix multiplication and addition structure for a matrix multiplcation in accordance with an embodiment of the invention.
Detailed description of embodiments
The invention relates to a computer architecture that is based on the principle of ComputationInMemory (CIM) that a memory unit comprises both data and instructions at physical locations in the memory unit and that computation based on the data and instructions is executed in the memory unit without moving the data and instructions from their respective locations. By this organization of data and instructions the memory unit also functions as processing unit, according to the invention, the memory unit is a memristorbased array or “crossbar”.
The main advantage of a computationinmemory (CIM) architecture over vonNeumann architectures is the tight integration of both computing and storing operations using the same physical crossbar. Moreover, the performance of von Neumann architectures suffers from delays caused by the required transfer of data and instructions from memory to CPU and vice versa. Since this bottleneck is removed here, in the CIM architecture massive parallelism can be achieved with minimized communication.
Fig. la shows a computationinmemory computer architecture 100 for use in accordance with the method of the invention. The main three CIM architecture components are : crossbar 110, communication network 130 and controller 140 are shown.
The communication network 130 is connected to the controller 140.
The controller 140 is configured to interface with the crossbar 110 by horizontal and vertical interfaces 141, 142 (comprising voltage controllers). The crossbar 110 consists of an array of memristors 120 that are used to implement logic functions and/or storage.
In the memristorbased crossbar array, the crossbar array comprises a plurality of parallel first bars (that may be nanowires) extending in a first direction and a second plurality of parallel second bars (e.g. nanowires) extending in a second direction opposite to the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact. Each contact forms a memristor with at least two different programmable resistive states.
The communication network 130 is either implemented within the crossbar 110 or by using separate metal layers. A controller 140 that may be implemented by CMOS devices handles auxiliary operations such as distributing data and controlling signals to the crossbar.
According to the invention the crossbar 110 is specialized to perform computation and storage operations in cells organized in rows and columns. Each cell can be a computational unit (such as an adder or multiplier) or storage location (such as a memory cell). The controller 140 is configured to arrange data as a data circuit (i.e., a storage location) on predetermined data locations of the crossbar array 110 in accordance with the data circuit template by programming the memristors of the crossbar array at the predetermined data locations. Additionally, the controller 140 is configured to arrange arithmetic instructions as an instruction circuit (i.e., a computational unit) on predetermined instructions locations of the crossbar array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors of the array at the predetermined instruction locations.
The cells in a row or column can be configured with the same or different functionality, for example as a data circuit for a storage location or an instruction circuit for a computational unit. The communication in CIM architecture 100 has maximum flexibility. The CIM architecture 100 allows bidirectional communication in both horizontal and vertical directions. The controller may contain a router 143 and a finite state machine (FSM). The router 143 provides the FSM with a communication scheme for data distribution and data movements. The FSM fetches instructions from an instruction memory (e.g. hard disk, not shown), which contains instructions for the configuration of the crossbar array 110, in accordance with a predetermined algorithm or dataset. The FSM is further configured to convert fetched instructions to controlling signals for the row/column voltage controller 141, 142 to configure the crossbar array with computational units and storage locations as defined by the instructions from the instruction memory.
Fig. lb shows a single CIM adder ADD. The basic computational unit CU is an nbit adder, which is surrounded by a number of memory cells (latches) Z1 ... Z5. An nbit adder contains three nbit latches (two for the inputs Zl, Z2 and one for the sum Z3), a 1bit carryin latch Z4 and a 1bit carryout latch Z5.
Below a CIM parallel adder circuit 200 in accordance with the present invention is described.
The CIM parallel adder circuit 200 relates to an algorithm to add a plurality of N numbers to calculate a sum, and is based on an addition scheme known as a binary tree addition scheme. According to this scheme, parallel additions on pairs of N numbers (e.g., Lil ... Li8) are carried out in iterations that half the number of additions in each iteration (the first additions run in parallel on N/2 pairs of these numbers, the next additions are done in parallel on the N/4 pairs of the N/2 results, and so on.) An example of the binary tree scheme is shown in Figure lc for N=8 numbers Lil ... Li8.
In Figure Ida CIM parallel adder 200 in accordance with the present invention is shown.
The CIM parallel adder 200 is built in the memristors based crossbar 110 in accordance with the binary tree scheme of Figure lc. The CIM parallel adder circuit 200 is arranged in the crossbar 110 and comprises a number of columns CE1 ... CE17 that correspond either with columns CS comprising circuits (circuit portions) acting as storage locations Lil ... Li 16 for numbers or with columns CA comprising circuits acting as arithmetic instructions (in this case: addition ADD).
By arranging the circuits of the same function (thus storage or arithmetic) in the same column CS, CA next to a column with circuits of another function, parallelism can be achieved by controlling the resistive signals in the crossbar 110 in the row direction RE (perpendicular to the columns CE).
The CIM parallel adder circuit 200 arranges multiple CIM adders ADD in a binary tree network. The carryin and carryout registers (see Figure lc) of an adder ADD are connected properly to generate correct addition results. In the prior art, the binary tree network is ineffective using traditional platforms due to the difference between processor and memory fabrication. A processor coupled with a large amount of memory is unrealistic with traditional CMOS technology. Using the new features of the CIM architecture 100 and its underlying memristor technology, the adder tree can be effectively mapped to reduce addition latency and increase resource utilization.
Fig. Id shows a CIM parallel adder circuit 200 using a mapped binary adder tree with 16 inputs (see Fig. lc for an 8 input example) on the CIM architecture 100. Each CIM adder ADD corresponds to the adder presented in Fig lb. Note that the output latches sum (Z5) at each adder are reused as input latches (Z1 or Z2) in the next adder stage. A first column CE1 of the crossbar gets the first half of the inputs Lil ... Li8. Add units ADD in a second column CE3 adjacent to the first column CE1 add every two corresponding input latches and store results in corresponding output latches Lsl ... Ls4 in a next third column CE5. Adders ADD in a fourth column CE7 add up results Lsl 1 ... Lsl 2 from output latches of the third column CE5. Another direction of computation takes place from the final column CE17 backwards to utilize as many resources of the crossbar as possible. In other words, a cell in CIM architecture 100 is configured to be an add unit ADD or a latch Li; Ls. The interconnects (dotted lines) between multiple rows and columns represent communication channels among cells that are located in even numbered columns CE2, CE4, ... CE16.
The crossbar array 110 for N additions contains at least N/2 χ (log2(N)) CIM adders. Due to the multidirectional characteristic of the CIMbased adder, additions can operate in two direction flows (RE from left to right and vice versa) of the array (as shown in Fig. Id). These bidirectional operations efficiently exploit resources in the architecture. Therefore, the architecture is designed with one additional column and half number of rows in comparison with the abovementioned size. Hence, for N inputs the delay and array size equals log2(N)+l and N/4 x(21og2(N)+l) cells, respectively (each adder processes two inputs). With this design, every operation is performed on a distinct operational and storage unit; hence, there is no operation overlapping at a particular location. In addition, a maximum number of adders ADD in the CIM architecture 100 are used to avoid idle adders. With smart communication schemes, the architecture can be pipelined to increase overall performance.
Thus, the memristorbased crossbar 110 comprises a CIM parallel adder circuit 200. The CIM parallel adder circuit 200 comprises in the crossbar a columnar structure CE1 ... CE17 of storage location columns CS comprising data circuits for holding data and arithmetic instruction circuit columns CA that are arranged in alternation. Inbetween a storage location column and an adjacent arithmetic instruction column a connecting column CE2; CE4; CE6;CE8; CE10; CE12; CE14; CE16 is provided that comprises connecting circuits to connect one or more data circuits Li; Ls in a storage location column CS with an arithmetic instruction circuit i.e., an adder circuit ADD, in the adjacent arithmetic instruction column CA.
In a first column CE1 a first half of the N numbers Lil ... Li8 to be added is stored in respective data circuits for holding the number value.
In an adjacent second column CE3 the adder circuits ADD for adding two numbers are arranged. On the opposite side (relative to the first column CE1) of the second column CE3 containing the adder circuits in a further third column CE5, the individual sums Lsl ... Ls4 are stored. Note that in the further third column CE5, the storage locations of the individual sums are shifted in the upward column direction to avoid empty storage locations (the additions reduce the data by half). By the shift the lower part of the column CE5 is thus free for other data.
It is noted that inbetween columns CE1 and CE3, and CE3 and C5, intermediate columns CE2, CE4 are present that handle the transfer of signals between data circuits in one column and arithmetic circuits in the adjacent column. The sequence of data circuits containing columns and arithmetic instruction circuits is repeated in the crossbar until a single pair of summable numbers Lsl 1 ... Lsl2 is reached for the first half of the N numbers to be added.
For the other half of the N numbers to be added, columns of data circuits next to arithmetic instruction circuits are arranged from the last column CE17 from the opposite direction. In the CIM parallel adder circuit 200 of Figure Id in column CE17 the second half of the N numbers Li9 ... Lil6 to be added is stored in respective data circuits for holding the number value. In an adjacent column CE15 in the direction towards column CE1 the adder circuits ADD for adding two numbers are arranged. The sum results Ls5 ... Ls8 are stored in column CE13. In adjacent column CE 11 adder circuits ADD are arranged which add the intermediate results Ls5, Ls6 and Ls7, Ls8 to sum results Lsl3 .. Lsl4. It is noted that by transfer of some addition results to storage locations unused in another column, the binary tree scheme can be “folded” which results in a more compact circuit occupying relatively less area of the crossbar, as will be appreciated by the person skilled in the art since the computation in the adder circuit can be done in both row directions. The sum Ls21 of Lsl 1, Lsl 2 is thus stored in column C13, row RE1, the sum Ls22 of Lsl3, Lsl4 is in column CE5, row RE 5. By copying Ls21 and Ls22 in this example to column CE13, rows RE3 (Ls21_copy) and RE4 (Ls22_copy), respectively, the final result Lst is calculated by the adder circuit ADD in column CE11 and stored in column CE9, row RE3.
The skilled in the art will appreciate that alternative “folding” options may be available to optimize the CIM parallel adder circuit 200. The circuit of Figure Id is only intended as an exemplary embodiment, without limitation of the scope of the invention as claimed in the claims.
Figure 2 shows a crossbarbased circuit 300 for matrix multiplication in accordance with an embodiment of the invention.
For this embodiment, the circuit for matrix multiplication relates to a multiplication of NbyN matrices, for example an 8by8 matrix C calculated from matrix product ADB. The circuit 300 shown is arranged to calculate one matrix element Cy of the multiplication. To calculate all matrix elements the circuit is simply repeated in the crossbar structure 110.
In the crossbar structure to obtain most efficiency of the computation, multiplications of matrix elements and additions of intermediate results should be done in parallel where possible. Thus, in the CIM architecture 100, identical arithmetic instruction circuits should for that reason be in the same column or same row of the crossbar (and the respective circuit).
In the embodiment, storage locations or data circuits Al, B1 for holding data (input matrix elements Aik or Bkj) are arranged in a first column CL1 in a first data entry row R1 and in parallel secondary data circuits A2, B2 in a second data entry row R5 which is separated from the first row R1 by three intermediate rows R2, R3, R4.
In the first data row R1 next to the input data circuits Al, B1 a multiplier circuit Ml is arranged configured for multiplication of the contents of data circuits Al, Bl. Next, in the first row to the multiplier circuit Ml opposite to the side of the input data circuits Al, Bl a storage location or output data circuit Cl for holding the result of the multiplication Al and Bl.
In the first row R1 a same series of third input data circuits A3, B3, third multiplier circuit M3 and third output data circuit C3 is arranged in mirrored position, in such a manner that the third output data circuit C3 is located next to the first output data circuit C1.
In the second data entry row R5 a same arrangement is made with second input data circuits A2, B2, a second multiplier circuit M2 and a second output data circuit C2, and in mirrored position fourth input data circuits A4, B4, a fourth multiplier circuit M4 and a second output data circuit C4.
In the intermediate row R2 adjacent the first data entry row R1 a first adder circuit ADD1 is arranged in column CL3 below the first and third output data circuits for calculating the addition of content result C13 of Cl and C3.
In the intermediate row RE4 adjacent the second data entry row RE1 an second adder circuit ADD2 is arranged below the first and third output data circuits for calculating the addition result C24 of content of C2 and C4.
In the central intermediate row R3, a storage location or output data circuit D1 is arranged located in the same column between the first and second adder circuits ADD1 and ADD2 for holding the addition results C13 and C24, respectively. In the cental row R3 an adder ADD 3 is arranged for addition of C13 and C24.
The above described multiplier and adder structure Al, Bl, Cl, A2, B2, C2, A3, B3, C3, A4, B4, C4, Ml, M2, M3, M4, ADD1, ADD2, ADD3 C13, C24, is repeated as a same structure A5, B5, C5, A6, B6, C6, A7, B7, C7, A8, B8, C8, M5, M6, M7, M8, ADD5, ADD6, ADD7, C57, C68 in the row direction RD in an adjacent area in the crossbar array, with an intermediate column CL6 that is provided with storage locations for holding the intermediate results Dl, D2 of the two neighbouring circuit structures and provided with further adder circuits ADD8 for calculating the addition result El of the content of D1 and D2. The final result El for one matrix element Cy is then to be stored in a result output data circuit El arranged in the intermediate column CL6 in the second data entry row R5 where the result can be fetched by an operation from the controller 140.
Figure 3 shows a schematic layout of a memristor based crossbar 110 arranged with multiplication and addition circuits 300 for the calculation of each matrix element. The multiplication and addition circuits 300 in an embodiment relate to the circuit as shown and described with reference to Figure 2.
The multiplication and addition circuits 300 are arranged in a row and column structure X, Y in which each multiplication and addition circuit is arranged to calculate one matrix element Cy of the matrix product ADB. (see nbyn matrix C in insert 3a ) The multiplication and addition circuits are aligned in both column and row directions to achieve parallelism in the computation of each matrix element Cy.
The invention has been described with reference to the preferred embodiment. Obvious modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims.
Claims (19)
 A method of data processing based on arithmetic operations in a membarbased crossbar array, the crossbar array having a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction opposite the first direction, such that each first bar intersects the second plurality of second bars and forms a contact at each intersection, each contact forming a memristor with at least two different programmable resistance states; wherein the method comprises:  defining data switching templates for containing data;  defining calculation circuit templates for containing a selected arithmetic instruction from the arithmetic operations;  arranging data as data circuits at predetermined data locations of the array in accordance with the data switching jab wage by programming the memristors of the array at the predetermined data locations;  arranging arithmetic instructions as instruction circuits at predetermined instruction locations of the array in accordance with the calculation scaling template relating to the respective arithmetic instruction by programming the memristors of the array at the predetermined instruction locations, with data functioning as input for a first arithmetic instruction are arranged as a data circuit connected to an input location of the associated first arithmetic instruction circuit, and data functioning as an output of the first arithmetic instruction is connected to an output location of the first arithmetic instruction circuit; data circuits for data to be calculated in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and instruction circuits for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array .
 A method according to claim 1, comprising:  arranging a plurality of data that functions as input for the first arithmetic instruction as a plurality of input data circuits in an input data column of the array;  arranging a plurality of data functioning as an output for the first arithmetic instruction as a plurality of output data circuits in an output data column of the array, and  arranging the first arithmetic instruction as a plurality of first arithmetic instruction circuits in a first instruction column of the array , between the input data column and the output data column.
 The method of claim 2, wherein the plurality of data functioning as an output are arranged as a plurality of second input data circuits for a second arithmetic instruction, the method comprising:  arranging the second arithmetic instruction as a plurality of second arithmetic instruction circuits in a second instruction column or row of the array, with the input of each second arithmetic instruction circuit connected to the associated second input data circuit,  arranging a plurality of second output data circuits in the array which are connected to the respective output of the associated second arithmetic instruction circuit .
 A method according to any one of the preceding claims, comprising for each connection between either an output of a data circuit and an input of an instruction circuit or an input of a data circuit and an output of an instruction circuit, forming a connection circuit in the array between the data circuit and the instruction circuit.
 The method of any one of claims 13, wherein the arithmetic instructions are selected from operations including addition, multiplication, subtraction, and division.
 A method according to any one of the preceding claims, wherein the data processing is a parallel addition of a plurality of N numbers using a binary tree addition algorithm, with N an even number.
 Method according to claim 6, comprising:  forming in the array of N + 1 columns of N / 2 rows, wherein the outer columns 1, N + 1 and the central column N / 2 + 1 are arranged as data columns with data circuits for receiving external data, wherein each data circuit in a data column is connected by a connection circuit in an adjacent column to an adder circuit in a successive arithmetic column and each adder circuit is connected by a further connection circuit to an output data circuit in a next data column, in such a way that in the central column N / 2 + 1 the result of the addition of the N numbers is stored.
 The method of claim 7, wherein a data circuit in a data column is either an output data circuit or an input data circuit for an addition or is a combined data circuit configured to receive a result from an addition and to send the result as input data for a subsequent addition.
 The method of claim 7 or claim 8, wherein additions in each adder column are performed in parallel.
 The method of any one of claims 79, wherein data is transmitted in parallel in each connection column.
 The method of any preceding claim, wherein the data processing is a matrix multiplication algorithm for an NbyN matrix.
 The method of claim 11, wherein a first arithmetic instruction is a multiplication for a pair of matrix elements that is executed as a multiplier circuit and a second arithmetic instruction is an addition for a pair of multiplication results that is performed as an adder circuit, the method comprising:  arranging in the crossbar array into at least one row:  of two multiplier circuits each between an input data circuit for containing a pair of matrix elements and an output data circuit for containing the result of the multiplication, the output data circuits being placed side by side between the multiplier circuits , such that within the at least one row the multiplier circuits can operate simultaneously, and  arranging in a crossbar in an adjacent row:  adjacent to the output data circuits, an adder circuit connected to the output data circuits of, and are adapted to receive data from, the respective output data circuit and to sum the received data, and  arranged in a further row of a sum output storage circuit arranged to receive result data from the adder circuit, such that the adder circuit is located between the row with multiplier circuits and the row with the sum output storage circuit.
 A method according to claim 12, comprising:  arranging in the crossbar array into a second row:  of two further multiplier circuits each between an input data circuit for containing a pair of matrix elements and an output data circuit for containing the result of the multiplication, wherein the output data circuits are placed side by side between the further multiplier circuits, such that within the second row the further multiplier circuits can operate simultaneously, and  arranging in the crossbar in an adjacent one second row:  in addition to the output data circuits that are adjacent to each other, of a further addition circuit connected to the output data circuits and adapted to receive data from the respective output data circuit and to sum the received data, and  arranging in the crossbar in a further row adjacent to the adjacent one row of a second sum output. a stroke circuit adapted to receive result data from the further addition circuit, such that the further addition circuit is located between the second row with the further multiplier circuits and the further row with the second sum output storage circuit, and wherein the second sum output storage circuit is arranged in addition to the sum output storage circuit.
 The method according to claim 13, further comprising: arranging in a row direction next to the sum output storage circuit and the second sum output storage circuit, a third adder circuit configured to add data from the sum output storage circuit and data from the second sum output storage circuit, and adjacent in the same row as the third adder, of a third sum output storage circuit adjacent to the same row as the third adder.
 An arithmetic operation data processing apparatus comprising a memristorbased crossbar array and a control, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction opposite the first direction, such that each first bar crosses the second plurality of second bars and a contact is formed at each intersection, each contact forming a memristor with at least two different programmable resistive states, and the control being arranged with a connecting interface with first electrical connecting lines and second electrical connecting lines, wherein each of the first electrical connecting lines is connected to one of the parallel first bars corresponding thereto and each of the second electrical connecting lines is connected to a corresponding one of the parallel electrical the second bars, and the controller is adapted to perform control operations on the crossbar array memristors over the connection interface in accordance with the method of any one of claims 114.
 The apparatus of claim 15, wherein the controller is arranged to execute the first arithmetic instructions of the plurality of first arithmetic instruction circuits in the first instruction column of the array in parallel.
 The apparatus of claim 16, wherein the controller is arranged to execute the second arithmetic instructions of the plurality of second arithmetic instruction circuits in the second instruction column or row of the array in parallel.
 18. Computer architecture for computationinmemory (computationinmemory) applications related to arithmetic operations data processing, comprising a membarbased crossbar array and a control, the crossbar array comprising a plurality of parallel first bars extending in a first direction and comprising a second plurality of parallel second beams extending in a second direction opposite the first direction, such that each first beam crosses the second plurality of second beams and a contact is formed at each junction, each contact forming a forms a memristor with at least two different programmable resistive states, and the control is arranged with a connection interface with first electrical connection lines and second electrical connection lines, each of the first electrical connection lines being connected to a corresponding of the parallel first bars and each of the second elect the connection lines are connected to a corresponding one of the parallel second bars, and the control is arranged to perform control operations on the crossbar array memristors over the connection interface; in which architecture data circuits and instruction circuits are arranged on the basis of data circuit templates for containing data or of calculation circuit templates for containing a selected arithmetic instruction from the arithmetic operations; wherein the data circuits are arranged at predetermined data locations of the array; wherein the instruction circuits are arranged at predetermined instruction locations of the array, wherein data having the function of input for a first arithmetic instruction is arranged as a data circuit connected to an input location of the associated first arithmetic instruction circuit, and data having the function of output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit, wherein data circuits for data to be calculated in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and instruction circuits for instructions to are executed in a parallel processing step arranged along either a column direction or a row direction in the crossbar array.
 A computer architecture for calculationinmemory applications comprising either a parallel addition circuit or a matrix multiplier circuit. Abstract Method for data processing based on arithmetic operations in a memristorbased crossbar, the crossbar including a variety of parallel first bars extending in a first direction and a second multiple or parallel second bars extending in a second direction opposite to the first direction, such that each first bar crosses the second several or second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states; including:  defining data circuit templates for data;  defining computation circuit templates for a selected arithmetic instruction from the arithmetic operations;  arranging data circuits on predetermined data locations of the array in accordance with the data circuit template and arranging instruction circuits on predetermined instructions locations of the array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors at the predetermined instruction locations. [fig. 1A]
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

NL2015114A NL2015114B1 (en)  20150707  20150707  Scalable computation architecture in a memristorbased array. 
Applications Claiming Priority (2)
Application Number  Priority Date  Filing Date  Title 

NL2015114A NL2015114B1 (en)  20150707  20150707  Scalable computation architecture in a memristorbased array. 
PCT/NL2016/050493 WO2017007318A1 (en)  20150707  20160707  Scalable computation architecture in a memristorbased array 
Publications (1)
Publication Number  Publication Date 

NL2015114B1 true NL2015114B1 (en)  20170201 
Family
ID=54705267
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

NL2015114A NL2015114B1 (en)  20150707  20150707  Scalable computation architecture in a memristorbased array. 
Country Status (2)
Country  Link 

NL (1)  NL2015114B1 (en) 
WO (1)  WO2017007318A1 (en) 
Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

EP1877927A1 (en) *  20050428  20080116  The University Court Of The University Of Edinburgh  Reconfigurable instruction cell array 
US20080222339A1 (en) *  20001219  20080911  Anthony Peter John Claydon  Processor architecture with switch matrices for transferring data along buses 
WO2013181664A1 (en) *  20120601  20131205  The Regents Of The University Of California  Programmable logic circuit architecture using resistive memory elements 

2015
 20150707 NL NL2015114A patent/NL2015114B1/en not_active IP Right Cessation

2016
 20160707 WO PCT/NL2016/050493 patent/WO2017007318A1/en active Application Filing
Patent Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US20080222339A1 (en) *  20001219  20080911  Anthony Peter John Claydon  Processor architecture with switch matrices for transferring data along buses 
EP1877927A1 (en) *  20050428  20080116  The University Court Of The University Of Edinburgh  Reconfigurable instruction cell array 
WO2013181664A1 (en) *  20120601  20131205  The Regents Of The University Of California  Programmable logic circuit architecture using resistive memory elements 
Also Published As
Publication number  Publication date 

WO2017007318A1 (en)  20170112 
Similar Documents
Publication  Publication Date  Title 

Khailany et al.  Imagine: Media processing with streams  
JP5129398B2 (en)  Processor array and method for forming the same  
Jullien  Implementation of multiplication, modulo a prime number, with applications to number theoretic transforms  
Batcher  Design of a massively parallel processor  
Ortega et al.  Solution of partial differential equations on vector and parallel computers  
Patel  Performance of processormemory interconnections for multiprocessors  
US5815723A (en)  Picket autonomy on a SIMD machine  
Ortega  Introduction to parallel and vector solution of linear systems  
Nath et al.  Efficient VLSI networks for parallel processing based on orthogonal trees  
CN101782893B (en)  Reconfigurable data processing platform  
US4943909A (en)  Computational origami  
Zakharov  Parallelism and array processing  
KR100740081B1 (en)  Reconfigurable operation apparatus  
Jang et al.  A fast algorithm for computing a histogram on reconfigurable mesh  
US5887186A (en)  Method of solving simultaneous linear equations in a memorydistributed parallel computer  
TWI234737B (en)  Integrated circuit device  
Barnett et al.  Global combine on mesh architectures with wormhole routing  
Johnsson  Solving tridiagonal systems on ensemble architectures  
Zhuo et al.  Scalable and modular algorithms for floatingpoint matrix multiplication on reconfigurable computing systems  
GB2178572A (en)  A processor array  
Solomonik et al.  Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions  
Ho et al.  Optimizing tridiagonal solvers for alternating direction methods on Boolean cube multiprocessors  
Johnsson et al.  Alternating direction methods on multiprocessors  
Battista et al.  The APE100 computer:(I) the architecture  
US4701876A (en)  Digital data processor for multiplying data by a coefficient set 
Legal Events
Date  Code  Title  Description 

MM  Lapsed because of nonpayment of the annual fee 
Effective date: 20180801 