WO2017007318A1

WO2017007318A1 - Scalable computation architecture in a memristor-based array

Info

Publication number: WO2017007318A1
Application number: PCT/NL2016/050493
Authority: WO
Inventors: Said Hamdioui; Koenraad Laurent Maria BERTELS; Mottaqiallah TAOUIL
Original assignee: Technische Universiteit Delft
Priority date: 2015-07-07
Filing date: 2016-07-07
Publication date: 2017-01-12
Also published as: NL2015114B1

Abstract

Method for data processing based on arithmetic operations in a memristor-based crossbar, The crossbar includes a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states. The method includes -defining data circuit templates for data in the memristor based crossbar; -defining computation circuit templates within the memristor based crossbar for a selected arithmetic instruction from the arithmetic operations; -arranging data circuits on predetermined data locations of the crossbar array in accordance with the data circuit template and arranging instruction circuits on predetermined instructions locations of the crossbar array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors at the predetermined instruction locations.

Description

Scalable computation architecture in a memristor-based array

Field of the invention

The present invention relates to a method for data processing based on arithmetic operations in a memristor-based crossbar array.

Also, the invention relates to a memristor based computational device for data processing. Furthermore, the invention relates to a computer architecture.

Background

In the last several decades, CMOS down-scaling has been the primary driver behind computer performance improvement. However, CMOS technology is reaching its physical -if not economical- limits. Down-scaling devices has led to many challenges such as leakage power, reliability, fabrication process and turnaround time, test complexity, cost for mask and design, and yield. Furthermore, the performance gain by increasing clock speed has saturated since early 2000; today, speed-up is no longer the result of a faster clock, but rather a result of parallelization on multi-core and many-core systems. However, the number of parallel cores that can be programed and the computation efficiency that can be extracted are tending to saturate as well.

Obviously, all today's computing systems are mainly built on John von Neumann's stored-program computer concept. A major drawback of this computer design is the gap between the processing units and the main memory, the so-called memory bottleneck. For data-intensive applications, the memory bottleneck is becoming even more severe and is putting major limitations both on performance and energy consumption. All of these motivate the need for a new architecture being able to (a) eliminate the communication bottleneck and support massive parallelism to increase the overall performance, (b) reduce the energy inefficiency to improve the computation efficiency.

Getting the memory closer the processing unit and reducing the memory bottleneck has attracted a lot of attention. In 1969, Logic-In-Memory (LEVI) was originally introduced as a memory accelerator; i.e., add some processing units close to main memory. In 1992, LIM concept re-appeared and named computational RAM, and typically uses the same accelerator concept where these are supposed to perform operations needed by the memory such as address translations. In the late 1990s and early 2000s, Processor-In-Memory (PIM) was proposed and manufactured. PIM is based on splitting the main memory in different parts, each with surrounded computing units to bring the computation near to the memory; the architecture has a master CPU that takes care of the overall control. In 2004, Memory-In-Logic (MIL), which provides massive addressable memory on the processor, was proposed for supercomputer systems. All mentioned above efforts have tried to close the gap between processor and memory speed. However, as the computation and the storage are kept separately, they fundamentally use a von Neumann stored-program computer concept and therefore suffer from a memory bottleneck, which negatively impacts the performance.

It is an object of the present invention to overcome or mitigate one or more of the disadvantages of the prior art.

Summary of the invention

The object is achieved by a method for data processing based on arithmetic operations in a memristor-based crossbar array, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from, for example perpendicular to, the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states;

the method comprising:

- defining data circuit templates for holding data in the memristor-based crossbar array;

- defining computation circuit templates for holding a selected arithmetic instruction from the arithmetic operations in the memristor-based crossbar array;

- arranging data as data circuits in the memristor-based crossbar array on predetermined data locations of the array in accordance with the data circuit template by programming the memristors of the memristor based crossbar array at the predetermined data locations;

- arranging arithmetic instructions as instruction circuits in the memristor-based crossbar array on predetermined instructions locations of the array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors of the memristor based crossbar array at the

predetermined instruction locations, wherein data acting as input for a first arithmetic instruction is arranged as data circuit in the memristor-based crossbar array connected to an input location of the associated first arithmetic instruction circuit in the memristor-based crossbar array, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit;

data circuits in the memristor-based crossbar array for data to be computed in a parallel processing step are arranged along either a column direction or a row direction in the memristor based crossbar array, and

instruction circuits in the memristor-based crossbar array for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the memristor based crossbar array.

Advantageously, computation based on the data and instructions is executed in the memristor crossbar (acting as memory unit) without moving the data and instructions from their respective locations. By this organization of data and instructions the memory unit also functions as processing unit. The main advantage of data processing using computation-in-memory over von-Neumann type data processing is the tight integration of both computing and storing operations using the same physical crossbar. Moreover, the performance of von Neumann architectures suffers from delays caused by the required transfer of data and instructions from memory to CPU and vice versa. Since this bottleneck is removed here, in the Computation-In- Memory architecture massive parallelism can be achieved with minimized

communication.

According to an aspect the invention provides a method as described above, comprising: - arranging a plurality of data acting as input for the first arithmetic instruction as a plurality of input data circuits in an input data column of the memristor based crossbar array;

- arranging a plurality of data acting as output for said first arithmetic instruction as a plurality output data circuits in an output data column of the memristor based crossbar array, and

- arranging said first arithmetic instruction as a plurality of first arithmetic instruction circuits in a first instructions column of the memristor based crossbar array, intermediate the input column and the output column. According to an aspect the invention provides a method as described above, wherein the plurality of data acting as output are arranged as a plurality of second input data circuits in the memristor-based crossbar array for a second arithmetic instruction, the method comprising:

- arranging said second arithmetic instruction as a plurality of second arithmetic instruction circuits in a second instruction column or row of the memristor based crossbar array, with the input of each second arithmetic instruction circuit connected to the associated second input data circuit,

- arranging a plurality of second output data circuits in the memristor-based crossbar array connected to the respective output of the associated second arithmetic instruction circuit.

According to an aspect the invention provides a method as described above, comprising for each connection between either an output of a data circuit and an input of an instruction circuit or an input of a data circuit and an output of an instruction circuit, a creation of a connection circuit in the memristor based crossbar array between said data circuit and said instruction circuit.

According to an aspect the invention provides a method as described above, wherein the arithmetic instructions are selected from operations comprising addition, multiplication, subtraction and division.

According to an aspect the invention provides a method as described above, wherein the arithmetic instructions additionally are selected from operations comprising logic functions.

Also, operations for logic functions can be implemented in an instruction circuit.

According to an aspect the invention provides a method as described above, wherein the data processing is a parallel addition of a plurality of N numbers using a binary tree addition algorithm, N being even.

According to an aspect the invention provides a method as described above, comprising: - creating in the array N+l columns of N/2 rows, the outer columns 1, N+l and the central column N/2+1 arranged as data columns with data circuits for receiving external data, in which each data circuit in a data column is connected through an connection circuit in an adjacent column to an adder circuit in a subsequent arithmetic column and each adder circuit through a further connection circuit is connected to an output data circuit in a next data column, such that in the central column N/2 + 1 the result of the addition of the N numbers is stored.

According to an aspect the invention provides a method as described above, wherein a data circuit in a data column of the memristor based crossbar array is either an output data circuit or an input data circuit for an addition or a combined data circuit configured to receive a result from an addition and to transmit said result as input data for a next addition.

According to an aspect the invention provides a method as described above, wherein additions in each adder column are performed in parallel.

According to an aspect the invention provides a method as described above, wherein in each connection column data are transferred in parallel.

According to an aspect the invention provides a method as described above, wherein the data processing is a matrix multiplication algorithm for an N-by-N matrix.

According to an aspect the invention provides a method as described above, wherein a first arithmetic instruction is a multiplication for a pair of matrix elements embodied as a multiplier circuit in the memristor-based crossbar array and a second arithmetic instruction is an addition for a pair of multiplication results embodied as an adder circuit in the memristor-based crossbar array, the method comprising:

- arranging in the memristor based crossbar array in at least one row:

— two multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the multiplication, wherein the output data circuits are located adjacent to each other in between the multiplier circuits, in a manner that within the at least one row the multiplier circuits can operate simultaneously, and

- arranging in the memristor based crossbar array in an adjacent row:

- next to the output data circuits that are adjacent to each other, an adder circuit connected to said output data circuits and arranged for receiving data from the respective output data circuit and for adding the received data, and

- arranging in the memristor based crossbar array in a further row a sum output storage circuit arranged for receiving result data from the adder circuit, such that the adder circuit is between the row with multiplier circuits and the row with the sum output storage circuit. According to an aspect the invention provides a method as described above, comprising: - arranging in the memristor based crossbar array in a second row:

— two further multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the

multiplication, wherein the output data circuits are located adjacent to each other in between the further multiplier circuits, in a manner that within the second row the further multiplier circuits can operate simultaneously, and - arranging in the memristor based crossbar array in an adjacent one of the second row:

— next to the output data circuits that are adjacent to each other, an further adder circuit connected to said output data circuits and arranged for receiving data from the respective output data circuit and for adding the received data, and

— arranging in the memristor based crossbar array in a further row next to the adjacent one a second sum output storage circuit arranged for receiving result data from the further adder circuit, such that the further adder circuit is between the second row with the further multiplier circuits and the further row with the second sum output storage circuit, and wherein the second sum output storage circuit is arranged adjacent to the sum output storage circuit.

According to an aspect the invention provides a method as described above, further comprising: - arranging in the memristor based crossbar array in a row direction adjacent to the sum output storage circuit and the second sum output storage circuit, a third adder circuit that is configured for adding data from the sum output storage circuit and data from the second sum output storage circuit, and adjacent in the same row as the third adder a third sum output storage circuit.

Also, the invention relates to a computational device for data processing based on arithmetic operations, comprising a memristor-based crossbar array and controller, the memristor based crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from, for example perpendicular to, the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states, and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the crossbar array over the connection interface in accordance with the method as described above.

According to an aspect the invention provides a device as described above, wherein the controller is arranged to execute the first arithmetic instructions of the plurality of first arithmetic instruction circuits in the first instructions column of the memristor based crossbar array, in parallel.

According to an aspect the invention provides a device as described above, wherein the controller is arranged to execute the second arithmetic instructions of the plurality of second arithmetic instruction circuits in the second instruction column or row of the memristor based crossbar array, in parallel.

Further, the invention relates to a computer architecture for computation-in- memory applications related to data processing based on arithmetic operations, comprising a memristor-based crossbar array and a controller, the memristor based crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from, for example perpendicular to, the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states, and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the crossbar array over the connection interface;

in which architecture data circuits and instruction circuits are arranged in the memristor based crossbar array based on data circuit templates for holding data and on

computation circuit templates for holding a selected arithmetic instruction from the arithmetic operations, respectively; the data circuits being arranged on predetermined data locations of the memristor based crossbar array; the instruction circuits being arranged on predetermined instructions locations of the memristor based crossbar array, wherein data acting as input for a first arithmetic instruction is arranged as data circuit in the memristor based crossbar array connected to an input location of the associated first arithmetic instruction circuit in the memristor based crossbar array, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit, data circuits for data to be computed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and instruction circuits for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array.

According to an aspect the invention provides a computer architecture as defined above, which interweaves both storage for memory functions and logic or arithmetic operations, such as parallel adders and matrix multiplication, in the memristor based crossbar array.

According to an aspect the invention provides a computer architecture as defined above, in which a portion of the memristor based crossbar array is configured at the memristor level to create circuitry for storage functions and another portion of the memristor based crossbar array is configured at the memristor level to create circuitry for logic or arithmetic operations.

Advantageous embodiments are further defined by the dependent claims. Brief description of drawings

The invention will be explained in more detail below with reference to drawings in which illustrative embodiments thereof are shown. The drawings are intended exclusively for illustrative purposes and not as a restriction of the inventive concept. The scope of the invention is only limited by the definitions presented in the appended claims.

Figure la shows a computation-in-memory computer architecture for use in accordance with the method of the invention;

Figure lb schematically shows a perspective view of a memristor based crossbar array; Figure lc shows a schematic of an adder circuit;

Figure Id shows a layout of a binary tree addition algorithm;

Figure le shows a computation-in-memory parallel adder structure in accordance with the invention;

Figure 2 shows a computation-in-memory matrix multiplication and addition structure in accordance with the invention;

Figure 3 shows an arrangement of plurality computation-in-memory matrix

multiplication and addition structure for a matrix multiplication in accordance with an embodiment of the invention.

Detailed description of embodiments

The invention relates to a computer architecture that is based on the principle of Computation-In-Memory that a memory unit comprises both data and instructions at physical locations in the memory unit and that computation based on the data and instructions is executed in the memory unit without moving the data and instructions from their respective locations. By this organization of data and instructions the memory unit also functions as processing unit, according to the invention, the memory unit is a memristor-based array or "crossbar".

The main advantage of a computation-in-memory (CIM) architecture over von- Neumann architectures is the tight integration of both computing and storing operations using the same physical crossbar. Moreover, the performance of von Neumann architectures suffers from delays caused by the required transfer of data and

instructions from memory to CPU and vice versa. Since this bottleneck is removed here, in the CIM architecture massive parallelism can be achieved with minimized communication.

Fig. la shows a computation-in-memory computer architecture 100 for use in accordance with the method of the invention. The main three CFM architecture components are : crossbar 110, communication network 130 and controller 140 are shown.

The communication network 130 is connected to the controller 140.

The controller 140 is configured to interface with the crossbar 110 by horizontal and vertical interfaces 141, 142 (comprising voltage controllers). The crossbar 110 consists of an array of memristors 120 that are used to implement logic functions and/or storage.

The communication network 130 is either implemented within the crossbar 110 or by using separate metal layers. A controller 140 that may be implemented by CMOS devices handles auxiliary operations such as distributing data and controlling signals to the crossbar. Fig. lb schematically shows a memristor based crossbar array 110 in perspective view.

In the memristor-based crossbar array, the crossbar array comprises a plurality of parallel first bars 122 (that may be nanowires) extending in a first direction Dl and a second plurality of parallel second bars 124 (e.g. nanowires) extending in a second direction D2 different from, for example perpendicular to the first direction Dl, such that each first bar 122 crosses the second plurality of second bars 124 and at each crossing forms a contact 126. Each contact 126 forms a memristor with at least two different programmable resistive states.

The CMOS devices are located as circuitry on a substrate 142. The crossbar array 110 is connected to the controller 140 by means of vias between the plurality of first bars 122 and the controller circuitry and by means of vias 128 between the plurality of second bars 124 and the controller circuitry.

According to the invention the crossbar 110 is specialized to perform

computation and storage operations in cells organized in rows and columns. Each cell (or a combination of cells) can be a computational unit (such as an adder or multiplier) or storage location (such as a memory cell). The controller 140 is configured to arrange data as a data circuit (i.e., a storage location) on predetermined data locations of the crossbar array 110 in accordance with the data circuit template by programming the memristors of the crossbar array at the predetermined data locations. Additionally, the controller 140 is configured to arrange arithmetic instructions as an instruction circuit (i.e., a computational unit) on predetermined instructions locations of the crossbar array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors of the array at the predetermined instruction locations.

The cells in a row or column can be configured with the same or different functionality, for example as a data circuit for a storage location or an instruction circuit for a computational unit. The communication in CIM architecture 100 has maximum flexibility. The CIM architecture 100 allows bi-directional communication in both horizontal and vertical directions. The controller may contain a router 143 and a finite state machine (FSM). The router 143 provides the FSM with a communication scheme for data distribution and data movements. The FSM fetches instructions from an instruction memory (e.g. hard disk, not shown), which contains instructions for the configuration of the crossbar array 110, in accordance with a predetermined algorithm or dataset. The FSM is further configured to convert fetched instructions to controlling signals for the row/column voltage controller 141, 142 to configure the crossbar array with computational units and storage locations as defined by the instructions from the instruction memory.

Fig. lc shows an example of a single CIM adder ADD. The basic computational unit CU is an n-bit adder, which is surrounded by a number of memory cells (latches) Zl ... Z5. An n-bit adder contains three n-bit latches (two for the inputs Zl, Z2 and one for the sum Z3), a 1-bit carry-in latch Z4 and a 1-bit carry-out latch Z5.

Below a CIM parallel adder circuit 200 in accordance with the present invention is described.

The CIM parallel adder circuit 200 relates to an algorithm to add a plurality of N numbers to calculate a sum, and is based on an addition scheme known as a binary tree addition scheme. According to this scheme, parallel additions on pairs of N numbers (e.g., Lil ... Li8) are carried out in iterations that half the number of additions in each iteration (the first additions run in parallel on N/2 pairs of these numbers, the next additions are done in parallel on the N/4 pairs of the N/2 results, and so on.) An example of the binary tree scheme is shown in Figure Id for N=8 numbers Lil ... Li8.

In Figure le a CFM parallel adder 200 in accordance with the present invention is shown.

The CIM parallel adder 200 is built in the memristors based crossbar 110 in accordance with the binary tree scheme of Figure Id. The CIM parallel adder circuit 200 is arranged in the crossbar 110 and comprises a number of columns CE1 ... CE17 that correspond either with columns CS comprising circuits (circuit portions) acting as storage locations Lil ... Li 16 for numbers or with columns CA comprising circuits acting as arithmetic instructions (in this case: addition ADD).

By arranging the circuits of the same function (thus storage or arithmetic) in the same column CS, CA next to a column with circuits of another function, parallelism can be achieved by controlling the crossbar 110 in the row direction RE (perpendicular to the columns CE) and column direction.

The CIM parallel adder circuit 200 arranges multiple CFM adders ADD in a binary tree network. The carry-in and carry-out registers (see Figure Id) of an adder ADD are connected properly to generate correct addition results. In the prior art, the binary tree network is ineffective using traditional platforms due to the difference between processor and memory fabrication. A processor coupled with a large amount of memory is unrealistic with traditional CMOS technology. Using the new features of the CIM architecture 100 and its underlying memristor technology, the adder tree can be effectively mapped to reduce addition latency and increase resource utilization.

Fig. le shows a CIM parallel adder circuit 200 using a mapped binary adder tree with 16 inputs (see Fig. Id for an 8 input example) on the CIM architecture 100. Each CIM adder ADD corresponds to the adder presented in Fig lc. Note that the output latches sum (Z5) at each adder are reused as input latches (Zl or Z2) in the next adder stage. A first column CE1 of the crossbar gets the first half of the inputs Lil ... Li8. Add units ADD in a second column CE3 adjacent to the first column CE1 add every two corresponding input latches and store results in corresponding output latches Lsl ... Ls4 in a next third column CE5. Adders ADD in a fourth column CE7 add up results Lsl 1 ... Lsl 2 from output latches of the third column CE5. Another direction of computation takes place from the final column CE17 backwards to utilize as many resources of the crossbar as possible. In other words, a cell in CFM architecture 100 is configured to be an add unit ADD or a latch Li; Ls. The interconnects (dotted lines) between multiple rows and columns represent communication channels among cells that are located in even numbered columns CE2, CE4, ... CE16.

The crossbar array 110 for N additions contains at least N/2 (log₂(N)) CIM adders. Due to the multi-directional characteristic of the CIM-based adder, additions can operate in two direction flows (RE from left to right and vice versa) of the array (as shown in Fig. le). These bi-directional operations efficiently exploit resources in the architecture. Therefore, the architecture is designed with one additional column and half number of rows in comparison with the above-mentioned size. Hence, for N inputs the delay and array size equals log₂(N)+l and N/4 x(21og₂(N)+l) cells, respectively (each adder processes two inputs). With this design, every operation is performed on a distinct operational and storage unit; hence, there is no operation overlapping at a particular location. In addition, a maximum number of adders ADD in the CIM architecture 100 are used to avoid idle adders. With smart communication schemes, the architecture can be pipelined to increase overall performance.

Thus, the memristor-based crossbar 110 comprises a CFM parallel adder circuit 200. The CIM parallel adder circuit 200 comprises in the crossbar a columnar structure CE1 ... CE17 of storage location columns CS comprising data circuits for holding data and arithmetic instruction circuit columns CA that are arranged in alternation.

Inbetween a storage location column and an adjacent arithmetic instruction column a connecting column CE2; CE4; CE6;CE8; CE10; CE12; CE14; CE16 is provided that comprises connecting circuits to connect one or more data circuits Li; Ls in a storage location column CS with an arithmetic instruction circuit i.e., an adder circuit ADD, in the adjacent arithmetic instruction column CA.

In a first column CE1 a first half of the N numbers Lil ... Li8 to be added is stored in respective data circuits for holding the number value.

In an adjacent second column CE3 the adder circuits ADD for adding two numbers are arranged. On the opposite side (relative to the first column CE1) of the second column CE3 containing the adder circuits in a further third column CE5, the individual sums Lsl ... Ls4 are stored. Note that in the further third column CE5, the storage locations of the individual sums are shifted in the upward column direction to avoid empty storage locations (the additions reduce the data by half). By the shift the lower part of the column CE5 is thus free for other data.

It is noted that inbetween columns CE1 and CE3, and CE3 and C5, intermediate columns CE2, CE4 are present that handle the transfer of signals between data circuits in one column and arithmetic circuits in the adjacent column. The sequence of data circuits containing columns and arithmetic instruction circuits is repeated in the crossbar until a single pair of summable numbers Lsl 1 ... Lsl2 is reached for the first half of the N numbers to be added.

For the other half of the N numbers to be added, columns of data circuits next to arithmetic instruction circuits are arranged from the last column CE17 from the opposite direction. In the CIM parallel adder circuit 200 of Figure le in column CE17 the second half of the N numbers Li9 ... Lil6 to be added is stored in respective data circuits for holding the number value. In an adjacent column CE15 in the direction towards column CE1 the adder circuits ADD for adding two numbers are arranged. The sum results Ls5 ... Ls8 are stored in column CE13. In adjacent column CE 11 adder circuits ADD are arranged which add the intermediate results Ls5, Ls6 and Ls7, Ls8 to sum results Lsl3 .. Lsl4. It is noted that by transfer of some addition results to storage locations unused in another column, the binary tree scheme can be "folded" which results in a more compact circuit occupying relatively less area of the crossbar, as will be appreciated by the person skilled in the art since the computation in the adder circuit can be done in both row directions. The sum Ls21 of Lsl 1, Lsl2 is thus stored in column C13, row RE1, the sum Ls22 of Lsl3, Lsl4 is in column CE5, row RE 5. By copying Ls21 and Ls22 in this example to column CE13, rows RE3 (Ls21_copy) and RE4 (Ls22_copy), respectively, the final result Lst is calculated by the adder circuit ADD in column CE11 and stored in column CE9, row RE3.

The skilled in the art will appreciate that alternative "folding" options may be available to optimize the CIM parallel adder circuit 200. The circuit of Figure le is only intended as an exemplary embodiment, without limitation of the scope of the invention as claimed in the claims.

Figure 2 shows a crossbar-based circuit 300 for matrix multiplication in accordance with an embodiment of the invention.

For this embodiment, the circuit for matrix multiplication relates to a multiplication of N-by-N matrices, for example an 8-by-8 matrix C calculated from matrix product AB. The circuit 300 shown is arranged to calculate one matrix element Cij of the multiplication. To calculate all matrix elements the circuit is simply repeated in the crossbar structure 110.

In the crossbar structure to obtain most efficiency of the computation, multiplications of matrix elements and additions of intermediate results should be done in parallel where possible. Thus, in the CIM architecture 100, identical arithmetic instruction circuits should for that reason be in the same column or same row of the crossbar (and the respective circuit).

In the embodiment, storage locations or data circuits Al, Bl for holding data (input matrix elements Aik or Bkj) are arranged in a first column CL1 in a first data entry row Rl and in parallel secondary data circuits A2, B2 in a second data entry row R5 which is separated from the first row Rl by three intermediate rows R2, R3, R4.

In the first data row Rl next to the input data circuits Al, B 1 a multiplier circuit Ml is arranged configured for multiplication of the contents of data circuits Al, B l . Next, in the first row to the multiplier circuit Ml opposite to the side of the input data circuits Al, B 1 a storage location or output data circuit CI for holding the result of the multiplication Al and Bl .

In the first row Rl a same series of third input data circuits A3, B3, third multiplier circuit M3 and third output data circuit C3 is arranged in mirrored position, in such a manner that the third output data circuit C3 is located next to the first output data circuit C 1.

In the second data entry row R5 a same arrangement is made with second input data circuits A2, B2, a second multiplier circuit M2 and a second output data circuit C2, and in mirrored position fourth input data circuits A4, B4, a fourth multiplier circuit M4 and a second output data circuit C4.

In the intermediate row R2 adjacent the first data entry row Rl a first adder circuit ADD1 is arranged in column CL3 below the first and third output data circuits for calculating the addition of content result C13 of CI and C3.

In the intermediate row RE4 adjacent the second data entry row RE1 an second adder circuit ADD2 is arranged below the first and third output data circuits for calculating the addition result C24 of content of C2 and C4.

In the central intermediate row R3, a storage location or output data circuit Dl is arranged located in the same column between the first and second adder circuits ADD1 and ADD2 for holding the addition results C13 and C24, respectively. In the central row R3 an adder ADD 3 is arranged for addition of C13 and C24.

The above described multiplier and adder structure Al, Bl, CI, A2, B2, C2, A3,

B3, C3, A4, B4, C4, Ml, M2, M3, M4, ADD1, ADD2, ADD3 C13, C24, is repeated as a same structure A5, B5, C5, A6, B6, C6, A7, B7, C7, A8, B8, C8, M5, M6, M7, M8, ADD5, ADD6, ADD7, C57, C68 in the row direction RD in an adjacent area in the crossbar array, with an intermediate column CL6 that is provided with storage locations for holding the intermediate results Dl, D2 of the two neighbouring circuit structures and provided with further adder circuits ADD8 for calculating the addition result El of the content of Dl and D2. The final result El for one matrix element Cy is then to be stored in a result output data circuit El arranged in the intermediate column CL6 in the second data entry row R5 where the result can be fetched by an operation from the controller 140.

Figure 3 shows a schematic layout of a memristor based crossbar 110 arranged with multiplication and addition circuits 300 for the calculation of each matrix element. The multiplication and addition circuits 300 in an embodiment relate to the circuit as shown and described with reference to Figure 2.

The multiplication and addition circuits 300 are arranged in a row and column structure X, Y in which each multiplication and addition circuit is arranged to calculate one matrix element Cy of the matrix product AB. (see n-by-n matrix C in insert 3a ) The multiplication and addition circuits are aligned in both column and row directions to achieve parallelism in the computation of each matrix element Cy.

The invention has been described with reference to the preferred embodiment. Obvious modifications and alterations will occur to others upon reading and understanding the preceding detailed description. It is intended that the invention be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims.

Claims

1. A method for data processing based on arithmetic operations in a memristor-based crossbar array, the crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states;

the method comprising:

- arranging arithmetic instructions as instruction circuits in the memristor-based crossbar array on predetermined instructions locations of the array in accordance with the computation circuit template related to the respective arithmetic instruction by programming the memristors of the memristor based crossbar array at the predetermined instruction locations,

wherein data acting as input for a first arithmetic instruction is arranged as data circuit in the memristor-based crossbar array connected to an input location of the associated first arithmetic instruction circuit in the memristor-based crossbar array, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit;

2. The method according to claim 1, comprising:

- arranging a plurality of data acting as input for the first arithmetic instruction as a plurality of input data circuits in an input data column of the memristor based crossbar array;

- arranging said first arithmetic instruction as a plurality of first arithmetic instruction circuits in a first instructions column of the memristor based crossbar array, intermediate the input column and the output column.

3. The method according to claim 2, wherein the plurality of data acting as output are arranged as a plurality of second input data circuits in the memristor-based crossbar array for a second arithmetic instruction, the method comprising:

4. The method according to any one of the preceding claims comprising for each

connection between either an output of a data circuit and an input of an instruction circuit or an input of a data circuit and an output of an instruction circuit, a creation of a connection circuit in the memristor based crossbar array between said data circuit and said instruction circuit.

5. The method according to any one of the claims 1 - 3, wherein the arithmetic

instructions are selected from operations comprising addition, multiplication, subtraction and division

6. The method according to claim 5, wherein the arithmetic instructions additionally are selected from operations comprising logic functions.

7. The method according to any one of the preceding claims,

wherein the data processing is a parallel addition of a plurality of N numbers using a binary tree addition algorithm, N being even.

8. The method according to claim 7, comprising:

- creating in the array N+1 columns of N/2 rows, the outer columns 1, N+1 and the central column N/2+1 arranged as data columns with data circuits for receiving external data, in which each data circuit in a data column is connected through an connection circuit in an adjacent column to an adder circuit in a subsequent arithmetic column and each adder circuit through a further connection circuit is connected to an output data circuit in a next data column, such that in the central column N/2 + 1 the result of the addition of the N numbers is stored.

9. The method according to claim 8, wherein a data circuit in a data column of the memristor based crossbar array is either an output data circuit or an input data circuit for an addition or a combined data circuit configured to receive a result from an addition and to transmit said result as input data for a next addition.

10. The method according to claim 8 or claim 9, wherein additions in each adder column are performed in parallel.

11. The method according to any one of claims 8 - 10, wherein in each connection column data are transferred in parallel.

12. The method according to any one of the preceding claims,

wherein the data processing is a matrix multiplication algorithm for an N-by-N matrix.

13. The method according to claim 12, wherein a first arithmetic instruction is a multiplication for a pair of matrix elements embodied as a multiplier circuit in the memristor-based crossbar array and a second arithmetic instruction is an addition for a pair of multiplication results embodied as an adder circuit in the memristor- based crossbar array,

the method comprising:

- arranging in the memristor based crossbar array in at least one row:

- two multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the

multiplication, wherein the output data circuits are located adjacent to each other in between the multiplier circuits, in a manner that within the at least one row the multiplier circuits can operate simultaneously, and

- arranging in the memristor based crossbar array in an adjacent row:

- arranging in the memristor based crossbar array in a further row a sum output storage circuit arranged for receiving result data from the adder circuit, such that the adder circuit is between the row with multiplier circuits and the row with the sum output storage circuit.

14. The method according to claim 13, comprising:

- arranging in the memristor based crossbar array in a second row:

- two further multiplier circuits each in between an input data circuit for holding a pair of matrix elements and an output data circuit for holding the result of the multiplication, wherein the output data circuits are located adjacent to each other in between the further multiplier circuits, in a manner that within the second row the further multiplier circuits can operate simultaneously, and

- arranging in the memristor based crossbar array in an adjacent one of the second row:

- next to the output data circuits that are adjacent to each other, an further adder circuit connected to said output data circuits and arranged for receiving data from the respective output data circuit and for adding the received data, and - arranging in the memristor based crossbar array in a further row next to the adjacent one a second sum output storage circuit arranged for receiving result data from the further adder circuit, such that the further adder circuit is between the second row with the further multiplier circuits and the further row with the second sum output storage circuit, and wherein

the second sum output storage circuit is arranged adjacent to the sum output storage circuit.

15. The method according to claim 14, further comprising:

- arranging in the memristor based crossbar array in a row direction adjacent to the sum output storage circuit and the second sum output storage circuit, a third adder circuit that is configured for adding data from the sum output storage circuit and data from the second sum output storage circuit, and adjacent in the same row as the third adder a third sum output storage circuit.

16. A device for data processing based on arithmetic operations, comprising a

memristor-based crossbar array and controller, the memristor based crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states,

and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the memristor based crossbar array over the connection interface in accordance with the method according to any one of the claims 1 - 15.

17. The device according to claim 16, wherein the controller is arranged to execute the first arithmetic instructions of the plurality of first arithmetic instruction circuits in the first instructions column of the memristor based crossbar array, in parallel.

18. The device according to claim 17, wherein the controller is arranged to execute the second arithmetic instructions of the plurality of second arithmetic instruction circuits in the second instruction column or row of the memristor based crossbar array, in parallel.

19. A computer architecture for computation-in-memory applications related to data processing based on arithmetic operations, comprising a memristor-based crossbar array and a controller, the memristor based crossbar array comprising a plurality of parallel first bars extending in a first direction and a second plurality of parallel second bars extending in a second direction different from the first direction, such that each first bar crosses the second plurality of second bars and at each crossing forms a contact, each contact forming a memristor with at least two different programmable resistive states,

and the controller being arranged with a connection interface with first electrical connection lines and second electrical connection lines, wherein each of the first electrical connection lines connects to a corresponding one of the parallel first bars and each of the second electrical connection lines connects to a corresponding one of the parallel second bars, and the controller is arranged to perform control operations on the memristors of the crossbar array over the connection interface; in which architecture data circuits and instruction circuits are arranged in the memristor based crossbar array based on data circuit templates for holding data and on computation circuit templates for holding a selected arithmetic instruction from the arithmetic operations, respectively;

the data circuits being arranged on predetermined data locations of the memristor based crossbar array;

the instruction circuits being arranged on predetermined instructions locations of the memristor based crossbar array,

wherein data acting as input for a first arithmetic instruction is arranged as data circuit in the memristor based crossbar array connected to an input location of the associated first arithmetic instruction circuit in the memristor based crossbar array, and data acting as output of the first arithmetic instruction is connected to an output location of said first arithmetic instruction circuit,

data circuits for data to be computed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array, and

instruction circuits for instructions to be executed in a parallel processing step are arranged along either a column direction or a row direction in the crossbar array.

20. The computer architecture according to claim 19 for computation-in-memory which interweaves both storage for memory functions and logic or arithmetic operations, such as parallel adders and matrix multiplication, in the memristor based crossbar array.

21. The computer architecture according to claim 19 for computation-in-memory in which a portion of the memristor based crossbar array is configured at the memristor level to create circuitry for storage functions and another portion of the memristor based crossbar array is configured at the memristor level to create circuitry for logic or arithmetic operations.