CN113593622A - Memory computing device and computing device - Google Patents

Memory computing device and computing device Download PDF

Info

Publication number
CN113593622A
CN113593622A CN202110913509.5A CN202110913509A CN113593622A CN 113593622 A CN113593622 A CN 113593622A CN 202110913509 A CN202110913509 A CN 202110913509A CN 113593622 A CN113593622 A CN 113593622A
Authority
CN
China
Prior art keywords
bit line
data
memory
word line
transistor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110913509.5A
Other languages
Chinese (zh)
Other versions
CN113593622B (en
Inventor
刘家隆
唐文骏
李学清
杨华中
刘勇攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Publication of CN113593622A publication Critical patent/CN113593622A/en
Application granted granted Critical
Publication of CN113593622B publication Critical patent/CN113593622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • G11C13/0021Auxiliary circuits
    • G11C13/004Reading or sensing circuits or methods
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C5/00Details of stores covered by group G11C11/00
    • G11C5/14Power supply arrangements, e.g. power down, chip selection or deselection, layout of wirings or power grids, or multiple supply levels
    • G11C5/147Voltage reference generators, voltage or current regulators; Internally lowered supply levels; Compensation for voltage drops
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/18Bit line organisation; Bit line lay-out
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C8/00Arrangements for selecting an address in a digital store
    • G11C8/14Word line organisation; Word line lay-out
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Power Engineering (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System (AREA)

Abstract

The present disclosure relates to an in-memory computing device and an arithmetic device, the device including: the memory computing module comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first storage unit and a second storage unit; a control module, coupled to the computational array, to: and controlling the voltage states of the first word line and the third word line so as to control the working mode of the memory computing module to be any one of a writing mode, a reading mode and a holding mode. The memory computing device provided by the embodiment of the disclosure has the characteristics of low circuit complexity, low power consumption, high accuracy and higher operation speed.

Description

Memory computing device and computing device
Technical Field
The present disclosure relates to the field of integrated circuit technologies, and in particular, to an in-memory computing device and an arithmetic device.
Background
Today, with the worldwide blowout-type growth of data volumes, existing computing systems are encountering the severe challenges of "memory walls". In a traditional computing system based on the von neumann architecture, a computing unit and a storage unit are physically separated, and data is frequently transferred between the two to cause serious loss of system power consumption and speed. The computing mode of the storage and computation integration is an important solution for solving the bottleneck of the storage wall, and currently, many research results of memory computing circuits, systems and architectures based on traditional or emerging devices exist.
However, the storage and computation integrated solution in the related art has the problems of high complexity, large computation overhead, low accuracy and the like.
Disclosure of Invention
According to an aspect of the present disclosure, there is provided an in-memory computing device, the device comprising:
the computing array comprises a plurality of memory computing modules, wherein each memory computing module comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first storage unit and a second storage unit, the first storage unit and the second storage unit are connected to the first word line, the second word line and the third word line, the first storage unit is further connected to the first bit line and the second bit line, and the second storage unit is further connected to the third bit line and the fourth bit line;
a control module, coupled to the computational array, to: controlling the voltage states of the first word line and the third word line to control the working mode of the memory computing module to be any one of a writing mode, a reading mode and a holding mode,
in the write mode, the control module writes data into the memory calculation module through the first bit line and the third bit line; in the read mode, the control module reads data from the memory computation module through the second bit line, the fourth bit line, and the second word line; in the hold mode, the control module is configured to hold a state of the memory computing module.
In one possible implementation, the first storage unit includes a first transistor, a second transistor, a first storage capacitor, and the second storage unit includes a third transistor, a fourth transistor, and a second storage capacitor, wherein,
first ends of the first transistor and the third transistor are connected to the first word line, a second end of the first transistor is connected to the first bit line, a second end of the third transistor is connected to the third bit line,
the third end of the first transistor, the first end of the second transistor and the first end of the first storage capacitor are connected to form a first node, the third end of the third transistor, the first end of the fourth transistor and the first end of the second storage capacitor are connected to form a second node,
third ends of the second transistor and the fourth transistor are both connected to the second word line, a second end of the second transistor is connected to the second bit line, and a second end of the fourth transistor is connected to the fourth bit line,
second ends of the first storage capacitor and the second storage capacitor are connected to the third word line.
In one possible implementation, the writing of data to the memory computation module by the control module through the first bit line and the third bit line includes:
the control module determines a target storage unit from the first storage unit and the second storage unit according to the value of the data to be written;
and configuring the voltages of the first bit line and the third bit line according to the value of the data to be written so as to write the data to be written into the target storage unit.
In one possible embodiment, in the write mode, the control module is further configured to:
if the value of the data to be written is a positive number, writing the opposite number of the data to be written into the target storage unit; or
If the value of the data to be written is negative, directly writing the data to be written into the target storage unit,
wherein the target memory cell corresponding to the positive write data is different from the target memory cell corresponding to the negative write data.
In one possible implementation, when one of the first memory cell and the second memory cell is used for writing data, the other is used for storing reference data, wherein the voltage of the node of the memory cell for writing data is a negative voltage, and each second transistor in the first memory cell and the second memory cell is in an off state.
In one possible embodiment, the control module is configured to:
configuring the first word line to a high voltage and the third word line to a low voltage to control the memory computing module to operate in the write mode;
configuring the first word line to be at a low voltage and the third word line to be at a high voltage to control the memory computing module to operate in a read mode;
and configuring the first word line and the third word line to be low voltage so as to control the memory computing module to work in a holding mode, wherein the node state of each memory unit is unchanged in the holding mode.
In one possible embodiment, the control module is further configured to:
configuring a voltage of the second word line according to input data;
the control module reads data from the memory computation module through the second bit line, the fourth bit line, and the second word line, including:
reading currents of the second bit line and the fourth bit line;
and obtaining the product of the input data and the data stored by the memory computing module according to the current difference value of the second bit line and the fourth bit line.
In a possible implementation, the computation array includes M rows and N columns, where M, N are integers greater than 0, and the first word line, the second word line, and the third word line of the memory computation modules in the same row in the computation array are respectively connected, and/or the first bit line, the second bit line, the third bit line, and the fourth bit line of the memory computation modules in the same column in the computation array are respectively connected.
In a possible implementation, each column of the memory computation module in the device comprises a plurality of readout paths, the control module is used for reading out a plurality of data from each readout path and summing the plurality of read data,
and for each row which does not need to read data, the control module is used for controlling each row of the memory computing module to work in a holding mode.
In one possible embodiment, the control module is further configured to:
controlling a first number of memory computing modules in the computing array to work in a write mode, controlling a second number of memory computing modules in the computing array to work in a read mode, and controlling the remaining memory computing modules in the computing array except the first number and the second number to work in a hold mode.
In one possible embodiment, the leakage current of the first transistor is smaller than a preset value.
According to an aspect of the present disclosure, there is provided a computing device comprising a first component and a second component, wherein,
the first assembly includes:
the sensor array is used for acquiring data to obtain a plurality of input data;
the memory computing device is used for receiving the input data, receiving a plurality of data to be written from the second component and obtaining the product sum of each input data and each data to be written;
and the second component is used for outputting the data to be written and processing the product sum to obtain a processing result.
The memory computing device provided by the embodiment of the disclosure comprises a plurality of memory computing modules and a control module, the memory computing module comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first storage unit and a second storage unit, the control module controls the voltage states of the first word line and the third word line, the memory computing device of the embodiment of the disclosure has the characteristic of integrating storage and computation by controlling the working mode of the memory computing module to be any one of a writing mode, a reading mode and a keeping mode, when operation, especially large-scale operation, is carried out, the problems of high system power consumption and low operation speed caused by frequent jumping of data between a computing unit and a storage unit of a separation framework can be avoided, and the method has the characteristics of low circuit complexity, low power consumption, high accuracy and higher operation speed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a schematic diagram of an approximate nonvolatile memory cell in the related art.
Fig. 2 is a diagram illustrating a related art implementation of in-memory matrix vector multiplication.
FIG. 3 shows a schematic diagram of an in-memory computing device, according to an embodiment of the present disclosure.
FIG. 4 shows a schematic diagram of an in-memory computing module according to an embodiment of the present disclosure.
FIG. 5 shows a schematic diagram of an arithmetic device according to an embodiment of the present disclosure.
Fig. 6a and 6b are schematic diagrams illustrating a linearity analysis of the memory computing device according to the embodiment of the disclosure.
Fig. 7 shows a schematic diagram of a read-write simulation waveform of a memory computing module according to an embodiment of the present disclosure.
Fig. 8a and 8b are schematic diagrams illustrating the retention time analysis of the in-memory computing module and the results of the monte carlo simulation according to the embodiment of the disclosure.
FIG. 9 is a diagram illustrating performance on a data set when an arithmetic device according to an embodiment of the present disclosure is applied to a first layer of computation.
Fig. 10a and 10b are schematic diagrams illustrating comparison results between the arithmetic device according to the embodiment of the present disclosure and a conventional digital processor.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
In the description of the present disclosure, it is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like, as used herein, refer to an orientation or positional relationship indicated in the drawings, which is solely for the purpose of facilitating the description and simplifying the description, and does not indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and, therefore, should not be taken as limiting the present disclosure.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.
In the present disclosure, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integral; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meaning of the above terms in the present disclosure can be understood by those of ordinary skill in the art as appropriate.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
As described in the background, the integrative solution of memory and computation in the related art has the problems of high complexity, large computation overhead, low accuracy, etc., for example, please refer to fig. 1, where fig. 1 shows a schematic diagram of an approximate nonvolatile memory cell in the related art.
Among them, the approximate Nonvolatile memory cell is also called a Nonvolatile Oxide Semiconductor Random Access Memory (NOSRAM). As shown in fig. 1, the approximate nonvolatile memory cell has a 2T1C structure, that is, the approximate nonvolatile memory cell includes 2 transistors (transistors, abbreviated as T) and 1 Capacitor (Capacitor, abbreviated as C), and thus, the cell can also be referred to as a 2T1C gain cell. Further, the two transistors in the cell are Oxide Semiconductor field effect transistors (OS-FETs) and silicon based field effect transistors (i.e., Si-FETs), respectively.
As shown in fig. 1, the drain of the OS-FET is connected to the gate of the Si-FET and one end of the capacitor, respectively, to form a storage node sn (storage node). Further, the bit line WBL, the bit line RBL, and the line SL are lines in a column direction (i.e., column lines), and the word line WWL and the word line RWL are lines in a row direction (i.e., row lines) for controlling operation states of the two transistors and the capacitor in fig. 1 in order to read and write data.
In the cell of fig. 1, the OS-FET for the write operation is stacked on the Si-FET for the read operation, and the charge stored at the storage node SN by the OS-FET does not leak. Since the OS-FET has an extremely low off-state current (i.e., leakage current), the data stored in the storage node in fig. 1 can be maintained for a long time, reducing the refresh frequency of the data; the value of the capacitance (i.e., Cs) in fig. 1 can be smaller compared to the 1T1C structure, and thus, the read and write speed is also faster. However, the scheme of FIG. 1 does not allow for memory computation nor does it coordinate with proximity sensing applications.
Fig. 2 is a diagram illustrating a related art implementation of in-memory matrix vector multiplication.
As shown in fig. 2, the operation of in-memory matrix vector multiplication can be implemented using an analog multiplier based on an OS-FET/Si-FET hybrid structure to simulate the storage of weights in a neural network. The analog multiplier is substantially the same as the 2T1C gain cell in fig. 1. It is noted that the Si-FET of the analog multiplier has been previously grounded, and the OS-FET In FIG. 2 employs an indium-gallium-zinc-based Oxide (IGZO) OS-FET.
As shown in fig. 2, the structure of the analog multiplier is shown by the left dashed circle. In fig. 2, the same structure of the analog multiplier can be used for both the memory cell and the reference cell. The analog multiplier implements a write operation by the OS-FET controlling the writing of charge to the capacitance (i.e., node FN), including a write mode, a hold mode, and a read mode. As in fig. 1, since the off-state current of the OS-FET is extremely small, the charge of the node FN can be held for a long time. For the purpose of facilitating an understanding of the embodiments of the present disclosure that follow, the operating principle of the analog multiplier in fig. 2 will be briefly described here.
As shown in FIG. 2, in write mode, WW is biased at a high voltage and VX is biased at 0; voltage W0+ w is applied to BW. At this time, the OS-FET is in a conducting state, and the voltage W0+ w is written to node FN. This operation may also be referred to simply as "write w" w.
As shown in FIG. 2, in the hold mode, WW is biased at a low voltage and VX is biased at 0, when the OS-FET is in the OFF state, and the voltage W on node FN0+ w is in the hold state and the hold time of the charge at this node is changed by adjusting the off-state current of the OS-FET.
As shown in FIG. 2, in the read mode, WW is biased at a low voltage, voltage X0+ x is applied to VX to provide the read current for the analog multiplier. At this time, the drain current (i.e., read current) of the Si-FET is I (w, x), and the Si-FET operates in the saturation region. Within a certain approximation range, I (W, x) ═ β/2 (W)0+w+X0+x-Vth)2. Wherein β may be a product of saturation mobility, a width-to-length ratio of a channel and gate capacitance per unit area, and VthRepresenting the threshold voltage of the Si-FET. Further, the drain current of the Si-FET also satisfies the following equation:
y=I(w,x)-I(0,x)-I(w,0)+I(0,0)=βwx。
that is, y can be calculated from the read current under four conditions, which can be a write weight of 0 or w and an input of 0 or x. And adding or subtracting the read current under the four conditions to obtain the product of the weight w and the input x. Since the plurality of analog multipliers can be arranged in a matrix (or row-column) manner, the plurality of analog multipliers can realize the operation of matrix-vector multiplication.
As shown in fig. 2, in addition to the memory cell and the reference cell, a bias cell needs to be configured to control or respond to the memory cell and the reference cell. The bias circuit includes a current source, a current mirror (left half in fig. 2), a current sink, and an analog switch. The current source consists of two OS-FETs, a p-channel Si-FET and a capacitor; the current sink is composed of two OS-FETs, an n-channel Si-FET and a capacitor. Wherein the p-channel Si-FET supplies a constant current and the n-channel Si-FET sinks the current.
In fig. 2, 6 row-wise control lines (near the analog switches in fig. 2, not noted) are also provided for controlling the operation mode of the entire bias unit. For example, in the case that the bias cell operation mode is the read mode, the read current of the memory cell is I (w, x), the read current of the reference cell is I (0, x), the current source provides the current I (w, 0), and the current sink provides the current I (0, 0). From these four currents, the value of y can be calculated, thus achieving the product of the weights and inputs of the neural network.
Therefore, the technical scheme for realizing matrix multiplication in the related technology has a very complex structure, one memory cell needs to be matched with three additional auxiliary units and a large amount of intermediate wiring and control signals, the control is complex, the additional energy cost is high, and the area utilization rate is low. In addition, a large amount of analog-to-digital conversion and buffering are required in the data stream of the conventional sensing-digital processor, which causes a large amount of energy consumption. After data are read from the sensors, data preprocessing is immediately carried out in the analog domain, so that the overhead can be effectively avoided, and the energy efficiency is improved. To achieve this goal, the sensed and calculated data paths should be matched to each other to improve data utilization. However, the conventional progressive scanning method reads out an entire line of data at a time, unlike the two-dimensional convolution in which data (one convolution kernel) is processed at a time, which requires a large amount of data buffering and complex data scheduling. Two-dimensional convolution is the basis of most image processing algorithms, which results in very limited existing near-sensing in-memory computing functions.
The memory computing device provided by the embodiment of the disclosure comprises a plurality of memory computing modules and a control module, wherein each memory computing module comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first storage unit and a second storage unit, and the control module controls the voltage states of the first word line and the third word line to control the working mode of the memory computing module to be any one of a writing mode, a reading mode and a holding mode Low power consumption, high accuracy and fast operation speed.
The in-memory computing apparatus of the present disclosure may be disposed in a terminal device, a server, or other types of electronic devices, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. The method can be applied to various fields such as the image processing field, big data and the like, and can be applied as long as multi-value input and multiple storages are needed, multiplication operation is carried out, and calculation for realizing matrix multiplication is realized.
Referring to fig. 3, fig. 3 is a schematic diagram of a memory computing device according to an embodiment of the disclosure.
As shown in fig. 3, the apparatus includes:
a computing array 10 comprising a plurality of memory computing modules 110, wherein each memory computing module 110 comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first memory cell 1110 and a second memory cell 1120, each of the first memory cell 1110 and the second memory cell 1120 is connected to the first word line, the second word line, and the third word line, each of the first memory cell 1110 and the second memory cell 1120 is further connected to the first bit line and the second bit line, and each of the second memory cell 1120 is further connected to the third bit line and the fourth bit line;
a control module 20 connected to the computational array 10 for: controlling the voltage states of the first word line and the third word line to control the operating mode of the memory computing module 110 to be any one of a write mode, a read mode and a hold mode, wherein in the write mode, the control module 20 writes data into the memory computing module 110 through the first bit line and the third bit line; in the read mode, the control module 20 reads data from the memory computation module 110 through the second bit line, the fourth bit line, and the second word line; in the hold mode, the control module 20 is configured to hold the state of the memory computing module 110.
The control module 20 of the disclosed embodiments may include processing components, which in one example include, but are not limited to, a single processor, or discrete components, or a combination of a processor and discrete components. The processor may comprise a controller having functionality to execute instructions in an electronic device, which may be implemented in any suitable manner, e.g., by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components. Within the processor, the executable instructions may be executed by hardware circuits such as logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers. In one example, the processing component may include a voltage generating unit to generate a corresponding voltage as required, and a voltage converting unit to convert the first voltage into the second voltage or convert the input data into a corresponding voltage or convert the voltage into data, and so on.
In a possible implementation manner, the computation array 10 includes M rows and N columns, where M, N are integers greater than 0, and the size of the computation array 10 is not limited in the embodiments of the present disclosure, and those skilled in the art may set the operations as needed to improve the flexibility and adaptability of the memory computing device in the embodiments of the present disclosure.
In a possible implementation manner, the first word line, the second word line, and the third word line of the plurality of memory computing modules 110 in the same row in the computing array 10 are respectively connected to implement sharing of the first word line, the second word line, and the third word line, reduce the line overhead and the control overhead, and improve the resource utilization rate.
In a possible implementation manner, the first bit line, the second bit line, the third bit line, and the fourth bit line of the plurality of memory computing modules 110 in the same column in the computing array 10 are respectively connected to realize the sharing of the first bit line, the second bit line, the third bit line, and the fourth bit line, thereby reducing the line overhead and the control overhead, and improving the resource utilization rate.
In a possible implementation, each column of the in-memory computation module 110 in the apparatus includes a plurality of readout paths, and the control module 20 is configured to read out a plurality of data from each readout path and add the plurality of read-out data to implement a matrix multiplication operation.
In one example, for each row that does not need to read data, the control module 20 is configured to control each row of the memory computing modules 110 to operate in the retention mode, and maintain the state of the internal node of each row of the memory computing modules 110 that does not need to operate unchanged, so as to reduce the system overhead.
In one possible embodiment, the control module 20 is further configured to:
controlling a first number of the memory computing modules 110 in the computing array 10 to operate in a write mode, controlling a second number of the memory computing modules 110 in the computing array 10 to operate in a read mode, and controlling the remaining memory computing modules 110 in the computing array 10 other than the first number and the second number to operate in a hold mode.
In the embodiment of the present disclosure, by defining the area for the computing array 10, effective utilization of resources may be achieved, for example, if there are multiple tasks or multiple small tasks divided by one large task, and the sum of resources required by each task is smaller than the maximum resource that can be provided by the computing array 10, each task may be performed synchronously, for example, the first number of in-memory computing modules 110 may be used to write data and perform a storage function; the memory calculation module 110 corresponding to the second number may perform matrix multiplication on two data (such as input data and weight data of the neural network), so that the embodiments of the present disclosure may fully utilize the characteristics of simple structure, low circuit complexity, low power consumption, high accuracy, and fast operation speed, and perform together through multiple tasks to improve the operation efficiency.
In the computation array 10 of the embodiment of the disclosure, the first storage unit 1110 and the second storage unit 1120 of each memory computation module 110 may have the same or similar structures, and the first storage unit 1110 and the second storage unit 1120 may have various implementation manners, which are described in the following exemplary embodiments.
Referring to fig. 4, fig. 4 is a schematic diagram of a memory computing module according to an embodiment of the disclosure.
In one possible implementation, as shown in fig. 4, the first memory cell 1110 may include a first transistor Q1, a second transistor Q2, and a first storage capacitor C1, and the second memory cell 1120 may include a third transistor Q3, a fourth transistor Q4, and a second storage capacitor C2, wherein,
the first end 1 of the first transistor Q1 and the third transistor Q3 are both connected to the first word line, the second end 2 of the first transistor Q1 is connected to the first bit line, the second end 2 of the third transistor Q3 is connected to the third bit line,
the third terminal 3 of the first transistor Q1, the first terminal 1 of the second transistor and the first terminal 1 of the first storage capacitor C1 are connected to form a first node N1, the third terminal 3 of the third transistor Q3, the first terminal 1 of the fourth transistor and the first terminal 1 of the second storage capacitor C2 are connected to form a second node N2,
the third terminals 3 of the second transistor Q2 and the fourth transistor Q4 are both connected to the second word line, the second terminal 2 of the second transistor Q2 is connected to the second bit line, the second terminal 2 of the fourth transistor Q4 is connected to the fourth bit line, and the second terminals 2 of the first storage capacitor C1 and the second storage capacitor C2 are both connected to the third word line.
In one example, as shown in fig. 4, the first ends 1 of the first transistor Q1, the second transistor Q2, the third transistor Q3, and the fourth transistor Q4 may be gates of transistors, and in the embodiment of the present disclosure, it is not limited to which of the second end 2 and the third end 3 of each transistor is a drain, and which is a source, and those skilled in the art may set the gate as needed.
In a possible implementation manner, the first transistor Q1 of the embodiment of the present disclosure may adopt a transistor with a low leakage characteristic to achieve an effect similar to a non-volatile memory, so that data can be stored in the memory computing module 110 for a longer time, for example, a transistor with a leakage current smaller than a preset value is adopted, the embodiment of the present disclosure does not limit the size of the preset value of the leakage current, as long as the adopted transistor can satisfy the low leakage characteristic, and for example, the preset value may be 10-20A, or others. The Transistor satisfying the low leakage characteristic may include various types, and the embodiment of the present disclosure does not limit the type of the Transistor, and may include, for example, an oxide semiconductor field effect Transistor OS-FET or a Thin Film Transistor (TFT), and the second Transistor Q2 may include, for example, a Transistor of a Thin Film Transistor (TFT) typeMay comprise a Si-FET.
Thin film transistors have the advantage of enabling them to have higher performance when used in memory computing. Firstly, the thin film transistor is widely applied to a sensing array, the integration of an arithmetic-integrated array and a sensing unit can be realized, and the sensing, the storing and the calculating are combined together, so that the expense of data transferring is further reduced; secondly, there are some works that use oxide thin film transistors to implement a memory that is similar to a non-volatile or even a memory computing array 10; in addition, the thin film transistor has a wider application range due to the unique advantages of transparency, flexibility, large-area integration and the like, and provides a new development direction for promoting the further development of memory calculation and intelligent sensing. The general IGZO (Indium Gallium Zinc Oxide) TFT can reach 10-16A/m, and even lower after special processing. Low leakage is a very good property in memory applications, taking a DRAM structure as an example, a conventional CMOS DRAM structure needs to be refreshed continuously due to a high leakage current, and the use of a low leakage TFT technology can greatly reduce the refresh frequency of the DRAM, save energy, and increase an access time window. Such low currents may be applied in proximity to non-volatile memories, in addition to volatile memories. Thin film transistors have great potential in the field of integration.
There is a scheme of implementing the in-memory calculation using TFTs in the related art, however, the related art has the following problems:
1. current TFT memory and TFT memory calculations do not combine well. On one hand, the TFT memory is fast in progress and has advantages in many aspects, but the work does not take memory calculation into consideration, and specifically, the TFT memory does not support multi-value storage and multi-value input and does not support a memory matrix vector multiplication calculation mode; on the other hand, the existing TFT memory computation work requires considerable overhead. For example, an extra 3 auxiliary cells need to be configured on one memory cell at the same time, and the operation result can be obtained only by using a very complicated current addition-subtraction relation, or the writing is very complicated due to the adoption of a floating gate structure, thereby causing huge extra overhead.
2. The current TFT memory has a trade-off relationship between the working precision and the circuit complexity, if a high-accuracy calculation is to be realized, a very complex circuit design is required, and if a simple circuit design is used to realize the internal calculation, the matrix-vector multiplication relationship is not ideal, or only a low-bit number operation is supported. At present, a TFT memory computation implementation manner which can support more accurate high-precision multiplication and has a relatively simple structure is still lacking.
3. The data flow supported by current TFT near-sensing memory computing work is very limited and the sensor and processor do not fit well. For example, the core of the related art is matrix-vector multiplication, but the application of near sensing is not considered, and the related art can only process a single algorithm because the data flow from the sensing array to the processor is not specially designed. In terms of sensing and reading, these operations adopt a progressive scanning mode, and are therefore not very suitable for performing two-dimensional filtering operations, which are rather important operations in image processing. Currently, a relatively general near-sensing memory computing scheme is still lacking.
The memory computing module 110 of the embodiment of the present disclosure may be implemented based on a TFT, and may implement large area integration by using a TFT process, further increase the time for data retention by increasing the circuit area, so as to reduce the refresh frequency during operation, achieve the purposes of saving energy and increasing the access time window, and improve the integrity and accuracy of matrix operation.
The implementation of the control functions of the control module 20 is described below by way of example.
In one possible embodiment, the control module 20 may be configured to:
the first word line is configured to have a high voltage, and the third word line is configured to have a low voltage, so as to control the memory computing module 110 to operate in the write mode.
The embodiment of the present disclosure may implement storing data in the memory computing module 110 by configuring the first word line as a high voltage and the third word line as a low voltage to control the memory computing module 110 to operate in the write mode.
In one example, taking the first transistor Q1, the second transistor Q2, the third transistor Q3 and the fourth transistor Q4 as an example, which are all n-channel devices, at the beginning of writing data, the first word line may be biased at a high voltage (e.g., 10V), the third word line may be configured at a low voltage (e.g., 0V, such as ground), that is, the control signal is applied to the first word line to turn on both the first transistor Q1 of the first memory cell 1110 and the third transistor Q3 of the second memory cell 1120, and at this time, the first node N1 of the first memory cell 1110 is shorted to a first bit line, the second node of the second memory cell 1120 is shorted to a third bit line, a voltage on the first bit line can be instantly applied to the first node N1 of the first memory cell 1110, and a voltage on the third bit line can be instantly applied to the second node N2 of the second memory cell 1120.
For example, by configuring the third word line to have a low voltage, a reference voltage may be provided to the first storage capacitor C1 and the second storage capacitor C2. For example, when the memory computing module 110 writes data, the second transistor Q2 of the first memory cell 1110 and the fourth transistor Q4 of the second memory cell 1120 may both be in an off state (i.e., an off state).
In a possible implementation manner, the writing data into the memory calculation module 110 by the control module 20 through the first bit line and the third bit line may include:
the control module 20 determines a target storage unit from the first storage unit 1110 and the second storage unit 1120 according to a value of data to be written;
and configuring the voltages of the first bit line and the third bit line according to the value of the data to be written so as to write the data to be written into the target storage unit.
In a possible implementation, in the write mode, the control module 20 may be further configured to:
if the value of the data to be written is a positive number, writing the opposite number of the data to be written into the target storage unit; or if the value of the data to be written is a negative number, directly writing the data to be written into the target storage unit, wherein the target storage unit corresponding to the positive write data and the negative write data is different.
In one possible implementation, when one of the first memory cell 1110 and the second memory cell 1120 is used for writing data, and the other is used for storing reference data, the voltage of the node of the memory cell for writing data is a negative voltage, and each of the second transistors Q2 in the first memory cell 1110 and the second memory cell 1120 is in an off state.
In one example, embodiments of the present disclosure may determine a target memory cell to which data is to be written according to the positive or negative of the value of the data to be written to ensure that the voltage written on the internal node of the target memory cell is always negative. For example, when data to be written is a negative number (e.g., -0.8), the data may be written in the first memory cell 1110; when the data to be written is a positive number (e.g., -0.8), the opposite number of the data (i.e., -0.8) may be written in the second memory cell 1120. It is noted that when the data to be written is loaded to the internal node of one of the memory cells in the memory computation module 110 through the first bit line or the third bit line, the other memory cell in the memory computation module 110 may be used as a reference cell. For example, whether the data to be written is a positive number (e.g., 0.8) or a negative number (e.g., -0.8), the reference cell writes 0, and the storage of arbitrary data is realized by the difference between the target memory cell and the reference cell.
The embodiment of the present disclosure determines the target memory cell to which data is to be written according to the positive and negative of the value of the data to be written, and can ensure that the voltage written on the internal node of the target memory cell is always negative, so that the gate-source voltage of the second transistor Q2 of the first memory cell 1110 and the fourth transistor Q4 in the second memory cell 1120 is less than the threshold voltage when the third word line is at a low voltage, the second transistor Q2 of the first memory cell 1110 and the fourth transistor Q4 in the second memory cell 1120 are always in an off state, and the voltage of the third word line does not need to be additionally set to a negative voltage to ensure that the second transistor Q2 of the first memory cell 1110 and the fourth transistor Q4 in the second memory cell 1120 are turned off; meanwhile, the input dynamic range is also reduced by determining the target storage unit to which the data to be written is to be written according to the positive and negative values of the value of the data to be written.
In one possible embodiment, the control module 20 may be configured to:
configuring the first word line to a low voltage and the third word line to a high voltage to control the memory computing module 110 to operate in a read mode;
the embodiment of the present disclosure controls the memory computing module 110 to operate in a read mode by configuring the first word line to have a low voltage and configuring the third word line to have a high voltage, and in the read mode, a product of input data and data stored in the memory computing module 110 may be read to implement matrix multiplication.
In one possible embodiment, the control module 20 may be further configured to:
configuring a voltage of the second word line according to input data;
the reading of data from the memory calculation module 110 by the control module 20 through the second bit line, the fourth bit line and the second word line may include:
reading currents of the second bit line and the fourth bit line;
the product of the input data and the data stored in the memory computing module 110 is obtained according to the current difference between the second bit line and the fourth bit line.
Through the above means, the control module 20 of the embodiment of the present disclosure may read data of the memory computing module 110. Taking the first transistor Q1, the second transistor Q2, the third transistor Q3, and the fourth transistor Q4 as an example of an n-channel device, when data is read, the first word line may be configured to be a low voltage (e.g., -4V) and the third word line may be set to be a high voltage (e.g., 18V) so that the memory computing module 110 enters a read mode. For example, by setting the voltage of the third word line to a high voltage, a higher gate voltage may be provided to all the second transistors Q2 and the fourth transistors Q4 of the row corresponding to the third word line in the computational array 10, so that all the second transistors Q2 and the fourth transistors Q4 of the row are in a linear region. For example, the third word line also has a function of selecting a row, and for a row that does not need to be read, the third word line may still be kept at a low level (e.g., 0V), so that the rows that do not need to be read enter a write mode or enter a hold mode, and in a case that the third word line may still be kept at 0, since the voltage of the internal node of the memory cell in the embodiment of the present disclosure is always a negative voltage, it may be ensured that the gate-source voltage of the second transistor Q2 and the fourth transistor Q4 of the row is lower than the threshold voltage, thereby ensuring the correctness of the read operation of the memory computation module 110. Compared with the traditional row-column switch, the output path is not subjected to voltage division by adopting the third word line for row selection, so that the multiplication result is influenced.
In one example, when reading the data of the memory computing module 110, it is necessary to keep a linear relationship between the output current and the product of the stored data (e.g., a weight value of a neural network) and the input voltage (e.g., a voltage corresponding to the input data of the neural network), and whether the linear relationship is satisfied may be determined by the voltage read when the second transistor Q2 and the fourth transistor Q4 operate in a linear region. The output current may be a current between the source and drain of the second transistor Q2 and the fourth transistor Q4 in the first memory cell 1110 or the second memory cell 1120, or a drain current, the stored data may be data stored in an internal node in the first memory cell 1110 or the second memory cell 1120, and the input voltage may be a gate-source voltage of the second transistor Q2 or the fourth transistor Q4.
In one example, in the case where the second transistor Q2 and the fourth transistor Q4 in the first memory cell 1110 and the second memory cell 1120 are located in a linear region, the relationship between the drain-source current and the port voltage of the second transistor Q2 and the fourth transistor Q4 can be approximately represented by formula 1:
Figure BDA0003204770050000121
wherein k may represent a constant of a saturation mobility, a width-to-length ratio of a channel, and a gate capacitance per unit area; vgs may represent the gate-source voltage of any second transistor Q2 or any fourth transistor Q4 in the memory compute module 110; vds may represent the drain-to-source voltage of any second transistor Q2 or any fourth transistor Q4 in the memory computation module 110; ids may represent the current between the drain and source of any second transistor Q2 or any fourth transistor Q4 in the memory computing module 110; vth may represent the threshold voltage of any second transistor Q2 or any fourth transistor Q4 in the memory calculation module 110.
Illustratively, since the current between the source and the drain of the second transistor Q2 is equal to the current between the drain and the source of the second transistor Q2, and the current between the source and the drain of the fourth transistor Q4 is equal to the current between the drain and the source of the fourth transistor Q4, the difference of the current or voltage directions does not affect the essence of the invention.
In the embodiment of the present disclosure, the difference between the gate-source voltage of the second transistor Q2 of the first memory cell 1110 and the gate-source voltage of the fourth transistor Q4 of the second memory cell 1120 may be a written data voltage (e.g., a weight voltage corresponding to a weight value on a written internal node). At this time, a difference between a current between source and drain of the second transistor Q2 of the first memory cell 1110 and a current between source and drain of the fourth transistor Q4 of the second memory cell 1120 may be represented by equation 2:
ΔIds=Ids(Vgs+Vweight,Vinput)-Ids(Vgs,Vinput)=kVweightVinputequation 2
Wherein, VweightMay represent a voltage on an internal node that has been stored, the magnitude of which may differ depending on the magnitude of the data value (weight value) being written; vinputA voltage value, which may represent an input, is applied through the second word line; delta IdsThe current difference may be represented. In one example, the current between the source and drain of the second transistor Q2 of the first memory cell 1110 can be read from the second bit line; the current between the source and drain of the fourth transistor Q4 of the second memory cell 1120 may be read from the fourth bit line.
The product of the input data (e.g., the input to the neural network) and the stored data (e.g., the weights of the neural network) can be obtained by equation 2 above. On the basis of formula 2, multiple rows of the memory computing device can be turned on simultaneously, and the outputs of the rows can be summed to realize the operation of matrix-vector multiplication.
In one possible embodiment, the control module 20 may be configured to:
the first word line and the third word line are configured to have a low voltage (e.g., -4V) to control the memory computing module 110 to operate in a retention mode, in which the node state of each memory cell is unchanged.
For example, for the memory computing modules 110 in the row that is not read or written, the memory computing modules 110 may be in a hold state, in the embodiment of the present disclosure, the first word line and the third word line are both configured to have a low voltage to control the memory computing modules 110 to operate in the hold mode, and in the hold state, internal nodes in the memory computing modules 110 of each memory computing module 110 may be kept unchanged, so as to reduce power consumption and improve resource utilization.
The voltage configuration of each word line and bit line in each mode will be described as an example.
As shown in table 1, in one possible implementation, the memory computing module 110 may operate in three operating modes: write mode, read mode, and hold mode. The voltage settings of the first word line, the second word line, the first bit line, the second bit line and the third bit line are different in different operation modes. For example, when reading and writing data, one of the memory cells in the memory computing block 110 may be used as a reference cell, and therefore, the first bit line and the second bit line of the write mode in table 1 may also be a third bit line and a fourth bit line, respectively.
TABLE 1
Write mode Read mode Hold mode
First word line 10V -4V -4V
First bit line [-4V,0V] 0V 0V
Second word line 0V [0V,3V] 0V
Second bit line 0V 0V 0V
Third word line 0V 18V 0V
Of course, the above descriptions of the voltage values of the word lines and the bit lines in the respective modes are exemplary and should not be considered as limitations of the embodiments of the present disclosure, and a person skilled in the art may set the corresponding values as needed, for example, the voltage values of the word lines and the bit lines may also be any one of the following ranges.
In the write mode, the voltage range of the first word line can be 10-30V; the voltage range of the second word line may be: 0-3V; the voltage range of the third word line can be 18-30V; the voltage range of the first bit line can be-4 to 0V; the voltage of the second bit line may be 0V (ground, e.g., 0V); the voltage range of the third bit line can be-4-0V; the voltage of the fourth bit line may be 0V.
In the read mode, the voltage range of the first word line may be-15V to-5V; the voltage range of the second word line can be 0-3V; the voltage range of the third word line can be-15-0V; the voltage range of the first bit line can be-15 to 15V; the voltage of the second bit line may be 0V; the voltage of the third bit line can be in the range of-15 to 15V; the voltage of the fourth bit line may be 0V.
It should be understood that the above description of the voltage ranges in the respective modes is an exemplary illustration of the voltage ranges in a certain TFT process, and it should be understood that, even for the complete same type of TFTs (e.g. metal oxide TFTs), the operating voltages may have fundamental differences due to the process differences, for example, transistors with the operating voltages given herein being around-10V may be implemented by using the same material, and the operating voltages may also be around-3V, so the power supply ranges of the respective word lines and bit lines may be determined according to the actual situation, and the embodiment of the present disclosure is not limited thereto.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating an arithmetic device according to an embodiment of the disclosure.
As shown in fig. 5, the device includes a first component and a second component, wherein,
the first component may include:
the sensor array is used for acquiring data to obtain a plurality of input data;
the memory computing device is used for receiving the input data, receiving a plurality of data to be written from the second component and obtaining the product sum of each input data and each data to be written;
and the second component is used for outputting the data to be written and processing the product sum to obtain a processing result.
The arithmetic device of the embodiment of the present disclosure includes a first component and a second component, and the first component may include: the sensor array is used for acquiring data to obtain a plurality of input data; the memory computing device is used for receiving the input data, receiving a plurality of data to be written from the second component and obtaining the product sum of each input data and each data to be written; and the second component is used for outputting the data to be written and processing the product sum to obtain a processing result. The memory computing device has the advantage of integrating storage and computation, and can avoid the problems of high system power consumption and low computation speed caused by frequent jumping of data between the computing unit and the storage unit of the separation framework during computation, particularly large-scale computation. The arithmetic device realized by the in-memory computing device can realize a data path between the sensor array and the in-memory computing device, support various image preprocessing algorithms, reduce the use of analog-to-digital conversion and cache, and has higher processing speed and energy efficiency level compared with the traditional digital arithmetic unit.
The sensor array of the embodiment of the present disclosure may include a plurality of sensors, and the sensor array may be configured to acquire images or other data, and the sensor may be, for example, a CCD (Charge-coupled Device), a CMOS (Complementary Metal Oxide Semiconductor), or other optical sensors, for which the embodiment of the present disclosure is not limited. For example, the sensor array may be limited to a camera, a video camera, etc., as shown in fig. 5, the sensor array may perform data collection through a scanning interface, the collected input data may be output through a sensor reading interface, and may be queued and output to the in-memory computing device according to a processing manner of the queue (e.g., FIFO, LIFO, etc.). On the other hand, the in-memory computing device may receive the data to be written stored in the memory of the second component, and write the data to be written into the in-memory computing module 110 of the computing array 10, and the in-memory computing device may perform computation (matrix multiplication) on the data to be written and the input data to obtain a computation result. For example, the input data processed by the queue may be sent to the memory computing device as input data, the memory computing device may receive the weight data from the weight writing interface, and the result of the computation obtained after the processing by the memory computing device may be sent to the second component.
In one example, as shown in fig. 5, the second component may further include an analog unit (e.g., may include a compensator, an amplifier, etc.), an analog-to-digital conversion unit (e.g., an analog-to-digital converter ADC), and a communication processor, where the analog unit may perform analog operations such as compensation and amplification on the calculation result, and the data processed by the analog unit may be analog-to-digital converted by the analog-to-digital conversion unit to obtain a digital quantity in binary form, for example, and processed by the communication processor. For example, the data in digital form after analog-to-digital conversion can also be stored in a memory or an internal memory.
Illustratively, the communication processor may include various general-purpose processors, such as a CPU (Central Processing Unit) and an artificial Intelligence Processor (IPU) for performing artificial intelligence operations, the artificial intelligence processor may include one or a combination of a GPU (Graphics Processing Unit), an NPU (Neural-Network Processing Unit), a DSP (Digital Signal Processing Unit), a Field Programmable Gate Array (FPGA) chip, and may be associated with a communication function for data transmission, such as to enable communication with other modules and other devices, such as a wireless Network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof.
For example, the memory may include a computer-readable storage medium, which may be a tangible device that may hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a programmable read-only memory (PROM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
The arithmetic device of the embodiment of the disclosure supports the application of proximity sensing, can be suitable for processing of various algorithms, supports various data streams, realizes good cooperation of a sensor array and a processor, and is very suitable for two-dimensional filtering operation.
The disclosed embodiments utilize low leakage characteristics of transistors, such as transistors including TFTs, to implement an in-memory computation, and further provide a near-sensing in-memory computation architecture (the computing device). The memory computing device and the arithmetic device can be applied to preprocessing operation of the low-power-consumption neural network. Compared with the related art, the memory calculation module 110 and the calculation array 10 which are similar to a nonvolatile memory and calculate integrally are realized, and the accurate matrix vector multiplication operation can be realized by a simpler circuit structure; meanwhile, a general near-sensing memory computing framework with large area and real-time data processing is realized, and compared with the existing work, the method supports more algorithms such as two-dimensional convolution operation and the like. Furthermore, the present disclosure enables speed and energy efficiency improvements over conventional digital processing units.
The following describes the performance test and simulation of the memory computing device and the computing device according to the embodiments of the present disclosure.
Fig. 6a and 6b are schematic diagrams illustrating a linearity analysis of the memory computing device according to the embodiment of the disclosure.
As shown in fig. 6a and 6b, to test the accuracy of the matrix vector multiplication, the circuit structure of the embodiment of the disclosure is subjected to circuit simulation, and Δ Ids (the current difference between the second bit line and the fourth bit line in the read mode) and V are obtainedinput(voltage corresponding to input data of second word line) and Vweight(voltages corresponding to stored data in the memory computing module 110), for a plurality of different sets of VinputAnd VweightR of all curves linear regression2The values are all larger than 0.999, and the result shows that an excellent linear relation is established, namely, the multiplication relation between the output current and the weight and the input is established, and the feasibility of the in-memory matrix vector multiplication operation is verified.
Fig. 7 shows a schematic diagram of a read-write simulation waveform of a memory computing module according to an embodiment of the present disclosure.
As shown in fig. 7, in order to test the read (read) write (write) delay of the memory computing module 110, transient simulation is performed on the circuit structure of the memory computing module 110, and an operating waveform of the memory computing module 110 is obtained. To accurately simulate array read operations, all waveforms considered by this simulation are consistent with expectations, and read frequencies can reach up to 15 MHz. For the write operation, since the conduction currents of the selected first transistor Q1 and the third transistor Q3 (such as TFTs) are small, and the capacitance is set to be large in order to increase the retention time, the time taken for the read operation is relatively long, but in the general edge calculation problem, the memory does not need to be written frequently, and the voltage is required to be trimmed only for the general refresh operation, which takes a relatively short time, so that the longer time of the write operation has little influence on the whole system, and in fig. 7, the sleep represents Standby.
Fig. 8a and 8b are schematic diagrams illustrating the retention time analysis of the in-memory computing module and the results of the monte carlo simulation according to the embodiment of the disclosure.
As shown in fig. 8a, in order to test the retention time of the approximate nonvolatile memory, a circuit simulation of the analog memory retention process is performed on the structure of the memory computing module 110, after 500s, the error caused by the leakage is only 2%, and the error is acceptable in most problems, that is, the refresh period of the system can reach the order of minutes, so as to achieve a better approximate nonvolatile characteristic.
As shown in fig. 8b, in order to test the influence caused by the device deviation, a monte carlo Simulation (MC Simulation Samples) is performed on the circuit of the embodiment of the present disclosure, at this time, the deviation of the threshold voltage is the largest influence on the result, according to the actually selected device model in the Simulation, the standard deviation of the threshold voltage distribution on the entire calculation array 10 is taken as 0.3V, the uniformity of the two adjacent devices is better, the mismatch standard deviation is taken as 0.03V, and the weighting interval is taken as 0.5V, at this time, the output current distribution shown in fig. 8b is obtained by performing the monte carlo Simulation, and it can be seen that there is no overlap between the current distributions corresponding to each quantization weight (stored data), that is, for a single in-memory calculation module 110, an unsigned 3-bit quantization result can be achieved, and for the in-memory calculation module 110 pair, a positive/negative 4-bit quantization result can be achieved. This indicates that the memory calculation module 110 can support multi-valued storage with a higher number of bits.
The embodiment of the disclosure particularly considers the condition of using the arithmetic device to carry out neural network preprocessing during simulation. The application of the neural network in image processing is more and more extensive, and the binaryzation neural network which is formed in recent years is more suitable for being deployed on a terminal system because of small volume and simple calculation. However, the first layer of the binarization network is generally full-precision, and the processing overhead of the layer is still very large, not only the calculation overhead, but also the overhead of analog-to-digital conversion and data buffering. Therefore, the full-precision layer closest to the sensor is put into an analog domain for calculation, calculation cost can be reduced to a great extent, and meanwhile, because the input of the later stage neural network is binaryzation, only a very simple comparator is needed to replace a high-precision analog-to-digital converter, so that the cost of analog-to-digital conversion is greatly reduced. Therefore, the preprocessing of the binary neural network by using the near-sensing in-memory computing architecture (arithmetic device) of the embodiment of the disclosure has great advantages.
FIG. 9 is a diagram illustrating performance on a data set when an arithmetic device according to an embodiment of the present disclosure is applied to a first layer of computation.
In the analysis process, a very typical binary neural network XNOR-NET (of course, it can also be applied to other neural networks, and the present disclosure is not limited thereto) is selected, and the network is used to classify the CIFAR-10 data set. In the first layer, weights of 4-bit quantization precision are used, and the output is 1 bit. Three aspects of error are considered in the deployment process: 1) bias due to inaccurate matrix-vector multiplication; 2) weight decay due to leakage; 3) random noise due to device variations. Finally, the performance of the neural network in the classification problem is shown in fig. 9, in which a baseline Accuracy (Bseline Accuracy) represents the Accuracy of the ideal full-precision case, and each circle represents the Accuracy of multiple simulations (Accuracy) under different threshold voltage deviations. It can be seen that there is a mismatch in threshold voltage (V)thMismatch) is less than 0.1V, the average loss of precision of the arithmetic device of the embodiment of the present disclosure is within 3%.
Fig. 10a and 10b are schematic diagrams illustrating comparison results between the arithmetic device according to the embodiment of the present disclosure and a conventional digital processor.
Comparing the arithmetic device of the embodiment of the disclosure with a traditional digital processor, the adopted reference scheme is a 32-bit matrix vector multiplication processing unit. In the course of demonstration, the processing delay and power consumption of the arithmetic device are obtained according to the circuit simulation result, and for the reference digital circuit scheme, a Synopsis Design Compiler (summary Design Compiler) is adopted to simulate the delay and power consumption data, and for the extra data access, a CACTI simulator is adopted to estimate the power consumption and delay. Finally, the comparison of the two schemes is shown in fig. 10a and 10b, fig. 10a shows a Delay (Delay) comparison diagram, and fig. 10b shows a Power consumption (Power) comparison diagram. Because the high-parallelism matrix vector multiplication operation is adopted, and the use of a high-precision analog-to-digital converter is avoided, the system of the embodiment of the disclosure can achieve 3.17 times of speed improvement and 9.57 times of energy efficiency improvement at most.
Wherein in FIG. 10a, the Near-sensing memory is calculated as (Near-Sensor CiM), wherein CiM denotes a computing-in-memory, memory calculation; Near-Sensor stands for Near sensing computation; CiM Array is the computational Array proposed in the embodiments of the present disclosure; readout Circuit represents a peripheral data reading Circuit; digital MAC Core represents a multiply-accumulate calculation unit implemented using a conventional Digital circuit architecture; ADC stands for Analog-digital converter; logic denotes a subject calculation section of the aforementioned multiply-accumulate calculation unit; memory refers to Memory; input Vector Length represents the Length of an Input Vector in multiply-accumulate operation; delay represents the time of operation; power represents Power consumption.
Referring to table 2, table 2 shows a comparison between the memory computing device and the computing device according to the embodiment of the disclosure and other existing tasks.
TABLE 2
Figure BDA0003204770050000181
In table 2, VLSI represents an abbreviation of Very Large Scale Integration forum (Very Large Scale Integration Symposium), which is the top-level conference in the field of integrated circuits. ISSCC stands for International Solid State Circuit Conference (International Solid State Circuit Conference), which is the top-level Conference in the field of integrated circuits. JJAP represents the Japanese Journal of Applied Physics, and MAC support represents multiply-accumulate operation (Multiplication-and-Accumulation), which is the most common operation form in edge calculation. TFTs of implementation 1 and implementation 2 of the memory computing device and the arithmetic device of the embodiments of the present disclosure have different processes.
The system of the embodiment of the disclosure can realize high-precision storage and operation with low circuit complexity, and meanwhile, a more advanced process is adopted for estimation, and the result shows that the TFT storage and calculation system has great potential.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

1. An in-memory computing device, the device comprising:
the computing array comprises a plurality of memory computing modules, wherein each memory computing module comprises a first word line, a second word line, a third word line, a first bit line, a second bit line, a third bit line, a fourth bit line, a first storage unit and a second storage unit, the first storage unit and the second storage unit are connected to the first word line, the second word line and the third word line, the first storage unit is further connected to the first bit line and the second bit line, and the second storage unit is further connected to the third bit line and the fourth bit line;
a control module, coupled to the computational array, to: controlling the voltage states of the first word line and the third word line to control the working mode of the memory computing module to be any one of a writing mode, a reading mode and a holding mode,
in the write mode, the control module writes data into the memory calculation module through the first bit line and the third bit line; in the read mode, the control module reads data from the memory computation module through the second bit line, the fourth bit line, and the second word line; in the hold mode, the control module is configured to hold a state of the memory computing module.
2. The apparatus of claim 1, wherein the first storage cell comprises a first transistor, a second transistor, a first storage capacitor, and the second storage cell comprises a third transistor, a fourth transistor, a second storage capacitor, wherein,
first ends of the first transistor and the third transistor are connected to the first word line, a second end of the first transistor is connected to the first bit line, a second end of the third transistor is connected to the third bit line,
the third end of the first transistor, the first end of the second transistor and the first end of the first storage capacitor are connected to form a first node, the third end of the third transistor, the first end of the fourth transistor and the first end of the second storage capacitor are connected to form a second node,
third ends of the second transistor and the fourth transistor are both connected to the second word line, a second end of the second transistor is connected to the second bit line, and a second end of the fourth transistor is connected to the fourth bit line,
second ends of the first storage capacitor and the second storage capacitor are connected to the third word line.
3. The apparatus of claim 1, wherein the control module writes data to the memory computation module via the first bit line and the third bit line, comprising:
the control module determines a target storage unit from the first storage unit and the second storage unit according to the value of the data to be written;
and configuring the voltages of the first bit line and the third bit line according to the value of the data to be written so as to write the data to be written into the target storage unit.
4. The apparatus of claim 3, wherein in the write mode, the control module is further configured to:
if the value of the data to be written is a positive number, writing the opposite number of the data to be written into the target storage unit; or if the value of the data to be written is negative, directly writing the data to be written into the target storage unit,
wherein the target memory cell corresponding to the positive write data is different from the target memory cell corresponding to the negative write data.
5. The apparatus of claim 4, wherein when one of the first memory cell and the second memory cell is used for writing data, and the other is used for storing reference data, wherein a voltage of a node of the memory cell for writing data is a negative voltage, each of the second transistors of the first memory cell and the second memory cell is in an off state.
6. The apparatus of claim 1, wherein the control module is configured to:
configuring the first word line to a high voltage and the third word line to a low voltage to control the memory computing module to operate in the write mode;
configuring the first word line to be at a low voltage and the third word line to be at a high voltage to control the memory computing module to operate in a read mode;
and configuring the first word line and the third word line to be low voltage so as to control the memory computing module to work in a holding mode, wherein the node state of each memory unit is unchanged in the holding mode.
7. The apparatus of claim 1, wherein the control module is further configured to:
configuring a voltage of the second word line according to input data;
the control module reads data from the memory computation module through the second bit line, the fourth bit line, and the second word line, including: reading currents of the second bit line and the fourth bit line; and obtaining the product of the input data and the data stored by the memory computing module according to the current difference value of the second bit line and the fourth bit line.
8. The apparatus of claim 1, wherein the compute array comprises M rows and N columns, wherein M, N are integers greater than 0, and wherein the first word line, the second word line, and the third word line of the memory compute modules in a same row of the compute array are connected respectively, and/or wherein the first bit line, the second bit line, the third bit line, and the fourth bit line of the memory compute modules in a same column of the compute array are connected respectively.
9. The apparatus of claim 1, wherein each column of the memory computing modules comprises a plurality of readout paths, and the control module is configured to read out a plurality of data from each readout path and add the plurality of read out data, wherein for each row that does not need to read out data, the control module is configured to control each row of the memory computing modules to operate in the hold mode.
10. The apparatus of claim 1, wherein the control module is further configured to:
controlling a first number of memory computing modules in the computing array to work in a write mode, controlling a second number of memory computing modules in the computing array to work in a read mode, and controlling the remaining memory computing modules in the computing array except the first number and the second number to work in a hold mode.
11. The apparatus of claim 2, wherein a leakage current of the first transistor is less than a predetermined value.
12. A computing device comprising a first component and a second component, wherein,
the first assembly includes:
the sensor array is used for acquiring data to obtain a plurality of input data;
the in-memory computing device of any of claims 1-11, configured to receive the plurality of input data and a plurality of data to be written from the second component, and to obtain a product-sum of each input data and each data to be written;
and the second component is used for outputting the data to be written and processing the product sum to obtain a processing result.
CN202110913509.5A 2021-05-21 2021-08-10 In-memory computing device and computing device Active CN113593622B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110558625X 2021-05-21
CN202110558625 2021-05-21

Publications (2)

Publication Number Publication Date
CN113593622A true CN113593622A (en) 2021-11-02
CN113593622B CN113593622B (en) 2023-06-06

Family

ID=78256812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110913509.5A Active CN113593622B (en) 2021-05-21 2021-08-10 In-memory computing device and computing device

Country Status (1)

Country Link
CN (1) CN113593622B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114300014A (en) * 2021-12-30 2022-04-08 厦门半导体工业技术研发有限公司 Memory data processing circuit and resistive random access memory
CN114625691A (en) * 2022-05-17 2022-06-14 电子科技大学 Memory computing device and method based on ping-pong structure
WO2023124096A1 (en) * 2021-12-31 2023-07-06 浙江驰拓科技有限公司 Memory and read circuit thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028875A (en) * 2019-11-29 2020-04-17 中国科学院微电子研究所 Memory computing circuit
CN111816234A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 Voltage accumulation memory computing circuit based on SRAM bit line union
CN111816231A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 Memory computing device with double-6T SRAM structure
CN112133339A (en) * 2020-08-12 2020-12-25 清华大学 Memory bit-by-bit logic calculation circuit structure based on ferroelectric transistor
US20210065776A1 (en) * 2019-08-26 2021-03-04 Stmicroelectronics International N.V. Memory computing, high-density array

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210065776A1 (en) * 2019-08-26 2021-03-04 Stmicroelectronics International N.V. Memory computing, high-density array
CN111028875A (en) * 2019-11-29 2020-04-17 中国科学院微电子研究所 Memory computing circuit
CN111816234A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 Voltage accumulation memory computing circuit based on SRAM bit line union
CN111816231A (en) * 2020-07-30 2020-10-23 中科院微电子研究所南京智能技术研究院 Memory computing device with double-6T SRAM structure
CN112133339A (en) * 2020-08-12 2020-12-25 清华大学 Memory bit-by-bit logic calculation circuit structure based on ferroelectric transistor

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114300014A (en) * 2021-12-30 2022-04-08 厦门半导体工业技术研发有限公司 Memory data processing circuit and resistive random access memory
WO2023124096A1 (en) * 2021-12-31 2023-07-06 浙江驰拓科技有限公司 Memory and read circuit thereof
CN114625691A (en) * 2022-05-17 2022-06-14 电子科技大学 Memory computing device and method based on ping-pong structure

Also Published As

Publication number Publication date
CN113593622B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN113593622B (en) In-memory computing device and computing device
CN108763163B (en) Analog vector-matrix multiplication circuit
CN110008440B (en) Convolution operation based on analog matrix operation unit and application thereof
CN109086249B (en) Analog vector-matrix multiplication circuit
CN111652363A (en) Storage and calculation integrated circuit
CN110007895B (en) Analog multiplication circuit, analog multiplication method and application thereof
US20220157384A1 (en) Pulse-Width Modulated Multiplier
US20230132411A1 (en) Devices, chips, and electronic equipment for computing-in-memory
US11797643B2 (en) Apparatus and method for matrix multiplication using processing-in-memory
Shukla et al. Ultralow-power localization of insect-scale drones: Interplay of probabilistic filtering and compute-in-memory
Maneux et al. Modelling of vertical and ferroelectric junctionless technology for efficient 3D neural network compute cube dedicated to embedded artificial intelligence
Liu et al. Almost-nonvolatile IGZO-TFT-based near-sensor in-memory computing
Song et al. A 28 nm 16 kb bit-scalable charge-domain transpose 6T SRAM in-memory computing macro
Kang et al. Deep in-memory architectures for machine learning
Cheon et al. A 2941-TOPS/W charge-domain 10T SRAM compute-in-memory for ternary neural network
Liu et al. An energy-efficient mixed-bit cnn accelerator with column parallel readout for reram-based in-memory computing
Angizi et al. Pisa: A non-volatile processing-in-sensor accelerator for imaging systems
Zhang et al. Fast Fourier transform (FFT) using flash arrays for noise signal processing
Yang et al. Essence: Exploiting structured stochastic gradient pruning for endurance-aware reram-based in-memory training systems
CN110311676B (en) Internet of things vision system adopting switching current technology and data processing method
CN111859261A (en) Computing circuit and operating method thereof
CN112632460B (en) Source coupled drain summed analog vector-matrix multiplication circuit
Huang et al. Accuracy optimization with the framework of non-volatile computing-in-memory systems
Laleni et al. In-Memory Computing exceeding 10000 TOPS/W using Ferroelectric Field Effect Transistors for EdgeAI Applications
US20240185916A1 (en) Devices, chips, and electronic equipment for sensing-memory-computing synergy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant