CN110674462B - Matrix operation device, method, processor and computer readable storage medium - Google Patents

Matrix operation device, method, processor and computer readable storage medium Download PDF

Info

Publication number
CN110674462B
CN110674462B CN201911223959.0A CN201911223959A CN110674462B CN 110674462 B CN110674462 B CN 110674462B CN 201911223959 A CN201911223959 A CN 201911223959A CN 110674462 B CN110674462 B CN 110674462B
Authority
CN
China
Prior art keywords
columns
row
rows
operated
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911223959.0A
Other languages
Chinese (zh)
Other versions
CN110674462A (en
Inventor
郑瀚寻
杨龚轶凡
闯小明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhonghao Xinying (Hangzhou) Technology Co.,Ltd.
Original Assignee
Shenzhen Xinying Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xinying Technology Co ltd filed Critical Shenzhen Xinying Technology Co ltd
Priority to CN201911223959.0A priority Critical patent/CN110674462B/en
Publication of CN110674462A publication Critical patent/CN110674462A/en
Application granted granted Critical
Publication of CN110674462B publication Critical patent/CN110674462B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Static Random-Access Memory (AREA)

Abstract

The embodiment of the invention discloses a matrix operation device, a matrix operation method, a processor and a computer readable storage medium. The matrix operation device includes: memory for storing the to-be-operated valueA matrix including M rows by N columns of memory cells; the memory cell array comprises M rows and K word lines, wherein each word line is connected with M rows and N columns of memory cells along the row direction, and each memory cell in each row is connected with K word lines; any K of M rows by K word lines1The strip word line is used for enabling corresponding K2Row by N columns of memory cells for writing at least one corresponding row of elements in the matrix to be operated into K2Corresponding positions in the row by N column of memory cells; and the operation circuit is connected with the memory and used for receiving the vector to be operated input from the outside and carrying out vector-matrix operation on the basis of the vector to be operated and the matrix to be operated to obtain an operation result. The device can be used for writing a plurality of rows of elements of a matrix to be operated into corresponding storage units simultaneously on the premise of realizing storage and calculation integration, thereby greatly improving the efficiency of data writing.

Description

Matrix operation device, method, processor and computer readable storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a matrix operation device, a matrix operation method, a matrix operation processor, and a computer-readable storage medium.
Background
With the development and progress of process nodes, the operation speed of computers based on digital logic is continuously increased, but the transmission speed of data based on transmission media (such as copper wires) is not correspondingly increased, and the data is more slowly due to the smaller and smaller size of the transmission media. The existing von neumann arithmetic system depends on different devices to store data and execute operation, and along with the increase of data volume and the increase of algorithm complexity, a great amount of time and power energy consumed by data access and transmission between a storage device and an arithmetic device are bottlenecks for further improving operation performance.
In order to solve this problem, many chip companies and scientists invest a lot of time and money to study how to transfer the operation in the computer from the central processing unit to the memory, so as to reduce the movement of data and improve the operation efficiency, and this method is also called as storage-computation integration. The existing storage and computation integrated method can really achieve the effects of improving the computation speed to a certain degree, saving circuit resources and reducing computation power consumption on the one hand. On the other hand, in terms of storing data, especially when a large amount of data needs to be calculated, the conventional technical solution still needs to consume a lot of time to write a large amount of data into the integrative storage device line by line, so that it is difficult to improve the overall working efficiency of the integrative storage device in practical application.
Disclosure of Invention
The embodiment of the invention provides a matrix operation device and a matrix operation method, which can realize integration of storage and operation, further improve the operation efficiency, simultaneously write a plurality of rows of elements in a matrix to be operated into corresponding storage units, and improve the efficiency of writing the matrix into a memory, thereby saving time and circuit resources.
In one aspect, an embodiment of the present invention provides a matrix operation apparatus, where the apparatus includes:
the memory is used for storing a matrix to be operated and comprises M rows of memory cells in N columns, and each memory cell is connected with adjacent memory cells in the row direction and the column direction;
word lines, wherein the word lines are M rows by K, and the word lines are connected with the M rows by N columns of memory cells along the row direction, wherein each memory cell in each row is connected with K word lines; any K of the M rows by K word lines1The strip word lines are used for enabling corresponding K simultaneously2Row by N columns of memory cells for writing corresponding at least one row of elements in the matrix to be operated into the K at the same time2Corresponding location in row by N column memory cell, K, K1And K2Is an integer greater than 1;
the operation circuit comprises X rows, Y columns of operation units, M rows, N columns of storage units and the X rows, Y columns of operation units are arranged in a crossed mode, each storage unit in the M rows, N columns of storage units is correspondingly connected with one operation unit, and the operation circuit is used for receiving externally input vectors to be operated and carrying out vector-matrix operation on the vectors to be operated and the matrixes to be operated stored in the M rows, N columns of storage units to obtain operation results.
In the structure of the memory cells in M rows by N columns, a plurality of word lines connected with each memory cell in each row can enable a plurality of memory cells in a plurality of rows simultaneously, so that a plurality of rows of data in a matrix to be operated can be written into the corresponding memory cells simultaneously. Compared with the prior art that data is written into the memory line by line, the method and the device greatly shorten the time consumed by writing the data into the memory and improve the writing efficiency. In addition, the embodiment of the invention can realize the in-memory calculation of vector-matrix operation through the X row-Y column operation units which are arranged in the device in a crossed way with the M row-N column storage units. Compared with the structure that an integral independent M rows by N columns of storage units and an integral independent X rows by Y columns of operation units are simply connected in the existing integral storage and operation scheme, the invention arranges the storage units and the operation units in the device in a crossed way and couples the storage units and the operation units (for example, a row of four operation units is arranged below a row of four storage units, then a row of four storage units is arranged, the cycle is repeated, each storage unit is respectively coupled with the operation units adjacent below the storage units, another row of four operation units is arranged below a row of four storage units, then a row of four storage units is arranged, the cycle is repeated, and each storage unit is respectively coupled with the operation units adjacent to the storage units in the column direction, and the like). Each storage unit is correspondingly connected with an arithmetic unit, so that when the write-in operation is executed on the storage unit, the arithmetic unit can directly acquire the data written in the storage unit without gating and reading the storage unit, and then the read data is transmitted to the arithmetic unit. Therefore, the data transmission overhead in the operation process is reduced, the data operation efficiency is greatly improved, and the circuit area is reduced.
In one possible embodiment, each of the X rows by Y columns of arithmetic units comprises an input port;
the operation unit is connected with the corresponding storage unit through the input port and is used for acquiring the elements stored in the storage unit;
the operation unit is further configured to obtain an element to be operated, and multiply the element stored in the corresponding storage unit with the element to be operated to obtain a multiplication result, where the element to be operated comes from the vector to be operated.
In a possible embodiment, the arithmetic circuit further comprises an adder circuit connected to the X row by Y column arithmetic units;
and the addition circuit is used for respectively adding the multiplication results obtained by the operation units in each row of the operation units in the X row by Y column to obtain the operation results of the vector to be operated and the matrix to be operated.
In a possible embodiment, the apparatus further comprises:
the number of the zero clearing lines is N, the zero clearing lines are connected with the M rows by N columns of storage units along the column direction, and each storage unit in each column is connected with the same zero clearing line; each zero clearing line is used for transmitting high level to each storage unit connected with the zero clearing line when the zero clearing operation is executed.
In a possible embodiment, the apparatus further comprises:
bit lines, each set of the bit lines including a first bit line and a second bit line complementary to the first bit line; within the M rows by N columns of memory cells, each of the memory cells in each column is connected to S groups of the bitlines in a column direction, thereby forming N columns by S groups of the bitlines;
when Nrow S in the Nrow S group bit line1S corresponding to group bit line2When the row N column of memory cells are enabled, the N columns S1A set of bit lines for writing at least one row of elements within the matrix to be operated on to the S2And (4) corresponding positions in the row by N columns of memory cells.
In one possible embodiment, each of the storage units comprises:
the first inverter and the second inverter are connected end to form a storage space, the storage space is used for storing one item of data, and the one item of data comprises corresponding elements in the matrix to be operated;
the drains of Q1 first transmission transistors are respectively connected with the output end of the same first inverter, the gates of Q1 first transmission transistors are respectively connected with corresponding word lines in the K word lines, and the sources of Q1 first transmission transistors are respectively connected with corresponding first bit lines in the S groups of bit lines;
the drains of the Q2 second transmission transistors are respectively connected with the output end of the same second inverter, the gates of the Q2 second transmission transistors are respectively connected with corresponding word lines in the K word lines, and the sources of the Q2 second transmission transistors are respectively connected with corresponding second bit lines in the S groups of bit lines;
when any one of the K word lines is at a high potential, a first transfer transistor and a second transfer transistor connected to the word line are used for conducting a circuit to write a corresponding element in the matrix to be operated into the memory cell.
In one possible implementation, each of the storage units further includes:
the drain electrode of the first zero clearing transistor is connected with the output end of the first phase inverter, and the gate electrode of the first zero clearing transistor is connected with the zero clearing line;
the drain electrode of the second zero clearing transistor is connected with the output end of the second inverter, and the gate electrode of the second zero clearing transistor is connected with the zero clearing line;
the first clear transistor and the second clear transistor are used for clearing the storage unit when the clear line is at a high potential.
In another aspect, an embodiment of the present invention provides a matrix operation method, where the method includes:
determining K in the word line according to the matrix to be operated on1Bar instituteThe word lines are connected with M rows and K columns of memory cells in the memory along the row direction, wherein each memory cell in each row is connected with K word lines;
by K1Striping the word lines while enabling corresponding K2Row by N columns of memory cells for writing corresponding at least one row of elements in the matrix to be operated into the K at the same time2The corresponding positions of the row N columns of memory cells; within the M rows by N columns of memory cells, each of the memory cells is connected to adjacent memory cells in both the row and column directions, K, K1And K2Is an integer greater than 1;
receiving an externally input vector to be operated through an operation circuit connected with the memory, and performing vector-matrix operation on the vector to be operated and the matrix to be operated stored in the M rows by N columns of storage units to obtain an operation result; the operation circuit comprises X rows, Y columns of operation units, M rows, N columns of storage units and X rows, Y columns of operation units, wherein the M rows, N columns of storage units are arranged in a crossed mode, and each storage unit in the M rows, N columns of storage units is correspondingly connected with one operation unit.
In one possible embodiment, the method further comprises:
when the matrix to be operated comprises P columns of all-zero elements, determining P corresponding zero clearing lines from N zero clearing lines in total, wherein the N zero clearing lines are connected with the M rows by N columns of storage units along the column direction, and each storage unit in each column is connected with the same zero clearing line;
and transmitting high level to the P columns of storage units connected with the zero clearing lines through the P strips so as to clear the P columns of storage units.
In another aspect, an embodiment of the present invention provides a processor including the apparatus in any one of the possible implementation manners of the above aspect. The processor may be formed of a chip or may include chips and other discrete devices.
In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, may implement the method included in any one of the possible implementation manners of the above another aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a memory cell in the prior art.
FIG. 2 is a block diagram illustrating an overall architecture of a data storage and operation in the prior art.
FIG. 3 is a clock cycle diagram of a prior art matrix writing to memory.
Fig. 4 is a clock cycle diagram of a prior art vector-matrix operation.
Fig. 5 is a schematic diagram of an arrangement and connection of memory cells according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a process for writing a matrix into a memory cell according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of another process for writing a matrix into a memory cell according to an embodiment of the present invention.
Fig. 8 is a schematic structural diagram of a memory cell according to an embodiment of the present invention.
Fig. 9 is a schematic diagram of a matrix operation apparatus according to an embodiment of the present invention.
Fig. 10 is a diagram illustrating mathematical operations of a vector-matrix according to an embodiment of the present invention.
Fig. 11 is a circuit connection diagram of a vector-matrix operation according to an embodiment of the present invention.
Fig. 12 is a schematic diagram of an overall architecture of a storage system according to an embodiment of the present invention.
FIG. 13 is a clock cycle diagram for writing a matrix into a memory according to an embodiment of the present invention.
Fig. 14 is a schematic diagram of clock cycles for performing vector-matrix operations according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and in the claims, and in the drawings, are used for distinguishing between different objects and not necessarily for describing a particular sequential order. The term "at least one" means one or more than one, and the term "plurality" means two or more than two, unless specifically limited otherwise. Also, the description and claims of the present invention and the drawings refer to "row" and "column" as relative concepts, not absolute "row" and "column". Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. It should be noted that when an element is referred to as being "coupled" or "connected" to another element or elements, it can be directly connected or indirectly connected to the other element or elements.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The following provides a detailed description of the background art related to the present invention to further illustrate the technical problems solved by the matrix operation apparatus and method according to the embodiments of the present invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a memory cell in the prior art. As shown in fig. 1, the Static Random-Access Memory (SRAM) unit with 6T structure includes: a first inverter and a second inverter. Wherein the first inverter includes a first pull-up transistor PU1 and a first pull-down transistor PD1, a drain of the first pull-up transistor PU1 is electrically connected to a source of the first pull-down transistor PD1, and as shown in fig. 1, a connection point of the drain of the first pull-up transistor PU1 and the source of the first pull-down transistor PD1 is the first storage node 11. The gate of first pull-up transistor PU1 and the gate of first pull-down transistor PD1 are electrically connected, and as shown in fig. 1, the connection point of the gate of first pull-up transistor PU1 and the gate of first pull-down transistor PD1 is the first read-write node 12. Wherein the second inverter includes a second pull-up transistor PU2 and a second pull-down transistor PD2, a drain of the second pull-up transistor PU2 is electrically connected to a source of the second pull-down transistor PD2, as shown in fig. 1, a connection point of the drain of the second pull-up transistor PU2 and the source of the second pull-down transistor PD2 is the second storage node 21. The gate of second pull-up transistor PU2 and the gate of second pull-down transistor PD2 are electrically connected, and as shown in FIG. 1, the junction of the gate of second pull-up transistor PU2 and the gate of second pull-down transistor PD2 is the second read-write node 22.
In addition, the source of the first pull-up transistor PU1 and the source of the second pull-up transistor PU2 are electrically connected to an operating voltage supply (VDD); the source of the first pull-down transistor PD1 and the drain of the second pull-down transistor PD2 are electrically connected to a common Voltage power supply (VSS), and the operating Voltage power supply VDD supplies a Voltage higher than the Voltage supplied by the common Voltage power supply VSS. Further, as shown in fig. 1, the first storage node 11 is electrically connected to the second read/write node 22, and the second storage node 21 is electrically connected to the first read/write node 12, which means that the first inverter and the second inverter are connected end to end.
In a memory cell with a 6T structure, the first pull-up Transistor PU1 and the second pull-up Transistor PU2 are typically P-type Metal-Oxide-semiconductor field-Effect transistors (MOSFETs), and the first pull-down Transistor PD1 and the second pull-down Transistor PD2 are N-type MOSFETs.
The memory cell of the 6T structure shown in fig. 1 may further include a pass transistor, which includes: the gates of the first pass gate transistor PG1 and the second pass gate transistor PG2, the first pass gate transistor PG1 and the second pass gate transistor PG2 are electrically connected to a word line wl; the source of the first transfer gate transistor PG1 is electrically connected to the first bit line bl, and the drain is electrically connected to the first storage node 11; the source of the second pass-gate transistor PG2 is electrically connected to a second bit line (complementary bit line) bl _ b, and the drain is electrically connected to the second storage point 21. As shown in fig. 1, a word line wl is orthogonal to a first bit line bl and a second bit line bl _ b, and the word line wl is generally distributed along a row and the first bit line bl and the second bit line bl _ b are generally distributed along a column.
In general, a memory cell having a 6T structure stores information "0" and information "1" in such a manner that: when the first storage node 11 is at a low level and the second storage node 21 is at a high level, the information stored in the memory cell with the 6T structure is "0"; when the first storage node 11 is at a high level and the second storage node 21 is at a low level, the information stored in the memory cell having the 6T structure is "1". Specifically, in performing a write operation on a memory cell, taking the example of writing information "1" into the memory cell, word line wl is first selected and a high level is applied thereto, so that the memory cell is in an enabled state even if first pass gate transistor PG1 and second pass transistor PG2 are turned on; and a high level is applied to the first bit line bl and a low level is applied to the second bit line bl _ b, so that the first storage node 11 is at a high level and the second storage node 21 is at a low level. Thus, information "1" is written in the memory cell. Taking the example of writing information "0" into a memory cell, word line wl is first selected and a high level is applied thereto so that the memory cell is in an enabled state even if first pass gate transistor PG1 and second pass transistor PG2 are turned on; and a low level is applied to the first bit line bl and a high level is applied to the second bit line bl _ b, so that the first storage node 11 is at a low level and the second storage node 21 is at a high level. Thus, information "0" is written in the memory cell "
In a read operation of the memory cell, a high level is applied to the word line wl, the first pass gate transistor PG1 and the second pass gate transistor PG2 are turned on, and a high level is simultaneously applied to the first bit line bl and the second bit line bl _ b, so that the sources of the first pass gate transistor PG1 and the second pass gate transistor PG2 are both at a high level, and information stored in the memory cell can be determined by measuring a potential difference between the first bit line bl and the second bit line bl _ b. Specifically, when the memory cell stores "0", the first storage node 11 is at a low level, the second storage node 21 is at a high level, and thus the drain of the first transfer gate transistor PG1 is at a low level, a voltage difference is formed between the source and the drain of the first transfer gate transistor PG1, and a current flows through the first transfer gate transistor PG 1. The second storage node 21 is at a high level, accordingly, the drain of the second pass gate transistor PG2 is at a high level, no voltage difference is formed between the source and the drain of the second pass gate transistor PG2, therefore, the level of the second storage node 21 does not change, the first read/write node 12 is at a high level because the second storage node 21 is connected to the first read/write node 12, the first pull-down transistor PD1 is turned on, current can flow through the first bit line bl, the first pass gate transistor PG1 and the first pull-down transistor PD1, the level of the first bit line bl is lowered, and if it is detected that the change of the voltage difference between the first bit line bl and the second bit line bl _ b exceeds a threshold value, the information stored in the memory cell with the 6T structure is "0".
Based on the internal structure of the 6T-structured memory cell and the data information writing and reading operations thereof in the prior art, please refer to fig. 2, and fig. 2 is a schematic diagram of an overall architecture of data storage and operation in the prior art. In particular, in the architecture, a write entry, i.e. one word line wl connected to each memory cell as shown in fig. 1, is provided to enable only any one of the word lines to be at a high potential in one clock cycle, so as to write a row of data in a matrix to be computed into a corresponding location in a memory array memory as shown in fig. 2, wherein the matrix to be computed may be a matrix stored in some external storage device, a matrix collected in some sampling device, a matrix set, and the like, which is not specifically limited in the embodiment of the present invention, for example, see fig. 3, fig. 3 is a clock cycle diagram of writing a matrix into a memory as shown in the prior art, and as shown in fig. 3, if the matrix to be operated is a 4-row by 4-column matrix, four clock cycles are needed to completely write the whole matrix to be operated into the SRAM, so that the consumption time is long, and the efficiency is low. The clock period, also called oscillation period, is defined as the reciprocal of the clock frequency, which is the most basic and smallest unit of time in a computer. In one clock cycle, the Central Processing Unit (CPU) only completes one most basic action. The clock cycle represents the highest frequency at which a Synchronous Dynamic Random Access Memory (SDRAM) can operate, and a smaller clock cycle means a higher operating frequency. In addition, as shown in fig. 2, since the SRAM and the ALU are independent from each other, before performing the operation, the matrix stored in the SRAM needs to be read out of the SRAM through the sense amplifier and transmitted to the ALU through the I/O interface (e.g., data _ output (matrix) shown in fig. 2); the ALU receives an externally input vector, and performs vector-matrix operation through a series of steps such as a Multiply Accumulator (MAC), to obtain and output an operation result. As described above, referring to fig. 4, fig. 4 is a schematic diagram of clock cycles for performing a vector-matrix operation in the prior art. As shown in fig. 4, the whole operation process consumes 5 clock cycles in total, and is time-consuming and inefficient. Especially, under the condition that the data processing amount is increasingly huge nowadays, the storage and operation mode greatly restricts the data processing efficiency.
Based on this, the embodiments of the present invention provide a matrix arithmetic device, which can simultaneously write multiple rows of elements in a matrix to be operated into corresponding memory cells through multiple write ports, thereby improving write efficiency, and meanwhile, on the premise of realizing integration of storage and arithmetic, further improving arithmetic efficiency by cross arrangement and coupling of the memory cells and the arithmetic cells, thereby saving time and circuit resources.
Specifically, the matrix operation device comprises a memory for storing a matrix to be operated, wherein the memory comprises M rows by N columns of memory cells bit cells, and each memory cell is connected with adjacent memory cells in the row direction and the column direction. Wherein M is an integer greater than 1, and N is an integer greater than 1. For example, referring to fig. 5, fig. 5 is a schematic diagram illustrating an arrangement and connection of memory cells according to an embodiment of the present invention. As shown in fig. 5, each of the M rows by N columns of memory cells is connected to adjacent memory cells in both the row direction and the column direction.
Specifically, referring to fig. 5, the matrix operation device may further include M rows by K word lines, and the word lines are connected to the M rows by N columns of memory cells along the row direction, where each memory cell in each row is connected to K word lines. Wherein K is an integer greater than 1. In the embodiment of the present invention, the upper limit of K is not particularly limited, but K is an integer of M or less in general. As shown in FIG. 5, wherein wl [0] [ K-1:0] may represent K wordlines connected to each memory cell of the first row, specifically may include wordlines wl [0] [0], wl [0] [1], wl [0] [2] … … wl [0] [ K-1 ]. Wherein wl [ M-1] [ K-1:0] may represent K word lines connected to each memory cell in the Mth row, and specifically may include word lines wl [ M-1] [0], wl [ M-1] [1], wl [ M-1] [2] … … wl [ M-1] [ K-1], and so on, which are not described herein again. Wherein [ K-1:0] may represent K write entries provided by the embodiment of the present invention, for example, word line wl [0] [0] may specifically represent a first write entry connected to each memory cell in the first row, word line wl [0] [1] may specifically represent a second write entry connected to each memory cell in the first row, word line wl [1] [0] may specifically represent a first write entry connected to each memory cell in the second row, word line wl [1] [1] may specifically represent a second write entry connected to each memory cell in the second row, and so on, which are not described herein again. As shown in fig. 5, K write entries may correspond to each memory cell in each row, and a write operation may be performed on a memory cell through any one of the K write entries. It is understood that, based on the premise that 4 write entries have been provided, for example, if only one word line is connected to each memory cell in each row, each memory cell in each row corresponds to only one of the write entries. For example, a first row of memory cells corresponds to a first write entry, a second row of memory cells corresponds to a second write entry, a third row of memory cells corresponds to a third write entry, a fourth row of memory cells corresponds to a fourth write entry, a fifth row of memory cells corresponds to the first write entry, and so on. For example, in the case of failure of the first write port, each memory cell in the first row, the fifth row, and so on cannot be written with data, that is, at most M rows by N columns of memory cells can be written with the matrix to be calculated through the M rows by N columns of memory cells and operated, and at this time, only 3/4M rows by N columns of matrix to be calculated can be written with the matrix to be calculated and operated, which greatly affects the writing efficiency and the operation efficiency, and contradicts the original purpose of the embodiment of the present invention. For another example, in a case where the second, third, and fourth rows of storage units may not be written (for example, in a case where all zero element rows of the second, third, and fourth rows in the matrix to be operated correspond to, and no writing may be skipped to improve writing efficiency), a writing operation is performed on the first row of storage units, the fifth row of storage units, and the like, because the first row of storage units and the fifth row of storage units both correspond to the first write entry, and data corresponding to the first write entry in one clock cycle is only one row of elements in the matrix to be operated, so that the corresponding two rows of elements cannot be written into the first row of storage units and the fifth row of storage units simultaneously through the first write entry, and the operation obviously needs to be completed in at least two clock cycles, respectively. That is, in the current clock cycle, the corresponding row element is written into the first row of memory cells through the first write entry, and in the next clock cycle, the corresponding other row element is written into the fifth row of memory cells through the first write entry, which greatly reduces the writing efficiency. Still alternatively, the multiple write entries provided according to the embodiment of the present invention may be, for example, in a case where data related to a long-latency instruction (for example, a command that may update data and perform operations for multiple times for a period of up to ten and several clock cycles) is written through the first write entry, but only the third, fourth, and fifth rows of memory cells may be used. At this time, data can be selectively written into the first and second rows of memory cells through the second, third and fourth write entries to execute other instructions in parallel, thereby achieving the effects of parallel computing and sufficient energy saving. It can be understood that, when the matrix arithmetic device is executing a calculation task, all the arithmetic units in the device are in a circuit conducting state, and when only part of the memory units are occupied by the existing instruction, data is written in through other writing ports and corresponding other instructions are executed, so that the data in calculation pointed by the previous instruction is not covered, and the idle arithmetic units can be fully utilized.
As described above, each of the K word lines in the row direction connected to each memory cell in each row may be used to enable the memory cell connected thereto, so as to write a corresponding row of elements in the matrix to be operated into the memory cell connected to this word line. Any K of the M rows by K word lines1The stripe word line can be used to enable the corresponding K2Row by N columns of memory cells for writing at least one row of elements corresponding to the matrix to be operated into the K2And (4) corresponding positions in the row by N columns of memory cells. Wherein, K1Is an integer greater than 1 and less than or equal to K2Is greater than 1 and less than or equal to K1Is usually K1Is equal to K2. Obviously, each memory cell in each row can be enabled by any word line connected with the memory cell, and data corresponding to a write port corresponding to the word line can be written into the corresponding memory cell based on the write port corresponding to the word line. As shown in FIG. 5, for example, by word line wl [0]][0]Each memory cell of the first row coupled thereto may be enabled to write data corresponding to the first write port (e.g., the data to be written to)A first row of elements within the operational matrix) to a first row of memory cells. At the same time, multiple word lines (for example, word line wl [0 ]) can be passed][0]And word line wl [1]][1]) Enabling multiple rows of memory cells (e.g., first and second rows of memory cells) simultaneously to write multiple rows of data (e.g., first and second rows of elements in the matrix to be operated on) to corresponding locations in the multiple rows of memory cells simultaneously greatly improves the efficiency of writing data to memory cells.
In a possible implementation manner, please refer to fig. 6, and fig. 6 is a schematic diagram illustrating a process of writing a matrix into a memory cell according to an embodiment of the present invention. The matrix to be calculated may be a matrix stored in a certain external storage device, or may also be a matrix collected in a certain sampling device, and the matrix to be calculated may be a matrix, or may also be a matrix set, and the like, which is not specifically limited in this embodiment of the present invention. As shown in fig. 6, the M rows by N columns of memory cells may be embodied as 4 rows by 4 columns of memory cells, and K may be equal to 4, i.e., each memory cell in each row is connected to 4 word lines, which include four different write entries. For example, a word line wl [0] is connected to each memory cell in the first row][0]、wl[0][1]、wl[0][2]And wl [0][3](ii) a As another example, a word line wl [1] is connected to each memory cell in the second row][0]、wl[1][1]、wl[1][2]And wl [1]][3]Etc., which are not described in detail herein. As shown in FIG. 6, when writing the to-be-operated matrix of FIG. 6 into a memory cell, four word lines may first be simultaneously selected to be at a high potential (i.e., K)1Equal to 4), for example, selecting word line wl [0] as shown in FIG. 6][0](corresponding to the first write entry), wl [1]][1](corresponding to the second write entry), wl [2]][2](corresponding to the third write entry) and wl [3]][3](corresponding to the fourth write entry). The four word lines connected 4 rows by 4 columns of memory cells as shown in fig. 6 are enabled (i.e., K)2Equal to 4) to write the data corresponding to the four write entries into the corresponding memory cells (e.g., as shown in fig. 6, writing the first row elements of the matrix corresponding to the first write entry into the first row of memory cells, writing the second row elements of the matrix corresponding to the second write entry into the second row of memory cells, etc.). That is, 4 rows by 4 columns of elements in the matrix to be operated can be written into the corresponding 4 rows by 4 columns of memory in one clock cycleThe efficiency of writing data into the storage unit is greatly improved. It should be noted that, in a normal case, in one clock cycle, data written to different write ports are different (for example, word line wl [0 ])][0]The corresponding first write port may be for writing the first row element [ 1010 ] of the matrix to be operated]E.g. word line wl [1]][1]The corresponding second write port may be for writing a second row element [ 1110 ] of the matrix to be operated upon]Etc.), only one word line connected thereto needs to be selected for each row of memory cells, and the write entries corresponding to the selected word line of each row may be different. For example, if the word line wl [0] connected to the first row of memory cells is selected simultaneously][0]And wl [0][1]That is, it is equivalent to simultaneously writing the first row element and the second row element in the matrix to be operated, which correspond to the first write entry and the second write entry, respectively, into the first row of memory cells, which is obviously meaningless. Or simultaneously selecting the word line wl [0] connected to the first row of memory cells][0]And a word line wl [1] connected to the memory cells of the second row][0]That is, it is equivalent to simultaneously writing the first row of elements in the matrix to be operated corresponding to the first writing port into the first row of memory cells and the second row of memory cells, which is obviously also meaningless.
Optionally, referring to fig. 5, the matrix operation device may further include bit lines, each of the bit lines including a first bit line and a second bit line complementary to the first bit line; within the M rows by N columns of memory cells, each memory cell in each column is connected to the S groups of bit lines in the column direction, thereby forming N columns by S groups of bit lines. When Nrow S in the Nrow S group bit line1S corresponding to group bit line2When the row N column of memory cells are enabled, the N columns S1A set of bit lines for writing at least one row of elements within the matrix to be operated on to the S2And (4) corresponding positions in the row by N columns of memory cells. Wherein S is an integer greater than 1, S1Is an integer greater than 1 and less than or equal to S2Is greater than 1 and less than or equal to S1Is an integer of (1). The embodiment of the invention pairs S and S1The numerical value of (A) is not particularly limited, but usually S is equal to K, S1Is equal to K1,S2Is equal to K2And all-in-oneUnder normal conditions S1Is equal to S2. As shown in FIG. 5, where bl [0]][S-1:0]May represent S first bit lines connected to each memory cell of the first column, and may specifically include a first bit line bl [0]][0]、bl[0][1]、bl[0][2]……bl[0][S-1]. Where bl _ b [0]][S-1:0]May represent S second bit lines connected to each memory cell of the first column, and may specifically include a second bit line bl _ b [0]][0]、bl_b[0][1]、bl_b[0][2]……bl_b[0][S-1]. Wherein, bl [1]][S-1:0]May represent S first bit lines connected to each memory cell of the second column, and may specifically include a first bit line bl [1]][0]、bl[1][1]、bl[1][2]……bl[1][S-1]Etc., which are not described in detail herein.
Referring also to FIG. 6, in one possible embodiment, S, S1And S2May both be equal to 4. To operate the first row of elements [ 1010 ] of the matrix to be operated on as shown in fig. 6]For example, the first row of memory cells in the 4 rows by 4 columns of memory cells is written. Specifically, when each memory cell in the first row is in the enabled state, the bit line bl [0] is coupled to the first bit line][0]Applying a high level to the second bit line bl _ b [0]][0]Applying a low level to write the element '1' of the first row and the first column into the memory cell of the first row and the first column; at the same time, by applying a first bit line bl [1]][0]Applying a low level to the second bit line bl _ b [1]][0]Applying high level to write the element 0 of the first row and the second column into the memory cells of the first row and the second column; at the same time, by applying a first bit line bl [2]][0]Applying a high level to the second bit line bl _ b [2]][0]Applying a low level to write the element "1" of the first row and the third column into the memory cells of the first row and the third column; at the same time, by applying a first bit line bl [3]][0]Applying a low level to the second bit line bl _ b [3]][0]Applying a high level writes the element "0" of the first row and the fourth column into the memory cell of the first row and the fourth column. Similarly, the writing process of the remaining second, third and fourth rows of elements can be obtained, and is not described herein again.
Optionally, referring to fig. 5, the matrix operation device may further include zero clearing lines, where the number of the zero clearing lines is N, the zero clearing lines are connected to the M rows by N columns of the storage units along the column direction, and each storage unit in each column is connected to the same zero clearing line. Specifically, the reset lines clr [0], clr [1], clr [2] … … clr [ N-1] can be included as shown in FIG. 5. Each zero clearing line is used for transmitting high level to each storage unit connected with the zero clearing line when the zero clearing operation is executed, so that the storage unit is cleared. For example, when the clear lines clr [0], clr [1] and clr [2] shown in FIG. 5 are selected, high levels may be transferred to the first, second and third column memory cells connected thereto through the clear lines clr [0], clr [1] and clr [2] respectively to clear the first, second and third column memory cells. Optionally, the priority of zeroing is higher than that of writing, that is, in the same clock cycle, when a memory cell connected to the memory cell is cleared by a zero line, the memory cell cannot be written with "1" at the same time. For example, when a sparse matrix with the first three columns all being zero elements (the sparse matrix is a matrix with the number of zero elements being much larger than the number of non-zero elements and the distribution of zero elements being irregular) is written into the corresponding memory unit, the controller may select the clear lines (e.g., clear lines clr [0], clr [1] and clr [2]) corresponding to the first three columns of all-zero elements, so that the clear lines are at a high potential, and then the three columns of memory units connected to the corresponding clear lines may be cleared (e.g., the first, second and third columns of memory units are cleared) without being enabled by the word lines, and zeros are written by the bit lines. Therefore, the efficiency of writing the sparse matrix into the memory cell can be greatly improved by clearing the zero line. And because the zero clearing line and the data input and output port are on the same side, the data flow direction is consistent, the data flow is smoother, and the layout and wiring are facilitated.
In a possible implementation manner, please refer to fig. 7, and fig. 7 is a schematic diagram illustrating another process for writing a matrix into a memory cell according to an embodiment of the present invention. As shown in fig. 7, the second row and the third row in the matrix to be operated are all zero element rows, so when a write operation is performed, all of the 4 rows by 4 columns of memory cells may be cleared first by clear lines (e.g., clear lines clr [0], clr [1], clr [2], and clr [3] shown in fig. 7) connected to each column of memory cells. Next, word lines wl [0] [0] and wl [3] [1] may be selected to write the first and fourth row elements in the matrix to be operated on to the first and fourth rows of memory cells, respectively. And the second row and the third row of memory cells can skip the non-writing, so that the efficiency of writing the sparse matrix into the memory cells can be greatly improved.
Based on the description of the above embodiment, each memory cell in the M rows by N columns of memory cells may specifically include a first inverter and a second inverter, where the first inverter and the second inverter are connected end to end and form a memory space, and the memory space is used for storing one item of data, and the one item of data includes a corresponding element (for example, "1" or "0") in the matrix to be operated. Wherein each memory cell may further include a first pass transistor and a second pass transistor. Specifically, the drains of Q1 first transfer transistors are respectively connected to the output end of the same first inverter, the gates of Q1 first transfer transistors are respectively connected to corresponding word lines of the K word lines, and the sources of Q1 first transfer transistors are respectively connected to corresponding first bit lines of the S groups of bit lines. Specifically, the drains of Q2 second pass transistors are respectively connected to the output end of the same second inverter, the gates of Q2 second pass transistors are respectively connected to corresponding word lines of the K word lines, and the sources of Q2 second pass transistors are respectively connected to corresponding second bit lines of the S groups of bit lines. When any one of the K word lines is at a high potential, a first transfer transistor and a second transfer transistor connected to the word line are used for conducting a circuit to write a corresponding element in the matrix to be operated into the memory cell. Where Q1 is an integer greater than 1 and Q2 is an integer greater than 1, Q1 is typically equal to Q2 is equal to K. Optionally, each memory cell may further include a first clear transistor and a second clear transistor. The drain electrode of the first zero clearing transistor is connected with the output end of the first phase inverter, and the grid electrode of the first zero clearing transistor is connected with the zero clearing line; the drain electrode of the second zero clearing transistor is connected with the output end of the second inverter, and the grid electrode of the second zero clearing transistor is connected with the zero clearing line. The first clear transistor and the second clear transistor may be used to clear the memory cell when the clear line is at a high potential.
In a possible implementation manner, please refer to fig. 8, and fig. 8 is a schematic structural diagram of a memory cell according to an embodiment of the present invention. As shown in FIG. 8, the memory cell includes a first inverter including a first pull-up transistor PU1 'and a first pull-down transistor PD 1', and a second inverter including a second pull-up transistor PU2 'and a second pull-down transistor PD 2'. The sources of the first pull-up transistor PU1 ' and the second pull-up transistor PU2 ' are connected to the operating voltage power supply VDD ', the gates thereof are connected to the gates of the first pull-down transistor PD1 ' and the second pull-down transistor PD2 ', respectively, and the connection points thereof are the first read/write node 12 ' and the second read/write node 22 ' as shown in fig. 8. Wherein the drains of the first pull-up transistor PU1 'and the second pull-up transistor PU 2' are connected to the sources of the first pull-down transistor PD1 'and the second pull-down transistor PD 2', respectively, and their connection points are the first storage node 11 'and the second storage node 21', respectively, as shown in fig. 8. The first storage node 11 'may be an output terminal of the first inverter, and the second storage node 21' may be an output terminal of the second inverter. Sources of the first pull-down transistor PD1 ' and the second pull-down transistor PD2 ' are connected to a common voltage power supply VSS '. As shown in fig. 8, the first storage node 11 'is connected to the second read/write node 22', and the second storage node 21 'is connected to the first read/write node 12', i.e. the first inverter and the second inverter are connected end to end, which can be used to store one bit of binary data. As described above, when the first storage node 11 'is at a high level and the second storage node 21' is at a low level, the data stored in the memory cell is "1"; when the first storage node 11 'is at a low level and the second storage node 21' is at a high level, the data stored in the memory cell is "0".
Optionally, the memory cell further includes first pass transistors PG1[0], PG1[1], PG1[2] and PG1[3] (i.e., Q1 equal to 4) and second pass transistors PG2[0], PG2[1], PG2[2] and PG2[3] (i.e., Q2 equal to 4). As shown in FIG. 8, four word lines are connected to the memory cell (i.e., K equals 4), which may include word lines wl [0], wl [1], wl [2], and wl [3 ]. As shown in FIG. 8, the memory cell is further connected to four sets of bit lines (S equals 4), which may specifically include four first bit lines bl [0], bl [1], bl [2] and bl [3] and four second bit lines bl _ b [0], bl _ b [1], bl _ b [2] and bl _ b [3 ]. Wherein the gates of first pass transistor PG1[0] and second pass transistor PG2[0] are connected to word line wl [0 ]; the gates of first pass transistor PG1[1] and second pass transistor PG2[1] are connected to word line wl [1 ]; the gates of first pass transistor PG1[2] and second pass transistor PG2[2] are connected to word line wl [2 ]; the gates of first pass transistor PG1[3] and second pass transistor PG2[3] are connected to word line wl [3 ]. The sources of the first pass transistors PG1[0], PG1[1], PG1[2] and PG1[3] are connected to first bit lines bl [0], bl [1], bl [2] and bl [3], respectively, and the drains are connected to the first storage node 11' (i.e., the output terminal of the first inverter). The sources of the second pass transistors PG2[0], PG2[1], PG2[2] and PG2[3] are connected to second bit lines bl _ b [0], bl _ b [1], bl _ b [2] and bl _ b [3], respectively, and the drains are connected to the second storage node 21' (i.e., the output terminal of the second inverter). Under this structure, a specific process of performing a write operation will be described, taking "1" as an example of writing into the memory cell. First, any one of the word lines connected to the memory cell, for example, word line wl [2], is selected and brought to a high potential, and at this time, the first pass transistor PG1[2] and the second pass transistor PG2[2] connected to word line wl [2] are turned on, that is, the memory cell is in an enabled state. At the same time, a high level is applied to the first bit line bl [2] connected to the first pass transistor PG1[2], and a low level is applied to the second bit line bl _ b [2] connected to the second pass transistor PG2[2], thereby completing the writing of "1" into the memory cell.
Referring to fig. 8, as shown in fig. 8, the memory cell may further include a first clear transistor and a second clear transistor, and a gate of the first clear transistor and a gate of the second clear transistor are connected to a clear line clr. The source of the first clear transistor is connected to a common voltage source VSS ', the drain thereof is connected to a first storage node 11' (i.e., the output terminal of the first inverter), the source of the second clear transistor is connected to a working voltage source VDD ', and the drain thereof is connected to a second storage node 21' (i.e., the output terminal of the second inverter). When the zero clearing operation is executed, the zero clearing line clr is selected to be in a high potential, the first zero clearing transistor and the second zero clearing transistor which are connected with the zero clearing line clr receive a high level, and the storage unit can be cleared through the first zero clearing transistor and the second zero clearing transistor.
It should be noted that, in the embodiments of the present invention, the types of the transistors are not specifically limited, but in general, the first transfer transistors PG1[0], PG1[1], PG1[2] and PG1[3], the second transfer transistors PG2[0], PG2[1], PG2[2] and PG2[3], the first pull-down transistor PD1 'and the second pull-down transistor PD 2', and the first clear transistor and the second clear transistor are generally N-type MOSFETs. The first pull-up transistor PU1 'and the second pull-up transistor PU 2' are generally P-type MOSFETs.
Optionally, the matrix operation device may further include an operation circuit connected to the memory, and configured to receive an externally input vector to be operated, and perform vector-matrix operation on the vector to be operated and the matrix to be operated stored in the M rows by N columns of storage units, so as to obtain an operation result. The arithmetic circuitry may include X rows by Y columns of arithmetic units coupled to M rows by N columns of memory cells. Wherein X is an integer greater than 1 and Y is an integer greater than 1. Referring to fig. 9, fig. 9 is a schematic diagram of a matrix operation device according to an embodiment of the invention. As shown in fig. 9, the coupling manner may be embodied as a cross-repeated arrangement of a row of memory cells and a row of operation cells, wherein each memory cell may be connected to an operation cell below the memory cell. Or, the coupling mode may also be specifically a circular arrangement mode in which one row of memory cells, two rows of arithmetic units, and one row of memory cells are formed, and such four rows form one group, where each memory cell is connected to its arithmetic unit adjacent to it in the column direction. The present invention may also be coupled in other coupling manners, which are not specifically limited in this embodiment. It should be noted that, the embodiment of the present invention is intended to subdivide the modules implementing the storage and operation functions into the minimum storage units and the minimum operation units, and connect each storage unit and the corresponding operation unit at a short distance by the above coupling manner and the like. Therefore, compared with the mode that a whole storage array and a whole operation array are adjacently arranged and connected in the prior art, the coupling mode adopted by the embodiment of the invention enables the connection between the storage unit and the operation unit to be tighter, the data moving distance to be shorter, and the circuit area to be smaller, thereby saving time and circuit resources.
Referring to fig. 10, fig. 10 is a schematic diagram illustrating mathematical operations of a vector-matrix according to an embodiment of the present invention. As shown in fig. 10, the vector to be calculated a may be a row vector including Z elements (i.e., Z columns), and the matrix to be calculated B may be a matrix of Z rows by T columns, where Z is an integer greater than 1 and T is an integer greater than 1. As shown in fig. 10, the operation result obtained by multiplying the vector a to be operated by the matrix B to be operated is a result vector C including T elements (i.e., T columns). Obviously, according to the mathematical algorithm of vector-matrix, the element C in the vector C is obtained as a result11For example, C11According to each element in the vector A to be operated, the element is multiplied by the para-position element in the first column in the matrix B to be operated, and the multiplication results are added to obtain the element (namely, the element is obtained
Figure GDA0002455336830000171
)。
Based on the above mathematical algorithm of vector-matrix operation, each of the X rows X Y columns of operation units may optionally include an input port. The operation unit may be connected to the corresponding storage unit through the input port, and specifically, may be connected to a storage node of the corresponding storage unit (since a first storage node and a second storage node exist, and storage information of the first storage node and the second storage node is opposite, the first storage node is generally taken, that is, the normal phase information, which is not specifically limited in this embodiment of the present invention), and is configured to obtain an element stored in the corresponding storage unit. For example, the input port of the first row and first column arithmetic unit is connected to the first storage node 11' of the first row and first column storage unit, and when the information "1" is written into the first row and first column storage unit, the arithmetic unit in the first row and first column can acquire the information "1" at the same time, so that the time consumed by data reading in the conventional sense is saved, and the arithmetic efficiency is improved. The operation unit can also be used for acquiring elements to be operated in externally input vectors to be operated, and multiplying the elements stored in the corresponding storage unit with the elements to be operated to obtain a multiplication result. The vector to be calculated may be a vector input from an external device connected to the matrix operation device, and the external device may be, for example, a computer or the like.
Optionally, the arithmetic circuit may further include an adder circuit connected to the X row X Y column arithmetic units. The addition circuit may be configured to add multiplication results obtained by the operation units in each column of the X rows by Y columns of the operation units, respectively, to obtain an operation result of the vector to be operated and the matrix to be operated, and output the operation result. The adder circuit may include a plurality of adders having the above-described addition function, and the adder may be a two-input adder, a four-input adder, a combination of various types of adders, and the like, which is not particularly limited in the embodiment of the present invention.
In a possible implementation manner, please refer to fig. 11, where fig. 11 is a circuit connection diagram of a vector-matrix operation according to an embodiment of the present invention. As shown in fig. 11, the adder circuit includes 4 rows by 4 columns of memory cells, 4 rows by 4 columns of arithmetic units, and four columns of adder sub-circuit groups, where each column of adder sub-circuit group includes 3 two-input adders. Each operation unit is connected to the first storage node 11 'of the corresponding storage unit, and can directly acquire an element in the matrix to be operated stored in the corresponding storage unit (for example, the first row and first column operation unit is connected to the first storage node 11' of the first row and first column storage unit, and can directly acquire the first row and first column element B in the matrix to be operated B stored in the first row and first column storage unit11). In addition, as shown in FIG. 11, each operation unit in the first row can input a [0] to the vector to be operated]Receiving a first element A in an externally input vector A to be operated11(ii) a Each operation unit in the second row can input a [1] through the vector to be operated]Receiving a second element A in an externally input vector A to be operated12(ii) a The third row of each arithmetic unit can input a [2] through a vector to be operated]Receiving a second element in an externally input vector A to be operatedElement A13(ii) a The fourth row of each operation unit can input a [3] through a vector to be operated]Receiving a second element A in an externally input vector A to be operated14. After each operation unit receives the corresponding element in the vector to be operated, the element is multiplied by the acquired element in the corresponding storage unit to obtain a multiplication result. Next, the four multiplication results obtained from each column may be added step by 3 two-input adders as shown in fig. 11 in each column to obtain addition results, i.e., each element in the result vector. Finally, it can pass through the output z [0] as shown in FIG. 11]、z[1]、z[2]And z [3]]Respectively outputting first elements C of the result vector C11A second element C12And the third element C13And a fourth element C14I.e. the complete result vector is output. In the operation process, data is gradually moved from the storage unit to the operation result output port along with the operation. The matrix operation device provided by the embodiment of the invention can realize that a vector to be operated is input, and a result vector is output after vector-matrix operation.
In a possible implementation manner, please refer to fig. 12, and fig. 12 is a schematic diagram of an overall architecture of a computer system according to an embodiment of the present invention. As shown in FIG. 12, the architecture includes four write entries, and correspondingly includes four x-decoder decoders and four y-decoder decoders corresponding thereto. The four x-decoder decoders and the four y-decoder decoders can be used for converting input binary codes into corresponding logic levels respectively, so that the high and low levels of the four writing inlets are controlled respectively, and when the four writing inlets are all high levels, four lines of data can be written into corresponding four lines of storage units simultaneously. Meanwhile, the storage unit is directly connected with the operation circuit to realize the calculation in the memory, which is not described herein again.
In a possible implementation manner, please refer to fig. 13, and fig. 13 is a schematic diagram of a clock cycle for writing a matrix into a memory according to an embodiment of the present invention. As shown in fig. 13, based on the word line driver wl _ driver and the bit line bl _ driver shown in fig. 12, four rows of memory cells can be enabled simultaneously through 4 write ports in one clock cycle, and four corresponding rows of elements in the matrix to be operated are written into the four rows of memory cells, which is time-consuming and efficient. In addition, referring to fig. 14, fig. 14 is a schematic diagram of a clock cycle for performing vector-matrix operation according to an embodiment of the present invention, and as shown in fig. 14, a vector to be operated may be input to the matrix operation device in one clock cycle. And, the operation result of the vector to be operated and the matrix to be operated can be output in the next clock cycle. The operation matrix does not need to be read out of the storage unit and then transmitted to the operation unit for operation, so that the operation efficiency is greatly improved.
Based on the description of the above embodiment of the matrix operation apparatus, the embodiment of the present invention further discloses a matrix operation method, which may include the following steps S11-S13:
step S11, according to the matrix to be operated, K in the word line is determined1The word lines are lined.
Specifically, during the first clock cycle, K in the word line is determined according to the matrix to be operated1And (6) word lines. The word lines are connected with M rows × N columns of memory cells in the memory along the row direction, wherein each memory cell in each row is connected with K word lines, and the first clock cycle may be one clock cycle or multiple clock cycles.
Step S12, by K1Striping the word lines, enabling corresponding K2Row by N columns of memory cells for writing at least one row of elements corresponding to the matrix to be operated into the K2And the corresponding positions of the row N columns of memory cells.
In particular, during the first clock cycle, pass K1Striping the word lines, enabling corresponding K2Row by N columns of memory cells for writing at least one row of elements corresponding to the matrix to be operated into the K2And the corresponding positions of the row N columns of memory cells. And in the M rows by N columns of memory cells, each memory cell is connected with adjacent memory cells in the row direction and the column direction.
Optionally, in the first clock cycle, the first clock cycle is based on the first clock cycle and the second clock cycleN columns and S groups of bit lines connected with the memory cells along the column direction, and determining the bit lines connected with the K1N columns S connected with N columns of memory cells1Group bit lines passing through the N columns S1Writing at least one row of corresponding elements in the matrix to be operated into the K by a group bit line1And the corresponding positions of the row N columns of memory cells. Each of the N columns of the S groups of bit lines includes a first bit line and a second bit line complementary to the first bit line. Specifically, reference may be made to the description of the embodiment corresponding to fig. 6, which is not repeated herein.
For example, in the case that the matrix to be operated is a matrix to be operated with 8 rows by 4 columns, the memory includes 8 rows by 4 columns of memory cells, and K is equal to 4, the first four rows of memory cells and the last four rows of memory cells can be enabled by respectively selecting the corresponding 4 word lines to be at a high potential in two clock cycles, so that the first four rows of elements and the last four rows of elements of the matrix to be operated can be written into the corresponding first four rows of memory cells and the last four rows of memory cells in two clock cycles. That is, the 8 rows by 4 columns to be operated matrix can be completely written into the 8 rows by 4 columns of memory cells in two clock cycles, where the two clock cycles can be two consecutive clock cycles.
Optionally, when the matrix to be operated includes P columns of all-zero elements, P corresponding zero clearing lines are determined from N total zero clearing lines, the N zero clearing lines are connected with the M rows × N columns of storage units along the column direction, and each storage unit in each column is connected with the same zero clearing line. And transmitting high level to P columns of storage units connected with the zero clearing line through P pieces of zero clearing lines so as to clear the P columns of storage units, wherein P is an integer which is more than equal 1 and less than N.
Optionally, if the matrix to be operated includes at least one row of all-zero element rows, the M rows × N columns of memory cells may be cleared by the N zero clearing lines before the write operation is performed. Specifically, reference may be made to the description of the embodiment corresponding to fig. 7, which is not repeated herein.
Step S13, receiving an externally input vector to be operated through an operation circuit, and performing vector-matrix operation based on the vector to be operated and the matrix to be operated to obtain an operation result.
Specifically, in a second clock cycle, an arithmetic circuit connected to the memory receives an externally input vector to be operated, and performs vector-matrix operation based on the vector to be operated and the matrix to be operated stored in the M rows by N columns of memory cells to obtain an operation result. The second clock cycle may be one clock cycle or multiple clock cycles.
Optionally, the arithmetic circuit may include X rows by Y columns of arithmetic units, the X rows by Y columns of arithmetic units are coupled to the M rows by N columns of storage units, each arithmetic unit includes an input port, and the arithmetic unit is connected to the corresponding storage unit through the input port to obtain the elements stored in the storage unit. The arithmetic circuit may further include an adder circuit connected to the X row X Y column arithmetic units. Optionally, the step S13 may specifically include the following steps S21-S22:
step s21, obtaining the elements stored in the corresponding storage unit, wherein the corresponding storage unit is connected with the input port interface of the arithmetic unit; and acquiring an element to be operated, and multiplying the element stored in the corresponding storage unit by the element to be operated to obtain a multiplication result.
And step s22, adding the multiplication results obtained by each row of the arithmetic units in the X row by Y column of the arithmetic units through the addition circuit respectively to obtain the operation results of the vector to be operated and the matrix to be operated. Specifically, reference may be made to the description of the embodiment corresponding to fig. 11, which is not repeated herein.
In the structure of the memory cells in M rows by N columns, a plurality of word lines connected with each memory cell in each row can enable a plurality of memory cells in a plurality of rows simultaneously, so that a plurality of rows of data in a matrix to be operated can be written into the corresponding memory cells simultaneously. Compared with the prior art that data is written into the memory line by line, the method and the device greatly shorten the time consumed by writing the data into the memory and improve the writing efficiency. In addition, the calculation in the memory can be realized through X rows and Y columns of operation units which are cross-coupled with M rows and N columns of storage units in the device. Meanwhile, compared with the structure that an integral independent M rows by N columns of storage units and an integral independent X rows by Y columns of operation units are simply connected in the existing integrated storage and operation scheme, the invention arranges the storage units and the operation units in the device in a crossed way and couples the storage units and the operation units (for example, four operation units are arranged in a row below four storage units in a row, then four storage units are arranged in a row, and the process is repeated, and one storage unit is closely attached to one operation unit, and the like). The connection between the storage and the operation is tighter, the data moving distance is shorter, the circuit delay and the circuit area are smaller, and the operation efficiency is higher.
Based on the description of the above matrix operation device and method embodiments, the embodiment of the present invention further discloses a processor, which includes all the contents described in the above device embodiments, and can implement the method described in the above method embodiments. The processor may be formed of a chip, or may include a chip and other discrete devices.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application are all or partially generated upon loading and execution of computer program instructions on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or can comprise one or more data storage devices, such as a server, a data center, etc., that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.
While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Although the present application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made thereto without departing from the spirit and scope of the invention. Accordingly, the specification and figures are merely exemplary of the invention as defined in the appended claims and are intended to cover any and all modifications, variations, combinations, or equivalents within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (11)

1. A matrix operation apparatus, comprising:
the memory is used for storing a matrix to be operated and comprises M rows of memory cells in N columns, and each memory cell is connected with adjacent memory cells in the row direction and the column direction;
word lines, wherein the word lines are M rows by K, and the word lines are connected with the M rows by N columns of memory cells along the row direction, wherein each memory cell in each row is connected with K word lines; any K of the M rows by K word lines1The strip word lines are used for enabling corresponding K simultaneously2Row by N columns of memory cells for writing corresponding at least one row of elements in the matrix to be operated into the K at the same time2Corresponding location in row by N column memory cell, K, K1And K2Is an integer greater than 1;
the operation circuit comprises X rows, Y columns of operation units, M rows, N columns of storage units and the X rows, Y columns of operation units are arranged in a crossed mode, each storage unit in the M rows, N columns of storage units is correspondingly connected with one operation unit, and the operation circuit is used for receiving externally input vectors to be operated and carrying out vector-matrix operation on the vectors to be operated and the matrixes to be operated stored in the M rows, N columns of storage units to obtain operation results.
2. The apparatus of claim 1, wherein each of the X rows by Y columns of arithmetic units comprises an input port;
the operation unit is connected with the corresponding storage unit through the input port and is used for acquiring the elements stored in the storage unit;
the operation unit is further configured to obtain an element to be operated, and multiply the element stored in the corresponding storage unit with the element to be operated to obtain a multiplication result, where the element to be operated comes from the vector to be operated.
3. The apparatus of claim 2, wherein the arithmetic circuitry further comprises summing circuitry coupled to the X row by Y column arithmetic units;
and the addition circuit is used for respectively adding the multiplication results obtained by the operation units in each row of the operation units in the X row by Y column to obtain the operation results of the vector to be operated and the matrix to be operated.
4. The apparatus of claim 3, further comprising:
the number of the zero clearing lines is N, the zero clearing lines are connected with the M rows by N columns of storage units along the column direction, and each storage unit in each column is connected with the same zero clearing line; each zero clearing line is used for transmitting high level to each storage unit connected with the zero clearing line when the zero clearing operation is executed.
5. The apparatus of claim 4, further comprising:
bit lines, each set of the bit lines including a first bit line and a second bit line complementary to the first bit line; within the M rows by N columns of memory cells, each of the memory cells in each column is connected to S groups of the bitlines in a column direction, thereby forming N columns by S groups of the bitlines;
when Nrow S in the Nrow S group bit line1S corresponding to group bit line2When the row N column of memory cells are enabled, the N columns S1A set of bit lines for writing at least one row of elements within the matrix to be operated on to the S2And (4) corresponding positions in the row by N columns of memory cells.
6. The apparatus of claim 5, wherein each of the storage units comprises:
the first inverter and the second inverter are connected end to form a storage space, the storage space is used for storing one item of data, and the one item of data comprises corresponding elements in the matrix to be operated;
the drains of Q1 first transmission transistors are respectively connected with the output end of the same first inverter, the gates of Q1 first transmission transistors are respectively connected with corresponding word lines in the K word lines, and the sources of Q1 first transmission transistors are respectively connected with corresponding first bit lines in the S groups of bit lines;
the drains of the Q2 second transmission transistors are respectively connected with the output end of the same second inverter, the gates of the Q2 second transmission transistors are respectively connected with corresponding word lines in the K word lines, and the sources of the Q2 second transmission transistors are respectively connected with corresponding second bit lines in the S groups of bit lines;
when any one of the K word lines is at a high potential, a first transfer transistor and a second transfer transistor connected to the word line are used for conducting a circuit to write a corresponding element in the matrix to be operated into the memory cell.
7. The apparatus of claim 6, wherein each of the storage units further comprises:
the drain electrode of the first zero clearing transistor is connected with the output end of the first phase inverter, and the gate electrode of the first zero clearing transistor is connected with the zero clearing line;
the drain electrode of the second zero clearing transistor is connected with the output end of the second inverter, and the gate electrode of the second zero clearing transistor is connected with the zero clearing line;
the first clear transistor and the second clear transistor are used for clearing the storage unit when the clear line is at a high potential.
8. A processor, characterized in that it comprises a device according to any of claims 1-7.
9. A method of matrix operations, comprising:
determining K in the word line according to the matrix to be operated on1The word lines are arranged in stripes, the word linesThe word lines are connected with M rows and N columns of memory cells in the memory along the row direction, wherein each memory cell in each row is connected with K word lines;
by K1Striping the word lines while enabling corresponding K2Row by N columns of memory cells for writing corresponding at least one row of elements in the matrix to be operated into the K at the same time2The corresponding positions of the row N columns of memory cells; within the M rows by N columns of memory cells, each of the memory cells is connected to adjacent memory cells in both the row and column directions, K, K1And K2Is an integer greater than 1;
receiving an externally input vector to be operated through an operation circuit connected with the memory, and performing vector-matrix operation on the vector to be operated and the matrix to be operated stored in the M rows by N columns of storage units to obtain an operation result; the operation circuit comprises X rows, Y columns of operation units, M rows, N columns of storage units and X rows, Y columns of operation units, wherein the M rows, N columns of storage units are arranged in a crossed mode, and each storage unit in the M rows, N columns of storage units is correspondingly connected with one operation unit.
10. The method of claim 9, further comprising:
when the matrix to be operated comprises P columns of all-zero elements, determining P corresponding zero clearing lines from N zero clearing lines in total, wherein the N zero clearing lines are connected with the M rows by N columns of storage units along the column direction, and each storage unit in each column is connected with the same zero clearing line;
and transmitting high level to the P columns of storage units connected with the zero clearing lines through the P strips so as to clear the P columns of storage units.
11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, implement the method according to any one of claims 9-10.
CN201911223959.0A 2019-12-04 2019-12-04 Matrix operation device, method, processor and computer readable storage medium Active CN110674462B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911223959.0A CN110674462B (en) 2019-12-04 2019-12-04 Matrix operation device, method, processor and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911223959.0A CN110674462B (en) 2019-12-04 2019-12-04 Matrix operation device, method, processor and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN110674462A CN110674462A (en) 2020-01-10
CN110674462B true CN110674462B (en) 2020-06-02

Family

ID=69088303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911223959.0A Active CN110674462B (en) 2019-12-04 2019-12-04 Matrix operation device, method, processor and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110674462B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459552B (en) * 2020-06-16 2020-10-13 之江实验室 Method and device for parallelization calculation in memory
CN111883191B (en) * 2020-07-14 2023-02-03 安徽大学 10T SRAM cell, and memory logic operation and BCAM circuit based on 10T SRAM cell
CN112259137B (en) * 2020-11-02 2023-05-23 海光信息技术股份有限公司 Memory operation circuit and chip structure
CN113506589B (en) * 2021-06-28 2022-04-26 华中科技大学 Sparse matrix storage system and method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263890A (en) * 2002-03-06 2003-09-19 Ricoh Co Ltd Semiconductor memory device
US9152827B2 (en) * 2012-12-19 2015-10-06 The United States Of America As Represented By The Secretary Of The Air Force Apparatus for performing matrix vector multiplication approximation using crossbar arrays of resistive memory devices
CN109859786B (en) * 2019-01-28 2020-10-02 北京航空航天大学 Data operation method based on spin magnetic memory
CN110276048B (en) * 2019-05-25 2023-06-09 南京惟心光电系统有限公司 Control method for matrix vector multiplication array
CN110364203B (en) * 2019-06-20 2021-01-05 中山大学 Storage system supporting internal calculation of storage and calculation method

Also Published As

Publication number Publication date
CN110674462A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
CN110674462B (en) Matrix operation device, method, processor and computer readable storage medium
Zabihi et al. In-memory processing on the spintronic CRAM: From hardware design to application mapping
Haj-Ali et al. Efficient algorithms for in-memory fixed point multiplication using magic
US10831446B2 (en) Digital bit-serial multi-multiply-and-accumulate compute in memory
CN109766309B (en) Spin-save integrated chip
Wang et al. An energy-efficient nonvolatile in-memory computing architecture for extreme learning machine by domain-wall nanowire devices
CN110597484B (en) Multi-bit full adder based on memory calculation and multi-bit full addition operation control method
CN108182959B (en) Method for realizing logic calculation based on crossing array structure of resistive device
US11211115B2 (en) Associativity-agnostic in-cache computing memory architecture optimized for multiplication
CN116126779A (en) 9T memory operation circuit, multiply-accumulate operation circuit, memory operation circuit and chip
CN115588446A (en) Memory operation circuit, memory calculation circuit and chip thereof
US10580481B1 (en) Methods, circuits, systems, and articles of manufacture for state machine interconnect architecture using embedded DRAM
CN111158635A (en) FeFET-based nonvolatile low-power-consumption multiplier and operation method thereof
CN112233712B (en) 6T SRAM (static random Access memory) storage device, storage system and storage method
CN111045727A (en) Processing unit array based on nonvolatile memory calculation and calculation method thereof
Wang et al. Efficient time-domain in-memory computing based on TST-MRAM
Rajput et al. Energy efficient 9T SRAM with R/W margin enhanced for beyond Von-Neumann computation
Li et al. Toward energy-efficient sparse matrix-vector multiplication with near STT-MRAM computing architecture
CN112951290B (en) Memory computing circuit and device based on nonvolatile random access memory
CN116204490A (en) 7T memory circuit and multiply-accumulate operation circuit based on low-voltage technology
Monga et al. A Novel Decoder Design for Logic Computation in SRAM: CiM-SRAM
US20220019407A1 (en) In-memory computation circuit and method
US11094355B1 (en) Memory chip or memory array for wide-voltage range in-memory computing using bitline technology
Chen et al. FAST: A fully-concurrent access SRAM topology for high row-wise parallelism applications based on dynamic shift operations
CN113889158A (en) Memory computing circuit and device based on SRAM and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210210

Address after: 311201 No. 602-11, complex building, 1099 Qingxi 2nd Road, Hezhuang street, Qiantang New District, Hangzhou City, Zhejiang Province

Patentee after: Zhonghao Xinying (Hangzhou) Technology Co.,Ltd.

Address before: 518 000 514, building 10, Shenzhen Bay science and technology ecological park, No.10, Gaoxin South 9th Road, high tech Zone community, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Xinying Technology Co.,Ltd.