CN114072778A - Memory processing unit architecture - Google Patents

Memory processing unit architecture Download PDF

Info

Publication number
CN114072778A
CN114072778A CN202080049322.9A CN202080049322A CN114072778A CN 114072778 A CN114072778 A CN 114072778A CN 202080049322 A CN202080049322 A CN 202080049322A CN 114072778 A CN114072778 A CN 114072778A
Authority
CN
China
Prior art keywords
memory
processing
regions
processing unit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080049322.9A
Other languages
Chinese (zh)
Inventor
M·施丹
J·博蒂默尔
C·刘
孟繁萱
T·韦斯利
张振亚
卢伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mamrix
Original Assignee
Mamrix
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/841,544 external-priority patent/US11488650B2/en
Application filed by Mamrix filed Critical Mamrix
Publication of CN114072778A publication Critical patent/CN114072778A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0207Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0607Interleaved addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/06Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
    • G06F12/0646Configuration or reconfiguration
    • G06F12/0684Configuration or reconfiguration with feedback, e.g. presence or absence of unit detected by addressing, overflow detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Image Processing (AREA)

Abstract

The memory processing unit architecture may include a plurality of memory regions and a plurality of processing regions interleaved between the plurality of memory regions. The plurality of processing regions may be configured to perform a computational function such as a model of an artificial neural network. Data may be transferred between computational functions in respective memory processing regions, and further, memory regions may be used to transfer data between computational functions in one processing region and computational functions in another processing region adjacent to the given memory region.

Description

Memory processing unit architecture
Background
Computing systems have made a significant contribution to the advancement of modern society and are used in many applications to achieve beneficial results. Applications such as artificial intelligence, machine learning, big data analysis, etc., perform computations on large amounts of data. In conventional computing systems, data is transferred from memory to one or more processing units, which perform computations on the data, and then the results are transferred back to memory. Transferring large amounts of data from memory to a processing unit and back to memory takes time and consumes power. Accordingly, there is a continuing need for improved computing systems that reduce processing delays, data delays, and/or power consumption.
Disclosure of Invention
The present technology may be best understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the present technology for a memory processing unit architecture. The architecture may include a plurality of memory regions (e.g., Static Random Access Memory (SRAM)) and a plurality of processing regions (including memory such as resistive random access memory (ReRAM), Magnetic Random Access Memory (MRAM), FLASH memory (FLASH), or Phase Change Random Access Memory (PCRAM)). The plurality of processing regions may be column interleaved between the plurality of memory regions. The plurality of processing regions may be configured to perform a computational function such as a model of an artificial neural network. Data may be configured to flow across multiple memory regions and processing regions in a cross-column direction.
In one embodiment, a memory processing unit may include multiple memory regions, multiple processing regions, and one or more centralized or distributed control circuits. The plurality of processing regions may be interleaved between the plurality of memory regions. One or more of the plurality of processing regions may be configured to perform one or more computational functions. The one or more control circuits may be configured to control a flow of data into each given processing region of the plurality of processing regions from a first adjacent memory region of the plurality of memory regions to a second adjacent memory region of the plurality of memory regions. The memory processing unit may also include one or more communication links that may be coupled between the interleaved plurality of memory regions and the processing region. The one or more communication links may be configured to move data between non-adjacent ones of the plurality of memory regions and/or the processing region.
In another embodiment, a method of configuring a memory processing unit may include receiving a model. One or more of the plurality of processing regions of the memory processing unit may be configured to perform one or more computational functions of the model. One or more of a plurality of memory regions of the memory processing unit may be configured to control a flow of data from a first adjacent memory region into the one or more of the plurality of processing regions and out to a second adjacent memory region, wherein the plurality of processing regions are interleaved between the plurality of memory regions.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
Embodiments of the present technology are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 illustrates a memory processing unit in accordance with embodiments of the present technology.
FIG. 2 illustrates a memory processing unit in accordance with embodiments of the present technology.
FIG. 3 illustrates a processing core in accordance with aspects of the present technique.
FIG. 4 illustrates a processing element in accordance with aspects of the present technique.
FIG. 5 illustrates a processing element in accordance with aspects of the present technique.
FIG. 6 illustrates a memory processing method in accordance with aspects of the present technique.
FIG. 7 illustrates exemplary configuration data in accordance with aspects of the present technique.
Fig. 8A through 8J illustrate the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 9 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 10 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 11 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 12 illustrates a configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 13 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 14 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 15 illustrates the configuration and operation of a memory processing unit in accordance with aspects of the present technique.
FIG. 16 illustrates data flow through a set of processing cores in a processing region, in accordance with aspects of the present technique.
17A and 17B illustrate data flow configurations of processing cores in accordance with aspects of the present technique.
Fig. 18 shows a conventional calculation process.
FIG. 19 shows a processing core in accordance with aspects of the present technique.
20A and 20B illustrate write back registers in accordance with aspects of the present technique.
FIG. 21 illustrates data transfers in a memory processing unit in accordance with aspects of the present technique.
FIG. 22 illustrates data transfers in a memory processing unit in accordance with aspects of the present technique.
FIG. 23 illustrates data reuse in a memory processing unit in accordance with aspects of the present technique.
Detailed Description
Reference will now be made in detail to embodiments of the present technology, examples of which are illustrated in the accompanying drawings. While the technology will be described in conjunction with these embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present technology, numerous specific details are set forth in order to provide a thorough understanding of the present technology. However, it is understood that the present technology may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present technology.
Some embodiments of the present technology below are presented in terms of routines, modules, logic blocks, and other symbolic representations of operations on data within one or more electronic devices. The description and representations are the means used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art. A routine, module, logic block, and/or the like is, and generally, conceived to be a self-consistent sequence of procedures or instructions leading to a desired result. These processes are those involving physical manipulations of physical quantities. Usually, though not necessarily, these physical manipulations take the form of electrical or magnetic signals capable of being stored, transferred, compared, and otherwise manipulated in an electronic device. For convenience, and with reference to common usage, such signals are referred to as data, bits, values, elements, symbols, characters, terms, numbers, strings, and the like, in reference to embodiments of the present technology.
It should be borne in mind, however, that all of these terms are to be interpreted as referring to physical manipulations and quantities and are merely convenient labels and are to be further interpreted in view of the terminology commonly used in the art. Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the present technology, discussions utilizing terms such as "receiving" or the like, refer to the actions and processes of an electronic device, such as an electronic computing device, that manipulates and transforms data. The data is represented as physical (e.g., electronic) quantities within the electronic device's logic circuits, registers, memories, etc., and is converted to other data similarly represented as physical quantities within the electronic device.
In this application, the use of the conjunction of the contrary intention is intended to include the conjunction. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, reference to "the" object or "an (a)" object is intended to also mean one of a possible plurality of such objects. It is also to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.
Referring now to FIG. 1, a memory processing unit is shown in accordance with embodiments of the present technique. The memory processing unit 100 may include a plurality of memory regions 110 and 130, a plurality of processing regions 135 and 150, one or more communication links 155, and one or more centralized or distributed control circuits 160. The plurality of memory regions 110 and 130 may also be referred to as active memory. The processing regions 135 and 150 may be interleaved between the memory regions 110 and 130. In one embodiment, the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 may have respective predetermined sizes. The multiple processing regions 135-150 may have the same design. Similarly, the plurality of memory regions 110-130 may also have the same design. In one embodiment, the plurality of memory regions 110 and 130 may be Static Random Access Memory (SRAM) and the plurality of processing regions may include one or more arrays of resistive random access memory (ReRAM), Magnetic Random Access Memory (MRAM), Phase Change Random Access Memory (PCRAM), FLASH memory (FLASH), and the like.
One or more of the plurality of processing regions 135-150 may be configured to execute one or more computational functions, one or more instances of one or more computational functions, one or more segments of one or more computational functions, and/or the like. For example, the first processing region 135 may be configured to perform two calculation functions and the second processing region 140 may be configured to perform a third calculation function. In another example, the first processing region 135 may be configured to execute three instances of a first calculation function, and the second processing region 140 may be configured to execute a second and third calculation function. The one or more centralized or distributed control circuits 160 may configure one or more computational functions of one or more of the plurality of processing regions 135-150. In yet another example, a given computation function may have a size greater than a predetermined size of one or more processing regions. In such a case, a given computational function may be segmented, and the computational function may be configured to execute on one or more of the plurality of processing units 135-150. The computational functions may include, but are not limited to, vector products, matrix dot products, convolution, min/max pooling, averaging, scaling, and the like.
The central data flow direction may be used with multiple memory regions 110 and 130 and multiple processing regions 135 and 150. The one or more centralized or distributed control circuits 160 may control the flow of data from a first adjacent memory region of the plurality of memory regions 110 and 130 into each given processing region of the plurality of processing regions 135 and 150 to a second adjacent memory region of the plurality of memory regions 110 and 130. For example, the one or more control circuits 160 may configure data to flow from the first memory area 110 into the first processing area 135 and out to the second memory area 115. Similarly, one or more control circuits 160 may configure data to flow from second memory area 115 into second processing area 140 and out to third memory area 120. The control circuitry 160 may include centralized control circuitry, distributed control circuitry, or a combination thereof. If distributed, the control circuitry 160 may be local to the plurality of memory regions 110, 130, the plurality of processing regions 135, 150, and/or the one or more communication links 155.
In one embodiment, the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 may be staggered with respect to each other. Data may be configured by one or more centralized or distributed control circuits 160 to flow between the processing regions 135 and memory regions 110 and 130 that are staggered in adjacent columns in a cross-column direction. In one embodiment, data may flow in a unidirectional cross-column direction between the adjacent processing region 135 and memory region 110 and 130. For example, data may be configured to flow from the first memory area 110 into the first processing area 135, from the first processing area 135 out into the second memory area 115, from the second memory area 115 into the second processing area 140, and so on. In another embodiment, data may flow in a bi-directional cross-column direction between the adjacent processing region 135 and memory region 110 and 130. Additionally or alternatively, data within respective ones of the processing regions 135-150 may flow between functions within the same processing region. For example, for a first processing region 135 configured to perform two computational functions, data may flow directly from the first computational function to the second computational function without being written or read from an adjacent memory region.
One or more communication links 155 may be coupled between the interleaved plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150. The one or more communication links 155 may be configured to move data between non-adjacent ones of the plurality of memory regions 110 and 130, between non-adjacent ones of the plurality of processing regions 135 and 150, or between a given memory region and a non-adjacent one of the given processing regions. For example, one or more communication links 155 may be configured to move data between second memory region 115 and fourth memory region 125. Additionally or alternatively, the one or more communication links 155 may be configured to move data between the first processing region 135 and the third processing region 145. Additionally or alternatively, the one or more communication links 155 may be configured to move data between the second memory section 115 and the third processing section 145 or between the second processing unit 140 and the fourth memory section 125.
Generally, the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 are configured to move partially and in a given direction through a given processing region. In addition, the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 are generally configured such that the edge outputs move in a given direction from a given processing region to an adjacent memory region. The terms partial sum and edge output are used herein to refer to the result of a given computation function or segment of a computation function.
Referring now to FIG. 2, a memory processing unit is shown in accordance with embodiments of the present technology. The memory processing unit 200 may include a plurality of memory regions 110 and 130, a plurality of processing regions 135 and 150, one or more communication links 155, and one or more centralized or distributed control circuits 160. The processing regions 135 and 150 may be interleaved between the memory regions 110 and 130. In one embodiment, the memory regions 110 and 130 and the processing regions 135 and 150 may be staggered with respect to each other. In one embodiment, the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 may have respective predetermined sizes.
Each of the plurality of processing regions 135-150 may include a plurality of processing cores 205-270. In one embodiment, the plurality of processing cores 205-270 may have a predetermined size. One or more of the processing cores 205 and 270 of one or more of the processing regions 135 and 150 may be configured to execute one or more computational functions, one or more instances of one or more computational functions, one or more segments of one or more computational functions, and/or the like. For example, the first processing core 205 of the first processing region 135 may be configured to perform a first computational function, the second processing core 210 of the first processing region 135 may be configured to perform a second computational function, and the first processing core of the second processing region 140 may be configured to perform a third computational function. Again, the calculation functions may include, but are not limited to, vector products, matrix dot products, convolution, min/max pooling, averaging, scaling, and the like.
The one or more centralized or distributed control circuits 160 may further configure the plurality of memory regions 110-. For example, the one or more control circuits 160 may configure data to flow from the first memory area 110 into the first processing area 135 and out to the second memory area 115. Similarly, one or more control circuits 160 may configure data to flow from second memory area 115 into second processing area 140 and out to third memory area 120. In one embodiment, the control circuitry 160 may configure the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 such that data flows in a single direction. For example, the data may be configured to flow unidirectionally from left to right across the one or more processing regions 135 and 150 and respective adjacent memory regions of the plurality of memory regions 110 and 130. In another embodiment, the control circuitry 160 may configure the plurality of memory regions 110 and 130 and the plurality of processing regions 135 and 150 such that data flows bi-directionally across one or more of the processing regions 135 and 150 and corresponding adjacent memory regions of the plurality of memory regions 110 and 130. In addition, the one or more control circuits 160 may also configure the data to flow in a given direction through the one or more processing cores 205 and 270 in each of the plurality of processing regions 135 and 150. For example, data may be configured to flow from the first processing core 205 through the second processing core 210 to the third processing core 215 from top to bottom in the first processing region 135.
Referring now to FIG. 3, a processing core is shown in accordance with aspects of the present technique. The processing core 300 may include a processing element 310, one or more memory region interfaces 315, 320, one or more address translators 325, 330, one or more counters 335, one or more controllers 340, and one or more write- back registers 345, 350. The processing element 310 may be configured to compute a computational function such as a vector product, a matrix dot product, a convolution, and the like. The memory interfaces 315, 320 may be configured to interface with respective adjacent memory regions. The address converters 325, 330 may be configured to convert multidimensional data, such as a feature map, into a one-dimensional memory organization within the processing element 310. Counters 335 (such as pixel counters, bit counters, channel counters, etc.) may be configured to scan data. The write back registers 345, 350 may be configured to hide memory access latency in the processing core 300 and may also perform min/max pooling, averaging, scaling, and the like. The controller 340 may configure one or more memory region interfaces 315, 320, one or more address translators 325, 330, one or more counters 335, and one or more write back registers 345, 350. Further, each core may communicate (pass data) with neighboring cores (top and bottom cores).
Referring now to FIG. 4, a processing element 400 in accordance with aspects of the present technique is shown. Processing element 400 may include one or more arrays of memory cells 410, 425, one or more input registers 430, 445, one or more output registers 450, 455, and one or more accumulators 460. Processing element 400 may share one or more memory cell arrays 410 with one or more neighboring processing elements 425. Memory cell array 410-425 may include an array of resistive random access memory (ReRAM), Magnetic Random Access Memory (MRAM), Phase Change Random Access Memory (PCRAM), FLASH memory (FLASH), and the like. Memory cell array 410-425 may include a plurality of memory cells arranged in rows of memory cells coupled to respective word lines and columns of memory cells coupled to respective bit lines. In one embodiment, the memory unit may be configured to store elements of the first matrix. Multiple sets of input registers may be associated with respective sets of word lines. In one embodiment, the input register may be configured to store respective elements of the second matrix. The individual word lines may be biased based on the bit values of sequential bit positions in the corresponding input register. The corresponding bit line may be sensed in response to the biased word line to determine a bit value. Each accumulator 460 may be associated with a respective set of bit lines. The accumulator 460 may include respective adders and shift registers configured to add the sensed bit value to the contents of the corresponding shift register and then load it back into the shift register. Each shift register may be configured to shift the sum in a given direction after each sum is loaded into the shift register. The shift register of the accumulator 460 may then output the result to the output registers 450, 455. In one embodiment, the output may be a partial sum of the contents of the input register and the contents of the memory cell array. For example, the partial sum may be a dot product of the first matrix and the second matrix.
Referring now to FIG. 5, a processing element in accordance with aspects of the present technique is shown. In general, the processing element may be a multiply-accumulate (MAC) unit. In one embodiment, the processing element 500 may include one or more sets of array units 505. Each array cell 505 may include one or more sets of input registers 510, wordline drivers 515, memory cell array 520, and sensing circuitry 525. Each group of array units 505 may be coupled to a respective multi-operand accumulator 530 and a respective shift register 535. The multi-operand accumulator 530 may be a carry save adder, a Wallace tree, etc.
The corresponding word lines in a corresponding set of cells (cell units)505 may be activated simultaneously. Each accumulator 530 may be configured to sum the partial sum from a corresponding readout circuit 525 of a group of array elements 505 with the contents of a corresponding shift register 535. The sum may then be loaded back into the corresponding shift register 535. Each shift register 535 may be configured to shift the sum in a given direction. For example, if the word line driver 515 biases the word lines based on the input register 510 ordered from the most significant bit to the least significant bit, the shift register 535 may shift its contents to the left by one bit after each loading of the sum from the accumulator 530 into the shift register 535. If input register 510 is ordered from least significant bit to most significant bit, shift register 535 may shift the sum right. After sorting on the word lines and on the bit positions of the input buffer of a group of array cells 505, the resulting dot product may be output from shift register 535.
The array unit 505 may be arranged to increase the length or width, or both dimensions, of the matrix. In one embodiment, the array units 505 may be arranged horizontally to increase the width for storing the larger matrix a while having minimal impact on the hardware design of the processing element 500. In another embodiment, array unit 505 may be arranged vertically to increase the length of matrix a. In a vertical implementation, a multi-operand accumulator 530 shared between vertical compute slices may reduce the size of the accumulator 530 and shift register 535. The processing element 500 described above is but one of many possible implementations of the processing element 500.
Referring now to FIG. 6, a memory processing method in accordance with aspects of the present technique is shown. The method may be implemented in a combination of hardware (such as one or more finite state machines) and software (such as computing device executable instructions). The memory processing method may include an initialization phase 605 and a runtime phase 610.
In the initialization stage 605, at 615, a model may be received by a memory processing unit. The memory processing unit may include a plurality of memory regions and a plurality of processing regions interleaved between the plurality of memory regions, as described above with reference to fig. 1-5. The model may include one or more computational functions. In one embodiment, the model may be a machine learning algorithm, an artificial neural network, a convolutional neural network, a recurrent neural network, or the like.
At 620, one or more of the plurality of processing regions of the memory processing unit may be configured to execute one or more computational functions of the model. In one embodiment, a given processing region may be configured to perform one or more computational functions. For example, a given processing region may be configured by writing a first matrix including a plurality of weights to a memory cell array of the given processing region. The corresponding inputs, counters, accumulators, shift registers, output registers, etc. of a given processing region may also be configured with initial values, states, etc. In another embodiment, a given processing core of a given processing region may be configured to perform a given computational function. For example, a given processing core of a given processing region may be configured by writing a first matrix comprising a plurality of weights to a memory cell array of the given processing core. The corresponding inputs, counters, accumulators, shift registers, output registers, etc. of a given processing core may also be configured with initial values, states, etc.
At 625, one or more of the plurality of memory regions may be configured to control data flow between the one or more configured processing regions and the one or more configured memory regions. In one embodiment, for a given processing region 140, data may be configured to flow in and out unidirectionally or bidirectionally from a first adjacent memory portion 115 to a second adjacent memory portion 120. In another embodiment, for a given processing core of a given processing region 140, data may be configured to flow in from a first adjacent memory portion 115 or a first adjacent processing core, and flow out to a second adjacent memory portion 120 or a second adjacent processing core.
In run-time mode 610, at 630, input data may be received by the memory processing unit. At 635, output data can be calculated from the input data processed by the one or more configured processing regions and the one or more configured memory regions. At 640, the output data may be output from the memory processing unit.
Referring now to FIG. 7, exemplary configuration data in accordance with aspects of the present technique is shown. In the initialization mode 605, a configuration flow may be generated to configure the processing region and the memory region. The configuration stream may include data 710 for use by one or more controllers 340 of processing core 300, which data 710 may include one or more bits indicating whether processing core 300 is an edge core configured to generate an edge output for output to a corresponding adjacent memory region, or a partial sum core configured to generate a partial sum for output to another processing core in a given processing region. The configuration flow may also include one or more bits indicative of a computational function performed by processing core 300.
The configuration flow may also include one or more bits indicating the core width of the processing element 310. The configuration stream may also include data 720 for use by the one or more memory converters, the data 720 including one or more bits indicative of the adjacent memory region to which the input data is provided and one or more bits indicative of the adjacent memory region to which the output data is sent. In one embodiment, a software layer of the control circuit may be configured to receive the neural network model and generate the configuration stream.
8A-8J, the operation of a memory processing unit in accordance with aspects of the present technique is shown. After the configuration mode, input data may be received in the first memory area 110. The first frame data may flow from the first memory area 110 to the first processing area 135 where the first processing core 805 may perform a configured computational function on the first frame data to generate a first instance of a first partial sum, as shown in fig. 8A.
As shown in fig. 8B, the first partial sum first instance may flow from the first processing core 805 to the second processing core 810. The second processing core 810 may execute the configured computational function on the first portion and the first instance of the first partial sum. The first processing core 805 may process the second frame data while the second processing core 810 is processing the partial sum associated with the first frame data. As shown in fig. 8C, the first instance of the second partial sum may flow from the second processing core 810 to the third processing core 815. The third processing core 815 may perform the configured computational function on the first instance of the second partial sum to generate a first instance of a third partial sum. While the third processing core 815 is processing the partial sum associated with the first frame data, the second processing core 810 may process the partial sum associated with the second frame data, and the first processing core 805 may process the partial sum associated with the third frame data. As shown in fig. 8D, the first instance of the third partial sum may flow from the third processing core 815 to the fourth processing core 820. Fourth processing core 820 may perform the configured computational function on the first instance of the third partial sum to generate a first instance of the first edge output. While the fourth processing core 820 is processing the partial sums associated with the first frame, the first, second, and third processing cores 805, 815 may process the respective partial sums associated with the respective frames of data from the first memory region 110. It should be understood that data frames are passed and processed in a pipeline configuration.
As shown in fig. 8E, the fourth processing core 820 may generate an edge output that may be passed to the adjacent second memory processing region 115. As shown in fig. 8F, the second memory processing region 115 may in turn pass the edge output to the first and fourth processing cores 825, 830 in the second processing region 140. Thus, multiple processing cores may work on the same edge output associated with a given data frame. The one or more edge outputs output to the second memory handling area 140 may represent an output of a first layer of the neural network. As shown in fig. 8G, the partial sums from the first and fourth processing cores 825, 830 in the second processing region 140 may be passed to the second and fifth processing cores 835, 840. Meanwhile, the first memory area 110, the first, second, third and fourth processing cores 805 and 820 of the first processing area 135, and the second memory area 115 may process the corresponding frames at the same time. Data input to the first and fourth processing cores 825, 830 from the second memory region 115 may represent input to a second layer of the neural network. As shown in fig. 8H, the partial sums from the second and fifth processing cores 825, 840 in the second processing region 140 may be passed to the third and sixth processing cores 845, 850. Meanwhile, the first, second, third and fourth processing cores 805 and 820 of the first memory area 110, the first processing area 135, the second memory area 115 and the first, second, fourth and fifth processing cores 825 and 840 of the second processing area 140 may process the corresponding frames at the same time. As shown in fig. 8I, the third and sixth processing cores 835, 850 of the second processing region 140 may pass the edge output to the adjacent third memory region 120. As shown in FIG. 8J, third memory region 120 may pass the edge output to first, third, and fifth processing cores 855-865 of third processing region 145. Thus, the first memory area 110, the first, second, third and fourth processing cores 805-. The configurations described above with respect to fig. 8A-8J are for illustration purposes and are not intended to limit aspects of the present technique. The memory processing unit may be configured to perform any of a variety of calculations.
Thus, each processing core may be configured in a configuration mode to perform a particular computational function. The processing cores may continue to execute the same computational functions until a new model is mapped to the memory processing units in a new configuration mode. Each programming element may be configured to perform a computational function, such as Conv _2D, density, Conv _2D + pooling, DW _ Conv, and so forth. Each processing core may be configured to generate a partial sum, or edge output. The partial sums are typically passed from one processing core to another processing core in the same processing region. The edge output is typically passed to the memory region.
Referring now to FIGS. 9-14, the configuration and operation of a memory processing unit in accordance with aspects of the present technique are shown. In a configuration mode, each layer of the neural network may be mapped to a respective memory handling area. For example, as shown in fig. 9, a first layer 910 of the neural network layer may map to the first processing region 135 and a second layer 920 of the neural network layer may map to the second processing region 140. In such an example, the first, second, and third processing cores of the first processing region 135 may perform the computational functions of the first layer 910 of the neural network layer. Similarly, the first, second, third, and fourth processing cores of the second processing region 140 may perform computational functions of a second layer 920 of the neural network layer.
Additionally or alternatively, multiple groups of processing cores (also referred to as workers) in the same processing region may operate on the same neural network layer. For example, as shown in fig. 10, a first set of processing cores in the first processing region 135 may operate on a first instance 1010 of the first layer, and a second set of processing cores in the first processing region 135 may operate on a second instance 1020 of the first layer. In addition, another set of processing cores in the second processing region 140 may operate on the second layer 1030.
Additionally or alternatively, multiple neural network layers may be mapped to the same processing region. For example, as shown in fig. 11, a first set of processing cores in the first processing region 135 may operate on the first layer 1110, a second set of processing cores in the first processing region 135 may operate on the second layer 1120, and a third set of processing cores in the first processing region 135 may operate on the third layer 1130. The fourth layer 1140 may be mapped to the first set of processing cores in the second processing region 140 if the first processing region 135 is utilized as much as possible by the first, second, and third layers 1110 and 1130. Mapping multiple smaller layers to a single memory region may increase the utilization of memory regions in a memory processing unit.
Additionally or alternatively, the branching may be implemented locally when mapping the neural network. For example, the first layer 1210 may be mapped to a set of processing cores in the first processing region 135. As shown in fig. 12, a first branch 1220 of the second tier may map to a first set of processing cores in the second processing region 140, and a second branch 1230 of the second tier may map to a second set of processing cores in the second processing region 140. Data from the first tier 1210 may be passed by the second memory area 115 to the appropriate branches 1220, 1230 of the second tier.
Additionally or alternatively, a relatively wide layer of the neural network may be split and mapped to multiple sets of processing cores of one or more memory regions. In a first example, as shown in FIG. 13, the first layer may be divided into three sections 1310, 1330, and the second layer may be divided into two sections 1340, 1350. The first portion 1310 of the first layer may be mapped to a first set of processing cores of the first processing region 135, the second portion 1320 of the first layer may be mapped to a second set of processing cores of the first processing region 135, and the third portion 1330 of the first layer may be mapped to a third set of processing cores of the first processing region 135. Similarly, the first portion 1340 of the second tier may be mapped to a first set of processing cores of the second processing region 140, and the second portion 1350 of the second tier may be mapped to a second set of processing cores of the second processing region 140. In another example, a layer may be divided into four portions, as shown in fig. 14. A first portion of layers 1410 may be mapped to a first set of processing cores of first processing region 135, a second portion of layers 1420 may be mapped to a second set of processing cores of first processing region 135, a third portion of layers 1430 may be mapped to a third set of processing cores of first processing region 135, and a fourth portion of layers 1440 may be mapped to a first set of processing cores of second processing region 140. The configurations described above with respect to fig. 9-14 are for illustration purposes and are not intended to limit aspects of the present technique. The memory processing unit may be configured to perform any of a variety of calculations.
Referring to FIG. 15, the configuration and operation of a memory processing unit in accordance with aspects of the present technique is shown. The first, second, and third processing cores 1510 and 1520 of the first processing region may be configured to receive data from the first memory region 1525. The first processing core 1510 may be configured to perform a 2D convolution on data received from the first memory region 1525 and generate a partial sum that is fed to the second processing core 1515. The second processing core 1515 may be configured to perform a 2D convolution on the data received from the first memory region 1525 and the partial sum from the first processing core 1510 and generate a partial sum that is fed to the third processing core 1520. The third processing core 1520 may be configured to perform a 2D convolution on the data received from the first memory region 1525 and the partial sum received from the second processing core 1515 and generate an edge output that is output to the second memory region 1530. The data received from the first memory area 1525 may be, for example, image data of a plurality of frames.
The first processing core 1535 of the second processing region may be configured to receive data from the second memory region 1530 and perform a 2D convolution to generate a partial sum that is fed to the second processing core 1540. The second processing core 1540 of the second processing region may be configured to perform a 2D convolution with maximal pooling on the data received from the second memory region 1530 and the partial sums received from the first processing core 1535 to generate an edge output to the third memory region 1545.
The first processing core 1550 of the third processing region may be configured to receive data from the third memory region 1545 and perform a fully-connected dense matrix product to generate a partial sum that is fed to the second processing core 1555. The second processing core 1555 of the third processing region may be configured to perform fully-connected dense matrix multiplication on the data received from the third memory region 1545 and the partial sums from the first processing core 1550 to generate partial sums that are output to the third processing core 1560. The third processing core 1560 of the third processing region may be configured to perform fully-connected dense matrix products on data received from the third memory region 1545 and partial sums from the second processing core 1555 to generate an edge output that is output to the fourth memory region 1565. The set of calculations described above are for illustrative purposes and are not intended to limit aspects of the present technology. The memory processing unit may be configured to perform any of a variety of calculations.
Referring now to FIG. 16, a data flow through a set of processing cores in a processing region is shown, in accordance with aspects of the present technique. One or more portions of the first processing core and registers 1610 may be initialized with a given value 1615 (such as all zeros). After a given computational function of a first processing core is executed in a first cycle, the values in the first portion and registers 1610 may be passed to a second processing core. After the computational functions of the first and second processing cores are executed in the second cycle, the corresponding portions and values in registers 1610, 1620 may be transferred to the second and third processing cores. After the computational functions of the first, second, and third processing cores are executed in the third cycle, the respective portions and values in registers 1610, 1620, 1625 may be passed to the second, third, and fourth processing cores. After the computational functions of the first, second, third, and fourth processing cores are executed in the fourth cycle, the respective portions and values in registers 1610, 1620, 1625 may be passed to the second, third, and fourth processing cores. Further, values in the one or more edge output registers 1630 of the fourth processing core may be passed to one or more corresponding write back registers 1635 of the fourth processing core. The write back register 1635 allows the value to be written out to the corresponding adjacent memory region 1640 while the next set of portions and the portion from the third processing core and the register 1625 are transferred to the fourth processing core in the fourth cycle. Optionally, the edge output register 1630 of the fourth processing core may be set to a given value 1645, such as all zeros, when the partial sum value is passed to the corresponding write back register 1635, so that when the partial sum is passed in the fourth cycle, the given value may be passed to the next processing core. As shown in FIG. 16, the write back register 1635 of the fourth processing core that generates the edge output is activated while the write back registers 1650-1660 of the first, second, and third processing cores that generate the partial sums are disabled. In a neural network or other similar application, a plurality of processing cores in a processing region may generate a very large amount of data to output to a corresponding adjacent memory region. The write back register 1635 coupled to the edge output register 1630 may provide for writing out very large amounts of data to the corresponding adjacent memory region 1640 without the need to stop or pause processing in multiple processing cores. The configuration described above with reference to FIG. 16 is for illustrative purposes and is not intended to limit aspects of the present technique. The set of processing cores may be configured in any of a variety of ways to perform computations.
Referring now to fig. 17A and 17B, data flow configurations of processing cores in accordance with aspects of the present technique are shown. As shown in fig. 17A, data may be configured to flow in from the left 1705, partial sum may flow in from the top 1710, and edge output may flow out to the right 1715. Alternatively, data may flow in from the left 1705, part and may flow in from the top 1710, and part and may flow out from the bottom 1720. In such an example, the data flow is unidirectional from left to right and top to bottom. In other examples, data and edge outputs may flow unidirectionally from right to left (not shown). Data configured to flow unidirectionally through a processing core may be used to implement a deep convolutional neural network, which includes forward data propagation.
Alternatively, the processing core may be configured to cause data to flow in from the left 1725 and edge out to the right 1730, or data to flow in from the right 1735 and edge out to the left 1740, as shown in fig. 17B. In such an example, the data flow is bidirectional from left to right and from right to left. Data configured to flow bi-directionally through the processing core may be used to implement a recurrent neural network that includes forward and backward data propagation.
Referring now to FIG. 18, a conventional computing process is shown. In conventional computer processors, the computational functions of the neural network are performed by instructions that are executed sequentially to perform computational operations 1810-. The calculation operation is performed and the result is written back to memory in sequence. In contrast, as shown in FIG. 19, a memory processing unit 1900 according to aspects of the present technology may use its combinations 1910 and 1925 to compute multiple partial sums and edge outputs in parallel with each other without writing to memory 1930. Furthermore, the use of the write back registers 1935-1945 may hide the latency of writing edge outputs and optionally partial sums to the adjacent memory regions 1930.
Referring now to FIGS. 20A and 20B, a write back register in accordance with aspects of the present technique is shown. As shown in fig. 20A, the write-back registers 2010 may be configured to scale operations, also referred to as normalization, by passing a given subset of bits 2020 from the edge output register 2030 to the corresponding write-back register 2010. Further, circuitry may be utilized in conjunction with the write back register to perform other operations. For example, combinatorial logic 2040 may be configured to pass the greater of the portion and the current contents of register 2050 or the current contents of write back register 2060 or a portion thereof back to write back register 2050 to achieve maximum pooling, as shown in fig. 20B. Other operations such as minimum pooling, averaging, correcting linearity, activating functions, etc. may also be implemented using partial sum registers and write back registers.
Referring now to FIG. 21, data transfers in a memory processing unit are shown in accordance with aspects of the present technique. The data flow between the multiple processing cores 2110-2130 and the corresponding neighboring memory regions 2140, 2150 may be controlled using handshaking rather than by a global controller. For example, the second processing core 2120 may wait until it receives a signal 2160 from the first processing core 2110 indicating that the first processing core has valid data. When the second processing core 2120 receives the signal 2160 indicating that the first processing core 2110 has valid data, the second processing core 2120 may copy 2170 the data from the first processing core 2110 and begin performing the computational functions of the second processing core 2120 on the data copied from the first processing core 2110. The second processing core 2120 may also send a signal 2160 to the first processing core 2110 indicating that it has copied data from the first processing core 2110. In response to the signal 2160 from the second processing core 2120, the first processing core 2110 may begin processing new data. The use of handshakes to control the flow of data may simplify pipelining of the plurality of processing cores 2110-2130 and the corresponding adjacent memory regions 2140, 2150. For example, through handshaking, no central control logic is required to track stalls in the various processing cores 2110-2130.
Referring now to FIG. 22, data transfers in a memory processing unit are shown in accordance with aspects of the present technique. Arbiter mechanism 2260 may be used to control the flow of data between a corresponding neighboring memory region 2210 and a plurality of processing cores 2220-2250 to facilitate memory accesses. Arbiter 2260 may provide access to a corresponding neighboring memory region 2210 by each of the plurality of processing cores 2220-2250 in turn. In addition, memory region 2210 may utilize a multi-bank architecture (multi-bank architecture) to facilitate access by multiple processing cores 2220-2250. Each bank may support access by a corresponding processing core such that multiple processing cores 2220-.
Referring now to FIG. 23, data reuse in a memory processing unit in accordance with aspects of the present technique is illustrated. Data reuse may be implemented within the processing core 2310 to reduce memory accesses. For example, if data received 2320 in the memory processing region is later needed 2330 again by the processing core 2310, the data may be retained in the memory processing region 2310 for reuse.
Embodiments of the present technology advantageously provide a reconfigurable computing platform. Memory processing units according to aspects of the present technology may advantageously perform computations directly in memory. Accordingly, aspects of the present technique may advantageously reduce processing delay, data delay, and/or power consumption.
The following examples relate to specific technical embodiments and indicate specific features, elements or steps that may be used or otherwise combined in implementing the embodiments.
Example 1 includes a memory processing unit, comprising: a plurality of memory regions; a plurality of processing regions interleaved among the plurality of memory regions, wherein one or more of the plurality of processing regions are configured to perform one or more computational functions; one or more communication links coupled between the interleaved plurality of memory regions and a plurality of processing units, wherein the communication links are configured to move data between non-adjacent regions of the plurality of memory regions or a plurality of processing regions; and one or more centralized or distributed control circuits configured to control a flow of data into each given processing region of the plurality of processing regions from a first adjacent memory region of the plurality of memory regions to a second adjacent memory region of the plurality of memory regions.
Example 2 includes the memory processing unit of example 1, wherein: the plurality of processing region columns are interleaved among the plurality of memory regions; and the one or more control circuits are configured to control a flow of data between each given processing region of the plurality of processing regions from adjacent memory regions of the plurality of memory regions in a cross-column direction.
Example 3 includes the memory processing unit of example 2, wherein each of the plurality of processing regions includes a plurality of processing cores.
Example 4 includes the memory processing unit of example 3, wherein the control circuitry is further configured to control data flow between the processing cores in respective ones of the plurality of processing regions in a column direction.
Example 5 includes the memory processing unit of example 3, wherein the computational function is partitioned among the plurality of processing cores.
Example 6 includes the memory processing unit of example 1, wherein the one or more computational functions include one or more computational functions of a neural network.
Example 7 includes the memory processing unit of example 6, wherein the neural network includes a plurality of layers, wherein each layer includes one or more computational functions.
Example 8 includes the memory processing unit of example 1, wherein the control circuitry includes a software layer configured to receive a neural network model and generate a configuration flow for configuring the plurality of memory regions and the plurality of processing units.
Example 9 includes the memory processing unit of example 3, wherein each processing core includes a processing element, one or more counters, one or more write back registers, one or more controllers, one or more address translators, and one or more memory region interfaces.
Example 10 includes the memory processing unit of example 9, wherein each processing element includes one or more memory arrays, one or more input registers, one or more accumulators, and one or more output registers.
Example 11 includes the memory processing unit of example 9, wherein the one or more memory arrays include one or more resistive random access memory (ReRAM) arrays.
Example 12 includes the memory processing unit of example 9, wherein the one or more memory arrays include one or more Magnetic Random Access Memory (MRAM) arrays.
Example 13 includes the memory processing unit of example 9, wherein the one or more memory arrays include one or more Phase Change Random Access Memory (PCRAM) arrays.
Example 14 includes the memory processing unit of example 9, wherein the plurality of memory regions comprises a plurality of Static Random Access Memories (SRAMs).
Example 15 includes a memory processing unit, comprising: a plurality of first memory areas configured to store data; a plurality of second memory regions column interleaved between the plurality of first memory regions, wherein one or more of the plurality of second memory regions are configured to perform one or more computational functions; a communication link coupled between a plurality of first memory regions and a plurality of second memory regions that are column-interleaved, the communication link configured to move data between non-adjacent ones of the plurality of first memory regions and the plurality of second memory regions; and a centralized or distributed control circuit configured to control a flow of data into each given first memory region of the plurality of first memory regions from a first adjacent second memory region of the plurality of second memory regions to a second adjacent second memory region of the plurality of second memory regions in a cross-column direction, and to control a flow of data within each given first memory region of the plurality of first memory regions in a column direction.
Example 16 includes the memory processing unit of example 15, wherein the plurality of first memory regions comprises a plurality of Static Random Access Memory (SRAM) regions.
Example 17 includes the memory processing unit of example 15, wherein the plurality of second memory regions comprises a plurality of resistive random access memory (ReRAM) regions.
Example 18 includes the memory processing unit of example 15, wherein the data stream comprises a pipelined data stream.
Example 19 includes the memory processing unit of example 15, wherein each of the plurality of second memory regions includes a plurality of processing cores arranged in a column series.
Example 20 includes the memory processing unit of example 15, wherein the plurality of processing cores in one or more of the plurality of second memory regions are configured to concurrently execute respective computational functions.
Example 21 includes the memory processing unit of example 20, wherein the plurality of processing cores in one or more of the plurality of second memory regions perform respective computational functions on data of a same frame.
Example 22 includes the memory processing unit of example 15, wherein the data moved between non-adjacent ones of the plurality of first memory regions and plurality of second memory regions includes an edge output.
Example 23 includes the memory processing unit of example 15, wherein the data flowing within each given first memory region of the plurality of first memory regions in the column direction includes a partial sum.
Example 24 includes the memory processing unit of example 15, wherein the one or more neural network layers are mapped to respective second memory regions of the plurality of second memory regions.
Example 25 includes the memory processing unit of example 15, wherein a group of processing cores of a given second memory region of the plurality of second memory regions are capable of operating on a same neural network layer.
Example 26 includes the memory processing unit of example 15, wherein a plurality of neural network layers are mapped to respective second memory regions of the plurality of second memory regions.
Example 27 includes the memory processing unit of example 15, wherein the neural network layer is mapped to two or more of the plurality of second memory regions.
Example 28 includes the memory processing unit of example 15, wherein the control circuitry comprises centralized control circuitry.
Example 29 includes the memory processing unit of example 15, wherein the control circuitry is localized to one or more of the plurality of first memory regions, the plurality of second memory regions, and the communication link.
Example 30 includes the memory processing unit of example 15, wherein the control circuitry includes a centralized portion and a distributed portion that are localized to one or more of the plurality of first memory regions, the plurality of second memory regions, and the communication link.
Example 31 includes a method, comprising: receiving a model; configuring one or more processing regions of a plurality of processing regions of a memory processing unit to perform one or more computational functions of the model; and configuring one or more of a plurality of memory regions of the memory processing unit to control data flow from a first adjacent memory region of the plurality of memory regions into the one or more processing regions of the plurality of processing regions, wherein the plurality of processing regions are interleaved between the plurality of memory regions.
Example 32 includes the method of example 31, further comprising: receiving input data; and computing output data from the input data, the input data processed by the configured one or more of the plurality of processing regions and the configured one or more of the plurality of memory regions of the memory processing unit.
Example 33 includes the method of example 31, wherein the model includes a machine learning algorithm including an artificial neural network.
Example 34 includes the method of example 33, wherein the artificial neural network comprises a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN).
Example 35 includes the method of example 31, wherein the plurality of processing region columns are interleaved between the plurality of memory regions.
Example 36 includes the method of example 35, wherein the plurality of memory regions includes a plurality of Static Random Access Memory (SRAM) regions.
Example 37 includes the method of example 35, wherein the plurality of processing regions includes one of: a plurality of resistive random access memory (ReRAM) regions, a plurality of Magnetic Random Access Memory (MRAM) regions, or a plurality of Phase Change Random Access Memory (PCRAM) regions.
Example 38 includes the method of example 31, wherein configuring the one or more of a plurality of processing zones comprises: programming one or more of the plurality of processing cores of one or more of the plurality of processing regions to perform the one or more computational functions.
The foregoing descriptions of specific embodiments of the present technology have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the technology and its practical applications, to thereby enable others skilled in the art to best utilize the technology and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.

Claims (24)

1. A memory processing unit, comprising:
a plurality of memory regions;
a plurality of processing regions interleaved among the plurality of memory regions, wherein one or more of the plurality of processing regions are configured to perform one or more computational functions;
one or more communication links coupled between the interleaved plurality of memory regions and a plurality of processing units, wherein the communication links are configured to move data between non-adjacent regions of the plurality of memory regions or a plurality of processing regions; and
one or more centralized or distributed control circuits configured to control a flow of data into each given processing region of the plurality of processing regions from a first adjacent memory region of the plurality of memory regions to a second adjacent memory region of the plurality of memory regions.
2. The memory processing unit of claim 1, wherein:
the plurality of processing region columns are interleaved among the plurality of memory regions; and
the one or more control circuits are configured to control a flow of data between each given processing region of the plurality of processing regions from adjacent memory regions of the plurality of memory regions in a cross-column direction.
3. The memory processing unit of claim 2, wherein each of the plurality of processing regions comprises a plurality of processing cores.
4. The memory processing unit of claim 3, wherein the control circuitry is further configured to control data flow between the processing cores in respective ones of the plurality of processing regions in a column direction.
5. The memory processing unit of claim 3, wherein the computation function is partitioned among the plurality of processing cores.
6. The memory processing unit of claim 1, wherein the one or more computational functions comprise one or more computational functions of a neural network.
7. The memory processing unit of claim 6, wherein the neural network comprises a plurality of layers, wherein each layer comprises one or more computational functions.
8. The memory processing unit of claim 1, wherein the control circuitry comprises a software layer configured to receive a neural network model and generate a configuration flow for configuring the plurality of memory regions and the plurality of processing units.
9. The memory processing unit of claim 3, wherein each processing core comprises a processing element, one or more counters, one or more write-back registers, one or more controllers, one or more address translators, and one or more memory region interfaces.
10. The memory processing unit of claim 9, wherein each processing element comprises one or more memory arrays, one or more input registers, one or more accumulators, and one or more output registers.
11. A memory processing unit, comprising:
a plurality of first memory areas configured to store data;
a plurality of second memory regions column interleaved between the plurality of first memory regions, wherein one or more of the plurality of second memory regions are configured to perform one or more computational functions;
a communication link coupled between a plurality of first memory regions and a plurality of second memory regions that are column-interleaved, the communication link configured to move data between non-adjacent ones of the plurality of first memory regions and the plurality of second memory regions; and
control circuitry configured to control a flow of data into each given first memory region of the plurality of first memory regions from a first adjacent second memory region of the plurality of second memory regions to a second adjacent second memory region of the plurality of second memory regions in a cross-column direction, and to control a flow of data within each given first memory region of the plurality of first memory regions in a column direction.
12. The memory processing unit of claim 11, wherein each of the plurality of second memory regions comprises a plurality of processing cores arranged in a column series.
13. The memory processing unit of claim 11, wherein a plurality of processing cores in one or more of the plurality of second memory regions are configured to simultaneously perform respective computational functions.
14. The memory processing unit of claim 13, wherein the plurality of processing cores in one or more of the plurality of second memory regions perform respective computational functions on data of a same frame.
15. The memory processing unit of claim 11, wherein the data moved between non-adjacent ones of the first and second plurality of memory regions comprises an edge output.
16. The memory processing unit of claim 11, wherein data flowing within each given first memory region of the plurality of first memory regions in the column direction comprises a partial sum.
17. The memory processing unit of claim 11, wherein one or more neural network layers are mapped to respective second memory regions of the plurality of second memory regions.
18. The memory processing unit of claim 11, wherein a group of processing cores of a given second memory region of the plurality of second memory regions are capable of operating on a same neural network layer.
19. The memory processing unit of claim 11, wherein a plurality of neural network layers are mapped to respective second memory regions of the plurality of second memory regions.
20. The memory processing unit of claim 11, wherein a neural network layer is mapped to two or more of the plurality of second memory regions.
21. A method, comprising:
receiving a model;
configuring one or more processing regions of a plurality of processing regions of a memory processing unit to perform one or more computational functions of the model; and
configuring one or more of a plurality of memory regions of the memory processing unit to control data flow from a first adjacent memory region of the plurality of memory regions into the one or more processing regions of the plurality of processing regions, wherein the plurality of processing regions are interleaved between the plurality of memory regions.
22. The method of claim 21, further comprising:
receiving input data; and
computing output data from the input data, the input data processed by the configured one or more of the plurality of processing regions and the configured one or more of the plurality of memory regions of the memory processing unit.
23. The method of claim 21, wherein the model comprises a machine learning algorithm comprising an artificial neural network.
24. The method of claim 21, wherein configuring the one or more of a plurality of processing zones comprises: programming one or more of the plurality of processing cores of one or more of the plurality of processing regions to perform the one or more computational functions.
CN202080049322.9A 2019-05-07 2020-04-23 Memory processing unit architecture Pending CN114072778A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US201962844644P 2019-05-07 2019-05-07
US62/844,644 2019-05-07
US16/841,544 US11488650B2 (en) 2020-04-06 2020-04-06 Memory processing unit architecture
US16/841,544 2020-04-06
PCT/US2020/029413 WO2020226903A1 (en) 2019-05-07 2020-04-23 Memory processing unit architecture

Publications (1)

Publication Number Publication Date
CN114072778A true CN114072778A (en) 2022-02-18

Family

ID=73050861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080049322.9A Pending CN114072778A (en) 2019-05-07 2020-04-23 Memory processing unit architecture

Country Status (3)

Country Link
EP (1) EP3966698A4 (en)
CN (1) CN114072778A (en)
WO (1) WO2020226903A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022047403A1 (en) * 2020-08-31 2022-03-03 Zidan Mohammed Memory processing unit architectures and configurations

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004006103A1 (en) * 2002-07-09 2004-01-15 Globespanvirata Incorporated Method and system for improving access latency of multiple bank devices
US8390325B2 (en) * 2006-06-21 2013-03-05 Element Cxi, Llc Reconfigurable integrated circuit architecture with on-chip configuration and reconfiguration
KR20100100395A (en) * 2009-03-06 2010-09-15 삼성전자주식회사 Memory system having multiple processors
US8819359B2 (en) * 2009-06-29 2014-08-26 Oracle America, Inc. Hybrid interleaving in memory modules by interleaving physical addresses for a page across ranks in a memory module
US9754056B2 (en) * 2010-06-29 2017-09-05 Exxonmobil Upstream Research Company Method and system for parallel simulation models
US10417555B2 (en) * 2015-05-29 2019-09-17 Samsung Electronics Co., Ltd. Data-optimized neural network traversal
CN113918481A (en) * 2017-07-30 2022-01-11 纽罗布拉德有限公司 Memory chip

Also Published As

Publication number Publication date
WO2020226903A1 (en) 2020-11-12
EP3966698A1 (en) 2022-03-16
EP3966698A4 (en) 2023-01-18

Similar Documents

Publication Publication Date Title
US11775313B2 (en) Hardware accelerator for convolutional neural networks and method of operation thereof
US10153042B2 (en) In-memory computational device with bit line processors
US11693783B2 (en) Apparatuses and methods for cache operations
US10353618B2 (en) Apparatuses and methods for data movement
CN114238204B (en) Reconfigurable parallel processing
KR20220054357A (en) Method for performing PROCESSING-IN-MEMORY (PIM) operations on serially allocated data, and related memory devices and systems
CN109147842B (en) Apparatus and method for simultaneous computational operations in a data path
US20190171448A1 (en) Stream processor with low power parallel matrix multiply pipeline
US11436143B2 (en) Unified memory organization for neural network processors
US20180122479A1 (en) Associative row decoder
CN111656339B (en) Memory device and control method thereof
CN101482811B (en) Processor architectures for enhanced computational capability
CN114341981B (en) Memory with artificial intelligence mode
EP3997585A1 (en) Non-volatile memory based processors and dataflow techniques
US20080162824A1 (en) Orthogonal Data Memory
US20220391128A1 (en) Techniques to repurpose static random access memory rows to store a look-up-table for processor-in-memory operations
EP3997563A1 (en) Methods for performing processing-in-memory operations, and related memory devices and systems
CN114286977A (en) Artificial intelligence accelerator
KR20210113099A (en) Adjustable function-in-memory computation system
US11488650B2 (en) Memory processing unit architecture
CN114072778A (en) Memory processing unit architecture
CN115668121A (en) Memory processing unit architecture and configuration
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
CN113270126A (en) Stream access memory device, system and method
WO2020059156A1 (en) Data processing system, method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination