US20230222315A1 - Systems and methods for energy-efficient data processing - Google Patents
Systems and methods for energy-efficient data processing Download PDFInfo
- Publication number
- US20230222315A1 US20230222315A1 US18/114,766 US202318114766A US2023222315A1 US 20230222315 A1 US20230222315 A1 US 20230222315A1 US 202318114766 A US202318114766 A US 202318114766A US 2023222315 A1 US2023222315 A1 US 2023222315A1
- Authority
- US
- United States
- Prior art keywords
- input
- data
- neuron
- result
- locations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 46
- 238000012545 processing Methods 0.000 title claims description 19
- 210000002569 neuron Anatomy 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 18
- 230000004044 response Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 description 24
- 230000008569 process Effects 0.000 description 24
- 238000010801 machine learning Methods 0.000 description 12
- 239000011159 matrix material Substances 0.000 description 6
- 238000007792 addition Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 238000012163 sequencing technique Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 108010001267 Protein Subunits Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 210000000352 storage cell Anatomy 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 239000002023 wood Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0207—Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1028—Power efficiency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1041—Resource optimization
- G06F2212/1044—Space efficiency improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/454—Vector or matrix data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present disclosure relates generally to data processing. More particularly, the present disclosure relates to systems and methods for improving utilization of computing and memory resources when performing arithmetic operations, such as matrix multiplications.
- Machine Learning is an exciting area of research and development that enables computation of algorithms and solutions previously infeasible in “classic” computing.
- most existing implementations make use of general-purpose CPUs or graphics processing units (GPUs). While delivering correct and satisfactory results in many cases, the energy needs of such implementations oftentimes preclude the use of computationally challenging machine learning algorithms in constrained environments such as battery operated sensors, small microcontrollers, and the like.
- multipliers are scalar machines that use a CPU or GPU as their computation unit and use registers and a cache to process data stored in memory relying on a series of software and hardware matrix manipulation steps, such as address generation, transpositions, bit-by-bit addition and shifting, converting multiplications into additions and outputting the result into some internal register.
- FIG. 1 is a general illustration of a simplified prior art fully connected network.
- FIG. 2 illustrates an exemplary memory structure with inline multipliers and adder according to various embodiments of the present disclosure.
- FIG. 3 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown in FIG. 2
- FIG. 4 is a data flow example that illustrates the process for energy-efficient data processing shown in FIG. 3 .
- FIG. 5 is a tabular listing of exemplary contents of a memory structure according to various embodiments of the present disclosure, such as the memory structure shown in FIG. 2 .
- FIG. 6 illustrates an exemplary tabular listing for FIG. 5 after a data loading step has been performed.
- FIG. 7 illustrates an exemplary tabular listing for FIG. 5 after activated computations are stored.
- FIG. 8 shows a prior art weight distribution
- FIG. 9 illustrates exemplary discrete weights in accordance with embodiments of the present disclosure.
- FIG. 10 illustrates an exemplary tabular listing for FIG. 5 after rounding.
- FIG. 11 illustrates a simplified example of rounding results according to various embodiments of the present disclosure.
- FIG. 12 illustrates an exemplary tabular listing for weight sharing according to various embodiments of the present disclosure.
- FIG. 13 illustrates an exemplary tabular listing for combining entries according to various embodiments of the present disclosure.
- FIG. 14 illustrates the table in FIG. 13 after sorting and adding a binary representation according to various embodiments of the present disclosure.
- FIG. 15 illustrates the table in FIG. 14 after replacing Read Source address bits, according to various embodiments of the present disclosure.
- FIG. 16 illustrates an exemplary memory structure that utilizes column weights, according to various embodiments of the present disclosure.
- FIG. 17 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown in FIG. 16 .
- FIG. 18 illustrates a simplified system utilizing a memory structure according to various embodiments of the present disclosure.
- FIG. 19 illustrates an alternate system utilizing a memory structure that uses column weights according to various embodiments of the present disclosure.
- connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
- a service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
- FIG. 1 is a general illustration of a simplified prior art fully connected network.
- Network 100 has four inputs 102 (denoted as 0 . 1 , 0 . 2 , 0 . 3 , and 0 . 4 ), three hidden layers 112 - 132 (having neurons denoted as 1 . 1 , 1 . 2 , 1 . 3 , 2 . 1 , 2 . 2 , 3 . 1 , 3 . 2 , and 3 . 3 ), and three outputs 152 (denoted as 4 . 1 , 4 . 2 , 4 . 3 ). It is pointed out that the diagram in FIG. 1 is used only for demonstration purposes and ease of illustration; practical machine learning models may operate on anywhere from hundreds to more than millions of neurons.
- the larger network 100 the larger the number of required multiplications and, thus, the energy impact will follow O(n 2 ), where n represents the number of neurons in the network. Therefore, reducing the energy impact of arithmetic operations, such as multiplications and additions, should be of utmost importance when designing low-power machine learning and similar applications.
- FIG. 2 illustrates an exemplary memory structure with inline multipliers and an adder according to various embodiments of the present disclosure.
- Memory structure 200 comprises memory elements S that store read sources 202 , memory elements W that store weights 206 , memory elements X that store data 210 , memory elements T that store write targets 220 , and output write enable signals, multipliers Mi 230 , that are arranged in rows 250 .
- memory elements comprise circuitry, such as logic circuits that control memory structure 200 .
- Memory structure 200 further comprises adder 240 that may be shared by rows 250 .
- activation function 242 and sequence number L 244 are shown outside of memory structure 200 , e.g., to facilitate easy sharing of circuit resources, person of skill will appreciate that, in embodiments, activation function 242 , sequence number L 244 , and any number of other circuit components may be integrated into memory structure 200 .
- memory structure 200 may be controlled by a state machine (not shown) that may be implemented as a hardware state machine or a software state machine.
- multiple instances of memory structure 200 may be used and combined (e.g., in a column arrangement, using digital components, using modular components, etc.) to alleviate physical restrictions such as maximum dimensions for memory structure 200 .
- Variations may implement any number of data elements X and/or multiple weight elements W per row 250 .
- memory structure 200 may be implemented using content addressable memory cells or similar circuitry that may use logic elements in any number and arrangement to control memory structure 200 and achieve the objectives of the present disclosure.
- the content addressable memory cells may use commonly available storage cells that store the actual 0 and 1 values, but that are subject to the interconnectivity of the content addressable memory cells.
- multipliers and adders may be implemented in different ways, for example using analog circuits, digital circuits, in-line within memory structure 200 , or at the bottom of the memory array.
- An example of an analog implementation for multipliers and adders are the systems and methods disclosed in U.S. Provisional Patent Application No. 62/740,691 (Docket No. 20057-2258P), entitled “SYSTEMS AND METHODS FOR ENERGY-EFFICIENT ANALOG MATRIX MULTIPLICATION FOR MACHINE LEARNING PROCESSES,” naming as inventors Sung Ung Kwak and Robert Michael Muchsel, and filed Oct. 3, 2018, which application is hereby incorporated herein by reference as to its entire content and for all purposes.
- read source 202 may cause enable signals 204 to be activated, e.g., to enable, activate, or control a read operation.
- enable signals 204 may be activated, e.g., to enable, activate, or control a read operation.
- those memory elements may cause write targets 220 to output enable signals 205 .
- a controller (not shown in FIG. 2 ) controls multipliers 230 to compute the product of weights Wi 206 and data Xi 210 , and controls adder 240 to compute the sum of the products Wi*Xi.
- the sequencing of operations is discussed next with reference to FIG. 3 .
- FIG. 3 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown in FIG. 2 .
- step 304 it is determined whether a stop condition has been met. If so, process 300 may resume with step 320 where results are collected.
- a sequencer may apply a value, L, to a read sequence input. In embodiments, this causes read sources S that contain the value L to output their enable signal.
- the enabled data items X and weights W may be multiplied using multipliers Mi.
- the products may be applied to the adder A to obtain Y′.
- an activation function, g() may be applied to the output Y′ of the adder A to obtain output Y.
- the sequencer applies the calculated output Y to the data inputs.
- the value L is applied to the write target inputs, e.g., via write sequence(s). In embodiments, this may cause all write targets T that contain the value L to output their enable signal such that, consequently, Y is written to the enabled data items X.
- L may be increased and process 300 may resume with step 304 to determine whether the stop condition has been reached.
- FIG. 4 is a data flow example that illustrates the process for energy-efficient data processing shown in FIG. 3 .
- the example illustrates a sequence involving neuron 1 . 2 shown in FIG. 1 .
- the four enabled data items depicted as X0.1, X0.2, X0.3, and X0.4 in column 454 , and weights W, depicted as W5, W6, W7, W8 in column 456 , are multiplied, e.g., by a multiplier circuit illustrated in FIG. 2 .
- the products (X*W) of the multiplication may then be input to adder 404 that computes X0.1*W5+X0.2*W6+X0.3*W7+X0.4*W8 to output an intermediary result 406 .
- output 406 of adder 404 may be provided to activation function 408 that outputs the result 410 of this process as of this calculation as Y.
- the illustrated multiplication operations may be performed in parallel and in place, such that data does not have to be moved far to be applied to adder 404 , thus, resulting in an energy-efficient arrangement.
- the sequencing process may be implemented using analog circuitry that, advantageously, further increases energy efficiency.
- FIG. 5 is a tabular listing of exemplary contents of a memory structure according to various embodiments of the present disclosure, such as the memory structure shown in FIG. 2 .
- the exemplary values are used to illustrate examples for a fully connected neural network, e.g., the network shown in FIG. 1 .
- For each row entry in table 500 in FIG. 5 shows values for (1) read source, S, 510 ; (2) data item, X, 520 ; (3) one or more weights, W, 530 or bias items 530 ; and (4) write target, T, 540 . It is noted that for clarity of the description, random values are shown for weights 530 .
- all memory elements with a matching value may be activated to enable a read operation; similarly, given a value for a write target T, all memory elements with matching write target T value may be activated.
- values listed in read source 510 and write target 540 are named in the format “layer.number,” e.g., “2.1.,” where “layer” refers to the neuron layer. It is noted that, as with other numbering schemes herein, this numbering is arbitrary.
- Entries denoted as “0.0” in write target 540 may represent memory that has not been written to or have been initialized as having a “zero” value.
- Table 500 in FIG. 5 shows an initial state of the data structure before data is loaded. In embodiments, such entries may be used for bias weights that may be preloaded into a memory structure. The data for bias entries may also be preloaded into the memory structure, here as having the value 1.000.
- the entries 0.1 through 0.4 in the write target 540 denote targets for input data, such as sensor data.
- the entries associated with the values 4.1, 4.2, and 4.3 in read source 510 i.e., entries 3.1, 3.2, and 3.3 in write target 540 may be used to collect the output data of the last fully connected layer.
- FIG. 6 illustrates an exemplary tabular listing for FIG. 5 after a data loading step has been performed.
- the activated computation result is stored in those locations that have a write target 640 of j.k.
- the data structure may now be represented as in FIG. 7 , in which the changes resulting from the activation function are highlighted.
- this process may be repeated for all remaining j.k, here, N 1.2 , N 1.3 , N 2.1 , N 2.2 , N 3.1 , N 3.2 , and N 3.3 .
- the data values from the last layer may then be collected, e.g., by software, and used directly or serve as input for a classification function, e.g., softmax.
- a programmable lookup table may be employed. If, for example, data vales are expressed as 8-bit integers, a table with 256 entries may sufficiently describe any possible activation function.
- hardware accelerators may be used for ReLU or other activation functions.
- weight distribution 800 in FIG. 8 which illustrates a typical weight distribution, in order to derive quantized weights.
- weight distribution 800 in FIG. 8 which illustrates a typical weight distribution, in order to derive quantized weights.
- the original weights in distribution 800 could be rounded to the nearest discrete weight, such that, after rounding, the data structure may look like that in FIG. 10 .
- FIG. 11 illustrates a simplified example in which rounding results in the elimination of much of the computations that have to be performed. While, in practice, the reduction may be may not so extreme for a large network, it might still be considerable. It is noted that some optimizations may not necessarily save computation time, but rather reduce storage requirements. For example, while matching rows for a neuron may be processed in parallel, the sequence of neurons may be computed sequentially. Further, if no weight sharing (discussed next) is used, then there may be no need to encode 0 . 0 weights at all.
- the basic data structure shown above comprises Read Source, Write Target, and an associated Weight.
- implementation options allow for multiple weights per data structure entry. As a result, if, e.g., two weights are used, then the data structures for N k.l and N k.l+1 may be combined. It is understood that once weights are combined in this manner, it should be possible to encode a 0.0 weight. It is further understood, that inefficiencies may be introduced if the node count per layer is not evenly divisible by the number of shared weights (e.g., N 1.3 and N 3.3 in the example in FIG. 12 ).
- a plurality of entries that have the same data, (rounded) Weights, and Write Target may be combined, e.g., by expressing address bits in the Read Source as “don't care.” In this flexible approach, the Read Source may then match regardless of whether the particular applied address bit is 0 or 1.
- Several of the ternary elements may be implemented, for example, 6-bits.
- the data structure may be re-sorted to yield significant savings. For example, a partial structure (after removing 0-weights) may provide the results in FIG. 13 . It is noted that, similar to weight sharing, Ternary Read Sources do not necessarily save computation time.
- sorting this structure by Write Target-Weight-Data and adding a binary representation of the decimal Write Target may provide the results in FIG. 14 .
- replacing Read Source address bits with ‘X’ without renumbering yields a smaller structure shown in FIG. 15 .
- neurons may be renumbered (not shown), such that the first two table entries may be combined.
- y j.k w 0 ⁇ m x m +w 1 ⁇ n x n +w 2 ⁇ p x p + . . . (4)
- This embodiment comprises a plurality of memory structures (“columns”) having elements that each may comprise, e.g., (1) a read source S, (2) a data item X, and (3) a write target T. Further, each column C may be associated with a weight W C , as shown in FIG. 16 , which illustrates an exemplary memory structure that utilizes column weights, according to various embodiments of the present disclosure.
- FIG. 17 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown in FIG. 16 .
- step 1704 it is determined whether a stop condition has been met. If so, process 1700 may resume with step 1730 where results are collected.
- the sequencer may apply a value, L, to a read sequence input. In embodiments, this causes read sources S that contain the value L to output their enable signal.
- the enabled data items X may be summed by an adder that outputs sums that, at step 1710 , are multiplied with the column weights W, e.g., by using multipliers M, to obtain column-specific outputs.
- the column outputs are added by a column adder.
- the outputs of the column adder is processed through an activation module that may apply an activation function, go, to the column adder output to obtain output Y.
- the sequencer applies the calculated output, Y, to the data inputs.
- the value L is applied to the write target inputs. In embodiments, this may cause all write targets T that contain the value L to output their enable signal such that Y is written to the enabled data items X.
- L may be increased and process 1700 may resume with step 1704 to determine whether the stop condition has been reached.
- FIG. 18 illustrates a simplified system utilizing a memory structure according to various embodiments of the present disclosure.
- Sequencer 1800 comprises read source 1802 , adder 1804 , inline multiplier 1806 , weights memory 1808 , data memory 1810 , write target 1812 , activation unit 1814 , and sequence number generator 1816 .
- read source 1802 receives sequence number 1820 from sequence number generator 1816 , e.g., until a stop condition has been reached. If the sequence number 1820 matches a content of a memory element in read source 1802 , then read source 1802 outputs an enable signal 1822 that enables data in weights memory 1808 and data memory 1810 to be multiplied by inline multiplier 1806 to generate products 1824 that are then added by adder 1804 that computes the sum of products 1824 . In addition, memory elements in in write target 1812 whose content matches sequence number 1820 may cause write target 1820 to output enable signals 1830 .
- output 1836 of adder 1804 is provided to activation unit 1814 that applies an activation function to products 1824 to generate output 1832 that may then be fed back to the input of data memory 1810 to be written according to enable signals 1830 generated by write target 1812 , in response to receiving sequence number 1820 .
- sequence number generator 1816 may increment sequence number 1820 and provide a new sequence number to read source 1802 to close the loop.
- FIG. 19 illustrates an alternate system utilizing a memory structure that uses column weights according to various embodiments of the present disclosure.
- System 1900 comprises sequencer 1904 that is similar to sequencer 1800 in FIG. 18 .
- System 1900 in FIG. 19 further comprises column adder 1902 , e.g. a column adder, such as those illustrated in FIG. 16 that may be coupled to any number of additional sequencers (not shown in FIG. 19 ) that share column adder 1902 and activation unit 1804 . Similar to FIG. 16 , sequence number generator in FIG. 19 may be coupled to any number of additional sequencers.
- aspects of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed.
- the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.
- alternative implementations are possible, including a hardware implementation or a software/hardware implementation.
- Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations.
- the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof.
- embodiments of the present invention may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations.
- the media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind known or available to those having skill in the relevant arts.
- Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices.
- ASICs application specific integrated circuits
- PLDs programmable logic devices
- flash memory devices and ROM and RAM devices.
- Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter.
- Embodiments of the present invention may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device.
- Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
- Memory System (AREA)
Abstract
An energy-efficient sequencer comprising inline multipliers and adders causes a read source that contains matching values to output an enable signal to enable a data item prior to using a multiplier to multiply the data item with a weight to obtain a product for use in a matrix-multiplication in hardware. A second enable signal causes the output to be written to the data item.
Description
- The present application is a continuation application of and claims priority benefit, under 35 U.S.C. § 120, to co-pending and commonly-assigned U.S. patent application Ser. No. 16/590,265, filed on Oct. 1, 2019, which claims priority, under 35 U.S.C. §119(e), to co-pending and commonly-assigned U.S. provisional patent application No. 62/740,700, filed on Oct. 3, 2018, entitled “Systems and Methods for Energy-Efficient Data Processing,” listing as inventors Mark Alan Lovell, Robert Michael Muchsel, and Donald Wood Loomis III, which application is herein incorporated by reference as to its entire content. Each reference mentioned in this patent document is incorporated by reference herein in its entirety.
- The present disclosure relates generally to data processing. More particularly, the present disclosure relates to systems and methods for improving utilization of computing and memory resources when performing arithmetic operations, such as matrix multiplications.
- Machine Learning is an exciting area of research and development that enables computation of algorithms and solutions previously infeasible in “classic” computing. However, most existing implementations make use of general-purpose CPUs or graphics processing units (GPUs). While delivering correct and satisfactory results in many cases, the energy needs of such implementations oftentimes preclude the use of computationally challenging machine learning algorithms in constrained environments such as battery operated sensors, small microcontrollers, and the like.
- This is mainly due to the fact that arithmetic operations are typically performed in software that operates on a general-purpose computing device, such as a conventional microprocessor. This approach is very costly in terms of both power and time, and for many computationally intensive applications (e.g., real-time applications) general hardware is unable to perform the necessary operations in a timely manner as the rate of calculations is limited by the computational resources and capabilities of existing hardware designs.
- Further, using a general processor's arithmetic functions to generate intermediate results comes at the expense of computing time due to the added steps of storing and retrieving intermediate results from various memory locations to complete an operation. For example, many conventional multipliers are scalar machines that use a CPU or GPU as their computation unit and use registers and a cache to process data stored in memory relying on a series of software and hardware matrix manipulation steps, such as address generation, transpositions, bit-by-bit addition and shifting, converting multiplications into additions and outputting the result into some internal register.
- Furthermore, computationally demanding applications such as convolutions oftentimes require a software function be embedded in the microprocessor and be used to convert convolution operations into alternate matrix-multiply operations. This involves rearranging and reformatting image data and weight data into two matrices that then are raw matrix-multiplied. There exist no mechanisms that efficiently select, use, and reuse data, while avoiding generating redundant data. Software must access the same locations of a standard memory and read, re-fetch, and write the same data over and over again when performing multiplication and other operations, which is computationally very burdensome and creates a bottleneck that curbs the usability of machine learning applications.
- As the amount of data subject to matrix multiplication operations increases and the complexity of operations continues to grow, the inability to reuse much of the data coupled with the added steps of storing and retrieving intermediate results from memory to complete an arithmetic operation present only some of the shortcomings of existing designs. Therefore, conventional hardware and methods are not well-suited for the ever-increasing demands for speed and the performance that are required to perform a myriad of complex processing steps involving large amounts of data in real-time.
- Accordingly, what is needed are high-computational-throughput systems and methods that move and process data in a rapid and energy-efficient manner to drastically reduce the number of arithmetic operations and storage requirements, e.g., for relatively small computing devices that can take advantage of and integrate machine learning processes without undue energy burden or excessive hardware cost.
- References will be made to embodiments of the invention, examples of which may be illustrated in the accompanying figures. These figures are intended to be illustrative, not limiting. Although the invention is generally described in the context of these embodiments, it should be understood that it is not intended to limit the scope of the invention to these particular embodiments. Items in the figures may be not to scale.
-
FIG. 1 is a general illustration of a simplified prior art fully connected network. -
FIG. 2 illustrates an exemplary memory structure with inline multipliers and adder according to various embodiments of the present disclosure. -
FIG. 3 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown inFIG. 2 -
FIG. 4 is a data flow example that illustrates the process for energy-efficient data processing shown inFIG. 3 . -
FIG. 5 is a tabular listing of exemplary contents of a memory structure according to various embodiments of the present disclosure, such as the memory structure shown inFIG. 2 . -
FIG. 6 illustrates an exemplary tabular listing forFIG. 5 after a data loading step has been performed. -
FIG. 7 illustrates an exemplary tabular listing forFIG. 5 after activated computations are stored. -
FIG. 8 shows a prior art weight distribution. -
FIG. 9 illustrates exemplary discrete weights in accordance with embodiments of the present disclosure. -
FIG. 10 illustrates an exemplary tabular listing forFIG. 5 after rounding. -
FIG. 11 illustrates a simplified example of rounding results according to various embodiments of the present disclosure. -
FIG. 12 illustrates an exemplary tabular listing for weight sharing according to various embodiments of the present disclosure. -
FIG. 13 illustrates an exemplary tabular listing for combining entries according to various embodiments of the present disclosure. -
FIG. 14 illustrates the table inFIG. 13 after sorting and adding a binary representation according to various embodiments of the present disclosure. -
FIG. 15 illustrates the table inFIG. 14 after replacing Read Source address bits, according to various embodiments of the present disclosure. -
FIG. 16 illustrates an exemplary memory structure that utilizes column weights, according to various embodiments of the present disclosure. -
FIG. 17 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown inFIG. 16 . -
FIG. 18 illustrates a simplified system utilizing a memory structure according to various embodiments of the present disclosure. -
FIG. 19 illustrates an alternate system utilizing a memory structure that uses column weights according to various embodiments of the present disclosure. - In the following description, for purposes of explanation, specific details are set forth in order to provide an understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these details. Furthermore, one skilled in the art will recognize that embodiments of the present invention, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.
- Connections between components or systems within the figures are not intended to be limited to direct connections. Rather, data between these components may be modified, re-formatted, or otherwise changed by intermediary components. Also, additional or fewer connections may be used. It shall also be noted that the terms “coupled,” “connected,” or “communicatively coupled” shall be understood to include direct connections, indirect connections through one or more intermediary devices, and wireless connections.
- Reference in the specification to “one embodiment,” “preferred embodiment,” “an embodiment,” or “embodiments” means that a particular feature, structure, characteristic, or function described in connection with the embodiment is included in at least one embodiment of the invention and may be in more than one embodiment. Also, the appearances of the above-noted phrases in various places in the specification are not necessarily all referring to the same embodiment or embodiments.
- The use of certain terms in various places in the specification is for illustration and should not be construed as limiting. A service, function, or resource is not limited to a single service, function, or resource; usage of these terms may refer to a grouping of related services, functions, or resources, which may be distributed or aggregated.
- In this document, the terms “in-line,” “in place,” and “local” are used interchangeably. Furthermore, the terms “include,” “including,” “comprise,” and “comprising,” shall be understood to be open terms and any lists the follow are examples and not meant to be limited to the listed items. Any headings used herein are for organizational purposes only and shall not be used to limit the scope of the description or the claims. Each reference mentioned in this patent document is incorporate by reference herein in its entirety.
- It is noted that embodiments described herein are given in the context of machine learning, but one skilled in the art shall recognize that the teachings of the present disclosure are not limited to machine learning hardware and may be applied to various other networks and applications that involve arithmetic operations that may be used in other contexts. For example, although embodiments herein are discussed mainly in the context of convolutions, a person of skill in the art will appreciate that a deconvolution operation can also be structured as matrix-matrix type multiply operation and, thus, the principles of the present invention are equally applicable to deconvolutions. Furthermore, other types of mathematical operations may be implemented in accordance with various embodiments of this disclosure.
- Similarly, embodiments herein are discussed mainly in the context of fully connected layers. Yet, one of skill in the art will appreciate that this does not limit this invention to this particular type of neural network; rather, the teachings of the present invention may be equally applied to other types of networks, such as image processing applications that use in accelerators for convolutions and deconvolutions.
- A. Fully Connected Networks
- Most machine learning processes make use of so-called “fully-connected layers” and sub-layers. Some neural networks exclusively use fully connected layers, while others make at least partial use of them.
FIG. 1 is a general illustration of a simplified prior art fully connected network. -
Network 100 has four inputs 102 (denoted as 0.1, 0.2, 0.3, and 0.4), three hidden layers 112-132 (having neurons denoted as 1.1, 1.2, 1.3, 2.1, 2.2, 3.1, 3.2, and 3.3), and three outputs 152 (denoted as 4.1, 4.2, 4.3). It is pointed out that the diagram inFIG. 1 is used only for demonstration purposes and ease of illustration; practical machine learning models may operate on anywhere from hundreds to more than millions of neurons. - As is known in the art, processing a machine learning algorithm entails a great number of matrix multiplication steps. In the example shown in
FIG. 1 , e.g., the output y1.1 of neuron 1.1 is calculated as: -
y 1.1 =g(Σi x i ·w i) (1) - yielding y1.1=g(x0.1·w0.1+x0.2·x0.3·w0.3+x0.4·w0.4+b1),
-
- where g is the activation function, xi are data elements, wi are weights and b1 is a bias value.
- As will be understood by a person of skill in the art, the
larger network 100, the larger the number of required multiplications and, thus, the energy impact will follow O(n2), where n represents the number of neurons in the network. Therefore, reducing the energy impact of arithmetic operations, such as multiplications and additions, should be of utmost importance when designing low-power machine learning and similar applications. - B. Memory Structure and Addressability
-
FIG. 2 illustrates an exemplary memory structure with inline multipliers and an adder according to various embodiments of the present disclosure.Memory structure 200 comprises memory elements S that store readsources 202, memory elements W that storeweights 206, memory elements X that storedata 210, memory elements T that store writetargets 220, and output write enable signals,multipliers Mi 230, that are arranged inrows 250. - In embodiments, memory elements comprise circuitry, such as logic circuits that control
memory structure 200.Memory structure 200 further comprisesadder 240 that may be shared byrows 250. - It is noted that components, or modules, shown in diagrams are illustrative of exemplary embodiments of the invention and are meant to avoid obscuring the invention. It is also understood that throughout this document components may be described as separate functional units, which may comprise sub-units, but those skilled in the art will recognize that various components, or portions thereof, may be divided into separate components or may be integrated together, including integrated within a single system or component. For example, although
activation function 242 andsequence number L 244 are shown outside ofmemory structure 200, e.g., to facilitate easy sharing of circuit resources, person of skill will appreciate that, in embodiments,activation function 242,sequence number L 244, and any number of other circuit components may be integrated intomemory structure 200. - It is further noted that functions or operations discussed herein may be implemented as software components, hardware components, or a combination thereof. For example,
memory structure 200 may be controlled by a state machine (not shown) that may be implemented as a hardware state machine or a software state machine. - In embodiments, multiple instances of
memory structure 200 may be used and combined (e.g., in a column arrangement, using digital components, using modular components, etc.) to alleviate physical restrictions such as maximum dimensions formemory structure 200. Variations may implement any number of data elements X and/or multiple weight elements W perrow 250. - In embodiments,
memory structure 200 may be implemented using content addressable memory cells or similar circuitry that may use logic elements in any number and arrangement to controlmemory structure 200 and achieve the objectives of the present disclosure. In embodiments, the content addressable memory cells may use commonly available storage cells that store the actual 0 and 1 values, but that are subject to the interconnectivity of the content addressable memory cells. - One skilled in the art will recognize that the multipliers and adders may be implemented in different ways, for example using analog circuits, digital circuits, in-line within
memory structure 200, or at the bottom of the memory array. An example of an analog implementation for multipliers and adders are the systems and methods disclosed in U.S. Provisional Patent Application No. 62/740,691 (Docket No. 20057-2258P), entitled “SYSTEMS AND METHODS FOR ENERGY-EFFICIENT ANALOG MATRIX MULTIPLICATION FOR MACHINE LEARNING PROCESSES,” naming as inventors Sung Ung Kwak and Robert Michael Muchsel, and filed Oct. 3, 2018, which application is hereby incorporated herein by reference as to its entire content and for all purposes. - In operation, in response to a
particular sequence number 244 that matches the content of memory elements ofread source 202, readsource 202 may cause enablesignals 204 to be activated, e.g., to enable, activate, or control a read operation. Similarly, in response to the particularsequence number L 244 matching the content of memory elements ofwrite target T 220, those memory elements may cause writetargets 220 to output enable signals 205. - In embodiments, a controller (not shown in
FIG. 2 ) controlsmultipliers 230 to compute the product ofweights Wi 206 anddata Xi 210, and controls adder 240 to compute the sum of the products Wi*Xi. The sequencing of operations is discussed next with reference toFIG. 3 . - C. Sequencing
-
FIG. 3 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown inFIG. 2 .Process 300 begins atstep 302 when the value of L is initialized, e.g., to L=1. - At
step 304, it is determined whether a stop condition has been met. If so,process 300 may resume withstep 320 where results are collected. - If, at
step 304, a stop condition has not been met, then at step 306 a sequencer may apply a value, L, to a read sequence input. In embodiments, this causes read sources S that contain the value L to output their enable signal. - At
step 308, the enabled data items X and weights W may be multiplied using multipliers Mi. - At
step 310, the products may be applied to the adder A to obtain Y′. - At
step 312, an activation function, g(), may be applied to the output Y′ of the adder A to obtain output Y. - At
step 314, the sequencer applies the calculated output Y to the data inputs. - At
step 316, the value L is applied to the write target inputs, e.g., via write sequence(s). In embodiments, this may cause all write targets T that contain the value L to output their enable signal such that, consequently, Y is written to the enabled data items X. - At
step 318, L may be increased andprocess 300 may resume withstep 304 to determine whether the stop condition has been reached. - It shall be noted that: (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently herein.
-
FIG. 4 is a data flow example that illustrates the process for energy-efficient data processing shown inFIG. 3 . The example illustrates a sequence involving neuron 1.2 shown inFIG. 1 . As depicted inFIG. 4 ,sequencer 402 may, first, apply a value L, e.g., L=1.2, to the input ofread sequence 452. In embodiments, this causes those read sources that contain the value 1.2, depicted inFIG. 4 as the first four rows ofcolumn 452 in table 450, to output their enable signals. As a result, the four enabled data items, depicted as X0.1, X0.2, X0.3, and X0.4 incolumn 454, and weights W, depicted as W5, W6, W7, W8 incolumn 456, are multiplied, e.g., by a multiplier circuit illustrated inFIG. 2 . - In embodiments, the products (X*W) of the multiplication may then be input to adder 404 that computes X0.1*W5+X0.2*W6+X0.3*W7+X0.4*W8 to output an
intermediary result 406. In embodiments,output 406 ofadder 404 may be provided toactivation function 408 that outputs theresult 410 of this process as of this calculation as Y. In embodiments,sequencer 402 applies 460 the calculated result 410 (e.g., Y=X1.1) to the data input, as indicated incolumn 454, and applies 470 the value L=1.2 to the write target input, as indicated incolumn 458. In embodiments, this causes write targets T that contain the value L=1.2 (shown in column 458) to output their enable signal, and consequently result 410 may be written 480 to the enabled data items X, as indicated incolumn 454. Finally, the value of L is increased, e.g., to L=1.3, and the sequence is repeated until a stop condition is met. - One of skill in the art will appreciate that the illustrated multiplication operations may be performed in parallel and in place, such that data does not have to be moved far to be applied to
adder 404, thus, resulting in an energy-efficient arrangement. In embodiments, the sequencing process may be implemented using analog circuitry that, advantageously, further increases energy efficiency. - It is noted that the following examples, values, and results are provided by way of illustration and are obtained under specific conditions using a specific embodiment or embodiments; accordingly, neither these examples nor their results shall be used to limit the scope of the current disclosure.
-
FIG. 5 is a tabular listing of exemplary contents of a memory structure according to various embodiments of the present disclosure, such as the memory structure shown inFIG. 2 . The exemplary values are used to illustrate examples for a fully connected neural network, e.g., the network shown inFIG. 1 . For each row entry in table 500 inFIG. 5 shows values for (1) read source, S, 510; (2) data item, X, 520; (3) one or more weights, W, 530 orbias items 530; and (4) write target, T, 540. It is noted that for clarity of the description, random values are shown forweights 530. - As previously mentioned with respect to fully connected networks, given a value for a read source S, all memory elements with a matching value may be activated to enable a read operation; similarly, given a value for a write target T, all memory elements with matching write target T value may be activated.
- In
FIG. 5 , values listed inread source 510 and writetarget 540 are named in the format “layer.number,” e.g., “2.1.,” where “layer” refers to the neuron layer. It is noted that, as with other numbering schemes herein, this numbering is arbitrary. - Entries denoted as “0.0” in
write target 540 may represent memory that has not been written to or have been initialized as having a “zero” value. Table 500 inFIG. 5 shows an initial state of the data structure before data is loaded. In embodiments, such entries may be used for bias weights that may be preloaded into a memory structure. The data for bias entries may also be preloaded into the memory structure, here as having the value 1.000. - The entries 0.1 through 0.4 in the
write target 540 denote targets for input data, such as sensor data. The entries associated with the values 4.1, 4.2, and 4.3 inread source 510, i.e., entries 3.1, 3.2, and 3.3 inwrite target 540 may be used to collect the output data of the last fully connected layer.FIG. 6 illustrates an exemplary tabular listing forFIG. 5 after a data loading step has been performed. - In
FIG. 6 it is assumed that input data xi, here, having the values x1=0.41, x2=0.52, x3=0.63, and x4=0.74, are loaded into locations addressed by 0.i inwrite target 640. The changed data is highlighted in table 600. After the data loading step, in embodiments, computations may commence by selecting, for each neuron Nj.k, all readsources 610 addressed by j.k to output, for each neuron Nj.k, the sum of the individual products passed through an activation function as follows: -
Y j.k=output(N j.k)=g(ΣS=j.kdata·weight) (2) - For N1.1 in the example above, the Eq. 2 may be written as:
-
Y j.k =g(0.41·−0.002849+0.52·−0.017828+0.63·0.006862+0.74·−0.000359+1.000·−0.061022) - Assuming that g() is a sigmoid function, Eq. 2 yields g(−0.06740325)=0.483156.
- In embodiments, the activated computation result is stored in those locations that have a
write target 640 of j.k. In the example above, for j.k=1.1, the data structure may now be represented as inFIG. 7 , in which the changes resulting from the activation function are highlighted. - In embodiments, this process may be repeated for all remaining j.k, here, N1.2, N1.3, N2.1, N2.2, N3.1, N3.2, and N3.3. The data values from the last layer (layer 4 in the example in
FIG. 1 ) may then be collected, e.g., by software, and used directly or serve as input for a classification function, e.g., softmax. - E. Activation Function Lookup Table
- Several known activation functions, such as sigmoid, ReLU, Leaky ReLU, and ELU, are commonly used with relatively good results. In embodiments, in order to save on compute time and to allow flexibility, a programmable lookup table may be employed. If, for example, data vales are expressed as 8-bit integers, a table with 256 entries may sufficiently describe any possible activation function. In embodiments, hardware accelerators may be used for ReLU or other activation functions.
- F. Optimizations
- In the examples above, both data and weights were displayed in floating point format. In embodiments, data may be expressed as integers, e.g., 8-bit integers, and the size of weights may be severely reduced by “quantizing” them. In embodiments, this is accomplished by applying a process to a weight distribution, such as
weight distribution 800 inFIG. 8 , which illustrates a typical weight distribution, in order to derive quantized weights. For example, givenweight distribution 800 inFIG. 8 and the following seven discrete weights in table 900 inFIG. 9 , the original weights indistribution 800 could be rounded to the nearest discrete weight, such that, after rounding, the data structure may look like that inFIG. 10 . - A person of skill in the art will appreciate that rounding is only one simple way to quantize weights. There is active research in the art that aims to improve the process and means of developing machine learning algorithms that use quantized weights. Accordingly, any known process in the art may be used to obtain or derive quantized weights. One of skill in the art will further appreciate that, in embodiments, data structure entries having a weight of 0.0 may advantageously be removed during the construction of the network such as to 1) reduces storage requirement, 2) eliminates a significant number of computations, and 3) reduce power consumption as overall data movement is reduced.
-
FIG. 11 illustrates a simplified example in which rounding results in the elimination of much of the computations that have to be performed. While, in practice, the reduction may be may not so extreme for a large network, it might still be considerable. It is noted that some optimizations may not necessarily save computation time, but rather reduce storage requirements. For example, while matching rows for a neuron may be processed in parallel, the sequence of neurons may be computed sequentially. Further, if no weight sharing (discussed next) is used, then there may be no need to encode 0.0 weights at all. - G. Weight Sharing
- The basic data structure shown above comprises Read Source, Write Target, and an associated Weight. In embodiments, implementation options allow for multiple weights per data structure entry. As a result, if, e.g., two weights are used, then the data structures for Nk.l and Nk.l+1 may be combined. It is understood that once weights are combined in this manner, it should be possible to encode a 0.0 weight. It is further understood, that inefficiencies may be introduced if the node count per layer is not evenly divisible by the number of shared weights (e.g., N1.3 and N3.3 in the example in
FIG. 12 ). - H. Ternary Read Sources
- In embodiments, similar to the concept of weight sharing, a plurality of entries that have the same data, (rounded) Weights, and Write Target, may be combined, e.g., by expressing address bits in the Read Source as “don't care.” In this flexible approach, the Read Source may then match regardless of whether the particular applied address bit is 0 or 1. Several of the ternary elements may be implemented, for example, 6-bits.
- Since, as mentioned previously, the numbering for Read Source and Write Target may be arbitrary, and the order of execution within a layer should not matter in most circumstances when no recurrent network is used, the data structure may be re-sorted to yield significant savings. For example, a partial structure (after removing 0-weights) may provide the results in
FIG. 13 . It is noted that, similar to weight sharing, Ternary Read Sources do not necessarily save computation time. - In embodiments, sorting this structure by Write Target-Weight-Data and adding a binary representation of the decimal Write Target may provide the results in
FIG. 14 . In embodiments, replacing Read Source address bits with ‘X’ without renumbering yields a smaller structure shown inFIG. 15 . In embodiments, neurons may be renumbered (not shown), such that the first two table entries may be combined. - I. Column Weights
- Assuming a limited number of discrete weights, instead of computing the sum of products as:
-
y j.k=Σi x i ·w i (3) - in embodiments may use the following expression:
-
y j.k =w 0·Σm x m +w 1·Σn x n +w 2·Σp x p+ . . . (4) - Advantageously, this approach allows for parallel computation of terms that share inputs, but that use different weights. This embodiment comprises a plurality of memory structures (“columns”) having elements that each may comprise, e.g., (1) a read source S, (2) a data item X, and (3) a write target T. Further, each column C may be associated with a weight WC, as shown in
FIG. 16 , which illustrates an exemplary memory structure that utilizes column weights, according to various embodiments of the present disclosure. -
FIG. 17 is a flowchart of an illustrative process for energy-efficient data processing in accordance with various embodiments of the present disclosure by utilizing a memory structure as shown inFIG. 16 .Process 1700 begins atstep 1702 when the value of L is initialized, e.g., to L=1. - At
step 1704, it is determined whether a stop condition has been met. If so,process 1700 may resume withstep 1730 where results are collected. - If, at
step 1704, a stop condition has not been met, then, atstep 1706, the sequencer may apply a value, L, to a read sequence input. In embodiments, this causes read sources S that contain the value L to output their enable signal. - At
step 1708, the enabled data items X may be summed by an adder that outputs sums that, atstep 1710, are multiplied with the column weights W, e.g., by using multipliers M, to obtain column-specific outputs. - At
step 1712, the column outputs are added by a column adder. - At
step 1714, the outputs of the column adder is processed through an activation module that may apply an activation function, go, to the column adder output to obtain output Y. - At
step 1716, the sequencer applies the calculated output, Y, to the data inputs. - At
step 1718, the value L is applied to the write target inputs. In embodiments, this may cause all write targets T that contain the value L to output their enable signal such that Y is written to the enabled data items X. - At
step 1720, L may be increased andprocess 1700 may resume withstep 1704 to determine whether the stop condition has been reached. - It is noted that, depending on the particular embodiment, (1) certain steps may optionally be performed; (2) steps may not be limited to the specific order set forth herein; (3) certain steps may be performed in different orders; and (4) certain steps may be done concurrently herein. For example, for a plurality of columns, some of the steps (e.g., all
steps 2 and all steps 3) may be performed in parallel. -
FIG. 18 illustrates a simplified system utilizing a memory structure according to various embodiments of the present disclosure.Sequencer 1800 comprises readsource 1802,adder 1804,inline multiplier 1806,weights memory 1808,data memory 1810, writetarget 1812,activation unit 1814, andsequence number generator 1816. - In operation, read
source 1802 receivessequence number 1820 fromsequence number generator 1816, e.g., until a stop condition has been reached. If thesequence number 1820 matches a content of a memory element inread source 1802, then readsource 1802 outputs an enablesignal 1822 that enables data inweights memory 1808 anddata memory 1810 to be multiplied byinline multiplier 1806 to generateproducts 1824 that are then added byadder 1804 that computes the sum ofproducts 1824. In addition, memory elements in inwrite target 1812 whose content matchessequence number 1820 may causewrite target 1820 to output enable signals 1830. - In embodiments,
output 1836 ofadder 1804 is provided toactivation unit 1814 that applies an activation function toproducts 1824 to generateoutput 1832 that may then be fed back to the input ofdata memory 1810 to be written according to enablesignals 1830 generated bywrite target 1812, in response to receivingsequence number 1820. Finally,sequence number generator 1816 mayincrement sequence number 1820 and provide a new sequence number to readsource 1802 to close the loop. -
FIG. 19 illustrates an alternate system utilizing a memory structure that uses column weights according to various embodiments of the present disclosure. For clarity, components similar to those shown inFIG. 18 are labeled in the same manner. For purposes of brevity, a description or their function is not repeated here.System 1900 comprisessequencer 1904 that is similar tosequencer 1800 inFIG. 18 .System 1900 inFIG. 19 further comprisescolumn adder 1902, e.g. a column adder, such as those illustrated inFIG. 16 that may be coupled to any number of additional sequencers (not shown inFIG. 19 ) thatshare column adder 1902 andactivation unit 1804. Similar toFIG. 16 , sequence number generator inFIG. 19 may be coupled to any number of additional sequencers. - J. System Embodiments
- Aspects of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory. It shall be noted that alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like. Accordingly, the “means” terms in any claims are intended to cover both software and hardware implementations. Similarly, the term “computer-readable medium or media” as used herein includes software and/or hardware having a program of instructions embodied thereon, or a combination thereof. With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.
- It shall be noted that embodiments of the present invention may further relate to computer products with a non-transitory, tangible computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind known or available to those having skill in the relevant arts. Examples of tangible computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store or to store and execute program code, such as application specific integrated circuits (ASICs), programmable logic devices (PLDs), flash memory devices, and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher level code that are executed by a computer using an interpreter. Embodiments of the present invention may be implemented in whole or in part as machine-executable instructions that may be in program modules that are executed by a processing device. Examples of program modules include libraries, programs, routines, objects, components, and data structures. In distributed computing environments, program modules may be physically located in settings that are local, remote, or both.
- One skilled in the art will recognize no computing system or programming language is critical to the practice of the present invention. One skilled in the art will also recognize that a number of the elements described above may be physically and/or functionally separated into sub-modules or combined together.
- It will be appreciated to those skilled in the art that the preceding examples and embodiments are exemplary and not limiting to the scope of the present disclosure. It is intended that all permutations, enhancements, equivalents, combinations, and improvements thereto that are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It shall also be noted that elements of any claims may be arranged differently including having multiple dependencies, configurations, and combinations.
Claims (20)
1. A method for energy-efficient data processing, the method comprising:
in response to obtaining a read command, identifying, in a memory device, a set of input locations from which to read input data, each of the input locations being associated with an address value for a neuron;
accessing the input data in the set of input locations;
using the input to generate a result; and
writing the result back into the memory device.
2. The method according to claim 1 , further comprising associating a set of input data items with the address value for the neuron.
3. The method according to claim 1 , wherein two or more locations of the set of input locations are concurrently accessed.
4. The method according to claim 1 , wherein the result is associated with the neuron.
5. The method according to claim 1 , further wherein the neuron represents a node in a fully connected network.
6. The method according to claim 1 , further wherein the memory device comprises summing nodes and multipliers that are embedded in the memory device.
7. The method according to claim 1 , further wherein the set of input locations are accessed in a single clock cycle.
8. The method according to claim 1 , further comprising a read source that comprises the address value for the neuron, the read source outputs a first enable signal that enables a data item among the set of input data items.
9. The method according to claim 8 , further comprising applying the address value to one or more write target inputs that, in response to containing the value, output a second enable signal that causes the result to be written to the data item.
10. The method according to claim 8 , further comprising enabling at least one weight item, and multiplying one or more of the enabled data items with enabled weight items to obtain a sum of products.
11. The method according to claim 10 , further wherein the result is associated with the sum of products that is associated with the neuron.
12. The method according to claim 10 , wherein generating the result further comprises applying the sum of products to an adder to obtain an output.
13. The method according to claim 11 , further comprising applying the output to an activation function to obtain the result.
14. A system for energy-efficient data processing, the system comprising:
a processor; and
a non-transitory computer-readable medium comprising instructions that, when executed by the processor, cause steps to be performed, the steps comprising:
in response to obtaining a read command, identifying, in a memory device, a set of input locations from which to read input data, each of the input locations being associated with an address value for a neuron;
accessing the input data in the set of input locations;
using the input to generate a result; and
writing the result back into the memory device.
15. The system according to claim 14 , wherein the two or more locations of the set of input locations are concurrently accessed.
16. The system according to claim 14 , wherein the steps further comprise associating a set of input data items with the address value for the neuron.
17. The system according to claim 16 , further comprising a read source that comprises the address value for the neuron, the read source outputs a first enable signal that enables a data item among the set of input data items.
18. The system according to claim 17 , further comprising applying the address value to one or more write target inputs that, in response to containing the value, output a second enable signal that causes the result to be written to the data item.
19. The system according to claim 17 , further comprising enabling at least one weight item, and multiplying a set of one or more of the enabled set of data items with enabled weight items to obtain a sum of products.
20. The system according to claim 19 , further wherein the result is associated with the sum of products that is associated with the neuron.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/114,766 US20230222315A1 (en) | 2018-10-03 | 2023-02-27 | Systems and methods for energy-efficient data processing |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862740700P | 2018-10-03 | 2018-10-03 | |
US16/590,265 US11610095B2 (en) | 2018-10-03 | 2019-10-01 | Systems and methods for energy-efficient data processing |
US18/114,766 US20230222315A1 (en) | 2018-10-03 | 2023-02-27 | Systems and methods for energy-efficient data processing |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/590,265 Continuation US11610095B2 (en) | 2018-10-03 | 2019-10-01 | Systems and methods for energy-efficient data processing |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230222315A1 true US20230222315A1 (en) | 2023-07-13 |
Family
ID=69886594
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/590,265 Active 2041-05-23 US11610095B2 (en) | 2018-10-03 | 2019-10-01 | Systems and methods for energy-efficient data processing |
US18/114,766 Pending US20230222315A1 (en) | 2018-10-03 | 2023-02-27 | Systems and methods for energy-efficient data processing |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/590,265 Active 2041-05-23 US11610095B2 (en) | 2018-10-03 | 2019-10-01 | Systems and methods for energy-efficient data processing |
Country Status (3)
Country | Link |
---|---|
US (2) | US11610095B2 (en) |
CN (2) | CN118503147A (en) |
DE (1) | DE102019126715A1 (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160196488A1 (en) * | 2013-08-02 | 2016-07-07 | Byungik Ahn | Neural network computing device, system and method |
US20170103304A1 (en) * | 2015-10-08 | 2017-04-13 | Via Alliance Semiconductor Co., Ltd. | Neural network unit with plurality of selectable output functions |
US20170102945A1 (en) * | 2015-10-08 | 2017-04-13 | Via Alliance Semiconductor Co., Ltd. | Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10810484B2 (en) * | 2016-08-12 | 2020-10-20 | Xilinx, Inc. | Hardware accelerator for compressed GRU on FPGA |
US10949736B2 (en) * | 2016-11-03 | 2021-03-16 | Intel Corporation | Flexible neural network accelerator and methods therefor |
CN111680789B (en) * | 2017-04-11 | 2023-04-28 | 上海兆芯集成电路有限公司 | Neural network unit |
CN107818367B (en) * | 2017-10-30 | 2020-12-29 | 中国科学院计算技术研究所 | Processing system and processing method for neural network |
-
2019
- 2019-10-01 US US16/590,265 patent/US11610095B2/en active Active
- 2019-10-02 DE DE102019126715.3A patent/DE102019126715A1/en active Pending
- 2019-10-08 CN CN202410685121.8A patent/CN118503147A/en active Pending
- 2019-10-08 CN CN201910949437.2A patent/CN110989971B/en active Active
-
2023
- 2023-02-27 US US18/114,766 patent/US20230222315A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160196488A1 (en) * | 2013-08-02 | 2016-07-07 | Byungik Ahn | Neural network computing device, system and method |
US20170103304A1 (en) * | 2015-10-08 | 2017-04-13 | Via Alliance Semiconductor Co., Ltd. | Neural network unit with plurality of selectable output functions |
US20170102945A1 (en) * | 2015-10-08 | 2017-04-13 | Via Alliance Semiconductor Co., Ltd. | Direct execution by an execution unit of a micro-operation loaded into an architectural register file by an architectural instruction of a processor |
Also Published As
Publication number | Publication date |
---|---|
CN118503147A (en) | 2024-08-16 |
CN110989971A (en) | 2020-04-10 |
DE102019126715A1 (en) | 2020-04-09 |
CN110989971B (en) | 2024-05-28 |
US20200110979A1 (en) | 2020-04-09 |
US11610095B2 (en) | 2023-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11693657B2 (en) | Methods for performing fused-multiply-add operations on serially allocated data within a processing-in-memory capable memory device, and related memory devices and systems | |
US11775313B2 (en) | Hardware accelerator for convolutional neural networks and method of operation thereof | |
Deng et al. | Lacc: Exploiting lookup table-based fast and accurate vector multiplication in dram-based cnn accelerator | |
CN112840356B (en) | Operation accelerator, processing method and related equipment | |
US11983616B2 (en) | Methods and apparatus for constructing digital circuits for performing matrix operations | |
CN110415157B (en) | Matrix multiplication calculation method and device | |
CN113139648B (en) | Data layout optimization of PIM architecture executing neural network model | |
CN110163355B (en) | Computing device and method | |
WO2022037257A1 (en) | Convolution calculation engine, artificial intelligence chip, and data processing method | |
US20200226201A1 (en) | Methods and Apparatus for Constructing Digital Circuits for Performing Matrix Operations | |
CN114692854A (en) | NPU for generating kernel of artificial neural network model and method thereof | |
KR20190089685A (en) | Method and apparatus for processing data | |
Chen et al. | BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs | |
US20220108203A1 (en) | Machine learning hardware accelerator | |
CN115485656A (en) | In-memory processing method for convolution operation | |
US20230222315A1 (en) | Systems and methods for energy-efficient data processing | |
US20230012127A1 (en) | Neural network acceleration | |
CN110765413A (en) | Matrix summation structure and neural network computing platform | |
CN115730653A (en) | Quantitative neural network training and reasoning | |
Dey et al. | An application specific processor architecture with 3D integration for recurrent neural networks | |
KR20200063077A (en) | Massively parallel, associative multiplier-accumulator | |
US11829864B2 (en) | Systems and methods for energy-efficient analog matrix multiplication for machine learning processes | |
Dai et al. | DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA | |
CN220983883U (en) | Matrix computing device, chiplet apparatus and artificial intelligence accelerator device | |
US20240134930A1 (en) | Method and apparatus for neural network weight block compression in a compute accelerator |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MAXIM INTEGRATED PRODUCTS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LOVELL, MARK ALAN;MUCHSEL, ROBERT MICHAEL;LOOMIS III, DONALD WOOD;SIGNING DATES FROM 20230208 TO 20230227;REEL/FRAME:062815/0118 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |