CN107615241A - Logical operation - Google Patents
Logical operation Download PDFInfo
- Publication number
- CN107615241A CN107615241A CN201680031683.4A CN201680031683A CN107615241A CN 107615241 A CN107615241 A CN 107615241A CN 201680031683 A CN201680031683 A CN 201680031683A CN 107615241 A CN107615241 A CN 107615241A
- Authority
- CN
- China
- Prior art keywords
- data
- memory
- logic engine
- logic
- division
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7821—Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Neurology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Memory System (AREA)
- Image Processing (AREA)
Abstract
In this example, a kind of method is stored at least one memory the data division that will be processed using identical logical operation in multiple different data objects including the use of the identification of at least one processor.Methods described can also include the expression for the operand that identification is stored at least one memory, and the operand is used to provide logical operation, and provides the operand to logic engine.The data division is stored in multiple Input Data Buffers, wherein, each in Input Data Buffer includes the data division of different data objects.Each execution logic in the logic engine data portion can be used to operate, and the output for each data division is stored in multiple output data buffers, wherein, each in output includes the data derived from different data objects.
Description
Background technology
Describe a kind of framework for being used to allow to handle the processing unit of (PIM) in memory.In PIM, be not from
Remote memory fetches data for processing, but locally executes the processing in memory.
Brief description of the drawings
Non-limiting example is described referring now to accompanying drawing, in the accompanying drawings:
Fig. 1 is the flow chart of the example of the method for execution logic operation;
Fig. 2 is the rough schematic view of exemplary resistive memory array devices;
Fig. 3 and Fig. 4 is the schematic example of processing unit;And
Fig. 5 is performed for the flow chart of another example of the another method of logical operation.
Embodiment
Fig. 1 shows the example of method.In block 102, this method is included in storage in memory multiple different
In data object, identification will use the data division of the processed different data objects of identical logical operation.For example, data
Stage in terms of object can be associated with least one image, and operation can include Object identifying.For example, certain logic
Operation (such as convolution) can be used for the performing such as Object identifying (such as face detection) etc of the task.Data object can wrap
One group of image pixel, such as one group of characteristic spectrum derived from one group of image pixel are included, and the operation includes face and examined
Stage in terms of survey.Such input can be referred to as " input neuron ".Data object can be completely unrelated (example
Such as, including the image from a variety of sources or the image derived from from a variety of sources).
Identification data part can include for example based on multiple data objects and/or data division, data output and deposit
(it can be for receiving buffer of data division etc., and/or storage operation number, data object and/or data to memory resource
Partial memory) in it is at least one, to determine the size of data division.Identification data part can include determining that data portion
The order divided, such as so that data division can be staggeredly.As discussed further below, it may be determined that data division
Sequentially, size and quantity, to provide substantially continuous availability of data for using identical operand or multiple
Operand is handled.
Frame 104 includes the expression of the operand in recognition memory to provide logical operation;And in frame 106, logic
Engine is provided with operand (it can be for example including matrix).Logic engine can be vector-matrix multiplication engine.At some
In example, logic engine can be provided as resistive memory array.In some instances, logic engine can include arithmetic
Logic unit (ALU).
In block 108, data division is stored in multiple Input Data Buffers.It is each in Input Data Buffer
The individual data division for all including different data objects.This can include expected from that (that is, data division will be by for processing sequence
The order of logical operation is subjected to according to it) come data storage part.Frame 110 includes each execution logic behaviour in data portion
Make.This can include performing Matrix-Vector multiplication etc..In wherein data division and order dependent example, this sequentially may be used
It is substantially continuous in some instances to be defined such that the utilization rate of the logic engine of execution logic operation is high
(that is, it is at least substantially full to handle streamline, and as is full as possible in some instances).It is every that frame 112 includes storage
The output of individual data division, wherein each output includes the data derived from different data objects.In some instances, export
It can be stored in multiple output data buffers, wherein each in output data buffer is included from different numbers
According to data derived from object.
This method can be performed using the device with processing (PIM) framework in memory.In such framework, meter
Calculate processing unit be placed in memory or memory nearby (for example, in memory array or subarray) to avoid length
Communication path.
In some instances, such as using resistance-type memory equipment such as " memristor " (it is can be with non-volatile
Property mode is written into the electric component of resistance) when, memory can provide processing component in itself.Resistance-type memory equipment
Array can be used for execution logic operation.
In some instances, logic engine can be associated with dynamic random access memory (DRAM).In this example, this
Kind association can include logic engine and (for example, on chip or on tube core) component progress physical access also including DRAM
It is and/or integrated.In some instances, logic engine and DRAM may be provided on identical tube core.In some instances, patrol
Volume engine can (for example, being physically arranged at DRAM buffer) associated with DRAM buffer, or can carry
For logic engine dual inline memory modules are reduced as the load in dual inline memory modules (DIMM)
(LRDIMM) buffer (or as one part), the dual inline memory modules can also include at least one DRAM
Chip.In another example, the one of tube core of the logical layer as memory portion (for example, DRAM parts) is also included can be provided
Part.For example, memory may be provided on the side of tube core, and logic engine may be provided in the opposite side of tube core
On.
Some calculating tasks use " deep learning " technology.When using deep learning treatment technology, input data is held
Row logical operation is to provide the output in the first layer of processing.Then the output execution logic in the succeeding layer of processing is operated,
In some instances for multiple iteration.Deep learning can be used for such as big data analysis, image and speech recognition and
Other are calculated in the field of complex task etc.At least some in process layer can include convolution, such as use matrix multiplication
To input data application logical operation.Convolution operation number (for example, matrix) can be referred to as process kernel.
In order to accelerate deep learning workload, the acceleration of the convolution of data can be especially considered to can take up height
Reach or about 80% be used for some deep learnings application the execution time.
In some instances, keep the quantity of kernel in the storage device can be very big.Although for example, shared kernel
It can be used for multiple or complete in one group of input data part (it can be referred to as input neuron) derived from data object
Portion input data part, but in other examples, handle the difference from identical data object using different kernels
Input neuron.In some instances, each input neuron can be from different " privately owned " derived from data object
Kernel is associated.For example, the privately owned kernel for some applications can take up the storage space more than GB.
Neuron is wherein being inputted each in the example associated with privately owned kernel, temporal locality is (that is, when data will be by
The data can be by local holding during reprocessing) can be low, this can be with the effect of influence technique (such as cache)
With.
In order to consider specific example, such as face detection can be applied to video file.In process layer, Ke Yiti
For the N in data objectiIndividual data object.In this example, these can be the N in video fileiIndividual image, and from video
N in fileiN derived from individual imagexxNyIndividual data division (input neuron, such as pixel or characteristic spectrum) can be with size
For KxxKyKernel carry out convolution to form NoIndividual input feature vector collection of illustrative plates.In this case, each output entry has it
The kernel (that is, kernel is privately owned) of oneself, and computing can be carried out based on below equation:
Wherein map and w is illustrated respectively in characteristic spectrum entry and weight in convolutional layer.
Based on above-mentioned equation, such layer occupies:
Nx x Ny x Ni x N0x Kx x Ky
Kernel spacing, this actual life application in can easily exceed 1GB.Further, since kernel is privately owned, so
In the absence of temporal locality (data for not reusing local cache), this imposes high bandwidth pressure.
According to Fig. 1 method, the different data objects that identical logical operation is processed are used (for example, different
Image or different characteristic spectrums) data division it is identified and be stored in data buffer.This allows batch processing,
I.e., it is possible to using identical kernel come each in processing data part (it can be exported from different data objects).This
Sample, reduce the frequency fetched when kernel is reused to new kernel.For example, identical kernel can be used for being changed in kernel
Should to the corresponding input neuron (for example, pixel or characteristic spectrum) for each image in multiple images before change
With identical convolution.This provides the temporal locality on kernel, even if wherein on data object or data division side
Face is not present in temporal locality or the example that relatively low temporal locality be present.
Data division can be retrieved and be provided to data buffer in order so that multiple data divisions are stored
In data buffer, and the logical operation using specific operand can be essentially continuously performed with data portion.
In other words, data division can be fetched so that the substantially continuous flowing water for the data to be handled by logic engine can be provided
Line.In some instances, data object can be associated with the delay of change, can be than taking for example, fetching some data objects
Returning other data objects will take longer time, and this can depend on type of memory, position etc..For different data pair
As or the memory resource of data division be different in size.The batch processing of data can contemplate this species diversity so that can
By early stage request data and to hold it in local buffer to reduce any gap in streamline, the gap can
Be, for example, data division according to priority be requested and " timely " handled when caused by the supply of data division
's.
As described above, in some instances, in PIM frameworks, disposal ability may be provided in very close memory
Place can be embedded into memory.In some instances, memory can provide logic engine (for example, vector-square in itself
Battle array multi-plier engine).In one example, kernel can be embodied as the resistance for including the two-dimensional grid of resistance-type memory element
Formula memory array, it can be cross-bar switch array.Show resistance-type memory element 202 (for example, memristor in Fig. 2
Or other resistance-type memory elements) cross-bar switch array 200 example.The resistive memory array 200 is included at least
One line of input is for receiving input data, and each resistance-type memory element has bit depth.In this example,
Term " bit depth " is the quantity for the position that can be represented by single resistance-type memory element 202.In some instances, element
202 can be the binary digit with a value in two values (for example, representing 0 or 1 (bit depth of one)), but
The resistance-type memory element of multiple values (such as 32 different ranks (bit depth of five)) can be used by showing.Can be with
By the way that element is submitted into voltage pulse to be carried out " programming " to array 200, each voltage pulse incrementally changes the element
202 resistance, and then, the resistance level is also resistance level by element 202 " memory ", is removed even in supply of electric power
Afterwards.
In some instances, such array 200 can (it can for example use digital-to-analog to input voltage vector
Bits of digital data is converted to analog voltage to be provided by converter ADC) handled to provide output vector, wherein defeated
Enter value by the conductance at each element 202 of array 200 to be weighted.This effectively shows that array 200 performs point to input
Product matrix operation is exported with producing.The weight of element 202 can be carried out individually by making element 202 be subjected to voltage pulse
" programming ", as described above.Such array 200 can be with high density, low-power consumption, long period tolerances and/or being switched fast speed
Degree is associated.Therefore such array can perform matrix combination operation.In some instances, array can include being used for square
Battle array mutually takes a part for dot product engine together.This dot product engine can be used in deep learning device, and for performing
Complicated calculations operate.
In this example, analogue data can be supplied for being handled using resistive memory array 200.Data
Can for example represent at least one pixel of image or the word (or sub- word or phrase) of voice or scientific experiment result or
Any other data.Input data can be provided as vectorial (that is, one-dimensional data character string), and array is applied to voltage
Value (is usually less than used for the voltage for setting the resistance of array element 202 so that the resistance of element 202 does not change in this operation
Become).
If resistive memory array 200 as use, the frame 106 in Fig. 1 can include storing resistance-type
Device array is written with resistance value.
Fig. 3 shows the example of processing unit 300, and it includes memory 302, logic engine 304, multiple input buffers
306th, multiple output buffers 308 and data batch processing module 310, the buffer 306,308 are related to logic engine 304
Connection (and being local in logic engine 304 in some instances).
Memory 302 includes at least one memory portion, and keeps multiple different data objects and multiple logics
Operator, wherein logical operator are used to operate the data division of data object.
Memory 302 can include at least one memory portion, and some of which can be remote memory part.
In the example for which providing multiple memory portions, memory portion can include multiple different type of memory and/or
Size.Memory 302 can include at least one non-volatile memory portion and/or at least one volatile memory portion
(for example, SRAM or DRAM).In some instances, it is this that at least multiple logical operators, which are stored in relative to logic engine 304,
In the memory portion on ground.In some instances, logic engine 304 can be embedded in the memory for storing multiple logical operators
In.In other examples, logic engine 304 can be via the data/address bus (for example, silicon hole (TSV)) with relatively high bandwidth
It is connected to the memory for storing multiple logical operators.
In this example, can be by memory 302 by the supply of buffer 306,308 and the action of batch processing module 310
Specification and the ability of logic engine 304 of ability separate, such as allow to according to computational efficiency come design logic engine
304, and memory 302 may be designed for storage capacity.For example, memory 302 can be directed to its at least a portion bag
Include the memory based on NMOS (it can be relatively intensive and slow).In other examples, it is contemplated that speed and density it
Between balance, it is possible to achieve memory 302 and/or buffer 306,308.
Logic engine 304 operates at least one data division execution logic, and in some instances, to multiple data
Part execution logic operation.Logic engine 304 can be provided by any processor.In some instances, logic engine 304 can
With including field programmable gate array (FPGA), application specific integrated circuit (ASIC), single-instruction multiple-data (SIMD) treatment element etc.,
It can provide convolutional neural networks (CNN) or the component of deep neural network (DNN).In some instances, logic engine 304
Resistive memory array can be included.
In use device 300, data batch processing module 310 identifies in multiple different data objects will be by public
The data division (each data division includes all or part of in data object) that operand operates on it, and will
The data division of multiple different data objects is sent to Input Data Buffer 306.Logic engine 304 is in the data division
Each continuously (in some instances, at least substantially continuously) using public operation number come execution logic operate, and
One output from each operation is supplied in multiple output buffers 308.
In some instances, data batch processing module 310 can be arranged to based on multiple input data set, output collection
At least one in conjunction and various memory resources (for example, at least a portion in buffer 306,308 or memory 302)
It is individual, to determine the size of the data division to be stored in buffer 306.Data batch processing module 310 can be arranged to
Determine data division staggeredly, such as so as to ensure by operand embody kernel utilization rate be it is high (such as so that patrol
Collect engine 304 and be directed to the period essentially continuously execution logic operation for using particular core).
As described above, by using public operation number (it represents public operation number or kernel) for multiple data objects,
The temporal locality of kernel can be realized.
In some instances, processing unit 300 can have a PIM frameworks, wherein operand pair in logic engine 304 be this
Ground storage.By using PIM frameworks, can be saved by corresponding electric power and time to avoid the piece for extracting kernel
Outer bandwidth.However, this may indicate that the on-chip memory storage body that can be used for storing kernel occupies relatively large area, and
By kernel be transferred to processor will consumption electric power, this can cause significant energy expenditure.
Fig. 4 shows the example of the processing unit including 3D memory stacks 400.Storehouse 400 includes multiple memory pipes
Core 402,404 and at least one processing tube core 406.Each in memory dice 402 includes at least one memory
Part 408a-h.At least one aspect of these memory portions 408 in type, size etc. can be similar (for example, phase
Same type of memory and/or size) or can be different.In this example, logic engine 304, which is provided at, draws logic
Hold up 304 sides of TSV 410 (that is, physically very close TSV410) for being connected to first memory tube core 402.Physical access
It can cause wiring is not present between TSV 410 and logic engine.
In this example, first memory tube core 402 stores multiple logical operators.In some instances, patrolled for storing
Collecting at least a portion 408 of the memory of operator, data object and/or data division includes on-chip memory, and logic is drawn
It is chip processing element to hold up 304.In some instances, multiple logical operators can be distributed on multiple memory dices 402,
On 404.In this example, a TSV 410 is shown.However, in other examples, can provide for example from different tube cores
Associated multiple TSV 410, or a tube core can have multiple TSV 410 so that itself and processing tube core 406 to be carried out mutually
Even.In this example, handling tube core 406 also includes multiple inputs and output buffer 306,308.Handling tube core 406 can be with
Including data batch processing module 310, or this may be provided in device elsewhere.
In other examples, logic engine 304 may be provided in the memory identical with storing multiple logical operators
On tube core, although this can increase the area of coverage of device 400.
In some instances, although data object is also stored in remote memory part, and data object
At least a portion can be received from such remote memory part, but data object is also stored in memory
On tube core 402,404.As described above, data object can be stored in multiple different memory resources, the memory
Resource can with different sizes and can supply with change delay (its can with for being taken from particular memory resource
Return time of data and at least one associated in the time of processing tube core 406 for data to be transferred to) data.
In some instances, the delay associated with logical operator is supplied into logic engine is considered.Even if can be relative
Logical operator is rapidly fetched, but is devices which that 400 still have delay, it can cause in (or increase) processing streamline
Gap or " bubble ", and therefore reduce performance.Can be with for example, accessing row in the 3D memories with 128 TSV buses
16ns memory latency is associated.When in the absence of temporal locality, it can be calculated in such a device in 16ns
A line (for example, being directed to 3D memories about 1KB) of matrix operator.As a result, in such an example, per 3D TSV meter
Calculate be limited to every TSV buses handled in 16ns 1KB matrix operator (if logic engine 304 be provided at it is exemplary
It it is 16 cycles when in 1GHz processing units).However, if data can use, logic engine 304 can enter in 1ns to it
Row processing (being 1 cycle in 1GHz).In other words, the computational efficiency of digital units drops to 1/16, because matrix is calculated
Quantum memory delay is higher than computing relay 16 times.
In some instances, therefore it provides multiple inputs and output buffer 306,308 allow this memory latency
Compensation so that device can be operated with complete calculating speed, and not have " bubble " in a pipeline.In some examples
In, the quantity of buffer 306,308 can be selected difference is fully compensated.It is, for example, possible to use cycle delay determines to buffer
The quantity of device 306,308, this makes memory latency relevant with computing relay.In the examples described above, cycle delay is 16, and because
This may have 16 input buffers 306 and 16 output buffers 308, and identical matrix manipulation can be held
Row 16 times (assuming that in the presence of the enough data divisions that be applied to operation).
Such device 400 can store kernel (for example, being used as matrix operator), the kernel by the use of 3D capacity on piece
It can be privately owned kernel, and can be used for computation intensive operations, such as in the convolution for the data of deep learning application.It is logical
Cross and logic engine 304 is arrived into the kernel storage on independent tube core (or multiple tube cores), this can discharge processing (its of tube core 406
Can be relatively expensive component in some instances) on space (for example, being used for more calculating logics).It shall yet further be noted that
The area of coverage in the kernel memory space on processing tube core 406 is precisely the space occupied by TSV 410.By independent tube core
402nd, the memory on 404 is supplied to logic engine 304 to reduce the area of coverage of storehouse 400.In addition, device 400 utilizes
Computationally efficient is realized in batch processing.
3D memory stacks can have high power capacity, for example, the storage space with the 4GB on 4 tube cores.So
3D memories can store logical operator (for example, deep learning kernel) applied for one or more Multilevel methods it is more
In a convolutional layer.This can realize the fast context switching between different convolutional layers on demand.
Fig. 5 is the flow chart of the example of method, and this method is included in frame 502, extraction and multiple different data objects
Associated multiple vectors.Multiple vectors can include or the data division derived from data object, as described above.Show at some
In example, vector is exported from characteristics of image collection of illustrative plates or exported from other data.Vector can include the character string of numerical data, and
It can be fetched from local storage (for example, memory out of 3D memory stacks that also provide logical operation), or from difference
(for example, more long-range) memory fetch.In frame 504, multiple vectors are stored in and 3D memory stacks one
In the local different data buffers of the logic engine of body.Frame 506 includes the logic that extraction is stored in 3D memory stacks
Operator, the logical operator is supplied to logic engine in frame 508.
In frame 510, multiple vector rows and multiplication of matrices are performed according to logical operator using logic engine, wherein
Identical logical operator is multiplied from each in the multiple vectors being associated with multiple different data objects.Show at some
Example in, such as using resistive memory array those, vector can undergo digital to analogy conversion with provide vector table
It is shown as analog voltage.In some instances, different output vectors is provided for each in data object.For example,
Such vector output can be maintained in different data buffer or register.
Methods described can be repeated, includes providing new patrol to logic engine for each iteration in some instances
Collect operator.
Example in the disclosure may be provided in method, system or machine readable instructions, such as software, hardware, firmware
Deng any combination.Such machine readable instructions can be included therein or have computer readable program code thereon
(include but is not limited to disc memory device, CD-ROM, light storage device etc.) on computer-readable recording medium.
The disclosure is described with reference to the flow chart and/or block diagram of the method, apparatus and system of the example according to the disclosure.To the greatest extent
Pipe above flow illustrates the particular order of execution, but the order performed can be differently configured from described order.On one
Frame described by individual flow chart can be combined with those frames in another flow chart.It should be understood that can be by machine
Each stream and/or frame that readable instruction is come in implementation process figure and/or block diagram and the stream in flow chart and/or block diagram and/or
The combination of figure.
Machine readable instructions for example can by all-purpose computer, special-purpose computer, embeded processor or other may be programmed
The processor of data processing equipment performs to realize the function described in specification and drawings (for example, processing unit 300,400
Function).Especially, processor or processing unit can perform machine readable instructions.Therefore, the functional module of device and equipment
(for example, batch processing module 310 or logic engine 304) can be stored in machine readable in memory by computing device
Instruction is operable to realize to realize, or by processor according to the instruction being embedded in logic circuit.Term " processing
Device " is broadly interpreted to include CPU, processing unit, ASIC, logic unit or programmable gate array etc..Methods described and work(
Can module can be performed or be divided among some processors by single processor.
Such machine readable instructions are also stored in computer readable storage means (for example, memory 302),
The instruction can guide computer or other programmable data processing devices to be operated with AD HOC.
Such machine readable instructions can also be loaded into computer or other programmable data processing devices so that
Computer or other programmable data processing devices perform sequence of operations to produce computer implemented processing, therefore are calculating
The instruction performed on machine or other programmable devices is realized as the function specified by the frame in the stream and/or block diagram in flow chart.
In addition, teaching herein can be implemented in form of a computer software product, the computer software product
It is stored in storage medium, and including for making the multiple of the method described in example of the computer equipment realization in the disclosure
Instruction.
Although describing method, apparatus and related fields by reference to some examples, in the essence without departing substantially from the disclosure
Various modifications may be made in the case of god, changes, omits and substitutes.Therefore, methods described, device and related fields are intended to only
Limited by above claim and its scope of equivalent.Retouched herein it should be noted that above-mentioned example illustrates rather than limit
The content stated, and those skilled in the art be able to will be designed in the case of the scope without departing substantially from appended claims it is a variety of
Substitute implementation.Feature on an example description can be combined with the feature of another example.
Word " comprising " does not exclude the presence of the element in addition to those elements listed in the claims, " one (a) "
Or " one (an) " be not excluded for it is multiple, and single processor or other units can complete to record in the claims it is some
The function of unit.
The feature of any dependent claims can be with any one in independent claims or other dependent claims
Individual feature is combined.
Claims (15)
1. a kind of method, including:
Using at least one processor, identify from be stored at least one memory in multiple different data objects will
The data division being processed using identical logical operation;
Using at least one processor, to identify the expression for the operand being stored at least one memory, the operand
For providing logical operation;
The operand is provided to logic engine;
The data division is stored in multiple Input Data Buffers, wherein each in the Input Data Buffer
Input Data Buffer all includes the data division of different data objects;
The logical operation is performed to each data division in the data division using the logic engine;And
Storage is for the output of each data division, and each output in the output from different data objects including leading
The data gone out.
2. according to the method for claim 1, wherein, performing the logical operation includes performing vector and matrix multiplication.
3. the method according to claim 11, including:It is determined that the cycle delay of the memory of the operand is stored,
And wherein, identification data part includes identifying number based on the value of the cycle delay for the memory for storing the operand
According to the quantity of part.
4. according to the method for claim 1, wherein, the data object is stored in postpones phase from different data retrievals
In the memory portion of association, methods described includes:
Fetch the data division so that multiple data divisions are stored in the data buffer, and
Wherein, the logical operation is performed to the data division using logic engine to be essentially continuously performed.
5. according to the method for claim 1, wherein, providing the operand to logic engine is included resistance-type memory
Array is written with resistance value.
6. a kind of processing unit, including:
Include the memory of at least one memory portion, the memory is used to keep in multiple different data objects extremely
Few one and multiple logical operators, wherein, the logical operator is used to operate the data division of the data object;
Logic engine, it is used to operate at least one data division execution logic;
The multiple input buffers associated with the logic engine and multiple output buffers;
Data batch processing module, it is used in multiple different data objects identify wanting for the multiple different data object
The data division operated on it by common logic operator, and by the data of the multiple different data object
Part is sent to the Input Data Buffer;And
Wherein, the logic engine is also used for the common logic operator to described in the Input Data Buffer
Each data division in data division is consecutively carried out logical operation, and the output from each operation is supplied into institute
State an output buffer in multiple output buffers.
7. processing unit according to claim 6, wherein, at least one memory portion and the logic engine are provided
On the tube core of 3D memory stacks.
8. processing unit according to claim 7, wherein, the logic engine is provided in first die, and described
Memory is provided at least one other tube core, wherein, the tube core is interconnected by silicon hole.
9. processing unit according to claim 7, wherein, the memory includes multiple memory portions, and at least
One memory portion includes the memory size or type different from least another memory portion.
10. processing unit according to claim 6, wherein, at least a portion of the memory is on-chip memory, and
And the logic engine is chip processing element.
11. processing unit according to claim 6, wherein, the data batch processing module is used for multiple data divisions
The Input Data Buffer is supplied to, the quantity of the data division is less than or equal to deposits with storing described in the operand
The value of the associated cycle delay of reservoir.
12. processing unit according to claim 6, it is used to perform machine learning application.
13. a kind of method, including:
Multiple vectors that extraction is associated from multiple different data objects from least one memory;
The multiple vector is stored in positioned at the different data bufferings local from the logic engine of 3D memory stacks one
In device;
Extraction is stored in the logical operator in the 3D memory stacks, and is supplied to the logic to draw the logical operator
Hold up;
Multiple continuous vector sum matrix multiplications are performed according to the logical operator using the logic engine, wherein, it is identical
Logical operator with from multiple different data objects be associated it is multiple vector in each vector be multiplied.
14. according to the method for claim 13, in addition to the logic engine provide new logical operator.
15. according to the method for claim 13, wherein, the vector is characteristics of image collection of illustrative plates.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2016/025143 WO2017171769A1 (en) | 2016-03-31 | 2016-03-31 | Logical operations |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107615241A true CN107615241A (en) | 2018-01-19 |
Family
ID=59966290
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201680031683.4A Pending CN107615241A (en) | 2016-03-31 | 2016-03-31 | Logical operation |
Country Status (4)
Country | Link |
---|---|
US (1) | US11126549B2 (en) |
EP (1) | EP3286638A4 (en) |
CN (1) | CN107615241A (en) |
WO (1) | WO2017171769A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111221748A (en) * | 2018-11-26 | 2020-06-02 | 通用汽车环球科技运作有限责任公司 | Method and apparatus for memory access management for data processing |
US11126549B2 (en) | 2016-03-31 | 2021-09-21 | Hewlett Packard Enterprise Development Lp | Processing in-memory architectures for performing logical operations |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10290327B2 (en) * | 2017-10-13 | 2019-05-14 | Nantero, Inc. | Devices and methods for accessing resistive change elements in resistive change element arrays |
US10409889B2 (en) | 2017-12-18 | 2019-09-10 | Mythic, Inc. | Systems and methods for mapping matrix calculations to a matrix multiply accelerator |
US10496374B2 (en) | 2018-03-22 | 2019-12-03 | Hewlett Packard Enterprise Development Lp | Crossbar array operations using ALU modified signals |
KR102615443B1 (en) | 2018-05-25 | 2023-12-20 | 에스케이하이닉스 주식회사 | Machine learning apparatus and machine learning system using the same |
US20200183837A1 (en) | 2018-12-07 | 2020-06-11 | Samsung Electronics Co., Ltd. | Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning |
US10534747B2 (en) * | 2019-03-29 | 2020-01-14 | Intel Corporation | Technologies for providing a scalable architecture for performing compute operations in memory |
US11769043B2 (en) | 2019-10-25 | 2023-09-26 | Samsung Electronics Co., Ltd. | Batch size pipelined PIM accelerator for vision inference on multiple images |
US11726784B2 (en) | 2020-04-09 | 2023-08-15 | Micron Technology, Inc. | Patient monitoring using edge servers having deep learning accelerator and random access memory |
US11461651B2 (en) * | 2020-04-09 | 2022-10-04 | Micron Technology, Inc. | System on a chip with deep learning accelerator and random access memory |
US11874897B2 (en) * | 2020-04-09 | 2024-01-16 | Micron Technology, Inc. | Integrated circuit device with deep learning accelerator and random access memory |
US11887647B2 (en) | 2020-04-09 | 2024-01-30 | Micron Technology, Inc. | Deep learning accelerator and random access memory with separate memory access connections |
US11355175B2 (en) | 2020-04-09 | 2022-06-07 | Micron Technology, Inc. | Deep learning accelerator and random access memory with a camera interface |
US11200948B1 (en) * | 2020-08-27 | 2021-12-14 | Hewlett Packard Enterprise Development Lp | System for a flexible conductance crossbar |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1581061A (en) * | 2003-12-05 | 2005-02-16 | 智权第一公司 | Dynamic logic register |
US20140172937A1 (en) * | 2012-12-19 | 2014-06-19 | United States Of America As Represented By The Secretary Of The Air Force | Apparatus for performing matrix vector multiplication approximation using crossbar arrays of resistive memory devices |
CN104011658A (en) * | 2011-12-16 | 2014-08-27 | 英特尔公司 | Instructions and logic to provide vector linear interpolation functionality |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6023759A (en) | 1997-09-30 | 2000-02-08 | Intel Corporation | System for observing internal processor events utilizing a pipeline data path to pipeline internally generated signals representative of the event |
US9684632B2 (en) * | 2009-06-04 | 2017-06-20 | Micron Technology, Inc. | Parallel processing and internal processors |
TW201347101A (en) | 2011-12-01 | 2013-11-16 | Mosaid Technologies Inc | CPU with stacked memory |
US20140040532A1 (en) | 2012-08-06 | 2014-02-06 | Advanced Micro Devices, Inc. | Stacked memory device with helper processor |
US9110778B2 (en) | 2012-11-08 | 2015-08-18 | International Business Machines Corporation | Address generation in an active memory device |
KR20150100042A (en) | 2014-02-24 | 2015-09-02 | 한국전자통신연구원 | An acceleration system in 3d die-stacked dram |
US9466362B2 (en) * | 2014-08-12 | 2016-10-11 | Arizona Board Of Regents On Behalf Of Arizona State University | Resistive cross-point architecture for robust data representation with arbitrary precision |
WO2017171769A1 (en) | 2016-03-31 | 2017-10-05 | Hewlett Packard Enterprise Development Lp | Logical operations |
-
2016
- 2016-03-31 WO PCT/US2016/025143 patent/WO2017171769A1/en active Application Filing
- 2016-03-31 US US16/073,202 patent/US11126549B2/en active Active
- 2016-03-31 CN CN201680031683.4A patent/CN107615241A/en active Pending
- 2016-03-31 EP EP16897318.8A patent/EP3286638A4/en not_active Withdrawn
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1581061A (en) * | 2003-12-05 | 2005-02-16 | 智权第一公司 | Dynamic logic register |
CN104011658A (en) * | 2011-12-16 | 2014-08-27 | 英特尔公司 | Instructions and logic to provide vector linear interpolation functionality |
US20140172937A1 (en) * | 2012-12-19 | 2014-06-19 | United States Of America As Represented By The Secretary Of The Air Force | Apparatus for performing matrix vector multiplication approximation using crossbar arrays of resistive memory devices |
Non-Patent Citations (3)
Title |
---|
AMIR MORAD等: "Efficient Dense And Sparse Matrix Multiplication On GP-SIMD", 《POWER AND TIMING MODELING,OPTIMIZATION AND SIMULATION》 * |
LIFAN XU等: "Scaling Deep Learning On Multiple In-Memory Processors", 《3RD WORKSHOP ON NEAR-DATA PROCESSING IN CONJUNCTION WITH MICRO-48》 * |
TAREK M. TAHA等: "Exploring the Design Space of Specialized Multicore Neural Processors", 《PROCEEDINGS OF INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11126549B2 (en) | 2016-03-31 | 2021-09-21 | Hewlett Packard Enterprise Development Lp | Processing in-memory architectures for performing logical operations |
CN111221748A (en) * | 2018-11-26 | 2020-06-02 | 通用汽车环球科技运作有限责任公司 | Method and apparatus for memory access management for data processing |
CN111221748B (en) * | 2018-11-26 | 2023-07-25 | 通用汽车环球科技运作有限责任公司 | Method and apparatus for memory access management for data processing |
Also Published As
Publication number | Publication date |
---|---|
US20190042411A1 (en) | 2019-02-07 |
EP3286638A4 (en) | 2018-09-05 |
US11126549B2 (en) | 2021-09-21 |
WO2017171769A1 (en) | 2017-10-05 |
EP3286638A1 (en) | 2018-02-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107615241A (en) | Logical operation | |
JP6857286B2 (en) | Improved performance of neural network arrays | |
EP3265907B1 (en) | Data processing using resistive memory arrays | |
US10691996B2 (en) | Hardware accelerator for compressed LSTM | |
US11055063B2 (en) | Systems and methods for deep learning processor | |
Ji et al. | ReCom: An efficient resistive accelerator for compressed deep neural networks | |
TWI759361B (en) | An architecture, method, computer-readable medium, and apparatus for sparse neural network acceleration | |
US9886377B2 (en) | Pipelined convolutional operations for processing clusters | |
Venkataramanaiah et al. | Automatic compiler based FPGA accelerator for CNN training | |
US9886418B2 (en) | Matrix operands for linear algebra operations | |
CN110352434A (en) | Utilize the Processing with Neural Network that model is fixed | |
KR20190019081A (en) | Accelerator for deep layer neural network | |
CN107608715A (en) | For performing the device and method of artificial neural network forward operation | |
US11663452B2 (en) | Processor array for processing sparse binary neural networks | |
CN111048135A (en) | CNN processing device based on memristor memory calculation and working method thereof | |
Zhou et al. | Mat: Processing in-memory acceleration for long-sequence attention | |
EP4009240A1 (en) | Method and apparatus for performing deep learning operations | |
Wang et al. | Reboc: Accelerating block-circulant neural networks in reram | |
Das et al. | NZESPA: A near-3D-memory zero skipping parallel accelerator for CNNs | |
US10929760B1 (en) | Architecture for table-based mathematical operations for inference acceleration in machine learning | |
Chen et al. | An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse | |
US20240036818A1 (en) | Computational memory for sorting multiple data streams in parallel | |
US20230195836A1 (en) | One-dimensional computational unit for an integrated circuit | |
US11249724B1 (en) | Processing-memory architectures performing atomic read-modify-write operations in deep learning systems | |
US20240094988A1 (en) | Method and apparatus with multi-bit accumulation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180119 |