CN107615241A - Logical operation - Google Patents

Logical operation Download PDF

Info

Publication number
CN107615241A
CN107615241A CN201680031683.4A CN201680031683A CN107615241A CN 107615241 A CN107615241 A CN 107615241A CN 201680031683 A CN201680031683 A CN 201680031683A CN 107615241 A CN107615241 A CN 107615241A
Authority
CN
China
Prior art keywords
data
memory
logic engine
logic
division
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201680031683.4A
Other languages
Chinese (zh)
Inventor
N·穆拉里曼诺亚
A·莎菲阿尔德斯塔尼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Enterprise Development LP
Original Assignee
Hewlett Packard Enterprise Development LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development LP filed Critical Hewlett Packard Enterprise Development LP
Publication of CN107615241A publication Critical patent/CN107615241A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7821Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Memory System (AREA)
  • Image Processing (AREA)

Abstract

In this example, a kind of method is stored at least one memory the data division that will be processed using identical logical operation in multiple different data objects including the use of the identification of at least one processor.Methods described can also include the expression for the operand that identification is stored at least one memory, and the operand is used to provide logical operation, and provides the operand to logic engine.The data division is stored in multiple Input Data Buffers, wherein, each in Input Data Buffer includes the data division of different data objects.Each execution logic in the logic engine data portion can be used to operate, and the output for each data division is stored in multiple output data buffers, wherein, each in output includes the data derived from different data objects.

Description

Logical operation
Background technology
Describe a kind of framework for being used to allow to handle the processing unit of (PIM) in memory.In PIM, be not from Remote memory fetches data for processing, but locally executes the processing in memory.
Brief description of the drawings
Non-limiting example is described referring now to accompanying drawing, in the accompanying drawings:
Fig. 1 is the flow chart of the example of the method for execution logic operation;
Fig. 2 is the rough schematic view of exemplary resistive memory array devices;
Fig. 3 and Fig. 4 is the schematic example of processing unit;And
Fig. 5 is performed for the flow chart of another example of the another method of logical operation.
Embodiment
Fig. 1 shows the example of method.In block 102, this method is included in storage in memory multiple different In data object, identification will use the data division of the processed different data objects of identical logical operation.For example, data Stage in terms of object can be associated with least one image, and operation can include Object identifying.For example, certain logic Operation (such as convolution) can be used for the performing such as Object identifying (such as face detection) etc of the task.Data object can wrap One group of image pixel, such as one group of characteristic spectrum derived from one group of image pixel are included, and the operation includes face and examined Stage in terms of survey.Such input can be referred to as " input neuron ".Data object can be completely unrelated (example Such as, including the image from a variety of sources or the image derived from from a variety of sources).
Identification data part can include for example based on multiple data objects and/or data division, data output and deposit (it can be for receiving buffer of data division etc., and/or storage operation number, data object and/or data to memory resource Partial memory) in it is at least one, to determine the size of data division.Identification data part can include determining that data portion The order divided, such as so that data division can be staggeredly.As discussed further below, it may be determined that data division Sequentially, size and quantity, to provide substantially continuous availability of data for using identical operand or multiple Operand is handled.
Frame 104 includes the expression of the operand in recognition memory to provide logical operation;And in frame 106, logic Engine is provided with operand (it can be for example including matrix).Logic engine can be vector-matrix multiplication engine.At some In example, logic engine can be provided as resistive memory array.In some instances, logic engine can include arithmetic Logic unit (ALU).
In block 108, data division is stored in multiple Input Data Buffers.It is each in Input Data Buffer The individual data division for all including different data objects.This can include expected from that (that is, data division will be by for processing sequence The order of logical operation is subjected to according to it) come data storage part.Frame 110 includes each execution logic behaviour in data portion Make.This can include performing Matrix-Vector multiplication etc..In wherein data division and order dependent example, this sequentially may be used It is substantially continuous in some instances to be defined such that the utilization rate of the logic engine of execution logic operation is high (that is, it is at least substantially full to handle streamline, and as is full as possible in some instances).It is every that frame 112 includes storage The output of individual data division, wherein each output includes the data derived from different data objects.In some instances, export It can be stored in multiple output data buffers, wherein each in output data buffer is included from different numbers According to data derived from object.
This method can be performed using the device with processing (PIM) framework in memory.In such framework, meter Calculate processing unit be placed in memory or memory nearby (for example, in memory array or subarray) to avoid length Communication path.
In some instances, such as using resistance-type memory equipment such as " memristor " (it is can be with non-volatile Property mode is written into the electric component of resistance) when, memory can provide processing component in itself.Resistance-type memory equipment Array can be used for execution logic operation.
In some instances, logic engine can be associated with dynamic random access memory (DRAM).In this example, this Kind association can include logic engine and (for example, on chip or on tube core) component progress physical access also including DRAM It is and/or integrated.In some instances, logic engine and DRAM may be provided on identical tube core.In some instances, patrol Volume engine can (for example, being physically arranged at DRAM buffer) associated with DRAM buffer, or can carry For logic engine dual inline memory modules are reduced as the load in dual inline memory modules (DIMM) (LRDIMM) buffer (or as one part), the dual inline memory modules can also include at least one DRAM Chip.In another example, the one of tube core of the logical layer as memory portion (for example, DRAM parts) is also included can be provided Part.For example, memory may be provided on the side of tube core, and logic engine may be provided in the opposite side of tube core On.
Some calculating tasks use " deep learning " technology.When using deep learning treatment technology, input data is held Row logical operation is to provide the output in the first layer of processing.Then the output execution logic in the succeeding layer of processing is operated, In some instances for multiple iteration.Deep learning can be used for such as big data analysis, image and speech recognition and Other are calculated in the field of complex task etc.At least some in process layer can include convolution, such as use matrix multiplication To input data application logical operation.Convolution operation number (for example, matrix) can be referred to as process kernel.
In order to accelerate deep learning workload, the acceleration of the convolution of data can be especially considered to can take up height Reach or about 80% be used for some deep learnings application the execution time.
In some instances, keep the quantity of kernel in the storage device can be very big.Although for example, shared kernel It can be used for multiple or complete in one group of input data part (it can be referred to as input neuron) derived from data object Portion input data part, but in other examples, handle the difference from identical data object using different kernels Input neuron.In some instances, each input neuron can be from different " privately owned " derived from data object Kernel is associated.For example, the privately owned kernel for some applications can take up the storage space more than GB.
Neuron is wherein being inputted each in the example associated with privately owned kernel, temporal locality is (that is, when data will be by The data can be by local holding during reprocessing) can be low, this can be with the effect of influence technique (such as cache) With.
In order to consider specific example, such as face detection can be applied to video file.In process layer, Ke Yiti For the N in data objectiIndividual data object.In this example, these can be the N in video fileiIndividual image, and from video N in fileiN derived from individual imagexxNyIndividual data division (input neuron, such as pixel or characteristic spectrum) can be with size For KxxKyKernel carry out convolution to form NoIndividual input feature vector collection of illustrative plates.In this case, each output entry has it The kernel (that is, kernel is privately owned) of oneself, and computing can be carried out based on below equation:
Wherein map and w is illustrated respectively in characteristic spectrum entry and weight in convolutional layer.
Based on above-mentioned equation, such layer occupies:
Nx x Ny x Ni x N0x Kx x Ky
Kernel spacing, this actual life application in can easily exceed 1GB.Further, since kernel is privately owned, so In the absence of temporal locality (data for not reusing local cache), this imposes high bandwidth pressure.
According to Fig. 1 method, the different data objects that identical logical operation is processed are used (for example, different Image or different characteristic spectrums) data division it is identified and be stored in data buffer.This allows batch processing, I.e., it is possible to using identical kernel come each in processing data part (it can be exported from different data objects).This Sample, reduce the frequency fetched when kernel is reused to new kernel.For example, identical kernel can be used for being changed in kernel Should to the corresponding input neuron (for example, pixel or characteristic spectrum) for each image in multiple images before change With identical convolution.This provides the temporal locality on kernel, even if wherein on data object or data division side Face is not present in temporal locality or the example that relatively low temporal locality be present.
Data division can be retrieved and be provided to data buffer in order so that multiple data divisions are stored In data buffer, and the logical operation using specific operand can be essentially continuously performed with data portion. In other words, data division can be fetched so that the substantially continuous flowing water for the data to be handled by logic engine can be provided Line.In some instances, data object can be associated with the delay of change, can be than taking for example, fetching some data objects Returning other data objects will take longer time, and this can depend on type of memory, position etc..For different data pair As or the memory resource of data division be different in size.The batch processing of data can contemplate this species diversity so that can By early stage request data and to hold it in local buffer to reduce any gap in streamline, the gap can Be, for example, data division according to priority be requested and " timely " handled when caused by the supply of data division 's.
As described above, in some instances, in PIM frameworks, disposal ability may be provided in very close memory Place can be embedded into memory.In some instances, memory can provide logic engine (for example, vector-square in itself Battle array multi-plier engine).In one example, kernel can be embodied as the resistance for including the two-dimensional grid of resistance-type memory element Formula memory array, it can be cross-bar switch array.Show resistance-type memory element 202 (for example, memristor in Fig. 2 Or other resistance-type memory elements) cross-bar switch array 200 example.The resistive memory array 200 is included at least One line of input is for receiving input data, and each resistance-type memory element has bit depth.In this example, Term " bit depth " is the quantity for the position that can be represented by single resistance-type memory element 202.In some instances, element 202 can be the binary digit with a value in two values (for example, representing 0 or 1 (bit depth of one)), but The resistance-type memory element of multiple values (such as 32 different ranks (bit depth of five)) can be used by showing.Can be with By the way that element is submitted into voltage pulse to be carried out " programming " to array 200, each voltage pulse incrementally changes the element 202 resistance, and then, the resistance level is also resistance level by element 202 " memory ", is removed even in supply of electric power Afterwards.
In some instances, such array 200 can (it can for example use digital-to-analog to input voltage vector Bits of digital data is converted to analog voltage to be provided by converter ADC) handled to provide output vector, wherein defeated Enter value by the conductance at each element 202 of array 200 to be weighted.This effectively shows that array 200 performs point to input Product matrix operation is exported with producing.The weight of element 202 can be carried out individually by making element 202 be subjected to voltage pulse " programming ", as described above.Such array 200 can be with high density, low-power consumption, long period tolerances and/or being switched fast speed Degree is associated.Therefore such array can perform matrix combination operation.In some instances, array can include being used for square Battle array mutually takes a part for dot product engine together.This dot product engine can be used in deep learning device, and for performing Complicated calculations operate.
In this example, analogue data can be supplied for being handled using resistive memory array 200.Data Can for example represent at least one pixel of image or the word (or sub- word or phrase) of voice or scientific experiment result or Any other data.Input data can be provided as vectorial (that is, one-dimensional data character string), and array is applied to voltage Value (is usually less than used for the voltage for setting the resistance of array element 202 so that the resistance of element 202 does not change in this operation Become).
If resistive memory array 200 as use, the frame 106 in Fig. 1 can include storing resistance-type Device array is written with resistance value.
Fig. 3 shows the example of processing unit 300, and it includes memory 302, logic engine 304, multiple input buffers 306th, multiple output buffers 308 and data batch processing module 310, the buffer 306,308 are related to logic engine 304 Connection (and being local in logic engine 304 in some instances).
Memory 302 includes at least one memory portion, and keeps multiple different data objects and multiple logics Operator, wherein logical operator are used to operate the data division of data object.
Memory 302 can include at least one memory portion, and some of which can be remote memory part. In the example for which providing multiple memory portions, memory portion can include multiple different type of memory and/or Size.Memory 302 can include at least one non-volatile memory portion and/or at least one volatile memory portion (for example, SRAM or DRAM).In some instances, it is this that at least multiple logical operators, which are stored in relative to logic engine 304, In the memory portion on ground.In some instances, logic engine 304 can be embedded in the memory for storing multiple logical operators In.In other examples, logic engine 304 can be via the data/address bus (for example, silicon hole (TSV)) with relatively high bandwidth It is connected to the memory for storing multiple logical operators.
In this example, can be by memory 302 by the supply of buffer 306,308 and the action of batch processing module 310 Specification and the ability of logic engine 304 of ability separate, such as allow to according to computational efficiency come design logic engine 304, and memory 302 may be designed for storage capacity.For example, memory 302 can be directed to its at least a portion bag Include the memory based on NMOS (it can be relatively intensive and slow).In other examples, it is contemplated that speed and density it Between balance, it is possible to achieve memory 302 and/or buffer 306,308.
Logic engine 304 operates at least one data division execution logic, and in some instances, to multiple data Part execution logic operation.Logic engine 304 can be provided by any processor.In some instances, logic engine 304 can With including field programmable gate array (FPGA), application specific integrated circuit (ASIC), single-instruction multiple-data (SIMD) treatment element etc., It can provide convolutional neural networks (CNN) or the component of deep neural network (DNN).In some instances, logic engine 304 Resistive memory array can be included.
In use device 300, data batch processing module 310 identifies in multiple different data objects will be by public The data division (each data division includes all or part of in data object) that operand operates on it, and will The data division of multiple different data objects is sent to Input Data Buffer 306.Logic engine 304 is in the data division Each continuously (in some instances, at least substantially continuously) using public operation number come execution logic operate, and One output from each operation is supplied in multiple output buffers 308.
In some instances, data batch processing module 310 can be arranged to based on multiple input data set, output collection At least one in conjunction and various memory resources (for example, at least a portion in buffer 306,308 or memory 302) It is individual, to determine the size of the data division to be stored in buffer 306.Data batch processing module 310 can be arranged to Determine data division staggeredly, such as so as to ensure by operand embody kernel utilization rate be it is high (such as so that patrol Collect engine 304 and be directed to the period essentially continuously execution logic operation for using particular core).
As described above, by using public operation number (it represents public operation number or kernel) for multiple data objects, The temporal locality of kernel can be realized.
In some instances, processing unit 300 can have a PIM frameworks, wherein operand pair in logic engine 304 be this Ground storage.By using PIM frameworks, can be saved by corresponding electric power and time to avoid the piece for extracting kernel Outer bandwidth.However, this may indicate that the on-chip memory storage body that can be used for storing kernel occupies relatively large area, and By kernel be transferred to processor will consumption electric power, this can cause significant energy expenditure.
Fig. 4 shows the example of the processing unit including 3D memory stacks 400.Storehouse 400 includes multiple memory pipes Core 402,404 and at least one processing tube core 406.Each in memory dice 402 includes at least one memory Part 408a-h.At least one aspect of these memory portions 408 in type, size etc. can be similar (for example, phase Same type of memory and/or size) or can be different.In this example, logic engine 304, which is provided at, draws logic Hold up 304 sides of TSV 410 (that is, physically very close TSV410) for being connected to first memory tube core 402.Physical access It can cause wiring is not present between TSV 410 and logic engine.
In this example, first memory tube core 402 stores multiple logical operators.In some instances, patrolled for storing Collecting at least a portion 408 of the memory of operator, data object and/or data division includes on-chip memory, and logic is drawn It is chip processing element to hold up 304.In some instances, multiple logical operators can be distributed on multiple memory dices 402, On 404.In this example, a TSV 410 is shown.However, in other examples, can provide for example from different tube cores Associated multiple TSV 410, or a tube core can have multiple TSV 410 so that itself and processing tube core 406 to be carried out mutually Even.In this example, handling tube core 406 also includes multiple inputs and output buffer 306,308.Handling tube core 406 can be with Including data batch processing module 310, or this may be provided in device elsewhere.
In other examples, logic engine 304 may be provided in the memory identical with storing multiple logical operators On tube core, although this can increase the area of coverage of device 400.
In some instances, although data object is also stored in remote memory part, and data object At least a portion can be received from such remote memory part, but data object is also stored in memory On tube core 402,404.As described above, data object can be stored in multiple different memory resources, the memory Resource can with different sizes and can supply with change delay (its can with for being taken from particular memory resource Return time of data and at least one associated in the time of processing tube core 406 for data to be transferred to) data.
In some instances, the delay associated with logical operator is supplied into logic engine is considered.Even if can be relative Logical operator is rapidly fetched, but is devices which that 400 still have delay, it can cause in (or increase) processing streamline Gap or " bubble ", and therefore reduce performance.Can be with for example, accessing row in the 3D memories with 128 TSV buses 16ns memory latency is associated.When in the absence of temporal locality, it can be calculated in such a device in 16ns A line (for example, being directed to 3D memories about 1KB) of matrix operator.As a result, in such an example, per 3D TSV meter Calculate be limited to every TSV buses handled in 16ns 1KB matrix operator (if logic engine 304 be provided at it is exemplary It it is 16 cycles when in 1GHz processing units).However, if data can use, logic engine 304 can enter in 1ns to it Row processing (being 1 cycle in 1GHz).In other words, the computational efficiency of digital units drops to 1/16, because matrix is calculated Quantum memory delay is higher than computing relay 16 times.
In some instances, therefore it provides multiple inputs and output buffer 306,308 allow this memory latency Compensation so that device can be operated with complete calculating speed, and not have " bubble " in a pipeline.In some examples In, the quantity of buffer 306,308 can be selected difference is fully compensated.It is, for example, possible to use cycle delay determines to buffer The quantity of device 306,308, this makes memory latency relevant with computing relay.In the examples described above, cycle delay is 16, and because This may have 16 input buffers 306 and 16 output buffers 308, and identical matrix manipulation can be held Row 16 times (assuming that in the presence of the enough data divisions that be applied to operation).
Such device 400 can store kernel (for example, being used as matrix operator), the kernel by the use of 3D capacity on piece It can be privately owned kernel, and can be used for computation intensive operations, such as in the convolution for the data of deep learning application.It is logical Cross and logic engine 304 is arrived into the kernel storage on independent tube core (or multiple tube cores), this can discharge processing (its of tube core 406 Can be relatively expensive component in some instances) on space (for example, being used for more calculating logics).It shall yet further be noted that The area of coverage in the kernel memory space on processing tube core 406 is precisely the space occupied by TSV 410.By independent tube core 402nd, the memory on 404 is supplied to logic engine 304 to reduce the area of coverage of storehouse 400.In addition, device 400 utilizes Computationally efficient is realized in batch processing.
3D memory stacks can have high power capacity, for example, the storage space with the 4GB on 4 tube cores.So 3D memories can store logical operator (for example, deep learning kernel) applied for one or more Multilevel methods it is more In a convolutional layer.This can realize the fast context switching between different convolutional layers on demand.
Fig. 5 is the flow chart of the example of method, and this method is included in frame 502, extraction and multiple different data objects Associated multiple vectors.Multiple vectors can include or the data division derived from data object, as described above.Show at some In example, vector is exported from characteristics of image collection of illustrative plates or exported from other data.Vector can include the character string of numerical data, and It can be fetched from local storage (for example, memory out of 3D memory stacks that also provide logical operation), or from difference (for example, more long-range) memory fetch.In frame 504, multiple vectors are stored in and 3D memory stacks one In the local different data buffers of the logic engine of body.Frame 506 includes the logic that extraction is stored in 3D memory stacks Operator, the logical operator is supplied to logic engine in frame 508.
In frame 510, multiple vector rows and multiplication of matrices are performed according to logical operator using logic engine, wherein Identical logical operator is multiplied from each in the multiple vectors being associated with multiple different data objects.Show at some Example in, such as using resistive memory array those, vector can undergo digital to analogy conversion with provide vector table It is shown as analog voltage.In some instances, different output vectors is provided for each in data object.For example, Such vector output can be maintained in different data buffer or register.
Methods described can be repeated, includes providing new patrol to logic engine for each iteration in some instances Collect operator.
Example in the disclosure may be provided in method, system or machine readable instructions, such as software, hardware, firmware Deng any combination.Such machine readable instructions can be included therein or have computer readable program code thereon (include but is not limited to disc memory device, CD-ROM, light storage device etc.) on computer-readable recording medium.
The disclosure is described with reference to the flow chart and/or block diagram of the method, apparatus and system of the example according to the disclosure.To the greatest extent Pipe above flow illustrates the particular order of execution, but the order performed can be differently configured from described order.On one Frame described by individual flow chart can be combined with those frames in another flow chart.It should be understood that can be by machine Each stream and/or frame that readable instruction is come in implementation process figure and/or block diagram and the stream in flow chart and/or block diagram and/or The combination of figure.
Machine readable instructions for example can by all-purpose computer, special-purpose computer, embeded processor or other may be programmed The processor of data processing equipment performs to realize the function described in specification and drawings (for example, processing unit 300,400 Function).Especially, processor or processing unit can perform machine readable instructions.Therefore, the functional module of device and equipment (for example, batch processing module 310 or logic engine 304) can be stored in machine readable in memory by computing device Instruction is operable to realize to realize, or by processor according to the instruction being embedded in logic circuit.Term " processing Device " is broadly interpreted to include CPU, processing unit, ASIC, logic unit or programmable gate array etc..Methods described and work( Can module can be performed or be divided among some processors by single processor.
Such machine readable instructions are also stored in computer readable storage means (for example, memory 302), The instruction can guide computer or other programmable data processing devices to be operated with AD HOC.
Such machine readable instructions can also be loaded into computer or other programmable data processing devices so that Computer or other programmable data processing devices perform sequence of operations to produce computer implemented processing, therefore are calculating The instruction performed on machine or other programmable devices is realized as the function specified by the frame in the stream and/or block diagram in flow chart.
In addition, teaching herein can be implemented in form of a computer software product, the computer software product It is stored in storage medium, and including for making the multiple of the method described in example of the computer equipment realization in the disclosure Instruction.
Although describing method, apparatus and related fields by reference to some examples, in the essence without departing substantially from the disclosure Various modifications may be made in the case of god, changes, omits and substitutes.Therefore, methods described, device and related fields are intended to only Limited by above claim and its scope of equivalent.Retouched herein it should be noted that above-mentioned example illustrates rather than limit The content stated, and those skilled in the art be able to will be designed in the case of the scope without departing substantially from appended claims it is a variety of Substitute implementation.Feature on an example description can be combined with the feature of another example.
Word " comprising " does not exclude the presence of the element in addition to those elements listed in the claims, " one (a) " Or " one (an) " be not excluded for it is multiple, and single processor or other units can complete to record in the claims it is some The function of unit.
The feature of any dependent claims can be with any one in independent claims or other dependent claims Individual feature is combined.

Claims (15)

1. a kind of method, including:
Using at least one processor, identify from be stored at least one memory in multiple different data objects will The data division being processed using identical logical operation;
Using at least one processor, to identify the expression for the operand being stored at least one memory, the operand For providing logical operation;
The operand is provided to logic engine;
The data division is stored in multiple Input Data Buffers, wherein each in the Input Data Buffer Input Data Buffer all includes the data division of different data objects;
The logical operation is performed to each data division in the data division using the logic engine;And
Storage is for the output of each data division, and each output in the output from different data objects including leading The data gone out.
2. according to the method for claim 1, wherein, performing the logical operation includes performing vector and matrix multiplication.
3. the method according to claim 11, including:It is determined that the cycle delay of the memory of the operand is stored, And wherein, identification data part includes identifying number based on the value of the cycle delay for the memory for storing the operand According to the quantity of part.
4. according to the method for claim 1, wherein, the data object is stored in postpones phase from different data retrievals In the memory portion of association, methods described includes:
Fetch the data division so that multiple data divisions are stored in the data buffer, and
Wherein, the logical operation is performed to the data division using logic engine to be essentially continuously performed.
5. according to the method for claim 1, wherein, providing the operand to logic engine is included resistance-type memory Array is written with resistance value.
6. a kind of processing unit, including:
Include the memory of at least one memory portion, the memory is used to keep in multiple different data objects extremely Few one and multiple logical operators, wherein, the logical operator is used to operate the data division of the data object;
Logic engine, it is used to operate at least one data division execution logic;
The multiple input buffers associated with the logic engine and multiple output buffers;
Data batch processing module, it is used in multiple different data objects identify wanting for the multiple different data object The data division operated on it by common logic operator, and by the data of the multiple different data object Part is sent to the Input Data Buffer;And
Wherein, the logic engine is also used for the common logic operator to described in the Input Data Buffer Each data division in data division is consecutively carried out logical operation, and the output from each operation is supplied into institute State an output buffer in multiple output buffers.
7. processing unit according to claim 6, wherein, at least one memory portion and the logic engine are provided On the tube core of 3D memory stacks.
8. processing unit according to claim 7, wherein, the logic engine is provided in first die, and described Memory is provided at least one other tube core, wherein, the tube core is interconnected by silicon hole.
9. processing unit according to claim 7, wherein, the memory includes multiple memory portions, and at least One memory portion includes the memory size or type different from least another memory portion.
10. processing unit according to claim 6, wherein, at least a portion of the memory is on-chip memory, and And the logic engine is chip processing element.
11. processing unit according to claim 6, wherein, the data batch processing module is used for multiple data divisions The Input Data Buffer is supplied to, the quantity of the data division is less than or equal to deposits with storing described in the operand The value of the associated cycle delay of reservoir.
12. processing unit according to claim 6, it is used to perform machine learning application.
13. a kind of method, including:
Multiple vectors that extraction is associated from multiple different data objects from least one memory;
The multiple vector is stored in positioned at the different data bufferings local from the logic engine of 3D memory stacks one In device;
Extraction is stored in the logical operator in the 3D memory stacks, and is supplied to the logic to draw the logical operator Hold up;
Multiple continuous vector sum matrix multiplications are performed according to the logical operator using the logic engine, wherein, it is identical Logical operator with from multiple different data objects be associated it is multiple vector in each vector be multiplied.
14. according to the method for claim 13, in addition to the logic engine provide new logical operator.
15. according to the method for claim 13, wherein, the vector is characteristics of image collection of illustrative plates.
CN201680031683.4A 2016-03-31 2016-03-31 Logical operation Pending CN107615241A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2016/025143 WO2017171769A1 (en) 2016-03-31 2016-03-31 Logical operations

Publications (1)

Publication Number Publication Date
CN107615241A true CN107615241A (en) 2018-01-19

Family

ID=59966290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201680031683.4A Pending CN107615241A (en) 2016-03-31 2016-03-31 Logical operation

Country Status (4)

Country Link
US (1) US11126549B2 (en)
EP (1) EP3286638A4 (en)
CN (1) CN107615241A (en)
WO (1) WO2017171769A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221748A (en) * 2018-11-26 2020-06-02 通用汽车环球科技运作有限责任公司 Method and apparatus for memory access management for data processing
US11126549B2 (en) 2016-03-31 2021-09-21 Hewlett Packard Enterprise Development Lp Processing in-memory architectures for performing logical operations

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10290327B2 (en) * 2017-10-13 2019-05-14 Nantero, Inc. Devices and methods for accessing resistive change elements in resistive change element arrays
US10409889B2 (en) 2017-12-18 2019-09-10 Mythic, Inc. Systems and methods for mapping matrix calculations to a matrix multiply accelerator
US10496374B2 (en) 2018-03-22 2019-12-03 Hewlett Packard Enterprise Development Lp Crossbar array operations using ALU modified signals
KR102615443B1 (en) 2018-05-25 2023-12-20 에스케이하이닉스 주식회사 Machine learning apparatus and machine learning system using the same
US20200183837A1 (en) 2018-12-07 2020-06-11 Samsung Electronics Co., Ltd. Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning
US10534747B2 (en) * 2019-03-29 2020-01-14 Intel Corporation Technologies for providing a scalable architecture for performing compute operations in memory
US11769043B2 (en) 2019-10-25 2023-09-26 Samsung Electronics Co., Ltd. Batch size pipelined PIM accelerator for vision inference on multiple images
US11726784B2 (en) 2020-04-09 2023-08-15 Micron Technology, Inc. Patient monitoring using edge servers having deep learning accelerator and random access memory
US11461651B2 (en) * 2020-04-09 2022-10-04 Micron Technology, Inc. System on a chip with deep learning accelerator and random access memory
US11874897B2 (en) * 2020-04-09 2024-01-16 Micron Technology, Inc. Integrated circuit device with deep learning accelerator and random access memory
US11887647B2 (en) 2020-04-09 2024-01-30 Micron Technology, Inc. Deep learning accelerator and random access memory with separate memory access connections
US11355175B2 (en) 2020-04-09 2022-06-07 Micron Technology, Inc. Deep learning accelerator and random access memory with a camera interface
US11200948B1 (en) * 2020-08-27 2021-12-14 Hewlett Packard Enterprise Development Lp System for a flexible conductance crossbar

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1581061A (en) * 2003-12-05 2005-02-16 智权第一公司 Dynamic logic register
US20140172937A1 (en) * 2012-12-19 2014-06-19 United States Of America As Represented By The Secretary Of The Air Force Apparatus for performing matrix vector multiplication approximation using crossbar arrays of resistive memory devices
CN104011658A (en) * 2011-12-16 2014-08-27 英特尔公司 Instructions and logic to provide vector linear interpolation functionality

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6023759A (en) 1997-09-30 2000-02-08 Intel Corporation System for observing internal processor events utilizing a pipeline data path to pipeline internally generated signals representative of the event
US9684632B2 (en) * 2009-06-04 2017-06-20 Micron Technology, Inc. Parallel processing and internal processors
TW201347101A (en) 2011-12-01 2013-11-16 Mosaid Technologies Inc CPU with stacked memory
US20140040532A1 (en) 2012-08-06 2014-02-06 Advanced Micro Devices, Inc. Stacked memory device with helper processor
US9110778B2 (en) 2012-11-08 2015-08-18 International Business Machines Corporation Address generation in an active memory device
KR20150100042A (en) 2014-02-24 2015-09-02 한국전자통신연구원 An acceleration system in 3d die-stacked dram
US9466362B2 (en) * 2014-08-12 2016-10-11 Arizona Board Of Regents On Behalf Of Arizona State University Resistive cross-point architecture for robust data representation with arbitrary precision
WO2017171769A1 (en) 2016-03-31 2017-10-05 Hewlett Packard Enterprise Development Lp Logical operations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1581061A (en) * 2003-12-05 2005-02-16 智权第一公司 Dynamic logic register
CN104011658A (en) * 2011-12-16 2014-08-27 英特尔公司 Instructions and logic to provide vector linear interpolation functionality
US20140172937A1 (en) * 2012-12-19 2014-06-19 United States Of America As Represented By The Secretary Of The Air Force Apparatus for performing matrix vector multiplication approximation using crossbar arrays of resistive memory devices

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AMIR MORAD等: "Efficient Dense And Sparse Matrix Multiplication On GP-SIMD", 《POWER AND TIMING MODELING,OPTIMIZATION AND SIMULATION》 *
LIFAN XU等: "Scaling Deep Learning On Multiple In-Memory Processors", 《3RD WORKSHOP ON NEAR-DATA PROCESSING IN CONJUNCTION WITH MICRO-48》 *
TAREK M. TAHA等: "Exploring the Design Space of Specialized Multicore Neural Processors", 《PROCEEDINGS OF INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11126549B2 (en) 2016-03-31 2021-09-21 Hewlett Packard Enterprise Development Lp Processing in-memory architectures for performing logical operations
CN111221748A (en) * 2018-11-26 2020-06-02 通用汽车环球科技运作有限责任公司 Method and apparatus for memory access management for data processing
CN111221748B (en) * 2018-11-26 2023-07-25 通用汽车环球科技运作有限责任公司 Method and apparatus for memory access management for data processing

Also Published As

Publication number Publication date
US20190042411A1 (en) 2019-02-07
EP3286638A4 (en) 2018-09-05
US11126549B2 (en) 2021-09-21
WO2017171769A1 (en) 2017-10-05
EP3286638A1 (en) 2018-02-28

Similar Documents

Publication Publication Date Title
CN107615241A (en) Logical operation
JP6857286B2 (en) Improved performance of neural network arrays
EP3265907B1 (en) Data processing using resistive memory arrays
US10691996B2 (en) Hardware accelerator for compressed LSTM
US11055063B2 (en) Systems and methods for deep learning processor
Ji et al. ReCom: An efficient resistive accelerator for compressed deep neural networks
TWI759361B (en) An architecture, method, computer-readable medium, and apparatus for sparse neural network acceleration
US9886377B2 (en) Pipelined convolutional operations for processing clusters
Venkataramanaiah et al. Automatic compiler based FPGA accelerator for CNN training
US9886418B2 (en) Matrix operands for linear algebra operations
CN110352434A (en) Utilize the Processing with Neural Network that model is fixed
KR20190019081A (en) Accelerator for deep layer neural network
CN107608715A (en) For performing the device and method of artificial neural network forward operation
US11663452B2 (en) Processor array for processing sparse binary neural networks
CN111048135A (en) CNN processing device based on memristor memory calculation and working method thereof
Zhou et al. Mat: Processing in-memory acceleration for long-sequence attention
EP4009240A1 (en) Method and apparatus for performing deep learning operations
Wang et al. Reboc: Accelerating block-circulant neural networks in reram
Das et al. NZESPA: A near-3D-memory zero skipping parallel accelerator for CNNs
US10929760B1 (en) Architecture for table-based mathematical operations for inference acceleration in machine learning
Chen et al. An efficient ReRAM-based inference accelerator for convolutional neural networks via activation reuse
US20240036818A1 (en) Computational memory for sorting multiple data streams in parallel
US20230195836A1 (en) One-dimensional computational unit for an integrated circuit
US11249724B1 (en) Processing-memory architectures performing atomic read-modify-write operations in deep learning systems
US20240094988A1 (en) Method and apparatus with multi-bit accumulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180119