CN107578098B - Neural network processor based on systolic array - Google Patents

Neural network processor based on systolic array Download PDF

Info

Publication number
CN107578098B
CN107578098B CN201710777741.4A CN201710777741A CN107578098B CN 107578098 B CN107578098 B CN 107578098B CN 201710777741 A CN201710777741 A CN 201710777741A CN 107578098 B CN107578098 B CN 107578098B
Authority
CN
China
Prior art keywords
data
array
neural network
processing unit
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710777741.4A
Other languages
Chinese (zh)
Other versions
CN107578098A (en
Inventor
韩银和
许浩博
王颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201710777741.4A priority Critical patent/CN107578098B/en
Publication of CN107578098A publication Critical patent/CN107578098A/en
Application granted granted Critical
Publication of CN107578098B publication Critical patent/CN107578098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

The invention provides a neural network processor, which comprises a control unit, a calculation unit, a data storage unit and a weight storage unit, wherein the calculation unit acquires data and weights from the data storage unit and the weight storage unit respectively under the control of the control unit to carry out neural network related operation, the calculation unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the data and the weights are transmitted to the pulsating array formed by the processing units from different directions, and each processing unit simultaneously and parallelly processes the data passing through the processing unit. The neural network processor can achieve high processing speed; meanwhile, input data are reused for many times, so that higher operation throughput rate can be realized under the condition of consuming smaller memory access bandwidth.

Description

Neural network processor based on systolic array
Technical Field
The present invention relates to neural network technology, and more particularly, to neural network processor architectures.
Background
Deep learning has made a major breakthrough in recent years, and neural network models trained by deep learning algorithms have made remarkable achievements in the application fields of image recognition, voice processing, intelligent robots and the like. The deep neural network simulates the neural connection structure of the human brain by establishing a model, and describes the data characteristics by layering a plurality of transformation stages when processing signals such as images, sounds and texts. With the continuous improvement of the complexity of the neural network, the neural network technology has the problems of more occupied resources, low operation speed, high energy consumption and the like in the practical application process. The method of using hardware accelerators instead of traditional software computing becomes an effective way to improve the computational efficiency of neural networks, such as those implemented using general purpose graphics processors, special purpose processor chips, and field programmable logic arrays (FPGAs).
However, since the neural network processor belongs to a computation intensive processor and a memory access intensive processor, on one hand, the neural network model includes a large number of multiply-add operations and other nonlinear operations, and the neural network processor is required to keep high-load operation so as to guarantee the operation requirement of the neural network model; on the other hand, a large number of parameter iterations exist in the neural network operation process, and the computing unit needs to access a large number of memories, so that the bandwidth design requirement on the processor is greatly increased, and the memory access power consumption is increased.
Therefore, there is a need for an improvement to existing neural network processors to improve the computational efficiency of the neural network processors and reduce the hardware overhead.
Disclosure of Invention
It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a systolic array-based neural network processor.
The purpose of the invention is realized by the following technical scheme:
according to an embodiment of the present invention, there is provided a neural network processor including a control unit, a calculation unit, a data storage unit, and a weight storage unit, the calculation unit acquiring data and weights from the data storage unit and the weight storage unit, respectively, under the control of the control unit, to perform neural network related operations;
the computing unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the array controller loads weights and data into the processing unit array from different directions, and each processing unit operates the received data and weights and transmits the data and weights to the next processing unit along different directions.
In the above technical solution, the processing unit array may be a one-dimensional systolic array or a two-dimensional systolic array.
In the above technical solution, the processing unit may include a data register, a weight register, a multiplier, and an accumulator;
wherein the weight register receives the weight of a processing unit in the column direction from the processing unit array, sends it to the multiplier and transfers it to the next processing unit in the direction;
the data register receives data from one processing unit in the row direction of the processing unit array, sends the data to the multiplier and transmits the data to the next processing unit in the row direction;
the multiplier multiplies the input data and the weight, the output of the multiplier is connected into the accumulator to be accumulated with the data in the accumulator or added with the partial sum input signal, and then the calculation result is used as partial sum output.
In the above technical solution, the array controller may load data from a row direction of the processing unit array, and load a weight from a column direction of the processing unit array.
In the above technical solution, the control unit may load the data sequence participating in the operation from the storage unit in the form of a row vector, and load the weight sequence corresponding to the data sequence in the form of a column vector.
In the above technical solution, the array controller may sequentially load the data sequence and the weight sequence into the corresponding rows and columns of the processing unit array according to the sequence from small row number to large column number, where the difference in time between adjacent rows and adjacent columns when entering the array is 1 clock cycle, and ensure that the corresponding weight and data to be calculated enter the processing unit array in the same clock cycle.
Compared with the prior art, the invention has the advantages that:
a pulsation array structure is adopted in a computing unit of the neural network processor, so that the operation efficiency of the neural network processor is improved, and the bandwidth requirement of processor design is relieved.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 shows a general topology of a neural network;
FIG. 2 shows a schematic block diagram of a neural network convolution operation;
FIG. 3 shows a schematic block diagram of a neural network processor architecture, according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a computational unit of a neural network processor, according to one embodiment of the present invention;
FIG. 5 is a schematic diagram of a computing unit of a neural network processor, according to yet another embodiment of the present invention;
FIG. 6 shows a schematic diagram of a processing unit in a systolic array architecture according to one embodiment of the present invention;
FIG. 7 shows a schematic diagram of a computing process of a computing unit according to an embodiment of the invention;
FIG. 8 is a diagram illustrating a neural network processor according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The neural network is a mathematical model formed by modeling human brain structure and behavior activity, and is generally divided into an input layer, a hidden layer, an output layer and other structures, wherein each layer is composed of a plurality of neuron nodes, and the output values of the neuron nodes on the layer can be transmitted to the neuron nodes on the next layer as input and are connected layer by layer. The neural network has the bionic characteristic, and the process of multi-layer abstract iteration of the neural network has a similar information processing mode with the human brain and other perception organs.
Fig. 1 shows a common topological schematic of a neural network. The first layer input value of the neural network multilayer structure is an original image (the "original image" in the present invention refers to original data to be processed, not only an image obtained by taking a photograph in a narrow sense), and typically, for each layer of the neural network, a node value of the next layer can be obtained by calculating a neuron node value (also referred to as data herein) of the layer and a weight value corresponding thereto. For example, suppose
Figure GDA0002559415810000031
Several neuron nodes representing one layer in the neural network, which are connected with the node y of the next layer,
Figure GDA0002559415810000032
representing the weight of the corresponding connection, the value of y is defined as follows: y is x × w. Thus, there are a number of multiply-add-based convolution operations for each layer of the neural network. The convolution operation in a neural network is generally as shown in fig. 2: will be oneAnd scanning the characteristic graph by the two-dimensional weight convolution kernel with the size of K x K, solving the inner product of the weight and the corresponding characteristic element in the characteristic graph in the scanning process, and summing all the inner product values to obtain the characteristic element of the output layer. When each convolution layer has N characteristic layers, convolution kernels with the size of N KxK are convoluted with the characteristic patterns in the convolution layers, and N inner product values are summed to obtain an output layer characteristic element. With the increasing complexity of neural networks, such calculations will undoubtedly consume a large amount of resources. Thus, neural network computations are typically implemented using a dedicated neural network processor.
Common neural network processors are based on a memory-control-computation architecture. The storage structure is used for storing data participating in calculation, neural network weight, operation instructions of the processor and the like; the control structure is used for analyzing the operation instruction and generating a control signal to control the scheduling and storage of data in the processor and the calculation process of the neural network; the computation fabric is responsible for neural network computation operations. The storage unit can store data (for example, raw feature map data) transmitted from the outside of the neural network processor, trained neural network weights, processing results or intermediate results generated in the calculation process, instruction information involved in the calculation, and the like.
Fig. 3 shows a schematic structural diagram of a neural network processor 300 according to an embodiment of the present invention. As shown in fig. 3, the storage unit is further subdivided into an input data storage unit 311, a weight storage unit 312, an instruction storage unit 313 and an output data storage unit 314, wherein the input data storage unit 311 is used for storing data participating in calculation, for example, data including original feature map data and data participating in intermediate layer calculation; the weight storage unit 312 is used for storing the trained neural network weights; the instruction storage unit 313 is used for storing instruction information participating in calculation, and the instruction can be analyzed into a control flow by the control unit 320 to schedule calculation of the neural network; the output data storage unit 314 is used for storing the calculated neuron response values. By subdividing the storage units, data with substantially consistent data types can be centrally stored, so that a suitable storage medium can be selected, and operations such as data addressing can be simplified. It should be understood that the input data storage unit 311 and the output data storage unit 314 may also be the same storage unit.
The control unit 320 is responsible for instruction decoding, data scheduling, process control, and the like. For example, the instructions stored in the instruction storage unit are acquired and analyzed, and then the data are scheduled according to the control signals obtained by analysis and the calculation unit is controlled to perform the related operation of the neural network. In an embodiment of the present invention, the layer data participating in the neural network operation is divided into different regions, each region being used as a matrix, so that the operation between the data and the weights is divided into a plurality of matrix operations (for example, as shown in fig. 2). In this way, the control unit loads the weight sequence and the data sequence participating in the operation from the memory unit in the form of a row vector or a column vector suitable for the matrix operation.
One or more computing units (e.g., computing units 330, 331, etc.) may be included in the neural network processor, and each computing unit may perform a corresponding neural network computation according to a control signal from the control unit 320, acquire data from each storage unit, perform the computation, and write the computation result to the storage unit. The respective calculation units may have the same configuration or different configurations, and may perform the same calculation or different calculations. In one embodiment of the present invention, a computing unit is provided that includes an array controller and a plurality of processing units organized in a systolic array, each processing unit having the same internal structure. The array controller is responsible for loading data into the systolic array, each processing unit is responsible for data calculation, the weight is input from the top of the systolic array and is propagated from top to bottom, the data is input from the left side of the systolic array and is propagated from left to right, each processing unit calculates the received data and the weight, and the result is output from the right side of the systolic array. The systolic array may be a one-dimensional or two-dimensional structure. However, it should be understood that the neural network processor may also include other computing units, and the control unit may select different computing units to process data according to actual requirements.
Fig. 4 shows a schematic structural diagram of a computing unit in a neural network processor according to an embodiment of the present invention. As shown in fig. 4, the pulse array has a one-dimensional structure, and the processing units are connected in series. For the corresponding weight sequence and data sequence to be operated, the array controller loads each weight in the weight sequence to different processing units and keeps the weight until the last element of the corresponding data sequence completes the calculation with the corresponding weight, and then loads the next group of weights; while each data in the data sequence is loaded into the systolic array from the left side in turn, the processed data is transferred from the other side of the systolic array into the array controller. In such a computing unit configuration, the first data first enters the first processing unit, is processed and then passed to the next processing unit, while the second data enters the first processing unit. By analogy, when the first data arrives at the last processing element, it has been processed multiple times. Therefore, the ripple architecture actually reuses input data many times, so that higher operation throughput rate can be realized with smaller memory access bandwidth consumption.
Fig. 5 shows a schematic structural diagram of a computing unit in a neural network processor according to an embodiment of the present invention. In this embodiment, the computing units are organized in a two-dimensional array comprising a row array and a column array, and each processing unit is connected only to adjacent processing units, i.e. the processing units communicate only with adjacent processing units. The array controller is responsible for the scheduling of data, and can control the relevant data to be input into the processing unit from the upper side and the left side of the systolic array of the computing unit, and different data are input into the processing unit from different directions. For example, the array controller controls the weights to be input from above the processing unit array and to be propagated in the parallel direction from top to bottom; data is input from the left side of the processing unit array and propagates in the row direction from left to right. The present invention is not limited to the input direction and the ripple propagation direction of various computational elements, and the terms "left", "right", "up", "down", and the like, as referred to herein, refer only to the respective directions as illustrated in the figures, and should not be construed as limiting the physical implementation of the present invention.
As noted above, in embodiments of the present invention, the various processing units in the computing unit are homogeneous and perform the same operations. Fig. 6 shows a schematic structural diagram of a processing unit according to an embodiment of the present invention. As shown in fig. 6, the input signal of the processing unit includes data, weights and partial sums; the output signal includes a data output, a weight output, and a partial sum output. The processing unit mainly internally comprises a data register, a weight register, a multiplier and an accumulator. The weight input signal is connected to the weight register and the multiplier, the data input signal is connected to the data register and the multiplier, and the partial sum input signal is connected to the accumulator. The weight register can send data to the multiplier for processing, and can also directly transmit the data to a calculation unit below; the data register may also send the data to a multiplier for processing or directly to the next unit on the right. The input data and the weight are multiplied in the multiplier, the output of the multiplier is connected into the accumulator to be accumulated with the data in the accumulator or added with the partial sum input signal, and then the calculation result is used as partial sum output. The above-described operations and transfers can be flexibly set in response to control signals from the array controller. For example, each processing unit may perform the following operations:
1) receiving data of a node on a row and a column in a pulsating direction;
2) calculating the product of the two data and accumulating the product with the original registered result;
3) the accumulated values are saved, the input data received from the row is output to the next row node, and the input data received from the column is output to the next column node.
In addition, for the processing units organized in a one-dimensional array form, the weights do not need to be propagated downwards, so that after the array controller loads each element of the weight sequence to be processed into the weight register of each processing unit respectively, the weight register does not need to output, but remains for a period of time in the weight register, and after the array controller finishes the related calculation task of the weights, the weight register is emptied and the weights to be processed subsequently are continuously loaded.
The calculation process of a calculation unit using a two-dimensional array structure according to an embodiment of the present invention is described below with reference to fig. 7, by way of example, in which data is multiplied by two 3 x 3 matrices representing weights:
data matrix
Figure GDA0002559415810000061
Weight matrix
Figure GDA0002559415810000062
The array controller controls data and weights to be inputted into the processing units from above and to the left of the processing unit array, respectively. For example, the row vectors of the matrix a may generally enter the rows corresponding to the processing unit array in sequence from small row numbers to large row numbers, and the adjacent row vectors enter the processing unit array with a difference of 1 clock cycle in time, that is, the data in the ith row and k column of the matrix a and the data in the ith-1 row and k-1 column of the matrix a enter the processing unit array at the same time; the column vectors of the matrix B sequentially enter the corresponding columns of the processing unit array from small to large according to the sequence of column numbers, and the difference of time between the adjacent column vectors entering the processing unit array is 1 clock cycle, namely, the data of the j column of the k row of the matrix B and the data of the j-1 column of the k-1 row of the matrix B enter the processing unit array simultaneously. And the data matrix A enters the systolic array according to the rows and the weight matrix B enters the processing unit array according to the columns in parallel in time, namely corresponding elements A to be calculated in the matrix A and the matrix Bi,kAnd Bk,jIt enters the processing unit array at the same clock cycle until all elements of matrix a and matrix B have traversed the entire row and column of the processing unit array. The time alignment is satisfied by the input control of the array controller responsible for the arrival of each data at each on-cell. Thus, the array controller directs data and weights from different directions into the systolic array of processing units, with weights flowing from top to bottom and data flowing from left to right. In the process of data flow, all processing units process data flowing through it simultaneously in parallel, so that high processing speed can be achieved. At the same time, the data flow from the process unit array to the process unit array is performed by a predetermined data flow patternAll processing corresponding to the data is finished without inputting the data again, so that the memory access operation is reduced.
As shown in fig. 7, in the first cycle, data 3 and 3 are simultaneously accessed into processing element PE11, and multiplication is performed in the processing element;
in the second cycle, data 3 flowing from the left side to the processing unit PE11 flows to the right, and data 4 is simultaneously coupled to the processing unit PE12, data 3 flowing from above to the processing unit PE12 flows downward, and data 2 is simultaneously coupled to the processing unit PE 21;
in the third cycle, data 3 flows into processing element PE11 from above PE11, data 2 flows into processing element P11 from the left, data 5 and data 2 flow into processing element PE21, data 4 and data 5 flow into processing element PE12, data 3 and data 2 flow into processing element PE13, data 2 and data 4 flow into compute element PE22, and data 3 flow into compute element PE 31.
In the fourth cycle, data 2 and data 2 are accessed to processing element PE12, data 4 and data 3 are accessed to processing element PE13, data 3 and data 3 are accessed to processing element PE21, data 5 and data 5 are accessed to processing element PE22, data 2 and data 2 are accessed to processing element PE23, data 2 and data 2 are accessed to processing element PE31, and data 3 and data 4 are accessed to processing element PE 32.
In the fifth cycle, data 2 and data 5 flow into the processing element PE13, data 3 and data 2 flow into the processing element PE22, data 5 and data 3 flow into the processing element PE23, data 5 and data 3 flow into the processing element PE31, data 5 and data 2 flow into the processing element PE32, and data 3 and data 2 flow into the processing element PE 33.
In the sixth cycle, data 3 and data 5 flow into the processing element PE23, data 5 and data 2 flow into the processing element PE32, data 2 and data 3 flow into the processing element PE33, and data 5 flow into the processing element PE 33.
In the seventh cycle, data 5 and data 5 flow into processing element PE 33.
The multiplication results are accumulated in the column direction, i.e., the multiplication results of PE11 are transmitted to PE21 for accumulation, and the accumulation calculation results are transmitted to PE31 for accumulation.
Fig. 8 is a schematic diagram showing an execution flow of a neural network processor using the above computing unit according to an example of the present invention. In step S1, the control unit addresses the storage unit, reads and parses the instruction to be executed next; step S2, acquiring input data from the storage unit according to the storage address obtained by the analysis instruction; step S3, loading data and weights from the input storage unit and the weight storage unit to the above-described calculation unit according to the embodiment of the present invention, respectively; step S4, the calculation unit executes an operation in a neural network operation; in step S5, the neural network calculation result is stored in the output storage unit.
Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims (4)

1. A neural network processor comprises a control unit, a calculation unit, a data storage unit and a weight storage unit, wherein the calculation unit respectively acquires data and weights from the data storage unit and the weight storage unit under the control of the control unit to perform neural network related operation;
the computing unit comprises an array controller and a plurality of processing units connected in a pulsating array mode, the array controller loads weights and data into the processing unit array from different directions, and each processing unit operates the received data and weights and transmits the data and weights to the next processing unit along different directions;
wherein the array of processing units is a two-dimensional systolic array;
wherein the processing unit comprises a data register, a weight register, a multiplier and an accumulator;
wherein the weight register receives the weight of a processing unit in the column direction from the processing unit array, sends it to the multiplier and transfers it to the next processing unit in the direction;
the data register receives data from one processing unit in the row direction of the processing unit array, sends the data to the multiplier and transmits the data to the next processing unit in the row direction;
the multiplier multiplies the input data and the weight, the output of the multiplier is connected into the accumulator to be added with the partial sum input signal, and then the calculated result is used as the partial sum output.
2. The neural network processor of claim 1, wherein the array controller loads data from a row direction of the array of processing units and loads weights from a column direction of the array of processing units.
3. The neural network processor of claim 1, wherein the control unit loads the data sequence participating in the operation from the storage unit in a row vector form, and loads the weight sequence corresponding to the data sequence in a column vector form.
4. The neural network processor of claim 3, wherein the array controller loads the data sequence and the weight sequence into the corresponding rows and columns of the processing unit array in sequence from small to large row number and column number, respectively, adjacent rows and adjacent columns differ in time by 1 clock cycle when entering the array, and ensure that the corresponding weights and data to be calculated enter the processing unit array at the same clock cycle.
CN201710777741.4A 2017-09-01 2017-09-01 Neural network processor based on systolic array Active CN107578098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710777741.4A CN107578098B (en) 2017-09-01 2017-09-01 Neural network processor based on systolic array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710777741.4A CN107578098B (en) 2017-09-01 2017-09-01 Neural network processor based on systolic array

Publications (2)

Publication Number Publication Date
CN107578098A CN107578098A (en) 2018-01-12
CN107578098B true CN107578098B (en) 2020-10-30

Family

ID=61030459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710777741.4A Active CN107578098B (en) 2017-09-01 2017-09-01 Neural network processor based on systolic array

Country Status (1)

Country Link
CN (1) CN107578098B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10459876B2 (en) * 2018-01-31 2019-10-29 Amazon Technologies, Inc. Performing concurrent operations in a processing element
CN108628799B (en) * 2018-04-17 2021-09-14 上海交通大学 Reconfigurable single instruction multiple data systolic array structure, processor and electronic terminal
US11501140B2 (en) * 2018-06-19 2022-11-15 International Business Machines Corporation Runtime reconfigurable neural network processor core
WO2020034079A1 (en) * 2018-08-14 2020-02-20 深圳市大疆创新科技有限公司 Systolic array-based neural network processing device
CN109902795B (en) * 2019-02-01 2023-05-23 京微齐力(北京)科技有限公司 Artificial intelligent module and system chip with processing unit provided with input multiplexer
CN109919321A (en) * 2019-02-01 2019-06-21 京微齐力(北京)科技有限公司 Unit has the artificial intelligence module and System on Chip/SoC of local accumulation function
CN109902836A (en) * 2019-02-01 2019-06-18 京微齐力(北京)科技有限公司 The failure tolerant method and System on Chip/SoC of artificial intelligence module
CN109902063B (en) * 2019-02-01 2023-08-22 京微齐力(北京)科技有限公司 System chip integrated with two-dimensional convolution array
CN109902064A (en) * 2019-02-01 2019-06-18 京微齐力(北京)科技有限公司 A kind of chip circuit of two dimension systolic arrays
CN109885512B (en) * 2019-02-01 2021-01-12 京微齐力(北京)科技有限公司 System chip integrating FPGA and artificial intelligence module and design method
CN109933371A (en) * 2019-02-01 2019-06-25 京微齐力(北京)科技有限公司 Its unit may have access to the artificial intelligence module and System on Chip/SoC of local storage
CN109902835A (en) * 2019-02-01 2019-06-18 京微齐力(北京)科技有限公司 Processing unit is provided with the artificial intelligence module and System on Chip/SoC of general-purpose algorithm unit
CN109919323A (en) * 2019-02-01 2019-06-21 京微齐力(北京)科技有限公司 Edge cells have the artificial intelligence module and System on Chip/SoC of local accumulation function
CN110348564B (en) * 2019-06-11 2021-07-09 中国人民解放军国防科技大学 SCNN reasoning acceleration device based on systolic array, processor and computer equipment
CN110211618B (en) * 2019-06-12 2021-08-24 中国科学院计算技术研究所 Processing device and method for block chain
CN110543934B (en) * 2019-08-14 2022-02-01 北京航空航天大学 Pulse array computing structure and method for convolutional neural network
CN110851779B (en) * 2019-10-16 2021-09-14 北京航空航天大学 Systolic array architecture for sparse matrix operations
CN110705703B (en) * 2019-10-16 2022-05-27 北京航空航天大学 Sparse neural network processor based on systolic array
KR20210060024A (en) * 2019-11-18 2021-05-26 에스케이하이닉스 주식회사 Memory device including neural network processing circuit
US20210150311A1 (en) * 2019-11-19 2021-05-20 Alibaba Group Holding Limited Data layout conscious processing in memory architecture for executing neural network model
CN111368988B (en) * 2020-02-28 2022-12-20 北京航空航天大学 Deep learning training hardware accelerator utilizing sparsity
FR3115136A1 (en) 2020-10-12 2022-04-15 Thales METHOD AND DEVICE FOR PROCESSING DATA TO BE PROVIDED AS INPUT OF A FIRST SHIFT REGISTER OF A SYSTOLIC NEURONAL ELECTRONIC CIRCUIT
CN112632464B (en) * 2020-12-28 2022-11-29 上海壁仞智能科技有限公司 Processing device for processing data
CN112862067B (en) * 2021-01-14 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for processing business by utilizing business model based on privacy protection
CN112836813B (en) * 2021-02-09 2023-06-16 南方科技大学 Reconfigurable pulse array system for mixed-precision neural network calculation
CN113393376A (en) * 2021-05-08 2021-09-14 杭州电子科技大学 Lightweight super-resolution image reconstruction method based on deep learning
CN113870273B (en) * 2021-12-02 2022-03-25 之江实验室 Neural network accelerator characteristic graph segmentation method based on pulse array
CN113869507B (en) * 2021-12-02 2022-04-15 之江实验室 Neural network accelerator convolution calculation device and method based on pulse array
CN114675806B (en) * 2022-05-30 2022-09-23 中科南京智能技术研究院 Pulsation matrix unit and pulsation matrix calculation device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9805303B2 (en) * 2015-05-21 2017-10-31 Google Inc. Rotating data for neural network computations
CN106529670B (en) * 2016-10-27 2019-01-25 中国科学院计算技术研究所 It is a kind of based on weight compression neural network processor, design method, chip
CN106650924B (en) * 2016-10-27 2019-05-14 中国科学院计算技术研究所 A kind of processor based on time dimension and space dimension data stream compression, design method
CN107085562B (en) * 2017-03-23 2020-11-03 中国科学院计算技术研究所 Neural network processor based on efficient multiplexing data stream and design method
CN107016175B (en) * 2017-03-23 2018-08-31 中国科学院计算技术研究所 It is applicable in the Automation Design method, apparatus and optimization method of neural network processor

Also Published As

Publication number Publication date
CN107578098A (en) 2018-01-12

Similar Documents

Publication Publication Date Title
CN107578098B (en) Neural network processor based on systolic array
US11734006B2 (en) Deep vision processor
TWI639119B (en) Adaptive execution engine for convolution computing systems cross-reference to related applications
CN107169560B (en) Self-adaptive reconfigurable deep convolutional neural network computing method and device
CN107578095B (en) Neural computing device and processor comprising the computing device
CN107301456B (en) Deep neural network multi-core acceleration implementation method based on vector processor
US11194549B2 (en) Matrix multiplication system, apparatus and method
KR101788829B1 (en) Convolutional neural network computing apparatus
CN111897579A (en) Image data processing method, image data processing device, computer equipment and storage medium
CN107239824A (en) Apparatus and method for realizing sparse convolution neutral net accelerator
CN112084038B (en) Memory allocation method and device of neural network
US11321607B2 (en) Machine learning network implemented by statically scheduled instructions, with compiler
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
CN108628799B (en) Reconfigurable single instruction multiple data systolic array structure, processor and electronic terminal
Meng et al. Accelerating proximal policy optimization on cpu-fpga heterogeneous platforms
KR102610842B1 (en) Processing element and operating method thereof in neural network
CN108304925B (en) Pooling computing device and method
CN110580519B (en) Convolution operation device and method thereof
CN112084037A (en) Memory allocation method and device of neural network
Chen et al. Tight compression: compressing CNN model tightly through unstructured pruning and simulated annealing based permutation
Meng et al. Ppoaccel: A high-throughput acceleration framework for proximal policy optimization
Clere et al. FPGA based reconfigurable coprocessor for deep convolutional neural network training
Tu et al. Neural approximating architecture targeting multiple application domains
CN220773595U (en) Reconfigurable processing circuit and processing core
Duranton et al. A general purpose digital architecture for neural network simulations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant