CN110321064A - Computing platform realization method and system for neural network - Google Patents
Computing platform realization method and system for neural network Download PDFInfo
- Publication number
- CN110321064A CN110321064A CN201810297916.6A CN201810297916A CN110321064A CN 110321064 A CN110321064 A CN 110321064A CN 201810297916 A CN201810297916 A CN 201810297916A CN 110321064 A CN110321064 A CN 110321064A
- Authority
- CN
- China
- Prior art keywords
- data
- external memory
- characteristic pattern
- read
- piece
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0613—Improving I/O performance in relation to throughput
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/068—Hybrid storage device
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of computing platform implementation methods for neural network.The computing platform reads required data from external memory, and the results of intermediate calculations for caching the data of reading on piece caching and respectively operating, the described method includes: reading the first first part's data for operating required characteristic pattern from the external memory, executing for first part's data includes the first operation and at least one other operation, will be stored back into the external memory for the operating result of first part's data;The first second part data for operating required characteristic pattern are read from the external memory, are executed for the second part data above-mentioned from the read operation of the external memory, first operation and at least one described other operation and the operation for being stored back into the external memory.With this solution, the data that can be greatly decreased between piece external storage and on piece caching carry number, to promote disposed of in its entirety speed.
Description
Technical field
The present invention relates to deep learning field more particularly to a kind of computing platform implementation methods and phase for neural network
Relationship system.
Background technique
Neural network (Neural Network) becomes the research hotspot of field of image recognition in recent years.After training
Neural network model can be used for the numerous areas such as image classification, object identification and conspicuousness detection.Neural network mould in recent years
Type is presented the trend that calculation scale increases, complexity is promoted and has been unable to satisfy the practicality demand using traditional CPU platform.
Therefore, carrying out neural network accelerator design using heterogeneous computing platforms such as FPGA, GPU becomes new research hotspot.Wherein,
FPGA compares GPU platform due to its low-power consumption feature, can obtain higher energy efficiency, at the same FPGA can with iteratively faster,
It can carry out the characteristic also more requirement of adaptive algorithm high speed development of hardware reconstruction.
When using the high degree of parallelism computing platform comprising FPGA and GPU etc. to execute ANN Reasoning, compared to reading
The time cost of parameter required for extract operation, it is very short to calculate the required execution time, and which results in memory readings to become
Improve the bottleneck of processing speed.
Therefore, there is still a need for a kind of scheme for being able to ascend convolutional neural networks computing platform overall treatment efficiency.
Summary of the invention
In order to solve the problems, such as it is above-mentioned at least one, the invention proposes a kind of computing platform realization sides for neural network
Case can be decomposed in not changing calculating figure in the case where bottom calculated result, by calculating the rational design of figure based on layer
With merge the carrying reduced to the greatest extent between memory and on piece data, more effectively utilize hardware resource, promoted neural network, especially
It is the arrangement treatment effeciency of convolutional neural networks computing platform.
According to an aspect of the invention, there is provided a kind of computing platform implementation method for neural network, the meter
Calculate platform from external memory read needed for data, and on piece caching in cache acquisition data and on piece operation in
Between calculated result, which comprises from the external memory read first operate needed for characteristic pattern first part's data,
Executing for first part's data includes the first operation and at least one other operation, will be directed to first part's number
According to operating result be stored back into the external memory;The second of the required characteristic pattern of the first operation is read from the external memory
Partial data executes the above-mentioned read operation from the external memory, first operation for the second part data
And at least one of described other operate and are stored back into the operation of the external memory.
It reduces unnecessary memory by the multiplexing to data as a result, to read, to promote overall treatment efficiency.
The characteristic pattern includes first part's data and at least one second part data, and the described method includes: needle
To each second part data, execute respectively above-mentioned from the read operation of the external memory, first operation and institute
It states at least one other operation and is stored back into the operation of the external memory, wherein being directed to first part's data and institute
The operating result for stating at least one second part data be equivalent to execute for characteristic pattern sequence first operation and
The operating result of at least one other operation.
Preferably, it is slow that the size of the partial data read every time from the external memory is at least partially based on the on piece
The capacity deposited determines.
At least one other operation can be at least one of following: at least one subsequent operation of first operation, institute
Stating subsequent operation at least needs the complete computation result of first operation to complete its all operations;And it is grasped with described first
Make at least one lateral operation with the input of same characteristic features figure.
In one embodiment, the first operation and at least one other operation include that successive convolution (CONV) operates, is non-
Linearly (ReLU) operation and pond (POOL) operation.
Computing platform implementation method of the invention can also include: to read the first subsequent operation institute from the external memory
The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the portion from the characteristic pattern
Divided data executes first subsequent operation.Thus the storage optimization for being suitable for having branch's neural network.Preferably, this first
Subsequent operation can be ELTWISE operation.
Computing platform implementation method of the invention can also include that the second subsequent operation is incorporated to the first operation.Preferably,
Second subsequent operation can be batch normalization (BatchNorm) and calibration (Scale operation), and the first operation can be CONV
Operation.
Computing platform implementation method of the invention can also include depositing operating result with the dimension arrangement mode after converting
The external memory is returned, to omit the operation for being only used for changing data dimension or arrangement mode in neural network.Thus, it is possible to
Cascade (CONCAT) operation and planarization (FLATTEN) operation are directly omitted by the regulation of storage mode.
Computing platform implementation method of the invention can also include layer operation splitting, such as may include decomposing described first
Operation and/or other at least one of described operations, wherein first operation after decomposing and/or at least one of other operation needles
Pair be a part that it is respectively originally inputted.First operation and/or at least one other operation after decomposition can be
CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.Institute after decomposition
It states the first operation and/or at least one other operation can also be the POOL operation through splitting, and each POOL through splitting
The input of operation is the complete characterization figure output in preceding operation.
According to another aspect of the present invention, a kind of neural fusion system is provided, comprising: piece external memory device,
For parameter and characteristic pattern needed for storing progress neural computing on a memory;Executive device is calculated, the calculating is held
Luggage is set to be cached including on piece, and the executive device that calculates is directed to based on the capacity division that can be cached on piece caching
The different piece of characteristic pattern, successively performs the following operations: from the spy of characteristic pattern needed for the external memory read operation
Determine part, executes at least two operations for the specific part, and operating result is stored back to described external memory device.
Preferably, which is the high parallel computation unit based on FPGA or GPU or ASIC.
According to a further aspect of the invention, a kind of computing platform for neural network is provided, comprising: data processing
Module for executing scheduled calculation processing to input data, and generates output data;Data memory module, for caching number
The intermediate data exported according to input data needed for processing module or data processing module;And control module, for described
Data processing module and the data memory module are controlled, to execute method described in any one as above.
Data processing module can be convolutional calculation module, for carrying out convolutional calculation to input data.
According to an aspect of the invention, there is provided a kind of non-transitory machinable medium, being stored thereon with can
Code is executed, when the executable code is executed by the processor of electronic equipment, the processor is made to execute as above any one
Method described in.
By using neural computing implementation of the invention, can be greatly decreased between piece external storage and on piece caching
Data carry number, eliminate bottleneck effect of the bandwidth to neural computing.Implementation of the invention can be by meter
Fusion, decomposition and the reconstruct for calculating operation avoid unnecessary band using the height multiplexing to shared input and/or intermediate data
Width occupies, and is operated by omitting time-consuming data rearrangement to storage and/or the reasonable arrangement read, and can be by certain subsequent behaviour
It is incorporated in preceding operation, thus each generic operation involved in neural computing and data access are optimized from process,
To promote overall calculation efficiency.
Detailed description of the invention
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its
Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label
Typically represent same parts.
Fig. 1 shows a series of layer of orderly functions of composition typical case CNN.
The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.
Fig. 3 shows the process of the computing platform implementation method according to an embodiment of the invention for neural network
Figure.
Fig. 4 is shown for the existing of Vgg basic structure and the network query function figure optimized based on the present invention.
Fig. 5 is shown for the existing of ResNet basic structure and the network query function figure optimized based on the present invention.
Fig. 6 shows the example that convolution according to an embodiment of the invention longitudinally merges.
Fig. 7 shows the example that convolution according to an embodiment of the invention laterally merges.
Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.
Fig. 9 shows the omission according to an embodiment of the invention to CONCAT (cascade) operation.
Figure 10 shows according to an embodiment of the invention to BatchNorm (batch normalizes) and Scale (calibration)
The fusion of operation.
Figure 11 shows the example of division operation according to an embodiment of the invention.
Figure 12 shows Inception structure adjacent in GoogLeNet v1 that is existing and optimizing based on the present invention.
Figure 13 shows Inception structure adjacent in GoogLeNet v2 that is existing and optimizing based on the present invention.
Figure 14 shows the example for the SoC that can be used for realizing computing system of the invention.
Figure 15 shows according to an embodiment of the invention for executing the platform of neural computing.
Specific embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here
Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure
Range is completely communicated to those skilled in the art.
The basic conception of neural network processor
With the continuous development of artificial intelligence, machine learning and neural network algorithm in recent years, convolutional neural networks are being schemed
The classification of picture, identification, detection and tracking etc., all achieve the effect for surmounting the mankind.Since convolutional neural networks are huge
The characteristics of parameter scale, huge calculation amount, and for hardware platform stability and the high requirement for calculating energy consumption ratio, utilize
The heterogeneous computing platforms such as FPGA, GPU, carrying out accelerator design becomes new research hotspot.GPU platform is compared, FPGA is due to it
Low-power consumption feature can obtain higher energy efficiency, at the same FPGA can with iteratively faster, the characteristic of hardware reconstruction can be carried out
The more requirement of adaptive algorithm high speed development.However, the existing CNN accelerator design based on FPGA and GPU exist as support
Network structure is single, the problems such as bandwidth usage is unreasonable, computing resource utilization efficiency is low, be unable to satisfy higher and higher real-time
Property require.For the isomery accelerator design of CNN, however it remains biggish research and discovery space.
Compared with single computing platform (there was only host or the computing platform of CPU), the present invention is directed in order to execute
Neural computing and the neural network application specific processor being especially designed.It will be appreciated by those skilled in the art that in the application
Used term " neural network application specific processor ", can be also simply referred to as " neural network processor " or " NN processor ".Due to
Deep learning is a technique classification presently the most popular in nerual network technique, therefore neural network application specific processor can be with
It is implemented as deep learning application specific processor or deep learning processor.But it will be understood by those skilled in the art that nerve net
Network has various technology branches, such as deep neural network (DNN, Deep Neutral Network) and convolutional neural networks
(CNN, Convolutional Neutral Network), therefore neural network application specific processor also may be implemented as depth
Neural network application specific processor or deep neural network processor (DNN processor or CNN processor).That is, related " deep
The neural computing of degree study processor " or " deep neural network processor " in heterogeneous computing platforms realizes that technology also exists
Within the scope of the present invention.
DPU (Deep-learning Processing Unit) is a for neural network algorithm in artificial intelligence
General acceleration platform, utilize FPGA high degree of parallelism and low-power consumption the characteristics of, realize based on convolutional neural networks (hereinafter referred to as
CNN it) makes inferences.Herein, DPU may be considered above " deep learning processor " or " deep neural network processor " or
One specific implementation of " neural network processor ".Following description, which will be based primarily upon, to be realized using CNN structure via FPGA
DPU is carried out, but it will be understood by those skilled in the art that the principle of the present invention is equally applicable to the hardware by such as GPU
Structure is directed to the neural network processor that other neural networks make inferences.
During DPU platform algorithm is to command mappings, need to decouple with deep learning frame, and parse the CNN of multiplicity
Algorithm constructs the corresponding calculating graph structure of DPU, the optimization of coarseness is carried out for graph structure, including node beta pruning, fusion are most
End form at subgraph and subgraph configuration, thus instruct DPU compiler instruct generation.
Since DPU is the general acceleration platform for neural network algorithm in artificial intelligence, it is necessarily required to support from not
With the fixed point of platform;At the same time, DPU supports hardware parameter can be with to be able to as Hardware I P (IP core)
It is deployed on different hardware platforms.Therefore different depth is come from firstly the need of parsing in the mapping process of algorithm to instruction
A possibility that practising the network query function figure of platform, and finding figure optimization, to realize the maximization of hardware computational efficiency as far as possible.
When DPU is in ANN Reasoning, during executing calculating figure, multiple operations can be executed in each node for calculating figure.These
Operation needs to complete on DPU, and can be understood as the calling to computing module different on DPU.Under normal conditions, and from outer
The time cost of data needed for portion's memory read operations is compared, and the time that DPU executes calculating is very short, and which results in memories
Read the bottleneck for becoming system processing capacity.How to find it is one smaller, calculate graph structure faster, do not change bottom meter in figure
Calculate as a result, reduce the carrying between memory and on piece data to the greatest extent, more effectively utilize hardware resource, such calculatings is schemed
Figure optimization method is the key that a ring during DPU algorithm to command mappings.Calculating implementation of the invention is deep learning
The part that algorithm and DPU compiler are formed a connecting link has both the effect of important neural network interlayer optimization, is the front end of compiler
Core algorithm.
Numerical procedure in order to better illustrate the present invention, first to CNN basic conception, network query function figure and its basic behaviour
It is illustrated as content.
CNN basic conception
CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which understands in the application, to be analyzed
Calculating operation based on CNN is primarily based on the rudimentary knowledge that existing CNN model introduces CNN.
As shown in Figure 1, typical CNN is made of a series of layer of orderly functions.
The parameter of CNN model is referred to as " weight " (weights).The first layer of CNN reads input figure, and exports a series of
Characteristic pattern (featuremap).Following layer reads the characteristic pattern generated by upper one layer, and exports new characteristic pattern.Last
A classifier (classifier) output input figure may belong to the probability of a certain classification.CONV layers (convolutional layers) and FC layers are (complete
Even layer) it is two kinds of basic channel types in CNN.After CONV layers, usually there is pond layer (Pooling layers).
In this application, for one CNN layers,Indicate j-th of input feature vector figure (inputfeaturemap),
Indicate i-th of output characteristic pattern (output featuremap), biIndicate the bias term of i-th of output figure.
For CONV layers, ninAnd noutRespectively represent the quantity for outputting and inputting characteristic pattern.
For FC layers, ninAnd noutRespectively represent the length for outputting and inputting feature vector.
The definition of CONV layers (Convolutional layers, convolutional layer): CONV layers using series of features figure as defeated
Enter, and output characteristic pattern is obtained with convolution kernels convolution.
The non-linear layer being usually connected with CONV layers, that is, nonlinear activation function is applied to every in output characteristic pattern
A element.The excitation function used is generally ReLU function, which is also commonly referred to as ReLU layers.
CONV layers can be indicated with expression formula 1:
Wherein gi,jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic pattern.FC layers
The definition of (Fully-Connected layers, connect layer entirely): FC layers are applied to a upward linear transformation of input feature vector:
fout=Wfin+b (2)
W is a nout×ninTransformation matrix, b are bias terms.It is noted that input is not several two dimensions for FC layers
The combination of characteristic pattern, but a feature to.Therefore, in expression formula 2, parameter ninAnd noutActually correspond to input and it is defeated
The length of feature vector out.
Pond (pooling) layer: being usually connected with CONV layers, for exporting each subregion in each characteristic pattern
(subarea) maximum value or average value.Pooling maximum value can be indicated by expression formula 3:
Wherein p is the size of pond kernel.This nonlinear " down-sampled " only next layer reduces characteristic pattern
Size and calculating additionally provide a kind of translation invariance.CNN can be used for during forward inference carrying out image classification.
The basic conception of network query function figure
In order to decouple with deep learning Computational frame, need to construct calculating figure knot corresponding with neural network processor
Structure.Convert the neural network algorithm from different depth learning platform in general calculating figure, calculating figure is optimized and
It reconstructs, then the calculating figure after optimization is mapped as to the instruction and machine code of hardware platform, just complete algorithm in hardware platform
Compile part.
The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.Since the present invention relates to pass through data
Rationally the data of multiplexing reduction external memory and on piece caching carry number, therefore conventional only including calculating operation node
Calculating graph structure in be added to memory access operations, i.e., be added to " MEM " in figure in figure and refer to memory and read behaviour
Make, to show advantage of the technical solution of the present invention compared with prior art on bandwidth conservation.In the realization of DPU, " MEM "
The carrying referred between DDR (Double Data Rate synchronous DRAM) and on piece data operates.
Fig. 2A shows the basic structure in Vgg network.As shown in Figure 2 A, branchiess network query function figure is executing the most
CONV (convolution operation), ReLU (nonlinear operation, the excitation function used are generally ReLU function) and (pond POOL on basis
Operation) when, the featuremap of loading as needed for it is excessive, it is therefore desirable in DDR and on piece caching (for example, being embodied as
BRAM, that is, block RAM) between data repeatedly carry.Fig. 2 B shows the basic structure in ResNet network, and Fig. 2 C is shown
Adjacent Inception structure in GoogLeNet v1.As illustrated by figures 2 b and 2 c, the network query function figure of branch also introduces use
In by multiple convolutional layers be added combineds ELTWISE (dot product) layer and for by the data of each input layer according to channel cascade and
CONCAT (cascade) layer of one layer of Cheng Xin.Similarly, the network query function figure in figure, which is still shown, needs between DDR and BRAM
Data repeatedly carry.It should be understood that " Vgg ", " Resnet " and " GoogLeNet " listed above is the prior art
The CNN framework of middle prevalence for explanation of the principles of the present invention rather than limits.
Optimization based on data-reusing
Since the storage resource of on piece is limited, entire CNN can not be once completed on piece and is calculated, the side by piecemeal is needed
Formula divides calculating task.It is loaded into the data of on piece after piecemeal, can also be used by way of multiplexing repeatedly, to drop
The low traffic with external memory unit.
It is as follows, data-reusing scheme of the invention will be described in conjunction with Fig. 3.Fig. 3 shows one according to the present invention
The computing platform implementation method for neural network of embodiment.Computing platform according to the present invention is needed from external memory
Data needed for reading for example, it is desired to the image of classification, and cache on piece caching the data and each operation of reading
Results of intermediate calculations.
In step S310, the first first part's data for operating required characteristic pattern are read from external memory, for described
It includes the first operation and at least one other operation that first part's data, which execute, will be directed to the operation of first part's data
As a result it is stored back into the external memory.
In step S320, the first second part data for operating required characteristic pattern are read from the external memory, for
The second part data execute the above-mentioned read operation from external memory, the first operation and at least one of other operations and deposit
Store up back the operation of external memory.
Preferably, characteristic pattern may include first part's data and at least one second part data, and the method
It include: to execute the above-mentioned read operation from the external memory, first operation respectively for each second part data
And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number
It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data
Work and the operating result of at least one other operation.
It is operated in the operating process that other are operated at least one for the first of each section, the mediant generated
It is cached according on piece is directly buffered in.In above-mentioned multiple operating process for same partial data, it is not related to depositing outside
The data of reservoir are written, and are not related to reading data (the ELTWISE behaviour of exception such as Fig. 5 to external memory usually
Shown in work).
Data-reusing scheme of the invention will more be vividly described in conjunction with Fig. 4 as follows.Fig. 4 shows Vgg
The network query function figure of basic structure, left side are existing structures, and right side is based on optimization structure of the invention.
Usually in the implementation procedure for entirely calculating figure, the storage mode of data within hardware can be abstracted into one
A wide (width), high (height) and channel (channel) three dimensions and optional batch (batch) of possessing is used as the 4th
The characteristic pattern of dimension is launched into one-dimensional data tensor according to certain rules.For example, when carrying out image classification, it is not only initial
Image can be abstracted into a three-dimensional feature figure, and the output (that is, operation result of each node) after operating every time is still
So can be referred to is a characteristic pattern.Since on piece storage resource is limited, for example, on piece BRAM can not disposably cache one it is complete
Whole characteristic pattern, therefore a certain operation of entire characteristic pattern is completed, need the multiple reading from DDR to BRAM.
Assuming that the characteristic pattern of input is triple channel (for example, RGB) two dimensional image, on piece BRAM can once read one and lead to
The data in road.In existing structure shown on the left of Fig. 4, in order to complete CONV, ReLU and POOL for the characteristic pattern of the input
Operation needs to read the Three-channel data of such as upper left image block from external memory (for example, DDR) first, carries out CONV behaviour
It is stored back to DDR after work, then is stored back to DDR after reading the Three-channel data progress CONV operation of upper right image block, then read lower-left image
The Three-channel data of block is stored back to DDR after carrying out CONV operation, and the Three-channel data for last reading lower-left image block carries out CONV behaviour
DDR is stored back to after work.It is finished as a result, for the CONV operation of characteristic pattern.Then, the left side operated through CONV is read from DDR
The Three-channel data of upper image block is stored back to DDR after carrying out ReLU operation, and the upper right image block operated through CONV is read from DDR
Three-channel data is stored back to DDR after carrying out ReLU operation, then the triple channel number of the lower-left image block operated through CONV is read from DDR
According to DDR is stored back to after carrying out ReLU operation, finally read from DDR again the Three-channel data of bottom right image block that is operated through CONV into
DDR is stored back to after row ReLU operation.It is finished as a result, for the ReLU operation of characteristic pattern.Finally, reading from DDR through ReLU
The Three-channel data of the upper left image block of operation is stored back to DDR after carrying out POOL operation, and the upper right operated through ReLU is read from DDR
The Three-channel data of image block is stored back to DDR after carrying out POOL operation, then the lower-left image block operated through ReLU is read from DDR
Three-channel data is stored back to DDR after carrying out POOL operation, and the triple channel of the bottom right image block operated through ReLU is finally read from DDR
Data are stored back to DDR after carrying out POOL operation.It is finished as a result, for the POOL operation of characteristic pattern.
On the right side of Fig. 4 based in layer fusion structure of the invention, for the purposes of being completed for the characteristic pattern of the input
CONV, ReLU and POOL operation, read the Three-channel data of upper left image block from DDR first, are stored in piece after carrying out CONV operation
Upper caching (for example, BRAM) directly carries out ReLU operation to the data of on piece caching, carries out after necessary on piece caching direct
POOL operation is carried out, then the Three-channel data of the upper left image block by CONV, ReLU and POOL operation is stored back to DDR.Then,
The Three-channel data that upper right image block is read from DDR is equally directly to be stored back to after on piece carries out CONV, ReLU and POOL operation
It is stored back to DDR.And then the Three-channel data of lower-left image block is read from DDR, it is equally directly to carry out CONV, ReLU on piece
DDR is stored back to finally, reading the Three-channel data of bottom right image block from DDR with after POOL operation, is equally directly on piece
DDR is stored back to after carrying out CONV, ReLU and POOL operation.
Delay from external memory on piece it can be seen that can be greatly reduced based on computing platform implementation of the invention
The reading times deposited solve the problems, such as that external storage is read as DPU efficiency bottle neck, so that the task execution of DPU be substantially improved
Efficiency.It should be understood that illustrate the principle of the invention for convenience, in upper example by a characteristic pattern be cut into upper left, upper right,
Four pieces of lower-left and bottom right are handled and are stored.In practical applications, it can according to need and carry out different cuttings to characteristic pattern,
Herein with no restrictions.
In the example of fig. 4, the first operation and at least one other operation include that successive convolution (CONV) operates, is non-thread
Property (ReLU) operation and pond (POOL) operate.This is basic structure most commonly seen in CNN.Here, at least one other behaviour
Making (that is, ReLU and POOL operation) is the subsequent operation based on the first operation, which at least needs first operation
Complete computation result with its complete all operationss.
In other embodiments, other operations can also have other correlativities with the first operation.It will combine as follows
Data-reusing scheme of the invention is described in multiple examples.
Fig. 5 shows the network query function figure of Resnet basic structure, and left side is existing structure, and right side is based on of the invention
Optimize structure.
Shown on the right side of Fig. 5, other than merging to (CONV) operation and non-linear (ReLU) operation, there is branch
Network structure in, can also realize the present invention to the node (ELTWISE operate) in such as figure with two or more inputs
Data-reusing scheme.Implementation method of the invention can also include reading the first subsequent operation institute from external memory as a result,
The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the part number from characteristic pattern
According to execution first subsequent operation.Here, ELTWISE (dot product) operation as the first subsequent operation comes from addition to being multiplexed
(CONV) operation and non-linear (ReLU) operation except the result data of CONV operation on the right side of network, also on the left of reception network
Result data as input.Thus the level of data-reusing is further promoted.
Similarly, it is operated for CONV the most basic in CNN, multiplexing scheme of the invention can be directed to different CONV
Operation is merged.Fig. 6 shows convolution according to an embodiment of the invention and longitudinally merges.Two successive convolution operations it is vertical
It is merged to layer, all parameters of needs can be disposably loaded on piece caching, thus carry out the two convolution operations
Fusion.The intermediate result between operation is without writing back external memory twice, but directly caches on piece, complete knot to be calculated
It is written back after beam.Fig. 7 shows convolution according to an embodiment of the invention and laterally merges.Lateral layer fusion can be understood as
Several operations for sharing identical input feature vector figure can not have to that characteristic pattern is loaded on piece for each layer, but directly
The corresponding position for being stored back to external memory respectively has been calculated on piece.In other words, other relevant to the first operation operate not
It only can be and need the sequentially operation based on the first operation output, be also possible to that there is the input of same characteristic features figure with the first operation
At least one lateral operation.Laterally fusion means that several calculating operations share identical input feature vector figure, can save from outer
Portion's memory carries data to the bandwidth resources of on piece, and longitudinal fusion means to save interlayer feature diagram data portion's memory outside
Carry required bandwidth back and forth on piece.
As described above, the storage mode of data within hardware can be abstracted in the implementation procedure for entirely calculating figure
At the one-dimensional data tensor that characteristic pattern is launched into according to certain rule, therefore the value for not changing data, only change data dimension
The operation of degree or arrangement mode, can be left out in calculating figure by beta pruning.Different from being similar to GPU and CPU within hardware
Special rearrangement is carried out to data, implementation of the invention can be dissolved into other associated operations
Reason, such as in the operation of a upper calculating node of graph, be directly stored in data according to the arrangement mode of the dimension after conversion external
In memory (for example, DDR);It can also be stored according to universal mode, and according to new dimension in next calculating node of graph
Degree carries out LOAD, so that the execution efficiency of hardware is erased and improved in influence caused by special data rearrangement and dimension are converted.
And other does not have influential operation to required calculated result, can be left out by beta pruning.
Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.Fig. 9 is shown
Omission according to an embodiment of the invention to CONCAT (cascade) operation.Since cascade operation is related to the conjunction of network branches
And, it is therefore desirable to it is marked in figure.Dotted line frame outside CONCAT indicates in implementation of the invention, involved by cascade operation
And dimension transformation be fused aforementioned CONV operation storing process in.
Here, FLATTEN operation is to carry out " clapping " to the characteristic pattern after single convolution, i.e. the processing of one-dimensional.And
Have in the network (such as GoogLeNet) of branch, for the cascade operation of multilayer, upper layer has multiple layers of output to be linked into
The input of CONCAT.CONCAT operation just refers to the data of each input layer according to channel cascade into new one layer, then exports
To next layer.FLATTEN and CONCAT operation is all special data rearrangement and dimension map function, can depositing by data
Enter and/or the particular specification of reading manner omits.
Figure 10 shows according to an embodiment of the invention (fixed to BatchNorm (batch normalizes) operation and Scale
Mark) operation fusion.In calculating process, operation and parameter directly can be incorporated to preceding CONV by BatchNorm and Scale
In layer.
In order to realize more efficiently data-reusing, other than the layer fusion method as above shown, layer can also be carried out
Operation splitting.In one embodiment, implementation method of the invention can also include decompose first operation and/or at least one of its
He operates, wherein the first operation and/or at least one other operation after decomposing are directed to its one be respectively originally inputted
Point.In other words, above-mentioned decomposition can be the decomposition of the layer in single branch, point being also possible on multiple-limb network backbone node
Solution.
In one embodiment, first operation after decomposition and/or at least one other operation can be through splitting
CONV operation, and it is each through splitting CONV operation input be former characteristic pattern a part.Figure 11 is shown according to this
The example of the division operation of invention one embodiment.In the case where hardware processing capability etc. is limited, layer can be carried out
Packet transaction reduces parameter amount and calculation amount by setting grouping (group) parameter.Although substantially with processing capacity
It is promoted, current hardware is without the need for setting group parameter, but in order to be compatible with forward, and implementation of the invention is still
Convolution with group parameter can be split as to several small convolution to be concatenated together again, to expand the versatility of hardware.
In another embodiment, the layer of decomposition can be the trunk node in branching networks calculating figure.In other words,
Layer for decomposition needs to receive the input from multiple characteristic patterns originally, and multiple layers after decomposing can then incorporate respectively respectively
A branch is to carry out the fusion of single branch.Preferably, described first after decomposition operates and/or at least one other operation is
POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.
Figure 12 and Figure 13 shows the typical structure in GoogLeNet and is applying based on data-reusing scheme of the invention
Optimization structure later, wherein Figure 12 shows Inception structure adjacent in GoogLeNet v1, and Figure 13 is shown
Adjacent Inception structure in GoogLeNet v2 shows prototype structure on the left of every figure, and right side is to show basis
The optimization structure of the present invention program.Similarly, the dotted line frame outside CONCAT → MEM indicates in implementation of the invention,
The transformation of dimension involved in cascade operation is fused in the storing process of aforementioned CONV operation.In addition, in order to make it easy to understand,
Used in figure " italic " and "Underscore" format is distinguish the consecutive operation and its relevant operation of multiple same names.
Specifically, it has been directed not only in figure to basic structure, that is, successive convolution (CONV) operation, non-linear (ReLU)
The fusion of operation and pond (POOL) operation, further includes a layer operation splitting.For example, for the result of several convolution operations is carried out
Cascade connects the operation in pond again, and the calculated result for carrying out cascade operation again with pondization direct after each convolution is of equal value.Thus, it is possible to
In each convolution operation before pondization operation to be incorporated to cascade operation respectively, to realize partial reconfiguration and the optimization of calculating figure.
In actual use, the solution of the present invention can be realized by a kind of neural computing system.The system may include
Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory;Calculating executes dress
It sets, the calculating executive device includes on piece caching, and the calculating executive device, which is directed to, to be based on to delay on piece caching
The different piece for the characteristic pattern that the capacity deposited divides, successively performs the following operations: from the external memory read operation institute
The specific part of the characteristic pattern needed executes at least two operations for the specific part, and operating result is stored back to outside described
Storage device.
In the computing system for neural network of the invention, for executing the calculating executive device of neural computing
Some or all of function can be by digital circuit.In one embodiment, computing system of the invention can include logical
It is realized with the system on chip (SoC) of processor, memory and digital circuit.Figure 14, which is shown, can be used for realizing calculating of the invention
An example of the SoC of system.
It in one embodiment, can be by digital circuits section (for example, FPGA or GPU) Lai Shixian this system institute on SoC
The learning network needed, such as convolutional neural networks.Calculate executive device can be by the height of FPGA or GPU or ASIC it is parallel based on
Calculate device.What it is due to CNN progress is parallel computation, realizes convolutional neural networks by logic hardware, especially FPGA
Computing function has natural calculating advantage, and executes compared to software, can be realized lower power consumption.
In one embodiment, external memory will can be stored in by whole parameters of the related CNN obtained in preceding training
It, can be according to layer of the invention fusion and decomposing scheme, according to optimization and again when then carrying out ANN Reasoning in reservoir
Calculating figure (for example, the right side of Figure 12 and 13 calculates figure) after structure, is subtracted by maximumlly carrying out the data-reusing of operation room
Few data from main memory on piece between caching are carried, thus lifting system efficiency.
Figure 15 shows according to the present invention a kind of for executing the computing platform of neural computing.The computing platform
1500 may include data processing module 1510, data memory module 1520 and control module 1530.
Data processing module 1510 can be used for executing scheduled calculation processing to input data, and generate output data.
In input data needed for data memory module 1520 can be used for data cached processing module or data processing module output
Between data.Control module 1530 can be used for controlling the data processing module 1510 and the data memory module 1520
System decomposes integration program to execute layer according to the present invention.In one embodiment, the specific framework of computing platform of the invention
Can the programmed logical module as shown in such as Figure 14 realize, wherein data processing module 1510 correspond to execute CNN operation
Complicated calculations core, data memory module 1520 are buffered corresponding to input and output, and control module 1530 corresponds to the controller in figure.
The computing platform realization side according to the present invention for neural network above is described in detail by reference to attached drawing
Case.
In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention
Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention
Machine program code instruction.
Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium
Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code),
When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server
Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.
Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein
Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.
The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities
Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent module, a journey
A part of sequence section or code, a part of the module, section or code include one or more for realizing defined
The executable instruction of logic function.It should also be noted that in some implementations as replacements, the function of being marked in box can also
To be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually be basically executed in parallel,
They can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or stream
The combination of each box in journey figure and the box in block diagram and or flow chart, can the functions or operations as defined in executing
Dedicated hardware based system realize, or can realize using a combination of dedicated hardware and computer instructions.
In addition, " first ", " second " used in the present invention are intended to indicate that different objects, rather than to the limit of execution order etc.
System, for example, " first part's data " and " second part data " involved in text are intended to indicate that the different portions for belonging to characteristic pattern
Point.And " the first subsequent operation " and " the second subsequent operation " is only used for two subsequent operations of difference and is different subsequent operation.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art
Other those of ordinary skill can understand each embodiment disclosed herein.
Claims (20)
1. a kind of computing platform implementation method for neural network, the computing platform counts needed for reading from external memory
According to, and cache on piece caching the data and each results of intermediate calculations that operates of reading, which comprises
The first first part's data for operating required characteristic pattern are read from the external memory, for first part's data
Executing includes other operations of the first operation and at least one, described by being stored back into for the operating result of first part's data
External memory;
The first second part data for operating required characteristic pattern are read from the external memory, for the second part data
It executes above-mentioned from the read operation of the external memory, first operation and described at least one other operations, Yi Jicun
Store up back the operation of the external memory.
2. the method for claim 1, wherein the characteristic pattern includes first part's data and at least one second part
Data, and the described method includes:
For each second part data, the above-mentioned read operation from the external memory, first operation are executed respectively
And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number
It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data
Work and the operating result of at least one other operation.
3. the size of the partial data the method for claim 1, wherein read every time from the external memory is at least
The capacity for being based partially on the on piece caching determines.
4. the method for claim 1, wherein at least one other operation is at least one of following:
At least one subsequent operation of first operation, the subsequent operation at least need the complete computation of first operation
As a result all operationss are completed with it;And
There is at least one lateral operation of same characteristic features figure input with first operation.
5. method as claimed in claim 4, wherein first operation and at least one other operation include successive convolution
(CONV) operation, non-linear (ReLU) operation and pond (POOL) operation.
6. method as claimed in claim 4, further includes:
From the partial data from other characteristic patterns needed for the external memory the first subsequent operation of reading, for from it
The partial data of his characteristic pattern and partial data from the characteristic pattern execute first subsequent operation.
7. method as claimed in claim 6, wherein first subsequent operation is dot product (ELTWISE) operation.
8. method as claimed in claim 4, further includes:
Second subsequent operation is incorporated to the first operation.
9. method according to claim 8, wherein second subsequent operation is batch normalization (BatchNorm) and determines
(Scale) operation is marked, first operation is CONV operation.
10. the method as described in claim 1, further includes:
Operating result is stored back to the external memory with required dimension arrangement mode, and/or with required dimension arrangement side
Formula is read from the external memory in preceding operating result, is only used for changing data dimension or arrangement in neural network to omit
The operation of mode.
11. method as claimed in claim 10, wherein the operation being omitted is at least one following:
Cascade (CONCAT) operation;And
Planarize (FLATTEN) operation.
12. the method as described in claim 1, further includes:
First operation and/or other at least one of described operations are decomposed, wherein first operation after decomposing and/or extremely
Other operations of one item missing are directed to its a part being respectively originally inputted.
13. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition
It is the CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.
14. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition
It is the POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.
15. a kind of neural computing system, comprising:
Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory;
Executive device is calculated, the calculating executive device includes on piece caching, and the calculating executive device is directed to based on described
The different piece for the characteristic pattern that the capacity that on piece caching can cache divides, successively performs the following operations:
From the specific part of characteristic pattern needed for the external memory read operation, at least two are executed for the specific part
Item operation, and operating result is stored back to described external memory device.
16. system as claimed in claim 15, wherein it is described calculate executive device be based on FPGA, GPU or ASIC it is high simultaneously
Row computing device.
17. system as claimed in claim 15, wherein at least two operations are following at least one:
Use same characteristic features figure lateral operation as input;
By the sequentially operation based on operating result of preceding operation.
18. a kind of computing platform for neural network, comprising:
Data processing module for executing scheduled calculation processing to input data, and generates output data;
Data memory module, the mediant exported for input data needed for data cached processing module or data processing module
According to;And
Control module, for controlling the data processing module and the data memory module, to execute according to right
It is required that method described in any one of 1-14.
19. computing platform according to claim 18, wherein
The data processing module is convolutional calculation module, for carrying out convolutional calculation to input data.
20. a kind of non-transitory machinable medium, is stored thereon with executable code, when the executable code is electric
When the processor of sub- equipment executes, the processor is made to execute the method as described in any one of claim 1-14.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810297916.6A CN110321064A (en) | 2018-03-30 | 2018-03-30 | Computing platform realization method and system for neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810297916.6A CN110321064A (en) | 2018-03-30 | 2018-03-30 | Computing platform realization method and system for neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110321064A true CN110321064A (en) | 2019-10-11 |
Family
ID=68112541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810297916.6A Pending CN110321064A (en) | 2018-03-30 | 2018-03-30 | Computing platform realization method and system for neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321064A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914999A (en) * | 2020-07-30 | 2020-11-10 | 云知声智能科技股份有限公司 | Method and equipment for reducing calculation bandwidth of neural network accelerator |
CN112698954A (en) * | 2021-01-14 | 2021-04-23 | 上海交通大学 | Coarse-grained reconfigurable array scheduling method based on subgraph decoupling |
CN112884123A (en) * | 2021-02-23 | 2021-06-01 | 杭州海康威视数字技术股份有限公司 | Neural network optimization method and device, electronic equipment and readable storage medium |
WO2021248941A1 (en) * | 2020-06-08 | 2021-12-16 | 深圳市九天睿芯科技有限公司 | All-on-chip storage neural network accelerator and implementation method therefor |
CN114356235A (en) * | 2021-12-31 | 2022-04-15 | Oppo广东移动通信有限公司 | Data standardization processing method and device, electronic equipment and storage medium |
CN114968602A (en) * | 2022-08-01 | 2022-08-30 | 成都图影视讯科技有限公司 | Architecture, method and apparatus for a dynamically resource-allocated neural network chip |
WO2023212975A1 (en) * | 2022-05-06 | 2023-11-09 | 北京灵汐科技有限公司 | Mapping method, electronic device and computer-readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105892989A (en) * | 2016-03-28 | 2016-08-24 | 中国科学院计算技术研究所 | Neural network accelerator and operational method thereof |
CN106250981A (en) * | 2015-06-10 | 2016-12-21 | 三星电子株式会社 | The impulsive neural networks of bandwidth consumption in minimizing memory access and network |
CN106951962A (en) * | 2017-03-22 | 2017-07-14 | 北京地平线信息技术有限公司 | Compound operation unit, method and electronic equipment for neutral net |
CN107203807A (en) * | 2016-03-16 | 2017-09-26 | 中国科学院计算技术研究所 | The computational methods of neutral net, system and its apparatus |
CN107437110A (en) * | 2017-07-11 | 2017-12-05 | 中国科学院自动化研究所 | The piecemeal convolution optimization method and device of convolutional neural networks |
CN107451654A (en) * | 2017-07-05 | 2017-12-08 | 深圳市自行科技有限公司 | Acceleration operation method, server and the storage medium of convolutional neural networks |
CN107657581A (en) * | 2017-09-28 | 2018-02-02 | 中国人民解放军国防科技大学 | Convolutional neural network CNN hardware accelerator and acceleration method |
CN107704923A (en) * | 2017-10-19 | 2018-02-16 | 珠海格力电器股份有限公司 | Convolutional neural networks computing circuit |
-
2018
- 2018-03-30 CN CN201810297916.6A patent/CN110321064A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250981A (en) * | 2015-06-10 | 2016-12-21 | 三星电子株式会社 | The impulsive neural networks of bandwidth consumption in minimizing memory access and network |
CN107203807A (en) * | 2016-03-16 | 2017-09-26 | 中国科学院计算技术研究所 | The computational methods of neutral net, system and its apparatus |
CN105892989A (en) * | 2016-03-28 | 2016-08-24 | 中国科学院计算技术研究所 | Neural network accelerator and operational method thereof |
CN106951962A (en) * | 2017-03-22 | 2017-07-14 | 北京地平线信息技术有限公司 | Compound operation unit, method and electronic equipment for neutral net |
CN107451654A (en) * | 2017-07-05 | 2017-12-08 | 深圳市自行科技有限公司 | Acceleration operation method, server and the storage medium of convolutional neural networks |
CN107437110A (en) * | 2017-07-11 | 2017-12-05 | 中国科学院自动化研究所 | The piecemeal convolution optimization method and device of convolutional neural networks |
CN107657581A (en) * | 2017-09-28 | 2018-02-02 | 中国人民解放军国防科技大学 | Convolutional neural network CNN hardware accelerator and acceleration method |
CN107704923A (en) * | 2017-10-19 | 2018-02-16 | 珠海格力电器股份有限公司 | Convolutional neural networks computing circuit |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021248941A1 (en) * | 2020-06-08 | 2021-12-16 | 深圳市九天睿芯科技有限公司 | All-on-chip storage neural network accelerator and implementation method therefor |
CN111914999A (en) * | 2020-07-30 | 2020-11-10 | 云知声智能科技股份有限公司 | Method and equipment for reducing calculation bandwidth of neural network accelerator |
CN111914999B (en) * | 2020-07-30 | 2024-04-19 | 云知声智能科技股份有限公司 | Method and equipment for reducing calculation bandwidth of neural network accelerator |
CN112698954A (en) * | 2021-01-14 | 2021-04-23 | 上海交通大学 | Coarse-grained reconfigurable array scheduling method based on subgraph decoupling |
CN112698954B (en) * | 2021-01-14 | 2022-05-10 | 上海交通大学 | Coarse-grained reconfigurable array scheduling method based on subgraph decoupling |
CN112884123A (en) * | 2021-02-23 | 2021-06-01 | 杭州海康威视数字技术股份有限公司 | Neural network optimization method and device, electronic equipment and readable storage medium |
CN112884123B (en) * | 2021-02-23 | 2024-03-01 | 杭州海康威视数字技术股份有限公司 | Neural network optimization method and device, electronic equipment and readable storage medium |
CN114356235A (en) * | 2021-12-31 | 2022-04-15 | Oppo广东移动通信有限公司 | Data standardization processing method and device, electronic equipment and storage medium |
WO2023124654A1 (en) * | 2021-12-31 | 2023-07-06 | Oppo广东移动通信有限公司 | Data standardization processing method and apparatus, electronic device, and storage medium |
WO2023212975A1 (en) * | 2022-05-06 | 2023-11-09 | 北京灵汐科技有限公司 | Mapping method, electronic device and computer-readable storage medium |
CN114968602A (en) * | 2022-08-01 | 2022-08-30 | 成都图影视讯科技有限公司 | Architecture, method and apparatus for a dynamically resource-allocated neural network chip |
CN114968602B (en) * | 2022-08-01 | 2022-10-21 | 成都图影视讯科技有限公司 | Architecture, method and apparatus for a dynamically resource-allocated neural network chip |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321064A (en) | Computing platform realization method and system for neural network | |
CN110321999A (en) | Neural computing figure optimization method | |
CN110175671A (en) | Construction method, image processing method and the device of neural network | |
Feng et al. | Computer vision algorithms and hardware implementations: A survey | |
CN106022468A (en) | Artificial neural network processor integrated circuit and design method therefor | |
US11093225B2 (en) | High parallelism computing system and instruction scheduling method thereof | |
CN108416436A (en) | The method and its system of neural network division are carried out using multi-core processing module | |
CN110458280B (en) | Convolutional neural network acceleration method and system suitable for mobile terminal | |
CN109496294A (en) | The Compilation Method and system of artificial intelligence process device, storage medium and terminal | |
CN105739951B (en) | A kind of L1 minimization problem fast solution methods based on GPU | |
WO2023093724A1 (en) | Neural network model processing method and device | |
CN111768004A (en) | Model self-adaption method and system based on intelligent computing framework | |
CN109496319A (en) | Artificial intelligence process device hardware optimization method, system, storage medium, terminal | |
Wang et al. | Nonlinear tensor train format for deep neural network compression | |
US20240160689A1 (en) | Method for optimizing convolution operation of system on chip and related product | |
CN109685208B (en) | Method and device for thinning and combing acceleration of data of neural network processor | |
Dey et al. | Accelerating training of deep neural networks via sparse edge processing | |
CN114003201A (en) | Matrix transformation method and device and convolutional neural network accelerator | |
JP7363145B2 (en) | Learning device and learning method | |
WO2023071658A1 (en) | Ai model processing method and apparatus, and ai model computing method and apparatus | |
Morcel et al. | Fpga-based accelerator for deep convolutional neural networks for the spark environment | |
Sun et al. | Computation on sparse neural networks and its implications for future hardware | |
Li et al. | Memory saving method for enhanced convolution of deep neural network | |
CN107644143B (en) | A kind of high-performance city CA model construction method based on vectorization and parallel computation | |
CN105573834A (en) | High-dimensional-data-oriented vocabulary tree building method based on heterogeneous platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200904 Address after: Unit 01-19, 10 / F, 101, 6 / F, building 5, yard 5, Anding Road, Chaoyang District, Beijing 100029 Applicant after: Xilinx Electronic Technology (Beijing) Co.,Ltd. Address before: 100083, 17 floor, four building four, 1 Wang Zhuang Road, Haidian District, Beijing. Applicant before: BEIJING DEEPHI INTELLIGENT TECHNOLOGY Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191011 |