CN110321997A - High degree of parallelism computing platform, system and calculating implementation method - Google Patents
High degree of parallelism computing platform, system and calculating implementation method Download PDFInfo
- Publication number
- CN110321997A CN110321997A CN201810277338.XA CN201810277338A CN110321997A CN 110321997 A CN110321997 A CN 110321997A CN 201810277338 A CN201810277338 A CN 201810277338A CN 110321997 A CN110321997 A CN 110321997A
- Authority
- CN
- China
- Prior art keywords
- data
- read
- computing platform
- cache
- level cache
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
- G06F12/0897—Caches characterised by their organisation or structure with two or more cache hierarchy levels
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a kind of high degree of parallelism computing platform, system and related realization method.The computing platform includes: level cache, for caching the calculating data read from external memory;Multiple Read Controllers, each Read Controller are used to read calculating data needed for single operation or part thereof in parallel computation module from any position of the level cache;Parallel computation module, the calculating data for reading to the multiple Read Controller carry out the calculating operation of high degree of parallelism.Thus, it is possible to abundant multiplex datas to promote treatment effeciency.The present invention also promotes whole data-handling efficiency by being further introduced into multi-level buffer and further promotes processing speed by data filtering.
Description
Technical field
The present invention relates to hardware structure field more particularly to a kind of high degree of parallelism computing platform, system and calculate realization side
Method.
Background technique
Neural network (Neural Network) becomes the research hotspot of field of image recognition in recent years.After training
Neural network model can be used for the numerous areas such as image classification, object identification and conspicuousness detection.Neural network mould in recent years
Type is presented the trend that calculation scale increases, complexity is promoted and has been unable to satisfy the practicality demand using traditional CPU platform.
Therefore, neural network accelerator design is carried out using the high degree of parallelism heterogeneous computing platforms such as FPGA, GPU, ASIC to grind as new
Study carefully hot spot.Wherein, compare GPU platform, FPGA can be realized higher calculating Energy Efficiency Ratio, while FPGA can with iteratively faster, can
Carry out the characteristic also more requirement of adaptive algorithm high speed development of hardware reconstruction.
When using the high degree of parallelism computing platform comprising FPGA and GPU etc. to execute calculating, compared to external storage
Device carry out data access needed for time cost, the execution time of parallel computation is very short, which results in bandwidth be constrained in order to
Improve the bottleneck of processing speed.In addition, how to make full use of bandwidth to be promoted needed for degree of parallelism and high degree of parallelism computing platform
The major issue of consideration.
Therefore, there is still a need for a kind of relevant programme that can optimize high degree of parallelism calculating.
Summary of the invention
In order to solve the problems, such as it is above-mentioned at least one, the invention proposes a kind of new computing platform frameworks.By the way that energy is arranged
The cache module and Read Controller matched with its read parallel while enough realizing any position, can further promote meter
Calculate the parallel processing capability of platform.By the layout of further multi-level buffer, it can make full use of bandwidth and realize maximum number
According to processing speed.Meaningless null value in sparse data can be removed by being introduced into for strobe utility, to further be promoted simultaneously
The computational efficiency of row computing module itself.The parallel processing capability of computing platform is substantially improved on the whole as a result,.
According to an aspect of the invention, there is provided a kind of high degree of parallelism computing platform, comprising: level cache, for delaying
Deposit the calculating data read from external memory;Multiple Read Controllers, each Read Controller are used for from the level cache
Any position read the single operation in parallel computation module needed for calculating data or part thereof;Parallel computation module is used
In the calculating operation that simultaneously the calculating data that the multiple Read Controller is read are carried out with high degree of parallelism.Pass through setting energy as a result,
The cache module and Read Controller matched with its read parallel while enough realizing any position, can further promote meter
Calculate the parallel processing capability of platform.The level cache can be realized by register file.
Preferably, which can also include: to be connected to the multiple Read Controller and the parallel computation module
Between filtering module, for filtering out assigning null data and/or the calculating in the calculating data read by the Read Controller
Correspond to the non-empty Value Data that corresponding calculated result is null value in data.By sending data (especially sparse data) will be calculated
" foam " in data on calculated result without influence is squeezed out before entering parallel computation module, is able to ascend parallel computation module itself
The reasons why efficiency.
Preferably, which can also include: directly to read the two of the calculating data from the external memory
Grade caching, the calculating data that the L2 cache sequential storage is read, and be at least based on being used to indicate data in level cache
The Read Controller state parameter of consumption situation, carry out at least one of following operation: the write-in of Xiang Suoshu level cache calculates number
According to;And it is read from the external memory and calculates data and cached.Pass through the cloth of further multi-level buffer as a result,
Office can make full use of bandwidth and realize maximum data processing speed.Preferably, which can also include: for depositing
Store up the Read Controller status register of the Read Controller state parameter;And it is generated based on the Read Controller state parameter slow
The buffer update controller of more new command is deposited, it is good fit between multi-level buffer to be realized from structure.
Preferably, which can also include: the calculated result cache module being connected with the parallel computation module,
It is deposited for caching the calculation result data for the calculating operation that the parallel computation module is carried out, and by the calculation result data
Return the external memory.
According to another aspect of the present invention, a kind of high degree of parallelism computing platform for convolutional neural networks is provided,
It include: level cache, for caching the feature diagram data and weighted data that read from external memory;Multiple Read Controllers, often
Characteristic pattern number needed for a Read Controller is used to read the operation of single convolutional calculation from any position of the level cache
According to and/or weighted data;Parallel computation module, for reading feature diagram data and weight number to the multiple Read Controller simultaneously
According to progress high degree of parallelism convolutional calculation operation.The level cache can be realized by register file.
Preferably, level cache may include characteristic pattern data pool for caching the feature diagram data and for delaying
The weighted data pond of the weighted data is deposited, and multiple Read Controllers include 2M Read Controller, wherein first group of M reading control
Device processed is used for while reading M characteristic pattern convolution window, and second group of M Read Controller is used for while reading M weight convolution window, with
Realize that degree of parallelism is the characteristic pattern convolution operation of M in the parallel computation module, wherein M is greater than 2 integer.
Preferably, which can also include: to be connected to the multiple Read Controller and the parallel computation module
Between filtering module, for filter out the null value in the feature diagram data and/or weighted data that are read by the Read Controller,
And/or correspond to the non-empty Value Data that corresponding convolution calculated result is null value in the feature diagram data and weighted data.It is preferred that
Ground, same position is all not zero in filtering module keeping characteristics figure convolution window and its respective weights convolution window value and by other values
It filters out.
Specifically, filtering module includes 0 grade of caching and M asynchronous fifo queues, and 0 grade of caching includes or can quilt
It is divided into 2M convolution window cache pool, M characteristic pattern convolution window and M weight convolution window are stored in institute by the 2M Read Controller
2M convolution window cache pool of 0 grade of caching is stated, and filters out characteristic pattern convolution window and its respective weights volume in a manner of one-to-one
In product window same position at least one be zero value, and retention is stored in the M asynchronous fifo queues respectively.And
Row degree M is at least determined based on the size of the bandwidth and the convolution window that read data from the external memory.
Preferably, the computing platform can also include: directly from the external memory read the feature diagram data and
The buffer memory capacity of the weighted data and the L2 cache being connected with the level cache, the L2 cache is greater than the level-one
The buffer memory capacity of caching, for be stored separately reading the value indicative data and the weighted data, and at least based on being used for
The Read Controller state parameter for indicating data consumption situation in level cache, carries out at least one of following operation: Xiang Suoshu
Level cache write-in characteristic diagram data and weighted data;And new feature diagram data and weight are read from the external memory
Data.Data consumption parameter can be position of the feature diagram data read in the Read Controller on the characteristic pattern and sit
Mark.
Preferably, which can also include: the Read Controller shape for storing the Read Controller state parameter
State register;And the buffer update controller of buffer update instruction is generated based on the Read Controller state parameter.
Preferably, which can also include: the calculated result cache module being connected with the parallel computation module,
For caching the convolutional calculation result data for the calculating operation that the parallel computation module is carried out, and by the calculated result number
According to being stored back to the external memory.
According to a further aspect of the invention, a kind of computing platform implementation method of convolutional neural networks reasoning is provided,
It include: to be read feature diagram data and weighted data from the external memory using described in any item computing platforms as above
Into the level cache;Feature diagram data needed for the multiple Read Controller reads multiple single convolutional calculation operations respectively
And/or weighted data;And the multiple groups feature diagram data that the parallel computation module simultaneously reads the multiple Read Controller
With the convolutional calculation operation of the progress high degree of parallelism of weighted data.
There are altitude information multiplexing, the data for the feature diagram data and weighted data read by the multiple Read Controller
Multiplexing includes at least one of following: being wider than the multiplexing of feature diagram data caused by step-length as the length of convolution window;And by simultaneously
Positions multiple in characteristic pattern are executed with the multiplexing of weighted data caused by convolution.It, can be into one by the abundant multiplexing to data
The bandwidth that step reduces external storage limits the influence to processing speed.
Preferably, the implementation method can also include: the L2 cache that is connected with the level cache directly from described
External memory reads the feature diagram data and the weighted data, and based on the characteristic pattern read in the Read Controller
Position coordinates of the data on the characteristic pattern to the level cache write-in characteristic diagram data and weighted data and and from
The external memory reads new feature diagram data and weighted data, wherein the L2 cache is write to the level cache
The time interval for entering data is less than or equal to the time interval that the L2 cache reads data from the external memory.Thus
Realize making full use of to both bandwidth and processing speed.
Feature diagram data and/or weighted data needed for multiple Read Controllers read multiple single convolutional calculation operations respectively
It include: to read corresponding characteristic pattern convolution window and weight convolution window according to the convolution kernel size of convolution operation to be executed.Correspondingly,
The method method can also include: to filter out in the characteristic pattern convolution window and weighted data convolution window and in convolution operation result
The value of zero corresponding position, and the filtered characteristic pattern convolution window and weight convolution window are sent into the parallel computation mould
Block.
The multiple Read Controller read every time the time interval of the feature diagram data and the weighted data with it is described
The period that parallel computation module executes needed for single M group calculates is identical, and is less than the L2 cache every time to described one
The time interval of grade buffering write data.Thereby, it is possible to further promote the overall calculation efficiency of platform.
According to a further aspect of the invention, a kind of high concurrent computational system is provided, comprising: as described in preceding any one
Computing platform;Mass storage outside the computing platform;And with the computing platform and the memory
The processor being connected, for executing such as in preceding described in any item implementation methods.
In one embodiment, parallel processing module is at least partly realized by FPGA, GPU or ASIC.
Computing platform proposed by the present invention is allowed to be more suitable for high degree of parallelism calculating and the improvement to hardware structure itself,
Especially convolutional neural networks calculate, and pass through making full use of and promote platform to the promotion for calculating degree of parallelism to bandwidth
Overall treatment efficiency.The present invention devises multi-level buffer structure according to the deterministic rule of convolutional neural networks data dependence, benefit
With the characteristic of bandwidth, multi-frequency and data-reusing, so that having obtained great promotion for several efficiency to convolutional calculation module.Separately
Outside, the present invention also utilizes the weight of convolutional neural networks and the sparse characteristic of characteristic pattern, passes through filtering null value significant increase fortune
Calculate efficiency.Computing platform design of the invention is simple, the cost energy without complicated encoding and decoding and the storage of sparse data information
It is enough that maximized computational efficiency income is obtained from the sparse characteristic of neural network parameter.
Detailed description of the invention
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its
Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label
Typically represent same parts.
Fig. 1 shows a series of orderly function layers of typical CNN.
Fig. 2 shows the schematic diagrames of computing platform according to an embodiment of the invention.
Fig. 3 shows an example of convolution operation.
Fig. 4 shows the schematic diagram of computing platform according to an embodiment of the invention.
Fig. 5 shows example of the filtering for the data of convolutional calculation in CNN.
Fig. 6 shows the schematic diagram of computing platform according to an embodiment of the invention.
The concrete operations that Fig. 7 shows the high degree of parallelism computing platform according to an embodiment of the invention for CNN are shown
It is intended to.
Fig. 8 shows the flow chart of computing platform implementation method according to an embodiment of the invention.
Specific embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here
Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure
Range is completely communicated to those skilled in the art.
For a long time, high parallel computation is in scientific algorithm, weather simulation, biosimulation, molecular mechanics model, aircraft manufacturing
Sufficient utilization has been obtained with fields such as military simulations.In recent years, with the continuing fermentation of deep learning upsurge, it is used for nerve net
The parallel meter of the height of network, especially convolutional neural networks (Convolutional Neutral Network, then referred to as CNN)
Calculating implementation is even more to have obtained multi-party concern.
Existing general processor (CPU) handles a variety of different data types, and its due to needing high universalizable
Logic judgment can introduce the processing that a large amount of branch jumps and interrupts.These all make CPU internal structure complex, are not suitable for
In type high unity and the data operation of the mutually large-scale data without dependence.Therefore, contour using FPGA, GPU and ASIC
Degree of parallelism heterogeneous computing platforms carry out the research heat that high degree of parallelism computing platform, especially neural network accelerator design become new
Point.Wherein, compare GPU platform, FPGA can be realized higher calculating Energy Efficiency Ratio, while FPGA can with iteratively faster, can carry out
The also more requirement of adaptive algorithm high speed development of the characteristic of hardware reconstruction.
Here, the computing platform is applicable in the invention proposes a kind of high parallel computing platform with new hardware structure
Relatively uniform in type and the mutually large-scale data without dependence concurrent operation, is particularly suitable for sparsity input feature vector
The convolutional neural networks of figure and weight parameter.The parallel computation module of the computing platform or its at least partly preferably by FPGA reality
It is existing.
Although computing platform side of the invention will be described mainly in combination with the parallel computation for convolutional neural networks as follows
Case, but it will be understood by those skilled in the art that hardware structure of the invention, which is suitable for all kinds of high degree of parallelism, calculates scene, especially
It is suitable for the high application scenarios of data-reusing rate and sparsity.
[CNN basic conception]
CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which understands in the application, to be analyzed
Calculating operation based on CNN is primarily based on the rudimentary knowledge that existing CNN model introduces CNN.
As shown in Figure 1, typical CNN is made of a series of layer of orderly functions.
The parameter of CNN model is referred to as " weight " (weights).The first layer of CNN reads input figure, and exports a series of
Characteristic pattern (featuremap).Following layer reads the characteristic pattern generated by upper one layer, and exports new characteristic pattern.Last
A classifier (classifier) output input figure may belong to the probability of a certain classification.CONV layers (convolutional layers) and FC layers are (complete
Even layer) it is two kinds of basic channel types in CNN.After CONV layers, usually there is pond layer (Pooling layers).
In this application, for one CNN layers,Indicate j-th of input feature vector figure (input featuremap),
Indicate i-th of output characteristic pattern (output featuremap), biIndicate the bias term of i-th of output figure.
For CONV layers, ninAnd noutRespectively represent the quantity for outputting and inputting characteristic pattern.
For FC layers, ninAnd noutRespectively represent the length for outputting and inputting feature vector.
The definition of CONV layers (Convolutional layers, convolutional layer): CONV layers using series of features figure as defeated
Enter, and output characteristic pattern is obtained with convolution kernels convolution.
The non-linear layer being usually connected with CONV layers, that is, nonlinear activation function is applied to every in output characteristic pattern
A element.The excitation function used is generally ReLU function, which is also commonly referred to as ReLU layers.
CONV layers can be indicated with expression formula 1:
Wherein gi,jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic pattern.FC layers of (Fully-
Connected layers, connects layer entirely) definition: FC layers are applied to a upward linear transformations of input feature vector:
fout=Wfin+b (2)
W is a nout×ninTransformation matrix, b are bias terms.It is noted that input is not several two dimensions for FC layers
The combination of characteristic pattern, but a feature to.Therefore, in expression formula 2, parameter ninAnd noutActually correspond to input and it is defeated
The length of feature vector out.
Pond (pooling) layer: being usually connected with CONV layers, for exporting each subregion in each characteristic pattern
(subarea) maximum value or average value.Pooling maximum value can be indicated by expression formula 3:
Wherein p is the size of pond kernel.This nonlinear " down-sampled " only next layer reduces characteristic pattern
Size and calculating additionally provide a kind of translation invariance.
CNN can be used for during forward inference carrying out image classification.As involved in being calculated CNN it is a large amount of between each other without according to
The convolution operation for the relationship of relying, so that being particularly suitable for realizing in high degree of parallelism computing platform.
[basic structure of computing platform of the present invention]
In order to cope with the calculating of high degree of parallelism, the invention proposes a kind of completely new hardware structures.Fig. 2 shows according to this
The schematic diagram of the computing platform of invention one embodiment.As described in Figure 2, high degree of parallelism computing platform 200 includes level cache
210, multiple Read Controllers 220 and parallel computation module 230.
Level cache 210 can cache the calculating data read from external memory, and have the slow of any position
Deposit the ability that content can be read simultaneously.External memory in figure is embodied as massive store unit, such as DDR.One
In a embodiment, level cache 210 is made of register file, to realize ability that above-mentioned any position is read simultaneously.Work as meter
When calculating platform for executing CNN calculating, the data of reading may include feature diagram data and weighted data.
Each of Read Controller 220 Read Controller can read parallel meter from the corresponding position of level cache 210
Calculating data needed for calculating the single operation of module or part thereof.Parallel computation module 230 is then for controlling the multiple reading
The calculating data that device is read carry out the calculating operation of high degree of parallelism.
In this, it is assumed that the degree of parallelism of parallel computation module 230 is M." single operation of parallel computation module " above refers to
Be that parallel computation module 230 executes single operation in the operation that degree of parallelism is M.That is, every time operation may include M
Single operation.
The quantity N of Read Controller can be according to specific implementation and number relevant to M.In one embodiment, each reading
Controller can execute in the operation that degree of parallelism is M needed for single operation from parallel computation module 230 is read in level cache 210
It is whole calculate data, in the case, the quantity N of Read Controller is equal to degree of parallelism M.In another embodiment, each reading
Controller can execute in the operation that degree of parallelism is M needed for single operation from parallel computation module 230 is read in level cache 210
Part calculate data.For example, being directed to CNN operation, 2M Read Controller (that is, N=2M) can be set, wherein M reading control
Device is for reading characteristic pattern convolution window, and M Read Controller is used for while reading M weight convolution window, in parallel computation module
Realize that degree of parallelism is the characteristic pattern convolution operation of M in 230.At this point, parallel computation module 230 can be implemented as M independently of each other
Multiply-accumulate unit.In other embodiments, each Read Controller can also read the characteristic pattern that needs to carry out convolution operation and
Both weight convolution windows.
Hardware structure shown in Fig. 2 is particularly suitable for the high parallel computation of data-reusing rate, for example, most commonly seen in CNN
Convolutional calculation.In order to facilitate understanding, Fig. 3 shows an example of convolution operation.As shown in figure 3, using the volume of a 3x3
Product core carries out convolutional calculation with input feature vector figure of the step-length 1 to a 5x5.First time convolutional calculation is shown on the left of figure, in
Between show second of convolutional calculation, and so on.After 9 convolutional calculations, the feature through convolution on the right side of Fig. 3 is being obtained
Figure.
It, can be at one time of parallel computation module 230 since dependence being not present between this 9 times convolutional calculation
It completes to execute (degree of parallelism M can generally achieve thousands of magnitudes) in operation.In one embodiment of the invention, in order to complete simultaneously
At this 9 convolutional calculations, 18 Read Controllers can be used and carry out reading data.First group of 9 Read Controller is for reading spy
Levy figure convolution window.For example, the 9 3x3 convolution windows slided one by one since the upper left of the 5x5 input feature vector figure until bottom right.By
It is 1 in step-length, therefore there are 6 data that can be multiplexed in each adjacent convolution window, as shown in left figure in Fig. 3 and middle graph.Second
9 Read Controllers of group are for reading weight convolution window, that is, the identical 9 3x3 convolution windows of data are (because all use one
Convolution kernel).As a result, in level cache 210, it is only necessary to store 34 (5x5+3x3) a data, it will be able to pass through Read Controller
It reads control and realizes the parallel computation in computing module.It should be understood that introducing 5x5 size here for convenient for explanation
Input feature vector figure, in practical applications, the size of input feature vector figure is typically greater than 5x5.
In addition, for the high sparsity data frequently encountered in high parallel computation, the present invention can also be by being additionally arranged
Filtering module promotes the computational efficiency of parallel computation module.Fig. 4 shows computing platform according to an embodiment of the invention
Schematic diagram.Further, high degree of parallelism computing platform 400 shown in Fig. 4 is in addition to including level cache 410, multiple readings control
It further include being connected between multiple Read Controllers 420 and parallel computation module 430 except device 420 and parallel computation module 430
Filtering module 440, for filtering out assigning null data and/or the calculating data in the calculating data read by the Read Controller
In correspond to corresponding calculated result be null value non-empty Value Data.In other words, filtering module 440 can be used for filtering out to subsequent
The meaningless value of calculated result in parallel computation module 430.
In the case where calculating for CNN, filtering module 440 can filter out the feature diagram data read by institute's controller
And/or corresponds to corresponding convolution calculated result in null value in weighted data, and/or the feature diagram data and weighted data and be
The non-empty Value Data of null value.Fig. 5 shows example of the filtering for the data of convolutional calculation in CNN.As shown, according to convolution
Definition, the numerical value being only all not zero in characteristic pattern convolution window and weight convolution window corresponding position can just generate output valve
It influences.Therefore, only same position is all not zero filtering module 440 in keeping characteristics figure convolution window and its respective weights convolution window
Value simultaneously filters out other values.
In one embodiment, filtering module 440 may include 0 grade of caching and multiple asynchronous fifo queues
(FIFO).In a preferred embodiment, 0 grade of caching may include or be divided into 2M convolution window cache pool, Duo Geyi
Step FIFO is embodied as M FIFO.Preceding 2M Read Controller can be used for M characteristic pattern convolution window and M weight convolution
Window is stored in 2M convolution window cache pool of 0 grade of caching, and filters out in a manner of one-to-one characteristic pattern convolution window and its right
Answer same position in weight convolution window at least one be zero value, and by the number of obtained compact arranged removal sparse data
It is input in this M asynchronous FIFO according to (as shown in Figure 5) with high-frequency.Later, the data through sparse processing are sent into simultaneously from FIFO
Row computing device.In the application for convolutional calculation, parallel computation unit can be implemented as include M it is a it is mutually independent multiply it is tired
Add the convolutional calculation core of module.It will be obvious that more rapidly being held by the introducing of filtering module so that convolutional calculation nuclear energy is enough
It has gone one time and has calculated (including M times while convolutional calculation of progress).
As described above, the cache contents that level cache needs to have any position by multiple Read Controllers while can be read
Ability, and can for example set level cache to being made of register file to realize above-mentioned function.However, above-mentioned function
Realization will lead to biggish power consumption and logistical overhead, so the capacity of level cache should not be too big.Degree of parallelism is calculated meeting
Demand under the premise of, the capacity of level cache should be small as far as possible, for example, in neural network hardware accelerator, tens or several
The level cache of hundred KB has met the calculating degree of parallelism demand of magnitudes thousands of or up to ten thousand.Due to level cache capacity and access side
The limitation of formula, so that level cache reads data directly from external memory and can not maximumlly utilize existing bandwidth, and
It is excessive to will lead to additional expense, therefore in one embodiment of the invention, can external memory and level cache it
Between be arranged a biggish L2 cache of capacity, to solve the above problems.
Fig. 6 shows the schematic diagram of computing platform according to an embodiment of the invention.Further, height shown in fig. 6
Degree of parallelism computing platform 600 is in addition to including level cache 610, multiple Read Controllers 620, filtering module 640 and parallel computation mould
It further include the L2 cache 650 being connected with level cache except block 630, based on directly needed for external memory reading
Count evidence.The L2 cache 650 is preferably realised as conventional cache, can order buffer from outside read calculating data, and
Can data be conveyed to level cache 610 on demand.The buffer memory capacity (for example, capacity of several MB) of L2 cache 650 is usually wanted
Greater than the buffer memory capacity (for example, capacity of several hundred KB) of level cache 610, external memory can be used as (similarly, in figure
External memory is embodied as massive store unit, such as DDR) with the transmission intermediary of level cache 610.It is slow by introducing second level
It deposits, can maximumlly utilize memory bandwidth, and reduce the demand to level cache memory capacity and power consumption.
Data can be written to level cache 610 according to the consumption situation of the data in level cache in L2 cache 650,
And/or data are read from external memory.In one embodiment, data consumption situation can be by Read Controller shape in level cache
Indicated by state parameter.
In the realization for CNN, L2 cache 650 can separate characteristic value data and the weight number that caching is read
According to, and situation at least is consumed based on data in level cache, at least one of following operation: Xiang Suoshu level cache can be carried out
Write-in characteristic diagram data and weighted data;And new feature diagram data and weighted data are read from the external memory.It is excellent
Selection of land, data consume situation can be according to position coordinates of the feature diagram data read in Read Controller on characteristic pattern come table
Sign.
The introducing of above-mentioned multi-level buffer structure is able to ascend system effectiveness and reduces overhead.Further, above-mentioned more
Data renewal frequency between grade caching is also based on the calculating speed of parallel computation module and computing platform is deposited with outside
The memory bandwidth of reservoir and designed, thus maximumlly lifting system performance.
In an ideal case, multiple Read Controllers read the feature diagram data and every time between the times of the weighted data
Every executing with the parallel computation module, the period needed for single M group calculates is identical.In other words, from level cache to 0 grade
The update of caching be data high frequency expansion update, can be executed with such as convolutional calculation core single time calculate (including M time parallel
Convolution window calculate) needed for time it is identical, thus, it is possible to make the computing capability of parallel computation module obtain maximized benefit
With.
The frequency that data update in level cache is obviously less than the renewal frequency in 0 grade of caching.In other words, second level is slow
It deposits and is greater than level cache every time between the time of 0 grade of buffering write data to the time interval of level cache write-in data every time
Every.When position coordinates of the feature diagram data read in such as Read Controller on characteristic pattern indicate the data in level cache i.e.
When will exhaust, new data can be written in L2 cache to level cache.
In contrast, the rate that data update in L2 cache can according to need flexible setting.L2 cache is from outside
The time interval of memory read data can be equal to or more than L2 cache to the time interval of level cache write-in data.
In one embodiment, whenever data exhaustion in level cache, the L2 cache just supplementary data into level cache, and at the same time
New data are read to external memory.For example, the second level of capacity 5MB is slow when data exhaustion in the level cache of capacity 500KB
The data for just supplementing 500KB into level cache are deposited, and read new 500KB data to external memory simultaneously.In other realities
It applies in example, the frequency of L2 cache supplementary data can be lower, for example, every update is twice, level cache number three times or more
According to the new data for just voluntarily supplementing respective numbers, the present invention is herein with no restrictions.
Fig. 2-6 pairs of hardware structures and its preferred embodiment of the invention have been combined to be described above.It will combine as follows
Fig. 7 describes a concrete application example of the invention and its operation.
The concrete operations that Fig. 7 shows the high degree of parallelism computing platform according to an embodiment of the invention for CNN are shown
It is intended to.As shown in fig. 7, computing platform 700 includes L2 cache as mentioned above (L2cache) 750, level cache
(L1cache) 710, multiple Read Controllers 720,0 grade of caching (L0cache) 741 as filtering module and asynchronous FIFO 742,
And it is embodied as the parallel computation module 730 of convolutional calculation module.
In addition to this, computing platform 700 further includes the calculated result caching mould being optionally connected with convolutional calculation module 730
Block 760.The module is used to cache the convolutional calculation result data for the calculating operation that parallel computation module 730 is carried out, and will meter
It calculates result data and is stored back to external massive store unit.Calculated result cache module 760 can for example be filled with certain data
Write-in is executed to external memory afterwards.
In one embodiment, computing platform 700 further includes optionally in the storage Read Controller state parameter
Read Controller status register 770, and the buffer update that buffer update instructs is generated based on the Read Controller state parameter
Controller 780.Read Controller status register 770 and buffer update controller 780 can be used for according to number in level cache 710
Firsts and seconds caching is updated according to consumption state.
It is as follows, the computing platform implementation method according to the present invention for being used for convolutional neural networks will be described in conjunction with Fig. 8.Fig. 8
Show the flow chart of computing platform implementation method according to an embodiment of the invention.Calculating above-mentioned can be used in this method
Platform and its preferred embodiment are realized, such as computing platform shown in Fig. 7 700.
In step S810, feature diagram data and weighted data are read in level cache from external memory.
In one embodiment, above-mentioned reading can be realized via L2 cache 750 as shown in Figure 7.With level cache
710 connected L2 caches 750 directly read feature diagram data and weighted data from massive store unit, and based on reading control
Position coordinates of the feature diagram data read in device processed on the characteristic pattern are to level cache write-in characteristic diagram data and weight
Data and and read new feature diagram data and weighted data from the external memory, wherein L2 cache is to level-one
The time interval of buffering write data is less than or equal to the time interval that L2 cache reads data from the external memory.
Then, in step S820, characteristic pattern needed for multiple Read Controllers read multiple single convolutional calculation operations respectively
Data and/or weighted data.Here, multiple Read Controllers are can be as shown in Figure 3 respectively according to the volume of convolution operation to be executed
Product core size reads corresponding characteristic pattern convolution window and weight convolution window.Here, the characteristic pattern number read by multiple Read Controllers
According to, there are altitude information multiplexing, the data-reusing includes at least one of following: being wider than by the length of convolution window with weighted data
The multiplexing of feature diagram data caused by step-length;And as executing weighted data caused by convolution to positions multiple in characteristic pattern simultaneously
Multiplexing.
In the multiple groups feature diagram data and weight number that step S830, parallel computation module simultaneously read multiple Read Controllers
According to progress high degree of parallelism convolutional calculation operation.Multiple groups feature diagram data and weighted data be admitted to convolutional calculation core it
Before, it may also pass through filter operation.Therefore, implementation method of the invention can also preferably include step S825, filter out feature
In figure convolution window and weighted data convolution window with the value of zero corresponding position in convolution operation result, and by the filtered spy
It levies figure convolution window and weight convolution window is sent into the parallel computation module.
In order to maximumlly utilize the computing capability of convolutional calculation core, multiple Read Controllers read the characteristic pattern number every time
It is identical and small according to the period executed with the time interval of the weighted data with convolutional calculation core needed for single time M group calculates
In the L2 cache every time to the time interval of level cache write-in data.
The application examples of calculating implementation method of the invention will be further described in conjunction with Fig. 7 as follows.
The buffer update controller 780 of L2 cache 750 is according to the Read Controller 720 of current level cache 710 first
State obtains data from massive store unit, and feature diagram data and weighted data is stored separately.Here, level cache
The state of 710 Read Controller 720 can be position coordinates of the neural network convolution window on characteristic pattern, can be pushed away by coordinate
The Expenditure Levels of data in the data pool of level cache 710 are calculated, and then the data consumed can be replaced with new data.
Then, data are stored in the data pool of level cache 710 according to data dependent manner.Herein, " data dependence "
Refer to the dependence between data flow distributing order.The data pool equally may include for characteristic pattern caching data pool #0 and
For the data pool #1 of weight caching.
Then characteristic pattern and weight are updated from level cache 710 to 0 grade of caching 741 respectively by two groups of Read Controllers
In (convolution window cache pool).Characteristic pattern and weight are stored in 0 grade of caching 741 in sparse mode, then pass through filter operation,
The correspondence number that any position in characteristic pattern and weight convolution window is 0 is all filtered out, the compact arranged sparse number of removal is obtained
According to parameter, and be input in asynchronous FIFO 742 with high-frequency.The data Jing Guo sparse processing give convolution from FIFO later
Core 730 (altogether M mutually independent multiply accumulating module) is calculated, is stored in calculated result cache module 760 after the completion of calculating.Finally
After storing enough data, result is stored in high-capacity storage module by calculated result cache module 760.
It is worth noting that, multi-level buffer of the invention is not traditional processor cache, but it is directed to parallel computation,
Especially CNN convolutional calculation demand and specially design.Wherein from massive store unit (such as DDR) to L2 cache, and
Data from L2 cache to level cache are the complete characteristic patterns and weight without sparse processing.Wherein, 0 grade of caching is logical
(or even can be identical as the arithmetic speed of convolution kernel) of high frequency update be crossed, and L2 cache and level cache then can be with lower
Frequency updates.In addition, being the high frequency expansion of data from level cache to 0 grade of buffer update, weight can be multiple during expansion
Characteristic pattern data sharing and characteristic pattern itself data-reusing degree how high when convolution kernel is greater than step.From L2 cache to level-one
The defeated update of caching can be same frequency.Hardware structure as a result, based on multi-level buffer of the invention, being capable of maximized benefit
With the processing speed of convolutional calculation core, and memory bandwidth is made no longer to cause to limit to the treatment effeciency of computing platform.
In addition, though the characteristic pattern and weight cache module and data pool of same size are shown in Fig. 7, but it is understood that
, in practical applications, can according to specific parallel computation scheme, reasonably adjust firsts and seconds caching in characteristic pattern and
Weighted data proportion.Under normal conditions, since the reusability of convolution kernel is higher, characteristic pattern cache module and data pool
Significantly larger than weight cache module and data pool.
Computing platform of the invention can be implemented as neural network processor.With single computing platform (i.e. only have host or
The computing platform of CPU) it compares, the present invention is directed the neural network being especially designed to execute neural computing is special
Use processor.It will be appreciated by those skilled in the art that term " neural network application specific processor " used in this application, it can also
Referred to as " neural network processor " or " NN processor ".Due to deep learning be in nerual network technique it is presently the most popular
One technique classification, therefore neural network application specific processor may be implemented as at deep learning application specific processor or deep learning
Manage device.But it will be understood by those skilled in the art that neural network has a various technology branches, such as deep neural network (DNN,
Deep Neutral Network) and CNN, therefore also to may be implemented as deep neural network special for neural network application specific processor
With processor (DNN processor) or convolutional neural networks application specific processor (CNN processor).That is, related " deep learning
The neural network of processor " or " deep neural network processor " or " convolutional neural networks processor " in heterogeneous computing platforms
It is also within the scope of the invention to calculate realization technology.
DPU (Deep-learning Processing Unit) is a for neural network algorithm in artificial intelligence
General acceleration platform, utilize FPGA high degree of parallelism and low-power consumption the characteristics of, realize be based on convolutional neural networks
(Convolutional Neural Network, hereinafter referred to as CNN) makes inferences.Herein, DPU may be considered above
" deep learning processor " or " deep neural network processor " or " convolutional neural networks processor " or " Processing with Neural Network
One specific implementation of device ".Description herein, which is based primarily upon, to be carried out using CNN structure via the DPU that FPGA is realized, but this field
Technical staff should be understood that the principle of the present invention is equally applicable to the hardware configuration by such as GPU for other nerves
The neural network processor that network makes inferences.
Computing platform of the invention can be realized in a high concurrent computational system, wherein for executing such as nerve net
Some or all of the high degree of parallelism calculating that network calculates function can be by digital circuit.In one embodiment, of the invention
Computing system can include system on chip (SoC) realization of general processor, mass storage and digital circuit.
In one embodiment, can be to realize this by the digital circuits section (for example, FPGA, GPU or ASIC) on SoC
High parallel computing platform needed for system, such as the computing platform for convolutional neural networks.Computing platform or in which be to count parallel
Calculating module can be the hardware device based on realizations such as FPGA or GPU or ASIC.What it is due to CNN progress is parallel computation,
It realizes that convolutional neural networks computing function has natural calculating advantage by logic hardware, especially FPGA, and compares
It is executed in software, can be realized lower power consumption.
In one embodiment, it can will pass through the whole parameters and, for example, needs of the related CNN obtained in preceding training
The characteristic pattern of classification is stored in external memory, when then carrying out neural computing, can be executed such as by general processor
Method described in preceding combination Fig. 8 in computing platform to realize high performance parallel computation.
High degree of parallelism computing platform, system and calculating according to the present invention above is described in detail in fact by reference to attached drawing
Existing method.
The present invention devises multi-level buffer structure, utilizes band according to the deterministic rule of convolutional neural networks data dependence
Wide, multi-frequency and data-reusing characteristic, so that having obtained great promotion for several efficiency to convolutional calculation module.In addition,
The present invention also utilizes the weight of convolutional neural networks and the sparse characteristic of characteristic pattern, passes through filtering null value significant increase operation effect
Rate.Computing platform design of the invention is simple, can be from without the cost of complicated encoding and decoding and the storage of sparse data information
Maximized computational efficiency income is obtained in the sparse characteristic of neural network parameter.
Herein it is emphasized that although the present invention shows the hardware of optimization mainly in combination with the calculating of convolutional neural networks
Framework, it will be understood by those skilled in the art that computing platform of the invention is also applied for the calculating of other high degree of parallelism, it is especially suitable
For the parallel computation application scenarios that input data type is relatively single and/or degree of rarefication is high.
In addition, though the title of " L2 cache ", " level cache " and " 0 grade of caching " is employed herein, but herein
" second level ", " level-one " and " 0 grade " are simply to illustrate that pass in the embodiments for including more than two cachings, between each caching
System, rather than to any restrictions for caching itself.
In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention
Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention
Machine program code instruction.
Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium
Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code),
When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server
Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.
Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein
Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.
The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities
Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent module, a journey
A part of sequence section or code, a part of the module, section or code include one or more for realizing defined
The executable instruction of logic function.It should also be noted that in some implementations as replacements, the function of being marked in box can also
To be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually be basically executed in parallel,
They can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or stream
The combination of each box in journey figure and the box in block diagram and or flow chart, can the functions or operations as defined in executing
Dedicated hardware based system realize, or can realize using a combination of dedicated hardware and computer instructions.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art
Other those of ordinary skill can understand each embodiment disclosed herein.
Claims (25)
1. a kind of high degree of parallelism computing platform, comprising:
Level cache, for caching the calculating data read from external memory;
Multiple Read Controllers, each Read Controller are used to read parallel computation module from any position of the level cache
Calculating data or part thereof needed for middle single operation;
Parallel computation module, the calculating data for reading to the multiple Read Controller carry out the calculating operation of high degree of parallelism.
2. computing platform as described in claim 1, further includes:
The filtering module being connected between the multiple Read Controller and the parallel computation module is controlled for filtering out by the reading
Corresponding to corresponding calculated result in the assigning null data and/or the calculating data in calculating data that device processed is read is the non-of null value
Assigning null data.
3. computing platform as described in claim 1, further includes:
The L2 cache for calculating data directly is read from the external memory, what the L2 cache order buffer was read
Data are calculated, and at least consume situation based on the data in the level cache, carry out at least one of following operation:
It is written to the level cache and calculates data;And
It is read from the external memory and calculates data and cached.
4. computing platform as described in claim 1, wherein data consume situation by the Read Controller in the level cache
Indicated by state parameter, and the computing platform further include:
For storing the Read Controller status register of the Read Controller state parameter;And
The buffer update controller of buffer update instruction is generated based on the Read Controller state parameter.
5. computing platform as described in claim 1, further includes:
The calculated result cache module being connected with the parallel computation module is carried out for caching the parallel computation module
The calculation result data of calculating operation, and the calculation result data is stored back to the external memory.
6. computing platform as described in claim 1, wherein the level cache is embodied as register file.
7. a kind of high degree of parallelism computing platform for convolutional neural networks, comprising:
Level cache, for caching the feature diagram data and weighted data that read from external memory;
Multiple Read Controllers, each Read Controller are used to read single convolutional calculation from any position of the level cache
Feature diagram data and/or weighted data needed for operation;
Parallel computation module, for reading the carry out high degree of parallelism of feature diagram data and weighted data to the multiple Read Controller
Convolutional calculation operation.
8. computing platform as claimed in claim 7, wherein the level cache includes for caching the feature diagram data
Characteristic pattern data pool and weighted data pond for caching the weighted data, and the multiple Read Controller includes 2M
Read Controller, wherein first group of M Read Controller is used for while reading M characteristic pattern convolution window, second group of M Read Controller is used
In simultaneously read M weight convolution window, in the parallel computation module realization degree of parallelism for M characteristic pattern convolution operation,
Wherein, M is greater than 2 integer.
9. computing platform as claimed in claim 8, further includes:
The filtering module being connected between the multiple Read Controller and the parallel computation module is controlled for filtering out by the reading
It is corresponding in the feature diagram data that device processed is read and/or the null value in weighted data, and/or the feature diagram data and weighted data
In the non-empty Value Data that corresponding convolution calculated result is null value.
10. computing platform as claimed in claim 9, wherein the filtering module keeping characteristics figure convolution window and its corresponding power
Same position is all not zero in reconvolution window value simultaneously filters out other values.
11. computing platform as claimed in claim 10, wherein the filtering module includes 0 grade of caching and M asynchronous advanced elder generations
Dequeue, 0 grade of caching include or can be divided into 2M convolution window cache pool, and the 2M Read Controller is by M feature
Figure convolution window and M weight convolution window are stored in 2M convolution window cache pool of 0 grade of caching, and are filtered in a manner of one-to-one
Except in characteristic pattern convolution window and its respective weights convolution window same position at least one be zero value, and retention is deposited respectively
Enter the M asynchronous fifo queues.
12. computing platform as claimed in claim 8, further includes:
The feature diagram data and the weighted data directly are read from the external memory and are connected with the level cache
L2 cache, the buffer memory capacity of the L2 cache is greater than the buffer memory capacity of the level cache, reads for separately caching
The characteristic value data and the weighted data, and at least consume situation based on data in the level cache, carry out such as
At least one of lower operation:
To the level cache write-in characteristic diagram data and weighted data;And
New feature diagram data and weighted data are read from the external memory.
13. computing platform as claimed in claim 12, wherein the data consumption parameter is read in the Read Controller
Position coordinates of the feature diagram data on the characteristic pattern.
14. computing platform as claimed in claim 12, further includes:
For storing the Read Controller status register of the Read Controller state parameter;And
The buffer update controller of buffer update instruction is generated based on the Read Controller state parameter.
15. computing platform as claimed in claim 8, wherein the M is at least based on reading data from the external memory
The size of bandwidth and the convolution window determines.
16. computing platform as claimed in claim 7, further includes:
The calculated result cache module being connected with the parallel computation module is carried out for caching the parallel computation module
The convolutional calculation result data of calculating operation, and the calculation result data is stored back to the external memory.
17. computing platform as claimed in claim 7, wherein the level cache is embodied as register file.
18. a kind of computing platform implementation method for convolutional neural networks, comprising:
Using the computing platform as described in any one of claim 1-17 by feature diagram data and weighted data from the outside
It is read in memory in the level cache;
Feature diagram data and/or weight number needed for the multiple Read Controller reads multiple single convolutional calculation operations respectively
According to;And
The parallel computation module simultaneously to the multiple Read Controller read multiple groups feature diagram data and weighted data into
The convolutional calculation of row high degree of parallelism operates.
19. method as claimed in claim 18, wherein the feature diagram data and weight number read by the multiple Read Controller
According to there are altitude information multiplexing, the data-reusing includes at least one of following:
It is wider than the multiplexing of feature diagram data caused by step-length as the length of convolution window;And
Multiplexing as simultaneously positions multiple in characteristic pattern are executed with weighted data caused by convolution.
20. method as claimed in claim 18, further includes:
The L2 cache being connected with the level cache directly reads the feature diagram data and described from the external memory
Weighted data, and the position coordinates based on the feature diagram data read in the Read Controller on the characteristic pattern are to described
Level cache write-in characteristic diagram data and weighted data and and from the external memory read new feature diagram data and
Weighted data, wherein the L2 cache is less than or equal to the second level to the time interval of level cache write-in data
Cache the time interval that data are read from the external memory.
21. method as claimed in claim 20, wherein the multiple Read Controller reads multiple single convolutional calculation behaviour respectively
Feature diagram data and/or weighted data needed for making include:
Corresponding characteristic pattern convolution window and weight convolution window are read according to the convolution kernel size of convolution operation to be executed.
22. method as claimed in claim 20, further includes:
The value in the characteristic pattern convolution window and weighted data convolution window with zero corresponding position in convolution operation result is filtered out, and
The filtered characteristic pattern convolution window and weight convolution window are sent into the parallel computation module.
23. method as claimed in claim 20, wherein the multiple Read Controller reads the feature diagram data and institute every time
Period needed for the time interval and the parallel computation module for stating weighted data execute single time M group calculating is identical and small
In the L2 cache every time to the time interval of level cache write-in data.
24. a kind of high concurrent computational system, comprising:
Computing platform as described in any one of claim 1-17;
Mass storage outside the computing platform;
The processor being connected with the computing platform and the memory, for executing such as any one of claim 18-23
The implementation method.
25. system as claimed in claim 24, wherein the parallel processing module is at least partly by FPGA, GPU or ASIC reality
It is existing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810277338.XA CN110321997B (en) | 2018-03-31 | 2018-03-31 | High-parallelism computing platform, system and computing implementation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810277338.XA CN110321997B (en) | 2018-03-31 | 2018-03-31 | High-parallelism computing platform, system and computing implementation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110321997A true CN110321997A (en) | 2019-10-11 |
CN110321997B CN110321997B (en) | 2021-10-19 |
Family
ID=68111837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810277338.XA Active CN110321997B (en) | 2018-03-31 | 2018-03-31 | High-parallelism computing platform, system and computing implementation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110321997B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111275194A (en) * | 2020-02-16 | 2020-06-12 | 苏州浪潮智能科技有限公司 | NLP reasoning acceleration system based on FPGA |
CN111984189A (en) * | 2020-07-22 | 2020-11-24 | 深圳云天励飞技术有限公司 | Neural network computing device, data reading method, data storage method and related equipment |
CN112749782A (en) * | 2019-10-31 | 2021-05-04 | 上海商汤智能科技有限公司 | Data processing method and related product |
CN112990157A (en) * | 2021-05-13 | 2021-06-18 | 南京广捷智能科技有限公司 | Image target identification acceleration system based on FPGA |
CN113111995A (en) * | 2020-01-09 | 2021-07-13 | 北京君正集成电路股份有限公司 | Method for shortening model reasoning and model post-processing operation time |
CN113642724A (en) * | 2021-08-11 | 2021-11-12 | 西安微电子技术研究所 | CNN accelerator with high bandwidth storage |
WO2021249192A1 (en) * | 2020-06-12 | 2021-12-16 | 中兴通讯股份有限公司 | Image processing method and apparatus, machine vision device, electronic device and computer-readable storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101034375A (en) * | 2007-02-12 | 2007-09-12 | 忆正存储技术(深圳)有限公司 | Computer memory system |
CN101499875A (en) * | 2008-02-02 | 2009-08-05 | 三星电子株式会社 | Variant processing rate supporting apparatus for LTE rate de-matching and de-interleaving |
WO2010064728A1 (en) * | 2008-12-04 | 2010-06-10 | Canon Kabushiki Kaisha | Convolution operation circuit and object recognition apparatus |
CN102667737A (en) * | 2009-12-21 | 2012-09-12 | 索尼公司 | Cache memory and cache memory control device |
US8442927B2 (en) * | 2009-07-30 | 2013-05-14 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN103729449A (en) * | 2013-12-31 | 2014-04-16 | 上海富瀚微电子有限公司 | Reference data access management method and device |
CN105119768A (en) * | 2015-06-26 | 2015-12-02 | 华为技术有限公司 | Field-programmable gate array FPGA and data storage method |
CN105611234A (en) * | 2015-12-21 | 2016-05-25 | 中国科学院长春光学精密机械与物理研究所 | Embedded system used analog display method for digital images of arbitrary frame rate |
CN106649143A (en) * | 2015-10-29 | 2017-05-10 | 阿里巴巴集团控股有限公司 | Method and device for cache access, and electronic equipment |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN107392309A (en) * | 2017-09-11 | 2017-11-24 | 东南大学—无锡集成电路技术研究所 | A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA |
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
CN107633297A (en) * | 2017-03-10 | 2018-01-26 | 南京大学 | A kind of convolutional neural networks hardware accelerator based on parallel quick FIR filter algorithm |
US20180089562A1 (en) * | 2016-09-28 | 2018-03-29 | SK Hynix Inc. | Operation apparatus and method for convolutional neural network |
-
2018
- 2018-03-31 CN CN201810277338.XA patent/CN110321997B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101034375A (en) * | 2007-02-12 | 2007-09-12 | 忆正存储技术(深圳)有限公司 | Computer memory system |
CN101499875A (en) * | 2008-02-02 | 2009-08-05 | 三星电子株式会社 | Variant processing rate supporting apparatus for LTE rate de-matching and de-interleaving |
WO2010064728A1 (en) * | 2008-12-04 | 2010-06-10 | Canon Kabushiki Kaisha | Convolution operation circuit and object recognition apparatus |
US8442927B2 (en) * | 2009-07-30 | 2013-05-14 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN102667737A (en) * | 2009-12-21 | 2012-09-12 | 索尼公司 | Cache memory and cache memory control device |
CN103729449A (en) * | 2013-12-31 | 2014-04-16 | 上海富瀚微电子有限公司 | Reference data access management method and device |
CN105119768A (en) * | 2015-06-26 | 2015-12-02 | 华为技术有限公司 | Field-programmable gate array FPGA and data storage method |
CN106649143A (en) * | 2015-10-29 | 2017-05-10 | 阿里巴巴集团控股有限公司 | Method and device for cache access, and electronic equipment |
CN105611234A (en) * | 2015-12-21 | 2016-05-25 | 中国科学院长春光学精密机械与物理研究所 | Embedded system used analog display method for digital images of arbitrary frame rate |
US20180089562A1 (en) * | 2016-09-28 | 2018-03-29 | SK Hynix Inc. | Operation apparatus and method for convolutional neural network |
CN106875011A (en) * | 2017-01-12 | 2017-06-20 | 南京大学 | The hardware structure and its calculation process of two-value weight convolutional neural networks accelerator |
CN107633297A (en) * | 2017-03-10 | 2018-01-26 | 南京大学 | A kind of convolutional neural networks hardware accelerator based on parallel quick FIR filter algorithm |
CN107451659A (en) * | 2017-07-27 | 2017-12-08 | 清华大学 | Neutral net accelerator and its implementation for bit wide subregion |
CN107392309A (en) * | 2017-09-11 | 2017-11-24 | 东南大学—无锡集成电路技术研究所 | A kind of general fixed-point number neutral net convolution accelerator hardware structure based on FPGA |
Non-Patent Citations (2)
Title |
---|
KAIYUAN GUO 等: "Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware", 《ISVLSI》 * |
杨宁: "深度卷积神经网络的多GPU并行框架", 《计算机与现代化》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112749782A (en) * | 2019-10-31 | 2021-05-04 | 上海商汤智能科技有限公司 | Data processing method and related product |
CN113111995A (en) * | 2020-01-09 | 2021-07-13 | 北京君正集成电路股份有限公司 | Method for shortening model reasoning and model post-processing operation time |
CN111275194A (en) * | 2020-02-16 | 2020-06-12 | 苏州浪潮智能科技有限公司 | NLP reasoning acceleration system based on FPGA |
WO2021249192A1 (en) * | 2020-06-12 | 2021-12-16 | 中兴通讯股份有限公司 | Image processing method and apparatus, machine vision device, electronic device and computer-readable storage medium |
CN111984189A (en) * | 2020-07-22 | 2020-11-24 | 深圳云天励飞技术有限公司 | Neural network computing device, data reading method, data storage method and related equipment |
CN111984189B (en) * | 2020-07-22 | 2022-05-17 | 深圳云天励飞技术股份有限公司 | Neural network computing device, data reading method, data storage method and related equipment |
CN112990157A (en) * | 2021-05-13 | 2021-06-18 | 南京广捷智能科技有限公司 | Image target identification acceleration system based on FPGA |
CN112990157B (en) * | 2021-05-13 | 2021-08-20 | 南京广捷智能科技有限公司 | Image target identification acceleration system based on FPGA |
CN113642724A (en) * | 2021-08-11 | 2021-11-12 | 西安微电子技术研究所 | CNN accelerator with high bandwidth storage |
Also Published As
Publication number | Publication date |
---|---|
CN110321997B (en) | 2021-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321997A (en) | High degree of parallelism computing platform, system and calculating implementation method | |
CN106951395B (en) | Parallel convolution operations method and device towards compression convolutional neural networks | |
WO2021004366A1 (en) | Neural network accelerator based on structured pruning and low-bit quantization, and method | |
CN107169563B (en) | Processing system and method applied to two-value weight convolutional network | |
CN105681628B (en) | A kind of convolutional network arithmetic element and restructural convolutional neural networks processor and the method for realizing image denoising processing | |
CN108090565A (en) | Accelerated method is trained in a kind of convolutional neural networks parallelization | |
CN107239824A (en) | Apparatus and method for realizing sparse convolution neutral net accelerator | |
CN107578098A (en) | Neural network processor based on systolic arrays | |
CN110175671A (en) | Construction method, image processing method and the device of neural network | |
CN109903221A (en) | Image oversubscription method and device | |
CN106228240A (en) | Degree of depth convolutional neural networks implementation method based on FPGA | |
CN107506828A (en) | Computing device and method | |
CN106951926A (en) | The deep learning systems approach and device of a kind of mixed architecture | |
CN110050267A (en) | System and method for data management | |
CN107273936A (en) | A kind of GAN image processing methods and system | |
CN108256628A (en) | Convolutional neural networks hardware accelerator and its working method based on multicast network-on-chip | |
CN108665063A (en) | Two-way simultaneous for BNN hardware accelerators handles convolution acceleration system | |
CN108334945A (en) | The acceleration of deep neural network and compression method and device | |
CN107256424A (en) | Three value weight convolutional network processing systems and method | |
CN109086802A (en) | A kind of image classification method based on biquaternion convolutional neural networks | |
CN110163354A (en) | A kind of computing device and method | |
CN107085562A (en) | A kind of neural network processor and design method based on efficient multiplexing data flow | |
CN110580519B (en) | Convolution operation device and method thereof | |
CN111985597B (en) | Model compression method and device | |
CN109840585A (en) | A kind of operation method and system towards sparse two-dimensional convolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190925 Address after: 2100 San Jose Rojack Avenue, California, USA Applicant after: XILINX INC Address before: 100083, 17 floor, four building four, 1 Wang Zhuang Road, Haidian District, Beijing. Applicant before: Beijing Shenjian Intelligent Technology Co., Ltd. |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |