CN110321997A

CN110321997A - High degree of parallelism computing platform, system and calculating implementation method

Info

Publication number: CN110321997A
Application number: CN201810277338.XA
Authority: CN
Inventors: 王俊斌; 隋凌志; 方绍峡; 于谦; 单羿
Original assignee: Beijing Shenjian Intelligent Technology Co Ltd
Current assignee: Xilinx Inc
Priority date: 2018-03-31
Filing date: 2018-03-31
Publication date: 2019-10-11
Anticipated expiration: 2038-03-31
Also published as: CN110321997B

Abstract

The invention discloses a kind of high degree of parallelism computing platform, system and related realization method.The computing platform includes: level cache, for caching the calculating data read from external memory；Multiple Read Controllers, each Read Controller are used to read calculating data needed for single operation or part thereof in parallel computation module from any position of the level cache；Parallel computation module, the calculating data for reading to the multiple Read Controller carry out the calculating operation of high degree of parallelism.Thus, it is possible to abundant multiplex datas to promote treatment effeciency.The present invention also promotes whole data-handling efficiency by being further introduced into multi-level buffer and further promotes processing speed by data filtering.

Description

High degree of parallelism computing platform, system and calculating implementation method

Technical field

The present invention relates to hardware structure field more particularly to a kind of high degree of parallelism computing platform, system and calculate realization side Method.

Background technique

Neural network (Neural Network) becomes the research hotspot of field of image recognition in recent years.After training Neural network model can be used for the numerous areas such as image classification, object identification and conspicuousness detection.Neural network mould in recent years Type is presented the trend that calculation scale increases, complexity is promoted and has been unable to satisfy the practicality demand using traditional CPU platform. Therefore, neural network accelerator design is carried out using the high degree of parallelism heterogeneous computing platforms such as FPGA, GPU, ASIC to grind as new Study carefully hot spot.Wherein, compare GPU platform, FPGA can be realized higher calculating Energy Efficiency Ratio, while FPGA can with iteratively faster, can Carry out the characteristic also more requirement of adaptive algorithm high speed development of hardware reconstruction.

When using the high degree of parallelism computing platform comprising FPGA and GPU etc. to execute calculating, compared to external storage Device carry out data access needed for time cost, the execution time of parallel computation is very short, which results in bandwidth be constrained in order to Improve the bottleneck of processing speed.In addition, how to make full use of bandwidth to be promoted needed for degree of parallelism and high degree of parallelism computing platform The major issue of consideration.

Therefore, there is still a need for a kind of relevant programme that can optimize high degree of parallelism calculating.

Summary of the invention

In order to solve the problems, such as it is above-mentioned at least one, the invention proposes a kind of new computing platform frameworks.By the way that energy is arranged The cache module and Read Controller matched with its read parallel while enough realizing any position, can further promote meter Calculate the parallel processing capability of platform.By the layout of further multi-level buffer, it can make full use of bandwidth and realize maximum number According to processing speed.Meaningless null value in sparse data can be removed by being introduced into for strobe utility, to further be promoted simultaneously The computational efficiency of row computing module itself.The parallel processing capability of computing platform is substantially improved on the whole as a result,.

According to an aspect of the invention, there is provided a kind of high degree of parallelism computing platform, comprising: level cache, for delaying Deposit the calculating data read from external memory；Multiple Read Controllers, each Read Controller are used for from the level cache Any position read the single operation in parallel computation module needed for calculating data or part thereof；Parallel computation module is used In the calculating operation that simultaneously the calculating data that the multiple Read Controller is read are carried out with high degree of parallelism.Pass through setting energy as a result, The cache module and Read Controller matched with its read parallel while enough realizing any position, can further promote meter Calculate the parallel processing capability of platform.The level cache can be realized by register file.

Preferably, which can also include: to be connected to the multiple Read Controller and the parallel computation module Between filtering module, for filtering out assigning null data and/or the calculating in the calculating data read by the Read Controller Correspond to the non-empty Value Data that corresponding calculated result is null value in data.By sending data (especially sparse data) will be calculated " foam " in data on calculated result without influence is squeezed out before entering parallel computation module, is able to ascend parallel computation module itself The reasons why efficiency.

Preferably, which can also include: directly to read the two of the calculating data from the external memory Grade caching, the calculating data that the L2 cache sequential storage is read, and be at least based on being used to indicate data in level cache The Read Controller state parameter of consumption situation, carry out at least one of following operation: the write-in of Xiang Suoshu level cache calculates number According to；And it is read from the external memory and calculates data and cached.Pass through the cloth of further multi-level buffer as a result, Office can make full use of bandwidth and realize maximum data processing speed.Preferably, which can also include: for depositing Store up the Read Controller status register of the Read Controller state parameter；And it is generated based on the Read Controller state parameter slow The buffer update controller of more new command is deposited, it is good fit between multi-level buffer to be realized from structure.

Preferably, which can also include: the calculated result cache module being connected with the parallel computation module, It is deposited for caching the calculation result data for the calculating operation that the parallel computation module is carried out, and by the calculation result data Return the external memory.

According to another aspect of the present invention, a kind of high degree of parallelism computing platform for convolutional neural networks is provided, It include: level cache, for caching the feature diagram data and weighted data that read from external memory；Multiple Read Controllers, often Characteristic pattern number needed for a Read Controller is used to read the operation of single convolutional calculation from any position of the level cache According to and/or weighted data；Parallel computation module, for reading feature diagram data and weight number to the multiple Read Controller simultaneously According to progress high degree of parallelism convolutional calculation operation.The level cache can be realized by register file.

Preferably, level cache may include characteristic pattern data pool for caching the feature diagram data and for delaying The weighted data pond of the weighted data is deposited, and multiple Read Controllers include 2M Read Controller, wherein first group of M reading control Device processed is used for while reading M characteristic pattern convolution window, and second group of M Read Controller is used for while reading M weight convolution window, with Realize that degree of parallelism is the characteristic pattern convolution operation of M in the parallel computation module, wherein M is greater than 2 integer.

Preferably, which can also include: to be connected to the multiple Read Controller and the parallel computation module Between filtering module, for filter out the null value in the feature diagram data and/or weighted data that are read by the Read Controller, And/or correspond to the non-empty Value Data that corresponding convolution calculated result is null value in the feature diagram data and weighted data.It is preferred that Ground, same position is all not zero in filtering module keeping characteristics figure convolution window and its respective weights convolution window value and by other values It filters out.

Specifically, filtering module includes 0 grade of caching and M asynchronous fifo queues, and 0 grade of caching includes or can quilt It is divided into 2M convolution window cache pool, M characteristic pattern convolution window and M weight convolution window are stored in institute by the 2M Read Controller 2M convolution window cache pool of 0 grade of caching is stated, and filters out characteristic pattern convolution window and its respective weights volume in a manner of one-to-one In product window same position at least one be zero value, and retention is stored in the M asynchronous fifo queues respectively.And Row degree M is at least determined based on the size of the bandwidth and the convolution window that read data from the external memory.

Preferably, the computing platform can also include: directly from the external memory read the feature diagram data and The buffer memory capacity of the weighted data and the L2 cache being connected with the level cache, the L2 cache is greater than the level-one The buffer memory capacity of caching, for be stored separately reading the value indicative data and the weighted data, and at least based on being used for The Read Controller state parameter for indicating data consumption situation in level cache, carries out at least one of following operation: Xiang Suoshu Level cache write-in characteristic diagram data and weighted data；And new feature diagram data and weight are read from the external memory Data.Data consumption parameter can be position of the feature diagram data read in the Read Controller on the characteristic pattern and sit Mark.

Preferably, which can also include: the Read Controller shape for storing the Read Controller state parameter State register；And the buffer update controller of buffer update instruction is generated based on the Read Controller state parameter.

Preferably, which can also include: the calculated result cache module being connected with the parallel computation module, For caching the convolutional calculation result data for the calculating operation that the parallel computation module is carried out, and by the calculated result number According to being stored back to the external memory.

According to a further aspect of the invention, a kind of computing platform implementation method of convolutional neural networks reasoning is provided, It include: to be read feature diagram data and weighted data from the external memory using described in any item computing platforms as above Into the level cache；Feature diagram data needed for the multiple Read Controller reads multiple single convolutional calculation operations respectively And/or weighted data；And the multiple groups feature diagram data that the parallel computation module simultaneously reads the multiple Read Controller With the convolutional calculation operation of the progress high degree of parallelism of weighted data.

There are altitude information multiplexing, the data for the feature diagram data and weighted data read by the multiple Read Controller Multiplexing includes at least one of following: being wider than the multiplexing of feature diagram data caused by step-length as the length of convolution window；And by simultaneously Positions multiple in characteristic pattern are executed with the multiplexing of weighted data caused by convolution.It, can be into one by the abundant multiplexing to data The bandwidth that step reduces external storage limits the influence to processing speed.

Preferably, the implementation method can also include: the L2 cache that is connected with the level cache directly from described External memory reads the feature diagram data and the weighted data, and based on the characteristic pattern read in the Read Controller Position coordinates of the data on the characteristic pattern to the level cache write-in characteristic diagram data and weighted data and and from The external memory reads new feature diagram data and weighted data, wherein the L2 cache is write to the level cache The time interval for entering data is less than or equal to the time interval that the L2 cache reads data from the external memory.Thus Realize making full use of to both bandwidth and processing speed.

Feature diagram data and/or weighted data needed for multiple Read Controllers read multiple single convolutional calculation operations respectively It include: to read corresponding characteristic pattern convolution window and weight convolution window according to the convolution kernel size of convolution operation to be executed.Correspondingly, The method method can also include: to filter out in the characteristic pattern convolution window and weighted data convolution window and in convolution operation result The value of zero corresponding position, and the filtered characteristic pattern convolution window and weight convolution window are sent into the parallel computation mould Block.

The multiple Read Controller read every time the time interval of the feature diagram data and the weighted data with it is described The period that parallel computation module executes needed for single M group calculates is identical, and is less than the L2 cache every time to described one The time interval of grade buffering write data.Thereby, it is possible to further promote the overall calculation efficiency of platform.

According to a further aspect of the invention, a kind of high concurrent computational system is provided, comprising: as described in preceding any one Computing platform；Mass storage outside the computing platform；And with the computing platform and the memory The processor being connected, for executing such as in preceding described in any item implementation methods.

In one embodiment, parallel processing module is at least partly realized by FPGA, GPU or ASIC.

Computing platform proposed by the present invention is allowed to be more suitable for high degree of parallelism calculating and the improvement to hardware structure itself, Especially convolutional neural networks calculate, and pass through making full use of and promote platform to the promotion for calculating degree of parallelism to bandwidth Overall treatment efficiency.The present invention devises multi-level buffer structure according to the deterministic rule of convolutional neural networks data dependence, benefit With the characteristic of bandwidth, multi-frequency and data-reusing, so that having obtained great promotion for several efficiency to convolutional calculation module.Separately Outside, the present invention also utilizes the weight of convolutional neural networks and the sparse characteristic of characteristic pattern, passes through filtering null value significant increase fortune Calculate efficiency.Computing platform design of the invention is simple, the cost energy without complicated encoding and decoding and the storage of sparse data information It is enough that maximized computational efficiency income is obtained from the sparse characteristic of neural network parameter.

Detailed description of the invention

Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.

Fig. 1 shows a series of orderly function layers of typical CNN.

Fig. 2 shows the schematic diagrames of computing platform according to an embodiment of the invention.

Fig. 3 shows an example of convolution operation.

Fig. 4 shows the schematic diagram of computing platform according to an embodiment of the invention.

Fig. 5 shows example of the filtering for the data of convolutional calculation in CNN.

Fig. 6 shows the schematic diagram of computing platform according to an embodiment of the invention.

The concrete operations that Fig. 7 shows the high degree of parallelism computing platform according to an embodiment of the invention for CNN are shown It is intended to.

Fig. 8 shows the flow chart of computing platform implementation method according to an embodiment of the invention.

Specific embodiment

The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure Range is completely communicated to those skilled in the art.

For a long time, high parallel computation is in scientific algorithm, weather simulation, biosimulation, molecular mechanics model, aircraft manufacturing Sufficient utilization has been obtained with fields such as military simulations.In recent years, with the continuing fermentation of deep learning upsurge, it is used for nerve net The parallel meter of the height of network, especially convolutional neural networks (Convolutional Neutral Network, then referred to as CNN) Calculating implementation is even more to have obtained multi-party concern.

Existing general processor (CPU) handles a variety of different data types, and its due to needing high universalizable Logic judgment can introduce the processing that a large amount of branch jumps and interrupts.These all make CPU internal structure complex, are not suitable for In type high unity and the data operation of the mutually large-scale data without dependence.Therefore, contour using FPGA, GPU and ASIC Degree of parallelism heterogeneous computing platforms carry out the research heat that high degree of parallelism computing platform, especially neural network accelerator design become new Point.Wherein, compare GPU platform, FPGA can be realized higher calculating Energy Efficiency Ratio, while FPGA can with iteratively faster, can carry out The also more requirement of adaptive algorithm high speed development of the characteristic of hardware reconstruction.

Here, the computing platform is applicable in the invention proposes a kind of high parallel computing platform with new hardware structure Relatively uniform in type and the mutually large-scale data without dependence concurrent operation, is particularly suitable for sparsity input feature vector The convolutional neural networks of figure and weight parameter.The parallel computation module of the computing platform or its at least partly preferably by FPGA reality It is existing.

Although computing platform side of the invention will be described mainly in combination with the parallel computation for convolutional neural networks as follows Case, but it will be understood by those skilled in the art that hardware structure of the invention, which is suitable for all kinds of high degree of parallelism, calculates scene, especially It is suitable for the high application scenarios of data-reusing rate and sparsity.

[CNN basic conception]

CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which understands in the application, to be analyzed Calculating operation based on CNN is primarily based on the rudimentary knowledge that existing CNN model introduces CNN.

As shown in Figure 1, typical CNN is made of a series of layer of orderly functions.

The parameter of CNN model is referred to as " weight " (weights).The first layer of CNN reads input figure, and exports a series of Characteristic pattern (featuremap).Following layer reads the characteristic pattern generated by upper one layer, and exports new characteristic pattern.Last A classifier (classifier) output input figure may belong to the probability of a certain classification.CONV layers (convolutional layers) and FC layers are (complete Even layer) it is two kinds of basic channel types in CNN.After CONV layers, usually there is pond layer (Pooling layers).

In this application, for one CNN layers,Indicate j-th of input feature vector figure (input featuremap), Indicate i-th of output characteristic pattern (output featuremap), b_iIndicate the bias term of i-th of output figure.

For CONV layers, n_inAnd n_outRespectively represent the quantity for outputting and inputting characteristic pattern.

For FC layers, n_inAnd n_outRespectively represent the length for outputting and inputting feature vector.

The definition of CONV layers (Convolutional layers, convolutional layer): CONV layers using series of features figure as defeated Enter, and output characteristic pattern is obtained with convolution kernels convolution.

The non-linear layer being usually connected with CONV layers, that is, nonlinear activation function is applied to every in output characteristic pattern A element.The excitation function used is generally ReLU function, which is also commonly referred to as ReLU layers.

CONV layers can be indicated with expression formula 1:

Wherein g_i,jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic pattern.FC layers of (Fully- Connected layers, connects layer entirely) definition: FC layers are applied to a upward linear transformations of input feature vector:

f^out=Wfⁱⁿ+b (2)

W is a n_out×n_inTransformation matrix, b are bias terms.It is noted that input is not several two dimensions for FC layers The combination of characteristic pattern, but a feature to.Therefore, in expression formula 2, parameter n_inAnd n_outActually correspond to input and it is defeated The length of feature vector out.

Pond (pooling) layer: being usually connected with CONV layers, for exporting each subregion in each characteristic pattern (subarea) maximum value or average value.Pooling maximum value can be indicated by expression formula 3:

Wherein p is the size of pond kernel.This nonlinear " down-sampled " only next layer reduces characteristic pattern Size and calculating additionally provide a kind of translation invariance.

CNN can be used for during forward inference carrying out image classification.As involved in being calculated CNN it is a large amount of between each other without according to The convolution operation for the relationship of relying, so that being particularly suitable for realizing in high degree of parallelism computing platform.

[basic structure of computing platform of the present invention]

In order to cope with the calculating of high degree of parallelism, the invention proposes a kind of completely new hardware structures.Fig. 2 shows according to this The schematic diagram of the computing platform of invention one embodiment.As described in Figure 2, high degree of parallelism computing platform 200 includes level cache 210, multiple Read Controllers 220 and parallel computation module 230.

Level cache 210 can cache the calculating data read from external memory, and have the slow of any position Deposit the ability that content can be read simultaneously.External memory in figure is embodied as massive store unit, such as DDR.One In a embodiment, level cache 210 is made of register file, to realize ability that above-mentioned any position is read simultaneously.Work as meter When calculating platform for executing CNN calculating, the data of reading may include feature diagram data and weighted data.

Each of Read Controller 220 Read Controller can read parallel meter from the corresponding position of level cache 210 Calculating data needed for calculating the single operation of module or part thereof.Parallel computation module 230 is then for controlling the multiple reading The calculating data that device is read carry out the calculating operation of high degree of parallelism.

In this, it is assumed that the degree of parallelism of parallel computation module 230 is M." single operation of parallel computation module " above refers to Be that parallel computation module 230 executes single operation in the operation that degree of parallelism is M.That is, every time operation may include M Single operation.

The quantity N of Read Controller can be according to specific implementation and number relevant to M.In one embodiment, each reading Controller can execute in the operation that degree of parallelism is M needed for single operation from parallel computation module 230 is read in level cache 210 It is whole calculate data, in the case, the quantity N of Read Controller is equal to degree of parallelism M.In another embodiment, each reading Controller can execute in the operation that degree of parallelism is M needed for single operation from parallel computation module 230 is read in level cache 210 Part calculate data.For example, being directed to CNN operation, 2M Read Controller (that is, N=2M) can be set, wherein M reading control Device is for reading characteristic pattern convolution window, and M Read Controller is used for while reading M weight convolution window, in parallel computation module Realize that degree of parallelism is the characteristic pattern convolution operation of M in 230.At this point, parallel computation module 230 can be implemented as M independently of each other Multiply-accumulate unit.In other embodiments, each Read Controller can also read the characteristic pattern that needs to carry out convolution operation and Both weight convolution windows.

Hardware structure shown in Fig. 2 is particularly suitable for the high parallel computation of data-reusing rate, for example, most commonly seen in CNN Convolutional calculation.In order to facilitate understanding, Fig. 3 shows an example of convolution operation.As shown in figure 3, using the volume of a 3x3 Product core carries out convolutional calculation with input feature vector figure of the step-length 1 to a 5x5.First time convolutional calculation is shown on the left of figure, in Between show second of convolutional calculation, and so on.After 9 convolutional calculations, the feature through convolution on the right side of Fig. 3 is being obtained Figure.

It, can be at one time of parallel computation module 230 since dependence being not present between this 9 times convolutional calculation It completes to execute (degree of parallelism M can generally achieve thousands of magnitudes) in operation.In one embodiment of the invention, in order to complete simultaneously At this 9 convolutional calculations, 18 Read Controllers can be used and carry out reading data.First group of 9 Read Controller is for reading spy Levy figure convolution window.For example, the 9 3x3 convolution windows slided one by one since the upper left of the 5x5 input feature vector figure until bottom right.By It is 1 in step-length, therefore there are 6 data that can be multiplexed in each adjacent convolution window, as shown in left figure in Fig. 3 and middle graph.Second 9 Read Controllers of group are for reading weight convolution window, that is, the identical 9 3x3 convolution windows of data are (because all use one Convolution kernel).As a result, in level cache 210, it is only necessary to store 34 (5x5+3x3) a data, it will be able to pass through Read Controller It reads control and realizes the parallel computation in computing module.It should be understood that introducing 5x5 size here for convenient for explanation Input feature vector figure, in practical applications, the size of input feature vector figure is typically greater than 5x5.

In addition, for the high sparsity data frequently encountered in high parallel computation, the present invention can also be by being additionally arranged Filtering module promotes the computational efficiency of parallel computation module.Fig. 4 shows computing platform according to an embodiment of the invention Schematic diagram.Further, high degree of parallelism computing platform 400 shown in Fig. 4 is in addition to including level cache 410, multiple readings control It further include being connected between multiple Read Controllers 420 and parallel computation module 430 except device 420 and parallel computation module 430 Filtering module 440, for filtering out assigning null data and/or the calculating data in the calculating data read by the Read Controller In correspond to corresponding calculated result be null value non-empty Value Data.In other words, filtering module 440 can be used for filtering out to subsequent The meaningless value of calculated result in parallel computation module 430.

In the case where calculating for CNN, filtering module 440 can filter out the feature diagram data read by institute's controller And/or corresponds to corresponding convolution calculated result in null value in weighted data, and/or the feature diagram data and weighted data and be The non-empty Value Data of null value.Fig. 5 shows example of the filtering for the data of convolutional calculation in CNN.As shown, according to convolution Definition, the numerical value being only all not zero in characteristic pattern convolution window and weight convolution window corresponding position can just generate output valve It influences.Therefore, only same position is all not zero filtering module 440 in keeping characteristics figure convolution window and its respective weights convolution window Value simultaneously filters out other values.

In one embodiment, filtering module 440 may include 0 grade of caching and multiple asynchronous fifo queues (FIFO).In a preferred embodiment, 0 grade of caching may include or be divided into 2M convolution window cache pool, Duo Geyi Step FIFO is embodied as M FIFO.Preceding 2M Read Controller can be used for M characteristic pattern convolution window and M weight convolution Window is stored in 2M convolution window cache pool of 0 grade of caching, and filters out in a manner of one-to-one characteristic pattern convolution window and its right Answer same position in weight convolution window at least one be zero value, and by the number of obtained compact arranged removal sparse data It is input in this M asynchronous FIFO according to (as shown in Figure 5) with high-frequency.Later, the data through sparse processing are sent into simultaneously from FIFO Row computing device.In the application for convolutional calculation, parallel computation unit can be implemented as include M it is a it is mutually independent multiply it is tired Add the convolutional calculation core of module.It will be obvious that more rapidly being held by the introducing of filtering module so that convolutional calculation nuclear energy is enough It has gone one time and has calculated (including M times while convolutional calculation of progress).

As described above, the cache contents that level cache needs to have any position by multiple Read Controllers while can be read Ability, and can for example set level cache to being made of register file to realize above-mentioned function.However, above-mentioned function Realization will lead to biggish power consumption and logistical overhead, so the capacity of level cache should not be too big.Degree of parallelism is calculated meeting Demand under the premise of, the capacity of level cache should be small as far as possible, for example, in neural network hardware accelerator, tens or several The level cache of hundred KB has met the calculating degree of parallelism demand of magnitudes thousands of or up to ten thousand.Due to level cache capacity and access side The limitation of formula, so that level cache reads data directly from external memory and can not maximumlly utilize existing bandwidth, and It is excessive to will lead to additional expense, therefore in one embodiment of the invention, can external memory and level cache it Between be arranged a biggish L2 cache of capacity, to solve the above problems.

Fig. 6 shows the schematic diagram of computing platform according to an embodiment of the invention.Further, height shown in fig. 6 Degree of parallelism computing platform 600 is in addition to including level cache 610, multiple Read Controllers 620, filtering module 640 and parallel computation mould It further include the L2 cache 650 being connected with level cache except block 630, based on directly needed for external memory reading Count evidence.The L2 cache 650 is preferably realised as conventional cache, can order buffer from outside read calculating data, and Can data be conveyed to level cache 610 on demand.The buffer memory capacity (for example, capacity of several MB) of L2 cache 650 is usually wanted Greater than the buffer memory capacity (for example, capacity of several hundred KB) of level cache 610, external memory can be used as (similarly, in figure External memory is embodied as massive store unit, such as DDR) with the transmission intermediary of level cache 610.It is slow by introducing second level It deposits, can maximumlly utilize memory bandwidth, and reduce the demand to level cache memory capacity and power consumption.

Data can be written to level cache 610 according to the consumption situation of the data in level cache in L2 cache 650, And/or data are read from external memory.In one embodiment, data consumption situation can be by Read Controller shape in level cache Indicated by state parameter.

In the realization for CNN, L2 cache 650 can separate characteristic value data and the weight number that caching is read According to, and situation at least is consumed based on data in level cache, at least one of following operation: Xiang Suoshu level cache can be carried out Write-in characteristic diagram data and weighted data；And new feature diagram data and weighted data are read from the external memory.It is excellent Selection of land, data consume situation can be according to position coordinates of the feature diagram data read in Read Controller on characteristic pattern come table Sign.

The introducing of above-mentioned multi-level buffer structure is able to ascend system effectiveness and reduces overhead.Further, above-mentioned more Data renewal frequency between grade caching is also based on the calculating speed of parallel computation module and computing platform is deposited with outside The memory bandwidth of reservoir and designed, thus maximumlly lifting system performance.

In an ideal case, multiple Read Controllers read the feature diagram data and every time between the times of the weighted data Every executing with the parallel computation module, the period needed for single M group calculates is identical.In other words, from level cache to 0 grade The update of caching be data high frequency expansion update, can be executed with such as convolutional calculation core single time calculate (including M time parallel Convolution window calculate) needed for time it is identical, thus, it is possible to make the computing capability of parallel computation module obtain maximized benefit With.

The frequency that data update in level cache is obviously less than the renewal frequency in 0 grade of caching.In other words, second level is slow It deposits and is greater than level cache every time between the time of 0 grade of buffering write data to the time interval of level cache write-in data every time Every.When position coordinates of the feature diagram data read in such as Read Controller on characteristic pattern indicate the data in level cache i.e. When will exhaust, new data can be written in L2 cache to level cache.

In contrast, the rate that data update in L2 cache can according to need flexible setting.L2 cache is from outside The time interval of memory read data can be equal to or more than L2 cache to the time interval of level cache write-in data. In one embodiment, whenever data exhaustion in level cache, the L2 cache just supplementary data into level cache, and at the same time New data are read to external memory.For example, the second level of capacity 5MB is slow when data exhaustion in the level cache of capacity 500KB The data for just supplementing 500KB into level cache are deposited, and read new 500KB data to external memory simultaneously.In other realities It applies in example, the frequency of L2 cache supplementary data can be lower, for example, every update is twice, level cache number three times or more According to the new data for just voluntarily supplementing respective numbers, the present invention is herein with no restrictions.

Fig. 2-6 pairs of hardware structures and its preferred embodiment of the invention have been combined to be described above.It will combine as follows Fig. 7 describes a concrete application example of the invention and its operation.

The concrete operations that Fig. 7 shows the high degree of parallelism computing platform according to an embodiment of the invention for CNN are shown It is intended to.As shown in fig. 7, computing platform 700 includes L2 cache as mentioned above (L2cache) 750, level cache (L1cache) 710, multiple Read Controllers 720,0 grade of caching (L0cache) 741 as filtering module and asynchronous FIFO 742, And it is embodied as the parallel computation module 730 of convolutional calculation module.

In addition to this, computing platform 700 further includes the calculated result caching mould being optionally connected with convolutional calculation module 730 Block 760.The module is used to cache the convolutional calculation result data for the calculating operation that parallel computation module 730 is carried out, and will meter It calculates result data and is stored back to external massive store unit.Calculated result cache module 760 can for example be filled with certain data Write-in is executed to external memory afterwards.

In one embodiment, computing platform 700 further includes optionally in the storage Read Controller state parameter Read Controller status register 770, and the buffer update that buffer update instructs is generated based on the Read Controller state parameter Controller 780.Read Controller status register 770 and buffer update controller 780 can be used for according to number in level cache 710 Firsts and seconds caching is updated according to consumption state.

It is as follows, the computing platform implementation method according to the present invention for being used for convolutional neural networks will be described in conjunction with Fig. 8.Fig. 8 Show the flow chart of computing platform implementation method according to an embodiment of the invention.Calculating above-mentioned can be used in this method Platform and its preferred embodiment are realized, such as computing platform shown in Fig. 7 700.

In step S810, feature diagram data and weighted data are read in level cache from external memory.

In one embodiment, above-mentioned reading can be realized via L2 cache 750 as shown in Figure 7.With level cache 710 connected L2 caches 750 directly read feature diagram data and weighted data from massive store unit, and based on reading control Position coordinates of the feature diagram data read in device processed on the characteristic pattern are to level cache write-in characteristic diagram data and weight Data and and read new feature diagram data and weighted data from the external memory, wherein L2 cache is to level-one The time interval of buffering write data is less than or equal to the time interval that L2 cache reads data from the external memory.

Then, in step S820, characteristic pattern needed for multiple Read Controllers read multiple single convolutional calculation operations respectively Data and/or weighted data.Here, multiple Read Controllers are can be as shown in Figure 3 respectively according to the volume of convolution operation to be executed Product core size reads corresponding characteristic pattern convolution window and weight convolution window.Here, the characteristic pattern number read by multiple Read Controllers According to, there are altitude information multiplexing, the data-reusing includes at least one of following: being wider than by the length of convolution window with weighted data The multiplexing of feature diagram data caused by step-length；And as executing weighted data caused by convolution to positions multiple in characteristic pattern simultaneously Multiplexing.

In the multiple groups feature diagram data and weight number that step S830, parallel computation module simultaneously read multiple Read Controllers According to progress high degree of parallelism convolutional calculation operation.Multiple groups feature diagram data and weighted data be admitted to convolutional calculation core it Before, it may also pass through filter operation.Therefore, implementation method of the invention can also preferably include step S825, filter out feature In figure convolution window and weighted data convolution window with the value of zero corresponding position in convolution operation result, and by the filtered spy It levies figure convolution window and weight convolution window is sent into the parallel computation module.

In order to maximumlly utilize the computing capability of convolutional calculation core, multiple Read Controllers read the characteristic pattern number every time It is identical and small according to the period executed with the time interval of the weighted data with convolutional calculation core needed for single time M group calculates In the L2 cache every time to the time interval of level cache write-in data.

The application examples of calculating implementation method of the invention will be further described in conjunction with Fig. 7 as follows.

The buffer update controller 780 of L2 cache 750 is according to the Read Controller 720 of current level cache 710 first State obtains data from massive store unit, and feature diagram data and weighted data is stored separately.Here, level cache The state of 710 Read Controller 720 can be position coordinates of the neural network convolution window on characteristic pattern, can be pushed away by coordinate The Expenditure Levels of data in the data pool of level cache 710 are calculated, and then the data consumed can be replaced with new data.

Then, data are stored in the data pool of level cache 710 according to data dependent manner.Herein, " data dependence " Refer to the dependence between data flow distributing order.The data pool equally may include for characteristic pattern caching data pool #0 and For the data pool #1 of weight caching.

Then characteristic pattern and weight are updated from level cache 710 to 0 grade of caching 741 respectively by two groups of Read Controllers In (convolution window cache pool).Characteristic pattern and weight are stored in 0 grade of caching 741 in sparse mode, then pass through filter operation, The correspondence number that any position in characteristic pattern and weight convolution window is 0 is all filtered out, the compact arranged sparse number of removal is obtained According to parameter, and be input in asynchronous FIFO 742 with high-frequency.The data Jing Guo sparse processing give convolution from FIFO later Core 730 (altogether M mutually independent multiply accumulating module) is calculated, is stored in calculated result cache module 760 after the completion of calculating.Finally After storing enough data, result is stored in high-capacity storage module by calculated result cache module 760.

It is worth noting that, multi-level buffer of the invention is not traditional processor cache, but it is directed to parallel computation, Especially CNN convolutional calculation demand and specially design.Wherein from massive store unit (such as DDR) to L2 cache, and Data from L2 cache to level cache are the complete characteristic patterns and weight without sparse processing.Wherein, 0 grade of caching is logical (or even can be identical as the arithmetic speed of convolution kernel) of high frequency update be crossed, and L2 cache and level cache then can be with lower Frequency updates.In addition, being the high frequency expansion of data from level cache to 0 grade of buffer update, weight can be multiple during expansion Characteristic pattern data sharing and characteristic pattern itself data-reusing degree how high when convolution kernel is greater than step.From L2 cache to level-one The defeated update of caching can be same frequency.Hardware structure as a result, based on multi-level buffer of the invention, being capable of maximized benefit With the processing speed of convolutional calculation core, and memory bandwidth is made no longer to cause to limit to the treatment effeciency of computing platform.

In addition, though the characteristic pattern and weight cache module and data pool of same size are shown in Fig. 7, but it is understood that , in practical applications, can according to specific parallel computation scheme, reasonably adjust firsts and seconds caching in characteristic pattern and Weighted data proportion.Under normal conditions, since the reusability of convolution kernel is higher, characteristic pattern cache module and data pool Significantly larger than weight cache module and data pool.

Computing platform of the invention can be implemented as neural network processor.With single computing platform (i.e. only have host or The computing platform of CPU) it compares, the present invention is directed the neural network being especially designed to execute neural computing is special Use processor.It will be appreciated by those skilled in the art that term " neural network application specific processor " used in this application, it can also Referred to as " neural network processor " or " NN processor ".Due to deep learning be in nerual network technique it is presently the most popular One technique classification, therefore neural network application specific processor may be implemented as at deep learning application specific processor or deep learning Manage device.But it will be understood by those skilled in the art that neural network has a various technology branches, such as deep neural network (DNN, Deep Neutral Network) and CNN, therefore also to may be implemented as deep neural network special for neural network application specific processor With processor (DNN processor) or convolutional neural networks application specific processor (CNN processor).That is, related " deep learning The neural network of processor " or " deep neural network processor " or " convolutional neural networks processor " in heterogeneous computing platforms It is also within the scope of the invention to calculate realization technology.

DPU (Deep-learning Processing Unit) is a for neural network algorithm in artificial intelligence General acceleration platform, utilize FPGA high degree of parallelism and low-power consumption the characteristics of, realize be based on convolutional neural networks (Convolutional Neural Network, hereinafter referred to as CNN) makes inferences.Herein, DPU may be considered above " deep learning processor " or " deep neural network processor " or " convolutional neural networks processor " or " Processing with Neural Network One specific implementation of device ".Description herein, which is based primarily upon, to be carried out using CNN structure via the DPU that FPGA is realized, but this field Technical staff should be understood that the principle of the present invention is equally applicable to the hardware configuration by such as GPU for other nerves The neural network processor that network makes inferences.

Computing platform of the invention can be realized in a high concurrent computational system, wherein for executing such as nerve net Some or all of the high degree of parallelism calculating that network calculates function can be by digital circuit.In one embodiment, of the invention Computing system can include system on chip (SoC) realization of general processor, mass storage and digital circuit.

In one embodiment, can be to realize this by the digital circuits section (for example, FPGA, GPU or ASIC) on SoC High parallel computing platform needed for system, such as the computing platform for convolutional neural networks.Computing platform or in which be to count parallel Calculating module can be the hardware device based on realizations such as FPGA or GPU or ASIC.What it is due to CNN progress is parallel computation, It realizes that convolutional neural networks computing function has natural calculating advantage by logic hardware, especially FPGA, and compares It is executed in software, can be realized lower power consumption.

In one embodiment, it can will pass through the whole parameters and, for example, needs of the related CNN obtained in preceding training The characteristic pattern of classification is stored in external memory, when then carrying out neural computing, can be executed such as by general processor Method described in preceding combination Fig. 8 in computing platform to realize high performance parallel computation.

High degree of parallelism computing platform, system and calculating according to the present invention above is described in detail in fact by reference to attached drawing Existing method.

The present invention devises multi-level buffer structure, utilizes band according to the deterministic rule of convolutional neural networks data dependence Wide, multi-frequency and data-reusing characteristic, so that having obtained great promotion for several efficiency to convolutional calculation module.In addition, The present invention also utilizes the weight of convolutional neural networks and the sparse characteristic of characteristic pattern, passes through filtering null value significant increase operation effect Rate.Computing platform design of the invention is simple, can be from without the cost of complicated encoding and decoding and the storage of sparse data information Maximized computational efficiency income is obtained in the sparse characteristic of neural network parameter.

Herein it is emphasized that although the present invention shows the hardware of optimization mainly in combination with the calculating of convolutional neural networks Framework, it will be understood by those skilled in the art that computing platform of the invention is also applied for the calculating of other high degree of parallelism, it is especially suitable For the parallel computation application scenarios that input data type is relatively single and/or degree of rarefication is high.

In addition, though the title of " L2 cache ", " level cache " and " 0 grade of caching " is employed herein, but herein " second level ", " level-one " and " 0 grade " are simply to illustrate that pass in the embodiments for including more than two cachings, between each caching System, rather than to any restrictions for caching itself.

In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention Machine program code instruction.

Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code), When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.

Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.

The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent module, a journey A part of sequence section or code, a part of the module, section or code include one or more for realizing defined The executable instruction of logic function.It should also be noted that in some implementations as replacements, the function of being marked in box can also To be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually be basically executed in parallel, They can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or stream The combination of each box in journey figure and the box in block diagram and or flow chart, can the functions or operations as defined in executing Dedicated hardware based system realize, or can realize using a combination of dedicated hardware and computer instructions.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. a kind of high degree of parallelism computing platform, comprising:

Level cache, for caching the calculating data read from external memory；

Multiple Read Controllers, each Read Controller are used to read parallel computation module from any position of the level cache Calculating data or part thereof needed for middle single operation；

Parallel computation module, the calculating data for reading to the multiple Read Controller carry out the calculating operation of high degree of parallelism.

2. computing platform as described in claim 1, further includes:

The filtering module being connected between the multiple Read Controller and the parallel computation module is controlled for filtering out by the reading Corresponding to corresponding calculated result in the assigning null data and/or the calculating data in calculating data that device processed is read is the non-of null value Assigning null data.

3. computing platform as described in claim 1, further includes:

The L2 cache for calculating data directly is read from the external memory, what the L2 cache order buffer was read Data are calculated, and at least consume situation based on the data in the level cache, carry out at least one of following operation:

It is written to the level cache and calculates data；And

It is read from the external memory and calculates data and cached.

4. computing platform as described in claim 1, wherein data consume situation by the Read Controller in the level cache Indicated by state parameter, and the computing platform further include:

For storing the Read Controller status register of the Read Controller state parameter；And

The buffer update controller of buffer update instruction is generated based on the Read Controller state parameter.

5. computing platform as described in claim 1, further includes:

The calculated result cache module being connected with the parallel computation module is carried out for caching the parallel computation module The calculation result data of calculating operation, and the calculation result data is stored back to the external memory.

6. computing platform as described in claim 1, wherein the level cache is embodied as register file.

7. a kind of high degree of parallelism computing platform for convolutional neural networks, comprising:

Level cache, for caching the feature diagram data and weighted data that read from external memory；

Multiple Read Controllers, each Read Controller are used to read single convolutional calculation from any position of the level cache Feature diagram data and/or weighted data needed for operation；

Parallel computation module, for reading the carry out high degree of parallelism of feature diagram data and weighted data to the multiple Read Controller Convolutional calculation operation.

8. computing platform as claimed in claim 7, wherein the level cache includes for caching the feature diagram data Characteristic pattern data pool and weighted data pond for caching the weighted data, and the multiple Read Controller includes 2M Read Controller, wherein first group of M Read Controller is used for while reading M characteristic pattern convolution window, second group of M Read Controller is used In simultaneously read M weight convolution window, in the parallel computation module realization degree of parallelism for M characteristic pattern convolution operation, Wherein, M is greater than 2 integer.

9. computing platform as claimed in claim 8, further includes:

The filtering module being connected between the multiple Read Controller and the parallel computation module is controlled for filtering out by the reading It is corresponding in the feature diagram data that device processed is read and/or the null value in weighted data, and/or the feature diagram data and weighted data In the non-empty Value Data that corresponding convolution calculated result is null value.

10. computing platform as claimed in claim 9, wherein the filtering module keeping characteristics figure convolution window and its corresponding power Same position is all not zero in reconvolution window value simultaneously filters out other values.

11. computing platform as claimed in claim 10, wherein the filtering module includes 0 grade of caching and M asynchronous advanced elder generations Dequeue, 0 grade of caching include or can be divided into 2M convolution window cache pool, and the 2M Read Controller is by M feature Figure convolution window and M weight convolution window are stored in 2M convolution window cache pool of 0 grade of caching, and are filtered in a manner of one-to-one Except in characteristic pattern convolution window and its respective weights convolution window same position at least one be zero value, and retention is deposited respectively Enter the M asynchronous fifo queues.

12. computing platform as claimed in claim 8, further includes:

The feature diagram data and the weighted data directly are read from the external memory and are connected with the level cache L2 cache, the buffer memory capacity of the L2 cache is greater than the buffer memory capacity of the level cache, reads for separately caching The characteristic value data and the weighted data, and at least consume situation based on data in the level cache, carry out such as At least one of lower operation:

To the level cache write-in characteristic diagram data and weighted data；And

New feature diagram data and weighted data are read from the external memory.

13. computing platform as claimed in claim 12, wherein the data consumption parameter is read in the Read Controller Position coordinates of the feature diagram data on the characteristic pattern.

14. computing platform as claimed in claim 12, further includes:

15. computing platform as claimed in claim 8, wherein the M is at least based on reading data from the external memory The size of bandwidth and the convolution window determines.

16. computing platform as claimed in claim 7, further includes:

The calculated result cache module being connected with the parallel computation module is carried out for caching the parallel computation module The convolutional calculation result data of calculating operation, and the calculation result data is stored back to the external memory.

17. computing platform as claimed in claim 7, wherein the level cache is embodied as register file.

18. a kind of computing platform implementation method for convolutional neural networks, comprising:

Using the computing platform as described in any one of claim 1-17 by feature diagram data and weighted data from the outside It is read in memory in the level cache；

Feature diagram data and/or weight number needed for the multiple Read Controller reads multiple single convolutional calculation operations respectively According to；And

The parallel computation module simultaneously to the multiple Read Controller read multiple groups feature diagram data and weighted data into The convolutional calculation of row high degree of parallelism operates.

19. method as claimed in claim 18, wherein the feature diagram data and weight number read by the multiple Read Controller According to there are altitude information multiplexing, the data-reusing includes at least one of following:

It is wider than the multiplexing of feature diagram data caused by step-length as the length of convolution window；And

Multiplexing as simultaneously positions multiple in characteristic pattern are executed with weighted data caused by convolution.

20. method as claimed in claim 18, further includes:

The L2 cache being connected with the level cache directly reads the feature diagram data and described from the external memory Weighted data, and the position coordinates based on the feature diagram data read in the Read Controller on the characteristic pattern are to described Level cache write-in characteristic diagram data and weighted data and and from the external memory read new feature diagram data and Weighted data, wherein the L2 cache is less than or equal to the second level to the time interval of level cache write-in data Cache the time interval that data are read from the external memory.

21. method as claimed in claim 20, wherein the multiple Read Controller reads multiple single convolutional calculation behaviour respectively Feature diagram data and/or weighted data needed for making include:

Corresponding characteristic pattern convolution window and weight convolution window are read according to the convolution kernel size of convolution operation to be executed.

22. method as claimed in claim 20, further includes:

The value in the characteristic pattern convolution window and weighted data convolution window with zero corresponding position in convolution operation result is filtered out, and The filtered characteristic pattern convolution window and weight convolution window are sent into the parallel computation module.

23. method as claimed in claim 20, wherein the multiple Read Controller reads the feature diagram data and institute every time Period needed for the time interval and the parallel computation module for stating weighted data execute single time M group calculating is identical and small In the L2 cache every time to the time interval of level cache write-in data.

24. a kind of high concurrent computational system, comprising:

Computing platform as described in any one of claim 1-17；

Mass storage outside the computing platform；

The processor being connected with the computing platform and the memory, for executing such as any one of claim 18-23 The implementation method.

25. system as claimed in claim 24, wherein the parallel processing module is at least partly by FPGA, GPU or ASIC reality It is existing.