CN110321064A

CN110321064A - Computing platform realization method and system for neural network

Info

Publication number: CN110321064A
Application number: CN201810297916.6A
Authority: CN
Inventors: 隋凌志; 王雨顺; 刘鑫
Original assignee: Beijing Shenjian Intelligent Technology Co Ltd
Current assignee: Xilinx Technology Beijing Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2019-10-11

Abstract

The invention discloses a kind of computing platform implementation methods for neural network.The computing platform reads required data from external memory, and the results of intermediate calculations for caching the data of reading on piece caching and respectively operating, the described method includes: reading the first first part's data for operating required characteristic pattern from the external memory, executing for first part's data includes the first operation and at least one other operation, will be stored back into the external memory for the operating result of first part's data；The first second part data for operating required characteristic pattern are read from the external memory, are executed for the second part data above-mentioned from the read operation of the external memory, first operation and at least one described other operation and the operation for being stored back into the external memory.With this solution, the data that can be greatly decreased between piece external storage and on piece caching carry number, to promote disposed of in its entirety speed.

Description

Computing platform realization method and system for neural network

Technical field

The present invention relates to deep learning field more particularly to a kind of computing platform implementation methods and phase for neural network Relationship system.

Background technique

Neural network (Neural Network) becomes the research hotspot of field of image recognition in recent years.After training Neural network model can be used for the numerous areas such as image classification, object identification and conspicuousness detection.Neural network mould in recent years Type is presented the trend that calculation scale increases, complexity is promoted and has been unable to satisfy the practicality demand using traditional CPU platform. Therefore, carrying out neural network accelerator design using heterogeneous computing platforms such as FPGA, GPU becomes new research hotspot.Wherein, FPGA compares GPU platform due to its low-power consumption feature, can obtain higher energy efficiency, at the same FPGA can with iteratively faster, It can carry out the characteristic also more requirement of adaptive algorithm high speed development of hardware reconstruction.

When using the high degree of parallelism computing platform comprising FPGA and GPU etc. to execute ANN Reasoning, compared to reading The time cost of parameter required for extract operation, it is very short to calculate the required execution time, and which results in memory readings to become Improve the bottleneck of processing speed.

Therefore, there is still a need for a kind of scheme for being able to ascend convolutional neural networks computing platform overall treatment efficiency.

Summary of the invention

In order to solve the problems, such as it is above-mentioned at least one, the invention proposes a kind of computing platform realization sides for neural network Case can be decomposed in not changing calculating figure in the case where bottom calculated result, by calculating the rational design of figure based on layer With merge the carrying reduced to the greatest extent between memory and on piece data, more effectively utilize hardware resource, promoted neural network, especially It is the arrangement treatment effeciency of convolutional neural networks computing platform.

According to an aspect of the invention, there is provided a kind of computing platform implementation method for neural network, the meter Calculate platform from external memory read needed for data, and on piece caching in cache acquisition data and on piece operation in Between calculated result, which comprises from the external memory read first operate needed for characteristic pattern first part's data, Executing for first part's data includes the first operation and at least one other operation, will be directed to first part's number According to operating result be stored back into the external memory；The second of the required characteristic pattern of the first operation is read from the external memory Partial data executes the above-mentioned read operation from the external memory, first operation for the second part data And at least one of described other operate and are stored back into the operation of the external memory.

It reduces unnecessary memory by the multiplexing to data as a result, to read, to promote overall treatment efficiency.

The characteristic pattern includes first part's data and at least one second part data, and the described method includes: needle To each second part data, execute respectively above-mentioned from the read operation of the external memory, first operation and institute It states at least one other operation and is stored back into the operation of the external memory, wherein being directed to first part's data and institute The operating result for stating at least one second part data be equivalent to execute for characteristic pattern sequence first operation and The operating result of at least one other operation.

Preferably, it is slow that the size of the partial data read every time from the external memory is at least partially based on the on piece The capacity deposited determines.

At least one other operation can be at least one of following: at least one subsequent operation of first operation, institute Stating subsequent operation at least needs the complete computation result of first operation to complete its all operations；And it is grasped with described first Make at least one lateral operation with the input of same characteristic features figure.

In one embodiment, the first operation and at least one other operation include that successive convolution (CONV) operates, is non- Linearly (ReLU) operation and pond (POOL) operation.

Computing platform implementation method of the invention can also include: to read the first subsequent operation institute from the external memory The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the portion from the characteristic pattern Divided data executes first subsequent operation.Thus the storage optimization for being suitable for having branch's neural network.Preferably, this first Subsequent operation can be ELTWISE operation.

Computing platform implementation method of the invention can also include that the second subsequent operation is incorporated to the first operation.Preferably, Second subsequent operation can be batch normalization (BatchNorm) and calibration (Scale operation), and the first operation can be CONV Operation.

Computing platform implementation method of the invention can also include depositing operating result with the dimension arrangement mode after converting The external memory is returned, to omit the operation for being only used for changing data dimension or arrangement mode in neural network.Thus, it is possible to Cascade (CONCAT) operation and planarization (FLATTEN) operation are directly omitted by the regulation of storage mode.

Computing platform implementation method of the invention can also include layer operation splitting, such as may include decomposing described first Operation and/or other at least one of described operations, wherein first operation after decomposing and/or at least one of other operation needles Pair be a part that it is respectively originally inputted.First operation and/or at least one other operation after decomposition can be CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.Institute after decomposition It states the first operation and/or at least one other operation can also be the POOL operation through splitting, and each POOL through splitting The input of operation is the complete characterization figure output in preceding operation.

According to another aspect of the present invention, a kind of neural fusion system is provided, comprising: piece external memory device, For parameter and characteristic pattern needed for storing progress neural computing on a memory；Executive device is calculated, the calculating is held Luggage is set to be cached including on piece, and the executive device that calculates is directed to based on the capacity division that can be cached on piece caching The different piece of characteristic pattern, successively performs the following operations: from the spy of characteristic pattern needed for the external memory read operation Determine part, executes at least two operations for the specific part, and operating result is stored back to described external memory device.

Preferably, which is the high parallel computation unit based on FPGA or GPU or ASIC.

According to a further aspect of the invention, a kind of computing platform for neural network is provided, comprising: data processing Module for executing scheduled calculation processing to input data, and generates output data；Data memory module, for caching number The intermediate data exported according to input data needed for processing module or data processing module；And control module, for described Data processing module and the data memory module are controlled, to execute method described in any one as above.

Data processing module can be convolutional calculation module, for carrying out convolutional calculation to input data.

According to an aspect of the invention, there is provided a kind of non-transitory machinable medium, being stored thereon with can Code is executed, when the executable code is executed by the processor of electronic equipment, the processor is made to execute as above any one Method described in.

By using neural computing implementation of the invention, can be greatly decreased between piece external storage and on piece caching Data carry number, eliminate bottleneck effect of the bandwidth to neural computing.Implementation of the invention can be by meter Fusion, decomposition and the reconstruct for calculating operation avoid unnecessary band using the height multiplexing to shared input and/or intermediate data Width occupies, and is operated by omitting time-consuming data rearrangement to storage and/or the reasonable arrangement read, and can be by certain subsequent behaviour It is incorporated in preceding operation, thus each generic operation involved in neural computing and data access are optimized from process, To promote overall calculation efficiency.

Detailed description of the invention

Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.

Fig. 1 shows a series of layer of orderly functions of composition typical case CNN.

The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.

Fig. 3 shows the process of the computing platform implementation method according to an embodiment of the invention for neural network Figure.

Fig. 4 is shown for the existing of Vgg basic structure and the network query function figure optimized based on the present invention.

Fig. 5 is shown for the existing of ResNet basic structure and the network query function figure optimized based on the present invention.

Fig. 6 shows the example that convolution according to an embodiment of the invention longitudinally merges.

Fig. 7 shows the example that convolution according to an embodiment of the invention laterally merges.

Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.

Fig. 9 shows the omission according to an embodiment of the invention to CONCAT (cascade) operation.

Figure 10 shows according to an embodiment of the invention to BatchNorm (batch normalizes) and Scale (calibration) The fusion of operation.

Figure 11 shows the example of division operation according to an embodiment of the invention.

Figure 12 shows Inception structure adjacent in GoogLeNet v1 that is existing and optimizing based on the present invention.

Figure 13 shows Inception structure adjacent in GoogLeNet v2 that is existing and optimizing based on the present invention.

Figure 14 shows the example for the SoC that can be used for realizing computing system of the invention.

Figure 15 shows according to an embodiment of the invention for executing the platform of neural computing.

Specific embodiment

The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure Range is completely communicated to those skilled in the art.

The basic conception of neural network processor

With the continuous development of artificial intelligence, machine learning and neural network algorithm in recent years, convolutional neural networks are being schemed The classification of picture, identification, detection and tracking etc., all achieve the effect for surmounting the mankind.Since convolutional neural networks are huge The characteristics of parameter scale, huge calculation amount, and for hardware platform stability and the high requirement for calculating energy consumption ratio, utilize The heterogeneous computing platforms such as FPGA, GPU, carrying out accelerator design becomes new research hotspot.GPU platform is compared, FPGA is due to it Low-power consumption feature can obtain higher energy efficiency, at the same FPGA can with iteratively faster, the characteristic of hardware reconstruction can be carried out The more requirement of adaptive algorithm high speed development.However, the existing CNN accelerator design based on FPGA and GPU exist as support Network structure is single, the problems such as bandwidth usage is unreasonable, computing resource utilization efficiency is low, be unable to satisfy higher and higher real-time Property require.For the isomery accelerator design of CNN, however it remains biggish research and discovery space.

Compared with single computing platform (there was only host or the computing platform of CPU), the present invention is directed in order to execute Neural computing and the neural network application specific processor being especially designed.It will be appreciated by those skilled in the art that in the application Used term " neural network application specific processor ", can be also simply referred to as " neural network processor " or " NN processor ".Due to Deep learning is a technique classification presently the most popular in nerual network technique, therefore neural network application specific processor can be with It is implemented as deep learning application specific processor or deep learning processor.But it will be understood by those skilled in the art that nerve net Network has various technology branches, such as deep neural network (DNN, Deep Neutral Network) and convolutional neural networks (CNN, Convolutional Neutral Network), therefore neural network application specific processor also may be implemented as depth Neural network application specific processor or deep neural network processor (DNN processor or CNN processor).That is, related " deep The neural computing of degree study processor " or " deep neural network processor " in heterogeneous computing platforms realizes that technology also exists Within the scope of the present invention.

DPU (Deep-learning Processing Unit) is a for neural network algorithm in artificial intelligence General acceleration platform, utilize FPGA high degree of parallelism and low-power consumption the characteristics of, realize based on convolutional neural networks (hereinafter referred to as CNN it) makes inferences.Herein, DPU may be considered above " deep learning processor " or " deep neural network processor " or One specific implementation of " neural network processor ".Following description, which will be based primarily upon, to be realized using CNN structure via FPGA DPU is carried out, but it will be understood by those skilled in the art that the principle of the present invention is equally applicable to the hardware by such as GPU Structure is directed to the neural network processor that other neural networks make inferences.

During DPU platform algorithm is to command mappings, need to decouple with deep learning frame, and parse the CNN of multiplicity Algorithm constructs the corresponding calculating graph structure of DPU, the optimization of coarseness is carried out for graph structure, including node beta pruning, fusion are most End form at subgraph and subgraph configuration, thus instruct DPU compiler instruct generation.

Since DPU is the general acceleration platform for neural network algorithm in artificial intelligence, it is necessarily required to support from not With the fixed point of platform；At the same time, DPU supports hardware parameter can be with to be able to as Hardware I P (IP core) It is deployed on different hardware platforms.Therefore different depth is come from firstly the need of parsing in the mapping process of algorithm to instruction A possibility that practising the network query function figure of platform, and finding figure optimization, to realize the maximization of hardware computational efficiency as far as possible. When DPU is in ANN Reasoning, during executing calculating figure, multiple operations can be executed in each node for calculating figure.These Operation needs to complete on DPU, and can be understood as the calling to computing module different on DPU.Under normal conditions, and from outer The time cost of data needed for portion's memory read operations is compared, and the time that DPU executes calculating is very short, and which results in memories Read the bottleneck for becoming system processing capacity.How to find it is one smaller, calculate graph structure faster, do not change bottom meter in figure Calculate as a result, reduce the carrying between memory and on piece data to the greatest extent, more effectively utilize hardware resource, such calculatings is schemed Figure optimization method is the key that a ring during DPU algorithm to command mappings.Calculating implementation of the invention is deep learning The part that algorithm and DPU compiler are formed a connecting link has both the effect of important neural network interlayer optimization, is the front end of compiler Core algorithm.

Numerical procedure in order to better illustrate the present invention, first to CNN basic conception, network query function figure and its basic behaviour It is illustrated as content.

CNN basic conception

CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which understands in the application, to be analyzed Calculating operation based on CNN is primarily based on the rudimentary knowledge that existing CNN model introduces CNN.

As shown in Figure 1, typical CNN is made of a series of layer of orderly functions.

The parameter of CNN model is referred to as " weight " (weights).The first layer of CNN reads input figure, and exports a series of Characteristic pattern (featuremap).Following layer reads the characteristic pattern generated by upper one layer, and exports new characteristic pattern.Last A classifier (classifier) output input figure may belong to the probability of a certain classification.CONV layers (convolutional layers) and FC layers are (complete Even layer) it is two kinds of basic channel types in CNN.After CONV layers, usually there is pond layer (Pooling layers).

In this application, for one CNN layers,Indicate j-th of input feature vector figure (inputfeaturemap), Indicate i-th of output characteristic pattern (output featuremap), b_iIndicate the bias term of i-th of output figure.

For CONV layers, n_inAnd n_outRespectively represent the quantity for outputting and inputting characteristic pattern.

For FC layers, n_inAnd n_outRespectively represent the length for outputting and inputting feature vector.

The definition of CONV layers (Convolutional layers, convolutional layer): CONV layers using series of features figure as defeated Enter, and output characteristic pattern is obtained with convolution kernels convolution.

The non-linear layer being usually connected with CONV layers, that is, nonlinear activation function is applied to every in output characteristic pattern A element.The excitation function used is generally ReLU function, which is also commonly referred to as ReLU layers.

CONV layers can be indicated with expression formula 1:

Wherein gi,_jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic pattern.FC layers The definition of (Fully-Connected layers, connect layer entirely): FC layers are applied to a upward linear transformation of input feature vector:

f^out=Wfⁱⁿ+b (2)

W is a n_out×n_inTransformation matrix, b are bias terms.It is noted that input is not several two dimensions for FC layers The combination of characteristic pattern, but a feature to.Therefore, in expression formula 2, parameter n_inAnd n_outActually correspond to input and it is defeated The length of feature vector out.

Pond (pooling) layer: being usually connected with CONV layers, for exporting each subregion in each characteristic pattern (subarea) maximum value or average value.Pooling maximum value can be indicated by expression formula 3:

Wherein p is the size of pond kernel.This nonlinear " down-sampled " only next layer reduces characteristic pattern Size and calculating additionally provide a kind of translation invariance.CNN can be used for during forward inference carrying out image classification.

The basic conception of network query function figure

In order to decouple with deep learning Computational frame, need to construct calculating figure knot corresponding with neural network processor Structure.Convert the neural network algorithm from different depth learning platform in general calculating figure, calculating figure is optimized and It reconstructs, then the calculating figure after optimization is mapped as to the instruction and machine code of hardware platform, just complete algorithm in hardware platform Compile part.

The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.Since the present invention relates to pass through data Rationally the data of multiplexing reduction external memory and on piece caching carry number, therefore conventional only including calculating operation node Calculating graph structure in be added to memory access operations, i.e., be added to " MEM " in figure in figure and refer to memory and read behaviour Make, to show advantage of the technical solution of the present invention compared with prior art on bandwidth conservation.In the realization of DPU, " MEM " The carrying referred between DDR (Double Data Rate synchronous DRAM) and on piece data operates.

Fig. 2A shows the basic structure in Vgg network.As shown in Figure 2 A, branchiess network query function figure is executing the most CONV (convolution operation), ReLU (nonlinear operation, the excitation function used are generally ReLU function) and (pond POOL on basis Operation) when, the featuremap of loading as needed for it is excessive, it is therefore desirable in DDR and on piece caching (for example, being embodied as BRAM, that is, block RAM) between data repeatedly carry.Fig. 2 B shows the basic structure in ResNet network, and Fig. 2 C is shown Adjacent Inception structure in GoogLeNet v1.As illustrated by figures 2 b and 2 c, the network query function figure of branch also introduces use In by multiple convolutional layers be added combineds ELTWISE (dot product) layer and for by the data of each input layer according to channel cascade and CONCAT (cascade) layer of one layer of Cheng Xin.Similarly, the network query function figure in figure, which is still shown, needs between DDR and BRAM Data repeatedly carry.It should be understood that " Vgg ", " Resnet " and " GoogLeNet " listed above is the prior art The CNN framework of middle prevalence for explanation of the principles of the present invention rather than limits.

Optimization based on data-reusing

Since the storage resource of on piece is limited, entire CNN can not be once completed on piece and is calculated, the side by piecemeal is needed Formula divides calculating task.It is loaded into the data of on piece after piecemeal, can also be used by way of multiplexing repeatedly, to drop The low traffic with external memory unit.

It is as follows, data-reusing scheme of the invention will be described in conjunction with Fig. 3.Fig. 3 shows one according to the present invention The computing platform implementation method for neural network of embodiment.Computing platform according to the present invention is needed from external memory Data needed for reading for example, it is desired to the image of classification, and cache on piece caching the data and each operation of reading Results of intermediate calculations.

In step S310, the first first part's data for operating required characteristic pattern are read from external memory, for described It includes the first operation and at least one other operation that first part's data, which execute, will be directed to the operation of first part's data As a result it is stored back into the external memory.

In step S320, the first second part data for operating required characteristic pattern are read from the external memory, for The second part data execute the above-mentioned read operation from external memory, the first operation and at least one of other operations and deposit Store up back the operation of external memory.

Preferably, characteristic pattern may include first part's data and at least one second part data, and the method It include: to execute the above-mentioned read operation from the external memory, first operation respectively for each second part data And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data Work and the operating result of at least one other operation.

It is operated in the operating process that other are operated at least one for the first of each section, the mediant generated It is cached according on piece is directly buffered in.In above-mentioned multiple operating process for same partial data, it is not related to depositing outside The data of reservoir are written, and are not related to reading data (the ELTWISE behaviour of exception such as Fig. 5 to external memory usually Shown in work).

Data-reusing scheme of the invention will more be vividly described in conjunction with Fig. 4 as follows.Fig. 4 shows Vgg The network query function figure of basic structure, left side are existing structures, and right side is based on optimization structure of the invention.

Usually in the implementation procedure for entirely calculating figure, the storage mode of data within hardware can be abstracted into one A wide (width), high (height) and channel (channel) three dimensions and optional batch (batch) of possessing is used as the 4th The characteristic pattern of dimension is launched into one-dimensional data tensor according to certain rules.For example, when carrying out image classification, it is not only initial Image can be abstracted into a three-dimensional feature figure, and the output (that is, operation result of each node) after operating every time is still So can be referred to is a characteristic pattern.Since on piece storage resource is limited, for example, on piece BRAM can not disposably cache one it is complete Whole characteristic pattern, therefore a certain operation of entire characteristic pattern is completed, need the multiple reading from DDR to BRAM.

Assuming that the characteristic pattern of input is triple channel (for example, RGB) two dimensional image, on piece BRAM can once read one and lead to The data in road.In existing structure shown on the left of Fig. 4, in order to complete CONV, ReLU and POOL for the characteristic pattern of the input Operation needs to read the Three-channel data of such as upper left image block from external memory (for example, DDR) first, carries out CONV behaviour It is stored back to DDR after work, then is stored back to DDR after reading the Three-channel data progress CONV operation of upper right image block, then read lower-left image The Three-channel data of block is stored back to DDR after carrying out CONV operation, and the Three-channel data for last reading lower-left image block carries out CONV behaviour DDR is stored back to after work.It is finished as a result, for the CONV operation of characteristic pattern.Then, the left side operated through CONV is read from DDR The Three-channel data of upper image block is stored back to DDR after carrying out ReLU operation, and the upper right image block operated through CONV is read from DDR Three-channel data is stored back to DDR after carrying out ReLU operation, then the triple channel number of the lower-left image block operated through CONV is read from DDR According to DDR is stored back to after carrying out ReLU operation, finally read from DDR again the Three-channel data of bottom right image block that is operated through CONV into DDR is stored back to after row ReLU operation.It is finished as a result, for the ReLU operation of characteristic pattern.Finally, reading from DDR through ReLU The Three-channel data of the upper left image block of operation is stored back to DDR after carrying out POOL operation, and the upper right operated through ReLU is read from DDR The Three-channel data of image block is stored back to DDR after carrying out POOL operation, then the lower-left image block operated through ReLU is read from DDR Three-channel data is stored back to DDR after carrying out POOL operation, and the triple channel of the bottom right image block operated through ReLU is finally read from DDR Data are stored back to DDR after carrying out POOL operation.It is finished as a result, for the POOL operation of characteristic pattern.

On the right side of Fig. 4 based in layer fusion structure of the invention, for the purposes of being completed for the characteristic pattern of the input CONV, ReLU and POOL operation, read the Three-channel data of upper left image block from DDR first, are stored in piece after carrying out CONV operation Upper caching (for example, BRAM) directly carries out ReLU operation to the data of on piece caching, carries out after necessary on piece caching direct POOL operation is carried out, then the Three-channel data of the upper left image block by CONV, ReLU and POOL operation is stored back to DDR.Then, The Three-channel data that upper right image block is read from DDR is equally directly to be stored back to after on piece carries out CONV, ReLU and POOL operation It is stored back to DDR.And then the Three-channel data of lower-left image block is read from DDR, it is equally directly to carry out CONV, ReLU on piece DDR is stored back to finally, reading the Three-channel data of bottom right image block from DDR with after POOL operation, is equally directly on piece DDR is stored back to after carrying out CONV, ReLU and POOL operation.

Delay from external memory on piece it can be seen that can be greatly reduced based on computing platform implementation of the invention The reading times deposited solve the problems, such as that external storage is read as DPU efficiency bottle neck, so that the task execution of DPU be substantially improved Efficiency.It should be understood that illustrate the principle of the invention for convenience, in upper example by a characteristic pattern be cut into upper left, upper right, Four pieces of lower-left and bottom right are handled and are stored.In practical applications, it can according to need and carry out different cuttings to characteristic pattern, Herein with no restrictions.

In the example of fig. 4, the first operation and at least one other operation include that successive convolution (CONV) operates, is non-thread Property (ReLU) operation and pond (POOL) operate.This is basic structure most commonly seen in CNN.Here, at least one other behaviour Making (that is, ReLU and POOL operation) is the subsequent operation based on the first operation, which at least needs first operation Complete computation result with its complete all operationss.

In other embodiments, other operations can also have other correlativities with the first operation.It will combine as follows Data-reusing scheme of the invention is described in multiple examples.

Fig. 5 shows the network query function figure of Resnet basic structure, and left side is existing structure, and right side is based on of the invention Optimize structure.

Shown on the right side of Fig. 5, other than merging to (CONV) operation and non-linear (ReLU) operation, there is branch Network structure in, can also realize the present invention to the node (ELTWISE operate) in such as figure with two or more inputs Data-reusing scheme.Implementation method of the invention can also include reading the first subsequent operation institute from external memory as a result, The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the part number from characteristic pattern According to execution first subsequent operation.Here, ELTWISE (dot product) operation as the first subsequent operation comes from addition to being multiplexed (CONV) operation and non-linear (ReLU) operation except the result data of CONV operation on the right side of network, also on the left of reception network Result data as input.Thus the level of data-reusing is further promoted.

Similarly, it is operated for CONV the most basic in CNN, multiplexing scheme of the invention can be directed to different CONV Operation is merged.Fig. 6 shows convolution according to an embodiment of the invention and longitudinally merges.Two successive convolution operations it is vertical It is merged to layer, all parameters of needs can be disposably loaded on piece caching, thus carry out the two convolution operations Fusion.The intermediate result between operation is without writing back external memory twice, but directly caches on piece, complete knot to be calculated It is written back after beam.Fig. 7 shows convolution according to an embodiment of the invention and laterally merges.Lateral layer fusion can be understood as Several operations for sharing identical input feature vector figure can not have to that characteristic pattern is loaded on piece for each layer, but directly The corresponding position for being stored back to external memory respectively has been calculated on piece.In other words, other relevant to the first operation operate not It only can be and need the sequentially operation based on the first operation output, be also possible to that there is the input of same characteristic features figure with the first operation At least one lateral operation.Laterally fusion means that several calculating operations share identical input feature vector figure, can save from outer Portion's memory carries data to the bandwidth resources of on piece, and longitudinal fusion means to save interlayer feature diagram data portion's memory outside Carry required bandwidth back and forth on piece.

As described above, the storage mode of data within hardware can be abstracted in the implementation procedure for entirely calculating figure At the one-dimensional data tensor that characteristic pattern is launched into according to certain rule, therefore the value for not changing data, only change data dimension The operation of degree or arrangement mode, can be left out in calculating figure by beta pruning.Different from being similar to GPU and CPU within hardware Special rearrangement is carried out to data, implementation of the invention can be dissolved into other associated operations Reason, such as in the operation of a upper calculating node of graph, be directly stored in data according to the arrangement mode of the dimension after conversion external In memory (for example, DDR)；It can also be stored according to universal mode, and according to new dimension in next calculating node of graph Degree carries out LOAD, so that the execution efficiency of hardware is erased and improved in influence caused by special data rearrangement and dimension are converted. And other does not have influential operation to required calculated result, can be left out by beta pruning.

Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.Fig. 9 is shown Omission according to an embodiment of the invention to CONCAT (cascade) operation.Since cascade operation is related to the conjunction of network branches And, it is therefore desirable to it is marked in figure.Dotted line frame outside CONCAT indicates in implementation of the invention, involved by cascade operation And dimension transformation be fused aforementioned CONV operation storing process in.

Here, FLATTEN operation is to carry out " clapping " to the characteristic pattern after single convolution, i.e. the processing of one-dimensional.And Have in the network (such as GoogLeNet) of branch, for the cascade operation of multilayer, upper layer has multiple layers of output to be linked into The input of CONCAT.CONCAT operation just refers to the data of each input layer according to channel cascade into new one layer, then exports To next layer.FLATTEN and CONCAT operation is all special data rearrangement and dimension map function, can depositing by data Enter and/or the particular specification of reading manner omits.

Figure 10 shows according to an embodiment of the invention (fixed to BatchNorm (batch normalizes) operation and Scale Mark) operation fusion.In calculating process, operation and parameter directly can be incorporated to preceding CONV by BatchNorm and Scale In layer.

In order to realize more efficiently data-reusing, other than the layer fusion method as above shown, layer can also be carried out Operation splitting.In one embodiment, implementation method of the invention can also include decompose first operation and/or at least one of its He operates, wherein the first operation and/or at least one other operation after decomposing are directed to its one be respectively originally inputted Point.In other words, above-mentioned decomposition can be the decomposition of the layer in single branch, point being also possible on multiple-limb network backbone node Solution.

In one embodiment, first operation after decomposition and/or at least one other operation can be through splitting CONV operation, and it is each through splitting CONV operation input be former characteristic pattern a part.Figure 11 is shown according to this The example of the division operation of invention one embodiment.In the case where hardware processing capability etc. is limited, layer can be carried out Packet transaction reduces parameter amount and calculation amount by setting grouping (group) parameter.Although substantially with processing capacity It is promoted, current hardware is without the need for setting group parameter, but in order to be compatible with forward, and implementation of the invention is still Convolution with group parameter can be split as to several small convolution to be concatenated together again, to expand the versatility of hardware.

In another embodiment, the layer of decomposition can be the trunk node in branching networks calculating figure.In other words, Layer for decomposition needs to receive the input from multiple characteristic patterns originally, and multiple layers after decomposing can then incorporate respectively respectively A branch is to carry out the fusion of single branch.Preferably, described first after decomposition operates and/or at least one other operation is POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.

Figure 12 and Figure 13 shows the typical structure in GoogLeNet and is applying based on data-reusing scheme of the invention Optimization structure later, wherein Figure 12 shows Inception structure adjacent in GoogLeNet v1, and Figure 13 is shown Adjacent Inception structure in GoogLeNet v2 shows prototype structure on the left of every figure, and right side is to show basis The optimization structure of the present invention program.Similarly, the dotted line frame outside CONCAT → MEM indicates in implementation of the invention, The transformation of dimension involved in cascade operation is fused in the storing process of aforementioned CONV operation.In addition, in order to make it easy to understand, Used in figure " italic " and "Underscore" format is distinguish the consecutive operation and its relevant operation of multiple same names.

Specifically, it has been directed not only in figure to basic structure, that is, successive convolution (CONV) operation, non-linear (ReLU) The fusion of operation and pond (POOL) operation, further includes a layer operation splitting.For example, for the result of several convolution operations is carried out Cascade connects the operation in pond again, and the calculated result for carrying out cascade operation again with pondization direct after each convolution is of equal value.Thus, it is possible to In each convolution operation before pondization operation to be incorporated to cascade operation respectively, to realize partial reconfiguration and the optimization of calculating figure.

In actual use, the solution of the present invention can be realized by a kind of neural computing system.The system may include Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory；Calculating executes dress It sets, the calculating executive device includes on piece caching, and the calculating executive device, which is directed to, to be based on to delay on piece caching The different piece for the characteristic pattern that the capacity deposited divides, successively performs the following operations: from the external memory read operation institute The specific part of the characteristic pattern needed executes at least two operations for the specific part, and operating result is stored back to outside described Storage device.

In the computing system for neural network of the invention, for executing the calculating executive device of neural computing Some or all of function can be by digital circuit.In one embodiment, computing system of the invention can include logical It is realized with the system on chip (SoC) of processor, memory and digital circuit.Figure 14, which is shown, can be used for realizing calculating of the invention An example of the SoC of system.

It in one embodiment, can be by digital circuits section (for example, FPGA or GPU) Lai Shixian this system institute on SoC The learning network needed, such as convolutional neural networks.Calculate executive device can be by the height of FPGA or GPU or ASIC it is parallel based on Calculate device.What it is due to CNN progress is parallel computation, realizes convolutional neural networks by logic hardware, especially FPGA Computing function has natural calculating advantage, and executes compared to software, can be realized lower power consumption.

In one embodiment, external memory will can be stored in by whole parameters of the related CNN obtained in preceding training It, can be according to layer of the invention fusion and decomposing scheme, according to optimization and again when then carrying out ANN Reasoning in reservoir Calculating figure (for example, the right side of Figure 12 and 13 calculates figure) after structure, is subtracted by maximumlly carrying out the data-reusing of operation room Few data from main memory on piece between caching are carried, thus lifting system efficiency.

Figure 15 shows according to the present invention a kind of for executing the computing platform of neural computing.The computing platform 1500 may include data processing module 1510, data memory module 1520 and control module 1530.

Data processing module 1510 can be used for executing scheduled calculation processing to input data, and generate output data. In input data needed for data memory module 1520 can be used for data cached processing module or data processing module output Between data.Control module 1530 can be used for controlling the data processing module 1510 and the data memory module 1520 System decomposes integration program to execute layer according to the present invention.In one embodiment, the specific framework of computing platform of the invention Can the programmed logical module as shown in such as Figure 14 realize, wherein data processing module 1510 correspond to execute CNN operation Complicated calculations core, data memory module 1520 are buffered corresponding to input and output, and control module 1530 corresponds to the controller in figure.

The computing platform realization side according to the present invention for neural network above is described in detail by reference to attached drawing Case.

In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention Machine program code instruction.

Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code), When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.

Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.

The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent module, a journey A part of sequence section or code, a part of the module, section or code include one or more for realizing defined The executable instruction of logic function.It should also be noted that in some implementations as replacements, the function of being marked in box can also To be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually be basically executed in parallel, They can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or stream The combination of each box in journey figure and the box in block diagram and or flow chart, can the functions or operations as defined in executing Dedicated hardware based system realize, or can realize using a combination of dedicated hardware and computer instructions.

In addition, " first ", " second " used in the present invention are intended to indicate that different objects, rather than to the limit of execution order etc. System, for example, " first part's data " and " second part data " involved in text are intended to indicate that the different portions for belonging to characteristic pattern Point.And " the first subsequent operation " and " the second subsequent operation " is only used for two subsequent operations of difference and is different subsequent operation.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. a kind of computing platform implementation method for neural network, the computing platform counts needed for reading from external memory According to, and cache on piece caching the data and each results of intermediate calculations that operates of reading, which comprises

The first first part's data for operating required characteristic pattern are read from the external memory, for first part's data Executing includes other operations of the first operation and at least one, described by being stored back into for the operating result of first part's data External memory；

The first second part data for operating required characteristic pattern are read from the external memory, for the second part data It executes above-mentioned from the read operation of the external memory, first operation and described at least one other operations, Yi Jicun Store up back the operation of the external memory.

2. the method for claim 1, wherein the characteristic pattern includes first part's data and at least one second part Data, and the described method includes:

For each second part data, the above-mentioned read operation from the external memory, first operation are executed respectively And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data Work and the operating result of at least one other operation.

3. the size of the partial data the method for claim 1, wherein read every time from the external memory is at least The capacity for being based partially on the on piece caching determines.

4. the method for claim 1, wherein at least one other operation is at least one of following:

At least one subsequent operation of first operation, the subsequent operation at least need the complete computation of first operation As a result all operationss are completed with it；And

There is at least one lateral operation of same characteristic features figure input with first operation.

5. method as claimed in claim 4, wherein first operation and at least one other operation include successive convolution (CONV) operation, non-linear (ReLU) operation and pond (POOL) operation.

6. method as claimed in claim 4, further includes:

From the partial data from other characteristic patterns needed for the external memory the first subsequent operation of reading, for from it The partial data of his characteristic pattern and partial data from the characteristic pattern execute first subsequent operation.

7. method as claimed in claim 6, wherein first subsequent operation is dot product (ELTWISE) operation.

8. method as claimed in claim 4, further includes:

Second subsequent operation is incorporated to the first operation.

9. method according to claim 8, wherein second subsequent operation is batch normalization (BatchNorm) and determines (Scale) operation is marked, first operation is CONV operation.

10. the method as described in claim 1, further includes:

Operating result is stored back to the external memory with required dimension arrangement mode, and/or with required dimension arrangement side Formula is read from the external memory in preceding operating result, is only used for changing data dimension or arrangement in neural network to omit The operation of mode.

11. method as claimed in claim 10, wherein the operation being omitted is at least one following:

Cascade (CONCAT) operation；And

Planarize (FLATTEN) operation.

12. the method as described in claim 1, further includes:

First operation and/or other at least one of described operations are decomposed, wherein first operation after decomposing and/or extremely Other operations of one item missing are directed to its a part being respectively originally inputted.

13. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition It is the CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.

14. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition It is the POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.

15. a kind of neural computing system, comprising:

Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory；

Executive device is calculated, the calculating executive device includes on piece caching, and the calculating executive device is directed to based on described The different piece for the characteristic pattern that the capacity that on piece caching can cache divides, successively performs the following operations:

From the specific part of characteristic pattern needed for the external memory read operation, at least two are executed for the specific part Item operation, and operating result is stored back to described external memory device.

16. system as claimed in claim 15, wherein it is described calculate executive device be based on FPGA, GPU or ASIC it is high simultaneously Row computing device.

17. system as claimed in claim 15, wherein at least two operations are following at least one:

Use same characteristic features figure lateral operation as input；

By the sequentially operation based on operating result of preceding operation.

18. a kind of computing platform for neural network, comprising:

Data processing module for executing scheduled calculation processing to input data, and generates output data；

Data memory module, the mediant exported for input data needed for data cached processing module or data processing module According to；And

Control module, for controlling the data processing module and the data memory module, to execute according to right It is required that method described in any one of 1-14.

19. computing platform according to claim 18, wherein

The data processing module is convolutional calculation module, for carrying out convolutional calculation to input data.

20. a kind of non-transitory machinable medium, is stored thereon with executable code, when the executable code is electric When the processor of sub- equipment executes, the processor is made to execute the method as described in any one of claim 1-14.