CN110321064A - Computing platform realization method and system for neural network - Google Patents

Computing platform realization method and system for neural network Download PDF

Info

Publication number
CN110321064A
CN110321064A CN201810297916.6A CN201810297916A CN110321064A CN 110321064 A CN110321064 A CN 110321064A CN 201810297916 A CN201810297916 A CN 201810297916A CN 110321064 A CN110321064 A CN 110321064A
Authority
CN
China
Prior art keywords
data
external memory
characteristic pattern
read
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810297916.6A
Other languages
Chinese (zh)
Inventor
隋凌志
王雨顺
刘鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xilinx Technology Beijing Ltd
Original Assignee
Beijing Shenjian Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenjian Intelligent Technology Co Ltd filed Critical Beijing Shenjian Intelligent Technology Co Ltd
Priority to CN201810297916.6A priority Critical patent/CN110321064A/en
Publication of CN110321064A publication Critical patent/CN110321064A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/068Hybrid storage device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of computing platform implementation methods for neural network.The computing platform reads required data from external memory, and the results of intermediate calculations for caching the data of reading on piece caching and respectively operating, the described method includes: reading the first first part's data for operating required characteristic pattern from the external memory, executing for first part's data includes the first operation and at least one other operation, will be stored back into the external memory for the operating result of first part's data;The first second part data for operating required characteristic pattern are read from the external memory, are executed for the second part data above-mentioned from the read operation of the external memory, first operation and at least one described other operation and the operation for being stored back into the external memory.With this solution, the data that can be greatly decreased between piece external storage and on piece caching carry number, to promote disposed of in its entirety speed.

Description

Computing platform realization method and system for neural network
Technical field
The present invention relates to deep learning field more particularly to a kind of computing platform implementation methods and phase for neural network Relationship system.
Background technique
Neural network (Neural Network) becomes the research hotspot of field of image recognition in recent years.After training Neural network model can be used for the numerous areas such as image classification, object identification and conspicuousness detection.Neural network mould in recent years Type is presented the trend that calculation scale increases, complexity is promoted and has been unable to satisfy the practicality demand using traditional CPU platform. Therefore, carrying out neural network accelerator design using heterogeneous computing platforms such as FPGA, GPU becomes new research hotspot.Wherein, FPGA compares GPU platform due to its low-power consumption feature, can obtain higher energy efficiency, at the same FPGA can with iteratively faster, It can carry out the characteristic also more requirement of adaptive algorithm high speed development of hardware reconstruction.
When using the high degree of parallelism computing platform comprising FPGA and GPU etc. to execute ANN Reasoning, compared to reading The time cost of parameter required for extract operation, it is very short to calculate the required execution time, and which results in memory readings to become Improve the bottleneck of processing speed.
Therefore, there is still a need for a kind of scheme for being able to ascend convolutional neural networks computing platform overall treatment efficiency.
Summary of the invention
In order to solve the problems, such as it is above-mentioned at least one, the invention proposes a kind of computing platform realization sides for neural network Case can be decomposed in not changing calculating figure in the case where bottom calculated result, by calculating the rational design of figure based on layer With merge the carrying reduced to the greatest extent between memory and on piece data, more effectively utilize hardware resource, promoted neural network, especially It is the arrangement treatment effeciency of convolutional neural networks computing platform.
According to an aspect of the invention, there is provided a kind of computing platform implementation method for neural network, the meter Calculate platform from external memory read needed for data, and on piece caching in cache acquisition data and on piece operation in Between calculated result, which comprises from the external memory read first operate needed for characteristic pattern first part's data, Executing for first part's data includes the first operation and at least one other operation, will be directed to first part's number According to operating result be stored back into the external memory;The second of the required characteristic pattern of the first operation is read from the external memory Partial data executes the above-mentioned read operation from the external memory, first operation for the second part data And at least one of described other operate and are stored back into the operation of the external memory.
It reduces unnecessary memory by the multiplexing to data as a result, to read, to promote overall treatment efficiency.
The characteristic pattern includes first part's data and at least one second part data, and the described method includes: needle To each second part data, execute respectively above-mentioned from the read operation of the external memory, first operation and institute It states at least one other operation and is stored back into the operation of the external memory, wherein being directed to first part's data and institute The operating result for stating at least one second part data be equivalent to execute for characteristic pattern sequence first operation and The operating result of at least one other operation.
Preferably, it is slow that the size of the partial data read every time from the external memory is at least partially based on the on piece The capacity deposited determines.
At least one other operation can be at least one of following: at least one subsequent operation of first operation, institute Stating subsequent operation at least needs the complete computation result of first operation to complete its all operations;And it is grasped with described first Make at least one lateral operation with the input of same characteristic features figure.
In one embodiment, the first operation and at least one other operation include that successive convolution (CONV) operates, is non- Linearly (ReLU) operation and pond (POOL) operation.
Computing platform implementation method of the invention can also include: to read the first subsequent operation institute from the external memory The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the portion from the characteristic pattern Divided data executes first subsequent operation.Thus the storage optimization for being suitable for having branch's neural network.Preferably, this first Subsequent operation can be ELTWISE operation.
Computing platform implementation method of the invention can also include that the second subsequent operation is incorporated to the first operation.Preferably, Second subsequent operation can be batch normalization (BatchNorm) and calibration (Scale operation), and the first operation can be CONV Operation.
Computing platform implementation method of the invention can also include depositing operating result with the dimension arrangement mode after converting The external memory is returned, to omit the operation for being only used for changing data dimension or arrangement mode in neural network.Thus, it is possible to Cascade (CONCAT) operation and planarization (FLATTEN) operation are directly omitted by the regulation of storage mode.
Computing platform implementation method of the invention can also include layer operation splitting, such as may include decomposing described first Operation and/or other at least one of described operations, wherein first operation after decomposing and/or at least one of other operation needles Pair be a part that it is respectively originally inputted.First operation and/or at least one other operation after decomposition can be CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.Institute after decomposition It states the first operation and/or at least one other operation can also be the POOL operation through splitting, and each POOL through splitting The input of operation is the complete characterization figure output in preceding operation.
According to another aspect of the present invention, a kind of neural fusion system is provided, comprising: piece external memory device, For parameter and characteristic pattern needed for storing progress neural computing on a memory;Executive device is calculated, the calculating is held Luggage is set to be cached including on piece, and the executive device that calculates is directed to based on the capacity division that can be cached on piece caching The different piece of characteristic pattern, successively performs the following operations: from the spy of characteristic pattern needed for the external memory read operation Determine part, executes at least two operations for the specific part, and operating result is stored back to described external memory device.
Preferably, which is the high parallel computation unit based on FPGA or GPU or ASIC.
According to a further aspect of the invention, a kind of computing platform for neural network is provided, comprising: data processing Module for executing scheduled calculation processing to input data, and generates output data;Data memory module, for caching number The intermediate data exported according to input data needed for processing module or data processing module;And control module, for described Data processing module and the data memory module are controlled, to execute method described in any one as above.
Data processing module can be convolutional calculation module, for carrying out convolutional calculation to input data.
According to an aspect of the invention, there is provided a kind of non-transitory machinable medium, being stored thereon with can Code is executed, when the executable code is executed by the processor of electronic equipment, the processor is made to execute as above any one Method described in.
By using neural computing implementation of the invention, can be greatly decreased between piece external storage and on piece caching Data carry number, eliminate bottleneck effect of the bandwidth to neural computing.Implementation of the invention can be by meter Fusion, decomposition and the reconstruct for calculating operation avoid unnecessary band using the height multiplexing to shared input and/or intermediate data Width occupies, and is operated by omitting time-consuming data rearrangement to storage and/or the reasonable arrangement read, and can be by certain subsequent behaviour It is incorporated in preceding operation, thus each generic operation involved in neural computing and data access are optimized from process, To promote overall calculation efficiency.
Detailed description of the invention
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein in disclosure illustrative embodiments, identical reference label Typically represent same parts.
Fig. 1 shows a series of layer of orderly functions of composition typical case CNN.
The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.
Fig. 3 shows the process of the computing platform implementation method according to an embodiment of the invention for neural network Figure.
Fig. 4 is shown for the existing of Vgg basic structure and the network query function figure optimized based on the present invention.
Fig. 5 is shown for the existing of ResNet basic structure and the network query function figure optimized based on the present invention.
Fig. 6 shows the example that convolution according to an embodiment of the invention longitudinally merges.
Fig. 7 shows the example that convolution according to an embodiment of the invention laterally merges.
Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.
Fig. 9 shows the omission according to an embodiment of the invention to CONCAT (cascade) operation.
Figure 10 shows according to an embodiment of the invention to BatchNorm (batch normalizes) and Scale (calibration) The fusion of operation.
Figure 11 shows the example of division operation according to an embodiment of the invention.
Figure 12 shows Inception structure adjacent in GoogLeNet v1 that is existing and optimizing based on the present invention.
Figure 13 shows Inception structure adjacent in GoogLeNet v2 that is existing and optimizing based on the present invention.
Figure 14 shows the example for the SoC that can be used for realizing computing system of the invention.
Figure 15 shows according to an embodiment of the invention for executing the platform of neural computing.
Specific embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in attached drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.On the contrary, these embodiments are provided so that this disclosure will be more thorough and complete, and can be by the disclosure Range is completely communicated to those skilled in the art.
The basic conception of neural network processor
With the continuous development of artificial intelligence, machine learning and neural network algorithm in recent years, convolutional neural networks are being schemed The classification of picture, identification, detection and tracking etc., all achieve the effect for surmounting the mankind.Since convolutional neural networks are huge The characteristics of parameter scale, huge calculation amount, and for hardware platform stability and the high requirement for calculating energy consumption ratio, utilize The heterogeneous computing platforms such as FPGA, GPU, carrying out accelerator design becomes new research hotspot.GPU platform is compared, FPGA is due to it Low-power consumption feature can obtain higher energy efficiency, at the same FPGA can with iteratively faster, the characteristic of hardware reconstruction can be carried out The more requirement of adaptive algorithm high speed development.However, the existing CNN accelerator design based on FPGA and GPU exist as support Network structure is single, the problems such as bandwidth usage is unreasonable, computing resource utilization efficiency is low, be unable to satisfy higher and higher real-time Property require.For the isomery accelerator design of CNN, however it remains biggish research and discovery space.
Compared with single computing platform (there was only host or the computing platform of CPU), the present invention is directed in order to execute Neural computing and the neural network application specific processor being especially designed.It will be appreciated by those skilled in the art that in the application Used term " neural network application specific processor ", can be also simply referred to as " neural network processor " or " NN processor ".Due to Deep learning is a technique classification presently the most popular in nerual network technique, therefore neural network application specific processor can be with It is implemented as deep learning application specific processor or deep learning processor.But it will be understood by those skilled in the art that nerve net Network has various technology branches, such as deep neural network (DNN, Deep Neutral Network) and convolutional neural networks (CNN, Convolutional Neutral Network), therefore neural network application specific processor also may be implemented as depth Neural network application specific processor or deep neural network processor (DNN processor or CNN processor).That is, related " deep The neural computing of degree study processor " or " deep neural network processor " in heterogeneous computing platforms realizes that technology also exists Within the scope of the present invention.
DPU (Deep-learning Processing Unit) is a for neural network algorithm in artificial intelligence General acceleration platform, utilize FPGA high degree of parallelism and low-power consumption the characteristics of, realize based on convolutional neural networks (hereinafter referred to as CNN it) makes inferences.Herein, DPU may be considered above " deep learning processor " or " deep neural network processor " or One specific implementation of " neural network processor ".Following description, which will be based primarily upon, to be realized using CNN structure via FPGA DPU is carried out, but it will be understood by those skilled in the art that the principle of the present invention is equally applicable to the hardware by such as GPU Structure is directed to the neural network processor that other neural networks make inferences.
During DPU platform algorithm is to command mappings, need to decouple with deep learning frame, and parse the CNN of multiplicity Algorithm constructs the corresponding calculating graph structure of DPU, the optimization of coarseness is carried out for graph structure, including node beta pruning, fusion are most End form at subgraph and subgraph configuration, thus instruct DPU compiler instruct generation.
Since DPU is the general acceleration platform for neural network algorithm in artificial intelligence, it is necessarily required to support from not With the fixed point of platform;At the same time, DPU supports hardware parameter can be with to be able to as Hardware I P (IP core) It is deployed on different hardware platforms.Therefore different depth is come from firstly the need of parsing in the mapping process of algorithm to instruction A possibility that practising the network query function figure of platform, and finding figure optimization, to realize the maximization of hardware computational efficiency as far as possible. When DPU is in ANN Reasoning, during executing calculating figure, multiple operations can be executed in each node for calculating figure.These Operation needs to complete on DPU, and can be understood as the calling to computing module different on DPU.Under normal conditions, and from outer The time cost of data needed for portion's memory read operations is compared, and the time that DPU executes calculating is very short, and which results in memories Read the bottleneck for becoming system processing capacity.How to find it is one smaller, calculate graph structure faster, do not change bottom meter in figure Calculate as a result, reduce the carrying between memory and on piece data to the greatest extent, more effectively utilize hardware resource, such calculatings is schemed Figure optimization method is the key that a ring during DPU algorithm to command mappings.Calculating implementation of the invention is deep learning The part that algorithm and DPU compiler are formed a connecting link has both the effect of important neural network interlayer optimization, is the front end of compiler Core algorithm.
Numerical procedure in order to better illustrate the present invention, first to CNN basic conception, network query function figure and its basic behaviour It is illustrated as content.
CNN basic conception
CNN reaches state-of-the-art performance in extensive visual correlation task.Help, which understands in the application, to be analyzed Calculating operation based on CNN is primarily based on the rudimentary knowledge that existing CNN model introduces CNN.
As shown in Figure 1, typical CNN is made of a series of layer of orderly functions.
The parameter of CNN model is referred to as " weight " (weights).The first layer of CNN reads input figure, and exports a series of Characteristic pattern (featuremap).Following layer reads the characteristic pattern generated by upper one layer, and exports new characteristic pattern.Last A classifier (classifier) output input figure may belong to the probability of a certain classification.CONV layers (convolutional layers) and FC layers are (complete Even layer) it is two kinds of basic channel types in CNN.After CONV layers, usually there is pond layer (Pooling layers).
In this application, for one CNN layers,Indicate j-th of input feature vector figure (inputfeaturemap), Indicate i-th of output characteristic pattern (output featuremap), biIndicate the bias term of i-th of output figure.
For CONV layers, ninAnd noutRespectively represent the quantity for outputting and inputting characteristic pattern.
For FC layers, ninAnd noutRespectively represent the length for outputting and inputting feature vector.
The definition of CONV layers (Convolutional layers, convolutional layer): CONV layers using series of features figure as defeated Enter, and output characteristic pattern is obtained with convolution kernels convolution.
The non-linear layer being usually connected with CONV layers, that is, nonlinear activation function is applied to every in output characteristic pattern A element.The excitation function used is generally ReLU function, which is also commonly referred to as ReLU layers.
CONV layers can be indicated with expression formula 1:
Wherein gi,jIt is applied to the convolution kernels of j-th of input feature vector figure and i-th of output characteristic pattern.FC layers The definition of (Fully-Connected layers, connect layer entirely): FC layers are applied to a upward linear transformation of input feature vector:
fout=Wfin+b (2)
W is a nout×ninTransformation matrix, b are bias terms.It is noted that input is not several two dimensions for FC layers The combination of characteristic pattern, but a feature to.Therefore, in expression formula 2, parameter ninAnd noutActually correspond to input and it is defeated The length of feature vector out.
Pond (pooling) layer: being usually connected with CONV layers, for exporting each subregion in each characteristic pattern (subarea) maximum value or average value.Pooling maximum value can be indicated by expression formula 3:
Wherein p is the size of pond kernel.This nonlinear " down-sampled " only next layer reduces characteristic pattern Size and calculating additionally provide a kind of translation invariance.CNN can be used for during forward inference carrying out image classification.
The basic conception of network query function figure
In order to decouple with deep learning Computational frame, need to construct calculating figure knot corresponding with neural network processor Structure.Convert the neural network algorithm from different depth learning platform in general calculating figure, calculating figure is optimized and It reconstructs, then the calculating figure after optimization is mapped as to the instruction and machine code of hardware platform, just complete algorithm in hardware platform Compile part.
The representative network that Fig. 2A -2C shows existing CNN network calculates graph structure.Since the present invention relates to pass through data Rationally the data of multiplexing reduction external memory and on piece caching carry number, therefore conventional only including calculating operation node Calculating graph structure in be added to memory access operations, i.e., be added to " MEM " in figure in figure and refer to memory and read behaviour Make, to show advantage of the technical solution of the present invention compared with prior art on bandwidth conservation.In the realization of DPU, " MEM " The carrying referred between DDR (Double Data Rate synchronous DRAM) and on piece data operates.
Fig. 2A shows the basic structure in Vgg network.As shown in Figure 2 A, branchiess network query function figure is executing the most CONV (convolution operation), ReLU (nonlinear operation, the excitation function used are generally ReLU function) and (pond POOL on basis Operation) when, the featuremap of loading as needed for it is excessive, it is therefore desirable in DDR and on piece caching (for example, being embodied as BRAM, that is, block RAM) between data repeatedly carry.Fig. 2 B shows the basic structure in ResNet network, and Fig. 2 C is shown Adjacent Inception structure in GoogLeNet v1.As illustrated by figures 2 b and 2 c, the network query function figure of branch also introduces use In by multiple convolutional layers be added combineds ELTWISE (dot product) layer and for by the data of each input layer according to channel cascade and CONCAT (cascade) layer of one layer of Cheng Xin.Similarly, the network query function figure in figure, which is still shown, needs between DDR and BRAM Data repeatedly carry.It should be understood that " Vgg ", " Resnet " and " GoogLeNet " listed above is the prior art The CNN framework of middle prevalence for explanation of the principles of the present invention rather than limits.
Optimization based on data-reusing
Since the storage resource of on piece is limited, entire CNN can not be once completed on piece and is calculated, the side by piecemeal is needed Formula divides calculating task.It is loaded into the data of on piece after piecemeal, can also be used by way of multiplexing repeatedly, to drop The low traffic with external memory unit.
It is as follows, data-reusing scheme of the invention will be described in conjunction with Fig. 3.Fig. 3 shows one according to the present invention The computing platform implementation method for neural network of embodiment.Computing platform according to the present invention is needed from external memory Data needed for reading for example, it is desired to the image of classification, and cache on piece caching the data and each operation of reading Results of intermediate calculations.
In step S310, the first first part's data for operating required characteristic pattern are read from external memory, for described It includes the first operation and at least one other operation that first part's data, which execute, will be directed to the operation of first part's data As a result it is stored back into the external memory.
In step S320, the first second part data for operating required characteristic pattern are read from the external memory, for The second part data execute the above-mentioned read operation from external memory, the first operation and at least one of other operations and deposit Store up back the operation of external memory.
Preferably, characteristic pattern may include first part's data and at least one second part data, and the method It include: to execute the above-mentioned read operation from the external memory, first operation respectively for each second part data And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data Work and the operating result of at least one other operation.
It is operated in the operating process that other are operated at least one for the first of each section, the mediant generated It is cached according on piece is directly buffered in.In above-mentioned multiple operating process for same partial data, it is not related to depositing outside The data of reservoir are written, and are not related to reading data (the ELTWISE behaviour of exception such as Fig. 5 to external memory usually Shown in work).
Data-reusing scheme of the invention will more be vividly described in conjunction with Fig. 4 as follows.Fig. 4 shows Vgg The network query function figure of basic structure, left side are existing structures, and right side is based on optimization structure of the invention.
Usually in the implementation procedure for entirely calculating figure, the storage mode of data within hardware can be abstracted into one A wide (width), high (height) and channel (channel) three dimensions and optional batch (batch) of possessing is used as the 4th The characteristic pattern of dimension is launched into one-dimensional data tensor according to certain rules.For example, when carrying out image classification, it is not only initial Image can be abstracted into a three-dimensional feature figure, and the output (that is, operation result of each node) after operating every time is still So can be referred to is a characteristic pattern.Since on piece storage resource is limited, for example, on piece BRAM can not disposably cache one it is complete Whole characteristic pattern, therefore a certain operation of entire characteristic pattern is completed, need the multiple reading from DDR to BRAM.
Assuming that the characteristic pattern of input is triple channel (for example, RGB) two dimensional image, on piece BRAM can once read one and lead to The data in road.In existing structure shown on the left of Fig. 4, in order to complete CONV, ReLU and POOL for the characteristic pattern of the input Operation needs to read the Three-channel data of such as upper left image block from external memory (for example, DDR) first, carries out CONV behaviour It is stored back to DDR after work, then is stored back to DDR after reading the Three-channel data progress CONV operation of upper right image block, then read lower-left image The Three-channel data of block is stored back to DDR after carrying out CONV operation, and the Three-channel data for last reading lower-left image block carries out CONV behaviour DDR is stored back to after work.It is finished as a result, for the CONV operation of characteristic pattern.Then, the left side operated through CONV is read from DDR The Three-channel data of upper image block is stored back to DDR after carrying out ReLU operation, and the upper right image block operated through CONV is read from DDR Three-channel data is stored back to DDR after carrying out ReLU operation, then the triple channel number of the lower-left image block operated through CONV is read from DDR According to DDR is stored back to after carrying out ReLU operation, finally read from DDR again the Three-channel data of bottom right image block that is operated through CONV into DDR is stored back to after row ReLU operation.It is finished as a result, for the ReLU operation of characteristic pattern.Finally, reading from DDR through ReLU The Three-channel data of the upper left image block of operation is stored back to DDR after carrying out POOL operation, and the upper right operated through ReLU is read from DDR The Three-channel data of image block is stored back to DDR after carrying out POOL operation, then the lower-left image block operated through ReLU is read from DDR Three-channel data is stored back to DDR after carrying out POOL operation, and the triple channel of the bottom right image block operated through ReLU is finally read from DDR Data are stored back to DDR after carrying out POOL operation.It is finished as a result, for the POOL operation of characteristic pattern.
On the right side of Fig. 4 based in layer fusion structure of the invention, for the purposes of being completed for the characteristic pattern of the input CONV, ReLU and POOL operation, read the Three-channel data of upper left image block from DDR first, are stored in piece after carrying out CONV operation Upper caching (for example, BRAM) directly carries out ReLU operation to the data of on piece caching, carries out after necessary on piece caching direct POOL operation is carried out, then the Three-channel data of the upper left image block by CONV, ReLU and POOL operation is stored back to DDR.Then, The Three-channel data that upper right image block is read from DDR is equally directly to be stored back to after on piece carries out CONV, ReLU and POOL operation It is stored back to DDR.And then the Three-channel data of lower-left image block is read from DDR, it is equally directly to carry out CONV, ReLU on piece DDR is stored back to finally, reading the Three-channel data of bottom right image block from DDR with after POOL operation, is equally directly on piece DDR is stored back to after carrying out CONV, ReLU and POOL operation.
Delay from external memory on piece it can be seen that can be greatly reduced based on computing platform implementation of the invention The reading times deposited solve the problems, such as that external storage is read as DPU efficiency bottle neck, so that the task execution of DPU be substantially improved Efficiency.It should be understood that illustrate the principle of the invention for convenience, in upper example by a characteristic pattern be cut into upper left, upper right, Four pieces of lower-left and bottom right are handled and are stored.In practical applications, it can according to need and carry out different cuttings to characteristic pattern, Herein with no restrictions.
In the example of fig. 4, the first operation and at least one other operation include that successive convolution (CONV) operates, is non-thread Property (ReLU) operation and pond (POOL) operate.This is basic structure most commonly seen in CNN.Here, at least one other behaviour Making (that is, ReLU and POOL operation) is the subsequent operation based on the first operation, which at least needs first operation Complete computation result with its complete all operationss.
In other embodiments, other operations can also have other correlativities with the first operation.It will combine as follows Data-reusing scheme of the invention is described in multiple examples.
Fig. 5 shows the network query function figure of Resnet basic structure, and left side is existing structure, and right side is based on of the invention Optimize structure.
Shown on the right side of Fig. 5, other than merging to (CONV) operation and non-linear (ReLU) operation, there is branch Network structure in, can also realize the present invention to the node (ELTWISE operate) in such as figure with two or more inputs Data-reusing scheme.Implementation method of the invention can also include reading the first subsequent operation institute from external memory as a result, The partial data from other characteristic patterns needed, for the partial data from other characteristic patterns and the part number from characteristic pattern According to execution first subsequent operation.Here, ELTWISE (dot product) operation as the first subsequent operation comes from addition to being multiplexed (CONV) operation and non-linear (ReLU) operation except the result data of CONV operation on the right side of network, also on the left of reception network Result data as input.Thus the level of data-reusing is further promoted.
Similarly, it is operated for CONV the most basic in CNN, multiplexing scheme of the invention can be directed to different CONV Operation is merged.Fig. 6 shows convolution according to an embodiment of the invention and longitudinally merges.Two successive convolution operations it is vertical It is merged to layer, all parameters of needs can be disposably loaded on piece caching, thus carry out the two convolution operations Fusion.The intermediate result between operation is without writing back external memory twice, but directly caches on piece, complete knot to be calculated It is written back after beam.Fig. 7 shows convolution according to an embodiment of the invention and laterally merges.Lateral layer fusion can be understood as Several operations for sharing identical input feature vector figure can not have to that characteristic pattern is loaded on piece for each layer, but directly The corresponding position for being stored back to external memory respectively has been calculated on piece.In other words, other relevant to the first operation operate not It only can be and need the sequentially operation based on the first operation output, be also possible to that there is the input of same characteristic features figure with the first operation At least one lateral operation.Laterally fusion means that several calculating operations share identical input feature vector figure, can save from outer Portion's memory carries data to the bandwidth resources of on piece, and longitudinal fusion means to save interlayer feature diagram data portion's memory outside Carry required bandwidth back and forth on piece.
As described above, the storage mode of data within hardware can be abstracted in the implementation procedure for entirely calculating figure At the one-dimensional data tensor that characteristic pattern is launched into according to certain rule, therefore the value for not changing data, only change data dimension The operation of degree or arrangement mode, can be left out in calculating figure by beta pruning.Different from being similar to GPU and CPU within hardware Special rearrangement is carried out to data, implementation of the invention can be dissolved into other associated operations Reason, such as in the operation of a upper calculating node of graph, be directly stored in data according to the arrangement mode of the dimension after conversion external In memory (for example, DDR);It can also be stored according to universal mode, and according to new dimension in next calculating node of graph Degree carries out LOAD, so that the execution efficiency of hardware is erased and improved in influence caused by special data rearrangement and dimension are converted. And other does not have influential operation to required calculated result, can be left out by beta pruning.
Fig. 8 shows the omission according to an embodiment of the invention to FLATTEN (planarization) operation.Fig. 9 is shown Omission according to an embodiment of the invention to CONCAT (cascade) operation.Since cascade operation is related to the conjunction of network branches And, it is therefore desirable to it is marked in figure.Dotted line frame outside CONCAT indicates in implementation of the invention, involved by cascade operation And dimension transformation be fused aforementioned CONV operation storing process in.
Here, FLATTEN operation is to carry out " clapping " to the characteristic pattern after single convolution, i.e. the processing of one-dimensional.And Have in the network (such as GoogLeNet) of branch, for the cascade operation of multilayer, upper layer has multiple layers of output to be linked into The input of CONCAT.CONCAT operation just refers to the data of each input layer according to channel cascade into new one layer, then exports To next layer.FLATTEN and CONCAT operation is all special data rearrangement and dimension map function, can depositing by data Enter and/or the particular specification of reading manner omits.
Figure 10 shows according to an embodiment of the invention (fixed to BatchNorm (batch normalizes) operation and Scale Mark) operation fusion.In calculating process, operation and parameter directly can be incorporated to preceding CONV by BatchNorm and Scale In layer.
In order to realize more efficiently data-reusing, other than the layer fusion method as above shown, layer can also be carried out Operation splitting.In one embodiment, implementation method of the invention can also include decompose first operation and/or at least one of its He operates, wherein the first operation and/or at least one other operation after decomposing are directed to its one be respectively originally inputted Point.In other words, above-mentioned decomposition can be the decomposition of the layer in single branch, point being also possible on multiple-limb network backbone node Solution.
In one embodiment, first operation after decomposition and/or at least one other operation can be through splitting CONV operation, and it is each through splitting CONV operation input be former characteristic pattern a part.Figure 11 is shown according to this The example of the division operation of invention one embodiment.In the case where hardware processing capability etc. is limited, layer can be carried out Packet transaction reduces parameter amount and calculation amount by setting grouping (group) parameter.Although substantially with processing capacity It is promoted, current hardware is without the need for setting group parameter, but in order to be compatible with forward, and implementation of the invention is still Convolution with group parameter can be split as to several small convolution to be concatenated together again, to expand the versatility of hardware.
In another embodiment, the layer of decomposition can be the trunk node in branching networks calculating figure.In other words, Layer for decomposition needs to receive the input from multiple characteristic patterns originally, and multiple layers after decomposing can then incorporate respectively respectively A branch is to carry out the fusion of single branch.Preferably, described first after decomposition operates and/or at least one other operation is POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.
Figure 12 and Figure 13 shows the typical structure in GoogLeNet and is applying based on data-reusing scheme of the invention Optimization structure later, wherein Figure 12 shows Inception structure adjacent in GoogLeNet v1, and Figure 13 is shown Adjacent Inception structure in GoogLeNet v2 shows prototype structure on the left of every figure, and right side is to show basis The optimization structure of the present invention program.Similarly, the dotted line frame outside CONCAT → MEM indicates in implementation of the invention, The transformation of dimension involved in cascade operation is fused in the storing process of aforementioned CONV operation.In addition, in order to make it easy to understand, Used in figure " italic " and "Underscore" format is distinguish the consecutive operation and its relevant operation of multiple same names.
Specifically, it has been directed not only in figure to basic structure, that is, successive convolution (CONV) operation, non-linear (ReLU) The fusion of operation and pond (POOL) operation, further includes a layer operation splitting.For example, for the result of several convolution operations is carried out Cascade connects the operation in pond again, and the calculated result for carrying out cascade operation again with pondization direct after each convolution is of equal value.Thus, it is possible to In each convolution operation before pondization operation to be incorporated to cascade operation respectively, to realize partial reconfiguration and the optimization of calculating figure.
In actual use, the solution of the present invention can be realized by a kind of neural computing system.The system may include Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory;Calculating executes dress It sets, the calculating executive device includes on piece caching, and the calculating executive device, which is directed to, to be based on to delay on piece caching The different piece for the characteristic pattern that the capacity deposited divides, successively performs the following operations: from the external memory read operation institute The specific part of the characteristic pattern needed executes at least two operations for the specific part, and operating result is stored back to outside described Storage device.
In the computing system for neural network of the invention, for executing the calculating executive device of neural computing Some or all of function can be by digital circuit.In one embodiment, computing system of the invention can include logical It is realized with the system on chip (SoC) of processor, memory and digital circuit.Figure 14, which is shown, can be used for realizing calculating of the invention An example of the SoC of system.
It in one embodiment, can be by digital circuits section (for example, FPGA or GPU) Lai Shixian this system institute on SoC The learning network needed, such as convolutional neural networks.Calculate executive device can be by the height of FPGA or GPU or ASIC it is parallel based on Calculate device.What it is due to CNN progress is parallel computation, realizes convolutional neural networks by logic hardware, especially FPGA Computing function has natural calculating advantage, and executes compared to software, can be realized lower power consumption.
In one embodiment, external memory will can be stored in by whole parameters of the related CNN obtained in preceding training It, can be according to layer of the invention fusion and decomposing scheme, according to optimization and again when then carrying out ANN Reasoning in reservoir Calculating figure (for example, the right side of Figure 12 and 13 calculates figure) after structure, is subtracted by maximumlly carrying out the data-reusing of operation room Few data from main memory on piece between caching are carried, thus lifting system efficiency.
Figure 15 shows according to the present invention a kind of for executing the computing platform of neural computing.The computing platform 1500 may include data processing module 1510, data memory module 1520 and control module 1530.
Data processing module 1510 can be used for executing scheduled calculation processing to input data, and generate output data. In input data needed for data memory module 1520 can be used for data cached processing module or data processing module output Between data.Control module 1530 can be used for controlling the data processing module 1510 and the data memory module 1520 System decomposes integration program to execute layer according to the present invention.In one embodiment, the specific framework of computing platform of the invention Can the programmed logical module as shown in such as Figure 14 realize, wherein data processing module 1510 correspond to execute CNN operation Complicated calculations core, data memory module 1520 are buffered corresponding to input and output, and control module 1530 corresponds to the controller in figure.
The computing platform realization side according to the present invention for neural network above is described in detail by reference to attached drawing Case.
In addition, being also implemented as a kind of computer program or computer program product, the meter according to the method for the present invention Calculation machine program or computer program product include the calculating for executing the above steps limited in the above method of the invention Machine program code instruction.
Alternatively, the present invention can also be embodied as a kind of (or the computer-readable storage of non-transitory machinable medium Medium or machine readable storage medium), it is stored thereon with executable code (or computer program or computer instruction code), When the executable code (or computer program or computer instruction code) by electronic equipment (or calculate equipment, server Deng) processor execute when, so that the processor is executed each step according to the above method of the present invention.
Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.
The flow chart and block diagram in the drawings show the possibility of the system and method for multiple embodiments according to the present invention realities Existing architecture, function and operation.In this regard, each box in flowchart or block diagram can represent module, a journey A part of sequence section or code, a part of the module, section or code include one or more for realizing defined The executable instruction of logic function.It should also be noted that in some implementations as replacements, the function of being marked in box can also To be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can actually be basically executed in parallel, They can also be executed in the opposite order sometimes, and this depends on the function involved.It is also noted that block diagram and/or stream The combination of each box in journey figure and the box in block diagram and or flow chart, can the functions or operations as defined in executing Dedicated hardware based system realize, or can realize using a combination of dedicated hardware and computer instructions.
In addition, " first ", " second " used in the present invention are intended to indicate that different objects, rather than to the limit of execution order etc. System, for example, " first part's data " and " second part data " involved in text are intended to indicate that the different portions for belonging to characteristic pattern Point.And " the first subsequent operation " and " the second subsequent operation " is only used for two subsequent operations of difference and is different subsequent operation.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Claims (20)

1. a kind of computing platform implementation method for neural network, the computing platform counts needed for reading from external memory According to, and cache on piece caching the data and each results of intermediate calculations that operates of reading, which comprises
The first first part's data for operating required characteristic pattern are read from the external memory, for first part's data Executing includes other operations of the first operation and at least one, described by being stored back into for the operating result of first part's data External memory;
The first second part data for operating required characteristic pattern are read from the external memory, for the second part data It executes above-mentioned from the read operation of the external memory, first operation and described at least one other operations, Yi Jicun Store up back the operation of the external memory.
2. the method for claim 1, wherein the characteristic pattern includes first part's data and at least one second part Data, and the described method includes:
For each second part data, the above-mentioned read operation from the external memory, first operation are executed respectively And at least one of described other operate and are stored back into the operation of the external memory, wherein being directed to first part's number It is equivalent to execute first behaviour for the characteristic pattern sequence according to the operating result at least one second part data Work and the operating result of at least one other operation.
3. the size of the partial data the method for claim 1, wherein read every time from the external memory is at least The capacity for being based partially on the on piece caching determines.
4. the method for claim 1, wherein at least one other operation is at least one of following:
At least one subsequent operation of first operation, the subsequent operation at least need the complete computation of first operation As a result all operationss are completed with it;And
There is at least one lateral operation of same characteristic features figure input with first operation.
5. method as claimed in claim 4, wherein first operation and at least one other operation include successive convolution (CONV) operation, non-linear (ReLU) operation and pond (POOL) operation.
6. method as claimed in claim 4, further includes:
From the partial data from other characteristic patterns needed for the external memory the first subsequent operation of reading, for from it The partial data of his characteristic pattern and partial data from the characteristic pattern execute first subsequent operation.
7. method as claimed in claim 6, wherein first subsequent operation is dot product (ELTWISE) operation.
8. method as claimed in claim 4, further includes:
Second subsequent operation is incorporated to the first operation.
9. method according to claim 8, wherein second subsequent operation is batch normalization (BatchNorm) and determines (Scale) operation is marked, first operation is CONV operation.
10. the method as described in claim 1, further includes:
Operating result is stored back to the external memory with required dimension arrangement mode, and/or with required dimension arrangement side Formula is read from the external memory in preceding operating result, is only used for changing data dimension or arrangement in neural network to omit The operation of mode.
11. method as claimed in claim 10, wherein the operation being omitted is at least one following:
Cascade (CONCAT) operation;And
Planarize (FLATTEN) operation.
12. the method as described in claim 1, further includes:
First operation and/or other at least one of described operations are decomposed, wherein first operation after decomposing and/or extremely Other operations of one item missing are directed to its a part being respectively originally inputted.
13. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition It is the CONV operation through splitting, and the input of each CONV operation through splitting is a part of former characteristic pattern.
14. method as claimed in claim 12, wherein first operation and/or at least one other operation after decomposition It is the POOL operation through splitting, and the input of each POOL operation through splitting is the complete characterization figure output in preceding operation.
15. a kind of neural computing system, comprising:
Piece external memory device, for parameter and characteristic pattern needed for storing progress neural computing on a memory;
Executive device is calculated, the calculating executive device includes on piece caching, and the calculating executive device is directed to based on described The different piece for the characteristic pattern that the capacity that on piece caching can cache divides, successively performs the following operations:
From the specific part of characteristic pattern needed for the external memory read operation, at least two are executed for the specific part Item operation, and operating result is stored back to described external memory device.
16. system as claimed in claim 15, wherein it is described calculate executive device be based on FPGA, GPU or ASIC it is high simultaneously Row computing device.
17. system as claimed in claim 15, wherein at least two operations are following at least one:
Use same characteristic features figure lateral operation as input;
By the sequentially operation based on operating result of preceding operation.
18. a kind of computing platform for neural network, comprising:
Data processing module for executing scheduled calculation processing to input data, and generates output data;
Data memory module, the mediant exported for input data needed for data cached processing module or data processing module According to;And
Control module, for controlling the data processing module and the data memory module, to execute according to right It is required that method described in any one of 1-14.
19. computing platform according to claim 18, wherein
The data processing module is convolutional calculation module, for carrying out convolutional calculation to input data.
20. a kind of non-transitory machinable medium, is stored thereon with executable code, when the executable code is electric When the processor of sub- equipment executes, the processor is made to execute the method as described in any one of claim 1-14.
CN201810297916.6A 2018-03-30 2018-03-30 Computing platform realization method and system for neural network Pending CN110321064A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810297916.6A CN110321064A (en) 2018-03-30 2018-03-30 Computing platform realization method and system for neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810297916.6A CN110321064A (en) 2018-03-30 2018-03-30 Computing platform realization method and system for neural network

Publications (1)

Publication Number Publication Date
CN110321064A true CN110321064A (en) 2019-10-11

Family

ID=68112541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810297916.6A Pending CN110321064A (en) 2018-03-30 2018-03-30 Computing platform realization method and system for neural network

Country Status (1)

Country Link
CN (1) CN110321064A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914999A (en) * 2020-07-30 2020-11-10 云知声智能科技股份有限公司 Method and equipment for reducing calculation bandwidth of neural network accelerator
CN112698954A (en) * 2021-01-14 2021-04-23 上海交通大学 Coarse-grained reconfigurable array scheduling method based on subgraph decoupling
CN112884123A (en) * 2021-02-23 2021-06-01 杭州海康威视数字技术股份有限公司 Neural network optimization method and device, electronic equipment and readable storage medium
WO2021248941A1 (en) * 2020-06-08 2021-12-16 深圳市九天睿芯科技有限公司 All-on-chip storage neural network accelerator and implementation method therefor
CN114356235A (en) * 2021-12-31 2022-04-15 Oppo广东移动通信有限公司 Data standardization processing method and device, electronic equipment and storage medium
CN114968602A (en) * 2022-08-01 2022-08-30 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip
WO2023212975A1 (en) * 2022-05-06 2023-11-09 北京灵汐科技有限公司 Mapping method, electronic device and computer-readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105892989A (en) * 2016-03-28 2016-08-24 中国科学院计算技术研究所 Neural network accelerator and operational method thereof
CN106250981A (en) * 2015-06-10 2016-12-21 三星电子株式会社 The impulsive neural networks of bandwidth consumption in minimizing memory access and network
CN106951962A (en) * 2017-03-22 2017-07-14 北京地平线信息技术有限公司 Compound operation unit, method and electronic equipment for neutral net
CN107203807A (en) * 2016-03-16 2017-09-26 中国科学院计算技术研究所 The computational methods of neutral net, system and its apparatus
CN107437110A (en) * 2017-07-11 2017-12-05 中国科学院自动化研究所 The piecemeal convolution optimization method and device of convolutional neural networks
CN107451654A (en) * 2017-07-05 2017-12-08 深圳市自行科技有限公司 Acceleration operation method, server and the storage medium of convolutional neural networks
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method
CN107704923A (en) * 2017-10-19 2018-02-16 珠海格力电器股份有限公司 Convolutional neural networks computing circuit

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250981A (en) * 2015-06-10 2016-12-21 三星电子株式会社 The impulsive neural networks of bandwidth consumption in minimizing memory access and network
CN107203807A (en) * 2016-03-16 2017-09-26 中国科学院计算技术研究所 The computational methods of neutral net, system and its apparatus
CN105892989A (en) * 2016-03-28 2016-08-24 中国科学院计算技术研究所 Neural network accelerator and operational method thereof
CN106951962A (en) * 2017-03-22 2017-07-14 北京地平线信息技术有限公司 Compound operation unit, method and electronic equipment for neutral net
CN107451654A (en) * 2017-07-05 2017-12-08 深圳市自行科技有限公司 Acceleration operation method, server and the storage medium of convolutional neural networks
CN107437110A (en) * 2017-07-11 2017-12-05 中国科学院自动化研究所 The piecemeal convolution optimization method and device of convolutional neural networks
CN107657581A (en) * 2017-09-28 2018-02-02 中国人民解放军国防科技大学 Convolutional neural network CNN hardware accelerator and acceleration method
CN107704923A (en) * 2017-10-19 2018-02-16 珠海格力电器股份有限公司 Convolutional neural networks computing circuit

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021248941A1 (en) * 2020-06-08 2021-12-16 深圳市九天睿芯科技有限公司 All-on-chip storage neural network accelerator and implementation method therefor
CN111914999A (en) * 2020-07-30 2020-11-10 云知声智能科技股份有限公司 Method and equipment for reducing calculation bandwidth of neural network accelerator
CN111914999B (en) * 2020-07-30 2024-04-19 云知声智能科技股份有限公司 Method and equipment for reducing calculation bandwidth of neural network accelerator
CN112698954A (en) * 2021-01-14 2021-04-23 上海交通大学 Coarse-grained reconfigurable array scheduling method based on subgraph decoupling
CN112698954B (en) * 2021-01-14 2022-05-10 上海交通大学 Coarse-grained reconfigurable array scheduling method based on subgraph decoupling
CN112884123A (en) * 2021-02-23 2021-06-01 杭州海康威视数字技术股份有限公司 Neural network optimization method and device, electronic equipment and readable storage medium
CN112884123B (en) * 2021-02-23 2024-03-01 杭州海康威视数字技术股份有限公司 Neural network optimization method and device, electronic equipment and readable storage medium
CN114356235A (en) * 2021-12-31 2022-04-15 Oppo广东移动通信有限公司 Data standardization processing method and device, electronic equipment and storage medium
WO2023124654A1 (en) * 2021-12-31 2023-07-06 Oppo广东移动通信有限公司 Data standardization processing method and apparatus, electronic device, and storage medium
WO2023212975A1 (en) * 2022-05-06 2023-11-09 北京灵汐科技有限公司 Mapping method, electronic device and computer-readable storage medium
CN114968602A (en) * 2022-08-01 2022-08-30 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip
CN114968602B (en) * 2022-08-01 2022-10-21 成都图影视讯科技有限公司 Architecture, method and apparatus for a dynamically resource-allocated neural network chip

Similar Documents

Publication Publication Date Title
CN110321064A (en) Computing platform realization method and system for neural network
CN110321999A (en) Neural computing figure optimization method
CN110175671A (en) Construction method, image processing method and the device of neural network
Feng et al. Computer vision algorithms and hardware implementations: A survey
CN106022468A (en) Artificial neural network processor integrated circuit and design method therefor
US11093225B2 (en) High parallelism computing system and instruction scheduling method thereof
CN108416436A (en) The method and its system of neural network division are carried out using multi-core processing module
CN110458280B (en) Convolutional neural network acceleration method and system suitable for mobile terminal
CN109496294A (en) The Compilation Method and system of artificial intelligence process device, storage medium and terminal
CN105739951B (en) A kind of L1 minimization problem fast solution methods based on GPU
WO2023093724A1 (en) Neural network model processing method and device
CN111768004A (en) Model self-adaption method and system based on intelligent computing framework
CN109496319A (en) Artificial intelligence process device hardware optimization method, system, storage medium, terminal
Wang et al. Nonlinear tensor train format for deep neural network compression
US20240160689A1 (en) Method for optimizing convolution operation of system on chip and related product
CN109685208B (en) Method and device for thinning and combing acceleration of data of neural network processor
Dey et al. Accelerating training of deep neural networks via sparse edge processing
CN114003201A (en) Matrix transformation method and device and convolutional neural network accelerator
JP7363145B2 (en) Learning device and learning method
WO2023071658A1 (en) Ai model processing method and apparatus, and ai model computing method and apparatus
Morcel et al. Fpga-based accelerator for deep convolutional neural networks for the spark environment
Sun et al. Computation on sparse neural networks and its implications for future hardware
Li et al. Memory saving method for enhanced convolution of deep neural network
CN107644143B (en) A kind of high-performance city CA model construction method based on vectorization and parallel computation
CN105573834A (en) High-dimensional-data-oriented vocabulary tree building method based on heterogeneous platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200904

Address after: Unit 01-19, 10 / F, 101, 6 / F, building 5, yard 5, Anding Road, Chaoyang District, Beijing 100029

Applicant after: Xilinx Electronic Technology (Beijing) Co.,Ltd.

Address before: 100083, 17 floor, four building four, 1 Wang Zhuang Road, Haidian District, Beijing.

Applicant before: BEIJING DEEPHI INTELLIGENT TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191011