CN108133270A - Convolutional neural networks accelerating method and device - Google Patents
Convolutional neural networks accelerating method and device Download PDFInfo
- Publication number
- CN108133270A CN108133270A CN201810028998.4A CN201810028998A CN108133270A CN 108133270 A CN108133270 A CN 108133270A CN 201810028998 A CN201810028998 A CN 201810028998A CN 108133270 A CN108133270 A CN 108133270A
- Authority
- CN
- China
- Prior art keywords
- output
- input
- convolution
- convolutional layer
- full articulamentum
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
This disclosure relates to a kind of convolutional neural networks accelerating method and device, the method includes:Read the input feature vector figure of convolutional layer;It will be in the processing array group of input feature vector figure input convolutional layer, according to the data of the first reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel, adder and multiplier is propagated using part and carries out the multiply-add operation of convolution, obtains the output result of processing array group;The output characteristic pattern of convolutional layer is obtained according to the output result of the processing array group;By the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer;Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains the output feature vector of full articulamentum;The output feature vector of described complete last layer of articulamentum is output in the 4th piecemeal of first memory.The disclosure effectively reduces hardware resource and power consumption, improves the processing speed of convolutional neural networks.
Description
Technical field
This disclosure relates to nerual network technique field more particularly to a kind of convolutional neural networks accelerating method and device.
Background technology
Deep learning is all shown on video identification, speech recognition and natural language processing etc. much is solved the problems, such as
Extraordinary performance.In different types of neural network, convolutional neural networks have obtained most in-depth study.Convolutional Neural net
The basic structure of network includes two layers:One is characterized extract layer, and the input of each neuron is connected with preceding layer regional area, carries
The feature of the part is taken, this is known as convolutional layer;The second is feature combination layer, is generally made of the neural network for connecting property entirely,
Since the characteristics of image extracted to preceding features extract layer is classified, this becomes full articulamentum.Network typically contains multilayer spy
Extract layer and feature combination layer are levied, realizes the complicated Nonlinear Classification to inputting on a large scale.With the increase of network depth, volume
The parameter scale of product neural network greatly improves, and network carries out what positive feature extraction and calculation, classification and deflection error were propagated
Calculation amount is usually thousands of times to hundred million ranks.Therefore it is to improve convolutional neural networks model convolutional neural networks accelerate
The key of computational efficiency.
Invention content
In view of this, the present disclosure proposes a kind of convolutional neural networks accelerating method and device, to solve convolutional Neural
The slow-footed problem of network calculations.
According to the one side of the disclosure, a kind of convolutional neural networks accelerated method is provided, the method includes:
Read the input feature vector figure of convolutional layer;
By in the processing array group of input feature vector figure input convolutional layer, joined according to the first reference pixel vector, second
The data of pixel vectors, convolution kernel weights and completed input channel are examined, it is multiply-add to propagate adder and multiplier progress convolution using part
Operation obtains the output of processing array group as a result, wherein, the processing array group includes multiple processing arrays, the processing battle array
Row include three row processing arrays, and the line number of convolution window is M, and the first reference pixel is joined to the preceding M-N rows for measuring convolution window, second
Examine the last N rows that pixel vectors take convolution window, M and N are positive integer, and M>N;
The output characteristic pattern of convolutional layer is obtained according to the output result of the processing array group;
By the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer;
Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains full articulamentum
Export feature vector;
The output feature vector of described complete last layer of articulamentum is output in the 4th piecemeal of first memory.
In a kind of possible realization method, the input feature vector figure of convolutional layer is read, including:
When the input feature vector figure for the current input channel that a convolution window in the first convolution window and the second convolution window is read
When being used to carry out the multiply-add operation of convolution, another convolution window reads the defeated of next input channel of the current input channel
Enter characteristic pattern;
When needing to fill the first convolution window or the second convolution window, from top, row caching neutralizes and input is read in right caching respectively
Characteristic pattern;
Output channel is currently organized when a register read in the first weights register and the second weights register
When convolution weights are used to carry out convolution multiply-add operation, next group that output channel is currently organized described in another register read is defeated
Go out the convolution weights of channel.
In a kind of possible realization method, the method further includes:
Input picture is written in the first piecemeal of first memory, the input feature vector figure as convolutional layer first layer.
In a kind of possible realization method, the method further includes:
It will be in the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum;
It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum;
The parameter includes weights and biasing.
In a kind of possible realization method, the output that convolutional layer is obtained according to the output result of the processing array group is special
Sign figure, including:
When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored in first and is deposited
In the third piecemeal of reservoir, the output characteristic pattern of the convolutional layer of odd-level is stored in the third piecemeal of second memory.
In a kind of possible realization method, full articulamentum is performed according to the output characteristic pattern of described last layer of convolutional layer
Multiply-add operation obtains the output feature vector of full articulamentum, including:
The input feature value of full articulamentum is obtained according to the output characteristic pattern of described last layer of convolutional layer;
By the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit and handled,
The output feature vector of full articulamentum is obtained, the full connection processing unit includes adder and multiplier and the register of L series connection,
In, L is the integer more than 2;
In each clock cycle, will be deposited in first register to the L-1 register in the register of described L series connection
The data transfer of a upper clock cycle for storage is to next stage register, the upper clock week that will be stored in l-th register
The data of phase input first memory by adder and multiplier.
In a kind of possible realization method, full articulamentum is performed according to the output characteristic pattern of described last layer of convolutional layer
Multiply-add operation obtains the output feature vector of full articulamentum, further includes:
When the input feature value of the full articulamentum or the weights that connect entirely are invalid data, in the input of adder and multiplier
Zero padding, and add up, until there is significant figure.
According to another aspect of the present disclosure, a kind of convolutional neural networks accelerator is provided, including:
Input feature vector figure read module, for reading the input feature vector figure of convolutional layer;
Array group processing module is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, according to
First reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel data, utilize part
It propagates adder and multiplier and carries out the multiply-add operation of convolution, obtain the output of processing array group as a result, wherein, the processing array group includes more
A processing array, the processing array include three rows and handle arrays, and the line number of convolution window is M, and the first reference pixel is to measurement
The preceding M-N rows of convolution window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N;
Convolutional layer exports characteristic pattern determining module, for obtaining convolutional layer according to the output result of the processing array group
Export characteristic pattern;
Full articulamentum input buffer module, for the output characteristic pattern of convolutional layer last layer to be written the defeated of full articulamentum
Enter caching;
Full articulamentum output feature vector acquisition module, for full articulamentum according to the output of described last layer of convolutional layer
Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum;
Full articulamentum output feature vector output module, for by the output feature vector of described complete last layer of articulamentum
It is output in the 4th piecemeal of first memory.
In a kind of possible realization method, the input feature vector figure read module, including:
Convolution window reading submodule, a convolution window for working as in the first convolution window and the second convolution window read current
When the input feature vector figure of input channel is used to carry out convolution multiply-add operation, another convolution window reads the current input channel
Next input channel input feature vector figure;
Convolution window fills submodule, for when needing to fill the first convolution window or the second convolution window, from top, row to be slow respectively
It deposits to neutralize and input feature vector figure is read in right caching;
Weights register read submodule, for when a deposit in the first weights register and the second weights register
When the convolution weights of current group output channel that device is read are used to carry out convolution multiply-add operation, described in another register read
The convolution weights of next group of output channel of current group output channel.
In a kind of possible realization method, which further includes:
Input picture writing module, for input picture to be written in the first piecemeal of first memory, as convolutional layer
The input feature vector figure of first layer.
In a kind of possible realization method, which further includes:
Parameter read-in module, for by the second of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum
In piecemeal;It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum;The parameter packet
Include weights and biasing.
In a kind of possible realization method, the convolutional layer exports characteristic pattern determining module, including:
Output sub-module, it is for when the number of plies of convolutional layer is not last layer, the output of the convolutional layer of even level is special
Sign figure is stored in the third piecemeal of first memory, and the output characteristic pattern of the convolutional layer of odd-level is stored in second memory
Third piecemeal in.
In a kind of possible realization method, the full articulamentum output feature vector acquisition module, including:
Input feature value acquisition submodule, for being connected entirely according to the output characteristic pattern of described last layer of convolutional layer
Connect the input feature value of layer;
Full connection processing submodule, for by the weights of the input feature value of full articulamentum and full articulamentum, input to be complete
Connection processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes adder and multiplier
The register connected with L, wherein, L is the integer more than 2;It, will be in the register of described L series connection in each clock cycle
The data transfer of a upper clock cycle stored in first register to the L-1 register, will to next stage register
The data of a upper clock cycle stored in l-th register input first memory by adder and multiplier.
In a kind of possible realization method, the full articulamentum output feature vector acquisition module further includes:
Zero padding submodule is invalid data for working as the input feature value of the full articulamentum or the weights connected entirely
When, it in the input zero padding of adder and multiplier, and adds up, until there is significant figure.
According to another aspect of the present disclosure, a kind of convolutional neural networks accelerator is provided, including:
Processor;
For storing the memory of processor-executable instruction;
Wherein, the processor is configured as:Perform above-mentioned convolutional neural networks accelerated method.
According to another aspect of the present disclosure, a kind of non-volatile computer readable storage medium storing program for executing is provided, is stored thereon with
Computer program instructions are realized when the computer program instructions are executed by processor and perform above-mentioned convolutional neural networks acceleration side
Method.
The disclosure in convolutional layer by setting multiple processing array groups, and when convolution window reads input feature vector figure, use
It does not go together and carries out convolutional calculation using the mode of different reference pixels vector, effectively reduce hardware resource and power consumption, improve
The processing speed of convolutional neural networks.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure will become
It is clear.
Description of the drawings
Comprising in the description and the attached drawing of a part for constitution instruction and specification together illustrate the disclosure
Exemplary embodiment, feature and aspect, and the principle for explaining the disclosure.
Fig. 1 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 2 shows the flow charts of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 3 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 4 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 5 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 6 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 7 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure;
Fig. 8 shows the block diagram of the convolutional neural networks hardware configuration according to one embodiment of the disclosure;
Fig. 9 shows the ping-pong operation flow chart of the individual layer convolution according to one embodiment of the disclosure;
Figure 10 shows the on piece caching mechanism schematic diagram according to the convolution window of one embodiment of the disclosure;
Figure 11 shows the structure diagram of the convolution multiplicaton addition unit according to one embodiment of the disclosure;
Figure 12 shows the structural parameters table of each layer of VGG16 networks;
Figure 13 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure;
Figure 14 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure;
Figure 15 is the block diagram according to a kind of device accelerated for convolutional neural networks shown in an exemplary embodiment.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Reference numeral represent functionally the same or similar element.Although the various aspects of embodiment are shown in the drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary "
Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.
In addition, in order to better illustrate the disclosure, numerous details is given in specific embodiment below.
It will be appreciated by those skilled in the art that without certain details, the disclosure can equally be implemented.In some instances, for
Method well known to those skilled in the art, means, element and circuit are not described in detail, in order to highlight the purport of the disclosure.
Convolutional neural networks acceleration be broadly divided into algorithm accelerate and it is hardware-accelerated.Algorithm acceleration is generally used for having network
Sorting phase pertains only to propagated forward, and can not effectively accelerate the training process of network.Pass through hardware-accelerated realization convolutional Neural
Network is a kind of effective way, and hardware-accelerated scheme is including based on FPGA, (Field-Programmable Gate Array show
Field programmable gate array)/ASIC (Application Specific Integrated Circuits, application-specific integrated circuit)
Convolutional neural networks accelerate framework.FPGA is a kind of logic gate array unit that can be stylized, has outstanding parallel computation
Ability.There is the features such as low in energy consumption, speed is fast, restructural by the FPGA specially designed.It is performed in addition, FPGA has broken sequence
Pattern, more processing tasks can be completed within each clock cycle, and FPGA has and is absorbed in without using operating system
In the certainty of a certain item task, the possibility to go wrong can be reduced.
Fig. 1 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in Figure 1, institute
The method of stating includes the following steps:
Step S10 reads the input feature vector figure of convolutional layer.
Step S20, by the input feature vector figure input convolutional layer processing array group in, according to the first reference pixel to
Amount, the second reference pixel vector, convolution kernel weights and completed input channel data, part is utilized to propagate adder and multiplier and is carried out
The multiply-add operation of convolution obtains the output of processing array group as a result, wherein, the processing array group includes multiple processing arrays, institute
It states processing array and includes multiple rows processing array, such as the processing array includes three rows and handles array.The line number of convolution window
For M, the first reference pixel is to the preceding M-N rows for measuring convolution window, and the second reference pixel is to the last N rows for measuring convolution window, M and N
It is positive integer, and M>N.
In a kind of possible realization method, input channel refers to, according to specific extraction feature, carry over an input image
The different input feature vectors got.Output channel refers to handle the result for respectively handling array output in array group.Convolutional calculation is adopted
With PP-MAC structures (Propagate Partial MultiplierAccumulator, partly propagate multiply-add algorithm).In order to carry
High processing rate, the present embodiment utilize multiple processing array groups so that every 32 handle one processing array group of array composition as an example
Realize the parallel processing of different output channels.
The result of calculation of processing array group is stored in convolutional layer output caching, waits for the calculating with other input channels
As a result it adds up.In the starting stage of convolution operation, given bias is stored in convolutional layer output caching first as tired
Add the initial value of device.During the part of same output channel difference convolution position is cached with the output for being respectively stored in different, side
Just part and the reading of result during subsequent operation.
When convolution window reads input feature vector figure, the first reference pixel r0 and the second reference pixel r1 are the spy for setting unit
Levy diagram data.For example, convolution window shares 16 rows, 14 rows before the first reference pixel r0 is read.Second reference pixel r1 reads last
Two rows.Wherein r1 is only just effective when convolution position changes in a column direction, and other positions are calculated using r0.Characteristic pattern is to broadcast
Form be input to processing array in carry out convolutional calculation.One processing array is made of 3 one-dimensional row processing arrays, each
Row processing array is made of the addition unit of three multiplication units and 3 operands.In three rows handle array, 1 × 3
Input feature vector diagram data is multiplied respectively with 31 × 3 weights in 3 × 3 convolution kernels, performs the volume of three adjacent location of pixels
Product.The processing delay of three row processing arrays is different.
Step S30 obtains the output characteristic pattern of convolutional layer according to the output result of the processing array group.
In a kind of possible realization method, the result of calculation of convolutional layer is needed with Relu (RectifiedLinear
Units corrects linear unit) function is into line activating.The output result of processing array group obtains the output feature of each layer of convolutional layer
Figure.The output characteristic pattern of each layer of convolutional layer is exported into convolutional layer output caching.And return to step S20 is repeated.
Step S40, by the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer.
In a kind of possible realization method, the result of calculation of last layer of convolutional layer is write direct full articulamentum input and is delayed
It deposits, is not required to cache outside write-in piece.
Step S50, full articulamentum perform multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtain complete
The output feature vector of articulamentum.
In a kind of possible realization method, the output characteristic pattern of last layer of convolutional layer, as the input of full articulamentum,
After performing the multiply-add operation in full articulamentum, the output feature vector of full articulamentum is obtained.
The output feature vector of described complete last layer of articulamentum is output to the 4th point of first memory by step S60
In block.
In a kind of possible realization method, by first memory carry out piecemeal after, by the output feature of full articulamentum to
Amount is stored therein in a piecemeal.Other piecemeals of first memory are used to store input picture, the parameter of each layer etc..Piecemeal
The extraction that the mode of storage is conducive to data is called, and improves the computational efficiency of system entirety.
In the present embodiment, it in the calculating of convolutional layer, introduces processing array group and carries out convolutional calculation, broken sequence
The pattern of execution can complete more processing tasks within each clock cycle, improve the calculating speed of convolutional neural networks
Rate.
Fig. 2 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in Fig. 2, with
Above-described embodiment the difference lies in, the step S10 of this method, including:
Step S11, when in the first convolution window and the second convolution window a convolution window read current input channel it is defeated
When entering characteristic pattern and being used to carry out the multiply-add operation of convolution, another convolution window reads next input of the current input channel
The input feature vector figure of channel.
Step S12, when needing to fill the first convolution window or the second convolution window, from top, row caching is neutralized in right caching respectively
Read input feature vector figure.
Step S13, when the currently group of a register read in the first weights register and the second weights register is defeated
When going out the convolution weights of channel and being used to carry out the multiply-add operation of convolution, output channel is currently organized described in another register read
The convolution weights of next group of output channel.
In a kind of possible realization method, due to the limitation of FPGA Resources on Chip, on-chip memory group cannot be used all
In storage input feature vector figure.Input feature vector figure carries out convolutional calculation according to the convolution unit of setting, is read in again after the completion of calculating another
One block of pixels is calculated, and is finished until by the reading of entire input feature vector figure.The reading of weights uses identical pattern.Convolution
The input-buffer of layer includes two convolution windows, two weights registers, a top row caching and a right caching.Two convolution windows
It is used for realizing ping-pong operation with weights register, is carried out at the same time convolutional calculation and digital independent.
Fig. 3 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 3, with
The difference lies in the method further includes above-described embodiment:
Step S70, by the first piecemeal of input picture write-in first memory, the input as convolutional layer first layer is special
Sign figure.
In a kind of possible realization method, input picture can also be inputted in the first piecemeal of second memory.It will
The input feature value of input picture and full articulamentum is respectively stored in the different piecemeals of first memory, is conducive to different numbers
According to extraction call, improve the whole efficiency of system.
Fig. 4 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 4, with
The difference lies in the method further includes above-described embodiment:
Step S80, will be in the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum;It will
In second piecemeal of the parameter read-in second memory of the even level of convolutional layer and full articulamentum;The parameter is including weights and partially
It puts.
In a kind of possible realization method, by the parameter of convolutional layer and full articulamentum, with input picture and full articulamentum
Input feature value be respectively stored in the different piecemeals of the first storage, be conducive to different data extraction call, improve system
The whole efficiency of system.The parameter of odd-level and even level is stored separately so that the convolutional layer of odd-level or even level, Yi Jiqi
Several layers or the full articulamentum of even level can quickly get corresponding parameter for calculating.
Fig. 5 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in figure 5, with
The difference lies in step S30 includes above-described embodiment:
When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored by step S31
In the third piecemeal of first memory, the output characteristic pattern of the convolutional layer of odd-level is stored in the third point of second memory
In block.
In a kind of possible realization method, the output characteristic pattern of convolutional layer odd-level is stored in the third of second memory
In piecemeal, and the output characteristic pattern of even level is stored in the third piecemeal of first memory.In the convolutional calculation mistake of same layer
Cheng Zhong, a piece of memory only carry out read operation, and another memory only carries out write operation.
Fig. 6 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in fig. 6, with
The difference lies in step S50 includes above-described embodiment:
Step S51 obtains the input feature value of full articulamentum according to the output characteristic pattern of described last layer of convolutional layer.
Step S52, by the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit into
Row processing, obtains the output feature vector of full articulamentum, and the full connection processing unit includes the deposit of adder and multiplier and L series connection
Device, wherein, L is the integer more than 2.
In each clock cycle, first register in the register of described L series connection is posted to L-1 is a by step S53
The data transfer of a upper clock cycle stored in storage is to next stage register, upper one will stored in l-th register
The data of a clock cycle input first memory by adder and multiplier.
In a kind of possible realization method, for the multiply-add process of full articulamentum, if only using a register, hold
Easily cause the problem of result of calculation is inaccurate.Multiply-add result is respectively stored in multiple registers in the present embodiment, passes through data
The progressive row of back kick add up, avoid and add up on a register multiple, reduce calculating error.
Fig. 7 shows the flow chart of the convolutional neural networks accelerated method according to one embodiment of the disclosure, as shown in fig. 7, with
The difference lies in step S50 is further included above-described embodiment:
Step S54, when the input feature value of the full articulamentum or the weights that connect entirely are invalid data, multiply-add
The input zero padding of device, and add up, until there is significant figure.
In a kind of possible realization method, if in some period, input feature vector figure or weight data are invalid, just exist
Zero padding is inputted, then is added up, until there is significant figure.
Implement example 1:
The convolutional neural networks model that the disclosure provides is connected entirely using VGG16 network models including 13 convolutional layers and 3
Connect layer (the full articulamentum of the first two is respectively split into two layers).Fig. 8 shows to be filled according to the convolutional neural networks of one embodiment of the disclosure
The block diagram put connects including convolutional layer accelerator, convolutional layer input-buffer, convolutional layer output caching, full articulamentum accelerator and entirely
Connect a layer input-buffer.
The accelerated method for the VGG16 convolutional neural networks accelerators that the disclosure proposes comprises the steps of:
S1:By PCIe, (peripheral component interconnect express, peripheral assembly are quickly mutual
Even) interface by the parameter of convolutional layer and full articulamentum (including biasing and weights) write-in DDR3 (DoubleData Rate, it is double
Rate synchronization dynamic RAM) (#0 and #1).
The parameter of convolutional layer and full articulamentum arrives DDR3's by the bus of PCIe3.0 × 8 according to the sequential storage of layer
In bank0.The parameter of odd-level is stored in first DDR3 (#0).The parameter of even level is stored in second DDR3 (#1)
In.Each layer of parameter according to before being biased in, the posterior sequence of weights arranged.Weights are arranged with 3 × 3 (convolution kernels) for unit
It is listed in together.The convolution kernel weights of same input channel are placed on continuous memory space, can continuously read convolution kernel weights, no
It needs to carry out address computation.
S2:Input picture is written by DDR3 (#0) by PCIe interface.
Input image data is written by the bus of PCIe3.0 × 8 in the bank1 of DDR3.Input image data is in DDR3
Institutional framework it is related with convolutional calculation method and DDR3 data read modes.Convolutional layer characteristic pattern needs to be drawn by block
Point, since the size of each layer characteristic pattern of convolutional layer is all 7 integral multiple, it is minimum particle size tissue signature figure number to take 7 × 7 pixels
According to.In view of the influence of pondization operation, the least unit of convolutional calculation is 14 × 14 pixels.According to 7 × 7 block of pixels by characteristic pattern
It is stored in DDR3, since the data length of mono- pulse burst of DDR3 transmission is 512bit, the data of 7 × 7 block of pixels
Length is 784bit, and in order to reduce the waste of the data in data read process, the data length of 7 × 7 block of pixels is extended for
1024bit is aligned according to burst first address.
S3:The data of needs are read in into convolutional layer input-buffer from DDR3.
Due to the limitation of FPGA Resources on Chip, on-chip memory group cannot be completely used for storage input feature vector figure.Input is special
It levies figure and carries out convolutional calculation for unit according to 14 × 14, another 14 × 14 block of pixels is read in again after the completion of calculating, until will be whole
A input feature vector figure reading finishes.The reading of weights uses identical pattern.The input-buffer of convolutional layer includes two 16 × 16
The convolution window of pixel, the weights register of two 32 × 3 × 3 pixels, the top row of 1024 × 16 pixels cache and one 512
The right caching of × 15 × 8 pixels.Two convolution windows and weights register are used for realizing ping-pong operation, read convolutional calculation and data
Taking can be carried out at the same time.Fig. 9 shows the ping-pong operation flow chart of the individual layer convolution according to one embodiment of the disclosure, individual layer convolution
Ping-pong operation flow is as shown in Figure 9.
Figure 10 shows the on piece caching mechanism schematic diagram according to the convolution window of one embodiment of the disclosure, as shown in Figure 10, slightly
Frame block represents the convolution window of 16 × 16 pixels.When needing to fill convolution window, the first row (1 × 16) in convolution window is gone from top
It is read in caching, 15 × 8 block of pixels of left side hatched example areas are read from right caching.Then block of pixels BI, j、BI, j+1、BI+1, j、
BI+1, j+1And BI+2, jAnd BI+2, j+1The first row need outside piece store DDR3 in read.After convolution window has been filled, first
1 × 16 block of pixels in region (vertical line region) can replace net region write-in top row to cache, and the 15 of second area (dotted region)
× 8 block of pixels can replace third region (hatched example areas) that right caching is written.
S4:Convolution operation is performed, repeats S3, until current convolution layer operation is completed.
Convolutional calculation uses PP-MAC structures (Propagate Partial Multiplier Accumulator, part
Parallel multiplication).Figure 11 shows the structure diagram of the convolution multiplicaton addition unit according to one embodiment of the disclosure, convolution multiplicaton addition unit
Structure it is as shown in figure 11.In order to improve processing speed, 32 processing array composition processing array groups realize different output channels
Parallel processing.
Use Ii(R, C) represents input pixel of the ith feature figure in (R, C) position, and r represents current recurring number, j-th
Handling the task that array is completed is:
Wherein, O32r+j(R, C) is pixel of (32r+j) a output characteristic pattern in (R, C) position,Represent convolution,
kI, 32r+jRepresent weights.
Reference pixel vector r0 and r1 are 1 × 3 feature diagram data, and p0 represents the portion read from convolutional layer output caching
Divide and data, kij (i, j ∈ { 0,1,2 }) represent the weights of convolution kernel.R0 takes preceding 14 row of convolution window, and r1 takes last two rows.r1
It is only just effective when convolution position is changed in a column direction.Characteristic pattern is input in processing array in the form of broadcasting and carried out
Convolutional calculation.One processing array is made of 3 one-dimensional row processing arrays.Each row processing array by three multiplication units and
The addition unit composition of one 3 operand.In three rows handle array, 1 × 3 input feature vector diagram data respectively with volume 3 × 3
31 × 3 weights in product core are multiplied, and perform the convolution of three adjacent location of pixels.The processing delay of three row processing arrays
Difference, scratch-pad register be used for guarantee section with it is synchronous.
Figure 12 shows the structural parameters table of each layer of VGG16 networks, as shown in the table in Figure 12, according to VGG16 networks
The structural parameters of each layer it is found that for conv1-2, conv2-2, conv3-3, conv4-3 and conv5-3 layers output result also
It needs to carry out pondization calculating.Pond window size is 2 × 2, step-length 2, and pond is realized using maximum down-sampling method.Pondization operates quilt
During being embedded into convolution operation, in 14 × 14 convolution unit, in the convolution for completing two vertical adjacent pixels positions
Afterwards, only retain wherein larger convolution results.In the convolution of another two location of pixels in going to 2 × 2 block of pixels, and before
The convolution results of reservation compare, as soon as only retaining maximum convolution results, have obtained 7 × 7 pond output matrix.
S5:If not last layer of convolutional layer, DDR3 is written into convolutional layer result of calculation, repeats S3 and S4.
The result of calculation of convolutional layer needs to use Relu functions into line activating, the expression formula of Relu functions:F (x)=max
(0,x).The output characteristic pattern of convolutional layer odd-level is stored in DDR3 (#0).The output characteristic pattern of even level is stored in DDR3 (#
1) in.During the convolutional calculation of same layer, a piece of DDR3 only carries out read operation, and another DDR3 only carries out write operation.
S6:Full articulamentum input-buffer is written into convolutional layer final calculation result.
The result of calculation of last layer of convolutional layer writes direct full articulamentum input-buffer, does not need to cache outside write-in piece
DDR3.Last layer of convolutional layer has 512 output channels, according to the number of processing array group, the data of 32 output channels
Collection is combined into a sector, a total of 16 sectors.In each sector, the pixel of continuous 32 storages is logical from 32 outputs
The same position in road.
S7:Full attended operation is performed, is carried out at the same time S3-S6;
For the multiply-add process of full articulamentum, the cumulative process of up to 25088 times is had.If only use a deposit
Device be easy to cause " big number eats decimal " problem.Multiply-add result is respectively stored in 4 registers in the disclosure, is avoided one
It adds up on a register multiple, reduces calculating error.
Full articulamentum multiplicaton addition unit needs 4 clock cycle from output result of calculation is input data into, by the letter of 4 level production lines
4 registers are turned to, and additionally increase by 1 outer layer register, form 5 level production lines.During calculating, no matter multiply-add list
Whether member is calculated as a result, can not all stop the data of reception following clock.Different location is stored on each register
The multiply-add result in the part of data and weights.If in some period, input feature vector figure or weight data are invalid, are just mended in input
Zero, it adds up, until there is significant figure.
S8:Full articulamentum result of calculation is output to DDR3;
Full articulamentum finally obtains the result of calculation of 1000 half accuracy floating-points, and the bank4 of DDR3 is written.
S9:If not last piece image, S7 and S8 is repeated;
S10:The result of calculation of entire convolutional neural networks is read from DDR3.
Figure 13 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure, as shown in figure 13, should
Device includes:
Input feature vector figure read module 41, for reading the input feature vector figure of convolutional layer;
Array group processing module 42 is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, root
According to the data of the first reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel, portion is utilized
Divide and propagate the adder and multiplier progress multiply-add operation of convolution, obtain the output of processing array group as a result, wherein, the processing array group includes
Multiple processing arrays, the processing array include three rows and handle array, and the line number of convolution window is M, and the first reference pixel is vectorial
Take the preceding M-N rows of convolution window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N;
Convolutional layer output characteristic pattern determining module 43, for obtaining convolutional layer according to the output result of the processing array group
Output characteristic pattern;
Full articulamentum input buffer module 44, for the output characteristic pattern of convolutional layer last layer to be written full articulamentum
Input-buffer;
Full articulamentum output feature vector acquisition module 45, for full articulamentum according to the defeated of described last layer of convolutional layer
Go out characteristic pattern and perform multiply-add operation, obtain the output feature vector of full articulamentum;
Full articulamentum output feature vector output module 46, for by the output feature of described complete last layer of articulamentum to
Amount is output in the 4th piecemeal of first memory.
Figure 14 shows the block diagram of the convolutional neural networks accelerator according to one embodiment of the disclosure, as shown in figure 14, with
In above-mentioned implementation the difference lies in,
In a kind of possible realization method, the input feature vector figure read module 41, including:
Convolution window reading submodule 411, for what is read when a convolution window in the first convolution window and the second convolution window
When the input feature vector figure of current input channel is used to carry out convolution multiply-add operation, another convolution window reads the current input
The input feature vector figure of next input channel of channel;
Convolution window fills submodule 412, for when needing to fill the first convolution window or the second convolution window, going respectively from top
Caching, which neutralizes, reads input feature vector figure in right caching;
Weights register read submodule 413, for as one in the first weights register and the second weights register
When the convolution weights of the current group output channel of register read are used to carry out convolution multiply-add operation, another register read
The convolution weights of next group of output channel of the current group of output channel.
In a kind of possible realization method, which further includes:
Input picture writing module 47, for input picture to be written in the first piecemeal of first memory, as convolution
The input feature vector figure of layer first layer.
In a kind of possible realization method, which further includes:
Parameter read-in module 48, for by the of the parameter read-in first memory of convolutional layer and the odd-level of full articulamentum
In two piecemeals;It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum;The parameter
Including weights and biasing.
In a kind of possible realization method, the convolutional layer output characteristic pattern determining module 43, including:
Output sub-module 431, for when the number of plies of convolutional layer is not last layer, by the output of the convolutional layer of even level
Characteristic pattern is stored in the third piecemeal of first memory, and the output characteristic pattern of the convolutional layer of odd-level is stored in the second storage
In the third piecemeal of device.
In a kind of possible realization method, the full articulamentum output feature vector acquisition module 45, including:
Input feature value acquisition submodule 451, for being obtained according to the output characteristic pattern of described last layer of convolutional layer
The input feature value of full articulamentum;
Full connection processing submodule 452, for by the weights of the input feature value of full articulamentum and full articulamentum, input
Full connection processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes multiply-add
Device and the register of L series connection, wherein, L is the integer more than 2;In each clock cycle, by the register of described L series connection
In the data transfer of a upper clock cycle that stores in first register to the L-1 register to next stage register,
The data of a upper clock cycle stored in l-th register are inputted into first memory by adder and multiplier.
In a kind of possible realization method, the full articulamentum output feature vector acquisition module 45 further includes:
Zero padding submodule 453 is invalid number for working as the input feature value of the full articulamentum or the weights connected entirely
According to when, in the input zero padding of adder and multiplier, and add up, until there is significant figure.
Implement example 2:
Fig. 8 shows the block diagram of the convolutional neural networks hardware configuration according to one embodiment of the disclosure, as shown in figure 8, this public affairs
The hardware configuration for opening the VGG16 convolutional neural networks accelerators based on FPGA of proposition includes:
(1) convolutional layer input-buffer group, including two convolution windows, two weights registers, a top row caches and one
Right caching;
(2) 32 convolutional layers handle array, and processing array uses PP-MAC (Propagate PartialMultiplier
Accumulator) structure completes the multiply-add operation of convolution according to control instruction;
(3) 32 convolutional layer output cachings, size is 3136 pixels;
(4) full articulamentum input-buffer, size are 512 × 7 × 7 pixels;
(5) full articulamentum processing array connects multiply-add operation entirely according to control instruction completion;
The DDR3 memories of (6) two 4GB.
The input data and parameter and multiply-add process of convolutional calculation all use half accuracy floating-point number.
The processing array of convolutional layer and full articulamentum may be performed simultaneously different width images using different calculating structures
Convolutional layer and full articulamentum calculating operation.
Convolutional layer characteristic pattern is divided by block, since the size of each layer characteristic pattern of convolutional layer is all 7 integral multiple, takes 7
× 7 pixels are minimum particle size tissue signature diagram data;Characteristic pattern is stored in DDR3 according to 7 × 7 block of pixels, according to DDR3's
Burst (512bit) is aligned;In view of the influence of pondization operation, the basic unit of convolutional calculation is 14 × 14 pixels, convolution kernel
Size be 3 × 3 pixels.
The size of weights register is 32 × 3 × 3 pixels, and the size of convolution window needs to be extended for 16 × 16 according to convolution kernel
Pixel, the size for pushing up row caching are 1024 × 16 pixels, and the size of right caching is 512 × 15 × 8 pixels.
Two convolution windows and weights register are used for realizing ping-pong operation, i.e. an input feature vector figure or weighting parameter data
Participate in read the data of another input feature vector figure or weighting parameter while calculating.
Convolutional layer is handled array and is tied using PP-MAC (Propagate Partial Multiplier Accumulator)
Structure, 32 processing array composition processing array groups realize the parallel processing of different output channels.
Use Ii(R, C) represents input pixel of the ith feature figure in (R, C) position, and r represents current recurring number, j-th
Handling the task that array is completed is:
Wherein, O32r+j(R, C) is pixel of (32r+j) a output characteristic pattern in (R, C) position,Represent convolution,
kI, 32r+jRepresent weights.
Reference pixel vector r0 and r1 are 1 × 3 feature diagram data, and r0 can take preceding 14 row of convolution window, and r1 can only take
Last two rows, r1 are only just effective when convolution position switches in a column direction.Characteristic pattern is inputted everywhere in the form of broadcasting
Convolutional calculation is carried out in reason array, 1 processing array is made of 3 one-dimensional row processing arrays, and each row handles array by 3
The addition unit composition of multiplication unit and 13 operand.In three rows handle array, 1 × 3 input feature vector diagram data point
It is not multiplied with 31 × 3 weights in 3 × 3 convolution kernels, performs the convolution of three adjacent location of pixels.Three rows handle array
Processing delay it is different, scratch-pad register be used for guarantee section with it is synchronous.
Need pondization operate convolutional layer, pondization operate be embedded in convolution operation during, in 14 × 14 volume
In product unit, after the convolution for completing two vertical adjacent pixels positions, only retain wherein larger convolution results.It is going to
In 2 × 2 block of pixels during the convolution of another two location of pixels, compared with the convolution results retained before, only retain maximum one
A convolution results have just obtained 7 × 7 pond output matrix.
The result of calculation of processing array group is stored in the calculating waited in convolutional layer output caching with other input channels
As a result it adds up, in the starting stage of convolutional layer, biasing is stored in convolutional layer output caching first as the initial of part sum
Value.During the part of same output channel difference convolution position is cached with the output for being respectively stored in different, facilitate subsequent operation
When part and result reading.
In the input-buffer of full articulamentum, the characteristic pattern of every 32 output channels is stored together, and forms a sector.
In each sector, the pixel of continuous 32 storages is from the identical position of 32 output channels.
The multiply-add result of full articulamentum is respectively stored in 4 registers, is avoided and is added up repeatedly on a register.
Full articulamentum multiplicaton addition unit needs 4 clock cycle from output result of calculation is input data into, and 4 level production lines are reduced to 4
Register, and additionally increase by 1 outer layer register, form 5 level production lines.During calculating, no matter multiplicaton addition unit whether
Calculate as a result, can not all stop receive following clock data, be stored on each register different location data and
The multiply-add result in part of weights.If in some period, input feature vector figure or weight data are invalid, are just inputting zero padding, into
Row is cumulative, until there is significant figure.
The input feature vector figure of odd-level and biasing are stored in weighting parameter in first DDR3 (#0), the input of even level
Characteristic pattern (i.e. the output characteristic pattern of odd-level) and biasing are stored in weighting parameter in second DDR3 (#1).The volume of a certain layer
In product calculating process, a piece of DDR3 only carries out read operation, and another DDR3 only completes write operation;Input feature vector figure and biasing with
Weighting parameter is respectively stored in the different bank of DDR3.
The disclosure can configure the high-performance calculation characteristic with low-power consumption using FPGA, realize that VGG16 convolutional neural networks are hard
Part accelerating structure, has the advantages that:
(1) (32) calculating of traditional single precision floating datum are replaced using (16) calculating of half accuracy floating-point number, ensured
Under the premise of VGG16 network algorithm accuracys rate, hardware resource and power consumption are effectively reduced.
(2) using double convolution windows and the table tennis repeating query operation of weights register, convolutional calculation and data read process can be same
Shi Jinhang reduces the access time to chip external memory, improves processing speed.
(3) using double DDR3 memories table tennis repeating query operations, during convolution operation, a piece of DDR3 memories only carry out
Read operation, another only carries out write operation, reduces read-write conversion, effectively reduces the power consumption of hardware configuration.With on a piece of DDR3
Characteristic pattern and parameter are stored in different bank, are avoided when reading weighting parameter and feature diagram data and switching over, row switchings
The high latency of generation.
(4) according to the calculation features of convolutional layer and full articulamentum, different processing structures is separately designed, may be performed simultaneously
The convolutional layer of different width images and full articulamentum calculating operation, and assembly line is carried out to the multiplicaton addition unit of convolutional layer and full articulamentum
Optimization improves the processing speed of accelerator.
Figure 15 is the frame according to a kind of device 1900 accelerated for convolutional neural networks shown in an exemplary embodiment
Figure.For example, device 1900 may be provided as a server.With reference to Figure 15, device 1900 includes processing component 1922, into one
Step includes one or more processors and memory resource represented by a memory 1932, can be by processing group for storing
The instruction of the execution of part 1922, such as application program.The application program stored in memory 1932 can include one or one
Each above corresponds to the module of one group of instruction.In addition, processing component 1922 is configured as execute instruction, it is above-mentioned to perform
Method.
Device 1900 can also include a power supply module 1926 and be configured as the power management of executive device 1900, one
Wired or wireless network interface 1950 is configured as device 1900 being connected to network and input and output (I/O) interface
1958.Device 1900 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac
OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In the exemplary embodiment, a kind of non-volatile computer readable storage medium storing program for executing is additionally provided, such as including calculating
The memory 1932 of machine program instruction, above computer program instruction can be performed to complete by the processing component 1922 of device 1900
The above method.
The disclosure can be system, method and/or computer program product.Computer program product can include computer
Readable storage medium storing program for executing, containing for make processor realize various aspects of the disclosure computer-readable program instructions.
Computer readable storage medium can keep and store to perform the tangible of the instruction that uses of equipment by instruction
Equipment.Computer readable storage medium for example can be-- but be not limited to-- storage device electric, magnetic storage apparatus, optical storage
Equipment, electromagnetism storage device, semiconductor memory apparatus or above-mentioned any appropriate combination.Computer readable storage medium
More specific example (non exhaustive list) includes:Portable computer diskette, random access memory (RAM), read-only is deposited hard disk
It is reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static RAM (SRAM), portable
Compact disk read-only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example thereon
It is stored with the punch card of instruction or groove internal projection structure and above-mentioned any appropriate combination.Calculating used herein above
Machine readable storage medium storing program for executing is not interpreted instantaneous signal in itself, and the electromagnetic wave of such as radio wave or other Free propagations leads to
It crosses the electromagnetic wave (for example, the light pulse for passing through fiber optic cables) of waveguide or the propagation of other transmission mediums or is transmitted by electric wire
Electric signal.
Computer-readable program instructions as described herein can be downloaded to from computer readable storage medium it is each calculate/
Processing equipment downloads to outer computer or outer by network, such as internet, LAN, wide area network and/or wireless network
Portion's storage device.Network can include copper transmission cable, optical fiber transmission, wireless transmission, router, fire wall, interchanger, gateway
Computer and/or Edge Server.Adapter or network interface in each calculating/processing equipment are received from network to be counted
Calculation machine readable program instructions, and the computer-readable program instructions are forwarded, for the meter being stored in each calculating/processing equipment
In calculation machine readable storage medium storing program for executing.
For perform the disclosure operation computer program instructions can be assembly instruction, instruction set architecture (ISA) instruction,
Machine instruction, machine-dependent instructions, microcode, firmware instructions, condition setup data or with one or more programming languages
Arbitrarily combine the source code or object code write, the programming language includes the programming language of object-oriented-such as
Procedural programming languages-such as " C " language or similar programming language of Smalltalk, C++ etc. and routine.Computer
Readable program instructions can be performed fully, partly perform on the user computer, is only as one on the user computer
Vertical software package performs, part performs or on the remote computer completely in remote computer on the user computer for part
Or it is performed on server.In situations involving remote computers, remote computer can pass through network-packet of any kind
Include LAN (LAN) or wide area network (WAN)-be connected to subscriber computer or, it may be connected to outer computer (such as profit
Pass through Internet connection with ISP).In some embodiments, by using computer-readable program instructions
Status information carry out personalized customization electronic circuit, such as programmable logic circuit, field programmable gate array (FPGA) or can
Programmed logic array (PLA) (PLA), the electronic circuit can perform computer-readable program instructions, so as to fulfill each side of the disclosure
Face.
Referring herein to the method, apparatus (system) according to the embodiment of the present disclosure and the flow chart of computer program product and/
Or block diagram describes various aspects of the disclosure.It should be appreciated that each box and flow chart of flow chart and/or block diagram and/
Or in block diagram each box combination, can be realized by computer-readable program instructions.
These computer-readable program instructions can be supplied to all-purpose computer, special purpose computer or other programmable datas
The processor of processing unit, so as to produce a kind of machine so that these instructions are passing through computer or other programmable datas
When the processor of processing unit performs, produce and realize work(specified in one or more of flow chart and/or block diagram box
The device of energy/action.These computer-readable program instructions can also be stored in a computer-readable storage medium, these refer to
It enables so that computer, programmable data processing unit and/or other equipment work in a specific way, so as to be stored with instruction
Computer-readable medium then includes a manufacture, including realizing in one or more of flow chart and/or block diagram box
The instruction of the various aspects of defined function/action.
Computer-readable program instructions can also be loaded into computer, other programmable data processing units or other
In equipment so that series of operation steps are performed on computer, other programmable data processing units or miscellaneous equipment, with production
Raw computer implemented process, so that performed on computer, other programmable data processing units or miscellaneous equipment
Function/action specified in one or more of flow chart and/or block diagram box is realized in instruction.
Flow chart and block diagram in attached drawing show the system, method and computer journey of multiple embodiments according to the disclosure
Architectural framework in the cards, function and the operation of sequence product.In this regard, each box in flow chart or block diagram can generation
One module of table, program segment or a part for instruction, the module, program segment or a part for instruction include one or more use
In the executable instruction of logic function as defined in realization.In some implementations as replacements, the function of being marked in box
It can be occurred with being different from the sequence marked in attached drawing.For example, two continuous boxes can essentially be held substantially in parallel
Row, they can also be performed in the opposite order sometimes, this is depended on the functions involved.It is also noted that block diagram and/or
The combination of each box in flow chart and the box in block diagram and/or flow chart can use function or dynamic as defined in performing
The dedicated hardware based system made is realized or can be realized with the combination of specialized hardware and computer instruction.
The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport
In the principle for best explaining each embodiment, practical application or to the technological improvement of the technology in market or lead this technology
Other those of ordinary skill in domain are understood that each embodiment disclosed herein.
Claims (16)
1. a kind of convolutional neural networks accelerated method, which is characterized in that the method includes:
Read the input feature vector figure of convolutional layer;
The input feature vector figure is inputted in the processing array group of convolutional layer, according to the first reference pixel vector, the second reference image
The data of plain vector, convolution kernel weights and completed input channel propagate adder and multiplier using part and carry out the multiply-add operation of convolution,
The output of processing array group is obtained as a result, wherein, the processing array group includes multiple processing arrays, the processing array includes
Three rows handle arrays, and the line number of convolution window is M, and the first reference pixel is to the preceding M-N rows for measuring convolution window, the second reference pixel
To the last N rows for measuring convolution window, M and N are positive integer, and M>N;
The output characteristic pattern of convolutional layer is obtained according to the output result of the processing array group;
By the input-buffer for exporting characteristic pattern and full articulamentum being written of last layer of convolutional layer;
Full articulamentum performs multiply-add operation according to the output characteristic pattern of described last layer of convolutional layer, obtains the output of full articulamentum
Feature vector;
The output feature vector of described complete last layer of articulamentum is output in the 4th piecemeal of first memory.
2. according to the method described in claim 1, it is characterized in that, read convolutional layer input feature vector figure, including:
When in the first convolution window and the second convolution window a convolution window read current input channel input feature vector figure by with
When the multiply-add operation of execution convolution, the input that another convolution window reads next input channel of the current input channel is special
Sign figure;
When needing to fill the first convolution window or the second convolution window, from top, row caching neutralizes in right caching and reads input feature vector respectively
Figure;
When the convolution of the current group output channel of a register read in the first weights register and the second weights register
When weights are used to carry out convolution multiply-add operation, the next group of output that output channel is currently organized described in another register read is led to
The convolution weights in road.
3. according to the method described in claim 1, it is characterized in that, the method further includes:
Input picture is written in the first piecemeal of first memory, the input feature vector figure as convolutional layer first layer.
4. according to the method described in claim 1, it is characterized in that, the method further includes:
It will be in the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum;
It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum;
The parameter includes weights and biasing.
5. according to the method described in claim 1, it is characterized in that, convolution is obtained according to the output result of the processing array group
The output characteristic pattern of layer, including:
When the number of plies of convolutional layer is not last layer, the output characteristic pattern of the convolutional layer of even level is stored in first memory
Third piecemeal in, the output characteristic pattern of the convolutional layer of odd-level is stored in the third piecemeal of second memory.
6. according to the method described in claim 1, it is characterized in that, full articulamentum is according to the output of described last layer of convolutional layer
Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum, including:
The input feature value of full articulamentum is obtained according to the output characteristic pattern of described last layer of convolutional layer;
By the input feature value of full articulamentum and the weights of full articulamentum, input full connection processing unit and handled, obtained
The output feature vector of full articulamentum, the full connection processing unit include adder and multiplier and the register of L series connection, wherein, L is
Integer more than 2;
In each clock cycle, by what is stored in first register to the L-1 register in the register of described L series connection
The data transfer of a upper clock cycle is to next stage register, by the upper clock cycle stored in l-th register
Data input first memory by adder and multiplier.
7. according to the method described in claim 6, it is characterized in that, full articulamentum is according to the output of described last layer of convolutional layer
Characteristic pattern performs multiply-add operation, obtains the output feature vector of full articulamentum, further includes:
When the input feature value of the full articulamentum or the weights that connect entirely are invalid data, mended in the input of adder and multiplier
Zero, and add up, until there is significant figure.
8. a kind of convolutional neural networks accelerator, which is characterized in that including:
Input feature vector figure read module, for reading the input feature vector figure of convolutional layer;
Array group processing module is handled, for the input feature vector figure to be inputted in the processing array group of convolutional layer, according to first
Reference pixel vector, the second reference pixel vector, convolution kernel weights and completed input channel data, part is utilized to propagate
Adder and multiplier carries out the multiply-add operation of convolution, obtains the output of processing array group as a result, wherein, the processing array group includes multiple places
Manage array, the processing array includes three rows and handles arrays, and the line number of convolution window is M, and the first reference pixel is to measuring convolution
The preceding M-N rows of window, the second reference pixel is to the last N rows for measuring convolution window, and M and N are positive integer, and M>N;
Convolutional layer exports characteristic pattern determining module, for obtaining the output of convolutional layer according to the output result of the processing array group
Characteristic pattern;
Full articulamentum input buffer module, for the input of the full articulamentum of output characteristic pattern write-in of convolutional layer last layer to be delayed
It deposits;
Full articulamentum output feature vector acquisition module, for full articulamentum according to the output feature of described last layer of convolutional layer
Figure performs multiply-add operation, obtains the output feature vector of full articulamentum;
Full articulamentum output feature vector output module, for the output feature vector of described complete last layer of articulamentum to be exported
Into the 4th piecemeal of first memory.
9. device according to claim 8, which is characterized in that the input feature vector figure read module, including:
Convolution window reading submodule, for the current input read when a convolution window in the first convolution window and the second convolution window
When the input feature vector figure of channel is used to carry out convolution multiply-add operation, another convolution window is read under the current input channel
The input feature vector figure of one input channel;
Convolution window fills submodule, for when needing to fill the first convolution window or the second convolution window, going respectively from top in caching
With reading input feature vector figure in right caching;
Weights register read submodule, for being read when a register in the first weights register and the second weights register
It is current described in another register read when the convolution weights of current group output channel taken are used to carry out convolution multiply-add operation
The convolution weights of next group of output channel of group output channel.
10. device according to claim 8, which is characterized in that further include:
Input picture writing module, for input picture to be written in the first piecemeal of first memory, as convolutional layer first
The input feature vector figure of layer.
11. device according to claim 8, which is characterized in that further include:
Parameter read-in module, for by the second piecemeal of convolutional layer and the parameter read-in first memory of the odd-level of full articulamentum
In;It will be in the second piecemeal of convolutional layer and the parameter read-in second memory of the even level of full articulamentum;The parameter includes power
Value and biasing.
12. device according to claim 8, which is characterized in that the convolutional layer exports characteristic pattern determining module, including:
Output sub-module, for when the number of plies of convolutional layer is not last layer, by the output characteristic pattern of the convolutional layer of even level
It is stored in the third piecemeal of first memory, the output characteristic pattern of the convolutional layer of odd-level is stored in the of second memory
In three piecemeals.
13. device according to claim 8, which is characterized in that the full articulamentum output feature vector acquisition module, packet
It includes:
Input feature value acquisition submodule, for obtaining full articulamentum according to the output characteristic pattern of described last layer of convolutional layer
Input feature value;
Full connection processing submodule, for by the weights of the input feature value of full articulamentum and full articulamentum, input to connect entirely
Processing unit is handled, and obtains the output feature vector of full articulamentum, and the full connection processing unit includes adder and multiplier and L
The register of series connection, wherein, L is the integer more than 2;It, will be first in the register of described L series connection in each clock cycle
The data transfer of a upper clock cycle stored in register to the L-1 register posts l-th to next stage register
The data of a upper clock cycle stored in storage input first memory by adder and multiplier.
14. device according to claim 13, which is characterized in that the full articulamentum output feature vector acquisition module,
It further includes:
Zero padding submodule, for when the input feature value of the full articulamentum or the weights connected entirely are invalid data,
The input zero padding of adder and multiplier, and add up, until there is significant figure.
15. a kind of convolutional neural networks accelerator, which is characterized in that including:
Processor;
For storing the memory of processor-executable instruction;
Wherein, the processor is configured as:Method in perform claim requirement 1 to 7 described in any one.
16. a kind of non-volatile computer readable storage medium storing program for executing, is stored thereon with computer program instructions, which is characterized in that institute
State the method realized when computer program instructions are executed by processor in claim 1 to 7 described in any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810028998.4A CN108133270B (en) | 2018-01-12 | 2018-01-12 | Convolutional neural network acceleration method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810028998.4A CN108133270B (en) | 2018-01-12 | 2018-01-12 | Convolutional neural network acceleration method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108133270A true CN108133270A (en) | 2018-06-08 |
CN108133270B CN108133270B (en) | 2020-08-04 |
Family
ID=62400444
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810028998.4A Active CN108133270B (en) | 2018-01-12 | 2018-01-12 | Convolutional neural network acceleration method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108133270B (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109032781A (en) * | 2018-07-13 | 2018-12-18 | 重庆邮电大学 | A kind of FPGA parallel system of convolutional neural networks algorithm |
CN109063825A (en) * | 2018-08-01 | 2018-12-21 | 清华大学 | Convolutional neural networks accelerator |
CN109272113A (en) * | 2018-09-13 | 2019-01-25 | 深思考人工智能机器人科技(北京)有限公司 | A kind of convolutional neural networks establish device and method |
CN109409514A (en) * | 2018-11-02 | 2019-03-01 | 广州市百果园信息技术有限公司 | Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks |
CN109597965A (en) * | 2018-11-19 | 2019-04-09 | 深圳力维智联技术有限公司 | Data processing method, system, terminal and medium based on deep neural network |
CN109740731A (en) * | 2018-12-15 | 2019-05-10 | 华南理工大学 | A kind of adaptive convolutional layer hardware accelerator design method |
CN109740732A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Neural network processor, convolutional neural networks data multiplexing method and relevant device |
CN109919312A (en) * | 2019-03-29 | 2019-06-21 | 北京智芯微电子科技有限公司 | Operation method, device and the DPU of convolutional neural networks |
CN109993303A (en) * | 2019-03-29 | 2019-07-09 | 河南九乾电子科技有限公司 | Computer accelerator for neural network and deep learning |
CN110059797A (en) * | 2018-10-10 | 2019-07-26 | 北京中科寒武纪科技有限公司 | A kind of computing device and Related product |
CN110096993A (en) * | 2019-04-28 | 2019-08-06 | 深兰科技(上海)有限公司 | The object detection apparatus and method of binocular stereo vision |
CN110377781A (en) * | 2019-06-06 | 2019-10-25 | 福建讯网网络科技股份有限公司 | A kind of matched innovatory algorithm of application sole search |
CN110533177A (en) * | 2019-08-22 | 2019-12-03 | 安谋科技(中国)有限公司 | A kind of data read-write equipment, method, equipment, medium and convolution accelerator |
CN110766128A (en) * | 2018-07-26 | 2020-02-07 | 北京深鉴智能科技有限公司 | Convolution calculation unit, calculation method and neural network calculation platform |
CN110826707A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN110888824A (en) * | 2018-09-07 | 2020-03-17 | 黑芝麻智能科技(上海)有限公司 | Multilevel memory hierarchy |
WO2020062284A1 (en) * | 2018-09-30 | 2020-04-02 | 深圳市大疆创新科技有限公司 | Convolutional neural network-based image processing method and device, and unmanned aerial vehicle |
CN110989920A (en) * | 2018-10-03 | 2020-04-10 | 马克西姆综合产品公司 | Energy efficient memory system and method |
CN111340201A (en) * | 2018-12-19 | 2020-06-26 | 北京地平线机器人技术研发有限公司 | Convolutional neural network accelerator and method for performing convolutional operation thereof |
CN111898733A (en) * | 2020-07-02 | 2020-11-06 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN111931918A (en) * | 2020-09-24 | 2020-11-13 | 深圳佑驾创新科技有限公司 | Neural network accelerator |
CN112099737A (en) * | 2020-09-29 | 2020-12-18 | 北京百度网讯科技有限公司 | Method, device and equipment for storing data and storage medium |
CN112132274A (en) * | 2020-09-22 | 2020-12-25 | 地平线(上海)人工智能技术有限公司 | Full-connection convolution method and device for feature graph, readable storage medium and electronic equipment |
CN112734020A (en) * | 2020-12-28 | 2021-04-30 | 中国电子科技集团公司第十五研究所 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
CN112862725A (en) * | 2021-03-12 | 2021-05-28 | 上海壁仞智能科技有限公司 | Method for computing, computing device and computer-readable storage medium |
CN113052291A (en) * | 2019-12-27 | 2021-06-29 | 上海商汤智能科技有限公司 | Data processing method and device |
CN113887720A (en) * | 2021-09-29 | 2022-01-04 | 杭州电子科技大学 | Up-sampling reverse blocking mapping method |
CN112513885B (en) * | 2018-06-22 | 2024-02-27 | 三星电子株式会社 | Neural processor |
KR102659202B1 (en) * | 2021-09-15 | 2024-04-22 | 한국항공대학교산학협력단 | System and method for recognizing CNN-based human behavior using Wi-Fi signals |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170102950A1 (en) * | 2003-05-23 | 2017-04-13 | Ip Reservoir, Llc | Intelligent Data Storage and Processing Using FPGA Devices |
CN106844294A (en) * | 2016-12-29 | 2017-06-13 | 华为机器有限公司 | Convolution algorithm chip and communication equipment |
CN106874219A (en) * | 2016-12-23 | 2017-06-20 | 深圳云天励飞技术有限公司 | A kind of data dispatching method of convolutional neural networks, system and computer equipment |
CN106940815A (en) * | 2017-02-13 | 2017-07-11 | 西安交通大学 | A kind of programmable convolutional neural networks Crypto Coprocessor IP Core |
US9721203B1 (en) * | 2016-11-10 | 2017-08-01 | Google Inc. | Performing kernel striding in hardware |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107229967A (en) * | 2016-08-22 | 2017-10-03 | 北京深鉴智能科技有限公司 | A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA |
CN107341544A (en) * | 2017-06-30 | 2017-11-10 | 清华大学 | A kind of reconfigurable accelerator and its implementation based on divisible array |
CN107463990A (en) * | 2016-06-02 | 2017-12-12 | 国家计算机网络与信息安全管理中心 | A kind of FPGA parallel acceleration methods of convolutional neural networks |
-
2018
- 2018-01-12 CN CN201810028998.4A patent/CN108133270B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170102950A1 (en) * | 2003-05-23 | 2017-04-13 | Ip Reservoir, Llc | Intelligent Data Storage and Processing Using FPGA Devices |
CN107463990A (en) * | 2016-06-02 | 2017-12-12 | 国家计算机网络与信息安全管理中心 | A kind of FPGA parallel acceleration methods of convolutional neural networks |
CN107229967A (en) * | 2016-08-22 | 2017-10-03 | 北京深鉴智能科技有限公司 | A kind of hardware accelerator and method that rarefaction GRU neutral nets are realized based on FPGA |
US9721203B1 (en) * | 2016-11-10 | 2017-08-01 | Google Inc. | Performing kernel striding in hardware |
CN106874219A (en) * | 2016-12-23 | 2017-06-20 | 深圳云天励飞技术有限公司 | A kind of data dispatching method of convolutional neural networks, system and computer equipment |
CN106844294A (en) * | 2016-12-29 | 2017-06-13 | 华为机器有限公司 | Convolution algorithm chip and communication equipment |
CN106940815A (en) * | 2017-02-13 | 2017-07-11 | 西安交通大学 | A kind of programmable convolutional neural networks Crypto Coprocessor IP Core |
CN107066239A (en) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | A kind of hardware configuration for realizing convolutional neural networks forward calculation |
CN107341544A (en) * | 2017-06-30 | 2017-11-10 | 清华大学 | A kind of reconfigurable accelerator and its implementation based on divisible array |
Non-Patent Citations (3)
Title |
---|
KAIYUAN GUO等: "Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA", 《 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS》 * |
万国春等: "《数字系统设计方法与实践》", 31 October 2015, 同济大学出版社 * |
余子健: "基于FPGA的卷积神经网络加速器", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (44)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112513885B (en) * | 2018-06-22 | 2024-02-27 | 三星电子株式会社 | Neural processor |
CN109032781A (en) * | 2018-07-13 | 2018-12-18 | 重庆邮电大学 | A kind of FPGA parallel system of convolutional neural networks algorithm |
CN110766128A (en) * | 2018-07-26 | 2020-02-07 | 北京深鉴智能科技有限公司 | Convolution calculation unit, calculation method and neural network calculation platform |
CN109063825A (en) * | 2018-08-01 | 2018-12-21 | 清华大学 | Convolutional neural networks accelerator |
CN109063825B (en) * | 2018-08-01 | 2020-12-29 | 清华大学 | Convolutional neural network accelerator |
CN110826707A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN110826707B (en) * | 2018-08-10 | 2023-10-31 | 北京百度网讯科技有限公司 | Acceleration method and hardware accelerator applied to convolutional neural network |
CN110888824A (en) * | 2018-09-07 | 2020-03-17 | 黑芝麻智能科技(上海)有限公司 | Multilevel memory hierarchy |
CN109272113A (en) * | 2018-09-13 | 2019-01-25 | 深思考人工智能机器人科技(北京)有限公司 | A kind of convolutional neural networks establish device and method |
CN109272113B (en) * | 2018-09-13 | 2022-04-19 | 深思考人工智能机器人科技(北京)有限公司 | Convolutional neural network establishing device and method based on channel |
WO2020062284A1 (en) * | 2018-09-30 | 2020-04-02 | 深圳市大疆创新科技有限公司 | Convolutional neural network-based image processing method and device, and unmanned aerial vehicle |
CN110989920B (en) * | 2018-10-03 | 2024-02-06 | 马克西姆综合产品公司 | Energy efficient memory system and method |
CN110989920A (en) * | 2018-10-03 | 2020-04-10 | 马克西姆综合产品公司 | Energy efficient memory system and method |
CN110059797A (en) * | 2018-10-10 | 2019-07-26 | 北京中科寒武纪科技有限公司 | A kind of computing device and Related product |
CN109409514A (en) * | 2018-11-02 | 2019-03-01 | 广州市百果园信息技术有限公司 | Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks |
CN109597965A (en) * | 2018-11-19 | 2019-04-09 | 深圳力维智联技术有限公司 | Data processing method, system, terminal and medium based on deep neural network |
CN109597965B (en) * | 2018-11-19 | 2023-04-18 | 深圳力维智联技术有限公司 | Data processing method, system, terminal and medium based on deep neural network |
CN109740731A (en) * | 2018-12-15 | 2019-05-10 | 华南理工大学 | A kind of adaptive convolutional layer hardware accelerator design method |
CN111340201A (en) * | 2018-12-19 | 2020-06-26 | 北京地平线机器人技术研发有限公司 | Convolutional neural network accelerator and method for performing convolutional operation thereof |
WO2020134546A1 (en) * | 2018-12-27 | 2020-07-02 | 深圳云天励飞技术有限公司 | Neural network processor, convolutional neural network data multiplexing method and related device |
CN109740732A (en) * | 2018-12-27 | 2019-05-10 | 深圳云天励飞技术有限公司 | Neural network processor, convolutional neural networks data multiplexing method and relevant device |
CN109919312B (en) * | 2019-03-29 | 2021-04-23 | 北京智芯微电子科技有限公司 | Operation method and device of convolutional neural network and DPU |
CN109993303A (en) * | 2019-03-29 | 2019-07-09 | 河南九乾电子科技有限公司 | Computer accelerator for neural network and deep learning |
CN109919312A (en) * | 2019-03-29 | 2019-06-21 | 北京智芯微电子科技有限公司 | Operation method, device and the DPU of convolutional neural networks |
CN109993303B (en) * | 2019-03-29 | 2022-09-23 | 河南九乾电子科技有限公司 | Computer accelerator for neural network and deep learning |
CN110096993A (en) * | 2019-04-28 | 2019-08-06 | 深兰科技(上海)有限公司 | The object detection apparatus and method of binocular stereo vision |
CN110377781A (en) * | 2019-06-06 | 2019-10-25 | 福建讯网网络科技股份有限公司 | A kind of matched innovatory algorithm of application sole search |
CN110533177A (en) * | 2019-08-22 | 2019-12-03 | 安谋科技(中国)有限公司 | A kind of data read-write equipment, method, equipment, medium and convolution accelerator |
CN110533177B (en) * | 2019-08-22 | 2023-12-26 | 安谋科技(中国)有限公司 | Data read-write device, method, equipment, medium and convolution accelerator |
CN113052291B (en) * | 2019-12-27 | 2024-04-16 | 上海商汤智能科技有限公司 | Data processing method and device |
CN113052291A (en) * | 2019-12-27 | 2021-06-29 | 上海商汤智能科技有限公司 | Data processing method and device |
CN111898733B (en) * | 2020-07-02 | 2022-10-25 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN111898733A (en) * | 2020-07-02 | 2020-11-06 | 西安交通大学 | Deep separable convolutional neural network accelerator architecture |
CN112132274A (en) * | 2020-09-22 | 2020-12-25 | 地平线(上海)人工智能技术有限公司 | Full-connection convolution method and device for feature graph, readable storage medium and electronic equipment |
CN111931918B (en) * | 2020-09-24 | 2021-02-12 | 深圳佑驾创新科技有限公司 | Neural network accelerator |
CN111931918A (en) * | 2020-09-24 | 2020-11-13 | 深圳佑驾创新科技有限公司 | Neural network accelerator |
CN112099737B (en) * | 2020-09-29 | 2023-09-01 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for storing data |
CN112099737A (en) * | 2020-09-29 | 2020-12-18 | 北京百度网讯科技有限公司 | Method, device and equipment for storing data and storage medium |
CN112734020A (en) * | 2020-12-28 | 2021-04-30 | 中国电子科技集团公司第十五研究所 | Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network |
CN112862725A (en) * | 2021-03-12 | 2021-05-28 | 上海壁仞智能科技有限公司 | Method for computing, computing device and computer-readable storage medium |
CN112862725B (en) * | 2021-03-12 | 2023-10-27 | 上海壁仞智能科技有限公司 | Method for computing, computing device, and computer-readable storage medium |
KR102659202B1 (en) * | 2021-09-15 | 2024-04-22 | 한국항공대학교산학협력단 | System and method for recognizing CNN-based human behavior using Wi-Fi signals |
CN113887720A (en) * | 2021-09-29 | 2022-01-04 | 杭州电子科技大学 | Up-sampling reverse blocking mapping method |
CN113887720B (en) * | 2021-09-29 | 2024-04-26 | 杭州电子科技大学 | Upsampling reverse blocking mapping method |
Also Published As
Publication number | Publication date |
---|---|
CN108133270B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108133270A (en) | Convolutional neural networks accelerating method and device | |
US10657306B1 (en) | Deep learning testability analysis with graph convolutional networks | |
CN108268424A (en) | For handling the isomerization hardware accelerator architecture of the sparse matrix data with deflection non-null distribution | |
CN107169563B (en) | Processing system and method applied to two-value weight convolutional network | |
WO2018227800A1 (en) | Neural network training method and device | |
CN108268422A (en) | For handling the hardware accelerator framework of very sparse and supersparsity matrix data | |
CN109032781A (en) | A kind of FPGA parallel system of convolutional neural networks algorithm | |
CN108268320A (en) | For the hardware accelerator framework and template of network size k mean value clusters | |
CN106951962A (en) | Compound operation unit, method and electronic equipment for neutral net | |
CN110334357A (en) | A kind of method, apparatus, storage medium and electronic equipment for naming Entity recognition | |
CN107945204A (en) | A kind of Pixel-level portrait based on generation confrontation network scratches drawing method | |
CN108268423A (en) | Realize the micro-architecture for being used for the concurrency with the enhancing for writing the sparse linear algebraic operation for reading dependence | |
CN109492666A (en) | Image recognition model training method, device and storage medium | |
CN108629406B (en) | Arithmetic device for convolutional neural network | |
US11651194B2 (en) | Layout parasitics and device parameter prediction using graph neural networks | |
WO2020156508A1 (en) | Method and device for operating on basis of chip with operation array, and chip | |
US11436017B2 (en) | Data temporary storage apparatus, data temporary storage method and operation method | |
CN108268931A (en) | The methods, devices and systems of data processing | |
Miyamoto et al. | Fast calculation of Haralick texture features | |
Solovyev et al. | Fixed-point convolutional neural network for real-time video processing in FPGA | |
US20220083857A1 (en) | Convolutional neural network operation method and device | |
CN109446996B (en) | Face recognition data processing device and method based on FPGA | |
CN110163363A (en) | A kind of computing device and method | |
CN112906865B (en) | Neural network architecture searching method and device, electronic equipment and storage medium | |
US20210350230A1 (en) | Data dividing method and processor for convolution operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |